Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion
arXivWhat makes a video?
Scene tells where
The scene specifies where the video takes place.
Subject tells who
The subject specifies who appears.
Motion tells how
The motion specifies how the subject/scene moves.
Tri-Prompting unifies all three in one video diffusion model.
A unified result with controllable scene, subject, and motion.
Our contributions are:
Unified framework. Jointly control scene, subject, and motion in one video diffusion model.
Dual-conditioned motion and multi-view subject consistency. Separate foreground and background motion while preserving subject identity across views.
Novel applications and strong results. Enable 3D-aware (multi-view) subject insertion and manipulation, with competitive performance against DaS and Phantom.
Video Results
Tri-Prompting enables two novel workflows.
Multi-view Subject Insertion
Insert a new subject into a scene and control.
Multi-view Subject Manipulation
Control an existing subject in an image.
1. Cowboy Insertion across Scene, Subject, and Motion


Across scene


Across subject


Across motion
2. Subject Manipulation in Fictional and Real-World Scenes








3. Subject Insertion Applications in Fictional Scenes


Anime


Game


Movie
4. Subject Insertion Applications in Real-world Scenes


Human


Animal


Vehicle
Quantitative Results
Tri-Prompting introduces a unified setting for scene, subject, and motion control, while achieving competitive or superior performance against specialized baselines.
Abstract
Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.
BibTeX
@article{Trip,
title={Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion},
author={Zhou, Zhenghong and Zhan, Xiaohang and Chen, Zhiqin and Kim, Soo Ye and Zhao, Nanxuan and Zheng, Haitian and Liu, Qing and Zhang, He and Lin, Zhe and Zhou, Yuqian and Luo, Jiebo},
journal={arXiv preprint arXiv:2603.15614},
year={2026}
}