PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

Zhenghong Zhou^1,2^* , Xiaohang Zhan¹ , Zhiqin Chen¹ , Soo Ye Kim¹ , Nanxuan Zhao¹ , Haitian Zheng¹ , Qing Liu¹ , He Zhang¹ , Zhe Lin¹ , Yuqian Zhou¹^† , Jiebo Luo²^†

¹Adobe Research ²University of Rochester
^*Work done during an internship at Adobe.
^†Corresponding authors.

arXiv

What makes a video?

Scene tells where

The scene specifies where the video takes place.

Subject tells who

The subject specifies who appears.

Motion tells how

The motion specifies how the subject/scene moves.

Tri-Prompting unifies all three in one video diffusion model.

Tri-Prompting result combining scene subject and motion

A unified result with controllable scene, subject, and motion.

Our contributions are:

Unified framework. Jointly control scene, subject, and motion in one video diffusion model.

Dual-conditioned motion and multi-view subject consistency. Separate foreground and background motion while preserving subject identity across views.

Novel applications and strong results. Enable 3D-aware (multi-view) subject insertion and manipulation, with competitive performance against DaS and Phantom.

Video Results

Tri-Prompting enables two novel workflows.

Multi-view Subject Insertion

Insert a new subject into a scene and control.

Multi-view Subject Manipulation

Control an existing subject in an image.

1. Cowboy Insertion across Scene, Subject, and Motion

Across scene

Across subject

Across motion

2. Subject Manipulation in Fictional and Real-World Scenes

3. Subject Insertion Applications in Fictional Scenes

Anime

Game

Movie

4. Subject Insertion Applications in Real-world Scenes

Human

Animal

Vehicle

Quantitative Results

Tri-Prompting introduces a unified setting for scene, subject, and motion control, while achieving competitive or superior performance against specialized baselines.

Abstract

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

Method

Click to open the full-resolution image.

BibTeX

@article{Trip,
  title={Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion},
  author={Zhou, Zhenghong and Zhan, Xiaohang and Chen, Zhiqin and Kim, Soo Ye and Zhao, Nanxuan and Zheng, Haitian and Liu, Qing and Zhang, He and Lin, Zhe and Zhou, Yuqian and Luo, Jiebo},
  journal={arXiv preprint arXiv:2603.15614},
  year={2026}
}