Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

1Adobe Research   2University of Rochester
*Work done during an internship at Adobe.
Corresponding authors.
arXiv

What makes a video?

Scene tells where

Scene defines where

The scene specifies where the video takes place.

Subject tells who

Subject defines who

The subject specifies who appears.

Motion tells how

Motion defines how

The motion specifies how the subject/scene moves.

Tri-Prompting unifies all three in one video diffusion model.

Tri-Prompting result combining scene subject and motion

A unified result with controllable scene, subject, and motion.

Our contributions are:

Unified framework. Jointly control scene, subject, and motion in one video diffusion model.

Dual-conditioned motion and multi-view subject consistency. Separate foreground and background motion while preserving subject identity across views.

Novel applications and strong results. Enable 3D-aware (multi-view) subject insertion and manipulation, with competitive performance against DaS and Phantom.

Video Results

Tri-Prompting enables two novel workflows.

Multi-view Subject Insertion

Insertion demo 1

Insert a new subject into a scene and control.

Multi-view Subject Manipulation

Manipulation demo 1

Control an existing subject in an image.

Note: Due to lossy GIF compression, banding artifacts or flickering may appear in these gifs; they are not present in the original videos.

1. Cowboy Insertion across Scene, Subject, and Motion

Cowboy demo
Cowboy across scene 1
Cowboy across scene 2

Across scene

Cowboy across subject bear
Cowboy across subject dinosaur

Across subject

Cowboy motion circle
Cowboy motion left

Across motion


2. Subject Manipulation in Fictional and Real-World Scenes

SAM3D bg 0
SAM3D 0_0
SAM3D 0_2
SAM3D bg 0
SAM3D 1_0
SAM3D 1_2
SAM3D bg 0
SAM3D 2_0
SAM3D 2_1
SAM3D bg 0
SAM3D 3_0
SAM3D 3_1

3. Subject Insertion Applications in Fictional Scenes

Anime bread
Anime cat

Anime

Game bear
Game woman

Game

Movie man
Movie swordsman

Movie


4. Subject Insertion Applications in Real-world Scenes

Real world astronaut
Real world human

Human

Real world cat
Real world dog

Animal

Real world car
Real world boat

Vehicle

Quantitative Results

Tri-Prompting introduces a unified setting for scene, subject, and motion control, while achieving competitive or superior performance against specialized baselines.

Main quantitative results comparing Tri-Prompting with DaS and Phantom

Abstract

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

Method

Tri-Prompting method overview

Click to open the full-resolution image.

BibTeX

@article{Trip,
  title={Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion},
  author={Zhou, Zhenghong and Zhan, Xiaohang and Chen, Zhiqin and Kim, Soo Ye and Zhao, Nanxuan and Zheng, Haitian and Liu, Qing and Zhang, He and Lin, Zhe and Zhou, Yuqian and Luo, Jiebo},
  journal={arXiv preprint arXiv:2603.15614},
  year={2026}
}