nielsr HF Staff commited on
Commit
4fbf53b
·
verified ·
1 Parent(s): e684092

Update model card for Tora2

Browse files

This PR updates the model card to reflect the new model, Tora2, as presented in the paper [Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation](https://huggingface.co/papers/2507.05963).

Specifically, it:
- Updates the model title and abstract to reflect Tora2.
- Updates the paper link to the official Hugging Face paper page for Tora2.
- Updates the project page link to the dedicated Tora2 project page.
- Enriches the metadata with structured links for the paper, project page, and GitHub repository.
- Adds new tags (`tora2`, `multi-entity-video-generation`) for better discoverability.
- Integrates comprehensive usage sections (Installation, Inference, Training, etc.) directly from the GitHub repository README to provide a more self-contained and useful resource on the Hub.
- Updates the citation to reflect the Tora2 paper.

Files changed (1) hide show
  1. README.md +222 -18
README.md CHANGED
@@ -1,30 +1,36 @@
1
  ---
2
- license: other
3
- language:
4
- - en
5
  base_model:
6
  - THUDM/CogVideoX-5b
7
- pipeline_tag: text-to-video
 
8
  library_name: diffusers
 
 
9
  tags:
10
  - video
11
  - video-generation
12
  - cogvideox
13
  - alibaba
 
 
 
 
 
14
  ---
 
15
  <div align="center">
16
 
17
  <img src="icon.jpg" width="250"/>
18
 
19
- <h2><center>Tora: Trajectory-oriented Diffusion Transformer for Video Generation</h2>
20
 
21
  Zhenghao Zhang\*, Junchao Liao\*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang
22
 
23
  \* equal contribution
24
  <br>
25
 
26
- <a href='https://arxiv.org/abs/2407.21705'><img src='https://img.shields.io/badge/ArXiv-2407.21705-red'></a>
27
- <a href='https://ali-videoai.github.io/tora_video/'><img src='https://img.shields.io/badge/Project-Page-Blue'></a>
28
  <a href="https://github.com/alibaba/Tora"><img src='https://img.shields.io/badge/Github-Link-orange'></a>
29
  <a href='https://www.modelscope.cn/studios/xiaoche/Tora'><img src='https://img.shields.io/badge/🤖_ModelScope-ZH_demo-%23654dfc'></a>
30
  <a href='https://www.modelscope.cn/studios/Alibaba_Research_Intelligence_Computing/Tora_En'><img src='https://img.shields.io/badge/🤖_ModelScope-EN_demo-%23654dfc'></a>
@@ -36,15 +42,16 @@ Zhenghao Zhang\*, Junchao Liao\*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu
36
  <a href='https://huggingface.co/Alibaba-Research-Intelligence-Computing/Tora_T2V_diffusers'><img src='https://img.shields.io/badge/🤗_HuggingFace-T2V_weights(diffusers)-%23ff9e0e'></a>
37
  </div>
38
 
39
-
40
- ## Please visit our [Github repo](https://github.com/alibaba/Tora) for more details.
41
 
42
  ## 💡 Abstract
43
 
44
- Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT’s scalability, allowing precise control of video content’s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora’s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world.
45
 
46
  ## 📣 Updates
47
 
 
 
48
  - `2025/01/06` 🔥🔥We released Tora Image-to-Video, including inference code and model weights.
49
  - `2024/12/13` SageAttention2 and model compilation are supported in diffusers version. Tested on the A10, these approaches speed up every inference step by approximately 52%, except for the first step.
50
  - `2024/12/09` 🔥🔥Diffusers version of Tora and the corresponding model weights are released. Inference VRAM requirements are reduced to around 5 GiB. Please refer to [this](diffusers-version/README.md) for details.
@@ -56,6 +63,21 @@ Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable
56
  - `2024/08/27` We released our v2 paper including appendix.
57
  - `2024/07/31` We submitted our paper on arXiv and released our project page.
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ## 🎞️ Showcases
60
 
61
  https://github.com/user-attachments/assets/949d5e99-18c9-49d6-b669-9003ccd44bf1
@@ -66,6 +88,190 @@ https://github.com/user-attachments/assets/4026c23d-229d-45d7-b5be-6f3eb9e4fd50
66
 
67
  All videos are available in this [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/showcases.zip)
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  ## 🤝 Acknowledgements
70
 
71
  We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:
@@ -84,13 +290,11 @@ Special thanks to the contributors of these libraries for their hard work and de
84
  ## 📚 Citation
85
 
86
  ```bibtex
87
- @misc{zhang2024toratrajectoryorienteddiffusiontransformer,
88
- title={Tora: Trajectory-oriented Diffusion Transformer for Video Generation},
89
- author={Zhenghao Zhang and Junchao Liao and Menghao Li and Zuozhuo Dai and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang},
90
- year={2024},
91
- eprint={2407.21705},
92
- archivePrefix={arXiv},
93
- primaryClass={cs.CV},
94
- url={https://arxiv.org/abs/2407.21705},
95
  }
96
  ```
 
1
  ---
 
 
 
2
  base_model:
3
  - THUDM/CogVideoX-5b
4
+ language:
5
+ - en
6
  library_name: diffusers
7
+ license: other
8
+ pipeline_tag: text-to-video
9
  tags:
10
  - video
11
  - video-generation
12
  - cogvideox
13
  - alibaba
14
+ - tora2
15
+ - multi-entity-video-generation
16
+ paper: https://huggingface.co/papers/2507.05963
17
+ url: https://ali-videoai.github.io/Tora2_page/
18
+ repo_code: https://github.com/alibaba/Tora
19
  ---
20
+
21
  <div align="center">
22
 
23
  <img src="icon.jpg" width="250"/>
24
 
25
+ <h2><center>Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation</h2>
26
 
27
  Zhenghao Zhang\*, Junchao Liao\*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang
28
 
29
  \* equal contribution
30
  <br>
31
 
32
+ <a href='https://huggingface.co/papers/2507.05963'><img src='https://img.shields.io/badge/Paper-2507.05963-red'></a>
33
+ <a href='https://ali-videoai.github.io/Tora2_page/'><img src='https://img.shields.io/badge/Project-Page-Blue'></a>
34
  <a href="https://github.com/alibaba/Tora"><img src='https://img.shields.io/badge/Github-Link-orange'></a>
35
  <a href='https://www.modelscope.cn/studios/xiaoche/Tora'><img src='https://img.shields.io/badge/🤖_ModelScope-ZH_demo-%23654dfc'></a>
36
  <a href='https://www.modelscope.cn/studios/Alibaba_Research_Intelligence_Computing/Tora_En'><img src='https://img.shields.io/badge/🤖_ModelScope-EN_demo-%23654dfc'></a>
 
42
  <a href='https://huggingface.co/Alibaba-Research-Intelligence-Computing/Tora_T2V_diffusers'><img src='https://img.shields.io/badge/🤗_HuggingFace-T2V_weights(diffusers)-%23ff9e0e'></a>
43
  </div>
44
 
45
+ This is the official repository for the paper "Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation".
 
46
 
47
  ## 💡 Abstract
48
 
49
+ Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation.
50
 
51
  ## 📣 Updates
52
 
53
+ - `2025/07/08` 🔥🔥 Our latest work, [Tora2](https://ali-videoai.github.io/Tora2_page/), has been accepted by ACM MM25. Tora2 builds on Tora with design improvements, enabling enhanced appearance and motion customization for multiple entities.
54
+ - `2025/05/24` We open-sourced a LoRA-finetuned model of [Wan](https://github.com/Wan-Video/Wan2.1). It turns things in the image into fluffy toys. Check this out: https://github.com/alibaba/wan-toy-transform
55
  - `2025/01/06` 🔥🔥We released Tora Image-to-Video, including inference code and model weights.
56
  - `2024/12/13` SageAttention2 and model compilation are supported in diffusers version. Tested on the A10, these approaches speed up every inference step by approximately 52%, except for the first step.
57
  - `2024/12/09` 🔥🔥Diffusers version of Tora and the corresponding model weights are released. Inference VRAM requirements are reduced to around 5 GiB. Please refer to [this](diffusers-version/README.md) for details.
 
63
  - `2024/08/27` We released our v2 paper including appendix.
64
  - `2024/07/31` We submitted our paper on arXiv and released our project page.
65
 
66
+ ## 📑 Table of Contents
67
+
68
+ - [🎞️ Showcases](#%EF%B8%8F-showcases)
69
+ - [✅ TODO List](#-todo-list)
70
+ - [🧨 Diffusers verision](#-diffusers-verision)
71
+ - [🐍 Installation](#-installation)
72
+ - [📦 Model Weights](#-model-weights)
73
+ - [🔄 Inference](#-inference)
74
+ - [🖥️ Gradio Demo](#%EF%B8%8F-gradio-demo)
75
+ - [🧠 Training](#-training)
76
+ - [🎯 Troubleshooting](#-troubleshooting)
77
+ - [🤝 Acknowledgements](#-acknowledgements)
78
+ - [📄 Our previous work](#-our-previous-work)
79
+ - [📚 Citation](#-citation)
80
+
81
  ## 🎞️ Showcases
82
 
83
  https://github.com/user-attachments/assets/949d5e99-18c9-49d6-b669-9003ccd44bf1
 
88
 
89
  All videos are available in this [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/showcases.zip)
90
 
91
+ ## ✅ TODO List
92
+
93
+ - [x] Release our inference code and model weights
94
+ - [x] Provide a ModelScope Demo
95
+ - [x] Release our training code
96
+ - [x] Release diffusers version and optimize the GPU memory usage
97
+ - [x] Release complete version of Tora
98
+
99
+ ## 🧨 Diffusers verision
100
+
101
+ Please refer to [the diffusers version](diffusers-version/README.md) for details.
102
+
103
+ ## 🐍 Installation
104
+
105
+ Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
106
+
107
+ ```bash
108
+ # Clone this repository.
109
+ git clone https://github.com/alibaba/Tora.git
110
+ cd Tora
111
+
112
+ # Install Pytorch (we use Pytorch 2.4.0) and torchvision following the official instructions: https://pytorch.org/get-started/previous-versions/. For example:
113
+ conda create -n tora python==3.10
114
+ conda activate tora
115
+ conda install pytorch==2.4.0 torchvision==0.19.0 pytorch-cuda=12.1 -c pytorch -c nvidia
116
+
117
+ # Install requirements
118
+ cd modules/SwissArmyTransformer
119
+ pip install -e .
120
+ cd ../../sat
121
+ pip install -r requirements.txt
122
+ cd ..
123
+ ```
124
+
125
+ ## 📦 Model Weights
126
+
127
+ ### Folder Structure
128
+
129
+ ```
130
+ Tora
131
+ └── sat
132
+ └── ckpts
133
+ ├── t5-v1_1-xxl
134
+ │ ├── model-00001-of-00002.safetensors
135
+ │ └── ...
136
+ ├── vae
137
+ │ └── 3d-vae.pt
138
+ ├── tora
139
+ │ ├── i2v
140
+ │ │ └── mp_rank_00_model_states.pt
141
+ │ └── t2v
142
+ │ └── mp_rank_00_model_states.pt
143
+ └── CogVideoX-5b-sat # for training stage 1
144
+ └── mp_rank_00_model_states.pt
145
+ ```
146
+
147
+ ### Download Links
148
+
149
+ *Note: Downloading the `tora` weights requires following the [CogVideoX License](CogVideoX_LICENSE).* You can choose one of the following options: HuggingFace, ModelScope, or native links.
150
+ After downloading the model weights, you can put them in the `Tora/sat/ckpts` folder.
151
+
152
+ #### HuggingFace
153
+
154
+ ```bash
155
+ # This can be faster
156
+ pip install "huggingface_hub[hf_transfer]"
157
+ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Alibaba-Research-Intelligence-Computing/Tora --local-dir ckpts
158
+ ```
159
+
160
+ or
161
+
162
+ ```bash
163
+ # use git
164
+ git lfs install
165
+ git clone https://huggingface.co/Alibaba-Research-Intelligence-Computing/Tora
166
+ ```
167
+
168
+ #### ModelScope
169
+
170
+ - SDK
171
+
172
+ ```bash
173
+ from modelscope import snapshot_download
174
+ model_dir = snapshot_download('xiaoche/Tora')
175
+ ```
176
+
177
+ - Git
178
+
179
+ ```bash
180
+ git clone https://www.modelscope.cn/xiaoche/Tora.git
181
+ ```
182
+
183
+ #### Native
184
+
185
+ - Download the VAE and T5 model following [CogVideo](https://github.com/THUDM/CogVideo/blob/main/sat/README.md#2-download-model-weights):
186
+ - VAE: https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
187
+ - T5: [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder), [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
188
+ - Tora t2v model weights: [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/mp_rank_00_model_states.pt). Downloading this weight requires following the [CogVideoX License](CogVideoX_LICENSE).
189
+
190
+ ## 🔄 Inference
191
+
192
+ ### Text to Video
193
+ It requires around 30 GiB GPU memory tested on NVIDIA A100.
194
+
195
+ ```bash
196
+ cd sat
197
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/t2v --output-dir samples --point_path trajs/coaster.txt --input-file assets/text/t2v/examples.txt
198
+ ```
199
+
200
+ You can change the `--input-file` and `--point_path` to your own prompts and trajectory points files. Please note that the trajectory is drawn on a 256x256 canvas.
201
+
202
+ Replace `$N_GPU` with the number of GPUs you want to use.
203
+
204
+ ### Image to Video
205
+
206
+ ```bash
207
+ cd sat
208
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora_i2v.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/i2v --output-dir samples --point_path trajs/sawtooth.txt --input-file assets/text/i2v/examples.txt --img_dir assets/images --image2video
209
+ ```
210
+
211
+ The first frame images should be placed in the `--img_dir`. The names of these images should be specified in the corresponding text prompt in `--input-file`, seperated by `@@`.
212
+
213
+ ### Recommendations for Text Prompts
214
+
215
+ For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness.
216
+
217
+ You can refer to the following resources for guidance:
218
+
219
+ - [CogVideoX Documentation](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py)
220
+ - [OpenSora Scripts](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/inference.py)
221
+
222
+ ## 🖥️ Gradio Demo
223
+
224
+ Usage:
225
+
226
+ ```bash
227
+ cd sat
228
+ python app.py --load ckpts/tora/t2v
229
+ ```
230
+
231
+ ## 🧠 Training
232
+
233
+ ### Data Preparation
234
+
235
+ Following this guide https://github.com/THUDM/CogVideo/blob/main/sat/README.md#preparing-the-dataset, structure the datasets as follows:
236
+
237
+ ```
238
+ .
239
+ ├── labels
240
+ │ ├── 1.txt
241
+ │ ├── 2.txt
242
+ │ ├── ...
243
+ └── videos
244
+ ├── 1.mp4
245
+ ├── 2.mp4
246
+ ├── ...
247
+ ```
248
+
249
+ Training data examples are in `sat/training_examples`
250
+
251
+ ### Text to Video
252
+
253
+ It requires around 60 GiB GPU memory tested on NVIDIA A100.
254
+
255
+ Replace `$N_GPU` with the number of GPUs you want to use.
256
+
257
+ - Stage 1
258
+
259
+ ```bash
260
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_dense.yaml --experiment-name "t2v-stage1"
261
+ ```
262
+
263
+ - Stage 2
264
+
265
+ ```bash
266
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_sparse.yaml --experiment-name "t2v-stage2"
267
+ ```
268
+
269
+ ## 🎯 Troubleshooting
270
+
271
+ ### 1. ValueError: Non-consecutive added token...
272
+
273
+ Upgrade the transformers package to 4.44.2. See [this](https://github.com/THUDM/CogVideo/issues/213) issue.
274
+
275
  ## 🤝 Acknowledgements
276
 
277
  We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:
 
290
  ## 📚 Citation
291
 
292
  ```bibtex
293
+ @article{zhang2025tora2,
294
+ title={Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation},
295
+ author={Zhang, Zhenghao and Liao, Junchao and Li, Menghao and Dai, Zuozhuo and Qiu, Bingxue and Zhu, Siyu and Qin, Long and Wang, Weizhi},
296
+ journal={arXiv preprint arXiv:2507.05963},
297
+ year={2025},
298
+ url={https://huggingface.co/papers/2507.05963},
 
 
299
  }
300
  ```