Vidi: Large Multimodal Models for Video Understanding and Editing

Homepage: https://bytedance.github.io/vidi-website/

Github: https://github.com/bytedance/vidi

We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understanding and editing (VUE) scenarios. The first release focuses on temporal retrieval (TR), i.e., identifying the time ranges in input videos corresponding to a given text query.

This model is the first release for temporal retrieval.

Please find the inference and evaluation code on https://github.com/bytedance/vidi.

Citation

If you find Vidi useful for your research and applications, please cite using this BibTeX:

@article{Vidi2025vidi2,
          title={Vidi2: Large Multimodal Models for Video 
                  Understanding and Creation},
          author={Vidi Team, Celong Liu, Chia-Wen Kuo, Chuang Huang, 
                  Dawei Du, Fan Chen, Guang Chen, Haoji Zhang, 
                  Haojun Zhao, Lingxi Zhang, Lu Guo, Lusha Li, 
                  Longyin Wen, Qihang Fan, Qingyu Chen, Rachel Deng,
                  Sijie Zhu, Stuart Siew, Tong Jin, Weiyan Tao,
                  Wen Zhong, Xiaohui Shen, Xin Gu, Zhenfang Chen, Zuhua Lin},
          journal={arXiv preprint arXiv:2511.19529},
          year={2025}
}

@article{Vidi2025vidi,
          title={Vidi: Large Multimodal Models for Video 
                  Understanding and Editing},
          author={Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, 
                  Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang,
                  Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, 
                  Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, 
                  Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, 
                  Xueqiong Qu, Zhenfang Chen},
          journal={arXiv preprint arXiv:2504.15681},
          year={2025}
}

Downloads last month: 588

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bytedance-research/Vidi-7B

Base model

mistralai/Mistral-7B-v0.3

Finetuned

mistralai/Mistral-7B-Instruct-v0.3

Finetuned

(372)

this model