cjwbw / controlvideo

Training-free Controllable Text-to-Video Generation

Demo API Examples Versions (91710b3f)


View more examples

Run time and cost

Predictions run on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 4 minutes. The predict time for this model varies significantly based on the inputs.


Official PyTorch implementation of “ControlVideo: Training-free Controllable Text-to-Video Generation”

ControlVideo adapts ControlNet to the video counterpart without any finetuning, aiming to directly inherit its high-quality and consistent generation


If you make use of our work, please cite our paper.

  title={ControlVideo: Training-free Controllable Text-to-Video Generation},
  author={Zhang, Yabo and Wei, Yuxiang and Jiang, Dongsheng and Zhang, Xiaopeng and Zuo, Wangmeng and Tian, Qi},
  journal={arXiv preprint arXiv:2305.13077},


This work repository borrows heavily from Diffusers, ControlNet, Tune-A-Video, and RIFE.

There are also many interesting works on video generation: Tune-A-Video, Text2Video-Zero, Follow-Your-Pose, Control-A-Video, et al.