chenxwh / text2video-zero

Text-to-Image Diffusion Models are Zero-Shot Video Generators

  • Public
  • 40.6K runs
  • GitHub
  • Paper
  • License



Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 85 seconds.



Official code for Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators*
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, Humphrey Shi

Our method Text2Video-Zero enables zero-shot video generation using (i) a textual prompt (see rows 1, 2), (ii) a prompt combined with guidance from poses or edges (see lower right), and (iii) Video Instruct-Pix2Pix, i.e., instruction-guided video editing (see lower left). Results are temporally consistent and follow closely the guidance and textual prompts.


The code is published under the CreativeML Open RAIL-M license. The license provided in this repository applies to all additions and contributions we make upon the original stable diffusion code. The original stable diffusion code is under the CreativeML Open RAIL-M license, which can found here.


If you use our work in your research, please cite our publication:

    title={Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators},
    author={Khachatryan, Levon and Movsisyan, Andranik and Tadevosyan, Vahram and Henschel, Roberto and Wang, Zhangyang and Navasardyan, Shant and Shi, Humphrey},
    journal={arXiv preprint arXiv:2303.13439},