ali-vilab / i2vgen-xl

RESEARCH/NON-COMMERCIAL USE ONLY: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

  • Public
  • 74.5K runs
  • GitHub
  • Paper



Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 3 minutes. The predict time for this model varies significantly based on the inputs.




VGen is an open-source video synthesis codebase developed by the Tongyi Lab of Alibaba Group, featuring state-of-the-art video generative models. This repository includes implementations of the following methods:

I2VGen-xl: High-quality image-to-video synthesis via cascaded diffusion models


If this repo is useful to you, please cite our corresponding technical paper.

  title={I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models},
  author={Zhang, Shiwei and Wang, Jiayu and Zhang, Yingya and Zhao, Kang and Yuan, Hangjie and Qing, Zhiwu and Wang, Xiang  and Zhao, Deli and Zhou, Jingren},
  booktitle={arXiv preprint arXiv:2311.04145},


We would like to express our gratitude for the contributions of several previous works to the development of VGen. This includes, but is not limited to Composer, ModelScopeT2V, Stable Diffusion, OpenCLIP, WebVid-10M, LAION-400M, Pidinet and MiDaS. We are committed to building upon these foundations in a way that respects their original contributions.


This open-source model is trained with using WebVid-10M and LAION-400M datasets and is intended for RESEARCH/NON-COMMERCIAL USE ONLY.