pollinations / tune-a-video

About Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

  • Public
  • 2.8K runs
  • GitHub



Run time and cost

This model runs on Nvidia A100 (40GB) GPU hardware. Predictions typically complete within 9 minutes. The predict time for this model varies significantly based on the inputs.



This repository is the official implementation of Tune-A-Video.

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

Project Page | arXiv



pip install -r requirements.txt

Installing xformers is highly recommended for more efficiency and speed on GPUs. To enable xformers, set enable_xformers_memory_efficient_attention=True (default).


You can download the pre-trained Stable Diffusion models (e.g., Stable Diffusion v1-4):

git lfs install
git clone https://huggingface.co/CompVis/stable-diffusion-v1-4

Alternatively, you can use a personalized DreamBooth model (e.g., mr-potato-head):

git lfs install
git clone https://huggingface.co/sd-dreambooth-library/mr-potato-head


To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:

accelerate launch train_tuneavideo.py --config="configs/man-surfing.yaml"


Once the training is done, run inference:

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

model_id = "path-to-your-trained-model"
unet = UNet3DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", unet=unet, torch_dtype=torch.float16).to("cuda")

prompt = "a panda is surfing"
video = pipe(prompt, video_length=8, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos

save_videos_grid(video, f"{prompt}.gif")


Fine-tuning on Stable Diffusion

[Training] a man is surfing. a panda is surfing. Iron Man is surfing in the desert. a raccoon is surfing, cartoon style.

Fine-tuning on DreamBooth

sks mr potato head. sks mr potato head, wearing a pink hat, is surfing. sks mr potato head, wearing sunglasses, is surfing. sks mr potato head is surfing in the forest.


    title={Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
    author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2212.11565},