jd7h / open-sora-512

Open-Sora: Democratizing Efficient Video Production for All. This is the 16x512x512 video generation variant.

  • Public
  • 212 runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 6 minutes. The predict time for this model varies significantly based on the inputs.

Readme

This demo implements the 16x512x512 inference demo from the Open-Sora readme.

Open-Sora: Democratizing Efficient Video Production for All

We present Open-Sora, an initiative dedicated to efficiently produce high-quality video and make the model, tools and contents accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video production. With Open-Sora, we aim to inspire innovation, creativity, and inclusivity in the realm of content creation.

Open-Sora is still at an early stage and under active development.

📰 News

  • [2024.03.18] 🔥 We release Open-Sora 1.0, a fully open-source project for video generation. Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with colossal ai acceleration, inference, and more. Our provided checkpoints can produce 2s 512x512 videos with only 3 days training.
  • [2024.03.04] Open-Sora provides training with 46% cost reduction.

🔆 New Features/Updates

  • 📍 Open-Sora-v1 released. Model weights are available here. With only 400K video clips and 200 H800 days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512×512 videos.
  • ✅ Three stages training from an image diffusion model to a video diffusion model. We provide the weights for each stage.
  • ✅ Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism. Open-Sora improve 55% training speed when training on 64x512x512 videos. Details locates at acceleration.md.
  • ✅ We provide data preprocessing pipeline, including downloading, video cutting, and captioning tools. Our data collection plan can be found at datasets.md.
  • ✅ We find VQ-VAE from VideoGPT has a low quality and thus adopt a better VAE from Stability-AI. We also find patching in the time dimension deteriorates the quality. See our report for more discussions.
  • ✅ We investigate different architectures including DiT, Latte, and our proposed STDiT. Our STDiT achieves a better trade-off between quality and speed. See our report for more discussions.
  • ✅ Support clip and T5 text conditioning.
  • ✅ By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet & UCF101). See command.md for more instructions.
  • ✅ Support inference with official weights from DiT, Latte, and PixArt.

Model Weights

Resolution Data #iterations Batch Size GPU days (H800) URL
16×256×256 366K 80k 8×64 117 :link:
16×256×256 20K HQ 24k 8×64 45 :link:
16×512×512 20K HQ 20k 2×64 35 :link:

Our model’s weight is partially initialized from PixArt-α. The number of parameters is 724M. More information about training can be found in our report. More about dataset can be found in dataset.md. HQ means high quality.

LIMITATION: Our model is trained on a limited budget. The quality and text alignment is relatively poor. The model performs badly especially on generating human beings and cannot follow detailed instructions. We are working on improving the quality and text alignment.

Acknowledgement

  • DiT: Scalable Diffusion Models with Transformers.
  • OpenDiT: An acceleration for DiT training. We adopt valuable acceleration strategies for training progress from OpenDiT.
  • PixArt: An open-source DiT-based text-to-image model.
  • Latte: An attempt to efficiently train DiT for video.
  • StabilityAI VAE: A powerful image VAE model.
  • CLIP: A powerful text-image embedding model.
  • T5: A powerful text encoder.
  • LLaVA: A powerful image captioning model based on Yi-34B.

We are grateful for their exceptional work and generous contribution to open source.

Citation

@software{opensora,
  author = {Zangwei Zheng and Xiangyu Peng and Yang You},
  title = {Open-Sora: Democratizing Efficient Video Production for All},
  month = {March},
  year = {2024},
  url = {https://github.com/hpcaitech/Open-Sora}
}

Zangwei Zheng and Xiangyu Peng equally contributed to this work during their internship at HPC-AI Tech.