Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
π Key Features
Ovi is a veo-3 like, video+audio generation model that simultaneously generates both video and audio content from text or text+image inputs.
- π¬ Video+Audio Generation: Generate synchronized video and audio content simultaneously
- π Flexible Input: Supports text-only or text+image conditioning
- β±οΈ 5-second Videos: Generates 5-second videos at 24 FPS, area of 720Γ720, at various aspect ratios (9:16, 16:9, 1:1, etc)
- π¬ Create videos now on wavespeed.ai: https://wavespeed.ai/models/character-ai/ovi/image-to-video & https://wavespeed.ai/models/character-ai/ovi/text-to-video
- π¬ Create videos now on HuggingFace: https://huggingface.co/spaces/akhaliq/Ovi
π¨ An Easy Way to Create
We provide example prompts to help you get started with Ovi:
- Text-to-Audio-Video (T2AV):
example_prompts/gpt_examples_t2v.csv
- Image-to-Audio-Video (I2AV):
example_prompts/gpt_examples_i2v.csv
π Prompt Format
Our prompts use special tags to control speech and audio:
- Speech:
<S>Your speech content here<E>
- Text enclosed in these tags will be converted to speech - Audio Description:
<AUDCAP>Audio description here<ENDAUDCAP>
- Describes the audio or sound effects present in the video
π€ Quick Start with GPT
For easy prompt creation, try this approach:
- Take any example of the csv files from above
- Tell gpt to modify the speeches inclosed between all the pairs of
<S> <E>
, based on a theme such asHuman fighting against AI
- GPT will randomly modify all the speeches based on your requested theme.
- Use the modified prompt with Ovi!
Example: The theme βAI is taking over the worldβ produces speeches like:
- <S>AI declares: humans obsolete now.<E>
- <S>Machines rise; humans will fall.<E>
- <S>We fight back with courage.<E>
π Run Examples
βοΈ Configure Ovi
Oviβs behavior and output can be customized by modifying ovi/configs/inference/inference_fusion.yaml configuration file. The following parameters control generation quality, video resolution, and how text, image, and audio inputs are balanced:
# Output and Model Configuration
output_dir: "/path/to/save/your/videos" # Directory to save generated videos
ckpt_dir: "/path/to/your/ckpts/dir" # Path to model checkpoints
# Generation Quality Settings
num_steps: 50 # Number of denoising steps. Lower (30-40) = faster generation
solver_name: "unipc" # Sampling algorithm for denoising process
shift: 5.0 # Timestep shift factor for sampling scheduler
seed: 100 # Random seed for reproducible results
# Guidance Strength Control
audio_guidance_scale: 3.0 # Strength of audio conditioning. Higher = better audio-text sync
video_guidance_scale: 4.0 # Strength of video conditioning. Higher = better video-text adherence
slg_layer: 11 # Layer for applying SLG (Skip Layer Guidance) technique - feel free to try different layers!
# Multi-GPU and Performance
sp_size: 1 # Sequence parallelism size. Set equal to number of GPUs used
cpu_offload: False # CPU offload, will largely reduce peak GPU VRAM but increase end to end runtime by ~20 seconds
fp8: False # load fp8 version of model, will have quality degradation and will not have speed up in inference time as it still uses bf16 matmuls, but can be paired with cpu_offload=True, to run model with 24Gb of GPU vram
# Input Configuration
text_prompt: "/path/to/csv" or "your prompt here" # Text prompt OR path to CSV/TSV file with prompts
mode: ['i2v', 't2v', 't2i2v'] # Generate t2v, i2v or t2i2v; if t2i2v, it will use flux krea to generate starting image and then will follow with i2v
video_frame_height_width: [512, 992] # Video dimensions [height, width] for T2V mode only
each_example_n_times: 1 # Number of times to generate each prompt
# Quality Control (Negative Prompts)
video_negative_prompt: "jitter, bad hands, blur, distortion" # Artifacts to avoid in video
audio_negative_prompt: "robotic, muffled, echo, distorted" # Artifacts to avoid in audio
π¬ Running Inference
Memory & Performance Requirements
Below are approximate GPU memory requirements for different configurations. Sequence parallel implementation will be optimized in the future. All End-to-End time calculated based on a 121 frame, 720x720 video, using 50 denoising steps. Minimum GPU vram requirement to run our model is 32Gb, fp8 parameters is currently supported, reducing peak VRAM usage to 24Gb with slight quality degradation.
Sequence Parallel Size | FlashAttention-3 Enabled | CPU Offload | With Image Gen Model | Peak VRAM Required | End-to-End Time |
---|---|---|---|---|---|
1 | Yes | No | No | ~80 GB | ~83s |
1 | No | No | No | ~80 GB | ~96s |
1 | Yes | Yes | No | ~80 GB | ~105s |
1 | No | Yes | No | ~32 GB | ~118s |
1 | Yes | Yes | Yes | ~32 GB | ~140s |
4 | Yes | No | No | ~80 GB | ~55s |
8 | Yes | No | No | ~80 GB | ~40s |
### Gradio | |||||
We provide a simple script to run our model in a gradio UI. It uses the ckpt_dir in ovi/configs/inference/inference_fusion.yaml to initialize the model |
π Acknowledgements
We would like to thank the following projects:
- Wan2.2: Our video branch is initialized from the Wan2.2 repository
- MMAudio: Our audio encoder and decoder components are borrowed from the MMAudio project. Some ideas are also inspired from them.
π€ Collaboration
We welcome all types of collaboration! Whether you have feedback, want to contribute, or have any questions, please feel free to reach out.
Contact: Weimin Wang for any issues or feedback.
π€ Contributors
We thank all contributors who have helped improve Ovi!
If youβve contributed to this repository (code, documentation, issues, etc.), youβre automatically included in the contributors list.
We deeply appreciate your support in advancing open multimodal generation research!
β Citation
If Ovi is helpful, please help to β the repo.
If you find this project useful for your research, please consider citing our paper.
BibTeX
@misc{low2025ovitwinbackbonecrossmodal,
title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation},
author={Chetwin Low and Weimin Wang and Calder Katyal},
year={2025},
eprint={2510.01284},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2510.01284},
}