Readme
Audio to video
Generate videos from audio with Lightricks’s audio-to-video model. This model uses your audio as the primary creative control, letting sound shape timing, motion, and performance from the first frame.
How it works
Audio-to-video generation starts with sound instead of treating it as an afterthought. When you provide an audio file—whether it’s dialogue, music, or sound effects—the model uses that audio to drive the video generation. Speech cadence determines pacing. Musical energy influences motion and camera behavior. Scene changes happen where the sound demands them.
This is part of Lightricks’s LTX-2 system, which generates synchronized audio and video in a single pass. The model uses an asymmetric dual-stream transformer architecture with 19 billion parameters, processing both modalities together to create naturally aligned output.
What you can create
The model generates Full HD video clips where length and motion are driven by your audio input. You can use:
- Voice recordings and dialogue to create character performances with natural lip sync
- Music tracks to generate videos that match the energy and rhythm of the sound
- Sound effects and ambient audio to build atmospheric scenes
- Any audio file as the starting point for video generation
You can also provide an optional reference image to anchor a character or scene, and add a short text prompt to guide visual style. But the audio stays in control.
Inputs
audio: Your audio file (required). This is the primary input that drives the generation.
image: An optional reference image to anchor a character, scene, or visual starting point.
prompt: An optional text description to guide visual style and content. Keep it short and let the audio do the heavy lifting.
duration: Video length in seconds (6, 8, or 10 seconds). The model will match the video to your audio timing.
resolution: Output resolution. Options include 720p, 1080p (Full HD), 1440p (QHD), and 2160p (4K/UHD).
seed: Random seed for reproducible generations. Use the same seed with the same inputs to get consistent results.
Technical details
The model processes audio and video jointly through a diffusion transformer architecture. It uses separate but connected streams for each modality, with bidirectional cross-attention that lets audio information influence video generation and vice versa. This produces synchronized output where visual motion, camera movement, and scene changes align naturally with the audio’s timing and energy.
Video generation happens at up to 50 frames per second, with support for resolutions up to 4K. The model can generate clips from 6 to 10 seconds in length.
Prompting tips
When writing prompts for audio-to-video generation, keep them minimal. The audio is already doing most of the work. Use your text prompt to specify:
- Visual style (cinematic, animated, photorealistic)
- Setting or environment (forest, city street, living room)
- Camera angle or framing (wide shot, close-up, aerial view)
- Lighting or mood (golden hour, dramatic, soft light)
Avoid describing actions or timing in detail—the audio determines that. Think of the prompt as setting the stage while the audio directs the performance.
For longer sequences, you can chain multiple clips together, building full videos modularly while keeping audio in control throughout.
Limitations
Quality can vary by language. Speech synthesis may be less precise for underrepresented languages or dialects. The model works best with clear audio that has distinct timing cues.
When generating video without speech, audio quality may be lower than when including dialogue. The model is optimized for audio that contains clear performance cues like speech cadence or musical rhythm.
Prompt following depends on prompting style. The clearer and more literal your text prompt, the better the results. But remember that audio is the primary control—don’t expect the text prompt to override what the audio is telling the model to do.
Learn more
For detailed information about the LTX-2 model family and audio-to-video generation, see Lightricks’s documentation.
You can try this model on the Replicate playground at replicate.com/playground.