lightricks/ltx-2-distilled

The first open source audio-video model

602 runs

Readme

LTX-2 Distilled

Generate synchronized video and audio at production quality with Lightricks’s fast distilled model. This is the speed-optimized version of LTX-2 that generates 4K video with synchronized sound in seconds.

What it does

LTX-2 Distilled generates video and audio together in one pass. Give it a text prompt and it creates a video clip up to 10 seconds long complete with synchronized audio including dialogue, ambient sound, and music.

This is the distilled variant of the full LTX-2 model, built for fast iteration and real-time workflows.

Why it’s different

Most video generation models create silent clips, or generate video and audio separately. LTX-2 creates both at the same time with natural timing and synchronization. The audio isn’t an afterthought, it’s part of the generation process, so lip sync, ambient sounds, and music all match what’s happening on screen.

How to use it

Write your prompt like you’re describing a shot to a cinematographer. Be specific about what’s happening, how the camera moves, what the scene looks like, and what sounds you want. Keep it under 200 words and write it as a single flowing paragraph.

Start with the action. Don’t waste time with preamble like “The video shows…” or “We see…”. Jump straight to what’s happening: “A yellow bird lands on a wooden birdhouse and shares a worm with another bird.”

Include details about: - What’s moving and how it moves - Camera angles and movements - Lighting and color - Environment and setting - Sound and atmosphere

The model works with both text-to-video and image-to-video generation. When you provide a reference image, the model maintains the lighting, composition, and style while adding motion and sound.

What you’ll get

The model generates video at 1080p by default, though it can render up to 4K. For the best results, stick with 1080p resolution. Frame rate depends on your content—fast-moving scenes benefit from higher frame rates up to 50 fps, while static shots work fine at 15 fps.

Videos can be up to 20 seconds long. The model handles synchronized audio throughout, maintaining timing between visual events and sound.

Technical details

LTX-2 Distilled is a 19 billion parameter model with an asymmetric dual-stream architecture: 14 billion parameters for video and 5 billion for audio. The two streams communicate through bidirectional cross-attention layers, which is what makes the synchronized audio-video generation possible.

The model uses a diffusion transformer architecture and runs efficiently on consumer hardware. It’s been quantized to FP8, which cuts the model size by about 30 percent and doubles performance without meaningful quality loss.

Width and height must be divisible by 32. Frame count must be divisible by 8 plus 1. If your dimensions don’t match, pad with -1 and crop to the target size.

Training and customization

You can fine-tune LTX-2 Distilled using LoRA adapters. Training for specific styles, motion patterns, or character likenesses typically takes less than an hour. The model supports control LoRAs for depth, pose, and edge control, plus IC-LoRAs for identity consistency and video-to-video transformations.

The full training code and documentation is available at https://github.com/Lightricks/LTX-2.

Limitations

The model generates plausible content but doesn’t provide factual information. It might amplify societal biases present in training data. Prompt following varies with how you write your prompts—clear, specific descriptions work better than vague requests.

Audio quality is highest when generating speech or natural environmental sounds. Abstract audio without speech may be lower quality.

Learn more

Try it yourself on the Replicate Playground at replicate.com/playground.

Model created
Model updated