Readme

Kling v2.6

Generate cinematic videos with synchronized audio from text prompts or images. This model creates video and sound together in a single pass—dialogue, ambient effects, and motion all aligned without separate audio production.

What it does

Kling v2.6 transforms text descriptions or static images into short video clips with native audio. The model generates speech, sound effects, and ambient audio that match the visuals frame-by-frame, so you get lip-synced dialogue and scene-appropriate sound without manual editing.

You can create videos up to 10 seconds long at 1080p resolution in multiple aspect ratios. The model handles both realistic and stylized content, though it’s strongest with photorealistic scenes.

How to use it

The model works with two input types:

Text to video: Describe what you want to see and hear. The model generates both visuals and audio from your description.

Image to video: Upload a still image and add a text prompt describing the motion and audio you want. The model animates your image with synchronized sound.

Writing effective prompts

Good prompts guide both the visual content and the audio. Structure your description to include:

Scene setting: Where and when the action happens, lighting conditions
Subject details: What characters or objects appear, how they look
Motion: What happens, how things move, camera behavior
Audio: Dialogue with quotation marks, ambient sounds, sound effects

Example: A woman walks down a rain-slicked neon street at night, camera slowly tracking behind her. She stops and turns to face the camera, saying "Let's begin." Ambient sound of rain on pavement, distant traffic, soft footsteps.

For dialogue, put the spoken text in quotes and the model will generate matching lip sync. You can specify voice characteristics like “warm female voice” or “confident male narrator.”

Describe ambient sounds and effects explicitly: “coffee shop chatter, espresso machine hissing, rain on windows” gives better results than just “background noise.”

Parameters

Duration: Choose 5 or 10 seconds per generation
Aspect ratio: 16:9 (horizontal), 9:16 (vertical), or 1:1 (square)
Audio: Toggle native audio on or off
Negative prompt: Specify what to exclude from the generation

What it’s good for

The model works well for:

Marketing videos with voiceover narration
Social media content with dialogue
Product demonstrations with sound effects
Character animations with speech
Cinematic sequences with ambient audio

The native audio makes it particularly useful when you need speech synchronized with character mouth movements, or when ambient sound needs to match on-screen action.

Limitations

Maximum 10 seconds per generation
Audio works best in English and Chinese
Character consistency can vary across multiple generations
Complex physics interactions may not look fully natural
Text overlays in the output can be distorted

For projects longer than 10 seconds, you’ll need to generate multiple clips and edit them together.

Technical details

The model outputs 1080p video with embedded audio at standard frame rates. It supports vertical video formats for platforms like TikTok and Instagram Reels.

Audio generation includes multiple layers: dialogue or narration, ambient environmental sound, and specific sound effects. These layers are mixed together in the output.

You can try this model on the Replicate Playground at replicate.com/playground

Model created 7 hours ago

Model updated 6 hours ago

Examples