Readme
Kling v2.6
Generate cinematic videos with synchronized audio from text prompts or images. This model creates video and sound together in a single pass—dialogue, ambient effects, and motion all aligned without separate audio production.
What it does
Kling v2.6 transforms text descriptions or static images into short video clips with native audio. The model generates speech, sound effects, and ambient audio that match the visuals frame-by-frame, so you get lip-synced dialogue and scene-appropriate sound without manual editing.
You can create videos up to 10 seconds long at 1080p resolution in multiple aspect ratios. The model handles both realistic and stylized content, though it’s strongest with photorealistic scenes.
How to use it
The model works with two input types:
Text to video: Describe what you want to see and hear. The model generates both visuals and audio from your description.
Image to video: Upload a still image and add a text prompt describing the motion and audio you want. The model animates your image with synchronized sound.
Writing effective prompts
Good prompts guide both the visual content and the audio. Structure your description to include:
- Scene setting: Where and when the action happens, lighting conditions
- Subject details: What characters or objects appear, how they look
- Motion: What happens, how things move, camera behavior
- Audio: Dialogue with quotation marks, ambient sounds, sound effects
Example: A woman walks down a rain-slicked neon street at night, camera slowly tracking behind her. She stops and turns to face the camera, saying "Let's begin." Ambient sound of rain on pavement, distant traffic, soft footsteps.
For dialogue, put the spoken text in quotes and the model will generate matching lip sync. You can specify voice characteristics like “warm female voice” or “confident male narrator.”
Describe ambient sounds and effects explicitly: “coffee shop chatter, espresso machine hissing, rain on windows” gives better results than just “background noise.”
Parameters
- Duration: Choose 5 or 10 seconds per generation
- Aspect ratio: 16:9 (horizontal), 9:16 (vertical), or 1:1 (square)
- Audio: Toggle native audio on or off
- Negative prompt: Specify what to exclude from the generation
What it’s good for
The model works well for:
- Marketing videos with voiceover narration
- Social media content with dialogue
- Product demonstrations with sound effects
- Character animations with speech
- Cinematic sequences with ambient audio
The native audio makes it particularly useful when you need speech synchronized with character mouth movements, or when ambient sound needs to match on-screen action.
Limitations
- Maximum 10 seconds per generation
- Audio works best in English and Chinese
- Character consistency can vary across multiple generations
- Complex physics interactions may not look fully natural
- Text overlays in the output can be distorted
For projects longer than 10 seconds, you’ll need to generate multiple clips and edit them together.
Technical details
The model outputs 1080p video with embedded audio at standard frame rates. It supports vertical video formats for platforms like TikTok and Instagram Reels.
Audio generation includes multiple layers: dialogue or narration, ambient environmental sound, and specific sound effects. These layers are mixed together in the output.
You can try this model on the Replicate Playground at replicate.com/playground