Seedance 2.0
Generate high-quality video from text, images, video clips, and audio — all in one pass with synchronized sound. Seedance 2.0 is ByteDance’s next-generation video model, built on a unified multimodal architecture that accepts mixed inputs and produces coherent, audio-synced output.
What’s new in 2.0
Seedance 2.0 is a significant upgrade from 1.5 Pro:
- Multimodal reference inputs — combine up to 9 images, 3 video clips, and 3 audio files in a single generation. Reference them in your prompt as [Image1], [Video1], [Audio1], etc.
- Better motion and physics — more realistic rendering of complex interactions like sports, dancing, and object collisions.
- Video editing and extension — modify existing videos or extend them by providing a reference video and describing what should happen next.
- Intelligent duration — set duration to -1 and let the model pick the best length for the content.
- Adaptive aspect ratio — set aspect ratio to “adaptive” and the model will choose the best fit based on your inputs.
What you can create
Text to video
Describe a scene in natural language and get a video with matching audio. The model understands multi-subject interactions, camera movements, and emotional tone. For dialogue, put speech in double quotes in your prompt — the model generates matching lip movements and voice.
Image to video
Animate a still image by providing it as the first frame. You can also specify a last frame image to control where the video ends up. The model preserves the look and style of your input image while adding natural motion.
Multimodal reference
Combine images, videos, and audio as references. For example, provide a reference video for motion style, reference images for character appearance, and reference audio for rhythm — then describe how to combine them. This is powerful for outfit-change videos, product showcases, and music-synced content.
Video editing
Provide a reference video and describe changes — replace an object, change a background, or alter the style. The model preserves the original motion and camera work while making your edits.
Video extension
Provide a reference video and describe what should happen next. The model continues the scene with consistent characters, environment, and style.
Key features
Native audio generation
Audio and video are generated together, not separately. This means dialogue, sound effects, and background music are all synchronized with the visuals from the start. You can turn audio off if you just want silent video.
Character consistency
When using reference images, the model maintains facial features, clothing, and style across the generated video. This makes it possible to create multi-shot narratives with consistent characters.
Precise prompt following
The model handles complex prompts with multiple subjects, specific actions, and detailed camera movements. It understands spatial relationships and sequential actions.
Tips for good results
- Be specific in your prompts — describe camera movements, lighting, mood, and specific actions.
- For dialogue, put the spoken words in double quotes:
The man stopped and said: "Remember this moment." - When using reference inputs, label them in your prompt: “The character from [Image1] performs the dance from [Video1].”
- For video editing, describe what to change and what to keep: “Replace the perfume in [Video1] with the face cream from [Image1], keeping all original motion.”
- Start with shorter durations (5 seconds) while experimenting, then increase once you’re happy with the style.
Supported resolutions
| Resolution | 16:9 | 4:3 | 1:1 | 3:4 | 9:16 | 21:9 |
|---|---|---|---|---|---|---|
| 480p | 864×496 | 752×560 | 640×640 | 560×752 | 496×864 | 992×432 |
| 720p | 1280×720 | 1112×834 | 960×960 | 834×1112 | 720×1280 | 1470×630 |
Learn more
For technical details and architecture, see the official Seedance 2.0 page.
You can try this model on the Replicate Playground.