wan-video/wan-2.7-t2v

Generate videos with audio from text prompts using Alibaba's Wan 2.7 model. 1080p, up to 15 seconds, with audio synchronization.

33 runs

Wan 2.7 text-to-video

Wan 2.7 is a text-to-video model from Alibaba’s Wan family. Describe a scene in natural language and it generates a video with coherent motion, lighting, and synchronized audio.

The model is built on a 27 billion parameter Mixture-of-Experts architecture and generates up to 1080p video at durations from 2 to 15 seconds. It auto-generates matching audio (sound effects, ambient noise) or you can provide your own audio file for voice or music synchronization.

Inputs

  • prompt — Text description of the video to generate (required)
  • negative_prompt — Describes content that should not appear in the video
  • audio — Optional audio file (wav/mp3, 3–30s, ≤15 MB) for voice or music synchronization. If omitted, the model generates matching audio automatically.
  • resolution — 720p or 1080p (default: 1080p)
  • aspect_ratio — 16:9, 9:16, 1:1, 4:3, or 3:4 (default: 16:9)
  • duration — Length in seconds, 2–15 (default: 5)
  • enable_prompt_expansion — Automatically expand short prompts for better results. Improves quality but adds latency (default: true)
  • seed — Random seed for reproducible results

Tips

  • Be descriptive. Include details about the scene, lighting, camera movement, and action. “A golden retriever running through autumn leaves in a park, camera tracking from the side, warm afternoon light” works much better than “a dog in a park.”
  • Keep durations short. 2–5 second clips tend to produce the most coherent motion and scene consistency.
  • Use negative prompts to reduce common artifacts — try “blurry, distorted, low quality, static.”
  • Enable prompt expansion for short prompts. It fills in visual details that improve generation quality.
  • Pick the right aspect ratio for your use case — 9:16 for vertical/mobile content, 16:9 for widescreen, 1:1 for social.

Limitations

  • Complex multi-character scenes with specific interactions can be inconsistent.
  • Text rendering within generated videos is unreliable.
  • Longer durations (10+ seconds) may show motion degradation or scene drift.
  • Precise spatial relationships (“object A is to the left of object B”) are not always followed exactly.
Model created
Model updated