vidu/q3-pro

High-fidelity video generation with text-to-video, image-to-video, and start-end-to-video modes. Up to 16 seconds at 1080p with synchronized audio.

452 runs

Vidu Q3 Pro

Vidu Q3 Pro generates high-fidelity video from text prompts, images, or a combination of both. It produces cinematic clips up to 16 seconds long at up to 1080p resolution with optional synchronized audio—dialogue, sound effects, and ambient sounds generated alongside the video.

What it does

Vidu Q3 Pro creates video in three modes, chosen automatically based on your inputs:

  • Text to video: Describe a scene and the model generates it
  • Image to video: Upload a starting image and a prompt describing the motion
  • Start-end to video: Upload both a starting and ending frame, and the model creates a smooth transition between them

The model handles complex motion, maintains temporal consistency across frames, and produces natural-looking camera movements. When audio is enabled, it generates synchronized sound that matches the visual content.

How to use it

Text to video

Provide a prompt describing your scene. Use aspect_ratio to control the framing.

Image to video

Upload a start_image along with a prompt describing what should happen. The model animates your image into video. Supported formats: PNG, JPEG, WebP.

Start-end to video

Upload both start_image and end_image with a prompt. The model generates a video that transitions smoothly from the first frame to the last. Both images should have similar aspect ratios.

Writing effective prompts

  • Be specific about motion: “A woman in a red coat walks through falling snow” works better than “a person outside”
  • Describe camera movement if you want it: “slow dolly shot”, “aerial view pulling back”
  • For audio, describe sounds explicitly: “birds chirping”, “footsteps on gravel”

Parameters

  • prompt: Text description of the video (up to 5,000 characters)
  • start_image: Starting frame image (enables image-to-video mode)
  • end_image: Ending frame image (requires start_image, enables start-end mode)
  • duration: Video length in seconds (1–16, default: 5)
  • resolution: Output resolution — 540p, 720p, or 1080p (default: 720p)
  • aspect_ratio: 16:9, 9:16, 3:4, 4:3, or 1:1 (text-to-video only, default: 16:9)
  • audio: Generate synchronized audio (default: true)
  • seed: Random seed for reproducible results

Pricing

Billed per second of video output, based on resolution:

Resolution Price per second
540p $0.07
720p $0.15
1080p $0.16

For example, a 5-second video at 720p costs $0.75.

What it’s good for

  • Marketing and advertising: Create polished video content from text descriptions or product images
  • Social media: Generate short-form video in vertical, square, or widescreen formats
  • Storyboarding: Quickly visualize scenes from written descriptions
  • Animation: Bring still images to life with natural motion
  • Scene transitions: Use start-end mode to create smooth visual bridges between keyframes

Limitations

  • Maximum 16 seconds per generation
  • Audio generation adds dialogue and sound effects but doesn’t support background music control
  • Complex text rendering within the video may not be reliable
  • Very rapid fine-grained hand movements can sometimes look unnatural
Model created
Model updated