prunaai/p-video-avatar

p-video-avatar is the fastest and cheapest avatar/lipsync video model on the market.

4.8K runs

Readme

p-video-avatar

Generate talking-head videos from a single portrait image plus either a script or an audio clip. Upload a photo, give it words to say (or audio to lip-sync to), and get back a video of the person speaking.

p-video-avatar is Pruna’s first avatar model on Replicate. It’s optimized for speed and cost — the cheapest avatar/lipsync option in this category — while still producing natural lip movements and facial expressions.

How it works

Give the model two things:

  1. A portrait image (jpg, jpeg, png, or webp) — the face you want to animate.
  2. Either a voice_script (text the model speaks aloud in one of 30 voices and 10 languages) or an audio file (your own recording, which the model lip-syncs to).

If you provide both, audio wins.

The model returns an MP4 video at either 720p or 1080p, with the speech baked into the audio track.

Voice and language

When you use voice_script, you can pick from 30 named voices (mix of male and female) and 10 output languages, including English, Spanish, French, German, Italian, Portuguese (Brazil), Japanese, Korean, and Hindi.

Use voice_prompt to give style direction to the speaker — for example, "speak with excitement", "calm and measured", or "like a news anchor." Use video_prompt to describe what’s happening in the video itself, like "the person is gesturing with their hands."

Tips for good results

  • Use a clear, front-facing portrait. Heavy angles, occlusion, or low resolution all hurt identity preservation.
  • Keep audio clean. Clear speech with minimal background noise produces tighter lip-sync.
  • Pick the right resolution. 720p is half the price of 1080p and works well for most use cases. Use 1080p when you need extra detail.
  • Use voice_prompt for performance direction, not for what to say. Put the words in voice_script.

Pricing

Billing is per second of output video, based on resolution:

  • 720p: $0.025 per second
  • 1080p: $0.045 per second

A 10-second clip at 720p costs $0.25.

What you can build

  • Educational content — narrate lessons with a consistent presenter from a single photo.
  • Marketing and social — turn product photos or brand mascots into talking avatars.
  • Localized content — generate the same script in multiple languages from one portrait.
  • Podcast or audiobook visuals — animate a host portrait synced to existing audio.

Try it on the Replicate playground.

Model created
Model updated