zsxkib / v-express

🫦 Realistic facial expression manipulation (lip-syncing) using audio or video

  • Public
  • 1K runs
  • A100 (80GB)
  • GitHub
  • Paper
  • License

Input

file
Preview
reference_image

Path to the reference image that will be used as the base for the generated video.

file

Path to the audio file that will be used to drive the motion in the generated video.

boolean

If True and driving_video is provided, use the audio from the driving video instead of the driving_audio.

Default: false

file

Path to the video file that will be used to extract the head motion. If not provided, the generated video will use the motion based on the selected motion_mode.

string

Mode for generating the head motion in the output video.

Default: "fast"

number
(minimum: 0, maximum: 1)

Amount of attention to pay to the reference image vs. the driving motion. Higher values will make the generated video adhere more closely to the reference image. Range: 0.0 to 1.0

Default: 0.95

number
(minimum: 0, maximum: 10)

Amount of attention to pay to the driving audio vs. the reference image. Higher values will make the generated video's motion more closely match the driving audio. Range: 0.0 to 10.0

Default: 3

integer
(minimum: 1, maximum: 100)

Number of diffusion steps to perform during generation. More steps will generally produce better quality results but will take longer to run. Range: 1 to 100

Default: 25

integer
(minimum: 64, maximum: 2048)

Width of the generated video frames.

Default: 512

integer
(minimum: 64, maximum: 2048)

Height of the generated video frames.

Default: 512

number
(minimum: 1, maximum: 60)

Frame rate of the generated video.

Default: 30

number
(minimum: 1, maximum: 20)

Guidance scale for the diffusion model. Higher values will result in the generated video following the driving motion and audio more closely.

Default: 3.5

integer
(minimum: 1, maximum: 24)

Number of context frames to use for motion estimation.

Default: 12

integer
(minimum: 1, maximum: 10)

Stride of the context frames.

Default: 1

integer
(minimum: 0, maximum: 24)

Number of overlapping frames between context windows.

Default: 4

integer
(minimum: 0, maximum: 10)

Number of audio frames to pad on each side of the driving audio.

Default: 2

integer

Random seed. Leave blank to randomize the seed

Output

Generated in

This example was created by a different version, zsxkib/v-express:f3400fd3.

Run time and cost

This model runs on Nvidia A100 (80GB) GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

πŸŽ₯ V-Express: Create Amazing Talking Portrait Videos

Follow me on X @zsakib_ for more AI projects and updates!

🌟 Bring Photos to Life with Talking Videos

V-Express is an amazing AI tool that can turn a single photo into a lifelike talking video. It’s like magic! You can create videos that look and sound just like the person in the picture.

🎭 Unleash Your Creativity

  • Realistic Results: V-Express makes videos that look super real, with mouth movements and facial expressions that match the audio perfectly.
  • Easy to Use: Just give V-Express a photo, an audio clip, and a pose sequence, and it will create an awesome video for you.
  • High-Quality Videos: Our special training method makes sure the videos are top-notch quality.

🎨 Lots of Cool Ways to Use V-Express

You can use V-Express in different ways:

  1. Same Person, Different Scene: Make a talking video that looks like a given video of the same person in a different place.
  2. Still Photo + Audio: Create a video where the person in a still photo talks using any audio you provide.
  3. Mix and Match: Make a video where one person’s movements match another person’s video, and their lips sync with the audio.

πŸ› οΈ Try V-Express on Replicate

You can easily make your own talking videos with V-Express on Replicate. Here’s what you need:

  • reference_image: A photo that will be used as the base for the video.
  • driving_audio: An audio clip that will be used to create the talking motion in the video.
  • use_video_audio: If you provide a driving_video, you can choose to use its audio instead of the driving_audio.
  • driving_video: A video that will be used to create the head motion in the generated video. If not provided, the motion will be based on the motion_mode you choose.
  • motion_mode: Choose how fast or slow the head motion should be in the video. You can pick from β€œstandard”, β€œgentle”, β€œnormal”, or β€œfast”.
  • reference_attention_weight: Decide how much the generated video should look like the reference image. A higher value means it will look more like the photo.
  • audio_attention_weight: Choose how much the video’s motion should match the driving audio. A higher value means the motion will match the audio more closely.
  • num_inference_steps: The number of steps V-Express takes to create the video. More steps usually mean better quality, but it will take longer.
  • image_width and image_height: The size of the generated video frames.
  • frames_per_second: The frame rate of the generated video.
  • guidance_scale: A setting that controls how closely the video follows the driving motion and audio. A higher value means it will follow them more closely.
  • num_context_frames, context_stride, and context_overlap: Advanced settings for motion estimation. You can leave these at their default values.
  • num_audio_padding_frames: The number of extra audio frames to use at the start and end of the driving audio.
  • seed: A random number that controls the video generation. If you leave it blank, V-Express will pick a random number for you.

Get ready to be amazed by the power of V-Express and create incredible talking videos! πŸŽ‰βœ¨

⚠️ Important Things to Keep in Mind

  • V-Express is a powerful tool that can create videos that look very real. Please use it responsibly and follow all the rules.
  • Don’t use the videos for bad things like spreading fake news or tricking people.
  • Respect people’s privacy and rights. Make sure you have permission before using someone’s photo.
  • The creators of V-Express are not responsible if someone uses the tool in a bad way.

By using V-Express, you promise to use it in a good and responsible way. Let’s make amazing videos while being kind and respectful to everyone! πŸ™Œ

✍️ Citation

@article{wang2024V-Express,
  title={V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation},
  author={Wang, Cong and Tian, Kuan and Zhang, Jun and Guan, Yonghang and Luo, Feng and Shen, Fei and Jiang, Zhiwei and Gu, Qing and Han, Xiao and Yang, Wei},
  booktitle={arXiv preprint arXiv:2406.02511},
  year={2024}
}