cjwbw / dreamtalk

RESEARCH/NON-COMMERCIAL USE ONLY: diffusion-based audio-driven expressive talking head generation

  • Public
  • 1K runs
  • L40S
  • GitHub
  • Paper
  • License

Input

*file
Preview
image

Input image. This specifies the input portrait. The resolution should be larger than 256x256 and will be cropped to 256x256.

*file
Preview
Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x

Input audio file. The input audio file extensions should be wav, mp3, m4a, and mp4 (video with sound) should all be compatible.

string

Input style_clip_mat, optional. This specifies the reference speaking style.

Default: "data/style_clip/3DMM/M030_front_neutral_level1_001.mat"

string

Input pose, specifies the head pose and should be a .mat file.

Default: "data/pose/RichardShelby_front_neutral_level1_001.mat"

integer

The maximum length (seconds) limitation for generating videos.

Default: 1000

integer
(minimum: 1, maximum: 500)

Number of denoising steps

Default: 10

boolean

Enable cropping the input image. If your portrait is already cropped to 256x256, set this to False.

Default: true

Output

Generated in

Run time and cost

This model costs approximately $0.013 to run on Replicate, or 76 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 14 seconds. The predict time for this model varies significantly based on the inputs.

Readme

This model doesn't have a readme.