skallagrimr / latentsync

LatentSync: generate high-quality lip sync animations from images and videos

  • Public
  • 179 runs
  • A100 (80GB)
  • GitHub
  • Weights
  • License

Input

file

Input image or video file (supports .jpg, .jpeg, .png, .bmp, .webp, .mp4)

file

Input audio file to sync with

number
(minimum: 0, maximum: 10)

Guidance scale

Default: 1

integer

Random seed (0 for random)

Default: 0

Output

No output yet! Press "Submit" to start a prediction.

Run time and cost

This model runs on Nvidia A100 (80GB) GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Unofficial Implementation of bytedance/LatentSync

This is an unofficial implementation of bytedance/LatentSync. This version has been improved to support both images and videos as input for lip-syncing.

LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

We present LatentSync, an end-to-end lip sync framework based on audio-conditioned latent diffusion models without any intermediate motion representation. This approach diverges from previous diffusion-based lip sync methods that rely on pixel space diffusion or two-stage generation.

Our framework leverages the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. However, we found that diffusion-based lip sync methods tend to exhibit inferior temporal consistency due to inconsistencies in the diffusion process across different frames.

To address this, we propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA utilizes temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames.

LatentSync

Framework

LatentSync utilizes Whisper to convert melspectrograms into audio embeddings, which are then integrated into the U-Net via cross-attention layers.

  • The reference and masked frames are channel-wise concatenated with noised latents as input to U-Net.
  • During training, we use a one-step method to estimate clean latents from predicted noises.
  • These latents are then decoded to obtain the estimated clean frames.
  • TREPA, LPIPS, and SyncNet losses are applied in the pixel space to improve performance.