Readme
Unofficial Implementation of bytedance/LatentSync
This is an unofficial implementation of bytedance/LatentSync. This version has been improved to support both images and videos as input for lip-syncing.
LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync
We present LatentSync, an end-to-end lip sync framework based on audio-conditioned latent diffusion models without any intermediate motion representation. This approach diverges from previous diffusion-based lip sync methods that rely on pixel space diffusion or two-stage generation.
Our framework leverages the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. However, we found that diffusion-based lip sync methods tend to exhibit inferior temporal consistency due to inconsistencies in the diffusion process across different frames.
To address this, we propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA utilizes temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames.
Framework
LatentSync utilizes Whisper to convert melspectrograms into audio embeddings, which are then integrated into the U-Net via cross-attention layers.
- The reference and masked frames are channel-wise concatenated with noised latents as input to U-Net.
- During training, we use a one-step method to estimate clean latents from predicted noises.
- These latents are then decoded to obtain the estimated clean frames.
- TREPA, LPIPS, and SyncNet losses are applied in the pixel space to improve performance.