zsxkib / memo

MEMO is a state-of-the-art open-weight model for audio-driven talking video generation.

  • Public
  • 666 runs
  • A100 (80GB)
  • GitHub
  • Weights
  • Paper
  • License

Input

*file
Preview
image

Input image (e.g. PNG/JPG).

*file
Preview
Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x

Input audio (e.g. WAV/MP3).

integer
(minimum: 1, maximum: 200)

Diffusion inference steps. Default: 20

Default: 20

number
(minimum: 1, maximum: 20)

Classifier-free guidance scale. Default: 3.5

Default: 3.5

integer

Set a random seed (None or 0 for random)

Default: 0

Including resolution and 3 more...

Output

Generated in

Run time and cost

This model runs on Nvidia A100 (80GB) GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

MEMO is a state-of-the-art model for generating expressive talking head videos from a single image and audio input. The model excels at maintaining audio-lip synchronization, identity consistency, and producing natural, audio-aligned expressions.

Description

MEMO can generate highly realistic talking head videos with the following capabilities:

  • Audio-Driven Animation: Generate talking videos from a single portrait image and an audio clip
  • Multi-Language Support: Works with various languages including English, Mandarin, Spanish, Japanese, Korean, and Cantonese
  • Versatile Image Input: Handles different image styles including portraits, sculptures, digital art, and animations
  • Audio Flexibility: Compatible with different audio types including speech, singing, and rap
  • Expression Control: Generates natural facial expressions aligned with audio emotional content
  • Identity Preservation: Maintains consistent identity throughout generated videos
  • Head Pose Variation: Supports various head poses while maintaining stability

Model Details

The model uses two key components: - Memory-guided temporal module for enhanced long-term identity consistency - Emotion-aware audio module for better audio-video synchronization and natural expressions

Ethical Considerations

This model is released for research purposes only. Users must: - Not create misleading, defamatory, or privacy-infringing content - Avoid using the model for political misinformation, impersonation, harassment, or fraud - Ensure proper authorization for input materials (images and audio) - Comply with copyright laws, especially regarding public figures’ likeness - Review generated content to ensure it meets ethical guidelines

Citation

@article{zheng2024memo,
  title={MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation},
  author={Longtao Zheng and Yifan Zhang and Hanzhong Guo and Jiachun Pan and Zhenxiong Tan and Jiahao Lu and Chuanxin Tang and Bo An and Shuicheng Yan},
  journal={arXiv preprint arXiv:2412.04448},
  year={2024}
}

Acknowledgements

This work builds upon high-quality open-source talking video datasets (HDTF, VFHQ, CelebV-HQ, MultiTalk, and MEAD) and pioneering works like EMO and Hallo.

License

Apache 2.0


Follow me on Twitter/X