MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
MEMO is a state-of-the-art model for generating expressive talking head videos from a single image and audio input. The model excels at maintaining audio-lip synchronization, identity consistency, and producing natural, audio-aligned expressions.
Description
MEMO can generate highly realistic talking head videos with the following capabilities:
- Audio-Driven Animation: Generate talking videos from a single portrait image and an audio clip
- Multi-Language Support: Works with various languages including English, Mandarin, Spanish, Japanese, Korean, and Cantonese
- Versatile Image Input: Handles different image styles including portraits, sculptures, digital art, and animations
- Audio Flexibility: Compatible with different audio types including speech, singing, and rap
- Expression Control: Generates natural facial expressions aligned with audio emotional content
- Identity Preservation: Maintains consistent identity throughout generated videos
- Head Pose Variation: Supports various head poses while maintaining stability
Model Details
The model uses two key components: - Memory-guided temporal module for enhanced long-term identity consistency - Emotion-aware audio module for better audio-video synchronization and natural expressions
Ethical Considerations
This model is released for research purposes only. Users must: - Not create misleading, defamatory, or privacy-infringing content - Avoid using the model for political misinformation, impersonation, harassment, or fraud - Ensure proper authorization for input materials (images and audio) - Comply with copyright laws, especially regarding public figures’ likeness - Review generated content to ensure it meets ethical guidelines
Citation
@article{zheng2024memo,
title={MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation},
author={Longtao Zheng and Yifan Zhang and Hanzhong Guo and Jiachun Pan and Zhenxiong Tan and Jiahao Lu and Chuanxin Tang and Bo An and Shuicheng Yan},
journal={arXiv preprint arXiv:2412.04448},
year={2024}
}
Acknowledgements
This work builds upon high-quality open-source talking video datasets (HDTF, VFHQ, CelebV-HQ, MultiTalk, and MEAD) and pioneering works like EMO and Hallo.
License
Apache 2.0
Follow me on Twitter/X