MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

MEMO is a state-of-the-art model for generating expressive talking head videos from a single image and audio input. The model excels at maintaining audio-lip synchronization, identity consistency, and producing natural, audio-aligned expressions.

Description

MEMO can generate highly realistic talking head videos with the following capabilities:

Audio-Driven Animation: Generate talking videos from a single portrait image and an audio clip
Multi-Language Support: Works with various languages including English, Mandarin, Spanish, Japanese, Korean, and Cantonese
Versatile Image Input: Handles different image styles including portraits, sculptures, digital art, and animations
Audio Flexibility: Compatible with different audio types including speech, singing, and rap
Expression Control: Generates natural facial expressions aligned with audio emotional content
Identity Preservation: Maintains consistent identity throughout generated videos
Head Pose Variation: Supports various head poses while maintaining stability

Model Details

The model uses two key components: - Memory-guided temporal module for enhanced long-term identity consistency - Emotion-aware audio module for better audio-video synchronization and natural expressions

Ethical Considerations

This model is released for research purposes only. Users must: - Not create misleading, defamatory, or privacy-infringing content - Avoid using the model for political misinformation, impersonation, harassment, or fraud - Ensure proper authorization for input materials (images and audio) - Comply with copyright laws, especially regarding public figures’ likeness - Review generated content to ensure it meets ethical guidelines

Citation

@article{zheng2024memo,
  title={MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation},
  author={Longtao Zheng and Yifan Zhang and Hanzhong Guo and Jiachun Pan and Zhenxiong Tan and Jiahao Lu and Chuanxin Tang and Bo An and Shuicheng Yan},
  journal={arXiv preprint arXiv:2412.04448},
  year={2024}
}

Acknowledgements

This work builds upon high-quality open-source talking video datasets (HDTF, VFHQ, CelebV-HQ, MultiTalk, and MEAD) and pioneering works like EMO and Hallo.

License

Apache 2.0

Follow me on Twitter/X

Model created over 1 year ago

Model updated 7 months, 2 weeks ago