tencent/hunyuanvideo-foley

(Research & Non-commercial use only) Text-Video-to-Audio Synthesis: Generate realistic audio from video and text descriptions

Public
5.7K runs

Run time and cost

This model costs approximately $0.011 to run on Replicate, or 90 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 12 seconds.

Readme

HunyuanVideo-Foley Logo

Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Professional-grade AI sound effect generation for video content creators


πŸ‘₯ Authors

Sizhe Shan<sup>1,2</sup> β€’ Qiulin Li<sup>1,3</sup> β€’ Yutao Cui<sup>1</sup> β€’ Miles Yang<sup>1</sup> β€’ Yuehai Wang<sup>2</sup> β€’ Qun Yang<sup>3</sup> β€’ Jin Zhou<sup>1†</sup> β€’ Zhao Zhong<sup>1</sup>

🏒 <sup>1</sup>Tencent Hunyuan β€’ πŸŽ“ <sup>2</sup>Zhejiang University β€’ ✈️ <sup>3</sup>Nanjing University of Aeronautics and Astronautics

*Equal contribution β€’ †Project lead


✨ Key Highlights

🎭 **Multi-scenario Sync** High-quality audio synchronized with complex video scenes 🧠 **Multi-modal Balance** Perfect harmony between visual and textual information 🎡 **48kHz Hi-Fi Output** Professional-grade audio generation with crystal clarity
</div>

πŸ“„ Abstract

<div style="background: linear-gradient(135deg, #ffeef8 0%, #f0f8ff 100%); padding: 30px; border-radius: 20px; margin: 20px 0; border-left: 5px solid #ff6b9d; color: #333;"> **πŸš€ Tencent Hunyuan** open-sources **HunyuanVideo-Foley** an end-to-end video sound effect generation model! *A professional-grade AI tool specifically designed for video content creators, widely applicable to diverse scenarios including short video creation, film production, advertising creativity, and game development.* </div>

🎯 Core Highlights

<div style="display: grid; grid-template-columns: 1fr; gap: 15px; margin: 20px 0;"> <div style="border-left: 4px solid #4CAF50; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;"> **🎬 Multi-scenario Audio-Visual Synchronization** Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications. </div> <div style="border-left: 4px solid #2196F3; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;"> **βš–οΈ Multi-modal Semantic Balance** Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements. </div> <div style="border-left: 4px solid #FF9800; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;"> **🎡 High-fidelity Audio Output** Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality. </div> </div> <div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0; color: #333;"> **πŸ† SOTA Performance Achieved** *HunyuanVideo-Foley comprehensively leads the field across multiple evaluation benchmarks, achieving new state-of-the-art levels in audio fidelity, visual-semantic alignment, temporal alignment, and distribution matching - surpassing all open-source solutions!* </div> <div> ![Performance Overview](assets/pan_chart.png) *πŸ“Š Performance comparison across different evaluation metrics - HunyuanVideo-Foley leads in all categories* </div>

πŸ”§ Technical Architecture

πŸ“Š Data Pipeline Design

<div style="margin: 20px 0;"> ![Data Pipeline](assets/data_pipeline.png) *πŸ”„ Comprehensive data processing pipeline for high-quality text-video-audio datasets* </div> <div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #17a2b8; margin: 20px 0; color: #333;"> The **TV2A (Text-Video-to-Audio)** task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities. </div>

πŸ—οΈ Model Architecture

<div style="margin: 20px 0;"> ![Model Architecture](assets/model_arch.png) *🧠 HunyuanVideo-Foley hybrid architecture with multimodal and unimodal transformer blocks* </div> <div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #28a745; margin: 20px 0; color: #333;"> **HunyuanVideo-Foley** employs a sophisticated hybrid architecture: - **πŸ”„ Multimodal Transformer Blocks**: Process visual-audio streams simultaneously - **🎡 Unimodal Transformer Blocks**: Focus on audio stream refinement - **πŸ‘οΈ Visual Encoding**: Pre-trained encoder extracts visual features from video frames - **πŸ“ Text Processing**: Semantic features extracted via pre-trained text encoder - **🎧 Audio Encoding**: Latent representations with Gaussian noise perturbation - **⏰ Temporal Alignment**: Synchformer-based frame-level synchronization with gated modulation </div>

πŸ“ˆ Performance Benchmarks

🎬 MovieGen-Audio-Bench Results

<div> > *Objective and Subjective evaluation results demonstrating superior performance across all metrics* </div> <div style="overflow-x: auto; margin: 20px 0;"> | πŸ† **Method** | **PQ** ↑ | **PC** ↓ | **CE** ↑ | **CU** ↑ | **IB** ↑ | **DeSync** ↓ | **CLAP** ↑ | **MOS-Q** ↑ | **MOS-S** ↑ | **MOS-T** ↑ | |:-------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:|:------------:|:------------:|:------------:| | FoleyGrafter | 6.27 | 2.72 | 3.34 | 5.68 | 0.17 | 1.29 | 0.14 | 3.36Β±0.78 | 3.54Β±0.88 | 3.46Β±0.95 | | V-AURA | 5.82 | 4.30 | 3.63 | 5.11 | 0.23 | 1.38 | 0.14 | 2.55Β±0.97 | 2.60Β±1.20 | 2.70Β±1.37 | | Frieren | 5.71 | 2.81 | 3.47 | 5.31 | 0.18 | 1.39 | 0.16 | 2.92Β±0.95 | 2.76Β±1.20 | 2.94Β±1.26 | | MMAudio | 6.17 | 2.84 | 3.59 | 5.62 | 0.27 | 0.80 | 0.35 | 3.58Β±0.84 | 3.63Β±1.00 | 3.47Β±1.03 | | ThinkSound | 6.04 | 3.73 | 3.81 | 5.59 | 0.18 | 0.91 | 0.20 | 3.20Β±0.97 | 3.01Β±1.04 | 3.02Β±1.08 | | **HunyuanVideo-Foley (ours)** | **6.59** | **2.74** | **3.88** | **6.13** | **0.35** | **0.74** | **0.33** | **4.14Β±0.68** | **4.12Β±0.77** | **4.15Β±0.75** | </div>

🎯 Kling-Audio-Eval Results

<div> > *Comprehensive objective evaluation showcasing state-of-the-art performance* </div> <div style="overflow-x: auto; margin: 20px 0;"> | πŸ† **Method** | **FD_PANNs** ↓ | **FD_PASST** ↓ | **KL** ↓ | **IS** ↑ | **PQ** ↑ | **PC** ↓ | **CE** ↑ | **CU** ↑ | **IB** ↑ | **DeSync** ↓ | **CLAP** ↑ | |:-------------:|:--------------:|:--------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:| | FoleyGrafter | 22.30 | 322.63 | 2.47 | 7.08 | 6.05 | 2.91 | 3.28 | 5.44 | 0.22 | 1.23 | 0.22 | | V-AURA | 33.15 | 474.56 | 3.24 | 5.80 | 5.69 | 3.98 | 3.13 | 4.83 | 0.25 | 0.86 | 0.13 | | Frieren | 16.86 | 293.57 | 2.95 | 7.32 | 5.72 | 2.55 | 2.88 | 5.10 | 0.21 | 0.86 | 0.16 | | MMAudio | 9.01 | 205.85 | 2.17 | 9.59 | 5.94 | 2.91 | 3.30 | 5.39 | 0.30 | 0.56 | 0.27 | | ThinkSound | 9.92 | 228.68 | 2.39 | 6.86 | 5.78 | 3.23 | 3.12 | 5.11 | 0.22 | 0.67 | 0.22 | | **HunyuanVideo-Foley (ours)** | **6.07** | **202.12** | **1.89** | **8.30** | **6.12** | **2.76** | **3.22** | **5.53** | **0.38** | **0.54** | **0.24** | </div> <div style="background: linear-gradient(135deg, #4CAF50 0%, #45a049 100%); color: white; padding: 15px; border-radius: 10px; margin: 20px 0;"> **πŸŽ‰ Outstanding Results!** HunyuanVideo-Foley achieves the best scores across **ALL** evaluation metrics, demonstrating significant improvements in audio quality, synchronization, and semantic alignment. </div>

πŸš€ Quick Start

πŸ“¦ Installation

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0;"> **πŸ”§ System Requirements** - **CUDA**: 12.4 or 11.8 recommended - **Python**: 3.8+ - **OS**: Linux (primary support) </div>

Step 1: Clone Repository

# πŸ“₯ Clone the repository
git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley
cd HunyuanVideo-Foley

Step 2: Environment Setup

<div style="background: #fff3cd; padding: 15px; border-radius: 8px; border-left: 4px solid #ffc107; margin: 10px 0; color: #333;"> πŸ’‘ **Tip**: We recommend using [Conda](https://docs.anaconda.com/free/miniconda/index.html) for Python environment management. </div>
# πŸ”§ Install dependencies
pip install -r requirements.txt

Step 3: Download Pretrained Models

<div style="background: #d1ecf1; padding: 15px; border-radius: 8px; border-left: 4px solid #17a2b8; margin: 10px 0; color: #333;"> πŸ”— **Download Model weights from Huggingface**
# using git-lfs
git clone https://huggingface.co/tencent/HunyuanVideo-Foley

# using huggingface-cli
huggingface-cli download tencent/HunyuanVideo-Foley
</div>

πŸ’» Usage

🎬 Single Video Generation

<div style="background: #e8f5e8; padding: 15px; border-radius: 8px; border-left: 4px solid #28a745; margin: 10px 0; color: #333;"> Generate Foley audio for a single video file with text description: </div>
python3 infer.py \
    --model_path PRETRAINED_MODEL_PATH_DIR \
    --config_path ./configs/hunyuanvideo-foley-xxl.yaml \
    --single_video video_path \
    --single_prompt "audio description" \
    --output_dir OUTPUT_DIR

πŸ“‚ Batch Processing

<div style="background: #fff3e0; padding: 15px; border-radius: 8px; border-left: 4px solid #ff9800; margin: 10px 0; color: #333;"> Process multiple videos using a CSV file with video paths and descriptions: </div>
python3 infer.py \
    --model_path PRETRAINED_MODEL_PATH_DIR \
    --config_path ./configs/hunyuanvideo-foley-xxl.yaml \
    --csv_path assets/test.csv \
    --output_dir OUTPUT_DIR

🌐 Interactive Web Interface

<div style="background: #f3e5f5; padding: 15px; border-radius: 8px; border-left: 4px solid #9c27b0; margin: 10px 0; color: #333;"> Launch a user-friendly Gradio web interface for easy interaction: </div>
export HIFI_FOLEY_MODEL_PATH=PRETRAINED_MODEL_PATH_DIR
python3 gradio_app.py
<div style="margin: 20px 0; color: #333;"> *πŸš€ Then open your browser and navigate to the provided local URL to start generating Foley audio!* </div>

πŸ“š Citation

<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #6c757d; margin: 20px 0; color: #333;"> If you find **HunyuanVideo-Foley** useful for your research, please consider citing our paper: </div>
@misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation,
      title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation}, 
      author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong},
      year={2025},
      eprint={2508.16930},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2508.16930}, 
}

πŸ™ Acknowledgements

<div> **We extend our heartfelt gratitude to the open-source community!** </div>
🎨 **[Stable Diffusion 3](https://huggingface.co/stabilityai/stable-diffusion-3-medium)** *Foundation diffusion models* ⚑ **[FLUX](https://github.com/black-forest-labs/flux)** *Advanced generation techniques* 🎡 **[MMAudio](https://github.com/hkchengrex/MMAudio)** *Multimodal audio generation*
πŸ€— **[HuggingFace](https://huggingface.co)** *Platform & diffusers library* πŸ—œοΈ **[DAC](https://github.com/descriptinc/descript-audio-codec)** *High-Fidelity Audio Compression* πŸ”— **[Synchformer](https://github.com/v-iashin/Synchformer)** *Audio-Visual Synchronization*
<div style="background: linear-gradient(135deg, #74b9ff 0%, #0984e3 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0;"> **🌟 Special thanks to all researchers and developers who contribute to the advancement of AI-generated audio and multimodal learning!** </div>
<div style="margin: 30px 0;"> ### πŸ”— **Connect with Us** [![GitHub](https://img.shields.io/badge/GitHub-Follow-black?style=for-the-badge&logo=github)](https://github.com/Tencent-Hunyuan) [![Twitter](https://img.shields.io/badge/Twitter-Follow-blue?style=for-the-badge&logo=twitter)](https://twitter.com/TencentHunyuan) [![Hunyuan](https://img.shields.io/badge/Website-HunyuanAI-green?style=for-the-badge&logo=hunyuan)](https://hunyuan.tencent.com/)

© 2025 Tencent Hunyuan. All rights reserved. | Made with ❀️ for the AI community

</div>