Readme
Audio🔹Separator
A powerful audio separation tool that extracts vocals and instrumental tracks from audio files using advanced MDX-Net models. Built with both a Gradio web interface and Replicate API support.
Features
🎵 Dual Stem Separation - Extract vocals or instrumental tracks from any audio file - High-quality separation powered by MDX-Net models
🎚️ Audio Effects Processing - Apply professional vocal effects including reverb, compression, and EQ - Customizable effect parameters for instrumental tracks - Independent effect chains for vocals and background
💻 Multiple Interfaces - Web-based Gradio interface for easy local use - Replicate API integration for programmatic access - Command-line support via Cog
🔧 Format Support - Input: MP3, WAV, FLAC, and other common formats - Output: WAV or MP3 - Automatic stereo conversion and normalization
⚡ Performance - GPU acceleration with CUDA 12.1 support - CPU fallback for systems without GPU - Optimized model inference with ONNX Runtime
Quick Start
Prerequisites
- Python 3.11+
- FFmpeg
- CUDA 12.1 (optional, for GPU acceleration)
- PyTorch 2.5.1+
Installation
-
Clone the repository
bash git clone https://huggingface.co/spaces/r3gm/Audio_separator cd Audio_separator -
Create a virtual environment
bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate -
Install dependencies
bash pip install -r requirements.txt
Usage
Web Interface (Gradio)
python app.py
Then open your browser to http://localhost:7860
Replicate API (Cog)
Local testing:
cog predict -i audio=@your_audio.mp3 -i extract_vocals=true -i output_format=wav
Build and deploy:
cog build
cog push r8.im/your-username/audio-separator
API Reference
Cog Predict Endpoint
Inputs:
audio(Path): Input audio fileextract_vocals(bool):true(default): Extract and process vocal trackfalse: Extract instrumental trackoutput_format(str):wav(default): Uncompressed WAV formatmp3: Compressed MP3 format
Output: - Returns the separated audio file in the requested format
Example:
cog predict \
-i audio=@song.mp3 \
-i extract_vocals=true \
-i output_format=wav
Audio Effects
Vocal Effects (Applied when extract_vocals=true)
- Reverb: Room-like ambience (room_size: 0.15, damping: 0.7)
- Compressor: Dynamic range control (threshold: -15dB, ratio: 4.0)
- Gain: Volume normalization (0dB)
- Highpass Filter: Remove unwanted low frequencies
Instrumental Effects (Applied when extract_vocals=false)
- Highpass Filter: Remove very low frequencies
- Lowpass Filter: Clean up high frequencies
- Reverb: Add space and depth
- Compressor: Smooth dynamic response
- Gain: Volume adjustment
Architecture
Core Components
MDX Model (predict.py / app.py)
- ONNX-based neural network for stem separation
- Operates on 44.1kHz stereo audio
- Processes audio in chunks for memory efficiency
Audio Processing Pipeline 1. Load and normalize input audio 2. Convert to stereo WAV if needed 3. Run MDX separation model 4. Apply vocal or instrumental effects 5. Convert to requested output format
Supported Models
- UVR-MDX-NET-Voc_FT.onnx: Vocal separation model (primary)
- Additional models automatically downloaded from GitHub releases
Dependencies
- PyTorch: Deep learning framework
- ONNX Runtime: Model inference with GPU support
- Librosa: Audio analysis and I/O
- SoundFile: WAV file handling
- Pedalboard: Audio effects processing
- Gradio: Web interface
- FFmpeg: Format conversion
Configuration
Application Settings
All default parameters are configurable through Gradio sliders:
Vocal Effects: - Reverb room size: 0.15 - Reverb damping: 0.7 - Reverb wet level: 0.2 - Compressor threshold: -15dB - Compressor ratio: 4.0 - Compressor attack: 1.0ms - Compressor release: 100ms - Gain: 0dB
Instrumental Effects: - Highpass filter: 80Hz - Lowpass filter: 18000Hz - Reverb room size: 0.3 - Reverb damping: 0.6 - Compressor threshold: -20dB - Compressor ratio: 3.0
Files
- app.py: Gradio web interface with full effect controls
- predict.py: Cog-compatible prediction endpoint for Replicate
- utils.py: Utility functions for file handling and logging
- cog.yaml: Cog configuration for containerized deployment
- requirements.txt: Python package dependencies
- packages.txt: System package dependencies
- pre-requirements.txt: Pre-installation requirements
Performance
GPU Requirements
- NVIDIA GPU with CUDA 12.1 support
- Minimum 4GB VRAM recommended
- Tested on A40, RTX 3090, RTX 4090
Processing Times (Approximate)
- 3-minute song: 15-30 seconds (GPU)
- 3-minute song: 2-5 minutes (CPU)
Troubleshooting
FFmpeg not found
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Windows
choco install ffmpeg
CUDA out of memory - Reduce audio length or use CPU processing - Close other GPU applications
Poor separation quality - Ensure input audio is clear and centered - Try with different audio sources - Model works best with 44.1kHz stereo audio
Original Source
Based on the Hugging Face Space: r3gm/Audio_separator
Original repository: https://huggingface.co/spaces/r3gm/Audio_separator/tree/main
This project was adapted into a Replicate Cog using Claude with the following requirements: - Simplified inputs (audio, extract_vocals, output_format) - Replicate-compatible prediction interface - Same defaults as the original app (reverb_room_size: 0.15, reverb_damping: 0.7)
License
MIT License - See LICENSE file for details
Citation
If you use this project in your research or work, please cite:
@misc{audio_separator,
title={Audio🔹Separator},
author={r3gm and contributors},
year={2024},
howpublish={\url{https://huggingface.co/spaces/r3gm/Audio_separator}}
}
Support
For issues, questions, or contributions: 1. Check existing issues on the GitHub repository 2. Create a new issue with detailed description 3. Include sample audio and exact error messages 4. Specify your system configuration (GPU, OS, Python version)
Acknowledgments
- MDX-Net model architecture and weights
- Pedalboard for audio effects
- Librosa for audio processing
- Gradio for the web interface
- Replicate for Cog framework