Clone your voice using open-source models
Realistic Voice Cloning (RVC) is a voice-to-voice model that can transform any input voice into a target voice.
Here’s an example of Morgan Freeman as Hannibal Lecter:
You can try it out with some pre-trained voices here.
In this blog post we’ll show you how to create your own RVC voice model on whatever voice you want. We’ll create a dataset, tune the model, then make some examples, all using Replicate.
At a high level, the process is:
- Create a training dataset: Use the zsxkib/create-rvc-dataset model to generate a dataset of speech audio files from a YouTube video URL.
- Train your voice model: Use the replicate/train-rvc-model model to create a fine-tuned RVC model based on your dataset.
- Run inference: Finally, user the zsxkib/realistic-voice-cloning model to create new speech audio (or even songs) featuring your voice.
Prerequisites
To follow this guide, you’ll need:
- a YouTube video to use as the source of your audio
- a Replicate account and API token
Step 0: Set up your environment
You can run all the models in this guide using Replicate’s web interface, or using Replicate’s API with the programming language of your choice. We have official client libraries for JavaScript, Python, and other languages like Go and Swift.
We’ve also created a Google Colab Notebook that contains all the code you need for this guide.
Step 1: Create a training dataset
The first step in voice model training is constructing a quality dataset of audio files. You can do this manually by collecting your own speech-based audio files, but it’s time-consuming and error prone.
To simplify the process of creating a training dataset, we’ve built a model at zsxkib/create-rvc-dataset that will automatically generate a dataset from a YouTube video URL.
Running the model will:
- download the YouTube audio
- isolate the target voice and remove background noise or music
- split the audio into 10 second chunks
- return a zip file of the samples to use for tuning
To run this model, you’ll need to provide:
youtube_url
: a link to the YouTube video containing the voice you’re aiming to cloneaudio_name
: a unique name for your dataset
Inside the output zip file you’ll find a dataset/<audio_name>/
folder packed with .wav
files, each tagged as split_<i>.wav
. Take a listen to some of them to check your training data is as expected.
Step 2: Train your voice model
The next step is training the RVC model with your dataset. You’ll use the replicate/train-rvc-model model to do this.
To start training, you’ll need to give:
dataset_zip
: The URL or direct upload of your dataset zip filesample_rate
: The audio sampling rate, usually it’s 48kversion
: Pick an RVC version, v2 is higher qualityf0method
: This is the method used to extract speech “formants”, withrmvpe_gpu
as the defaultepoch
: The number of complete passes over the training data. Set this to 80 for best results.batch_size
: The number of data points processed in each step. We recommend setting this to7
After running this, you’ll get a zip file containing your newly trained RVC model.
Step 3: Generate audio with your new voice model
With your RVC model now finely-tuned, the final step is to run it using the zsxkib/realistic-voice-cloning model.
- Upload your starting audio file (or pass in a URL via the API). This can be a song or a speech-only audio clip.
- In the
rvc_model
field, selectCUSTOM
- Set
custom_rvc_model_download_url
to the URL of your trained model - Configure additional parameters as needed to tweak your output. For example,
pitch_change
can change voices from male to female, and visa versa. Experiment withindex_rate
,reverb_size
, andpitch_change
to control aspects of the AI’s voice in the final output. The right combination will give you the most natural-sounding voice.
What’s next?
At this point you should now have a reusable clone of your own voice. You can use it to create new audio files, bedtime stories, or even songs.