Realistic Voice Cloning (RVC) is a voice-to-voice model that can transform any input voice into a target voice.
Here’s an example of Morgan Freeman as Hannibal Lecter:
You can try it out with some pre-trained voices here.
In this blog post we’ll show you how to create your own RVC voice model on whatever voice you want. We’ll create a dataset, tune the model, then make some examples, all using Replicate.
At a high level, the process is:
To follow this guide, you’ll need:
You can run all the models in this guide using Replicate's web interface, or using Replicate's API with the programming language of your choice. We have official client libraries for JavaScript, Python, and other languages like Go and Swift.
We've also created a Google Colab Notebook that contains all the code you need for this guide.
The first step in voice model training is constructing a quality dataset of audio files. You can do this manually by collecting your own speech-based audio files, but it's time-consuming and error prone.
To simplify the process of creating a training dataset, we’ve built a model at zsxkib/create-rvc-dataset that will automatically generate a dataset from a YouTube video URL.
Running the model will:
To run this model, you'll need to provide:
youtube_url
: a link to the YouTube video containing the voice you're aiming to cloneaudio_name
: a unique name for your datasetInside the output zip file you'll find a dataset/<audio_name>/
folder packed with .wav
files, each tagged as split_<i>.wav
. Take a listen to some of them to check your training data is as expected.
The next step is training the RVC model with your dataset. You’ll use the replicate/train-rvc-model model to do this.
To start training, you'll need to give:
dataset_zip
: The URL or direct upload of your dataset zip filesample_rate
: The audio sampling rate, usually it’s 48kversion
: Pick an RVC version, v2 is higher qualityf0method
: This is the method used to extract speech "formants", with rmvpe_gpu
as the defaultepoch
: The number of complete passes over the training data. Set this to 80 for best results.batch_size
: The number of data points processed in each step. We recommend setting this to 7
After running this, you’ll get a zip file containing your newly trained RVC model.
With your RVC model now finely-tuned, the final step is to run it using the zsxkib/realistic-voice-cloning model.
rvc_model
field, select CUSTOM
custom_rvc_model_download_url
to the URL of your trained modelpitch_change
can change voices from male to female, and visa versa. Experiment with index_rate
, reverb_size
, and pitch_change
to control aspects of the AI's voice in the final output. The right combination will give you the most natural-sounding voice.At this point you should now have a reusable clone of your own voice. You can use it to create new audio files, bedtime stories, or even songs.