Clone your voice using open-source models

Posted by @zsxkib and @fofr

Realistic Voice Cloning (RVC) is a voice-to-voice model that can transform any input voice into a target voice.

Here’s an example of Morgan Freeman as Hannibal Lecter:

You can try it out with some pre-trained voices here.

In this blog post we’ll show you how to create your own RVC voice model on whatever voice you want. We’ll create a dataset, tune the model, then make some examples, all using Replicate.

At a high level, the process is:

  1. Create a training dataset: Use the zsxkib/create-rvc-dataset model to generate a dataset of speech audio files from a YouTube video URL.
  2. Train your voice model: Use the replicate/train-rvc-model model to create a fine-tuned RVC model based on your dataset.
  3. Run inference: Finally, user the zsxkib/realistic-voice-cloning model to create new speech audio (or even songs) featuring your voice.

Prerequisites

To follow this guide, you’ll need:

Step 0: Set up your environment

You can run all the models in this guide using Replicate’s web interface, or using Replicate’s API with the programming language of your choice. We have official client libraries for JavaScript, Python, and other languages like Go, Swift, and Elixir.

We’ve also created a Google Colab Notebook that contains all the code you need for this guide.

Step 1: Create a training dataset

The first step in voice model training is constructing a quality dataset of audio files. You can do this manually by collecting your own speech-based audio files, but it’s time-consuming and error prone.

To simplify the process of creating a training dataset, we’ve built a model at zsxkib/create-rvc-dataset that will automatically generate a dataset from a YouTube video URL.

Running the model will:

  • download the YouTube audio
  • isolate the target voice and remove background noise or music
  • split the audio into 10 second chunks
  • return a zip file of the samples to use for tuning

To run this model, you’ll need to provide:

  • youtube_url: a link to the YouTube video containing the voice you’re aiming to clone
  • audio_name: a unique name for your dataset

Inside the output zip file you’ll find a dataset/<audio_name>/ folder packed with .wav files, each tagged as split_<i>.wav. Take a listen to some of them to check your training data is as expected.

Step 2: Train your voice model

The next step is training the RVC model with your dataset. You’ll use the replicate/train-rvc-model model to do this.

To start training, you’ll need to give:

  • dataset_zip: The URL or direct upload of your dataset zip file
  • sample_rate: The audio sampling rate, usually it’s 48k
  • version: Pick an RVC version, v2 is higher quality
  • f0method: This is the method used to extract speech “formants”, with rmvpe_gpu as the default
  • epoch: The number of complete passes over the training data. Set this to 80 for best results.
  • batch_size: The number of data points processed in each step. We recommend setting this to 7

After running this, you’ll get a zip file containing your newly trained RVC model.

Step 3: Generate audio with your new voice model

With your RVC model now finely-tuned, the final step is to run it using the zsxkib/realistic-voice-cloning model.

  1. Upload your starting audio file (or pass in a URL via the API). This can be a song or a speech-only audio clip.
  2. In the rvc_model field, select CUSTOM
  3. Set custom_rvc_model_download_url to the URL of your trained model
  4. Configure additional parameters as needed to tweak your output. For example, pitch_change can change voices from male to female, and visa versa. Experiment with index_rate, reverb_size, and pitch_change to control aspects of the AI’s voice in the final output. The right combination will give you the most natural-sounding voice.

What’s next?

At this point you should now have a reusable clone of your own voice. You can use it to create new audio files, bedtime stories, or even songs.

Have fun and let us know what you make on X and Discord.