Change voice for spoken text

  • Public
  • 59.4K runs

Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 2 seconds. The predict time for this model varies significantly based on the inputs.


Given any spoken text, change the voice of the person who speaks it to another person.

Model description

This model adopts the end-to-end framework of VITS for high-quality waveform reconstruction, and propose strategies for clean content information extraction without text annotation. It disentangles content information by imposing an information bottleneck to WavLM features, and propose the spectrogram-resize based data augmentation to improve the purity of extracted content information.

Model types

There are three model types available: 1. FreeVC-s, the proposed model that uses non-pretrained speaker encoder 2. FreeVC, the proposed model that uses pretrained speaker encoder, and 3. FreeVC (24k), the same as the above but at 24khz