Given any spoken text, change the voice of the person who speaks it to another person.
This model adopts the end-to-end framework of VITS for high-quality waveform reconstruction, and propose strategies for clean content information extraction without text annotation. It disentangles content information by imposing an information bottleneck to WavLM features, and propose the spectrogram-resize based data augmentation to improve the purity of extracted content information.
There are three model types available: 1. FreeVC-s, the proposed model that uses non-pretrained speaker encoder 2. FreeVC, the proposed model that uses pretrained speaker encoder, and 3. FreeVC (24k), the same as the above but at 24khz