wendison / vqmivc

One-shot (any-to-any) Voice Conversion

  • Public
  • 6.3K runs
  • GitHub
  • Paper
  • License

😵 Uh oh! This model can't be run on Replicate because it was built with a version of Cog or Python that is no longer supported. Consider opening an issue on the model's GitHub repository to see if it can be updated to use a recent version of Cog. If you need any help, please Contact us about it.

Run time and cost

This model costs approximately $0.0020 to run on Replicate, or 500 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on CPU hardware. Predictions typically complete within 21 seconds. The predict time for this model varies significantly based on the inputs.

Readme

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion (Interspeech 2021)

This paper proposes a speech representation disentanglement framework for one-shot/any-to-any voice conversion, which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference. Vector quantization with contrastive predictive coding (VQCPC) is used for content encoding and mutual information (MI) is introduced as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner.

Citation

If the code is used in your research, please Star our repo and cite our paper:

@article{wang2021vqmivc,
  title={VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion},
  author={Wang, Disong and Deng, Liqun and Yeung, Yu Ting and Chen, Xiao and Liu, Xunying and Meng, Helen},
  journal={arXiv preprint arXiv:2106.10132},
  year={2021}
}

Acknowledgements:

  • The content encoder is borrowed from VectorQuantizedCPC, which also inspires the negative sampling within-utterance for CPC;
  • The speaker encoder is borrowed from AdaIN-VC;
  • The decoder is modified from AutoVC;
  • Estimation of mutual information is modified from CLUB;
  • Speech features extraction is based on espnet and Pyworld.