nosenseni/diff-svc | Readme and Docs

Singing voice conversion (SVC) is one promising technique that can enrich the way of human-computer interaction by endowing a computer the ability to produce high-fidelity and expressive singing voice. In this paper, we propose DiffSVC, an SVC system based on denoising diffusion probabilistic model. DiffSVC uses phonetic posteriorgrams (PPGs) as content features. A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram produced by the diffusion/forward process and its corresponding step information as input to predict the added Gaussian noise. We use PPGs, fundamental frequency features and loudness features as auxiliary inputs to assist the denoising process. Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.

Singing plays an important role in human daily life, including information transmission, emotional expression and entertainment. The technology of singing voice conversion (SVC) aims at converting the voice of a singing signal to a voice of a target singer without changing the underlying content and melody. Endowing machine with the ability to produce high-fidelity and expressive singing voice provides new ways for human-computer interaction. SVC is among the possible ways to achieve this. Most recent SVC systems train a content encoder to extract content features from a source singing signal and a conversion model to transform content features to either acoustic features or waveform. One class of SVC approaches jointly trains the content encoder and the conversion model as an auto-encoder model [1, 2]. Another class of SVC approaches separately trains the content encoder and the conversion model. These approaches train an automatic speech recognition (ASR) model as the content encoder. The ASR model can be an end-to-end model, as in [3, 4] or a hybrid HMMDNN model, as in [5]. The conversion model can be the generator in a generative adversarial network (GAN) [3, 4], which directly generates waveform from content features; or a regression model, which transforms content features to spectral features (e.g., mel spectrograms), and adopts an additionally trained neural vocoder to generate waveform. In this paper, we focus on the latter class and devote to introducing the recently-emerged diffusion probabilistic modeling into the conversion model.