adirik / hierspeechpp

Zero-shot speech synthesizer for text-to-speech and voice conversion

  • Public
  • 4.6K runs
  • GitHub
  • Paper
  • License
Iterate in playground

Input

string
Shift + Return to add a new line

Text input to the model. If provided, it will be used for the speech content of the output.

file

Sound input to the model in .wav format. If provided, it will be used for the speech content of the output.

*file
Preview
Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x

A voice clip in .wav format containing the speaker to synthesize.

number
(minimum: 0, maximum: 1)

Noise control. 0 means no noise reduction, 1 means maximum noise reduction. If noise reduction is desired, it is recommended to set this value to 0.6~0.8

Default: 0

number
(minimum: 0, maximum: 1)

Temperature for text-to-vector model. Larger value corresponds to slightly more random output.

Default: 0.33

number
(minimum: 0, maximum: 1)

Temperature for the voice conversion model. Larger value corresponds to slightly more random output.

Default: 0.33

integer

Sample rate of the output audio file.

Default: 16000

boolean

Scale normalization. If set to true, the output audio will be scaled according to the input sound if provided.

Default: false

integer

Random seed to use for reproducibility.

Output

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
Generated in

This output was created using a different version of the model, adirik/hierspeechpp:4e41ecdd.

Run time and cost

This model costs approximately $0.058 to run on Replicate, or 17 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 60 seconds. The predict time for this model varies significantly based on the inputs.

Readme

HierSpeech++

HierSpeech++ is a text-to-speech model that can generate speech from text and a target voice for zero-shot speech synthesis. See the original repository and paper for details.

API Usage

To use the model, simply provide the text you would like to generate speech and a sound file of your target voice as input. Optionally provide a reference speech (.mp3 or .wav) instead of text to parse speech content. The API returns an .mp3 file with generated speech.

Input parameters are as follows:
- input_text: (optional) text input to the model. If provided, it will be used for the speech content of the output.
- input_sound: (optional) sound input to the model. If provided, it will be used for the speech content of the output.
- target_voice: a voice clip containing the speaker to synthesize.
- denoise_ratio: noise control. 0 means no noise reduction, 1 means maximum noise reduction. If noise reduction is desired, it is recommended to set this value to 0.6~0.8.
- text_to_vector_temperature: temperature for text-to-vector model. Larger value corresponds to slightly more random output.
- output_sample_rate: sample rate of the output audio file.
- scale_output_volume: scale normalization. If set to true, the output audio will be scaled according to the input sound if provided.
- seed: random seed to use for reproducibility.

References

@article{Lee2023HierSpeechBT,
  title={HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis},
  author={Sang-Hoon Lee and Haram Choi and Seung-Bin Kim and Seong-Whan Lee},
  journal={ArXiv},
  year={2023},
  volume={abs/2311.12454},
  url={https://api.semanticscholar.org/CorpusID:265308903}
}