You're looking at a specific version of this model. Jump to the model overview.
cjwbw /voicecraft:1f110ca2
Input schema
The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.
Field | Type | Default value | Description |
---|---|---|---|
task |
string
(enum)
|
zero-shot text-to-speech
Options: speech_editing-substitution, speech_editing-insertion, speech_editing-deletion, zero-shot text-to-speech |
Choose a task
|
voicecraft_model |
string
(enum)
|
giga330M_TTSEnhanced.pth
Options: giga830M.pth, giga330M.pth, giga330M_TTSEnhanced.pth |
Choose a model
|
orig_audio |
string
|
Original audio file. WhisperX small.en model will be used for transcription
|
|
orig_transcript |
string
|
|
Optionally provide the transcript of the input audio. Leave it blank to use the whisper model below to generate the transcript. Inaccurate transcription may lead to error TTS or speech editing
|
whisper_model |
string
(enum)
|
whisper-base.en
Options: whisper-base.en, whisper-small.en, whisper-medium.en, whisperx-base.en, whisperx-small.en, whisperx-medium.en |
If orig_transcript is not provided above, choose a Whisper or WhisperX model. WhisperX model contains extra alignment steps. Inaccurate transcription may lead to error TTS or speech editing. You can modify the generated transcript and provide it directly to
|
target_transcript |
string
|
Transcript of the target audio file
|
|
cut_off_sec |
number
|
3.01
|
Only used for for zero-shot text-to-speech task. The first seconds of the original audio that are used for zero-shot text-to-speech. 3 sec of reference is generally enough for high quality voice cloning, but longer is generally better, try e.g. 3~6 sec
|
kvcache |
integer
|
1
|
Set to 0 to use less VRAM, but with slower inference
|
left_margin |
number
|
0.08
|
Margin to the left of the editing segment
|
right_margin |
number
|
0.08
|
Margin to the right of the editing segment
|
temperature |
number
|
1
|
Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic. Do not recommend to change
|
top_p |
number
|
0.8
Max: 1 |
When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens
|
stop_repetition |
integer
|
-1
|
-1 means do not adjust prob of silence tokens. if there are long silence or unnaturally stretched words, increase sample_batch_size to 2, 3 or even 4
|
sample_batch_size |
integer
|
4
|
The higher the number, the faster the output will be. Under the hood, the model will generate this many samples and choose the shortest one
|
seed |
integer
|
Random seed. Leave blank to randomize the seed
|
Output schema
The shape of the response you’ll get when you run this model with an API.
Schema
{'properties': {'generated_audio': {'format': 'uri',
'title': 'Generated Audio',
'type': 'string'},
'whisper_transcript_orig_audio': {'title': 'Whisper Transcript '
'Orig Audio',
'type': 'string'}},
'required': ['whisper_transcript_orig_audio', 'generated_audio'],
'title': 'ModelOutput',
'type': 'object'}