You're looking at a specific version of this model. Jump to the model overview.
geopti /chatterbox-multilingual:a0459af8
Input schema
The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.
| Field | Type | Default value | Description |
|---|---|---|---|
| text |
string
|
The text you want spoken. Can be a single sentence or a long paragraph — long inputs are automatically split into chunks.
|
|
| language |
None
|
en
|
Language of the text. Use the two-letter code (en=English, fr=French, de=German, es=Spanish, ja=Japanese, zh=Chinese, ar=Arabic, el=Greek, etc.).
|
| audio_prompt |
string
|
Optional reference voice clip (.wav/.mp3). The output will mimic this voice. If left empty, a default voice is used.
|
|
| cfg_weight |
number
|
0.5
Max: 1 |
How closely the speech follows the text. Higher = sticks to the text more strictly. Lower = more freedom (but can hallucinate or get stuck).
|
| exaggeration |
number
|
0.5
Max: 1 |
How expressive the voice is. Higher = more emotional / dramatic. Lower = more flat / neutral.
|
| temperature |
number
|
0.8
Max: 2 |
Randomness of the voice. Higher = more variation between runs. Lower = more consistent / robotic.
|
| repetition_penalty |
number
|
2
Min: 1 Max: 5 |
Penalty for repeating the same sounds. Higher = less repetition.
|
| top_p |
number
|
1
Max: 1 |
Top-p (nucleus) sampling. Restricts the model to the most likely tokens. 1.0 = no restriction.
|
| pause_between_sentences |
number
|
0.1
Max: 5 |
Length of the silence (in seconds) inserted between sentences.
|
| max_words_per_chunk |
integer
|
60
Min: 10 Max: 200 |
Long texts are split into chunks before generation. This is the max number of words per chunk. Smaller = safer for tricky languages, but slower.
|
| repeated_token_threshold |
integer
|
3
Min: 2 Max: 10 |
If the model repeats the same sound this many times in a row, the chunk is cut off (prevents the model from getting stuck looping). Raise this if too much real speech is being cut.
|
| garbage_trim_buffer |
integer
|
25
Max: 200 |
Number of audio frames kept after the model finishes saying the sentence (each frame = ~40ms). Lower = trims garbage tails more aggressively but may cut off the last syllable.
|
Output schema
The shape of the response you’ll get when you run this model with an API.
{'format': 'uri', 'title': 'Output', 'type': 'string'}