You're looking at a specific version of this model. Jump to the model overview.

geopti /chatterbox-multilingual:a0459af8

Input schema

The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.

Field Type Default value Description
text
string
The text you want spoken. Can be a single sentence or a long paragraph — long inputs are automatically split into chunks.
language
None
en
Language of the text. Use the two-letter code (en=English, fr=French, de=German, es=Spanish, ja=Japanese, zh=Chinese, ar=Arabic, el=Greek, etc.).
audio_prompt
string
Optional reference voice clip (.wav/.mp3). The output will mimic this voice. If left empty, a default voice is used.
cfg_weight
number
0.5

Max: 1

How closely the speech follows the text. Higher = sticks to the text more strictly. Lower = more freedom (but can hallucinate or get stuck).
exaggeration
number
0.5

Max: 1

How expressive the voice is. Higher = more emotional / dramatic. Lower = more flat / neutral.
temperature
number
0.8

Max: 2

Randomness of the voice. Higher = more variation between runs. Lower = more consistent / robotic.
repetition_penalty
number
2

Min: 1

Max: 5

Penalty for repeating the same sounds. Higher = less repetition.
top_p
number
1

Max: 1

Top-p (nucleus) sampling. Restricts the model to the most likely tokens. 1.0 = no restriction.
pause_between_sentences
number
0.1

Max: 5

Length of the silence (in seconds) inserted between sentences.
max_words_per_chunk
integer
60

Min: 10

Max: 200

Long texts are split into chunks before generation. This is the max number of words per chunk. Smaller = safer for tricky languages, but slower.
repeated_token_threshold
integer
3

Min: 2

Max: 10

If the model repeats the same sound this many times in a row, the chunk is cut off (prevents the model from getting stuck looping). Raise this if too much real speech is being cut.
garbage_trim_buffer
integer
25

Max: 200

Number of audio frames kept after the model finishes saying the sentence (each frame = ~40ms). Lower = trims garbage tails more aggressively but may cut off the last syllable.

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema
{'format': 'uri', 'title': 'Output', 'type': 'string'}