You're looking at a specific version of this model. Jump to the model overview.
cjwbw /voicecraft:a99ecb8e
            
              
                
              
            
            Input schema
          
        The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.
| Field | Type | Default value | Description | 
|---|---|---|---|
| task | 
            None
            
           | 
              speech_editing-substitution
             | 
            Choose a task. For zero-shot text-to-speech, you also need to specify the cut_off_sec of the original audio to be used for zero-shot generation and the transcript until the cut_off_sec
           | 
| orig_audio | 
            string
            
           | 
            Original audio file
           | |
| orig_transcript | 
            string
            
           | 
            Transcript of the original audio file. You can use models such as https://replicate.com/openai/whisper and https://replicate.com/vaibhavs10/incredibly-fast-whisper to get the transcript (and modify it if it's not accurate)
           | |
| target_transcript | 
            string
            
           | 
            Transcript of the target audio file
           | |
| cut_off_sec | 
            number
            
           | 
              3.01
             | 
            Necessary for zero-shot text-to-speech task. The first seconds of the original audio that are used for zero-shot text-to-speech (TTS). 3 sec of reference is generally enough for high quality voice cloning, but longer is generally better, try e.g. 3~6 sec
           | 
| orig_transcript_until_cutoff_time | 
            string
            
           | 
            Necessary for zero-shot text-to-speech task. Transcript of the original audio file until the cut_off_sec specified above. This process will be improved and made automatically later
           | |
| temperature | 
            number
            
           | 
              1
             Min: 0.01 Max: 5 | 
            Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic
           | 
| top_p | 
            number
            
           | 
              0.8
             Max: 1 | 
            When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens
           | 
| stop_repetition | 
            integer
            
           | 
              -1
             | 
             -1 means do not adjust prob of silence tokens. if there are long silence or unnaturally strecthed words, increase sample_batch_size to 2, 3 or even 4
           | 
| sampling_rate | 
            integer
            
           | 
              16000
             | 
            Specify the sampling rate of the audio codec
           | 
| seed | 
            integer
            
           | 
            Random seed. Leave blank to randomize the seed
           | 
            
              
                
              
            
            Output schema
          
        The shape of the response you’ll get when you run this model with an API.
              Schema
            
            {'format': 'uri', 'title': 'Output', 'type': 'string'}