Prediction

Official model

minimax/speech-02-hd

fah37y17x9rme0cpmqt973z6dr

Status

Succeeded

Source

Web

Total duration

5.0s

Created

10 months ago by minimax

Webhook

–

Input

text: <#0.7#>An Introduction to Minimax Speech-02 <#0.7#> Minimax's Speech-02 series are text-to-speech models that create natural-sounding voices with emotional expression. These models support more than 30 languages. According to the Artificial Analysis Speech Arena, Speech-02-HD is currently rated as the best text-to-speech model available today, while Speech-02-Turbo ranks third. With Replicate's platform, you can access these powerful models easily. <#0.7#> Model Options <#0.7#> You can choose between two models: Speech-02-HD: Designed for high-quality voiceovers and audiobooks when premium audio quality matters. Speech-02-Turbo: A more affordable option that processes faster, making it ideal for real-time applications. Both models can work with cloned voices. Voice cloning requires at least 10 seconds of audio and takes approximately 30 seconds to train. Each voice can be customized with adjustments to pitch, speed, and volume to achieve a natural sound. These models are available through Replicate's platform, where you can try them in an interactive playground. <#0.7#> Potential Applications <#0.7#> With these text-to-speech models, you can create: Virtual assistants with natural-sounding voices, studio-quality audiobooks and voiceovers, language learning tools featuring native pronunciation, multilingual customer service bots, and audio content that improves accessibility. <#0.7#> Emotion Control Features <#0.7#> Minimax's emotion control system offers two approaches for adding feeling to voices. The auto-detect mode automatically determines the appropriate emotional tone based on your text. Alternatively, manual controls allow you to specify exactly which emotion you want to convey. This flexibility helps your voices sound natural and engaging across various use cases, whether for entertainment, education, or business purposes. <#0.7#> Language Support <#0.7#> The models support more than 30 languages and accents. You can work with various English variants including US, UK, Australian, and Indian English. Asian language support includes Mandarin, Cantonese, Japanese, Korean, Vietnamese, and Indonesian. European languages like French, German, Spanish, Portuguese, Turkish, Russian, and Ukrainian are also supported. <#0.7#> Using the API <#0.7#> You can run these models using either JavaScript or Python with Replicate's client libraries. The process involves two main steps: first cloning a voice using an audio sample, then using that cloned voice for text-to-speech generation. To get started, you'll need to obtain an API token from your Replicate account. Once set up, you can clone voices using audio files in MP3, M4A, or WAV format. These files should be between 10 seconds and 5 minutes long and less than 20MB in size. After cloning a voice, you can use the generated voice ID to create text-to-speech with your preferred emotional style. <#0.7#> Pricing Information <#0.7#> The text-to-speech models are priced based on input and output tokens, where one token equals one character. The turbo model costs $30 per million characters, while the HD model costs $50 per million characters. Voice cloning has a separate cost of $3 per voice. <#0.7#> Stay Connected <#0.7#> To keep up with the latest developments, you can follow Replicate on their social media channels and join their Discord community for updates and discussions. Happy creating with these powerful text-to-speech capabilities!
voice_id: Wise_Woman
speed: 1.15
volume: 1
pitch: 0
emotion: happy
english_normalization: true
sample_rate: 32000
bitrate: 128000
channel: mono
language_boost: English

{
  "bitrate": 128000,
  "channel": "mono",
  "emotion": "happy",
  "english_normalization": true,
  "language_boost": "English",
  "pitch": 0,
  "sample_rate": 32000,
  "speed": 1.15,
  "text": "<#0.7#>An Introduction to Minimax Speech-02 <#0.7#>\nMinimax's Speech-02 series are text-to-speech models that create natural-sounding voices with emotional expression. These models support more than 30 languages. According to the Artificial Analysis Speech Arena, Speech-02-HD is currently rated as the best text-to-speech model available today, while Speech-02-Turbo ranks third. With Replicate's platform, you can access these powerful models easily.\n<#0.7#> Model Options <#0.7#>\nYou can choose between two models: Speech-02-HD: Designed for high-quality voiceovers and audiobooks when premium audio quality matters. Speech-02-Turbo: A more affordable option that processes faster, making it ideal for real-time applications. Both models can work with cloned voices. Voice cloning requires at least 10 seconds of audio and takes approximately 30 seconds to train. Each voice can be customized with adjustments to pitch, speed, and volume to achieve a natural sound. These models are available through Replicate's platform, where you can try them in an interactive playground.\n<#0.7#> Potential Applications <#0.7#>\nWith these text-to-speech models, you can create: Virtual assistants with natural-sounding voices, studio-quality audiobooks and voiceovers, language learning tools featuring native pronunciation, multilingual customer service bots, and audio content that improves accessibility.\n<#0.7#> Emotion Control Features <#0.7#>\nMinimax's emotion control system offers two approaches for adding feeling to voices. The auto-detect mode automatically determines the appropriate emotional tone based on your text. Alternatively, manual controls allow you to specify exactly which emotion you want to convey. This flexibility helps your voices sound natural and engaging across various use cases, whether for entertainment, education, or business purposes.\n<#0.7#> Language Support <#0.7#>\nThe models support more than 30 languages and accents. You can work with various English variants including US, UK, Australian, and Indian English. Asian language support includes Mandarin, Cantonese, Japanese, Korean, Vietnamese, and Indonesian. European languages like French, German, Spanish, Portuguese, Turkish, Russian, and Ukrainian are also supported.\n<#0.7#> Using the API <#0.7#>\nYou can run these models using either JavaScript or Python with Replicate's client libraries. The process involves two main steps: first cloning a voice using an audio sample, then using that cloned voice for text-to-speech generation. To get started, you'll need to obtain an API token from your Replicate account. Once set up, you can clone voices using audio files in MP3, M4A, or WAV format. These files should be between 10 seconds and 5 minutes long and less than 20MB in size. After cloning a voice, you can use the generated voice ID to create text-to-speech with your preferred emotional style.\n<#0.7#> Pricing Information <#0.7#>\nThe text-to-speech models are priced based on input and output tokens, where one token equals one character. The turbo model costs $30 per million characters, while the HD model costs $50 per million characters. Voice cloning has a separate cost of $3 per voice.\n<#0.7#> Stay Connected <#0.7#>\nTo keep up with the latest developments, you can follow Replicate on their social media channels and join their Discord community for updates and discussions. Happy creating with these powerful text-to-speech capabilities!",
  "voice_id": "Wise_Woman",
  "volume": 1
}

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=r8_Xx0**********************************

This is your API token. Keep it to yourself.

Import and set up the client:

import Replicate from "replicate";
import fs from "node:fs";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run minimax/speech-02-hd using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const input = {
  bitrate: 128000,
  channel: "mono",
  emotion: "happy",
  english_normalization: true,
  language_boost: "English",
  pitch: 0,
  sample_rate: 32000,
  speed: 1.15,
  text: "<#0.7#>An Introduction to Minimax Speech-02 <#0.7#>\nMinimax's Speech-02 series are text-to-speech models that create natural-sounding voices with emotional expression. These models support more than 30 languages. According to the Artificial Analysis Speech Arena, Speech-02-HD is currently rated as the best text-to-speech model available today, while Speech-02-Turbo ranks third. With Replicate's platform, you can access these powerful models easily.\n<#0.7#> Model Options <#0.7#>\nYou can choose between two models: Speech-02-HD: Designed for high-quality voiceovers and audiobooks when premium audio quality matters. Speech-02-Turbo: A more affordable option that processes faster, making it ideal for real-time applications. Both models can work with cloned voices. Voice cloning requires at least 10 seconds of audio and takes approximately 30 seconds to train. Each voice can be customized with adjustments to pitch, speed, and volume to achieve a natural sound. These models are available through Replicate's platform, where you can try them in an interactive playground.\n<#0.7#> Potential Applications <#0.7#>\nWith these text-to-speech models, you can create: Virtual assistants with natural-sounding voices, studio-quality audiobooks and voiceovers, language learning tools featuring native pronunciation, multilingual customer service bots, and audio content that improves accessibility.\n<#0.7#> Emotion Control Features <#0.7#>\nMinimax's emotion control system offers two approaches for adding feeling to voices. The auto-detect mode automatically determines the appropriate emotional tone based on your text. Alternatively, manual controls allow you to specify exactly which emotion you want to convey. This flexibility helps your voices sound natural and engaging across various use cases, whether for entertainment, education, or business purposes.\n<#0.7#> Language Support <#0.7#>\nThe models support more than 30 languages and accents. You can work with various English variants including US, UK, Australian, and Indian English. Asian language support includes Mandarin, Cantonese, Japanese, Korean, Vietnamese, and Indonesian. European languages like French, German, Spanish, Portuguese, Turkish, Russian, and Ukrainian are also supported.\n<#0.7#> Using the API <#0.7#>\nYou can run these models using either JavaScript or Python with Replicate's client libraries. The process involves two main steps: first cloning a voice using an audio sample, then using that cloned voice for text-to-speech generation. To get started, you'll need to obtain an API token from your Replicate account. Once set up, you can clone voices using audio files in MP3, M4A, or WAV format. These files should be between 10 seconds and 5 minutes long and less than 20MB in size. After cloning a voice, you can use the generated voice ID to create text-to-speech with your preferred emotional style.\n<#0.7#> Pricing Information <#0.7#>\nThe text-to-speech models are priced based on input and output tokens, where one token equals one character. The turbo model costs $30 per million characters, while the HD model costs $50 per million characters. Voice cloning has a separate cost of $3 per voice.\n<#0.7#> Stay Connected <#0.7#>\nTo keep up with the latest developments, you can follow Replicate on their social media channels and join their Discord community for updates and discussions. Happy creating with these powerful text-to-speech capabilities!",
  voice_id: "Wise_Woman",
  volume: 1
};

const output = await replicate.run("minimax/speech-02-hd", { input });

// To access the file URL:
console.log(output.url()); //=> "http://example.com"

// To write the file to disk:
fs.writeFile("my-image.png", output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=r8_Xx0**********************************

This is your API token. Keep it to yourself.

Import the client:

import replicate

Run minimax/speech-02-hd using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "minimax/speech-02-hd",
    input={
        "bitrate": 128000,
        "channel": "mono",
        "emotion": "happy",
        "english_normalization": True,
        "language_boost": "English",
        "pitch": 0,
        "sample_rate": 32000,
        "speed": 1.15,
        "text": "<#0.7#>An Introduction to Minimax Speech-02 <#0.7#>\nMinimax's Speech-02 series are text-to-speech models that create natural-sounding voices with emotional expression. These models support more than 30 languages. According to the Artificial Analysis Speech Arena, Speech-02-HD is currently rated as the best text-to-speech model available today, while Speech-02-Turbo ranks third. With Replicate's platform, you can access these powerful models easily.\n<#0.7#> Model Options <#0.7#>\nYou can choose between two models: Speech-02-HD: Designed for high-quality voiceovers and audiobooks when premium audio quality matters. Speech-02-Turbo: A more affordable option that processes faster, making it ideal for real-time applications. Both models can work with cloned voices. Voice cloning requires at least 10 seconds of audio and takes approximately 30 seconds to train. Each voice can be customized with adjustments to pitch, speed, and volume to achieve a natural sound. These models are available through Replicate's platform, where you can try them in an interactive playground.\n<#0.7#> Potential Applications <#0.7#>\nWith these text-to-speech models, you can create: Virtual assistants with natural-sounding voices, studio-quality audiobooks and voiceovers, language learning tools featuring native pronunciation, multilingual customer service bots, and audio content that improves accessibility.\n<#0.7#> Emotion Control Features <#0.7#>\nMinimax's emotion control system offers two approaches for adding feeling to voices. The auto-detect mode automatically determines the appropriate emotional tone based on your text. Alternatively, manual controls allow you to specify exactly which emotion you want to convey. This flexibility helps your voices sound natural and engaging across various use cases, whether for entertainment, education, or business purposes.\n<#0.7#> Language Support <#0.7#>\nThe models support more than 30 languages and accents. You can work with various English variants including US, UK, Australian, and Indian English. Asian language support includes Mandarin, Cantonese, Japanese, Korean, Vietnamese, and Indonesian. European languages like French, German, Spanish, Portuguese, Turkish, Russian, and Ukrainian are also supported.\n<#0.7#> Using the API <#0.7#>\nYou can run these models using either JavaScript or Python with Replicate's client libraries. The process involves two main steps: first cloning a voice using an audio sample, then using that cloned voice for text-to-speech generation. To get started, you'll need to obtain an API token from your Replicate account. Once set up, you can clone voices using audio files in MP3, M4A, or WAV format. These files should be between 10 seconds and 5 minutes long and less than 20MB in size. After cloning a voice, you can use the generated voice ID to create text-to-speech with your preferred emotional style.\n<#0.7#> Pricing Information <#0.7#>\nThe text-to-speech models are priced based on input and output tokens, where one token equals one character. The turbo model costs $30 per million characters, while the HD model costs $50 per million characters. Voice cloning has a separate cost of $3 per voice.\n<#0.7#> Stay Connected <#0.7#>\nTo keep up with the latest developments, you can follow Replicate on their social media channels and join their Discord community for updates and discussions. Happy creating with these powerful text-to-speech capabilities!",
        "voice_id": "Wise_Woman",
        "volume": 1
    }
)

# To access the file URL:
print(output.url())
#=> "http://example.com"

# To write the file to disk:
with open("my-image.png", "wb") as file:
    file.write(output.read())

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=r8_Xx0**********************************

This is your API token. Keep it to yourself.

Run minimax/speech-02-hd using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "input": {
      "bitrate": 128000,
      "channel": "mono",
      "emotion": "happy",
      "english_normalization": true,
      "language_boost": "English",
      "pitch": 0,
      "sample_rate": 32000,
      "speed": 1.15,
      "text": "<#0.7#>An Introduction to Minimax Speech-02 <#0.7#>\\nMinimax\'s Speech-02 series are text-to-speech models that create natural-sounding voices with emotional expression. These models support more than 30 languages. According to the Artificial Analysis Speech Arena, Speech-02-HD is currently rated as the best text-to-speech model available today, while Speech-02-Turbo ranks third. With Replicate\'s platform, you can access these powerful models easily.\\n<#0.7#> Model Options <#0.7#>\\nYou can choose between two models: Speech-02-HD: Designed for high-quality voiceovers and audiobooks when premium audio quality matters. Speech-02-Turbo: A more affordable option that processes faster, making it ideal for real-time applications. Both models can work with cloned voices. Voice cloning requires at least 10 seconds of audio and takes approximately 30 seconds to train. Each voice can be customized with adjustments to pitch, speed, and volume to achieve a natural sound. These models are available through Replicate\'s platform, where you can try them in an interactive playground.\\n<#0.7#> Potential Applications <#0.7#>\\nWith these text-to-speech models, you can create: Virtual assistants with natural-sounding voices, studio-quality audiobooks and voiceovers, language learning tools featuring native pronunciation, multilingual customer service bots, and audio content that improves accessibility.\\n<#0.7#> Emotion Control Features <#0.7#>\\nMinimax\'s emotion control system offers two approaches for adding feeling to voices. The auto-detect mode automatically determines the appropriate emotional tone based on your text. Alternatively, manual controls allow you to specify exactly which emotion you want to convey. This flexibility helps your voices sound natural and engaging across various use cases, whether for entertainment, education, or business purposes.\\n<#0.7#> Language Support <#0.7#>\\nThe models support more than 30 languages and accents. You can work with various English variants including US, UK, Australian, and Indian English. Asian language support includes Mandarin, Cantonese, Japanese, Korean, Vietnamese, and Indonesian. European languages like French, German, Spanish, Portuguese, Turkish, Russian, and Ukrainian are also supported.\\n<#0.7#> Using the API <#0.7#>\\nYou can run these models using either JavaScript or Python with Replicate\'s client libraries. The process involves two main steps: first cloning a voice using an audio sample, then using that cloned voice for text-to-speech generation. To get started, you\'ll need to obtain an API token from your Replicate account. Once set up, you can clone voices using audio files in MP3, M4A, or WAV format. These files should be between 10 seconds and 5 minutes long and less than 20MB in size. After cloning a voice, you can use the generated voice ID to create text-to-speech with your preferred emotional style.\\n<#0.7#> Pricing Information <#0.7#>\\nThe text-to-speech models are priced based on input and output tokens, where one token equals one character. The turbo model costs $30 per million characters, while the HD model costs $50 per million characters. Voice cloning has a separate cost of $3 per voice.\\n<#0.7#> Stay Connected <#0.7#>\\nTo keep up with the latest developments, you can follow Replicate on their social media channels and join their Discord community for updates and discussions. Happy creating with these powerful text-to-speech capabilities!",
      "voice_id": "Wise_Woman",
      "volume": 1
    }
  }' \
  https://api.replicate.com/v1/models/minimax/speech-02-hd/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

{
  "id": "fah37y17x9rme0cpmqt973z6dr",
  "model": "minimax/speech-02-hd",
  "version": "hidden",
  "input": {
    "bitrate": 128000,
    "channel": "mono",
    "emotion": "happy",
    "english_normalization": true,
    "language_boost": "English",
    "pitch": 0,
    "sample_rate": 32000,
    "speed": 1.15,
    "text": "<#0.7#>An Introduction to Minimax Speech-02 <#0.7#>\nMinimax's Speech-02 series are text-to-speech models that create natural-sounding voices with emotional expression. These models support more than 30 languages. According to the Artificial Analysis Speech Arena, Speech-02-HD is currently rated as the best text-to-speech model available today, while Speech-02-Turbo ranks third. With Replicate's platform, you can access these powerful models easily.\n<#0.7#> Model Options <#0.7#>\nYou can choose between two models: Speech-02-HD: Designed for high-quality voiceovers and audiobooks when premium audio quality matters. Speech-02-Turbo: A more affordable option that processes faster, making it ideal for real-time applications. Both models can work with cloned voices. Voice cloning requires at least 10 seconds of audio and takes approximately 30 seconds to train. Each voice can be customized with adjustments to pitch, speed, and volume to achieve a natural sound. These models are available through Replicate's platform, where you can try them in an interactive playground.\n<#0.7#> Potential Applications <#0.7#>\nWith these text-to-speech models, you can create: Virtual assistants with natural-sounding voices, studio-quality audiobooks and voiceovers, language learning tools featuring native pronunciation, multilingual customer service bots, and audio content that improves accessibility.\n<#0.7#> Emotion Control Features <#0.7#>\nMinimax's emotion control system offers two approaches for adding feeling to voices. The auto-detect mode automatically determines the appropriate emotional tone based on your text. Alternatively, manual controls allow you to specify exactly which emotion you want to convey. This flexibility helps your voices sound natural and engaging across various use cases, whether for entertainment, education, or business purposes.\n<#0.7#> Language Support <#0.7#>\nThe models support more than 30 languages and accents. You can work with various English variants including US, UK, Australian, and Indian English. Asian language support includes Mandarin, Cantonese, Japanese, Korean, Vietnamese, and Indonesian. European languages like French, German, Spanish, Portuguese, Turkish, Russian, and Ukrainian are also supported.\n<#0.7#> Using the API <#0.7#>\nYou can run these models using either JavaScript or Python with Replicate's client libraries. The process involves two main steps: first cloning a voice using an audio sample, then using that cloned voice for text-to-speech generation. To get started, you'll need to obtain an API token from your Replicate account. Once set up, you can clone voices using audio files in MP3, M4A, or WAV format. These files should be between 10 seconds and 5 minutes long and less than 20MB in size. After cloning a voice, you can use the generated voice ID to create text-to-speech with your preferred emotional style.\n<#0.7#> Pricing Information <#0.7#>\nThe text-to-speech models are priced based on input and output tokens, where one token equals one character. The turbo model costs $30 per million characters, while the HD model costs $50 per million characters. Voice cloning has a separate cost of $3 per voice.\n<#0.7#> Stay Connected <#0.7#>\nTo keep up with the latest developments, you can follow Replicate on their social media channels and join their Discord community for updates and discussions. Happy creating with these powerful text-to-speech capabilities!",
    "voice_id": "Wise_Woman",
    "volume": 1
  },
  "logs": "Generating speech with model speech-02-hd\nGenerated speech in 4.93sec\nEach character is 1 token\nTokens: 3442",
  "output": "https://replicate.delivery/xezq/viVRmeWzrnQdeEFhgp4MIpBc1OOVLPB54ssLhqWVD1oou8pUA/tmprjh2sov4.mp3",
  "data_removed": false,
  "error": null,
  "source": "web",
  "status": "succeeded",
  "created_at": "2025-05-06T14:16:03.818Z",
  "started_at": "2025-05-06T14:16:03.82683Z",
  "completed_at": "2025-05-06T14:16:08.776906Z",
  "urls": {
    "cancel": "https://api.replicate.com/v1/predictions/fah37y17x9rme0cpmqt973z6dr/cancel",
    "get": "https://api.replicate.com/v1/predictions/fah37y17x9rme0cpmqt973z6dr",
    "stream": "https://stream.replicate.com/v1/files/bcwr-5xl2r64v4zsqktxbdvnfc43h47r2e4kgbnki3r7lj3ging3jwqja",
    "web": "https://replicate.com/p/fah37y17x9rme0cpmqt973z6dr"
  },
  "metrics": {
    "input_token_count": 3442,
    "output_token_count": 1,
    "predict_time": 4.950075966,
    "time_to_first_token": 0.008864575999999999,
    "tokens_per_second": 0.2020185128189403,
    "total_time": 4.958906
  }
}

Generated in

5.0 seconds

Input tokens

3.4K

Output tokens

Tokens per second

0.20 tokens / second

Time to first token

9 milliseconds

Tweak it Iterate in playground Report