minimax/speech-2.6-hd

MiniMax Speech 2.6 HD delivers studio-quality multilingual text-to-audio on Replicate with nuanced prosody, subtitle export, and premium voices

539 runs

Readme

MiniMax Speech 2.6 HD on Replicate

Models

  • Speech-2.6-HD: Next-generation high-definition model with improved realism and expressive control
  • Speech-2.6-Turbo: Enhanced low-latency model optimized for live and interactive applications
  • Speech-02-HD: Optimized for high-fidelity applications like voiceovers and audiobooks
  • Speech-02-Turbo: Designed for real-time applications with low latency
  • Voice-Cloning: Clone voices for use with speech-02-hd and speech-02-turbo

MiniMax Speech 2.6 HD is the flagship text-to-audio model from MiniMax, tuned for premium voiceover work, audiobooks, marketing content, and any scenario that demands maximum fidelity and vocal nuance. It ships on Replicate with the same easy REST API as the Turbo model, plus full support for 40+ languages, 300+ voices, and custom voice cloning.

Why use the HD variant?

  • πŸŽ™ Studio-grade prosody – crisper articulation, better breath control, and smoother phrasing than 2.6 Turbo.
  • 🧠 Emotion intelligence – β€œauto” matches the tone to your script, or pick precise emotions like calm, fluent, or surprised.
  • 🌐 Global language coverage – identical multilingual, dialect boost, and subtitle support as Turbo.
  • 🧾 Subtitles on tap – enable subtitle_enable for sentence-timestamped .titles files (great for captions or QA).
  • πŸ’Ό Predictable billing – $0.10 per 1,000 input tokens (token_input_count), zero cost for outputs.

Upgrading from Speech 2.0 HD?
Expect noticeably richer performances. The API schema is unchanged, but the per-character price is 4Γ— higher. Consider offering both HD generations so customers can pick the fidelity that matches their budget.

Quick start

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  https://api.replicate.com/v1/predictions \
  -d '{
    "version": "latest",
    "input": {
      "text": "Welcome to the MiniMax Speech 2.6 HD voice studio.",
      "voice_id": "English_expressive_narrator",
      "emotion": "calm",
      "audio_format": "flac",
      "subtitle_enable": true
    }
  }'

Outputs include a hosted audio file (e.g., FLAC) plus subtitle metadata when requested.

Input parameters

Name Type Default Description
text string – Up to 10β€―000 characters. Supports <#seconds#> pause markers and multi-paragraph scripts.
voice_id string Wise_Woman Any MiniMax system or cloned voice ID.
speed float 1.0 Range 0.5–2.0.
volume float 1.0 Range 0–10.
pitch int 0 Semitone shift βˆ’12 to +12.
emotion string auto auto, happy, sad, angry, fearful, disgusted, surprised, calm, fluent, neutral.
english_normalization bool false Enables advanced number/date handling for English text.
sample_rate int 32000 8000–44100 Hz.
bitrate int 128000 32000, 64000, 128000, or 256000 (MP3 only).
audio_format string mp3 Choose mp3, wav, flac, or pcm. FLAC/WAV recommended for post-production.
channel string mono mono or stereo.
subtitle_enable bool false Return MiniMax subtitle metadata (sentence-level timestamps).
language_boost string Null Boost recognition for any supported language or set Automatic.

Output

You receive: - A hosted audio file in the requested format (valid for 24 hours by default). - Metadata containing character counts, duration, bitrate, etc. - Optional .titles subtitle JSON when subtitle_enable is true.

Pricing on Replicate

  • $0.10 per 1,000 input tokens (token_input_count)
  • $0.00 per output token

Because the metric comes straight from MiniMax’s character counter, you can estimate costs by multiplying character count Γ— \$0.0001.

Ideal use cases

  • Narrated product demos, audiobooks, podcasts, and marketing assets
  • Localization pipelines needing multiple languages with consistent delivery
  • Dialogue tracks for games or animated content
  • Accessibility overlays (read-aloud, captioned videos, screenreader augmentations)

Additional resources

For interactive R&D or low-latency deployments, use the Turbo sibling model. For premier-quality voiceovers that stand up to post-production, Speech 2.6 HD is the better fit.