Official

playht / play-dialog

End-to-end AI speech model designed for natural-sounding conversational speech synthesis, with support for context-aware prosody, intonation, and emotional expression.

  • Public
  • 24K runs
  • $0.001 per second of audio
  • Commercial use
Iterate in playground

Input

*string
Shift + Return to add a new line

Text for speech generation

string

Voice to use for generation

Default: "Angelo (Young male US conversational voice)"

string

Optional second voice to use for generation

Default: "None"

string

The language of the text to be spoken.

Default: "english"

string

The emotion to use for the generated audio.

Default: "None"

number
(minimum: 0.1, maximum: 1.5)

The temperature parameter controls variance. Lower temperatures result in more predictable results, higher temperatures allow each run to vary more, so the voice may sound less like the baseline voice. Between 1.02 and 1.05 give good results.

Default: 1.02

string
Shift + Return to add a new line

A prompt to guide the style of the output generated by the first voice.

Default: ""

string
Shift + Return to add a new line

A prompt to guide the style of the output generated by the second voice.

Default: ""

string
Shift + Return to add a new line

The prefix to indicate the start of a turn in a multi-turn dialogue for the first voice.

Default: "Voice 1:"

string
Shift + Return to add a new line

The prefix to indicate the start of a turn in a multi-turn dialogue for the second voice.

Default: "Voice 2:"

integer
(minimum: 1, maximum: 60)

The number of seconds of conditioning to use from the selected voice. Lower values generate audio less similar to the cloned voice, but lead to more model stability and expressiveness. Higher values create output more similar to the cloned voice, but can lead to model instability and reduced expressiveness.

Default: 20

integer
(minimum: 1, maximum: 60)

The number of seconds of conditioning to use from the second selected voice.

Default: 20

integer

Random seed. Set for reproducible generation

Output

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
Generated in

Pricing

Model pricing for playht/play-dialog. Looking for volume pricing? Get in touch.

$1.00
per thousand seconds of output audio

or around 1,000 seconds for $1

Official models are always on, maintained, and have predictable pricing. Learn more.

Check out our docs for more information about how pricing works on Replicate.

Readme

PlayDialog

PlayDialog is an end-to-end AI speech model designed for natural-sounding conversational speech synthesis, with support for context-aware prosody, intonation, and emotional expression.

Features

Core Speech Model

  • Context-aware speech generation using Adaptive Speech Contextualizer (ASC) architecture
  • Prosody and intonation control based on conversation history
  • Emotion-aware speech synthesis
  • Support for streaming responses from LLMs via WebSocket
  • Trained on hundreds of millions of real conversation samples
  • Complementary to Play 3.0 mini (which supports 30+ languages with low latency)

Performance

  • 2:1 preference ratio in blind testing against leading competitors (n=600)
  • Strong performance in expressiveness metrics
  • Optimized for conversation flow and natural speech patterns

Support

For technical support or sales inquiries, contact our support team.