Use Text Guidance e.g. describe a tone of voice, emotion, speed, background noise. Then generate speech with a given sound description.

A female speaker with a slightly low-pitched, quite monotone voice delivers her words at a slightly faster-than-average pace in a confined space with very clear audio.

