chenxwh / cosyvoice2-0.5b

Scalable Streaming Speech Synthesis with Large Language Models

  • Public
  • 827 runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.0071 to run on Replicate, or 140 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 8 seconds. The predict time for this model varies significantly based on the inputs.

Readme

CosyVoice 2.0

Compared to version 1.0, the new version offers more accurate, more stable, faster, and better speech generation capabilities.

Multilingual

  • Support Language: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
  • Crosslingual & Mixlingual:Support zero-shot voice cloning for cross-lingual and code-switching scenarios.

Ultra-Low Latency

  • Bidirectional Streaming Support: CosyVoice 2.0 integrates offline and streaming modeling technologies.
  • Rapid First Packet Synthesis: Achieves latency as low as 150ms while maintaining high-quality audio output.

High Accuracy

  • Improved Pronunciation: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
  • Benchmark Achievements: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.

Strong Stability

  • Consistency in Timbre: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
  • Cross-language Synthesis: Marked improvements compared to version 1.0.

Natural Experience

  • Enhanced Prosody and Sound Quality: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
  • Emotional and Dialectal Flexibility: Now supports more granular emotional controls and accent adjustments.