jichengdu/cosyvoice | Run with an API on Replicate

CosyVoice2-0.5B-Scalable Streaming Speech Synthesis with Large Language Models

Public

1.9K runs

License

GitHub

Weights

Paper

Playground API Examples README Versions

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

CosyVoice 2.0-0.5B

Multilingual Support

Supported Languages: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
Cross-lingual & Mixed-lingual: Supports zero-shot voice cloning for cross-language and code-switching scenarios.

Ultra-Low Latency

Bidirectional Streaming Support: CosyVoice 2.0 integrates offline and streaming modeling technologies.
Rapid First Packet Synthesis: Achieves latency as low as 150ms while maintaining high-quality audio output.

High Accuracy

Improved Pronunciation: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
Benchmark Achievements: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.

Strong Stability

Timbre Consistency: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
Cross-language Synthesis: Shows significant improvements compared to version 1.0.

Natural Experience

Enhanced Prosody and Sound Quality: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
Emotional and Dialectal Flexibility: Now supports more granular emotional controls and accent adjustments.

多语言支持

支持的语言：中文、英语、日语、韩语、中国方言（粤语、四川话、上海话、天津话、武汉话等）
跨语言与混合语言：支持跨语言和代码切换场景下的零样本声音克隆。

超低延迟

双向流式支持：CosyVoice 2.0 集成了离线和流式建模技术。
快速首包合成：在保持高质量音频输出的同时，实现低至 150ms 的延迟。

高精度

改进的发音：与 CosyVoice 1.0 相比，发音错误减少了 30% 到 50%。
基准测试成就：在 Seed-TTS 评估集的困难测试集上获得最低字符错误率。

强大的稳定性

音色一致性：确保零样本和跨语言语音合成的可靠声音一致性。
跨语言合成：与 1.0 版本相比有显著改进。

自然体验

增强的韵律和音质：改进了合成音频的对齐，将 MOS 评估分数从 5.4 提高到 5.53。
情感和方言灵活性：现在支持更细粒度的情感控制和口音调整。来一段中英双语的英语在前

Model created over 1 year ago