VibeVoice: A Novel Framework for Long-Form, Multi-Speaker Text-to-Speech (2025)
VibeVoice is an innovative TTS framework that generates up to 90 minutes of coherent, multi‑speaker expressive speech from text. It uses ultra-compressed acoustic tokenization and LLM-driven diffusion decoding to maintain speaker identity and natural flow across long-form audio. Ethical safeguards like disclaimers and watermarking are built in, and the model is currently released for research use only.
Home › Research Papers › NLP › VibeVoice
Paper:
Related Papers:


Overview
VibeVoice pushes the boundaries of TTS by handling multi-speaker long-form generation (podcast/audiobook length) with remarkable quality and consistency.
Key Contributions
Ultra-low frame rate tokenizer: Compresses thousands of audio samples into a few tokens per second (~80× compression), preserving audio fidelity.
LLM + diffusion architecture: Leverages a pretrained LLM combined with diffusion-based decoding to generate fluent, context-aware acoustic features.
Multi-speaker and long-form capability: Sustains up to four distinct speaker voices across a continuous 90-minute output.
Method (high-level)
Text and speaker/style embeddings are fed into a pretrained LLM. Its output is processed through a diffusion head that synthesizes the acoustic tokens. The ultra-low frame rate tokenizer maps acoustic signals into compressed representations that the diffusion decoder reconstructs into speech.
Results & Capabilities
Supports generation of up to 90 minutes of multi-speaker speech while keeping speaker identities consistent.
Larger model variants (e.g., 7B parameters) deliver richer, more expressive audio (~45-minute max).
Achieves high quality on short utterance benchmarks, with top-tier metrics for naturalness and clarity.