VibeVoice: A Novel Framework for Long-Form, Multi-Speaker Text-to-Speech (2025)

VibeVoice is an innovative TTS framework that generates up to 90 minutes of coherent, multi‑speaker expressive speech from text. It uses ultra-compressed acoustic tokenization and LLM-driven diffusion decoding to maintain speaker identity and natural flow across long-form audio. Ethical safeguards like disclaimers and watermarking are built in, and the model is currently released for research use only.

  • Paper:

  • Related Papers:

Illustration of multi-speaker AI text-to-speech: clown, astronaut, and farmer characters with voice
Illustration of multi-speaker AI text-to-speech: clown, astronaut, and farmer characters with voice

Overview

VibeVoice pushes the boundaries of TTS by handling multi-speaker long-form generation (podcast/audiobook length) with remarkable quality and consistency.

Key Contributions

  • Ultra-low frame rate tokenizer: Compresses thousands of audio samples into a few tokens per second (~80× compression), preserving audio fidelity.

  • LLM + diffusion architecture: Leverages a pretrained LLM combined with diffusion-based decoding to generate fluent, context-aware acoustic features.

  • Multi-speaker and long-form capability: Sustains up to four distinct speaker voices across a continuous 90-minute output.

Method (high-level)

Text and speaker/style embeddings are fed into a pretrained LLM. Its output is processed through a diffusion head that synthesizes the acoustic tokens. The ultra-low frame rate tokenizer maps acoustic signals into compressed representations that the diffusion decoder reconstructs into speech.

Results & Capabilities

  • Supports generation of up to 90 minutes of multi-speaker speech while keeping speaker identities consistent.

  • Larger model variants (e.g., 7B parameters) deliver richer, more expressive audio (~45-minute max).

  • Achieves high quality on short utterance benchmarks, with top-tier metrics for naturalness and clarity.

Links & citation