Connect

Microsoft Releases VibeVoice: An Open-Source TTS Model for Multi-Speaker Conversational Audio

Microsoft Releases VibeVoice: An Open-Source TTS Model for Multi-Speaker Conversational Audio

Liang Wei

Translate this article

Updated:
August 27, 2025

Microsoft Research has introduced VibeVoice, an open-source framework aimed at generating expressive, long-form, multi-speaker conversational audio from text. This model is specifically designed for scenarios like podcasts, tackling issues in traditional Text-to-Speech (TTS) systems such as scalability, speaker consistency, and natural turn-taking.

Core Innovations

VibeVoice incorporates continuous speech tokenizers both acoustic and semantic that operate at a frame rate of 7.5 Hz. These tokenizers maintain audio fidelity while improving computational efficiency for handling extended sequences. The model uses a next-token diffusion framework, which includes a Large Language Model (LLM) to process textual context and dialogue flow, paired with a diffusion head for producing detailed acoustic output.

According to the details provided, VibeVoice can generate speech up to 90 minutes in length involving up to four distinct speakers. This capability extends beyond the common limitations of prior models, which often support only one or two speakers.

Training and Architecture Details

The model is built around a transformer-based LLM integrated with acoustic and semantic tokenizers and a diffusion-based decoding head.

LLM: Utilizes Qwen2.5-1.5B for this release.

  1. Acoustic Tokenizer: Based on a σ-VAE variant with a mirror-symmetric encoder-decoder structure, including seven stages of modified Transformer blocks. It achieves 3200x downsampling from 24kHz input audio, with encoder and decoder components each comprising approximately 340 million parameters.
  2. Semantic Tokenizer: Features an encoder that mirrors the acoustic tokenizer's architecture (excluding VAE components) and is trained using an Automatic Speech Recognition (ASR) proxy task.
  3. Diffusion Head: A lightweight module with four layers and about 123 million parameters, conditioned on LLM hidden states. It predicts acoustic VAE features via Denoising Diffusion Probabilistic Models (DDPM), employing Classifier-Free Guidance (CFG) and DPM-Solver (including variants) during inference.

The context length is trained progressively up to 65,536 tokens. Training occurs in stages: first, the acoustic and semantic tokenizers are pre-trained separately; then, during VibeVoice training, these tokenizers are frozen, and only the LLM and diffusion head are trained. A curriculum learning approach scales input sequence lengths from 4k to 16k, 32k, and finally 64k tokens. The text tokenizer aligns with the LLM's default, while audio is handled by the acoustic and semantic tokenizers.

Intended Uses and Restrictions

VibeVoice is restricted to research purposes focused on highly realistic audio dialogue generation, as outlined in the technical report.

Out-of-scope applications include any uses that violate laws or regulations, including trade compliance, or breach the MIT License. It is not for generating text transcripts. Prohibited scenarios encompass:

  1. Voice impersonation without explicit, recorded consent, such as cloning voices for satire, advertising, ransom, social engineering, or bypassing authentication.
  2. Creating disinformation or impersonating real people or events.
  3. Real-time or low-latency voice conversion, like in telephone or video-conference applications.
  4. Generating output in unsupported languages; the model is trained only on English and Chinese data, and other languages may yield unintelligible or offensive results.
  5. Producing background ambience, Foley effects, or music, as VibeVoice is speech-only and does not generate coherent non-speech audio.

Risks, Limitations, and Recommendations

The model may produce unexpected, biased, or inaccurate outputs, inheriting any biases, errors, or omissions from its base LLM (Qwen2.5-1.5B). Key risks include the potential for deepfakes and disinformation, where high-quality synthetic speech could be misused for impersonation, fraud, or spreading false information. Users are advised to verify transcript reliability, ensure content accuracy, and avoid misleading applications.

Additional limitations:

  1. Supports only English and Chinese; other languages may lead to unexpected audio.
  2. Focuses exclusively on speech synthesis, excluding background noise, music, or other effects.
  3. Does not model or generate overlapping speech in conversations.
  4. Microsoft does not recommend deploying VibeVoice in commercial or real-world settings without additional testing and development. It is intended solely for research and should be used responsibly.

To address misuse risks, the following measures are implemented:

  1. An audible disclaimer (e.g., “This segment was generated by AI”) is automatically embedded in every synthesized audio file.
  2. An imperceptible watermark is added to verify provenance.
  3. Inference requests are logged (hashed) for detecting abuse patterns, with aggregated statistics published quarterly.

Users must source datasets legally and ethically, including securing rights and anonymizing data as needed, while considering privacy concerns. It is recommended to disclose AI usage when sharing generated content.

ai

About the Author

Liang Wei

Liang Wei

Liang Wei is our AI correspondent from China

Subscribe to Newsletter

Enter your email address to register to our newsletter subscription!

Contact

+1 336-825-0330

Connect