Bland Introduces a New Approach to Text-to-Speech The First Voice AI To Cross The Uncanny Valley.

Chinedu Chimamora

Translate this article

Updated:

June 7, 2025

Bland has released a new text-to-speech (TTS) system that diverges from the conventional pipeline-based models in favor of a large language model (LLM) architecture designed to predict audio directly from text input. Rather than improving traditional steps like phoneme conversion and waveform synthesis, the system generates speech holistically capturing prosody, timing, and emotional tone in a single model output.

Key Technical Differences

Unlike traditional TTS systems that break down speech generation into multiple stages, Bland's architecture treats speech as a generative task. The model directly predicts audio token sequences conditioned on text input, bypassing intermediate representations such as phonemes or prosody vectors.

Architecture: The system builds on a transformer-based decoder model, adapted to predict audio tokens rather than text. It uses a SNAC (Spectral Normalized Audio Codec) tokenizer to convert continuous audio into discrete tokens across multiple resolution levels, enabling the capture of both coarse and fine acoustic details.
Training Format: Training is conducted using paired examples of text transcripts and corresponding audio token sequences. These are formatted similarly to chat-style prompts to take advantage of few-shot learning capabilities.

Data Scale and Quality

The effectiveness of the model is supported by a large proprietary dataset that exceeds the size and quality of most publicly available TTS datasets. It includes:

Multi-million-hour corpus of two-channel conversational recordings
Time-aligned utterance-level transcriptions
Speaker role metadata
Markers for conversational context (e.g., turn-taking, interruptions)
Terminology spanning multiple industries
This dataset enables the model to learn conversational patterns and speaker dynamics with a higher degree of nuance.

Style and Sound Control

The system enables voice style control and sound integration through a few key mechanisms:

In-context learning: Few-shot examples guide the model toward specific voices or styles.
Special tokens: Inputs can include markers such as <excited> or <calm> to influence speech delivery.
Effect markers: Non-speech sounds like <barking> can be reproduced when examples are provided during training.

The model generalizes stylistic and contextual cues from examples, without requiring exhaustive manual labeling.

Voice Blending and Effect Reproduction

Voice blending is achieved by providing context examples from multiple voice styles. The output reflects characteristics based on the prominence of each voice in the prompt. Similarly, non-verbal sounds are produced by associating textual labels with acoustic patterns during training.

Bland’s new TTS engine represents a shift toward unified speech generation driven by large language models. By training the model to predict audio directly from text in context-rich formats, it enables a more cohesive and expressive synthesis process. The system’s effectiveness is underpinned by large-scale, well-structured training data and architectural enhancements that support prosodic, stylistic, and emotional nuance. While still being refined, this approach presents a new direction for voice AI rooted in pattern learning rather than piecemeal processing.

Artificial Intelligence

About the Author

Chinedu Chimamora