MoCha: Bringing Cinematic Conversations to AI-Generated Characters

Jason Calloway

Translate this article

Updated:

April 3, 2025

In the world of AI-generated video, the focus has long been on creating visually striking scenes. However, character-driven storytelling with natural dialogue, emotion, and full-body performance remains a complex challenge. This is where MoCha makes a notable contribution.

Developed by researchers from Meta’s GenAI team in collaboration with the University of Waterloo, MoCha (short for Movie-Grade Talking Character) is designed to generate full-body, speech-driven character videos using only text and audio. Rather than emphasizing visual effects alone, MoCha targets narrative coherence and performance.

Moving Beyond Talking Heads

Many existing models for speech-driven video generation focus primarily on the face, commonly referred to as “talking heads.” While these are suitable for basic communication, they fall short for scenarios requiring expressive body language or multi-character interactions. Key limitations include:

Simplified or repetitive lip and facial movements
Limited or absent body motion
Lack of structured support for conversations involving multiple characters

MoCha is built to address these constraints. It generates synchronized lip motion, expressive facial behavior, and full-body gestures without requiring additional inputs like reference images or pose skeletons. It also supports structured, turn-based multi-character dialogues using a novel character tagging system.

How It Works

MoCha is based on a diffusion transformer (DiT) architecture that jointly processes spoken audio and descriptive text to synthesize coherent video frames. A central innovation is the Speech-Video Window Attention mechanism, which improves lip-sync accuracy by allowing video tokens to attend to localized audio segments. This structure helps balance short-term speech-driven movements (like lip and facial motion) with broader body actions guided by the prompt.

For generating conversations between multiple characters, MoCha uses structured prompts that define each character once, then refer to them using tags across video clips. This avoids redundant character descriptions and supports consistency across scenes.

A Layered Training Process

MoCha’s training process is progressive. It starts with simpler close-up shots where speech-video alignment is most critical and gradually introduces medium and wide shots with more complex body motion and multi-character dynamics. This curriculum-based approach helps the model learn nuanced character behavior.

To overcome limited availability of large-scale speech-annotated video datasets, MoCha combines speech+text data (ST2V) with text-only data (T2V). This joint training strategy improves generalization to diverse actions and environments.

Evaluation and Results

To evaluate its performance, the researchers introduced MoCha-Bench, a benchmark specifically for talking character video generation. Human evaluations assessed performance across five dimensions:

Lip-sync accuracy
Facial expression naturalness
Body motion realism
Alignment with the input text
Overall visual quality

On average, MoCha outperformed existing methods (such as SadTalker, Hallo3, and AniPortrait) across all metrics. Scores approached but did not reach perfect alignment with cinematic quality, particularly in lip-sync and expression.

MoCha offers a structured approach to AI-generated character performance, with potential use cases in animation, digital storytelling, educational video content, and virtual assistants. By focusing on synchronized, full-body character generation from only text and audio, it demonstrates an advance in scalable, narrative-driven AI video synthesis.

To explore MoCha in action, visit the official demo page here

Artificial IntelligenceData VisualizationEmerging Trends

About the Author

Jason Calloway

Jason Calloway is an AI correspondent from United States of America