Alibaba Releases Open-Source Wan2.2-S2V Model for Speech-to-Video Generation

Leo Silva

Translate this article

Updated:

August 27, 2025

Alibaba has introduced Wan2.2-S2V, an open-source model focused on creating digital human videos from speech inputs. This Speech-to-Video tool animates portrait photos into avatars that can speak, sing, and perform actions.

As part of the Wan2.2 video generation series, the model produces animated videos using a single image and an audio clip. It supports various framing options, such as portrait, bust, and full-body views. The system generates character actions and environmental elements based on prompt instructions, allowing creators to align visuals with specific narrative needs.

The model uses audio-driven animation to create performances, including dialogue and music, and can manage multiple characters in a scene. It accommodates different avatar types, from cartoons and animals to stylized figures.

Output resolutions include 480P and 720P, suitable for social media and professional uses.

Example Prompt and Application

For a prompt like: "In the video, a man is walking alongside the railway tracks, singing to express his emotions as he goes. A train slowly passes by him," the model can generate corresponding animated content.

Technical Features

Wan2.2-S2V combines text-guided global motion control with audio-driven local movements for character performances in various scenarios. It compresses historical frames into a compact latent representation to reduce computational needs and support stable long-video creation.

The model was trained on a large-scale audio-visual dataset tailored to film and television contexts, using a multi-resolution approach for flexible formats, including vertical short videos and horizontal productions.

Availability and Impact

The Wan2.2-S2V model is available for download on Hugging Face, GitHub, and Alibaba Cloud’s ModelScope community. Alibaba previously open-sourced Wan2.1 models in February 2025 and Wan2.2 models in July 2025. The Wan series has accumulated over 6.9 million downloads on Hugging Face and ModelScope.

airesearch and innovationmachine learning

About the Author

Leo Silva

Leo Silva is an Air correspondent from Brazil.