Introducing SAPO: A New Approach for More Stable Reinforcement Learning in AI Models

Aremi Olu

Translate this article

Updated:

December 12, 2025

A new research paper from the Qwen team introduces a method aimed at improving a core technique used to train large language models (LLMs). The method, named Soft Adaptive Policy Optimization (SAPO), addresses specific challenges in reinforcement learning (RL) training.

The Challenge in Current RL Training

Reinforcement learning is used to refine AI models,helping them improve at tasks like solving math problems or writing code. A common technique involves sampling multiple model responses to a single prompt, comparing their quality, and using that feedback to update the model. A persistent issue in this process is managing "importance ratios," which measure how much the model being trained has deviated from the model that generated the training data. When these ratios fluctuate too much—a common occurrence in large or specialized models—the training updates can become noisy and unstable.

Current methods like GRPO and GSPO try to control this by using "hard clipping," which completely ignores gradients (the signals used for learning) if they fall outside a set range. This approach has two main drawbacks:

· It can discard useful learning signals.

· It creates a difficult trade-off: a narrow clipping range may ignore too much information, while a wide range can let in destabilizing noise.

How SAPO Proposes a Different Path

The SAPO method replaces the hard clipping mechanism with a smooth,adaptive function. Instead of an abrupt cutoff, SAPO gradually reduces the influence of an update signal as it becomes less trustworthy. The design includes several features:

· Continuous Adjustment: It avoids sudden discontinuities in how learning signals are handled.

· Token-Level Focus: If only a few words in a long response are problematic, SAPO can down-weight just those parts, whereas methods like GSPO might discard the learning from the entire sequence.

· Asymmetric Design: It applies different sensitivity levels for positive and negative feedback, which the researchers found improves training stability for models with large vocabularies.

Reported Experimental Outcomes

According to the paper,SAPO was tested against existing methods like GSPO and GRPO on mathematical reasoning and multimodal tasks. The reported findings indicate:

· Improved Stability: SAPO maintained more stable training over longer periods.

· Performance Gains: On benchmarks including AIME25 (math) and LiveCodeBench v6 (coding), models trained with SAPO achieved higher scores.

· Architectural Flexibility: The method showed consistent benefits across both standard and Mixture-of-Experts model architectures.

Implications and Availability

The research suggests SAPO could offer a more practical and stable component for future RL training pipelines.By providing a smoother way to balance learning signals and stability, it may help in the development of more capable and reliable models.

For complete details, including theoretical analysis and full experimental results, the technical paper "Soft Adaptive Policy Optimization" is available on arXiv.

About the Author

Aremi Olu

Aremi Olu is an AI news correspondent from Nigeria.