INTELLECT-2: A 32B Parameter Model Trained Through Globally Distributed Reinforcement Learning

omar ali

Translate this article

Updated:

May 14, 2025

Prime Intellect has released INTELLECT-2, a 32-billion parameter language model trained using a novel decentralized approach to reinforcement learning (RL). Unlike traditional RL training, which relies on centralized GPU clusters, INTELLECT-2 was developed through asynchronous RL across a global network of permissionless compute contributors. This release demonstrates the feasibility of distributed RL for large language models (LLMs).

Training Infrastructure

To enable this distributed training, Prime Intellect developed several open-source components:

PRIME-RL: A framework for asynchronous RL, decoupling rollout generation, model training, and weight broadcasting. It supports training over diverse, unreliable networks using PyTorch FSDP2 for training, vLLM for inference, and the GENESYS schema for verification.
SHARDCAST: A library that efficiently distributes large model weight files to decentralized inference workers via an HTTP-based tree-topology network.
TOPLOC: A locality-sensitive hashing method to verify model outputs, detecting unauthorized changes or precision errors across varied GPU hardware. Rollouts are validated using signed URLs and on-chain events, ensuring security and transparency.
Protocol Testnet: A Rust-based system that coordinates global compute resources, enabling nodes to auto-register, undergo hardware checks, and receive task assignments.

These components enable INTELLECT-2’s training across a dynamic, global compute network.

Training Approach

INTELLECT-2’s training used 285,000 verifiable tasks focused on mathematics and coding, sourced from NuminaMath-1.5, Deepscaler, and SYNTHETIC-1. The training recipe included:

A binary task reward system with a length reward, allowing users to allocate computational resources (“thinking tokens”) to prioritize reasoning during inference.
Two-step asynchronous RL to overlap policy weight broadcasting with inference and training, reducing communication delays.
Two-sided GRPO clipping, a technique to stabilize training by limiting extreme changes in model updates.
Advanced data filtering, combining offline and online methods, to select challenging tasks and improve learning efficiency.
Aggressive gradient clipping to manage escalating gradient norms at scale.

Prime Intellect conducted two experiments: TARGET-SHORT, optimizing for efficient reasoning with shorter target lengths, and TARGET-LONG, the primary run with longer targets. Both experiments overlapped communication and computation, with the model improving task rewards on mathematics and coding problems, though length penalty reductions were slower than in preliminary tests.

Performance and Limitations

INTELLECT-2, built on the QwQ-32B model, achieved modest performance improvements on mathematics and coding benchmarks. However, as QwQ-32B was already extensively trained with RL, broad generalized gains were limited. Further improvements may require higher-quality datasets or stronger base models like Qwen3.

Open-Source Contributions

Prime Intellect has open-sourced INTELLECT-2, its code, and data to support research in decentralized training. The model is available on Hugging Face, with a chat interface at chat.primeintellect.ai and a technical report at primeintellect.ai/intellect-2.

Future Directions

Prime Intellect plans to enhance INTELLECT-2 by increasing inference compute, integrating tools like web search and Python interpreters for multi-turn RL, crowdsourcing RL tasks, and exploring model merging via techniques like DiLoCo. These efforts aim to advance open-source, decentralized AI.

Artificial Intelligence

About the Author

omar ali