Ai2 Releases DataDecide: A Benchmark Suite for Smarter Pretraining Dataset Selection

Leo Silva

Updated:

April 20, 2025

The Allen Institute for AI (Ai2) has introduced DataDecide, a large-scale suite of pretrained models and evaluations aimed at helping AI researchers and developers make informed decisions about pretraining datasets. By openly sharing results from over 30,000 model checkpoints, Ai2 provides a unique resource for understanding how small-scale experiments can predict downstream performance in larger models.

What is DataDecide?

DataDecide consists of language models pretrained on 25 different corpora each with varying sources, levels of deduplication, and filtering spanning up to 100 billion tokens. These models range in size from 4 million to 1 billion parameters, covering 14 sizes in total. Ai2 evaluated each model across 10 multiple-choice downstream tasks to investigate how dataset choices at small scales translate to performance at larger scales.

Key Insights from the Study

Simple predictions are surprisingly strong: Ranking datasets based on performance at a single model size (e.g., 150M parameters) yielded ~80% accuracy in predicting which datasets would perform best at 1B parameters. Interestingly, this approach performed on par with more complex scaling law-based methods.
Checkpoints don’t need to be final: Intermediate training checkpoints were found to be just as reliable as fully trained ones when used to assess dataset ranking, offering savings in compute without loss in predictive power.
Benchmark variability matters: Some benchmarks, such as MMLU and ARC Easy, provided highly predictable results with significantly less compute. Others, like HellaSwag or tasks across the broader OLMES set, were less reliable predictors at small scales.
Better metrics lead to better predictions: For code-related tasks like MBPP and HumanEval, using continuous metrics specifically those based on character-normalized or raw likelihood of answers significantly improved prediction accuracy compared to traditional discrete accuracy scores.

What This Means for Developers

Ai2’s findings suggest that developers can make sound pretraining decisions without needing to run full-scale experiments. By selecting more predictable benchmarks (like MMLU or ARC Easy), using character-normalized likelihood for code tasks, and ranking datasets based on a single model size, teams can achieve high decision accuracy while reducing costs.

Why It Matters

Choosing the right pretraining dataset is a critical but resource-intensive step in LLM development. DataDecide shows that well-designed, small-scale experiments can provide strong signals for making larger decisions, helping democratize model development by lowering the barrier to experimentation.

Explore and Extend

DataDecide is publicly available, including models, evaluation data, and code. Researchers are encouraged to build on the work by evaluating additional benchmarks, testing new metrics, or refining prediction strategies.

Check out the full paper and resources here

Artificial Intelligence

About the Author

Leo Silva

Leo Silva is an Air correspondent from Brazil.