METR: New Lens for Measuring AI Ability to Complete Long Tasks.
A new study from METR proposes a refreshingly grounded approach to evaluating AI capabilities measuring how long an AI can complete a tasks with 50% reliability where the term "long" is defined by how much time (duration) a human expert would take to do the same task.
Their research reveals a striking trend, over the last six years, the duration of tasks AI agents can autonomously complete has been doubling approximately every seven months. At present, even the most capable AI agents such as Claude 3.7 Sonnet can reliably complete tasks that take a human only about an hour. Tasks requiring several hours or days remain largely out of reach.
This research insight helps to reconcile a common contradiction in AI today. Truly, AI models score highly on academic benchmarks and narrowly defined tasks, yet it still struggle to automate meaningful portions of day-to-day knowledge work. It’s not a lack of intelligence, but a limitation in sustaining longer sequences of decisions and actions.
METR’s method of evaluation involves fitting success probability curves for different models based on human task durations. This allows the extraction of a "time horizon" for each model, how long it can complete a task with a set reliability. Their analysis shows a consistent, exponential improvement in this time horizon, even when tested across diverse tasks and data sources.
"The steepness of the trend means that our forecasts about when different capabilities will arrive are relatively robust even to large errors in measurement or in the comparisons between models and humans. For example, if the absolute measurements are off by a factor of 10x, that only changes the arrival time by around 2 years."
METR has published the full methodology, analysis code, and datasets, inviting others to replicate or build on the work (https://github.com/METR/vivaria). With broad implications for evaluation design, AI forecasting, and responsible development, this new metric could become a key tool in understanding the real-world utility and limits of AI.
About the Author
Leo Silva
Leo Silva is an Air correspondent from Brazil.
Recent Articles
Subscribe to Newsletter
Enter your email address to register to our newsletter subscription!