AGI-Elo: How Far Are We From Mastering A Task?

Vision
Language
Action

Image Classification
Object Detection

Visualization of the estimated test case rating distribution and agent ratings on six distinct datasets. The percentile curve represents the cumulative percentage of test cases up to each rating level. For each agent, the portion of the test cases and the percentile curve that lies to the right represents the fraction of the dataset that remains difficult (below 50% confidence).

Visualization of the predicted (theoretical) agent performances based on the differences between agents and test cases vs. the empirical performance obtained on each dataset.

Visualization of the estimated test case rating distribution and agent ratings on object detection datasets. The percentile curve represents the cumulative percentage of test cases up to each rating level. For each agent, the portion of the test cases and the percentile curve that lies to the right represents the fraction of the dataset that remains difficult (below 50% confidence).

Visualization of the predicted (theoretical) agent performances based on the differences between agents and test cases vs. the empirical performance obtained on object detection datasets.

Question & Answering
Code Generation

Visualization of the estimated test case rating distribution and agent ratings on question answering datasets. The percentile curve represents the cumulative percentage of test cases up to each rating level. For each agent, the portion of the test cases and the percentile curve that lies to the right represents the fraction of the dataset that remains difficult (below 50% confidence).

Visualization of the predicted (theoretical) agent performances based on the differences between agents and test cases vs. the empirical performance obtained on question answering datasets.

Visualization of the estimated test case rating distribution and agent ratings on code generation datasets. The percentile curve represents the cumulative percentage of test cases up to each rating level. For each agent, the portion of the test cases and the percentile curve that lies to the right represents the fraction of the dataset that remains difficult (below 50% confidence).

Visualization of the predicted (theoretical) agent performances based on the differences between agents and test cases vs. the empirical performance obtained on code generation datasets.

Motion Prediction
Motion Planning

Visualization of the estimated test case rating distribution and agent ratings on motion prediction datasets. The percentile curve represents the cumulative percentage of test cases up to each rating level. For each agent, the portion of the test cases and the percentile curve that lies to the right represents the fraction of the dataset that remains difficult (below 50% confidence).

Visualization of the predicted (theoretical) agent performances based on the differences between agents and test cases vs. the empirical performance obtained on motion prediction datasets.

Visualization of the estimated test case rating distribution and agent ratings on motion planning datasets. The percentile curve represents the cumulative percentage of test cases up to each rating level. For each agent, the portion of the test cases and the percentile curve that lies to the right represents the fraction of the dataset that remains difficult (below 50% confidence).

Visualization of the predicted (theoretical) agent performances based on the differences between agents and test cases vs. the empirical performance obtained on motion planning datasets.

Abstract

As the field progresses toward Artificial General Intelligence (AGI), there is a pressing need for more comprehensive and insightful evaluation frameworks that go beyond aggregate performance metrics. This paper introduces a unified rating system that jointly models the difficulty of individual test cases and the competency of AI models (or humans) across vision, language, and action domains. Unlike existing metrics that focus solely on models, our approach allows for fine-grained, difficulty-aware evaluations through competitive interactions between models and tasks, capturing both the long-tail distribution of real-world challenges and the competency gap between current models and full task mastery. We validate the generalizability and robustness of our system through extensive experiments on multiple established datasets and models across distinct AGI domains. The resulting rating distributions offer novel perspectives and interpretable insights into task difficulty, model progression, and the outstanding challenges that remain on the path to achieving full AGI task mastery. To provide a qualitative evaluation of test case difficulty, we randomly sample test cases from each rating level for every dataset/task and present them in huggingface Dataset Viewer of this collection for visual comparison.

BibTeX


        @misc{sun2025agielofarmasteringtask,
          title={AGI-Elo: How Far Are We From Mastering A Task?}, 
          author={Shuo Sun and Yimin Zhao and Christina Dao Wen Lee and Jiawei Sun and Chengran Yuan and Zefan Huang and Dongen Li and Justin KW Yeoh and Alok Prakash and Thomas W. Malone and Marcelo H. Ang Jr},
          year={2025},
          eprint={2505.12844},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2505.12844}, 
        }

[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?

Abstract

In this paper, we address long-standing questions regarding the current capabilities of AGI and humans on challenging tasks by proposing a standardized framework to quantitatively assess task difficulty, evaluate AGI competency, and identify gaps to task mastery.

BibTeX