- Home
- /
- Benchmarks
- /
- ARC-AGI-2
ARC-AGI-2
Reasoning
+
+
+
+
About
ARC-AGI-2 tests AI systems on novel abstract visual pattern-matching tasks measuring fluid intelligence. Humans score ~100% while frontier models score well below, making it a key AGI milestone benchmark.
+
+
+
+
Evaluation Stats
Total Models8
Organizations3
Verified Results0
Self-Reported1
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
8 models
Top Score
77.1%
Average Score
46.8%
High Performers (80%+)
0Top Organizations
#1OpenAI
1 model
54.2%
#2Google DeepMind
3 models
47.3%
#3Anthropic
4 models
44.6%
+
+
+
+
Leaderboard
8 models ranked by performance on ARC-AGI-2
| License | Links | ||||
|---|---|---|---|---|---|
| Feb 19, 2026 | Proprietary | 77.1% | |||
| Feb 1, 2026 | Proprietary | 68.8% | |||
| Feb 17, 2026 | Proprietary | 58.3% | |||
| Dec 11, 2025 | Proprietary | 54.2% | |||
| Nov 1, 2025 | Proprietary | 37.6% | |||
| Dec 17, 2025 | Proprietary | 33.6% | |||
| Nov 18, 2025 | Proprietary | 31.1% | |||
| Sep 29, 2025 | Proprietary | 13.6% |
+
+
+
+
Additional Metrics
Extended metrics for top models on ARC-AGI-2
| Model | Score | Cost/Task | Author | ARC-AGI-1 | System Type |
|---|---|---|---|---|---|
| Gemini 3.1 Pro | 77.1 | $0.962 | 98% | CoT | |
| Claude Opus 4.6 | 68.8 | $2.25 | Anthropic | 86% | CoT |
| Claude Sonnet 4.6 | 58.3 | $2.72 | Anthropic | 86% | CoT |
| GPT-5.2 | 54.2 | $8.99 | OpenAI | 81.2% | CoT |
| Claude Opus 4.5 | 37.6 | $2.4 | Anthropic | 80% | CoT |
| Gemini 3 Flash | 33.6 | $0.231 | 84.7% | CoT | |
| Gemini 3 Pro | 31.1 | $77.16 | 87.5% | CoT |