- Home
- /
- Benchmarks
- /
- Humanity's Last Exam
Humanity's Last Exam
Reasoning
+
+
+
+
About
Humanity's Last Exam is a 2,500-question PhD-level benchmark spanning the most challenging academic disciplines, designed as a near-impossible final test for frontier AI.
+
+
+
+
Evaluation Stats
Total Models15
Organizations4
Verified Results0
Self-Reported0
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
15 models
Top Score
37.5%
Average Score
23.7%
High Performers (80%+)
0Top Organizations
#1Moonshot AI
1 model
24.4%
#2OpenAI
7 models
24.3%
#3Anthropic
4 models
23.3%
#4Google DeepMind
3 models
22.5%
+
+
+
+
Leaderboard
15 models ranked by performance on Humanity's Last Exam
| License | Links | ||||
|---|---|---|---|---|---|
| Nov 18, 2025 | Proprietary | 37.5% | |||
| Dec 11, 2025 | Proprietary | 36.6% | |||
| Feb 17, 2026 | Proprietary | 33.2% | |||
| Aug 7, 2025 | Proprietary | 31.6% | |||
| Nov 24, 2025 | Proprietary | 30.8% | |||
| Aug 7, 2025 | Proprietary | 25.3% | |||
| Jan 27, 2026 | MIT | 24.4% | |||
| Nov 12, 2025 | Proprietary | 23.7% | |||
| Aug 7, 2025 | Proprietary | 19.4% | |||
| Apr 16, 2025 | Proprietary | 19.2% |
Showing 1 to 10 of 15 models
+
+
+
+
Additional Metrics
Extended metrics for top models on Humanity's Last Exam
| Model | Score | Calib. Error |
|---|---|---|
| Gemini 3 Pro | 37.5 | 57 |
| GPT-5.2 | 36.6 | 45 |
| GPT-5 Pro | 31.6 | 49 |
| Claude Opus 4.5 | 30.8 | 56 |
| GPT-5 | 25.3 | 50 |
| Kimi K2.5 | 24.4 | 67 |
| GPT-5.1 Thinking | 23.7 | 55 |
| GPT-5 Mini | 19.4 | 65 |
| o3 | 19.2 | 39 |
| Gemini 2.5 Pro | 17.8 | 70 |
| Claude Sonnet 4.5 | 17.7 | 65 |
| o4 mini | 14.3 | 59 |
| Gemini 2.5 Flash | 12.1 | 80 |
| Claude Opus 4.1 | 11.5 | 71 |