Humanity's Last Exam

Reasoning
+
+
+
+
About

Humanity's Last Exam is a 2,500-question PhD-level benchmark spanning the most challenging academic disciplines, designed as a near-impossible final test for frontier AI.

+
+
+
+
Evaluation Stats
Total Models15
Organizations4
Verified Results0
Self-Reported1
+
+
+
+
Benchmark Details
Max Score100
Sub-benchmarks1
+
+
+
+
Sub-benchmarks
1 related benchmark
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

15 models
Top Score
40.0%
Average Score
25.6%
High Performers (80%+)
0

Top Organizations

#1Anthropic
4 models
30.4%
#2Moonshot AI
1 model
24.4%
#3OpenAI
7 models
24.3%
#4Google DeepMind
3 models
22.5%
+
+
+
+
Leaderboard
15 models ranked by performance on Humanity's Last Exam
LicenseLinks
Feb 1, 2026
Proprietary
40.0%
Nov 18, 2025
Proprietary
37.5%
Dec 11, 2025
Proprietary
36.6%
Feb 17, 2026
Proprietary
33.2%
Aug 7, 2025
Proprietary
31.6%
Nov 1, 2025
Proprietary
30.8%
Aug 7, 2025
Proprietary
25.3%
Jan 1, 2026
MIT
24.4%
Nov 1, 2025
Proprietary
23.7%
Aug 1, 2025
Proprietary
19.4%
Showing 1 to 10 of 15 models
+
+
+
+
Additional Metrics
Extended metrics for top models on Humanity's Last Exam
ModelScoreCalib. Error
Claude Opus 4.640.044
Gemini 3 Pro37.557
GPT-5.236.645
GPT-5 Pro31.649
Claude Opus 4.530.856
GPT-525.350
Kimi K2.524.467
GPT-5.1 Thinking23.755
GPT-5 Mini19.465
o319.239
Gemini 2.5 Pro17.870
Claude Sonnet 4.517.765
o4 mini14.359
Gemini 2.5 Flash12.180