Humanity's Last Exam

Reasoning
+
+
+
+
About

Humanity's Last Exam is a 2,500-question PhD-level benchmark spanning the most challenging academic disciplines, designed as a near-impossible final test for frontier AI.

+
+
+
+
Evaluation Stats
Total Models15
Organizations4
Verified Results0
Self-Reported0
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

15 models
Top Score
37.5%
Average Score
23.7%
High Performers (80%+)
0

Top Organizations

#1Moonshot AI
1 model
24.4%
#2OpenAI
7 models
24.3%
#3Anthropic
4 models
23.3%
#4Google DeepMind
3 models
22.5%
+
+
+
+
Leaderboard
15 models ranked by performance on Humanity's Last Exam
LicenseLinks
Nov 18, 2025
Proprietary
37.5%
Dec 11, 2025
Proprietary
36.6%
Feb 17, 2026
Proprietary
33.2%
Aug 7, 2025
Proprietary
31.6%
Nov 24, 2025
Proprietary
30.8%
Aug 7, 2025
Proprietary
25.3%
Jan 27, 2026
MIT
24.4%
Nov 12, 2025
Proprietary
23.7%
Aug 7, 2025
Proprietary
19.4%
Apr 16, 2025
Proprietary
19.2%
Showing 1 to 10 of 15 models
+
+
+
+
Additional Metrics
Extended metrics for top models on Humanity's Last Exam
ModelScoreCalib. Error
Gemini 3 Pro37.557
GPT-5.236.645
GPT-5 Pro31.649
Claude Opus 4.530.856
GPT-525.350
Kimi K2.524.467
GPT-5.1 Thinking23.755
GPT-5 Mini19.465
o319.239
Gemini 2.5 Pro17.870
Claude Sonnet 4.517.765
o4 mini14.359
Gemini 2.5 Flash12.180
Claude Opus 4.111.571