- Home
- /
- Benchmarks
- /
- GPQA Diamond
GPQA Diamond
Science
+
+
+
+
About
GPQA Diamond is a 198-question benchmark of PhD-level multiple-choice questions in biology, chemistry, and physics authored by domain experts — designed to be unsolvable without deep expertise.
+
+
+
+
Evaluation Stats
Total Models22
Organizations5
Verified Results0
Self-Reported2
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
22 models
Top Score
93.2%
Average Score
76.7%
High Performers (80%+)
11Top Organizations
#1xAI
1 model
87.5%
#2Google DeepMind
3 models
82.5%
#3Anthropic
7 models
80.1%
#4OpenAI
10 models
73.0%
#5DeepSeek
1 model
62.4%
+
+
+
+
Leaderboard
22 models ranked by performance on GPQA Diamond
| License | Links | ||||
|---|---|---|---|---|---|
| Dec 11, 2025 | Proprietary | 93.2% | |||
| Nov 18, 2025 | Proprietary | 91.9% | |||
| Feb 1, 2026 | Proprietary | 91.3% | |||
| Feb 17, 2026 | Proprietary | 89.9% | |||
| Nov 1, 2025 | Proprietary | 88.1% | |||
| Jul 10, 2025 | Proprietary | 87.5% | |||
| Aug 7, 2025 | Proprietary | 87.3% | |||
| Nov 1, 2025 | Proprietary | 87.0% | |||
| Sep 29, 2025 | Proprietary | 83.4% | |||
| Aug 5, 2025 | Proprietary | 80.9% |
Showing 1 to 10 of 22 models
+
+
+
+
Additional Metrics
Extended metrics for top models on GPQA Diamond
| Model | Score |
|---|---|
| GPT-5.2 | 93.2 |
| Gemini 3 Pro | 91.9 |
| GPT-5.1 | 88.1 |
| Grok 4 | 87.5 |
| GPT-5 | 87.3 |
| Claude Opus 4.5 | 87.0 |
| Claude Sonnet 4.5 | 83.4 |
| Claude Opus 4.1 | 80.9 |
| GPT-OSS-20B | 80.1 |
| GPT-OSS-120B | 78.3 |
| Gemini 2.5 Flash | 78.3 |
| Gemini 2.5 Pro | 77.2 |
| Claude Haiku 4.5 | 73.0 |
| o3 | 69.1 |
| GPT-4.1 nano | 66.0 |
| GPT-4.1 mini | 64.2 |
| DeepSeek-R1 | 62.4 |
| GPT-4o | 56.1 |
| Claude 3.5 Sonnet | 55.0 |
| GPT-4.1 | 48.1 |