MATH
text
+
+
+
+
About
MATH is a comprehensive mathematical reasoning benchmark featuring 12,500 challenging problems from high school mathematics competitions. Created by Hendrycks et al., this dataset tests AI models' advanced mathematical capabilities across seven subjects including algebra, geometry, number theory, and calculus. MATH measures deep mathematical understanding and problem-solving skills through competition-level problems requiring multi-step reasoning.
+
+
+
+
Evaluation Stats
Total Models64
Organizations11
Verified Results0
Self-Reported62
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
64 models
Top Score
97.9%
Average Score
67.0%
High Performers (80%+)
15Top Organizations
#1Moonshot AI
2 models
79.7%
#2DeepSeek
1 model
74.7%
#3OpenAI
9 models
74.3%
#4Amazon
3 models
73.1%
#5Alibaba Cloud / Qwen Team
11 models
69.1%
+
+
+
+
Leaderboard
64 models ranked by performance on MATH
License | Links | ||||
---|---|---|---|---|---|
Jan 30, 2025 | Proprietary | 97.9% | |||
Dec 17, 2024 | Proprietary | 96.4% | |||
Dec 1, 2024 | Proprietary | 89.7% | |||
Sep 5, 2025 | Proprietary | 89.1% | |||
Mar 12, 2025 | Gemma | 89.0% | |||
Feb 5, 2025 | Proprietary | 86.8% | |||
May 1, 2024 | Proprietary | 86.5% | |||
Sep 12, 2024 | Proprietary | 85.5% | |||
Aug 7, 2025 | Proprietary | 84.7% | |||
Mar 12, 2025 | Gemma | 83.8% |
Showing 1 to 10 of 64 models
...