MATH

text
+
+
+
+
About

MATH is a comprehensive mathematical reasoning benchmark featuring 12,500 challenging problems from high school mathematics competitions. Created by Hendrycks et al., this dataset tests AI models' advanced mathematical capabilities across seven subjects including algebra, geometry, number theory, and calculus. MATH measures deep mathematical understanding and problem-solving skills through competition-level problems requiring multi-step reasoning.

+
+
+
+
Evaluation Stats
Total Models64
Organizations11
Verified Results0
Self-Reported62
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

64 models
Top Score
97.9%
Average Score
67.0%
High Performers (80%+)
15

Top Organizations

#1Moonshot AI
2 models
79.7%
#2DeepSeek
1 model
74.7%
#3OpenAI
9 models
74.3%
#4Amazon
3 models
73.1%
#5Alibaba Cloud / Qwen Team
11 models
69.1%
+
+
+
+
Leaderboard
64 models ranked by performance on MATH
LicenseLinks
Jan 30, 2025
Proprietary
97.9%
Dec 17, 2024
Proprietary
96.4%
Dec 1, 2024
Proprietary
89.7%
Sep 5, 2025
Proprietary
89.1%
Mar 12, 2025
Gemma
89.0%
Feb 5, 2025
Proprietary
86.8%
May 1, 2024
Proprietary
86.5%
Sep 12, 2024
Proprietary
85.5%
Aug 7, 2025
Proprietary
84.7%
Mar 12, 2025
Gemma
83.8%
Showing 1 to 10 of 64 models
...
+
+
+
+
Resources