MATH

text

About

MATH is a comprehensive mathematical reasoning benchmark featuring 12,500 challenging problems from high school mathematics competitions. Created by Hendrycks et al., this dataset tests AI models' advanced mathematical capabilities across seven subjects including algebra, geometry, number theory, and calculus. MATH measures deep mathematical understanding and problem-solving skills through competition-level problems requiring multi-step reasoning.

Evaluation Stats

Total Models64

Organizations11

Verified Results0

Self-Reported62

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

64 models

Top Score

97.9%

Average Score

67.0%

High Performers (80%+)

Top Organizations

#1Moonshot AI

2 models

79.7%

#2DeepSeek

1 model

74.7%

#3OpenAI

9 models

74.3%

#4Amazon

3 models

73.1%

#5Alibaba Cloud / Qwen Team

11 models

69.1%

Leaderboard

64 models ranked by performance on MATH

			License
#01o3-mini	OpenAI	Jan 30, 2025	Proprietary	97.9%
#02o1	OpenAI	Dec 17, 2024	Proprietary	96.4%
#03Gemini 2.0 Flash	Google	Dec 1, 2024	Proprietary	89.7%
#04Kimi K2 0905	Moonshot AI	Sep 5, 2025	Proprietary	89.1%
#05Gemma 3 27B	Google	Mar 12, 2025	Gemma	89.0%
#06Gemini 2.0 Flash-Lite	Google	Feb 5, 2025	Proprietary	86.8%
#07Gemini 1.5 Pro	Google	May 1, 2024	Proprietary	86.5%
#08o1-preview	OpenAI	Sep 12, 2024	Proprietary	85.5%
#09GPT-5	OpenAI	Aug 7, 2025	Proprietary	84.7%
#10Gemma 3 12B	Google	Mar 12, 2025	Gemma	83.8%

Showing 1 to 10 of 64 models

...

Resources

Research Paper