GSM8k

text
+
+
+
+
About

GSM8K (Grade School Math 8K) is a mathematical reasoning benchmark featuring 8,500 linguistically diverse grade school math word problems requiring multi-step reasoning. Created by OpenAI and Surge AI, this dataset tests models' ability to solve 2-8 step arithmetic problems using basic operations. GSM8K serves as a fundamental evaluation for mathematical reasoning and problem-solving capabilities in language models.

+
+
+
+
Evaluation Stats
Total Models46
Organizations15
Verified Results0
Self-Reported46
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

46 models
Top Score
97.3%
Average Score
87.8%
High Performers (80%+)
38

Top Organizations

#1OpenAI
2 models
97.0%
#2DeepSeek
1 model
95.1%
#3Moonshot AI
2 models
94.7%
#4Amazon
3 models
93.9%
#5Anthropic
5 models
93.8%
+
+
+
+
Leaderboard
46 models ranked by performance on GSM8k
LicenseLinks
Jul 11, 2025
MIT
97.3%
Dec 17, 2024
Proprietary
97.1%
Feb 27, 2025
Proprietary
97.0%
Jul 23, 2024
Llama 3.1 Community License
96.8%
Oct 22, 2024
Proprietary
96.4%
Jun 21, 2024
Proprietary
96.4%
Mar 12, 2025
Gemma
95.9%
Sep 19, 2024
Apache 2.0
95.9%
Sep 19, 2024
Qwen
95.8%
May 8, 2024
deepseek
95.1%
Showing 1 to 10 of 46 models
...
+
+
+
+
Resources