GSM8k
text
+
+
+
+
About
GSM8K (Grade School Math 8K) is a mathematical reasoning benchmark featuring 8,500 linguistically diverse grade school math word problems requiring multi-step reasoning. Created by OpenAI and Surge AI, this dataset tests models' ability to solve 2-8 step arithmetic problems using basic operations. GSM8K serves as a fundamental evaluation for mathematical reasoning and problem-solving capabilities in language models.
+
+
+
+
Evaluation Stats
Total Models46
Organizations15
Verified Results0
Self-Reported46
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
46 models
Top Score
97.3%
Average Score
87.8%
High Performers (80%+)
38Top Organizations
#1OpenAI
2 models
97.0%
#2DeepSeek
1 model
95.1%
#3Moonshot AI
2 models
94.7%
#4Amazon
3 models
93.9%
#5Anthropic
5 models
93.8%
+
+
+
+
Leaderboard
46 models ranked by performance on GSM8k
License | Links | ||||
---|---|---|---|---|---|
Jul 11, 2025 | MIT | 97.3% | |||
Dec 17, 2024 | Proprietary | 97.1% | |||
Feb 27, 2025 | Proprietary | 97.0% | |||
Jul 23, 2024 | Llama 3.1 Community License | 96.8% | |||
Oct 22, 2024 | Proprietary | 96.4% | |||
Jun 21, 2024 | Proprietary | 96.4% | |||
Mar 12, 2025 | Gemma | 95.9% | |||
Sep 19, 2024 | Apache 2.0 | 95.9% | |||
Sep 19, 2024 | Qwen | 95.8% | |||
May 8, 2024 | deepseek | 95.1% |
Showing 1 to 10 of 46 models
...