GSM8k

text

About

GSM8K (Grade School Math 8K) is a mathematical reasoning benchmark featuring 8,500 linguistically diverse grade school math word problems requiring multi-step reasoning. Created by OpenAI and Surge AI, this dataset tests models' ability to solve 2-8 step arithmetic problems using basic operations. GSM8K serves as a fundamental evaluation for mathematical reasoning and problem-solving capabilities in language models.

Evaluation Stats

Total Models46

Organizations15

Verified Results0

Self-Reported46

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

46 models

Top Score

97.3%

Average Score

87.8%

High Performers (80%+)

Top Organizations

#1OpenAI

2 models

97.0%

#2DeepSeek

1 model

95.1%

#3Moonshot AI

2 models

94.7%

#4Amazon

3 models

93.9%

#5Anthropic

5 models

93.8%

Leaderboard

46 models ranked by performance on GSM8k

			License
#01Kimi K2 Instruct	Moonshot AI	Jul 11, 2025	MIT	97.3%
#02o1	OpenAI	Dec 17, 2024	Proprietary	97.1%
#03GPT-4.5	OpenAI	Feb 27, 2025	Proprietary	97.0%
#04Llama 3.1 405B Instruct	Meta	Jul 23, 2024	Llama 3.1 Community License	96.8%
#05Claude 3.5 Sonnet	Anthropic	Oct 22, 2024	Proprietary	96.4%
#06Claude 3.5 Sonnet	Anthropic	Jun 21, 2024	Proprietary	96.4%
#07Gemma 3 27B	Google	Mar 12, 2025	Gemma	95.9%
#08Qwen2.5 32B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Apache 2.0	95.9%
#09Qwen2.5 72B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Qwen	95.8%
#10DeepSeek-V2.5	DeepSeek	May 8, 2024	deepseek	95.1%

Showing 1 to 10 of 46 models

...

Resources

Research Paper