SWE Bench Verified

Coding

About

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators.

Evaluation Stats

Total Models21

Organizations9

Verified Results0

Self-Reported0

Benchmark Details

Max Score100

Performance Overview

Score distribution and top performers

21 models

Top Score

80.9%

Average Score

76.5%

High Performers (80%+)

#1MiniMax

1 model

80.2%

#2Google DeepMind

2 models

78.0%

#3Anthropic

5 models

77.8%

#4Moonshot AI

1 model

76.8%

#5OpenAI

7 models

76.0%

Leaderboard

21 models ranked by performance on SWE Bench Verified

Showing 1 to 10 of 21 models

Additional Metrics

Extended metrics for top models on SWE Bench Verified

Model	Score	Cost	Size	Context
Claude Opus 4.5	80.9	$5.00 $25.00	—	200K
Claude Opus 4.6	80.8	$5.00 $25.00	—	200K
Minimax M 2.5	80.2	$0.30 $1.20	230B	1.0M
GPT-5.2	80.0	$1.75 $14.00	—	400K
Gemini 3 Flash	78.0	$0.50 $3.00	—	1.0M
Gemini 3 Pro	78.0	$2.00 $12.00	—	1.0M
GLM 5	77.8	$1.00 $3.20	744B	200K
Kimi K2.5	76.8	$0.60 $2.50	1.0T	262K
GPT-5.1 Instant	76.3	$1.25 $10.00	—	400K
GPT-5.1	76.3	$1.25 $10.00	—	400K
GPT-5.1 Thinking	76.3	$1.25 $10.00	—	400K
GPT-5	74.9	$1.25 $10.00	—	400K
Claude Opus 4.1	74.5	$15.00 $75.00	—	200K
GPT-5 Codex	74.5	$—	—	—
Step-3.5-Flash	74.4	$0.10 $0.40	196B	66K
GLM-4.7	73.8	$0.60 $2.20	358B	205K
GPT-5.1 Codex	73.7	$1.25 $10.00	—	400K
MiMo-V2-Flash	73.4	$0.10 $0.30	309B	256K
Claude Haiku 4.5	73.3	$1.00 $5.00	—	200K
DeepSeek-V3.2 Thinking	73.1	$0.28 $0.42	685B	131K