SWE-rebench

Coding

About

SWE-rebench evaluates LLM coding agents on real-world software engineering tasks with resolved rate as the primary metric.

Evaluation Stats

Total Models18

Organizations9

Verified Results0

Self-Reported0

Benchmark Details

Max Score100

Performance Overview

Score distribution and top performers

18 models

Top Score

51.7%

Average Score

43.7%

High Performers (80%+)

#1Anthropic

3 models

47.5%

#2Google DeepMind

2 models

46.7%

#3OpenAI

5 models

46.3%

#4Zhipu AI

2 models

41.7%

#5Moonshot AI

2 models

40.8%

Leaderboard

18 models ranked by performance on SWE-rebench

Showing 1 to 10 of 18 models

Additional Metrics

Extended metrics for top models on SWE-rebench

Model	Score	SEM	Cost	Tokens	Pass@5	Cached Tokens
Claude Opus 4.6	51.7	0.42%	$0.93	1031373	58.3%	94.3%
GPT-5.2	51.0	1.04%	$0.76	981139	60.4%	68.3%
GPT-5.1 Codex Max	48.5	1.13%	$0.73	1239950	56.3%	67.1%
Claude Sonnet 4.5	47.1	1.69%	$0.94	1924648	60.4%	96.4%
Gemini 3 Flash	46.7	1.41%	$0.32	2173478	54.2%	77.5%
Gemini 3 Pro	46.7	2.04%	$0.59	1221222	58.3%	84.6%
GPT-5.2 Codex	45.0	1.69%	$0.46	579616	54.2%	66.1%
GPT-5 Codex	44.0	2.46%	$0.29	580361	55.3%	86.9%
Kimi K2 Thinking	43.8	1.47%	$0.42	2242684	58.3%	95.1%
Claude Opus 4.5	43.8	0.93%	$1.19	1426974	58.3%	95.3%
GPT-5.1 Codex	42.9	1.25%	$0.64	1790759	50%	84.2%
GLM 5	42.1	1.21%	$0.45	1426726	50%	84.1%
GLM-4.7	41.3	2.12%	$0.27	1866019	56.3%	94.1%
Qwen3 Coder Next	40.0	1.21%	$0.49	2341400	64.6%	97.6%
Minimax M 2.5	39.6	0.66%	$0.09	1391598	56.3%	89.5%
Kimi K2.5	37.9	1.21%	$0.18	1156152	50%	90.2%
DeepSeek-V3.2	37.5	1.14%	$0.15	2120848	45.8%	85.1%
Devstral-2-123B	37.5	2.19%	$0.09	1743224	52.1%	96.6%

Resources