SWE-rebench

Coding
+
+
+
+
About

SWE-rebench evaluates LLM coding agents on real-world software engineering tasks with resolved rate as the primary metric.

+
+
+
+
Evaluation Stats
Total Models18
Organizations9
Verified Results0
Self-Reported0
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

18 models
Top Score
51.7%
Average Score
43.7%
High Performers (80%+)
0

Top Organizations

#1Anthropic
3 models
47.5%
#2Google DeepMind
2 models
46.7%
#3OpenAI
5 models
46.3%
#4Zhipu AI
2 models
41.7%
#5Moonshot AI
2 models
40.8%
+
+
+
+
Leaderboard
18 models ranked by performance on SWE-rebench
LicenseLinks
Feb 5, 2026
Proprietary
51.7%
Dec 11, 2025
Proprietary
51.0%
Nov 19, 2025
Proprietary
48.5%
Sep 29, 2025
Proprietary
47.1%
Dec 17, 2025
Proprietary
46.7%
Nov 18, 2025
Proprietary
46.7%
Jan 14, 2026
Proprietary
45.0%
Sep 23, 2025
Proprietary
44.0%
Nov 6, 2025
Modified MIT
43.8%
Nov 24, 2025
Proprietary
43.8%
Showing 1 to 10 of 18 models
+
+
+
+
Additional Metrics
Extended metrics for top models on SWE-rebench
ModelScoreSEMCostTokensPass@5Cached Tokens
Claude Opus 4.651.70.42%$0.93103137358.3%94.3%
GPT-5.251.01.04%$0.7698113960.4%68.3%
GPT-5.1 Codex Max48.51.13%$0.73123995056.3%67.1%
Claude Sonnet 4.547.11.69%$0.94192464860.4%96.4%
Gemini 3 Flash46.71.41%$0.32217347854.2%77.5%
Gemini 3 Pro46.72.04%$0.59122122258.3%84.6%
GPT-5.2 Codex45.01.69%$0.4657961654.2%66.1%
GPT-5 Codex44.02.46%$0.2958036155.3%86.9%
Kimi K2 Thinking43.81.47%$0.42224268458.3%95.1%
Claude Opus 4.543.80.93%$1.19142697458.3%95.3%
GPT-5.1 Codex42.91.25%$0.64179075950%84.2%
GLM 542.11.21%$0.45142672650%84.1%
GLM-4.741.32.12%$0.27186601956.3%94.1%
Qwen3 Coder Next40.01.21%$0.49234140064.6%97.6%
Minimax M 2.539.60.66%$0.09139159856.3%89.5%
Kimi K2.537.91.21%$0.18115615250%90.2%
DeepSeek-V3.237.51.14%$0.15212084845.8%85.1%
Devstral-2-123B37.52.19%$0.09174322452.1%96.6%
+
+
+
+
Resources