SWE Bench Verified

Coding
+
+
+
+
About

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators.

+
+
+
+
Evaluation Stats
Total Models21
Organizations9
Verified Results0
Self-Reported0
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

21 models
Top Score
80.9%
Average Score
76.5%
High Performers (80%+)
4

Top Organizations

#1MiniMax
1 model
80.2%
#2Google DeepMind
2 models
78.0%
#3Anthropic
5 models
77.8%
#4Moonshot AI
1 model
76.8%
#5OpenAI
7 models
76.0%
+
+
+
+
Leaderboard
21 models ranked by performance on SWE Bench Verified
LicenseLinks
Nov 24, 2025
Proprietary
80.9%
Feb 5, 2026
Proprietary
80.8%
Feb 12, 2026
MIT
80.2%
Dec 11, 2025
Proprietary
80.0%
Feb 17, 2026
Proprietary
79.6%
Dec 17, 2025
Proprietary
78.0%
Nov 18, 2025
Proprietary
78.0%
Feb 11, 2026
MIT
77.8%
Jan 27, 2026
MIT
76.8%
Nov 12, 2025
Proprietary
76.3%
Showing 1 to 10 of 21 models
+
+
+
+
Additional Metrics
Extended metrics for top models on SWE Bench Verified
ModelScoreCostSizeContextLicense
Claude Opus 4.580.9$5.00 $25.00—200K
Claude Opus 4.680.8$5.00 $25.00—200K
Minimax M 2.580.2$0.30 $1.20230B1.0M
GPT-5.280.0$1.75 $14.00—400K
Gemini 3 Flash78.0$0.50 $3.00—1.0M
Gemini 3 Pro78.0$2.00 $12.00—1.0M
GLM 577.8$1.00 $3.20744B200K
Kimi K2.576.8$0.60 $2.501.0T262K
GPT-5.1 Instant76.3$1.25 $10.00—400K
GPT-5.176.3$1.25 $10.00—400K
GPT-5.1 Thinking76.3$1.25 $10.00—400K
GPT-574.9$1.25 $10.00—400K
Claude Opus 4.174.5$15.00 $75.00—200K
GPT-5 Codex74.5$———
Step-3.5-Flash74.4$0.10 $0.40196B66K
GLM-4.773.8$0.60 $2.20358B205K
GPT-5.1 Codex73.7$1.25 $10.00—400K
MiMo-V2-Flash73.4$0.10 $0.30309B256K
Claude Haiku 4.573.3$1.00 $5.00—200K
DeepSeek-V3.2 Thinking73.1$0.28 $0.42685B131K