SWE-Bench Verified
text
+
+
+
+
About
SWE-bench-verified is a human-validated subset of the original SWE-bench featuring 500 carefully verified samples for evaluating AI models' software engineering capabilities. This rigorous benchmark tests models' ability to generate patches that resolve real GitHub issues, focusing on bug fixing, code generation, and software development tasks with improved reliability and reduced evaluation noise.
+
+
+
+
Evaluation Stats
Total Models34
Organizations7
Verified Results0
Self-Reported34
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
34 models
Top Score
74.9%
Average Score
53.2%
High Performers (80%+)
0Top Organizations
#1Moonshot AI
1 model
65.8%
#2Zhipu AI
3 models
63.3%
#3Anthropic
6 models
63.3%
#4Mistral AI
2 models
57.6%
#5Google
5 models
49.1%
+
+
+
+
Leaderboard
34 models ranked by performance on SWE-Bench Verified
License | Links | ||||
---|---|---|---|---|---|
Aug 7, 2025 | Proprietary | 74.9% | |||
Sep 15, 2025 | Proprietary | 74.5% | |||
Aug 5, 2025 | Proprietary | 74.5% | |||
May 22, 2025 | Proprietary | 72.7% | |||
May 22, 2025 | Proprietary | 72.5% | |||
Feb 24, 2025 | Proprietary | 70.3% | |||
Apr 16, 2025 | Proprietary | 69.1% | |||
Apr 16, 2025 | Proprietary | 68.1% | |||
Sep 30, 2025 | MIT | 68.0% | |||
Sep 29, 2025 | MIT | 67.8% |
Showing 1 to 10 of 34 models