SWE-Bench Verified
text
+
+
+
+
About
SWE-bench-verified is a human-validated subset of the original SWE-bench featuring 500 carefully verified samples for evaluating AI models' software engineering capabilities. This rigorous benchmark tests models' ability to generate patches that resolve real GitHub issues, focusing on bug fixing, code generation, and software development tasks with improved reliability and reduced evaluation noise.
+
+
+
+
Evaluation Stats
Total Models36
Organizations8
Verified Results0
Self-Reported36
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
36 models
Top Score
74.9%
Average Score
54.3%
High Performers (80%+)
0Top Organizations
#1xAI
1 model
70.8%
#2Moonshot AI
1 model
65.8%
#3Anthropic
7 models
64.7%
#4Zhipu AI
3 models
63.3%
#5Mistral AI
2 models
57.6%
+
+
+
+
Leaderboard
36 models ranked by performance on SWE-Bench Verified
| License | Links | ||||
|---|---|---|---|---|---|
| Aug 7, 2025 | Proprietary | 74.9% | |||
| Sep 15, 2025 | Proprietary | 74.5% | |||
| Aug 5, 2025 | Proprietary | 74.5% | |||
| Oct 15, 2025 | Proprietary | 73.3% | |||
| May 22, 2025 | Proprietary | 72.7% | |||
| May 22, 2025 | Proprietary | 72.5% | |||
| Aug 28, 2025 | Proprietary | 70.8% | |||
| Feb 24, 2025 | Proprietary | 70.3% | |||
| Apr 16, 2025 | Proprietary | 69.1% | |||
| Apr 16, 2025 | Proprietary | 68.1% |
Showing 1 to 10 of 36 models