SWE-Bench Verified

text
+
+
+
+
About

SWE-bench-verified is a human-validated subset of the original SWE-bench featuring 500 carefully verified samples for evaluating AI models' software engineering capabilities. This rigorous benchmark tests models' ability to generate patches that resolve real GitHub issues, focusing on bug fixing, code generation, and software development tasks with improved reliability and reduced evaluation noise.

+
+
+
+
Evaluation Stats
Total Models34
Organizations7
Verified Results0
Self-Reported34
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

34 models
Top Score
74.9%
Average Score
53.2%
High Performers (80%+)
0

Top Organizations

#1Moonshot AI
1 model
65.8%
#2Zhipu AI
3 models
63.3%
#3Anthropic
6 models
63.3%
#4Mistral AI
2 models
57.6%
#5Google
5 models
49.1%
+
+
+
+
Leaderboard
34 models ranked by performance on SWE-Bench Verified
LicenseLinks
Aug 7, 2025
Proprietary
74.9%
Sep 15, 2025
Proprietary
74.5%
Aug 5, 2025
Proprietary
74.5%
May 22, 2025
Proprietary
72.7%
May 22, 2025
Proprietary
72.5%
Feb 24, 2025
Proprietary
70.3%
Apr 16, 2025
Proprietary
69.1%
Apr 16, 2025
Proprietary
68.1%
Sep 30, 2025
MIT
68.0%
Sep 29, 2025
MIT
67.8%
Showing 1 to 10 of 34 models
+
+
+
+
Resources