SWE-Bench Verified

text
+
+
+
+
About

SWE-bench-verified is a human-validated subset of the original SWE-bench featuring 500 carefully verified samples for evaluating AI models' software engineering capabilities. This rigorous benchmark tests models' ability to generate patches that resolve real GitHub issues, focusing on bug fixing, code generation, and software development tasks with improved reliability and reduced evaluation noise.

+
+
+
+
Evaluation Stats
Total Models36
Organizations8
Verified Results0
Self-Reported36
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

36 models
Top Score
74.9%
Average Score
54.3%
High Performers (80%+)
0

Top Organizations

#1xAI
1 model
70.8%
#2Moonshot AI
1 model
65.8%
#3Anthropic
7 models
64.7%
#4Zhipu AI
3 models
63.3%
#5Mistral AI
2 models
57.6%
+
+
+
+
Leaderboard
36 models ranked by performance on SWE-Bench Verified
LicenseLinks
Aug 7, 2025
Proprietary
74.9%
Sep 15, 2025
Proprietary
74.5%
Aug 5, 2025
Proprietary
74.5%
Oct 15, 2025
Proprietary
73.3%
May 22, 2025
Proprietary
72.7%
May 22, 2025
Proprietary
72.5%
Aug 28, 2025
Proprietary
70.8%
Feb 24, 2025
Proprietary
70.3%
Apr 16, 2025
Proprietary
69.1%
Apr 16, 2025
Proprietary
68.1%
Showing 1 to 10 of 36 models
+
+
+
+
Resources