LLM Leaderboard for Agentic Coders
+
+
+
+
Total Models
73
AI models tracked
+
+
+
+

Organizations
17
Companies & labs
+
+
+
+
Providers
17
API providers
+
+
+
+
Benchmarks
13
Evaluation metrics
Top Models by SWE-rebench
| Feb 5, 2026 | 51.7% | 69.9% | 80.8% | 64.6% | - | 60.0% | - | 0.9% | - | - | - | - | - | ||
| Dec 11, 2025 | 51.0% | 54.0% | 80.0% | 38.5% | - | 58.5% | 49.7% | 92.4% | 27.8% | 60.6% | - | - | - | ||
| Nov 19, 2025 | 48.5% | 60.4% | - | - | - | - | - | - | - | - | - | - | - | ||
| Sep 29, 2025 | 47.1% | 46.5% | - | - | - | 54.5% | 42.5% | - | 13.7% | 43.8% | -1.0% | 62.9% | 84.7% |
+
+
+
+
SWE-bench Dominance Timeline
Models that achieved the highest SWE-bench score at the time of their release
Claude Opus 4.5
GPT-5.1 Thinking
GPT-5
Claude Opus 4.1
Aug 2025Feb 2026
Organizations
Anthropic (44.6%)
OpenAI (55.4%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coding Categories Performance
Model performance across different coding domains and specializations
Note: These rankings reflect performance on available benchmarks for each model. Rankings do not necessarily indicate absolute superiority in a category, as most models have not been evaluated on all benchmarks.
+
+
+
+
Agentic Coding
Assesses autonomous agents for code editing, issue resolution, and tool-using workflows.
#1
Claude Opus 4.5Anthropic
81%
1 benchmarks
#2
Claude Opus 4.6Anthropic
81%
1 benchmarks
#3
Minimax M 2.5MiniMax
80%
1 benchmarks
#4
GPT-5.2OpenAI
80%
1 benchmarks
#5
Gemini 3 FlashGoogle DeepMind
78%
1 benchmarks
+
+
+
+
Repository-Level Coding
Involves understanding and modifying code in full repositories.
#1
Claude Opus 4.5Anthropic
81%
1 benchmarks
#2
Claude Opus 4.6Anthropic
81%
1 benchmarks
#3
Minimax M 2.5MiniMax
80%
1 benchmarks
#4
GPT-5.2OpenAI
80%
1 benchmarks
#5
Gemini 3 FlashGoogle DeepMind
78%
1 benchmarks