LLM Leaderboard for Agentic Coders
+
+
+
+
Total Models
97
AI models tracked
+
+
+
+

Organizations
19
Companies & labs
+
+
+
+
Providers
17
API providers
+
+
+
+
Benchmarks
21
Evaluation metrics
Top Models by SWE-rebench
| Feb 19, 2026 | - | 74.8% | 80.6% | 77.1% | - | 59.7% | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
| Feb 17, 2026 | - | 59.1% | 79.6% | 58.3% | 74.7% | 63.3% | - | 81.7% | 89.9% | 49.0% | 33.2% | - | 61.3% | 89.3% | - | 74.5% | 75.6% | 72.5% | 91.7% | 97.9% | - | ||
| Feb 16, 2026 | - | - | 76.4% | - | - | 54.5% | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
| Feb 14, 2026 | - | - | 76.5% | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
+
+
+
+
SWE-bench Dominance Timeline
Models that achieved the highest SWE-bench score at the time of their release
Claude Opus 4.5
Claude Sonnet 4.5
GPT-5
GPT-5 Codex
Aug 2025Feb 2026
Organizations
Anthropic (71.8%)
OpenAI (28.2%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coding Categories Performance
Model performance across different coding domains and specializations
Note: These rankings reflect performance on available benchmarks for each model. Rankings do not necessarily indicate absolute superiority in a category, as most models have not been evaluated on all benchmarks.
+
+
+
+
Python Coding
Focuses on generating, completing, and debugging Python code.
#1
Llama-3.3 Nemotron Super 49BNVIDIA
91%
1 benchmarks
#2
Qwen2.5-Coder 32B InstructAlibaba / Qwen
90%
1 benchmarks
#3
Qwen2.5 72B InstructAlibaba / Qwen
88%
1 benchmarks
#4
Llama 3.1 Nemotron Nano 8BNVIDIA
85%
1 benchmarks
#5
Qwen2.5 32B InstructAlibaba / Qwen
84%
1 benchmarks
+
+
+
+
Agentic Coding
Assesses autonomous agents for code editing, issue resolution, and tool-using workflows.
#1
Claude Opus 4.5Anthropic
81%
1 benchmarks
#2
Claude Opus 4.6Anthropic
81%
1 benchmarks
#3
Gemini 3.1 ProGoogle DeepMind
81%
1 benchmarks
#4
Minimax M 2.5MiniMax
80%
1 benchmarks
#5
GPT-5.2OpenAI
80%
1 benchmarks
+
+
+
+
Repository-Level Coding
Involves understanding and modifying code in full repositories.
#1
Claude Opus 4.5Anthropic
81%
1 benchmarks
#2
Claude Opus 4.6Anthropic
81%
1 benchmarks
#3
Gemini 3.1 ProGoogle DeepMind
81%
1 benchmarks
#4
Minimax M 2.5MiniMax
80%
1 benchmarks
#5
GPT-5.2OpenAI
80%
1 benchmarks