LLM Leaderboard for Agentic Coders

+
+
+
+
Total Models
97

AI models tracked

+
+
+
+
LLM Leaderboard background
Organizations
19

Companies & labs

+
+
+
+
Providers
17

API providers

+
+
+
+
Benchmarks
21

Evaluation metrics

Top Models by SWE-rebench

Feb 19, 2026-74.8%80.6%77.1%-59.7%---------------
Feb 17, 2026-59.1%79.6%58.3%74.7%63.3%-81.7%89.9%49.0%33.2%-61.3%89.3%-74.5%75.6%72.5%91.7%97.9%-
Feb 16, 2026--76.4%--54.5%---------------
Feb 14, 2026--76.5%------------------
+
+
+
+
SWE-bench Dominance Timeline
Models that achieved the highest SWE-bench score at the time of their release
Claude Opus 4.5
Anthropic
Claude Sonnet 4.5
Anthropic
GPT-5
OpenAI
GPT-5 Codex
OpenAI
Aug 2025Feb 2026
Organizations
Anthropic (71.8%)
OpenAI (28.2%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

Coding Categories Performance

Model performance across different coding domains and specializations

Note: These rankings reflect performance on available benchmarks for each model. Rankings do not necessarily indicate absolute superiority in a category, as most models have not been evaluated on all benchmarks.

+
+
+
+
Python Coding

Focuses on generating, completing, and debugging Python code.

#1
Llama-3.3 Nemotron Super 49BNVIDIA
91%
1 benchmarks
#2
Qwen2.5-Coder 32B InstructAlibaba / Qwen
90%
1 benchmarks
#3
Qwen2.5 72B InstructAlibaba / Qwen
88%
1 benchmarks
#4
Llama 3.1 Nemotron Nano 8BNVIDIA
85%
1 benchmarks
#5
Qwen2.5 32B InstructAlibaba / Qwen
84%
1 benchmarks
+
+
+
+
Agentic Coding

Assesses autonomous agents for code editing, issue resolution, and tool-using workflows.

#1
Claude Opus 4.5Anthropic
81%
1 benchmarks
#2
Claude Opus 4.6Anthropic
81%
1 benchmarks
#3
Gemini 3.1 ProGoogle DeepMind
81%
1 benchmarks
#4
Minimax M 2.5MiniMax
80%
1 benchmarks
#5
GPT-5.2OpenAI
80%
1 benchmarks
+
+
+
+
Repository-Level Coding

Involves understanding and modifying code in full repositories.

#1
Claude Opus 4.5Anthropic
81%
1 benchmarks
#2
Claude Opus 4.6Anthropic
81%
1 benchmarks
#3
Gemini 3.1 ProGoogle DeepMind
81%
1 benchmarks
#4
Minimax M 2.5MiniMax
80%
1 benchmarks
#5
GPT-5.2OpenAI
80%
1 benchmarks