LLM Leaderboard for Agentic Coders

+
+
+
+
Total Models
73

AI models tracked

+
+
+
+
LLM Leaderboard background
Organizations
17

Companies & labs

+
+
+
+
Providers
17

API providers

+
+
+
+
Benchmarks
13

Evaluation metrics

Top Models by SWE-rebench

Feb 5, 202651.7%69.9%80.8%64.6%-60.0%-0.9%-----
Dec 11, 202551.0%54.0%80.0%38.5%-58.5%49.7%92.4%27.8%60.6%---
Nov 19, 202548.5%60.4%-----------
Sep 29, 202547.1%46.5%---54.5%42.5%-13.7%43.8%-1.0%62.9%84.7%
+
+
+
+
SWE-bench Dominance Timeline
Models that achieved the highest SWE-bench score at the time of their release
Claude Opus 4.5
Anthropic
GPT-5.1 Thinking
OpenAI
GPT-5
OpenAI
Claude Opus 4.1
Anthropic
Aug 2025Feb 2026
Organizations
Anthropic (44.6%)
OpenAI (55.4%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

Coding Categories Performance

Model performance across different coding domains and specializations

Note: These rankings reflect performance on available benchmarks for each model. Rankings do not necessarily indicate absolute superiority in a category, as most models have not been evaluated on all benchmarks.

+
+
+
+
Agentic Coding

Assesses autonomous agents for code editing, issue resolution, and tool-using workflows.

#1
Claude Opus 4.5Anthropic
81%
1 benchmarks
#2
Claude Opus 4.6Anthropic
81%
1 benchmarks
#3
Minimax M 2.5MiniMax
80%
1 benchmarks
#4
GPT-5.2OpenAI
80%
1 benchmarks
#5
Gemini 3 FlashGoogle DeepMind
78%
1 benchmarks
+
+
+
+
Repository-Level Coding

Involves understanding and modifying code in full repositories.

#1
Claude Opus 4.5Anthropic
81%
1 benchmarks
#2
Claude Opus 4.6Anthropic
81%
1 benchmarks
#3
Minimax M 2.5MiniMax
80%
1 benchmarks
#4
GPT-5.2OpenAI
80%
1 benchmarks
#5
Gemini 3 FlashGoogle DeepMind
78%
1 benchmarks