LLM Leaderboard for Agentic Coders

Total Models

AI models tracked

Organizations

Companies & labs

Providers

API providers

Benchmarks

Evaluation metrics

Top Models by SWE-rebench


#01Gemini 3.1 Pro	Google DeepMind	Feb 19, 2026	-	74.8%	80.6%	77.1%	-	59.7%	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
#02Claude Sonnet 4.6	Anthropic	Feb 17, 2026	-	59.1%	79.6%	58.3%	74.7%	63.3%	-	81.7%	89.9%	49.0%	33.2%	-	61.3%	89.3%	-	74.5%	75.6%	72.5%	91.7%	97.9%	-
#03Qwen3.5-397B-A17B	Alibaba / Qwen	Feb 16, 2026	-	-	76.4%	-	-	54.5%	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
#04Seed 2.0 Pro	ByteDance	Feb 14, 2026	-	-	76.5%	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-

SWE-bench Dominance Timeline

Models that achieved the highest SWE-bench score at the time of their release

Claude Opus 4.5

Claude Sonnet 4.5

GPT-5

GPT-5 Codex

Aug 2025Mar 2026

Organizations

Anthropic (74.4%)

OpenAI (25.6%)

Coding Categories Performance

Model performance across different coding domains and specializations

Note: These rankings reflect performance on available benchmarks for each model. Rankings do not necessarily indicate absolute superiority in a category, as most models have not been evaluated on all benchmarks.

Python Coding

Focuses on generating, completing, and debugging Python code.

Llama-3.3 Nemotron Super 49BNVIDIA

91%

1 benchmarks

Qwen2.5-Coder 32B InstructAlibaba / Qwen

90%

1 benchmarks

Qwen2.5 72B InstructAlibaba / Qwen

88%

1 benchmarks

Llama 3.1 Nemotron Nano 8BNVIDIA

85%

1 benchmarks

Qwen2.5 32B InstructAlibaba / Qwen

84%

1 benchmarks

Agentic Coding

Assesses autonomous agents for code editing, issue resolution, and tool-using workflows.

Claude Opus 4.5Anthropic

81%

1 benchmarks

Claude Opus 4.6Anthropic

81%

1 benchmarks

Gemini 3.1 ProGoogle DeepMind

81%

1 benchmarks

Minimax M 2.5MiniMax

80%

1 benchmarks

GPT-5.2OpenAI

80%

1 benchmarks

Repository-Level Coding

Involves understanding and modifying code in full repositories.

Claude Opus 4.5Anthropic

81%

1 benchmarks

Claude Opus 4.6Anthropic

81%

1 benchmarks

Gemini 3.1 ProGoogle DeepMind

81%

1 benchmarks

Minimax M 2.5MiniMax

80%

1 benchmarks

GPT-5.2OpenAI

80%

1 benchmarks