+
+
+
+
Total Models
165

AI models tracked

+
+
+
+
LLM Leaderboard background
Organizations
17

Companies & labs

+
+
+
+
Providers
20

API providers

+
+
+
+
Benchmarks
342

Evaluation metrics

LLM Leaderboard for Builders

Sep 29, 202577.2%----
Aug 7, 202574.9%88.0%93.4%--
Aug 5, 202574.5%----
Sep 15, 202574.5%----
+
+
+
+
SWE-bench Dominance Timeline
Models that achieved the highest SWE-bench score at the time of their release
Claude Sonnet 4.5
Anthropic
GPT-5
OpenAI
Claude Opus 4.1
Anthropic
Claude Sonnet 4
Anthropic
Claude Opus 4
Anthropic
Claude 3.7 Sonnet
Anthropic
DeepSeek-V3.1
DeepSeek
Claude 3.5 Sonnet
Anthropic
o1-preview
OpenAI
GPT-4o
OpenAI
DeepSeek-V2.5
DeepSeek
May 2024Oct 2025
Organizations
Anthropic (48.8%)
OpenAI (25.1%)
DeepSeek (26.1%)

Coding Categories Performance

Model performance across different coding domains and specializations

Note: These rankings reflect performance on available benchmarks for each model. Rankings do not necessarily indicate absolute superiority in a category, as most models have not been evaluated on all benchmarks.

+
+
+
+
Python Coding

Focuses on generating, completing, and debugging Python code.

#1
Kimi K2 0905Moonshot AI
95%
1 benchmarks
#2
Claude 3.5 SonnetAnthropic
94%
1 benchmarks
#3
GPT-5OpenAI
93%
1 benchmarks
#4
Kimi K2 InstructMoonshot AI
93%
1 benchmarks
#5
Phi 4 ReasoningMicrosoft
93%
1 benchmarks
+
+
+
+
Web Development

Evaluates coding in JavaScript/TypeScript for web frameworks like Next.js.

#1
GPT-5OpenAI
88%
1 benchmarks
#2
Gemini 2.5 Pro Preview 06-05Google
82%
1 benchmarks
#3
o3OpenAI
81%
1 benchmarks
#4
Gemini 2.5 ProGoogle
77%
1 benchmarks
#5
Qwen2.5 32B InstructAlibaba Cloud / Qwen Team
75%
1 benchmarks
+
+
+
+
Terminal/Command Line Tasks

Tests command-line operations, scripting, and system interactions.

#1
Qwen2.5 VL 7B InstructAlibaba Cloud / Qwen Team
60%
1 benchmarks
#2
Claude Sonnet 4.5Anthropic
56%
2 benchmarks
#3
Claude Opus 4.1Anthropic
43%
1 benchmarks
#4
GLM-4.6Zhipu AI
41%
1 benchmarks
#5
Claude Opus 4Anthropic
39%
1 benchmarks
+
+
+
+
Agentic Coding

Assesses autonomous agents for code editing, issue resolution, and tool-using workflows.

#1
GPT-5OpenAI
81%
2 benchmarks
#2
Gemini 2.5 Pro Preview 06-05Google
75%
2 benchmarks
#3
o3OpenAI
75%
2 benchmarks
#4
GPT-5 CodexOpenAI
75%
1 benchmarks
#5
DeepSeek-V3.2-ExpDeepSeek
71%
2 benchmarks
+
+
+
+
Multilingual Coding

Covers code generation across multiple programming languages.

#1
GPT-5OpenAI
88%
1 benchmarks
#2
Gemini 2.5 Pro Preview 06-05Google
82%
1 benchmarks
#3
o3OpenAI
81%
1 benchmarks
#4
Gemini 2.5 ProGoogle
75%
2 benchmarks
#5
Qwen2.5 32B InstructAlibaba Cloud / Qwen Team
75%
1 benchmarks
+
+
+
+
Coding Contests

Simulates competitive programming problems from platforms like LeetCode or CodeForces.

#1
GPT OSS 120BOpenAI
87%
1 benchmarks
#2
GPT OSS 20BOpenAI
84%
1 benchmarks
#3
Grok-3 MinixAI
80%
1 benchmarks
#4
Grok-3xAI
79%
1 benchmarks
#5
Grok-4 HeavyxAI
79%
1 benchmarks
+
+
+
+
Repository-Level Coding

Involves understanding and modifying code in full repositories.

#1
Phi-3.5-MoE-instructMicrosoft
85%
1 benchmarks
#2
Phi-3.5-mini-instructMicrosoft
77%
1 benchmarks
#3
GPT-5OpenAI
75%
1 benchmarks
#4
GPT-5 CodexOpenAI
75%
1 benchmarks
#5
Claude Opus 4.1Anthropic
75%
1 benchmarks
+
+
+
+
Function/Tool Calling

Evaluates API usage, function invocation, and tool integration in code.

#1
Llama 3.1 405B InstructMeta
72%
3 benchmarks
#2
Qwen3 235B A22BAlibaba Cloud / Qwen Team
71%
1 benchmarks
#3
Qwen3 32BAlibaba Cloud / Qwen Team
70%
1 benchmarks
#4
Qwen3 30B A3BAlibaba Cloud / Qwen Team
69%
1 benchmarks
#5
Llama 3.1 70B InstructMeta
68%
3 benchmarks
+
+
+
+
Math Reasoning for Coding

Tests mathematical problem-solving, which underpins algorithmic thinking in coding.

#1
Kimi K2 InstructMoonshot AI
97%
1 benchmarks
#2
GPT-4.5OpenAI
97%
1 benchmarks
#3
o3-miniOpenAI
95%
2 benchmarks
#4
o1OpenAI
94%
3 benchmarks
#5
Mistral Large 2Mistral AI
93%
1 benchmarks