Ultimate LLM Leaderboard for Builders: 200+ AI Models Compared & Ranked (2025)

Total Models

169

AI models tracked

Organizations

Companies & labs

Providers

API providers

Benchmarks

342

Evaluation metrics

LLM Leaderboard for Builders


#01Claude Sonnet 4.5	Anthropic	Sep 29, 2025	77.2%	-	-	-	-
#02GPT-5	OpenAI	Aug 7, 2025	74.9%	88.0%	93.4%	-	-
#03Claude Opus 4.1	Anthropic	Aug 5, 2025	74.5%	-	-	-	-
#04GPT-5 Codex	OpenAI	Sep 15, 2025	74.5%	-	-	-	-

SWE-bench Dominance Timeline

Models that achieved the highest SWE-bench score at the time of their release

Claude Sonnet 4.5

GPT-5

Claude Opus 4.1

Claude Sonnet 4

Claude 3.7 Sonnet

DeepSeek-V3.1

Claude 3.5 Sonnet

o1-preview

GPT-4o

DeepSeek-V2.5

May 2024Jan 2026

Organizations

Anthropic (57.1%)

OpenAI (21.1%)

DeepSeek (21.9%)

Coding Categories Performance

Model performance across different coding domains and specializations

Note: These rankings reflect performance on available benchmarks for each model. Rankings do not necessarily indicate absolute superiority in a category, as most models have not been evaluated on all benchmarks.

Python Coding

Focuses on generating, completing, and debugging Python code.

Kimi K2 0905Moonshot AI

95%

1 benchmarks

Claude 3.5 SonnetAnthropic

94%

1 benchmarks

GPT-5OpenAI

93%

1 benchmarks

Kimi K2 InstructMoonshot AI

93%

1 benchmarks

Phi 4 ReasoningMicrosoft

93%

1 benchmarks

Web Development

Evaluates coding in JavaScript/TypeScript for web frameworks like Next.js.

GPT-5OpenAI

88%

1 benchmarks

Gemini 2.5 Pro Preview 06-05Google

82%

1 benchmarks

o3OpenAI

81%

1 benchmarks

Gemini 2.5 ProGoogle

77%

1 benchmarks

Qwen2.5 32B InstructAlibaba Cloud / Qwen Team

75%

1 benchmarks

Terminal/Command Line Tasks

Tests command-line operations, scripting, and system interactions.

Qwen2.5 VL 7B InstructAlibaba Cloud / Qwen Team

60%

1 benchmarks

Claude Sonnet 4.5Anthropic

56%

2 benchmarks

Claude Haiku 4.5Anthropic

46%

2 benchmarks

Claude Opus 4.1Anthropic

43%

1 benchmarks

GLM-4.6Zhipu AI

41%

1 benchmarks

Agentic Coding

Assesses autonomous agents for code editing, issue resolution, and tool-using workflows.

GPT-5OpenAI

81%

2 benchmarks

Gemini 2.5 Pro Preview 06-05Google

75%

2 benchmarks

o3OpenAI

75%

2 benchmarks

GPT-5 CodexOpenAI

75%

1 benchmarks

Claude Haiku 4.5Anthropic

73%

1 benchmarks

Multilingual Coding

Covers code generation across multiple programming languages.

GPT-5OpenAI

88%

1 benchmarks

Gemini 2.5 Pro Preview 06-05Google

82%

1 benchmarks

o3OpenAI

81%

1 benchmarks

Gemini 2.5 ProGoogle

75%

2 benchmarks

Qwen2.5 32B InstructAlibaba Cloud / Qwen Team

75%

1 benchmarks

Coding Contests

Simulates competitive programming problems from platforms like LeetCode or CodeForces.

GPT OSS 120BOpenAI

87%

1 benchmarks

GPT OSS 20BOpenAI

84%

1 benchmarks

Grok-3 MinixAI

80%

1 benchmarks

Grok 4 FastxAI

80%

1 benchmarks

Grok-3xAI

79%

1 benchmarks

Repository-Level Coding

Involves understanding and modifying code in full repositories.

Phi-3.5-MoE-instructMicrosoft

85%

1 benchmarks

Phi-3.5-mini-instructMicrosoft

77%

1 benchmarks

GPT-5OpenAI

75%

1 benchmarks

Claude Opus 4.1Anthropic

75%

1 benchmarks

GPT-5 CodexOpenAI

75%

1 benchmarks

Function/Tool Calling

Evaluates API usage, function invocation, and tool integration in code.

Llama 3.1 405B InstructMeta

72%

3 benchmarks

Qwen3 235B A22BAlibaba Cloud / Qwen Team

71%

1 benchmarks

Qwen3 32BAlibaba Cloud / Qwen Team

70%

1 benchmarks

Qwen3 30B A3BAlibaba Cloud / Qwen Team

69%

1 benchmarks

Llama 3.1 70B InstructMeta

68%

3 benchmarks

Math Reasoning for Coding

Tests mathematical problem-solving, which underpins algorithmic thinking in coding.

Kimi K2 InstructMoonshot AI

97%

1 benchmarks

GPT-4.5OpenAI

97%

1 benchmarks

o3-miniOpenAI

95%

2 benchmarks

o1OpenAI

94%

3 benchmarks

Mistral Large 2Mistral AI

93%

1 benchmarks