ZeroEval

zeroeval.com

Platform Stats

Total Models20

Organizations7

Verified Benchmarks0

Multimodal Models14

Pricing Overview

Avg Input (per 1M)$2.57

Avg Output (per 1M)$12.65

Cheapest Model$0.05

Premium Model$15.00

Supported Features

Number of models supporting each feature

web Search

function Calling

structured Output

code Execution

batch Inference

finetuning

Input Modalities

Models supporting different input types

text

20 (100%)

image

14 (70%)

audio

1 (5%)

video

3 (15%)

Models Overview

Top performers and pricing distribution

Pricing Distribution

Input pricing per 1M tokens

$0-1

11 models

$1-5

7 models

$15+

2 models

Top Performing Models

By benchmark avg

#1Kimi K2 0905

84.0%

#2Claude Sonnet 4.5

75.8%

#3Claude 3.7 Sonnet

74.1%

#4Claude Opus 4.1

72.7%

#5GPT-5

70.1%

Most Affordable Models

GPT-5 nano

$0.05/1M

GPT OSS 20B

$0.10/1M

GPT OSS 120B

$0.15/1M

Available Models

20 models available through ZeroEval

			License
#01Claude Sonnet 4.5 Claude Sonnet 4.5 is the best coding model in the world. It's the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math. Highest intelligence across most tasks with exceptional agent and coding capabilities.	Anthropic	Sep 29, 2025	Proprietary	77.2%	-	-	-	-
#02GPT-5 GPT-5 is our flagship model for coding, reasoning, and agentic tasks across domains. The best model for coding and agentic tasks with higher reasoning capabilities and medium speed.	OpenAI	Aug 7, 2025	Proprietary	74.9%	88.0%	93.4%	-	-
#03Claude Opus 4.1 Claude Opus 4.1 is a hybrid reasoning model that pushes the frontier for coding and AI agents, featuring a 200K context window. It delivers superior performance and precision for real-world coding and agentic tasks, handling complex multi-step problems with rigor and attention to detail. With extended thinking capabilities, it offers instant responses or extended step-by-step thinking visible through user-friendly summaries. It advances state-of-the-art coding performance to 74.5% on SWE-bench Verified, excels at agentic search and research, and produces human-quality content with exceptional writing abilities. It supports 32K output tokens and adapts to specific coding styles while delivering exceptional quality for extensive generation and refactoring projects.	Anthropic	Aug 5, 2025	Proprietary	74.5%	-	-	-	-
#04Claude Sonnet 4 Claude Sonnet 4, part of the Claude 4 family, is a significant upgrade to Claude Sonnet 3.7. It excels in coding (72.7% on SWE-bench) and reasoning, responding more precisely to instructions. Sonnet 4 offers an optimal mix of capability and practicality, with enhanced steerability, and supports extended thinking with tool use.	Anthropic	May 22, 2025	Proprietary	72.7%	-	-	-	-
#05Claude Opus 4 Claude Opus 4 is Anthropic's most powerful model and the world's best coding model, part of the Claude 4 family. It delivers sustained performance on complex, long-running tasks and agent workflows. Opus 4 excels at coding, advanced reasoning, and can use tools (like web search) during extended thinking. It supports parallel tool execution and has improved memory capabilities.	Anthropic	May 22, 2025	Proprietary	72.5%	-	-	-	-
#06Claude 3.7 Sonnet The most intelligent Claude model and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. Shows particularly strong improvements in coding and front-end web development.	Anthropic	Feb 24, 2025	Proprietary	70.3%	-	-	-	-
#07GLM-4.6 GLM-4.6 is the latest version of Z.ai's flagship model, bringing significant improvements over GLM-4.5. Key features include: 200K token context window (expanded from 128K), superior coding performance with better real-world application in Claude Code/Cline/Roo Code/Kilo Code, advanced reasoning with tool use during inference, stronger agent capabilities, and refined writing aligned with human preferences. GLM-4.6 achieves competitive performance with DeepSeek-V3.2-Exp and Claude Sonnet 4, reaching near parity with Claude Sonnet 4 (48.6% win rate) on CC-Bench real-world coding tasks.	Zhipu AI	Sep 30, 2025	MIT	68.0%	-	-	-	-
#08DeepSeek-V3.2-Exp DeepSeek-V3.2-Exp is an experimental iteration introducing DeepSeek Sparse Attention (DSA) to improve long-context training and inference efficiency while keeping output quality on par with V3.1. It explores fine-grained sparse attention for extended sequence processing.	DeepSeek	Sep 29, 2025	MIT	67.8%	74.5%	-	74.1%	-
#09GLM-4.5 GLM-4.5 is an Agentic, Reasoning, and Coding (ARC) foundation model designed for intelligent agents, featuring 355 billion total parameters with 32 billion active parameters using MoE architecture. Trained on 23T tokens through multi-stage training, it is a hybrid reasoning model that provides two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. The model unifies agentic, reasoning, and coding capabilities with 128K context length support. It achieves exceptional performance with a score of 63.2 across 12 industry-standard benchmarks, placing 3rd among all proprietary and open-source models. Released under MIT open-source license allowing commercial use and secondary development.	Zhipu AI	Jul 28, 2025	MIT	64.2%	-	-	72.9%	-
#10Gemini 2.5 Pro Our most intelligent AI model, built for the agentic era. Gemini 2.5 Pro leads on common benchmarks with enhanced reasoning, multimodal capabilities (text, image, video, audio input), and a 1M token context window.	Google	May 20, 2025	Proprietary	63.2%	76.5%	-	-	-

Showing 1 to 10 of 20 models

Resources

Official Website