HumanEval

text

About

HumanEval is a fundamental code generation benchmark developed by OpenAI featuring 164 programming problems that evaluate AI models' ability to generate functional Python code from natural language descriptions. This benchmark tests models' programming competency, algorithmic thinking, and ability to implement correct solutions that pass predefined unit tests, serving as a standard evaluation for coding capabilities.

Evaluation Stats

Total Models63

Organizations12

Verified Results0

Self-Reported62

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

63 models

Top Score

94.5%

Average Score

80.6%

High Performers (80%+)

Top Organizations

#1Moonshot AI

2 models

93.9%

#2DeepSeek

1 model

89.0%

#3IBM

3 models

87.3%

#4Alibaba Cloud / Qwen Team

10 models

86.1%

#5Amazon

3 models

85.2%

Leaderboard

63 models ranked by performance on HumanEval

			License
#01Kimi K2 0905	Moonshot AI	Sep 5, 2025	Proprietary	94.5%
#02Claude 3.5 Sonnet	Anthropic	Oct 22, 2024	Proprietary	93.7%
#03GPT-5	OpenAI	Aug 7, 2025	Proprietary	93.4%
#04Kimi K2 Instruct	Moonshot AI	Jul 11, 2025	MIT	93.3%
#05Qwen2.5-Coder 32B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Apache 2.0	92.7%
#06o1-mini	OpenAI	Sep 12, 2024	Proprietary	92.4%
#07Claude 3.5 Sonnet	Anthropic	Jun 21, 2024	Proprietary	92.0%
#08Mistral Large 2	Mistral AI	Jul 24, 2024	Mistral Research License	92.0%
#09Qwen2.5 VL 32B Instruct	Alibaba Cloud / Qwen Team	Feb 28, 2025	Apache 2.0	91.5%
#10GPT-4o	OpenAI	May 13, 2024	Proprietary	90.2%

Showing 1 to 10 of 63 models

...

Resources

Research Paper