HumanEval

text
+
+
+
+
About

HumanEval is a fundamental code generation benchmark developed by OpenAI featuring 164 programming problems that evaluate AI models' ability to generate functional Python code from natural language descriptions. This benchmark tests models' programming competency, algorithmic thinking, and ability to implement correct solutions that pass predefined unit tests, serving as a standard evaluation for coding capabilities.

+
+
+
+
Evaluation Stats
Total Models63
Organizations12
Verified Results0
Self-Reported62
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

63 models
Top Score
94.5%
Average Score
80.6%
High Performers (80%+)
42

Top Organizations

#1Moonshot AI
2 models
93.9%
#2DeepSeek
1 model
89.0%
#3IBM
3 models
87.3%
#4Alibaba Cloud / Qwen Team
10 models
86.1%
#5Amazon
3 models
85.2%
+
+
+
+
Leaderboard
63 models ranked by performance on HumanEval
LicenseLinks
Sep 5, 2025
Proprietary
94.5%
Oct 22, 2024
Proprietary
93.7%
Aug 7, 2025
Proprietary
93.4%
Jul 11, 2025
MIT
93.3%
Sep 19, 2024
Apache 2.0
92.7%
Sep 12, 2024
Proprietary
92.4%
Jun 21, 2024
Proprietary
92.0%
Jul 24, 2024
Mistral Research License
92.0%
Feb 28, 2025
Apache 2.0
91.5%
May 13, 2024
Proprietary
90.2%
Showing 1 to 10 of 63 models
...
+
+
+
+
Resources