HumanEval
text
+
+
+
+
About
HumanEval is a fundamental code generation benchmark developed by OpenAI featuring 164 programming problems that evaluate AI models' ability to generate functional Python code from natural language descriptions. This benchmark tests models' programming competency, algorithmic thinking, and ability to implement correct solutions that pass predefined unit tests, serving as a standard evaluation for coding capabilities.
+
+
+
+
Evaluation Stats
Total Models63
Organizations12
Verified Results0
Self-Reported62
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
63 models
Top Score
94.5%
Average Score
80.6%
High Performers (80%+)
42Top Organizations
#1Moonshot AI
2 models
93.9%
#2DeepSeek
1 model
89.0%
#3IBM
3 models
87.3%
#4Alibaba Cloud / Qwen Team
10 models
86.1%
#5Amazon
3 models
85.2%
+
+
+
+
Leaderboard
63 models ranked by performance on HumanEval
License | Links | ||||
---|---|---|---|---|---|
Sep 5, 2025 | Proprietary | 94.5% | |||
Oct 22, 2024 | Proprietary | 93.7% | |||
Aug 7, 2025 | Proprietary | 93.4% | |||
Jul 11, 2025 | MIT | 93.3% | |||
Sep 19, 2024 | Apache 2.0 | 92.7% | |||
Sep 12, 2024 | Proprietary | 92.4% | |||
Jun 21, 2024 | Proprietary | 92.0% | |||
Jul 24, 2024 | Mistral Research License | 92.0% | |||
Feb 28, 2025 | Apache 2.0 | 91.5% | |||
May 13, 2024 | Proprietary | 90.2% |
Showing 1 to 10 of 63 models
...