HumanEval-ER

text
+
+
+
+
About

HumanEval-ER is an enhanced variant of the HumanEval benchmark that includes error recovery testing, evaluating AI models' ability to debug and fix code errors. This benchmark tests models' capacity to identify, understand, and correct programming mistakes in addition to generating functional code, measuring both code generation and debugging capabilities for more comprehensive programming evaluation.

+
+
+
+
Evaluation Stats
Total Models1
Organizations1
Verified Results0
Self-Reported1
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

1 models
Top Score
81.1%
Average Score
81.1%
High Performers (80%+)
1

Top Organizations

#1Moonshot AI
1 model
81.1%
+
+
+
+
Leaderboard
1 models ranked by performance on HumanEval-ER
LicenseLinks
Jul 11, 2025
MIT
81.1%
+
+
+
+
Resources