HumanEval-ER

text

About

HumanEval-ER is an enhanced variant of the HumanEval benchmark that includes error recovery testing, evaluating AI models' ability to debug and fix code errors. This benchmark tests models' capacity to identify, understand, and correct programming mistakes in addition to generating functional code, measuring both code generation and debugging capabilities for more comprehensive programming evaluation.

Evaluation Stats

Total Models1

Organizations1

Verified Results0

Self-Reported1

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

1 models

Top Score

81.1%

Average Score

81.1%

High Performers (80%+)

Top Organizations

#1Moonshot AI

1 model

81.1%

Leaderboard

1 models ranked by performance on HumanEval-ER

			License		Links
#01Kimi K2 Instruct	Moonshot AI	Jul 11, 2025	MIT	81.1%

Resources

Research Paper