EvalPlus

text

About

EvalPlus is a rigorous code evaluation framework that enhances HumanEval and MBPP with 80x and 35x more tests respectively for comprehensive LLM coding assessment. This benchmark includes HumanEval+, MBPP+, and EvalPerf components that test code correctness, efficiency, and long-context repository understanding. EvalPlus provides precise evaluation of LLM-generated code quality and performance across diverse programming scenarios.

Evaluation Stats

Total Models4

Organizations2

Verified Results0

Self-Reported4

Benchmark Details

Max Score100

Language

Performance Overview

Score distribution and top performers

Score Distribution

4 models

Top Score

80.3%

Average Score

76.8%

High Performers (80%+)

Top Organizations

#1Moonshot AI

1 model

80.3%

#2Alibaba Cloud / Qwen Team

3 models

75.6%

Leaderboard

4 models ranked by performance on EvalPlus

			License
#01Kimi K2 Base	Moonshot AI	Jul 11, 2025	MIT	80.3%
#02Qwen2 72B Instruct	Alibaba Cloud / Qwen Team	Jul 23, 2024	tongyi-qianwen	79.0%
#03Qwen3 235B A22B	Alibaba Cloud / Qwen Team	Apr 29, 2025	Apache 2.0	77.6%
#04Qwen2 7B Instruct	Alibaba Cloud / Qwen Team	Jul 23, 2024	Apache 2.0	70.3%

Resources

Research Paper