EvalPlus

text
+
+
+
+
About

EvalPlus is a rigorous code evaluation framework that enhances HumanEval and MBPP with 80x and 35x more tests respectively for comprehensive LLM coding assessment. This benchmark includes HumanEval+, MBPP+, and EvalPerf components that test code correctness, efficiency, and long-context repository understanding. EvalPlus provides precise evaluation of LLM-generated code quality and performance across diverse programming scenarios.

+
+
+
+
Evaluation Stats
Total Models4
Organizations2
Verified Results0
Self-Reported4
+
+
+
+
Benchmark Details
Max Score100
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

4 models
Top Score
80.3%
Average Score
76.8%
High Performers (80%+)
1

Top Organizations

#1Moonshot AI
1 model
80.3%
#2Alibaba Cloud / Qwen Team
3 models
75.6%
+
+
+
+
Leaderboard
4 models ranked by performance on EvalPlus
LicenseLinks
Jul 11, 2025
MIT
80.3%
Jul 23, 2024
tongyi-qianwen
79.0%
Apr 29, 2025
Apache 2.0
77.6%
Jul 23, 2024
Apache 2.0
70.3%
+
+
+
+
Resources