EvalPlus
text
+
+
+
+
About
EvalPlus is a rigorous code evaluation framework that enhances HumanEval and MBPP with 80x and 35x more tests respectively for comprehensive LLM coding assessment. This benchmark includes HumanEval+, MBPP+, and EvalPerf components that test code correctness, efficiency, and long-context repository understanding. EvalPlus provides precise evaluation of LLM-generated code quality and performance across diverse programming scenarios.
+
+
+
+
Evaluation Stats
Total Models4
Organizations2
Verified Results0
Self-Reported4
+
+
+
+
Benchmark Details
Max Score100
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
4 models
Top Score
80.3%
Average Score
76.8%
High Performers (80%+)
1Top Organizations
#1Moonshot AI
1 model
80.3%
#2Alibaba Cloud / Qwen Team
3 models
75.6%
+
+
+
+
Leaderboard
4 models ranked by performance on EvalPlus
License | Links | ||||
---|---|---|---|---|---|
Jul 11, 2025 | MIT | 80.3% | |||
Jul 23, 2024 | tongyi-qianwen | 79.0% | |||
Apr 29, 2025 | Apache 2.0 | 77.6% | |||
Jul 23, 2024 | Apache 2.0 | 70.3% |