MBPP EvalPlus

text

About

MBPP EvalPlus is an augmented version of the MBPP benchmark developed as part of the EvalPlus framework, featuring significantly expanded test coverage for more rigorous code evaluation. This enhanced benchmark tests AI models' Python programming capabilities through comprehensive test suites, providing more thorough assessment of code correctness and edge case handling than the original MBPP.

Evaluation Stats

Total Models2

Organizations1

Verified Results0

Self-Reported2

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

2 models

Top Score

88.6%

Average Score

88.1%

High Performers (80%+)

Top Organizations

#1Meta

2 models

88.1%

Leaderboard

2 models ranked by performance on MBPP EvalPlus

			License		Links
#01Llama 3.1 405B Instruct	Meta	Jul 23, 2024	Llama 3.1 Community License	88.6%
#02Llama 3.3 70B Instruct	Meta	Dec 6, 2024	Llama 3.3 Community License Agreement	87.6%

Resources

Research Paper