MBPP

text
+
+
+
+
About

MBPP (Mostly Basic Python Problems) is a foundational code generation benchmark featuring 974 Python programming tasks with natural language descriptions and test cases. Created by Google Research, this dataset evaluates AI models' ability to synthesize basic Python code from prompts, testing fundamental programming skills and algorithmic thinking through straightforward coding problems with automated verification.

+
+
+
+
Evaluation Stats
Total Models31
Organizations6
Verified Results0
Self-Reported31
+
+
+
+
Benchmark Details
Max Score100
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

31 models
Top Score
91.3%
Average Score
73.0%
High Performers (80%+)
11

Top Organizations

#1NVIDIA
2 models
87.9%
#2Alibaba Cloud / Qwen Team
11 models
81.2%
#3Microsoft
2 models
75.2%
#4Mistral AI
3 models
74.2%
#5Meta
2 models
72.7%
+
+
+
+
Leaderboard
31 models ranked by performance on MBPP
LicenseLinks
Mar 18, 2025
Llama 3.1 Community License
91.3%
Sep 19, 2024
Apache 2.0
90.2%
Sep 19, 2024
Qwen
88.2%
Mar 18, 2025
Llama 3.1 Community License
84.6%
Sep 19, 2024
Apache 2.0
84.0%
Feb 28, 2025
Apache 2.0
84.0%
Sep 19, 2024
Apache 2.0
83.5%
Sep 19, 2024
Apache 2.0
82.0%
Apr 29, 2025
Apache 2.0
81.4%
Aug 23, 2024
MIT
80.8%
Showing 1 to 10 of 31 models
+
+
+
+
Resources