MBPP
text
+
+
+
+
About
MBPP (Mostly Basic Python Problems) is a foundational code generation benchmark featuring 974 Python programming tasks with natural language descriptions and test cases. Created by Google Research, this dataset evaluates AI models' ability to synthesize basic Python code from prompts, testing fundamental programming skills and algorithmic thinking through straightforward coding problems with automated verification.
+
+
+
+
Evaluation Stats
Total Models31
Organizations6
Verified Results0
Self-Reported31
+
+
+
+
Benchmark Details
Max Score100
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
31 models
Top Score
91.3%
Average Score
73.0%
High Performers (80%+)
11Top Organizations
#1NVIDIA
2 models
87.9%
#2Alibaba Cloud / Qwen Team
11 models
81.2%
#3Microsoft
2 models
75.2%
#4Mistral AI
3 models
74.2%
#5Meta
2 models
72.7%
+
+
+
+
Leaderboard
31 models ranked by performance on MBPP
License | Links | ||||
---|---|---|---|---|---|
Mar 18, 2025 | Llama 3.1 Community License | 91.3% | |||
Sep 19, 2024 | Apache 2.0 | 90.2% | |||
Sep 19, 2024 | Qwen | 88.2% | |||
Mar 18, 2025 | Llama 3.1 Community License | 84.6% | |||
Sep 19, 2024 | Apache 2.0 | 84.0% | |||
Feb 28, 2025 | Apache 2.0 | 84.0% | |||
Sep 19, 2024 | Apache 2.0 | 83.5% | |||
Sep 19, 2024 | Apache 2.0 | 82.0% | |||
Apr 29, 2025 | Apache 2.0 | 81.4% | |||
Aug 23, 2024 | MIT | 80.8% |
Showing 1 to 10 of 31 models