MBPP

Coding
+
+
+
+
About

MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers, covering programming fundamentals and standard library functionality.

+
+
+
+
Evaluation Stats
Total Models20
Organizations6
Verified Results0
Self-Reported0
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

20 models
Top Score
91.3%
Average Score
80.3%
High Performers (80%+)
11

Top Organizations

#1NVIDIA
2 models
87.9%
#2Alibaba / Qwen
10 models
82.6%
#3Microsoft
1 model
80.8%
#4Meta AI
1 model
77.6%
#5Google DeepMind
3 models
74.5%
+
+
+
+
Leaderboard
20 models ranked by performance on MBPP
LicenseLinks
Mar 1, 2025
Apache 2.0
91.3%
Nov 12, 2024
Apache 2.0
90.2%
Sep 19, 2024
Apache 2.0
88.2%
Jan 6, 2025
Apache 2.0
84.6%
Sep 19, 2024
Apache 2.0
84.0%
Mar 1, 2025
Apache 2.0
84.0%
Nov 12, 2024
Apache 2.0
83.5%
Sep 19, 2024
Apache 2.0
82.0%
Apr 28, 2025
Apache 2.0
81.4%
Aug 22, 2024
MIT
80.8%
Showing 1 to 10 of 20 models
+
+
+
+
Additional Metrics
Extended metrics for top models on MBPP
ModelScoreCostSizeContextLicense
Llama-3.3 Nemotron Super 49B91.3—50B—
Qwen2.5-Coder 32B Instruct90.2$0.09 $0.0932B128K
Qwen2.5 72B Instruct88.2$0.35 $0.4073B131K
Llama 3.1 Nemotron Nano 8B84.6—8B—
Qwen2.5 32B Instruct84.0—33B—
Qwen2.5-VL 32B Instruct84.0—34B—
Qwen2.5-Coder 7B Instruct83.5—7B—
Qwen2.5 14B Instruct82.0—15B—
Qwen3-235B-A22B81.4$0.10 $0.10235B128K
Phi-3.5-MoE Instruct80.8—60B—
Qwen2 72B Instruct80.2—72B—
Qwen2.5 7B Instruct79.2$0.30 $0.308B131K
Codestral 22B78.2—22B—
Llama 4 Maverick77.6$0.17 $0.60400B1.0M
Gemini Diffusion76.0———
Mistral Small 3.1 24B Instruct74.7—24B—
Gemma 3 27B74.4$0.10 $0.2027B131K
Qwen2.5-Omni-7B73.2—7B—
Gemma 3 12B73.0$0.05 $0.1012B131K
Mistral Small 3 24B69.6—24B—