BigCodeBench

text
+
+
+
+
About

BigCodeBench is a comprehensive programming benchmark featuring practical and challenging coding tasks designed to evaluate Large Language Models' true programming capabilities. Constructed through collaboration between human experts and LLMs, it assesses code generation, function calling, and complex instruction following across diverse programming scenarios. The benchmark uses calibrated Pass@1 scoring with greedy decoding and offers both Complete and Instruct evaluation variants.

+
+
+
+
Evaluation Stats
Total Models2
Organizations2
Verified Results0
Self-Reported2
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

2 models
Top Score
45.4%
Average Score
43.2%
High Performers (80%+)
0

Top Organizations

#1Google
1 model
45.4%
#2Alibaba Cloud / Qwen Team
1 model
41.0%
+
+
+
+
Leaderboard
2 models ranked by performance on BigCodeBench
LicenseLinks
May 20, 2025
Proprietary
45.4%
Sep 19, 2024
Apache 2.0
41.0%
+
+
+
+
Resources