BigCodeBench

text

About

BigCodeBench is a comprehensive programming benchmark featuring practical and challenging coding tasks designed to evaluate Large Language Models' true programming capabilities. Constructed through collaboration between human experts and LLMs, it assesses code generation, function calling, and complex instruction following across diverse programming scenarios. The benchmark uses calibrated Pass@1 scoring with greedy decoding and offers both Complete and Instruct evaluation variants.

Evaluation Stats

Total Models2

Organizations2

Verified Results0

Self-Reported2

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

2 models

Top Score

45.4%

Average Score

43.2%

High Performers (80%+)

Top Organizations

#1Google

1 model

45.4%

#2Alibaba Cloud / Qwen Team

1 model

41.0%

Leaderboard

2 models ranked by performance on BigCodeBench

			License		Links
#01Gemini Diffusion	Google	May 20, 2025	Proprietary	45.4%
#02Qwen2.5-Coder 7B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Apache 2.0	41.0%

Resources

Research Paper