BBH

text

About

BBH (BIG-Bench Hard) is a challenging reasoning benchmark consisting of 23 demanding tasks selected from the original BIG-Bench suite where prior language models failed to surpass average human performance. The benchmark focuses on multi-step reasoning tasks and evaluates chain-of-thought prompting effectiveness. BBH tests complex reasoning abilities across diverse domains, serving as a rigorous assessment of advanced language model capabilities in scenarios requiring sophisticated logical thinking.

Evaluation Stats

Total Models8

Organizations3

Verified Results0

Self-Reported8

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

8 models

Top Score

88.9%

Average Score

83.4%

High Performers (80%+)

Top Organizations

#1DeepSeek

1 model

84.3%

#2Alibaba Cloud / Qwen Team

4 models

83.5%

#3Amazon

3 models

82.9%

Leaderboard

8 models ranked by performance on BBH

			License
#01Qwen3 235B A22B	Alibaba Cloud / Qwen Team	Apr 29, 2025	Apache 2.0	88.9%
#02Nova Pro	Amazon	Nov 20, 2024	Proprietary	86.9%
#03Qwen2.5 32B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Apache 2.0	84.5%
#04DeepSeek-V2.5	DeepSeek	May 8, 2024	deepseek	84.3%
#05Nova Lite	Amazon	Nov 20, 2024	Proprietary	82.4%
#06Qwen2 72B Instruct	Alibaba Cloud / Qwen Team	Jul 23, 2024	tongyi-qianwen	82.4%
#07Nova Micro	Amazon	Nov 20, 2024	Proprietary	79.5%
#08Qwen2.5 14B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Apache 2.0	78.2%

Resources

Research Paper