BBH

text
+
+
+
+
About

BBH (BIG-Bench Hard) is a challenging reasoning benchmark consisting of 23 demanding tasks selected from the original BIG-Bench suite where prior language models failed to surpass average human performance. The benchmark focuses on multi-step reasoning tasks and evaluates chain-of-thought prompting effectiveness. BBH tests complex reasoning abilities across diverse domains, serving as a rigorous assessment of advanced language model capabilities in scenarios requiring sophisticated logical thinking.

+
+
+
+
Evaluation Stats
Total Models8
Organizations3
Verified Results0
Self-Reported8
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

8 models
Top Score
88.9%
Average Score
83.4%
High Performers (80%+)
6

Top Organizations

#1DeepSeek
1 model
84.3%
#2Alibaba Cloud / Qwen Team
4 models
83.5%
#3Amazon
3 models
82.9%
+
+
+
+
Leaderboard
8 models ranked by performance on BBH
LicenseLinks
Apr 29, 2025
Apache 2.0
88.9%
Nov 20, 2024
Proprietary
86.9%
Sep 19, 2024
Apache 2.0
84.5%
May 8, 2024
deepseek
84.3%
Nov 20, 2024
Proprietary
82.4%
Jul 23, 2024
tongyi-qianwen
82.4%
Nov 20, 2024
Proprietary
79.5%
Sep 19, 2024
Apache 2.0
78.2%
+
+
+
+
Resources