BBH
text
+
+
+
+
About
BBH (BIG-Bench Hard) is a challenging reasoning benchmark consisting of 23 demanding tasks selected from the original BIG-Bench suite where prior language models failed to surpass average human performance. The benchmark focuses on multi-step reasoning tasks and evaluates chain-of-thought prompting effectiveness. BBH tests complex reasoning abilities across diverse domains, serving as a rigorous assessment of advanced language model capabilities in scenarios requiring sophisticated logical thinking.
+
+
+
+
Evaluation Stats
Total Models8
Organizations3
Verified Results0
Self-Reported8
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
8 models
Top Score
88.9%
Average Score
83.4%
High Performers (80%+)
6Top Organizations
#1DeepSeek
1 model
84.3%
#2Alibaba Cloud / Qwen Team
4 models
83.5%
#3Amazon
3 models
82.9%
+
+
+
+
Leaderboard
8 models ranked by performance on BBH
License | Links | ||||
---|---|---|---|---|---|
Apr 29, 2025 | Apache 2.0 | 88.9% | |||
Nov 20, 2024 | Proprietary | 86.9% | |||
Sep 19, 2024 | Apache 2.0 | 84.5% | |||
May 8, 2024 | deepseek | 84.3% | |||
Nov 20, 2024 | Proprietary | 82.4% | |||
Jul 23, 2024 | tongyi-qianwen | 82.4% | |||
Nov 20, 2024 | Proprietary | 79.5% | |||
Sep 19, 2024 | Apache 2.0 | 78.2% |