BIG-Bench Hard
text
+
+
+
+
About
BIG-Bench Hard selects 23 of the most challenging tasks from the original BIG-Bench suite, focusing on scenarios where language models traditionally struggle to match human performance. This curated subset emphasizes complex reasoning, multi-step problem solving, and sophisticated cognitive tasks. The benchmark is specifically designed to evaluate chain-of-thought prompting effectiveness and assess the limits of current language model reasoning capabilities.
+
+
+
+
Evaluation Stats
Total Models21
Organizations4
Verified Results0
Self-Reported21
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
21 models
Top Score
93.1%
Average Score
71.2%
High Performers (80%+)
8Top Organizations
#1Anthropic
5 models
85.9%
#2Microsoft
3 models
72.8%
#3Google
10 models
65.4%
#4IBM
3 models
64.7%
+
+
+
+
Leaderboard
21 models ranked by performance on BIG-Bench Hard
License | Links | ||||
---|---|---|---|---|---|
Oct 22, 2024 | Proprietary | 93.1% | |||
Jun 21, 2024 | Proprietary | 93.1% | |||
May 1, 2024 | Proprietary | 89.2% | |||
Mar 12, 2025 | Gemma | 87.6% | |||
Feb 29, 2024 | Proprietary | 86.8% | |||
Mar 12, 2025 | Gemma | 85.7% | |||
May 1, 2024 | Proprietary | 85.5% | |||
Feb 29, 2024 | Proprietary | 82.9% | |||
Aug 23, 2024 | MIT | 79.1% | |||
Mar 13, 2024 | Proprietary | 73.7% |
Showing 1 to 10 of 21 models