BIG-Bench Hard

text
+
+
+
+
About

BIG-Bench Hard selects 23 of the most challenging tasks from the original BIG-Bench suite, focusing on scenarios where language models traditionally struggle to match human performance. This curated subset emphasizes complex reasoning, multi-step problem solving, and sophisticated cognitive tasks. The benchmark is specifically designed to evaluate chain-of-thought prompting effectiveness and assess the limits of current language model reasoning capabilities.

+
+
+
+
Evaluation Stats
Total Models21
Organizations4
Verified Results0
Self-Reported21
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

21 models
Top Score
93.1%
Average Score
71.2%
High Performers (80%+)
8

Top Organizations

#1Anthropic
5 models
85.9%
#2Microsoft
3 models
72.8%
#3Google
10 models
65.4%
#4IBM
3 models
64.7%
+
+
+
+
Leaderboard
21 models ranked by performance on BIG-Bench Hard
LicenseLinks
Oct 22, 2024
Proprietary
93.1%
Jun 21, 2024
Proprietary
93.1%
May 1, 2024
Proprietary
89.2%
Mar 12, 2025
Gemma
87.6%
Feb 29, 2024
Proprietary
86.8%
Mar 12, 2025
Gemma
85.7%
May 1, 2024
Proprietary
85.5%
Feb 29, 2024
Proprietary
82.9%
Aug 23, 2024
MIT
79.1%
Mar 13, 2024
Proprietary
73.7%
Showing 1 to 10 of 21 models
+
+
+
+
Resources