BIG-Bench Hard

text

About

BIG-Bench Hard selects 23 of the most challenging tasks from the original BIG-Bench suite, focusing on scenarios where language models traditionally struggle to match human performance. This curated subset emphasizes complex reasoning, multi-step problem solving, and sophisticated cognitive tasks. The benchmark is specifically designed to evaluate chain-of-thought prompting effectiveness and assess the limits of current language model reasoning capabilities.

Evaluation Stats

Total Models21

Organizations4

Verified Results0

Self-Reported21

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

21 models

Top Score

93.1%

Average Score

71.2%

High Performers (80%+)

Top Organizations

#1Anthropic

5 models

85.9%

#2Microsoft

3 models

72.8%

#3Google

10 models

65.4%

#4IBM

3 models

64.7%

Leaderboard

21 models ranked by performance on BIG-Bench Hard

			License
#01Claude 3.5 Sonnet	Anthropic	Oct 22, 2024	Proprietary	93.1%
#02Claude 3.5 Sonnet	Anthropic	Jun 21, 2024	Proprietary	93.1%
#03Gemini 1.5 Pro	Google	May 1, 2024	Proprietary	89.2%
#04Gemma 3 27B	Google	Mar 12, 2025	Gemma	87.6%
#05Claude 3 Opus	Anthropic	Feb 29, 2024	Proprietary	86.8%
#06Gemma 3 12B	Google	Mar 12, 2025	Gemma	85.7%
#07Gemini 1.5 Flash	Google	May 1, 2024	Proprietary	85.5%
#08Claude 3 Sonnet	Anthropic	Feb 29, 2024	Proprietary	82.9%
#09Phi-3.5-MoE-instruct	Microsoft	Aug 23, 2024	MIT	79.1%
#10Claude 3 Haiku	Anthropic	Mar 13, 2024	Proprietary	73.7%

Showing 1 to 10 of 21 models

Resources

Research Paper