BoolQ

text

About

BoolQ is a question answering benchmark for yes/no questions containing 15,942 naturally occurring examples that test reading comprehension and natural language inference. These questions are unprompted and unconstrained, reflecting real-world information-seeking scenarios. The benchmark challenges models to perform binary classification on complex questions requiring deep text understanding, making it similar to existing natural language inference tasks but with practical, realistic queries.

Evaluation Stats

Total Models9

Organizations2

Verified Results0

Self-Reported9

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

9 models

Top Score

84.8%

Average Score

81.0%

High Performers (80%+)

Top Organizations

#1Microsoft

3 models

81.3%

#2Google

6 models

80.8%

Leaderboard

9 models ranked by performance on BoolQ

			License
#01Gemma 2 27B	Google	Jun 27, 2024	Gemma	84.8%
#02Phi-3.5-MoE-instruct	Microsoft	Aug 23, 2024	MIT	84.6%
#03Gemma 2 9B	Google	Jun 27, 2024	Gemma	84.2%
#04Gemma 3n E4B	Google	Jun 26, 2025	Proprietary	81.6%
#05Gemma 3n E4B Instructed LiteRT Preview	Google	May 20, 2025	Gemma	81.6%
#06Phi 4 Mini	Microsoft	Feb 1, 2025	MIT	81.2%
#07Phi-3.5-mini-instruct	Microsoft	Aug 23, 2024	MIT	78.0%
#08Gemma 3n E2B	Google	Jun 26, 2025	Proprietary	76.4%
#09Gemma 3n E2B Instructed LiteRT (Preview)	Google	May 20, 2025	Gemma	76.4%

Resources

Research Paper