BoolQ

text
+
+
+
+
About

BoolQ is a question answering benchmark for yes/no questions containing 15,942 naturally occurring examples that test reading comprehension and natural language inference. These questions are unprompted and unconstrained, reflecting real-world information-seeking scenarios. The benchmark challenges models to perform binary classification on complex questions requiring deep text understanding, making it similar to existing natural language inference tasks but with practical, realistic queries.

+
+
+
+
Evaluation Stats
Total Models9
Organizations2
Verified Results0
Self-Reported9
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

9 models
Top Score
84.8%
Average Score
81.0%
High Performers (80%+)
6

Top Organizations

#1Microsoft
3 models
81.3%
#2Google
6 models
80.8%
+
+
+
+
Leaderboard
9 models ranked by performance on BoolQ
LicenseLinks
Jun 27, 2024
Gemma
84.8%
Aug 23, 2024
MIT
84.6%
Jun 27, 2024
Gemma
84.2%
Jun 26, 2025
Proprietary
81.6%
May 20, 2025
Gemma
81.6%
Feb 1, 2025
MIT
81.2%
Aug 23, 2024
MIT
78.0%
Jun 26, 2025
Proprietary
76.4%
May 20, 2025
Gemma
76.4%
+
+
+
+
Resources