FlenQA

text
+
+
+
+
About

FlenQA (Flexible Length Question Answering) is a benchmark designed to isolate the impact of input length on AI reasoning performance. Featuring 12,000 questions with True/False labels, this dataset creates multiple context versions by embedding relevant information within longer, irrelevant texts. FlenQA evaluates how well models maintain reasoning accuracy as input length increases, testing long-context reasoning capabilities.

+
+
+
+
Evaluation Stats
Total Models2
Organizations1
Verified Results0
Self-Reported2
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

2 models
Top Score
97.9%
Average Score
97.8%
High Performers (80%+)
2

Top Organizations

#1Microsoft
2 models
97.8%
+
+
+
+
Leaderboard
2 models ranked by performance on FlenQA
LicenseLinks
Apr 30, 2025
MIT
97.9%
Apr 30, 2025
MIT
97.7%
+
+
+
+
Resources