FlenQA
text
+
+
+
+
About
FlenQA (Flexible Length Question Answering) is a benchmark designed to isolate the impact of input length on AI reasoning performance. Featuring 12,000 questions with True/False labels, this dataset creates multiple context versions by embedding relevant information within longer, irrelevant texts. FlenQA evaluates how well models maintain reasoning accuracy as input length increases, testing long-context reasoning capabilities.
+
+
+
+
Evaluation Stats
Total Models2
Organizations1
Verified Results0
Self-Reported2
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
2 models
Top Score
97.9%
Average Score
97.8%
High Performers (80%+)
2Top Organizations
#1Microsoft
2 models
97.8%
+
+
+
+
Leaderboard
2 models ranked by performance on FlenQA
License | Links | ||||
---|---|---|---|---|---|
Apr 30, 2025 | MIT | 97.9% | |||
Apr 30, 2025 | MIT | 97.7% |