FlenQA

text

About

FlenQA (Flexible Length Question Answering) is a benchmark designed to isolate the impact of input length on AI reasoning performance. Featuring 12,000 questions with True/False labels, this dataset creates multiple context versions by embedding relevant information within longer, irrelevant texts. FlenQA evaluates how well models maintain reasoning accuracy as input length increases, testing long-context reasoning capabilities.

Evaluation Stats

Total Models2

Organizations1

Verified Results0

Self-Reported2

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

2 models

Top Score

97.9%

Average Score

97.8%

High Performers (80%+)

Top Organizations

#1Microsoft

2 models

97.8%

Leaderboard

2 models ranked by performance on FlenQA

			License		Links
#01Phi 4 Reasoning Plus	Microsoft	Apr 30, 2025	MIT	97.9%
#02Phi 4 Reasoning	Microsoft	Apr 30, 2025	MIT	97.7%

Resources

Research Paper