PhiBench

text

About

PhiBench is Microsoft's internal reasoning benchmark that evaluates language models' performance in complex reasoning tasks and language understanding. This benchmark serves as a key evaluation metric for Phi model series development, testing models' capabilities in logical reasoning, problem-solving, and sophisticated language comprehension across various domains and difficulty levels.

Evaluation Stats

Total Models3

Organizations1

Verified Results0

Self-Reported3

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

3 models

Top Score

74.2%

Average Score

67.0%

High Performers (80%+)

Top Organizations

#1Microsoft

3 models

67.0%

Leaderboard

3 models ranked by performance on PhiBench

			License
#01Phi 4 Reasoning Plus	Microsoft	Apr 30, 2025	MIT	74.2%
#02Phi 4 Reasoning	Microsoft	Apr 30, 2025	MIT	70.6%
#03Phi 4	Microsoft	Dec 12, 2024	MIT	56.2%

Resources

Research Paper