PhiBench

text
+
+
+
+
About

PhiBench is Microsoft's internal reasoning benchmark that evaluates language models' performance in complex reasoning tasks and language understanding. This benchmark serves as a key evaluation metric for Phi model series development, testing models' capabilities in logical reasoning, problem-solving, and sophisticated language comprehension across various domains and difficulty levels.

+
+
+
+
Evaluation Stats
Total Models3
Organizations1
Verified Results0
Self-Reported3
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

3 models
Top Score
74.2%
Average Score
67.0%
High Performers (80%+)
0

Top Organizations

#1Microsoft
3 models
67.0%
+
+
+
+
Leaderboard
3 models ranked by performance on PhiBench
LicenseLinks
Apr 30, 2025
MIT
74.2%
Apr 30, 2025
MIT
70.6%
Dec 12, 2024
MIT
56.2%
+
+
+
+
Resources