PhiBench
text
+
+
+
+
About
PhiBench is Microsoft's internal reasoning benchmark that evaluates language models' performance in complex reasoning tasks and language understanding. This benchmark serves as a key evaluation metric for Phi model series development, testing models' capabilities in logical reasoning, problem-solving, and sophisticated language comprehension across various domains and difficulty levels.
+
+
+
+
Evaluation Stats
Total Models3
Organizations1
Verified Results0
Self-Reported3
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
3 models
Top Score
74.2%
Average Score
67.0%
High Performers (80%+)
0Top Organizations
#1Microsoft
3 models
67.0%
+
+
+
+
Leaderboard
3 models ranked by performance on PhiBench
License | Links | ||||
---|---|---|---|---|---|
Apr 30, 2025 | MIT | 74.2% | |||
Apr 30, 2025 | MIT | 70.6% | |||
Dec 12, 2024 | MIT | 56.2% |