Microsoft

Phi 4 Reasoning

Zero-eval
#1HumanEval+
#2FlenQA
#2OmniMath
+1 more

by Microsoft

+
+
+
+
About

Phi 4 Reasoning is a language model developed by Microsoft. It achieves strong performance with an average score of 75.1% across 11 benchmarks. It excels particularly in FlenQA (97.7%), HumanEval+ (92.9%), IFEval (83.4%). It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Microsoft's latest advancement in AI technology.

+
+
+
+
Timeline
AnnouncedApr 30, 2025
ReleasedApr 30, 2025
Knowledge CutoffMar 1, 2025
+
+
+
+
Specifications
Training Tokens16.0B
+
+
+
+
License & Family
License
MIT
Base ModelPhi 4
Performance Overview
Performance metrics and category breakdown

Overall Performance

11 benchmarks
Average Score
75.1%
Best Score
97.7%
High Performers (80%+)
3
+
+
+
+
All Benchmark Results for Phi 4 Reasoning
Complete list of benchmark scores with detailed information
FlenQA
text
0.98
97.7%
Self-reported
HumanEval+
text
0.93
92.9%
Self-reported
IFEval
text
0.83
83.4%
Self-reported
OmniMath
text
0.77
76.6%
Self-reported
AIME 2024
text
0.75
75.3%
Self-reported
MMLU-Pro
text
0.74
74.3%
Self-reported
Arena Hard
text
0.73
73.3%
Self-reported
PhiBench
text
0.71
70.6%
Self-reported
GPQA
text
0.66
65.8%
Self-reported
AIME 2025
text
0.63
62.9%
Self-reported
Showing 1 to 10 of 11 benchmarks