
Phi 4 Reasoning
Zero-eval
#1HumanEval+
#2FlenQA
#2OmniMath
+1 more
by Microsoft
+
+
+
+
About
Phi 4 Reasoning is a language model developed by Microsoft. It achieves strong performance with an average score of 75.1% across 11 benchmarks. It excels particularly in FlenQA (97.7%), HumanEval+ (92.9%), IFEval (83.4%). It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Microsoft's latest advancement in AI technology.
+
+
+
+
Timeline
AnnouncedApr 30, 2025
ReleasedApr 30, 2025
Knowledge CutoffMar 1, 2025
+
+
+
+
Specifications
Training Tokens16.0B
+
+
+
+
License & Family
License
MIT
Base ModelPhi 4
Performance Overview
Performance metrics and category breakdown
Overall Performance
11 benchmarks
Average Score
75.1%
Best Score
97.7%
High Performers (80%+)
3+
+
+
+
All Benchmark Results for Phi 4 Reasoning
Complete list of benchmark scores with detailed information
FlenQA | text | 0.98 | 97.7% | Self-reported | |
HumanEval+ | text | 0.93 | 92.9% | Self-reported | |
IFEval | text | 0.83 | 83.4% | Self-reported | |
OmniMath | text | 0.77 | 76.6% | Self-reported | |
AIME 2024 | text | 0.75 | 75.3% | Self-reported | |
MMLU-Pro | text | 0.74 | 74.3% | Self-reported | |
Arena Hard | text | 0.73 | 73.3% | Self-reported | |
PhiBench | text | 0.71 | 70.6% | Self-reported | |
GPQA | text | 0.66 | 65.8% | Self-reported | |
AIME 2025 | text | 0.63 | 62.9% | Self-reported |
Showing 1 to 10 of 11 benchmarks