Phi-4-multimodal-instruct
Multimodal
Zero-eval
#1ScienceQA Visual
#1BLINK
#1InterGPS
+2 more
by Microsoft
+
+
+
+
About
Phi-4 Multimodal was created to handle multiple input modalities including text, images, and potentially other formats. Built to extend Phi-4's efficiency into multimodal applications, it demonstrates that compact models can successfully integrate diverse information types.
+
+
+
+
Pricing Range
Input (per 1M)$0.05 -$0.05
Output (per 1M)$0.10 -$0.10
Providers1
+
+
+
+
Timeline
AnnouncedFeb 1, 2025
ReleasedFeb 1, 2025
Knowledge CutoffJun 1, 2024
+
+
+
+
Specifications
Training Tokens5.0T
Capabilities
Multimodal
+
+
+
+
License & Family
License
MIT
Performance Overview
Performance metrics and category breakdown
Overall Performance
15 benchmarks
Average Score
72.0%
Best Score
97.5%
High Performers (80%+)
7Performance Metrics
Max Context Window
256.0KAvg Throughput
25.0 tok/sAvg Latency
1ms+
+
+
+
All Benchmark Results for Phi-4-multimodal-instruct
Complete list of benchmark scores with detailed information
| ScienceQA Visual | multimodal | 0.97 | 97.5% | Self-reported | |
| DocVQA | multimodal | 0.93 | 93.2% | Self-reported | |
| MMBench | multimodal | 0.87 | 86.7% | Self-reported | |
| POPE | multimodal | 0.86 | 85.6% | Self-reported | |
| OCRBench | multimodal | 0.84 | 84.4% | Self-reported | |
| AI2D | multimodal | 0.82 | 82.3% | Self-reported | |
| ChartQA | multimodal | 0.81 | 81.4% | Self-reported | |
| TextVQA | multimodal | 0.76 | 75.6% | Self-reported | |
| InfoVQA | multimodal | 0.73 | 72.7% | Self-reported | |
| MathVista | multimodal | 0.62 | 62.4% | Self-reported |
Showing 1 to 10 of 15 benchmarks
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+