
Phi-3.5-vision-instruct
Multimodal
Zero-eval
#1ScienceQA
#1POPE
#2InterGPS
by Microsoft
+
+
+
+
About
Phi-3.5-vision-instruct is a multimodal language model developed by Microsoft. It achieves strong performance with an average score of 68.3% across 9 benchmarks. It excels particularly in ScienceQA (91.3%), POPE (86.1%), MMBench (81.9%). As a multimodal model, it can process and understand text, images, and other input formats seamlessly. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2024, it represents Microsoft's latest advancement in AI technology.
+
+
+
+
Timeline
AnnouncedAug 23, 2024
ReleasedAug 23, 2024
+
+
+
+
Specifications
Training Tokens500.0B
Capabilities
Multimodal
+
+
+
+
License & Family
License
MIT
Performance Overview
Performance metrics and category breakdown
Overall Performance
9 benchmarks
Average Score
68.3%
Best Score
91.3%
High Performers (80%+)
4+
+
+
+
All Benchmark Results for Phi-3.5-vision-instruct
Complete list of benchmark scores with detailed information
ScienceQA | multimodal | 0.91 | 91.3% | Self-reported | |
POPE | multimodal | 0.86 | 86.1% | Self-reported | |
MMBench | multimodal | 0.82 | 81.9% | Self-reported | |
ChartQA | multimodal | 0.82 | 81.8% | Self-reported | |
AI2D | multimodal | 0.78 | 78.1% | Self-reported | |
TextVQA | multimodal | 0.72 | 72.0% | Self-reported | |
MathVista | multimodal | 0.44 | 43.9% | Self-reported | |
MMMU | multimodal | 0.43 | 43.0% | Self-reported | |
InterGPS | text | 0.36 | 36.3% | Self-reported |