Microsoft

Phi-3.5-vision-instruct

Multimodal
Zero-eval
#1ScienceQA
#1POPE
#2InterGPS

by Microsoft

+
+
+
+
About

Phi-3.5 Vision was developed as a multimodal variant of Phi-3.5, designed to understand and reason about both images and text. Built to extend the Phi family's efficiency into vision-language tasks, it enables compact multimodal AI for practical applications.

+
+
+
+
Timeline
AnnouncedAug 23, 2024
ReleasedAug 23, 2024
+
+
+
+
Specifications
Training Tokens500.0B
Capabilities
Multimodal
+
+
+
+
License & Family
License
MIT
Performance Overview
Performance metrics and category breakdown

Overall Performance

9 benchmarks
Average Score
68.3%
Best Score
91.3%
High Performers (80%+)
4
+
+
+
+
All Benchmark Results for Phi-3.5-vision-instruct
Complete list of benchmark scores with detailed information
ScienceQA
multimodal
0.91
91.3%
Self-reported
POPE
multimodal
0.86
86.1%
Self-reported
MMBench
multimodal
0.82
81.9%
Self-reported
ChartQA
multimodal
0.82
81.8%
Self-reported
AI2D
multimodal
0.78
78.1%
Self-reported
TextVQA
multimodal
0.72
72.0%
Self-reported
MathVista
multimodal
0.44
43.9%
Self-reported
MMMU
multimodal
0.43
43.0%
Self-reported
InterGPS
text
0.36
36.3%
Self-reported
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+