TextVQA

multimodal
+
+
+
+
About

TextVQA is a visual question answering benchmark featuring 28,408 images and 45,336 questions that requires models to read and reason about text present in images. This comprehensive evaluation tests AI models' ability to incorporate textual information from visual scenes, combining optical character recognition with visual reasoning to answer questions about text-containing images across diverse real-world scenarios.

+
+
+
+
Evaluation Stats
Total Models15
Organizations7
Verified Results0
Self-Reported15
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

15 models
Top Score
85.5%
Average Score
77.0%
High Performers (80%+)
8

Top Organizations

#1Alibaba Cloud / Qwen Team
3 models
84.9%
#2DeepSeek
3 models
82.8%
#3Amazon
2 models
80.8%
#4xAI
1 model
78.1%
#5Microsoft
2 models
73.8%
+
+
+
+
Leaderboard
15 models ranked by performance on TextVQA
LicenseLinks
Aug 29, 2024
tongyi-qianwen
85.5%
Jan 26, 2025
Apache 2.0
84.9%
Mar 27, 2025
Apache 2.0
84.4%
Dec 13, 2024
deepseek
84.2%
Dec 13, 2024
deepseek
83.4%
Nov 20, 2024
Proprietary
81.5%
Dec 13, 2024
deepseek
80.7%
Nov 20, 2024
Proprietary
80.2%
Apr 12, 2024
Proprietary
78.1%
Feb 1, 2025
MIT
75.6%
Showing 1 to 10 of 15 models
+
+
+
+
Resources