TextVQA
multimodal
+
+
+
+
About
TextVQA is a visual question answering benchmark featuring 28,408 images and 45,336 questions that requires models to read and reason about text present in images. This comprehensive evaluation tests AI models' ability to incorporate textual information from visual scenes, combining optical character recognition with visual reasoning to answer questions about text-containing images across diverse real-world scenarios.
+
+
+
+
Evaluation Stats
Total Models15
Organizations7
Verified Results0
Self-Reported15
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
15 models
Top Score
85.5%
Average Score
77.0%
High Performers (80%+)
8Top Organizations
#1Alibaba Cloud / Qwen Team
3 models
84.9%
#2DeepSeek
3 models
82.8%
#3Amazon
2 models
80.8%
#4xAI
1 model
78.1%
#5Microsoft
2 models
73.8%
+
+
+
+
Leaderboard
15 models ranked by performance on TextVQA
License | Links | ||||
---|---|---|---|---|---|
Aug 29, 2024 | tongyi-qianwen | 85.5% | |||
Jan 26, 2025 | Apache 2.0 | 84.9% | |||
Mar 27, 2025 | Apache 2.0 | 84.4% | |||
Dec 13, 2024 | deepseek | 84.2% | |||
Dec 13, 2024 | deepseek | 83.4% | |||
Nov 20, 2024 | Proprietary | 81.5% | |||
Dec 13, 2024 | deepseek | 80.7% | |||
Nov 20, 2024 | Proprietary | 80.2% | |||
Apr 12, 2024 | Proprietary | 78.1% | |||
Feb 1, 2025 | MIT | 75.6% |
Showing 1 to 10 of 15 models