TextVQA

multimodal

About

TextVQA is a visual question answering benchmark featuring 28,408 images and 45,336 questions that requires models to read and reason about text present in images. This comprehensive evaluation tests AI models' ability to incorporate textual information from visual scenes, combining optical character recognition with visual reasoning to answer questions about text-containing images across diverse real-world scenarios.

Evaluation Stats

Total Models15

Organizations7

Verified Results0

Self-Reported15

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

15 models

Top Score

85.5%

Average Score

77.0%

High Performers (80%+)

Top Organizations

#1Alibaba Cloud / Qwen Team

3 models

84.9%

#2DeepSeek

3 models

82.8%

#3Amazon

2 models

80.8%

#4xAI

1 model

78.1%

#5Microsoft

2 models

73.8%

Leaderboard

15 models ranked by performance on TextVQA

			License
#01Qwen2-VL-72B-Instruct	Alibaba Cloud / Qwen Team	Aug 29, 2024	tongyi-qianwen	85.5%
#02Qwen2.5 VL 7B Instruct	Alibaba Cloud / Qwen Team	Jan 26, 2025	Apache 2.0	84.9%
#03Qwen2.5-Omni-7B	Alibaba Cloud / Qwen Team	Mar 27, 2025	Apache 2.0	84.4%
#04DeepSeek VL2	DeepSeek	Dec 13, 2024	deepseek	84.2%
#05DeepSeek VL2 Small	DeepSeek	Dec 13, 2024	deepseek	83.4%
#06Nova Pro	Amazon	Nov 20, 2024	Proprietary	81.5%
#07DeepSeek VL2 Tiny	DeepSeek	Dec 13, 2024	deepseek	80.7%
#08Nova Lite	Amazon	Nov 20, 2024	Proprietary	80.2%
#09Grok-1.5V	xAI	Apr 12, 2024	Proprietary	78.1%
#10Phi-4-multimodal-instruct	Microsoft	Feb 1, 2025	MIT	75.6%

Showing 1 to 10 of 15 models

Resources

Research Paper