MMBench

Multilingual

multimodal

About

MMBench is a systematically designed bilingual benchmark for robust evaluation of large vision-language models. It features meticulously curated multiple-choice questions in both English and Chinese, employing a rigorous CircularEval strategy to ensure accurate assessment. The benchmark evaluates multimodal perception and reasoning abilities across diverse visual understanding tasks, providing comprehensive evaluation of VLMs' capabilities with well-designed quality control schemes.

Evaluation Stats

Total Models7

Organizations3

Verified Results0

Self-Reported7

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

7 models

Top Score

88.0%

Average Score

81.4%

High Performers (80%+)

Top Organizations

#1Alibaba Cloud / Qwen Team

2 models

86.1%

#2Microsoft

2 models

84.3%

#3DeepSeek

3 models

76.4%

Leaderboard

7 models ranked by performance on MMBench

			License
#01Qwen2.5 VL 72B Instruct	Alibaba Cloud / Qwen Team	Jan 26, 2025	tongyi-qianwen	88.0%
#02Phi-4-multimodal-instruct	Microsoft	Feb 1, 2025	MIT	86.7%
#03Qwen2.5 VL 7B Instruct	Alibaba Cloud / Qwen Team	Jan 26, 2025	Apache 2.0	84.3%
#04Phi-3.5-vision-instruct	Microsoft	Aug 23, 2024	MIT	81.9%
#05DeepSeek VL2 Small	DeepSeek	Dec 13, 2024	deepseek	80.3%
#06DeepSeek VL2	DeepSeek	Dec 13, 2024	deepseek	79.6%
#07DeepSeek VL2 Tiny	DeepSeek	Dec 13, 2024	deepseek	69.2%

Resources

Research Paper