MLVU

multimodal

About

MLVU (Multi-task Long Video Understanding Benchmark) is a comprehensive benchmark designed to evaluate Multimodal Large Language Models on long video understanding tasks. It features videos of varying lengths across diverse genres including movies, surveillance footage, egocentric videos, cartoons, and game videos. The benchmark assesses key capabilities like temporal reasoning, event understanding, and context modeling across extended video sequences, revealing significant performance challenges for current models.

Evaluation Stats

Total Models1

Organizations1

Verified Results0

Self-Reported1

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

1 models

Top Score

70.2%

Average Score

70.2%

High Performers (80%+)

Top Organizations

#1Alibaba Cloud / Qwen Team

1 model

70.2%

Leaderboard

1 models ranked by performance on MLVU

			License		Links
#01Qwen2.5 VL 7B Instruct	Alibaba Cloud / Qwen Team	Jan 26, 2025	Apache 2.0	70.2%

Resources

Research Paper