MLVU

multimodal
+
+
+
+
About

MLVU (Multi-task Long Video Understanding Benchmark) is a comprehensive benchmark designed to evaluate Multimodal Large Language Models on long video understanding tasks. It features videos of varying lengths across diverse genres including movies, surveillance footage, egocentric videos, cartoons, and game videos. The benchmark assesses key capabilities like temporal reasoning, event understanding, and context modeling across extended video sequences, revealing significant performance challenges for current models.

+
+
+
+
Evaluation Stats
Total Models1
Organizations1
Verified Results0
Self-Reported1
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

1 models
Top Score
70.2%
Average Score
70.2%
High Performers (80%+)
0

Top Organizations

#1Alibaba Cloud / Qwen Team
1 model
70.2%
+
+
+
+
Leaderboard
1 models ranked by performance on MLVU
LicenseLinks
Jan 26, 2025
Apache 2.0
70.2%
+
+
+
+
Resources