MM-MT-Bench

multimodal

About

A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn dialogues and answer open-ended questions in a zero-shot manner.

Evaluation Stats

Total Models3

Organizations2

Verified Results0

Self-Reported3

Benchmark Details

Max Score100

Language

Performance Overview

Score distribution and top performers

Score Distribution

3 models

Top Score

74.0%

Average Score

46.8%

High Performers (80%+)

Top Organizations

#1Mistral AI

2 models

67.3%

#2Alibaba Cloud / Qwen Team

1 model

6.0%

Leaderboard

3 models ranked by performance on MM-MT-Bench

			License
#01Pixtral Large	Mistral AI	Nov 18, 2024	Mistral Research License (MRL) for research; Mistral Commercial License for commercial use	74.0%
#02Pixtral-12B	Mistral AI	Sep 17, 2024	Apache 2.0	60.5%
#03Qwen2.5-Omni-7B	Alibaba Cloud / Qwen Team	Mar 27, 2025	Apache 2.0	6.0%