MT-Bench

text

About

MT-Bench is a challenging multi-turn conversation benchmark designed to evaluate the conversational and instruction-following abilities of large language models. It features complex dialogue scenarios that test models' ability to maintain context, provide coherent responses across multiple turns, and demonstrate advanced conversational intelligence in realistic interaction settings.

Evaluation Stats

Total Models11

Organizations4

Verified Results0

Self-Reported11

Benchmark Details

Max Score100

Language

Performance Overview

Score distribution and top performers

Score Distribution

11 models

Top Score

93.5%

Average Score

78.8%

High Performers (80%+)

Top Organizations

#1DeepSeek

1 model

90.2%

#2Alibaba Cloud / Qwen Team

3 models

88.4%

#3Mistral AI

4 models

82.4%

#4NVIDIA

3 models

60.6%

Leaderboard

11 models ranked by performance on MT-Bench

			License
#01Qwen2.5 72B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Qwen	93.5%
#02Llama-3.3 Nemotron Super 49B v1	NVIDIA	Mar 18, 2025	Llama 3.1 Community License	91.7%
#03DeepSeek-V2.5	DeepSeek	May 8, 2024	deepseek	90.2%
#04Qwen2.5 7B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Apache 2.0	87.5%
#05Mistral Large 2	Mistral AI	Jul 24, 2024	Mistral Research License	86.3%
#06Qwen2 7B Instruct	Alibaba Cloud / Qwen Team	Jul 23, 2024	Apache 2.0	84.1%
#07Mistral Small 3 24B Instruct	Mistral AI	Jan 30, 2025	Apache 2.0	83.5%
#08Ministral 8B Instruct	Mistral AI	Oct 16, 2024	Mistral Research License	83.0%
#09Llama 3.1 Nemotron Nano 8B V1	NVIDIA	Mar 18, 2025	Llama 3.1 Community License	81.0%
#10Pixtral-12B	Mistral AI	Sep 17, 2024	Apache 2.0	76.8%

Showing 1 to 10 of 11 models

Resources

Research Paper