MT-Bench

text
+
+
+
+
About

MT-Bench is a challenging multi-turn conversation benchmark designed to evaluate the conversational and instruction-following abilities of large language models. It features complex dialogue scenarios that test models' ability to maintain context, provide coherent responses across multiple turns, and demonstrate advanced conversational intelligence in realistic interaction settings.

+
+
+
+
Evaluation Stats
Total Models11
Organizations4
Verified Results0
Self-Reported11
+
+
+
+
Benchmark Details
Max Score100
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

11 models
Top Score
93.5%
Average Score
78.8%
High Performers (80%+)
9

Top Organizations

#1DeepSeek
1 model
90.2%
#2Alibaba Cloud / Qwen Team
3 models
88.4%
#3Mistral AI
4 models
82.4%
#4NVIDIA
3 models
60.6%
+
+
+
+
Leaderboard
11 models ranked by performance on MT-Bench
LicenseLinks
Sep 19, 2024
Qwen
93.5%
Mar 18, 2025
Llama 3.1 Community License
91.7%
May 8, 2024
deepseek
90.2%
Sep 19, 2024
Apache 2.0
87.5%
Jul 24, 2024
Mistral Research License
86.3%
Jul 23, 2024
Apache 2.0
84.1%
Jan 30, 2025
Apache 2.0
83.5%
Oct 16, 2024
Mistral Research License
83.0%
Mar 18, 2025
Llama 3.1 Community License
81.0%
Sep 17, 2024
Apache 2.0
76.8%
Showing 1 to 10 of 11 models
+
+
+
+
Resources