MT-Bench
text
+
+
+
+
About
MT-Bench is a challenging multi-turn conversation benchmark designed to evaluate the conversational and instruction-following abilities of large language models. It features complex dialogue scenarios that test models' ability to maintain context, provide coherent responses across multiple turns, and demonstrate advanced conversational intelligence in realistic interaction settings.
+
+
+
+
Evaluation Stats
Total Models11
Organizations4
Verified Results0
Self-Reported11
+
+
+
+
Benchmark Details
Max Score100
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
11 models
Top Score
93.5%
Average Score
78.8%
High Performers (80%+)
9Top Organizations
#1DeepSeek
1 model
90.2%
#2Alibaba Cloud / Qwen Team
3 models
88.4%
#3Mistral AI
4 models
82.4%
#4NVIDIA
3 models
60.6%
+
+
+
+
Leaderboard
11 models ranked by performance on MT-Bench
License | Links | ||||
---|---|---|---|---|---|
Sep 19, 2024 | Qwen | 93.5% | |||
Mar 18, 2025 | Llama 3.1 Community License | 91.7% | |||
May 8, 2024 | deepseek | 90.2% | |||
Sep 19, 2024 | Apache 2.0 | 87.5% | |||
Jul 24, 2024 | Mistral Research License | 86.3% | |||
Jul 23, 2024 | Apache 2.0 | 84.1% | |||
Jan 30, 2025 | Apache 2.0 | 83.5% | |||
Oct 16, 2024 | Mistral Research License | 83.0% | |||
Mar 18, 2025 | Llama 3.1 Community License | 81.0% | |||
Sep 17, 2024 | Apache 2.0 | 76.8% |
Showing 1 to 10 of 11 models