Multi-Challenge

text

About

MultiChallenge is a pioneering benchmark evaluating large language models on conducting realistic multi-turn conversations with human users. It features authentic conversation scenarios that test models' ability to maintain coherent dialogue, understand context across multiple exchanges, and provide helpful responses in complex conversational situations.

Evaluation Stats

Total Models7

Organizations2

Verified Results0

Self-Reported7

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

7 models

Top Score

54.1%

Average Score

40.1%

High Performers (80%+)

Top Organizations

#1Moonshot AI

2 models

54.1%

#2OpenAI

5 models

34.6%

Leaderboard

7 models ranked by performance on Multi-Challenge

			License
#01Kimi K2-Instruct-0905	Moonshot AI	Sep 5, 2025	MIT	54.1%
#02Kimi K2 Instruct	Moonshot AI	Jul 11, 2025	MIT	54.1%
#03GPT-4.5	OpenAI	Feb 27, 2025	Proprietary	43.8%
#04o3-mini	OpenAI	Jan 30, 2025	Proprietary	39.9%
#05GPT-4.1	OpenAI	Apr 14, 2025	Proprietary	38.3%
#06GPT-4.1 mini	OpenAI	Apr 14, 2025	Proprietary	35.8%
#07GPT-4.1 nano	OpenAI	Apr 14, 2025	Proprietary	15.0%

Resources

Research Paper