Scale MultiChallenge

text

About

Scale-MultiChallenge is a pioneering benchmark evaluating large language models on realistic multi-turn conversations with human users, identifying four key challenge categories requiring accurate instruction-following, context allocation, and in-context reasoning. This comprehensive evaluation reveals that even frontier models achieve less than 50% accuracy, highlighting significant gaps in conversational AI capabilities.

Evaluation Stats

Total Models4

Organizations1

Verified Results0

Self-Reported4

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

4 models

Top Score

69.6%

Average Score

52.3%

High Performers (80%+)

Top Organizations

#1OpenAI

4 models

52.3%

Leaderboard

4 models ranked by performance on Scale MultiChallenge

			License
#01GPT-5	OpenAI	Aug 7, 2025	Proprietary	69.6%
#02o3	OpenAI	Apr 16, 2025	Proprietary	56.5%
#03o4-mini	OpenAI	Apr 16, 2025	Proprietary	43.0%
#04GPT-4o	OpenAI	Aug 6, 2024	Proprietary	40.3%

Resources

Research Paper