MMLU-ProX

text

About

MMLU-Prox is a specialized variant of the Massive Multitask Language Understanding benchmark designed for proximity-based evaluation of language models. It focuses on assessing models' ability to handle similar or related questions within the same academic domains, testing consistency and robustness in knowledge application across closely related concepts.

Evaluation Stats

Total Models8

Organizations2

Verified Results0

Self-Reported8

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

8 models

Top Score

81.0%

Average Score

46.5%

High Performers (80%+)

Top Organizations

#1Alibaba Cloud / Qwen Team

4 models

79.0%

#2Google

4 models

14.0%

Leaderboard

8 models ranked by performance on MMLU-ProX

			License
#01Qwen3-235B-A22B-Thinking-2507	Alibaba Cloud / Qwen Team	Jul 25, 2025	Apache 2.0	81.0%
#02Qwen3-235B-A22B-Instruct-2507	Alibaba Cloud / Qwen Team	Jul 22, 2025	Apache 2.0	79.4%
#03Qwen3-Next-80B-A3B-Thinking	Alibaba Cloud / Qwen Team	Sep 10, 2025	Apache 2.0	78.7%
#04Qwen3-Next-80B-A3B-Instruct	Alibaba Cloud / Qwen Team	Sep 10, 2025	Apache 2.0	76.7%
#05Gemma 3n E4B Instructed LiteRT Preview	Google	May 20, 2025	Gemma	19.9%
#06Gemma 3n E4B Instructed	Google	Jun 26, 2025	Proprietary	19.9%
#07Gemma 3n E2B Instructed LiteRT (Preview)	Google	May 20, 2025	Gemma	8.1%
#08Gemma 3n E2B Instructed	Google	Jun 26, 2025	Proprietary	8.1%

Resources

Research Paper