Arena Hard

text

About

Arena Hard is an automatic evaluation benchmark for instruction-tuned Large Language Models that achieves 98.6% correlation with human preferences from Chatbot Arena. Built from live crowdsourced data, it provides 3x higher model performance separation compared to MT-Bench through pairwise comparisons using strong judge models. The benchmark uses the Bradley-Terry model for scoring and offers cost-effective, fast evaluation with frequent updates from real-world user interactions.

Evaluation Stats

Total Models21

Organizations7

Verified Results0

Self-Reported21

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

21 models

Top Score

95.6%

Average Score

65.2%

High Performers (80%+)

Top Organizations

#1NVIDIA

1 model

88.3%

#2Alibaba Cloud / Qwen Team

5 models

82.7%

#3DeepSeek

1 model

76.2%

#4Mistral AI

3 models

67.2%

#5Microsoft

6 models

55.9%

Leaderboard

21 models ranked by performance on Arena Hard

			License
#01Qwen3 235B A22B	Alibaba Cloud / Qwen Team	Apr 29, 2025	Apache 2.0	95.6%
#02Qwen3 32B	Alibaba Cloud / Qwen Team	Apr 29, 2025	Apache 2.0	93.8%
#03Qwen3 30B A3B	Alibaba Cloud / Qwen Team	Apr 29, 2025	Apache 2.0	91.0%
#04Llama-3.3 Nemotron Super 49B v1	NVIDIA	Mar 18, 2025	Llama 3.1 Community License	88.3%
#05Mistral Small 3 24B Instruct	Mistral AI	Jan 30, 2025	Apache 2.0	87.6%
#06Qwen2.5 72B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Qwen	81.2%
#07Phi 4 Reasoning Plus	Microsoft	Apr 30, 2025	MIT	79.0%
#08DeepSeek-V2.5	DeepSeek	May 8, 2024	deepseek	76.2%
#09Phi 4	Microsoft	Dec 12, 2024	MIT	75.4%
#10Phi 4 Reasoning	Microsoft	Apr 30, 2025	MIT	73.3%

Showing 1 to 10 of 21 models

Resources

Research Paper