Arena-Hard v2

text
+
+
+
+
About

Arena Hard v2 is an enhanced version of the Arena Hard benchmark featuring improved evaluation methodology and expanded test coverage for instruction-tuned Large Language Models. Building on the original's high correlation with human preferences, it incorporates refined judge models, updated evaluation criteria, and additional challenge categories. The benchmark maintains the cost-effective automated evaluation approach while providing even better model discrimination and alignment with human judgment.

+
+
+
+
Evaluation Stats
Total Models4
Organizations1
Verified Results0
Self-Reported4
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

4 models
Top Score
82.7%
Average Score
76.0%
High Performers (80%+)
1

Top Organizations

#1Alibaba Cloud / Qwen Team
4 models
76.0%
+
+
+
+
Leaderboard
4 models ranked by performance on Arena-Hard v2
LicenseLinks
Sep 10, 2025
Apache 2.0
82.7%
Jul 25, 2025
Apache 2.0
79.7%
Jul 22, 2025
Apache 2.0
79.2%
Sep 10, 2025
Apache 2.0
62.3%
+
+
+
+
Resources