Arena-Hard v2
text
+
+
+
+
About
Arena Hard v2 is an enhanced version of the Arena Hard benchmark featuring improved evaluation methodology and expanded test coverage for instruction-tuned Large Language Models. Building on the original's high correlation with human preferences, it incorporates refined judge models, updated evaluation criteria, and additional challenge categories. The benchmark maintains the cost-effective automated evaluation approach while providing even better model discrimination and alignment with human judgment.
+
+
+
+
Evaluation Stats
Total Models4
Organizations1
Verified Results0
Self-Reported4
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
4 models
Top Score
82.7%
Average Score
76.0%
High Performers (80%+)
1Top Organizations
#1Alibaba Cloud / Qwen Team
4 models
76.0%
+
+
+
+
Leaderboard
4 models ranked by performance on Arena-Hard v2
License | Links | ||||
---|---|---|---|---|---|
Sep 10, 2025 | Apache 2.0 | 82.7% | |||
Jul 25, 2025 | Apache 2.0 | 79.7% | |||
Jul 22, 2025 | Apache 2.0 | 79.2% | |||
Sep 10, 2025 | Apache 2.0 | 62.3% |