AlignBench

Multilingual

text

About

AlignBench is a comprehensive multi-dimensional benchmark for evaluating Large Language Model alignment, particularly in Chinese. It assesses how well AI systems interpret, internalize, and execute human instructions through 683 real-scenario queries across 8 categories. The benchmark evaluates literal, implicit, and intentional compliance, complex conflict resolution, multi-turn dialogue consistency, and adaptive reinterpretation capabilities in scenarios with profound ambiguity.

Evaluation Stats

Total Models4

Organizations2

Verified Results0

Self-Reported4

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

4 models

Top Score

81.6%

Average Score

76.9%

High Performers (80%+)

Top Organizations

#1DeepSeek

1 model

80.4%

#2Alibaba Cloud / Qwen Team

3 models

75.7%

Leaderboard

4 models ranked by performance on AlignBench

			License
#01Qwen2.5 72B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Qwen	81.6%
#02DeepSeek-V2.5	DeepSeek	May 8, 2024	deepseek	80.4%
#03Qwen2.5 7B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Apache 2.0	73.3%
#04Qwen2 7B Instruct	Alibaba Cloud / Qwen Team	Jul 23, 2024	Apache 2.0	72.1%

Resources

Research Paper