AlpacaEval 2.0

text

About

AlpacaEval 2.0 is an automated evaluation benchmark for instruction-following language models that achieves 0.98 correlation with ChatBot Arena while being fast, cheap, and reliable. It evaluates models' ability to follow general user instructions using GPT-4 as an auto-annotator, comparing responses to reference outputs. The benchmark features length-controlled win rates, costs less than $10 in OpenAI credits, and runs in under 3 minutes.

Evaluation Stats

Total Models4

Organizations2

Verified Results0

Self-Reported4

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

4 models

Top Score

62.7%

Average Score

52.8%

High Performers (80%+)

Top Organizations

#1IBM

3 models

53.5%

#2DeepSeek

1 model

50.5%

Leaderboard

4 models ranked by performance on AlpacaEval 2.0

			License
#01Granite 3.3 8B Base	IBM	Apr 16, 2025	Apache 2.0	62.7%
#02Granite 3.3 8B Instruct	IBM	Apr 16, 2025	Apache 2.0	62.7%
#03DeepSeek-V2.5	DeepSeek	May 8, 2024	deepseek	50.5%
#04IBM Granite 4.0 Tiny Preview	IBM	May 2, 2025	Apache 2.0	35.2%

Resources

Research Paper