Wild Bench

text

About

WildBench is an automated evaluation framework for benchmarking large language models using 1,024 challenging, real-world user queries selected from over one million human-chatbot conversation logs. Featuring WB-Reward and WB-Score metrics with strong correlation to human evaluation, this benchmark tests LLMs' performance on authentic user interactions and complex real-world tasks.

Evaluation Stats

Total Models4

Organizations2

Verified Results0

Self-Reported4

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

4 models

Top Score

65.3%

Average Score

52.1%

High Performers (80%+)

Top Organizations

#1Mistral AI

2 models

58.8%

#2AI21 Labs

2 models

45.5%

Leaderboard

4 models ranked by performance on Wild Bench

			License
#01Mistral Small 3.2 24B Instruct	Mistral AI	Jun 20, 2025	Apache 2.0	65.3%
#02Mistral Small 3 24B Instruct	Mistral AI	Jan 30, 2025	Apache 2.0	52.2%
#03Jamba 1.5 Large	AI21 Labs	Aug 22, 2024	Jamba Open Model License	48.5%
#04Jamba 1.5 Mini	AI21 Labs	Aug 22, 2024	Jamba Open Model License	42.4%

Resources

Research Paper