AGIEval

text

About

AGIEval is a human-centric AI benchmark that evaluates foundation models using real standardized exams including college entrance tests (SAT, Gaokao), law school admissions (LSAT), math competitions, and professional qualification exams. This bilingual benchmark assesses four core capabilities: understanding, knowledge, reasoning, and calculation. By testing AI against human-level performance on authentic exams rather than artificial datasets, AGIEval provides meaningful evaluation for Artificial General Intelligence development.

Evaluation Stats

Total Models5

Organizations3

Verified Results0

Self-Reported5

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

5 models

Top Score

65.8%

Average Score

54.3%

High Performers (80%+)

Top Organizations

#1Mistral AI

2 models

57.0%

#2Google

2 models

54.0%

#3IBM

1 model

49.3%

Leaderboard

5 models ranked by performance on AGIEval

			License
#01Mistral Small 3 24B Base	Mistral AI	Jan 30, 2025	Apache 2.0	65.8%
#02Gemma 2 27B	Google	Jun 27, 2024	Gemma	55.1%
#03Gemma 2 9B	Google	Jun 27, 2024	Gemma	52.8%
#04Granite 3.3 8B Base	IBM	Apr 16, 2025	Apache 2.0	49.3%
#05Ministral 8B Instruct	Mistral AI	Oct 16, 2024	Mistral Research License	48.3%

Resources

Research Paper Implementation