HellaSwag

text

About

HellaSwag is a commonsense reasoning benchmark that evaluates AI models' ability to complete natural language scenarios with the most plausible ending. Created using Adversarial Filtering, this dataset features video caption segments with multiple-choice options where models must select logical conclusions. HellaSwag tests models' understanding of everyday situations and common sense reasoning through challenging adversarial examples.

Evaluation Stats

Total Models24

Organizations10

Verified Results0

Self-Reported24

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

24 models

Top Score

95.4%

Average Score

82.4%

High Performers (80%+)

Top Organizations

#1OpenAI

1 model

95.3%

#2Anthropic

3 models

90.1%

#3Cohere

1 model

88.6%

#4NVIDIA

1 model

85.6%

#5Mistral AI

1 model

83.5%

Leaderboard

24 models ranked by performance on HellaSwag

			License
#01Claude 3 Opus	Anthropic	Feb 29, 2024	Proprietary	95.4%
#02GPT-4	OpenAI	Jun 13, 2023	Proprietary	95.3%
#03Gemini 1.5 Pro	Google	May 1, 2024	Proprietary	93.3%
#04Claude 3 Sonnet	Anthropic	Feb 29, 2024	Proprietary	89.0%
#05Command R+	Cohere	Aug 30, 2024	CC BY-NC	88.6%
#06Qwen2 72B Instruct	Alibaba Cloud / Qwen Team	Jul 23, 2024	tongyi-qianwen	87.6%
#07Gemini 1.5 Flash	Google	May 1, 2024	Proprietary	86.5%
#08Gemma 2 27B	Google	Jun 27, 2024	Gemma	86.4%
#09Claude 3 Haiku	Anthropic	Mar 13, 2024	Proprietary	85.9%
#10Llama 3.1 Nemotron 70B Instruct	NVIDIA	Oct 1, 2024	Llama 3.1 Community License	85.6%

Showing 1 to 10 of 24 models

Resources

Research Paper