HellaSwag

text
+
+
+
+
About

HellaSwag is a commonsense reasoning benchmark that evaluates AI models' ability to complete natural language scenarios with the most plausible ending. Created using Adversarial Filtering, this dataset features video caption segments with multiple-choice options where models must select logical conclusions. HellaSwag tests models' understanding of everyday situations and common sense reasoning through challenging adversarial examples.

+
+
+
+
Evaluation Stats
Total Models24
Organizations10
Verified Results0
Self-Reported24
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

24 models
Top Score
95.4%
Average Score
82.4%
High Performers (80%+)
16

Top Organizations

#1OpenAI
1 model
95.3%
#2Anthropic
3 models
90.1%
#3Cohere
1 model
88.6%
#4NVIDIA
1 model
85.6%
#5Mistral AI
1 model
83.5%
+
+
+
+
Leaderboard
24 models ranked by performance on HellaSwag
LicenseLinks
Feb 29, 2024
Proprietary
95.4%
Jun 13, 2023
Proprietary
95.3%
May 1, 2024
Proprietary
93.3%
Feb 29, 2024
Proprietary
89.0%
Aug 30, 2024
CC BY-NC
88.6%
Jul 23, 2024
tongyi-qianwen
87.6%
May 1, 2024
Proprietary
86.5%
Jun 27, 2024
Gemma
86.4%
Mar 13, 2024
Proprietary
85.9%
Oct 1, 2024
Llama 3.1 Community License
85.6%
Showing 1 to 10 of 24 models
+
+
+
+
Resources