Winogrande

text

About

WinoGrande is a large-scale commonsense reasoning benchmark featuring 44,000 pronoun resolution problems designed to challenge machine learning models beyond the original Winograd Schema Challenge. Using systematic bias reduction through the AfLite algorithm, this evaluation tests AI models' ability to understand commonsense relationships and resolve ambiguous pronouns requiring world knowledge and reasoning.

Evaluation Stats

Total Models19

Organizations8

Verified Results0

Self-Reported19

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

19 models

Top Score

87.5%

Average Score

77.0%

High Performers (80%+)

Top Organizations

#1OpenAI

1 model

87.5%

#2Cohere

1 model

85.4%

#3NVIDIA

1 model

84.5%

#4Alibaba Cloud / Qwen Team

4 models

80.2%

#5Mistral AI

2 models

76.0%

Leaderboard

19 models ranked by performance on Winogrande

			License
#01GPT-4	OpenAI	Jun 13, 2023	Proprietary	87.5%
#02Command R+	Cohere	Aug 30, 2024	CC BY-NC	85.4%
#03Qwen2 72B Instruct	Alibaba Cloud / Qwen Team	Jul 23, 2024	tongyi-qianwen	85.1%
#04Llama 3.1 Nemotron 70B Instruct	NVIDIA	Oct 1, 2024	Llama 3.1 Community License	84.5%
#05Gemma 2 27B	Google	Jun 27, 2024	Gemma	83.7%
#06Qwen2.5 32B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Apache 2.0	82.0%
#07Phi-3.5-MoE-instruct	Microsoft	Aug 23, 2024	MIT	81.3%
#08Qwen2.5-Coder 32B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Apache 2.0	80.8%
#09Gemma 2 9B	Google	Jun 27, 2024	Gemma	80.6%
#10Mistral NeMo Instruct	Mistral AI	Jul 18, 2024	Apache 2.0	76.8%

Showing 1 to 10 of 19 models

Resources

Research Paper