PIQA

text

About

PIQA (Physical Interaction Question Answering) is a benchmark for physical commonsense reasoning that tests AI models' understanding of everyday physical interactions through multiple-choice questions. Drawing inspiration from instructables.com, this benchmark focuses on atypical but practical solutions to physical problems, evaluating models' grasp of real-world physics and common-sense reasoning in physical scenarios.

Evaluation Stats

Total Models9

Organizations2

Verified Results0

Self-Reported9

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

9 models

Top Score

88.6%

Average Score

81.3%

High Performers (80%+)

Top Organizations

#1Microsoft

3 models

82.4%

#2Google

6 models

80.8%

Leaderboard

9 models ranked by performance on PIQA

			License
#01Phi-3.5-MoE-instruct	Microsoft	Aug 23, 2024	MIT	88.6%
#02Gemma 2 27B	Google	Jun 27, 2024	Gemma	83.2%
#03Gemma 2 9B	Google	Jun 27, 2024	Gemma	81.7%
#04Gemma 3n E4B	Google	Jun 26, 2025	Proprietary	81.0%
#05Gemma 3n E4B Instructed LiteRT Preview	Google	May 20, 2025	Gemma	81.0%
#06Phi-3.5-mini-instruct	Microsoft	Aug 23, 2024	MIT	81.0%
#07Gemma 3n E2B Instructed LiteRT (Preview)	Google	May 20, 2025	Gemma	78.9%
#08Gemma 3n E2B	Google	Jun 26, 2025	Proprietary	78.9%
#09Phi 4 Mini	Microsoft	Feb 1, 2025	MIT	77.6%

Resources

Research Paper