POPE

multimodal

About

POPE (Polling-based Object Probing Evaluation) is a benchmark specifically designed to evaluate object hallucination in large vision-language models. Using a polling-based query method, POPE systematically tests whether models accurately identify the presence or absence of objects in images, providing crucial insights into visual grounding accuracy and the tendency of multimodal models to hallucinate non-existent objects.

Evaluation Stats

Total Models2

Organizations1

Verified Results0

Self-Reported2

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

2 models

Top Score

86.1%

Average Score

85.9%

High Performers (80%+)

Top Organizations

#1Microsoft

2 models

85.9%

Leaderboard

2 models ranked by performance on POPE

			License		Links
#01Phi-3.5-vision-instruct	Microsoft	Aug 23, 2024	MIT	86.1%
#02Phi-4-multimodal-instruct	Microsoft	Feb 1, 2025	MIT	85.6%

Resources

Research Paper