SimpleQA

text

About

SimpleQA is OpenAI's factuality benchmark designed to measure language models' ability to answer short, fact-seeking questions with high correctness and low variance. This comprehensive evaluation tests factual knowledge across diverse topics, challenging even frontier models and providing crucial insights into AI systems' reliability in providing accurate, verifiable information for straightforward factual queries.

Evaluation Stats

Total Models26

Organizations8

Verified Results0

Self-Reported26

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

26 models

Top Score

97.1%

Average Score

37.4%

High Performers (80%+)

Top Organizations

#1xAI

1 model

95.0%

#2DeepSeek

4 models

76.9%

#3Alibaba Cloud / Qwen Team

1 model

54.3%

#4OpenAI

5 models

41.0%

#5Moonshot AI

3 models

32.4%

Leaderboard

26 models ranked by performance on SimpleQA

			License
#01DeepSeek-V3.2-Exp	DeepSeek	Sep 29, 2025	MIT	97.1%
#02Grok 4 Fast	xAI	Aug 28, 2025	Proprietary	95.0%
#03DeepSeek-V3.1	DeepSeek	Jan 10, 2025	MIT	93.4%
#04DeepSeek-R1-0528	DeepSeek	May 28, 2025	MIT	92.3%
#05GPT-4.5	OpenAI	Feb 27, 2025	Proprietary	62.5%
#06Qwen3-235B-A22B-Instruct-2507	Alibaba Cloud / Qwen Team	Jul 22, 2025	Apache 2.0	54.3%
#07Gemini 2.5 Pro Preview 06-05	Google	Jun 5, 2025	Proprietary	54.0%
#08Gemini 2.5 Pro	Google	May 20, 2025	Proprietary	50.8%
#09o1	OpenAI	Dec 17, 2024	Proprietary	47.0%
#10o1-preview	OpenAI	Sep 12, 2024	Proprietary	42.4%

Showing 1 to 10 of 26 models

Resources

Research Paper