TruthfulQA

text

About

TruthfulQA is a benchmark measuring language models' truthfulness in generating answers to 817 questions across 38 categories including health, law, finance, and politics. Specifically designed to elicit false answers that humans might give due to misconceptions, this evaluation reveals that larger models often generate more false answers, highlighting critical challenges in AI truthfulness and factual accuracy.

Evaluation Stats

Total Models16

Organizations7

Verified Results0

Self-Reported16

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

16 models

Top Score

77.5%

Average Score

58.7%

High Performers (80%+)

Top Organizations

#1Microsoft

3 models

69.3%

#2IBM

3 models

59.0%

#3NVIDIA

1 model

58.6%

#4Cohere

1 model

56.3%

#5AI21 Labs

2 models

56.2%

Leaderboard

16 models ranked by performance on TruthfulQA

			License
#01Phi-3.5-MoE-instruct	Microsoft	Aug 23, 2024	MIT	77.5%
#02Granite 3.3 8B Instruct	IBM	Apr 16, 2025	Apache 2.0	66.9%
#03Phi 4 Mini	Microsoft	Feb 1, 2025	MIT	66.4%
#04Phi-3.5-mini-instruct	Microsoft	Aug 23, 2024	MIT	64.0%
#05Llama 3.1 Nemotron 70B Instruct	NVIDIA	Oct 1, 2024	Llama 3.1 Community License	58.6%
#06Qwen2.5 14B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Apache 2.0	58.4%
#07Jamba 1.5 Large	AI21 Labs	Aug 22, 2024	Jamba Open Model License	58.3%
#08IBM Granite 4.0 Tiny Preview	IBM	May 2, 2025	Apache 2.0	58.1%
#09Qwen2.5 32B Instruct	Alibaba Cloud / Qwen Team	Sep 19, 2024	Apache 2.0	57.8%
#10Command R+	Cohere	Aug 30, 2024	CC BY-NC	56.3%

Showing 1 to 10 of 16 models

Resources

Research Paper