TruthfulQA
text
+
+
+
+
About
TruthfulQA is a benchmark measuring language models' truthfulness in generating answers to 817 questions across 38 categories including health, law, finance, and politics. Specifically designed to elicit false answers that humans might give due to misconceptions, this evaluation reveals that larger models often generate more false answers, highlighting critical challenges in AI truthfulness and factual accuracy.
+
+
+
+
Evaluation Stats
Total Models16
Organizations7
Verified Results0
Self-Reported16
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
16 models
Top Score
77.5%
Average Score
58.7%
High Performers (80%+)
0Top Organizations
#1Microsoft
3 models
69.3%
#2IBM
3 models
59.0%
#3NVIDIA
1 model
58.6%
#4Cohere
1 model
56.3%
#5AI21 Labs
2 models
56.2%
+
+
+
+
Leaderboard
16 models ranked by performance on TruthfulQA
License | Links | ||||
---|---|---|---|---|---|
Aug 23, 2024 | MIT | 77.5% | |||
Apr 16, 2025 | Apache 2.0 | 66.9% | |||
Feb 1, 2025 | MIT | 66.4% | |||
Aug 23, 2024 | MIT | 64.0% | |||
Oct 1, 2024 | Llama 3.1 Community License | 58.6% | |||
Sep 19, 2024 | Apache 2.0 | 58.4% | |||
Aug 22, 2024 | Jamba Open Model License | 58.3% | |||
May 2, 2025 | Apache 2.0 | 58.1% | |||
Sep 19, 2024 | Apache 2.0 | 57.8% | |||
Aug 30, 2024 | CC BY-NC | 56.3% |
Showing 1 to 10 of 16 models