COLLIE

text

About

COLLIE is a systematic framework for evaluating constrained text generation capabilities in Large Language Models. The benchmark tests models' ability to generate text under various compositional constraints with diverse generation levels and modeling challenges. COLLIE provides comprehensive assessment of controlled text generation through constraint structure specification, example extraction, instruction rendering, and rigorous evaluation against specified constraints.

Evaluation Stats

Total Models8

Organizations1

Verified Results0

Self-Reported8

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

8 models

Top Score

99.0%

Average Score

74.0%

High Performers (80%+)

Top Organizations

#1OpenAI

8 models

74.0%

Leaderboard

8 models ranked by performance on COLLIE

			License
#01GPT-5	OpenAI	Aug 7, 2025	Proprietary	99.0%
#02o3-mini	OpenAI	Jan 30, 2025	Proprietary	98.7%
#03o3	OpenAI	Apr 16, 2025	Proprietary	98.4%
#04GPT-4.5	OpenAI	Feb 27, 2025	Proprietary	72.3%
#05GPT-4.1	OpenAI	Apr 14, 2025	Proprietary	65.8%
#06GPT-4o	OpenAI	Aug 6, 2024	Proprietary	61.0%
#07GPT-4.1 mini	OpenAI	Apr 14, 2025	Proprietary	54.6%
#08GPT-4.1 nano	OpenAI	Apr 14, 2025	Proprietary	42.5%

Resources

Research Paper