ZebraLogic

text

About

ZebraLogic is a comprehensive logical reasoning benchmark featuring 1,000 logic grid puzzles (Zebra puzzles) ranging from 2x2 to 6x6 complexity levels. This evaluation tests large language models' ability to solve constraint satisfaction problems by deducing unique value assignments based on logical clues, measuring both puzzle-level and cell-wise accuracy across easy and hard reasoning challenges.

Evaluation Stats

Total Models3

Organizations2

Verified Results0

Self-Reported3

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

3 models

Top Score

95.0%

Average Score

91.0%

High Performers (80%+)

Top Organizations

#1Alibaba Cloud / Qwen Team

1 model

95.0%

#2Moonshot AI

2 models

89.0%

Leaderboard

3 models ranked by performance on ZebraLogic

			License
#01Qwen3-235B-A22B-Instruct-2507	Alibaba Cloud / Qwen Team	Jul 22, 2025	Apache 2.0	95.0%
#02Kimi K2 Instruct	Moonshot AI	Jul 11, 2025	MIT	89.0%
#03Kimi K2-Instruct-0905	Moonshot AI	Sep 5, 2025	MIT	89.0%

Resources

Research Paper