TAU2-Bench Retail

Agents

About

TAU2-Bench Retail evaluates conversational AI agents on customer service tasks in a retail environment using a dual-control framework where both the agent and user hold tools, testing policy adherence, tool use, and task success across returns, exchanges, and order management scenarios.

Evaluation Stats

Total Models6

Organizations3

Verified Results0

Self-Reported6

Benchmark Details

Max Score100

Performance Overview

Score distribution and top performers

Score Distribution

6 models

Top Score

91.9%

Average Score

87.7%

High Performers (80%+)

Top Organizations

#1Anthropic

4 models

89.7%

#2Google DeepMind

1 model

85.3%

#3OpenAI

1 model

82.0%

Leaderboard

6 models ranked by performance on TAU2-Bench Retail

			License
#01Claude Opus 4.6	Anthropic	Feb 1, 2026	Proprietary	91.9%
#02Claude Sonnet 4.6	Anthropic	Feb 17, 2026	Proprietary	91.7%
#03Claude Opus 4.5	Anthropic	Nov 1, 2025	Proprietary	88.9%
#04Claude Sonnet 4.5	Anthropic	Sep 29, 2025	Proprietary	86.2%
#05Gemini 3 Pro	Google DeepMind	Nov 18, 2025	Proprietary	85.3%
#06GPT-5.2	OpenAI	Dec 11, 2025	Proprietary	82.0%

Resources

Source Leaderboard Research Paper