TAU2-Bench Retail

Agents
+
+
+
+
About

TAU2-Bench Retail evaluates conversational AI agents on customer service tasks in a retail environment using a dual-control framework where both the agent and user hold tools, testing policy adherence, tool use, and task success across returns, exchanges, and order management scenarios.

+
+
+
+
Evaluation Stats
Total Models6
Organizations3
Verified Results0
Self-Reported6
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

6 models
Top Score
91.9%
Average Score
87.7%
High Performers (80%+)
6

Top Organizations

#1Anthropic
4 models
89.7%
#2Google DeepMind
1 model
85.3%
#3OpenAI
1 model
82.0%
+
+
+
+
Leaderboard
6 models ranked by performance on TAU2-Bench Retail
LicenseLinks
Feb 1, 2026
Proprietary
91.9%
Feb 17, 2026
Proprietary
91.7%
Nov 1, 2025
Proprietary
88.9%
Sep 29, 2025
Proprietary
86.2%
Nov 18, 2025
Proprietary
85.3%
Dec 11, 2025
Proprietary
82.0%