Tau2 Retail

text

About

TAU2-retail is the retail component of the τ²-Bench framework, evaluating conversational agents in e-commerce and retail customer service environments. This comprehensive benchmark tests AI agents' capabilities in handling product queries, order processing, customer support interactions, and retail-specific policies within a structured dual-control evaluation environment that mirrors real-world retail customer service scenarios.

Evaluation Stats

Total Models10

Organizations4

Verified Results0

Self-Reported10

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

10 models

Top Score

83.2%

Average Score

71.7%

High Performers (80%+)

Top Organizations

#1Anthropic

1 model

83.2%

#2OpenAI

3 models

74.9%

#3Moonshot AI

2 models

70.6%

#4Alibaba Cloud / Qwen Team

4 models

67.1%

Leaderboard

10 models ranked by performance on Tau2 Retail

			License
#01Claude Haiku 4.5	Anthropic	Oct 15, 2025	Proprietary	83.2%
#02GPT-5	OpenAI	Aug 7, 2025	Proprietary	81.1%
#03o3	OpenAI	Apr 16, 2025	Proprietary	80.2%
#04Qwen3-235B-A22B-Thinking-2507	Alibaba Cloud / Qwen Team	Jul 25, 2025	Apache 2.0	71.9%
#05Qwen3-235B-A22B-Instruct-2507	Alibaba Cloud / Qwen Team	Jul 22, 2025	Apache 2.0	71.3%
#06Kimi K2 Instruct	Moonshot AI	Jul 11, 2025	MIT	70.6%
#07Kimi K2-Instruct-0905	Moonshot AI	Sep 5, 2025	MIT	70.6%
#08Qwen3-Next-80B-A3B-Thinking	Alibaba Cloud / Qwen Team	Sep 10, 2025	Apache 2.0	67.8%
#09GPT-4o	OpenAI	Aug 6, 2024	Proprietary	63.4%
#10Qwen3-Next-80B-A3B-Instruct	Alibaba Cloud / Qwen Team	Sep 10, 2025	Apache 2.0	57.3%

Resources

Research Paper