TAU-bench Retail

text

About

TAU-bench Retail is the retail industry subset of the TAU-bench benchmark, testing AI agents' performance in e-commerce and retail customer service scenarios with specialized APIs and business policies. This domain-specific evaluation challenges models to handle product inquiries, order management, returns, customer support, and retail-specific workflows while maintaining accuracy in tool usage and adherence to retail industry standards.

Evaluation Stats

Total Models22

Organizations4

Verified Results0

Self-Reported22

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

22 models

Top Score

86.2%

Average Score

67.5%

High Performers (80%+)

Top Organizations

#1Zhipu AI

2 models

78.8%

#2Anthropic

7 models

76.0%

#3Alibaba Cloud / Qwen Team

3 models

66.1%

#4OpenAI

10 models

59.8%

Leaderboard

22 models ranked by performance on TAU-bench Retail

			License
#01Claude Sonnet 4.5	Anthropic	Sep 29, 2025	Proprietary	86.2%
#02Claude Opus 4.1	Anthropic	Aug 5, 2025	Proprietary	82.4%
#03Claude Opus 4	Anthropic	May 22, 2025	Proprietary	81.4%
#04Claude 3.7 Sonnet	Anthropic	Feb 24, 2025	Proprietary	81.2%
#05Claude Sonnet 4	Anthropic	May 22, 2025	Proprietary	80.5%
#06GLM-4.5	Zhipu AI	Jul 28, 2025	MIT	79.7%
#07GLM-4.5-Air	Zhipu AI	Jul 28, 2025	MIT	77.9%
#08o4-mini	OpenAI	Apr 16, 2025	Proprietary	71.8%
#09o1	OpenAI	Dec 17, 2024	Proprietary	70.8%
#10Qwen3-Next-80B-A3B-Thinking	Alibaba Cloud / Qwen Team	Sep 10, 2025	Apache 2.0	69.6%

Showing 1 to 10 of 22 models

Resources

Research Paper