τ-bench

Agents
+
+
+
+
About

τ-bench evaluates LLM agents on realistic customer service tasks (retail, airline, telecom), measuring sustained performance across multiple turns with Pass^k metrics.

+
+
+
+
Evaluation Stats
Total Models10
Organizations6
Verified Results0
Self-Reported0
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

10 models
Top Score
85.4%
Average Score
69.3%
High Performers (80%+)
4

Top Organizations

#1Google DeepMind
1 model
85.4%
#2DeepSeek
1 model
80.4%
#3Anthropic
2 models
73.3%
#4Alibaba / Qwen
1 model
72.0%
#5Moonshot AI
1 model
64.3%
+
+
+
+
Leaderboard
10 models ranked by performance on τ-bench
LicenseLinks
Nov 18, 2025
Proprietary
85.4%
Sep 29, 2025
Proprietary
84.7%
Dec 1, 2025
MIT
80.4%
Aug 7, 2025
Proprietary
80.0%
Sep 5, 2025
Proprietary
72.0%
Jul 11, 2025
Modified MIT
64.3%
Feb 24, 2025
Proprietary
61.8%
Apr 16, 2025
Proprietary
56.9%
Apr 14, 2025
Proprietary
54.7%
Apr 14, 2025
Proprietary
53.0%
+
+
+
+
Additional Metrics
Extended metrics for top models on τ-bench
ModelScoreOrganizationPass^2Pass^3Pass^4User Simulator
Gemini 3 Pro85.4Sierra0%0%0%Gemini-3.0-Pro
Claude Sonnet 4.584.7Anthropic0%0%0%
DeepSeek-V3.280.4DeepSeek0%0%0%DeepSeek-V3.2
GPT-580.0Sierra73%68%64%gpt-4.1-2025-04-14
Qwen3-Max72.0Qwen66.7%0%54.8%
Kimi K264.3Moonshot AI0%0%0%
Claude 3.7 Sonnet61.8Sierra56.5%52.9%49.7%gpt-4.1-2025-04-14
o4 mini56.9Sierra48.3%42.6%38%gpt-4.1-2025-04-14
GPT-4.154.7Sierra46.5%41.4%36.9%gpt-4.1-2025-04-14
GPT-4.1 mini53.0Sierra42.5%35.2%30.3%gpt-4.1-2025-04-14
+
+
+
+
Resources