- Home
- /
- Benchmarks
- /
- τ-bench
τ-bench
Agents
+
+
+
+
About
τ-bench evaluates LLM agents on realistic customer service tasks (retail, airline, telecom), measuring sustained performance across multiple turns with Pass^k metrics.
+
+
+
+
Evaluation Stats
Total Models10
Organizations6
Verified Results0
Self-Reported0
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
10 models
Top Score
85.4%
Average Score
69.3%
High Performers (80%+)
4Top Organizations
#1Google DeepMind
1 model
85.4%
#2DeepSeek
1 model
80.4%
#3Anthropic
2 models
73.3%
#4Alibaba / Qwen
1 model
72.0%
#5Moonshot AI
1 model
64.3%
+
+
+
+
Leaderboard
10 models ranked by performance on τ-bench
| License | Links | ||||
|---|---|---|---|---|---|
| Nov 18, 2025 | Proprietary | 85.4% | |||
| Sep 29, 2025 | Proprietary | 84.7% | |||
| Dec 1, 2025 | MIT | 80.4% | |||
| Aug 7, 2025 | Proprietary | 80.0% | |||
| Sep 5, 2025 | Proprietary | 72.0% | |||
| Jul 11, 2025 | Modified MIT | 64.3% | |||
| Feb 24, 2025 | Proprietary | 61.8% | |||
| Apr 16, 2025 | Proprietary | 56.9% | |||
| Apr 14, 2025 | Proprietary | 54.7% | |||
| Apr 14, 2025 | Proprietary | 53.0% |
+
+
+
+
Additional Metrics
Extended metrics for top models on τ-bench
| Model | Score | Organization | Pass^2 | Pass^3 | Pass^4 | User Simulator |
|---|---|---|---|---|---|---|
| Gemini 3 Pro | 85.4 | Sierra | 0% | 0% | 0% | Gemini-3.0-Pro |
| Claude Sonnet 4.5 | 84.7 | Anthropic | 0% | 0% | 0% | |
| DeepSeek-V3.2 | 80.4 | DeepSeek | 0% | 0% | 0% | DeepSeek-V3.2 |
| GPT-5 | 80.0 | Sierra | 73% | 68% | 64% | gpt-4.1-2025-04-14 |
| Qwen3-Max | 72.0 | Qwen | 66.7% | 0% | 54.8% | |
| Kimi K2 | 64.3 | Moonshot AI | 0% | 0% | 0% | |
| Claude 3.7 Sonnet | 61.8 | Sierra | 56.5% | 52.9% | 49.7% | gpt-4.1-2025-04-14 |
| o4 mini | 56.9 | Sierra | 48.3% | 42.6% | 38% | gpt-4.1-2025-04-14 |
| GPT-4.1 | 54.7 | Sierra | 46.5% | 41.4% | 36.9% | gpt-4.1-2025-04-14 |
| GPT-4.1 mini | 53.0 | Sierra | 42.5% | 35.2% | 30.3% | gpt-4.1-2025-04-14 |