TAU-bench Airline

text

About

TAU-bench Airline is the aviation industry subset of the TAU-bench benchmark, specifically testing AI agents' capabilities in airline customer service scenarios with domain-specific APIs and policies. This specialized evaluation challenges models to handle flight bookings, cancellations, customer inquiries, and airline-specific procedures while maintaining accurate tool usage and following industry regulations and guidelines.

Evaluation Stats

Total Models20

Organizations4

Verified Results0

Self-Reported20

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

20 models

Top Score

70.0%

Average Score

47.8%

High Performers (80%+)

Top Organizations

#1Zhipu AI

2 models

60.6%

#2Anthropic

7 models

53.3%

#3Alibaba Cloud / Qwen Team

3 models

46.3%

#4OpenAI

8 models

40.5%

Leaderboard

20 models ranked by performance on TAU-bench Airline

			License
#01Claude Sonnet 4.5	Anthropic	Sep 29, 2025	Proprietary	70.0%
#02GLM-4.5-Air	Zhipu AI	Jul 28, 2025	MIT	60.8%
#03GLM-4.5	Zhipu AI	Jul 28, 2025	MIT	60.4%
#04Claude Sonnet 4	Anthropic	May 22, 2025	Proprietary	60.0%
#05Claude Opus 4	Anthropic	May 22, 2025	Proprietary	59.6%
#06Claude 3.7 Sonnet	Anthropic	Feb 24, 2025	Proprietary	58.4%
#07Claude Opus 4.1	Anthropic	Aug 5, 2025	Proprietary	56.0%
#08GPT-4.5	OpenAI	Feb 27, 2025	Proprietary	50.0%
#09o1	OpenAI	Dec 17, 2024	Proprietary	50.0%
#10GPT-4.1	OpenAI	Apr 14, 2025	Proprietary	49.4%

Showing 1 to 10 of 20 models

Resources

Research Paper