Tau2 Airline

text

About

TAU2-airline is part of the τ²-Bench evaluation framework, testing conversational agents in airline customer service scenarios within a dual-control environment. This advanced benchmark assesses AI agents' ability to handle complex aviation industry interactions, manage flight-related tasks, and provide customer support while maintaining consistency and accuracy across multiple conversation turns in realistic airline service contexts.

Evaluation Stats

Total Models10

Organizations4

Verified Results0

Self-Reported10

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

10 models

Top Score

64.8%

Average Score

55.8%

High Performers (80%+)

Top Organizations

#1Anthropic

1 model

63.6%

#2OpenAI

3 models

57.6%

#3Moonshot AI

2 models

56.5%

#4Alibaba Cloud / Qwen Team

4 models

52.0%

Leaderboard

10 models ranked by performance on Tau2 Airline

			License
#01o3	OpenAI	Apr 16, 2025	Proprietary	64.8%
#02Claude Haiku 4.5	Anthropic	Oct 15, 2025	Proprietary	63.6%
#03GPT-5	OpenAI	Aug 7, 2025	Proprietary	62.6%
#04Qwen3-Next-80B-A3B-Thinking	Alibaba Cloud / Qwen Team	Sep 10, 2025	Apache 2.0	60.5%
#05Qwen3-235B-A22B-Thinking-2507	Alibaba Cloud / Qwen Team	Jul 25, 2025	Apache 2.0	58.0%
#06Kimi K2 Instruct	Moonshot AI	Jul 11, 2025	MIT	56.5%
#07Kimi K2-Instruct-0905	Moonshot AI	Sep 5, 2025	MIT	56.5%
#08Qwen3-Next-80B-A3B-Instruct	Alibaba Cloud / Qwen Team	Sep 10, 2025	Apache 2.0	45.5%
#09GPT-4o	OpenAI	Aug 6, 2024	Proprietary	45.5%
#10Qwen3-235B-A22B-Instruct-2507	Alibaba Cloud / Qwen Team	Jul 22, 2025	Apache 2.0	44.0%

Resources

Research Paper