Terminal-Bench

text

About

Terminal-Bench is a comprehensive evaluation framework measuring AI agents' mastery of terminal and command-line interface tasks across diverse domains including scientific workflows, network configuration, data analysis, and cybersecurity. Featuring 80 hand-crafted, human-verified tasks with dedicated Docker environments, this benchmark tests agents' ability to complete complex terminal-based operations in realistic computing environments.

Evaluation Stats

Total Models14

Organizations4

Verified Results0

Self-Reported14

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

14 models

Top Score

50.0%

Average Score

34.4%

High Performers (80%+)

Top Organizations

#1Anthropic

6 models

40.7%

#2Zhipu AI

3 models

36.0%

#3Moonshot AI

2 models

27.5%

#4DeepSeek

3 models

24.9%

Leaderboard

14 models ranked by performance on Terminal-Bench

			License
#01Claude Sonnet 4.5	Anthropic	Sep 29, 2025	Proprietary	50.0%
#02Claude Opus 4.1	Anthropic	Aug 5, 2025	Proprietary	43.3%
#03Claude Haiku 4.5	Anthropic	Oct 15, 2025	Proprietary	41.0%
#04GLM-4.6	Zhipu AI	Sep 30, 2025	MIT	40.5%
#05Claude Opus 4	Anthropic	May 22, 2025	Proprietary	39.2%
#06DeepSeek-V3.2-Exp	DeepSeek	Sep 29, 2025	MIT	37.7%
#07GLM-4.5	Zhipu AI	Jul 28, 2025	MIT	37.5%
#08Claude Sonnet 4	Anthropic	May 22, 2025	Proprietary	35.5%
#09Claude 3.7 Sonnet	Anthropic	Feb 24, 2025	Proprietary	35.2%
#10DeepSeek-V3.1	DeepSeek	Jan 10, 2025	MIT	31.3%

Showing 1 to 10 of 14 models