Terminal-Bench

text
+
+
+
+
About

Terminal-Bench is a comprehensive evaluation framework measuring AI agents' mastery of terminal and command-line interface tasks across diverse domains including scientific workflows, network configuration, data analysis, and cybersecurity. Featuring 80 hand-crafted, human-verified tasks with dedicated Docker environments, this benchmark tests agents' ability to complete complex terminal-based operations in realistic computing environments.

+
+
+
+
Evaluation Stats
Total Models13
Organizations4
Verified Results0
Self-Reported13
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

13 models
Top Score
50.0%
Average Score
33.9%
High Performers (80%+)
0

Top Organizations

#1Anthropic
5 models
40.6%
#2Zhipu AI
3 models
36.0%
#3Moonshot AI
2 models
27.5%
#4DeepSeek
3 models
24.9%
+
+
+
+
Leaderboard
13 models ranked by performance on Terminal-Bench
LicenseLinks
Sep 29, 2025
Proprietary
50.0%
Aug 5, 2025
Proprietary
43.3%
Sep 30, 2025
MIT
40.5%
May 22, 2025
Proprietary
39.2%
Sep 29, 2025
MIT
37.7%
Jul 28, 2025
MIT
37.5%
May 22, 2025
Proprietary
35.5%
Feb 24, 2025
Proprietary
35.2%
Jan 10, 2025
MIT
31.3%
Jul 11, 2025
MIT
30.0%
Showing 1 to 10 of 13 models