Terminal-Bench
text
+
+
+
+
About
Terminal-Bench is a comprehensive evaluation framework measuring AI agents' mastery of terminal and command-line interface tasks across diverse domains including scientific workflows, network configuration, data analysis, and cybersecurity. Featuring 80 hand-crafted, human-verified tasks with dedicated Docker environments, this benchmark tests agents' ability to complete complex terminal-based operations in realistic computing environments.
+
+
+
+
Evaluation Stats
Total Models14
Organizations4
Verified Results0
Self-Reported14
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
14 models
Top Score
50.0%
Average Score
34.4%
High Performers (80%+)
0Top Organizations
#1Anthropic
6 models
40.7%
#2Zhipu AI
3 models
36.0%
#3Moonshot AI
2 models
27.5%
#4DeepSeek
3 models
24.9%
+
+
+
+
Leaderboard
14 models ranked by performance on Terminal-Bench
| License | Links | ||||
|---|---|---|---|---|---|
| Sep 29, 2025 | Proprietary | 50.0% | |||
| Aug 5, 2025 | Proprietary | 43.3% | |||
| Oct 15, 2025 | Proprietary | 41.0% | |||
| Sep 30, 2025 | MIT | 40.5% | |||
| May 22, 2025 | Proprietary | 39.2% | |||
| Sep 29, 2025 | MIT | 37.7% | |||
| Jul 28, 2025 | MIT | 37.5% | |||
| May 22, 2025 | Proprietary | 35.5% | |||
| Feb 24, 2025 | Proprietary | 35.2% | |||
| Jan 10, 2025 | MIT | 31.3% |
Showing 1 to 10 of 14 models