Terminal Bench 2.0

Coding
+
+
+
+
About

Terminal Bench 2.0 evaluates AI agents on terminal-based tasks measuring real-world command-line proficiency.

+
+
+
+
Evaluation Stats
Total Models12
Organizations4
Verified Results0
Self-Reported0
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

12 models
Top Score
66.5%
Average Score
57.0%
High Performers (80%+)
0

Top Organizations

#1Google DeepMind
2 models
60.3%
#2OpenAI
5 models
59.5%
#3Anthropic
4 models
58.8%
#4MiniMax
1 model
30.0%
+
+
+
+
Leaderboard
12 models ranked by performance on Terminal Bench 2.0
LicenseLinks
Jan 14, 2026
Proprietary
66.5%
Feb 5, 2026
Proprietary
65.4%
Feb 5, 2026
Proprietary
64.7%
Dec 11, 2025
Proprietary
64.7%
Dec 17, 2025
Proprietary
64.3%
Nov 19, 2025
Proprietary
60.4%
Nov 24, 2025
Proprietary
59.8%
Feb 17, 2026
Proprietary
59.1%
Nov 18, 2025
Proprietary
56.2%
Sep 29, 2025
Proprietary
51.0%
Showing 1 to 10 of 12 models
+
+
+
+
Additional Metrics
Extended metrics for top models on Terminal Bench 2.0
ModelScoreDateAgentAgent OrgModel Org
GPT-5.2 Codex66.52026-02-12CodeBrain-1LangChainOpenAI
Claude Opus 4.665.42026-02-05DroidFactoryAnthropic
GPT-5.3 Codex64.72026-02-05Terminus 2Terminal BenchOpenAI
GPT-5.264.72025-12-12Terminus 2Terminal BenchOpenAI
Gemini 3 Flash64.32025-12-23Junie CLIJetBrainsGoogle
GPT-5.1 Codex Max60.42025-11-24Codex CLIOpenAIOpenAI
Claude Opus 4.559.82025-12-22GooseBlockAnthropic
Gemini 3 Pro56.22026-01-06AnteAntigma LabsGoogle
Claude Sonnet 4.551.02025-12-24OpenHandsOpenHandsAnthropic
GPT-5 Codex41.32025-11-03Mini-SWE-AgentPrincetonOpenAI
MiniMax M2.130.02025-11-01Terminus 2Terminal BenchMiniMax
+
+
+
+
Resources