OSWorld
multimodal
+
+
+
+
About
OSWorld benchmarks multimodal agents on 369 real-world computer tasks across various operating systems, testing abilities to interact with actual web and desktop applications. This comprehensive evaluation covers file I/O operations, cross-application workflows, and complex GUI interactions, assessing practical computer use capabilities in authentic environments rather than simulated scenarios.
+
+
+
+
Evaluation Stats
Total Models3
Organizations2
Verified Results0
Self-Reported3
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
3 models
Top Score
61.4%
Average Score
25.4%
High Performers (80%+)
0Top Organizations
#1Anthropic
1 model
61.4%
#2Alibaba Cloud / Qwen Team
2 models
7.4%
+
+
+
+
Leaderboard
3 models ranked by performance on OSWorld
License | Links | ||||
---|---|---|---|---|---|
Sep 29, 2025 | Proprietary | 61.4% | |||
Jan 26, 2025 | tongyi-qianwen | 8.8% | |||
Feb 28, 2025 | Apache 2.0 | 5.9% |