OSWorld

multimodal
+
+
+
+
About

OSWorld benchmarks multimodal agents on 369 real-world computer tasks across various operating systems, testing abilities to interact with actual web and desktop applications. This comprehensive evaluation covers file I/O operations, cross-application workflows, and complex GUI interactions, assessing practical computer use capabilities in authentic environments rather than simulated scenarios.

+
+
+
+
Evaluation Stats
Total Models3
Organizations2
Verified Results0
Self-Reported3
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

3 models
Top Score
61.4%
Average Score
25.4%
High Performers (80%+)
0

Top Organizations

#1Anthropic
1 model
61.4%
#2Alibaba Cloud / Qwen Team
2 models
7.4%
+
+
+
+
Leaderboard
3 models ranked by performance on OSWorld
LicenseLinks
Sep 29, 2025
Proprietary
61.4%
Jan 26, 2025
tongyi-qianwen
8.8%
Feb 28, 2025
Apache 2.0
5.9%
+
+
+
+
Resources