- Home
- /
- Benchmarks
- /
- OSWorld
OSWorld
Agents
+
+
+
+
About
OSWorld evaluates multimodal AI agents on real computer tasks across web browsers, office suites, and OS interfaces using GUI interaction, with success rate as the primary metric.
+
+
+
+
Evaluation Stats
Total Models13
Organizations5
Verified Results0
Self-Reported4
+
+
+
+
Benchmark Details
Max Score100
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
13 models
Top Score
72.7%
Average Score
49.4%
High Performers (80%+)
0Top Organizations
#1Anthropic
5 models
63.4%
#2Moonshot AI
1 model
63.3%
#3ByteDance
3 models
51.7%
#4OpenAI
2 models
30.6%
#5Alibaba / Qwen
2 models
22.7%
+
+
+
+
Leaderboard
13 models ranked by performance on OSWorld
| License | Links | ||||
|---|---|---|---|---|---|
| Feb 1, 2026 | Proprietary | 72.7% | |||
| Feb 17, 2026 | Proprietary | 72.5% | |||
| Nov 1, 2025 | Proprietary | 66.3% | |||
| Jan 1, 2026 | MIT | 63.3% | |||
| Dec 18, 2025 | Proprietary | 61.9% | |||
| Sep 29, 2025 | Proprietary | 61.4% | |||
| Sep 4, 2025 | Apache 2.0 | 53.1% | |||
| May 14, 2025 | Proprietary | 43.9% | |||
| Jan 22, 2026 | Apache 2.0 | 41.6% | |||
| Jan 22, 2025 | Proprietary | 40.0% |
Showing 1 to 10 of 13 models
+
+
+
+
Additional Metrics
Extended metrics for top models on OSWorld
| Model | Score | Max Steps | Model Type | Organization |
|---|---|---|---|---|
| Kimi K2.5 | 63.3 | 100 | General model | Moonshot AI |
| Seed-1.8 | 61.9 | 100 | General model | ByteDance Seed |
| Claude Sonnet 4.5 | 61.4 | 100 | General model | Anthropic |
| UI-TARS-2 | 53.1 | 100 | General model | ByteDance Seed |
| Claude Sonnet 4 | 43.9 | 50 | General model | Anthropic |
| Qwen3-VL Flash | 41.6 | 100 | General model | Qwen Team, Alibaba Group |
| Doubao 1.5 Vision Pro | 40.0 | 100 | General model | ByteDance Seed |
| o3 | 23.0 | 100 | General model | OpenAI |
| Qwen2.5-VL 32B Instruct | 3.9 | 15 | General model | Alibaba Cloud, Qwen Team |