OSWorld

Agents

About

OSWorld evaluates multimodal AI agents on real computer tasks across web browsers, office suites, and OS interfaces using GUI interaction, with success rate as the primary metric.

Evaluation Stats

Total Models13

Organizations5

Verified Results0

Self-Reported4

Benchmark Details

Max Score100

Performance Overview

Score distribution and top performers

Score Distribution

13 models

Top Score

72.7%

Average Score

49.4%

High Performers (80%+)

Top Organizations

#1Anthropic

5 models

63.4%

#2Moonshot AI

1 model

63.3%

#3ByteDance

3 models

51.7%

#4OpenAI

2 models

30.6%

#5Alibaba / Qwen

2 models

22.7%

Leaderboard

13 models ranked by performance on OSWorld

			License
#01Claude Opus 4.6	Anthropic	Feb 1, 2026	Proprietary	72.7%
#02Claude Sonnet 4.6	Anthropic	Feb 17, 2026	Proprietary	72.5%
#03Claude Opus 4.5	Anthropic	Nov 1, 2025	Proprietary	66.3%
#04Kimi K2.5	Moonshot AI	Jan 1, 2026	MIT	63.3%
#05Seed-1.8	ByteDance	Dec 18, 2025	Proprietary	61.9%
#06Claude Sonnet 4.5	Anthropic	Sep 29, 2025	Proprietary	61.4%
#07UI-TARS-2	ByteDance	Sep 4, 2025	Apache 2.0	53.1%
#08Claude Sonnet 4	Anthropic	May 14, 2025	Proprietary	43.9%
#09Qwen3-VL Flash	Alibaba / Qwen	Jan 22, 2026	Apache 2.0	41.6%
#10Doubao 1.5 Vision Pro	ByteDance	Jan 22, 2025	Proprietary	40.0%

Showing 1 to 10 of 13 models

Additional Metrics

Extended metrics for top models on OSWorld

Model	Score	Max Steps	Model Type	Organization
Kimi K2.5	63.3	100	General model	Moonshot AI
Seed-1.8	61.9	100	General model	ByteDance Seed
Claude Sonnet 4.5	61.4	100	General model	Anthropic
UI-TARS-2	53.1	100	General model	ByteDance Seed
Claude Sonnet 4	43.9	50	General model	Anthropic
Qwen3-VL Flash	41.6	100	General model	Qwen Team, Alibaba Group
Doubao 1.5 Vision Pro	40.0	100	General model	ByteDance Seed
o3	23.0	100	General model	OpenAI
Qwen2.5-VL 32B Instruct	3.9	15	General model	Alibaba Cloud, Qwen Team

Resources

Source Leaderboard