BrowseComp-zh

Multilingual

text

About

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

Evaluation Stats

Total Models3

Organizations1

Verified Results0

Self-Reported3

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

3 models

Top Score

49.2%

Average Score

44.3%

High Performers (80%+)

Top Organizations

#1DeepSeek

3 models

44.3%

Leaderboard

3 models ranked by performance on BrowseComp-zh

			License
#01DeepSeek-V3.1	DeepSeek	Jan 10, 2025	MIT	49.2%
#02DeepSeek-V3.2-Exp	DeepSeek	Sep 29, 2025	MIT	47.9%
#03DeepSeek-R1-0528	DeepSeek	May 28, 2025	MIT	35.7%

Resources

Research Paper