BrowseComp-zh

Multilingual
text
+
+
+
+
About

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

+
+
+
+
Evaluation Stats
Total Models3
Organizations1
Verified Results0
Self-Reported3
+
+
+
+
Benchmark Details
Max Score1
Language
zh
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

3 models
Top Score
49.2%
Average Score
44.3%
High Performers (80%+)
0

Top Organizations

#1DeepSeek
3 models
44.3%
+
+
+
+
Leaderboard
3 models ranked by performance on BrowseComp-zh
LicenseLinks
Jan 10, 2025
MIT
49.2%
Sep 29, 2025
MIT
47.9%
May 28, 2025
MIT
35.7%
+
+
+
+
Resources