BrowseComp-zh
Multilingual
text
+
+
+
+
About
A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.
+
+
+
+
Evaluation Stats
Total Models3
Organizations1
Verified Results0
Self-Reported3
+
+
+
+
Benchmark Details
Max Score1
Language
zh
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
3 models
Top Score
49.2%
Average Score
44.3%
High Performers (80%+)
0Top Organizations
#1DeepSeek
3 models
44.3%
+
+
+
+
Leaderboard
3 models ranked by performance on BrowseComp-zh
License | Links | ||||
---|---|---|---|---|---|
Jan 10, 2025 | MIT | 49.2% | |||
Sep 29, 2025 | MIT | 47.9% | |||
May 28, 2025 | MIT | 35.7% |