All Benchmarks

Explore all 21 benchmarks for evaluating language models across different capabilities and domains

PropertiesLinks
Coding
25
11
80.9%
76.7%
Agents
23
5
84.0%
20.3%
Science
22
5
93.2%
76.7%
Coding
20
6
91.3%
80.3%
Tool Use
19
8
62.3%
35.0%
Coding
18
9
51.7%
43.7%
Multimodal
16
9
78.4%
30.1%
Humanity's Last Exam
1 sub-benchmark
Reasoning
15
4
40.0%
25.6%
Coding
15
5
74.8%
54.9%
Agents
13
5
72.7%
49.4%
Showing 1 to 10 of 21 benchmarks