All Benchmarks

Explore all 20 benchmarks for evaluating language models across different capabilities and domains

PropertiesLinks
Coding
21
9
80.9%
76.4%
Agents
19
5
74.7%
8.3%
Tool Use
18
8
62.3%
33.7%
Coding
18
9
51.7%
43.7%
Multimodal
16
9
78.4%
30.1%
Reasoning
15
4
37.5%
21.7%
Coding
12
4
69.9%
56.3%
Finance
11
5
63.3%
56.1%
Agents
11
4
49.7%
33.3%
Agents
10
6
85.4%
69.3%
Showing 1 to 10 of 20 benchmarks