All Benchmarks
Explore all 21 benchmarks for evaluating language models across different capabilities and domains
| Properties | Links | ||||||
|---|---|---|---|---|---|---|---|
Coding | 25 | 11 | 80.9% | 76.7% | |||
Agents | 23 | 5 | 84.0% | 20.3% | |||
Science | 22 | 5 | 93.2% | 76.7% | |||
Coding | 20 | 6 | 91.3% | 80.3% | |||
Tool Use | 19 | 8 | 62.3% | 35.0% | |||
Coding | 18 | 9 | 51.7% | 43.7% | |||
Multimodal | 16 | 9 | 78.4% | 30.1% | |||
Humanity's Last Exam 1 sub-benchmark | Reasoning | 15 | 4 | 40.0% | 25.6% | ||
Coding | 15 | 5 | 74.8% | 54.9% | |||
Agents | 13 | 5 | 72.7% | 49.4% |
Showing 1 to 10 of 21 benchmarks