ACEBench

text

About

ACEBench is a comprehensive AI benchmark for evaluating Large Language Models' tool usage capabilities across three scenarios: Normal (basic tool usage), Special (ambiguous instructions), and Agent (multi-agent interactions). Covering 8 major domains and 68 sub-domains, it tests LLMs' decision-making and reasoning when integrated with various tools through 1-8 dialogue turns simulating real-world contexts. The benchmark addresses limitations in existing evaluations by providing granular error analysis.

Evaluation Stats

Total Models2

Organizations1

Verified Results0

Self-Reported2

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

2 models

Top Score

76.5%

Average Score

76.5%

High Performers (80%+)

Top Organizations

#1Moonshot AI

2 models

76.5%

Leaderboard

2 models ranked by performance on ACEBench

			License		Links
#01Kimi K2 Instruct	Moonshot AI	Jul 11, 2025	MIT	76.5%
#02Kimi K2-Instruct-0905	Moonshot AI	Sep 5, 2025	MIT	76.5%

Resources

Research Paper Implementation