BFCL

text

About

BFCL (Berkeley Function Calling Leaderboard) is a comprehensive benchmark evaluating Large Language Models' ability to accurately call functions and use tools. It assesses both single-turn and multi-turn interactions, measuring hallucination rates and format sensitivity in function calling scenarios. The benchmark uses real-world data to test tool usage capabilities, providing critical evaluation for AI systems designed to interact with external APIs and tools.

Evaluation Stats

Total Models10

Organizations3

Verified Results0

Self-Reported10

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

10 models

Top Score

88.5%

Average Score

71.7%

High Performers (80%+)

Top Organizations

#1Meta

3 models

83.1%

#2Alibaba Cloud / Qwen Team

4 models

69.2%

#3Amazon

3 models

63.7%

Leaderboard

10 models ranked by performance on BFCL

			License
#01Llama 3.1 405B Instruct	Meta	Jul 23, 2024	Llama 3.1 Community License	88.5%
#02Llama 3.1 70B Instruct	Meta	Jul 23, 2024	Llama 3.1 Community License	84.8%
#03Llama 3.1 8B Instruct	Meta	Jul 23, 2024	Llama 3.1 Community License	76.1%
#04Qwen3 235B A22B	Alibaba Cloud / Qwen Team	Apr 29, 2025	Apache 2.0	70.8%
#05Qwen3 32B	Alibaba Cloud / Qwen Team	Apr 29, 2025	Apache 2.0	70.3%
#06Qwen3 30B A3B	Alibaba Cloud / Qwen Team	Apr 29, 2025	Apache 2.0	69.1%
#07Nova Pro	Amazon	Nov 20, 2024	Proprietary	68.4%
#08Nova Lite	Amazon	Nov 20, 2024	Proprietary	66.6%
#09QwQ-32B	Alibaba Cloud / Qwen Team	Mar 5, 2025	Apache 2.0	66.4%
#10Nova Micro	Amazon	Nov 20, 2024	Proprietary	56.2%

Resources

Research Paper