BFCL

text
+
+
+
+
About

BFCL (Berkeley Function Calling Leaderboard) is a comprehensive benchmark evaluating Large Language Models' ability to accurately call functions and use tools. It assesses both single-turn and multi-turn interactions, measuring hallucination rates and format sensitivity in function calling scenarios. The benchmark uses real-world data to test tool usage capabilities, providing critical evaluation for AI systems designed to interact with external APIs and tools.

+
+
+
+
Evaluation Stats
Total Models10
Organizations3
Verified Results0
Self-Reported10
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers

Score Distribution

10 models
Top Score
88.5%
Average Score
71.7%
High Performers (80%+)
2

Top Organizations

#1Meta
3 models
83.1%
#2Alibaba Cloud / Qwen Team
4 models
69.2%
#3Amazon
3 models
63.7%
+
+
+
+
Leaderboard
10 models ranked by performance on BFCL
LicenseLinks
Jul 23, 2024
Llama 3.1 Community License
88.5%
Jul 23, 2024
Llama 3.1 Community License
84.8%
Jul 23, 2024
Llama 3.1 Community License
76.1%
Apr 29, 2025
Apache 2.0
70.8%
Apr 29, 2025
Apache 2.0
70.3%
Apr 29, 2025
Apache 2.0
69.1%
Nov 20, 2024
Proprietary
68.4%
Nov 20, 2024
Proprietary
66.6%
Mar 5, 2025
Apache 2.0
66.4%
Nov 20, 2024
Proprietary
56.2%
+
+
+
+
Resources