BFCL
text
+
+
+
+
About
BFCL (Berkeley Function Calling Leaderboard) is a comprehensive benchmark evaluating Large Language Models' ability to accurately call functions and use tools. It assesses both single-turn and multi-turn interactions, measuring hallucination rates and format sensitivity in function calling scenarios. The benchmark uses real-world data to test tool usage capabilities, providing critical evaluation for AI systems designed to interact with external APIs and tools.
+
+
+
+
Evaluation Stats
Total Models10
Organizations3
Verified Results0
Self-Reported10
+
+
+
+
Benchmark Details
Max Score1
Language
en
+
+
+
+
Performance Overview
Score distribution and top performers
Score Distribution
10 models
Top Score
88.5%
Average Score
71.7%
High Performers (80%+)
2Top Organizations
#1Meta
3 models
83.1%
#2Alibaba Cloud / Qwen Team
4 models
69.2%
#3Amazon
3 models
63.7%
+
+
+
+
Leaderboard
10 models ranked by performance on BFCL
License | Links | ||||
---|---|---|---|---|---|
Jul 23, 2024 | Llama 3.1 Community License | 88.5% | |||
Jul 23, 2024 | Llama 3.1 Community License | 84.8% | |||
Jul 23, 2024 | Llama 3.1 Community License | 76.1% | |||
Apr 29, 2025 | Apache 2.0 | 70.8% | |||
Apr 29, 2025 | Apache 2.0 | 70.3% | |||
Apr 29, 2025 | Apache 2.0 | 69.1% | |||
Nov 20, 2024 | Proprietary | 68.4% | |||
Nov 20, 2024 | Proprietary | 66.6% | |||
Mar 5, 2025 | Apache 2.0 | 66.4% | |||
Nov 20, 2024 | Proprietary | 56.2% |