Salesforce AI Research and UC Berkeley have unveiled BFCL Audio, a new benchmark designed to rigorously evaluate the precision of AI models in handling audio-native function calls. In an announcement on its blog, the collaboration detailed how this extension of the existing BFCL framework addresses critical challenges in real-world voice interactions, particularly for enterprise applications where accuracy is paramount.
The initiative stems from a recognized gap in evaluating AI models' ability to reliably execute zero-shot function calls, a problem the team first tackled in 2022 with the Gorilla OpenFunctions models. The original BFCL benchmark evolved through several versions, from AST-based evaluation to multi-turn and agentic settings, becoming a foundational tool for text-based function calling. BFCL Audio now extends this crucial evaluation to the voice domain, acknowledging that real-world products rarely operate in pure text.
Voice interfaces are ubiquitous, appearing in phone support, in-car assistants, smart homes, wearables, and accessibility tools. In these scenarios, AI agents must balance natural, low-latency dialogue with precise action execution. For enterprises, automating customer support and call centers demands flawless function calling; a misheard account number or incorrect appointment time can lead to significant customer frustration and financial loss.
