Salesforce AI Research and UC Berkeley have unveiled BFCL Audio, a new benchmark designed to rigorously evaluate the precision of AI models in handling audio-native function calls. In an announcement on its blog, the collaboration detailed how this extension of the existing BFCL framework addresses critical challenges in real-world voice interactions, particularly for enterprise applications where accuracy is paramount.
The initiative stems from a recognized gap in evaluating AI models' ability to reliably execute zero-shot function calls, a problem the team first tackled in 2022 with the Gorilla OpenFunctions models. The original BFCL benchmark evolved through several versions, from AST-based evaluation to multi-turn and agentic settings, becoming a foundational tool for text-based function calling. BFCL Audio now extends this crucial evaluation to the voice domain, acknowledging that real-world products rarely operate in pure text.
Voice interfaces are ubiquitous, appearing in phone support, in-car assistants, smart homes, wearables, and accessibility tools. In these scenarios, AI agents must balance natural, low-latency dialogue with precise action execution. For enterprises, automating customer support and call centers demands flawless function calling; a misheard account number or incorrect appointment time can lead to significant customer frustration and financial loss.
The Precision Problem in Voice AI
The report highlights two primary architectural paths for voice agents: End-to-End (E2E) speech-to-speech systems and Cascaded ASR → LLM → TTS pipelines. E2E models offer natural prosody and low latency, unifying reasoning over acoustics and semantics. However, they often lack tool-call precision without additional structure and have limited model availability. Cascaded systems, while leveraging mature text LLM stacks and offering modularity, face a critical bottleneck: ASR errors. Even minor transcription mistakes can be catastrophic for function calling, where APIs demand exact matches. The LLM in a cascaded system never "hears" the waveform, losing crucial acoustic cues that could recover intent.
Audio introduces systematic shifts compared to typed input, including conversational fillers, acoustic artifacts, accents, background noise, and the challenge of homophones or named entities (e.g., "John" vs. "Jon," "final report.pdf" vs. "finalReport.pdf"). These factors contribute to non-trivial word error rates from ASR systems, directly impacting the reliability of subsequent function calls.
To create BFCL Audio, existing text BFCL queries were naturally paraphrased into conversational-style speech. Synthetic audio was then generated using a variety of TTS engines to diversify inputs. For cascaded systems, each audio sample was pre-transcribed by three different ASR systems (OpenAI, ElevenLabs, Deepgram) to expose sensitivity to ASR choices. This multi-tier approach ensures a robust evaluation environment.
The evaluation protocol includes a system prompt to inform models they are operating in an audio setting, advising them to be robust to ASR errors and clarify ambiguous information. A key innovation is a clarification mechanism, where an LLM judge and a simulated user support spelling or disambiguation queries without penalizing the model for asking. This allows agents to confirm critical details before acting, preventing reckless tool calls. Clarification turns are ignored when computing the final function-calling score, ensuring they enable correctness rather than inflate scores.
Initial results from BFCL Audio reveal a significant performance drop. Pipelined systems typically see a 10-20% decrease relative to text-mode BFCL, largely due to failures in handling entity dictation. E2E models show an even higher degradation compared to the original text baseline. While E2E systems excel in naturalness and responsiveness for general conversation, they currently underperform pipelined systems specifically for function calling precision, indicating a weakness in their multimodal function calling capabilities. This underscores the ongoing challenge of building voice AI that is both natural and reliably precise.

