Databricks GPT-5.5 Outperforms GPT-4 on OfficeQA Benchmark

Databricks Research Engineer Arnav Singhvi reveals GPT-5.5, a new AI model achieving state-of-the-art results on the OfficeQA benchmark and outperforming GPT-4.

Arnav Singhvi, Research Engineer at Databricks, discusses GPT-5.5 performance.
Image credit: Databricks· OpenAI Youtube

In the rapidly evolving world of large language models, Databricks has unveiled a significant advancement with GPT-5.5. Arnav Singhvi, a Research Engineer at Databricks, presented findings that showcase GPT-5.5 outperforming its predecessor, GPT-4, on the OfficeQA benchmark. This development signals a notable leap in the accuracy and performance of AI models designed for complex reasoning and task completion.

The conversation, led by Singhvi, highlighted the challenges and triumphs in developing more capable AI agents. Databricks, a prominent company in data analytics and AI, is positioning itself at the forefront of this progress by refining models that can better understand and interact with real-world data and tasks.

The full discussion can be found on OpenAI Youtube's YouTube channel.

Introducing GPT-5.5 with Databricks - OpenAI Youtube
Introducing GPT-5.5 with Databricks — from OpenAI Youtube

Arnav Singhvi and Databricks' AI Push

Arnav Singhvi, a Research Engineer at Databricks, is a key figure in the company's pursuit of advanced AI solutions. His work focuses on enhancing the capabilities of AI models, particularly in areas like agent reasoning and natural language understanding. Databricks, founded by the creators of Apache Spark, has established itself as a leader in unified data analytics platforms, and its expansion into sophisticated AI model development, including the work on GPT-5.5, underscores its commitment to providing comprehensive AI capabilities to its users.

Related startups

GPT-5.5: A New Benchmark for Accuracy

The core of Singhvi's presentation revolved around the performance of GPT-5.5 on the OfficeQA benchmark, a dataset designed to evaluate the ability of AI models to answer questions based on complex documents. Singhvi reported that GPT-5.5 achieved a 46% reduction in errors compared to GPT-5.4. This substantial improvement places GPT-5.5 at the top of current benchmarks for agent performance, with an accuracy rate exceeding 50% on the tested tasks.

Specifically, in the 'Oracle PDF + web_search' scenario, where models are tasked with answering questions using pointed document analysis and web searches, GPT-5.5 achieved 64.66% accuracy. This is a marked increase from GPT-5.4's 57.14%. The 'Full Agent Workflow' scenario, which involves running agents end-to-end in a 'Codex agent harness,' also saw significant gains, with GPT-5.5 reaching 52.63% accuracy compared to GPT-5.4's 36.09%.

Singhvi emphasized that these metrics were derived from 133 OfficeQA Pro questions, with prior baselines from the OfficeQA Pro paper and new results from current OfficeQA evaluations. The substantial performance leap suggests that GPT-5.5 is not merely an incremental update but a considerable step forward in AI agent capabilities.

Advancements in Agent Reasoning

A significant factor contributing to GPT-5.5's enhanced performance is its improved ability in agent reasoning, particularly in parsing complex and messy documents. Singhvi noted that Databricks' approach involves using multi-agent setups where one agent performs the parsing and another agent acts as a supervisor. This architecture allows for more robust handling of user queries.

"We rely on custom parsing at Databricks and having these multi-agent setups that can perform parsing within their agent harnesses," Singhvi explained. This strategy is crucial for tackling real-world scenarios where user inputs are often unstructured and require sophisticated interpretation.

The effectiveness of GPT-5.5 in this area is demonstrated by its ability to perform a 'stepwise' function, which allows it to break down complex tasks into manageable steps. This contrasts with earlier models that struggled with such multi-stage reasoning, often failing to parse digits correctly or requiring manual intervention.

The Role of Databricks' Platform

Singhvi also touched upon how Databricks' platform facilitates the development and deployment of such advanced AI agents. The platform provides tools and infrastructure that enable users to build and manage custom agent workflows. This includes the ability to integrate various models, such as those from OpenAI, and orchestrate their interactions through Databricks' agent framework.

The demonstration of the 'openai-agents-sdk-agent' on the Databricks platform illustrated how users can deploy custom agents and interact with them. This showcase highlighted the practical application of these models in creating intelligent assistants capable of answering queries about the platform itself, such as "What is Databricks?"

The ability to use GPT-5.5 as a supervisor model for these custom agent workflows is a key differentiator. It allows for more sophisticated control and reasoning within the agent's operation, leading to more accurate and reliable outcomes. This is particularly valuable for enterprise applications where precision and efficiency are paramount.

Future Implications for AI Development

The performance gains reported for GPT-5.5 have significant implications for the future of AI development. The ability to handle complex document parsing and execute multi-step reasoning more effectively opens doors for a wider range of applications, from advanced customer support bots to sophisticated data analysis tools.

As AI models become more capable of understanding and processing unstructured data, their utility in enterprise settings will continue to grow. Databricks' focus on providing a unified platform for data and AI development, coupled with advancements like GPT-5.5, positions them to capitalize on this trend. The ongoing competition among AI labs and tech companies to develop more powerful and efficient models suggests that we can expect further breakthroughs in the near future.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.