Abed Matini, Senior Backend Developer at Ogilvy, recently presented on "Bypassing the Multimodal Tax: Hybrid RAG, SQL RRF & UI Telemetry" at AI Engineer World's Fair 2026. Matini's talk explored the challenges and solutions for building robust and efficient Retrieval-Augmented Generation (RAG) systems, particularly those dealing with multimodal data and complex user queries.
Related startups
Understanding the Multimodal Tax
Matini began by defining the "multimodal tax" as the inherent cost and complexity associated with processing and integrating various data formats, such as text, images, and structured data, within a single RAG system. He highlighted two primary problems that document chatbots face: the cost of reading documents multiple times and the scattering of search capabilities across too many tools.
He elaborated on the cost factor, explaining that cloud vision APIs can charge between 500 to 1000 tokens per page just to convert a PDF into text. Before a user even asks a question, this process alone can incur significant costs, especially with large documents. For instance, a 200-page manual could cost up to 100,000 tokens, with tables often proving to be particularly challenging and costly to process accurately.
