Shashi Jagtap, Founder of Superagentic AI, presented a novel approach called TurboQuant, designed to significantly enhance the efficiency of AI agents by turbocharging their retrieval capabilities. The core problem addressed by TurboQuant lies in the substantial memory footprint and computational cost associated with large language models (LLMs) and their retrieval mechanisms, particularly the KV cache and vector embeddings. Jagtap explained how traditional methods of compression often lead to a drop in quality or require extensive retraining, a trade-off that TurboQuant aims to overcome.
Related startups
The Memory Challenge in AI Agents
Jagtap highlighted the critical issue of memory consumption in AI agents, especially those relying on retrieval-augmented generation (RAG) systems. Every token processed by an agent is cached in memory, forming the KV cache, which grows with each interaction. As the context window expands, so does the memory demand. On Mac devices, this problem is exacerbated as the model, cache, and vector index compete for a shared pool of RAM, often leading to performance degradation.
Embeddings, which are essentially lists of numbers representing data, are typically stored at 32-bit precision. However, Jagtap pointed out that for search operations, only the relative proximity of vectors matters, and a much lower precision, such as 3 to 4 bits, is often sufficient. Storing these embeddings at full 32-bit precision results in significant memory wastage, with estimates suggesting up to 5x more memory is used than necessary.
