Artificial Intelligence

Preferred on Google

Superlinked's Filip Makraduli on Small Model Inference Infrastructure

Filip Makraduli of Superlinked discusses the critical need for robust small model inference infrastructure, highlighting Superlinked's open-source solution.

May 5 at 6:04 PM3 min read

Filip Makraduli presenting on small model inference infrastructure. — Image credit: AI Engineer Europe· AI Engineer

Filip Makraduli, representing Superlinked, took the stage to discuss a critical gap in the AI ecosystem: the lack of robust infrastructure for small model inference. He highlighted that while many teams focus on building powerful models, the systems to deploy and run them efficiently are often an afterthought. This oversight, Makraduli explained, leads to wasted resources and suboptimal performance, especially for AI agents and complex workflows.

Superlinked's Filip Makraduli on Small Model Inference Infrastructure - AI Engineer — Superlinked's Filip Makraduli on Small Model Inference Infrastructure — from AI Engineer

Bridging the Infrastructure Gap

Makraduli began by framing the problem: the AI community's tendency to overlook the practicalities of inference, particularly for smaller models. He noted that his own previous work focused heavily on model training and evaluation, but he soon realized the crucial importance of the underlying infrastructure. This realization led him to join Superlinked, a company focused on building this essential layer for AI search and document processing.

Related startups

The 'Yin' and 'Yang' of Inference

He introduced a helpful analogy of the 'yin' and 'yang' of model inference. The 'yin' represents the models themselves – their performance, accuracy, and the advancements in areas like open-source model development. The 'yang,' conversely, is the infrastructure required to make these models work effectively in production, encompassing aspects like routing, autoscaling, monitoring, and deployment.

Makraduli presented data showing the rapid growth of open-source models, highlighting that this ecosystem is not slowing down. However, he emphasized that the performance of these models is heavily reliant on the quality of the inference infrastructure. He pointed to a graph illustrating how model performance can degrade significantly with increased context length, a phenomenon known as 'context rot.' This underscores the need for sophisticated infrastructure to manage context effectively.

From Theory to Production: Superlinked's Solution

Superlinked's approach to this problem is to build a comprehensive inference engine that supports a wide array of models and architectures. Makraduli showcased a diagram illustrating their production cluster, which is designed for efficiency and scalability. This cluster features multiple pools of GPUs (L4, A100, H100) to accommodate different model requirements. Key components include:

Routers: For per-model and LoRA affinity, pool routing, and managing high concurrency.
Model Pools: Dedicated groups of GPUs for specific models like Stella, Glider, bge-m3, reranker, Qwen1.5-4B, and Erlit.
LRU Eviction: A mechanism to swap out models from memory based on their least recently used status, optimizing GPU utilization.
Monitoring: Integration with Prometheus for performance tracking.
Scalability: KEDA for scale-to-zero capabilities, ensuring cost-efficiency when models are not in use.
Deployment: The entire infrastructure is managed through tools like Terraform and Helm, allowing for streamlined deployment on cloud platforms like AWS and GCP.

Makraduli highlighted that Superlinked has open-sourced both the model inference layer and the production cluster, providing valuable tools for the community. He stressed that this approach allows developers to easily integrate various models and adapt their inference strategies without needing to rebuild infrastructure from scratch.

The Importance of Adaptability

He also touched upon the diversity of model architectures, noting that each has unique requirements for optimization. For instance, models like BERT, Qwen2, and CoLBERT have different normalization, attention mechanisms, and output structures. Superlinked's solution aims to provide a universal engine that can adapt to these differences, rather than requiring separate, specialized inference backends for each model. This adaptability is key to supporting the rapid evolution of LLMs and ensuring that developers can efficiently leverage the best models for their specific tasks.

Ultimately, Makraduli's message was clear: building effective AI agents and workflows requires a strong foundation in inference infrastructure. By focusing on efficiency, scalability, and adaptability, Superlinked is contributing to the advancement of the AI field, making it easier for developers to harness the power of smaller, more specialized models.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Filip Makraduli #Superlinked #AI Inference #LLM #Open Source #Machine Learning Infrastructure #AI Agents #Kubernetes #Prometheus #KEDA

AI Daily Digest

Get the most important AI news daily.

+40k readers