Filip Makraduli, representing Superlinked, took the stage to discuss a critical gap in the AI ecosystem: the lack of robust infrastructure for small model inference. He highlighted that while many teams focus on building powerful models, the systems to deploy and run them efficiently are often an afterthought. This oversight, Makraduli explained, leads to wasted resources and suboptimal performance, especially for AI agents and complex workflows.
Bridging the Infrastructure Gap
Makraduli began by framing the problem: the AI community's tendency to overlook the practicalities of inference, particularly for smaller models. He noted that his own previous work focused heavily on model training and evaluation, but he soon realized the crucial importance of the underlying infrastructure. This realization led him to join Superlinked, a company focused on building this essential layer for AI search and document processing.
The 'Yin' and 'Yang' of Inference
He introduced a helpful analogy of the 'yin' and 'yang' of model inference. The 'yin' represents the models themselves – their performance, accuracy, and the advancements in areas like open-source model development. The 'yang,' conversely, is the infrastructure required to make these models work effectively in production, encompassing aspects like routing, autoscaling, monitoring, and deployment.
Makraduli presented data showing the rapid growth of open-source models, highlighting that this ecosystem is not slowing down. However, he emphasized that the performance of these models is heavily reliant on the quality of the inference infrastructure. He pointed to a graph illustrating how model performance can degrade significantly with increased context length, a phenomenon known as 'context rot.' This underscores the need for sophisticated infrastructure to manage context effectively.
From Theory to Production: Superlinked's Solution
Superlinked's approach to this problem is to build a comprehensive inference engine that supports a wide array of models and architectures. Makraduli showcased a diagram illustrating their production cluster, which is designed for efficiency and scalability. This cluster features multiple pools of GPUs (L4, A100, H100) to accommodate different model requirements. Key components include:
- Routers: For per-model and LoRA affinity, pool routing, and managing high concurrency.
- Model Pools: Dedicated groups of GPUs for specific models like Stella, Glider, bge-m3, reranker, Qwen1.5-4B, and Erlit.
- LRU Eviction: A mechanism to swap out models from memory based on their least recently used status, optimizing GPU utilization.
- Monitoring: Integration with Prometheus for performance tracking.
- Scalability: KEDA for scale-to-zero capabilities, ensuring cost-efficiency when models are not in use.
- Deployment: The entire infrastructure is managed through tools like Terraform and Helm, allowing for streamlined deployment on cloud platforms like AWS and GCP.
Makraduli highlighted that Superlinked has open-sourced both the model inference layer and the production cluster, providing valuable tools for the community. He stressed that this approach allows developers to easily integrate various models and adapt their inference strategies without needing to rebuild infrastructure from scratch.
The Importance of Adaptability
He also touched upon the diversity of model architectures, noting that each has unique requirements for optimization. For instance, models like BERT, Qwen2, and CoLBERT have different normalization, attention mechanisms, and output structures. Superlinked's solution aims to provide a universal engine that can adapt to these differences, rather than requiring separate, specialized inference backends for each model. This adaptability is key to supporting the rapid evolution of LLMs and ensuring that developers can efficiently leverage the best models for their specific tasks.
Ultimately, Makraduli's message was clear: building effective AI agents and workflows requires a strong foundation in inference infrastructure. By focusing on efficiency, scalability, and adaptability, Superlinked is contributing to the advancement of the AI field, making it easier for developers to harness the power of smaller, more specialized models.
