Philip Kiely on AI Inference: Cost, Scale, and the Path Forward

Philip Kiely of Baseten discusses the critical challenges and strategies for optimizing AI models for efficient and cost-effective inference.

Philip Kiely, Head of AI Education at Baseten, speaking on AI inference challenges.
Image credit: TWIML AI Podcast· TWIML

In a recent TWIML AI Podcast episode, Philip Kiely, Head of AI Education at Baseten, joined host Sam Charrington to discuss the intricacies of AI inference. Kiely, who has spent over four years in the AI space, shared his insights on the challenges and opportunities in making AI models efficient and accessible for real-world applications. The conversation highlighted the critical differences between AI training and inference, emphasizing the need for specialized approaches to optimize the latter.

Kiely noted that while the AI field has seen tremendous progress in model training, the subsequent step of deploying these models for inference—where they are used to make predictions on new data—is often overlooked. This phase presents unique challenges related to cost, latency, and scalability, particularly as AI models become larger and more complex.

The full discussion can be found on TWIML's YouTube channel.

Related startups

How to Engineer AI Inference Systems [Philip Kiely] - 766 - TWIML
How to Engineer AI Inference Systems [Philip Kiely] - 766 — from TWIML

The Inference Challenge

The core thesis of the discussion revolved around the unique difficulties associated with AI inference. "If you think about medicine, for example, it can take decades for research to reach a pharmacy," Kiely explained. "Even within AI, if you want to train a model off of a new technique, it can still take weeks or months to find the exact right way to express that technique. But with inference, the timeline is often hours." He elaborated that inference, unlike training, needs to happen in real-time or near real-time to be useful for many applications.

This demand for speed and efficiency means that companies need to carefully consider how their models are deployed. "Inference is often the bottleneck," Kiely stated. "It's where the rubber meets the road. If your inference is too slow or too expensive, your product simply won't be viable." He contrasted this with the training phase, which, while computationally intensive, can often be done asynchronously and with more tolerance for longer processing times.

Optimizing for Inference

The conversation then delved into the strategies and techniques companies employ to optimize their AI models for inference. Kiely highlighted several key areas:

  • Model Quantization: This involves reducing the precision of the model's weights and activations, often from 32-bit floating-point numbers down to 8-bit integers or even lower. "Quantization is a really powerful technique," Kiely said. "It can significantly reduce model size and memory footprint, leading to faster inference and lower computational costs." He noted that this process requires careful tuning to minimize accuracy loss.
  • Hardware Acceleration: Kiely discussed the importance of leveraging specialized hardware, such as GPUs, TPUs, and custom ASICs, designed specifically for AI workloads. "The hardware is critical," he stated. "You can have the most optimized model, but if your hardware isn't up to par, you're still going to face significant latency and cost issues."
  • Efficient Model Architectures: The choice of model architecture itself plays a crucial role. Kiely mentioned the trend towards smaller, more efficient models like MobileNets and EfficientNets, which are designed with inference performance in mind. He also touched upon techniques like model pruning and knowledge distillation to create smaller, faster versions of larger, more accurate models.
  • Optimized Inference Engines: Software libraries and runtimes, such as TensorFlow Lite, ONNX Runtime, and TensorRT, are essential for efficiently deploying models on various hardware platforms. Kiely emphasized that these tools are crucial for bridging the gap between model development and production deployment.

The Role of Baseten

Kiely, as Head of AI Education at Baseten, provided insights into how his company addresses these challenges. Baseten offers a platform designed to simplify the deployment and management of machine learning models. "Our goal is to democratize AI inference," Kiely explained. "We provide tools and infrastructure that allow developers to deploy their models quickly and efficiently, regardless of their specific hardware or software stack." He highlighted Baseten's ability to handle various model formats and frameworks, as well as its features for monitoring and scaling inference workloads.

He also touched upon the importance of understanding the entire AI lifecycle, from data collection and model training to deployment and monitoring. "Inference isn't just about running a model; it's about building a complete, reliable, and cost-effective AI-powered product," he stated.

Future Trends in AI Inference

Looking ahead, Kiely anticipates continued advancements in AI inference optimization. He pointed to the growing importance of edge AI, where models are deployed on devices rather than in the cloud, necessitating even more efficient and compact models. "The edge presents a whole new set of challenges and opportunities," he noted. "You're dealing with limited computational resources, power constraints, and the need for real-time performance."

Kiely also highlighted the potential of new research areas like tinyML and the ongoing development of more efficient quantization techniques. "The field is moving incredibly fast," he concluded. "There's a constant drive to push the boundaries of what's possible in AI inference, making it more accessible, affordable, and performant for a wider range of applications." The conversation underscored that efficient inference is not just a technical challenge but a critical business imperative for companies aiming to succeed in the AI-driven economy.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.