Multimodal Large Language Models (MLLMs) are now powering robotic navigation systems, and they’re compact enough to work at 10 frames per second, on the edge. Vayu Robotics is building one to power autonomous delivery robots, with plans to expand its use across the full gamut of autonomous robots and vehicles.
The advent of LLMs brought a slew of use cases for enterprise and media-related tasks. Combined with sensor arrays, synthetic data, and cutting-edge RAG-like techniques, Vayu Robotics has fashioned them into a high-powered operating system for robotic perception, reasoning, and navigation.
As a mark of validation, the Palo Alto-based startup just signed a deal with an e-commerce company to deploy 2,500 units of their delivery fleet.
“The goal for Vayu is to be the force to drive all the machines in the world,” Nitish Srivastava, co-founder and CTO of Vayu Robotics, told StartupHub.ai in an exclusive interview. “The reason why foundation models work here is because it’s largely a data problem.”
The corner cases of data
One area where traditional neural networks struggle is the ‘corner case problem.’ Srivastava describes it as ‘the long tail problem.’ While deep learning can manage most scenarios—navigating from point A to point B—it's the "long tail" of rare, complex situations that pose the greatest challenges. "The hardness lies in the tail," he says, referring to unexpected events that are difficult to predict and classify.
Vayu tackles this challenge by leveraging foundation models that can generalize well across diverse scenarios. Srivastava explains, “if you wanted to build a really good text summarization model, you’d take an LLM trained to do next token prediction on all the data in the world, and it would outperform a model trained solely for that task. Similarly, a mobility foundation model trained on multiple navigation domains and robot form factors will do better at navigating a car because it has a better general representation space; it understands how to move around things. Foundation models will give us the power to solve all robotics problems.”
At a high level, the focus is on understanding what’s happening around the robot—this is "System 1," which involves quick, instinctive decisions such as recognizing obstacles. The more complex aspect, "System 2," deals with deeper reasoning and problem-solving, like determining how to navigate around unexpected obstacles or how objects in the environment might behave. Srivastava notes, “our mobility foundation model is solving System 2 problems, like identifying a new object in front of you and predicting its movement and behavior.”
