Preferred on Google

Decoupling AI Agents from LLMs for Scalable Cloud Deployments

Oct 15, 2025 at 10:45 PM4 min read

Cloud Run LLM Agent

The true power of a large language model is only realized when it can interact with the world, yet deploying these advanced "brains" efficiently and scalably remains a significant challenge. Amit Maraj, a Developer Advocate at Google Cloud, recently demonstrated a practical solution to this dilemma on the Google Cloud Tech channel, illustrating how to connect an AI agent to a cloud-hosted Large Language Model (LLM) on Google Cloud Run. His presentation meticulously walked through the architecture and implementation of decoupling the LLM from the agent, a critical step for achieving independent scaling and cost optimization in production environments.

Maraj’s demonstration centered on taking a powerful, GPU-accelerated LLM—specifically, Gemma 270M deployed on Cloud Run—and giving it a conversational interface. He began by humorously likening the raw LLM to "the world's most expensive pet rock" when it exists in isolation, devoid of an agent to harness its capabilities. The core objective was to build an agent that could interact with users and leverage the LLM’s intelligence, all while ensuring that the computational demands of the LLM did not constrain the scalability or cost-efficiency of the agent layer.

A pivotal insight highlighted throughout the demonstration is the strategic imperative to decouple the LLM “brain” from its agent. "We want to decouple our LLM brain from our agent so we can scale and build it independently," Maraj explained. This architectural separation allows developers to manage and scale the resource-intensive LLM independently of the lighter-weight agent logic. For founders and AI professionals, this translates directly into optimized resource allocation and reduced operational overhead, as the agent service can scale quickly and cheaply without needing dedicated GPUs, while the LLM backend can be provisioned based on its specific computational demands.

The agent itself is constructed using Google’s Agent Development Kit (ADK), providing the conversational logic necessary to facilitate user interaction. The ADK simplifies the creation of agents, allowing developers to focus on defining the agent's persona and goals rather than low-level infrastructure concerns. In this specific example, the agent was configured to act as a friendly and knowledgeable zoo tour guide named Jen, tasked with making zoo visits more fun and educational by answering visitor questions.

Crucially, the connection between the ADK agent and the cloud-hosted LLM is managed through LiteLLM, a library praised by Maraj as "a fantastic library for connecting to hundreds of different model APIs with a unified interface." This abstraction layer is a significant boon for developers, offering a consistent API regardless of the underlying LLM provider or model architecture. It mitigates vendor lock-in and simplifies model experimentation, allowing teams to swap out LLMs with minimal code changes, a valuable flexibility in the rapidly evolving AI landscape.

Deployment of both the LLM and the ADK agent occurs on Google Cloud Run, a fully managed compute platform. The distinction in resource allocation underscores the decoupling strategy. While the Gemma LLM service requires GPU acceleration for its processing power, the ADK agent service is deployed with significantly fewer resources – less memory, fewer CPUs, and, critically, "no GPU." This lightweight nature of the agent service means it is designed for managing session state, handling web requests, and routing them, making it highly efficient.

Related Reading

Seamless communication between these two distinct services is achieved via environment variables. Maraj emphasized the importance of the `OLLAMA_API_BASE` variable, which directly passes the URL of the Gemma service to the agent. "When the agent calls LiteLLM, it will use this URL to send the request to our GPU backend. This is how the two services talk to each other," he clarified. This elegant solution ensures that the agent knows exactly where to send its requests for LLM processing, maintaining the logical connection while physical resources remain separated.

The demonstration concluded with a successful test of the deployed agent in the ADK’s built-in web UI. When asked about red pandas' diet or why poison dart frogs are brightly colored, the agent, powered by the remote Gemma LLM, provided accurate and detailed responses. This real-time interaction validates the effectiveness of the decoupled architecture, proving that the agent can efficiently query the LLM and deliver responses to users. The system's ability to handle these queries efficiently is a testament to the robust design, laying the groundwork for handling larger scales.

#AI Agents #Cloud Computing #Google Cloud #LLM deployment #Machine Learning #scalable AI

AI Daily Digest

Get the most important AI news daily.

Google

Sequoia

a16z

+40k readers