Google's Gemini Computer Use is an AI that clicks

Google is taking another major step beyond the chatbot, releasing a new model designed to give AI agents eyes and hands to directly control software. Unveiled today, the Gemini 2.5 Computer Use model allows AI to interact with graphical user interfaces (GUIs) just like a person would, by looking at the screen, clicking buttons, typing in forms, and navigating complex websites.

Available in preview for developers via the Gemini API, this isn't about connecting to clean, structured APIs. It's about tackling the messy reality of the web, where most tasks still require direct human interaction. Google's demos show the model scraping pet details from one website and inputting them into a separate CRM, then booking an appointment, a multi-step, cross-application task that has long been the holy grail for AI assistants.

The system works in an iterative loop. A developer's agent sends the model a user's goal along with a screenshot of the current application window. Gemini Computer Use analyzes the image and the goal, then returns a specific action, like `click(x, y)` or `type("text")`. The agent executes that command, takes a new screenshot, and sends it back to the model for the next step. This continues until the task is complete.

The race for a true AI assistant

Google is making aggressive performance claims, positioning Gemini Computer Use as a direct challenger to similar efforts from OpenAI and Anthropic. According to benchmarks shared by Google DeepMind, the model outperforms leading alternatives on web and mobile control tasks like Online-Mind2Web and WebVoyager. More importantly, Google claims it does so with lower latency, a critical factor for agents that need to feel responsive, not sluggish. One chart shows Gemini achieving over 70% accuracy on a key benchmark in around 225 seconds, while competitors lag in both speed and accuracy.

This is the core of the AI agent race: creating a tool that can reliably and quickly perform complex digital chores. Early testers are already putting it to work. The AI assistant Poke.com claims the model is "50% faster and better" than other solutions, while the automation service Autotab saw an 18% performance boost on its hardest evaluations. Even Google's internal payments team is using it to salvage over 60% of its previously failing UI tests, turning a multi-day fix into an autonomous recovery.

But giving an AI the keys to your computer introduces a minefield of safety and security risks. An agent that can fill out forms can also empty a bank account. An agent that can read a screen can also exfiltrate sensitive data. Google says it has built safety features directly into the model and is providing developers with critical guardrails.

These include a per-step safety service that assesses every proposed action before it's executed. Developers can also provide system instructions that force the model to ask for user confirmation before taking high-stakes actions, like making a purchase or sending a message. While the model is primarily optimized for web browsers, with some promise on mobile, Google notes it is not yet ready for OS-level control.

This release moves the concept of an AI agent from a theoretical chatbot to a practical, if still nascent, tool for developers. The promise is a future where you can simply ask your AI to handle complex digital tasks. The challenge will be ensuring that in giving AI the ability to click for us, we don't open the door to it doing things we never intended.

© 2025 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.