LinkedIn's AI Tester Sees Bugs

LinkedIn's new AI QA Agent uses generative AI and VLMs to autonomously test apps, finding bugs and democratizing quality assurance for non-engineers.

4 min read
Illustration showing the QA Agent's action cycle with planning, grounding, and execution steps.
The QA Agent's decision loop integrates planning, execution, and evaluation for robust testing.· LinkedIn Engineering

LinkedIn is rethinking software quality with its new AI Quality Assurance (QA) Agent. The platform, used by 1.3 billion members across iOS, Android, and Web, faces immense complexity with countless user permutations. Traditional testing methods struggle to keep pace, especially with the rise of agentic coding tools.

The sheer scale of LinkedIn's UI, not one app, but thousands of combinations based on user type, language, and experiments, means features can regress silently. While employee bug reports help, manual exploration doesn't scale. This led LinkedIn to build an autonomous digital tester.

Related startups

This QA Agent leverages generative AI and Vision-Language Models (VLMs) to perceive and interact with applications like a human. It can execute complex end-to-end workflows across platforms. Early results show it has flagged over 200 valid bugs and caught critical regressions in revenue-impacting areas.

Crucially, the agent allows product and engineering teams to author tests using natural language. This shifts quality assurance from a developer-only task to a shared responsibility.

Beyond Scripts: Vision-Language Models

Traditional automation relies on brittle code selectors. VLMs, however, 'see' the screen, understanding text, icons, and hierarchy. This decouples testing from underlying code changes.

The QA Agent employs a multi-model approach for different cognitive tasks. One model handles high-level planning, deciding the next action based on screenshots and goals. Another model performs analytical reasoning for error detection and bug reporting.

A third, fine-tuned model handles visual grounding, translating natural language instructions like 'Tap the Apply button' into precise screen coordinates for execution.

Agent Architecture: System 1 and System 2

To balance AI costs with performance, LinkedIn adopted a hybrid architecture inspired by human cognition: System 1 (fast, intuitive) and System 2 (slow, deliberate). This architecture is key to how the AI autonomous testing operates.

System 1 uses deterministic replay. If a test ran successfully before and the UI hasn't changed, the agent follows a stored path using element signatures. This is fast and cost-effective.

System 2 activates when the UI changes or System 1 fails. VLMs take over, with the planner generating a structured plan, including fallback actions and backtracking needs. The grounding model then finds coordinates for execution.

A feedback loop biases the planner toward previously successful natural language instructions for specific tasks.

Evaluators as Guardrails

To prevent the agent from getting stuck or hallucinating, several evaluators act as guardrails. The View Tree Evaluator checks for meaningful UI changes post-action.

The Tracking Log Evaluator verifies that expected analytics events fire correctly. The Action History Evaluator ensures the sequence of actions aligns with the test's high-level intent.

An Error Detection Pipeline filters false positives. It first detects potential errors on screenshots, then uses a second LLM call for verification before reporting.

This rigorous, multi-stage verification is vital for maintaining trust with engineering teams.

Democratizing Quality: A New Operating Model

Perhaps the most significant impact of the QA Agent is cultural. By abstracting testing complexity, non-engineers can now own product quality.

Users can record interactions, and an LLM synthesizes these into semantic, natural language instructions. A tap at specific coordinates becomes 'Tap the 'Easy Apply' button,' for instance.

A Human-in-the-Loop process allows creators to review and approve these synthesized plans, ensuring accurate test intent before execution.

This workflow seeds the agent's memory with known-good paths, enabling rapid, deterministic execution while retaining adaptability.

This approach democratizes quality at LinkedIn, creating a discipline where non-technical stakeholders can author resilient, agent-driven tests.

Rigorous Evaluation

An autonomous agent must be precise to maintain trust. LinkedIn prioritized precision over recall, ensuring reported bugs are reliable.

Evaluating the agent against a live, constantly changing app proved problematic. To address this, they developed a golden dataset framework.

Human reviewers annotate agent runs, creating a frozen snapshot of application states, screenshots, and expected actions. Evaluations then replay against this captured state, not the live app.

This method intercepts device commands, returning pre-recorded data and comparing agent actions against the recorded signatures. This ensures deterministic and reproducible evaluations.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.