Meta's Nishant Gupta on Evaluating Agentic AI Systems

Nishant Gupta from Meta's Superintelligence Labs discusses the shift from accuracy-based evaluation to reliability-focused methods for agentic AI systems.

4 min read
Presentation slide titled 'Production Evals for Agentic Systems' with speaker Nishant Gupta's name and affiliation.
Nishant Gupta, Tech Lead at Meta, presents on production evaluations for agentic AI systems.· AI Engineer

Nishant Gupta, Tech Lead at Meta's Superintelligence Labs, recently shared insights into the critical but often overlooked area of production evaluations for agentic AI systems. In his presentation, Gupta highlighted the evolving landscape of AI development, emphasizing that traditional evaluation methods designed for static models are no longer sufficient for the dynamic and complex nature of agentic AI workflows.

Meta's Nishant Gupta on Evaluating Agentic AI Systems - AI Engineer
Meta's Nishant Gupta on Evaluating Agentic AI Systems — from AI Engineer

The Illusion vs. Reality of AI Evaluation

Gupta opened by illustrating a common misconception in AI evaluation: a high benchmark accuracy score can create an illusion of reliability. He presented a stark contrast between "The Illusion" of a simple benchmark score, such as 90% accuracy, and "The Reality" depicted by a graph showing degraded production behavior and unpredictable reliability gaps. This discrepancy arises because benchmarks often fail to capture crucial aspects like invisible failure modes, degraded production behavior, and unpredictable user reliability gaps that manifest in real-world, dynamic environments.

Related startups

The core of the issue, Gupta explained, is that AI systems have evolved faster than the methods used to evaluate them. While benchmarks measure model capabilities in isolated, static datasets, agentic systems operate through complex workflows involving tool usage, planning, and interaction with dynamic contexts. Consequently, evaluating these systems requires a fundamental shift in focus from mere output accuracy to the overall behavior and reliability of the entire workflow.

The Paradigm Shift: Output vs. Behavior

This shift is characterized by a change in evaluation goals. Traditional LLM evaluation focuses on output accuracy, using static datasets and single-path processing, with failure modes often simplified to hallucination. In contrast, agent evaluation must prioritize workflow behavior, operate within dynamic contexts, handle multi-path and tool-dependent execution, and account for cascading workflow failures.

Gupta elaborated on the anatomy of agentic failure, presenting a pyramid structure that starts with foundational issues like memory and safety, progressing through reasoning and planning errors, tool usage failures, and culminating in apex-level coordination conflicts in multi-agent systems. He argued that many teams still focus solely on hallucination as the primary failure mode, overlooking the more complex, systemic failures that emerge in production.

From Accuracy to Reliability: The SRE Mindset

To address these shortcomings, Gupta advocated for adopting a Site Reliability Engineering (SRE) mindset. Instead of solely focusing on accuracy, SREs prioritize system reliability, which encompasses a broader set of metrics including task success, tool success, planning quality, latency, cost, safety, human satisfaction, and recovery rate. This holistic view acknowledges that business success is driven by reliable outcomes, not just accurate outputs.

The evaluation signal hierarchy illustrates this transition. At the base are benchmarks, which are high-volume but offer low operational value. Above them are scenario evaluations, which are more targeted but still limited in scope. The most valuable signals come from production telemetry, which has lower volume but maximum signal value, capturing real-time interactions and system behavior.

Offline Evals and Production Streams

Gupta then detailed two key evaluation methodologies: offline scenario-driven simulation and online production streams.

Offline Evals: Scenario-Driven Simulation involves running agents within an agent sandbox that simulates tools and execution steps. This allows for controlled testing and metric collection, such as completion rate, tool correctness, plan quality, and simulated cost. The key takeaway here is that evaluation should be scenario-driven, not prompt-driven, focusing on the entire workflow rather than isolated prompts.

Online Evals: The Production Stream recognizes that production is the largest evaluation dataset, and every user interaction is a signal. By passing user interactions through an evaluation gateway, systems can collect metadata and telemetry, which are then stored in an analytics database. This continuous feedback loop is crucial for identifying and addressing issues as they arise in real-time.

Human-in-the-Loop and Agent Drift

Gupta stressed the importance of human-in-the-loop (HITL) calibration, emphasizing that humans should be treated as evaluators, not merely fallback systems. HITL review, triggered by automated alerts from online telemetry, provides critical feedback on correctness, usefulness, trust, and safety, which in turn calibrates evaluation pipelines and identifies specific failure points. He also highlighted the problem of agent drift, where system reliability gradually degrades over time due to model updates, prompt changes, tool API changes, or user behavior shifts. This drift, often a silent killer, necessitates continuous monitoring and proactive evaluation.

Architectural Imperatives for Reliable Agents

Finally, Gupta outlined five key architectural imperatives for building reliable agentic systems:

  1. Offline benchmarks are necessary but insufficient.
  2. Agentic systems must be evaluated as full workflows.
  3. Production telemetry is the ultimate evaluation signal.
  4. Reliability always supersedes raw model accuracy.
  5. Evals are no longer tests; they are core infrastructure.

He concluded with a powerful statement: "You can't improve what you don't continuously evaluate." This encapsulates the need for a robust, ongoing evaluation framework to ensure the safety, reliability, and effectiveness of agentic AI systems in production environments.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.