Building reliable AI applications is fundamentally different from traditional software development, a crucial insight shared by Dmitry Kuchin of Multinear at the AI Engineer World's Fair in San Francisco. Kuchin, a seasoned startup co-founder and CTO with over 15 years of experience, including executive roles at ZoomInfo, Lemonade, and Meta, presented his practical tactics for achieving production-level reliability in generative AI. Having built over 50 GenAI projects, Kuchin highlighted a critical disconnect in the current approach to AI development.
While creating a Proof of Concept (POC) for an AI application can seem straightforward, achieving production-level reliability is a significant hurdle. GenAI, by its very nature, is non-deterministic, meaning identical inputs can yield different outputs. This inherent variability necessitates continuous experimentation, as changes to code, prompts, or data can impact results in unpredictable ways. Many practitioners, accustomed to predictable software development lifecycles, often stumble when trying to scale AI solutions beyond initial demonstrations.
A common pitfall, Kuchin observed, is the reliance on traditional data science metrics. As he plainly stated, metrics like "groundness, factuality, [and] bias... don't translate into reliable real-world performance." These abstract measures fail to answer crucial questions: Does the solution work as intended for the user? Do the latest changes actually improve it from a practical standpoint?
The solution, according to Kuchin, lies in a paradigm shift: "You need to reverse engineer your metrics." Instead of starting with generic data science metrics, teams must begin with realistic user scenarios and create evaluations that precisely mimic actual user experiences. This demands testing specific criteria directly tied to business outcomes, rather than universal evaluations that offer little actionable insight. For instance, a customer support bot should be evaluated not just on factual accuracy, but on its ability to prevent escalation to human agents.
The practical development process for GenAI, therefore, diverges from conventional software. It begins with building a basic POC, followed immediately by defining the first set of practical evaluations. Running these evaluations helps pinpoint specific failures or unexpected behaviors. This leads to an iterative loop of educated adjustments across various components: the underlying model, the application code, the prompts used, the training data, and crucially, the evaluations themselves. This "rinse-repeat" cycle is vital for continuously improving performance and, critically, for catching regressions before they impact users.
The ultimate goal of this rigorous process is to establish a "reliable baseline." With this benchmark in place, teams can confidently optimize their AI solutions. This confidence enables informed decisions on whether to switch to a smaller, more cost-effective model like "4o-mini instead of 4o," explore alternative architectural approaches like GraphRAG, or simplify complex agentic logic, all while knowing their core reliability is maintained.
While the overarching approach remains unified, the specific evaluations differ significantly across various AI solutions. A support bot might leverage an LLM as a judge, while a text-to-SQL solution requires mocking database queries to ensure correct output. A call center classifier might rely on simple matching, whereas robust guardrails necessitate a combination of evaluation methods to cover diverse scenarios.
To truly build dependable AI applications, organizations must evaluate them the way their users actually interact with them. Abstract metrics are often misleading. Frequent, detailed evaluation is paramount for rapid progress and quickly identifying issues, proactively addressing regressions. This iterative process, combined with structured feedback, is the path to achieving genuine AI explainability and reliability.

