OpenAI's Codex Powers Self-Improving Tax Software

OpenAI's Codex is powering a new generation of self-improving tax software, demonstrating significant gains in accuracy and efficiency through an AI-driven feedback loop.

May 28 at 1:38 AM9 min read

Diagram showing the three-part loop for self-improving AI agents. — An illustration of the three-part loop enabling AI agents to self-improve.· OpenAI News

Visual TL;DR. Production Failures leads to Engineer-Driven Fixes. Engineer-Driven Fixes addressed by OpenAI Codex. OpenAI Codex enables Three-Part Loop. Three-Part Loop builds Tax AI System. Tax AI System enables Autonomous Enhancement. Autonomous Enhancement leads to Measurable Gains. Autonomous Enhancement expands to New Domains.

Production Failures: real-world software falters in unpredictable ways after deployment
Engineer-Driven Fixes: weeks fixing bugs based on user feedback and engineer translation
OpenAI Codex: advanced agentic capabilities powering self-improvement
Three-Part Loop: robust evaluation infrastructure and direct access to domain experts
Tax AI System: streamlines complex tax return preparation for accounting firms
Autonomous Enhancement: transforms real-world usage into actionable signals for self-improvement
Measurable Gains: significant gains in accuracy and efficiency demonstrated
New Domains: expanding self-improving capabilities to new application areas

Visual TL;DRQuickExplainDeeper

Real-world software often falters in unpredictable ways after deployment. Teams typically spend weeks fixing bugs based on user feedback, a process that relies on engineers to translate those issues into product improvements. However, by leveraging advanced agentic capabilities like those found in Codex, coupled with robust evaluation infrastructure and direct access to domain experts, it’s now possible to build systems that self-improve.

Over six months, OpenAI engineers and researchers partnered with Thrive Holdings to develop Tax AI for Crete’s accounting firms. This system aims to streamline the preparation of complex tax returns, moving beyond a purely engineer-driven improvement cycle. Tax AI transforms real-world usage into actionable signals for autonomous enhancement.

The accounting firms processed tens of thousands of tax returns, involving millions of documents. For complex filings, data entry alone can consume eight hours per return, often complicated by messy data sources and manual calculations. Tax AI processed 7,000 returns in its pilot phase, automating significant portions of the 1040 and 1041 tax return preparation.

Crucially, Tax AI has demonstrably improved since its initial deployment. The system now saves practitioners about a third of their time, drafts returns with up to 97% accuracy, and increases throughput by approximately 50%.

Measurable Self-Improvement

Accuracy is measured by the percentage of returns completed correctly without subsequent correction. At launch, only 25% of returns achieved 75% correct field completion. Within six weeks, this figure rose to 86%, with even faster growth seen at 90% and 100% completion levels.

Initially, Tax AI handled simpler documents like W-2s and 1099s. As the tax season progressed, it successfully tackled more complex returns involving K-1s and intricate schedules. Each expansion into more challenging tasks yielded greater time savings per return.

This continuous progress is fueled by a co-engineered approach centered on three pillars: expert practitioner feedback, detailed production traces, and a Codex-driven iteration loop utilizing tailored evaluations. This methodology aims to accelerate product development in domains where expert insight is paramount.

The Problem of Production Failures

As Tax AI tackled more complex tax preparation tasks, such as those involving K-1s or rental property schedules, the core challenge became making production failures visible, understandable, and actionable. Early corrections by practitioners lacked full context, making it difficult for engineers to pinpoint root causes like extraction errors, mapping issues, or simple workflow noise.

Without a structured feedback mechanism, engineers struggled to identify the most critical areas for improvement. The existing system lacked the signals needed to direct development effectively.

Our Approach: A Three-Part Loop

The solution involved designing the system around three core principles:

Stay close to practitioners: Their expertise is vital for guiding product learning and identifying high-impact areas.
Build the product to create evidence: Capture the complete workflow from source material to final submission, including expert corrections.
Create a Codex-driven improvement loop: Use structured production issues to generate tailored evaluations that Codex can address, accelerating development.

The rental property example illustrates this loop in action, showing how a practitioner's correction evolves into a structured finding, then an evaluation target, and finally a Codex-scoped engineering task.

Rental Property Example

Extracting rental property income, reported on Schedule E, presents a complex challenge. The system must read varied source materials, extract relevant fields, and maintain traceability for practitioner review.

A practitioner correction reveals a failure: Differences between the AI's predicted value and the filed return are now captured as structured data. This transforms the review process from a post-failure step into a continuous learning cycle.
Product traces turn corrections into evaluations: The system preserves the full workflow, enabling detailed failure investigation. Practitioner corrections are processed to capture differences, group recurring issues, and define clear evaluation targets for Codex.
The finding becomes a hill to climb for Codex: These targeted evaluations allow Codex to investigate root causes, implement fixes, and validate changes. This automated process turns recurring practitioner corrections into measurable engineering tasks.

This end-to-end loop ensures that production evidence fuels continuous improvement, with actionable patterns becoming bounded evaluations for Codex and ambiguous cases routed back to product teams.

How to Use Codex to Build This Loop

The pattern of using production artifacts and traces to enhance agent capabilities is broadly applicable. By providing Codex with reviewed findings, source traces, expected outputs, and relevant code, its performance can be significantly improved over time.

This approach builds on principles of making tasks legible to AI, providing scoped context, and integrating human review. A practitioner correction only becomes a Codex task after repeated issues are identified and grouped into actionable findings.

This automation is applied to a bounded layer of the product responsible for extraction and mapping. Engineers retain oversight of architecture and product strategy, while practitioners guide the improvement loop through their existing workflows.

For Codex, this means receiving scoped engineering tasks with clear evidence and validation gates, rather than vague alerts. The context for a typical task includes the code repository, evaluation datasets, and relevant documentation.

Expanding to New Domains

The self-improvement loop is not limited to rental properties; it's a reusable pattern for enhancing agent capabilities across various domains. This iterative process, driven by real-world usage and expert feedback, allows for continuous, measurable advancements in AI systems.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#OpenAI #Codex #AI #Machine Learning #Tax Technology #Accounting Software #Agentic AI #Software Development