OpenAI has shared its attempts at solving problems from the First Proof challenge, a rigorous math competition designed to test AI's ability to generate verifiable, end-to-end proofs in specialized domains. This initiative offers a glimpse into the evolving capabilities of advanced AI models for complex research tasks.
The research firm submitted proof attempts for all ten First Proof problems, which are known for requiring deep expertise and extended reasoning, with some problems remaining unsolved by humans for years. According to OpenAI News, experts believe at least five of the AI-generated proofs (problems 4, 5, 6, 9, and 10) are likely correct, while several others are still undergoing review. The AI's initial confidence in problem 2 has been revised to incorrect following community analysis.
Testing Frontier Reasoning
OpenAI views these types of frontier challenges as crucial for evaluating next-generation AI, going beyond traditional benchmarks. They stress-test AI's capacity for sustained reasoning, abstraction, handling ambiguity, and producing arguments that withstand expert scrutiny. The company noted that while benchmarks are useful, they often miss the nuances of complex research.
Iterative Development and Human Oversight
The AI model was run with limited human intervention, occasionally suggesting retry strategies or asking for clarifications based on expert feedback. A back-and-forth process with ChatGPT was used for verification, formatting, and style refinement. OpenAI acknowledged that the rapid sprint was not a perfectly controlled evaluation and looks forward to more rigorous experiments.
This work builds on previous milestones, including achieving gold medal performance on the International Mathematical Olympiad and GPT-5's role in accelerating scientific research, as detailed in their research publications. The company aims to incorporate these advanced reasoning capabilities into future public models.



