AI Models Now Predict the Future, Almost

The race to build AI systems that can predict world events is heating up. Mantic, a company using the Tinker platform, has demonstrated that fine-tuning Large Language Models (LLMs) for forecasting tasks can elevate their performance to levels comparable with top-tier, general-purpose models.

Fine-Tuning for Foresight

The prevailing strategy for AI forecasting has relied on off-the-shelf LLMs like Gemini 3 or GPT-5, augmented with specialized context-gathering techniques. These models, while powerful, were not inherently designed for prediction.

Mantic’s research focused on "judgmental forecasting", predictions requiring human-like research and reasoning, crucial for domains like geopolitics and economics where traditional statistical methods fall short. Drawing inspiration from the book Superforecasting, they explored whether models explicitly trained for forecasting could outperform their generalist counterparts.

Using reinforcement learning on approximately 10,000 binary questions (e.g., "Will event X occur before date Y?"), Mantic fine-tuned a model called gpt-oss-120b. This process rewarded the model for assigning higher probabilities to correct real-world outcomes.

The results were striking: the fine-tuned gpt-oss-120b achieved marginally superior performance to leading LLMs in head-to-head contests. This improvement was particularly pronounced when the model was also provided with pre-researched context.

This work on gpt-oss-120b fine-tuning demonstrates that specialized training can indeed unlock new levels of predictive capability.

The Ensemble Advantage

In Mantic’s experiments, the fine-tuned gpt-oss-120b proved to be a critical component in optimal model ensembles. While frontier LLMs like Grok 4 and GPT-5 offer strong standalone performance, they often make similar predictions. The fine-tuned model, however, provided unique insights, contributing to a more robust and accurate collective forecast.

An ensemble combining the fine-tuned gpt-oss-120b with Gemini 3 Pro, GPT-5, and Grok 4 outperformed any single model. Notably, the fine-tuned model and Grok 4 were deemed the least replaceable, highlighting their distinct contributions.

This approach extends the findings from earlier research on models like gpt-oss-120b, showcasing the power of domain-specific adaptation.

Beyond General Models

The implications for LLM forecasting accuracy are significant. By training models on specific forecasting tasks, Mantic has shown it's possible to create specialized AI forecasters that rival, and in some cases surpass, the capabilities of general-purpose LLMs.

This research suggests a future where highly accurate, automated forecasting systems could dramatically improve decision-making across various sectors.

The study also underscores the value of comparing AI systems rigorously, much like in the ongoing efforts to audit LLM agent skill integrity.

Future Directions

Mantic plans to expand this research by training larger models, adapting to different question formats (like numerical or multiple-choice predictions), and integrating real-time information retrieval into the forecasting loop.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.