The development of high-performance Large Language Model (LLM) search agents, critical for frontier capabilities, has been largely confined to industrial giants due to a significant bottleneck: the scarcity of transparent, high-quality training data. This gap has stifled broader research innovation. Addressing this directly, the OpenSeeker project introduces the first fully open-source search agent, providing both model and data to democratize this vital domain.
Fact-Grounded Synthesis for Scalable Reasoning
OpenSeeker's core innovation lies in its ability to generate complex, multi-hop reasoning tasks at scale. By reverse-engineering the web graph through topological expansion and entity obfuscation, the system synthesizes fact-grounded, scalable, and controllable Question Answering (QA) data. This approach allows for precise control over task coverage and complexity, directly tackling the data scarcity issue that has plagued the research community.
Denoising Trajectories for Enhanced Action Quality
Complementing the data synthesis, OpenSeeker employs a denoised trajectory synthesis mechanism. This technique utilizes a retrospective summarization approach to refine the action sequences generated by teacher LLMs. By effectively 'denoising' these trajectories, the system significantly enhances the quality of actions produced, leading to more efficient and effective search operations.
Frontier Performance with Open Access
The impact of these innovations is demonstrably clear in performance metrics. Trained on a remarkably small dataset of just 11.7k synthesized samples, the OpenSeeker search agent achieves state-of-the-art results across multiple benchmarks, including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, it outperforms other open-source agents like DeepDive by a substantial margin (29.5% vs. 15.3% on BrowseComp) and even surpasses industrial competitors such as Tongyi DeepResearch on BrowseComp-ZH (48.4% vs. 46.7%). The full open-sourcing of the training dataset and model weights signals a pivotal shift towards a more transparent and collaborative research ecosystem.