The composition of pretraining data is the invisible architect of Large Language Model (LLM) capabilities and limitations. Yet, this critical 'digital DNA' remains largely undisclosed, hindering independent auditing. This opacity poses a significant challenge for understanding model behavior and provenance. The researchers introduce Data Mixture Surgery (DMS), a formalization for estimating the domain-level distribution of an LLM's pretraining corpus using only its generated text.
Related startups
Reverse-Engineering the Training Corpus
The core innovation, LLMSurgeon, reframes the problem of LLM data mixture analysis as an inverse problem. By assuming a label-shift scenario, LLMSurgeon moves beyond simple aggregation of classifier outputs. It instead estimates a calibrated 'soft' confusion matrix to account for systematic domain confusion. This approach allows for the recovery of the latent mixture prior, providing a robust method for understanding what data shaped the LLM, even without direct access to that data.
A Verifiable Benchmark for Transparency
To rigorously evaluate DMS and LLMSurgeon, the authors developed LLMScan. This evaluation suite is recipe-verifiable and built using open-source LLMs with known pretraining mixtures. LLMScan ensures that LLMSurgeon's ability to recover domain mixtures is assessed under standardized, reproducible conditions. The framework demonstrates high fidelity in recovering these mixtures, marking a significant step towards practical, post-hoc auditing of foundation models.