Natural Language ProcessingEnglishPublished

LLMSurgeon: Estimating an LLM’s training‑data mix from its outputs

May 29, 2026arXiv: 2605.30348v1

Researchers introduce a method to recover the mix of data domains that shaped a large language model (LLM) using only the text the model generates. They call the problem Data Mixture Surgery (DMS) and present LLMSurgeon, a practical framework that aims to estimate what proportion of an LLM’s training material came from different domains such as Wikipedia, code, or web text.

The core idea is to treat the problem as an inverse puzzle. LLMSurgeon first trains an external classifier on known, labeled reference data for each domain. It then prompts the target model with neutral prompts to get representative generated text and labels those outputs with the frozen classifier. Instead of simply counting labels, the method computes a calibrated “soft” confusion matrix that captures how often the classifier confuses one domain for another. LLMSurgeon then solves a constrained inverse problem that uses this matrix to correct systematic bias and recover the latent domain proportions the model reflects.

To test the approach, the authors build LLMScan, a benchmark made from eight open‑source models with transparent training recipes and sizes ranging roughly from 1 billion to 65 billion parameters. LLMScan includes multiple auditing resolutions (coarse, mid, fine) and models such as LLaMA‑1, OLMo, Amber, Pythia, and GPT‑Neo. On these controlled tests and fixed protocols, LLMSurgeon recovers domain mixtures with high fidelity and outperforms simple aggregation baselines that rely on pointwise membership signals.

Why this matters: the composition of a model’s pretraining data affects its behavior, strengths, and blind spots. Knowing the domain mix can help researchers and auditors investigate bias, copyright risk, or why a model behaves differently on some topics. LLMSurgeon offers a post‑hoc way to probe that “digital DNA” without access to the model’s training files or weights.