How large language models can make randomized trials more precise
This paper tests whether large language models (LLMs) can help make randomized controlled trials (RCTs) more precise. The authors build a practical pipeline that uses LLM-generated predictions as an extra measured variable in an RCT analysis. They show this can reduce the uncertainty in trial estimates, especially when trials lack good numeric predictors or when the trial already has text data, like paper abstracts, that LLMs can read well.
What the researchers did, at a high level, was simple. They asked an LLM to predict each trial unit’s outcome and treated that prediction as a new covariate—a measured characteristic—alongside the usual trial data. They then used a design-based estimator, adapted from a recent method, to combine the LLM prediction with the trial data. Crucially, the imputed predictions used in the estimator must be constructed so that each unit’s imputation is independent of that unit’s treatment assignment (for example by using leave-one-out or out-of-bag predictions). That independence is needed to keep the trial estimate unbiased.
Why this works in practice comes down to precision. Trials can be small or have hard-to-model features. If the LLM prediction is correlated with the outcome, adding it as a single covariate can capture complex structure (for example in text) without needing a large model trained inside the trial. The authors show the math for the estimator and for a variance formula that uses the prediction errors in the treated and control groups to quantify precision gains.
They test the pipeline on three studies. The first is a natural experiment on judges and recidivism with 1,003 defendants. The second is a school trial of a new algebra curriculum with 19,053 students. The third is an RCT that randomized 3,245 scientific papers to open access, where the authors analyze a subset of 1,248 papers across five journals for which they could get abstracts. In the first two examples their method produced little improvement. In the abstracts example, which uses text well suited to LLMs, the method substantially improved precision. In one case the improvement was equivalent to increasing the sample size by nearly 60 percent.