Statistical MethodologyEnglishPublished

New statistical framework uses trial data and real-world records to pick patients who likely benefit from a treatment

July 1, 2026arXiv: 2606.31954v1

This paper presents a method to identify which patients are likely to benefit from a treatment while accounting for uncertainty. The authors target situations where a randomized controlled trial (RCT) gives reliable average results, but clinicians want to know who among individual patients will actually gain. The method aims to avoid selecting patients based only on point estimates, which can lead to many false positives when many candidates are screened at once.

The researchers reformulate the selection task as a set of tests. For each candidate patient they test whether that patient’s conditional treatment benefit exceeds a clinically meaningful threshold. The conditional average treatment effect (CATE) is the expected difference in outcome with and without treatment for a patient with particular characteristics. They build a conformal p-value for each patient. Conformal inference is a data-driven way to measure uncertainty that does not rely on strong distributional assumptions and can give valid statements in finite samples under an exchangeability condition.

After forming these p-values using calibration from the RCT, the method applies the Benjamini–Hochberg (BH) procedure to control the false discovery rate (FDR). False discovery rate control means limiting the expected proportion of wrongly declared beneficiaries among all selected patients. Because the framework is model-agnostic, the authors say it can be paired with many prediction methods, from standard machine-learning models to newer tabular foundation models such as TabPFN and TabICL.

A practical feature is the use of external data, for example real-world data (RWD), to improve model training when the RCT is small. The external data are used only to train flexible treatment-effect models. The conformal calibration step that produces p-values stays anchored in the RCT data. This separation is intended to gain efficiency from extra data while preserving the validity of the selection step for the trial population.