Checks on gravitational-wave population models can miss bad fits when individual events are weakly measured
This paper tests how well a common statistical check finds problems in models used to describe populations of merging black holes seen in gravitational waves. The authors focus on a tool called posterior predictive checks (PPCs). PPCs compare catalogs predicted by a model with the catalog of events we actually observe. They show PPCs can fail when single events give only weak information about a parameter, for example the angles between black-hole spins and the orbit (spin tilts).
The team compared several kinds of PPCs. “Event-level” PPCs use posterior draws for each event. Posterior draws are samples from the probability distribution for a parameter after combining the data and a prior belief. “Data-level” PPCs instead use point estimates from the data, specifically maximum likelihood values, which do not depend on the prior. They also tested two new event-level variants called partial PPCs and split PPCs. The tests used simulated catalogs meant to be like the LIGO–Virgo–KAGRA (LVK) O3-era data and they applied the same checks to the recent GWTC-4 catalog. To judge fit they computed posterior predictive p-values (pT) from several catalog-level numbers: the mean tilt, the tilt standard deviation, a ratio of counts in angle bins, and the fraction of events with large tilt.
Their main finding is that PPCs based on data-level quantities (maximum likelihood points) are always at least as good, and often better, at spotting a wrong model than any event-level PPC. Event-level PPCs can say a model fits well even when it does not, if the per-event measurements are dominated by the prior. Partial PPCs work better when they target a feature the model already predicts well, and worse when the model cannot capture that feature. Split PPCs, which split the catalog into parts used for fitting and testing, were the least informative of the variants they tried.