Bug in gsynth R package led to much smaller reported uncertainties and changed results in three APSR papers
This paper documents a software error in the gsynth R package that could make analysts look far more certain than they should be. The problem affected versions of gsynth released before the 1.3.1 update on CRAN in December 2025. When users selected two options together — the parametric bootstrap for inference and the IFE-EM estimator (an Interactive Fixed Effects estimator that uses an Expectation-Maximization fitting step) — the package produced standard errors that were often far too small. The authors show that this behavior could lead to many false positive findings in realistic data settings.
At a technical level, the bug came from how the parametric bootstrap was implemented. A bootstrap is a resampling method used to estimate how much an estimate would vary from sample to sample. The published algorithm that gsynth intended to follow (from Xu 2017) adds simulated out-of-sample prediction errors to fitted values. The buggy implementation instead reused in-sample residuals — the differences between observed outcomes and their fitted values on the same data. Using in-sample residuals mechanically lowers the variation across bootstrap samples and so shrinks reported standard errors.
To test how harmful this was, the authors ran an empirical Monte Carlo study. They applied randomly assigned placebo treatments to a set of state-level panel datasets and recorded how often gsynth reported statistically significant effects when there should be none. The package’s historical behavior produced high false positive rates. The authors also reanalyzed three papers published in the American Political Science Review that had used the affected code path. They compare (1) the original published results using the historical gsynth implementation, (2) results using a corrected implementation of the parametric bootstrap that restores the omitted leave-one-out step, and (3) results using the Generalized Synthetic Control method (GSC) as described in Xu (2017). Correcting the implementation made most previously significant findings become insignificant. Using GSC made every finding in these checks insignificant.