A practical version of Wald’s sequential test for faster, safer online A/B tests
Many teams run online A/B tests but check results too often. That “peeking” raises the chance of false positives. The paper revives an old sequential method, Wald’s Sequential Probability Ratio Test (SPRT), and adapts it so it can be used in modern, high-volume A/B testing. The goal is to let experiments stop early when a treatment clearly works (efficacy) or clearly does not help (futility), while keeping both false positives and false negatives under control.
The authors introduce three main ideas. First, SPRT-z replaces Hajnal’s exact sequential t-test with a Z-statistic that uses a normal approximation. This is much faster to compute for large experiments. Second, Scale-Free Horizon Calibration (SFHC) is a Monte Carlo bisection procedure on a standardized Z scale that picks a maximum sample size. That cap limits how long an experiment can run while trying to keep the planned power to detect a business-relevant minimum detectable effect (MDE). Third, a Brownian Median Unbiased Estimator and matching confidence intervals correct the upward bias that appears when you stop early for a good result.
At a high level the workflow enforces a practical data rule so the math works: each user’s metric is only counted after a fixed observation window (for example, 7 days) so new data at each check are effectively independent from previous checks. The normal approximation (the “Z” in SPRT-z) avoids expensive exact t-distribution calculations. SFHC finds a runtime cap that is independent of the metric’s variance, so one calibration can serve many metrics and traffic scales. The bias correction uses Brownian motion—an idealized random-walk approximation—to simulate stopping behavior and produce an estimator whose median is unbiased across the possible stopping situations.
Why this matters to practitioners: the method provides a way to stop early both when a change clearly helps users and when it clearly does not. It also ties decisions to a business-relevant MDE and gives explicit control over Type I error (false positives) and Type II error (false negatives). In simulations reported by the authors, this workflow controlled Type I and II errors, reduced sample size compared to fixed-horizon testing, and reduced estimation bias from early stopping, with confidence interval coverage close to the nominal level in most scenarios. The paper also notes the method has been deployed in production at a large software company.