One-step gradient delay need not stop large-scale asynchronous pipeline training of LLMs
This paper looks at a practical problem in training very large language models (LLMs). Training across many GPUs often uses pipeline parallelism, which cuts a model into stages. The synchronous way to run such pipelines leaves some GPUs idle during “pipeline bubbles.” Asynchronous pipeline parallelism (AsyncPP) avoids these idle times but uses slightly older gradients, a problem called gradient staleness. The authors show that a single-step delay in gradients, when handled right, is not a fundamental barrier to good training quality.
The team ran experiments comparing several optimizers and strategies under a one-step delay produced by a schedule called PipeDream-2BW. Unlike the older PipeDream schedule, PipeDream-2BW guarantees a constant one-step delay across pipeline stages. They report results on medium models (135M and 360M parameters) and scale up to a 10 billion-parameter Mixture-of-Experts (MoE) model trained on 200 billion tokens. Their main empirical finding is that optimizer choice matters a lot: AdamW, a common optimizer, degrades substantially with one-step delay, while a newer optimizer called Muon is much more robust. They also test an optimizer-agnostic correction inspired by Error Feedback and provide a theoretical convergence analysis for Muon with and without this correction.
At a high level, pipeline parallelism splits the model so different GPUs hold different layers. Synchronous updates wait so every stage sees the same parameters, but that creates idle time. AsyncPP removes the waiting and applies updates as they arrive, so a gradient can be one step out of date. PipeDream-2BW controls this by batching updates so the delay is always exactly one step. The authors study the delayed-update rule (an update based on the previous step’s gradient) and a simple Error-Feedback-style correction that mixes recent delayed updates to reduce the mismatch caused by staleness.