Computer VisionEnglishPublished

Spatially Speculative Decoding (SSD) speeds up autoregressive image generation by up to 13.3×

June 19, 2026arXiv: 2606.20543v1

Autoregressive image models build pictures by predicting one token at a time, like a language model does with words. That approach flattens a two‑dimensional image into a one‑dimensional list of tokens. The flattening throws away the natural neighborhood structure of images and creates a big compute and memory bottleneck when generating images.

The paper introduces Spatially Speculative Decoding (SSD). Instead of predicting only the next token in the 1D order, the model is trained to predict the token to the right and the token below the current position at the same time. This simple change aligns the model’s predictions with the true 2D layout of images and lets the system reuse computations across nearby pixels.

On standard generation tests, SSD speeds up autoregressive image generation by as much as 13.3× while keeping high image quality. The authors report these results on two benchmarks called DPG‑Bench and GenEval. In plain terms, SSD finds a way to generate the same looking images much faster by exploiting the fact that nearby pixels are strongly related.

The work matters because it targets a practical bottleneck: inference memory and compute. By respecting image geometry instead of flattening it, SSD opens a route to faster, higher‑resolution autoregressive generators. That could help move these models closer to real‑time applications and make them cheaper to run.

There are important caveats. The speedup is reported as “up to” 13.3×, so it is a best‑case number and will likely depend on the model, image size, and hardware. The results in the abstract come from specific benchmarks (DPG‑Bench and GenEval), and broader tests are needed to confirm the gains in other settings. SSD is presented as a new framework, so further work will be needed to understand its limits and to integrate it into production systems.