Large language models struggle to follow long step-by-step arithmetic procedures
Researchers tested whether large language models (LLMs) actually carry out a sequence of steps when asked to do so, or just guess a plausible final answer. They built a controlled benchmark that gives a model a short algorithm written in plain language, two numbers as inputs, and asks the model to return the final numeric result. The goal was to separate true step-by-step execution from cases where a model arrives at a correct answer by shortcuts or luck.
The benchmark uses simple arithmetic only. Each example starts by setting S1 = x and S2 = y, then defines a sequence of intermediate variables S3, S4, ... by applying +, −, ×, or ÷ to earlier values. The model must follow the written steps exactly and return the last value rounded to three decimal places. The authors varied how many steps the algorithm had (from 5 up to 95), how far back a step might need to read intermediate variables (they call this “look-back,” up to seven steps), the input ranges ([0,1], [1,10], [10,100]), and whether operations were mixed or all the same. They produced 55 datasets with 55,000 total examples and tested 14 different models.
The results show performance falls as procedures get longer or require retrieving older intermediate values. Average first-answer accuracy fell from 61% on 5-step procedures to 20% on 95-step procedures. Increasing look-back depth from one step to seven steps reduced accuracy by 18.43 percentage points. The authors also looked at execution traces, not just final answers, and found exact-step execution dropped from 70.88% to 46.84% while “under-execution” (stopping early or skipping steps) rose from 24.25% to 50.87%. Common failure modes included missing or premature answers, self-correction after an initial mistake, under-executed traces, and invented extra steps.
This matters because many real tasks require strict adherence to a sequence of actions. In settings such as precise calculations, rule-based decision making, or automated workflows, users expect the steps to be followed exactly. The study shows that a correct final answer can hide deeper problems: a model might give the right number while failing to track intermediate states or honoring the exact procedure.