Large language models can carry out parts of statistical proofs but struggle to pick the right strategy
Researchers studied how widely available large language models (LLMs) help with a hard kind of expert work: developing statistical proofs. They found that these models can execute technical steps reliably when the problem is stated precisely and a clear strategy is given. But the same models become unreliable when the task is open‑ended, requires a long chain of reasoning, or asks the model to choose and adapt a proof strategy on its own.
To reach this conclusion the team used statistical proof development as a “window” into human–AI collaboration. They ran four paired case studies, which together covered eight research‑level proof problems from everyday statistical research. The problems came from high‑dimensional inference, extreme‑value theory, transfer learning, and differential privacy. The authors tested three general‑purpose models — GPT‑5.4 Thinking, Gemini 3.1 Pro, and Claude Opus 4.6 — and report that the qualitative results were consistent across all three. Most of the proof problems were taken from ongoing projects without public solutions; in one high‑dimensional case the models reached a correct result by a different route than the existing paper.
At a high level, the paper identifies an “execution–strategy gap.” Current LLMs can carry out a supplied strategy: compute a step, manipulate an expression, or fill in a technical argument when the assumptions and targets are clear. They do poorly at the earlier, more open tasks that statisticians face: turning a real‑world scientific question into a statistical model, choosing which of many domain‑specific strategies to use, and adapting that strategy to the details of the problem. The authors argue these earlier steps require deep expertise in both the statistical literature and the real‑world context, and so are not removed by AI — they are relocated to where human judgment matters most.