Statistical MLEnglishPublished

When are the most likely answers from large language models actually correct?

June 26, 2026arXiv: 2606.27359v1

Researchers ask a simple but important question: when does the probability a language model assigns to a full answer line up with whether that answer is correct? By “sequence probability” they mean the model’s conditional probability for a continuation given a prompt — in plain terms, how likely the model thinks a whole response is after seeing the question.

To study this, the team measured the link between sequence probability and correctness at four levels. They compared many decoding methods, changed hyperparameters inside each method, looked across different prompt-answer pairs in datasets, and examined repeated responses to the same prompt. Their experiments covered eight decoding methods (including local methods like low-temperature sampling, top-k and top-p sampling, and global methods such as beam search, Best-of-N, and power-sampling), 14 model variants from several families, and six benchmark datasets.

Across prompt-answer pairs inside the same dataset, higher sequence probability was often a good sign of correctness. The paper reports consistently positive correlations on several datasets. For example, the math dataset MATH500 showed a strong positive link. But this pattern was not universal: one dataset (IFEval) showed a negative correlation, and results depended on the model family. Models that had extra post-training tended to show more positive correlations, while base models were more mixed.

Importantly, the positive signal does not mean you can always improve accuracy by forcing the model to pick higher-probability answers. The authors find that changing decoding methods or tuning hyperparameters to increase sequence probability does not reliably raise correctness. Likewise, for repeated answers to the same prompt, sequence probability is not a dependable indicator of which individual response is correct. The within-sample correlations (comparing multiple responses to the same prompt) were often weak or centered near zero, which limits the usefulness of probability-weighted aggregation methods.