When are the most likely answers from large language models actually correct? | arXiv News