Researchers find a “value” direction inside a language model that tracks whether it thinks its current plan will work
Researchers report that a large language model appears to keep an internal signal that estimates how likely its current line of thought will succeed. In the Qwen3-8B model they studied, a single linear direction in the model’s activations — which they call a “value axis” — reliably rises when the model is on a good path and falls when it is not. That internal signal both correlates with the model’s expressed confidence and can be nudged to change behavior.
To find the axis, the team created a controlled setup they call in‑context reinforcement learning. They generated 300 short conversations where the model tries to guess a hidden rule (for example, “include a dash”) and receives only +1 or −1 feedback. At the moment the model first satisfies the rule, the researchers compared the average neural activations of the tokens that come after that success to the tokens just before it. That difference defines the value axis. The representation appears in middle layers of the network; layers 21–22 gave especially strong results and generalized to held‑out rules with an AUROC (a standard measure of how well a signal separates two classes) above 0.95.
They tested whether this axis tracks real behavior in other settings. On 455 AIME math problems, projections onto the axis were higher when the model answered “yes” to “Do you think your answer is correct?” than when it answered “no,” and the axis separated confident from unconfident runs with AUROC above 0.75. Rollouts that contain backtracking phrases (