HaloProbe: a Bayesian probe that detects and reduces object hallucinations in image captions without changing the model
Large vision-language models can name objects that are not actually in an image. This paper studies that problem, which is called object hallucination, and shows that popular detection methods based on averaged attention values can be misleading. The authors find that two hidden factors — the token’s position in the caption and whether an object is repeated — can flip or erase attention-based trends. That statistical reversal is an instance of Simpson’s paradox and makes coarse attention analysis unreliable.
To fix this, the researchers introduce HaloProbe, a Bayesian framework for token-level hallucination detection. HaloProbe separates two kinds of information: internal signals from the model’s decoding dynamics (for example, attention patterns at the level of individual layers and heads, and logit-based confidence) and external caption statistics (for example, where a token appears in the sentence and whether it is a repeated mention). The system uses balanced training to learn internal evidence without being biased by the imbalanced counts of correct versus hallucinated tokens. It then learns a prior over the external features and combines both parts to estimate the true posterior probability that a token is hallucinated.
Crucially, HaloProbe is used as an external probe rather than by changing the model’s internals. The authors show how the probe’s token-level scores can guide non-invasive mitigation during decoding. One method they test is a beam-search strategy that re-ranks candidate captions by HaloProbe’s scores. They also evaluate simple post-processing rescoring. Because these methods do not alter attention or other internal weights, they aim to preserve the model’s usual fluency and word choices while reducing hallucinations.
In experiments on the MSCOCO image-caption dataset, and measured with the CHAIR metric (a metric for caption hallucinations), HaloProbe-guided decoding reduced hallucinations more effectively than state-of-the-art intervention-based methods. The paper argues that some intervention methods can hurt fluency or produce unnatural text because they change how the model normally operates. HaloProbe avoids those side effects by keeping the model’s internal dynamics intact and only using an external signal to pick better outputs.