Computer VisionEnglishPublished

Many reported weaknesses of image-capable large language models come from bad tests and labels, not the models themselves

March 14, 2026arXiv: 2603.06578v2

This paper looks at why multimodal large language models (MLLMs) sometimes seem worse than traditional image classifiers. MLLMs are models that can read text and look at images. The authors show that how researchers test these models and the quality of the labels used for comparison can make performance look much better or much worse than it really is.

To find out what goes wrong, the team tested the common evaluation procedures and fixed key problems. They found three important issues: model answers that do not match the provided class list were simply thrown away, multiple-choice tests had weak “distractor” choices that made results look artificially high, and an “open-world” test only failed because model outputs were mapped to classes badly. They also measured often-overlooked settings such as batch size, the order images are shown in, and which text encoder the model uses. Those choices changed accuracy by a noticeable amount.

The authors created ReGT, a multilabel reannotation covering 625 ImageNet-1k classes, to study the effect of label quality. With corrected labels, MLLMs improved by as much as 10.8% on some measures. That gain substantially narrows the gap between MLLMs and supervised vision models. The paper also reports that models which rely less on supervised training were the most sensitive to label noise, meaning they can gain or lose more when annotations are fixed or wrong.

The team tested whether MLLMs could help people fix labels. In a controlled case study, human annotators either confirmed or incorporated MLLM predictions in about half of the difficult cases they reviewed. This suggests MLLMs could be useful tools for large-scale dataset curation, not just for automated labeling.

Why this matters: the work shows that much of the claimed shortfall of MLLMs on image classification is caused by noisy ground truth and flawed testing methods rather than inherent model failure. Important caveats remain. The reannotation covers 625 classes only, so results may not generalize to every dataset or setting. The improvements depend on careful evaluation choices and better labels, and models differ in how sensitive they are to annotation quality. This project is part of the Aiming for Perfect ImageNet-1k effort.