Many reported weaknesses of image-capable large language models come from bad tests and labels, not the models themselves
This paper looks at why multimodal large language models (MLLMs) sometimes seem worse than traditional image classifiers. MLLMs are models t
This paper looks at why multimodal large language models (MLLMs) sometimes seem worse than traditional image classifiers. MLLMs are models that can read text and look at images. The authors show that how researchers test these models and the quality of the labels used for comparison can make performance look much better or much worse than it really is.
To find out what goes wrong, the team tested the common evaluation procedures and fixed key problems. They found three important issues: model answers that do not match the provided class list were simply thrown away, multiple-choice tests had weak “distractor” choices that made results look artificially high, and an “open-world” test only failed because model outputs were mapped to classes badly. They also measured often-overlooked settings such as batch size, the order images are shown in, and which text encoder the model uses. Those choices changed accuracy by a noticeable amount.
The authors created ReGT, a multilabel reannotation covering 625 ImageNet-1k classes, to study the effect of label quality. With corrected labels, MLLMs improved by as much as 10.8% on some measures. That gain substantially narrows the gap between MLLMs and supervised vision models. The paper also reports that models which rely less on supervised training were the most sensitive to label noise, meaning they can gain or lose more when annotations are fixed or wrong.
The team tested whether MLLMs could help people fix labels. In a controlled case study, human annotators either confirmed or incorporated MLLM predictions in about half of the difficult cases they reviewed. This suggests MLLMs could be useful tools for large-scale dataset curation, not just for automated labeling.