Many reported weaknesses of image-capable large language models come from bad tests and labels, not the models themselves | arXiv News