Training vision-language models in stages fixes sight before thought and improves results | arXiv News