Natural Language ProcessingEnglishPublished

Training vision-language models in stages fixes sight before thought and improves results

May 20, 2026arXiv: 2605.20177v1

This paper argues that many failures of vision-language models come not from weak thinking but from poor visual perception. The authors show that if a model mis-sees an image at the start, longer chains of reasoning rarely fix the mistake. They therefore separate learning into distinct stages so the model learns to “see” reliably before it learns to reason about images.

The team split post-training into three parts: visual perception, visual reasoning, and textual reasoning. They built a perception-focused dataset by turning image captions into targeted perception questions and then filtering for examples where the caption implies the answer but the model answering from the image alone fails. For the perception stage they favor Reinforcement Learning with Verifiable Rewards (RLVR) rather than caption-based Supervised Fine-Tuning (SFT). RLVR rewards answers that can be checked and keeps the model on-policy during learning.

Across several open models and benchmarks they report clear gains. A diagnostic check found that 86.9% of incorrect answers from a tested model on visual math tasks were due to perception mistakes. Staged post-training gave both better accuracy and shorter reasoning traces: one model saw a 1.46-point increase in math reasoning accuracy and a 20.8% shorter average reasoning trace compared with merged training. On the WeMath benchmark they raised Qwen3-VL-8B from 50.9% to 56.1% accuracy (a 5.2-point gain), and they report a 7.43-point gain for Qwen2.5-VL-7B when adding the perception stage. Their staged Qwen3-VL-8B also reached 75.9% on Math-Vista and 74.5% on RealWorldQA. They further show swapping RLVR for SFT in the perception stage caused drops of 8.1% and 1.6% on WeMath for different models, indicating RLVR was more effective in their setup.

Why this matters: the work suggests that building a strong visual foundation reduces wasted reasoning effort. Better perception made reasoning shorter and more accurate, and the authors propose a new “capability-based” curriculum that is orthogonal to the usual idea of ordering data by difficulty. Combining capability staging with difficulty-based curricula gave extra gains in their tests.