ThinkJEPA blends a dense video predictor with a vision–language “thinker” to forecast longer-range hand movements
This paper presents ThinkJEPA, a method that combines two ways of understanding video to predict future states for tasks like hand-manipulation. The authors start from latent world models, which forecast future scenes in a compact representation rather than by generating full images. Such models (for example V-JEPA2) are good at fine motion but usually use a short window of densely sampled frames. That short view can make predictors focus on local, low-level motion and miss longer-term, semantic cues about what will happen next.
To add longer-term knowledge, the team taps large vision–language models (VLMs). VLMs are trained on image–text pairs and can reason about objects, actions, and general world knowledge across more widely spaced frames. But VLMs alone are not ideal dense predictors. They tend to process only a few uniformly sampled frames because of compute limits, their internal features are shaped toward producing language (which can lose fine spatial detail), and adapting them to small, action-conditioned datasets can be tricky.
ThinkJEPA addresses these trade-offs with a dual-temporal design. One branch is a dense JEPA-style predictor that keeps high-frame-rate cues needed for precise motion and contact. The other branch is a uniformly sampled VLM “thinker” that looks farther in time to offer semantic guidance. To transfer the VLM’s reasoning into the dense predictor, the authors build a hierarchical pyramid representation extraction module. This module pulls features from several depths inside the VLM (not just the final language-oriented layer), bundles them into guidance signals, and injects them into the JEPA predictor via layer-wise modulation so the predictor can use both fine motion cues and higher-level knowledge.
In experiments on 3D hand-trajectory prediction, ThinkJEPA outperformed a strong VLM-only baseline (the paper cites an open-source model Qwen3-VL with thinking capability) and a JEPA predictor baseline. The authors report that their method yields more robust long-horizon rollouts, meaning the predicted sequences stay reasonable farther into the future. These improvements matter because better long-range forecasts in a compact latent space can make downstream tasks—like planning and control in manipulation—more reliable and cheaper to compute than pixel-level generation.