Computer VisionEnglishPublished

MuRF: Combine low- and high-resolution views at inference to improve vision foundation models

March 27, 2026arXiv: 2603.25744v1

This paper introduces MuRF, short for Multi-Resolution Fusion. The idea is simple. Instead of feeding a single resized image to a pre-trained Vision Foundation Model (VFM), MuRF runs the same image at several resolutions through the frozen model and fuses the resulting features. That gives a single representation that keeps both the global scene and the fine details.

To build this representation the authors make an image pyramid: several resized versions of the input. Each version is passed through a frozen VFM encoder (the paper focuses on DINOv2 but also tests a contrastive model called SigLIP2). The encoder returns a spatial feature map for each size. MuRF upsamples those maps to a common grid and concatenates them along the channel direction. The authors chose channel-wise concatenation to avoid blending scale-specific signals, so coarse and fine features remain distinct.

The team applied MuRF without changing the pre-trained backbone. They used simple, lightweight heads on top of the fused features for many tasks. For dense prediction tasks like semantic segmentation and depth estimation, linear heads trained on MuRF features beat heads trained on single-scale features. As a visual encoder for Multimodal Large Language Models (MLLMs) in Visual Question Answering (VQA), MuRF gave the language models richer visual context that included both scene-level cues and small details. In unsupervised anomaly detection, MuRF helped detect both large structural defects and tiny surface flaws and reached strong performance on a challenging benchmark called MVTecAD2TESTpriv,mix.

Why this matters: many modern VFMs are trained to accept variable image sizes, but inference often uses a single fixed scale. That single-scale choice can lose either global context or fine detail. MuRF is a training-free, architecture-agnostic way to recover both kinds of information at inference time. It avoids tiling approaches that break image continuity, and it does so without retraining the foundation model.