Computer VisionEnglishPublished

GreenRFM: a radiology foundation model that aims for high performance with far less compute

March 14, 2026arXiv: 2603.06467v1

This paper introduces GreenRFM, a pre-training approach for medical imaging models that focuses on using supervision more wisely instead of just making models bigger. The authors say GreenRFM matches or beats larger, resource-heavy radiology foundation models while using far less compute. They offer two practical setups: a performant version that they say can be trained on a single 24 GB GPU within 24 hours, and a lightweight version that they report can match benchmarks on a 6 GB laptop-style graphics card in about 4 hours.

The core idea is a shift from brute-force scaling to what the authors call principled supervision design. They package their design into four principles abbreviated MUST: More distilled, Ubiquitous, Semantic-enforcing, and Task-aligning supervision. Practically, they use Large Language Models (LLMs) to turn free-text radiology reports into structured, higher-quality labels (a “silver-standard”), they inject explicit supervision into every component of the system, they pre-train vision and text parts separately to make sure each learns clear medical concepts, and they align the pre-training objectives with real clinical tasks.

At a high level the method works in two stages. First, the team distills noisy reports into structured diagnostic labels using LLMs while modeling uncertainty and missing data. That creates dense and semantically meaningful supervision without expensive manual curation. Second, they pre-train the image and text encoders independently on these labels and only then perform cross-modal alignment so the image and language representations line up. The authors demonstrate this using a streamlined 3D ResNet-18 backbone (a compact form of neural network suited for volumetric scans) and report that this design can reduce compute needs by orders of magnitude compared with very large transformer models.