Computer VisionEnglishPublished

BenchX: 85,355 CT scans show tumor‑detection AI can fail for underrepresented groups and imaging protocols

June 24, 2026arXiv: 2606.24883v1

This paper introduces BenchX, a large, open benchmark built to test how well artificial intelligence (AI) models detect and locate tumors in medical images. The team assembled 85,355 computed tomography (CT) scans and ran 12 tumor‑detection models against them. Their main finding is simple but important: models that look good on average can perform much worse on rare or underrepresented patient groups and on certain scanning conditions, for example when tumors are very small or when images come from different contrast phases.

To make this comparison possible, the researchers organized scans by key factors that can change model performance: tumor size and location, patient subgroups (age, sex, race), and imaging protocol (how the scan was taken). They used large language models (LLMs) to read and extract subgroup and protocol information from clinical records, which helped scale the work and keep the analysis reproducible. With those labels in place, they ran the 12 models and measured how detection and localization accuracy varied across the different groups.

At a high level, BenchX looks for patterns of inconsistency. Rather than reporting only an overall accuracy number, it breaks performance into slices—small versus large tumors, different body locations, and subgroups such as younger patients or female African American patients. The benchmark shows that state‑of‑the‑art models tuned for average performance can miss more tumors or localize them less accurately in rare or underrepresented subgroups.

Why this matters: AI tools are moving into clinical settings, and uneven performance can have real effects on patient care. If a model routinely works well for the most common cases but fails more often for certain demographic groups or scan types, those patients may get worse or delayed diagnoses. BenchX gives researchers and clinicians a concrete resource to measure these problems and to test model improvements across the kinds of variation that appear in real practice.