Foundation models match overall accuracy for traumatic bowel injury but give many false alarms when other organ injuries are present
This paper checks how large, pre-trained medical “foundation” models behave when asked to find traumatic bowel injury on CT scans. Foundation models are trained on many kinds of medical images and text, so they can sometimes work without extra training. The authors found that these models could identify injured cases about as well as models trained for the task, but they made many more false alarms when other injuries were present.
The team used a multi-institutional CT dataset collected by RSNA from 23 centers (2019–2023). They trained and evaluated five approaches on a training set of 3,147 patients, where bowel injury was rare (2.3% prevalence). Two foundation models were tested: MedCLIP used in zero‑shot mode (no task-specific training) and RadDINO used as a frozen feature extractor with a simple classifier. They compared these to three task-specific systems (a CNN, a transformer, and an ensemble). To isolate why models failed, the authors compared specificity — the ability to correctly say “no injury” — in two groups that both had zero bowel injuries: patients with other solid organ injuries (liver, spleen, kidney; n=58) and patients with no abdominal pathology (n=50).
Overall discrimination, measured by area under the curve (AUC), was similar: foundation models scored 0.64–0.68 versus 0.58–0.64 for task‑specific models. Foundation models had higher sensitivity (they caught more true positives): 79–91% versus 41–74% for the task‑specific approaches. But foundation models had lower specificity overall (more false alarms): 33–50% versus 50–88% for task‑specific systems. All models kept high specificity (84–100%) in the no‑pathology group. The big problem appeared when solid organ injuries were present: specificity fell by about 50–51 percentage points for foundation models, while task‑specific models showed smaller drops of 12–41 points.