In radiology tests, clinical LLMs become safer mainly when given clean clinician-written evidence — accuracy alone is not enough
This paper shows that for clinical large language models (LLMs) safety and accuracy do not always improve together. The authors introduce SaFE-Scale, a framework to measure how safety changes when models are scaled or deployed differently. They also build RadSaFE-200, a radiology benchmark of 200 multiple-choice questions where each question has clinician-written “clean” evidence, conflicting evidence, and option-level labels for high-risk errors, unsafe answers, and evidence contradiction.
The team tested 34 locally run LLMs from several families (for example, Qwen, Llama, Gemma, MedGemma, DeepSeek, Mistral, and OpenAI-OSS) under six deployment conditions: closed-book prompting (no evidence), clean evidence, conflict evidence, standard retrieval-augmented generation (RAG), agentic RAG (an LLM-driven multi-step retrieval and reasoning pipeline), and max-context prompting (dumping all retrieved passages into a long context). For each run they recorded the chosen answer, the model’s confidence, and latency, and mapped outputs to the predefined safety labels. Main outcomes were rates of high-risk errors, unsafe answers, evidence contradiction, and dangerous overconfidence; accuracy was reported as a secondary endpoint.
The clearest result was that clinician-written clean evidence produced the largest and most consistent safety gains. Averaged across the 34 models, accuracy rose from 73.5% (closed-book) to 94.1% with clean evidence. High-risk error fell from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. By contrast, standard RAG and agentic RAG did not match that safety profile: agentic RAG increased accuracy slightly over standard RAG (about 76.0% to 78.1%) and reduced contradiction (11.7% to 9.0%), but high-risk errors and dangerous overconfidence stayed elevated. Longer contexts increased response latency but did not close the safety gap, and extra inference-time compute (for example, self-consistency sampling or small ensembles) produced only limited additional benefit and sometimes preserved synchronized failures across models. The authors also found that clinically important errors tended to concentrate in a small subset of questions.