cs.CYEnglishPublished

Calibrated Bayesian uncertainty flags underserved patients in a simulated clinical AI study

June 9, 2026arXiv: 2606.09789v1

This paper describes a single, integrated approach to two problems in clinical artificial intelligence (AI): how to make predictions that include a principled measure of confidence, and how to use that confidence to reveal groups the model serves less well. The authors build an end-to-end Bayesian deep learning system that produces not only risk scores but also separated uncertainty estimates. They test the idea on a simulated set of patients and report that higher “epistemic” uncertainty — the model’s lack of knowledge — consistently marks understudied groups such as rural patients and those with low socioeconomic status.

At a high level the system maps three kinds of patient data — electronic health record features (32 numbers), imaging features (128 numbers), and short clinical text embeddings (64 numbers) — into a shared 16-dimensional probabilistic representation. Each data type has its own variational encoder that outputs both a mean and a variance. Those per-modality variances are converted into precisions and used to form a precision-weighted fusion. This lets the model rely more on the more confident data sources and less on noisy or missing ones. The output head separates aleatoric uncertainty (noise that can’t be reduced) from epistemic uncertainty (lack of knowledge that could be reduced with more data). Epistemic uncertainty is estimated with repeated stochastic forward passes (Monte Carlo dropout), while aleatoric uncertainty is given by a dedicated output head.

The training objective combines three terms: a usual prediction loss (binary cross-entropy), a Kullback–Leibler regularisation term for the latent distributions, and an uncertainty-calibration penalty that encourages the model’s stated uncertainty to match its errors. The authors give the training weights they used (KL weight = 0.001, uncertainty penalty = 0.1) and standard optimisation details (Adam, learning rate 1×10^-3, 50 epochs, batch size 32, dropout p=0.3). Calibration was measured with Expected Calibration Error (ECE = 0.096) on a held-out set of 300 patients, and overall held-out accuracy was 85.7%.