Signal ProcessingEnglishPublished

Diffusion model can create realistic heart‑sound clips but misses some abnormal cues

June 2, 2026arXiv: 2606.02448v1

This paper explores whether a recent class of generative models, called diffusion models, can make realistic heart‑sound recordings. The recordings here are phonocardiograms (PCGs) — audio traces of heartbeats used in clinical listening. The authors trained a conditional diffusion model to produce short, four‑second PCG clips and then judged the output with signal checks, a machine classifier, and a small expert listening test.

The team used the PhysioNet / Computing in Cardiology Challenge 2016 collection of 3,240 recordings. After filtering, band‑pass filtering (20–500 Hz), and quality control, they split the data into 16,749 non‑overlapping four‑second clips (12,827 labeled normal and 3,922 labeled abnormal). Each clip was converted to a compact log‑mel spectrogram — a 1×128×128 time–frequency image — and a compact 2D U‑Net denoiser was trained as a class‑conditional diffusion model. The model used classifier‑free guidance to steer generation toward normal or abnormal labels.

To measure whether generated sounds looked physiologically plausible, the authors designed three lightweight, physiology‑inspired metrics. The envelope‑autocorrelation rhythm score estimates how regular the heartbeat rhythm appears. The amplitude‑based “explosion” score flags sudden, implausible bursts. The dominant cycle lag measures the typical heartbeat cycle duration. These metrics are proxies for plausibility, not direct clinical measurements.

On these signal checks, synthetic clips kept similar dominant cycle durations to real clips but showed reduced envelope periodicity and more transient burstiness. That means the synthesized beats tended to have roughly the right spacing, but their overall rhythmic steadiness was weaker and they contained more short, spiky artifacts compared with real recordings.