Signal ProcessingEnglishPublished

PHAST-Net: a neural method that produces cleaner, high-resolution time–frequency views of sound

June 23, 2026arXiv: 2606.23665v1

Researchers introduce PHAST-Net, a neural network that turns a set of wavelet-based measurements into clean time–frequency pictures of sound. These pictures include familiar spectrograms (how frequency content changes over time) and less familiar tempo- and meter-based views such as tempograms and metrograms. The method is designed to give high resolution while suppressing common cross-term artifacts that make interpretation difficult.

The system starts from a specially chosen collection of wavelet transforms called CLAWT (Continuous Log-frequency Adaptive Wavelet Transform). The authors select this collection using an analysis from Cohen’s class (a family of mathematical time–frequency kernels) so the transforms cover different curvatures in a logarithmic-frequency time–frequency plane. A neural network with attention layers learns to combine that constellation of inputs and produce the target Ideal Time–Frequency Representations (ITFRs). The attention helps the network suppress cross-terms, which are spurious interference patterns that can appear when multiple signal components interact.

PHAST-Net is made “physics-informed” by an auxiliary reprojection loss during training. That loss forces the network’s predictions to re-create the original CLAWT measurements when combined with the same Cohen-class kernels. In plain terms, the network is trained not just to match a desired output image but also to be consistent with the forward transforms that produced the inputs. According to the authors, this helps keep energy consistent, avoids pathological sparsity in targets, and stabilises learning.

The approach also includes specialized variants. Harmonic PHAST-Net focuses on isolating the fundamental harmonic structure of signals, producing representations that emphasise the fundamental frequency in speech and music. Spline-PHAST-Net goes further by turning detected time–frequency ridges into continuous spline curves. Those spline trajectories can be re-rendered on arbitrary time–frequency grids and support signal reconstruction from the extracted features.