Repurposing a speech classifier to steer diffusion-based speech generation
This paper explores a simpler way to steer a modern speech generator. The authors start from a conventionally trained noise-conditioned speech classifier and attach a small generative subnetwork to it. By keeping the classifier frozen and training only the new subnetwork, they aim to do conditional speech generation with one backbone instead of two separate models. They report that this approach gives high-quality audio while cutting memory use and computation compared with the usual two-model pipeline.
The core idea relies on diffusion models. In diffusion generation, a model learns how to undo noise that was gradually added to real data. That undoing step needs a “score” function, which most systems learn with a U-Net style network. A common way to guide generation toward a desired label is classifier guidance, which uses a classifier’s gradients to push samples to a target class. Classifier guidance usually needs both a trained score model and a separate classifier. The authors instead freeze a pretrained noise-conditioned classifier and attach a lightweight decoder-style adapter — called ScoreSubnet — that learns to predict the score by reusing the classifier’s intermediate feature maps and certain backpropagated gradient signals.
Technically, they work in log-Mel filterbank feature space. The classifier is trained with cross-entropy on noisy inputs so it can give time-conditioned class probabilities. During ScoreSubnet training, the classifier’s forward activations (feature taps) and gradient taps obtained by backpropagating a Joint Energy-based Model (JEM) style marginal log density are extracted. These signals are normalized, fused via attention, and fed to a small decoder that is trained with Denoising Score Matching (DSM) to predict the diffusion score. At generation time the learned score replaces the unknown score in the reverse stochastic differential equation and can be combined with standard classifier guidance to produce class-conditional samples. Waveforms are produced from log-Mel features using a pretrained HiFi-GAN vocoder.