MeloDISinger: editing sung lyrics while keeping melody and timing unchanged
This paper introduces MeloDISinger, a system that edits sung lyrics in recorded audio while keeping the original melody, the total timing, and the parts that were not edited. The goal is practical: change words in a recorded vocal without shifting the song’s rhythm or pitch and without re-recording. The authors focus on two strict requirements that prior work sometimes breaks: staying locked to the original melody, and preserving the total duration of each edited region so the voice stays in sync with the accompaniment.
The main technical idea is a new duration predictor called MeloDRP (Melody-aware Duration Ratio Predictor). Instead of predicting absolute lengths for each phoneme, MeloDRP predicts duration ratios inside each edited span. Those ratios add up to the span’s original time budget, so the span’s total length is preserved by construction. To make these ratios follow the song’s melody, MeloDRP combines phonetic cues (information about the sounds and word/syllable boundaries) with a pseudo-MIDI representation of the original melody. The pseudo-MIDI is extracted from the recorded singing rather than from a written score, because live performances often differ from sheet music. The model also uses supervision that encourages soft alignments between phonemes and notes rather than forcing a strict one-to-one match.
To turn the predicted durations and pitches into audio, the authors use an audio-infilling decoder built with a flow-matching model. During training, the decoder sees audio with random masked regions and learns to fill in just those regions. At inference it starts from noise and generates mel-spectrogram frames for the edited parts, then merges them with the unedited frames from the original audio. This preserves the non-edited context exactly and creates smooth transitions at edit boundaries. The system also predicts the edited-region pitch contour (F0) using a separate module adapted from prior work.