Signal ProcessingEnglishPublished

Masking sEMG and lipreading together reduces errors in silent speech synthesis by up to 14 points

June 9, 2026arXiv: 2606.09667v1

Researchers report a way to make silent speech systems more accurate and more robust by training a model to use both muscle signals and video of the mouth. Silent speech interfaces try to recover speech when a person cannot produce sound. The team combined surface electromyography (sEMG) — electrical signals from facial muscles measured on the skin — with video-based lipreading. They used a training trick called modality masking, where one input type is sometimes hidden so the model learns to rely on the other input when needed.

To build the system the authors trained a multimodal speech synthesizer that fuses sEMG and lipreading. They used a transformer-style encoder to learn joint representations of the two input streams, then fine-tuned and tested the model under degraded conditions. The degradations were realistic, for example lowering video frame rate or reducing transmission bitrate. During training they applied temporal adaptive masking that randomly blanked part or all of a modality to encourage the network to learn complementary cues across modalities.

The multimodal, masked approach improved performance compared with single-modality systems. In multi-speaker tests the fusion reached around 76% phone-level accuracy (phones are the basic speech sounds) and about 40% word error rate (WER). Compared to the strongest unimodal baseline, word error rate fell by up to 14 absolute percentage points. The authors also found that masking during training was key to these gains and that it made the model more robust than doing task-specific data augmentation for particular degradations.

A closer look at errors showed that the two modalities bring different strengths. sEMG provided especially useful information for vowels and for some consonant groups such as affricates. The model still struggled more with certain sounds, notably some plosive and nasal distinctions. This phone-level analysis helps explain where the fusion helps and where more work is needed.