New method lets speech‑aware large language models handle multi‑speaker audio by conditioning the encoder on diarization
Researchers introduce a way to make spoken large language models (SLMs) work better on recordings with many speakers. Instead of changing the language model part, they change only the acoustic encoder so it focuses on one speaker at a time. The resulting system, called Dixtral, combines a diarization‑conditioned Whisper encoder (DiCoW) with a preexisting SLM (Voxtral) while keeping the language decoder frozen.
The key idea is to give the encoder a diarization mask that says, for each time frame, whether a particular speaker is silent, speaking alone, overlapping, or not the target. The encoder uses these four probabilities to modify its internal layers with a technique the authors call frame‑level diarization‑dependent transformations. Those conditioned acoustic features are mapped into the LLM’s embedding space by a small adapter and then fed to the frozen decoder for transcription, summarization, or question answering.
The team tested Dixtral on several multi‑talker datasets: AMI, NOTSOFAR‑1, LibriSpeechMix, and Mixer6. They measured speaker‑attributed transcription error with concatenated minimum‑permutation word error rate (cpWER). Across the benchmarks Dixtral achieved a macro‑average cpWER of 15.4%, compared to 44.4% for Gemini 3.0 Flash, 35.2% for VibeVoice, and 31.4% for Voxtral Mini Transcribe V2. The paper also reports absolute cpWER improvements of about 29.0%, 19.8%, and 16.0% against those systems. On a new long‑form multi‑speaker QA and summarization benchmark built from NOTSOFAR‑1, Dixtral matched Gemini in zero‑shot content understanding, and after fine‑tuning it surpassed both Gemini and Voxtral (operating on close‑talk audio) on all evaluated tasks.
This approach matters for two reasons. First, it avoids retraining the large language decoder, which can cause “catastrophic forgetting” of reasoning and summarization skills. Second, treating speakers separately reduces decoding cost as models scale: decoding S separate speakers separately can be cheaper than producing one long interleaved transcript. Because the decoder stays frozen, the SLM preserves its original language and reasoning abilities while gaining multi‑speaker handling.