Signal ProcessingEnglishPublished

A voice agent that changes when it speaks based on an assigned role

June 12, 2026arXiv: 2606.13544v1

Researchers present ModeratorLM, a new speech large language model that decides when a voice assistant should speak in group conversations by following an explicit role. The idea is simple: give the agent a short description of the role it should play (for example, a quiet listener or a firm facilitator) and let the model use that role to guide both the decision to take the floor and any textual response. The team says this is the first role-conditioned voice agent designed specifically for multi‑party settings with overlapping speech and shifting floor control.

The system works on short audio chunks. A speech encoder turns each incoming chunk into a vector. Those vectors go into a large language model (LLM). For each chunk the model either emits a control token plus a text reply (meaning “I will take the floor and say this”), or it emits nothing (meaning “do not take the floor now”). A second variant, called ModeratorLM‑Think, lets the LLM write a short internal reasoning trace—similar to a chain‑of‑thought—about whether to speak before it makes that decision. That trace is used during training to teach more deliberate turn‑taking.

To train the models the authors built RolePlayConv, a synthetic spoken dataset of roughly 75,000 conversations. Each conversation has three to six speakers and is conditioned on one of 125 detailed assistant roles. Turns are kept short (fewer than 15 words) to mimic real spoken exchanges, and the text conversations are synthesized into speech using a text‑to‑speech model. The creators also add reasoning traces to many assistant turns so the thinking variant can be supervised during training.

The team evaluated ModeratorLM on their synthetic RolePlayConv data and on a real meeting corpus (NOTSOFAR‑1). Compared with baselines that do not use role conditioning, they report large gains: turn‑taking precision improved by over 40% and recall by more than 70%, with far fewer false‑positive interruptions. The backbone model in these experiments was Qwen3‑4B and training included a stage that aligned the speech encoder to the LLM using about 90,000 hours of public speech used for automatic speech recognition (ASR). During fine tuning the speech encoder was kept frozen and the LLM parameters were adapted with a low‑rank method (about 13.4 million trainable parameters).