Signal ProcessingEnglishPublished

Study measures how much sound knowledge lives inside language models and how that affects audio AI

March 20, 2026arXiv: 2603.19195v1

This paper asks a simple question: how much do large language models (LLMs) already know about sounds from text-only training, and does that help when they are turned into audio-capable systems? The authors test many LLMs in three ways: by asking them sound-related questions, by giving them rich text captions of audio, and by fine-tuning them together with an audio encoder to make full audio language models (LALMs).

To support the tests the team built AKB-2000, a curated auditory knowledge benchmark with 2,000 questions. The questions cover six broad categories — Music, Sound, Paralinguistic (voice cues), Phonetic, Audio Quality, and Technical — and 48 subcategories. The benchmark was created with LLM-assisted question generation and then checked by humans. The researchers evaluated 12 open-weight LLMs from four model families (Qwen, Llama, OLMo, Phi) and five proprietary models (for example GPT, Gemini, and Claude). They ran three evaluations: direct probing on AKB-2000, a cascade setup where an audio captioner turns sound into text that an LLM answers, and an audio-grounded setup where each LLM is fine-tuned into a LALM using the DeSTA self-distillation framework.

Their main findings are empirical but clear. Auditory knowledge differs a lot between model families. In the experiments, Qwen models consistently outperformed Llama models across many settings. Performance on the text-only AKB-2000 benchmark was strongly correlated with performance after audio grounding. In one controlled test, changing only the base LLM while keeping the fine-tuning recipe identical produced more than a 10% absolute difference in the final LALM’s performance.

The paper also highlights important limits. LLMs struggle with phonological tasks — questions about fine-grained speech sounds — suggesting a real gap from text-only pretraining. Another result is that a simple cascade approach, where audio is first captioned and then fed to a text LLM, can match or even beat some end-to-end LALMs. That suggests current end-to-end systems may be held back more by audio encoders than by the LLMs’ own reasoning.