WiFo-MiSAC: a single foundation model that fuses wireless signals, radar, and maps for sensing and communication
This paper introduces WiFo-MiSAC, a foundation model designed to process and fuse heterogeneous wireless and sensing data in one unified system. The goal is to let one model handle different kinds of inputs—radio channel measurements, radar returns, and map data—so it can be adapted to many downstream tasks such as predicting antenna beams or estimating communication channels.
The researchers build a unified Transformer-style backbone that first turns each sensor stream into a common set of tokens. They propose a Shared–Specific Disentangled Mixture-of-Experts (SS-DMoE) architecture that separates modality-specific details from modality-shared, environment-level information. For pre-training they combine masked reconstruction (filling in hidden pieces of the data) with contrastive alignment (making related signals similar in representation). Their experiments use three modalities: CSI (channel state information), FMCW (frequency-modulated continuous-wave) radar, and map data. They also assembled a large multimodal dataset with over one billion complex CSI entries and more than 200k time-synchronized CSI–radar–map triplets for training and evaluation.
At a high level, SS-DMoE lets modality-specialized expert groups work together with a group of experts that learn the shared “synesthesia-of-machines” representation. This design is meant to let the model capture fine-grained, token-level links across modalities without causing interference when one sensor is different or missing. The combined masked-reconstruction and contrastive objectives encourage both local, token-level alignment and global consistency across modalities, which the authors say improves performance on tasks like beam prediction and channel estimation.
Why this matters: prior systems tended to be built for one task or one sensor setup, with hand-designed fusion steps that break when sensors change or data are noisy. WiFo-MiSAC is task-agnostic and early-fuses heterogeneous inputs, which the authors report leads to strong few-shot adaptation (learning new tasks with little data) and easier integration of new sensor types. This could provide a scalable backbone for future integrated sensing-and-communication systems, a direction motivated by visions for next-generation wireless networks.