DiV‑INR compresses video at extremely low bitrates by combining compact neural representations with diffusion models
This paper introduces DiV‑INR, a new way to compress video when you only have a tiny number of bits to spend (under 0.05 bits per pixel). The key idea is to pair two recent tools from machine learning: implicit neural representations (INRs), which can store a video in a very compact neural form, and pre‑trained diffusion models, which are generative models trained on large video datasets and can fill in realistic image details. The authors aim for better perceptual quality — images that look right to people — rather than only optimizing pixel‑by‑pixel error.
Instead of storing ordinary intra‑coded keyframes, DiV‑INR stores bit‑efficient neural representations. These INRs are trained to estimate intermediate latent features that the diffusion model can use as conditioning signals. The system jointly optimizes the INR weights together with small, parameter‑efficient adapter modules added to the diffusion model. This lets the method convey video‑specific information with only a small extra parameter cost while still relying on the powerful generative priors of the diffusion model.
At a high level, the diffusion model uses the INR conditioning to generate frames. The authors report that the model first composes a scene layout and object identities and then refines textures. That is, the method appears to use a semantic‑to‑visual hierarchy: it gets the overall scene and objects right before adding fine visual detail. The evaluation focuses on perceptual quality and uses metrics such as LPIPS and DISTS (perceptual similarity metrics) and FID (Fréchet Inception Distance, which measures distributional similarity).
The experiments run on standard video test sets (UVG, MCL‑JCV, and JVET Class‑B) and target the extremely low bitrate regime (<0.05 bpp). On perceptual measures the paper reports substantial gains. For example, the authors claim improvements in BD‑LPIPS up to 0.214 and BD‑FID up to 91.14 compared with HEVC (High Efficiency Video Coding). They also report that DiV‑INR outperforms VVC (Versatile Video Coding) and prior strong neural and INR‑only codecs on the same perceptual metrics.