Genomic language models can memorize DNA — new framework measures privacy risk
Researchers tested how much genomic language models (GLMs) can memorize specific DNA sequences and how that creates privacy risks. They buil
Researchers tested how much genomic language models (GLMs) can memorize specific DNA sequences and how that creates privacy risks. They built a three-part evaluation that looks for memorization in complementary ways. The methods are: a perplexity check (which sees how surprised a model is by a sequence), canary sequence extraction (planting known “canary” DNA strings in training data and trying to recover them), and membership inference (testing whether an attacker can tell if a particular sequence was used to train the model). The three results are combined into a single worst-case memorization score.
To make the tests controlled and precise, the team planted canary sequences at different repetition rates into both synthetic and real genomic datasets. They ran experiments on four kinds of GLM designs that represent common model choices in the field, including masked language models, long-range convolutional models, state-space models, and a lightweight transformer baseline. They also compared full fine-tuning with a parameter-efficient method called LoRA (low-rank adaptation), which updates fewer model parameters during fine-tuning.
The experiments used four genomic datasets that grow in biological complexity: synthetic zero-order sequences, prokaryotic and eukaryotic reference genomes, and a curated multi-species benchmark. Across these settings the authors found measurable memorization. How much a model memorizes depended on the architecture, how it was trained, and how often a sequence appeared in the training data. The paper reports that duplication of training sequences drives memorization — a scaling behavior previously seen in text models also appears in the genomic setting.