Machine LearningEnglishPublished

Genomic language models can memorize DNA — new framework measures privacy risk

March 14, 2026arXiv: 2603.08913v1

Researchers tested how much genomic language models (GLMs) can memorize specific DNA sequences and how that creates privacy risks. They built a three-part evaluation that looks for memorization in complementary ways. The methods are: a perplexity check (which sees how surprised a model is by a sequence), canary sequence extraction (planting known “canary” DNA strings in training data and trying to recover them), and membership inference (testing whether an attacker can tell if a particular sequence was used to train the model). The three results are combined into a single worst-case memorization score.

To make the tests controlled and precise, the team planted canary sequences at different repetition rates into both synthetic and real genomic datasets. They ran experiments on four kinds of GLM designs that represent common model choices in the field, including masked language models, long-range convolutional models, state-space models, and a lightweight transformer baseline. They also compared full fine-tuning with a parameter-efficient method called LoRA (low-rank adaptation), which updates fewer model parameters during fine-tuning.

The experiments used four genomic datasets that grow in biological complexity: synthetic zero-order sequences, prokaryotic and eukaryotic reference genomes, and a curated multi-species benchmark. Across these settings the authors found measurable memorization. How much a model memorizes depended on the architecture, how it was trained, and how often a sequence appeared in the training data. The paper reports that duplication of training sequences drives memorization — a scaling behavior previously seen in text models also appears in the genomic setting.

Why this matters: genomic sequences are permanent and uniquely identifying. A leaked genome cannot be changed, and fragments of DNA can identify an individual or relatives. That makes any memorization by GLMs a serious privacy concern, especially for models fine-tuned on small, sensitive cohorts. The authors argue that no single test captures the full privacy exposure; different architectures reveal leakage with different attack methods, so multi-vector auditing should be standard practice for genomic AI.