DNA embeddings can leak sequence data: study shows some model outputs can be inverted to recover DNA
Researchers tested whether dense numerical summaries of DNA sequences — called embeddings — can be turned back into the original DNA. Embedd
Researchers tested whether dense numerical summaries of DNA sequences — called embeddings — can be turned back into the original DNA. Embeddings are vectors that foundation models produce to represent sequences in a compact form. These representations are often shared through Embeddings-as-a-Service (EaaS) so groups can run downstream analyses without sharing raw genomes. The study shows that sharing embeddings does not always protect sequence privacy.
The team evaluated three modern DNA foundation models: DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2). They tried two common ways of sharing embeddings. The first preserves every token position from the model (per-token embeddings). The second averages token vectors into one fixed-length vector per sequence (mean-pooled embeddings). To test privacy, the authors trained a decoder that takes an intercepted embedding and attempts to reconstruct the original DNA. This is a training-based model inversion attack. Reconstruction quality was measured with nucleotide accuracy and Levenshtein similarity, a standard sequence distance metric.
Results were stark. Per-token embeddings allowed near-perfect sequence reconstruction across all three models. Mean-pooled embeddings were safer but not secure: reconstruction got worse as sequences grew longer, yet it stayed well above random guessing. Evo 2 and NTv2 were the most vulnerable. For shorter sequences, reconstructions from those models reached similarities above 90%. DNABERT-2 was the most resilient; its Byte Pair Encoding (BPE) tokenization — a variable-length token method — made inversion harder. The authors also found that how closely two embeddings match tends to predict how similar the underlying DNA sequences are. That correlation helps attackers decide when reconstruction will work.