Cryptic, widespread DNA damage is commonly interpreted as sequence variation.
A single-nucleotide change can lead to serious illness. Geneticists aim their most sophisticated detection tools at low-frequency genetic variants—those that may be present in a small number of cancerous cells—and at hard-to-detect variants in genomic regions with low sequencing coverage. But recent work led by Thomas Evans Jr. and Laurence Ettwiller at New England BioLabs reveals that many of the presumed successes in the mutation hunt are in fact decoys, a result of DNA damage.
Those who study the genetic remnants of woolly mammoths or archived clinical samples know that DNA damage can distort the output of DNA sequencers. Evans, Ettwiller and their team tested whether this extends to high-quality DNA. They took advantage of the fact that damage usually occurs on only a single DNA strand, which can be distinguished by the two strand-specific reads generated by paired-end sequencing.
For example, the oxidation of guanine to 7,8-dihydro-8-oxoguanine (8-oxo-dG), which results in a G-to-T switch after amplification, would change only one read, whereas a true sequence variant would have a T in one read and an A (its complement) at the same position in the second read.
The researchers computed a global imbalance value (GIV) score for each variant, which agreed well with experimentally induced damage or repair in test DNA. Nearly three-quarters of the data sets in The Cancer Genome Atlas (TCGA) that they analyzed had GIV scores indicating widespread 8-oxo-dG damage, which occurs during sequencing-library preparation. Over 40% of germline DNA sequenced in the 1000 Genomes Project also had high GIV scores, meaning that erroneous variant calls are pervasive in public repositories.
To better assess the effect of DNA damage, the researchers sequenced a targeted set of cancer genes. Using typical library preparation, they identified 195 low-to-moderate-level G-to-T and C-to-A variants—approximately one variant per gene—whereas treatment with a DNA-repair enzyme resulted in only 12 variants. The false positive rate of one in two somatic variants estimated for the majority of TCGA tumor samples is cause for concern. Computational filtering is not an ideal solution, as it still misses errors and reduces the sensitivity to detect true variants. The researchers write that improvements will need to come from better library-preparation techniques, such as the addition of DNA-repair enzymes.
Chen, L. et al. DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science 355, 752–756 (2017).