Accurate somatic mutation detection from single-cell DNA sequencing is challenging due to amplification-related artifacts. To reduce this artifact burden, an improved amplification technique, primary template-directed amplification (PTA), was recently introduced. We analyzed whole-genome sequencing data from 52 PTA-amplified single neurons using SCAN2, a new genotyper we developed to leverage mutation signatures and allele balance in identifying somatic single-nucleotide variants (SNVs) and small insertions and deletions (indels) in PTA data. Our analysis confirms an increase in nonclonal somatic mutation in single neurons with age, but revises the estimated rate of this accumulation to 16 SNVs per year. We also identify artifacts in other amplification methods. Most importantly, we show that somatic indels increase by at least three per year per neuron and are enriched in functional regions of the genome such as enhancers and promoters. Our data suggest that indels in gene-regulatory elements have a considerable effect on genome integrity in human neurons.
This is a preview of subscription content, access via your institution
Subscribe to Nature+
Get immediate online access to Nature and 55 other Nature journal
Subscribe to Journal
Get full journal access for 1 year
only $6.58 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
All MDA-amplified single neurons and matched bulks listed in Supplementary Table 2 were downloaded from dbGaP, accession no. phs001485.v1.p1. Only neurons from the PFCs of individuals for which additional PTA data were generated were used. Raw sequencing read data for PTA-amplified human neurons can be downloaded from dbGaP, accession no. phs001485.v3.p1. PTA-amplified mESC kindred cells and bulks can be downloaded from the National Center for Biotechnology Information’s Sequence Read Archive, accession no. PRJNA832209.
Poduri, A., Evrony, G. D., Cai, X. & Walsh, C. A. Somatic mutation, genomic variation, and neurological disease. Science 341, 43–51 (2013).
Lodato, M. et al. Somatic mutation in single human neurons tracks developmental and transcriptional history. Science 350, 94–98 (2015).
Martincorena, I. et al. High burden and pervasive positive selection of somatic mutations in normal human skin. Science 348, 880–886 (2015).
Jaiswal, S. et al. Clonal hematopoiesis and risk of atherosclerotic cardiovascular disease. N. Engl. J. Med. 377, 111–121 (2017).
Blokzijl, F. et al. Tissue-specific mutation accumulation in human adult stem cells during life. Nature 538, 260–264 (2016).
Lodato, M. et al. Aging and neurodegeneration are associated with increased mutations in single human neurons. Science 359, 555–559 (2018).
Martincorena, I. et al. Somatic mutant clones colonize the human esophagus with age. Science 362, 911–917 (2018).
Lee-Six, H. et al. The landscape of somatic mutation in normal colorectal epithelial cells. Nature 574, 532–537 (2019).
Franco, I. et al. Somatic mutagenesis in satellite cells associates with human skeletal muscle aging. Nat. Commun. 9, 800 (2018).
Franco, I. et al. Whole genome DNA sequencing provides an atlas of somatic mutagenesis in healthy human cells and identifies a tumor-prone cell type. Genome Biol. 20, 285 (2019).
Woodworth, M. B., Girskis, K. M. & Walsh, C. A. Building a lineage from single cells: genetic techniques for cell lineage tracking. Nat. Rev. Genet. 18, 230–244 (2017).
Evrony, G., Lee, E., Park, P. J. & Walsh, C. A. Resolving rates of mutation in the brain using single-neuron genomics. eLife 5, e12966 (2016).
Zhang, C. Z. et al. Calibrating genomic and allelic coverage bias in single-cell sequencing. Nat. Commun. 6, 6822 (2015).
Luquette, L. J. et al. Identification of somatic mutations in single cell DNA-seq using a spatial model of allelic imbalance. Nat. Commun. 10, 3908 (2019).
Bohrson, C. et al. Linked-read analysis identifies mutations in single-cell DNA sequencing data. Nat. Genet. 51, 749–754 (2019).
Gonzalez-Pena, V. et al. Accurate genomic variant detection in single cells with primary template-directed amplification. Proc. Natl Acad. Sci. USA 118, e2024176118 (2021).
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).
Zafar, H., Wang, Y., Nakhleh, L., Navin, N. & Chen, K. Monovar: single-nucleotide variant detection in single cells. Nat. Methods 13, 505–507 (2016).
Singer, J., Kuipers, J., Jahn, K. & Beerenwinkel, N. Single-cell mutation identification via phylogenetic inference. Nat. Commun. 9, 5144 (2018).
Miller, M. B. et al. Somatic genomic changes in single Alzheimer’s disease neurons. Nature 604, 714–722 (2022).
McConnell, M. J. et al. Mosaic copy number variation in human neurons. Science 342, 632–637 (2013).
Chronister, W. D. et al. Neurons with complex karyotypes are rare in aged human neocortex. Cell Rep. 26, 825–835 (2019).
Dong, X. et al. Accurate identification of single-nucleotide variants in whole-genome-amplified single cells. Nat. Methods 14, 491–493 (2017).
Petljak, M. et al. Characterizing mutational signatures in human cancer cell lines reveals episodic APOBEC mutagenesis. Cell 176, 1282–1294 (2019).
Gymrek, M. PCR-free library preparation greatly reduces stutter noise at short tandem repeats. Preprint at bioRxiv https://doi.org/10.1101/043448 (2016).
Lasken, R. S. & Stockwell, T. B. Mechanism of chimera formation during the multiple displacement amplification reaction. BMC Biotechnol. https://doi.org/10.1186/1472-6750-7-19 (2007).
Yoshida, K. et al. Tobacco smoking and somatic mutations in human bronchial epithelium. Nature 578, 266–272 (2020).
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6, 80–92 (2012).
Reid, D. et al. Incorporation of a nucleoside analog maps genome repair sites in postmitotic human neurons. Science 372, 91–94 (2021).
Wu, W. et al. Neuronal enhancers are hotspots for DNA single-strand break repair. Nature 593, 440–444 (2021).
Madabhushi, R. et al. Activity-induced DNA breaks govern the expression of neuronal early-response genes. Cell 161, 1592–1605 (2015).
Roadmap Epigenomics Consortium, Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Nott et al. Brain cell type-specific enhancer-promoter interactome maps and disease-risk association. Science 366, 1134–1139 (2019).
Hauberg, M. et al. Common schizophrenia risk variants are enriched in open chromatin regions of human glutamatergic neurons. Nat. Commun. 11, 5581 (2020).
Alt, F. W. & Schwer, B. DNA double-strand breaks as drivers of neural genomic change, function, and disease. DNA Repair 71, 158–163 (2018).
Xing, D. et al. Accurate SNV detection in single cells by transposon-based whole-genome amplification of complementary strands. Proc. Natl Acad. Sci. USA 118, e2013106118 (2021).
Abascal, F. et al. Somatic mutation landscapes at single-molecule resolution. Nature 593, 405–410 (2021).
Evrony, G. D. et al. Single-neuron sequencing analysis of L1 retrotransposition and somatic mutation in the human brain. Cell 151, 483–496 (2012).
Baslan, T. et al. Genome-wide copy number analysis of single cells. Nat. Protoc. 7, 1024–1041 (2012).
Garvin, T. et al. Interactive analysis and assessment of single-cell copy-number variations. Nat. Methods 12, 1058–1060 (2015).
Bergstrom, E. N. et al. SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC Genom. 20, 685 (2019).
Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67, 1–48 (2015).
Alexandrov, L. SigProfiler. MATLAB Central File Exchange https://www.mathworks.com/matlabcentral/fileexchange/38724-sigprofiler (2020).
Luquette, L. SCAN2_PTA_paper_2022. Zenodo https://doi.org/10.5281/zenodo.6532827 (2022).
We thank R. S. Hill, R. Mathieu and L. (Sahithi) Cheemalamarri at the Boston Children’s Hospital & Harvard Stem Cell Institute Flow Cytometry Research Facility, the Research Computing group at Harvard Medical School, and the Boston Children’s Hospital Intellectual and Developmental Disabilities Research Center Molecular Genetics Core for assistance. Human tissue was obtained from the NIH Neurobiobank at the University of Maryland, and we thank the donors and families for their invaluable contributions to the advancement of science. This work was supported by the Bioinformatics and Integrative Genomics training grant (no. T32HG002295 to L.J.L.), grant nos. K08 AG065502 and T32 HL007627 (to M.B.M.), the Brigham and Women’s Hospital Program for Interdisciplinary Neuroscience through a gift from Lawrence and Tiina Rand (to M.B.M.), the donors of the Alzheimer’s Disease Research program of the BrightFocus Foundation (no. A20201292F to M.B.M.), the Doris Duke Charitable Foundation Clinical Scientist Development Award (no. 2021183 to M.B.M.), PRMRP Discovery Award (no. W81XWH2010028 to Z.Z.), the Edward R. and Anne G. Lefler Center postdoctoral fellowship (to Z.Z.), and grant nos. R00 AG054748 (to M.A.L.), R01 AG070921 (to C.A.W.) and R01NS032457 and U01MH106883 (to P.J.P. and C.A.W.), and the Allen Discovery Center program, a Paul G. Allen Frontiers Group advised program of the Paul G. Allen Family Foundation (to C.A.W.). C.A.W. is an investigator at the Howard Hughes Medical Institute. The funders had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript.
The authors declare the following competing interests: C.G. is Director and cofounder and J.W. is CEO and cofounder of Bioskryb, Inc., the manufacturer of PTA kits used in the present study. C.A.W. is a consultant for Maze Therapeutics (cash, equity), Third Rock Ventures (cash) and Flagship Pioneering (cash), none of which have any relevance to the present study. The remaining authors declare no competing interests.
Peer review information
Nature Genetics thanks Ruben van Boxtel and Federico Abascal for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
a. Genome-wide allele balance (binned in 100 kb windows) for 3 typical PTA cells from the same individual. b. Allele balance for cells in (a) plotted against each other. c-d. Allele balance averaged across the cohort of 52 PTA cells (c) or 75 MDA cells (d); that is, each point represents the average allele balance for a single 100 kb window. A small number of regions show consistent allelic imbalance across many amplifications (arrows). e. Correlation of allele balance profiles between all pairs of PTA cells. Correlation is generally low; cells from the same individual show slightly higher correlations; and a single individual (4638) shows an atypically strong correlation.
sSNVs were simulated using the synthetic diploid (SD) X chromosome approach (Methods). Sensitivity is the fraction of known spike-ins recovered and false positives (FPs) are defined as calls that are neither known spike-ins nor somatic mutations endogenous to the haploid X chromosomes used to create each SD. Each point in a-d represents a single SD simulation with 10-250 spike-ins. a-b. Comparison of SCAN2 and SCAN-SNV sensitivity (a; lines are R loess() fits) and false discovery rates (b; lines are linear regression fits to FDR ~ 1/mutations per Mb). c-d. Comparison to other single cell SNV genotypers. c. Sensitivity vs. false positives per megabase of analyzed sequence. d. False discovery rate vs. the number of spike-ins per megabase. Lines are parameterized by mean sensitivity S and false positive rate per megabase F measured across all points: FDR = F / (F + xS). SCcaller standard uses a calling threshold of α = 0.05 while stringent calling uses α = 0.01. e-f. Performance of SCAN2 mutation signature-based rescue as a function of the number of sSNVs available for learning the true mutation signature. Sensitivity (e) and false discovery rate (f) are shown relative to the sensitivity or false discovery rate of the same SD simulation using the maximum sSNV catalog of 4,666 sSNVs. ε = 0.0001 was added to all quantities to avoid division by zero. Solid lines are fitted by R’s loess() function. g. Effect of mutation signature of spike-ins on SCAN2 sensitivity. Each point is the average sensitivity of 9 SD simulations with 1000 spike-ins from a single COSMIC SBS signature. Mutation signatures are characterized by their similarity to the PTA SNV artifact signature. Solid line: linear regression on all points except PTAerr. SBS30 (h) is the most similar COSMIC signature to the PTA SNV artifact signature (PTAerr) (i).
a-b. SBS spectra of somatic SNVs called in 4 single cells from the untreated clone. C > A mutations (blue peaks) are characteristic of COSMIC SBS18 and the mutation signature of SNVs acquired during clonal expansion5. These peaks persist in the clonally unsupported SNVs (b), suggesting that the method for classifying true positives is overly conservative. c. Spectra for SNVs called in the 4 single cells taken from an aristolochic acid (AAI)-treated clone.
a-c. SCAN2 and other callers were applied to simulated indels using the synthetic diploid (SD) X chromosome spike-in approach (Methods). SDs received 10, 25 or 50 indel spike-ins each, which correspond, respectively, to genome-wide burdens of approximately 170 (intermediate), 430 (high) and 850 (very high) somatic indels. Performance was measured by the average number of indels called per SD (a), the fraction of false positives per indel call set (b) and the fraction of spike-ins recovered (c). Tested methods were SCAN2 (with and without signature-based rescue), GATK HaplotypeCaller, GATK HaplotypeCaller with filtration by SCAN2’s cross-sample recurrent artifact filter and an adaptation of SCAN-SNV’s somatic SNV discovery approach to indels. Boxplot whiskers, the furthest outlier < =1.5 times the interquartile range from the box; box, 25th and 75th percentiles; centre bar, median; n=9 SDs per boxplot. d. Distribution of indel lengths among all simulated indels (black) and VAF-based SCAN2 indel calls (red). e. Spike-in indel sensitivity by length for VAF-based SCAN2 calls. f. Sensitivity for VAF-based SCAN2 indel calling stratified by the 83-dimensional indel classification scheme used by COSMIC indel signatures (ID83). Dotted outlines: sensitivity before applying cross-subject filtration. g. ID83-stratified indel sensitivity for SCAN2 calls with signature-based rescue.
Single human neurons were analyzed by LiRA15, a specific but lower sensitivity approach for calling somatic SNVs. a-b. SCAN2 and LiRA extrapolations for the total (not called) sSNV burden per diploid Gb of human sequence from MDA- (a) and PTA-amplified (b) single neurons. Solid lines: y=x. c. Linear regression estimates for the number of sSNVs accumulated per neuron per year from several sources and analyses. Horizontal bars represent 95% C.I.s produced by confint applied to an lmer fit by the lme4 R package; centre points from fixef applied to the same fits. (1) LiRA rates taken from ref. 6, which used a larger set of n=91 MDA-amplified PFC neurons; (2) LiRA rates taken from ref. 6 using n=73 of the 75 MDA-amplified PFC neurons from subjects analyzed in this study (the two excluded neurons are 5087pfc-Rp3C5, an extreme outlier, and 4638-MDA-14); (3) rerun of LiRA on n=74 MDA-amplified neurons in (2) using the same input provided to SCAN2; (4) SCAN2 on n=74 MDA-amplified neurons; (5) LiRA on n=34 PTA-amplified neurons from donors also analyzed in ref. 6 (N.B. LiRA’s higher rate estimate in (c) occurs despite lower burden estimations in (b) due to differences in model intercepts: SCAN2 intercept=95.83, LiRA intercept=17.63); (6) SCAN2 on all n=52 PTA-amplified neurons generated here. d. LiRA classification of SCAN2 calls where reads linked to nearby germline heterozygous SNPs are available (black: likely true sSNVs, red: possible false positives). PASS is the highest quality LiRA class. UNCERTAIN and LOW_POWER indicate lack of linking reads to make a confident call, but no evidence of artifactual status is detected. All other classes (red) are interpreted as false positives. Percentages show the fraction of all false positive classes among SCAN2 calls. e-f. Raw mutation spectra for SCAN2 calls without (e) and with mutation signature-based calling (f) SCAN2 calls stratified by LiRA classification. The similarities between PASS and the two lower quality UNCERTAIN_CALL and LOW_POWER classes suggest that the majority of UNCERTAIN_CALL and LOW_POWER SCAN2 calls are true mutations. Confident false positives (FILTERED_FPs) possess a C > T dominated signature with lack of C > Ts at CpGs.
a. Spectrum of 1541 indels from PTA neurons from this study, same as Fig. 4e. b-e. Somatic indel spectra from other studies: clonally expanded single skeletal muscle stem cells (b), clonally expanded single kidney (excluding hypermutated kidney cells, designated KT2 in the original study), epidermis and fat cells (c) and clonally expanded bronchial epithelial cells from children and never-smokers (d). e. COSMIC signatures with clock-like or age-associated annotations. f. Non-aging COSMIC signatures with >5% contribution to single neurons. g. Per-neuron COSMIC signature fits, corrected for ID83 sensitivity (Methods). Correlation (ρ) between age and exposure and P-value of two-sided t-test for correlation=0 (p) are shown for each COSMIC signature. P-values were not adjusted for multiple comparisons. Colors correspond to subject IDs as shown in Fig. 4. Note that y-axes are not the same scale.
a. Absolute sensitivity for spatial measurements that divide the genome into roughly equally sized deciles (median GTEx expression for a single tissue type, brain BA9 prefrontal cortex, and phyloP 100way conservation). b-c. Relative sensitivities: sensitivity inside of the tested region divided by sensitivity of the complemented region. Enhancers and promoters from Nott et al. 2019, ATAC-seq from Hauberg et al. 2020, DNA repair hotspots from Wu et al. 2021 and Reid et al. 2021, H3K27ac peaks from Roadmap Epigenomics. Each point represents one PTA neuron; crosses represent the 7 PTA neurons sequenced to 60x, circles represent 30x depth samples. Boxplot whiskers, the furthest outlier < =1.5 times the interquartile range from the box; box, 25th and 75th percentiles; centre bar, median.
Enrichment analysis of ChromHMM states from 127 tissues from the Roadmap Epigenomics Project. Active regions include 1_Tss, 4_Tx, 5_TxWk, 6_EnhG and 7_Enh; inactive states include 9_Het and 14_ReprPCWk. Red points, brain tissue regardless of significance level; black points, non-brain tissue; grey points, enrichment not significant at the P < 0.1 level. No correction for multiple hypothesis testing was applied.
Extended Data Fig. 9 Patterns of mutation enrichment persist at increasing sequencing depth thresholds.
Analyses presented in Fig. 5 rerun using mutations supported by at least 10, 15, 20, 25 and 30 reads; permutations used for enrichment analysis are also restricted to the subset of the genome with the corresponding sequencing depth. GABA, GABAergic neurons; GLU, glutamatergic neurons; OLIG, oligodendrocytes; MGAS, microglia and astrocytes. Error bars: 95% bootstrapping confidence intervals. For panels a-d, each plot presents an analysis at one depth cutoff; for panels e-i, each plot contains the full range of depth cutoffs, as indicated on the x-axis. Error bars in d-i represent bootstrap 95% C.I.s using n=10,000 bootstrap samples; centre points are the observed mutation count divided by the mean mutation count of the bootstrap samples.
About this article
Cite this article
Luquette, L.J., Miller, M.B., Zhou, Z. et al. Single-cell genome sequencing of human neurons identifies somatic point mutation and indel enrichment in regulatory elements. Nat Genet 54, 1564–1571 (2022). https://doi.org/10.1038/s41588-022-01180-2