Machine learning reveals bilateral distribution of somatic L1 insertions in human neurons and glia

Abstract

Retrotransposons can cause somatic genome variation in the human nervous system, which is hypothesized to have relevance to brain development and neuropsychiatric disease. However, the detection of individual somatic mobile element insertions presents a difficult signal-to-noise problem. Using a machine-learning method (RetroSom) and deep whole-genome sequencing, we analyzed L1 and Alu retrotransposition in sorted neurons and glia from human brains. We characterized two brain-specific L1 insertions in neurons and glia from a donor with schizophrenia. There was anatomical distribution of the L1 insertions in neurons and glia across both hemispheres, indicating retrotransposition occurred during early embryogenesis. Both insertions were within the introns of genes (CNNM2 and FRMD4A) inside genomic loci associated with neuropsychiatric disorders. Proof-of-principle experiments revealed these L1 insertions significantly reduced gene expression. These results demonstrate that RetroSom has broad applications for studies of brain development and may provide insight into the possible pathological effects of somatic retrotransposition.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Project overview and machine-learning method.
Fig. 2: Benchmarking in independent test datasets.
Fig. 3: Discovery and experimental validation of somatic L1-1 and L1-2.
Fig. 4: L1-1 and L1-2 have wide anatomical distribution in glia, as well as in neurons.
Fig. 5: Somatic L1 insertions occur in genomic regions of high functional potential.
Fig. 6: Intronic L1 insertions suppress EGFP reporter activities.

Data availability

WGS data from the six donors (Fig. 1a,b) have been deposited in the Sequence Read Archive under BioProject ID: PRJNA541510.

Microscope image collection for the reporter assay is available from Figshare under collection 5182676 (https://doi.org/10.6084/m9.figshare.c.5182676.v1). The source data for the genome-mixing experiment (Fig. 2c) are deposited in the NIMH Data Archive (https://nda.nih.gov/) under collection 2,458, experiment 1,072. The data are not publicly available because they contain information that could compromise research participant consent, but will be available from the corresponding author upon reasonable request. Source data are provided with this paper.

Code availability

The supplementary software file contains the following scripts:

R scripts for plotting the main figures (Figs. 16).

R scripts for the machine-learning modeling of L1 and Alu supporting reads (RFI-IV).

Perl/shell scripts for the visualization of MEI supporting reads (RetroVis).

An actively maintained RetroSom pipeline is available at https://github.com/XiaoweiZhuJJ/RetroSom.

References

  1. 1.

    Luan, D. D., Korman, M. H., Jakubczak, J. L. & Eickbush, T. H. Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal target site: a mechanism for non-LTR retrotransposition. Cell 72, 595–605 (1993).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  2. 2.

    Richardson, S. R. et al. The influence of LINE-1 and SINE retrotransposons on mammalian genomes. Microbiol. Spectr. https://doi.org/10.1128/microbiolspec.MDNA3-0061-2014 (2015).

  3. 3.

    Hancks, D. C. & Kazazian, H. H. Roles for retrotransposon insertions in human disease. Mob. DNA 7, 9 (2016).

  4. 4.

    Tubio, J. M. C. et al. Extensive transduction of nonrepetitive DNA mediated by L1 retrotransposition in cancer genomes. Science 345, 1251343 (2014).

  5. 5.

    Evrony, G. D. et al. Cell lineage analysis in human brain using endogenous retroelements. Neuron 85, 49–60 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. 6.

    Erwin, J. A. et al. L1-associated genomic regions are deleted in somatic cells of the healthy human brain. Nat. Neurosci. 19, 1583–1591 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Reilly, M. T., Faulkner, G. J., Dubnau, J., Ponomarev, I. & Gage, F. H. The role of transposable elements in health and diseases of the central nervous system. J. Neurosci. 33, 17577–17586 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Jacob-Hirsch, J. et al. Whole-genome sequencing reveals principles of brain retrotransposition in neurodevelopmental disorders. Cell Res. 28, 187–203 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  9. 9.

    Muotri, A. R. et al. Somatic mosaicism in neuronal precursor cells mediated by L1 retrotransposition. Nature 435, 903–910 (2005).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  10. 10.

    Richardson, S. R. et al. Heritable L1 retrotransposition in the mouse primordial germline and early embryo. Genome Res. 27, 1395–1405 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Sanchez-Luque, F. J. et al. LINE-1 evasion of epigenetic repression in humans. Mol. Cell 75, 590–604 (2019).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  12. 12.

    Baillie, J. K. et al. Somatic retrotransposition alters the genetic landscape of the human brain. Nature 479, 534–537 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  13. 13.

    Upton, K. R. et al. Ubiquitous L1 mosaicism in hippocampal neurons. Cell 161, 228–239 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  14. 14.

    Evrony, G. D. et al. Single-neuron sequencing analysis of L1 retrotransposition and somatic mutation in the human brain. Cell 151, 483–496 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Evrony, G. D., Lee, E., Park, P. J. & Walsh, C. A. Resolving rates of mutation in the brain using single-neuron genomics. Elife 5, 1–32 (2016).

    Article  CAS  Google Scholar 

  16. 16.

    Zhou, W. et al. Identification and characterization of occult human-specific LINE-1 insertions using long-read sequencing technology. Nucleic Acids Res. https://doi.org/10.1093/nar/gkz1173 (2019).

  17. 17.

    Rishishwar, L., Mariño-Ramírez, L. & Jordan, I. K. Benchmarking computational tools for polymorphic transposable element detection. Brief. Bioinform. 18, 908–918 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Keane, T. M., Wong, K. & Adams, D. J. RetroSeq: transposable element discovery from next-generation sequencing data. Bioinformatics 29, 389–390 (2013).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  19. 19.

    Birur, B., Kraguljac, N. V., Shelton, R. C. & Lahti, A. C. Brain structure, function, and neurochemistry in schizophrenia and bipolar disorder—a systematic review of the magnetic resonance neuroimaging literature. NPJ Schizophr. 3, 15 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  20. 20.

    Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  21. 21.

    Flasch, D. A. et al. Genome-wide de novo L1 retrotransposition connects endonuclease activity with replication. Cell https://doi.org/10.1016/j.cell.2019.02.050 (2019)

  22. 22.

    Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Article  Google Scholar 

  23. 23.

    Skowronski, J., Fanning, T. G. & Singer, M. F. Unit-length line-1 transcripts in human teratocarcinoma cells. Mol. Cell. Biol. 8, 1385–1397 (1988).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  24. 24.

    Moran, J. V. et al. Exon shuffling by L1 retrotransposition. Science 283, 1530–1534 (1999).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  25. 25.

    Bae, T. et al. Different mutational rates and mechanisms in human cells at pregastrulation and neurogenesis. Science 359, 550–555 (2018).

    CAS  Article  Google Scholar 

  26. 26.

    Ovchinnikov, I. et al. Genomic characterization of recent human LINE-1 insertions: evidence supporting random insertion. Genome Res. https://doi.org/10.1101/gr.194701 (2001).

  27. 27.

    Morrish, T. A. et al. DNA repair mediated by endonuclease-independent LINE-1 retrotransposition. Nat. Genet. 31, 159–165 (2002).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  28. 28.

    McConnell, M. J. et al. Intersection of diverse neuronal genomes and neuropsychiatric disease: the brain somatic mosaicism network. Science 356, eaal1641 (2017).

  29. 29.

    Feng, Q., Moran, J. V., Kazazian, H. H. & Boeke, J. D. Human L1 retrotransposon encodes a conserved endonuclease required for retrotransposition. Cell 87, 905–916 (1996).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  30. 30.

    Grimaldi, G., Skowronski, J. & Singer, M. F. Defining the beginning and end of KpnI family segments. EMBO J. 3, 1753–1759 (1984).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  31. 31.

    Zingler, N. et al. Analysis of 5′ junctions of human LINE-1 and Alu retrotransposons suggests an alternative model for 5′-end attachment requiring microhomology-mediated end-joining. Genome Res. 15, 780–789 (2005).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  32. 32.

    Zerbino, D. R., Wilder, S. P., Johnson, N., Juettemann, T. & Flicek, P. R. The ensembl regulatory build. Genome Biol. 16, 1–8 (2015).

    Article  Google Scholar 

  33. 33.

    Ripke, S. et al. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  34. 34.

    Han, J. S., Szak, S. T. & Boeke, J. D. Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes. Nature 429, 268–274 (2004).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  35. 35.

    Dou, Y. et al. Accurate detection of mosaic variants in sequencing data without matched controls. Nat. Biotech. https://doi.org/10.1038/s41587-019-0368-8 (2020).

  36. 36.

    Scott, E. C. & Devine, S. E. The role of somatic L1 retrotransposition in human cancers. Viruses https://doi.org/10.3390/v9060131 (2017)

  37. 37.

    Malatesta, P., Hartfuss, E. & Götz, M. Isolation of radial glial cells by fluorescent-activated cell sorting reveals a neuronal lineage. Development 127, 5253–5263 (2000).

    CAS  PubMed  Google Scholar 

  38. 38.

    Coufal, N. G. et al. L1 retrotransposition in human neural progenitor cells. Nature 460, 1127–1131 (2009).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  39. 39.

    Rehen, S. K. et al. Chromosomal variation in neurons of the developing and adult mammalian nervous system. Proc. Natl Acad. Sci. USA 98, 13361–13366 (2001).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  40. 40.

    De Cecco, M. et al. L1 drives IFN in senescent cells and promotes age-associated inflammation. Nature 566, 73–78 (2019).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  41. 41.

    Yamaguchi, Y. & Miura, M. Programmed cell death in neurodevelopment. Dev. Cell 32, 478–490 (2015).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  42. 42.

    Shirley, M. D. et al. Sturge–Weber syndrome and port-wine stains caused by somatic mutation in GNAQ. N. Engl. J. Med. 368, 1971–1979 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Lim, J. S. et al. Brain somatic mutations in MTOR cause focal cortical dysplasia type II leading to intractable epilepsy. Nat. Med. 21, 395–400 (2015).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  44. 44.

    Poduri, A. et al. Somatic activation of AKT3 causes hemispheric developmental brain malformations. Neuron 74, 41–48 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  45. 45.

    Thyme, S. B. et al. Phenotypic landscape of schizophrenia-associated genes defines candidates and their shared functions. Cell 177, 478–491 (2019).

  46. 46.

    Fine, D. et al. A syndrome of congenital microcephaly, intellectual disability and dysmorphism with a homozygous mutation in FRMD4A. Eur. J. Hum. Genet. 23, 1729–1734 (2015).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  47. 47.

    Rees, E. et al. Analysis of copy number variations at 15 schizophrenia-associated loci. Br. J. Psychiatry 204, 108–114 (2014).

    PubMed  PubMed Central  Article  Google Scholar 

  48. 48.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  49. 49.

    Ikenouchi, J. & Umeda, M. FRMD4A regulates epithelial polarity by connecting Arf6 activation with the PAR complex. Proc. Natl Acad. Sci. USA 107, 748–753 (2010).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  50. 50.

    Stan, A. D. et al. Magnetic resonance spectroscopy and tissue protein concentrations together suggest lower glutamate signaling in dentate gyrus in schizophrenia. Mol. Psychiatry 20, 433–439 (2015).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  51. 51.

    Matevossian, A. & Akbarian, S. Neuronal nuclei isolation from human postmortem brain tissue. J. Vis. Exp. https://doi.org/10.3791/914 (2008).

  52. 52.

    Kozlenkov, A. et al. A unique role for DNA (hydroxy)methylation in epigenetic regulation of human inhibitory neurons. Sci. Adv. https://doi.org/10.1126/sciadv.aau6190 (2018).

  53. 53.

    Julius, M. H., Masuda, T. & Herzenberg, L. A. Demonstration that antigen-binding cells are precursors of antibody-producing cells after purification with a fluorescence-activated cell sorter. Proc. Natl Acad. Sci. USA 69, 1934–1938 (1972).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  54. 54.

    Zhang, Y. et al. Purification and characterization of progenitor and mature human astrocytes reveals transcriptional and functional differences with mouse. Neuron 89, 37–53 (2016).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  55. 55.

    Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  56. 56.

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  57. 57.

    Jurka, J. Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 16, 418–420 (2000).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  58. 58.

    Wootton, J. C. Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput. Chem. 18, 269–285 (1994).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  59. 59.

    Friedman, J. H., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).

    PubMed  PubMed Central  Article  Google Scholar 

  60. 60.

    Liaw, A. & Wiener, M. Classification and regression by randomForest. R News 2, 18–22 (2002).

    Google Scholar 

  61. 61.

    Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).

  62. 62.

    Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  63. 63.

    Zhou, B. et al. Detection and quantification of mosaic genomic DNA variation in primary somatic tissues using ddPCR: analysis of mosaic transposable-element insertions, copy-number variants and single-nucleotide variants. Methods Mol. Biol. 1768, 173–190 (2018).

  64. 64.

    Szak, S. T. et al. Molecular archeology of L1 insertions in the human genome. Genome Biol. 3, research0052.1 (2002).

    Article  Google Scholar 

  65. 65.

    Heckman, K. L. & Pease, L. R. Gene splicing and mutagenesis by PCR-driven overlap extension. Nat. Protoc. 2, 924–932 (2007).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  66. 66.

    Bonano, V. I., Oltean, S. & Garcia-blanco, M. A. A protocol for imaging alternative splicing regulation in vivo using fluorescence reporters in transgenic mice. Nat. Protoc. 2, 2166–2181 (2007).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  67. 67.

    Shinde, D., Lai, Y., Sun, F. & Arnheim, N. Taq DNA polymerase slippage mutation rates measured by PCR and quasi-likelihood analysis: (CA/GT)n and (A/T)n microsatellites. Nucleic Acids Res. 31, 974–980 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  68. 68.

    Zerbino, D. R. et al. Ensembl regulation resources. Database 2016, bav119 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  69. 69.

    McMahon, A. et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2018).

    PubMed Central  Google Scholar 

  70. 70.

    Malone, J. et al. Modeling sample variables with an experimental factor ontology. Bioinformatics 26, 1112–1118 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

Download references

Acknowledgements

We thank W. H. Wong, J. Chao, A. Z. Wang and N. Bosch for constructive comments on the manuscript. We thank J. E. Kleinman, T. H. Hyde and D.W. from Lieber Institute for Brain Development for providing the BSMN common brain tissue and L. Fasching from Yale University for extracting the BSMN common brain DNA. This work utilized computing resources provided by the Stanford Genetics Bioinformatics Service Center. Funding: this work was supported by Eureka Grant R01MH094740 from the NIMH and the Stanford Schizophrenia Genetics Research Fund. The mixing-genome DNA sequencing and BSMN common brain sequencing data were generated as part of the BSMN Consortium and supported by: U01MH106874, U01MH106876, U01MG106882, U01MH106883, U01MH106883, U01MH106884, U01MH106891, U01MH106891, U01MH106891, U01MH106892, U01MH106893, and U01MH108898 awarded to N.S., F.M.V., F.G., C.W., P.P., J.P., A.C., J.V.M., D.W. and J.G. B.Z. is funded by the National Heart, Lung, and Blood Institute grant T32 HL110952. A.E.U. was a Tashia and John Morgridge Faculty Fellow of the Stanford Child Health Research Institute. The Urban laboratory receives funding through the Jaswa Innovator Award and from B. Blackie and W. Mclvor. We acknowledge helpful discussions with B. Blackie and W. Mclvor. Flow cytometry sorting was performed on an instrument in the Stanford shared fluorescence-activated cell sorting facility obtained under an NIH S10 Shared Instrument Grant (S10RR025518-01).

Author information

Affiliations

Authors

Consortia

Contributions

X.Z. coordinated the project, wrote the manuscript, and designed the model and the computational framework, with initial advice from A.F. and D.P. X.Z., B.Z. and R.P. designed and carried out the MEI validation experimental approaches. K.G., C.T., C.A.T., S.S., B.A.B. and H.V. provided the tissue samples. J.M., A.A. and F.M.V. provided the clone sequencing data. X.Z. and B.Z. generated the genome-mixing data. A.K. performed the transfection in the reporter assays and X.Z. quantified the data. L.D. advised the polygenic risk score analysis. J.V.M. contributed to the interpretation of the somatic L1 sequences. A.E.U. conceived the original idea. D.F.L. and A.E.U. supervised the project. All authors provided critical feedback and helped shape the research, analysis and manuscript.

Corresponding author

Correspondence to Alexander E. Urban.

Ethics declarations

Competing interests

J.V.M. is an inventor on patent US6150160, is a paid consultant for Gilead Sciences, serves on the scientific advisory board of Tessera Therapeutics (also receives consultant fees and has equity options), and currently serves on the American Society of Human Genetics Board of Directors. C.A.T is or has been a deputy editor for the American Psychiatric Association; an ad hoc consultant for Astellas, Merck and Lundbeck; a council member for the Brain & Behavior Research Foundation, the National Academy of Medicine, the National Alliance on Mental Illness and a reviewer for the NIMH; she is an advisor for Karuna Therapeutics and owns its stock.

Additional information

Peer review information Nature Neuroscience thanks Geoffrey Faulkner, Alysson Muotri and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Classification of supporting reads from putative mobile element insertions.

a, We simulated the relationship between the detectable mosaicism of somatic MEIs and the number of supporting reads in bulk sequencing by considering the range of coordinates for the putative supporting reads for either the upstream or downstream junction (see Fig. 1d). Blue, segment of supporting read that maps to flanking sequence; red, segment of read that maps to ME consensus; gray, the insert segment between the two paired-end reads. b, A detailed flowchart describing the framework behind RetroSom. We labeled putative supporting reads as true or false insertions based on the inheritance pattern and built a set of random forest models to classify them based on various sequencing features (see Supplementary Table 3). c, The distribution of true L1 (left) and Alu (right) insertions among 11 offspring is similar to a theoretical binomial distribution (red line). The peaks around N = 11 represent additional MEIs that are homozygous in one of the parents and transmitted to all 11 offspring. d, To avoid missing values, we categorized L1 PE supporting reads into 8 subgroups depending on their mapping locations on the L1Hs (L1 human specific) consensus sequence. e, The performance of random forest classification in all 8 L1 PE read sub-models, ranked based on their average F1 score (harmonic average of sensitivity and precision) from 11x cross validation (n = 11 tests). f and g, Model selection and evaluation with 11x cross validation: (f) precision-recall curve, (g) area under the precision-recall curve (AUPR, n = 11 tests). The boundaries of the boxplots indicate the 25th percentile (above) and the 75th percentile (below), the black line within the box marks the median. Whiskers above and below the box indicate the 10th and 90th percentiles. Source data

Extended Data Fig. 2 Benchmarking Alu insertions in independent test datasets.

a, Performance in detecting germline Alu insertions from clonally expanded fetal brain cells sequencing data. Gray, clones from donor “316” sequenced with whole genome amplification (316WGA, n=10 clones); brown, the rest of the “316” datasets (316 noWGA, n=5 clones); blue, clones from donor “320” (n=52 clones). The boundaries of the boxplots indicate the 25th percentile (above) and the 75th percentile (below), the black line within the box marks the median. Whiskers above and below the box indicate the 10th and 90th percentiles. b, Performance in detecting germline Alu insertions from sequencing libraries prepared with or without PCR. Light blue/green, PCR-free libraries for sample “Heart” (light blue circle, n=1) and “Neuron” (light green triangle, n=1); Dark blue/green, PCR-based libraries for “Heart” (dark blue circle, n=6) and “Neuron” (dark green triangle, n=6). c-e, Performance in detecting somatic MEIs simulated by six genomic DNA samples at proportions of 0.04% to 25% with that of NA12878, at various sequencing depth (gray, 50×; brown, 100×; blue, 200×; green, 400×). Source data

Extended Data Fig. 3 Discovery and experimental validation of insertion L1-3.

a, We identified a somatic L1 insertion (L1-3, red arrow) in one clone, “BG clone16,” with 17 supporting reads. b, L1-3 is inserted into an intron of gene EVC2. Blue, segment of supporting read that maps to the flanking sequence; red, segment of read that maps to ME consensus. c, PCR (n=1 replicate) surrounding L1-3 produced a unique band in BG clone16, as well as a lower band in all tested samples, representing the product from the DNA without the insertion. d, DdPCR (n=2 replicates) detects the upstream junction in 22.54% of the cells in BG clone16. e, DdPCR (n=2 replicates) detects the downstream junction in 24.16% of the cells in BG clone16. f and g, L1-3 is absent in 6 bulk tissues (n=4 replicates): BG ventricular zone/subventricular zone (BG VZ/SVZ), BG cortex (BG CX), FR VZ/SVZ, FR CX, occipital cortex, and spleen. The error bars represent the 95% confidence intervals of the mosaicism level in BG clone 16. h, The full sequence of L1-3: black, flanking sequence; red, inserted L1 sequence; purple, target site duplication; brown, mismatches to the L1Hs consensus. i, Sequencing depth and reads around L1-3 junction in BG clone16. Mismatch bases are indiated by color: green, A; blue, C; brown, G; red, T. Source data

Extended Data Fig. 4 Postprocessing of putative somatic MEIs.

a, Procedure for manual curation of putative somatic MEIs. To further remove false positive MEIs, especially for Alu insertions, we implemented manual inspections for each putative insertion. We first check the neighboring regions in both the UCSC and IGV browsers and remove calls that are from regions of potential mapping errors or CNVs. We also remove calls that are found in datasets of other donors. We then apply a novel visualization tool, RetroVis, to quickly screen out calls with questionable supporting read positions. We further inspect the read sequences to check for unwarranted transduction and similarity between different supporting reads. Finally, we design nested PCR and ddPCR to validate the insertions and quantify their respective levels of mosaicism using DNA from the same tissue. In a RetroVis plot, black lines represent human genome location (top) and the inferred segment of the inserted mobile element (for example, L1) (bottom). A paired-end supporting read is represented by a blue arrow and a red (+ strand insertion) or purple (-strand insertion) arrow connected by a dashed line. A split-read supporting read (spanning an insertion junction) is plotted as a blue arrow (reference segment) connected to an empty rectangle (mobile element segment), with a red or purple arrow below. The positions of the blue segments and red/purple segments reflect the insertion coordinates in the human reference genome and mobile element consensus. b-j, Examples of likely false positive insertions examined by manual curation. Blue, flanking sequence; red, mobile element sequence (+ strand insertion). b, Merging different MEIs into one. c, PCR duplicates. d, All ME ends are mapped to identical coordinates at the 3’ end of the L1Hs sequence. e, All anchor ends are mapped to identical coordinates in flanking sequences. f, Lacking target site duplication. g, A truncated 3’ end indicates a false insertion or an endonuclease-independent retrotransposition. h, Two supporting reads mapping to the same ME location but having a low sequence similarity. i, When the split-read supporting read is mapped partially to the ME consensus (red, locus 2) and fully to another reference genome element (green and red, locus 1), the additional sequence (green) is transduced to the new location. Transduction in Alu insertions, or 5’ transduction in 5’-truncated L1 insertions, indicates a false insertion. j, The supporting reads suggest that the ME is inserted in the + strand, yet the 3’ end is closer to the upstream flank and the 5’ end is closer to the downstream flank. This conflict indicates a false insertion or a 5’ inversion in L1 retrotransposition.

Extended Data Fig. 5 Summary of the validation experiments.

a, We used droplet digital PCR (ddPCR) to confirm presence of detected somatic L1s in the DNA from combined cells and to measure the tissue allele frequency, and nested PCR to sequence the junctions (1st nested PCR is the reaction containing both ends of the insertion, and the 2nd nested PCR then uses the product of the 1st as template and targets upstream or downstream junctions), (b) We applied nested PCR to amplify the 5’ and 3’ junctions for L1-1 and L1-2 with overlapping primers, and then used overlap extension PCR (OE-PCR) to obtain the full sequence of L1-1 and L1-2. Control DNA was amplified on DNA without the L1 insertion (NA12878) using primer iii and primer vi. The amplified DNA (L1 or control) was cloned to a constitutively spliced intron in an enhanced green fluorescence protein (EGFP) reporter, pGint. c, An example of biased PCR amplification favoring pre-integration (insertion-) site blocks the amplification of the post-integration (insertion+) site even at relatively high tissue allele frequencies. We titrated the L1-1 template from GL1-1 plasmid in NA12878 genomic DNA at allele frequencies of 92.4%, 64.6%, 20.7%, 3.59% and 0.53%, and then tested PCR amplification with external primers using PhusionTaq or DreamTaq polymerases, and 30 or 60 PCR cycles (n=1 replicate for each PCR cycle). d, We designed a droplet-based full length PCR to reduce bias and amplify the post-integration site. We prepared 8 droplet PCR reactions from the genomic DNA of brain or controls: 7 reactions were combined for gel electrophoresis and the last reaction was tested for the probe fluorescence (for example, again ddPCR). NA12878 genomic DNA was used negative control and the known L1-1 or L1-2 templates was tested as positive controls. e, The placement of primers (P1+P2) and probe used in the droplet-based full length PCR for L1-1 and L1-2. Primer P3+P2 and P3+P4 were used for in a second PCR to re-amplify the full length insertion of L1-1 and L1-2, respectively. Source data

Extended Data Fig. 6 Experimental validation of L1-1.

a, We used droplet digital PCR (ddPCR) to measure the frequency, nested PCR to sequence the junctions, cloning with overlap extension PCR (OE-PCR) to obtain the full length insertion sequence, and droplet-based full length PCR followed by gel electrophoresis or fluorescence read-out to amplify the post-integration site (see Extended Data Fig. 5d). TSD, target site duplication; up, upstream junction; dn, downstream junction. b, DdPCR detected a clear signal for L1-1 in the genomic DNA from right hemisphere superior temporal gyrus, in both neurons (n=8 replicates) and glia (n=8 replicates), but not in the fibroblast (n=8 replicates). Green, droplets containing only RPP30 (internal control); Blue, droplets containing only the L1 junction template; Orange, droplets containing both L1 and RPP30 templates; Black, droplets containing neither L1 nor RPP30 templates. We used NA12878 DNA as a negative control and synthesized DNA with the target L1 junction as a positive control. c, The full sequence of L1-1 based on OE-PCR. Black, flanking sequence; red, inserted L1 sequence; purple, target site duplication; cyan, L1Hs specific alleles; brown, mismatch to the L1Hs consensus. d, Nested PCR results showed L1-1 upstream and downstream junctions amplified specifically in the genomic DNA of right STG (RSTG) but not in NA12878. This experiment was repeated for 4 times and always showed the same results. Yellow arrow, product of pre-integration site in the 1st nested PCR (934 bp); yellow rectangle, gel extraction from the 1st PCR to serve as template in 2nd PCRs; red arrow: upstream junction in 2nd nested PCR (336 bp); blue arrow, downstream junction in 2nd nested PCR (594 bp); NA12878, negative control. e and f, The gel electrophoresis from three independent replicate experiment of the droplet-based full length PCR, confirming the amplification of the L1-1 post-integration site in glia from two brain anatomical regions: LOP—left hemisphere occipital cortex, proximal to STG and LSTG2—a second sample from left hemisphere superior temporal gyrus. NA12878, negative control; L1-1, positive control with known L1-1 junction from plasmid GL1-1. e, Replicate experiment 1. f, Replicate experiment 2 and 3. g, Fluorescence readout of the droplet-based full length PCR was quantified based on a standard curve where L1-1 template (from plasmid GL1-1) is mixed with NA12878 at 4 different allele frequencies: 10.83%, 19.54%, 24.27% and 32.69%. The ratio of positive droplets is positively correlated with the L1-1 template frequency (Pearson’s r=0.99). The blue line marks the linear trend and the surrounding gray area marks the 95% confidence intervals. h, Fluorescence readout (n=2 anatomical regions) of the droplet-based full length PCR confirms the presence of L1-1 in the tested glial cells but shows no signal in the fibroblasts. The results are displayed in 2 dimensions for clearer illustration, with no internal control used for the signal on the X-axis. The ratio of L1-1 positive droplets (blue) over the total number of droplets is indicated in each ddPCR experiment. Source data

Extended Data Fig. 7 Experimental validation of L1-2.

a, We used droplet digital PCR (ddPCR) to measure the frequency, nested PCR to sequence the junctions, cloning with overlap extension PCR (OE-PCR) to obtain the full length insertion sequence, droplet-based full length PCR followed by gel electrophoresis or fluorescence ddPCR to amplify the post-integration site, and ddPCR using a Taqman probe crossing its 5’-junction (see Extended Data Fig. 5d). TSD, target site duplication; up, upstream junction; dn, downstream junction. b, DdPCR detected a clear signal for L1-2 in the genomic DNA from right hemisphere superior temporal gyrus, in both neurons (n=10 replicates) and glia (n=10 replicates), but not in the fibroblast (n=10 replicates). Green, droplets containing only RPP30 (internal control); Blue, droplets containing only the L1 junction template; Orange, droplets containing both L1 and RPP30 templates; Black, droplets containing neither L1 nor RPP30 templates. We used NA12878 DNA as a negative control and synthesized DNA with the target L1 junction as a positive control. c, The full sequence of L1-2 based on OE-PCR. Black, flanking sequence; red, inserted L1 sequence; purple, target site duplication; cyan, L1Hs specific alleles; brown, mismatch to the L1Hs consensus. d, Nested PCR results showed L1-2 upstream and downstream junctions amplified specifically in the genomic DNA of right STG (RSTG) but not in NA12878. This experiment was repeated for 4 times and always showed the same results. Notably, we used two different sets of primers in the first PCR for the upstream and downstream junctions. Yellow arrow, product of pre-integration site in the 1st nested PCR (L1-2 up, 266 bp; L1-2 dn, 561 bp); yellow rectangle, gel extraction from the 1st PCR to serve as template in 2nd PCRs; red arrow: upstream junction in 2nd nested PCR (263 bp); blue arrow, downstream junction in 2nd nested PCR (215 bp); NA12878, negative control. e, Gel electrophoresis of the droplet-based full length PCR confirmed the amplification of the L1-2 post-integration site in neurons from the right hemisphere occipital cortex, distal to STG (ROD). NA12878, negative control; L1-2, positive control with known L1-2 junction from L1-2 OE-PCR (see Extended Data Fig. 5b). The droplet-based full length PCR experiment was repeated and showed similar results. f, Fluorescence readout (n=1 replicate) of the droplet-based full length PCR confirms the presence of L1-2 in neurons from ROD but shows no signal in the fibroblasts. The results are displayed in 2 dimensions for clearer illustration, with no internal control used for the signal on the X-axis. The ratio of L1-2 positive droplets (blue) over the total number of droplets is indicated in each ddPCR experiment. The quantification of the L1-2 frequency is based on a standard curve where L1-2 template (from L1-2 OE-PCR) is mixed with NA12878 at allele frequencies of 7.25% and 13.51%. Source data

Extended Data Fig. 8 Spatial distribution and poly(A) length of L1-1 and L1-2.

a, Anatomical brain regions studied in donor 12004: 1 and 1’, superior temporal gyrus (BA22, both sides); 2, prefrontal cortex distal (BA9, both sides); 3, prefrontal cortex proximal (BA46, both sides); 4, motor cortex distal (BA4, both sides); 5, motor cortex proximal (BA6, both sides); 6, parietal cortex distal (BA7, both sides); 7, parietal cortex proximal (BA39, both sides); 8, occipital cortex distal (BA19, both sides); 9, occipital cortex proximal (BA19, both sides); 10, putamen (both sides); 11, cerebellum (both sides). The tissue for deep whole genome sequencing is from right superior temporal gyrus (1’). The tissues that were dissected from both hemispheres were bilaterally symmetrical. The metric unit on the ruler is the centimeter. b, The levels of mosaicism in neurons are highly correlated with levels in glia. Red, L1-1; green, L1-2. c, Poly(A) lengths of L1-1 and L1-2 were estimated as the lengths supported by the highest numbers of GL1-1 and GL1-2 clones (see Supplementary 8b). The variation among clones was likely the result of PCR stutter around low-complexity templates67. d, Poly-A length distribution in 22 previously reported de novo and disease-causing L1 retrotranspositions. The poly-A lengths of L1-1 and L1-2 are at 18.2% and 13.6% percentiles, respectively, of this distribution. Source data

Extended Data Fig. 9 The genomic locus with L1-1 insertion.

L1-1 is inserted in a 2.6 kb promoter flanking region (ENSR00000032826) that is hypothesized to regulates the expression of nearby genes 68. The chromatin states are shown for a subset of human cell lines: light gray, heterochromatin; light green, weakly transcribed; yellow, weak/poised enhancer; orange, strong enhancer; light red, weak promoter; bright red, strong promoter. L1-1 is inserted in a linkage disequilibrium (LD) block, based on the common SNPs that are highly correlated (R2 > 0.6, green line) with the closest common SNP to L1-1, rs1890185. This LD block is highlighted in red, and contains 72 lead SNPs associated with 10 diseases or disorders and 28 measurements or other traits69, including 13 risk SNPs from 11 schizophrenia studies (triangle). We categorized all traits under 11 terms based on the Experimental Factor Ontology70. The significantly associated SNPs, indexed from number 1 to 72, are documented in details in Supplementary Table 6.

Extended Data Fig. 10 Fluorescence quantification in the reporter assay.

a-b, Original photos of the representative images in Fig. 6d,e. c, Raw fluorescence intensities (green and red) used in the statistical analysis in Fig. 6f,g were in the range of 0–3035 for green fluorescence and 0–3613 for red, with no saturated pixels (>4000). Each cell is represented by the average pixel intensity (dot) and the maximum and minimum pixel intensities (bar). Red, Gcont-1; Cyan, GL1-1; Green, Gcont-2; Purple, GL1-2. d, Measurement of the green fluorescence, red fluorescence and brightfield of three cells. C1, live cell; C2, dead cell, C3, dead cell. Each image is a representative of the green and red fluorescence images in well 1 to well 5 for any reporters (total=60). e, Representative images from each the GFP fluorescence of the control and L1-1 reporters in the single transfection experiment (2 wells and 3 images per well, see Fig. 6c). The maximum signal intensities are adjusted from 4095 to 1000 in (d) and (e) to illustrate the cells with weak fluorescence. Source data

Supplementary information

Supplementary Information

Supplementary Notes 1–6, Supplementary Tables 1, 2 and 7 and Supplementary Figs 1–5.

Reporting Summary

Supplementary Software

R scripts for plotting Figs. 1–6, random forest modeling and RetroVis (MEI visualization tool).

Supplementary Tables 3–6.

Source data

Source Data Fig. 1

Statistical source data for Fig. 1d,f–h.

Source Data Fig. 2

Statistical source data for Fig. 2a–d.

Source Data Fig. 3

Statistical source data for Fig. 3b,e.

Source Data Fig. 4

Statistical source data for Fig. 4a,b.

Source Data Fig. 6

Statistical source data for Fig. 6f–h.

Source Data Extended Data Fig. 1

Statistical source data for Extended Data Fig. 1c,e–g.

Source Data Extended Data Fig. 2

Statistical source data for Extended Data Fig. 2a–d.

Source Data Extended Data Fig. 3

Statistical source data for Extended Data Fig. 3a and 3g.

Source Data Extended Data Fig. 3

Unprocessed gel for Extended Data Fig. 3c.

Source Data Extended Data Fig. 5

Unprocessed gel for Extended Data Fig. 5c.

Source Data Extended Data Fig. 6

Statistical source data for Extended Data Fig. 6g.

Source Data Extended Data Fig. 6

Unprocessed gel for Extended Data Fig. 6d–f.

Source Data Extended Data Fig. 7

Unprocessed gel for Extended Data Fig. 7d,e.

Source Data Extended Data Fig. 8

Statistical source data for Extended Data Fig. 8b–d.

Source Data Extended Data Fig. 10

Statistical source data for Extended Data Fig. 10c.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhu, X., Zhou, B., Pattni, R. et al. Machine learning reveals bilateral distribution of somatic L1 insertions in human neurons and glia. Nat Neurosci (2021). https://doi.org/10.1038/s41593-020-00767-4

Download citation

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing