Mapping and characterization of structural variation in 17,795 human genomes

Abstract

A key goal of whole-genome sequencing for studies of human genetics is to interrogate all forms of variation, including single-nucleotide variants, small insertion or deletion (indel) variants and structural variants. However, tools and resources for the study of structural variants have lagged behind those for smaller variants. Here we used a scalable pipeline1 to map and characterize structural variants in 17,795 deeply sequenced human genomes. We publicly release site-frequency data to create the largest, to our knowledge, whole-genome-sequencing-based structural variant resource so far. On average, individuals carry 2.9 rare structural variants that alter coding regions; these variants affect the dosage or structure of 4.2 genes and account for 4.0–11.2% of rare high-impact coding alleles. Using a computational model, we estimate that structural variants account for 17.2% of rare alleles genome-wide, with predicted deleterious effects that are equivalent to loss-of-function coding alleles; approximately 90% of such structural variants are noncoding deletions (mean 19.1 per genome). We report 158,991 ultra-rare structural variants and show that 2% of individuals carry ultra-rare megabase-scale structural variants, nearly half of which are balanced or complex rearrangements. Finally, we infer the dosage sensitivity of genes and noncoding elements, and reveal trends that relate to element class and conservation. This work will help to guide the analysis and interpretation of structural variants in the era of whole-genome sequencing.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: The public version of the B38 callset derived from 14,623 samples.
Fig. 2: Burden of rare gene-altering SVs.
Fig. 3: Estimation of genome-wide burden of high-impact functional alleles.
Fig. 4: Dosage sensitivity of functional annotations.

Data availability

The sequencing data can be accessed through dbGaP (https://www.ncbi.nlm.nih.gov/gap) under the accession numbers provided in Supplementary Table 7. PacBio long-read data used for SV validation can be accessed through the Sequence Read Archive (SRA), under the accession numbers provided in Supplementary Table 2. The set of high-confidence HGSVC long-read-derived SV calls, validated by our independent PacBio data and used as a truth set, can be found in Supplementary File 3. Supplementary Files 1–4 can be found at https://github.com/hall-lab/sv_paper_042020.

Code availability

Custom code used in the long-read validation can be found here: https://github.com/abelhj/long-read-validation/tree/master.

References

  1. 1.

    Larson, D. E. et al. svtools: population-scale analysis of structural variation. Bioinformatics 35, 4782–4787 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J. O. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat. Rev. Genet. 14, 125–138 (2013).

    CAS  PubMed  Google Scholar 

  3. 3.

    Stranger, B. E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848–853 (2007).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  4. 4.

    Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Sebat, J. et al. Strong association of de novo copy number mutations with autism. Science 316, 445–449 (2007).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  7. 7.

    Weiss, L. A. et al. Association between microdeletion and microduplication at 16p11.2 and autism. N. Engl. J. Med. 358, 667–675 (2008).

    CAS  PubMed  Google Scholar 

  8. 8.

    Turner, T. N. et al. Genomic patterns of de novo mutation in simplex autism. Cell 171, 710–722 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Werling, D. M. et al. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nat. Genet. 50, 727–736 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Brandler, W. M. et al. Paternally inherited cis-regulatory structural variants are associated with autism. Science 360, 327–331 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    International Schizophrenia Consortium. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 455, 237–241 (2008).

    Google Scholar 

  12. 12.

    Walsh, T. et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320, 539–543 (2008).

    CAS  PubMed  ADS  Google Scholar 

  13. 13.

    McCarthy, S. E. et al. Microduplications of 16p11.2 are associated with schizophrenia. Nat. Genet. 41, 1223–1227 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Marshall, C. R. et al. Contribution of copy number variants to schizophrenia from a genome-wide study of 41,321 subjects. Nat. Genet. 49, 27–35 (2017).

    CAS  PubMed  Google Scholar 

  15. 15.

    Craddock, N. et al. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464, 713–720 (2010).

    CAS  PubMed  ADS  Google Scholar 

  16. 16.

    Kathiresan, S. et al. Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants. Nat. Genet. 41, 334–341 (2009).

    CAS  PubMed  Google Scholar 

  17. 17.

    MacDonald, J. R., Ziman, R., Yuen, R. K., Feuk, L. & Scherer, S. W. The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 42, D986–D992 (2014).

    CAS  PubMed  Google Scholar 

  18. 18.

    Bragin, E. et al. DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res. 42, D993–D1000 (2014).

    CAS  PubMed  Google Scholar 

  19. 19.

    Lappalainen, I. et al. dbVar and DGVa: public archives for genomic structural variation. Nucleic Acids Res. 41, D936–D941 (2013).

    CAS  PubMed  Google Scholar 

  20. 20.

    Hehir-Kwa, J. Y. et al. A high-quality human reference panel reveals the complexity and distribution of genomic structural variants. Nat. Commun. 7, 12989 (2016).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  21. 21.

    Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).

    CAS  PubMed  ADS  Google Scholar 

  22. 22.

    Sudmant, P. H. et al. Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761 (2015).

    PubMed  PubMed Central  Google Scholar 

  23. 23.

    Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).

    PubMed  PubMed Central  Google Scholar 

  24. 24.

    Chiang, C. et al. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat. Methods 12, 966–968 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 4038 (2018).

    PubMed  PubMed Central  ADS  Google Scholar 

  26. 26.

    Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).

    PubMed  PubMed Central  ADS  Google Scholar 

  27. 27.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Google Scholar 

  29. 29.

    Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Huang, Y. F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 31.

    McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 2069–2070 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Ewing, A. D. et al. Retrotransposition of gene transcripts leads to structural variation in mammalian genomes. Genome Biol. 14, R22 (2013).

    PubMed  PubMed Central  Google Scholar 

  34. 34.

    Schrider, D. R. et al. Gene copy-number polymorphism caused by retrotransposition in humans. PLoS Genet. 9, e1003242 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Abyzov, A. et al. Analysis of variable retroduplications in human populations suggests coupling of retrotransposition to cell division. Genome Res. 23, 2042–2052 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Cooper, G. M. et al. A copy number variation morbidity map of developmental delay. Nat. Genet. 43, 838–846 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. 37.

    Hook, E. B. & Hamerton, J. L. in Population Cytogenetics: Studies in Humans (eds Hook, E. B. & Porter, I. H.) 63–79 (Academic Press, 1977).

  38. 38.

    Forabosco, A., Percesepe, A. & Santucci, S. Incidence of non-age-dependent chromosomal abnormalities: a population-based study on 88965 amniocenteses. Eur. J. Hum. Genet. 17, 897–903 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Malhotra, A. et al. Breakpoint profiling of 64 cancer genomes reveals numerous complex rearrangements spawned by homology-independent mechanisms. Genome Res. 23, 762–776 (2013).

    MathSciNet  CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Conrad, D. F. et al. Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nat. Genet. 42, 385–391 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. 41.

    Quinlan, A. R. et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res. 20, 623–635 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Kidd, J. M. et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837–847 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. 44.

    Quinlan, A. R. & Hall, I. M. Characterizing complex structural variation in germline and somatic genomes. Trends Genet. 28, 43–53 (2012).

    CAS  PubMed  Google Scholar 

  45. 45.

    Ruderfer, D. M. et al. Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nat. Genet. 48, 1107–1111 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet. 6, e1001154 (2010).

    PubMed  PubMed Central  Google Scholar 

  47. 47.

    ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    ADS  Google Scholar 

  48. 48.

    Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. 49.

    Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013).

    PubMed  PubMed Central  Google Scholar 

  50. 50.

    Visel, A., Minovitsky, S., Dubchak, I. & Pennacchio, L. A. VISTA Enhancer Browser—a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).

    CAS  PubMed  Google Scholar 

  51. 51.

    Fishilevich, S. et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database 2017, bax028 (2017).

    PubMed Central  Google Scholar 

  52. 52.

    Lesurf, R. et al. ORegAnno 3.0: a community-driven resource for curated regulatory annotation. Nucleic Acids Res. 44, D126–D132 (2016).

    CAS  PubMed  Google Scholar 

  53. 53.

    Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54.

    Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. 55.

    Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. 56.

    Rodriguez, J. M. et al. APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res. 41, D110–D117 (2013).

    CAS  PubMed  Google Scholar 

  57. 57.

    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. 58.

    Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/201178v3 (2018).

  59. 59.

    Tan, A., Abecasis, G. R. & Kang, H. M. Unified representation of genetic variants. Bioinformatics 31, 2202–2204 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. 60.

    McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).

    PubMed  PubMed Central  Google Scholar 

  61. 61.

    Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).

    PubMed  Google Scholar 

  62. 62.

    Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).

    PubMed  PubMed Central  Google Scholar 

  63. 63.

    Ganel, L., Abel, H. J. & Hall, I. M. SVScore: an impact prediction tool for structural variation. Bioinformatics 33, 1083–1085 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  64. 64.

    Derrien, T. et al. Fast computation and applications of genome mappability. PLoS One 7, e30377 (2012).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  65. 65.

    Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  66. 66.

    Griffith, O. L. et al. ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res. 36, D107–D113 (2008).

    CAS  PubMed  Google Scholar 

  67. 67.

    Bejerano, G. et al. Ultraconserved elements in the human genome. Science 304, 1321–1325 (2004).

    CAS  PubMed  ADS  Google Scholar 

  68. 68.

    Yip, K. Y. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 13, R48 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  69. 69.

    Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014).

    PubMed  PubMed Central  Google Scholar 

  70. 70.

    Ashoor, H., Kleftogiannis, D., Radovanovic, A. & Bajic, V. B. DENdb: database of integrated human enhancers. Database) 2015, bav085 (2015).

    Google Scholar 

  71. 71.

    Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).

    CAS  PubMed  PubMed Central  ADS  Google Scholar 

  72. 72.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  73. 73.

    Faust, G. G. & Hall, I. M. YAHA: fast and flexible long-read alignment with optimal breakpoint detection. Bioinformatics 28, 2417–2424 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank staff at the NHGRI for supporting this effort. This study was funded by NHGRI CCDG awards to Washington University in St Louis (UM1 HG008853), the Broad Institute of MIT and Harvard (UM1 HG008895), Baylor College of Medicine (UM1 HG008898) and New York Genome Center (UM1 HG008901); an NHGRI GSP Coordinating Center grant to Rutgers (U24 HG008956); and a Burroughs Wellcome Fund Career Award to I.M.H. Additional data production at Washington University in St Louis was funded by a separate NHGRI award (5U54HG003079). We thank S. Sunyaev for comments on the manuscript; T. Teshiba for coordinating samples for FINRISK and EUFAM sequencing; and the staff and participants of the ARIC study for their contributions; and we acknowledge all individuals who were involved in the collection of samples that were analysed for this study. Data production for EUFAM was funded by 4R01HL113315-05; the Metabolic Syndrome in Men (METSIM) study was supported by grants to M. Laakso from the Academy of Finland (no. 321428), the Sigrid Juselius Foundation, the Finnish Foundation for Cardiovascular Research, Kuopio University Hospital and the Centre of Excellence of Cardiovascular and Metabolic Diseases supported by the Academy of Finland; data collection for the CEPH pedigrees was funded by the George S. and Dolores Doré Eccles Foundation and NIH grants GM118335 and GM059290; study recruitment at Washington University in St Louis was funded by the DDRCC (NIDDK P30 DK052574) and the Helmsley Charitable Trust; study recruitment at Cedars-Sinai was supported by the F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, NIH/NIDDK grants P01 DK046763 and U01 DK062413 and the Helmsley Charitable Trust; study recruitment at Intermountain Medical Center was funded by the Dell Loy Hansen Heart Foundation; the Late Onset Alzheimer's Disease Study (LOAD) study was funded by grants to T. Foroud (U24AG021886, U24AG056270, U24AG026395 and R01AG041797); the Atherosclerosis Risk in Communities (ARIC) study was funded by the NHLBI (HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700004I and HHSN268201700005I); and the PAGE programme is funded by the NHGRI with co-funding from the NIMHD (U01HG007416, U01HG007417, U01HG007397, U01HG007376 and U01HG007419). Samples from the BioMe Biobank were provided by The Charles Bronfman Institute for Personalized Medicine at the Icahn School of Medicine at Mount Sinai. The Hispanic Community Health Study/Study of Latinos was carried out as a collaborative study supported by the NHLBI (N01-HC65233, N01-HC65234, N01-HC65235, N01-HC65236 and N01-HC65237), with contributions from the NIMHD, NIDCD, NIDCR, NIDDK, NINDS and NIH ODS. The Multiethnic Cohort (MEC) study is funded through the NCI (R37CA54281, R01 CA63, P01CA33619, U01CA136792 and U01CA98758). For the Stanford Global Reference Panel, individuals from Puno, Peru were recruited by J. Baker and C. Bustamante, with funding from the Burroughs Welcome Fund, and individuals from Rapa Nui (Easter Island) were recruited by K. Sandoval Mendoza and A. Moreno Estrada, with funding from the Charles Rosenkranz Prize for Health Care Research in Developing Countries. The Women’s Health Initiative (WHI) programme is funded by the NHLBI (HHSN268201100046C, HHSN268201100001C, HHSN268201100002C, HHSN268201100003C, HHSN268201100004C and HHSN271201100004C). The GALA II study and E. G. Burchard are supported by the Sandler Family Foundation, the American Asthma Foundation, the RWJF Amos Medical Faculty Development Program, the Harry Wm. and Diana V. Hind Distinguished Professor in Pharmaceutical Sciences II, the NHLBI (R01HL117004, R01HL128439, R01HL135156 and X01HL134589), the NIEHS (R01ES015794, R21ES24844), the NIMHD (P60MD006902, R01MD010443, RL5GM118984) and the Tobacco-Related Disease Research Program (24RT-0025). We acknowledge the following GALA II co-investigators for recruitment of individuals, sample processing and quality control: C. Eng, S. Salazar, S. Huntsman, D. Hu, A. C.Y. Mak, L. Caine, S. Thyne, H. J. Farber, P. C. Avila, D. Serebrisky, W. Rodriguez-Cintron, Jose R. Rodriguez-Santana, R. Kumar, L. N. Borrell, E. Brigino-Buenaventura, A. Davis, M. A. LeNoir, K. Meade, S. Sen and F. Lurmann, and we thank the staff and participants who contributed to the GALA II study.

Author information

Affiliations

Authors

Consortia

Contributions

I.M.H. conceived and directed the study. D.E.L. and H.J.A. developed the final version of the SV calling pipeline, constructed the SV callsets and performed the data analyses. C.C. and R.M.L. helped design the SV calling pipeline. A.A.R. contributed to long-read validation. I.D. was instrumental in the migration of workflows to the Google Cloud Platform. K.L.K. assisted with data management. E.S.L., B.M.N. and N.O.S. provided input on population genetic analyses. W.J.S., D.M.M., E.S.L., B.M.N., M.C.Z., C.R., T.C.M., S.B., S.K.D., I.M.H. and N.O.S. directed data production, processing and management at their respective sites, and edited the manuscript. Members of the NHGRI CCDG consortium provided samples, produced sequencing data and coordinated and administered data-sharing efforts. H.J.A., D.E.L. and I.M.H. wrote the manuscript.

Corresponding author

Correspondence to Ira M. Hall.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 SV mapping pipeline.

SVs are detected within each sample using LUMPY. Breakpoint probability distributions are used to merge and refine the position of detected SVs within a cohort, followed by parallelized re-genotyping and copy-number annotation. Samples are merged into a single cohort-level VCF file, variant types reclassified and genotypes refined with svtools using the combined breakpoint genotype and read-depth information. Finally, sample-level quality control (QC) and variant confidence scoring is conducted to produce the final callset.

Extended Data Fig. 2 The B37 callset.

a, Variant counts (y axis) for each sample (x axis) in the callset, ordered by cohort. Large (>1 kb) variants are shown in dark shades and smaller variants in light shades. b, Variant counts per sample, ordered by self-reported ancestry according to the colour scheme on the right. Abbreviations as in Fig. 1a. Note that African-ancestry samples show more variant calls, as expected. c, Table showing the number of variant calls by variant type and frequency class, and Mendelian error rate by variant type. d, Histogram of allele count for each variant class, showing alleles with counts ≤ 100. e, Linkage disequilibrium of each variant class as represented by maximum R2 value to nearby SNVs, for n = 1,581 samples. Note that these distributions mirror those from our previous SV callset for GTEx4, which was characterized extensively in the context of expression quantitative trait loci.

Extended Data Fig. 3 The B38 callset.

a, Variant counts (y axis) for each sample (x axis) in the callset, ordered by cohort. Large (>1 kb) variants are shown in dark shades and smaller variants in light shades. b, Variant counts per sample, ordered by self-reported ancestry according to the colour scheme on the right. Abbreviations as in Fig. 1a. Note that African-ancestry samples show more variant calls, as expected. Note also that there is some residual variability in variant counts owing to differences in data from each sequencing centre, but that this is mainly limited to small tandem duplications (see a), primarily at STRs. c, SV length distribution by variant class. d, Distribution of the number of singleton SVs detected in samples from different ancestry groups. Only groups with ≥1,000 samples in the B38 callset are shown, and each group was subsampled down to 1,000 individuals before recalculation of the allele frequency. e, Histogram showing the resolution of SV breakpoint calls, as defined by the length of the 95% confidence interval of the breakpoint-containing region defined by LUMPY, after cross-sample merging and refinement using svtools. Data are from n = 360,614 breakpoints, 2 per variant. f, Distribution of the number of SVs detected per sample in WGS data from each sequencing centre (x axis) for African and non-African (non-AFR) samples, showing all variants (left), and those larger (middle) and smaller (right) than 1 kb in size. Per-centre counts are as follows: centre A, 1,527 AFR, 2,080 non-AFR; centre B, 408 AFR, 2,745 non-AFR; centre C, 2,953 AFR, 2226 non-AFR; centre D, 150 AFR, 2,534 non-AFR. g, Plots of Mendelian error (ME) rate (y axis) by MSQ for each variant class. Dot size is determined by point density (right) and the threshold used to determine high and low confidence SVs are shown by the vertical lines. All box plots in indicate the median (centre line) and the first and third quartiles (box limits); whiskers extend 1.5 × IQR.

Extended Data Fig. 4 PCA for the B37 callset.

PCA was performed using a linkage disequilibrium-pruned subset of high-confidence DEL and MEI variants, with MAF > 1%. Self-reported ancestry is shown using the colour scheme on the right, with abbreviations as in Fig. 1a.

Extended Data Fig. 5 Validation of SV calls by PacBio long reads in nine control samples.

n = 9,905 variants. a, Validation rates in variant carriers (y axis) versus validation rates in non-carriers (that is, false validations; x axis) for each method of determining variant overlap, for a range of supporting-read-count thresholds. Ultra-rare variants (n = 133) are shown separately on the right. For each variant overlap method, each data point represents a distinct read-count threshold (≥1, 2, 3, 5, 10, 15 or 20 PacBio reads) that was used to determine validation of SV calls by long-read alignments. Two methods were used for determining overlap between SV coordinates and long-read alignments while accounting for positional uncertainty: (1) BEDTools pairtopair, requiring overlap between the pair of breakpoint intervals predicted by short-read SV mapping and the pair of breakpoint intervals predicted by long-read alignment, allowing 100 bp or 200 bp of ‘slop’; and (2) BEDTools intersect, requiring 90% or 95% reciprocal overlap between the coordinates spanned by the SV predicted by short-read SV mapping and the SV predicted by long-read alignment. Here, we plot the first criteria by themselves, and in pairwise combination with the latter (see key on the right of the figure). Note that Supplementary Table 3 is based on the ‘100 bp slop or 90% reciprocal overlap’ method, requiring at least two PacBio reads. b, Validation rates by frequency class for variant carriers and non-carriers with increasing PacBio supporting-read thresholds, shown using the same overlap method as in Supplementary Table 3. Variant counts per frequency class are as follows: ultra-rare, n = 133; rare, n = 734; low frequency, n = 1,361; common, n = 7,677.

Extended Data Fig. 6 Mendelian inheritance analysis in a set of three-generation CEPH pedigrees comprising 409 parent–offspring trios.

a, Example structure of a single CEPH pedigree indicating nomenclature of the parental (P0), first (F1) and second (F2) generations. b, Transmission rate of SVs from different allele-frequency classes including SVs that are unique to a single family (ultra-rare), rare (<1%), low frequency (‘low’; 1–5%) and common (>5%). c, Table showing the number and rate of Mendelian errors by allele-frequency class. d, Table showing the number and rate of Mendelian errors for SVs that are unique to a single family, for each SV type.

Extended Data Fig. 7 Comparison of SV calls and genotypes to the 1KG phase 3 callset.

a, Number of known and novel SVs in the B37 (left) and B38 (right) callsets, shown by frequency class. b, Table showing the genotypes (GT) reported in our B38 callset5 (rows) versus the 1KG callset (columns) at SVs identified by both studies among the five samples included in both callsets. c, Table showing genotype concordance by SV type including the fraction of concordant calls and Cohen’s κ coefficient. d, Distribution of correlation (R2) between genotype information determined by breakpoint-spanning reads and estimates of copy number (CN) determined by read-depth analysis for the SVs shown in b, c when genotype information between the B38 and the 1KG callset is concordant (left) or discordant (middle, right). At sites with discordant genotypes, correlation with copy-number information is typically higher for genotypes from the B38 callset (middle) than the 1KG callset (right).

Extended Data Fig. 8 Ultra-rare SVs in the B38 callset.

n = 14,623. a, Histogram showing the number of ultra-rare SVs per individual (ultra-rare is defined as singleton variants private to a single individual or nuclear family). b, Histogram showing the number of genes affected by ultra-rare SVs larger than 1 Mb in size.

Extended Data Fig. 9 Correlations between dosage sensitivity scores for CNV in the combined callset.

n = 17,795. a, Results for deletion variants. The ExAC score is the published ExAC DEL intolerance score45; the CCDG score is similarly calculated from our data, using CCDG deletions; pLI is the published loss-of-function intolerance score from ExAC27; ‘HI.Z’ is the negative of the inverse-normal transformed haploinsufficiency score from DECIPHER46; ‘ave.ccdg.exac’ is the arithmetic mean of the CCDG and ExAC DEL intolerance scores; and ‘ave.ccdg.hi’ is the arithmetic mean of the CCDG and HI-Z scores. The correlations shown are Spearman rank correlations (rho); P values are calculated by two-sided Spearman rank correlation test; and N represents the number of genes included in the test. b, Results for duplication variants, using the same naming conventions as in a.

Extended Data Table 1 Ancestry, ethnicity and continental origin of the samples analysed in this study

Supplementary information

Supplementary Information

This file contains: (1) Supplementary Note: SV Callset Quality Assessment; (2) Supplementary References (3) Supplementary Table legends (Supplementary Table files provided in separate excel format); (4) Supplementary Files; and (5) Consortium Members: NHGRI Centers for Common Disease Genomics.

Reporting Summary

Supplementary Table 1

Description of which callset and sample subsets were used for each of the major analyses in the study. See main Supplementary Information PDF for full legend.

Supplementary Table 2

Description of PacBio long-read datasets used for SV validation analyses. See main Supplementary Information PDF for full legend.

Supplementary Table 3

SV validation rate analysis using split-read mapping with deep coverage (>60x) PacBio long-read WGS data. See main Supplementary Information PDF for full legend.

Supplementary Table 4

SV detection sensitivity analysis based on long-read SV calls from the Human Genome Structural Variation Consortium (HGSVC). See main Supplementary Information PDF for full legend.

Supplementary Table 5

Number of variants represented in Fig 2d. See main Supplementary Information PDF for full legend.

Supplementary Table 6

Number of variants for each category in Fig. 3a and 3b. See main Supplementary Information PDF for full legend.

Supplementary Table 7

Data Availability. See main Supplementary Information PDF for full legend.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Abel, H.J., Larson, D.E., Regier, A.A. et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583, 83–89 (2020). https://doi.org/10.1038/s41586-020-2371-0

Download citation

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.