Mapping and characterization of structural variation in 17,795 human genomes

Abel, Haley J.; Larson, David E.; Regier, Allison A.; Chiang, Colby; Das, Indraniel; Kanchi, Krishna L.; Layer, Ryan M.; Neale, Benjamin M.; Salerno, William J.; Reeves, Catherine; Buyske, Steven; Matise, Tara C.; Muzny, Donna M.; Zody, Michael C.; Lander, Eric S.; Dutcher, Susan K.; Stitziel, Nathan O.; Hall, Ira M.

doi:10.1038/s41586-020-2371-0

Article
Published: 27 May 2020

Mapping and characterization of structural variation in 17,795 human genomes

Nature volume 583, pages 83–89 (2020)Cite this article

21k Accesses
134 Citations
69 Altmetric
Metrics details

Subjects

Abstract

A key goal of whole-genome sequencing for studies of human genetics is to interrogate all forms of variation, including single-nucleotide variants, small insertion or deletion (indel) variants and structural variants. However, tools and resources for the study of structural variants have lagged behind those for smaller variants. Here we used a scalable pipeline¹ to map and characterize structural variants in 17,795 deeply sequenced human genomes. We publicly release site-frequency data to create the largest, to our knowledge, whole-genome-sequencing-based structural variant resource so far. On average, individuals carry 2.9 rare structural variants that alter coding regions; these variants affect the dosage or structure of 4.2 genes and account for 4.0–11.2% of rare high-impact coding alleles. Using a computational model, we estimate that structural variants account for 17.2% of rare alleles genome-wide, with predicted deleterious effects that are equivalent to loss-of-function coding alleles; approximately 90% of such structural variants are noncoding deletions (mean 19.1 per genome). We report 158,991 ultra-rare structural variants and show that 2% of individuals carry ultra-rare megabase-scale structural variants, nearly half of which are balanced or complex rearrangements. Finally, we infer the dosage sensitivity of genes and noncoding elements, and reveal trends that relate to element class and conservation. This work will help to guide the analysis and interpretation of structural variants in the era of whole-genome sequencing.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The public version of the B38 callset derived from 14,623 samples.**

**Fig. 2: Burden of rare gene-altering SVs.**

**Fig. 3: Estimation of genome-wide burden of high-impact functional alleles.**

**Fig. 4: Dosage sensitivity of functional annotations.**

Discovery and quality analysis of a comprehensive set of structural variants and short tandem repeats

Article Open access 10 June 2020

David Jakubosky, Erin N. Smith, … Kelly A. Frazer

Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes

Article Open access 27 May 2020

Qingbo Wang, Emma Pierce-Hoffman, … Daniel G. MacArthur

A structural variation reference for medical and population genetics

Article Open access 27 May 2020

Ryan L. Collins, Harrison Brand, … Michael E. Talkowski

Data availability

The sequencing data can be accessed through dbGaP (https://www.ncbi.nlm.nih.gov/gap) under the accession numbers provided in Supplementary Table 7. PacBio long-read data used for SV validation can be accessed through the Sequence Read Archive (SRA), under the accession numbers provided in Supplementary Table 2. The set of high-confidence HGSVC long-read-derived SV calls, validated by our independent PacBio data and used as a truth set, can be found in Supplementary File 3. Supplementary Files 1–4 can be found at https://github.com/hall-lab/sv_paper_042020.

Code availability

Custom code used in the long-read validation can be found here: https://github.com/abelhj/long-read-validation/tree/master.

References

Larson, D. E. et al. svtools: population-scale analysis of structural variation. Bioinformatics 35, 4782–4787 (2019).
CAS PubMed PubMed Central Google Scholar
Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J. O. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat. Rev. Genet. 14, 125–138 (2013).
CAS PubMed Google Scholar
Stranger, B. E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848–853 (2007).
CAS PubMed PubMed Central ADS Google Scholar
Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017).
CAS PubMed PubMed Central Google Scholar
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
CAS PubMed PubMed Central Google Scholar
Sebat, J. et al. Strong association of de novo copy number mutations with autism. Science 316, 445–449 (2007).
CAS PubMed PubMed Central ADS Google Scholar
Weiss, L. A. et al. Association between microdeletion and microduplication at 16p11.2 and autism. N. Engl. J. Med. 358, 667–675 (2008).
CAS PubMed Google Scholar
Turner, T. N. et al. Genomic patterns of de novo mutation in simplex autism. Cell 171, 710–722 (2017).
CAS PubMed PubMed Central Google Scholar
Werling, D. M. et al. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nat. Genet. 50, 727–736 (2018).
CAS PubMed PubMed Central Google Scholar
Brandler, W. M. et al. Paternally inherited cis-regulatory structural variants are associated with autism. Science 360, 327–331 (2018).
CAS PubMed PubMed Central Google Scholar
International Schizophrenia Consortium. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 455, 237–241 (2008).
Google Scholar
Walsh, T. et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320, 539–543 (2008).
CAS PubMed ADS Google Scholar
McCarthy, S. E. et al. Microduplications of 16p11.2 are associated with schizophrenia. Nat. Genet. 41, 1223–1227 (2009).
CAS PubMed PubMed Central Google Scholar
Marshall, C. R. et al. Contribution of copy number variants to schizophrenia from a genome-wide study of 41,321 subjects. Nat. Genet. 49, 27–35 (2017).
CAS PubMed Google Scholar
Craddock, N. et al. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464, 713–720 (2010).
CAS PubMed ADS Google Scholar
Kathiresan, S. et al. Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants. Nat. Genet. 41, 334–341 (2009).
CAS PubMed Google Scholar
MacDonald, J. R., Ziman, R., Yuen, R. K., Feuk, L. & Scherer, S. W. The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 42, D986–D992 (2014).
CAS PubMed Google Scholar
Bragin, E. et al. DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res. 42, D993–D1000 (2014).
CAS PubMed Google Scholar
Lappalainen, I. et al. dbVar and DGVa: public archives for genomic structural variation. Nucleic Acids Res. 41, D936–D941 (2013).
CAS PubMed Google Scholar
Hehir-Kwa, J. Y. et al. A high-quality human reference panel reveals the complexity and distribution of genomic structural variants. Nat. Commun. 7, 12989 (2016).
CAS PubMed PubMed Central ADS Google Scholar
Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).
CAS PubMed ADS Google Scholar
Sudmant, P. H. et al. Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761 (2015).
PubMed PubMed Central Google Scholar
Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
PubMed PubMed Central Google Scholar
Chiang, C. et al. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat. Methods 12, 966–968 (2015).
CAS PubMed PubMed Central Google Scholar
Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 4038 (2018).
PubMed PubMed Central ADS Google Scholar
Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).
PubMed PubMed Central ADS Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
CAS PubMed PubMed Central Google Scholar
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
CAS PubMed PubMed Central Google Scholar
Huang, Y. F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).
CAS PubMed PubMed Central Google Scholar
McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 2069–2070 (2010).
CAS PubMed PubMed Central Google Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
CAS PubMed PubMed Central Google Scholar
Ewing, A. D. et al. Retrotransposition of gene transcripts leads to structural variation in mammalian genomes. Genome Biol. 14, R22 (2013).
PubMed PubMed Central Google Scholar
Schrider, D. R. et al. Gene copy-number polymorphism caused by retrotransposition in humans. PLoS Genet. 9, e1003242 (2013).
CAS PubMed PubMed Central Google Scholar
Abyzov, A. et al. Analysis of variable retroduplications in human populations suggests coupling of retrotransposition to cell division. Genome Res. 23, 2042–2052 (2013).
CAS PubMed PubMed Central Google Scholar
Cooper, G. M. et al. A copy number variation morbidity map of developmental delay. Nat. Genet. 43, 838–846 (2011).
CAS PubMed PubMed Central Google Scholar
Hook, E. B. & Hamerton, J. L. in Population Cytogenetics: Studies in Humans (eds Hook, E. B. & Porter, I. H.) 63–79 (Academic Press, 1977).
Forabosco, A., Percesepe, A. & Santucci, S. Incidence of non-age-dependent chromosomal abnormalities: a population-based study on 88965 amniocenteses. Eur. J. Hum. Genet. 17, 897–903 (2009).
CAS PubMed PubMed Central Google Scholar
Malhotra, A. et al. Breakpoint profiling of 64 cancer genomes reveals numerous complex rearrangements spawned by homology-independent mechanisms. Genome Res. 23, 762–776 (2013).
MathSciNet CAS PubMed PubMed Central Google Scholar
Conrad, D. F. et al. Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nat. Genet. 42, 385–391 (2010).
CAS PubMed PubMed Central Google Scholar
Quinlan, A. R. et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res. 20, 623–635 (2010).
CAS PubMed PubMed Central Google Scholar
Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).
CAS PubMed PubMed Central Google Scholar
Kidd, J. M. et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837–847 (2010).
CAS PubMed PubMed Central Google Scholar
Quinlan, A. R. & Hall, I. M. Characterizing complex structural variation in germline and somatic genomes. Trends Genet. 28, 43–53 (2012).
CAS PubMed Google Scholar
Ruderfer, D. M. et al. Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nat. Genet. 48, 1107–1111 (2016).
CAS PubMed PubMed Central Google Scholar
Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet. 6, e1001154 (2010).
PubMed PubMed Central Google Scholar
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
ADS Google Scholar
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
CAS PubMed PubMed Central Google Scholar
Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013).
PubMed PubMed Central Google Scholar
Visel, A., Minovitsky, S., Dubchak, I. & Pennacchio, L. A. VISTA Enhancer Browser—a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).
CAS PubMed Google Scholar
Fishilevich, S. et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database 2017, bax028 (2017).
PubMed Central Google Scholar
Lesurf, R. et al. ORegAnno 3.0: a community-driven resource for curated regulatory annotation. Nucleic Acids Res. 44, D126–D132 (2016).
CAS PubMed Google Scholar
Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).
CAS PubMed PubMed Central Google Scholar
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
CAS PubMed PubMed Central Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
CAS PubMed PubMed Central Google Scholar
Rodriguez, J. M. et al. APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res. 41, D110–D117 (2013).
CAS PubMed Google Scholar
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
CAS PubMed PubMed Central Google Scholar
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/201178v3 (2018).
Tan, A., Abecasis, G. R. & Kang, H. M. Unified representation of genetic variants. Bioinformatics 31, 2202–2204 (2015).
CAS PubMed PubMed Central Google Scholar
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).
PubMed PubMed Central Google Scholar
Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).
PubMed Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
PubMed PubMed Central Google Scholar
Ganel, L., Abel, H. J. & Hall, I. M. SVScore: an impact prediction tool for structural variation. Bioinformatics 33, 1083–1085 (2017).
CAS PubMed PubMed Central Google Scholar
Derrien, T. et al. Fast computation and applications of genome mappability. PLoS One 7, e30377 (2012).
CAS PubMed PubMed Central ADS Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
CAS PubMed PubMed Central Google Scholar
Griffith, O. L. et al. ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res. 36, D107–D113 (2008).
CAS PubMed Google Scholar
Bejerano, G. et al. Ultraconserved elements in the human genome. Science 304, 1321–1325 (2004).
CAS PubMed ADS Google Scholar
Yip, K. Y. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 13, R48 (2012).
CAS PubMed PubMed Central Google Scholar
Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014).
PubMed PubMed Central Google Scholar
Ashoor, H., Kleftogiannis, D., Radovanovic, A. & Bajic, V. B. DENdb: database of integrated human enhancers. Database) 2015, bav085 (2015).
Google Scholar
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
CAS PubMed PubMed Central ADS Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
CAS PubMed PubMed Central Google Scholar
Faust, G. G. & Hall, I. M. YAHA: fast and flexible long-read alignment with optimal breakpoint detection. Bioinformatics 28, 2417–2424 (2012).
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank staff at the NHGRI for supporting this effort. This study was funded by NHGRI CCDG awards to Washington University in St Louis (UM1 HG008853), the Broad Institute of MIT and Harvard (UM1 HG008895), Baylor College of Medicine (UM1 HG008898) and New York Genome Center (UM1 HG008901); an NHGRI GSP Coordinating Center grant to Rutgers (U24 HG008956); and a Burroughs Wellcome Fund Career Award to I.M.H. Additional data production at Washington University in St Louis was funded by a separate NHGRI award (5U54HG003079). We thank S. Sunyaev for comments on the manuscript; T. Teshiba for coordinating samples for FINRISK and EUFAM sequencing; and the staff and participants of the ARIC study for their contributions; and we acknowledge all individuals who were involved in the collection of samples that were analysed for this study. Data production for EUFAM was funded by 4R01HL113315-05; the Metabolic Syndrome in Men (METSIM) study was supported by grants to M. Laakso from the Academy of Finland (no. 321428), the Sigrid Juselius Foundation, the Finnish Foundation for Cardiovascular Research, Kuopio University Hospital and the Centre of Excellence of Cardiovascular and Metabolic Diseases supported by the Academy of Finland; data collection for the CEPH pedigrees was funded by the George S. and Dolores Doré Eccles Foundation and NIH grants GM118335 and GM059290; study recruitment at Washington University in St Louis was funded by the DDRCC (NIDDK P30 DK052574) and the Helmsley Charitable Trust; study recruitment at Cedars-Sinai was supported by the F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, NIH/NIDDK grants P01 DK046763 and U01 DK062413 and the Helmsley Charitable Trust; study recruitment at Intermountain Medical Center was funded by the Dell Loy Hansen Heart Foundation; the Late Onset Alzheimer's Disease Study (LOAD) study was funded by grants to T. Foroud (U24AG021886, U24AG056270, U24AG026395 and R01AG041797); the Atherosclerosis Risk in Communities (ARIC) study was funded by the NHLBI (HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700004I and HHSN268201700005I); and the PAGE programme is funded by the NHGRI with co-funding from the NIMHD (U01HG007416, U01HG007417, U01HG007397, U01HG007376 and U01HG007419). Samples from the BioMe Biobank were provided by The Charles Bronfman Institute for Personalized Medicine at the Icahn School of Medicine at Mount Sinai. The Hispanic Community Health Study/Study of Latinos was carried out as a collaborative study supported by the NHLBI (N01-HC65233, N01-HC65234, N01-HC65235, N01-HC65236 and N01-HC65237), with contributions from the NIMHD, NIDCD, NIDCR, NIDDK, NINDS and NIH ODS. The Multiethnic Cohort (MEC) study is funded through the NCI (R37CA54281, R01 CA63, P01CA33619, U01CA136792 and U01CA98758). For the Stanford Global Reference Panel, individuals from Puno, Peru were recruited by J. Baker and C. Bustamante, with funding from the Burroughs Welcome Fund, and individuals from Rapa Nui (Easter Island) were recruited by K. Sandoval Mendoza and A. Moreno Estrada, with funding from the Charles Rosenkranz Prize for Health Care Research in Developing Countries. The Women’s Health Initiative (WHI) programme is funded by the NHLBI (HHSN268201100046C, HHSN268201100001C, HHSN268201100002C, HHSN268201100003C, HHSN268201100004C and HHSN271201100004C). The GALA II study and E. G. Burchard are supported by the Sandler Family Foundation, the American Asthma Foundation, the RWJF Amos Medical Faculty Development Program, the Harry Wm. and Diana V. Hind Distinguished Professor in Pharmaceutical Sciences II, the NHLBI (R01HL117004, R01HL128439, R01HL135156 and X01HL134589), the NIEHS (R01ES015794, R21ES24844), the NIMHD (P60MD006902, R01MD010443, RL5GM118984) and the Tobacco-Related Disease Research Program (24RT-0025). We acknowledge the following GALA II co-investigators for recruitment of individuals, sample processing and quality control: C. Eng, S. Salazar, S. Huntsman, D. Hu, A. C.Y. Mak, L. Caine, S. Thyne, H. J. Farber, P. C. Avila, D. Serebrisky, W. Rodriguez-Cintron, Jose R. Rodriguez-Santana, R. Kumar, L. N. Borrell, E. Brigino-Buenaventura, A. Davis, M. A. LeNoir, K. Meade, S. Sen and F. Lurmann, and we thank the staff and participants who contributed to the GALA II study.

Author information

Richard K. Wilson
Present address: Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, USA
These authors contributed equally: Haley J. Abel, David E. Larson

Authors and Affiliations

McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA
Haley J. Abel, David E. Larson, Allison A. Regier, Colby Chiang, Indraniel Das, Krishna L. Kanchi, Elizabeth Appelbaum, Lei Chen, Ryan Christ, Lisa Cook, Matthew Cordes, Laura Courtney, Tracie Deluca, Catrina Fronick, Lucinda Fulton, Robert Fulton, Liron Ganel, Bo Ji, Chul Joo Kang, Adam E. Locke, Amy Ly, Joanne Nelson, Jennifer Ponce, Jason Waligorski, Richard K. Wilson, Erica Young, Susan K. Dutcher, Nathan O. Stitziel & Ira M. Hall
Department of Genetics, Washington University School of Medicine, St Louis, MO, USA
Haley J. Abel, David E. Larson, Susan K. Dutcher, Nathan O. Stitziel & Ira M. Hall
Department of Medicine, Washington University School of Medicine, St Louis, MO, USA
Allison A. Regier, Adam E. Locke, Rodney D. Newberry, Erica Young, Nathan O. Stitziel & Ira M. Hall
BioFrontiers Institute, University of Colorado, Boulder, CO, USA
Ryan M. Layer
Department of Computer Science, University of Colorado, Boulder, CO, USA
Ryan M. Layer
Broad Institute of MIT and Harvard, Cambridge, MA, USA
Benjamin M. Neale, Eric Banks, Mark J. Daly, Patrick T. Ellinor, Yossi Farjoun, Stacy Gabriel, Namrata Gupta, Daniel Howrigan, Sek Kathiresan, Amit Khera, Robert Maier, Aarno Palotie, Samuli Ripatti, Christine Stevens, Kathleen Tibbetts, Charlotte Tolonen & Eric S. Lander
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Benjamin M. Neale
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
Benjamin M. Neale
Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
William J. Salerno, Eric Boerwinkle, Huyen Dinh, Harsha Doddapaneni, Richard A. Gibbs, Megan L. Grove, Yi Han, Jianhong Hu, Ziad Khan, Olga Krasheninina, Vipin Menon, Ginger A. Metcalf, Zeineen Momin, Caitlin Nessner, Jireh Santibanez, Kimberly Walker & Donna M. Muzny
New York Genome Center, New York, NY, USA
Catherine Reeves, Toby Bloom, Robert B. Darnell, Shailu Gargeya, Goren Germer, Lily Khaira, Tuuli Lappalainen, Tom Maniatis, Guiseppe Narzisi, Michael Wigler, Lara Winterkorn & Michael C. Zody
Department of Statistics, Rutgers University, Piscataway, NJ, USA
Steven Buyske
Department of Genetics, Rutgers University, Piscataway, NJ, USA
Jinchuan Xing, Yeting Zhang & Tara C. Matise
Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
Eric S. Lander
Department of Systems Biology, Harvard Medical School, Boston, MA, USA
Eric S. Lander
Department of Biostatistics and Center for Statistical Genetics, University of Michigan, School of Public Health, Ann Arbor, MI, USA
Goncalo R. Abecasis, Michael Boehnke & Hyun Min Kang
Department of Genetics, Stanford University, Stanford, CA, USA
Julie Baker, Carlos D. Bustamante & Genevieve Wojcik
Department of Psychiatry and Behavioral Sciences, University of Washington, Seattle, WA, USA
Raphael A. Bernier
Human Genetics Center and Department of Epidemiology, University of Texas Health Science Center, Houston, TX, USA
Eric Boerwinkle, Goo Jun, Paul De Vries & Bing Yu
The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Erwin P. Bottinger, Judy H. Cho, Eimear E. Kenny & Ruth J. F. Loos
Department of Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
Steven R. Brant
Department of Bioengineering, University of California, San Francisco, San Francisco, CA, USA
Esteban G. Burchard
Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Judy H. Cho & Eimear E. Kenny
Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Judy H. Cho
MRC/BHF Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
Rajiv Chowdhury
Intermountain Heart Institute, Intermountain Medical Center, Murray, UT, USA
Michael J. Cutler
Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
Mark J. Daly, Aarno Palotie & Samuli Ripatti
Analytical and Translational Genetics Unit, Psychiatric and Neurodevelopmental Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
Mark J. Daly, Daniel Howrigan, Robert Maier, Aarno Palotie & Samuli Ripatti
Department of Surgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Scott M. Damrauer
Laboratory of Molecular Neuro-Oncology, The Rockefeller University, New York, NY, USA
Robert B. Darnell
Howard Hughes Medical Institute, The Rockefeller University, New York, NY, USA
Robert B. Darnell
Department of Genome Science, University of Washington, Seattle, WA, USA
Evan E. Eichler & Tychele Turner
Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
Evan E. Eichler
Division of Cardiology, Department of Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
Patrick T. Ellinor, Sek Kathiresan & Amit Khera
National Laboratory of Genomics for Biodiversity (LANGEBIO), CINVESTAV, Irapuato, Mexico
Andres M. Estrada & Karla S. Mendoza
National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
Adam Felsenfeld, Carolyn Hutter, Heidi Sofia, Taylorlyn Stephan & Vivian Ota Wang
Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA
Tatiana Foroud
Center for Neurobehavioral Genetics, Jane and Terry Semel Institute for Neuroscience and Human Behavior, University of California, Los Angeles, Los Angeles, CA, USA
Nelson B. Freimer
Program in Neurogenetics, Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
Daniel H. Geschwind
Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
Daniel H. Geschwind
Institute of Precision Health, University of California, Los Angeles, Los Angeles, CA, USA
Daniel H. Geschwind
Institute for Genomic Medicine, Columbia University Medical Center, New York, NY, USA
David B. Goldstein
Department of Genetics and Development, Columbia University Medical Center, New York, NY, USA
David B. Goldstein
Department of Preventative Medicine, University of Southern California, Los Angeles, CA, USA
Christopher A. Haiman
Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
Ivan Iossifov & Michael Wigler
Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
Lynn B. Jorde & Aaron Quinlan
Cardiovascular Research Institute, University of California, San Francisco, San Francisco, CA, USA
John Kane & Clive Pullinger
Center for Genomic Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
Sek Kathiresan & Amit Khera
The Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Eimear E. Kenny
Center for Statistical Genetics, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Eimear E. Kenny
Fred Hutchinson Cancer Research Center, Seattle, WA, USA
Charles Kooperberg & Ulrike Peters
Department of Medicine, Duke University, Durham, NC, USA
William E. Kraus & Svati H. Shah
Department of Pediatrics, Emory University School of Medicine, Atlanta, GA, USA
Subra Kugathasan
Institute of Clinical Medicine, Internal Medicine, University of Eastern Finland, Kuopio, Finland
Markku Laakso
Department of Systems Biology, Columbia University, New York, NY, USA
Tuuli Lappalainen
Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
Tom Maniatis
Cancer Center, University of Hawaii, Honolulu, HI, USA
Loic Le Marchand
Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
Gregory M. Marcus
Department of Neurology, Columbia University, New York, NY, USA
Richard P. Mayeux
F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Dermot P. B. McGovern & Stephan R. Targan
Department of Epidemiology, University of North Carolina, Chapel Hill, NC, USA
Kari E. North
Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Daniel J. Rader
Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA
Stephen S. Rich
Department of Medicine, Vanderbilt University, Nashville, TN, USA
Dan M. Roden & M. Benjamin Shoemaker
National Institute for Health and Welfare, Helsinki, Finland
Veikko Salomaa
Research Programs Unit, Diabetes and Obesity, University of Helsinki, Helsinki, Finland
Marja-Riitta Taskinen
Heart and Lung Centre, Helsinki University Hospital, Helsinki, Finland
Marja-Riitta Taskinen

Authors

Haley J. Abel
View author publications
You can also search for this author in PubMed Google Scholar
David E. Larson
View author publications
You can also search for this author in PubMed Google Scholar
Allison A. Regier
View author publications
You can also search for this author in PubMed Google Scholar
Colby Chiang
View author publications
You can also search for this author in PubMed Google Scholar
Indraniel Das
View author publications
You can also search for this author in PubMed Google Scholar
Krishna L. Kanchi
View author publications
You can also search for this author in PubMed Google Scholar
Ryan M. Layer
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin M. Neale
View author publications
You can also search for this author in PubMed Google Scholar
William J. Salerno
View author publications
You can also search for this author in PubMed Google Scholar
Catherine Reeves
View author publications
You can also search for this author in PubMed Google Scholar
Steven Buyske
View author publications
You can also search for this author in PubMed Google Scholar
Tara C. Matise
View author publications
You can also search for this author in PubMed Google Scholar
Donna M. Muzny
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Zody
View author publications
You can also search for this author in PubMed Google Scholar
Eric S. Lander
View author publications
You can also search for this author in PubMed Google Scholar
Susan K. Dutcher
View author publications
You can also search for this author in PubMed Google Scholar
Nathan O. Stitziel
View author publications
You can also search for this author in PubMed Google Scholar
Ira M. Hall
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

NHGRI Centers for Common Disease Genomics

Goncalo R. Abecasis
, Elizabeth Appelbaum
, Julie Baker
, Eric Banks
, Raphael A. Bernier
, Toby Bloom
, Michael Boehnke
, Eric Boerwinkle
, Erwin P. Bottinger
, Steven R. Brant
, Esteban G. Burchard
, Carlos D. Bustamante
, Lei Chen
, Judy H. Cho
, Rajiv Chowdhury
, Ryan Christ
, Lisa Cook
, Matthew Cordes
, Laura Courtney
, Michael J. Cutler
, Mark J. Daly
, Scott M. Damrauer
, Robert B. Darnell
, Tracie Deluca
, Huyen Dinh
, Harsha Doddapaneni
, Evan E. Eichler
, Patrick T. Ellinor
, Andres M. Estrada
, Yossi Farjoun
, Adam Felsenfeld
, Tatiana Foroud
, Nelson B. Freimer
, Catrina Fronick
, Lucinda Fulton
, Robert Fulton
, Stacy Gabriel
, Liron Ganel
, Shailu Gargeya
, Goren Germer
, Daniel H. Geschwind
, Richard A. Gibbs
, David B. Goldstein
, Megan L. Grove
, Namrata Gupta
, Christopher A. Haiman
, Yi Han
, Daniel Howrigan
, Jianhong Hu
, Carolyn Hutter
, Ivan Iossifov
, Bo Ji
, Lynn B. Jorde
, Goo Jun
, John Kane
, Chul Joo Kang
, Hyun Min Kang
, Sek Kathiresan
, Eimear E. Kenny
, Lily Khaira
, Ziad Khan
, Amit Khera
, Charles Kooperberg
, Olga Krasheninina
, William E. Kraus
, Subra Kugathasan
, Markku Laakso
, Tuuli Lappalainen
, Adam E. Locke
, Ruth J. F. Loos
, Amy Ly
, Robert Maier
, Tom Maniatis
, Loic Le Marchand
, Gregory M. Marcus
, Richard P. Mayeux
, Dermot P. B. McGovern
, Karla S. Mendoza
, Vipin Menon
, Ginger A. Metcalf
, Zeineen Momin
, Guiseppe Narzisi
, Joanne Nelson
, Caitlin Nessner
, Rodney D. Newberry
, Kari E. North
, Aarno Palotie
, Ulrike Peters
, Jennifer Ponce
, Clive Pullinger
, Aaron Quinlan
, Daniel J. Rader
, Stephen S. Rich
, Samuli Ripatti
, Dan M. Roden
, Veikko Salomaa
, Jireh Santibanez
, Svati H. Shah
, M. Benjamin Shoemaker
, Heidi Sofia
, Taylorlyn Stephan
, Christine Stevens
, Stephan R. Targan
, Marja-Riitta Taskinen
, Kathleen Tibbetts
, Charlotte Tolonen
, Tychele Turner
, Paul De Vries
, Jason Waligorski
, Kimberly Walker
, Vivian Ota Wang
, Michael Wigler
, Richard K. Wilson
, Lara Winterkorn
, Genevieve Wojcik
, Jinchuan Xing
, Erica Young
, Bing Yu
& Yeting Zhang

Contributions

I.M.H. conceived and directed the study. D.E.L. and H.J.A. developed the final version of the SV calling pipeline, constructed the SV callsets and performed the data analyses. C.C. and R.M.L. helped design the SV calling pipeline. A.A.R. contributed to long-read validation. I.D. was instrumental in the migration of workflows to the Google Cloud Platform. K.L.K. assisted with data management. E.S.L., B.M.N. and N.O.S. provided input on population genetic analyses. W.J.S., D.M.M., E.S.L., B.M.N., M.C.Z., C.R., T.C.M., S.B., S.K.D., I.M.H. and N.O.S. directed data production, processing and management at their respective sites, and edited the manuscript. Members of the NHGRI CCDG consortium provided samples, produced sequencing data and coordinated and administered data-sharing efforts. H.J.A., D.E.L. and I.M.H. wrote the manuscript.

Corresponding author

Correspondence to Ira M. Hall.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 SV mapping pipeline.

SVs are detected within each sample using LUMPY. Breakpoint probability distributions are used to merge and refine the position of detected SVs within a cohort, followed by parallelized re-genotyping and copy-number annotation. Samples are merged into a single cohort-level VCF file, variant types reclassified and genotypes refined with svtools using the combined breakpoint genotype and read-depth information. Finally, sample-level quality control (QC) and variant confidence scoring is conducted to produce the final callset.

Extended Data Fig. 2 The B37 callset.

a, Variant counts (y axis) for each sample (x axis) in the callset, ordered by cohort. Large (>1 kb) variants are shown in dark shades and smaller variants in light shades. b, Variant counts per sample, ordered by self-reported ancestry according to the colour scheme on the right. Abbreviations as in Fig. 1a. Note that African-ancestry samples show more variant calls, as expected. c, Table showing the number of variant calls by variant type and frequency class, and Mendelian error rate by variant type. d, Histogram of allele count for each variant class, showing alleles with counts ≤ 100. e, Linkage disequilibrium of each variant class as represented by maximum R² value to nearby SNVs, for n = 1,581 samples. Note that these distributions mirror those from our previous SV callset for GTEx⁴, which was characterized extensively in the context of expression quantitative trait loci.

Extended Data Fig. 3 The B38 callset.

a, Variant counts (y axis) for each sample (x axis) in the callset, ordered by cohort. Large (>1 kb) variants are shown in dark shades and smaller variants in light shades. b, Variant counts per sample, ordered by self-reported ancestry according to the colour scheme on the right. Abbreviations as in Fig. 1a. Note that African-ancestry samples show more variant calls, as expected. Note also that there is some residual variability in variant counts owing to differences in data from each sequencing centre, but that this is mainly limited to small tandem duplications (see a), primarily at STRs. c, SV length distribution by variant class. d, Distribution of the number of singleton SVs detected in samples from different ancestry groups. Only groups with ≥1,000 samples in the B38 callset are shown, and each group was subsampled down to 1,000 individuals before recalculation of the allele frequency. e, Histogram showing the resolution of SV breakpoint calls, as defined by the length of the 95% confidence interval of the breakpoint-containing region defined by LUMPY, after cross-sample merging and refinement using svtools. Data are from n = 360,614 breakpoints, 2 per variant. f, Distribution of the number of SVs detected per sample in WGS data from each sequencing centre (x axis) for African and non-African (non-AFR) samples, showing all variants (left), and those larger (middle) and smaller (right) than 1 kb in size. Per-centre counts are as follows: centre A, 1,527 AFR, 2,080 non-AFR; centre B, 408 AFR, 2,745 non-AFR; centre C, 2,953 AFR, 2226 non-AFR; centre D, 150 AFR, 2,534 non-AFR. g, Plots of Mendelian error (ME) rate (y axis) by MSQ for each variant class. Dot size is determined by point density (right) and the threshold used to determine high and low confidence SVs are shown by the vertical lines. All box plots in indicate the median (centre line) and the first and third quartiles (box limits); whiskers extend 1.5 × IQR.

Extended Data Fig. 4 PCA for the B37 callset.

PCA was performed using a linkage disequilibrium-pruned subset of high-confidence DEL and MEI variants, with MAF > 1%. Self-reported ancestry is shown using the colour scheme on the right, with abbreviations as in Fig. 1a.

Extended Data Fig. 5 Validation of SV calls by PacBio long reads in nine control samples.

n = 9,905 variants. a, Validation rates in variant carriers (y axis) versus validation rates in non-carriers (that is, false validations; x axis) for each method of determining variant overlap, for a range of supporting-read-count thresholds. Ultra-rare variants (n = 133) are shown separately on the right. For each variant overlap method, each data point represents a distinct read-count threshold (≥1, 2, 3, 5, 10, 15 or 20 PacBio reads) that was used to determine validation of SV calls by long-read alignments. Two methods were used for determining overlap between SV coordinates and long-read alignments while accounting for positional uncertainty: (1) BEDTools pairtopair, requiring overlap between the pair of breakpoint intervals predicted by short-read SV mapping and the pair of breakpoint intervals predicted by long-read alignment, allowing 100 bp or 200 bp of ‘slop’; and (2) BEDTools intersect, requiring 90% or 95% reciprocal overlap between the coordinates spanned by the SV predicted by short-read SV mapping and the SV predicted by long-read alignment. Here, we plot the first criteria by themselves, and in pairwise combination with the latter (see key on the right of the figure). Note that Supplementary Table 3 is based on the ‘100 bp slop or 90% reciprocal overlap’ method, requiring at least two PacBio reads. b, Validation rates by frequency class for variant carriers and non-carriers with increasing PacBio supporting-read thresholds, shown using the same overlap method as in Supplementary Table 3. Variant counts per frequency class are as follows: ultra-rare, n = 133; rare, n = 734; low frequency, n = 1,361; common, n = 7,677.

Extended Data Fig. 6 Mendelian inheritance analysis in a set of three-generation CEPH pedigrees comprising 409 parent–offspring trios.

a, Example structure of a single CEPH pedigree indicating nomenclature of the parental (P₀), first (F₁) and second (F₂) generations. b, Transmission rate of SVs from different allele-frequency classes including SVs that are unique to a single family (ultra-rare), rare (<1%), low frequency (‘low’; 1–5%) and common (>5%). c, Table showing the number and rate of Mendelian errors by allele-frequency class. d, Table showing the number and rate of Mendelian errors for SVs that are unique to a single family, for each SV type.

Extended Data Fig. 7 Comparison of SV calls and genotypes to the 1KG phase 3 callset.

a, Number of known and novel SVs in the B37 (left) and B38 (right) callsets, shown by frequency class. b, Table showing the genotypes (GT) reported in our B38 callset⁵ (rows) versus the 1KG callset (columns) at SVs identified by both studies among the five samples included in both callsets. c, Table showing genotype concordance by SV type including the fraction of concordant calls and Cohen’s κ coefficient. d, Distribution of correlation (R²) between genotype information determined by breakpoint-spanning reads and estimates of copy number (CN) determined by read-depth analysis for the SVs shown in b, c when genotype information between the B38 and the 1KG callset is concordant (left) or discordant (middle, right). At sites with discordant genotypes, correlation with copy-number information is typically higher for genotypes from the B38 callset (middle) than the 1KG callset (right).

Extended Data Fig. 8 Ultra-rare SVs in the B38 callset.

n = 14,623. a, Histogram showing the number of ultra-rare SVs per individual (ultra-rare is defined as singleton variants private to a single individual or nuclear family). b, Histogram showing the number of genes affected by ultra-rare SVs larger than 1 Mb in size.

Extended Data Fig. 9 Correlations between dosage sensitivity scores for CNV in the combined callset.

n = 17,795. a, Results for deletion variants. The ExAC score is the published ExAC DEL intolerance score⁴⁵; the CCDG score is similarly calculated from our data, using CCDG deletions; pLI is the published loss-of-function intolerance score from ExAC²⁷; ‘HI.Z’ is the negative of the inverse-normal transformed haploinsufficiency score from DECIPHER⁴⁶; ‘ave.ccdg.exac’ is the arithmetic mean of the CCDG and ExAC DEL intolerance scores; and ‘ave.ccdg.hi’ is the arithmetic mean of the CCDG and HI-Z scores. The correlations shown are Spearman rank correlations (rho); P values are calculated by two-sided Spearman rank correlation test; and N represents the number of genes included in the test. b, Results for duplication variants, using the same naming conventions as in a.

Extended Data Table 1 Ancestry, ethnicity and continental origin of the samples analysed in this study

Full size table

Supplementary information

Supplementary Information

This file contains: (1) Supplementary Note: SV Callset Quality Assessment; (2) Supplementary References (3) Supplementary Table legends (Supplementary Table files provided in separate excel format); (4) Supplementary Files; and (5) Consortium Members: NHGRI Centers for Common Disease Genomics.

Reporting Summary

Supplementary Table 1

Description of which callset and sample subsets were used for each of the major analyses in the study. See main Supplementary Information PDF for full legend.

Supplementary Table 2

Description of PacBio long-read datasets used for SV validation analyses. See main Supplementary Information PDF for full legend.

Supplementary Table 3

SV validation rate analysis using split-read mapping with deep coverage (>60x) PacBio long-read WGS data. See main Supplementary Information PDF for full legend.

Supplementary Table 4

SV detection sensitivity analysis based on long-read SV calls from the Human Genome Structural Variation Consortium (HGSVC). See main Supplementary Information PDF for full legend.

Supplementary Table 5

Number of variants represented in Fig 2d. See main Supplementary Information PDF for full legend.

Supplementary Table 6

Number of variants for each category in Fig. 3a and 3b. See main Supplementary Information PDF for full legend.

Supplementary Table 7

Data Availability. See main Supplementary Information PDF for full legend.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abel, H.J., Larson, D.E., Regier, A.A. et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583, 83–89 (2020). https://doi.org/10.1038/s41586-020-2371-0

Download citation

Received: 29 December 2018
Accepted: 18 May 2020
Published: 27 May 2020
Issue Date: 02 July 2020
DOI: https://doi.org/10.1038/s41586-020-2371-0

This article is cited by

Rare copy-number variants as modulators of common disease susceptibility
- Chiara Auwerx
- Maarja Jõeloo
- Zoltán Kutalik
Genome Medicine (2024)
Genome-wide association study between copy number variation and feeding behavior, feed efficiency, and growth traits in Nellore cattle
- Lorena F. Benfica
- Luiz F. Brito
- Maria E. Z. Mercadante
BMC Genomics (2024)
Whole genome sequencing enables new genetic diagnosis for inherited retinal diseases by identifying pathogenic variants
- Xubing Liu
- Fangyuan Hu
- Jihong Wu
npj Genomic Medicine (2024)
Protein-altering variants at copy number-variable regions influence diverse human phenotypes
- Margaux L. A. Hujoel
- Robert E. Handsaker
- Po-Ru Loh
Nature Genetics (2024)
Pangenome graph construction from genome alignments with Minigraph-Cactus
- Glenn Hickey
- Jean Monlong
- Benedict Paten
Nature Biotechnology (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.