An integrated map of structural variation in 2,504 human genomes

Journal name:
Nature
Volume:
526,
Pages:
75–81
Date published:
DOI:
doi:10.1038/nature15394
Received
Accepted
Published online

Abstract

Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.

At a glance

Figures

  1. Phase 3 integrated SV callset.
    Figure 1: Phase 3 integrated SV callset.

    a, Novelty based on overlap of our SV set with DGV19 (upper panel, broken down by SV class), of collapsed CNVRs with earlier 1000 Genomes Project releases6, 8 (middle panel) and of our SV set with refs 6, 8 (bottom panel). b, Size distribution of ascertained SVs (bin width is uniform in log-scale). DEL, biallelic deletion, DUP, biallelic duplication, INV, inversion, INS, non-reference insertion (including MEIs and NUMTs). c, Breakpoint precision of assembled deletions stratified by VAF (split-read caller Pindel23 shown separately). d, SV allele sharing across continental groups. e, LD properties of biallelic SV classes.

  2. SV functional impact.
    Figure 2: SV functional impact.

    a, Relative enrichment or depletion of genomic elements within breakpoint-resolved deletions binned by VAF. TF, transcription factor binding site; nc, noncoding. RVIS range from 0–100 (low < 20, medium 20–50, high ≥ 50). *no element intersected. b, Enrichment/depletion of genomic elements within different SV classes, compared with breakpoint-resolved deletions. c, Manhattan plot of DUSP22-eQTL. Inset, boxplots of association between copy-number genotype and expression. d, Manhattan plot of ZNF43-eQTL. e, Enrichment of SV-containing haplotypes at previously reported GWAS hits (error bars show s.e.m.).

  3. SV complexity at different scales.
    Figure 3: SV complexity at different scales.

    a, PSG locus with clustered SVs. Population copy-number state histograms are shown for two example SVs. b, Schemes depicting assembled complex deletions. c, Smaller-scale complex deletions identified with Pindel23. Flanking sequences are shown for reference (REF) and alternate (ALT) alleles, further to insertions at the breakpoints. Proximal stretches matching the insertion are labelled in red (forward) and green (reverse complement). Blue, insertions lacking nearby matches. d, Alignment dot plots depicting inversions (inverted sequences are in red within each dot plot). Adjacent schemes depict allelic structures for REF and ALT. e, Inversion complexity summarized.

  4. Construction of the SV release and intensity rank sum validation.
    Extended Data Fig. 1: Construction of the SV release and intensity rank sum validation.

    a, Approach used for constructing our SV release set. b, Intensity rank sum (IRS) validation results for deletions in different size bins. c, IRS validation results for deletions in variant allele frequency (VAF) bins. d, IRS results for duplications in different size bins. e, IRS validation results for duplications in VAF bins. Based on Affymetrix SNP6 array probes, the IRS FDR for all SV length and VAF bins was ≤5.4%, requiring at least 100 SVs per bin with an IRS assigned P-value.

  5. This figure shows the number of SV sites in our phase 3 release relative to allele frequency expressed in terms of allele count.
    Extended Data Fig. 2: This figure shows the number of SV sites in our phase 3 release relative to allele frequency expressed in terms of allele count.

    SVs down to an allele count of 1 (corresponding to VAF = 0.0002) are represented in our phase 3 SV set (with the exception of mCNVs, denoted ‘CNV’ in this figure, which are defined as sites of multi-allelic variation thus requiring allele count ≥2, hence no mCNV sites are ascertained for allele count = 1).

  6. Size and population distribution of different SV classes.
    Extended Data Fig. 3: Size and population distribution of different SV classes.

    a, Variants ascertained in the 1000 Genomes Project pilot phase6 (light grey) as well as the recent publication of SVs ascertained by PacBio sequencing in the CHM1 genome14 (grey) are displayed for comparison in this SV size distribution figure (INS, used as abbreviation for MEIs and NUMTs in this display item). b, Population distribution of SV allele sharing across continental groups for different SV classes. c, Cumulative distributions of the number of events as a function of size by SV class.

  7. LD properties of various SV classes.
    Extended Data Fig. 4: LD properties of various SV classes.

    a, LD properties of deletions, broken down by continental group and shown as a function of VAF. b, LD properties of duplications. c, LD properties of Alu, L1 and SVA mobile element insertions. d, LD properties of inversions (with breakdown for two independent inversion sets generated with our inversion discovery algorithm Delly; that is, CINV = one-sided inversions with support for one breakpoint; INV = two-sided inversions with support for both breakpoints; these two sets are combined into the joint phase3 SV group inversion set).

  8. Population genetic properties of SVs.
    Extended Data Fig. 5: Population genetic properties of SVs.

    a, Deletion heterozygosity and homozygosity among human populations for a subset of high-confidence deletions. Populations from the African continental group (AFR) exhibit the highest levels of heterozygosity and thus diversity among humans, but show the overall lowest level of deletion homozygosity among all continental groups. By comparison, East Asian populations exhibited the lowest levels of deletion heterozygosity and the highest levels of homozygosity. Het., heterozygous; Hom., homozygous. b, VAF distribution of major SV classes. Bi-allelic duplications represent a notable outlier, showing a striking depletion of common alleles, which can be explained by the preponderance of genomic sites of duplication to undergo recurrent rearrangement (see main text). As a consequence, most common duplications are classified as multi-allelic variants (that is, mCNVs). c, The number of base pairs (bp) differing among individuals within and between continental groups for deletions (upper panel) and SNPs (middle panel) contrasted with the ratio of deletion bp differences to SNP bp differences (‘deletion bp/SNP bp’) among groups (lower panel). Non-African groups exhibit a higher ‘deletion bp/SNP bp’ compared to Africans. d, Neighbour-joining tree of populations constructed from MEIs (homoplasy-free markers) to provide a (simplified) view of population ancestry. The tree is labelled with the number of lineage-specific MEIs (Alu:L1:SVA). e, Classification of ancestry in AFR/AMR and AMR admixed populations using homoplasy-free ancestry informative MEI markers. Colour usage follows the same scheme as in Fig. 1d, except in the case of AFR individuals, which use both the colour in Fig. 1d and another colour that is unrelated to any other figure to indicate additional substructure within this group.

  9. Principal component analysis and population stratification of SVs.
    Extended Data Fig. 6: Principal component analysis and population stratification of SVs.

    a, Principal component analysis (PCA) plot of principal components 1 and 2 for deletions. b, PCA plot of principal components 3 and 4 for deletions. c, PCA plot of principal components 1 and 2 for MEIs. d, PCA plot of principal components 3 and 4 for MEIs. e, The five most highly population-stratified deletions intersecting protein-coding genes based on VST. f, The five most highly population-stratified duplications and multi-allelic copy number variants (mCNVs) intersecting protein-coding genes based on VST. For abbreviations, see Supplementary Table 1.

  10. Enrichment of functional elements intersecting SVs.
    Extended Data Fig. 7: Enrichment of functional elements intersecting SVs.

    a, Shadow figure of Fig. 2a. Overlap enrichment analysis of deletions (with resolved breakpoints) versus genomic elements, using partial overlap statistic, deletions categorized into VAF bins. b, Similar to a. The only difference is that engulf overlap statistic is used instead of partial overlap statistic. Engulf overlap statistic is the count of genomic elements (for example, CDS) that are fully imbedded in at least one SV interval (for example, deletions). *no element intersected observed within data set. c, Similar to a and b, with the enrichment/depletion analysis pursued for common SNPs as well as more rare single nucleotide polymorphisms/variants (SNVs). Common SNV alleles show the highest levels of depletion for investigated genomic elements. d, Overlap enrichment analysis of various SV types versus genomic elements, using partial overlap statistic.

  11. SV-eQTL analysis.
    Extended Data Fig. 8: SV-eQTL analysis.

    a, SV-centric eQTL analysis of coding SVs. Shown is the proportion of coding SVs that are eQTLs as a function of the minimum VAF and the expression quartile. b, Total number of coding SVs for corresponding filters. Common SVs (VAF > 0.2) in highly expressed genes (>75% quantile) are very likely to correspond to SV-eQTLs (54%, see also Supplementary Table 8). c, For all genes with significant eQTLs (FDR < 10%), shown are raw P-values considering only SNPs (x axis) or only SVs (y axis). Genes with (strict lead) SV-eQTLs are shown in red. Genes with a SNP lead eQTL that is in linkage with an SV (r2 > 0.5) are shown in orange. SNP lead eQTLs without an SV in LD are shown in blue. d, Relative eQTL effect sizes for genetic and intergenic SV eQTLs (n = 239) either with an SV-eQTL or an LD tagged SV (in log abundance scale). Shown are regression trends for both genic and intergenic SV eQTls. For genetic eQTLs, a clear relationship between SV effect size is found. For example, genic SVs >10 kb have threefold larger effect sizes compared to genic SVs < 1 kb; P = 0.004; t-test.

  12. SV clustering and breakpoint analysis.
    Extended Data Fig. 9: SV clustering and breakpoint analysis.

    a–c, Extensive clustering of recurrent SVs into CNVRs appears unrelated to the extent of segmental duplications (a) and is only partially correlating with SNP diversity (b) and GC content (c). Breakdown of SV mechanism classifications based on criteria from two earlier studies (refs 6, 40). Shown are results for deletions with nucleotide resolved breakpoints. BreakSeq was used for mechanism inference. d, 1KG_P3: breakdown for our 1000 Genomes Project phase 3 SV callset using classification criteria from ref. 6. e, Conrad_2010: summary of mechanism classification results published in ref. 40. f, Mills_2011: summary of mechanism classification results published in ref. 6. g, 1KG_P3_Conrad: Breakdown for our 1000 Genomes Project phase 3 SV callset using classification criteria from ref. 40. Mechanism classification was pursued using four different categories. Blue, non-allelic homologous recombination (NAHR); green, mobile elements inserted into the reference genomes (appearing deleted in this analysis); red, non-homology-based rearrangement mechanisms (NHR), such as NHEJ, microhomology-mediated end-joining and microhomology-mediated break-induced replication (involving blunt-ended deletion breakpoints or breakpoints with microhomoloy); purple, expansion or shrinkage of variable numbers of tandem repeats (VNTRs). TEI, transposable element insertion (equivalent with MEI). h, Distribution of lengths of micro-homology (MH) for complex SVs, measured between deletion and corresponding template sites boundaries. Simple deletions, which based on BreakSeq were inferred to be formed by a non-homology-based SV formation mechanism, such as NHEJ and microhomology-mediated break-induced replication (Supplementary Table 3), are shown as an additional control (here denoted ‘blunt NH deletions’). i, Origins of inserted sequences in complex deletions inferred by split read analysis. This figure depicts examples for each class shown in Supplementary Table 13.

  13. Examples of inversions identified in the SV release.
    Extended Data Fig. 10: Examples of inversions identified in the SV release.

    a–e, Five classifications of inversions verified using PacBio and Minion reads are represented: Simple Inversion (a), inv-dup (b), inv-del (c), MultiDel with Inv (here abbreviated as inv-2dels) (d) and complex (e). f, Several further examples of inverted duplications (inv-dup), the most common form of inversion-associated SV identified in the phase 3 release set. The figure is depicting DNA sequence alignment dotplots (same arrangement as in Fig. 3), with the y axis referring to PacBio DNA single molecule sequencing reads and the x axis referring to the reference genome assembly (hg19). Inverted sequences are highlighted in red. Sequence analysis suggests that these inverted duplications are not typically associated with retrotransposition.

References

  1. Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J. O. Phenotypic impact of genomic structural variation: insights from and for human disease. Nature Rev. Genet. 14, 125138 (2013)
  2. Malhotra, D. & Sebat, J. CNVs: harbingers of a rare variant revolution in psychiatric genetics. Cell 148, 12231241 (2012)
  3. Hastings, P. J., Lupski, J. R., Rosenberg, S. M. & Ira, G. Mechanisms of change in gene copy number. Nature Rev. Genet. 10, 551564 (2009)
  4. Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nature Rev. Genet. 12, 363376 (2011)
  5. Wellcome Trust Case Control Consortium. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464, 713720 (2010)
  6. Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 5965 (2011)
  7. Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641646 (2010)
  8. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 5665 (2012)
  9. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 10611073 (2010)
  10. Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704712 (2010)
  11. Kidd, J. M. et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837847 (2010)
  12. Korbel, J. O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420426 (2007)
  13. Pang, A. W. et al. Towards a comprehensive structural variation map of an individual human genome. Genome Biol. 11, R52 (2010)
  14. Chaisson, M. J. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608611 (2015)
  15. Teague, B. et al. High-resolution human genome structure by single-molecule analysis. Proc. Natl Acad. Sci. USA 107, 1084810853 (2010)
  16. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature http://dx.doi.org/10.1038/nature15393 (this issue)
  17. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589595 (2010)
  18. Hach, F. et al. mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic Acids Res. 42, W494W500 (2014)
  19. MacDonald, J. R., Ziman, R., Yuen, R. K., Feuk, L. & Scherer, S. W. The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 42, D986D992 (2014)
  20. Stewart, C. et al. A comprehensive map of mobile element insertion polymorphisms in humans. PLoS Genet. 7, e1002236 (2011)
  21. Martínez-Fundichely, A. et al. InvFEST, a database integrating information of polymorphic inversions in the human genome. Nucleic Acids Res. 42, D1027D1032 (2014)
  22. Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nature Methods 12, 780786 (2015)
  23. Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 28652871 (2009)
  24. Kloosterman, W. P. et al. Characteristics of de novo structural changes in the human genome. Genome Res. 25, 792801 (2015)
  25. McCarroll, S. A. et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nature Genet. 40, 11661174 (2008)
  26. Locke, D. P. et al. Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am. J. Hum. Genet. 79, 275290 (2006)
  27. Handsaker, R. E. et al. Large multiallelic copy number variations in humans. Nature Genet. 47, 296303 (2015)
  28. Simons, Y. B., Turchin, M. C., Pritchard, J. K. & Sella, G. The deleterious mutation load is insensitive to recent population history. Nature Genet. 46, 220224 (2014)
  29. Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444454 (2006)
  30. Stefansson, H. et al. A common inversion under selection in Europeans. Nature Genet. 37, 129137 (2005)
  31. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012)
  32. McVicker, G., Gordon, D., Davis, C. & Green, P. Widespread genomic signatures of natural selection in hominid evolution. PLoS Genet. 5, e1000471 (2009)
  33. Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013)
  34. Stranger, B. E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848853 (2007)
  35. Schlattl, A., Anders, S., Waszak, S. M., Huber, W. & Korbel, J. O. Relating CNVs to transcriptome data at fine resolution: assessment of the effect of variant size, type, and overlap with functional regions. Genome Res. 21, 20042013 (2011)
  36. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506511 (2013)
  37. Moore, T. & Dveksler, G. S. Pregnancy-specific glycoproteins: complex gene families regulating maternal-fetal interactions. Int. J. Dev. Biol. 58, 273280 (2014)
  38. Girirajan, S. et al. Relative burden of large CNVs on a range of neurodevelopmental phenotypes. PLoS Genet. 7, e1002334 (2011)
  39. International HapMap Consortium. A haplotype map of the human genome. Nature 437, 12991320 (2005)
  40. Conrad, D. F. et al. Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nature Genet. 42, 385391 (2010)

Download references

Author information

  1. These authors contributed equally to this work.

    • Peter H. Sudmant,
    • Tobias Rausch,
    • Eugene J. Gardner,
    • Robert E. Handsaker,
    • Alexej Abyzov,
    • John Huddleston,
    • Yan Zhang &
    • Kai Ye
  2. These authors jointly supervised this work.

    • Ryan E. Mills,
    • Mark B. Gerstein,
    • Ali Bashir,
    • Oliver Stegle,
    • Scott E. Devine,
    • Charles Lee,
    • Evan E. Eichler &
    • Jan O. Korbel

Affiliations

  1. Department of Genome Sciences, University of Washington, 3720 15th Avenue NE, Seattle, Washington 98195-5065, USA

    • Peter H. Sudmant,
    • John Huddleston,
    • Fereydoun Hormozdiari,
    • Maika Malig,
    • Mark J. P. Chaisson,
    • Bradley J. Nelson &
    • Evan E. Eichler
  2. European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Meyerhofstrasse 1, 69117 Heidelberg, Germany

    • Tobias Rausch,
    • Markus Hsi-Yang Fritz,
    • Adrian M. Stütz,
    • Sascha Meiers,
    • Benjamin Raeder,
    • Andreas Schlattl,
    • Andreas Untergasser,
    • Thomas Zichner &
    • Jan O. Korbel
  3. Institute for Genome Sciences, University of Maryland School of Medicine, 801 W Baltimore Street, Baltimore, Maryland 21201, USA

    • Eugene J. Gardner &
    • Scott E. Devine
  4. Department of Genetics, Harvard Medical School, Boston, 25 Shattuck Street, Boston, Massachusetts 02115, USA

    • Robert E. Handsaker,
    • Seva Kashin &
    • Steven A. McCarroll
  5. Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, Massachusetts 02142, USA

    • Robert E. Handsaker,
    • Seva Kashin &
    • Steven A. McCarroll
  6. Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic, 200 First Street SW, Rochester, Minnesota 55905, USA

    • Alexej Abyzov &
    • Taejeong Bae
  7. Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA

    • John Huddleston &
    • Evan E. Eichler
  8. Program in Computational Biology and Bioinformatics, Yale University, BASS 432 & 437, 266 Whitney Avenue, New Haven, Connecticut 06520, USA

    • Yan Zhang,
    • Jieming Chen,
    • Xinmeng Jasmine Mu,
    • Jing Zhang &
    • Mark B. Gerstein
  9. Department of Molecular Biophysics and Biochemistry, School of Medicine, Yale University, 266 Whitney Avenue, New Haven, Connecticut 06520, USA

    • Yan Zhang,
    • Jing Zhang &
    • Mark B. Gerstein
  10. The Genome Institute, Washington University School of Medicine, 4444 Forest Park Avenue, St Louis, Missouri 63108, USA

    • Kai Ye &
    • Li Ding
  11. Department of Genetics, Washington University in St Louis, 4444 Forest Park Avenue, St Louis, Missouri 63108, USA

    • Kai Ye &
    • Li Ding
  12. Department of Biostatistics and Center for Statistical Genetics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48109, USA

    • Goo Jun &
    • Jeffrey M. Kidd
  13. Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, 1200 Pressler St., Houston, Texas 77030, USA

    • Goo Jun
  14. Department of Biological Sciences, Louisiana State University, 202 Life Sciences Building, Baton Rouge, Louisiana 70803, USA

    • Miriam K. Konkel,
    • Jerilyn A. Walker &
    • Mark A. Batzer
  15. The Jackson Laboratory for Genomic Medicine, 10 Discovery 263 Farmington Avenue, Farmington, Connecticut 06030, USA

    • Ankit Malhotra,
    • Eliza Cerveira,
    • Mallory Romanovitch,
    • Chengsheng Zhang &
    • Charles Lee
  16. Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, North Carolina 28223, USA

    • Xinghua Shi &
    • Andrew Quitadamo
  17. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

    • Francesco Paolo Casale,
    • Laura Clarke,
    • Paul Flicek,
    • Xiangqun Zheng-Bradley,
    • Oliver Stegle &
    • Jan O. Korbel
  18. Integrated Graduate Program in Physical and Engineering Biology, Yale University, New Haven, Connecticut 06520, USA

    • Jieming Chen
  19. Department of Computational Medicine & Bioinformatics, University of Michigan, 500 S. State Street, Ann Arbor, Michigan 48109, USA

    • Gargi Dayama &
    • Ryan E. Mills
  20. The University of Texas MD Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, Texas 77030, USA

    • Ken Chen,
    • Zechen Chong,
    • Xian Fan &
    • Wanding Zhou
  21. The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK

    • Klaudia Walter &
    • Shane McCarthy
  22. Department of Biology, Boston College, 355 Higgins Hall, 140 Commonwealth Avenue, Chestnut Hill, Massachusetts 02467, USA

    • Erik Garrison &
    • Gabor Marth
  23. Department of Genetics, Albert Einstein College of Medicine, 1301 Morris Park Avenue, Bronx, New York 10461, USA

    • Adam Auton &
    • Yu Kong
  24. Bina Technologies, Roche Sequencing, 555 Twin Dolphin Drive, Redwood City, California 94065, USA

    • Hugo Y. K. Lam
  25. Cancer Program, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, Massachusetts 02142, USA

    • Xinmeng Jasmine Mu
  26. Department of Computer Engineering, Bilkent University, 06800 Ankara, Turkey

    • Can Alkan,
    • Elif Dal &
    • Fatma Kahveci
  27. University of California San Diego (UCSD), 9500 Gilman Drive, La Jolla, California 92093, USA

    • Danny Antaki,
    • Madhusudan Gujral,
    • Amina Noor &
    • Jonathan Sebat
  28. National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892 USA

    • Peter Chines
  29. Department of Medicine, Washington University in St Louis, 4444 Forest Park Avenue, St Louis, Missouri 63108, USA

    • Li Ding
  30. Siteman Cancer Center, 660 South Euclid Avenue, St Louis, Missouri 63110, USA

    • Li Ding
  31. Department of Human Genetics, University of Michigan, 1241 Catherine Street, Ann Arbor, Michigan 48109, USA

    • Sarah Emery,
    • Jeffrey M. Kidd &
    • Ryan E. Mills
  32. Molecular Epidemiology, Leiden University Medical Center, Leiden 2300RA, The Netherlands

    • Eric-Wubbo Lameijer
  33. Baylor College of Medicine, 1 Baylor Plaza, Houston, Texas 77030, USA

    • Richard A. Gibbs,
    • Min Wang &
    • Fuli Yu
  34. The Department of Physiology and Biophysics and the HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, 1305 York Avenue, Weill Cornell Medical College, New York, New York 10065, USA

    • Christopher E. Mason
  35. The Feil Family Brain and Mind Research Institute, 413 East 69th St, Weill Cornell Medical College, New York, New York 10065, USA

    • Christopher E. Mason
  36. University of Oxford, 1 South Parks Road, Oxford OX3 9DS, UK

    • Androniki Menelaou
  37. Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, 3584 CG, The Netherlands

    • Androniki Menelaou
  38. Department of Genetics and Genomic Sciences, Icahn School of Medicine, New York School of Natural Sciences, 1428 Madison Avenue, New York, New York 10029, USA

    • Donna M. Muzny,
    • Matthew Pendleton,
    • Eric E. Schadt,
    • Robert Sebra &
    • Ali Bashir
  39. Institute for Virus Research, Kyoto University, 53 Shogoin Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan

    • Nicholas F. Parrish
  40. Center for Biomarker Research and Precision Medicine, Virginia Commonwealth University, 1112 East Clay Street, McGuire Hall, Richmond, Virginia 23298-0581, USA

    • Andrey A. Shabalin
  41. Zentrum für Molekulare Biologie, University of Heidelberg, Im Neuenheimer Feld 282, 69120 Heidelberg, Germany

    • Andreas Untergasser
  42. Department of Computer Science, Yale University, 51 Prospect Street, New Haven, Connecticut 06511, USA

    • Mark B. Gerstein
  43. Department of Graduate Studies – Life Sciences, Ewha Womans University, Ewhayeodae-gil, Seodaemun-gu, Seoul 120-750, South Korea

    • Charles Lee

Consortia

  1. The 1000 Genomes Project Consortium

  2. A list of participants and their affiliations appears in the Supplementary Information.

Contributions

SV discovery & genotyping: R.E.H., P.H.S., T.R., E.J.G., A.Ab., K.Y., F.H., K.C., G.D., K.W., M.H.-Y.F., S.K., C.A., S.A.M., R.E.M., K.Y., M.B.G., S.E.D., E.E.E., J.O.K.; SV merging & haplotype-integration: T.R., R.E.H., M.H.-Y.F., E.G., A.Me., S.McC.; SV validation: R.E.H., A.Ab., G.J., M.H.-Y.F., A.M.S., M.K.K., A.Ma., S.K., M.M., M.J.P.C., S.M., P.C., S.E., J.M.K., B.R., J.A.W., F.Y., T.Z., M.A.B., R.E.M., A.B., C.L., E.E.E., J.O.K.; additional analyses: A.Au., C.E.M., E.C., E.D., E.-W.L., F.K., J.H., Y.Z., X.S., F.P.C., M.M., M.J.P.C., G.M., S.M., D.A., T.B., J.C., Z.C., L.D., X.F., M.G., J.M.K., H.Y.K.L., Y.K., X.J.M., B.J.N., A.N., R.A.G., M.P., M.R., R.S., D.M.M., M.W., N.F.P., A.Q., E.E.S., A.S., A.A.S., A.U., C.Z., J.Z., W.Z., J.S., O.S.; data management & archiving: L.C., X.Z.-B., P.F.; display items: P.H.S., T.R., E.J.G., A.A., Y.Z., J.H., M.H.-Y.F., K.Y., M.B.G., A.B., O.S., R.E.M., S.E.D., E.E.E., J.O.K.; organization of Supplementary Material: G.D., J.O.K., P.H.S., R.E.M.; SV Analysis group co-chairs: C.L., E.E.E., J.O.K.; manuscript writing: P.H.S., T.R., E.J.G., J.H., R.E.M., M.B.G., O.S., S.E.D., E.E.E., J.O.K.

Competing financial interests

E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc. and is a consultant for Kunming University of Science and Technology (KUST) as part of the 1000 China Talent Program. P.F. is on the SAB of Omicia, Inc.

Corresponding authors

Correspondence to:

Sequencing data, archive accessions and supporting datasets including GRCh37 variant call files comprising the extended SV Analysis Group release set, a ‘readme’ describing differences to the phase 3 marker paper variant release16, and a GRCh38 version of our callset, are available at http://www.1000genomes.org/phase-3-structural-variant-dataset. DGV archive accession: estd219.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Construction of the SV release and intensity rank sum validation. (306 KB)

    a, Approach used for constructing our SV release set. b, Intensity rank sum (IRS) validation results for deletions in different size bins. c, IRS validation results for deletions in variant allele frequency (VAF) bins. d, IRS results for duplications in different size bins. e, IRS validation results for duplications in VAF bins. Based on Affymetrix SNP6 array probes, the IRS FDR for all SV length and VAF bins was ≤5.4%, requiring at least 100 SVs per bin with an IRS assigned P-value.

  2. Extended Data Figure 2: This figure shows the number of SV sites in our phase 3 release relative to allele frequency expressed in terms of allele count. (236 KB)

    SVs down to an allele count of 1 (corresponding to VAF = 0.0002) are represented in our phase 3 SV set (with the exception of mCNVs, denoted ‘CNV’ in this figure, which are defined as sites of multi-allelic variation thus requiring allele count ≥2, hence no mCNV sites are ascertained for allele count = 1).

  3. Extended Data Figure 3: Size and population distribution of different SV classes. (460 KB)

    a, Variants ascertained in the 1000 Genomes Project pilot phase6 (light grey) as well as the recent publication of SVs ascertained by PacBio sequencing in the CHM1 genome14 (grey) are displayed for comparison in this SV size distribution figure (INS, used as abbreviation for MEIs and NUMTs in this display item). b, Population distribution of SV allele sharing across continental groups for different SV classes. c, Cumulative distributions of the number of events as a function of size by SV class.

  4. Extended Data Figure 4: LD properties of various SV classes. (313 KB)

    a, LD properties of deletions, broken down by continental group and shown as a function of VAF. b, LD properties of duplications. c, LD properties of Alu, L1 and SVA mobile element insertions. d, LD properties of inversions (with breakdown for two independent inversion sets generated with our inversion discovery algorithm Delly; that is, CINV = one-sided inversions with support for one breakpoint; INV = two-sided inversions with support for both breakpoints; these two sets are combined into the joint phase3 SV group inversion set).

  5. Extended Data Figure 5: Population genetic properties of SVs. (346 KB)

    a, Deletion heterozygosity and homozygosity among human populations for a subset of high-confidence deletions. Populations from the African continental group (AFR) exhibit the highest levels of heterozygosity and thus diversity among humans, but show the overall lowest level of deletion homozygosity among all continental groups. By comparison, East Asian populations exhibited the lowest levels of deletion heterozygosity and the highest levels of homozygosity. Het., heterozygous; Hom., homozygous. b, VAF distribution of major SV classes. Bi-allelic duplications represent a notable outlier, showing a striking depletion of common alleles, which can be explained by the preponderance of genomic sites of duplication to undergo recurrent rearrangement (see main text). As a consequence, most common duplications are classified as multi-allelic variants (that is, mCNVs). c, The number of base pairs (bp) differing among individuals within and between continental groups for deletions (upper panel) and SNPs (middle panel) contrasted with the ratio of deletion bp differences to SNP bp differences (‘deletion bp/SNP bp’) among groups (lower panel). Non-African groups exhibit a higher ‘deletion bp/SNP bp’ compared to Africans. d, Neighbour-joining tree of populations constructed from MEIs (homoplasy-free markers) to provide a (simplified) view of population ancestry. The tree is labelled with the number of lineage-specific MEIs (Alu:L1:SVA). e, Classification of ancestry in AFR/AMR and AMR admixed populations using homoplasy-free ancestry informative MEI markers. Colour usage follows the same scheme as in Fig. 1d, except in the case of AFR individuals, which use both the colour in Fig. 1d and another colour that is unrelated to any other figure to indicate additional substructure within this group.

  6. Extended Data Figure 6: Principal component analysis and population stratification of SVs. (719 KB)

    a, Principal component analysis (PCA) plot of principal components 1 and 2 for deletions. b, PCA plot of principal components 3 and 4 for deletions. c, PCA plot of principal components 1 and 2 for MEIs. d, PCA plot of principal components 3 and 4 for MEIs. e, The five most highly population-stratified deletions intersecting protein-coding genes based on VST. f, The five most highly population-stratified duplications and multi-allelic copy number variants (mCNVs) intersecting protein-coding genes based on VST. For abbreviations, see Supplementary Table 1.

  7. Extended Data Figure 7: Enrichment of functional elements intersecting SVs. (338 KB)

    a, Shadow figure of Fig. 2a. Overlap enrichment analysis of deletions (with resolved breakpoints) versus genomic elements, using partial overlap statistic, deletions categorized into VAF bins. b, Similar to a. The only difference is that engulf overlap statistic is used instead of partial overlap statistic. Engulf overlap statistic is the count of genomic elements (for example, CDS) that are fully imbedded in at least one SV interval (for example, deletions). *no element intersected observed within data set. c, Similar to a and b, with the enrichment/depletion analysis pursued for common SNPs as well as more rare single nucleotide polymorphisms/variants (SNVs). Common SNV alleles show the highest levels of depletion for investigated genomic elements. d, Overlap enrichment analysis of various SV types versus genomic elements, using partial overlap statistic.

  8. Extended Data Figure 8: SV-eQTL analysis. (364 KB)

    a, SV-centric eQTL analysis of coding SVs. Shown is the proportion of coding SVs that are eQTLs as a function of the minimum VAF and the expression quartile. b, Total number of coding SVs for corresponding filters. Common SVs (VAF > 0.2) in highly expressed genes (>75% quantile) are very likely to correspond to SV-eQTLs (54%, see also Supplementary Table 8). c, For all genes with significant eQTLs (FDR < 10%), shown are raw P-values considering only SNPs (x axis) or only SVs (y axis). Genes with (strict lead) SV-eQTLs are shown in red. Genes with a SNP lead eQTL that is in linkage with an SV (r2 > 0.5) are shown in orange. SNP lead eQTLs without an SV in LD are shown in blue. d, Relative eQTL effect sizes for genetic and intergenic SV eQTLs (n = 239) either with an SV-eQTL or an LD tagged SV (in log abundance scale). Shown are regression trends for both genic and intergenic SV eQTls. For genetic eQTLs, a clear relationship between SV effect size is found. For example, genic SVs >10 kb have threefold larger effect sizes compared to genic SVs < 1 kb; P = 0.004; t-test.

  9. Extended Data Figure 9: SV clustering and breakpoint analysis. (357 KB)

    a–c, Extensive clustering of recurrent SVs into CNVRs appears unrelated to the extent of segmental duplications (a) and is only partially correlating with SNP diversity (b) and GC content (c). Breakdown of SV mechanism classifications based on criteria from two earlier studies (refs 6, 40). Shown are results for deletions with nucleotide resolved breakpoints. BreakSeq was used for mechanism inference. d, 1KG_P3: breakdown for our 1000 Genomes Project phase 3 SV callset using classification criteria from ref. 6. e, Conrad_2010: summary of mechanism classification results published in ref. 40. f, Mills_2011: summary of mechanism classification results published in ref. 6. g, 1KG_P3_Conrad: Breakdown for our 1000 Genomes Project phase 3 SV callset using classification criteria from ref. 40. Mechanism classification was pursued using four different categories. Blue, non-allelic homologous recombination (NAHR); green, mobile elements inserted into the reference genomes (appearing deleted in this analysis); red, non-homology-based rearrangement mechanisms (NHR), such as NHEJ, microhomology-mediated end-joining and microhomology-mediated break-induced replication (involving blunt-ended deletion breakpoints or breakpoints with microhomoloy); purple, expansion or shrinkage of variable numbers of tandem repeats (VNTRs). TEI, transposable element insertion (equivalent with MEI). h, Distribution of lengths of micro-homology (MH) for complex SVs, measured between deletion and corresponding template sites boundaries. Simple deletions, which based on BreakSeq were inferred to be formed by a non-homology-based SV formation mechanism, such as NHEJ and microhomology-mediated break-induced replication (Supplementary Table 3), are shown as an additional control (here denoted ‘blunt NH deletions’). i, Origins of inserted sequences in complex deletions inferred by split read analysis. This figure depicts examples for each class shown in Supplementary Table 13.

  10. Extended Data Figure 10: Examples of inversions identified in the SV release. (486 KB)

    a–e, Five classifications of inversions verified using PacBio and Minion reads are represented: Simple Inversion (a), inv-dup (b), inv-del (c), MultiDel with Inv (here abbreviated as inv-2dels) (d) and complex (e). f, Several further examples of inverted duplications (inv-dup), the most common form of inversion-associated SV identified in the phase 3 release set. The figure is depicting DNA sequence alignment dotplots (same arrangement as in Fig. 3), with the y axis referring to PacBio DNA single molecule sequencing reads and the x axis referring to the reference genome assembly (hg19). Inverted sequences are highlighted in red. Sequence analysis suggests that these inverted duplications are not typically associated with retrotransposition.

Supplementary information

PDF files

  1. Supplementary Information (3 MB)

    This file contains a list of abbreviations used in the study, Supplementary Text, Supplementary References and a full author list of the 1000 Genomes consortium.

Zip files

  1. Supplementary Tables (25.7 MB)

    This zipped file contains Supplementary Tables 1-16 and a Supplementary Table guide.

Additional data