Discovery and genotyping of genome structural polymorphism by sequencing on a population scale

Journal name:
Nature Genetics
Volume:
43,
Pages:
269–276
Year published:
DOI:
doi:10.1038/ng.768
Received
Accepted
Published online

Abstract

Accurate and complete analysis of genome variation in large populations will be required to understand the role of genome variation in complex disease. We present an analytical framework for characterizing genome deletion polymorphism in populations using sequence data that are distributed across hundreds or thousands of genomes. Our approach uses population-level concepts to reinterpret the technical features of sequence data that often reflect structural variation. In the 1000 Genomes Project pilot, this approach identified deletion polymorphism across 168 genomes (sequenced at 4× average coverage) with sensitivity and specificity unmatched by other algorithms. We also describe a way to determine the allelic state or genotype of each deletion polymorphism in each genome; the 1000 Genomes Project used this approach to type 13,826 deletion polymorphisms (48–995,664 bp) at high accuracy in populations. These methods offer a way to relate genome structural polymorphism to complex disease in populations.

At a glance

Figures

  1. A population-aware analytical framework for analyzing Genome STRucture in Populations (Genome STRiP).
    Figure 1: A population-aware analytical framework for analyzing Genome STRucture in Populations (Genome STRiP).

    (a) Population-scale sequence data contain two classes of information: technical features of the sequence data within a genome and population-scale patterns that span all the genomes analyzed. Technical features include breakpoint-spanning reads2, 3, paired-end sequences4, 5, 6 and local variation in read depth of coverage7, 8, 9. Genome STRiP combines these with population-scale patterns that span many genomes, including: the sharing of structural alleles by multiple genomes; the pattern of sequence heterogeneity within a population; the substitution of alternative structural alleles for each other; and the haplotype structure of human genome polymorphism. (b) Goals of structural variation (SV) analysis in Genome STRiP. 'Variation discovery' involves identifying the structural alleles that are segregating in a population. The power to observe a variant in any one genome is only partial, but the evidence defining a segregating site can be derived from many genomes at once. 'Population genotyping' requires accurately determining the allelic state of each variant in every diploid genome in a population.

  2. Identifying coherent sets of aberrantly mapping reads from a population of genomes.
    Figure 2: Identifying coherent sets of aberrantly mapping reads from a population of genomes.

    (a) Millions of end-sequence pairs from sequencing libraries show aberrant alignment locations, appearing to span vast genomic distances. Almost all of these observations derive not from true structural variants but from chimeric inserts in molecular sequencing libraries. Data shown are paired-end alignments on chromosome 5 from 41 initial genome sequencing libraries from the 1000 Genomes Project. (b) A set of 'coherently aberrant' end-sequence pairs from many genomes. At this genomic locus, paired-end sequences (sequences of the two ends of the inserts in a molecular library) fall into two classes: (i) end-sequence pairs that show the genomic spacing expected given the insert size distribution of each sequencing library, such as the three-read–pair alignments for genome NA07037; and (ii) end-sequence pairs that align to genomic locations unexpectedly far apart but which relate to their expected insert size distributions by a shared correction factor (red arrows). A unifying model in which these eight read pairs from five genomes arise from a shared deletion allele (size of red arrows) converts all of these aberrant read pairs to likely observations. In the right panel, the black tick marks indicate genomic distance between left and right end sequences; the black curves indicate insert size distributions of the molecular library from which each sequence-pair was drawn.

  3. Evaluating the population-heterogeneity and allele-substitution properties of population-scale sequence data.
    Figure 3: Evaluating the population-heterogeneity and allele-substitution properties of population-scale sequence data.

    (a) At a candidate deletion locus, the distribution across genomes of 'evidentiary reads' (read pairs suggesting the presence of a deletion allele at a locus) (blue bars) is compared to a null model under which genomes are equally likely, per molecule sequenced, to give rise to such evidentiary reads (green curve). For the locus shown, the distribution of evidentiary reads across genomes differs from the null distribution (P = 1 × 10−4), confirming that evidentiary sequence data appears differentially within the population at this locus. (b) At another genomic locus, putative structural variation–supporting read pairs arise from many genomes but in a pattern that does not significantly differ from a null distribution based on equal probability per molecule sequenced. Subsequent assays confirmed that this is not a true deletion. (c) Distribution of a population-heterogeneity statistic (from a,b) for read-pair data at 1,420 sites of known deletion polymorphism. (d) Distribution of the same population-heterogeneity statistic from read-pair data at 45,000 candidate deletion loci nominated by read-pair analysis. (e,f) If a putative deletion is real, then genomes with molecular evidence for the deletion allele would be expected to have less evidence for the reference allele ('allelic substitution'). A simple test of allelic substitution is to compare average read depth (across a putative deletion segment) between two subpopulations—the genomes with read-pair evidence for the deletion (blue curve) and the genomes lacking such evidence (black trace). The locus in e was subsequently validated as containing a real deletion; the locus in f was not. (g) Distribution of this 'subpopulation depth ratio' statistic (e,f) for sequence data at 1,420 sites of known deletion polymorphism. (h) Distribution of the same statistic for sequence data at 45,000 candidate deletion loci.

  4. Deletion polymorphisms identified by Genome STRiP in low-coverage sequence data from 168 genomes.
    Figure 4: Deletion polymorphisms identified by Genome STRiP in low-coverage sequence data from 168 genomes.

    (a) Size distribution. Sensitivity for large deletions (>10 kb) is similar to that of the array-based approaches applied in large, population-scale studies (red); sensitivity for deletions smaller than 10 kb is much greater. A strong peak near 300 bp arises from ALU insertion polymorphisms; a smaller peak near 6 kb arises from L1 insertion polymorphisms. Number of evidentiary sequence reads (b) and genomes (c) contributing to each deletion discovery in population-scale sequence data. We identified 1,033 of these deletions (14.7%) with evidentiary pairs from single genomes. (d) Specificity: false discovery rates of ten deletion discovery methods evaluated by the 1000 Genomes Project in the Project's population-scale low-coverage sequence data. (e) Sensitivity: power of the same ten discovery methods in identifying known deletions as a function of the allele frequency of the deletion. (f) Localization of the breakpoints of a common deletion allele using read-pair data from many genomes. The difference between (i) the genomic separation of each read-pair sequence and (ii) the insert-size distribution of the molecular library from which is it drawn (Fig. 2b) allows a likelihood-based estimate of deletion length from each read pair (blue curves). Combining this likelihood information across many genomes (black curve) allows fine-scale localization of the breakpoint. (g) Resolution of breakpoint estimates from Genome STRiP, as estimated using Genome STRiP confidence intervals (red) and comparison to molecularly established breakpoint sequences (blue). (h) Fine-scale localization of a structural variation breakpoint facilitates directed local assembly of the deletion allele from sequence data derived from many genomes.

  5. Determining the allelic state (genotype) of 13,826 deletions in 156 genomes.
    Figure 5: Determining the allelic state (genotype) of 13,826 deletions in 156 genomes.

    (a) Four of the 13,826 deletion polymorphisms analyzed, representing diverse properties in terms of size and alignability of the affected sequence. Gray vertical rectangles indicate a sequence that is repeat masked or otherwise non-alignable. The locus in the bottom row is an ALU insertion polymorphism. (b) Population-scale distribution of read depth across genomes at each of the deletion loci in a. For each locus, normalized measurements of read depth (across the deleted segment) from 156 genomes were fitted to a Gaussian mixture model. Colored squares represent genomes for which genotype could be called at 95% confidence based on read depth. (c) Genotype likelihood from read depth. Each horizontal stripe (corresponding to 1 of the 156 genomes) is divided into three sections with length proportional to the estimated relative likelihood of the sequence data given each genotype model (blue, copy-number 2; green, copy-number 1; orange, copy-number 0). (d) Genotype likelihood based on evidence from read pairs (RP) and breakpoint-spanning reads (BR). At the third locus from top, the absence of an established breakpoint sequence limits inference to read pairs. (e) Genotype likelihood based on integrating evidence from read depth (RD), read pairs (RP) and breakpoint-spanning reads (BR). (f) Genotype likelihood based on integrating evidence from c–e with flanking SNP data in a population haplotype model. (g) Population-scale sequence data at each locus as resolved into genotype classes. Traces indicate average read depth for genomes of each inferred genotype. Orange and green rectangles indicate evidentiary read pairs and breakpoint-spanning reads, colored by the genotype determination for the genome from which they arise.

References

  1. 1000 Genomes Project Consortium et al. A map of human genome variation from population scale sequencing. Nature 467, 10611073 (2010).
  2. Ye, K., Schulz, M.H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 28652871 (2009).
  3. Lam, H.Y. et al. Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat. Biotechnol. 28, 4755 (2010).
  4. Korbel, J.O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420426 (2007).
  5. Korbel, J.O. et al. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol. 10, R23 (2009).
  6. Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 6, 677681 (2009).
  7. Chiang, D.Y. et al. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat. Methods 6, 99103 (2009).
  8. Yoon, S., Xuan, Z., Makarov, V., Ye, K. & Sebat, J. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 19, 15861592 (2009).
  9. Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41, 10611067 (2009).
  10. Mills, R.E. et al. Mapping copy number variation by population scale sequencing. Nature published online, doi:1:10.1038/nature09708 (3 February 2011).
  11. McCarroll, S.A. et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 40, 11661174 (2008).
  12. Conrad, D.F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704712 (2010).
  13. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727732 (2005).
  14. Iskow, R.C. et al. Natural mutagenesis of human genomes by endogenous retrotransposons. Cell 141, 12531261 (2010).
  15. Huang, C.R. et al. Mobile interspersed repeats are major structural variants in the human genome. Cell 141, 11711182 (2010).
  16. Mills, R.E. et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 16, 11821190 (2006).
  17. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589595 (2010).
  18. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906913 (2007).
  19. Li, Y., Willer, C., Sanna, S. & Abecasis, G. Genotype imputation. Annu. Rev. Genomics Hum. Genet. 10, 387406 (2009).
  20. Browning, B.L. & Browning, S.R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210223 (2009).
  21. Coin, L.J. et al. cnvHap: an integrative population and haplotype-based multiplatform model of SNPs and CNVs. Nat. Methods 7, 541546 (2010).
  22. International HapMap3 Consortium et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 5258 (2010).
  23. International HapMap Consortium. A haplotype map of the human genome. Nature 437, 12991320 (2005).
  24. McCarroll, S.A. et al. Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn's disease. Nat. Genet. 40, 11071112 (2008).
  25. Willer, C.J. et al. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat. Genet. 41, 2534 (2009).
  26. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 18511858 (2008).

Download references

Author information

Affiliations

  1. Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA.

    • Robert E Handsaker,
    • Joshua M Korn,
    • James Nemesh &
    • Steven A McCarroll
  2. Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.

    • Robert E Handsaker,
    • Joshua M Korn,
    • James Nemesh &
    • Steven A McCarroll
  3. Stanley Center for Psychiatric Disease Research, Cambridge, Massachusetts, USA.

    • Steven A McCarroll

Contributions

R.E.H., J.M.K., J.N. and S.A.M. conceived the analytical approaches. R.E.H. implemented the algorithms and performed the data analysis. R.E.H. and S.A.M. wrote the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (808K)

    Supplementary Figures 1–6, Supplementary Table 1 and Supplementary Note.

Excel files

  1. Supplementary Table 2 (56K)

    Evaluation of genotype likelihood calibration

  2. Supplementary Table 3 (4M)

    tagSNPs identified by Genome STRiP for deletions from the 1000 Genomes Project

  3. Supplementary Table 4 (96K)

    Phenotype associated SNPs in linkage disequilibrium with 1000 Genomes pilot deletions

Additional data