Accurate and complete analysis of genome variation in large populations will be required to understand the role of genome variation in complex disease. We present an analytical framework for characterizing genome deletion polymorphism in populations using sequence data that are distributed across hundreds or thousands of genomes. Our approach uses population-level concepts to reinterpret the technical features of sequence data that often reflect structural variation. In the 1000 Genomes Project pilot, this approach identified deletion polymorphism across 168 genomes (sequenced at 4× average coverage) with sensitivity and specificity unmatched by other algorithms. We also describe a way to determine the allelic state or genotype of each deletion polymorphism in each genome; the 1000 Genomes Project used this approach to type 13,826 deletion polymorphisms (48–995,664 bp) at high accuracy in populations. These methods offer a way to relate genome structural polymorphism to complex disease in populations.
View full text
At a glance
: A population-aware analytical framework for analyzing Genome STRucture in Populations (Genome STRiP).
a) Population-scale sequence data contain two classes of information: technical features of the sequence data within a genome and population-scale patterns that span all the genomes analyzed. Technical features include breakpoint-spanning reads , paired-end sequences 2, 3 and local variation in read depth of coverage 4, 5, 6 . Genome STRiP combines these with population-scale patterns that span many genomes, including: the sharing of structural alleles by multiple genomes; the pattern of sequence heterogeneity within a population; the substitution of alternative structural alleles for each other; and the haplotype structure of human genome polymorphism. ( 7, 8, 9 b) Goals of structural variation (SV) analysis in Genome STRiP. 'Variation discovery' involves identifying the structural alleles that are segregating in a population. The power to observe a variant in any one genome is only partial, but the evidence defining a segregating site can be derived from many genomes at once. 'Population genotyping' requires accurately determining the allelic state of each variant in every diploid genome in a population.
: Identifying coherent sets of aberrantly mapping reads from a population of genomes.
a) Millions of end-sequence pairs from sequencing libraries show aberrant alignment locations, appearing to span vast genomic distances. Almost all of these observations derive not from true structural variants but from chimeric inserts in molecular sequencing libraries. Data shown are paired-end alignments on chromosome 5 from 41 initial genome sequencing libraries from the 1000 Genomes Project. ( b) A set of 'coherently aberrant' end-sequence pairs from many genomes. At this genomic locus, paired-end sequences (sequences of the two ends of the inserts in a molecular library) fall into two classes: (i) end-sequence pairs that show the genomic spacing expected given the insert size distribution of each sequencing library, such as the three-read–pair alignments for genome NA07037; and (ii) end-sequence pairs that align to genomic locations unexpectedly far apart but which relate to their expected insert size distributions by a shared correction factor (red arrows). A unifying model in which these eight read pairs from five genomes arise from a shared deletion allele (size of red arrows) converts all of these aberrant read pairs to likely observations. In the right panel, the black tick marks indicate genomic distance between left and right end sequences; the black curves indicate insert size distributions of the molecular library from which each sequence-pair was drawn.
: Evaluating the population-heterogeneity and allele-substitution properties of population-scale sequence data.
a) At a candidate deletion locus, the distribution across genomes of 'evidentiary reads' (read pairs suggesting the presence of a deletion allele at a locus) (blue bars) is compared to a null model under which genomes are equally likely, per molecule sequenced, to give rise to such evidentiary reads (green curve). For the locus shown, the distribution of evidentiary reads across genomes differs from the null distribution ( P = 1 × 10 −4), confirming that evidentiary sequence data appears differentially within the population at this locus. ( b) At another genomic locus, putative structural variation–supporting read pairs arise from many genomes but in a pattern that does not significantly differ from a null distribution based on equal probability per molecule sequenced. Subsequent assays confirmed that this is not a true deletion. ( c) Distribution of a population-heterogeneity statistic (from a, b) for read-pair data at 1,420 sites of known deletion polymorphism. ( d) Distribution of the same population-heterogeneity statistic from read-pair data at 45,000 candidate deletion loci nominated by read-pair analysis. ( e, f) If a putative deletion is real, then genomes with molecular evidence for the deletion allele would be expected to have less evidence for the reference allele ('allelic substitution'). A simple test of allelic substitution is to compare average read depth (across a putative deletion segment) between two subpopulations—the genomes with read-pair evidence for the deletion (blue curve) and the genomes lacking such evidence (black trace). The locus in e was subsequently validated as containing a real deletion; the locus in f was not. ( g) Distribution of this 'subpopulation depth ratio' statistic ( e, f) for sequence data at 1,420 sites of known deletion polymorphism. ( h) Distribution of the same statistic for sequence data at 45,000 candidate deletion loci.
: Deletion polymorphisms identified by Genome STRiP in low-coverage sequence data from 168 genomes.
a) Size distribution. Sensitivity for large deletions (>10 kb) is similar to that of the array-based approaches applied in large, population-scale studies (red); sensitivity for deletions smaller than 10 kb is much greater. A strong peak near 300 bp arises from ALU insertion polymorphisms; a smaller peak near 6 kb arises from L1 insertion polymorphisms. Number of evidentiary sequence reads ( b) and genomes ( c) contributing to each deletion discovery in population-scale sequence data. We identified 1,033 of these deletions (14.7%) with evidentiary pairs from single genomes. ( d) Specificity: false discovery rates of ten deletion discovery methods evaluated by the 1000 Genomes Project in the Project's population-scale low-coverage sequence data. ( e) Sensitivity: power of the same ten discovery methods in identifying known deletions as a function of the allele frequency of the deletion. ( f) Localization of the breakpoints of a common deletion allele using read-pair data from many genomes. The difference between (i) the genomic separation of each read-pair sequence and (ii) the insert-size distribution of the molecular library from which is it drawn ( Fig. 2b) allows a likelihood-based estimate of deletion length from each read pair (blue curves). Combining this likelihood information across many genomes (black curve) allows fine-scale localization of the breakpoint. ( g) Resolution of breakpoint estimates from Genome STRiP, as estimated using Genome STRiP confidence intervals (red) and comparison to molecularly established breakpoint sequences (blue). ( h) Fine-scale localization of a structural variation breakpoint facilitates directed local assembly of the deletion allele from sequence data derived from many genomes.
: Determining the allelic state (genotype) of 13,826 deletions in 156 genomes.
a) Four of the 13,826 deletion polymorphisms analyzed, representing diverse properties in terms of size and alignability of the affected sequence. Gray vertical rectangles indicate a sequence that is repeat masked or otherwise non-alignable. The locus in the bottom row is an ALU insertion polymorphism. ( b) Population-scale distribution of read depth across genomes at each of the deletion loci in a. For each locus, normalized measurements of read depth (across the deleted segment) from 156 genomes were fitted to a Gaussian mixture model. Colored squares represent genomes for which genotype could be called at 95% confidence based on read depth. ( c) Genotype likelihood from read depth. Each horizontal stripe (corresponding to 1 of the 156 genomes) is divided into three sections with length proportional to the estimated relative likelihood of the sequence data given each genotype model (blue, copy-number 2; green, copy-number 1; orange, copy-number 0). ( d) Genotype likelihood based on evidence from read pairs (RP) and breakpoint-spanning reads (BR). At the third locus from top, the absence of an established breakpoint sequence limits inference to read pairs. ( e) Genotype likelihood based on integrating evidence from read depth (RD), read pairs (RP) and breakpoint-spanning reads (BR). ( f) Genotype likelihood based on integrating evidence from c–e with flanking SNP data in a population haplotype model. ( g) Population-scale sequence data at each locus as resolved into genotype classes. Traces indicate average read depth for genomes of each inferred genotype. Orange and green rectangles indicate evidentiary read pairs and breakpoint-spanning reads, colored by the genotype determination for the genome from which they arise.