Haplotype information is essential to the complete description and interpretation of genomes1, genetic diversity2 and genetic ancestry3. Although individual human genome sequencing is increasingly routine4, nearly all such genomes are unresolved with respect to haplotype. Here we combine the throughput of massively parallel sequencing5 with the contiguity information provided by large-insert cloning6 to experimentally determine the haplotype-resolved genome of a South Asian individual. A single fosmid library was split into a modest number of pools, each providing ~3% physical coverage of the diploid genome. Sequencing of each pool yielded reads overwhelmingly derived from only one homologous chromosome at any given location. These data were combined with whole-genome shotgun sequence to directly phase 94% of ascertained heterozygous single nucleotide polymorphisms (SNPs) into long haplotype blocks (N50 of 386 kilobases (kbp)). This method also facilitates the analysis of structural variation, for example, to anchor novel insertions7, 8 to specific locations and haplotypes.
At a glance
The high quality of the human reference genome derives from the hierarchical sequencing of large-insert clones, such that the assembly corresponding to each clone represents a single haplotype9. One of the first 'personal genomes' exploited clone-based mate pairing and long, accurate Sanger reads to resolve variants into haplotype blocks (N50 of 350 kbp; that is, 50% of resolved sequence is within blocks of at least 350 kbp)1. Although new technologies5 have subsequently enabled >1,000-fold reduction in genome sequencing costs, the short read-lengths and paucity of contiguity information are such that it remains challenging to determine haplotypes at a genome-wide scale. Genomic phase, the assignment of alleles to homologous chromosomes, was determined for SNPs using mate-paired reads on the SOLiD (sequencing by oligonucleotide ligation and detection) platform10 for an individual genome, but only 43% of heterozygous variants were phased, and nearly all in blocks no greater than the insert size, that is, <3.5 kbp10. Experimental limitations on the size and complexity of mate-pair libraries based on in vitro circularization11 make it difficult to improve upon this approach.
An alternative is to infer haplotypes from population-based linkage disequilibrium data or from pedigree analysis. For example, haplotypes were successfully inferred in the YH (YanHuang) genome for variants at which phased CHB/JPT HapMap data were available (CHB, Han Chinese from Beijing, China; JPT, Japanese from Tokyo, Japan)12. The genomes of a family of four have been sequenced and these relationships used to infer inheritance blocks13. Although they can be successful, inferential methods have limitations. Statistical phasing, whether based on genotyping2 or sequencing14, performs poorly when linkage disequilibrium is not high, and for rare variants. Phasing by pedigree analysis requires genome sequencing of many related individuals, increasing costs and limiting practical application.
We describe a cost-effective method for determining long-range haplotypes at a genome-wide scale by massively parallel sequencing of complex, haploid subsets of an individual genome (Fig. 1). We apply this method to the first reported whole-genome sequencing of a human of South Asian ancestry. The Indian subcontinent is home to myriad culturally and genetically diverse groups with distinct population histories15. We selected a female from the HapMap panel of 'Gujarati Indians in Houston' (GIH; NA20847) for sequencing. Notably, the imputation of genotypes for GIH was the least effective of all non-African populations in HapMap2.
Genomic DNA from NA20847 was used to construct a single, complex fosmid library, containing clones packaged in phage for infecting Escherichia coli cells (>2 × 106 clones with ~37 kbp inserts) (Fig. 1a and Supplementary Methods). We then split a portion of this library to 115 pools, at a density such that each pool contained ~5,000 independent clones. Each pool was expanded by either scraping a single plate of infected cells and inoculating outgrowth culture, or by direct liquid outgrowth after infection. However, at no point does this method require the isolation of individual colonies. We next constructed 115 barcoded, shotgun sequencing libraries from fosmid DNA isolated from each of the 115 pools16. Libraries indexed with barcodes were combined and sequenced (Illumina GAIIx; PE76 or PE101 reads) to a mean 2.4× depth per haploid clone (Fig. 1b).
Because each pool captures an essentially random ~3% of the 6-gigabase (Gb) diploid genome (that is, ~5,000 fosmids × ~37 kbp inserts) sequence reads from each pool are overwhelmingly (99.1%) derived from only one homologous chromosome or the other at any single location. Upon mapping reads from each pool to the reference assembly, the approximate boundaries of 538,009 individual clones (37.2 ± 4.7 kbp) were identified by read depth (4,678 ± 1,229 clones per pool). Coverage was uniform across the genome (98.6% covered by one or more clones) and within each pool (82% of clones with mean read depth within a tenfold range) (Supplementary Fig. 1).
For unphased variation discovery, we performed conventional whole-genome resequencing to 15× depth (Illumina HiSeq; PE50) (Supplementary Table 1 and Supplementary Fig. 2). After alignment to the reference, we called 3.3 × 106 SNPs and 3.4 × 105 short indels17, 18 (Fig. 1c). Nonreference sensitivity for SNPs was 91%, that is, HapMap variant genotypes at positions also called in our data, and genotype concordance to high-quality HapMap 3 genotypes2 at called positions was 99.2% (n = 1,436,495). Other bulk statistics, including the heterozygous-to-homozygous call ratio, the fraction of called variants previously ascertained in the NCBI SNP database (dbSNP), the transition-to-transversion ratio, and the numbers and classes of coding variants, were consistent with expectations based on previously sequenced non-African genomes (Supplementary Table 2).
Several methods have been described for assembling haplotypes from sequence data1, 19, 20, 21. We adopted a maximum parsimony approach19 to combine the unphased variants from shotgun whole-genome sequencing with haploid genotype calls from sequencing of the 115 pools (Fig. 1d). The resulting assembly incorporated 94% of ascertained heterozygous SNPs into haplotype-resolved blocks, with an N90 of 89 kbp, an N50 of 386 kbp and an N10 of 1 megabase (Mbp) (Fig. 2a). Sixty-two percent of genes were fully encompassed by single blocks, and 73% were covered for over half their length.
To evaluate accuracy, we compared our haplotype assembly with HapMap phase predictions for NA20847 (Fig. 2b)2. For pairs of SNPs in exceptionally high-linkage disequilibrium (D′ > 0.90 among GIH), we observed nearly perfect concordance (>99.7%). Because NA20847 was not part of a trio, HapMap predictions rely upon linkage disequilibrium between alleles to predict phase from genotypes. Correspondingly, concordance was reduced to ~71% when D′ < 0.10, which is the case for most (66%) pairwise SNP combinations. Concordance is also reduced when one or both alleles in the pair is rare in GIH (Fig. 2c). Note that our haplotype assembly is experimental and specific to an individual, and therefore completely independent of population-based phenomena such as linkage disequilibrium and allele frequency. Consequently, these trends likely reflect errors in HapMap phasing1.
South Asian history includes admixture between two ancestral groups, one genetically close to Europeans (ANI) and another more highly diverged from well-ascertained populations (ASI)15. Furthermore, principal components analysis revealed a distinct subgroup of Indian populations in general and GIH in particular, including NA20847, that may harbor substantial genetic ancestry from a third population distinct from ANI and ASI15. We compared haplotype blocks for this individual to HapMap allele frequencies in the GIH and CEPH European (CEU) populations to distinguish 'GIH-like' from 'CEU-like' haplotypes. Notably, novel SNPs are markedly enriched on the most GIH-like haplotypes (Fig. 3). We also scored haplotype blocks against allele frequencies from the 1000 Genomes Project14 (Supplementary Fig. 3). Haplotypes that least resembled all three populations in that study (CEU, CHB/JPT and Yoruba) were also markedly enriched for novel SNPs. We propose that GIH-like blocks and other well-differentiated haplotypes may be derived from more poorly ascertained ancestral populations, and therefore enriched for novel variants. Such haplotypes may represent a valuable source of information about human history on the South Asian subcontinent.
A substantial fraction of the human genome consists of gene-rich segmental duplications and otherwise structurally complex regions that continue to defy accurate diploid consensus assembly within individual genomes. We sought to evaluate whether haplotype-resolved sequencing is useful for the fine-mapping and haplotype-assignment of deletions, inversions and novel contigs.
We used shotgun read depth22, discordant pairing in shotgun data23 and array-based SNP calls2 to estimate copy number and detect 58 deletions (>8 kbp), 15 of which were flanked by segmental duplications. Of these, 48 deletions (83%) were unambiguously confirmed by sequenced fosmid clones spanning the breakpoints, providing fine-scale resolution and confirming 30 as hemizygous (Fig. 4a and Supplementary Table 3). Heterozygous variants in flanking clones allowed for unambiguous incorporation of these deletions into haplotype-resolved blocks.
Inversions are challenging to detect because they are copy-number neutral and frequently mediated by repetitive sequences. As even fosmid end-sequencing tends to overcall inversions6, the added information from interrogating full ~37-kbp inserts may be useful for discriminating true inversions from false positives (Supplementary Fig. 4). Indeed, we observed a number of unambiguous inversions by means of breakpoint-spanning clones (Supplementary Fig. 5). However, larger clones (>100 kbp) may be required to span the large duplication blocks where inversion breakpoints typically map6. NA20847 is heterozygous for the inversion-containing H2 haplotype at the MAPT locus (17q21) (Supplementary Fig. 6). Of note, we properly phased all 287 SNPs that tag the H2 haplotype across a 588-kbp span24.
We also detected common human sequences unrepresented in the reference, that is, the 'pan-genome' (Supplementary Table 4)7, 8. Of 16,904 contigs (total 12.8 Mbp) reported by two recent studies7, 8, we identified 8,993 in NA20847. We exploited the contiguity of fosmids to anchor ~30% of these (Fig. 4b), with 73% agreement (±50 kbp) with a previously anchored subset8. De novo assembly of remaining unmapped reads yielded 2,242 additional contigs after filtering, of which we anchored 396. To validate anchoring accuracy, we simulated novel insertions by deleting 600 intervals (250 bp–10 kbp) in silico from the reference and remapping reads to the modified reference. Unmapped reads were de novo assembled into 5,435 contigs that covered ~61% of simulated insertions. Of these, we predicted anchoring locations for 2,184 with an accuracy of 87%, with the remaining contigs unassigned because of limited clone coverage. The sensitivity and specificity with which novel contigs can be anchored by this approach is likely to improve with increased clone and shotgun coverage.
We recently demonstrated exome sequencing as a strategy for identifying causal variants in Mendelian disorders25, for example, implicating compound heterozygote variants in DHODH in Miller syndrome26. In such studies, phasing reduces the number of candidate genes consistent with a recessive, compound heterozygous model13. For example, in this Gujarati Indian individual, unphased variant data included 44 genes consistent with compound heterozygosity (that is, two or more heterozygous, novel, nonsynonymous or splice-site variants that altered the same gene). But after phase was taken into account, only ten were validated as trans heterozygous, with the remainder having both variants on the same haplotype.
This method requires significantly greater expertise and sample preparation than the haplotype-blind shotgun sequencing of an individual genome—specifically, the construction of a single fosmid library and >100 in vitro shotgun libraries, as compared with constructing one or a few in vitro shotgun libraries. A detailed consideration of the added effort and cost are provided in Supplementary Table 5. In summary, sample preparation can be completed in <2 weeks by a single technician at a cost (~$4,000) that is much greater than that of preparing a single shotgun library, but low relative to the overall cost of whole-genome sequencing. We use an unconventional method based on in vitro transposition16 to significantly reduce the time and effort for producing >100 shotgun libraries. Current costs are primarily driven by commercial reagents for fosmid and shotgun library construction, and may therefore be amenable to optimization16. Furthermore, most steps are compatible with manual scaling and/or automation.
We also note that the total bases sequenced here (~87 Gb shotgun, ~110 Gb clone-based) is only modestly higher than for other individual human genomes sequenced to date. To estimate the minimal amount of clone sequencing required, we subsampled our data for either the number of independent clones or the depth of clone library sequencing (Supplementary Fig. 7). The primary effect was a reduction in the length of assembled haplotype blocks, rather than any decay in accuracy. For example, at 80% of clones and 60% of sequencing depth (which is 48% as much clone-based sequencing), the N50 dropped from 386 kbp to 238 kbp. However, most ascertained heterozygous variants remained phased (85.4%), and phasing remained highly concordant with HapMap (>99% at D′ > 0.9). Other optimizations, for example, switching from plate-scraping to direct liquid outgrowth to improve clone uniformity (Supplementary Fig. 1), may further reduce sequencing requirements.
Haplotypes are essential to the information content that defines a diploid human genome, but have heretofore been intractable to genome-wide, experimental determination in the context of massively parallel sequencing. We anticipate that haplotype-resolved genome sequencing will be valuable in a broad range of scenarios, including the following. (i) Population genetics. Haplotype-resolved genome sequencing eliminates the need for population or pedigree-based haplotype inference. This will be most useful in populations that are poorly ascertained (e.g., South Asians) or have low linkage disequilibrium (e.g., Africans), and more generally for rare variants. (ii) Genetic anthropology. For example, the availability of the haplotype-resolved reference and Venter genomes was critical to the observation of a Neanderthal contribution to some modern humans3. (iii) Medical genetics of rare and common phenotypes. Haplotype information can facilitate the analysis of recessive Mendelian disorders13, the determination of the parent of origin for de novo mutations, and the study of complex interactions among multiple SNPs27. (iv) Structural variation in both germline and cancer genomes. Our approach is more comprehensive than long-insert mate-pairing (whether by fosmids6 or in vitro circularization28), as these methods determine the ends of large molecules but are blind to their internal contents. Also, the intermediate level of partitioning provided by fosmids may be more useful than whole chromosome amplification29, as many germline and somatic structural events are intrachromosomal. (v) Allele-specific phenomena. Haplotype information may be essential for understanding the genetic basis of phenomena such as allele-specific expression and methylation30. (vi) De novo genome assembly. Massively parallel sequencing of highly complex pools of minimally redundant haploid clones may facilitate the high-quality de novo assembly of new genomes, an area that continues to be a major challenge for the genomics field despite the falling costs of DNA sequencing11.
Fosmid library pool construction.
High molecular weight genomic DNA (HMW gDNA) was extracted from HapMap lymphoblastoid cell line GM20847 (Coriell) using the Gentra Puregene kit (Qiagen). A single, complex fosmid library (>2 × 106 clones) was created using the CopyControl pCC1Fos Fosmid Library Construction kit (Epicentre), as previously described31. After bulk infection, the library was split into 115 pools of ~5,000 clones each. Each pool was then individually expanded, either by scraping plates of infected cells and inoculating outgrowth culture, or by direct liquid outgrowth after infection. Clone DNA was extracted from each pool by alkaline lysis miniprep.
Massively parallel sequencing.
Illumina-compatible shotgun sequencing libraries were prepared from each fosmid clone pool DNA and HMW gDNA using the Nextera DNA Sample Prep Kit (Epicentre), as described16. For each fosmid pool library, a 9-bp barcoded adaptor was added during PCR amplification16. Pool-derived libraries were combined before sequencing (PE76 or PE101 reads, plus index read, on an Illumina GA2x), and the index read was used to deconvolve the original clone pools from the combined reads. For unphased variant discovery, a single whole-genome shotgun library was sequenced across seven lanes (PE50 reads on an Illumina HiSeq).
Read mapping and variant discovery.
Basecalling was performed with Illumina RTA v1.8 software. The resulting reads were aligned to the reference assembly (NCBI release GRCh37, UCSC release hg19) using BWA v0.5.8a17. The Genome Analysis Toolkit (GATK)18 was used to recalibrate base quality scores, realign reads surrounding putative and known indels, and call single-nucleotide and indel variants from the whole-genome shotgun data. Quality filters were applied based on coverage, base and mapping quality score, and allelic and strand bias. Copy number genotypes were estimated genomewide by (G+C)-corrected read depth, as previously described32. Deletions >8 kbp were identified by intersecting regions of predicted copy less than 2 with split-read calls23 and published SNP array-based calls2 and requiring calls by two of the three methods.
Clone coordinates were identified within each pool by searching for intervals of length 25–45 kbp with coverage significantly above background. Heterozygous SNP positions ascertained during whole-genome shotgun sequencing were regenotyped within each haploid clone pool. Clones with an excess of heterozygous positions, likely representing overlapping clones drawn from different haplotypes, were discarded. Haplotype blocks were created from overlapping clones using a custom reimplementation of HAPCUT19, a parsimony maximization-based haplotype assembly algorithm. The effects of lower sequence coverage upon haplotype assembly accuracy and block length were simulated by leaving out a random subset of clones and/or reads.
Haplotype ancestry analysis.
Phased blocks were divided into sliding windows of variants from HapMap2 (20 SNPs/window) or the 1000 Genomes Project14 (200 SNPs/window). For the HapMap-based comparison, similarity to GIH and CEU populations was scored for both haplotypes of NA20847 at every window based on the frequencies of alleles in NA20847 among GIH and CEU. Haplotype windows were then rank-ordered by the difference in similarity scores, such that haplotypes with high-frequency alleles among GIH but not CEU were more highly ranked. The fraction of all detected novel variants (not in dbSNP release 130) was then counted for each haplotype window for NA20847, and for comparison in the same rank-ordered windows, the trio-resolved CEU individual NA12871 (ref. 14). Pairs of homologous haplotype windows were rank ordered by differential similarity to GIH, and the fraction of novel variants on the GIH-enriched homolog was computed. For the 1000 Genomes-based comparison, haplotype windows were rank-ordered by divergence from CEU, YRI, and CHB+JPT populations and the fraction of novel variants per haplotype window computed for both NA20847 and NA12871 as before.
Pan-genome and novel contig mapping and anchoring.
Whole-genome and clone pool-derived reads that did not align to the human genome reference (GRCh37/hg19) were mapped to novel contigs not present in the human reference genome assembly7, 8 to find contigs covered with ≥50 bp (phred-scaled mapping score ≥Q20). A subset of contigs were anchored by ≥2 reads with mates mapping to the reference. As further evidence of anchoring, intervals were identified in the reference assembly having read depth from clone pools also hitting a given contig but depleted among those pools not hitting that contig. Further novel sequences from NA20847 were assembled de novo from remaining unmapped reads using Velvet33. Contigs aligning to existing pan-genome sequences and contaminating sequences (E. coli, vector backbone, Epstein-Barr virus) were removed and remaining contigs were anchored as above. Sensitivity to detect and accurately anchor novel sequence was simulated by introducing in silico deletions into the reference, de novo assembling corresponding insertion contigs, anchoring as before, and measuring agreement between predicted anchoring location and the known site of simulated deletion.
Short read sequence data have been deposited at the NCBI Sequence Read Archive (SRA) under accession no. 026360. Assembled haplotype blocks and novel contigs are available from: http://krishna.gs.washington.edu/indianGenome/.
Sequence Read Archive
- The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007). et al.
- International HapMap Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
- A draft sequence of the Neandertal genome. Science 328, 710–722 (2010). et al.
- Anonymous. Human genome: Genomes by the thousand. Nature 467, 1026–1027 (2010).
- Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145 (2008). &
- Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008). et al.
- Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010). et al.
- Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat. Methods 7, 365–371 (2010). et al.
- International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
- Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 19, 1527–1541 (2009). et al.
- Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010). , &
- The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008). et al.
- Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639 (2010). et al.
- 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
- Reconstructing Indian population history. Nature 461, 489–494 (2009). , , , &
- Rapid, low-input, low-bias construction of shotgun fragment libraries by high density in vitro transposition. Genome Biol. 11, R119 (2010). et al.
- Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009). &
- The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010). et al.
- HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159 (2008). &
- Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. Genome Res. 17, 1101–1110 (2007). , &
- An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 18, 1336–1346 (2008). , , &
- Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41, 1061–1067 (2009). et al.
- Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics 26, i350–i357 (2010). et al.
- Evolutionary toggling of the MAPT 17q21.31 inversion region. Nat. Genet. 40, 1076–1083 (2008). et al.
- Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009). et al.
- Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42, 30–35 (2010). et al.
- Complex promoter and coding region beta 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc. Natl. Acad. Sci. USA 97, 10483–10488 (2000). et al.
- Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426 (2007). et al.
- Direct determination of molecular haplotypes by chromosome microdissection. Nat. Methods 7, 299–301 (2010). et al.
- Allele-specific DNA methylation: beyond imprinting. Hum. Mol. Genet. 19, R210–R220 (2010).
- Targeted, haplotype-resolved resequencing of long segments of the human genome. Genomics 86, 759–766 (2005). et al.
- Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010). et al.
- Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). &
We thank C. Lee and M. Malig for technical assistance, J. Akey, T. O'Connor and P. Green for helpful discussions, D. Reich for ancestry information on NA20847, the U.W. Genome Sciences Genomics Resource Center (GS-GRC) for sequencing and the 1000 Genomes Project for early data release. This work was supported by National Institutes of Health grants AG039173 (J.B.H.) and HG002385 (E.E.E.), a National Science Foundation Graduate Research Fellowship (J.O.K.), a Natural Sciences and Engineering Research Council of Canada Fellowship (P.H.S.) and a fellowship from the Achievement Rewards for College Scientists Foundation (J.B.H.). E.E.E. is an investigator of the Howard Hughes Medical Institute.