Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing

Abstract

Haplotype-resolved genome sequencing enables the accurate interpretation of medically relevant genetic variation, deep inferences regarding population history and non-invasive prediction of fetal genomes. We describe an approach for genome-wide haplotyping based on contiguity-preserving transposition (CPT-seq) and combinatorial indexing. Tn5 transposition is used to modify DNA with adaptor and index sequences while preserving contiguity. After DNA dilution and compartmentalization, the transposase is removed, resolving the DNA into individually indexed libraries. The libraries in each compartment, enriched for neighboring genomic elements, are further indexed via PCR. Combinatorial 96-plex indexing at both the transposition and PCR stage enables the construction of phased synthetic reads from each of the nearly 10,000 'virtual compartments'. We demonstrate the feasibility of this method by assembling >95% of the heterozygous variants in a human genome into long, accurate haplotype blocks (N50 = 1.4–2.3 Mb). The rapid, scalable and cost-effective workflow could enable haplotype resolution to become routine in human genome sequencing.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: The Tn5 transposase maintains the contiguity of target DNA after transposition.
Figure 2: Overview of the CPT-seq workflow.
Figure 3: Demonstration of haplotype read islands.
Figure 4: Summary of phasing results.

Similar content being viewed by others

Accession codes

Primary accessions

BioProject

References

  1. Bansal, V. et al. The next phase in human genetics. Nat. Biotechnol. 29, 38–39 (2011).

    Article  CAS  Google Scholar 

  2. Tewhey, R. et al. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).

    Article  CAS  Google Scholar 

  3. Fan, H.C. et al. Non-invasive prenatal measurement of the fetal genome. Nature 487, 320–324 (2012).

    Article  CAS  Google Scholar 

  4. Kitzman, J.O. et al. Noninvasive whole-genome sequencing of a human fetus. Sci. Transl. Med. 4, 137ra76 (2012).

    Article  Google Scholar 

  5. Sabeti, P.C. et al. Detecting recent positive selection in the human genome from haplotype structure. Nature 419, 832–837 (2002).

    Article  CAS  Google Scholar 

  6. Adey, A. et al. The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line. Nature 500, 207–211 (2013).

    Article  CAS  Google Scholar 

  7. Tishkoff, S.A. et al. Global patterns of linkage disequilibrium at the CD4 locus and modern human origins. Science 271, 1380–1387 (1996).

    Article  CAS  Google Scholar 

  8. Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008).

    Article  CAS  Google Scholar 

  9. Hosomichi, K. et al. Phase-defined complete sequencing of the HLA genes by next-generation sequencing. BMC Genomics 14, 355 (2013).

    Article  CAS  Google Scholar 

  10. Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).

    Article  CAS  Google Scholar 

  11. Bansal, V. et al. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 18, 1336–1346 (2008).

    Article  CAS  Google Scholar 

  12. He, D. et al. Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics 26, i183–i190 (2010).

    Article  CAS  Google Scholar 

  13. Kaper, F. et al. Whole-genome haplotyping by dilution, amplification, and sequencing. Proc. Natl. Acad. Sci. USA 110, 5552–5557 (2013).

    Article  CAS  Google Scholar 

  14. Kitzman, J.O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).

    Article  CAS  Google Scholar 

  15. Peters, B.A. et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487, 190–195 (2012).

    Article  CAS  Google Scholar 

  16. Fan, H.C. et al. Whole-genome molecular haplotyping of single cells. Nat. Biotechnol. 29, 51–57 (2011).

    Article  CAS  Google Scholar 

  17. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

    Article  Google Scholar 

  18. Duitama, J. et al. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of single individual haplotyping techniques. Nucleic Acids Res. 40, 2041–2053 (2012).

    Article  CAS  Google Scholar 

  19. Suk, E.K. et al. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 21, 1672–1685 (2011).

    Article  CAS  Google Scholar 

  20. Lo, C. et al. On the design of clone-based haplotyping. Genome Biol. 14, R100 (2013).

    Article  Google Scholar 

  21. Geraci, F. A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem. Bioinformatics 26, 2217–2225 (2010).

    Article  CAS  Google Scholar 

  22. Caruccio, N. Preparation of next-generation sequencing libraries using Nextera technology: simultaneous DNA fragmentation and adaptor tagging by in vitro transposition. Methods Mol. Biol. 733, 241–255 (2011).

    Article  CAS  Google Scholar 

  23. Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome Biol. 11, R119 (2010).

    Article  CAS  Google Scholar 

  24. Erlich, Y. et al. DNA Sudoku—harnessing high-throughput sequencing for multiplexed specimen analysis. Genome Res. 19, 1243–1253 (2009).

    Article  CAS  Google Scholar 

  25. Duitama, J. et al. in Proc. 1st ACM Int. Conf. Bioinformatics Comput. Biol. 160–169 (ACM (Association for Computing Machinery), New York, 2010).

  26. Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).

    Article  CAS  Google Scholar 

  27. Abecasis, G.R. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

    Article  Google Scholar 

  28. Conrad, D.F. et al. Variation in genome-wide mutation rates within and between human families. Nat. Genet. 43, 712–714 (2011).

    Article  CAS  Google Scholar 

  29. Kamphans, T. et al. Filtering for compound heterozygous sequence variants in non-consanguineous pedigrees. PLoS ONE 8, e70151 (2013).

    Article  CAS  Google Scholar 

  30. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

    Article  CAS  Google Scholar 

  31. Lo, C. et al. Strobe sequence design for haplotype assembly. BMC Bioinformatics 12 (suppl. 1), S24 (2011).

    Article  Google Scholar 

  32. Fu, A.Y. et al. A microfabricated fluorescence-activated cell sorter. Nat. Biotechnol. 17, 1109–1111 (1999).

    Article  CAS  Google Scholar 

  33. Hua, Z. et al. Multiplexed real-time polymerase chain reaction on a digital microfluidic platform. Anal. Chem. 82, 2310–2316 (2010).

    Article  CAS  Google Scholar 

  34. Adey, A. et al., long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res. 10.1101/gr.178319.114 (19 October 2014)

  35. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We are thankful to J. Bruand, F. Zhang and A. Kia for help with the data analysis. We are also thankful to I. Goryshin, N. Caruccio and R. Vaidyanathan for discussions at different stages of the project. We also thank S. Norberg, J. Zhang, J. Bernd, T. McSherry, T. Le, P. Diep and G. Roberts for performing sequencing, helping with custom recipes and supporting data transfer. J.S. was supported by grant HG006283 from the National Human Genome Research Institute. A.A. and J.O.K. were supported by graduate research fellowship DGE-0718124 from the National Science Foundation.

Author information

Authors and Affiliations

Authors

Contributions

F.J.S., S.A. and K.L.G. conceived the study. F.J.S. oversaw the technology development. S.A. led the assay development, performed the experiments and analyzed the data. L.C., C.T., N.P., A.A. and J.O.K. performed experiments. T.R. and E.K. performed data analysis. D.P. developed the analysis pipeline. K.V. developed the single-molecule imaging system and collected images for the single-molecule experiments. S.A., L.C., D.P., M.R., K.L.G., J.S. and F.J.S. co-wrote the manuscript. All authors contributed to the revision and review of the manuscript.

Corresponding author

Correspondence to Frank J Steemers.

Ethics declarations

Competing interests

S.A., D.P., L.C., E.K., T.R., C.T., N.P., K.V., M.R., K.L.G. and F.J.S. declare competing financial interests in the form of stock ownership and paid employment by Illumina, Inc.

Integrated supplementary information

Supplementary Figure 1 Single-molecule imaging of contiguously transposed DNA.

Single-molecule imaging of contiguously transposed DNA using Cy5-labeled transposomes and YOYO-1–labeled DNA (colored as red and blue, respectively). The ‘bead-on-a-string’ configuration of the substrate DNA post-transposition (top panel, with Mg2+) indicates that target DNA is not fragmented after transposition. In the absence of Mg2+, transposome complexes bind to substrate DNA (top panel, without Mg2+) but do not transpose into DNA; therefore, protease treatment does not fragment the DNA pre-exposed to transposomes in the absence of Mg2+ (bottom panel, without Mg2+, with protease). When transposition occurred in the presence of Mg2+ and protease (which digests the transposase), DNA fragments (bottom panel, with Mg2+, with protease).

Supplementary Figure 2 Proof-of-principle example showing the distribution of distance values between tandem alignments with SDS treatment before or after the dilution step.

High-molecular-weight genomic DNA was transposed and either diluted before SDS treatment (post-dilution) or after SDS treatment (pre-dilution). In both cases, 1.2 pg of transposed DNA was used to set up PCR. Amplified libraries were sequenced, and the reads were aligned to the human genome. Aligned reads were sorted across the chromosome on the basis of their alignment matching coordinates, and the distribution of distances between tandem alignments (consecutive aligned reads) was calculated and plotted as a histogram. The number of reads in the pre-dilution case was down-sampled to match the read count of the post-dilution sample. For in silico sampling, alignment coordinates were randomly picked from the genome and sorted, and the tandem distance distribution was plotted. When SDS treatment is carried out after dilution, enrichment is observed for reads that map to proximal regions of the genome (represented by the left peak in the bimodal distribution). When dilution is carried out after SDS treatment, the proximal population is not observed. The distribution for the in silico sampling experiment resembles the pre-dilution case. These results demonstrate that DNA stays intact after transposition and dilution and that proximity information can therefore be extracted from each individual molecule.

Supplementary Figure 3 Design of the two-level (transposon and PCR) indexed templates and sequencing readout scheme.

Universal transposon sequences and indexes (i.e., T5 and T7 indexes) are introduced to the sample during the transposition step. During the PCR step, the overlap between the PCR and transposon oligonucleotides (i.e., Universal connector) is used to introduce universal sequencing primers (i.e., P5 and P7) together with the PCR indexes (i.e., P5 and P7 indexes). There are 8 different P5, 12 different P7, 8 different T5, and 12 different T7 index sequences (see Online Methods and Supplementary Table 4).

Supplementary Figure 4 Intensity versus cycle plot for a typical two-level dual-indexing sequencing run.

The order of sequencing reads is as follows: genomic DNA read 1 (cycles 1–51), index 1 (transposon i7, cycles 52–59, and PCR i7, cycles 60–67), index 2 (PCR i5, cycles 68–75, and transposon i5, cycles 76–83) and genomic DNA read 2 (cycles 84–134).

Supplementary Figure 5 Pulse-field gel electrophoresis of genomic DNA samples used in this study.

The NA12878, NA12891 and NA12892 samples were either purchased from Coriell or prepared using the Gentra protocol. All samples were analyzed with a Bio-Rad Pulsed-Field Gel Electrophoresis System using a 1% agarose gel run for 16 h at 14 °C at 170 V with a switch time starting at 1 s and progressing to 6 s.

Supplementary Figure 6 Representative coverage plots for three indexes.

The distribution of aligned sequenced reads is plotted for three indexes, with proximal regions showing as islands across part of chromosome 22. The snapshot was generated with the Integrated Genome Viewer (IGV) v.2.3 (Broad Institute).

Supplementary Figure 7 Representative distribution of distances between tandem alignment reads for a single index.

A bimodal distribution is observed, with proximal and distal genomic regions segregating into two separate subpopulations. NA12878 genomic DNA, acquired from a Gentra preparation, was processed with the CPT-seq workflow and sequenced on four lanes of a HiSeq 2000. Data were demultiplexed and mapped to the reference human genome (hg19).

Supplementary Figure 8 Distribution of intra-island coverage values.

Haplotyping island boundaries were determined by finding clusters of reads such that the distance between any two consecutive reads did not exceed 15 kb and there were at least five unique read pairs in each cluster. The fraction of each haplotyping island covered by sequencing was calculated, and the distribution is plotted.

Supplementary Figure 9 Summary of the data analysis pipeline for whole-genome phasing.

Demultiplexed sequencing reads from all 9,216 partitions were aligned to the human reference genome (hg19). Alignment coordinates were used to call haplotyping islands. For each partition, initial haplotyping blocks were generated by phasing heterozygous SNPs using ReFHap25. Subsequently, SNPs that were linked by only one data point or showed conflicting calls by multiple islands were removed. Next, 1000 Genomes Project panel data were used to phase additional SNPs.

Supplementary Figure 10 Stitching versus filling imputation.

Data from the 1000 Genomes Project can be used to generate longer haplotyping blocks by connecting smaller blocks (stitching imputation). Alternatively, these data can be used to fill in the gaps for SNPs that are missing and not covered by high-confidence experimental data (filling imputation). We report data with (step III) and without (ReFHap accuracy, step I) imputation (Table 1). Imputation is only used for filling gaps as stitching imputation can potentially result in high long-switch error rates. Therefore, the N50 of assembled haplotyping blocks does not change after the imputation step. M denotes a SNP from the mother, and D denotes a SNP from the father. In the ideal case, a haplotype string will consist of only M or D SNPs.

Supplementary Figure 11 Sequencing depth, phasing coverage and accuracy.

The percentage of SNPs phased and the accuracy of phasing are plotted as a function of sequencing depth.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–11 and Supplementary Tables 1–3. (PDF 1199 kb)

Supplementary Table 4

Transposon sequences and sequencing primers. (XLSX 10 kb)

Supplementary Data Set

Source code files. (ZIP 242 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Amini, S., Pushkarev, D., Christiansen, L. et al. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat Genet 46, 1343–1349 (2014). https://doi.org/10.1038/ng.3119

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3119

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research