Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing

Amini, Sasan; Pushkarev, Dmitry; Christiansen, Lena; Kostem, Emrah; Royce, Tom; Turk, Casey; Pignatelli, Natasha; Adey, Andrew; Kitzman, Jacob O; Vijayan, Kandaswamy; Ronaghi, Mostafa; Shendure, Jay; Gunderson, Kevin L; Steemers, Frank J

doi:10.1038/ng.3119

Technical Report
Published: 19 October 2014

Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing

Sasan Amini¹,
Dmitry Pushkarev¹,
Lena Christiansen¹,
Emrah Kostem¹,
Tom Royce¹,
Casey Turk¹,
Natasha Pignatelli¹,
Andrew Adey²,
Jacob O Kitzman²,
Kandaswamy Vijayan¹,
Mostafa Ronaghi¹,
Jay Shendure²,
Kevin L Gunderson¹ &
…
Frank J Steemers¹

Nature Genetics volume 46, pages 1343–1349 (2014)Cite this article

20k Accesses
109 Citations
63 Altmetric
Metrics details

Subjects

Abstract

Haplotype-resolved genome sequencing enables the accurate interpretation of medically relevant genetic variation, deep inferences regarding population history and non-invasive prediction of fetal genomes. We describe an approach for genome-wide haplotyping based on contiguity-preserving transposition (CPT-seq) and combinatorial indexing. Tn5 transposition is used to modify DNA with adaptor and index sequences while preserving contiguity. After DNA dilution and compartmentalization, the transposase is removed, resolving the DNA into individually indexed libraries. The libraries in each compartment, enriched for neighboring genomic elements, are further indexed via PCR. Combinatorial 96-plex indexing at both the transposition and PCR stage enables the construction of phased synthetic reads from each of the nearly 10,000 'virtual compartments'. We demonstrate the feasibility of this method by assembling >95% of the heterozygous variants in a human genome into long, accurate haplotype blocks (N50 = 1.4–2.3 Mb). The rapid, scalable and cost-effective workflow could enable haplotype resolution to become routine in human genome sequencing.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: The Tn5 transposase maintains the contiguity of target DNA after transposition.**

**Figure 2: Overview of the CPT-seq workflow.**

**Figure 3: Demonstration of haplotype read islands.**

**Figure 4: Summary of phasing results.**

Chromosome-scale, haplotype-resolved assembly of human genomes

Article Open access 07 December 2020

Improved haplotype inference by exploiting long-range linking and allelic imbalance in RNA-seq datasets

Article Open access 16 September 2020

Targeted phasing of 2–200 kilobase DNA fragments with a short-read sequencer and a single-tube linked-read library method

Article Open access 05 April 2024

Accession codes

Primary accessions

BioProject

PRJNA241346

References

Bansal, V. et al. The next phase in human genetics. Nat. Biotechnol. 29, 38–39 (2011).
Article CAS Google Scholar
Tewhey, R. et al. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).
Article CAS Google Scholar
Fan, H.C. et al. Non-invasive prenatal measurement of the fetal genome. Nature 487, 320–324 (2012).
Article CAS Google Scholar
Kitzman, J.O. et al. Noninvasive whole-genome sequencing of a human fetus. Sci. Transl. Med. 4, 137ra76 (2012).
Article Google Scholar
Sabeti, P.C. et al. Detecting recent positive selection in the human genome from haplotype structure. Nature 419, 832–837 (2002).
Article CAS Google Scholar
Adey, A. et al. The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line. Nature 500, 207–211 (2013).
Article CAS Google Scholar
Tishkoff, S.A. et al. Global patterns of linkage disequilibrium at the CD4 locus and modern human origins. Science 271, 1380–1387 (1996).
Article CAS Google Scholar
Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008).
Article CAS Google Scholar
Hosomichi, K. et al. Phase-defined complete sequencing of the HLA genes by next-generation sequencing. BMC Genomics 14, 355 (2013).
Article CAS Google Scholar
Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
Article CAS Google Scholar
Bansal, V. et al. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 18, 1336–1346 (2008).
Article CAS Google Scholar
He, D. et al. Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics 26, i183–i190 (2010).
Article CAS Google Scholar
Kaper, F. et al. Whole-genome haplotyping by dilution, amplification, and sequencing. Proc. Natl. Acad. Sci. USA 110, 5552–5557 (2013).
Article CAS Google Scholar
Kitzman, J.O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).
Article CAS Google Scholar
Peters, B.A. et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487, 190–195 (2012).
Article CAS Google Scholar
Fan, H.C. et al. Whole-genome molecular haplotyping of single cells. Nat. Biotechnol. 29, 51–57 (2011).
Article CAS Google Scholar
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Article Google Scholar
Duitama, J. et al. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of single individual haplotyping techniques. Nucleic Acids Res. 40, 2041–2053 (2012).
Article CAS Google Scholar
Suk, E.K. et al. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 21, 1672–1685 (2011).
Article CAS Google Scholar
Lo, C. et al. On the design of clone-based haplotyping. Genome Biol. 14, R100 (2013).
Article Google Scholar
Geraci, F. A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem. Bioinformatics 26, 2217–2225 (2010).
Article CAS Google Scholar
Caruccio, N. Preparation of next-generation sequencing libraries using Nextera technology: simultaneous DNA fragmentation and adaptor tagging by in vitro transposition. Methods Mol. Biol. 733, 241–255 (2011).
Article CAS Google Scholar
Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome Biol. 11, R119 (2010).
Article CAS Google Scholar
Erlich, Y. et al. DNA Sudoku—harnessing high-throughput sequencing for multiplexed specimen analysis. Genome Res. 19, 1243–1253 (2009).
Article CAS Google Scholar
Duitama, J. et al. in Proc. 1st ACM Int. Conf. Bioinformatics Comput. Biol. 160–169 (ACM (Association for Computing Machinery), New York, 2010).
Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).
Article CAS Google Scholar
Abecasis, G.R. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Article Google Scholar
Conrad, D.F. et al. Variation in genome-wide mutation rates within and between human families. Nat. Genet. 43, 712–714 (2011).
Article CAS Google Scholar
Kamphans, T. et al. Filtering for compound heterozygous sequence variants in non-consanguineous pedigrees. PLoS ONE 8, e70151 (2013).
Article CAS Google Scholar
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Article CAS Google Scholar
Lo, C. et al. Strobe sequence design for haplotype assembly. BMC Bioinformatics 12 (suppl. 1), S24 (2011).
Article Google Scholar
Fu, A.Y. et al. A microfabricated fluorescence-activated cell sorter. Nat. Biotechnol. 17, 1109–1111 (1999).
Article CAS Google Scholar
Hua, Z. et al. Multiplexed real-time polymerase chain reaction on a digital microfluidic platform. Anal. Chem. 82, 2310–2316 (2010).
Article CAS Google Scholar
Adey, A. et al., long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res. 10.1101/gr.178319.114 (19 October 2014)
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar

Download references

Acknowledgements

We are thankful to J. Bruand, F. Zhang and A. Kia for help with the data analysis. We are also thankful to I. Goryshin, N. Caruccio and R. Vaidyanathan for discussions at different stages of the project. We also thank S. Norberg, J. Zhang, J. Bernd, T. McSherry, T. Le, P. Diep and G. Roberts for performing sequencing, helping with custom recipes and supporting data transfer. J.S. was supported by grant HG006283 from the National Human Genome Research Institute. A.A. and J.O.K. were supported by graduate research fellowship DGE-0718124 from the National Science Foundation.

Author information

Authors and Affiliations

Illumina, Inc., Advanced Research Group, San Diego, California, USA
Sasan Amini, Dmitry Pushkarev, Lena Christiansen, Emrah Kostem, Tom Royce, Casey Turk, Natasha Pignatelli, Kandaswamy Vijayan, Mostafa Ronaghi, Kevin L Gunderson & Frank J Steemers
Department of Genome Sciences, University of Washington, Seattle, Washington, USA
Andrew Adey, Jacob O Kitzman & Jay Shendure

Authors

Sasan Amini
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Pushkarev
View author publications
You can also search for this author in PubMed Google Scholar
Lena Christiansen
View author publications
You can also search for this author in PubMed Google Scholar
Emrah Kostem
View author publications
You can also search for this author in PubMed Google Scholar
Tom Royce
View author publications
You can also search for this author in PubMed Google Scholar
Casey Turk
View author publications
You can also search for this author in PubMed Google Scholar
Natasha Pignatelli
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Adey
View author publications
You can also search for this author in PubMed Google Scholar
Jacob O Kitzman
View author publications
You can also search for this author in PubMed Google Scholar
Kandaswamy Vijayan
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa Ronaghi
View author publications
You can also search for this author in PubMed Google Scholar
Jay Shendure
View author publications
You can also search for this author in PubMed Google Scholar
Kevin L Gunderson
View author publications
You can also search for this author in PubMed Google Scholar
Frank J Steemers
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.J.S., S.A. and K.L.G. conceived the study. F.J.S. oversaw the technology development. S.A. led the assay development, performed the experiments and analyzed the data. L.C., C.T., N.P., A.A. and J.O.K. performed experiments. T.R. and E.K. performed data analysis. D.P. developed the analysis pipeline. K.V. developed the single-molecule imaging system and collected images for the single-molecule experiments. S.A., L.C., D.P., M.R., K.L.G., J.S. and F.J.S. co-wrote the manuscript. All authors contributed to the revision and review of the manuscript.

Corresponding author

Correspondence to Frank J Steemers.

Ethics declarations

Competing interests

S.A., D.P., L.C., E.K., T.R., C.T., N.P., K.V., M.R., K.L.G. and F.J.S. declare competing financial interests in the form of stock ownership and paid employment by Illumina, Inc.

Integrated supplementary information

Supplementary Figure 1 Single-molecule imaging of contiguously transposed DNA.

Single-molecule imaging of contiguously transposed DNA using Cy5-labeled transposomes and YOYO-1–labeled DNA (colored as red and blue, respectively). The ‘bead-on-a-string’ configuration of the substrate DNA post-transposition (top panel, with Mg²⁺) indicates that target DNA is not fragmented after transposition. In the absence of Mg²⁺, transposome complexes bind to substrate DNA (top panel, without Mg²⁺) but do not transpose into DNA; therefore, protease treatment does not fragment the DNA pre-exposed to transposomes in the absence of Mg²⁺ (bottom panel, without Mg²⁺, with protease). When transposition occurred in the presence of Mg²⁺ and protease (which digests the transposase), DNA fragments (bottom panel, with Mg²⁺, with protease).

Supplementary Figure 2 Proof-of-principle example showing the distribution of distance values between tandem alignments with SDS treatment before or after the dilution step.

High-molecular-weight genomic DNA was transposed and either diluted before SDS treatment (post-dilution) or after SDS treatment (pre-dilution). In both cases, 1.2 pg of transposed DNA was used to set up PCR. Amplified libraries were sequenced, and the reads were aligned to the human genome. Aligned reads were sorted across the chromosome on the basis of their alignment matching coordinates, and the distribution of distances between tandem alignments (consecutive aligned reads) was calculated and plotted as a histogram. The number of reads in the pre-dilution case was down-sampled to match the read count of the post-dilution sample. For in silico sampling, alignment coordinates were randomly picked from the genome and sorted, and the tandem distance distribution was plotted. When SDS treatment is carried out after dilution, enrichment is observed for reads that map to proximal regions of the genome (represented by the left peak in the bimodal distribution). When dilution is carried out after SDS treatment, the proximal population is not observed. The distribution for the in silico sampling experiment resembles the pre-dilution case. These results demonstrate that DNA stays intact after transposition and dilution and that proximity information can therefore be extracted from each individual molecule.

Supplementary Figure 3 Design of the two-level (transposon and PCR) indexed templates and sequencing readout scheme.

Universal transposon sequences and indexes (i.e., T5 and T7 indexes) are introduced to the sample during the transposition step. During the PCR step, the overlap between the PCR and transposon oligonucleotides (i.e., Universal connector) is used to introduce universal sequencing primers (i.e., P5 and P7) together with the PCR indexes (i.e., P5 and P7 indexes). There are 8 different P5, 12 different P7, 8 different T5, and 12 different T7 index sequences (see Online Methods and Supplementary Table 4).

Supplementary Figure 4 Intensity versus cycle plot for a typical two-level dual-indexing sequencing run.

The order of sequencing reads is as follows: genomic DNA read 1 (cycles 1–51), index 1 (transposon i7, cycles 52–59, and PCR i7, cycles 60–67), index 2 (PCR i5, cycles 68–75, and transposon i5, cycles 76–83) and genomic DNA read 2 (cycles 84–134).

Supplementary Figure 5 Pulse-field gel electrophoresis of genomic DNA samples used in this study.

The NA12878, NA12891 and NA12892 samples were either purchased from Coriell or prepared using the Gentra protocol. All samples were analyzed with a Bio-Rad Pulsed-Field Gel Electrophoresis System using a 1% agarose gel run for 16 h at 14 °C at 170 V with a switch time starting at 1 s and progressing to 6 s.

Supplementary Figure 6 Representative coverage plots for three indexes.

The distribution of aligned sequenced reads is plotted for three indexes, with proximal regions showing as islands across part of chromosome 22. The snapshot was generated with the Integrated Genome Viewer (IGV) v.2.3 (Broad Institute).

Supplementary Figure 7 Representative distribution of distances between tandem alignment reads for a single index.

A bimodal distribution is observed, with proximal and distal genomic regions segregating into two separate subpopulations. NA12878 genomic DNA, acquired from a Gentra preparation, was processed with the CPT-seq workflow and sequenced on four lanes of a HiSeq 2000. Data were demultiplexed and mapped to the reference human genome (hg19).

Supplementary Figure 8 Distribution of intra-island coverage values.

Haplotyping island boundaries were determined by finding clusters of reads such that the distance between any two consecutive reads did not exceed 15 kb and there were at least five unique read pairs in each cluster. The fraction of each haplotyping island covered by sequencing was calculated, and the distribution is plotted.

Supplementary Figure 9 Summary of the data analysis pipeline for whole-genome phasing.

Demultiplexed sequencing reads from all 9,216 partitions were aligned to the human reference genome (hg19). Alignment coordinates were used to call haplotyping islands. For each partition, initial haplotyping blocks were generated by phasing heterozygous SNPs using ReFHap²⁵. Subsequently, SNPs that were linked by only one data point or showed conflicting calls by multiple islands were removed. Next, 1000 Genomes Project panel data were used to phase additional SNPs.

Supplementary Figure 10 Stitching versus filling imputation.

Data from the 1000 Genomes Project can be used to generate longer haplotyping blocks by connecting smaller blocks (stitching imputation). Alternatively, these data can be used to fill in the gaps for SNPs that are missing and not covered by high-confidence experimental data (filling imputation). We report data with (step III) and without (ReFHap accuracy, step I) imputation (Table 1). Imputation is only used for filling gaps as stitching imputation can potentially result in high long-switch error rates. Therefore, the N50 of assembled haplotyping blocks does not change after the imputation step. M denotes a SNP from the mother, and D denotes a SNP from the father. In the ideal case, a haplotype string will consist of only M or D SNPs.

Supplementary Figure 11 Sequencing depth, phasing coverage and accuracy.

The percentage of SNPs phased and the accuracy of phasing are plotted as a function of sequencing depth.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–11 and Supplementary Tables 1–3. (PDF 1199 kb)

Supplementary Table 4

Transposon sequences and sequencing primers. (XLSX 10 kb)

Supplementary Data Set

Source code files. (ZIP 242 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amini, S., Pushkarev, D., Christiansen, L. et al. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat Genet 46, 1343–1349 (2014). https://doi.org/10.1038/ng.3119

Download citation

Received: 26 February 2014
Accepted: 24 September 2014
Published: 19 October 2014
Issue Date: December 2014
DOI: https://doi.org/10.1038/ng.3119

This article is cited by

Next-Generation Sequencing in Medicinal Plants: Recent Progress, Opportunities, and Challenges
- Deeksha Singh
- Shivangi Mathur
- Rajiv Ranjan
Journal of Plant Growth Regulation (2024)
T-RHEX-RNAseq – a tagmentation-based, rRNA blocked, random hexamer primed RNAseq method for generating stranded RNAseq libraries directly from very low numbers of lysed cells
- Charlotte Gustafsson
- Julia Hauenstein
- Robert Månsson
BMC Genomics (2023)
Spatially resolved gene regulatory and disease-related vulnerability map of the adult Macaque cortex
- Ying Lei
- Mengnan Cheng
- Xun Xu
Nature Communications (2022)
Noninvasive prenatal diagnosis of monogenic disorders based on direct haplotype phasing through targeted linked-read sequencing
- Chao Chen
- Min Chen
- Jun Sun
BMC Medical Genomics (2021)
Noninvasive prenatal testing of α-thalassemia and β-thalassemia through population-based parental haplotyping
- Chao Chen
- Ru Li
- Can Liao
Genome Medicine (2021)