Haplotyping of human chromosomes is a prerequisite for cataloguing the full repertoire of genetic variation. We present a microfluidics-based, linked-read sequencing technology that can phase and haplotype germline and cancer genomes using nanograms of input DNA. This high-throughput platform prepares barcoded libraries for short-read sequencing and computationally reconstructs long-range haplotype and structural variant information. We generate haplotype blocks in a nuclear trio that are concordant with expected inheritance patterns and phase a set of structural variants. We also resolve the structure of the EML4-ALK gene fusion in the NCI-H2228 cancer cell line using phased exome sequencing. Finally, we assign genetic aberrations to specific megabase-scale haplotypes generated from whole-genome sequencing of a primary colorectal adenocarcinoma. This approach resolves haplotype information using up to 100 times less genomic DNA than some methods and enables the accurate detection of structural variants.
At a glance
Sequence Read Archive
- Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011). et al.
- The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line. Nature 500, 207–211 (2013). et al.
- 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
- A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 21, 1672–1685 (2011). et al.
- Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques. Nucleic Acids Res. 40, 2041–2053 (2012). et al.
- Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487, 190–195 (2012). et al.
- Whole-genome haplotyping by dilution, amplification, and sequencing. Proc. Natl. Acad. Sci. USA 110, 5552–5557 (2013). et al.
- Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat. Biotechnol. 31, 1111–1118 (2013). , , &
- Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat. Genet. 46, 1343–1349 (2014). et al.
- Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods 12, 780–786 (2015). et al.
- Beating Poisson encapsulation statistics using close-packed ordering. Lab Chip 9, 2628–2631 (2009). , , &
- Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014). et al.
- Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010). &
- The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010). et al.
- Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. J. Comput. Biol. 21, 405–419 (2014). et al.
- Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008). et al.
- LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014). , , &
- 1000 Genomes Project. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011). et al.
- A programmable method for massively parallel targeted sequencing. Nucleic Acids Res. 42, e88 (2014). et al.
- Efficient targeted resequencing of human germline and cancer genomes by oligonucleotide-selective sequencing. Nat. Biotechnol. 29, 1024–1027 (2011). , , , &
- Gene copy-number polymorphism caused by retrotransposition in humans. PLoS Genet. 9, e1003242 (2013). et al.
- Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat. Biotechnol. 31, 1023–1031 (2013). et al.
- Identification of new ALK and RET gene fusions from colorectal and lung cancer biopsies. Nat. Med. 18, 382–384 (2012). et al.
- Identification of novel isoforms of the EML4-ALK transforming gene in non-small cell lung cancer. Cancer Res. 68, 4971–4976 (2008). et al.
- EML4-ALK fusion gene and efficacy of an ALK kinase inhibitor in lung cancer. Clin. Cancer Res. 14, 4275–4283 (2008). et al.
- Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 448, 561–566 (2007). et al.
- Discovery of ALK-PTPN3 gene fusion from human non-small cell lung carcinoma cell line using next-generation RNA sequencing. Genes Chromosom. Cancer 51, 590–597 (2012). et al.
- A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014). et al.
- Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
- BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 6, 677–681 (2009). et al.
- Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing. Ann. Appl. Stat. 6, 476–496 (2012). &
- A genetic model for colorectal tumorigenesis. Cell 61, 759–767 (1990). &
- Genetic alterations during colorectal-tumor development. N. Engl. J. Med. 319, 525–532 (1988). et al.
- Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015). et al.
- Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015). et al.
- Phasing of single DNA molecules by massively parallel barcoding. Nat. Commun. 6, 7173 (2015). et al.
- Targeted sequencing by proximity ligation for comprehensive variant detection and local haplotyping. Nat. Biotechnol. 32, 1019–1025 (2014). et al.
- A rapid molecular approach for chromosomal phasing. PLoS ONE 10, e0118270 (2015). et al.
- Chromosomal haplotypes by genetic phasing of human families. Am. J. Hum. Genet. 89, 382–397 (2011). et al.
- BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
- Supplementary Figure 1: Barcode sequencing library and analysis software workflow. (91 KB)
(a) Barcoded primers are used to initiate primer extension in each droplet, which is then followed by (b) pooling of droplets, end-repair, and ligation of P7 sequencing adaptor. The library is completed by (c) sample indexing PCR and (d) sequencing on Illumina sequencers. (e) The barcode pipeline builds upon accepted aligners such as BWA and previously called variants or from variant callers such as Freebayes and GATK. It uses linked-reads to enable phasing and structural variant calling. The results are produced in standard file formats such as BAM, VCF, and BEDPE.
- Supplementary Figure 2: Sequencing and phasing performance of NA12878 trio. (156 KB)
(a) Number of reads corresponding to each barcoded oligonucleotide is plotted against its rank to illustrate the uniformity of counts over 100,000 barcodes. (b) Pulse-field gel electrophoresis of the trio input DNA. NA12878 DNA was run on a separate gel from NA12877 and NA12882, along with 5 kb and 8-48 kb ladders to estimate the size of input DNA. (c) Gap size distribution of GemCode NA12878 WGS sample. (d) Coverage vs. GC fraction of barcode libraries from NA12878 WGS sample. The relative coverage, normalized by the median, is plotted against GC fraction brackets, spanning from 29% to 60%. (e) Cumulative distribution function of phase block length of NA12878 trio exome samples. (f) Phasing accuracy of the nuclear trio exome data.
- Supplementary Figure 3: Comparison between barcoded and standard TruSeq libraries. (73 KB)
Coverage distributions of NA12878 from (a) phased library from 1ng of genomic DNA, (b) standard TruSeq library from 100 ng of genomic DNA. (c) Coverage statistics between NA12878 phased barcoded library versus a standard Illumina TruSeq library.
- Supplementary Figure 4: Barcode overlap of structural variants. (417 KB)
We generated non-overlapping window size of 100 kb to visualize structural alterations with uniquely mapping, non-duplicated reads. (a) Schematics of barcode overlap in reference (WT), deletion, inversion and tandem duplication. Matrix view of representative barcode overlap patterns for (b) reference, (c) deletion, (d) inversion and (e) tandem duplication events. Barcode overlap of heterozygous (f) inversion and (g) inversion and tandem duplication events in NA12878.
- Supplementary Figure 5: Barcode count analysis of eight deletion candidates in linked-read WGS data from NA12878. (175 KB)
(a) Barcode counts in regions of five high-scoring deletions. (b) Barcode counts in the interval covering of three low-scoring deletions.
- Supplementary Figure 6: Validation of genomic deletions with targeted sequencing. (188 KB)
We used a targeted sequencing approach called Oligonucleotide Selective-Sequencing (OS-Seq) for validating breakpoints of the deletions. Four out of five of the high-ranked candidates had a minimum of 450 reads aligning beyond the opposite breakpoint and at least 90 reads covering the breakpoint. The remaining high scoring deletion was found to have added sequence complexity that was observed in the targeted sequencing data. An example of a high scoring deletion that was validated is shown. (a) Ribbon plot displaying the location of reads mapped to breakpoints of a high-scoring deletion. Left, position of reads mapped to the left breakpoint, where red represents probes mapping to 5’ end of the breakpoint (using coordinates at the bottom of the plot), and blue represents probes mapping to the 3’ end of the breakpoint (using coordinates at the top of the plot). Right, position of reads mapped to the right breakpoint. The y-axis indicates the index of the reads. Pink line represents the mappability of the reads, where 1 indicates unique mapping, and 0 indicates mapping to multiple places in the genome. Because the deletion is heterozygous, reads colored in red on the left plot represent reads from the wild type allele, and reads colored in blue on the left plot represents reads from the deleted haplotype. The asterisks and arrows denote locations of primer probes, their direction of capture, and their typical capture distance. (b) Validation of breakpoint structure by soft-clipped read counting. Read 1s are grouped based on primer probe (read 2) identity. Soft-clipped reads supporting the breakpoint structure are tallied based on each breakpoint’s start and end location, and are reported as reads mapping “across” the breakpoint in Supplemental Table 6. (c) IGV screenshots of read alignment from a high-scoring deletion by left and right breakpoints, and Haplotype 1 and Haplotype 2. The deletion involves Haplotype 2 is shown by missing reads from left and right breakpoints of the haplotype. (d) IGV screenshots of read alignment from a low-scoring deletion by left and right breakpoints, and Haplotype 1 and Haplotype 2. Reads are missing from the right breakpoint of both Haplotype 1 and Haplotype 2, suggesting that reads cannot be properly mapped to the breakpoint, and the breakpoint is not accurate.
- Supplementary Figure 7: ALK gene fusions in NA12878 exome and NCI-H2228 WGS data. (91 KB)
Heatmap of barcode overlap of (a) EML4-ALK and (b) ALK-PTPN3 in NA12878 exome (a negative control). Barcode overlap of (c) EML4-ALK and (d) ALK-PTPN3 in NCI-H2228 WGS. (e) RT-PCR data of EML4-ALK and ALK-PTPN3 transcripts in NA12878 and NCI-H2228.
- Supplementary Text and Figures (1,608 KB)
Supplementary Figures 1–7
- Supplementary Information (1,985 KB)
Supplementary Tables 1–6, Supplementary Tables 8–13 and Supplementary Notes 1 and 2