Haplotyping germline and cancer genomes with high-throughput linked-read sequencing

Journal name:
Nature Biotechnology
Volume:
34,
Pages:
303–311
Year published:
DOI:
doi:10.1038/nbt.3432
Received
Accepted
Published online

Abstract

Haplotyping of human chromosomes is a prerequisite for cataloguing the full repertoire of genetic variation. We present a microfluidics-based, linked-read sequencing technology that can phase and haplotype germline and cancer genomes using nanograms of input DNA. This high-throughput platform prepares barcoded libraries for short-read sequencing and computationally reconstructs long-range haplotype and structural variant information. We generate haplotype blocks in a nuclear trio that are concordant with expected inheritance patterns and phase a set of structural variants. We also resolve the structure of the EML4-ALK gene fusion in the NCI-H2228 cancer cell line using phased exome sequencing. Finally, we assign genetic aberrations to specific megabase-scale haplotypes generated from whole-genome sequencing of a primary colorectal adenocarcinoma. This approach resolves haplotype information using up to 100 times less genomic DNA than some methods and enables the accurate detection of structural variants.

At a glance

Figures

  1. Overview of the technology for generating linked reads.
    Figure 1: Overview of the technology for generating linked reads.

    (a) Gel beads loaded with primers and barcoded oligonucleotides are mixed with DNA and enzyme mixture then oil-surfactant solution at a microfluidic 'double-cross' junction. Gel bead–containing droplets flow to a reservoir where gel beads are dissolved, initiating whole-genome primer extension. The products are pooled from each droplet. The final library preparation requires shearing the libraries and incorporation of Illumina adapters. (b) Top, linked reads of the ALK gene from the NA12878 WGS sample. Lines represent linked reads; dots represent reads; color indicates barcode. Middle, exon boundaries of the ALK gene. Bottom, linked reads of the ALK gene from the NA12878 exome data. Reads from neighboring exons are linked by common barcodes. Only a small fraction of linked reads is presented here.

  2. Phasing performance of NA12878 trio analysis.
    Figure 2: Phasing performance of NA12878 trio analysis.

    (a) Length-weighted molecule size of the trio WGS data, calculated as the number of molecules in the length bin × the median of the length bin. (b) Cumulative distribution function of phase block length of the trio WGS samples. (c) Phasing accuracy. For all pairs of SNVs that are on the same phasing block, the probability of correct phasing of a pair is plotted as a function of its distance. Inset, SNV pairs separated by at least 0.1 Mb. (d) Haplotype blocks of LRRK2 of the trio exome libraries demonstrating Mendelian inheritance. Most of this gene is phased in all trio samples, but the beginning is not. NA12882 (child) inherited one allele from haplotype 2 from NA12877 (father) and haplotype 1 from NA12878 (mother). Gray bars represent reference alleles; green bars represent alternative alleles.

  3. Detecting genomic deletions in NA12878.
    Figure 3: Detecting genomic deletions in NA12878.

    (a) Top, heat map of overlapping barcodes plotted for a deletion on chromosome 6 (chr. 6): 78967194–79036419 in NA12878. Black circle indicate overlapping barcodes near the breakpoints. Bottom, heat map of barcodes in the same region for NA12882, shown as a negative control. (b) Linked-read data of NA12878 WGS sample spanning chr. 6: 78967194–79036419. In haplotype 1 (top), overlapping barcodes are observed only in contiguous regions. In haplotype 2 (bottom), a deletion is shown as a gap in linked reads. In contrast to regions without a deletion, barcodes in the region before the gap overlap with barcodes in the region after the gap. Horizontal lines represent linked reads with the same barcode; dots represent reads; colors indicate barcodes. Dashed vertical black lines represent breakpoints. (c) Summary of eight deletion candidates, including supporting evidence from overlapping barcode (BC) count, phasing of the deletion breakpoints and inheritance support in NA12882. Whereas the five high-scoring SV candidates have support from each type of evidence, two lower-scoring candidates lack support from any evidence including targeted sequencing. Haplotype assignment in one phase block is not necessarily the same as the haplotype assignment in a different phase block. Hap1, haplotype 1; hap2, haplotype 2.

  4. Rearrangement detection of an EML4-ALK fusion from exome sequencing of NCI-H2228.
    Figure 4: Rearrangement detection of an EML4-ALK fusion from exome sequencing of NCI-H2228.

    (a) Overlap of barcodes between exons 20–28 (e20, e28) of ALK and exons 2–6 (e2–e6) of EML4. (b) Overlap of barcodes between e1–e2 of ALK and e7–e17 of EML4. (c) Overlap of barcodes between e10–e11 of ALK and the 5′ end of PTPN3. Blue bars in ac represent exons. (d) Barcode counts in ALK of NCI-H2228 WGS sample. (e) Schematics of complex chromosomal rearrangement involving ALK, EML4 and PTPN3. Instead of the simple inversion reported in the literature, we observed a deletion, an inversion of ALK on chromosome 2 with EML4 and an insertion of ALK into PTPN3 on chromosome 9. (f) Phasing support around ALK and PTPN3 breakpoints in EML4-ALK and ALK-PTPN3 gene fusion. Haplotype assignment in one phase block is not necessarily the same as the haplotype assignment in a different phase block. Chr., chromosome.

  5. Phasing analysis of a primary colon cancer genome and structure of the TP53 driver event.
    Figure 5: Phasing analysis of a primary colon cancer genome and structure of the TP53 driver event.

    (a) Length-weighted molecule size histogram of normal (N) and tumor (T) samples from patient 1532. (b) Cumulative distribution function of phase block length of the normal and tumor tissue. (c) Phased haplotype block showing TP53 C>T mutation in haplotype 2. (d) Minor allele fraction of the tumor sample (relative to the matched normal) on chromosome 17. (e) Barcode count throughout chromosome 17 for tumor (blue) and matched normal (gray) tissue. Red box depicts the TP53 region. (f) Phasing analysis of TP53 between tumor and matched normal tissue. Left, ratio of SNV counts between tumor and normal in TP53 region. Right, density of SNV ratios of haplotype 1 and haplotype 2. Whereas the SNV density centers around 1 for haplotype 2, most SNV ratios between tumor and normal are 0.5 in haplotype 2, indicating that LOH is on haplotype 2. Chr., chromosome; hap, haplotype.

  6. Barcode sequencing library and analysis software workflow.
    Supplementary Fig. 1: Barcode sequencing library and analysis software workflow.

    (a) Barcoded primers are used to initiate primer extension in each droplet, which is then followed by (b) pooling of droplets, end-repair, and ligation of P7 sequencing adaptor. The library is completed by (c) sample indexing PCR and (d) sequencing on Illumina sequencers. (e) The barcode pipeline builds upon accepted aligners such as BWA and previously called variants or from variant callers such as Freebayes and GATK. It uses linked-reads to enable phasing and structural variant calling. The results are produced in standard file formats such as BAM, VCF, and BEDPE.

  7. Sequencing and phasing performance of NA12878 trio.
    Supplementary Fig. 2: Sequencing and phasing performance of NA12878 trio.

    (a) Number of reads corresponding to each barcoded oligonucleotide is plotted against its rank to illustrate the uniformity of counts over 100,000 barcodes. (b) Pulse-field gel electrophoresis of the trio input DNA. NA12878 DNA was run on a separate gel from NA12877 and NA12882, along with 5 kb and 8-48 kb ladders to estimate the size of input DNA. (c) Gap size distribution of GemCode NA12878 WGS sample. (d) Coverage vs. GC fraction of barcode libraries from NA12878 WGS sample. The relative coverage, normalized by the median, is plotted against GC fraction brackets, spanning from 29% to 60%. (e) Cumulative distribution function of phase block length of NA12878 trio exome samples. (f) Phasing accuracy of the nuclear trio exome data.

  8. Comparison between barcoded and standard TruSeq libraries.
    Supplementary Fig. 3: Comparison between barcoded and standard TruSeq libraries.

    Coverage distributions of NA12878 from (a) phased library from 1ng of genomic DNA, (b) standard TruSeq library from 100 ng of genomic DNA. (c) Coverage statistics between NA12878 phased barcoded library versus a standard Illumina TruSeq library.

  9. Barcode overlap of structural variants.
    Supplementary Fig. 4: Barcode overlap of structural variants.

    We generated non-overlapping window size of 100 kb to visualize structural alterations with uniquely mapping, non-duplicated reads. (a) Schematics of barcode overlap in reference (WT), deletion, inversion and tandem duplication. Matrix view of representative barcode overlap patterns for (b) reference, (c) deletion, (d) inversion and (e) tandem duplication events. Barcode overlap of heterozygous (f) inversion and (g) inversion and tandem duplication events in NA12878.

  10. Barcode count analysis of eight deletion candidates in linked-read WGS data from NA12878.
    Supplementary Fig. 5: Barcode count analysis of eight deletion candidates in linked-read WGS data from NA12878.

    (a) Barcode counts in regions of five high-scoring deletions. (b) Barcode counts in the interval covering of three low-scoring deletions.

  11. Validation of genomic deletions with targeted sequencing.
    Supplementary Fig. 6: Validation of genomic deletions with targeted sequencing.

    We used a targeted sequencing approach called Oligonucleotide Selective-Sequencing (OS-Seq) for validating breakpoints of the deletions. Four out of five of the high-ranked candidates had a minimum of 450 reads aligning beyond the opposite breakpoint and at least 90 reads covering the breakpoint. The remaining high scoring deletion was found to have added sequence complexity that was observed in the targeted sequencing data. An example of a high scoring deletion that was validated is shown. (a) Ribbon plot displaying the location of reads mapped to breakpoints of a high-scoring deletion. Left, position of reads mapped to the left breakpoint, where red represents probes mapping to 5’ end of the breakpoint (using coordinates at the bottom of the plot), and blue represents probes mapping to the 3’ end of the breakpoint (using coordinates at the top of the plot). Right, position of reads mapped to the right breakpoint. The y-axis indicates the index of the reads. Pink line represents the mappability of the reads, where 1 indicates unique mapping, and 0 indicates mapping to multiple places in the genome. Because the deletion is heterozygous, reads colored in red on the left plot represent reads from the wild type allele, and reads colored in blue on the left plot represents reads from the deleted haplotype. The asterisks and arrows denote locations of primer probes, their direction of capture, and their typical capture distance. (b) Validation of breakpoint structure by soft-clipped read counting. Read 1s are grouped based on primer probe (read 2) identity. Soft-clipped reads supporting the breakpoint structure are tallied based on each breakpoint’s start and end location, and are reported as reads mapping “across” the breakpoint in Supplemental Table 6. (c) IGV screenshots of read alignment from a high-scoring deletion by left and right breakpoints, and Haplotype 1 and Haplotype 2. The deletion involves Haplotype 2 is shown by missing reads from left and right breakpoints of the haplotype. (d) IGV screenshots of read alignment from a low-scoring deletion by left and right breakpoints, and Haplotype 1 and Haplotype 2. Reads are missing from the right breakpoint of both Haplotype 1 and Haplotype 2, suggesting that reads cannot be properly mapped to the breakpoint, and the breakpoint is not accurate.

  12. ALK gene fusions in NA12878 exome and NCI-H2228 WGS data.
    Supplementary Fig. 7: ALK gene fusions in NA12878 exome and NCI-H2228 WGS data.

    Heatmap of barcode overlap of (a) EML4-ALK and (b) ALK-PTPN3 in NA12878 exome (a negative control). Barcode overlap of (c) EML4-ALK and (d) ALK-PTPN3 in NCI-H2228 WGS. (e) RT-PCR data of EML4-ALK and ALK-PTPN3 transcripts in NA12878 and NCI-H2228.

Accession codes

Primary accessions

Sequence Read Archive

References

  1. Kitzman, J.O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 5963 (2011).
  2. Adey, A. et al. The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line. Nature 500, 207211 (2013).
  3. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 5665 (2012).
  4. Suk, E.K. et al. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 21, 16721685 (2011).
  5. Duitama, J. et al. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques. Nucleic Acids Res. 40, 20412053 (2012).
  6. Peters, B.A. et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487, 190195 (2012).
  7. Kaper, F. et al. Whole-genome haplotyping by dilution, amplification, and sequencing. Proc. Natl. Acad. Sci. USA 110, 55525557 (2013).
  8. Selvaraj, S., R Dixon, J., Bansal, V. & Ren, B. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat. Biotechnol. 31, 11111118 (2013).
  9. Amini, S. et al. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat. Genet. 46, 13431349 (2014).
  10. Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods 12, 780786 (2015).
  11. Abate, A.R., Chen, C.H., Agresti, J.J. & Weitz, D.A. Beating Poisson encapsulation statistics using close-packed ordering. Lab Chip 9, 26282631 (2009).
  12. Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261266 (2014).
  13. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589595 (2010).
  14. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 12971303 (2010).
  15. Cleary, J.G. et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. J. Comput. Biol. 21, 405419 (2014).
  16. Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 5664 (2008).
  17. Layer, R.M., Chiang, C., Quinlan, A.R. & Hall, I.M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
  18. Mills, R.E. et al. 1000 Genomes Project. Mapping copy number variation by population-scale genome sequencing. Nature 470, 5965 (2011).
  19. Hopmans, E.S. et al. A programmable method for massively parallel targeted sequencing. Nucleic Acids Res. 42, e88 (2014).
  20. Myllykangas, S., Buenrostro, J.D., Natsoulis, G., Bell, J.M. & Ji, H.P. Efficient targeted resequencing of human germline and cancer genomes by oligonucleotide-selective sequencing. Nat. Biotechnol. 29, 10241027 (2011).
  21. Schrider, D.R. et al. Gene copy-number polymorphism caused by retrotransposition in humans. PLoS Genet. 9, e1003242 (2013).
  22. Frampton, G.M. et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat. Biotechnol. 31, 10231031 (2013).
  23. Lipson, D. et al. Identification of new ALK and RET gene fusions from colorectal and lung cancer biopsies. Nat. Med. 18, 382384 (2012).
  24. Choi, Y.L. et al. Identification of novel isoforms of the EML4-ALK transforming gene in non-small cell lung cancer. Cancer Res. 68, 49714976 (2008).
  25. Koivunen, J.P. et al. EML4-ALK fusion gene and efficacy of an ALK kinase inhibitor in lung cancer. Clin. Cancer Res. 14, 42754283 (2008).
  26. Soda, M. et al. Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 448, 561566 (2007).
  27. Jung, Y. et al. Discovery of ALK-PTPN3 gene fusion from human non-small cell lung carcinoma cell line using next-generation RNA sequencing. Genes Chromosom. Cancer 51, 590597 (2012).
  28. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310315 (2014).
  29. Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330337 (2012).
  30. Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 6, 677681 (2009).
  31. Shen, J.J. & Zhang, N.R. Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing. Ann. Appl. Stat. 6, 476496 (2012).
  32. Fearon, E.R. & Vogelstein, B. A genetic model for colorectal tumorigenesis. Cell 61, 759767 (1990).
  33. Vogelstein, B. et al. Genetic alterations during colorectal-tumor development. N. Engl. J. Med. 319, 525532 (1988).
  34. Klein, A.M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 11871201 (2015).
  35. Macosko, E.Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 12021214 (2015).
  36. Borgström, E. et al. Phasing of single DNA molecules by massively parallel barcoding. Nat. Commun. 6, 7173 (2015).
  37. de Vree, P.J. et al. Targeted sequencing by proximity ligation for comprehensive variant detection and local haplotyping. Nat. Biotechnol. 32, 10191025 (2014).
  38. Regan, J.F. et al. A rapid molecular approach for chromosomal phasing. PLoS ONE 10, e0118270 (2015).
  39. Roach, J.C. et al. Chromosomal haplotypes by genetic phasing of human families. Am. J. Hum. Genet. 89, 382397 (2011).
  40. Kent, W.J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656664 (2002).

Download references

Author information

  1. These authors contributed equally to this work.

    • Grace X Y Zheng &
    • Billy T Lau

Affiliations

  1. 10X Genomics, Pleasanton, California, USA.

    • Grace X Y Zheng,
    • Michael Schnall-Levin,
    • Mirna Jarosz,
    • Christopher M Hindson,
    • Sofia Kyriazopoulou-Panagiotopoulou,
    • Donald A Masquelier,
    • Landon Merrill,
    • Jessica M Terry,
    • Patrice A Mudivarti,
    • Paul W Wyatt,
    • Rajiv Bharadwaj,
    • Anthony J Makarewicz,
    • Yuan Li,
    • Phillip Belgrader,
    • Andrew D Price,
    • Adam J Lowe,
    • Patrick Marks,
    • Gerard M Vurens,
    • Paul Hardenbol,
    • Luz Montesclaros,
    • Melissa Luo,
    • Lawrence Greenfield,
    • Alexander Wong,
    • David E Birch,
    • Steven W Short,
    • Keith P Bjornson,
    • Pranav Patel,
    • Sukhvinder Kaur,
    • Glenn K Lockwood,
    • David Stafford,
    • Joshua P Delaney,
    • Indira Wu,
    • Heather S Ordonez,
    • Josephine Y Lee,
    • Kamila Belhocine,
    • Kristina M Giorda,
    • William H Heaton,
    • Geoffrey P McDermott,
    • Zachary W Bent,
    • Francesca Meschi,
    • Nikola O Kondov,
    • Ryan Wilson,
    • Jorge A Bernate,
    • Shawn Gauby,
    • Alex Kindwall,
    • Clara Bermejo,
    • Adrian N Fehr,
    • Adrian Chan,
    • Serge Saxonov,
    • Kevin D Ness &
    • Benjamin J Hindson
  2. Stanford Genome Technology Center, Stanford University, Palo Alto, California, USA.

    • Billy T Lau,
    • John M Bell,
    • Erik S Hopmans,
    • Susan M Grimes &
    • Hanlee P Ji
  3. Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, California, USA.

    • Christina Wood,
    • Stephanie Greer &
    • Hanlee P Ji

Contributions

B.T.L., M.S.-L., M.J., J.M.B., C.M.H., S.K.-P., L. Merrill, R.B., A.J.M., Y.L., A.D.P., A.J.L., P.H., L.G., K.B., P.P., E.S.H., C.W., K.M.G., S.S., K.D.N., B.J.H. and H.P.J. designed the experiments. B.T.L., J.M.B., C.M.H., L. Merrill, J.M.T., P.A.M., P.W.W., R.B., A.J.M., Y.L., P.B., A.D.P., A.J.L., P.M., G.M.V., L. Montesclaros, M.L., L.G., D.E.B., K.B., P.P., E.S.H., C.W., J.P.D., I.W., H.S.O, J.Y.L., Z.W.B., K.M.G, G.P.M., Z.W.B., F.M., N.O.K., J.A.B., S.G., C.B., A.N.F., A.C. and B.J.H. conducted the experiments. D.A.M., R.B., A.J.M., S.W.S., S.K., J.A.B., A.K., K.D.N. and B.J.H. designed the instrument. M.S.-L., M.J., C.M.H., P.W.W., R.B., A.J.M., Y.L., A.D.P., A.J.L., P.H., L. Merrill, L.G., K.P.B., P.P., S.K., J.P.D., J.A.B., K.D.N. and B.J.H. designed reagents for phasing. B.T.L, J.M.B., E.S.H. and H.P.J. designed reagents for targeted sequencing analysis. G.X.Y.Z., M.S.-L., S.K.-P., P.M., G.K.L., D.L.S., W.H.H., R.T.W., S.S. and K.D.N. wrote the haplotype analysis algorithms. J.M.B. and S.M.G. wrote the analysis algorithms for short-read sequencing analysis. M.S.-L., P.J.M, A.W., G.K.L., D.L.S., W.H.H. and R.T.W. wrote the analysis software. G.X.Y.Z., B.T.L., M.S.-L., M.J., J.M.B., C.M.H., S.K.P., J.M.T., R.B., A.J.M., Y.L., P.B., P.M., P.H., L. Merrill, M.L., A.W., K.B., P.P., S.K., J.P.D., I.W., H.S.O., S.M.G., S. Greer, J.Y.L., Z.W.B., K.M.G., W.H.H., G.P.M., Z.W.B., F.M., J.A.B., S. Gauby, C.B., A.N.F., W.H.H., A.C., S.S., K.D.N., B.J.H. and H.P.J. analyzed the data. G.X.Y.Z., B.T.L., M.S.-L., M.J., S. Greer, B.J.H. and H.P.J. wrote the manuscript. H.P.J. oversaw the overall genetic experiments and analysis.

Competing financial interests

G.X.Y.Z., M.S.-L., M.J., C.M.H., S.K.-P., D.A.M., L. Merrill, J.M.T., P.A.M., P.W.W., R.B., A.J.M., Y.L., P.B., A.D.P., A.J.L., P.M., G.M.V., P.H., L. Montesclaros, M.L., L.G., A.W., D.E.B., S.W.S., K.P.B., P.P., S.K., G.K.L., D.S., J.P.D., I.W., H.S.O., J.Y.L., Z.W.B., K.M.G., W.H.H., G.P.M., Z.W.B., F.M., N.O.K., R.W., J.A.B., S. Gauby, A.K., C.B., A.N.F., A.C., S.S., K.D.N. and B.J.H. are employees of 10X Genomics.

Corresponding authors

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Barcode sequencing library and analysis software workflow. (91 KB)

    (a) Barcoded primers are used to initiate primer extension in each droplet, which is then followed by (b) pooling of droplets, end-repair, and ligation of P7 sequencing adaptor. The library is completed by (c) sample indexing PCR and (d) sequencing on Illumina sequencers. (e) The barcode pipeline builds upon accepted aligners such as BWA and previously called variants or from variant callers such as Freebayes and GATK. It uses linked-reads to enable phasing and structural variant calling. The results are produced in standard file formats such as BAM, VCF, and BEDPE.

  2. Supplementary Figure 2: Sequencing and phasing performance of NA12878 trio. (156 KB)

    (a) Number of reads corresponding to each barcoded oligonucleotide is plotted against its rank to illustrate the uniformity of counts over 100,000 barcodes. (b) Pulse-field gel electrophoresis of the trio input DNA. NA12878 DNA was run on a separate gel from NA12877 and NA12882, along with 5 kb and 8-48 kb ladders to estimate the size of input DNA. (c) Gap size distribution of GemCode NA12878 WGS sample. (d) Coverage vs. GC fraction of barcode libraries from NA12878 WGS sample. The relative coverage, normalized by the median, is plotted against GC fraction brackets, spanning from 29% to 60%. (e) Cumulative distribution function of phase block length of NA12878 trio exome samples. (f) Phasing accuracy of the nuclear trio exome data.

  3. Supplementary Figure 3: Comparison between barcoded and standard TruSeq libraries. (73 KB)

    Coverage distributions of NA12878 from (a) phased library from 1ng of genomic DNA, (b) standard TruSeq library from 100 ng of genomic DNA. (c) Coverage statistics between NA12878 phased barcoded library versus a standard Illumina TruSeq library.

  4. Supplementary Figure 4: Barcode overlap of structural variants. (417 KB)

    We generated non-overlapping window size of 100 kb to visualize structural alterations with uniquely mapping, non-duplicated reads. (a) Schematics of barcode overlap in reference (WT), deletion, inversion and tandem duplication. Matrix view of representative barcode overlap patterns for (b) reference, (c) deletion, (d) inversion and (e) tandem duplication events. Barcode overlap of heterozygous (f) inversion and (g) inversion and tandem duplication events in NA12878.

  5. Supplementary Figure 5: Barcode count analysis of eight deletion candidates in linked-read WGS data from NA12878. (175 KB)

    (a) Barcode counts in regions of five high-scoring deletions. (b) Barcode counts in the interval covering of three low-scoring deletions.

  6. Supplementary Figure 6: Validation of genomic deletions with targeted sequencing. (188 KB)

    We used a targeted sequencing approach called Oligonucleotide Selective-Sequencing (OS-Seq) for validating breakpoints of the deletions. Four out of five of the high-ranked candidates had a minimum of 450 reads aligning beyond the opposite breakpoint and at least 90 reads covering the breakpoint. The remaining high scoring deletion was found to have added sequence complexity that was observed in the targeted sequencing data. An example of a high scoring deletion that was validated is shown. (a) Ribbon plot displaying the location of reads mapped to breakpoints of a high-scoring deletion. Left, position of reads mapped to the left breakpoint, where red represents probes mapping to 5’ end of the breakpoint (using coordinates at the bottom of the plot), and blue represents probes mapping to the 3’ end of the breakpoint (using coordinates at the top of the plot). Right, position of reads mapped to the right breakpoint. The y-axis indicates the index of the reads. Pink line represents the mappability of the reads, where 1 indicates unique mapping, and 0 indicates mapping to multiple places in the genome. Because the deletion is heterozygous, reads colored in red on the left plot represent reads from the wild type allele, and reads colored in blue on the left plot represents reads from the deleted haplotype. The asterisks and arrows denote locations of primer probes, their direction of capture, and their typical capture distance. (b) Validation of breakpoint structure by soft-clipped read counting. Read 1s are grouped based on primer probe (read 2) identity. Soft-clipped reads supporting the breakpoint structure are tallied based on each breakpoint’s start and end location, and are reported as reads mapping “across” the breakpoint in Supplemental Table 6. (c) IGV screenshots of read alignment from a high-scoring deletion by left and right breakpoints, and Haplotype 1 and Haplotype 2. The deletion involves Haplotype 2 is shown by missing reads from left and right breakpoints of the haplotype. (d) IGV screenshots of read alignment from a low-scoring deletion by left and right breakpoints, and Haplotype 1 and Haplotype 2. Reads are missing from the right breakpoint of both Haplotype 1 and Haplotype 2, suggesting that reads cannot be properly mapped to the breakpoint, and the breakpoint is not accurate.

  7. Supplementary Figure 7: ALK gene fusions in NA12878 exome and NCI-H2228 WGS data. (91 KB)

    Heatmap of barcode overlap of (a) EML4-ALK and (b) ALK-PTPN3 in NA12878 exome (a negative control). Barcode overlap of (c) EML4-ALK and (d) ALK-PTPN3 in NCI-H2228 WGS. (e) RT-PCR data of EML4-ALK and ALK-PTPN3 transcripts in NA12878 and NCI-H2228.

PDF files

  1. Supplementary Text and Figures (1,608 KB)

    Supplementary Figures 1–7

  2. Supplementary Information (1,985 KB)

    Supplementary Tables 1–6, Supplementary Tables 8–13 and Supplementary Notes 1 and 2

Excel files

  1. Supplementary Table 7 (68,141 KB)

Additional data