Article | Published:

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions

Nature Biotechnology volume 31, pages 11191125 (2013) | Download Citation

Abstract

Genomes assembled de novo from short reads are highly fragmented relative to the finished chromosomes of Homo sapiens and key model organisms generated by the Human Genome Project. To address this problem, we need scalable, cost-effective methods to obtain assemblies with chromosome-scale contiguity. Here we show that genome-wide chromatin interaction data sets, such as those generated by Hi-C, are a rich source of long-range information for assigning, ordering and orienting genomic sequences to chromosomes, including across centromeres. To exploit this finding, we developed an algorithm that uses Hi-C data for ultra-long-range scaffolding of de novo genome assemblies. We demonstrate the approach by combining shotgun fragment and short jump mate-pair sequences with Hi-C data to generate chromosome-scale de novo assemblies of the human, mouse and Drosophila genomes, achieving—for the human genome—98% accuracy in assigning scaffolds to chromosome groups and 99% accuracy in ordering and orienting scaffolds within chromosome groups. Hi-C data can also be used to validate chromosomal translocations in cancer genomes.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Accessions

References

  1. 1.

    International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  2. 2.

    International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

  3. 3.

    & Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145 (2008).

  4. 4.

    & The expanding scope of DNA sequencing. Nat. Biotechnol. 30, 1084–1094 (2012).

  5. 5.

    , & How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29, 987–991 (2011).

  6. 6.

    et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513–1518 (2011).

  7. 7.

    et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).

  8. 8.

    Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

  9. 9.

    et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).

  10. 10.

    et al. The oyster genome reveals stress adaptation and complexity of shell formation. Nature 490, 49–54 (2012).

  11. 11.

    et al. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262, 110–114 (1993).

  12. 12.

    et al. The genome of Prunus mume. Nat. Commun. 3, 1318 (2012).

  13. 13.

    et al. Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat. Biotechnol. 31, 135–141 (2013).

  14. 14.

    et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat. Biotechnol. 30, 771–776 (2012).

  15. 15.

    et al. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS ONE 3, e3376 (2008).

  16. 16.

    Genome 10K Community of Scientists. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 100, 659–674 (2009).

  17. 17.

    et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).

  18. 18.

    et al. A three-dimensional model of the yeast genome. Nature 465, 363–367 (2010).

  19. 19.

    , , & Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998).

  20. 20.

    et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).

  21. 21.

    & Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43, 1059–1065 (2011).

  22. 22.

    et al. The Drosophila melanogaster Genetic Reference Panel. Nature 482, 173–178 (2012).

  23. 23.

    et al. Three-dimensional folding and functional organization principles of the Drosophila genome. Cell 148, 458–472 (2012).

  24. 24.

    et al. The genomic and transcriptomic landscape of a HeLa cell line. G3 3, 1213–1224 (2013).

  25. 25.

    et al. The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line. Nature 500, 207–211 (2013).

  26. 26.

    et al. High-resolution identification of balanced and complex chromosomal rearrangements by 4C technology. Nat. Methods 6, 837–842 (2009).

  27. 27.

    et al. Comprehensive and definitive molecular cytogenetic characterization of HeLa cells by spectral karyotyping. Cancer Res. 59, 141–150 (1999).

  28. 28.

    et al. MORC family ATPases required for heterochromatin condensation and gene silencing. Science 336, 1448–1451 (2012).

  29. 29.

    & How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998).

  30. 30.

    , , & A decision criterion for the optimal number of clusters in hierarchical clustering. J. Glob. Optim. 25, 91–111 (2003).

  31. 31.

    & Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

  32. 32.

    , , , & Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

  33. 33.

    et al. Hi-C: a method to study the three-dimensional architecture of genomes. J. Vis. Exp. 39, e1869 (2010).

Download references

Acknowledgements

We thank F. Ay, E. Eichler, J. Felsenstein, P. Green, L. Hillier, M. van Min, W. Noble, R. Waterston and members of the Shendure lab for helpful discussions. Some of the sequencing data used in this research were derived from a HeLa cell line. Henrietta Lacks, and the HeLa cell line that was established from her tumor cells without her knowledge or consent in 1951, have made significant contributions to scientific progress and advances in human health. We are grateful to Henrietta Lacks, now deceased, and to her surviving family members for their contributions to biomedical research. Our work was supported by grant HG006283 from the National Human Genome Research Institute (NHGRI; to J.S.); a graduate research fellowship DGE-0718124 from the National Science Foundation (to A.A. and J.O.K.); and grant T32HG000035 from the NHGRI (to J.N.B.).

Author information

Affiliations

  1. Department of Genome Sciences, University of Washington, Seattle, Washington, USA.

    • Joshua N Burton
    • , Andrew Adey
    • , Rupali P Patwardhan
    • , Ruolan Qiu
    • , Jacob O Kitzman
    •  & Jay Shendure

Authors

  1. Search for Joshua N Burton in:

  2. Search for Andrew Adey in:

  3. Search for Rupali P Patwardhan in:

  4. Search for Ruolan Qiu in:

  5. Search for Jacob O Kitzman in:

  6. Search for Jay Shendure in:

Contributions

J.N.B., A.A., J.O.K. and J.S. conceived and designed the study. J.N.B. designed and wrote the LACHESIS software. J.N.B. and R.P.P. performed the de novo assemblies. R.Q. conducted the HeLa Hi-C experiments. A.A. analyzed the HeLa Hi-C data. J.N.B., A.A. and J.S. prepared the manuscript, with input from all authors. J.S. supervised the study.

Competing interests

The authors have fieled a provisional patent application on this method. J.S. is a member of the scientific advisory board or serves as a consultant for Adaptive Biotechnologies, Ariosa Diagnostics, Stratos Genomics, GenePeeks, Gen9, Good Start Genetics, Ingenuity Systems and Rubicon Genomics.

Corresponding authors

Correspondence to Joshua N Burton or Jay Shendure.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–13 and Supplementary Tables 1–6

Zip files

  1. 1.

    Supplementary Data 1

    LACHESIS.tar.gz

Text files

  1. 1.

    Supplementary Data 2

    README.txt

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/nbt.2727