Limitations of next-generation genome sequence assembly

Alkan, Can; Sajjadian, Saba; Eichler, Evan E

doi:10.1038/nmeth.1527

Perspective
Published: 21 November 2010

Limitations of next-generation genome sequence assembly

Can Alkan¹,
Saba Sajjadian¹ &
Evan E Eichler¹

Nature Methods volume 8, pages 61–65 (2011)Cite this article

15k Accesses
490 Citations
35 Altmetric
Metrics details

Subjects

Abstract

High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Summary of *de novo* genome assembly and new sequence analysis.**

Efficient hybrid de novo assembly of human genomes with WENGAN

Article Open access 14 December 2020

Towards a reference genome that captures global genetic diversity

Article Open access 30 October 2020

Highly accurate long-read HiFi sequencing data for five complex genomes

Article Open access 17 November 2020

References

Huang, S. et al. The genome of the cucumber, Cucumis sativus L. Nat. Genet. 41, 1275–1281 (2009).
Article CAS Google Scholar
Li, R. et al. The sequence and de novo assembly of the giant panda genome. Nature 463, 311–317 (2010).
Article CAS Google Scholar
Genome 10K Community of Scientists. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 100, 659–674 (2009).
Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).
Article CAS Google Scholar
Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Article CAS Google Scholar
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).
Article CAS Google Scholar
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Article CAS Google Scholar
Myers, E.W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).
Article CAS Google Scholar
Pevzner, P.A., Tang, H. & Waterman, M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 9748–9753 (2001).
Article CAS Google Scholar
Chaisson, M.J., Brinza, D. & Pevzner, P.A. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res. 19, 336–346 (2009).
Article CAS Google Scholar
Simpson, J.T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).
Article CAS Google Scholar
Schuster, S.C. et al. Complete Khoisan and Bantu genomes from southern Africa. Nature 463, 943–947 (2010).
Article CAS Google Scholar
Green, P. Whole-genome disassembly. Proc. Natl. Acad. Sci. USA 99, 4143–4144 (2002).
Article CAS Google Scholar
Schatz, M.C., Delcher, A.L. & Salzberg, S.L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010).
Article CAS Google Scholar
Meader, S., Hillier, L.W., Locke, D., Ponting, C.P. & Lunter, G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res. 20, 675–684 (2010).
Article CAS Google Scholar
Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214 (2000).
Article CAS Google Scholar
Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).
Article CAS Google Scholar
Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462–467 (2005).
Article CAS Google Scholar
Mills, R.E., Bennett, E.A., Iskow, R.C. & Devine, S.E. Which transposable elements are active in the human genome? Trends Genet. 23, 183–191 (2007).
Article CAS Google Scholar
Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J. & Eichler, E.E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001).
Article CAS Google Scholar
She, X. et al. Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 431, 927–930 (2004).
Article CAS Google Scholar
Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41, 1061–1067 (2009).
Article CAS Google Scholar
Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Article CAS Google Scholar
Doggett, N.A. et al. A 360-kb interchromosomal duplication of the human HYDIN locus. Genomics 88, 762–771 (2006).
Article CAS Google Scholar
Worley, K.C. & Gibbs, R.A. Genetics: decoding a national treasure. Nature 463, 303–304 (2010).
Article CAS Google Scholar
Kidd, J.M. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat. Methods 7, 365–371 (2010).
Article CAS Google Scholar

Download references

Acknowledgements

We thank E. Karakoc and P. Sudmant for helpful discussions, T. Marques-Bonet and J.M. Kidd for providing the nonredundant gene table, and T. Brown for proofreading the manuscript. This work was partly supported by US National Institutes of Health grant HG002385 to E.E.E. E.E.E. receives funds as an Investigator of the Howard Hughes Medical Institute.

Author information

Authors and Affiliations

Department of Genome Sciences, University of Washington School of Medicine and Howard Hughes Medical Institute, Seattle, Washington, USA
Can Alkan, Saba Sajjadian & Evan E Eichler

Authors

Can Alkan
View author publications
You can also search for this author in PubMed Google Scholar
Saba Sajjadian
View author publications
You can also search for this author in PubMed Google Scholar
Evan E Eichler
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.A. and E.E.E. conceived the study and wrote the manuscript. C.A. and S.S. analyzed the data.

Corresponding author

Correspondence to Evan E Eichler.

Ethics declarations

Competing interests

E.E.E. is a scientific advisory board member of Pacific Biosciences.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alkan, C., Sajjadian, S. & Eichler, E. Limitations of next-generation genome sequence assembly. Nat Methods 8, 61–65 (2011). https://doi.org/10.1038/nmeth.1527

Download citation

Published: 21 November 2010
Issue Date: January 2011
DOI: https://doi.org/10.1038/nmeth.1527