Limitations of next-generation genome sequence assembly

Journal name:
Nature Methods
Volume:
8,
Pages:
61–65
Year published:
DOI:
doi:10.1038/nmeth.1527
Published online

Abstract

High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution.

References

  1. Huang, S. et al. The genome of the cucumber, Cucumis sativus L. Nat. Genet. 41, 12751281 (2009).
  2. Li, R. et al. The sequence and de novo assembly of the giant panda genome. Nature 463, 311317 (2010).
  3. Genome 10K Community of Scientists. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 100, 659674 (2009).
  4. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265272 (2010).
  5. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860921 (2001).
  6. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931945 (2004).
  7. Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872876 (2008).
  8. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 5359 (2008).
  9. Myers, E.W. et al. A whole-genome assembly of Drosophila . Science 287, 21962204 (2000).
  10. Pevzner, P.A., Tang, H. & Waterman, M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 97489753 (2001).
  11. Chaisson, M.J., Brinza, D. & Pevzner, P.A. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res. 19, 336346 (2009).
  12. Simpson, J.T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 11171123 (2009).
  13. Schuster, S.C. et al. Complete Khoisan and Bantu genomes from southern Africa. Nature 463, 943947 (2010).
  14. Green, P. Whole-genome disassembly. Proc. Natl. Acad. Sci. USA 99, 41434144 (2002).
  15. Schatz, M.C., Delcher, A.L. & Salzberg, S.L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 11651173 (2010).
  16. Meader, S., Hillier, L.W., Locke, D., Ponting, C.P. & Lunter, G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res. 20, 675684 (2010).
  17. Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203214 (2000).
  18. Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 5763 (2010).
  19. Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462467 (2005).
  20. Mills, R.E., Bennett, E.A., Iskow, R.C. & Devine, S.E. Which transposable elements are active in the human genome? Trends Genet. 23, 183191 (2007).
  21. Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J. & Eichler, E.E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 10051017 (2001).
  22. She, X. et al. Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 431, 927930 (2004).
  23. Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41, 10611067 (2009).
  24. Venter, J.C. et al. The sequence of the human genome. Science 291, 13041351 (2001).
  25. Doggett, N.A. et al. A 360-kb interchromosomal duplication of the human HYDIN locus. Genomics 88, 762771 (2006).
  26. Worley, K.C. & Gibbs, R.A. Genetics: decoding a national treasure. Nature 463, 303304 (2010).
  27. Kidd, J.M. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat. Methods 7, 365371 (2010).

Download references

Author information

Affiliations

  1. Department of Genome Sciences, University of Washington School of Medicine and Howard Hughes Medical Institute, Seattle, Washington, USA.

    • Can Alkan,
    • Saba Sajjadian &
    • Evan E Eichler

Contributions

C.A. and E.E.E. conceived the study and wrote the manuscript. C.A. and S.S. analyzed the data.

Competing financial interests

E.E.E. is a scientific advisory board member of Pacific Biosciences.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (404 KB)

    Supplementary Figures 1–2, Supplementary Table 2, Supplementary Note

Excel files

  1. Supplementary Table 1 (408 KB)

    Contamination found in reported human new sequence insertions from the genomes of two individuals.

  2. Supplementary Table 3 (5 MB)

    Analysis of nonredundant autosomal genes in the YH genome assembly.

  3. Supplementary Table 5 (2 MB)

    Assigned positions of duplicated sequences (YH) to the NCBI build 36 assembly.

Text files

  1. Supplementary Table 4 (12 MB)

    Analysis of nonredundant autosomal coding exons in the YH genome. NOTE: This is a tab-delimited text file with 171,751 rows of data. Confirm that all data will load into your application before proceeding.

Additional data