Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Repetitive DNA and next-generation sequencing: computational challenges and solutions

A Corrigendum to this article was published on 17 January 2012

This article has been updated

Key Points

  • New high-throughput sequencing technologies have spurred explosive growth in the use of sequencing to discover mutations and structural variants in the human genome and in the number of projects to sequence and assemble new genomes.

  • Highly efficient algorithms have been developed to align next-generation sequences to genomes, and these algorithms use a variety of strategies to place repetitive reads.

  • Ambiguous mapping of sequences that are derived from repetitive regions makes it difficult to identify true polymorphisms and to reconstruct transcripts.

  • Short read lengths combined with mapping ambiguities lead to false reports of single-nucleotide polymorphisms, inserts, deletions and other sequence variants.

  • When assembling a genome de novo, repetitive sequences can lead to erroneous rearrangements, deletions, collapsed repeats and other assembly errors.

  • Long-range linking information from paired-end reads can overcome some of the difficulties in short-read assembly.

Abstract

Repetitive DNA sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly programs. Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challenges more difficult. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. Simply ignoring repeats is not an option, as this creates problems of its own and may mean that important biological phenomena are missed. We discuss the computational problems surrounding repeats and describe strategies used by current bioinformatics systems to solve them.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: Ambiguities in read mapping.
Figure 2: Three strategies for mapping multi-reads.
Figure 3: Assembly errors caused by repeats.
Figure 4: Longer paired-end libraries improved assembly contiguity in the repetitive potato genome.

Change history

  • 17 January 2012

    In the above article, Table 1 provided a URL for software called 'SNiPer'. This should have been a URL for software called 'Sniper'. The correct URL (http://kim.bio.upenn.edu/software/sniper.shtml) has been inserted, and in Table 1 and in the two occurrences in the main text, the word 'SNiPer' has been changed to 'Sniper'. Also, references to Figure 3b and Figure 3c have been reversed. The authors and editors apologize for these errors.

References

  1. Weigel, D. & Mott, R. The 1001 genomes project for Arabidopsis thaliana. Genome Biol. 10, 107 (2009).

    Article  Google Scholar 

  2. The 1000 Genomes Project Consurtium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  3. Genome 10K Community of Scientists. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 100, 659–674 (2009).

  4. Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008).

    Article  CAS  Google Scholar 

  5. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods 5, 621–628 (2008).

    Article  CAS  Google Scholar 

  6. Lister, R. et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133, 523–536 (2008).

    Article  CAS  Google Scholar 

  7. Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods 5, 613–619 (2008).

    Article  CAS  Google Scholar 

  8. Park, P. J. ChIP–seq: advantages and challenges of a maturing technology. Nature Rev. Genet. 10, 669–680 (2009).

    Article  CAS  Google Scholar 

  9. Schmidt, D. et al. Five-vertebrate ChIP–seq reveals the evolutionary dynamics of transcription factor binding. Science 328, 1036–1040 (2010).

    Article  CAS  Google Scholar 

  10. Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1497–1502 (2007).

    Article  CAS  Google Scholar 

  11. Garber, M., Grabherr, M. G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods 8, 469–477 (2011).

    Article  CAS  Google Scholar 

  12. Brunner, A. L. et al. Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver. Genome Res. 19, 1044–1056 (2009).

    Article  CAS  Google Scholar 

  13. Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).

    Article  CAS  Google Scholar 

  14. Meyerson, M., Gabriel, S. & Getz, G. Advances in understanding cancer genomes through second-generation sequencing. Nature Rev. Genet. 11, 685–696 (2010).

    Article  CAS  Google Scholar 

  15. Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for discovering structural variation with next-generation sequencing. Nature Methods 6, S13–S20 (2009).

    Article  CAS  Google Scholar 

  16. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

    Article  Google Scholar 

  17. Li, Y., Hu, Y., Bolund, L. & Wang, J. State of the art de novo assembly of human genomes from massively parallel sequencing data. Hum. Genomics 4, 271–277 (2010).

    Article  CAS  Google Scholar 

  18. Roberts, A., Pimentel, H., Trapnell, C. & Pachter, L. Identification of novel transcripts in annotated genomes using RNA-seq. Bioinformatics 27, 2325–2329 (2011).

    Article  CAS  Google Scholar 

  19. Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotech. 28, 511–515 (2010). This paper describes transcript assembly and abundance estimation from RNA-seq data, including statistical corrections for multi-reads.

    Article  CAS  Google Scholar 

  20. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl Acad. Sci. USA 108, 1513–1518 (2011). This paper presents a highly effective NGS genome assembler that integrates several effective strategies for handling repeats.

    Article  CAS  Google Scholar 

  21. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotech. 29, 644–652 (2011).

    Article  CAS  Google Scholar 

  22. Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011).

    Article  CAS  Google Scholar 

  23. Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nature Rev. Genet. 12, 363–376 (2011).

    Article  CAS  Google Scholar 

  24. Schmid, C. W. & Deininger, P. L. Sequence organization of the human genome. Cell 6, 345–358 (1975).

    Article  CAS  Google Scholar 

  25. Batzer, M. A. & Deininger, P. L. Alu repeats and human genomic diversity. Nature Rev. Genet. 3, 370–379 (2002).

    Article  CAS  Google Scholar 

  26. Jurka, J., Kapitonov, V. V., Kohany, O. & Jurka, M. V. Repetitive sequences in complex genomes: structure and evolution. Annu. Rev. Genomics Hum. Genet. 8, 241–259 (2007).

    Article  CAS  Google Scholar 

  27. Britten, R. J. Transposable element insertions have strongly affected human evolution. Proc. Natl Acad. Sci. USA 107, 19945–19948 (2010).

    Article  CAS  Google Scholar 

  28. Hua-Van, A., Le Rouzic, A., Boutin, T. S., Filee, J. & Capy, P. The struggle for life of the genome's selfish architects. Biol. Direct 6, 19 (2011).

    Article  Google Scholar 

  29. Kim, P. M. et al. Analysis of copy number variants and segmental duplications in the human genome: evidence for a change in the process of formation in recent evolutionary history. Genome Res. 18, 1865–1874 (2008).

    Article  CAS  Google Scholar 

  30. Zhang, L., Lu, H. H., Chung, W. Y., Yang, J. & Li, W. H. Patterns of segmental duplication in the human genome. Mol. Biol. Evol. 22, 135–141 (2005).

    Article  CAS  Google Scholar 

  31. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).

  32. Schnable, P. S. et al. The B73 maize genome: complexity, diversity, and dynamics. Science 326, 1112–1115 (2009).

    Article  CAS  Google Scholar 

  33. Reichwald, K. et al. High tandem repeat content in the genome of the short-lived annual fish Nothobranchius furzeri: a new vertebrate model for aging research. Genome Biology 10, R16 (2009).

    Article  Google Scholar 

  34. Cho, N. H. et al. The Orientia tsutsugamushi genome reveals massive proliferation of conjugative type IV secretion system and host-cell interaction genes. Proc. Natl Acad. Sci. USA 104, 7981–7986 (2007).

    Article  CAS  Google Scholar 

  35. Shen, Y. et al. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res. 20, 273–280 (2010).

    Article  CAS  Google Scholar 

  36. Mu, X. J., Lu, Z. J., Kong, Y., Lam, H. Y. & Gerstein, M. B. Analysis of genomic variation in non-coding elements using population-scale sequencing data from the 1000 Genomes Project. Nucleic Acids Res. 39, 7058–7076 (2011).

    Article  CAS  Google Scholar 

  37. Gravel, S. et al. Demographic history and rare allele sharing among human populations. Proc. Natl Acad. Sci. USA 108, 11983–11988 (2011).

    Article  CAS  Google Scholar 

  38. Simola, D. F. & Kim, J. Sniper: improved SNP discovery by multiply mapping deep sequenced reads. Genome Biol. 12, R55 (2011).

    Article  Google Scholar 

  39. Tucker, B. A. et al. Exome sequencing and analysis of induced pluripotent stem cells identify the cilia-related gene male germ cell-associated kinase (MAK) as a cause of retinitis pigmentosa. Proc. Natl Acad. Sci. USA 108, E569–E576 (2011). This study shows a striking example of why multi-reads should not be discarded.

    Article  CAS  Google Scholar 

  40. Robinson, J. T. et al. Integrative genomics viewer. Nature Biotech. 29, 24–26 (2011).

    Article  CAS  Google Scholar 

  41. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

  42. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 43, 491–498 (2011).

    Article  CAS  Google Scholar 

  43. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

    Article  CAS  Google Scholar 

  44. Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).

    Article  CAS  Google Scholar 

  45. Koboldt, D. C. et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 2283–2285 (2009).

    Article  CAS  Google Scholar 

  46. Hormozdiari, F. et al. Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics 26, i350–i357 (2010). The authors of this paper present variation detection software that explicitly searches for repetitive transposon sequences.

    Article  CAS  Google Scholar 

  47. He, D., Hormozdiari, F., Furlotte, N. & Eskin, E. Efficient algorithms for tandem copy number variation reconstruction in repeat-rich regions. Bioinformatics 27, 1513–1520 (2011).

    Article  CAS  Google Scholar 

  48. Ye, L. et al. A vertebrate case study of the quality of assemblies derived from next-generation sequences. Genome Biol. 12, R31 (2011).

    Article  CAS  Google Scholar 

  49. Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010).

    Article  CAS  Google Scholar 

  50. Pop, M. & Salzberg, S. L. Bioinformatics challenges of new sequencing technology. Trends Genet. 24, 142–149 (2008).

    Article  CAS  Google Scholar 

  51. Phillippy, A. M., Schatz, M. C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).

    Article  Google Scholar 

  52. Alkan, C., Sajjadian, S. & Eichler, E. E. Limitations of next-generation genome sequence assembly. Nature Methods 8, 61–65 (2011). This is an excellent review that highlights the difficulties repeats pose for NGS assemblers.

    Article  CAS  Google Scholar 

  53. Read, T. D. et al. Comparative genome sequencing for discovery of novel polymorphisms in Bacillus anthracis. Science 296, 2028–2033 (2002).

    Article  CAS  Google Scholar 

  54. Rasko, D. A. et al. Bacillus anthracis comparative genome analysis in support of the Amerithrax investigation. Proc. Natl Acad. Sci. USA 108, 5027–5032 (2011). This paper provides a description of how scientists used DNA sequencing to discover a few rare variants in the anthrax-causing bacterium, which led US Federal Bureau of Investigation (FBI) investigators to the original source of the mailed anthrax from the 2001 attacks.

    Article  CAS  Google Scholar 

  55. Pevzner, P. A., Tang, H. & Waterman, M. S. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753 (2001).

    Article  CAS  Google Scholar 

  56. Xu, X. et al. Genome sequence and analysis of the tuber crop potato. Nature 475, 189–195 (2011).

    Article  CAS  Google Scholar 

  57. Wetzel, J., Kingsford, C. & Pop, M. Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinformatics 12, 95 (2011).

    Article  Google Scholar 

  58. Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009).

    Article  CAS  Google Scholar 

  59. Wang, K. et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178 (2010).

    Article  Google Scholar 

  60. Lesniewska, A. & Okoniewski, M. J. rnaSeqMap: a Bioconductor package for RNA sequencing data exploration. BMC Bioinformatics 12, 200 (2011).

    Article  Google Scholar 

  61. Grant, G. R. et al. Comparative analysis of RNA-seq alignment algorithms and the RNA-seq unified mapper (RUM). Bioinformatics 27, 2518–2528 (2011).

    Article  CAS  Google Scholar 

  62. Au, K. F., Jiang, H., Lin, L., Xing, Y. & Wong, W. H. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 38, 4570–4578 (2010).

    Article  CAS  Google Scholar 

  63. Kim, D. & Salzberg, S. L. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 12, R72 (2011).

    Article  CAS  Google Scholar 

  64. Sboner, A. et al. FusionSeq: a modular framework for finding gene fusions by analysing paired-end RNA-sequencing data. Genome Biol. 11, R104 (2010).

    Article  CAS  Google Scholar 

  65. Kinsella, M., Harismendy, O., Nakano, M., Frazer, K. A. & Bafna, V. Sensitive gene fusion detection using ambiguously mapping RNA-seq read pairs. Bioinformatics 27, 1068–1075 (2011).

    Article  CAS  Google Scholar 

  66. Jiang, H. & Wong, W. H. Statistical inferences for isoform expression in RNA-seq. Bioinformatics 25, 1026–1032 (2009).

    Article  CAS  Google Scholar 

  67. Chung, D. et al. Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP–seq data. PLoS Comput. Biol. 7, e1002111 (2011).

    Article  CAS  Google Scholar 

  68. Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. & Dewey, C. N. RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).

    Article  Google Scholar 

  69. Homer, N., Merriman, B. & Nelson, S. F. BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4, e7767 (2009).

    Article  Google Scholar 

  70. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

    Article  Google Scholar 

  71. Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genet. 41, 1061–1067 (2009).

    Article  CAS  Google Scholar 

  72. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).

    Article  CAS  Google Scholar 

  73. Miller, J. R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008).

    Article  CAS  Google Scholar 

  74. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

    Article  CAS  Google Scholar 

  75. Zerbino, D. R., McEwen, G. K., Margulies, E. H. & Birney, E. Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler. PLoS ONE 4, e8407 (2009).

    Article  Google Scholar 

  76. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nature Methods 7, 909–912 (2010).

    Article  CAS  Google Scholar 

  77. Garg, R., Patel, R. K., Tyagi, A. K. & Jain, M. De novo assembly of chickpea transcriptome using short reads for gene discovery and marker identification. DNA Res. 18, 53–63 (2011).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank K. Hansen for useful comments on an earlier draft.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steven L. Salzberg.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

Todd J. Treangen's homepage

Steven L. Salzberg's homepage

RepeatMasker software for screening repeats

Glossary

Next-generation sequencing

(NGS). Any of several technologies that sequence very large numbers of DNA fragments in parallel, producing millions or billions of short reads in a single run of an automated sequencer. By contrast, traditional Sanger sequencing only produces a few hundred reads per run.

Interspersed repeats

Identical or nearly identical DNA sequences that are separated by hundreds, thousands or even millions of nucleotides in the source genome. Repeats can be spread out through the genome by mechanisms such as transposition.

Tandem repeats

DNA repeats (≥2bp in length) that are adjacent to each other and can involve as few as two copies or many thousands of copies. Centromeres and telomeres are largely comprised of tandem repeats.

Short interspersed nuclear elements

(SINEs). Repetitive DNA elements that are typically 100–300 bp in length and spread throughout the genome (such as Alu repeats).

Long interspersed nuclear elements

(LINEs). Repetitive DNA elements that are typically >300 bp in length and spread throughout the genome (such as L1 repeats).

Multi-read

A DNA sequence fragment (a 'read') that aligns to multiple positions in the reference genome and, consequently, creates ambiguity as to which location was the true source of the read.

Paired-end reads

Reads that are sequenced from both ends of the same DNA fragment. These can be produced by a variety of sequencing protocols, and paired-end preparation is specific to a given sequencing technology. Some recent sequencing vendors use the terms 'paired end' and 'mate pair' to refer to different protocols, but these terms are generally synonymous.

De Bruijn graph

A directed graph data structure representing overlaps between sequences. In the context of genome assembly, DNA sequence reads are broken up into fixed-length subsequences of length k, which are represented as nodes in the graph. Directed edges are created between nodes i and j if the last k–1 nucleotides of i match the first k–1 nucleotides of j. Reads become paths in the graph, and contigs are assembled by following longer paths.

Contigs

Contiguous stretches of DNA that are constructed by an assembler from the raw reads produced by a sequencing machine.

DNA fragment

In the sequencing process, millions of small fragments are randomly generated from a DNA sample. In paired-end sequencing, both ends of each fragment are sequenced, and the fragment length becomes the 'library' size.

N50

A widely used statistic for assessing the contiguity of a genome assembly. The N50 value is computed by sorting all contigs in an assembly from largest to smallest, then cumulatively adding contig sizes starting with the largest and reporting the size of the contig that makes the total greater than or equal to 50% of the genome size. The N50 value is also used for scaffolds.

Scaffold

A scaffold is a collection of contigs that are linked together by paired end information with gaps separating the contigs.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Treangen, T., Salzberg, S. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13, 36–46 (2012). https://doi.org/10.1038/nrg3117

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3117

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing