Introduction

‘Next-generation sequencing’ (NGS) platforms has been introduced and are wildly available recently,1, 2 although large-scale sequencing laboratories were significant contribute to Human Genome Project.3, 4 The limitations of the conventional Sanger (or di-deoxy terminator5) strategy urgently required certain new technologies for sequencing human genomes in parallel despite these dramatic improvements in this era. Thanks to the recent availability of optical instruments and the application of molecular biology,1 a series of new massively parallel sequencing technologies, the NGS technologies, have tremendously changed this scenario.

Three platforms have been availabile: the Roche/454 FLX (30) (http://454.com/products-solutions/454-sequencing-system-portfolio.asp), the Illumina/Solexa Genome Analyzer (7) (http://www.illumina.com/pages.ilmn?ID=203) and the Applied Biosystems SOLiDTM System (http://www.appliedbiosystems.com/absite/us/en/home/applications-technologies/solid-next-generation-sequencing.html). These methods are all based on a template amplification phase before sequencing. Two new systems, the Helicos HeliscopeTM (www.helicosbio.com) and Pacific Biosciences SMRT (www.pacificbiosciences.com) instruments,6 which avoid the amplification step and use single molecule as template, were also introduced recently.

These new technologies are advantageous because of their high throughput and low cost per base with over one billion reads per run incurring significantly lower base-cost,2 which have given great impetus to the achievement of the 1000 Genomes Project goal.7 These important characteristics permit the ultra-deep sequencing technologies to be widely used in the field of biology and medical research. NGS technologies have also made a huge and ongoing impact on transcriptome, gene annotation and RNA splice identification in addition to the traditional applications of DNA sequencing in genome resequencing and SNP discovery, Metagenomic8 and genome methylation analysis9 have also benefited from these new technologies. A new applications is also likely to be unveiled in the coming years.1 The most fundamental steps for almost all of these applications are the mapping of the reads to the reference genome and the assembly of the reads to attain the desired DNA sequence for analysis.10

However, certain obstacles stemming from the NGS's inherent characteristics need to be eliminated before these technologies can be extensively used. The limitations on short read lengths (typically 35–400 bp compared with 650–800 bp of Sanger-based technology reads), low reading accuracy in homopolar stretches of identical bases, and non-uniform confidence in base calling require more efficient software and algorithms to help these new technologies develop further in the immediate future. Massive tools for NGS reads mapping and assembly have been flooding the market until now. We will only discuss some of the software, which we have first-hand experience on (considering the rapid developments in this field), and compare their working efficiency in terms of sensitivity, accuracy, speed and random-access memory (RAM) requirement.

Mapping

Mapping tools overview

The most important step in NGS analysis is the mapping of reads to the original sequences.1 Alignment, as a classical problem in bioinformatics, requires finding the most credible source for the sequenced DNA,11 using the information of which species the reads have been generated. We also have to consider two fundamental issues aside from the shorter reads that are produced by NGS (compared with those from gel-capillary technology). One is the significantly greater amount of data, which requires optimized memory usage and speed, and the other is the different error profiles of data from the previous technologies. These call for algorithms that can be used to obtain as much information as possible from the sequencing data.10 The traditional methods such as the pure Smith-Waterman dynamic programming, BLAT or BLAST may map the reads in a few days (given a large and expensive computer grid), however, such grids are not available to everyone. Some of the previous programs that are performing for the Sanger sequencing reads have not yet adapted to the huge volumes of data produced by NGS. Moreover, certain error characteristics with second generation sequencing, for example, Roche 454, have the tendency to have insertion or deletion errors during homopolymer runs,12 therefore, they need to be considered when designing analysis tools.

Many methods are introduced and tools or programs based on these algorithms have been reported on an almost weekly basis to meet these challenges.13 Doruk Bozdag and Umit Catalyurek from the Ohio State University proposed six parallelization methods to improve the hash/index-based short-sequence mapping: partitioning reads only, partitioning genome only, partition reads and genome, suffix-based assignment (SBA), SBA after partitioning reads and SBA after partitioning genome (see Bozdag et al.14 for the details of the algorithms). CloudBurst, presented by Schatz et al.,15 is a sensitive parallel seed-and-extend read-mapping algorithm, optimized for mapping single-end (SE) reads. BreakDancer, consisting of two complementary algorithms (BreakDancerMax and BreakDancerMini), supports pooled analysis across multiple samples and libraries.16 Clement et al.17 introduced a program called GNUMAP (Genomic Next generation Universal MAPper), which uses the quality score to get more accurate results from fewer sequencing runs (which are often costly). Other tools such as PASS,18 SOAP2,19 Bowtie,20 CloudBurst,15 MAQ,21 ZOOM,22 SHRIMP,23 PERM24 and others are also designed recently for NGS data.

Some researchers categorized the tools based on whether the genome or reads are indexed.1, 25 Certain software, such as CloudBurst,15 Eland, MAQ,21 RMAP,26 SeqMap,27 SHRiMP23 and ZOOM,22 work by constructing hash tables for short reads and mapping them to the original genome sequences. The memory occupancy of these programs depends on the amount of reads that they processed, but it would be time consuming to scan the whole-genome when few reads are mapped.25 Some programs such as BFAST,28 Bowite,20 BWA,25 MOM,29 MosaikAligner (http://bioinformatics.bc.edu/marthlab/Mosaik), NovoAlign (http://www.novocraft.com), SOAP,19 PASS,18 PerM,24 ProbeMatch,30 SSAHA2,31 index genomic sequence. This kind of software can easily be parallelized to work on multithreading at the cost of larger memory occupancy if the original genome is large such as the human genome sequence. However, this limitation can be ignored if more efficient strategies are involved in the indexing process, similar to what Bowtie, SOAP2 and BWA do. In fact, indexing the genome and mapping the reads to the index usually occupy similar RAM as in the case of inverse operation (indexing the reads and mapping the reads to the genome).1 The third category that includes Slider I and Slider II32 achieves short-reads alignment by merge-sorting the subsequences of the genome and the tags from NGS platforms (mainly Illumina/Solexa).

These mapping tools for NGS, when referring to indexing strategies, can also be divided into two main categories: hash table-based algorithms and Trie/Burrows–Wheeler Transform (BWT)-based algorithms. The former approach that basically follows seed-and-extend paradigm was the first wave of alignment programs. Many improvements have been developed since the very first hash-based algorithm, BLAST, to adapt to the specific characteristics of NGS reads mapping. First, the concept of spaced seed is introduced by Lin et al.22 on the seeding approach, and several programs23, 33 have implemented q-gram filter and multiple seed hits while seeding. Another development was on the seed extension aspect, in which CPU SIMD instructions are involved to achieve parallelize alignment and dynamic programming was used to accelerate alignment speed. Most of the software available now (all the programs mentioned above, excluding Bowtie, BWA and SOAP2) are based on this strategy. The trie-based algorithms efficiently cut down the complexity of inexact matching problem to the exact matching problem.34 However, the memory used to hold the full occurrence array and prefix/suffix array is huge. The introduction of BWT algorithm35 has significantly reduced the memory desired and led to the development of several tools like SOAP2 and Bowtie. Readers who are interested to know more about the Trie-based algorithm and BWT concept can refer to Li and Durbin.25

The software mentioned above can also be classified into two groups based on whether the ‘quality scores’ of nucleotide is involved during the mapping. Quality scores that come with reads from NGS platforms (mainly from Illumina) are, arguably, crucial in preventing the possibility of trivial matches during the mapping. Most of the tools18, 19, 20, 21, 22, 23, 24, 25, 26, 28 available now use base quality information when they do mapping tasks, although some of them may not fully use it to advance mapping accuracy. However, there are also some programs, such as CloudBurst, SeqMap, MOM, ProbeMatch and Slider, that involve nucleotide information only for short reads alignment. Slider, on another hand, fully utilizes short reads’ probability information (given in the prb file from Illumina Sequence Analyzer) to reduce the alignment problem space.32 More details on the tools mentioned above are in Table 1.

Table 1 Tools for the analysis of next generation sequencing data

Evaluation of mapping tools

To illustrate the performance of these mapping tools, we basically consider the following statistic indexes: mapping speed, RAM occupancy, sensitivity (measured as the percentage of reads mapped) and accuracy (in terms of the percentage of reads mapped correctly). We evaluated the performance of several tools, namely, SOAP_2.2, Bowtie_0.12.5, SeqMap_1.0.13, MOM_0.6, SHRiMP_2.0.1, PASS_v1.2, BWA_0.5.9, RMAP_v2.05, Mosaik_1.1.0021 and SSAHA2_v2.5.3, either using simulated data or the real data from Illumina platform. Those tools, with versions currently available during the time of our research, are widely used in the fields of Illumina reads mapping analysis. We first performed a simulation work on the chosen tools and summarized their efficiencies in terms of speed, memory usage, sensitivity and accuracy. Then we evaluated their mapping capacities on real applications, with Illumina reads from 1000 Genomes Project Database (http://www.1000genomes.org/data). Based on the evaluated tools’ own heuristics, we fixed parameters so as to get all programs’ equally best matches, with up to two mismatches.

Evaluation on simulation data

We used dwgsim, a utility for whole-genome Illumina reads simulation, contained in DNAA_0.1.2 (http://sourceforge.net/projects/dnaa/), to generate Illumina-like short sequences, using the default empirical error model illustrated on DNAA's Whole-Genome Simulation web (http://sourceforge.net/apps/mediawiki/dnaa/index.php?title=Whole_Genome_Simulation). In total, we generated 15 million reads with 76 bp length, using the complete human genome (hg18) as a reference. Details of the codes used to run those tools mentioned above with the simulation data can be found in Supplementary Information S1. Table 2 provides us the results of the simulation work with statistics on the number of reads mapped, the amount of reads correctly mapped, time consumed and RAM required.

Table 2 Results of mapping simulated illumina reads against human genome sequences (hg18)

From Table 2, we found that for Illumina SE reads mapping, SHRiMP provided the highest true mapping percentage (around 99%) among all programs, at the expanse of consuming much more time and RAM than others. BWA, which is the second most accuracy (around 4% less than that of SHRiMP), performed tremendously faster than SHRiMP and occupied least memories among all tools. Other tools, including Bowtie, Mosaik, RMAP, SeqMap and SOAP, can all correctly catch more than 75% genuine matches, with SOAP most speedy while Bowtie most RAM-saved. The apparent poor performance of PASS on sensitivity and accuracy has been, to some extent, explained in Horner's simulation studies.1 For paired-end (PE) mapping tasks, the validate alignments of BWA, who can correctly map more than 98% of all reads to human reference, with least RAM usage and acceptable time consuming, are remarkably more than the alignments of other tools. SSAHA2 behaved similarly as BWA did in terms of mapping sensitivity and accuracy. However, it occupied around four folds of RAM and time than BWA did for the same task. Among all the tools stated as PE supported, Bowtie and RMAP showed absolutely lower coverage rates for Illumina mates mapping.

Evaluation on real data

To further compare the behavior of those tools on real applications, we used around 12 million Illumina SE reads with length of 76 bp (AC: ERR008834) and 17 million pairs of 76 bp reads (AC: SRR043391) from Sequence Reads Achieve to align against the whole human genome sequences (assembly: NCBI36.1/hg18). Table 3 illustrates the results of this evaluation experiment. Compared with the results on Table 2, Table 3 indicated that the conclusions of evaluation on real applications are generally consistent with the results from simulation work, except that Mosaik acted slightly better than BWA, and SHRiMP performed not as well as it did in PE mapping. Thus, the parameters, such as sequence errors, fraction of indels and outer distance between the two ends, set in our simulation experiment seemed to have little effect on capturing the general divergences of mapping performance between those tools selected.

Table 3 Results of mapping illumina real reads against human genome sequences (hg18)

As additional remarks to the experiments mentioned above, several points needed to be stated here: (1) MOM has also been tested with our simulation data and real reads from 1000 Genomes Project, however, this program seems not so stable to input file formats and no certain bug information was given to guide users to resolve the problem. (2) Although a ‘PE’ section has been posted on PASS website, it seems that PASS was still on developing of this application. (3) All experiments are run on our 64-bit quad-core Linux system, with 32 GB RAM.

Discussions on mapping tools

Generally speaking, BWA, Mosaik, SHRiMP and SOAP all provide satisfactory mapping results in both SE and PE Illumina reads alignments, with BWA using much less RAM than the others, which is mostly owed to its BWT-based algorithm, whereas SOAP providing the fastest performance among all tools, which is likely benefited from its core algorithm (2way-BWT). The differences of those methods on mapping sensitivities could mostly be attributed to the heuristics applied by different algorithms in detecting imperfectly matching positions.1 The apparently excellent performance of BWT-based aligners in time consumption and memory occupancy could mainly be attributed to their multithreading processing characteristic and independence from the amount of reads to be aligned.25 Although certain programs, such as SHRiMP, have elegant performance in terms of mapping sensitivity and accuracy, the enormous time consuming and RAM occupancy need to be considered once again before using them as an aligner for large mammalian genomes. However, it would also be an option when it comes to mapping small genomes, like Drosophila.

Till now, only a few open source tools, such as Mosaik, PASS and SSAHA2, are available for 454 mapping and their sensitivities in catching mapping positions are not so satisfied, which calls for an urgent need for developing novel software supporting 454-like longer (typically 400–1000 bp) NGS reads. Although several programs, such as Mosaik, PASS, Bowtie, SHRiMP and/or some other tools, are declared as color-space-mapping available, their capabilities in matching SOLiD-specific reads are pretty low, which may mainly due to the specific design of ABI outputs. Algorithms involved with advanced spaced seeds would be a considerable modification for SOLiD mappers, as in Laurent Noe et al.36 As this review mainly focuses on comparing the capacities of Illumina aligners, no certain evaluation results about 454 and SOLiD-supported tools are provided here. But authors also has performed simple testing studies on the tools declared as 454-bared, namely Mosaik, SSAHA2, PASS, and tools called themselves as color-space-tolerated, including Mosaik, PASS, Bowtie and SHRiMP, using 454 and SOLiD real reads from Sequence Reads Achieve (http://www.ncbi.nlm.nih.gov/sra). Readers with interests in applying those programs for 454 and SOLiD reads mapping could refer to Supplementary Information S2 and S3, in which details of the data involved and results of the experiments are represented, respectively.

Overall, decisions on choosing an appropriate method against another should mostly depend on the amount of reads to be mapped, the reference genome to be considered, and the computing equipment available. The final goals of certain experiments may also determine or help determine the choice.

Assembly

Assembly strategies

The lengths of individual sequencing read from either Sanger-based technology or novel NGS platforms are significantly shorter than the desired length of DNA sequence.10 A so-called technology ‘Assembly’, first designed for cosimid37 and then used in genomic analysis, was introduced in the late 1980s and early 1990s to resolve the problem. The fundamental concept in this technology is to group the random fragments of a significantly longer DNA sequence into contigs and then contigs into scaffolds to reconstruct the original DNA sequence. It can be divided into two different approaches: de novo approach and comparative (resequencing) approach based on the different focus of this technology.38

The de novo approaches mainly focus on reconstructing genomes that have never been sequenced, although it is sufficient for comparative approaches to map the reads to the guided sequence to characterize a newly sequenced organism. The de novo methods are irreplaceable, especially in discovering new, previously unknown sequences—this is essential for characterizing biological diversity of our world—but they are mathematically more complex and needs larger memory than the comparative ones. There are mainly two factors that influence the complexity of de novo assembly technology: the length and the volume of the reads. Shorter reads may complicate the layout phase of an assembly (because it is more difficult for de novo assemblers to handle repeats with short reads) but they are easier to be aligned. More reads also pose quadratic or even exponential complexity to the underlying algorithms but they promise better identification of sequence overlaps. Managing the large volumes of reads with even shorter length (typically 35–400 bp, which is significantly shorter than the traditional ones’ 600–800 bp) from NGS and fully exploiting the deeper coverage produced by NGS technologies have become the most crucial issues being considered when researchers design assemblers for NGS.

These challenges lead to more considerable efforts being exerted in the modification of three widely used de novo assembly strategies:10, 39 greedy, overlap–layout–consensus and Eulerian or de Bruijin graph.40 The success of the recently introduced NGS assemblers is mainly caused by the development of pragmatic engineering and heuristics on assembly algorithms.39 Some of the tools, such as SSAKE,41 SHARCGS,42 VCAKE,43 and QSRA,44 work by using greedy graph strategy. Programs applying this algorithm undertake one basic operation: iterative extension (that is, given any read or contig, it will merge with the one with the largest overlap). The three programs (SSAKE, VCAKE and QSRA) have been developed to handle imperfectly matching reads,41, 43, 44 whereas SHARCGS is widely used on uniform-length, high-coverage and unpaired short reads. QSRA, the most recently developed software in this category, has an advantage in quality-value scores to help users deal with base call errors. It provides better and more preferable performance in terms of speed and output quality44 compared with the other tools mentioned above. The second category of software that includes CABOG,45 Edena,46 Newbler47 and Shorty48 are based on overlap-layout-consensus. This strategy involves three main steps. First, assemblers compare the reads to each other to construct an overlap graph in the first overlap discovery stage. Second, the overlap graph is analyzed and the appropriate paths traversing through the graph are identified in the layout stage. Third, consensus sequence will be determined through multiple sequence alignment. Newbler, among the overlap-layout-consensus-based software, was specifically designed to handle the ambiguity in the length of 454's homopolymer runs, whereas the other widely used programs (distributed by Illumina/Solexa), including Shorty, can also be applied to ABI/SOLiD and Helicos. CABOG, Newbler and Shorty can manage base calling error and repeats with their specific schemes, whereas Edena was designed for unpaired reads with uniform length. Newbler particularly applies instrument metrics to overcome inaccurate calls caused by homopolymer repeats in 454.39 CABOG uses a so-called ‘rocks and stones’ technique,49, 50 whose main procedure could be summarized as ‘unitig-contig-scaffolds’, for base call correction.45 Shorty innovatively estimates the intercontig distances from the mate pairs using a few seeds of 300–500 bp length. The third category of software based on de Bruijn graph approaches40 are widely used in assembling data from the Solexa and SOLiD platforms. The tools in this category (such as ABySS,51 ALLPATHS,52 EULER-SR,53 SOAPdenovo54 and Velvet55) have applied certain heuristic strategies to reduce the complexity of the de Bruijn graphs, which trivialize assembly problem by finding the path that would traverse each edge of the graph exactly once. EULER-SR52 mitigates error sequencing impact by constructing different K-mer sizes De Bruijn graphs and reduces graph complexity by applying low-quality read ends and PE constraints. Velvet55 uses an error-avoidance read filter for error calls correction and adopts a pebble smoothing technique, involving read threading and mate pairs for graph reduction. ABySS is an scalable assembly software and designed to overcome memory limitations in large genome assembly by distributing graph and graph computation across a compute grid. ALLPATHS targets large genomes and invokes tow pre-processors, read-correction processor and ‘unipaths’ creation processor, for erroneous base call correction and graph simplification. Finally, SOAPdenovo is, by far, the only software amalgamating de Bruijin graph and overlap-layout-consensus strategies together, in which a contig graph is constructed by the de Bruijin graph method although its complexity is reduced by cutting transitive edges and isolating multi-path involved contigs. Its transitive link deduction scheme is similar to CABOG's ‘rocks and stones’ method and to Velvet's breadcrumbs and pebble techniques.39 Table 4 shows more details on the assembly programs. Several papers10, 38, 39 have also provided significant insights on the technical strategies and tools of the de novo assembly of short reads.

Table 4 Tools for de novo assembly analysis

Evaluation on assembly tools

The efficiency of assemblers is basically assessed through two indexes: size and accuracy of the assemblies’ contigs and scaffolds.39 However, N50, one of the widely used statistics for size measurement, can only be comparable between different assemblers when each is measured with the same combined length value. On another hand, the accuracy of assemblies is generally difficult to measure, although certain inherent accuracy measurement may be used for specific assembler. In our study, we applied six statistical values, namely, maximum contig length, minimum contig length, average contig length, genomic coverage (measured as the total length of reads used for constructing contigs divided by the length of all queries), total processed time and RAM occupancy, to illustrate the trade-offs between contig length and genomic coverage that certain assemblers have made while they are treating with large volume of short reads. Six widely used assembly tools were involved, including QSRA,44 SSAKE_v3-5,41 Edena_2.1.1,46 AByss_1.2.6,56 SOAPdenovo_1.0554 and Velvet_1.0.09.55 Limited by our computer RAM available now (32 GB), we extracted 1.5 million reads and pairs from SE reads file ERR008834 and PE reads file SRR043391, respectively, as input queries. The results are shown in Table 5.

Table 5 Assembly results using real illumina single-end and paired-end reads from SRA

From Table 5, we see that, in SE test, SOAPdenovo and QSRA yielded distinctly higher genomic coverage than the other tools, around 60% higher, with generally a larger number of short contigs. As a contrast, SSAKE and Edena usually produce longer contigs with much lower genomic coverage. Among all the tools been tested, SOAPdenovo and AByss were the fastest, whereas Edena and QSRA were the most memory-efficient. For mate reads assemblies, wherein QSRA and Edena are not available, SOAPdenovo granted the most elegant performance with the highest genomic coverage and the least time and RAM requirement. AByss yielded the longest contigs, whereas reads from SSAKE were longer in general. Pop38 and Miller et al.39 have given further insights on the performance of the other de novo tools and assembly algorithm of NGS.

Discussions on assembly tools

As an interim conclusion, in our experiments SOAPdeovo offered more satisfactory performance, in terms of speed, memory usage and genomic coverage, than other tools in both SE and mate-end conditions, whereas QSRA behaved inferiorly in individual reads assembly. However, reads from both of those two programs are usually short. On another hand, SSAKE and Edena generally produce longer contiges with lower coverage rates. AByss could produce longest contigs using mate reads, although the average length of contigs from AByss is short. Among those tools been tested, Velevet, SSAKE and AByss cost more computer memory for the same task. In our experience, more than 32 GB memory is needed to handle larger volume (for example, more than ten million) of input reads using these programs. Also, compared with other assemblers, Velvet and SSAKE are more time consuming, which may limit their applications in the filed of de novo assembly. In summary, such approaches mentioned above all have to make a balance between the length of contigs and the coverage of genome.

Nevertheless, the scale of the analysis and the types of assay may decide the tool(s) to be used. Moreover, the heuristics for real reads error and genomes repeats owed by a certain assembler, and the computer source available may also profoundly influence the program's success in de novo assembly filed.

Challenges and prospects

Despite the strikingly attractive success of NGS in genomics and post genomics, three main challenges, which could be summarized as Computational Challenge, Developmental Challenge and Cross-Platform Unification Challenge, are blocking, or in a not short period will still block, the development of these new technologies from infancy to mature.

The growing gap between massive output data from NGS platforms and the computer source available to process and analyze them has to be bridged in an urgent need. Aligning millions or even billions of reads against a large mammalian genome as a complete experiment becomes common in today's genomic studies. However, super computers with abundant memories to handle such big headaches are not always available to every user. Timing is also an inevitable question while dealing with NGS tasks. Thus, an extraordinarily efficient algorithm is then urgently needed to reduce computing costs. Parallelization strategies, like BWT algorithm applied by BWA, Bowtie and SOAP2, have been proposed and managed to help aligners speed up their execution time and reduce their computer memory requirement with uncompromising results accuracy.14

As long as NGS technologies go on changing, developers of short reads mapping and assembly software have to keep pace with these novel techniques. To keep up or even exceed Sanger sequencers in terms of read length, which has critical effects on detecting split mapping signatures and de novo sequencing, NGS sequencing machines all try to produce longer reads. Thus, future mappers for short reads or NGS tools available now need to be adjusted as programs compatible with longer reads. Furthermore, unfamiliar data formats from so-called next–next-generation sequencers, such as Helicos HeliscopeTM and Pacific Biosciences SMRT, explosive mass of different experiments and divergent scale of analysis all call for more robust and efficient algorithms in automatically redressing parameters for specific demands.

Another main challenge met by developers of NGS mappers and assemblers comes from the standards inconformity in size of inserts between mates, error profiles and ‘true match’ benchmarks across diverse NGS platforms. Different sizes of inserts, which are common in variant NGS platforms, also have different potency in detecting variants.57 Shorter insert sizes, compared with long inserts (which offer advantages in detecting larger events), increase the sensitivity of smaller events.58, 59 Therefore, a combination of multiple libraries with varying insert sizes will be a good choice in future studies.58, 60, 61 Furthermore, as different platforms produce reads with different error models and also isolate ‘real alignment’ from multiple possible matches with their own criterions, investigators are often embarrassed when they explore the data from several platforms. Thus, a unified standard for determining genuine match and a critical evaluation of the quality of data from these technologies are in urgent need.62 In addition, considering that ‘NGS users are always puzzled by a complicated maze of base calling, alignment, assembly, and analysis tools with often incomplete documentation and providing no ideas on how to compare and validate the outputs, Paul Medvedev et al.,57 recommended that new methods should combine the previous approaches and possess different types of signatures to support an event’.

Nevertheless, NGS approaches are undoubtedly here to stay and will propel the development of bioinformatics in several areas such as mapping, assembly, detecting variants, and other related areas, for many years.1, 62 Their advantages in speed and cost62 and their higher capabilities in detecting divergent types of variants56, 59, 60, 61, 63 granted their wide applications in the field of medical research and diagnostics.64 Moreover, genomics,64 functional genomics,9 proteomics,64 transcriptome analysis,65 epigenetic research66 and the characterization of new virus67 and bacterium68, 69 all benefited from these technologies immediately after their introduction into the market.

Conclusion

Challenges definitely remain to be justified for the further development of NGS. More efforts need to be done, not only in the fields of mapping and assembly, but also on the areas of so-called ‘downstream analysis’, such as metagenomics, transcriptome analyses, small RNA detection and/or other related areas. New considerations and questions will continue to emerge, thus novel programs have to evolve rapidly to keep up with the pace of NGS and the changes in adoption of these techniques.