Ultra-fast genome comparison for large-scale genomic experiments

In the last decade, a technological shift in the bioinformatics field has occurred: larger genomes can now be sequenced quickly and cost effectively, resulting in the computational need to efficiently compare large and abundant sequences. Furthermore, detecting conserved similarities across large collections of genomes remains a problem. The size of chromosomes, along with the substantial amount of noise and number of repeats found in DNA sequences (particularly in mammals and plants), leads to a scenario where executing and waiting for complete outputs is both time and resource consuming. Filtering steps, manual examination and annotation, very long execution times and a high demand for computational resources represent a few of the many difficulties faced in large genome comparisons. In this work, we provide a method designed for comparisons of considerable amounts of very long sequences that employs a heuristic algorithm capable of separating noise and repeats from conserved fragments in pairwise genomic comparisons. We provide software implementation that computes in linear time using one core as a minimum and a small, constant memory footprint. The method produces both a previsualization of the comparison and a collection of indices to drastically reduce computational complexity when performing exhaustive comparisons. Last, the method scores the comparison to automate classification of sequences and produces a list of detected synteny blocks to enable new evolutionary studies.


Poisson distribution in split number of chromosomes 47
List of copyright permissions for the pictures shown in Figure 3 50

References 51
Glossary Dotplot : A comparison between two sequenced represented in the Cartesian plane.

Hit:
An identified match between two k -mers. Indexing : In the context of k -mers, computing a hash (footprint) of the k -mer to store it in order to improve posterior retrieval. Inexact k-mer : A k-mer that has been indexed using a key which represents the k -mer partially to force collisions. K-mer : A DNA substring of fixed size k . Repeats : A DNA string that appears repeatedly in a DNA sequence, typically ranging from few base pairs to a few thousands. Search space : Set containing all possible solutions for a given problem. Subsampling : In the context of k -mers, selecting a subset of all k -mers according to an heuristic. Synteny : Two or more blocks present in species being compared which conserve relative order.
2. Regions with homologous genes will form fragments as these include more k -mers to be matched than the previous case (even though not many will be unique). 3. Regions with repeats will form no fragments as nearly all of their matching k -mers will be removed due to uniqueness.
We have performed an experiment to measure the sensitivity of the method regarding genes found in more closely or distantly related genomes, particularly for the Escherichia Coli, whose relation among strains is closely depicted in [1]. Therefore we downloaded the strains B, K-12, O157:H7, Shigella flexneri2a and CFT073 (in order of relatedness regarding the B strain). These were all compared to the B strain using first CHROMEISTER and then the pipeline (i.e. GECKO guided by the information generated with CHROMEISTER). Then, all genes found in the comparison for each sequence were counted for the forward strand. The sensitivity was calculated as follows:

S = T P T P +F N = genes identif ied total number of genes in the comparison
Notice that in order to calculate the sensitivity we need a ground truth reference, which in this case, will be GECKO. Therefore, the sum of the true positive and false negative rate, which equals the total number of genes in the comparison will be provided by GECKO, and the reported results by the CHROMEISTER pipeline will be calculated upon these. Table 1 shows the results in sensitivity for the depicted comparisons.

Comparison
Query cover MEGABLAST We can observe from Table 1 that the results are consistent with those reported in [1] and that moreover, the CHROMEISTER pipeline only loses a relatively small percentage of the signal when the relatedness between sequences lowers.

Gene search comparison with other methods
We have performed a similar experiment to the previous one to measure how many genes are detected comparing other software, particularly NUCMER, GECKO and the CHROMEISTER pipeline (that is, using the information generated by CHROMEISTER to find the fine-grained alignments or High-scoring Segment Pairs). We must stress out that the CHROMEISTER method is intended for detecting large synteny blocks and evolutionary rearrangements in a fast fashion using as less resources as possible and therefore it is not aimed (on its own) to detect genes and moreover, its inclusion in the tables would be unfair for the other software methods since the number of genes reported would have to calculated as those covered by the detected synteny blocks (which are large and therefore would report much more genes). Table 2 shows the percentage of genes detected divided by the total number of genes present in the sequence. Notice that we only used fragments detected in the forward strand and therefore only accounted for genes that are not in the complementary strand (particularly to simplify the additional programs that were developed to check this and to reduce execution times by half, approximately).
We performed the experiment with the following sequences:  Table 4. Re-run experiment without overlapping regions to detect the number of genes by method and dataset. Each cell shows two numbers, namely the amount of genes detected in the forward strand of the query sequence divided by the total number of genes in such strand and sequence, and the second number is likewise for the reference sequence.
We can see that results are much closer altogether and even in some cases, the pipeline outperforms NUCMER. Notice also that the number of genes in the second Escherichia coli sequence for GECKO is a 2% lower compared to the pipeline, which seems a contradiction. However, it is due to the high number of repetitions in Escherichia coli that are found by GECKO but not by the pipeline, which eventually get discarded in the filtering process. This is, in fact, an imperfection of the devised method to remove repeated regions, since it also removes repetitions included in the main signal which in the case of the pipeline do not get removed because no repetitions are detected.

Indexation and matching procedure
First of all we would like to stress (as stated in the manuscript) that the method is designed to use as little information as possible, and as fast and efficiently as possible in order to produce a quick overview of the comparison of two genomes. Although the method is the combination of several heuristics (inexactitude, uniqueness, subsampling and traversal of local optima), in this section we discuss the matching procedure, which includes indexing by both inexactitude and uniqueness. See Figure 1 for an example. Figure 1. Indexation of k -mers using the inexact method with z = 2. The original k -mer (shown in the top) is composed of two parts (1) in red, the fixed prefix used to map the k -mer to a hash table (key) and (2) in green, the nucleotides used to compute its hash (value). The second part (bottom) shows which k -mers can be matched (and therefore converted into hits if uniqueness is satisfied) to the original k -mer. The third part shows an example k -mer that can be matched, with nucleotides differing from the original one shown in yellow.
In short, the matching procedure is aimed at reducing the search space by allowing to match k -mers to a representative k -mer of its class (clustering by inexactitude) and then using only the unique prototypes of each class (uniqueness). Freely speaking, it can be considered as an on-the-fly clustering where only centroids are used.
False positives and false negatives due to inexactness From Figure 1 we can see how the indexing is performed. A prefix (in red) is used to map the k -mer key to a hash table whereas the individual nucleotides (in green) are used to calculate the hash value. Notice that this approach enables long k -mers (which avoids a large noise to signal ratio) to be matched with other k -mers that include single point mutations or have suffered from sequencing errors. This also includes cases where remote sequences are being compared who have suffered from evolution throughout millions of years.
Regarding error rates, we can consider two cases: (1) false negatives and (2) false positives. In one hand (1), a false negative corresponds to not matching two equal k -mers. This case is not possible since two equal k -mers will have two equal hash values. On the other hand (2), a false positive would correspond to matching two different k -mers. Strictly speaking, the method creates false positives to account for mutations and other evolutionary events. We have performed an experiment addressing the confusion matrix rates, which is shown in the next sections.

False positives and false negatives due to uniqueness
The condition of uniqueness is very restrictive. It is important to keep in mind that it was designed in such a way that the algorithm could run with nearly no memory requirements and boosting performance while producing a fast previsualization. In short, the uniqueness condition removes the problem of handling collisions, balancing trees, etc., i.e. it reduces the problem to a single, key-value hash table. These computational benefits come, however, at the expense of a decrease in sensitivity. Particularly, due to uniqueness, false positives (matching different k -mers) are not affected (since their matching is only due to inexactness), however these can be turned into false negatives as explained next, and therefore removed. On the other hand, a large deal of false negatives (not matching equal k -mers) will be generated since any k -mer that is present twice is considered as a repetition (therefore not being matched with any other k -mer to form a hit).
In any case, we performed an experiment to objectively measure how many hits are lost due to uniqueness.  Table 5. Hits detected using traditional k -mer matching and those unique for each of the three datasets.
As can be observed from Table 5, the number of unique hits is not correlated to the number of total hits. Moreover, we calculated the percentage of final fragments that included a unique hit by extracting the aligned fragments from GECKO. Then we removed all repeating regions that had at least 60% of overlap as in previous exercises to account only for the main signal.  Table 6. Unique hits contained within aligned fragments. The row "Minimum fragment length" represents the length filter (in base pairs) applied in the comparison. The row "Total aligned frags" shows how many fragments were detected by GECKO. The last row, "Total frags containing unique hits" represents how many of these fragments included the unique hits reported above.
As we can see from Table 6, above 80% of the non-overlapping aligned fragments reported by GECKO (of a certain minimum length) include unique hits. From our perspective, this shows that the uniqueness approach (while dramatically reducing the search space) is able to find a significant proportion of seeds that can later spawn the alignments of the main signal.

Regarding error susceptibility of inexact versus traditional k-mers
First, the inexact design is aimed at enabling long k -mers to be matched with other k -mers that have suffered some form of mutation (a higher degree of the variable z will perform a less restrictive matching and vice versa). However, in order to measure the impact of the inexactness in the matching, we performed an experiment where we took one sequence, generated its k -mers, then mutated the sequence (see details below in the description of the experiment) once and re-generated its k -mers. Lastly, we matched the initial k -mers with those corresponding to the mutated sequence and counted the rates for the confusion matrix. We repeated the experiment with both traditional k -mer matching and inexact k -mer matching to understand 1. To which degree the traditional method loses sensitivity due to single point mutations compared to the inexact method. 2. To which degree the inexact method matches to other k -mers.
For the sake of simplicity, the initial sequence had all k -mers different, i.e. each k -mer should match only once between the set of k -mers from the non-mutated and mutated sequences. Regarding the dataset, the first kilobase of the Bos taurus chromosome X was extracted. A mutation probability of 0.01 was applied to each base (resulting in average of 10 mutations per kilobase). We counted the confusion matrix rates as follows: Table 7 shows the confusion matrix for the experiment. The number of true negatives is high as it counts for each k -mer on the non-mutated sequence if it is matched to any of the other mutated sequence. Regarding true positives, we can see that the inexact method enables to match more k -mers than the traditional exact method, as it is not as sensible to mutation. Moreover, the number of false positives is zero in both the traditional method (it can not be otherwise since hashes are bijective) and the inexact one (although this depends on the dataset size, since collisions are possible).  Table 7. Confusion matrix for the traditional versus inexact k -mer matching experiment using 1,000 k -mers. The row represents the actual class of the k -mer whereas the column represents the predicted one.
In short, the experiment depicted shows that inexact k -mer matching can reduce the number of false negatives while increasing true positive rates in a scenario where we allow and desire some degree of flexibility in the matching. In cases where we are looking for exact, restrictive matching, then it makes more sense to employ traditional k -mer matching. From our perspective, in the context of genome comparison where sequences have diverged several millions of years ago, it is reasonable to employ the inexact method since it not only improves detection but also reduces the computational complexity (less nucleotides need to be hashed is a positive side effect).
Furthermore, we performed a second experiment with nearly 10,000 k -mers (9,713 after removal of equal k -mers in order to have no false positives) to see the trade off between true positives, false negatives and particularly false positives. The procedure was the same as in the previous experiment, i.e. same mutation rates were applied and no two k -mers are equal. Table  8 shows the results of the confusion matrix for the nearly 10,000 k -mers.  Table 8. Confusion matrix for the traditional versus inexact k -mer matching experiment using 9,713 k -mers. The row represents the actual class of the k -mer whereas the column represents the predicted one.
As we can see from Table 8, the number of false negatives is reduced by almost 50% in the case of the inexact matching, since it allows to match slightly varying k -mers. Moreover, this can also be appreciated in the number of true positives identified while the number of false positives only raises to 205. Overall, while the inexact approach is able to improve the true positive rate and decrease true negative rate at the expense of a slight increase in false positive rate, we must also point out that these "false positive" matches will rarely have a direct impact in the output, since, as explained before, a sufficient number of matches is required around a region in order to form or cluster signal (and the number of false positives is not only reduced, but their positions will be scattered throughout the comparison).

Pipelined execution
The comparisons were rerun exhaustively with GECKO using the information generated from the previous CHROMEISTER execution. That is, the genomic regions surrounding the detected signals of conversed similarities were compared with GECKO to produce a set of aligned HSPs. Such set of fragments can then be used for post-processing affairs such as in GECKO-MGV. Table 3 shows the execution times for the exhaustive rerun compared to the best execution from either CGALN or NUCMER. Notice that in the previous comparison (main manuscript), the parameters of NUCMER were tweaked to favor it (and for instance, the extension of alignments stage was removed), but for this exhaustive test, parameters are left default in order to produce high quality results. As can be seen, comparisons B19-C9, HX-MX and BC-BC are slower by few minutes, but still competitive. However, as the search space of the comparison grows (e.g. M2-O1, A1-T1, HC-MC and AC-TC), the pipelined execution outperforms the other methods while providing a higher level of exhaustiveness (see next section) in terms of signal detection and competitive results in terms of gene detection (see section "Sensitivity analysis"). Particularly, the case of M2-O1 shows a time reduction that does not follow the growing trend of the search space. This is due to the time consumed by the pipelined execution being directly proportional to the number of conserved signals detected, for which M2-O1 has the lowest number of similarities. It is also interesting to remark that the pipelined execution required less than 1.5 GB of RAM for any execution, including the full plant genomes. This enables the method to be run even on old computers, tablets or smartphones. The CSV files containing the sets of HSPs ready-to-run on the GECKO-MGV server are  Notice that as the length of the compared sequences grows, the amount of information compressed per cell in the hits matrix also grows. Therefore, it becomes necessary to increase the resolution of the cell matrix in order to not lose detail due to information compression. For instance, in Table 9, the sequences corresponding to A1-T1, HC-MC and AC-TC, were executed with a double sized hits matrix, thus enabling a higher resolution output and decreasing the number of unnecessary exhaustive comparisons performed by GECKO. The increase in size of the hits matrix does not affect performance, and only impacts the size of the output file in secondary storage.

Signal comparison
The signals produced by each of the methods mentioned are now compared to the proposal in order to validate and assess the quality of the results. That is, the information produced by CHROMEISTER is pipelined into GECKO to obtain a set of HSPs, and these are compared to the HSPs found by the alternative methods. Coverage is measured relative to the query genome used in the comparison. In this case, Bos taurus Chr 19 was compared against Canis familiaris Chr 9. Figure 2 shows the dotplot comparison as generated by GECKO, NUCMER, CGALN and the pipeline. Notice that HSPs were filtered with a minimum of 300 base pairs and 70% identity. As can be observed, consensus is to be found among the different methods. These results can also be contrasted with public repositories, such as NARCISSE or CoGe. As can be observed in Figure 2, mostly all methods agree on the same visual result. However, NUCMER detects isolated and single HSPs throughout the comparison, even so when all thresholds are set equally. Notice that the coverage appears to be low compared to the representation in the dotplots; this is due to the existing separation between HSPs, which can not be observed in thumbnail previews. As seen in Table 10, the signal percentage for the pipeline was superior to that of NUCMER and CGALN, and highly similar to that of the full GECKO execution, which is considered the reference as it provides the longest, high-quality and verifiable set of alignments.  Table 10. Comparison of the quality of the signal detected among the different methods. For each method, the total number of base pairs aligned at a minimum of 70% identity and of length above 300 nucleotides is reported. The coverage is computed relative to the size of the query genome. The signal percentage column is computed as the ratio between the coverage from GECKO and the rest of the methods. *MEGABLAST is included as reference without dotplot.

Infrastructure
All the comparisons and executions were performed on our local server at the University of Malaga with 2x 8-core Intel Broadwell-EP E5-2620V4 processor at 2.  Table 11. Parameters employed in the LSGR detection.
The amount of evolutionary events detected for each comparison with Homo sapiens is displayed in Table 12. Notice, however, that the algorithm employed counts synteny blocks, i.e. more synteny blocks imply more evolutionary events (as long as there still is conserved similarity) since they have to be broken apart to separate into more synteny blocks.

Correlation between LSGRs and the scoring distance
The number of LSGRs and the scoring distance was calculated for all primate species in respect to Homo sapiens . Figure 3 shows the visual correlation between both functions. The Pearson's correlation coefficient was 0.72, which indicates that a positive, relatively strong correlation exists. Notice that the score of Homo Sapiens against itself is not 0 due to the huge islands of unidentified nucleotides (see Table 13). These are not computed by CHROMEISTER and can include up to 30,000,000 nucleotides which can impact the overall score.

Block tracing
The blocks traced appearing in Figure 4 in the main manuscript are obtained from Tables 14 and 15. The proposed method identifies synteny blocks, providing coordinates for each block. Then, if overlapping blocks are found, they are considered related blocks. As can be seen, the first row in both tables are related to larger blocks (in Figure 4 they would represent a block filling almost the whole chromosome), however, just a fragment of it has been selected, as it is the section that is shared with all the other comparisons. Coordinates are scaled from 1 to 1,000 in respect to their original size. Notice that each comparison employs in the x-axis the sequence that originally was in the y-axis of the previous one, hence allowing to follow blocks from the root to the leaves.  Figure 4 in the main manuscript.

Parallel execution
The parallelisation of CHROMEISTER is immediate at a task-level paradigm. That is, since the computation of each comparison takes a small amount of time, data-level parallelisation is not as efficient as a coarse-grained task-level paradigm (mostly due to the former requiring synchronisation techniques that clog up computation and add overhead). Therefore, the current implementation of CHROMEISTER includes a script for all versus all comparisons which can be run in parallel. The script is called onto one or two folders, and it will compare the sequences within or between the folders in order. Each time the script is called (with the same command line) another process will be spawned using one core and comparing the next sequences in the order list. The parallelisation level is done at operating system level, i.e. each process checks that the current two sequences to be compared are not already being compared by another process.

Distribution of hits
Then the number of random matches that occur between the two sequences and follow a A B binomial distribution of mean and variance where is the number of events n * p 1 ) n * p * ( − p n (each comparison that can produce a match or not) and is the probability of each match. The p variables and will take its value depending on the choice of the k-mer size and the length of n p the query and the database. For instance, in the case of one query sequence and one database sequence of lengths and (respectively), and a choice of k-mer size being , there would l 1 l 2 k be comparisons of k-mers, each of having a probability of being equal of ) l ) (l 1 − k + 1 * ( 2 − k + 1 ¼ to the power of . To accomplish this, the query is considered as a single long sequence, k since it is of no use to distinguish between query sequences. Thus the number of k-mers present in a query of sequences is the union of the k-mers from each sequence . On the Q Q i other hand, each sequence in the database is considered independently, therefore producing S binomial distributions, result of comparing the hits between the whole query as one long sequence and each individual sequence in the database. In particular, for a query of S i Q sequences each of length , and a database composed of sequences, each of them of Q i S length , and a k-mer choice of , the number of k-mer comparisons between the query and S i k each one of the databases will be: Notice that the number of k-mers in the reference sequence has to be multiplied by two to consider the reverse complementary strand. Therefore, the resulting binomial distribution is modelled as:

Impact of inexact indexing
The impact of the order of in the indexing function is analyzed. That is, the sequences z depicted in the previous section were compared while varying the value, and the number of z unique hits was recorded. The range used for the value is the sequence z (missing values such as 8 and 9 produce the same results under , , , , , , , 0, 6, 2 1 2 3 4 5 6 7 1 1 3 function as their listed previous integer). Figure 4 shows the number of unique and inexact f (x) z hits found as a function of . It can be seen that the sequence comparisons (sorted by z ascending search space in the legend) shows a right-skewed distribution that approximates a normal distribution with the growth of the search space. Furthermore, the effect of the value z can be divided into four stages that repeat themselves in each comparison, namely: 1. The initial mild growth is indicative that few inexactitudes are being allowed in the matching of hits, and therefore few and closely related new inexact hits are matched. 2. The steep growth indicates that the allowance degree of inexactitude in the matching is producing a large number of new unique hit matches but with less similarity.
3. The steep decay is indicative for the inexactitude constraints saturating the matching of hits; that is, highly different -mers are being matched as equal, thus reducing the k uniqueness of hits (because almost everything tends to produce more than one hit). 4. The ending mild decay appears when the number of allowed inexactitudes completely saturates the matching of hits (everything matches to everything with no significant similarity). By observing Figure 4, we might conclude empirically that the optimal number of mismatches that should be allowed per sequence comparison has a natural distribution that depends on both input sequences (apparently on the search space). Therefore, with the aim of producing high-quality indexing of unique, inexact hits, the value should be modified according to the z lengths of the sequence, and particularly avoiding the saturation of the hits indexing. The same experiment depicted here is also reproduced for plant genomes (see "Plant genomes and uniqueness" below). Notice, however, that the degree of is flexible and will still produce z high-quality results (all following experiments described in this manuscript were run with ). z = 4

Plant genomes and uniqueness
The impact of the order of in the indexing function was analyzed for mammalian sequences.
z That is, plant genomes were compared while varying the value, and the number of unique hits z was recorded. The range used for the value is the sequence (missing z , , , , , , , 0, 6, 2 1 2 3 4 5 6 7 1 1 3 values such as 8 and 9 produce the same results under function ). Figure 5 shows the f (x) z number of unique and inexact hits found as a function of . The following plant sequences and z genomes were employed (1) Brassica oleracea -Brassica rapa, (2) Aegilops tauschii chr. 1 -Triticum aestivum chr. 1, (3) Zea mays -Sorghum bicolor and (4) Aegilops tauschii -Triticum aestivum. Figure 5. Impact of in the number of unique inexact hits for plant sequences. The x-axis f (x) z shows the value, whereas the y-axis shows the number of unique inexact hits found in each z of the comparisons. Sequences are sorted from smaller to larger in the legend.

Further visualization
In this section we include the visualization plots for the comparisons that do not appear on the main manuscript. These include M2-O1, BC-BC, HC-MC and AC-TC (see manuscript for more information). Only the previsualization plot generated by CHROMEISTER is shown.

Software parameters
Regarding each of the programs: • CGALN was used in the following way: ○ The maketable program was run with different K-mer sizes such as 18, 19, 20, 30 and 32. But all of them produced a segmentation fault, except for the default size 11. ○ The -r -b flags were set for fast computation.
• NUCMER was used with the following parameters: ○ nucmer -t 1 --noextend --mum -l 32 ○ Time included generating the files for plotting and running mummerplot. • CHROMEISTER was used default. Time included plotting.

NUCMER CPU usage
Despite using the parameter -t 1 to avoid multithreading, NUCMER would utilize more than one core for every test. Therefore, the amount of CPU% used was measured with the time utility and recorded.

Single One vs. One chromosome comparison and software comparison
In this section, we show several comparisons of standard state-of-the-art chromosomes and genomes belonging to mammals and plants. Additionally, CPU time and RAM consumption is measured and compared with CGALN and NUCMER, part of the MUMMER suite.
The results of three comparisons, namely B19-C9, HX-MX and A1-T1 are shown in Figure 10. The correctness of the B19-C9 and HX-MX comparisons are checked against the NARCISSE portal. Notice that the A1-T1 comparison was found nowhere on public repositories, and therefore its inspection was done in respect to NUCMER. (c) Aegilops tauschii assem. Chr 1 vs. Triticum aestivum pseudo-Chr 1 Figure 10. Comparison of the One vs. One experiment. The x-axis of each image corresponds to the first sequence in each comparison, whereas the y-axis corresponds to the second sequence. The dotplots on the left are produced by the proposed method, whereas those on the right have been extracted from the NARCISSE platform (except for the last one, which was generated with NUCMER). Differences that are not found in the reference are highlighted in red.
It can be observed from Figure 10 that both (a) and (b) are almost equivalent to the NARCISSE resource. However, notice that the NARCISSE resource has undergone several post-processing refinements and therefore the conserved similarities are observed as much more "cleaner" diagonals than in the case of the proposed method. Still, few differences can be identified (marked in red). Blocks detected by CHROMEISTER which are undetected by NARCISSE are due to the latter computing the pairwise comparison using gene annotated sets to account for conserved segments. See the Supplementary Material to find more details on these blocks. On the other hand, some conserved segments of lesser size detected by NARCISSE are not detected in CHROMEISTER due to its heuristic nature, i.e. blocks will be detected based on their number of unique hits instead of their gene content, although this is not mutually exclusive (see section "Sensitivity analysis and impact of the heuristic methodologies" On the other hand, the (c) comparison was contrasted with the result of NUCMER since the comparison did not exist in the considered repository (and which had to be manually filtered, see "NUCMER comparison" in the Supplementary Material). Both methods detect the conserved homologies, although CHROMEISTER removes any repetitive element and therefore only the main syntenies are to be seen, whereas NUCMER shows both repeating regions and homologies, as well as also some noise elements.
Regarding computational times, we set a maximum of 100 hours of computing time to limit execution times. Running times along with the memory ram consumption are shown in Table 16 for NUCMER, CGALN and the proposed method. Only one core was used for each comparison (see "Software parameters" in the Supplementary Material for the list of parameters). The parameters of NUCMER were tweaked to favor it, as it would take more than several thousand minutes on certain datasets with default parameters.   Table 16. Summarized executions. Each tested program is run against the three single datasets. *NUCMER used more than one core in the computation, the adjusted time is shown in parenthesis, which equals the reported time consumption multiplied by the CPU usage. **The acronym "DNF" is short for "Did Not Finish", and the flag is set when the execution had not finished after 100 hours of computation time.
As can be observed, the proposed method CHROMEISTER is the fastest on every comparison and uses the lowest amount of RAM except for the first dataset B19-C9. CGALN was unable to finish any of the larger comparisons due to time limit. While in the smaller datasets the difference in computation time is not very significative between NUCMER and CHROMEISTER, the gap widens on the large comparisons such as HC-MC and AC-TC, obtaining a speedup of 12x and 11x. Regarding RAM consumption, CHROMEISTER is able to use up to 50 times less than NUCMER in the larger comparisons regardless of the size of the input sequences and therefore making it suitable for its use on any desktop computer.

Proof for new fragments detected between Bos taurus 19 and Canis familiaris 9
The "Single One vs. One chromosome comparison and software comparison" experiment depicted several similarity blocks that were not identified in the reference, i.e. NARCISSE. In order to ensure that the blocks had indeed similarity, we proceeded to align them using GECKO and loaded the comparison along with the annotation files from GenBank in the MG-Visualizator. Such enables to visualize both the similarity between sequences and whether coding regions are shared. Figure 11. Comparison of Bos taurus chromosome 19 and Canis familiaris chromosome 9 using the MG-Visualizator. Annotation files were included (scattered points around the axes).
We now check the regions identified by CHROMEISTER, as displayed in red in Figure 11. From this figure we can see that there are no coding regions for the Bos taurus chromosome 19, but there are several involved for Canis familiaris chromosome 9. This can be seen as well in region 2 in Figure 12. Notice that in Figure 12, no coding regions exist for sequence in the X-axis but one exists for the Y-axis. Anyway, we can see the two blocks are aligned properly, and that the reason they do not appear on NARCISSE is due to the lack of CDS. The very same scenario can be seen for region 3 in Figure 13. The block once again exists and aligned between both sequences. However, no coding region is shared between the two.

Proof for new fragments detected between Homo sapiens X and Mus musculus X.
The "Single One vs. One chromosome comparison and software comparison" experiment depicted several similarity blocks that were not identified in the reference, i.e. NARCISSE. In order to ensure that the blocks had indeed similarity, we proceeded to align them using GECKO and loaded the comparison along with the annotation files from GenBank in the MG-Visualizator. Such enables to visualize both the similarity between sequences and whether coding regions are shared. Two regions of interest are shown (see Figure 14), aligned with GECKO and zoomed in order to enable visual inspection. Figure 14. Comparison of Homo sapiens chromosome X and Mus musculus chromosome X using the MG-Visualizator. Annotation files were included (scattered points around the axes).
We now check the regions identified by CHROMEISTER, as displayed in red in Figure 14. A fragment of length 678 base pairs is detected in the point of interest regarding region 1.
Although the fragment is not very large, it proves that the method is consistent with its reported similarities. The same is observed for region 2, see Figure 15, where the block is 840 base pairs length and Figure 15 where the block has 218 base pairs length. This shows that the length of the block does not necessarily imply that the block will be skipped -it is rather a combination of the uniqueness in its composition.

NUCMER comparison
The NUCMER comparison of A1-T1 ( Aegilops tauschii assem. chrom. 1 vs. Triticum aestivum pseudo-chrom. 1) produced a delta file of 455 MB. Running GNUPLOT with such file was not feasible, as it would crash. To obtain the picture representing the dotplot comparison between A1-T1, manual filtering was needed, which included: • The resulting "fplot" and "rplot" files had to be shortened, by taking only the hits with 100% similarity. The following UNIX command was used for such purpose:

Pipelined execution
Using the script /bin/guidefastas.sh from the branch "inmemory_guided_chrom" available at GECKO's github ( https://github.com/estebanpw/gecko/tree/inmemory_guided_chrom ) it is possible to run full comparisons using the information generated by a previsualization comparison with CHROMEISTER. All files were produced with parameters dimension: 1000, length: 200, similarity: 40 and word size: 32" (unless specified otherwise) and are available here . See Figure 18 to 24 for the

Random perfect hits
This section addresses why the distribution of random hits can be ignored in the context of genomic comparisons. The probability of a DNA sequence occurring somewhere else in another sequence is:  represents how many times each chromosome is split, meaning that "0" represents that the homology of a given chromosome was either found in just one other chromosome or was not found at all. On the other hand, a value of "1" or more indicates that the chromosome's homology was split more than once. represents how many times each chromosome is split, meaning that "0" represents that the homology of a given chromosome was either found in just one other chromosome or was not found at all. On the other hand, a value of "1" or more indicates that the chromosome's homology was split more than once. represents how many times each chromosome is split, meaning that "0" represents that the homology of a given chromosome was either found in just one other chromosome or was not found at all. On the other hand, a value of "1" or more indicates that the chromosome's homology was split more than once.
List of copyright permissions for the pictures shown in Figure 3