An alignment-free method to find and visualise rearrangements between pairs of DNA sequences

Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail.

: Smash result when the reference and the target are swapped. "SF" stands for S. flexneri and "EC" for the E. coli bacterial genomes. Smash was ran using T = 1.8, k = 20 and discarding blocks smaller than 20 kb. Smash has the advantage of being fully capable of dealing with data coming directly from high throughput sequencing [1], a characteristic usually not found in methods based on alignments. As an example, Mauve [2], that is known to be one of the most flexible methods, was unable to create a valid map for permutations of blocks as large as 1Mb and even unable to provide any output at all (due to several crashes).
In addition to this, the running times are essentially proportional to the number of matches. Thus, when working with Smash it is irrelevant if the reference sequence is Figure 3: Smash result when the reference and the target are swapped. "HS" denotes the H. sapiens whereas "PT" indicates the P. troglodytes eukaryotic genomes. Smash was ran using T = 1.3, k = 20 and discarding blocks smaller than 1 Mb. Figure 4: Comparison between Smash, Mauve and the VISTA methods on a synthetic sequence (4,000 bases) with a ground truth. Smash was ran with T = 1.5, k = 10 and discarding blocks smaller than 5 bases. VISTA was computed online, for both synteny and alignments.
shuffled (as in a NGS output). In the case of alignment methods this can be a problem (regarding running time and peak memory), even when running in computer clusters.
We have included two synthetic sequences of 4 kb each and made a comparative study Figure 5: Comparison between Smash, Mauve and the VISTA methods on a more complex synthetic sequence (4,000 bases). Smash ran with T = 1.5, k = 10 and discarding blocks smaller than 5 bases. VISTA was computed online (syntheny) and the maps were adapted (colors instead of pointer lines) for better display. First sub-image depicts only the VISTA alignments. Figure 6: Comparison between Smash and Mauve methods on S. flexneri and E. coli bacterial genomes. Smash was ran using T = 1.8, k = 20 and discarding blocks smaller than 20 kb. with respect to other techniques: Mauve [2], and VISTA [3]. The synthetic sequences of Supplementary Figs. 4 and 5 were simulated using [4] and randomly permuted using different block sizes and inversions. The first one ( Supplementary Fig. 4) contains a small number of block permutations and inversions to allow an easy comparison with the provided ground truth. The second one contains overlaping permutations and, therefore, is more complex.
Supplementary Fig. 6 addresses a large-scale comparative study between Shigella flexneri (NC 017328) and Escherichia coli (NC 017638), using Smash and Mauve.
Regarding the eukaryotic example, we have used chromosome 3 of human and orangutan, depicted in Supplementary Fig. 7. Smash spent 871 seconds (for the inside Smash map) and 1291 seconds (for the outside Smash map) using a maximum memory peak of 2.4 GB, while Mauve spent 2633 seconds and used 6.1 GB of memory. The Mauve version used is already the improved (latest) version.
All the experimental results have been computed on a common laptop (running Linux Ubuntu 12.04 LTS with the following hardware characteristics: 4 Intel Core i7-3520M CPU at 2.90GHz, 8 GB of RAM and a SSD of 243.6 GB). Nevertheless, the RAM memory usage never surpassed 3.1 GB, even for the larger chromosomes.
For the human/chimp maps, using k = 20 (the default value) and T = 1.3, Smash needed 212 minutes. We recall that by using properly estimated parameters the elapsed time might substantially decrease.

Translocation detection example
In Supplementary Figs. 8 and 9 we show how that Smash detects a well known translocation between human and gorilla chromosomes 5 and 17 [7]. Scripts that allow reproducing these results are included in the package containing the source code of Smash.
Specifically, in Supplementary Fig. 8 we present the detection results provided by Smash both for the case where the whole concatenated genome of the gorilla is used as reference (top) as well as when used as target (bottom). In the first case (top map), the ≈ 3 GB of reference required ≈ 22 GB of memory to run. We can clearly identify the region with homology to the human chromosome splited in two major blocks, placed in the regions corresponding to the positions of chromosomes 5 and 17 of gorilla (the precise locations can be obtained from the files produced by Smash). The bottom map shows similar results, but at the cost of considerable less resources (just 2 GB of memory and 115 minutes of computing time), because in this case the shortest sequence played the role of reference sequence. Supplementary Fig. 9 provides more detail regarding this translocation, because in this case only chromosomes 5 and 17 of both species were concatenated and hence considered. Figure 8: Detection of a translocation between the whole genome of gorilla, GG (all chromosomes concatenated), and human chromosome 5, HS C5. Smash was ran using the default parameters, twice. In the first case, GG was used as the reference and HS C5 as the target (it required ≈ 22 GB of memory and ≈ 224 minutes), while in the second case HS C5 was used as reference and GG as the target (requiring ≈ 2 GB of memory and ≈ 115 minutes to run).

Additional notes
Smash can be executed in different ways. Suppose that we are interested in comparing the sequences a 1 , a 2 , a 3 with b 1 , b 2 , b 3 , b 4 . Obviously, the tool can be run on each of the (a i , b j ) pairs. Alternatively, we could form a new sequence a by concatenating all the a i , another sequence b by concatenating all the b i , and then run the tool (just once) on the (longer) a and b sequences. There are many other possibilities. For example, we could apply the tool to each of the pairs (a i , b). Or, swapping a and b, one could focus instead on the pairs (b i , a).
Regardless of how one decides to organise the analysis, the tool will detect the relations between the given sequences and will generate appropriate log files that identify with precision the type of relation, the file offsets etc.
The colors used by the tool are visual clues. They are helpful to visualise and understand the nature of the rearrangements, but for very precise results (offsets etc.) one would have to refer to the log files. The results, modulo border effects at the edges of the sequences, will be essentially the same -regardless of the approach used.
Working with concatenated sequences such as (a, b) can be helpful for exploratory analysis, when the nature of the relations is totally unknown. The results of this "large scale analysis" may expose interesting relations between parts of a and b, which can then be analysed in detail by checking specific pairs (a i , b j ).
In our experience, after acquiring some experience with the tool, the user will have no problem in organising the analysis. The best approach depends on the main goal: whether to check for precise relations within two relatively small sequences, between one or more small sequences and a large (possibly concatenated) one, or between two concatenated sequences.

Installation
We provide a binary for each 64 bits operating systems (Linux, OSX, Windows). However, for other purposes, such as source code compilation, use the following installation instructions. Cmake is needed for installation (http://www.cmake.org/), although for Linux there is a possibility to skip its usage. You can download it directly from http://www.cmake.org/cmake/resources/software.html or use an appropriate packet manager. In the following instructions we show the procedure to install, compile and create the information map between human and orangutan chromosome 20:

Step 1
Download, install Smash and resolve conflicts.

Linux:
sudo apt−g e t i n s t a l l cmake wget h t t p s : / / g i t h u b . com/ p r a t a s / smash / a r c h i v e / master . z i p u n z i p master . z i p cd smash−master cmake . make Alternatively, you can install (without cmake and only for Linux) using

Windows:
In windows use cygwin (https://www.cygwin.com/) and make sure that it is included in the installation: cmake, make, zcat, unzip, wget, tr, grep (and any dependencies). If you install the complete cygwin packet then all these will be installed. After, all steps will be the same as in Linux.

Parameters
The Smash program has many options in the interface because there is a wide variety of parameters that can be defined by the user. However, for the detection of the arrangements only two are critical, namely context and threshold. Additional information about these parameters can be found in the paper. By default, Smash has many parameters assigned in order to avoid the estimation, enabling only to set both reference and target files. However, these defaults are only partially estimated to detect rearrangements in primates. Therefore, for other purposes you might need to adjust context and threshold parameters. Moreover, for custom image maps you might also need to set other parameters, such as width.

Options
Parameters Meaning -h It will print the parameters menu (help menu) -V It will print the Smash version number, license type and authors.
-v It will print progress information such as positions of the patterns, times, etc. -f It will force to write over files already created. -c <context> Size of the FCM (Markov) context order (interval [1;28]). Contexts above 14 will be handled with a hash- The positions from all the rearrangements detected on the run (see subsection "Some considerations" for more information).

-o <outFile>
The output SVG image filename. The default uses the concatenation of reference with the target filenames (adding the "svg" extension).
Beware: if the files are not in the working directory this might have problems due to several types of characters (such as '/').
The reference filename.
The target filename. 9

Some considerations
The coordinates of the different blocks, as well as the direction and the respective block code, are registered in the ".pos" file, separated by tab symbols, with the following order and meaning: • type: reference or target; • block code (to link reference with target); • initial position of the block; • final position of the block; • direction of the block (inverted or regular); indicates two similar regions. The first (code:1), of regular type, has similar information in the first 39612 bases of both sequences. The second one (code:2), inverted, shares information, respectively, from position 2287593 to 2366817 and from 2267787 to 2307399, between both sequences. Nevertheless, to read only the target positions, there is allways the possibility of using the output of the program with the "-v" mode (verbose). In this case, it outputs several types of information including, for example, the following: Running IR p a t t e r n 3 with s i z e 13881450 E x t r a c t i n g p a t t e r n [ 2 0 8 0 8 7 5 0 ; 3 4 6 9 0 2 0 0 ] This example shows that there is an inverted region with size 13881450 (with starting position 20808750 and final position 34690200) that is valid, which means that its size is larger or equal than the minimum region (-m flag) that we want to consider.