An alignment-free method to find and visualise rearrangements between pairs of DNA sequences

Pratas, Diogo; Silva, Raquel M.; Pinho, Armando J.; Ferreira, Paulo J.S.G.

doi:10.1038/srep10203

Download PDF

Article
Open access
Published: 18 May 2015

An alignment-free method to find and visualise rearrangements between pairs of DNA sequences

Diogo Pratas¹,
Raquel M. Silva¹,
Armando J. Pinho¹ &
…
Paulo J.S.G. Ferreira¹

Scientific Reports volume 5, Article number: 10203 (2015) Cite this article

6967 Accesses
22 Citations
7 Altmetric
Metrics details

Subjects

Abstract

Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail.

UniAligner: a parameter-free framework for fast sequence alignment

Article 14 August 2023

Deciphering complex breakage-fusion-bridge genome rearrangements with Ambigram

Article Open access 08 September 2023

Patterns of somatic structural variation in human cancer genomes

Article Open access 05 February 2020

Introduction

Structural genomic rearrangements are a major source of intra- and inter-species variation. Chromosomal inversions, translocations, fissions and fusions, are part of the naturally occurring genetic diversity of individuals, are selectable and can confer environment-dependent advantages¹. Chromosome rearrangements are also associated with disease, namely, developmental disorders and cancer. For example, many leukaemia patients present a reciprocal translocation between chromosomes 9 and 22, also known as the Philadelphia chromosome. This produces BCR-ABL fusion proteins that are constitutively active tyrosine kinases, contributing to tumour growth and proliferation². Another striking example is the human inversion polymorphism in the 17q21 region, which contains the neurodegenerative disorder-associated gene MAPT (microtubule associated protein Tau). The direct oriented H1 haplotype is common and relates with increased Alzheimer’s and Parkinson’s disease risk, while the inverted H2 haplotype has higher frequencies in Southwest Asia and Southern Europe populations, particularly around the Mediterranean³. Recurrent inversions are found in the primate lineage, where the H2 haplotype is the ancestral state and recent work evidences that Neanderthals and Denisovans also carried the H1 allele⁵.

How genome architecture changes contribute to speciation and which macroevolutionary events occurred through time are fundamental to understand the dynamics of chromosome evolution and hence, the origins of species. In addition, chromosome alterations are hallmarks of cancer genomes with diagnosis and prognosis value⁶ and are also used in prenatal and postnatal clinical settings. Several insights into chromosome structure and evolution have been traditionally achieved by cytogenetic procedures such as G-banding, or molecular karyotyping approaches like fluorescence in situ hybridisation (FISH) and, more recently, array-based methods⁷. However, in some groups, such as the great apes, access to samples is often difficult, e.g. due to ethical reasons. Also, these approaches can be time-consuming, expensive, or lack resolution, as opposed to computational solutions⁸.

The advent of sequencing technology enabled the analysis of genomic sequences at nucleotide resolution. Nowadays, next-generation sequencing is bringing a substantial increase of speed, quality and reliability of the results for much less costs, although there is still promising space for improvements. The availability of sequenced genomes boosted computational methods into a new era, allowing some expensive and/or lengthy wet lab processes to be complemented by computational approaches⁹.

Derived scientific insights from genomic sequences, including the conserved distribution of genes on the chromosomes of different species or synteny, have been mostly explored using sequence alignments^{10,11,12,13,14,15,16,17,18,19}, while for visualisation, a wide variety of strategies have been proposed^{20,21,22,23,24}. Specifically, at a macro level the most popular are Mauve¹³, Cinteny²⁵, Apollo²⁴, MEDEA ( http://www.broadinstitute.org/annotation/medea), MizBee²⁶ and Circos²⁷, which are discussed in a recent review²⁸. Although, the circle-based visualisation is becoming very popular, for detecting block alignments and re-arrangements across very similar species, such as primates, an ideogram still seems to be the best approach.

We propose a computational method to detect signatures of chromosome evolution. The method is completely alignment-free and is based on the information content of the sequences being compared. The information content itself is estimated using data compression techniques. The resulting stand-alone algorithm depends only on two parameters.

We developed a tool by means of which the proposed method can be tested in practice. The tool has been made publicly available and is described in detail. It is capable of producing an SVG image that shows the correspondence of regions between two sequences. Its performance is demonstrated with the help of several examples. Those involving synthetic sequences are intended to illustrate the underlying principles. More realistic case studies, involving prokaryotic and eukaryotic genomes, are also discussed. In particular, we obtain human/chimpanzee and human/orangutan chromosome maps.

For clarity, the potential and limitations of the tool and some of its design tradeoffs are discussed separately, following the description of the method. This separates limitations that are inherent to the method from those that are by-products of the current implementation and that as such might be removed in future implementations.

Method

Creating models of the data

The immediate goal of a data compression method is to describe data as compactly as possible. The usefulness of data compression as a tool to find structure in data is perhaps less well-known^29,30.

Nevertheless, this ability is a direct consequence of how data compression works. Compression methods usually rely on statistical data models that estimate the probability of the data symbols along the sequence. Better (i.e., more accurate) statistical models tend to lead to better compressors (i.e., higher compression ratios).

Ultimately, the size of the compressed data can be seen as an estimate of the Kolmogorov (algorithmic) complexity of the original data, a fundamental yet noncomputable complexity measure closely related to information theory³¹.

Genomic data compression, now more than twenty years old^{32,33,34,35,36,37,38,39,40,41,42,43,44}, has been the subject of recent review articles^45,46,47. Typically, the compression methods rely on a combination of models that explore the redundancy found in DNA sequences, usually with models developed to handle high information content (i.e., hard to compress) regions and distinct models to handle low information content (i.e,. very compressible) regions.

The method proposed in this paper identifies small-scale or large-scale rearrangements between pairs of sequences called the reference and the target. The method applies to arbitrary sequences and therefore the reference and the target can be as large as an entire chromosome or genome. The goal of the method is to automatically detect regions in the target sequence that have information content similar to regions found in the reference. The method yields a set of segments of the target sequence and, for each of these, the corresponding segment found in the reference sequence.

Both sequences are preprocessed such that their alphabet is . Symbols originally not belonging to (for example, N’s) are substituted by uniformly distributed symbols from , in order to keep the original length of the sequence. These random generated segments are high information content regions and, therefore, will not share information with any other sequence, hence will not interfere with the matching process.

The core of the method involves the estimation of the amount of conditional information that is required to represent a certain region of the target, using exclusively information from the reference. Basically, if x and y are, respectively, the target and reference sequences, we compute a numerical sequence , where and is the size of the target sequence. For a position in the target sequence, measures the number of bits required to represent the symbol located in that position, according to the aforementioned interpretation of conditional information.

To properly estimate , it is crucial to have a good model of the reference sequence . We have chosen finite-context models (FCMs) for this purpose. FCMs are probabilistic models based on the assumption that the information source is Markovian, i.e., that the probability of the next outcome depends only on some finite number of (recent) past outcomes referred to as the context.

The estimated probability distribution at position , , according to the order-k context is calculated with the symbol counts previously computed on the reference sequence , using the estimator

where represents the number of times that symbol was found in sequence having as context and where

is the total number of events that occurred in y in association with context . The parameter is set to 0.001, forcing the estimator to behave approximately as a maximum likelihood estimator. In practice, this makes the segmentation process easier (see below). The number of bits that is required to represent symbol using exclusively information from the reference sequence is given by

Finding information-similar regions

As explained before, the core idea of the method is to compute, along the target sequence , the amount of information required to represent x using exclusively information from the reference sequence y. Therefore, at a first stage, we end up with a numerical information sequence of size . Fig 1 illustrates how the method operates, using synthetic data generated with an appropriate tool⁴⁸. The target was created by manipulating some parts of the reference, as described in the figure. Additional examples are provided in the Supplementary Material file.

Regions where is small indicate a high level of information sharing with . To mark them, we compare a smoothed version of the information sequence with a threshold (). The result is the set of regions of interest of , for the given reference , which are denoted by .

It remains to find the regions of the reference which are strongly associated with each . To do this we invert the roles of the reference and the target. More precisely, each is now regarded as a reference and is taken as the target. We thus compute, for each , the information sequences , from which the regions of associated with each can be found.

The described procedure can find pairs of regions that are similar in the sense of information-sharing, but does not take into account possible inversions. For this purpose, the reference sequence should be reverted, complemented and loaded in the FCM model. Then steps entirely similar to those described above need to be taken. Having done this, both inversions and direct homologies can be segmented in the target sequence.

If both the inverted and direct instances of a region are found to have high information content, then the region shares no information with the rest of the data and therefore it is left unmarked. This is the case with regions that are essentially unique and with unsequenced regions (those that originally contained N’s, that have been replaced with random data).

The tool

Availability

An implementation of the method (Smash) is freely available, under GPL-2 license, at http://bioinformatics.ua.pt/software/smash. Smash is a tool that computes chromosome information maps, with an ideogram output architecture. The colours for each block are automatically calculated using the HSV (Hue, Saturation, Value) colour space, where only the Hue varies. For more information about Smash, see the Supplementary Material, Section “The Smash tool”.

The threshold T

Smash has a command-line option by means of which the threshold can be varied in the interval (see the Supplementary Material). The threshold can be regarded as a parameter. In general, the best is data-dependent. The guiding principle is to choose so that it selects regions of complexity sufficiently below the average. In practice, this is not difficult to achieve, but some experimentation may be required to obtain the best results.

As a rule, should be smaller when working with similar species than when working with more distant species. For example, for the human/chimpanzee pair we used but for the chicken/turkey pair we used . When working with entire chromosomes, the threshold can be adjusted to match the degree of divergence encountered.

Model depth

The model depth, described by the parameter k, must be an integer in the range [1,28] (as described in the Subsection “Parameters, Options”, option -c. The default value () works well for sequences, say, longer than 1 Mb (1,000,000 symbols). The default also works well for smaller sequences, although in this case the actual performance may depend on how repetitive they are. We have found out that there is often little practical need to tune k.

The relation between the model depth k and the estimated probabilities (which are directly related to the counters ) and the capabilities of Markov models in the context of DNA sequence modelling, have been treated in detail elsewhere⁴⁴.

Commutativity

The proposed method is fully commutative, that is, it has the potential to lead to the same results when the reference and the target are swapped. Smash can easily be made commutative as well. However, in most usage scenarios, there is a natural reference sequence. Furthermore, the assumption that one of the two sequences is the reference simplifies the algorithm and leads to time savings. For these two reasons, the current implementation of Smash is approximately commutative, but not exactly so.

To illustrate this, we performed additional experiments using both prokaryotic and eukaryotic genomes. For the prokaryotes, we have used Shigella flexneri (NC_017328) and Escherichia coli (NC_017638). As can be seen in Supplementary Fig. 2, the maps are very similar (apart from some differences in colour and reversed pattern assignment, due to the automatic colouring method used). Nevertheless, it is possible to spot small differences, mainly because we have discarded matched regions smaller than 20 kb. Supplementary Fig. 3, which illustrates the human/chimp pair, shows that at a larger scale these small differences tend to disappear.

Working with distant genomes

Smash does work for more distant genomes than, say, the human/chimpanzee pair studied in detail next. This is shown e.g. by the chicken/turkey map of chromosome 1 included as Supplementary Fig. 1. According to TimeTree⁶², Gallus gallus and Meleagris gallopavo have an estimated divergence time of 44.6 million years (MY), while between Homo sapiens and Pan troglodytes or Pongo abelii the divergence times are estimated as 6.3 MY and 15.7 MY, respectively.

We emphasise, however, that Smash can be applied to pairs of sequences that are even more distant. Regardless of the exact nature of the reference and target, Smash will find the rearrangements present, even if one or both sequences are synthetic (computer generated). This can be useful to develop a better understanding of how Smash works, or for testing purposes. Examples are presented in Supplementary Figs. 4 and 5, where synthetic sequences containing different rearrangements were processed with Smash. For comparison purposes, the output of widely used tools such as Mauve¹³ and VISTA¹⁵ is also provided. In Supplementary Figs. 6 and 7, the methods are compared in real prokaryotic and eukaryotic sequences, respectively.

Working with unassembled sequences or assembling errors

One of the advantages of Smash is that it works even when the reference is not assembled. Therefore, it can be used with references composed of non-assembled reads obtained directly from the NGS sequencers. In fact, although next-generation sequencing made low cost high speed sequencing possible, it also decreased the size of sequencing reads⁶¹. On the other hand, most of the primate assembled sequences use the human genome as a reference. This might be problematic, because of the assumption that humans and the other primates exhibit a high degree of homology, which might not always be true⁵³. Hence, it might be important to measure similarity against non-aligned references.

Figure 2 depict the results of Smash over chromosome 18 of human and chimp using random permutations of blocks with different size, showing its robustness when fragmented references are used. Smash spent less than 8 minutes for each computation.

Smash is able to identify regions containing shared information even when one of the sequences is block-permuted, a capability that may be of interest to measure sequence similarity, e.g. when one of the sequences is not assembled, or when there are assembly errors. Obviously, the identification of the precise genomic rearrangements that took place will have to be deferred until final assembly takes place.

Results and Discussion

To illustrate the potential of the proposed method, we show the complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. Additional examples can be found in the Supplementary Material. The Homo sapiens, Pan troglodytes and Pongo abelii reference assembled chromosomes were downloaded from the NCBI. In order to create the human-chimpanzee map, we have concatenated chromosomes 2A and 2B of the chimpanzee, ran Smash once per chromosome (totalling 23 runs), then manually corrected the associated picture regarding the hypothetical centromere between 2A and 2B and finally grouped all the maps in one global picture (the one shown in Fig. 3). A similar process was done for the human/orangutan map, shown in Fig. 4. The results obtained confirm and extend previous work based on orthologous gene distribution, array comparative genomic hybridisation (array CGH) and FISH approaches^49,50,51.

Figure 3 shows the complete information maps between human and chimpanzee genomes, using chromosome pairwise comparisons, which are characterised by several inversions, in chromosomes 1, 4, 5, 7, 12, 15, 17, 18 and Y. All known pericentric inversions were detected by our method with the exception of inversions in chromosomes 9 and 16 that are located in regions with limited available sequence information⁵². The structural rearrangements observed in the chimpanzee Y chromosome agree with previous reports⁵³, where variable copy number and position of Y-specific genes was found among chimpanzees (Pan troglodytes) but not among bonobo (P. paniscus), gorilla (Gorilla gorilla gorilla and G. beringei graueri) or orangutan (Pongo pygmaeus and P. abelii) lineages⁵⁴. In addition, we identify inversions in chromosome 7 (Fig. 5) that were only partially described before⁵⁰. Despite their importance, inversions are traditionally difficult to detect and new experimental approaches have been recently developed to improve the available tools⁵⁵. These two inversions are located in 7p14.1 and 7q11.23 around the GLI3 and ELN genes, respectively and both are associated with human disorders. Namely, the Greig cephalopolysyndactyly syndrome is caused by mutations, deletion or rearrangements in the region containing the GLI3 transcription factor that affect the development of the limbs, head and face and is characterised by the presence of extra fingers or toes⁵⁶. The Williams-Beuren syndrome (WBS) is a neurodevelopmental disease with distinctive facial and behavioural features, as well as several degrees of intellectual disability, caused by deletions of genes including ELN⁵⁷. Curiously, inversion polymorphisms are present in a significant proportion of parents from WBS patients^57,60, which is also observed in the 17q21.31 region⁵⁹, suggesting that structural variants enhance some microdeletion syndromes. Given the structural differences observed in these chromosomal regions, one might speculate that they have contributed to evolutionary innovation and the emergence of lineage-specific phenotypes.

Figure 4 depicts the complete information maps between human and orangutan. It shows that orangutan chromosome 1 is in the opposite direction as compared with human. Moreover, there are large inversions in chromosomes 2, 3, 4, 7, 8, 9, 10, 11, 16, 17, 18 and 20. Although there are fewer data available, the results are consistent with previous cytogenetic approaches that identified new rearrangements on the orangutan genome, specifically, a pericentric inversion on chromosome 1, complex rearrangements on chromosome 2 and a subtelomeric deletion on chromosome 19 ⁶⁰. Also, recent evidence suggests that the orangutan genome maintains the ancestral chromosomal state with observable differences in most chromosomes when compared with humans, including chromosomes 1, 2, 3, 7, 10, 11 and 18 ⁴⁹.

The method and the implementation here described allows the detection of large-scale and small-scale genomic rearrangements, including balanced translocations and inversions that are not detected by array-CGH or chromosome alterations that are below the limits of microscopy, thus, extending the possibilities of genome-wide structure characterisation with a single tool.

In Supplementary Figs. 8 and 9 we provide an example of a translocation between chromosomes 5 and 17 of human and gorilla. As it can be seen, after concatenating the sequences, Smash was able to detect a well known translocation that is one of the bases of gorilla speciation foundations⁶³.

Smash compares pairs of sequences. These pairs can be built using single chromosomes, as shown in Figs. 3 and 4, or sets of chromosomes concatenated in a single sequence, as in the example of the translocation shown in Supplementary Figs. 8 and 9. In either case, Smash looks for and reports the position of regions that are similar, from the point of view of information content. Hence, in the examples provided in Figs. 3 and 4, only the regions that are similar in each pair of chromosomes are reported. To have a full view, it would be required either to run Smash in each possible pair of chromosomes (i.e., all possible pairs formed between the set of human chromosomes and the set of chimpanzee chromosomes, or by concatenating in a single sequence the whole genome of each species). Naturally, when very large sequences are involved (for example, entire genomes concatenated), the visualization granularity is reduced and the computational resources increase. A more detailed discussion can be found in Section 2 of the Supplementary Material.

Conclusion

Chromosome rearrangements can drive adaptation and evolution of novel traits, but they can be deleterious as well. Here, we show that compression-based models are remarkably capable of detecting signatures of genomic chromosomal evolution, namely to determine how information flows between sequences. The method is alignment-free and universal, in the sense that it can accept any input pair of genomic sequences and depends only on two parameters.

A tool that implements the method has been made available for download. General guidelines have been given on how to select the values of its two parameters, which do not affect its performance in an overly sensitive way. Its advantages and limitations have been discussed.

The tool and the ideas that underlie its design may lead to new insights about important genomic questions, since it allows blind unsupervised detection of rearrangements and similarities between genomic sequences. An obvious example is the detection of evolutionary patterns across species, as demonstrated in the examples, but the tool has similar potential for diagnosis and genetic counselling. The detection of rearrangements in cancer genomes at high resolution levels is also considered important, in connection with risk stratification and personalised therapeutics.

Additional Information

How to cite this article: Pratas, D. et al. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci. Rep. 5, 10203; doi: 10.1038/srep10203 (2015).

References

Avelar, A., Perfeito, L., Gordo, I. & Ferreira, M. Genome architecture is a selectable trait that can be maintained by antagonistic pleiotropy. Nat. Commun. 4, 10.1038/ncomms3235 (2013).
Lee, H., Thompson, J., Wang, E. & Wetzler, M. Philadelphia chromosome-positive acute lymphoblastic leukemia. Cancer 117, 1583–1594 (2011).
PubMed Google Scholar
Zody, M. et al. Evolutionary toggling of the MAPT 17q21. 31 inversion region. Nat. Genet. 40, 1076–1083 (2008).
CAS PubMed PubMed Central Google Scholar
Donnelly, M. et al. The distribution and most recent common ancestor of the 17q21 inversion in humans. Am. J. Hum. Gen. 86, 161–171 (2010).
CAS Google Scholar
Setó-Salvia, N. et al. Using the neanderthal and denisova genetic data to understand the common MAPT 17q21 inversion in modern humans. Hum. Biol. 84, 1 (2013).
Google Scholar
Meyerso, M., Gabriel, S. & Getz, G. Advances in understanding cancer genomes through second-generation sequencing. Nat. Rev. Genet. 11, 685–696 (2010).
Google Scholar
Das, K. & Tan, P. Molecular cytogenetics: recent developments and applications in cancer. Clin. Genet. 84, 315–325 (2013).
CAS PubMed Google Scholar
Wang, T. et al. Digital karyotyping. Proc. Natl. Acad. Sci. USA 99, 16156–16161 (2002).
ADS CAS PubMed PubMed Central Google Scholar
Kircher, M. Analysis of high-throughput ancient DNA sequencing data. Methods Mol. Biol. 840, 197–228 (2012).
CAS PubMed Google Scholar
Brudno, M. et al. Glocal alignment: finding rearrangements during alignment. Bioinformatics 19, i54–i62 (2003).
PubMed Google Scholar
Schwartz, S. et al. Human-mouse alignments with blastz. Genome. Res. 13, 103–107 (2003).
CAS PubMed PubMed Central Google Scholar
Dewey, C. N. Aligning multiple whole genomes with mercator and mavid. In Comparative genomics. 221–235 Springer 2008).
Darling, A. E., Mau, B. & Perna, N. T. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLOS ONE 5, e11147 (2010).
ADS PubMed PubMed Central Google Scholar
Dubchak, I., Poliakov, A., Kislyuk, A. & Brudno, M. Multiple whole-genome alignments without a reference organism. Genome. Res. 19, 682–689 (2009).
CAS PubMed PubMed Central Google Scholar
Frazer, K. A., Pachter, L., Poliakov, A., Rubin, E. M. & Dubchak, I. VISTA: computational tools for comparative genomics. Nucleic Acids Res. 32, W273–W279 (2004).
CAS PubMed PubMed Central Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm and yeast genomes. Genome. Res. 15, 1034–1050 (2005).
CAS PubMed PubMed Central Google Scholar
Karolchik, D. et al. Comparative genomic analysis using the ucsc genome browser. In Comparative Genomics, 17–33 Springer- 2008).
Prabhakar, S. et al. Close sequence comparisons are sufficient to identify human cis-regulatory elements. Genome. Res. 16, 855–863 (2006).
CAS PubMed PubMed Central Google Scholar
Gregory, S. G. et al. A physical map of the mouse genome. Nature 418, 743–750 (2002).
ADS CAS PubMed Google Scholar
Haas, B. J., Delcher, A. L., Wortman, J. R. & Salzberg, S. L. Dagchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics 20, 3643–3646 (2004).
CAS PubMed Google Scholar
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome. Biol. 5, R12 (2004).
PubMed PubMed Central Google Scholar
Ohtsubo, Y., Ikeda-Ohtsubo, W., Nagata, Y. & Tsuda, M. Genomematcher: a graphical user interface for dna sequence comparison. BMC Bioinformatics 9, 376 (2008).
PubMed PubMed Central Google Scholar
Putnam, N. H. et al. Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization. Science 317, 86–94 (2007).
ADS CAS PubMed Google Scholar
Lewis, S. E. et al. Apollo: a sequence annotation editor. Genome. Biol. 3, 1–14 (2002).
Google Scholar
Sinha, A. & Meller, J. Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms. BMC Bioinformatics 8, 82 (2007).
PubMed PubMed Central Google Scholar
Meyer, M., Munzner, T. & Pfister, H. Mizbee: a multiscale synteny browser. IEEE Trans. Vis. Comput. Graphics 15, 897–904 (2009).
Google Scholar
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome. Res. 19, 1639–1645 (2009).
CAS PubMed PubMed Central Google Scholar
Nielsen, C., Cantor, M., Dubchak, I., Gordon, D. & Wang, T. Visualizing genomes: techniques and challenges. Nat. Methods 7, S5–S15 (2010).
CAS PubMed Google Scholar
Dix, T. I. et al. Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics 8, S10 (2007).
PubMed PubMed Central Google Scholar
Pinho, A. J., Garcia, S. P., Pratas, D. & Ferreira, P. J. S. G. DNA sequences at a glance. PLOS ONE 8, e79922 (2013).
ADS PubMed PubMed Central Google Scholar
Li, M. & Vitányi, P. An introduction to Kolmogorov complexity and its applications Springer 2008).
Grumbach, S. & Tahi, F. Compression of DNA sequences. In Proc. of the DCC, 340–350 Snowbird: Utah, 1993).
Rivals, E., Delahaye, J.-P., Dauchet, M. & Delgrange, O. A guaranteed compression scheme for repetitive DNA sequences. In Proc. of the DCC, 453 Snowbird: Utah, 1996).
Loewenstern, D. & Yianilos, P. N. Significantly lower entropy estimates for natural DNA sequences. In Proc. of the DCC, 151–160 Snowbird: Utah, 1997).
Matsumoto, T., Sadakane, K. & Imai, H. Biological sequence compression algorithms. In Dunker, A. K., Konagaya, A., Miyano, S. & Takagi, T. (eds.) Genome. Inform. Ser. 43–52 (Tokyo, Japan 2000).
Chen, X., Li, M., Ma, B. & Tromp, J. DNACompress: fast and effective DNA sequence compression. Bioinformatics 18, 1696–1698 (2002).
CAS PubMed Google Scholar
Manzini, G. & Rastero, M. A simple and fast DNA compressor. Software: Practice and Experience 34, 1397–1411 (2004).
Google Scholar
Korodi, G. & Tabus, I. An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. on Information Systems 23, 3–34 (2005).
Google Scholar
Behzadi, B. & Le Fessant, F. DNA compression challenge revisited. In Combinatorial Pattern Matching: Proc. of CPM-2005, vol. 3537 of LNCS, 190–200 Springer-Verlag 2005).
MATH Google Scholar
Korodi, G. & Tabus, I. Normalized maximum likelihood model of order-1 for the compression of DNA sequences. In Proc. of the DCC, 33–42 Snowbird: Utah, 2007).
Cao, M. D., Dix, T. I., Allison, L. & Mears, C. A simple statistical algorithm for biological sequence compression. In Proc. of the DCC, 43–52 Snowbird: Utah, 2007).
Zhu, Z., Zhou, J., Ji, Z. & Shi, Y. DNA sequence compression using adaptive particle swarm optimization-based memetic algorithm. IEEE Trans. Evol. Comput. 15, 643–658 (2011).
Google Scholar
Pinho, A. J., Pratas, D. & Ferreira, P. J. S. G. Bacteria DNA sequence compression using a mixture of finite-context models. In Proc. of the SSP Nice: France, 2011).
Pinho, A. J., Ferreira, P. J. S. G., Neves, A. J. R. & Bastos, C. A. C. On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6, e21588 (2011).
ADS CAS PubMed PubMed Central Google Scholar
Berger, B., Peng, J. & Singh, M. Computational solutions for omics data. Nat. Rev. Genet. 14, 333–346 (2013).
CAS PubMed PubMed Central Google Scholar
Deorowicz, S. & Grabowski, S. Data compression for sequencing data. Algorithms Mol. Biol. 8, 25 (2013).
PubMed PubMed Central Google Scholar
Wandelt, S., Bux, M. & Leser, U. Trends in genome compression. Curr. Bioinform. 9, 315–326 (2013).
Google Scholar
Pratas, D., Pinho, A. J. & Rodrigues, J. M. XS: a FASTQ read simulator. BMC Res. Notes 7, 40 (2014).
PubMed PubMed Central Google Scholar
Hedges, S. B., Dudley, J. & Kumar, S. Timetree: a public knowledge-base of divergence times among organisms. Bioinformatics 22, 2971–2972 (2006).
CAS PubMed Google Scholar
Tomkins, J. How genomes are sequenced and why it matters: Implications for studies in comparative genomics of humans and chimpanzees. Answers Res. Journal 4, 81–88 (2011).
Google Scholar
Hughes, J. et al. Chimpanzee and human Y chromosomes are remarkably divergent in structure and gene content. Nature 463, 536–539 (2010).
ADS CAS PubMed PubMed Central Google Scholar
Farré, M., Micheletti, D. & Ruiz-Herrera, A. Recombination rates and genomic shuffling in human and chimpanzee—a new twist in the chromosomal speciation theory. Mol. Biol. Evol. 30, 853–864 (2013).
PubMed Google Scholar
Feuk, L. et al. Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies. PLOS Genet. 1, e56 (2005).
PubMed PubMed Central Google Scholar
Locke, D. et al. Large-scale variation among human and great ape genomes determined by array comparative genomic hybridization. Genome. Res. 13, 347–357 (2003).
CAS PubMed PubMed Central Google Scholar
Church, D., Deanna, M., Schneider, V. et al. Modernizing reference genome assemblies. PLOS Biol. 9, e1001091 (2011).
CAS PubMed PubMed Central Google Scholar
Greve, G. et al. Y-chromosome variation in hominids: intraspecific variation is limited to the polygamous chimpanzee. PLOS ONE 6, e29311 (2011).
ADS CAS PubMed PubMed Central Google Scholar
Ray, F. et al. Directional genomic hybridization for chromosomal inversion discovery and detection. Chromosome Res. 21, 165–174 (2013).
CAS PubMed PubMed Central Google Scholar
Biesecker, L. The greig cephalopolysyndactyly syndrome. Orphanet J. Rare Dis. 3, 238 (2008).
Google Scholar
Cuscó, I. et al. Copy number variation at the 7q11. 23 segmental duplications is a susceptibility factor for the williams-beuren syndrome deletion. Genome. Res. 18, 683–694 (2008).
PubMed PubMed Central Google Scholar
Osborne, L. et al. A 1.5 million-base pair inversion polymorphism in families with williams-beuren syndrome. Nat. Genet. 29, 321–325 (2001).
CAS PubMed PubMed Central Google Scholar
Sharp, A. et al. Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat. Genet. 38, 1038–1042 (2006).
CAS PubMed Google Scholar
Weise, A. et al. New aspects of chromosomal evolution in the gorilla and the orangutan. Int. J. Mol. Med. 19, 437–443 (2007).
CAS PubMed Google Scholar
Samonte, R. V. & Eichler, E. E. Segmental duplications and the evolution of the primate genome. Nat. Rev. Genet. 3, 65–72 (2002).
CAS PubMed Google Scholar

Download references

Acknowledgements

Supported by the European Fund for Regional Development (FEDER) through the Operational Program Competitiveness Factors (COMPETE) and by the Portuguese Foundation for Science and Technology (FCT), in the context of projects PEst-OE/EEI/UI0127/2014 and Incentivo/EEI/UI0127/2014. DP is supported by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 305444 “RD-Connect: An integrated platform connecting registries, biobanks and clinical bioinformatics for rare disease research”. RMS is supported by the project Neuropath (CENTRO-07-ST24-FEDER-002034), co-funded by QREN Mais Centro program and the EU.

Author information

Authors and Affiliations

IEETA/DETI, University of Aveiro, Portugal
Diogo Pratas, Raquel M. Silva, Armando J. Pinho & Paulo J.S.G. Ferreira

Authors

Diogo Pratas
View author publications
You can also search for this author in PubMed Google Scholar
Raquel M. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Armando J. Pinho
View author publications
You can also search for this author in PubMed Google Scholar
Paulo J.S.G. Ferreira
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.P., A.P. and P.F. designed the algorithms. D.P. implemented and tested the software. D.P., R.S., A.P. and P.F. designed the experiments and interpreted the results. D.P., R.S., A.P. and P.F. wrote the manuscript. All authors reviewed the manuscript.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Electronic supplementary material

Supplementary Information

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Pratas, D., Silva, R., Pinho, A. et al. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci Rep 5, 10203 (2015). https://doi.org/10.1038/srep10203

Download citation

Received: 26 August 2014
Accepted: 07 April 2015
Published: 18 May 2015
DOI: https://doi.org/10.1038/srep10203

This article is cited by

New insights into mammalian sex chromosome structure and evolution using high-quality sequences from bovine X and Y chromosomes
- Ruijie Liu
- Wai Yee Low
- John L. Williams
BMC Genomics (2019)
AC: A Compression Tool for Amino Acid Sequences
- Morteza Hosseini
- Diogo Pratas
- Armando J. Pinho
Interdisciplinary Sciences: Computational Life Sciences (2019)
SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform
- Jie Lin
- Jing Wei
- Yue Jiang
BMC Bioinformatics (2018)
Alignment-free sequence comparison: benefits, applications, and tools
- Andrzej Zielezinski
- Susana Vinga
- Wojciech M. Karlowski
Genome Biology (2017)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.