Α Quantum Pattern Recognition Method for Improving Pairwise Sequence Alignment

Quantum pattern recognition techniques have recently raised attention as potential candidates in analyzing vast amount of data. The necessity to obtain faster ways to process data is imperative where data generation is rapid. The ever-growing size of sequence databases caused by the development of high throughput sequencing is unprecedented. Current alignment methods have blossomed overnight but there is still the need for more efficient methods that preserve accuracy in high levels. In this work, a complex method is proposed to treat the alignment problem better than its classical counterparts by means of quantum computation. The basic principal of the standard dot-plot method is combined with a quantum algorithm, giving insight into the effect of quantum pattern recognition on pairwise alignment. The central feature of quantum algorithmic -quantum parallelism- and the diffraction patterns of x-rays are synthesized to provide a clever array indexing structure on the growing sequence databases. A completely different approach is considered in contrast to contemporary conventional aligners and a variety of competitive classical counterparts are classified and organized in order to compare with the quantum setting. The proposed method seems to exhibit high alignment quality and prevail among the others in terms of time and space complexity.

inventions of Smith-Waterman 3,4 (S.W.) and Needleman-Wunsch 5 (N.W.) algorithms were the greatest examples in D.P. These algorithms are based on a scoring framework providing optimal alignment solutions. The performance of both S.M. and N.W. algorithms depends on the lengths of the reference S R and the query S Q sequences, L R and L Q respectively, achieving O(L R L Q ) runtime and memory resources. Then, the word methods (a.k.a. k-mer or k-tuple) were introduced to speed-up the alignment process. In the last two decades, combinatorial strategies were invented like seed-and-extend and q-gram filter to optimize the speed and the accuracy of the alignment result.
On the parallel, the field of pattern recognition has been extended over the quantum information and computing settings. In 2001, both R. Schützhold and C.A.Trugenberger set the concept of quantum pattern recognition (QPR). R. Schützhold mentions a quantum algorithm which runs on a probe structure line those in diffraction experiments and detects easily patterns 6 . Trugenberger proposed the quantum analogue of the classical associative memory 7,8 confronting the problem of capacity shortages but probabilistically retrieving incomplete or noisy inputs. R. Zhou and Q. Ding mention in 2008 an alternative way 9 of QPR with 100% probability to retrieve the query pattern which is highly based on an improved version of Grover's search algorithm. Using quantum multi-pattern search algorithm 10 may be possible to achieve pairwise SA but such a technique demands a relatively complex way in producing quantum original states if a high number of patterns exist. In 2009, R. Neigovzen et al. describe a notable QPR scheme 11 which according to its authors combines the idea of a classical Hopfield neural network with adiabatic quantum computation and E. A. Fard et al. present a quantum neural network 12 .
In this study, we expand our previously reported method 13 which seems to outperform its classical counterparts at least in terms of time and computing resources. The reported method explains how it is possible to turn R. Schützhold's pattern recognition schema into a novel alignment approximation by utilizing the central concept of the old but infallible technique of dot-matrix plots. In fact, the crossing points of the dots on the matrix are simulated onto an array structure lattice plane like those in diffraction experiments offering a different approach to the alignment process.
It is assumed that our aligner is a complex device which can handle large sequences on a lattice plane surface made by non-linear Kerr media and run R. Schützhold's QPR algorithm in a quantum computing system. The physical realization of such a device may be an apparatus consisting of a Laue diffraction system with a probing light beam (e.g. X-rays or electrons) and two computational subunits: one with a classical processor and one with a quantum processor. The R. Schützhold's QPR algorithm is analytically described in the Methods section splitting it into five steps. The most complex parts are the quantum black box 14 , or BB, and the quantum Fourier transform (QFT) 14,15 . The classical computing subunit is needed to run calculations of Laue's equations in order to localize and specify candidate line-patterns.

Specification of the QPR Circuit
According to R. Schützhold's QPR algorithm two registers are used as an index structure for the formation of a dot-plot like diffraction pattern. The regX register for the horizontal and the regY register for the vertical indexing. A third one-qubit register is entangled with the other two registers and a single measurement of it let in superposition only the dot cells (reflective cells). Then the QFT enhances the amplitude of the superposition at the proper index coordinates on the plane revealing the existence of prominent line patterns. A few more runs of the algorithms may be needed to secure the detection result and some further classical calculations will locate the pattern. The circuit model of the quantum part of this algorithm is depicted in Fig. 1.
In an over-simplifying way, the action of a Hadamard transform (gate H) on |0〉 in Fig. 1 sets the two coordinate registers into a superposition of the input state. Assuming that the X and Y coordinate registers have n and m qubits respectively, the expression ⊗ H n or ⊗ H m denotes this action. The operator of the H gate is defined in Eq. (1).
The symbol "⊗" represents the tensor product. The coordinate registers come into a superposition by the application of the Hadamard gates which is described by Eq. (2). The BB sub-circuit acts similarly to the black boxes of the milestone algorithms of P. Shor and L. Grover. The application of a unitary operation as BB affects the state of each qubit concurrently and this is quantum parallelism. Actually, this can be viewed as a function f(x) that evaluates for many different values of x simultaneously, as f(x): {0, 1} L → {0, 1}. Starting from the state |ψ 0 〉 = |x〉|0〉 then we see that BB|ψ 0 〉 = |x〉|f(x)⊕0〉 = |x〉|f(x)〉. Since the domain is a binary w-bit string and the range a one-bit, it is possible to consider x as the address of a cell in the rectangular array and the one-bit result f(x) as a flag identifier on whether the cell is a reflective one or not. The unitary operation of the BB may be constructed by phase shift gates R u of π … π π π + − or , , , 2 2 2 n m 2 1 turns as the next unitary operation is the QFT. The R u unitary transformation, a.k.a. phase-Shift, is defined in Eq. (3), leaving unchanged the probability of measuring the orthonormal basis states of the quantum system. Moreover, the BB includes the application of controlled NOT or cNOT gates which are necessary to develop entanglement between the coordinate registers and the one-qubit register allowing for a number of parallel computation paths. The cNOT gate is defined by the operator in Eq. (4).
Right after the measurement, the system |ψ〉 remains in a superposition as in Eq. (5) and entangled, but only for the states that either had f(x) = 0 or f(x) = 1.
The next sub-circuit is the QFT (see the circuit in Supplementary Material). It is the well-known linear transformation that affects the amplitudes and phases of the base states of a qubit-system. Τhe first set of the input qubit states |q 1 〉..|q m 〉, where m < u, represents the qubits of the regY input register and the rest |q m+1 〉..|q u 〉 the qubits of the regX input register. Typically, the number of gates that are required to build QFT circuit grows only as a quadratic function of the number of the input qubits. From bottom up, QFT circuit has 1 + 2 + … + u = u(u + 1)/2 elementary Hadamard and controlled rotation gates. Such an implementation has an exponential speed up over the classical fast Fourier transform. Moreover, the approximate version of QFT known as AQFT can improve further the performance but this study is out of our scope.
proposed sA protocol. The protocol of our method is described in Fig. 2. In the same way to the dot matrix comparison, the user has to input the query S Q and the reference S R sequences and investigate the plane for diagonal line patterns. Reference's length L R may be a relatively small number when a local alignment is conducted or a large one when a global alignment or database searching takes place. If the size of S R is large enough, a fragmentation process splits it into smaller segments for a more efficient management. These segments are also known as window spaces. Then the routineQPR quantum routine runs Ω times per window to find and secure the largest line pattern in the region under comparison and then routineLaue classical routine locates the position of this pattern within this window space. routineLaue also estimates the length of the line pattern L D and the deviation angle -π/2 ≤ ϑ ≤ π/2 from a vertical one. A record with all the relevant information about the detected line is saved and the procedure continues till all the windows are exhausted. Given a line pattern starting at x 0 and y 0 on the panel should keep the following information: record (id, L D , ϑ, [x 0 , x 0 + L D ], [y 0 , y 0 + L D ]). Finally, a list of records is formed for each segment and algorithms like S.W., N.W., FOGSAA, or modified versions of them undertake for the optimal alignment.
Dot matrix analysis provides a great way to compare every position in the matrix. Its usefulness is seen in alignment of proteins, nucleic acid sequences, finding amino acid repeats within a protein (comparing it by itself), database searching or sequence assembly.
The formed diagonals in the dot matrix plane, illustrate the degree of similarity. All the occurring diagonal lines that are formed right after BB's application may not be of the same importance since most of them may emerge from the small alphabet of the sequences which causes a lot of noise (e.g. nucleic acids in DNA has a value of α = 4). Diagonals may not be perfectly shaped up to the degree that the density of dots ρ within a line-width (e.g. L D /2) deviates by an acceptable quantity Δρ from the mean ρ. A threshold value (or cutoff value) for the minimum acceptable length L 0 of detected diagonals should be defined in the beginning of the protocol in order to trim redundant or insignificant matches. But as it is the case in most situations, more than one diagonal line will occur in a typical window space with a small alphabet. Multiple occurring lines in a confined region may be a confusing problem for our probabilistic QPR algorithm in two directions: (a) When line patterns of nearby lengths exist in the same window, it will be difficult to distinguish them in their superposition by QFT, as they form coherently amplitudes of the same measuring probability. The longer the length of a diagonal line in contrast to the other diagonals, the easier the way to detect it. However, this case is usually limited in microsatellite repeats or patterns. (b) When two, or more, line patterns of accepted lengths, >L 0 , exist in the same window, then the lengthiest of them will always shade the others rendering difficult their detection.
www.nature.com/scientificreports www.nature.com/scientificreports/ One way to combat the first problem is to adjust properly the size L W of the window. L 0 should always be a few orders of magnitude smaller than L W . Reducing L W the probability to allow more than one acceptable diagonal in it shrinks. Another way to consider is to include classical heuristic methods, but increasing the computational load.
An option to treat the second problematic area is to invent a way to destroy the amplitudes of the first detected diagonal within a single window space. It can be possible by adding some extra phase gates in the BB to direct properly the quantum Fourier transformation. Another easier and cost-effective option would be to remove the region of the previously detected diagonal in S R , but this solution would omit to take into consideration smaller and acceptable diagonals that occur within the range of the previously detected diagonal. However, for small L W this option may be an effective solution. A third option would be to include again classical methods.
L 0 and L W are two critical parameters which have to be carefully chosen with the aid of statistics. The goal of analysis usually determines the window size L W and the threshold length L 0 of diagonal. The size of the average exon or the average protein structural element, of gene promoter or enzyme active site, are typical factors that affect window's size. The level of noise in a dot matrix for a small α may cause the occurrence of irrelevant diagonals but since the significant diagonals have a good similarity and aggregate many dots then the detection effectiveness is independent of noise. Thus, a densely dot matrix with a high ρ will demand a higher threshold length at the risk to encompass irrelevant diagonals. A small window size would increase dramatically the number of the windows and therefore the application times of routineQPR. An interesting improvement would be to examine first the diagonal windows for a large-scale alignment.
The size of the diagonal will affect the number of measurements that we have to apply. The larger the length, the higher the probability to be measured, as small patterns need more measurements 6 to secure the result. If the size of a line pattern is a small portion β of the array size L Q L R , the probability to locate a line pattern is given by Eq. (6).

Results
The aim of this work is to examine whether the QPR method can be used efficiently as an alternative to the conventional sequence alignment methods. The performance of our QPR protocol is tested in terms of computational resources and run time. In order to evaluate its contribution to the alignment procedure, a simple program was implemented to simulate the behavior of the novel QPR algorithm (see Methods section). Though the adopted algorithm isn't far from reality, quantum computing technology isn't yet mature enough to carry out efficiently the subroutines QFT or BB for multiple qubits (>20). Quantum entanglement fragility and decoherence render unstable quantum computation. The performance of our method is tested and some pairwise alignment examples are presented to demonstrate the precision of our method as a benchmark test with known aligners. For the following results, the supportive data will be found in the Supplementary Material.
Computing resources. The circuit in Fig. 1 provides a clear and compact summary of the quantum computing resources that are needed for this method. The two memory registers regX and regY and a set of single-qubit quantum gates including: Hadamard, cNot and phase-Shift operations, are the fundamental components for the implementation of the proposed routineQPR routine.
In order to conduct a theoretical analysis with numerical treatments, an optimistic specification of the rou-tineQPR 's hardware is assumed which is based on two realistic implementations on different technology platforms. The first one is the superconducting transmon IBM-QE device with limited qubit connectivity, and the other is a fully connected trapped-ion system 16 . Both are limited to 5-qubits, but we do the assumption that we can reach the size of 20 in each register as it is enough to index sequences of more than one million base pairs (bp). According to their specifications the execution of an X, Hadamard and cNOT gate takes 130 ns, 130 ns, and 650 ns for the superconductor and approximately 20000 ns, 20000 ns, and 250000 ns for the ion-trap system, assuming the aforementioned memory size. For a more thorough estimation, the encoded versions of the circuit are compared with the unencoded. The well-studied distance-3 [ [5,1,3]] and [ [7,1,3]] error correcting codes are included in the estimation chart in Fig. 3.
An increment in the qubits of the registers, has a sublinear augmentation on gate operations. Quantum recovery protocols usually add an expensive computing overhead. Error correction takes place after one or two logical gates. Generally, gate operations of the encoding and recovery system occupy more than 90% of the overall system. In this circuit the approximated version of QFT is considered with a degree of 2.
From the aspect of memory allocation, our proposed method needs only a few qubits. Quantum registers can host sequences of size that increases exponentially in the number of qubits. Only a little classical memory will be needed to keep records of the matching points.

Run time estimation.
Given the hardware specification and the QPR algorithm, the expected run time of the routineQPR routine is estimated by dividing the algorithm mainly into three computing steps: the initial superposition by Hadamards, the BB application and the QFT subroutine. The run time escalation for 6, 10, 15 and 20 totally indexing qubits is described in Table 1. Hadamard and phase-Shift gates are supposed that act concurrently and parallel. Noise recovery delay is not included.
As parameter Ω secures pattern recognition and detection, its behavior for more memory qubits is studied. This parameter is critical for the performance of our method, as it may be time consuming for many repetitions of routineQPR(Ω). In the following chart time consumption is calculated for 6, 10, 15 and 20 qubit memory in the range 5 ≤ Ω ≤ 20. The results in Fig. 4 demonstrate that the size of the register affects Ω significantly, thus a good management of this parameter will save much time. However, run times still remain at relatively low levels.  www.nature.com/scientificreports www.nature.com/scientificreports/ used between the E. coli K12 MG1655 (accession number SRR001665 in NCBI SRA) and the E. coli 536 (accession number NC_008253 in GenBank) using the paired-end alignment for each aligner. According to this work, 20.8 million 200-bp-insert-size paired-end reads of the SRR001665 are aligned onto the reference genome NC_008253. The first three columns in Table 1 provide the precision and recall results for CUSHAW, SOAP2 and BWA.
Due to the large size of the sequences our classical simulator took many hours to approximate the alignment quality levels of the classical aligners. Real reads of medium size 200 bp were extracted from the query sequence SRR001665 in order to treat mapping as it would be in a single read sequencing way and not as in paired-end. A sample of ½ million reads covers almost more than ¼ of the initial sample which is a secure amount to approximate and consider our comparison. Then, the reduced reference genome is divided into segments of size 1000 bp and the window size is defined at read's length. Each read is tested over the reference genome and when an acceptable line occurs, the threshold L 0 criterion is fulfilled and the window is recorded as "positive" in order to keep track of the alignment. The threshold value L 0 has an elasticity of 15-20% mismatches in each window. Each time that a positive window occurs, it is recorded as a positive hit by the simulator. Due to some shifted diagonals the width of the examination of the neighboring windows was increased in order to cover more thoroughly problematic regions with highly distorted diagonals. The output results are finally compared with the peer reviewed alignments as they are reported in the relative study 17 .
The SNP calling is evaluated by using the precision and recall measures. Precision is defined as TP/(TP + FP) and recall as TP/(TP + FN). TP stands for true positive and FP stands for false positive, representing a match and mismatch, respectively, with a SNP (Single Nucleotide Polymorphism). FN is a false negative and represents a "correct" SNP that was not identified.
The final comparison results between the four popular aligners and our own method is presented in Table 2. The QPR method seems to be advantageous as the precision and recall measures surpass the other methods having a relatively small standard deviation. By a short research, the average time of a single mapping of a read for the four aligners was estimated, in order to conduct the runtime comparison. Moreover, the runtimes for the single-end (SE) and the paired-end (PE) alignment methods are included as derived by the aforementioned research 17 to demonstrate the superiority of our method. The parameter Ω has a sufficient number of times to detect more than one diagonal lines in a single window.
Computational complexity. To investigate in a uniform way all the established known short read alignment methods, five stages were considered to cover every stage of these strategies. In the first stage all the preliminary operations take place regarding the reference system. In the second and the third stage the generation and the mapping process are examined, respectively. The fourth and the fifth stages deal with the mapping validity of the candidate matching list. Twelve popular aligners plus our own method and its classical counterpart are analyzed to demonstrate the pros and cons of each strategy in Table 3.
In order to allow for a vivid comparison between our method and the various alignment tools, Table 3 summarizes all the ingredient algorithms of modern aligners with their time and space complexities. This comparative elaboration demonstrates the intrinsic computational power of each aligner. Apart from the old dot-matrix method 18 , most of the other aligners are some prominent variations 19,20 of the seed-and-extend strategy: (1) the k-mer exact match seed (BLAST 21    The comparison table shows that the quantum approach outperforms the other classical strategies in terms of computational complexity as at every stage the quantum setting has lower complexity than its classical counterparts. The usage of hash tables or suffix trees impose the preparation of indexing structure which may take time and space. Moreover, their manipulation may not be affordable enough in certain cases. Hash tables turn out to be more efficient than search trees or any other lookup table structure, but hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys. A data structure based on Burrows-Wheeler Transform can search stored items in constant time which is the best possible complexity, but in reality, the complexity may range between Θ (1) and O(L Seed ) where L Seed the length of the query seed. The QPR approach is evolved in a dynamic way binding only a few memory qubits. The quantum index structure can host exponentially larger data segments of genome sequences in the increase of the number of qubits.
As the goal of Table 3 is to describe the computing complexities, it is not clear what is the real consumption of resources. The computational overhead to conduct seed generation is unnecessary in the QPR approach since the QFT can detect the line-patterns in a logarithmic time. The repetitive seed generation stages, the various calculations for seed or mapping scoring and filtering and the final alignment processes based on D.P. imply a poorer performance, even if stage 2 provides an advantage with the constant time complexity by the searching methods. However, the S.W. or N.W. or alternative of dynamic programming should be incorporated in our method in order to achieve precision similar to the other classical aligners.
For weak similarities or particular patterns, the QPR approach may fail to detect or interpret correctly positive windows and further study may improve the situation. Obviously, reducing window's size or increasing Ω surely will increase sensitivity at a cost of time reaching a level that will limit the need for dynamic programming. Long reads are the future tendency in NGS techniques and longer inexact or spaced seeds can guarantee an accurate matching detection. In this regime, the quantum algorithm may have a great advantage as it supports the detection of arbitrary long imperfect line-patterns.

Discussion
A sophisticated method is presented which combines dot-matrix plotting and quantum pattern recognition to improve sequence alignment in biology research and all the relevant fields. The results of our experiments lead to the conclusion that there is a great advantage with a method which should be explored further in the future. The implementation of such a method is not far from the reality, since it does not demand a coherent quantum system of many qubits.
The overall time complexity of our method is O(3log 2 L R L Q ). The first term concerns the application of the Hadamard gates, the second the application of the BB and the third the QFT. The final aligning step is common for all the strategies, but it may be limited in dot-matrix approaches. Current aligners seem that they cannot run faster than O(L R L Q ) in the worst case. Moreover, the quantum registers can save much space due to the exponential indexing coverage of the sequences.
Since, the QPR approach isn't directly an alignment algorithm, the combination with classical tools may seem unavoidable but better treatments can be employed to make perfect alignments. Using scoring methods will also improve the alignment result. More benchmarking tests are needed under specific circumstances to illustrate the power of this method. Database searching and sequence assembly may be benefited. Another area to be consider with the QPR method is the multiple sequence alignment.

RoutineQpR.
The primary task of this method is to detect, identify and localize certain linear patterns in an unstructured data set as it is argued by R. Schützhold in his seminal paper 6 . The considered linear pattern in this study is the line pattern. The first part of this method uses quantum computation while the second part uses classical computation.
A full unstructured data set has to be loaded into a lattice plane which can be considered as a rectangular n × m array. This array is assumed to consist of unit cells (e.g. controlled refractors or any non-linear Kerr cells) which are, either completely absorptive, or completely reflective. The reflective cells are also assumed to be www.nature.com/scientificreports www.nature.com/scientificreports/ uniformly distributed in the rectangular array with a density ρ < 1. The position of the reflective points on the plane may be located by shining appropriately focused light beams on the Kerr media and doing calculations in close analogy to diffraction experiments studying crystal structures via Laue diagrams. Symmetrical pattern transformations are assumed to be resilient for an ideal number of times into different directions.
In another way, the array data can be treated by a quantum black box (BB) of the form in Eq. (7). The BB transformation is based on the principal of quantum parallelism and acts as a binomial classifier and selector for the array data. The input x and y quantum states, being in a superposition, can encode simultaneously all the coordinates of the reflective cells in the array. The initial x and y quantum states are superpositions that represent all the possible coordinates as n-and m-qubit strings laying in quantum registers. The zero-state register consists of a single qubit which has to come into entanglement with the other two registers. A single measurement of the entangled qubit will randomly determine which of the groups of the binomial classification will survive, which means that then the x and y states will exclusively become superpositions, either of the reflective, or the absorptive, cells.
Since the unstructured data set is loaded into the array structure, the method is ready to begin. The whole method can be translated into five algorithmic phases:   Table 3. Comparison of Sequence Alignment Strategies in terms of Computational Complexity. Notes:S Q and L Q are the query read sequence and its length. S S and L S are the seed sequence and its length. S R and L R is the reference genome sequence and its length. q is the length of a substring (q-gam) of a target sequence. L = L S L Q , L PWM is the length of a PWM sequence. N m is the matching times, r is the number of the seed strings in a seedtable, k is the allowed mismatches in the search, a is the number of all the different elements of the sequence (a = 4 for DNA seq. and a = 20 for protein seq.). *The outcome of the quantum measurements should be extracted immediately.
www.nature.com/scientificreports www.nature.com/scientificreports/ Input data preparation. The two coordinate registers are initialized into |0〉 state and then the Hadamard transformation is applied. This step is described in Eq. (8). N and M parameters represent the rows and the columns of the rectangular array, respectively.
Black-Box run. The BB transformation is applied and the third one-qubit registered is measured. The current state of the registers is described by Eq. (8). If the outcome of the measurement is 1, the resulting superposition concerns only the reflective cells. Otherwise, the resulting superposition concerns only the absorptive cells (the complementary set) and step 2 is repeated.
At this stage, some useful information about the pattern can be extracted by measuring |k〉. Some values of k may be more prominent as their amplitudes may be increased by the QFT. Line patterns demonstrate a stable repetition of reflective cells developing peaks of |z〉 factor. These values of k are known as wave-numbers and can lead to the detection of the pattern (a.k.a. feature selection). When there isn't a prominent pattern then the measurement of the quantum state k yield just noise.
Pattern localization. This is the only classical part of this method since it makes use of classical computation. By using the equations of Laue diagrams used in diffraction experiments it is possible to find the location of the detected patterns and extract further information about their length and their direction. This step demands a second execution of the algorithm.

Fidelity verification.
A few more repetitions of the algorithm may be needed to ensure the detection of our findings.
RoutineLaue. In a continuation of the QPR algorithm, the key parameters L D and ϑ can be inferred from the measurements in routineQPR, in order to locate the center points z of a particular diagonal which is described by where N is an integer. The Ω measurements [k 1 , …, k Ω ] of routineQPR running for a single window are stored and used in this routine. Some peaks are generated in each row of the pattern The second term in Eqs (12) and (13)  Prevalent peaks in the measurements of k occur for values which satisfy both conditions in Eqs (12) and (13), concurrently. So, the wave-numbers of the potential peaks are given by (2019)  However, Eq. (14) cannot give large peaks standalone. The first few of them may be suppressed, but some potential peaks from Eq. (14) may match conditions (12) and (13). One more time, the same process has to be repeated for the transposed array (NM→)MN. The weave-numbers of the peaks are now since transposing changes the array orientation per π/2. So far, the candidate values L D and ϑ can be approximated. Using the candidate values for L D /cos ϑ from Eq. (14) and L D /sin ϑ from Eq. (15) a mathematical system is created for solution. Therefore, comparing the condition Eqs (12) and (13) and the sets of the suppressed peaks will provide accurately the L D and ϑ values.
For the size of the pattern, the frequency of measuring the peaks at k and k΄ and their width is enough to find fraction β. So, with the aid of L, ϑ and β, the pattern can be localized efficiently by splitting up symmetrically the window into smaller orthogonal windows and running routineQPR again in these smaller windows. A richer analysis of this method can be found in the seminal paper 6 .
Classical simulation of routineQPR. A simple program is implemented by using the bioinformatics toolbox application of Mathworks in the Matlab suite.
The readily available function of seqdotplot allows the creation of a large dot matrix and its manipulation. seqdotplot can load large sequences, take control of the size of the window on the reference genome and set the number of mismatches in it. The functions nwalign and swalign are good simulations of the well-known dynamic alignment algorithms N.W. and S.W., respectively. Whatever is the aim, the main program splits the dot matrix into smaller windows and windows that are directly connected with the main diagonal of the plot are examined. All our results are conducted with this setting ignoring windows of second priority. The majority of the diagonals, a fraction ~50%, in a sub-matrix, or window, are scanned for "1" (dot) values and the summation is kept for each one. When the sum is higher than a threshold, this window is recorded as positive. Finally, a matrix keeps the positive windows to index the regions of similarity. The threshold for the summation is analogous to the size of the window. Due to the limited performance and memory allocation of our program, reasonable sizes of sequences were tested.

Data Availability
All data generated or analyzed during this study are included in this published article (and its Supplementary Information files).