qPMS9: An Efficient Algorithm for Quorum Planted Motif Search

Discovering patterns in biological sequences is a crucial problem. For example, the identification of patterns in DNA sequences has resulted in the determination of open reading frames, identification of gene promoter elements, intron/exon splicing sites, and SH RNAs, location of RNA degradation signals, identification of alternative splicing sites, etc. In protein sequences, patterns have led to domain identification, location of protease cleavage sites, identification of signal peptides, protein interactions, determination of protein degradation elements, identification of protein trafficking elements, discovery of short functional motifs, etc. In this paper we focus on the identification of an important class of patterns, namely, motifs. We study the (ℓ, d) motif search problem or Planted Motif Search (PMS). PMS receives as input n strings and two integers ℓ and d. It returns all sequences M of length ℓ that occur in each input string, where each occurrence differs from M in at most d positions. Another formulation is quorum PMS (qPMS), where the motif appears in at least q% of the strings. We introduce qPMS9, a parallel exact qPMS algorithm that offers significant runtime improvements on DNA and protein datasets. qPMS9 solves the challenging DNA (ℓ, d)-instances (28, 12) and (30, 13). The source code is available at https://code.google.com/p/qpms9/.

worst case. Thus, it is important to develop efficient algorithms in practice. The practical performance of PMS algorithms is typically evaluated on datasets generated as follows (see refs 1, 6): 20 DNA/ protein strings of length 600 are generated according to the independent identically distributed (i.i.d.) model. Similarly, a random motif (,-mer) M is generated and ''planted'' at a random location in each input string (or in q% of the input strings for qPMS). Every planted instance of the motif is mutated in exactly d positions.
Definition 1. An (,, d) instance is defined to be a challenging instance if d is the largest integer for which the expected number of motifs of length , that would occur in the input by random chance does not exceed a constant (500 in this paper, same as in Ref. 7).
Note that in this paper we only address exact algorithms, which find all the existing motifs. Most of the exact PMS algorithms use a combination of two fundamental techniques. One is a sample driven technique and the other is a pattern driven technique. In the sample driven stage, the algorithm selects a tuple of ,-mers coming from distinct input strings. In the pattern driven stage, the algorithm generates the common d-neighborhood of the ,-mers in the tuple. Each such ,-mer becomes a motif candidate. The size of the tuple is usually fixed to a value such as 1 (see e.g. 6,8,9 ), 2 (see e.g. 10 ), 3 (see e.g. [11][12][13][14] or n (see e.g. 1,15 ). In contrast, PMS8 7 and qPMS9 (this paper) utilize a variable tuple size, which adapts to the problem instance under consideration.
There are many PMS algorithms in the literature. In a previous paper 7 we have introduced the PMS8 algorithm. In the same paper we have performed a comparison between PMS8 and all the exact algorithms we could find in the literature of the previous five years. We have shown that PMS8 outperforms these algorithms. Ever since the publishing of PMS8, one other exact qPMS algorithm has been published, called TraverStringRef 11 . Therefore, in this paper we compare qPMS9 with PMS8 and TraverStringRef.
The TraverStringRef algorithm 11 is an algorithm for the qPMS problem, based on the earlier qPMS7 14 algorithm. qPMS7 14 can solve, for example, the challenging DNA instance (23, 9) whereas TraverStringRef 11 can solve (25,10), in a reasonable amount of time (no more than two days using commodity processors). In the case of the PMS problem, the PMS8 algorithm 7 can solve the DNA instances (25,10), on a single core machine, and (26,11) on a multi-core machine. We have used PMS8 as the basis for the new qPMS9 algorithm. The qPMS9 algorithm extends PMS8 in several ways. First, qPMS9 introduces a search procedure which significantly increases performance by allowing for better pruning of the search space. Second, qPMS9 adds support for solving the qPMS problem, which was lacking in PMS8. We compare qPMS9 with PMS8 7 and TraverStringRef 11 on several DNA and protein instances.
Another useful notion is that of a d-neighborhood. Given a tuple of ,-mers T 5 (t 1 , t 2 , …, t s ), the common d-neighborhood of T includes all the ,-mers r such that Hd(r, We now define the consensus ,-mer and the consensus total distance for a tuple of ,-mers. The consensus ,-mer for a tuple of ,-mers If p is the consensus ,-mer for T then the consensus total distance of T is defined as While the consensus string is generally not a motif, the consensus total distance provides a lower bound on the total distance between any motif and a tuple of ,-mers. qPMS9. As indicated previously, most of the motif search algorithms combine a sample driven approach with a pattern driven approach. In the sample driven part, tuples of ,-mers (t 1 , t 2 , …, t k ) are generated, where t i is an ,-mer in S i . Then, in the pattern driven part, for each tuple, its common d-neighborhood is generated. Every ,mer in the neighborhood is a candidate motif. In PMS8 7 and qPMS9, the tuple size k is variable. By default, a good value for k is estimated automatically based on the input parameters (see Ref. 7 for details), or k can be user specified.
Tuple Generation. In the sample driven part of PMS8, tuples T 5 (t 1 , t 2 , …, t k ), where t i is an ,-mer from string s i , ;i 5 1..k, are generated based on the following principles. First, if T has a common d-neighborhood, then every subset of T has a common dneighborhood. Second, for a motif to exist, there has to be at least one ,-mer u in each of the remaining strings s k 1 1 , s k 1 2 , …, s n such that T < {u} has a common dneighborhood. We call such ,-mers u ''alive'' with respect to tuple T. As we add ,mers to T we update the alive ,-mers and reorder the strings in increasing order of the number of alive ,-mers. This reordering reduces the running time because it leads to generating fewer tuples overall.
In qPMS9 we change the criteria by which the strings are reordered, as follows. Let T be the current tuple of ,-mers and let u be an alive ,-mer with respect to T. If we add u to T, then the consensus total distance of T increases. We compute this additional distance Cd(T<{u}) 2 Cd(T). For each of the remaining strings, we compute the minimum additional distance for any alive ,-mer in that string. Then we sort the strings decreasingly by the minimum additional distance. Therefore, we give priority to the string with the largest minimum additional distance. If two strings have the same minimum additional distance, we give priority to the string with fewer alive ,mers. The intuition is that larger additional distance could indicate more ''diversity'' among the ,-mers in the tuple, which means smaller common d-neighborhoods. The pseudocode for generating tuples T is given in Figure 1. We invoke the algorithm as GenTuples({}, k, R) where the matrix R contains all the ,-mers in all the input strings, grouped as one row per string.
Neighborhood Generation. For every tuple T, obtained as described in the previous section, we generate the common d-neighbors of the ,-mers in the tuple. In qPMS9, the neighbor generation uses the same process as in PMS8 7 . For the sake of completeness, we briefly review the process.
Given a tuple T 5 (t 1 , t 2 , …, t k ) of ,-mers, we want to generate all ,-mers M such that Hd(t i , M) # d, ;i 5 1..k. We traverse the tree of all possible ,-mers. A node at depth r, which represents an r-mer, is not explored deeper if certain pruning conditions are met. Necessary and sufficient conditions for 2 and 3 ,-mers to have a common neighbor are given in Ref. 7. The same paper gives necessary conditions for more than 3 ,-mers to have a common neighbor. The interested reader is referred to the PMS8 paper 7 for a more in depth description of neighborhood generation.
Adding Quorum Support. We extend the algorithm to solve the qPMS problem. In the qPMS problem, when we generate tuples, we may ''skip'' some of the strings entirely. This translates to the implementation as follows: in the PMS version we successively try every alive ,-mer in a given string by adding it to the tuple T and recursively calling the algorithm for the remaining strings. For the qPMS version we have an additional step where, if the value of q permits, we skip the current string and try ,-mers from the next string. At all times we keep track of how many strings we have skipped. The pseudocode for this algorithm is given in Figure 2. We invoke the algorithm as QGenerateTuples(n 2 Q 1 1, {}, 0, k, R) where Q~t qn 100 s and R contains all the ,-mers in all the strings.
Parallel Algorithm. In PMS8 7 the search space is split into m 5 js 1 j 2 , 1 1 independent subproblems P 1 , P 2 , …, P m , where P i explores the d-neighborhood of ,mer s 1 [i..i 1 , 2 1]. In the parallel implementation, processor 0 acts as both a master and a worker, the other processors are workers. Each worker requests a subproblem from the master, solves it, then repeats until all subproblems have been solved. Communication between processors is done using the Message Passing Interface (MPI). In qPMS9, we extend the previous idea to the q version. We split the problem into subproblems P 1,1 , P 1,2 , …, P 1, s1 j j{'z1 , P 2,1 , P 2,2 , …, P 2, s2 j j{'z1 , …, P r,1 , P r,2 , …, P r, sr j j{'z1 where r 5 n 2 Q 1 1 and Q~t qn 100 s. Problem P i,j explores the d-neighborhood of the j-th ,-lmer in string s i and searches for ,-mers M such that there are Q 2 1 instances of M in strings s i11 , …, s n . Notice that Q is fixed, therefore subproblems P i,j get progressively easier as i increases.
Test Data Generation. As mentioned in the introduction, PMS algorithms are typically tested on datasets generated as follows. 20 strings of length 600 each are generated from the i.i.d. We choose an ,-mer M as a motif and plant modified versions of it in q% of the n strings. Each planted instance is modified in d random positions.
It is useful to estimate how many ''spurious'' motifs (motifs expected by random chance) will be found in a random sample. For that, we make the following observations. The probability that a random ,-mer u is within distance at most d from another ,-mer v is The probability that an ,-mer is within distance d from any of the ,-mers in a string S of length m is: The probability that an ,-mer is within distance d from at least q out of n strings of length m each is: Therefore, the expected number of motifs for a given qPMS instance is: jSj , Q(q, n, m, ,, S). Based on these formulas, we compute for every , the largest value of d such that the number of spurious motifs does not exceed 500. These values are presented in table 1 for DNA and table 2 for protein.

Results
In this section we analyze the running times of PMS8 7 , TraverStringRef 11 and qPMS9, on several synthetic DNA and protein   instances. For every instance of the problem we generated 5 datasets as described in the Methods section. For q 5 100% we compare all three algorithms, for q 5 50% we compare only the algorithms that solve the quorum PMS problem: TraverStringRef and qPMS9. All programs were executed on the Hornet cluster at the University of Connecticut, which is a highend, 104-node, 1408-core High Performance Computing cluster. For our experiments we used Intel Xeon X5650 Westmere cores. Most results refer to single core execution, unless specified otherwise.
In table 3 we compare the three algorithm on DNA data when q 5 100%. In table 4 we show a similar comparison on protein data.
In table 5 we compare TraverStringRef and qPMS9 on DNA data when q 5 50%. In table 6 we compare TraverStringRef and qPMS9 on protein data when q 5 50%.
In Figure 3 we present the running time of qPMS9 on DNA datasets for all combinations of , and d with , up to 50 and d up to 25, with q 5 100%. In Figure 4 we present the running time of qPMS9 on protein datasets for all combinations of , and d with , up to 30 and d up to 21, with q 5 100%.

Discussion
We have presented qPMS9, an efficient algorithm for Quorum Planted Motif Search. The algorithm is based on the PMS8 algorithm 7 . qPMS9 includes a new procedure for exploring the search space and adds support for the quorum version of PMS. We compared qPMS9 with two state of the art algorithms and showed that qPMS9 is very competitive. qPMS9 is the first algorithm to solve the challenging DNA instances (28, 12) and (30, 13). qPMS9 can also efficiently solve instances with larger , and d such as (50, 21) for DNA data or (30, 18) for protein data.
For future work, one of our reviewers kindly pointed out that our approach of filtering ,-mers for Hamming Distances could benefit for the work in Ref. 16.