## Abstract

Discovering patterns in biological sequences is a crucial problem. For example, the identification of patterns in DNA sequences has resulted in the determination of open reading frames, identification of gene promoter elements, intron/exon splicing sites, and SH RNAs, location of RNA degradation signals, identification of alternative splicing sites, etc. In protein sequences, patterns have led to domain identification, location of protease cleavage sites, identification of signal peptides, protein interactions, determination of protein degradation elements, identification of protein trafficking elements, discovery of short functional motifs, etc. In this paper we focus on the identification of an important class of patterns, namely, motifs. We study the (*ℓ*, *d*) motif search problem or Planted Motif Search (PMS). PMS receives as input *n* strings and two integers *ℓ* and *d*. It returns all sequences *M* of length *ℓ* that occur in each input string, where each occurrence differs from *M* in at most *d* positions. Another formulation is quorum PMS (qPMS), where the motif appears in at least *q*% of the strings. We introduce qPMS9, a parallel exact qPMS algorithm that offers significant runtime improvements on DNA and protein datasets. qPMS9 solves the challenging DNA (*ℓ*, *d*)-instances (28, 12) and (30, 13). The source code is available at https://code.google.com/p/qpms9/.

## Introduction

The Planted Motif Search (PMS) problem, also known as the (*l*, *d*)-motif problem, has been introduced in Ref. 1 with the aim of detecting motifs and significant conserved regions in a set of DNA or protein sequences. PMS receives as input *n* biological sequences and two integers *ℓ* and *d*. It returns all possible biological sequences *M* of length *ℓ* such that *M* occurs in each of the input strings, and each occurrence differs from *M* in at most *d* positions. Any such *M* is called a motif.

Buhler and Tompa^{2} have employed PMS algorithms to find known transcriptional regulatory elements upstream of several eukaryotic genes. In particular, they have used orthologous sequences from different organisms upstream of four different genes: preproinsulin, dihydrofolate reductase (DHFR), metallothioneins, and c-fos. These sequences are known to contain binding sites for specific transcription factors. Their algorithm successfully identified the experimentally determined transcription factor binding sites. They have also employed their algorithm to solve the ribosome binding site problem for various prokaryotes. Eskin and Pevzner^{3} used PMS algorithms to find composite regulatory patterns using their PMS algorithm called MITRA. They have employed the upstream regions involved in purine metabolism from three *Pyrococcus* genomes. They have also tested their algorithm on four sets of *S.cerevisiae* genes which are regulated by two transcription factors such that the transcription factor binding sites occur near each other. Price, et al.^{4} have employed their PatternBranching PMS technique to find motifs on a sample containing CRP binding sites in *E.coli*, upstream regions of many organisms of the eukaryotic genes: preproinsulin, DHFR, metallothionein, & c-fos, and a sample of yeast promoter regions.

A problem that is very similar to (*ℓ*, *d*) motif search is the Closest Substring problem. The Closest Substring problem is essentially the PMS problem where the aim is to find the smallest *d* for which there exists at least one motif. These two problems have applications in PCR primer design, genetic probe design, discovering potential drug targets, antisense drug design, finding unbiased consensus of a protein family, creating diagnostic probes and motif finding (see e.g.^{5}). Therefore, the development of efficient algorithms for solving the PMS problem constitute an active interest in biology and bioinformatics.

In a practical scenario, instances of the motif may not appear in all of the input strings. This has led to the introduction of a more general formulation of the problem, called quorum PMS (qPMS). In qPMS we are interested in motifs that appear in at least *q* percent of the *n* input strings. Therefore, the PMS problem is the same as qPMS when *q* = 100%.

The Closest Substring problem is NP-Hard^{5}. The Closest Substring problem can be solved by a linear number of calls to PMS. Therefore, there is a polynomial time reduction from Closest Substring to PMS, which means that the PMS problem is also NP-Hard. Because of this, all known exact algorithms have an exponential runtime in the worst case. Thus, it is important to develop efficient algorithms in practice. The practical performance of PMS algorithms is typically evaluated on datasets generated as follows (see refs 1, 6): 20 DNA/protein strings of length 600 are generated according to the independent identically distributed (i.i.d.) model. Similarly, a random motif (*ℓ*-mer) *M* is generated and “planted” at a random location in each input string (or in *q*% of the input strings for qPMS). Every planted instance of the motif is mutated in exactly *d* positions.

**Definition 1.** *An (ℓ, d) instance is defined to be a* **challenging instance** *if d is the largest integer for which the expected number of motifs of length ℓ that would occur in the input by random chance does not exceed a constant* (*500 in this paper, same as in* Ref. 7).

Intuitively the more we increase *d*, the more we increase the search space. However, if we increase *d* too much, we find many motifs just by random chance (spurious motifs). According to the above definition, the challenging instances for PMS are (13, 4), (15, 5), (17, 6), (19, 7), (21, 8), (23, 9), (25, 10), (26, 11), (28, 12), (30, 13), etc.

Note that in this paper we only address exact algorithms, which find all the existing motifs. Most of the exact PMS algorithms use a combination of two fundamental techniques. One is a sample driven technique and the other is a pattern driven technique. In the sample driven stage, the algorithm selects a tuple of *ℓ*-mers coming from distinct input strings. In the pattern driven stage, the algorithm generates the common *d*-neighborhood of the *ℓ*-mers in the tuple. Each such *ℓ*-mer becomes a motif candidate. The size of the tuple is usually fixed to a value such as 1 (see e.g.^{6,8,9}), 2 (see e.g.^{10}), 3 (see e.g.^{11,12,13,14}) or *n* (see e.g.^{1,15}). In contrast, PMS8^{7} and qPMS9 (this paper) utilize a variable tuple size, which adapts to the problem instance under consideration.

There are many PMS algorithms in the literature. In a previous paper^{7} we have introduced the PMS8 algorithm. In the same paper we have performed a comparison between PMS8 and all the exact algorithms we could find in the literature of the previous five years. We have shown that PMS8 outperforms these algorithms. Ever since the publishing of PMS8, one other exact qPMS algorithm has been published, called TraverStringRef^{11}. Therefore, in this paper we compare qPMS9 with PMS8 and TraverStringRef.

The TraverStringRef algorithm^{11} is an algorithm for the qPMS problem, based on the earlier qPMS7^{14} algorithm. qPMS7^{14} can solve, for example, the challenging DNA instance (23,9) whereas TraverStringRef^{11} can solve (25,10), in a reasonable amount of time (no more than two days using commodity processors). In the case of the PMS problem, the PMS8 algorithm^{7} can solve the DNA instances (25,10), on a single core machine, and (26,11) on a multi-core machine. We have used PMS8 as the basis for the new qPMS9 algorithm. The qPMS9 algorithm extends PMS8 in several ways. First, qPMS9 introduces a search procedure which significantly increases performance by allowing for better pruning of the search space. Second, qPMS9 adds support for solving the qPMS problem, which was lacking in PMS8. We compare qPMS9 with PMS8^{7} and TraverStringRef^{11} on several DNA and protein instances.

## Methods

We start by defining the PMS and qPMS problems more formally. A string of length *ℓ* is called an *ℓ*-mer. Given two *ℓ*-mers *u* and *v*, the number of positions where the two *ℓ*-mers differ is called their Hamming distance and is denoted as *Hd*(*u*, *v*). For any string *T*, we denote the substring of *T* starting at position *i* and ending at position *j* by *T*[*i*..*j*].

**Definition 2.** *The PMS problem: Given n sequences s _{1}, s_{2}, …, s_{n}, over an alphabet Σ, and two integers ℓ and d, identify all ℓ-mers M, M ∈ Σ^{l}, such that ∀i, 1 ≤ i ≤ n, ∃j_{i}, 1 ≤ j_{i} ≤ |s_{i}| − l + 1, such that Hd(M, s_{i}[j_{i}..j_{i} + l − 1]) ≤ d*.

**Definition 3.** *The qPMS problem: same as the PMS problem, however the motif appears in at least q% of the strings, instead of all of them. PMS is a special case of qPMS for which q* = *100%*.

Another useful notion is that of a *d*-neighborhood. Given a tuple of *ℓ*-mers *T* = (*t*_{1}, *t*_{2}, …, *t _{s}*), the common

*d*-neighborhood of

*T*includes all the

*ℓ*-mers

*r*such that

*Hd*(

*r*,

*t*) ≤

_{i}*d*, µ1 ≤

*i*≤

*s*.

We now define the consensus *ℓ*-mer and the consensus total distance for a tuple of *ℓ*-mers. The consensus *ℓ*-mer for a tuple of *ℓ*-mers *T* = (*t*_{1}, …, *t _{k}*) is an

*ℓ*-mer

*u*where

*u*[

*i*] is the most common character among (

*t*

_{1}[

*i*],

*t*

_{2}[

*i*], …,

*t*[

_{k}*i*]) for each 1 ≤

*i*≤

*ℓ*. If

*p*is the consensus

*ℓ*-mer for

*T*then the consensus total distance of

*T*is defined as . While the consensus string is generally not a motif, the consensus total distance provides a lower bound on the total distance between any motif and a tuple of

*ℓ*-mers.

## qPMS9

As indicated previously, most of the motif search algorithms combine a sample driven approach with a pattern driven approach. In the sample driven part, tuples of *ℓ*-mers (*t*_{1}, *t*_{2}, …, *t _{k}*) are generated, where

*t*is an

_{i}*ℓ*-mer in

*S*. Then, in the pattern driven part, for each tuple, its common

_{i}*d*-neighborhood is generated. Every

*ℓ*-mer in the neighborhood is a candidate motif. In PMS8

^{7}and qPMS9, the tuple size

*k*is variable. By default, a good value for

*k*is estimated automatically based on the input parameters (see Ref. 7 for details), or

*k*can be user specified.

## Tuple Generation

In the sample driven part of PMS8, tuples *T* = (*t*_{1}, *t*_{2}, …, *t _{k}*), where

*t*is an

_{i}*ℓ*-mer from string

*s*, ∀

_{i}*i*= 1..

*k*, are generated based on the following principles. First, if

*T*has a common

*d*-neighborhood, then every subset of

*T*has a common

*d*-neighborhood. Second, for a motif to exist, there has to be at least one

*ℓ*-mer

*u*in each of the remaining strings

*s*

_{k}_{ + 1},

*s*

_{k}_{ + 2}, …,

*s*such that

_{n}*T*∪ {

*u*} has a common

*d*-neighborhood. We call such

*ℓ*-mers

*u*“alive” with respect to tuple

*T*. As we add

*ℓ*-mers to

*T*we update the alive

*ℓ*-mers and reorder the strings in increasing order of the number of alive

*ℓ*-mers. This reordering reduces the running time because it leads to generating fewer tuples overall.

In qPMS9 we change the criteria by which the strings are reordered, as follows. Let *T* be the current tuple of *ℓ*-mers and let *u* be an alive *ℓ*-mer with respect to *T*. If we add *u* to *T*, then the consensus total distance of *T* increases. We compute this additional distance *Cd*(*T*∪{*u*}) − *Cd*(*T*). For each of the remaining strings, we compute the minimum additional distance for any alive *ℓ*-mer in that string. Then we sort the strings decreasingly by the minimum additional distance. Therefore, we give priority to the string with the largest minimum additional distance. If two strings have the same minimum additional distance, we give priority to the string with fewer alive *ℓ*-mers. The intuition is that larger additional distance could indicate more “diversity” among the *ℓ*-mers in the tuple, which means smaller common *d*-neighborhoods. The pseudocode for generating tuples *T* is given in Figure 1. We invoke the algorithm as *GenTuples*({}, *k*, *R*) where the matrix *R* contains all the *ℓ*-mers in all the input strings, grouped as one row per string.

## Neighborhood Generation

For every tuple *T*, obtained as described in the previous section, we generate the common *d*-neighbors of the *ℓ*-mers in the tuple. In qPMS9, the neighbor generation uses the same process as in PMS8^{7}. For the sake of completeness, we briefly review the process.

Given a tuple *T* = (*t*_{1}, *t*_{2}, …, *t _{k}*) of

*ℓ*-mers, we want to generate all

*ℓ*-mers

*M*such that

*Hd*(

*t*,

_{i}*M*) ≤

*d*, ∀

*i*= 1..

*k*. We traverse the tree of all possible

*ℓ*-mers. A node at depth

*r*, which represents an

*r*-mer, is not explored deeper if certain pruning conditions are met. Necessary and sufficient conditions for 2 and 3

*ℓ*-mers to have a common neighbor are given in Ref. 7. The same paper gives necessary conditions for more than 3

*ℓ*-mers to have a common neighbor. The interested reader is referred to the PMS8 paper

^{7}for a more in depth description of neighborhood generation.

## Adding Quorum Support

We extend the algorithm to solve the qPMS problem. In the qPMS problem, when we generate tuples, we may “skip” some of the strings entirely. This translates to the implementation as follows: in the PMS version we successively try every alive *ℓ*-mer in a given string by adding it to the tuple *T* and recursively calling the algorithm for the remaining strings. For the qPMS version we have an additional step where, if the value of *q* permits, we skip the current string and try *ℓ*-mers from the next string. At all times we keep track of how many strings we have skipped. The pseudocode for this algorithm is given in Figure 2. We invoke the algorithm as *QGenerateTuples*(*n* − *Q* + 1, {}, 0, *k*, *R*) where and *R* contains all the *ℓ*-mers in all the strings.

## Parallel Algorithm

In PMS8^{7} the search space is split into *m* = |*s*_{1}| − *ℓ* + 1 independent subproblems *P*_{1}, *P*_{2}, …, *P _{m}*, where

*P*explores the

_{i}*d*-neighborhood of

*ℓ*-mer

*s*

_{1}[

*i*..

*i*+

*ℓ*− 1]. In the parallel implementation, processor 0 acts as both a master and a worker, the other processors are workers. Each worker requests a subproblem from the master, solves it, then repeats until all subproblems have been solved. Communication between processors is done using the Message Passing Interface (MPI).

In qPMS9, we extend the previous idea to the *q* version. We split the problem into subproblems *P*_{1,1}, *P*_{1,2}, …, , *P*_{2,1}, *P*_{2,2}, …, , …, *P _{r}*

_{,1},

*P*

_{r}_{,2}, …, where

*r*=

*n*−

*Q*+ 1 and . Problem

*P*

_{i}_{,j}explores the

*d*-neighborhood of the

*j*-th

*ℓ*-lmer in string

*s*and searches for

_{i}*ℓ*-mers

*M*such that there are

*Q*− 1 instances of

*M*in strings

*s*

_{i}_{+1}, …,

*s*. Notice that

_{n}*Q*is fixed, therefore subproblems

*P*

_{i}_{,j}get progressively easier as

*i*increases.

## Test Data Generation

As mentioned in the introduction, PMS algorithms are typically tested on datasets generated as follows. 20 strings of length 600 each are generated from the i.i.d. We choose an *ℓ*-mer *M* as a motif and plant modified versions of it in *q*% of the *n* strings. Each planted instance is modified in *d* random positions.

It is useful to estimate how many “spurious” motifs (motifs expected by random chance) will be found in a random sample. For that, we make the following observations. The probability that a random *ℓ*-mer *u* is within distance at most *d* from another *ℓ*-mer *v* is

The probability that an *ℓ*-mer is within distance *d* from any of the *ℓ*-mers in a string *S* of length *m* is:

The probability that an *ℓ*-mer is within distance *d* from at least *q* out of *n* strings of length *m* each is:

Therefore, the expected number of motifs for a given qPMS instance is: |Σ|* ^{ℓ}Q*(

*q*,

*n*,

*m*,

*ℓ*, Σ). Based on these formulas, we compute for every

*ℓ*the largest value of

*d*such that the number of spurious motifs does not exceed 500. These values are presented in table 1 for DNA and table 2 for protein.

## Results

In this section we analyze the running times of PMS8^{7}, TraverStringRef^{11} and qPMS9, on several synthetic DNA and protein instances. For every instance of the problem we generated 5 datasets as described in the Methods section. For *q* = 100% we compare all three algorithms, for *q* = 50% we compare only the algorithms that solve the quorum PMS problem: TraverStringRef and qPMS9. All programs were executed on the Hornet cluster at the University of Connecticut, which is a highend, 104-node, 1408-core High Performance Computing cluster. For our experiments we used Intel Xeon X5650 Westmere cores. Most results refer to single core execution, unless specified otherwise.

In table 3 we compare the three algorithm on DNA data when *q* = 100%. In table 4 we show a similar comparison on protein data.

In table 5 we compare TraverStringRef and qPMS9 on DNA data when *q* = 50%. In table 6 we compare TraverStringRef and qPMS9 on protein data when *q* = 50%.

In Figure 3 we present the running time of qPMS9 on DNA datasets for all combinations of *ℓ* and *d* with *ℓ* up to 50 and *d* up to 25, with *q* = 100%. In Figure 4 we present the running time of qPMS9 on protein datasets for all combinations of *ℓ* and *d* with *ℓ* up to 30 and *d* up to 21, with *q* = 100%.

## Discussion

We have presented qPMS9, an efficient algorithm for Quorum Planted Motif Search. The algorithm is based on the PMS8 algorithm^{7}. qPMS9 includes a new procedure for exploring the search space and adds support for the quorum version of PMS. We compared qPMS9 with two state of the art algorithms and showed that qPMS9 is very competitive. qPMS9 is the first algorithm to solve the challenging DNA instances (28, 12) and (30, 13). qPMS9 can also efficiently solve instances with larger *ℓ* and *d* such as (50, 21) for DNA data or (30, 18) for protein data.

For future work, one of our reviewers kindly pointed out that our approach of filtering *ℓ*-mers for Hamming Distances could benefit for the work in Ref. 16.

## Change history

### Updated online 27 March 2015

A correction has been published and is appended to both the HTML and PDF versions of this paper. The error has not been fixed in the paper.

## References

- 1.
Pevzner, P. A. & Sze, S.-H. Combinatorial approaches to finding subtle signals in dna sequences. In:

*Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, La Jolla / San Diego, CA, USA,***vol. 8**, 269–278 (AAAI Press 2000). - 2.
Buhler, J. & Tompa, M. Finding motifs using random projections.

*J. Comp. Biol.***9**, 225–242 (2002). - 3.
Eskin, E. & Pevzner, P. A. Finding composite regulatory patterns in dna sequences.

*Bioinformatics***18**, 354–363 (2002). - 4.
Price, A., Ramabhadran, S. & Pevzner, P. A. Finding subtle motifs by branching from sample strings.

*Bioinformatics***19**, 149–155 (2003). - 5.
Kevin Lanctot, J., Li, M., Ma, B., Wang, S. & Zhang, L. Distinguishing string selection problems.

*Inform. Comput.***185**, 41–55 (2003). - 6.
Davila, J., Balla, S. & Rajasekaran, S. Fast and practical algorithms for planted (l, d) motif search.

*IEEE/ACM Trans. Comput. Biol. Bioinf.***4**, 544–552 (2007). - 7.
Nicolae, M. & Rajasekaran, S. Efficient sequential and parallel algorithms for planted motif search.

*BMC Bioinformatics***15**, 34 (2014). - 8.
Rajasekaran, S., Balla, S. & Huang, C.-H. Exact algorithms for planted motif problems.

*J. Comp. Biol.***12**, 1117–1128 (2005). - 9.
Rajasekaran, S. & Dinh, H. A speedup technique for (l, d)-motif finding algorithms.

*BMC Res Notes***4**, 54 (2011). - 10.
Yu, Q., Huo, H., Zhang, Y. & Guo, H. Pairmotif: A new pattern-driven algorithm for planted (

*l*,*d*) dna motif search.*PLoS ONE***7**, e48442 (2012). - 11.
Tanaka, S. Improved exact enumerative algorithms for the planted (l, d)-motif search problem.

*IEEE/ACM Trans. Comput. Biol. Bioinf.***11**, 361–374 (2014). - 12.
Dinh, H., Rajasekaran, S. & Kundeti, V. Pms5: an efficient exact algorithm for the (

*l*,*d*)-motif finding problem.*BMC bioinformatics***12**, 410 (2011). - 13.
Bandyopadhyay, S., Sahni, S. & Rajasekaran, S. Pms6: A fast algorithm for motif discovery. In:

*IEEE 2nd International Conference on Computational Advances in Bio and Medical Sciences, ICCABS 2012, Las Vegas, NV, USA, February 23–25, 2012*1–6 (IEEE, 2012). - 14.
Dinh, H., Rajasekaran, S. & Davila, J. qpms7: A fast algorithm for finding (

*l*,*d*)-motifs in dna and protein sequences.*PLoS ONE***7**, e41425 (2012). - 15.
Roy, I. & Aluru, S. Finding motifs in biological sequences using the micron automata processor. In:

*2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS 14, Washington, DC, USA*415–424 (IEEE, 2014). - 16.
Peterlongo, P., Pisanti, N., Boyer, F., do Lago, A. P. & Sagot, M.-F. Lossless filter for multiple repetitions with hamming distance.

*JDA***6**, 497–509 (2008).

## Author information

## Affiliations

### Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USA

- Marius Nicolae
- & Sanguthevar Rajasekaran

## Authors

### Search for Marius Nicolae in:

### Search for Sanguthevar Rajasekaran in:

## Contributions

M.N. and S.R. designed and analyzed the algorithms. M.N. implemented the algorithms and carried out the empirical experiments. M.N. and S.R. analyzed the empirical results and drafted the manuscript. All authors read and approved the final manuscript.

## Competing interests

The authors declare no competing financial interests.

## Corresponding author

Correspondence to Marius Nicolae.

## Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder in order to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/