detectMITE: A novel approach to detect miniature inverted repeat transposable elements in genomes

Miniature inverted repeat transposable elements (MITEs) are prevalent in eukaryotic genomes, including plants and animals. Classified as a type of non-autonomous DNA transposable elements, they play important roles in genome organization and evolution. Comprehensive and accurate genome-wide detection of MITEs in various eukaryotic genomes can improve our understanding of their origins, transposition processes, regulatory mechanisms, and biological relevance with regard to gene structures, expression, and regulation. In this paper, we present a new MATLAB-based program called detectMITE that employs a novel numeric calculation algorithm to replace conventional string matching algorithms in MITE detection, adopts the Lempel-Ziv complexity algorithm to filter out MITE candidates with low complexity, and utilizes the powerful clustering program CD-HIT to cluster similar MITEs into MITE families. Using the rice genome as test data, we found that detectMITE can more accurately, comprehensively, and efficiently detect MITEs on a genome-wide scale than other popular MITE detection tools. Through comparison with the potential MITEs annotated in Repbase, the widely used eukaryotic repeat database, detectMITE has been shown to find known and novel MITEs with a complete structure and full-length copies in the genome. detectMITE is an open source tool (https://sourceforge.net/projects/detectmite).

Several databases (e.g., Repbase 30,31 , P-MITE 29 , BrassicaTED 32 ) provide MITE annotations for different species. As the most widely used database of eukaryotic repetitive and transposable elements, Repbase 30,31 contains different types of repeat elements, including MITEs, from various species. P-MITE 29 is a database for MITEs detected in 41 plant species using MITE-Hunter, MITE Digger and RSPB. BrassicaTED is a specialized database for Brassica species, which contains MITEs, TRIMs (Terminal Repeat Retrotransposon in Miniatures), and SINEs (Short Interspersed Elements).
Generally speaking, there are three main challenges in genome-wide detection of MITEs: (1) the rapid, comprehensive and accurate detection of putative MITE sequences in genomes, (2) the effective filtration of false positive cases from putative MITE candidates, and (3) the efficient clustering of similar MITE sequences into distinctive MITE families. To address these challenges, we developed a novel MATLAB-based program called detectMITE, which employs a complex-number-based numeric calculation to replace conventional string matching algorithms in MITE detection on a genome scale. To filter out false positives, we adopted the Lempel-Ziv complexity algorithm for filtering low-complexity sequences and utilized a filtration strategy that is based on sequence similarity among MITE flanks 21 . detectMITE uses an effective and accurate clustering program called CD-HIT 33,34 to cluster similar MITEs into distinctive MITE families. Our comparative data analysis shows that detectMITE can more comprehensively, accurately, and efficiently detect MITEs on a genome-wide scale than MITE-Hunter, MITE Digger and RSPB, all of which are capable of processing genome-scale inputs.

Methods
Replacing conventional string matching algorithms for inverted repeat detection, we have created findIR 35 , which utilizes prime-number-based numeric calculation and manipulation to identify perfect inverted repeats, and detectIR 36 , which deploys complex-number-based numeric calculation for detecting both perfect and imperfect inverted repeats. Both tools have demonstrated their capability to more efficiently, accurately, and comprehensively detect perfect and imperfect inverted repeats than other popular tools 35,36 . As non-autonomous DNA transposons, MITEs are characterized by their terminal inverted repeats. Consequently, the core algorithm of detectIR in inverted repeat detection has been adopted and modified by detectMITE. Due to special structure requirements of MITEs (as shown in Fig. 1) and other constraints, detectMITE required new functions, including detection of target site duplication, clustering of similar MITE candidates into distinctive MITE families, and reducing false positive cases of MITEs. As shown in Fig. 2, the core algorithm of detectMITE includes the following five main steps: Detection of MITE candidate sequences with TIR and TSD. For a given genome, all sequence fragments that contain a TIR pair at their ends (default length = 10 nt, see Fig. 1), being flanked by a TSD (2-10 nt), and have a length between 50 and 800 nt will be identified in this step. First, each genomic sequence input (i.e., individual chromosome sequences) will be mapped into a numeric vector of complex numbers using the mapping score schema: (A → 1, T → − 1, C → j, G → − j). As the score summation of the subsequence's corresponding vector, the cumulative scores will be calculated for all subsequences with a length of 10 nt. If the sum of the cumulative scores of any two subsequences located within a range of 50~800 nt is C (a complex number), and if the sum of the absolute values of the real part and the imaginary part of C is ≤ 2, then the two subsequences are potential terminal inverted repeats. Next, the potential TSD -a direct repeat pair flanking the TIR pair -will be searched and validated (i.e., the cumulative scores for the two target sites must be exactly the same, and the two target sites have the same length of 2-10 nt). Through robust numerical vector calculation of MATLAB, all subsequences, i.e., MITE candidate sequences with the same length, can be searched exhaustively and validated efficiently. Since numerical calculation enables an efficient and exhaustive search, all putative MITEs that meet the defined criteria will be identified and kept for the downstream analysis. detectIR 36 can detect both perfect inverted repeats with two completely reverse complementary halves (stem) and imperfect inverted repeats with a middle non-palindromic spacer (loop) and non-complementary pairs in the stem. Unfortunately, in its most recent version it cannot detect inverted repeats with indels inside the stem 36 . Correspondingly, detectMITE is also incapable of detecting MITEs with indel(s) in their terminal inverted repeats. Even with this limitation, detectMITE has demonstrated its capability for more accurate and comprehensive detection of MITEs on a genome scale in comparison with three popular tools (see Results).
Filtration of MITE candidates with low complexity. Because low complexity sequences are rare in real MITEs 21 , we need to filter out MITE candidates having low complexity in their sequences. The DUST program 37 integrating BLAST has been often used to identify low complexity sequences [38][39][40] . This program has also been utilized by MITE-Hunter 21 to filter out MITE candidates with low complexity. In detectMITE, we replaced DUST with the Lempel-Ziv complexity algorithm, which is frequently used in biosignal analysis 41,42 . As shown in Supplementary Fig. S1, our Lempel-Ziv complexity analysis for MITEs identified by MITE-Hunter and RSPB indicated that many reported MITEs still have low complexity sequences, which are unlikely to be valid MITEs. In detectMITE, each putative MITE that meets one of the following criteria was filtered out as a false positive: (1) the TIR contains a homopolymer or dinucleotide stretch of a length ≥ 8 nt, (2) the TIR contains low G/C or A/T content (default < 20%), (3) the Lempel-Ziv complexity value of the sequence is less than 0.675, and (4) if the target site length is 2, the target site is not 'TA' . Similar criteria have been adopted by others to reduce false positive cases of MITEs 21,26,28 . Clustering of similar MITEs into MITE families. As transposable elements, MITEs move within genomes, leading to multiple copies distributed along the whole genomes. Accordingly, filtering out putative MITEs with low mobility (i.e., low copy number) in genomes can effectively reduce the false positive cases in MITE detection. The prerequisite for determining and counting the copy number of a specific putative MITE candidate is to cluster identical or highly similar candidate MITEs with full copy lengths together. Among the existing tools for genome-wide MITE detection (e.g., MITE-Hunter 21 , MITE Digger 28 , and RSPB 13 ), blastn-based clustering approaches have been utilized to cluster similar MITEs into MITE families. Because blastn-based clustering is usually time-consuming and reports fragmented sequences 33 , CD-HIT was adopted in detectMITE. CD-HIT adopts short word filters and a greedy strategy to avoid unnecessary comparisons and reduce redundant computations dramatically in clustering 33,34,43 . In detectMITE, candidate MITEs with similarity (i.e., the number of match bases/the length of the shorter sequence) ≥ 80% and coverage rate (i.e., the aligned length/the length of the longer sequence) ≥ 99% will be grouped into the same MITE families. After clustering, MITE families containing few members will be filtered out (i.e., having fewer than 3 members).

Filtration of MITE family members in terms of their flanking sequence similarity. When a MITE
is transposed into different genomic locations, it is less likely that its flanking sequences will also be transmitted together 21,28 . Therefore, within a given MITE family generated from the aforementioned clustering step, we will keep the valid MITE members that have different flanking sequences in order to count the copy number of this family across the entire genome conservatively, reducing false positives. To compare flanking sequences, we extracted 50 nt sequences from both sides of a candidate MITE (see Fig. 1), and conducted pairwise alignments to identify sequence similarity. For a given MITE family, left flanks and right flanks are compared respectively using pairwise alignments; no comparison is conducted between left and right flanks. If two left (or right) flanks share at least 50% similarity (i.e., ≥ 25 bases matched in their pairwise alignments), only one MITE will be kept in this MITE family. Finally, all valid members retained for a given MITE family must have different left and right flanking sequences. As shown in Fig. 2D, a MITE family has 5 full-length copies (putative MITE candidates), left flanking sequences of candidate 1 and candidate 2 have high similarity, and the right flanking sequences of candidate 4 and candidate 5 have high similarity. Candidate 2 and candidate 5 were removed in this step, so the family has 3 full-length valid members that have different left and right flanking sequences.

Selection of the representative sequence for each MITE family with enough members.
After the filtration process in the previous step, the MITE families with at least 3 valid members were retained as valid families, while others were recognized as invalid, false positive cases and filtered out. For each valid MITE family, we will select a representative sequence to represent that family (see Fig. 2E). If a family has n distinctive valid members, the similarity score (i.e., optimal local alignment score) between any two member sequences i and j is score(i, j). The representativeness_score of sequence i is defined as, Here, sequences mean the valid MITE members flanked by a TSD pair. Then, the sequence with the highest score will be selected as the representative sequence. MITE-Hunter uses multiple sequence alignment of each family to generate a consensus sequence to represent the corresponding family. In detectMITE, we use representative sequences to replace consensus sequences that may contain mismatches/indels due to multiple sequence alignment, ensuring that the representative sequences can be unambiguously positioned in the genome. Unlike MITE-Hunter, MITE Digger and RSPB, we use strict criteria when clustering similar MITEs into MITE families. For instance, in MITE-Hunter, a MITE is validated if it has at least three full-length copies characterized by TIRs and flanking TSDs, and all MITE sequences are clustered into MITE families using the 80-80-80 rule from all-against-all blastn results 21 : i.e., two sequences will be classified into the same family if both of them have a length of ≥ 80 nt and share sequence similarity of ≥ 80% in at least 80% aligned sequences. In contrast, RSPB adopts the E-value of ≤ 10 −10 rule to generate final MITE families 13 : i.e., two sequences will be classified into the same family if they have a valid blastn hit with an E-value of ≤ 10 −10 . Apparently, loose clustering criteria in MITE-Hunter, MITE Digger and RSPB tend to cluster similar MITE sequences into a smaller number of MITE families with more members whereas strict clustering criteria in detectMITE would result in a larger number of smaller MITE families. The rationale for us to do this is to retain the completeness and validity of MITE members within a given MITE family as best we can, without losing accurate structural information that can be advantageous in further downstream data analyses including genome annotation. For a given MITE family generated using loose clustering criteria, a representative or consensus sequence cannot always represent faithfully the sequence and structural characteristics of all MITE members within that family. In contrast, the representative sequence of a MITE family generated using strict clustering criteria can be directly used to retrieve its members in the genome with precise boundaries and high sequence similarity.
The entire algorithms were implemented into a package of MATLAB scripts, which require pre-installation of CD-HIT 33,34 . detectMITE is an open-source tool (https://sourceforge.net/projects/detectmite). All the tests were performed using Ubuntu 12.04 (precise) 64-bit platform with Intel Xeon (2.00 GHz) processors, 4 CPU cores and 128 GB RAM.

Results
To test the performance of detectMITE, we used detectMITE to detect MITEs in the Oryza sativa genome (MSU Rice Genome Annotation Project Release 6.1) and compared the detection results using MITE-Hunter, MITE Digger, and RSPB (see Table 1; outputs of each tool are available at: http://sourceforge.net/projects/detectmite/ files/Supplementary_Data.7z).
As shown in Table 1 29 . Among 179415 MITEs reported by PSPB, only 56391 (i.e., 31.4%) were labeled as complete sequences that were supposed to have complete terminal inverted repeats, whereas the others were labeled as partial sequences 29 .
nt. MITE Digger took 15.44 hours to identify 332 MITE families, each of which has a representative sequence 28 . RSPB identified 179415 MITE sequences using more time than MITE Digger, and used blastn (E-value ≤ 10 −10 ) to group them into 497 families 29,13 . Obviously, detectMITE is more efficient than these popular tools. Apparently, RSPB identified many more MITEs than the other three tools, but the majority of its detected MITEs (i.e., 68.6%) lack the complete structure of a typical MITE.
To evaluate MITE detection accuracy, the detection result of detectMITE was compared with both Repbase 30 and the outputs of MITE-Hunter, MITE Digger, and RSPB individually. Since P-MITE database contains a mixture of outputs from MITE-Hunter, MITE Digger, and RSPB 29 , P-MITE is not used for our comparison. Because detect-MITE, MITE-Hunter, and MITE Digger only detect MITEs with complete structures, we filtered out all partial MITE sequences in the output of RSPB and kept 56391 sequences that were labelled as complete sequences for comparison (see Table 1).
Comparison of the outputs of MITE-Hunter, MITE Digger, RSPB and detectMITE with Repbase data. Repbase is a comprehensive repeat database that contains both transposon elements and other repeats such as tandem repeats 30,44 . It has been widely utilized in genome annotation [45][46][47] . As short non-autonomous DNA transposons (Class II), MITEs are not explicitly annotated and labeled in the current Repbase release 30,31 . Therefore, we extracted out all Class II non-autonomous TEs with a length of 50-800 nt from Repbase as our reference dataset for comparison. In Oryza sativa, there are 217 Class II non-autonomous TEs annotated in Repbase, and 162 of them have a length of 50-800 nt. We used blastn (E-value ≤ 10 −10 ) to compare the outputs of MITE-Hunter, MITE Digger, RSPB, and detectMITE with these 162 Repbase reference sequences. The comparison results are shown in Fig. 3 (the relevant data is available at: http://sourceforge.net/projects/detectmite/files/ Supplementary_Data.7z).
As shown in Fig. 3, among 162 Repbase reference sequences, 48, 94, 15, and 49 are not detected by MITE-Hunter, MITE Digger, RSPB, and detectMITE respectively. Obviously, RSPB detected many more sequences in the Repbase data than other three tools. The major reason for this is that the RSPB detection result contains many sequences having short/diverse TIRs (i.e., TIR pairs with a lower degree of pairing), low full-length copy number, or no flanking TSD, which will be compared and discussed in detail in the next section.
For the 49 sequences missed by detectMITE, we manually checked their structures and retrieved their full-length copies in genome using blastn. We found that 5 of them do not have complete TIR structures (ECSR, GLUTEL1LIKE, POP-OL2, TOURIST-XIII and WUJI), 27 have low full-length copy numbers that do not meet our cutoff of ≥ 3 (CASIN, COWARD-2, F1275, HEARTBLEEDING, ID-2, LIER, OSTE23, OSTE26, SEVERIN, STONE, TOURIST-XV, WUWU and STOWAWAY [15,16,19,24,25,26,27,28,29,30- Clearly, almost all MITEs in Repbase can be detected by detectMITE, MITE-Hunter, and RSPB effectively, while MITE Digger missed too many cases (i.e., 94). In other words, the performance of detectMITE, MITE-Hunter, and RSPB in terms of Repbase annotation appears to be comparable. Although RSPB can match more sequences in the Repbase data, many of its so-called "complete" sequences still lack the complete and canonical structure of MITEs and/or do not meet our criteria for being a valid MITE member (see below).

Comparison of detectMITE with MITE-Hunter, MITE Digger, and RSPB individually. Since
MITE-Hunter, MITE Digger, and RSPB are the most popular tools for genome-wide detection of both known and novel MITEs, we compared MITE detection results in the rice genome between detectMITE and each of these three tools individually using blastn (E-value ≤ 10 −10 ). For comparison purposes, the detection results have been divided into three categories: (1) sequences identified by both detectMITE and MITE-Hunter (or MITE Digger, RSPB), (2) sequences identified only by MITE-Hunter (or MITE Digger, RSPB), and (3) sequences identified only by detectMITE. All the relevant data for comparisons are available at http://sourceforge.net/projects/detectmite/ files/Supplementary_Data.7z.
As described previously, different tools use different criteria to cluster similar MITE sequences into distinctive MITE families. In order to make the comparisons more convincing, we conducted all-against-all blastn for all representative sequences of 4790 MITE families detected by detectMITE and adopted the 80-80-80 rule utilized by MITE-Hunter 21 to further cluster these MITE families into super-families. Accordingly, the aforementioned 4790 MITE families were classified into 1821 super-families.
As shown in Fig. 4A, 728 (or 728/1821 ≈ 40%) super-families (i.e., 3397 MITE families) in detectMITE output match with 403 (or 403/578 ≈ 70%) consensus sequences in MITE-Hunter output, while 175 (or 175/578 ≈ 30%) consensus sequences in MITE-Hunter do not match any sequence in detectMITE output and 1093 (or 1093/1821 ≈ 60%) super-families (i.e., 1393 MITE families) in detectMITE output do not match any sequence in MITE-Hunter output. In detailed analysis of these 175 sequences, we found that 76 of them have low full-length copy numbers (< 3), 37 have high mismatch pairs in their TIRs, 23 do not bear a TIR, 1 has high A/T content in its TIR, and only 38 have full-length copies ≥ 3. Among these 38 cases, 21 have at least 3 valid full-length copies that bear good TIRs and TSDs and have distinct flanks. Therefore, this suggests that detectMITE missed 21 cases in comparison with MITE-Hunter, whereas MITE-Hunter missed 1093 super-families (i.e., 1393 MITE families) identified by detectMITE.
In  In Fig. 4C, 1269 (or 1269/1821 ≈ 70%) super-families (i.e., 4021 MITE families) in detectMITE output match the 38244 (or 38244/56391 ≈ 68%) MITE sequences in RSPB output, whereas 18147 (or 18147/56391 ≈ 32%) sequences in RSPB output do not match any sequence in detectMITE output and 552 (or 552/1821 ≈ 30%) super-families (i.e., 769 MITE families) in detectMITE output do not match any sequence in RSPB output. For 18147 sequences unique in RSPB output, we will check if they have more than 3 complete copies in the genome. Through clustering similar sequences by our criteria (i.e., similarity ≥ 80% and coverage rate ≥ 99%), we obtained 13397 groups. Among them, only 795 groups have full-length copy number ≥ 3. Here, we do not consider the low copy number groups. For those groups that have copy number ≥ 3, we generated a multiple sequence alignment for each group and manually checked the alignment quality. Generally, they can be classified into the following categories (see Supplementary Fig. S2): (1) 46 groups do not have clear TIRs, (2) 344 groups contain too many mismatches in TIRs, (3) TIRs of 4 groups have high A/T content, (4) 305 groups have a low number of full-length copies with complete TIRs, and (5) 96 groups have complete TIRs with at least three full-length copies in the genome. For these 96 groups, we further retrieved their flanking sequences in genome and found that only 16 of them have ≥ 3 valid copies with a clear TIR and TSD and distinctive flanks. Therefore, detectMITE only missed 16 cases detected by RSPB, which possess canonical MITE structures and have at least 3 valid full length copies with distinctive flanks in the genome. On the other hand, 552 super-families (i.e., 769 MITE families) reported by detectMITE are completely missed by RSPB.
For the MITE families uniquely detected by detectMITE in individual pair-wise comparisons with MITE-Hunter, MITE Digger, and RSPB respectively, we generated multiple sequence alignment for each family and manually examined the alignment results using BioEdit 48 . We found that all of these families meet our definition of canonical MITEs, having full-length copies ≥ 3, bearing clear TIR, and flanked by TSD (see Supplementary  There are 509 super-families/669 families detected by detectMITE but missed by all three other tools (MITE Digger, MITE-Hunter, and RSPB). Moreover, when we adopted a looser clustering rule -the E-value of ≤ 10 −10 rule used in RSPB (i.e., two sequences will be classified into the same family/group if they have a valid blastn hit with an E-value of ≤ 10 −10 ), the 4790 MITE families detected by detectMITE can be further clustered into 843 groups. Among these groups, 581, 703 and 335 do not have a valid match (E-value of ≤ 10 −10 ) with the outputs of MITE-Hunter, MITE Digger and RSPB respectively. (http://sourceforge.net/projects/detectmite/ files/Supplementary_Data.7z). Therefore, even with these two different loose clustering rules (i.e., the 80-80-80 rule and the E-value of ≤ 10 −10 rule), detectMITE still shows its capability of detecting many more MITEs than MITE-Hunter, MITE Digger and RSPB.
Moreover, the detectMITE output definitely contains fewer false positive cases of MITEs due to the structural requirement (i.e., clear TIRs flanked by TSD) and copy number constraint (i.e., full-length copy number of distinctive valid members with different flanking sequences ≥ 3) that we have enforced in our algorithms. If we examine the MITEs uniquely identified by MITE-Hunter, MITE Digger and RSPB, respectively, in comparison with detectMITE, these tools find many false positive MITEs that lack these important copy number and structural requirements. Among  Supplementary Fig. S5. Since the repeat sequences in TIGR Plant Repeat Database were obtained using homology-based methods that take advantages of GenBank and other public annotations 49 , the likelihood that these matched MITEs are real MITEs is high. Clearly, these results can demonstrate the reliability of detectMITE in finding novel MITEs.

Discussion
To fully elucidate the origins, functions, and biological relevance of MITEs, we need to comprehensively, accurately, and effectively detect the ubiquitous MITEs hidden in eukaryotic genomes. Due to the well-defined structures of MITEs, many tools are available for performing MITE detection. However, the complex organizations and compositions of genomes make the accurate, comprehensive, and effective detection of MITE very challenging. That explains why accurate and effective tools for MITE detection are currently rare. FINDMITE and MUST are structure-based methods for MITE detection, but have high false-positive rates in their outputs and cannot deal with genome-scale inputs 21 . Homology-based methods can only detect known MITEs and are mostly applicable in the discovery of MITEs between closely related genomes 22 . Using both de novo and structure-based approaches, MITE-Hunter and MITE Digger clearly improve the accuracy of genome-wide MITE detection, but can only detect a portion of MITEs hidden in genomes 21,28,29 . RSPB is essentially a mixture of both de novo and homology-based methods, but generates outputs that often include lots of sequences without a typical or complete structure of canonical MITEs. Furthermore, RSPB is time-and resource-consuming in its execution.
From our data analysis using the rice genome, it is clear that detectMITE can more comprehensively and accurately detect MITEs than the three popular tools for MITE detection. detectMITE is faster than MITE Digger, which is considered the most efficient tool in MITE detection so far 29 . As mentioned previously, detectMITE cannot detect MITEs that bear indels in their terminal inverted repeats. Nevertheless, the numerical approach for searching inverted repeats, either perfect ones or imperfect ones with mismatched/non-complementary pairs, can be more exhaustive and comprehensive than conventional string matching approaches 35,36 . This is why detectMITE is capable of detecting many more MITEs with a complete and canonical MITE structure hidden in genomes than popular string matching tools, even with its inability to detect MITEs with indels within TIRs. detectMITE has taken advantage of robust vector calculation power of MATLAB, which explains why detectMITE is very efficient in its detection.
Using the Lempel-Ziv complexity algorithm, detectMITE can identify many low complexity sequences that MITE-Hunter and RSPB cannot find. detectMITE adopted the notion that sequence similarities are only shared in the internal sequences of different members in a MITE family, whereas the flanking sequences are not supposed to be transposed together 21,28 . Then, detectMITE uses a PSA (Pairwise Sequence Alignment) method to find the number of valid full-length members (copies) in a given family that bear different flanking sequences 21  of similar MITE sequences into distinctive MITE families is the most time-consuming and resource-demanding process in MITE detection. detectMITE utilizes the more efficient clustering program CD-HIT to replace blastn and ensures that only highly similar sequences (≥ 80%) with high coverage (≥ 99%) can be clustered together.
As the rice genome is the well-studied genome in MITEs research, we used the rice genome as our test data to evaluate the performance and reliability of detectMITE in MITE detection. In comparison with known MITEs annotated in Repbase, detectMITE missed 2 cases, MITE-Hunter missed 3 cases, and RSPB missed 1 case, demonstrating that detectMITE, MITE-Hunter, and RSPB have comparable abilities in annotating known MITEs accurately. Compared to MITE-Hunter, MITE Digger and RSPB, detectMITE performs with higher efficiency and can detect many MITEs that are missed by these tools, as well as by Repbase (see Figs 3 and 4). Although detectMITE certainly misses some cases when compared with these tools, it can detect many more sequences that meet the criteria of MITEs than MITE-Hunter (1093 super-families vs. 21), MITE Digger (1489 super-families vs. 4), and RSPB (552 super-families vs. 16). Even with loose clustering criteria (i.e., RSPB's E-value of ≤ 10 −10 rule), detect-MITE still demonstrates its advantage of finding more MITEs than its competitors. More importantly, the detection result of detectMITE clearly contains fewer false positives due to the structure constraint (e.g., with clear TIR and TSD) and copy number constraint (at least 3 valid, full-length copies with different flank sequences). This makes detectMITE competitive in MITE detection, since detection results of MITE-Hunter, MITE Digger and RSPB often contain many false positives, requiring tedious manual checks. Furthermore, detectMITE provides information on accurate positions and length of flanking TSDs for each sequence in its output.
In conclusion, we present a novel numeric-calculation-based program detectMITE that can more comprehensively, accurately, and effectively identify MITEs in genomes than other available tools. Without a doubt, detectMITE is a valuable addition to the research community studying MITEs and other transposon elements. Computational methods, however, can only utilize different features of MITEs (e.g. sequence structures and similarities, as well as genome-wide copy numbers) to justify whether a candidate sequence is a valid MITE or not. To determine whether a novel candidate is a genuine MITE or not in reality, further wet-lab experiments are clearly needed. In the future, we will work to improve the core algorithm so that terminal inverted repeats with indels in the paring stem can be detected using numeric calculation approaches. Also, a mixed strategy that integrates homology-based approaches, e.g., blast search for well-defined MITE families detected by detectMITE, can be used to annotate additional potential MITEs in genomes.