MINTmap: fast and exhaustive profiling of nuclear and mitochondrial tRNA fragments from short RNA-seq data

Transfer RNA fragments (tRFs) are an established class of constitutive regulatory molecules that arise from precursor and mature tRNAs. RNA deep sequencing (RNA-seq) has greatly facilitated the study of tRFs. However, the repeat nature of the tRNA templates and the idiosyncrasies of tRNA sequences necessitate the development and use of methodologies that differ markedly from those used to analyze RNA-seq data when studying microRNAs (miRNAs) or messenger RNAs (mRNAs). Here we present MINTmap (for MItochondrial and Nuclear TRF mapping), a method and a software package that was developed specifically for the quick, deterministic and exhaustive identification of tRFs in short RNA-seq datasets. In addition to identifying them, MINTmap is able to unambiguously calculate and report both raw and normalized abundances for the discovered tRFs. Furthermore, to ensure specificity, MINTmap identifies the subset of discovered tRFs that could be originating outside of tRNA space and flags them as candidate false positives. Our comparative analysis shows that MINTmap exhibits superior sensitivity and specificity to other available methods while also being exceptionally fast. The MINTmap codes are available through https://github.com/TJU-CMC-Org/MINTmap/ under an open source GNU GPL v3.0 license.

tRFs are constitutive regulators. Fast accumulating evidence is beginning to uncover significant roles for full-length tRNAs that go well beyond their conventional roles in translating mRNAs 14,15 . Similarly, tRNA-derived fragments, originally thought to be transcriptional noise, are now known to be constitutive in nature and to have expression profiles that depend on many variables (detailed below) in healthy individuals and in cancer patients 2 . We summarize the key findings so far in the following paragraphs.
We have shown that mitochondrially-encoded tRNAs produce a rich repertoire of tRFs 2 just like the nuclearly-encoded tRNAs. We also showed that the locations of cleavage differ distinctly between mitochondrially-encoded and nuclearly-encoded isodecoders of the same isoacceptor, i.e. differ between counterpart sequences. There are multiple cleavage points that occur at specific sites and give rise to quantized tRF lengths: importantly, the resulting tRFs are overlapping and not consecutive on the mature tRNA. These findings suggest that the tRFs are not random products and do not result from post-transcriptional modifications of tRNA sites that might inhibit sequencing. Moreover, we have discovered the new category of internal tRFs or "i-tRFs" (Fig. 1 and also section 'Nomenclature and Structural Categories of tRNA Fragments' below). Our analyses of hundreds of human transcriptomes showed that i-tRFs are produced by both nuclearly-encoded and mitochondrially-encoded tRNAs, and are present in different tissues and in hundreds of individuals 2 . We also showed that i-tRF lengths can vary considerably from as short as 16 nt to as long as 36 nt, and possibly longer 2 . i-tRFs are fully contained within the span of the mature tRNA, can begin and end at a multitude of positions, and can precede, follow, or straddle the anticodon (e.g. see Fig. 2 of our previous publication 2 ). These findings indicate that mature tRNAs produce a richer repertoire of fragments than originally thought.
Using hundreds of human transcriptomes, we showed that mature tRNAs will give rise to distinct tRF profiles in a manner that depends on the tissue at hand, the tissue's state (health vs. cancer) and the disease subtype as well as on a individual's attributes like sex, population origin, and race 2 . The profiles of tRFs also differ across cell types 2,19 . These findings indicate that the production of tRFs is affected by a number of important patient-centered variables. By comparing the tRF profiles of healthy individuals and cancer patients we found that choice of cleavage points and the relative abundances of tRFs remain unchanged across individuals who belong to the same sex, to the same race and to the same population 2 , in complete analogy to our reported findings on miRNA isoforms (isomiRs) 5,6 . These findings further support the constitutive nature of tRFs within human tissues. The cleavage points and relative abundances changed between health and disease, and also across disease subtypes.
From a functional standpoint, the longer tRNA halves have been shown to have roles that range from inhibition of protein synthesis in eukaryotes 8,12 and archaea 16 , to enhancing cell proliferation 11 . On the other hand, the shorter tRNA fragments (i.e. 5′ -tRFs, i-tRFs and 3′ -tRFs) have been shown to have more diverse functions some of which include the regulation of expression of protein-coding genes 2,9,10,19,21,30 . These findings further support the regulatory nature of tRFs. 5′ -tRFs, i-tRFs, and 3′ -tRFs have been shown to enter the RNA interference (RNAi) pathway through Argonaute loading and, thus, to have regulatory roles through direct interaction with targets that are currently unknown 18,19 . On a related note, the production of 5′ -tRFs was shown to be DICER-dependent in mammals and the fruit fly but DICER-independent in yeast 18,19,21,31,32 . Also, our analyses of the human breast cancer cell lines BT-474 (Her2+ ), MDA-MB-231 (ER− /PR− /Her2− ) and MCF7 (ER+ ) showed that whether a given 5′ -tRF, i-tRF or 3′ -tRF will be loaded on Argonaute depends on the cell type 2 . These findings further support the participation of some tRF categories in regulation.
In addition to affecting targets through direct molecular contact, tRFs have also been shown to regulate other transcripts through molecular competition, also known as "decoying. " Specifically, it was recently shown that three i-tRFs that straddle the anticodon (i-tRF GTATCCCCGCCTGTCACG from tRNA AspGTC , i-tRF GGATTCGGCGCTCTCACC from tRNA GluCTC , and i-tRF TAGCATAGCTGCCTTCCA from tRNA GlyTCC ) and are induced by hypoxic-stress can counteract post-transcriptionally the stabilization of oncogenic mRNAs by competing for binding to YBX1, an RNA binding protein 33 . These findings indicate that the newly discovered i-tRFs also have regulatory roles.
With regard to the longer tRNA halves (Fig. 1), their production has been shown to depend on cellular stress 12,13,34,35 . Their production is also modulated by the presence of the sex hormones estrogen and androgen 11 . These findings suggest that the production of the (longer) tRNA halves can be driven by external factors and by context.
In male mice that were subjected to a protein-restricted diet, the abundances of multiple (short) tRFs were found to be altered throughout their reproductive tract and in the mature sperm, compared to control mice 36 . In maturing mouse sperm, 5′ -tRFs from tRNA GlyGCC were shown to suppress protein-coding transcripts that are driven by the endogenous retroelement MERVL 36 . These findings suggest that the production of the (shorter) tRFs can be driven by external factors and that these tRFs affect protein expression.
tRNA halves as well as shorter tRNA fragments have been linked to viral and bacterial infections as well. For example, an 18 nt 3′ -tRF from tRNA LysTTT in HIV-infected human MT4 T-cells was shown to inhibit HIV-1 replication by ~80% and to be dependent on DICER and Argonaute 2 (AGO2) 37 . Also, RSV infections were found to increase the production of tRNA fragments by as much as 20x compared to mock infections 38 : among the fragments found to be up-regulated, a 5′ -tRH from tRNA GluCTC was shown to promote viral replication 39 . Similarly, production of 5′-tRHs, primarily from tRNA ValGTG and tRNA GlyGCC , and of 5′ -tRFs, i-tRFs and 3′ -tRFs increased 2-4x in the lungs of mice following infection by Rickettsia 40 . These findings suggest that tRNA halves and the shorter tRFs play important roles in bacterial and viral infections.
The above multifaceted evidence that has accumulated in a few short years warrants viewing tRFs as a previously unrecognized, bona fide category of ncRNAs. It is reasonable to conjecture that the roles of tRNA fragments likely extend beyond what has been uncovered to date. Additionally, the current knowledge about the biogenesis of tRFs remains very limited, despite the multitude of settings in which these molecules have been encountered. We also note that fragments have also been found that are produced from the precursor tRNA molecule. These fragments are also "tRFs" and functions for a few of them have been reported already 7,9 . The emerging importance of tRFs as regulators and their complex dependencies on the source tissue, the tissue state, and an individual's attributes are making the ability to analyze transcriptomic data (short RNA-seq) of paramount importance.
Scientific RepoRts | 7:41184 | DOI: 10.1038/srep41184 tRNA-halves. We use the Angiogenin cleavage sites that have been reported in the literature to define 5′ -halves and 3′ -halves. Specifically, if A 1 A 2 A 3 is the anticodon triplet and n 1 n 2 A 1 A 2 A 3 n 3 n 4 denotes the anticodon loop and the sequence immediately surrounding the triplet, then 5′ -tRHs begin at position + 1 of a tRNA (or, − 1 in the case of tRNA His ) and terminate at any of the four underlined positions n 1 ∇ n 2 ∇ A 1 ∇ A 2 ∇ A 3 n 3 n 4 (each ∇ denotes a reported Angiogenin cleavage site). Analogously, 3′ -tRHs can only begin at any of the four underlined positions n 1 ∇ n 2 ∇ A 1 ∇ A 2 ∇ A 3 n 3 n 4 and terminate in the first C, the second C, or the A of the non-templated CCA addition. 5′-tRFs. 5′ -tRFs are fragments that begin at the 5′ end of the tRNA (− 1 for tRNA His -see also text) and end in a position other than n 1 , n 2 , A 1 , or A 2 . In other words, a 5′ -tRF can end either before n 1 , or after A 2 in which case it can potentially include positions to the right of n 4 .
i-tRFs. i-tRFs are fragments that begin at either position + 2 or further to the right and end before the first C of the non-templated CCA addition.
3′-tRFs. 3′ -tRFs are fragments that begin at a position other than n 2 , A 1 , A 2 or A 3 and terminate at the first C, the second C, or the A of the non-templated CCA addition. In other words, a 3′ -tRF can begin either after A 3 or before n 2 including potentially positions to the left of n 1 .
We also distinguish between tRNAs whose isodecoders are harbored by the nuclear genome ("nuclearly-encoded") and those harbored by the mitochondrial ("mitochondrially-encoded") genome. Our use of "nuclear" or "mitochondrial" as an adjective will refer to the genomic source of a mature tRNA, of an isodecoder, or of a tRNA fragment, and not to the sub-cellular location of the corresponding transcript. This choice of nomenclature is necessitated by our recent discovery that the nuclear genomes of H. sapiens 41 and of other organisms 42 contain multiple sequences that best resemble the organism's mitochondrial tRNAs. The nuclear genome of H. sapiens harbors 351 "tRNA-lookalikes" whose closest sequence match is one of the 22 human mitochondrial tRNAs. As these tRNA-lookalikes contain sub-sequences of various lengths that are identical to the organism's bona fide mitochondrial tRNAs they represent a source of complication during the mapping of next generation sequencing data.
The problem of tRF discovery in context. Seemingly simple, mapping short RNA-seq datasets to the human genome for the purpose of determining the profiles of the various classes of tRFs is a very involved undertaking. Several factors complicate this task and include: tRNAs are repeat elements themselves; tRNA isodecoders corresponding to the same amino acid can, and usually do, share extended regions of similarity confounding one's attempts to determine a fragment's source; for some short and long fragments it is not possible to unambiguously establish their tRNA nature; the human nuclear genome is riddled with hundreds of sequences that resemble mitochondrially-encoded tRNAs; the human nuclear genome is also riddled with hundreds of partial tRNA sequences whose lengths are ~1/2 of the 72 nt that are characteristic of a mature tRNA; mature tRNAs are post-transcriptionally modified through the addition of extra bases; etc. In our description of MINTmap's implementation below, we describe each of these complicating factors in more detail and describe how each of them is tackled by MINTmap and by other methods.

Results
So far, we have used MINTmap to analyze 843 public short RNA-seq datasets from a variety of cellular contexts 2,11 . We are currently hosting the results of these analyses in MINTbase 1 , which was developed in the context of a parallel project. MINTbase is an unrestricted, interactive web-based resource that allows users to interrogate the data by tRF structural category, isoacceptor, isodecoder, nucleic acid, sequence, genomic position, etc. User queries can be directed at tRFs produced by nuclearly-encoded tRNA, mitochondrially-encoded tRNAs, or both simultaneously. MINTbase is accessible at http://cm.jefferson.edu/MINTbase/. Next, we describe in detail the specific considerations that informed our design of MINTmap and the rationale behind our various design choices. As already mentioned, the sequence space of tRNAs is characterized by a number of idiosyncratic features. These features need to be taken into account explicitly when seeking to identify and quantitate tRFs in short RNA-seq datasets. Where relevant, we also describe the approaches taken by previous methods and discuss in detail the resulting complications. For clarity of presentation, we focus on H. sapiens. We also describe how we achieved exceptional runtime performance despite the computational demands of the task at hand. Design goals. We designed MINTmap with four goals in mind: (1) maximize sensitivity; (2) maximize specificity; (3) identify and report those fragments whose tRNA origin is inherently ambiguous (possible false positives); and (4) make the generated tRF expression profile adequately detailed and easy to incorporate into the users' downstream analyses. MINTmap achieves this through an exhaustive, genome-wide search. MINTmap comprises twelve steps that are described in Methods. The provided detail should permit anyone to re-implement MINTmap at will. We note that the output is automatically saved both as a tab-separated plaintext file and as an html file (see also below). We stress that MINTmap was also designed to ensure data confidentiality. In fact, all of the data are generated locally on the user's machine. The plaintext file allows the user to comprehensively study the mined tRFs without any need to connect to our MINTbase 1 repository. Should the user decide to contribute his or her data to MINTbase, we make this possible through MINTsubmit an automated submission tool that is part of MINTbase and can be accessed from https://cm.jefferson.edu/MINTsubmit/.
Establishing a tRNA reference set. Central to identifying tRFs among the transcripts of a tissue or cell of interest is the definition of the reference tRNA space, i.e. the union of genomic locations that harbor the DNA templates of tRNAs. In order to define the tRNA space, we need a comprehensive collection of tRNA genes for the genome of interest. Several groups currently maintain databases that are updated frequently and enumerate the tRNAs encoded in the nuclear or the mitochondrial genomes of various organisms; these databases include tRNAdb/mitotRNAdb 43 , tRNAdb-CE 44 , Mamit-tRNA 45 , and gtRNAdb 46 . For our work, we favor gtRNAdb because it is comprehensive and cross-links to other repositories such as NCBI Gene (http://www.ncbi.nlm.nih. Scientific RepoRts | 7:41184 | DOI: 10.1038/srep41184 gov/gene) and HGNC (http://www.genenames.org/). Specifically, we use release v1.0 of gtRNAdb 46 and has been stable and vetted over the course of several years. IMPORTANT: We highly recommend that the earlier release, v1.0, of gtRNAdb be used instead of the more recent release v2.0. Indeed, we have found release v2.0 to contain inconsistencies 47 that would directly affect the quality of the reported results.
The current implementation of MINTmap is for the human genome. To allow interested users to extend the codes to other genomes as well, we have designed the approach in a very generic way and describe all its steps in detail in Methods.
We use a reference set of tRNAs that includes the following 640 tRNA sequences: i) The 508 true tRNAs and 102 pseudo-tRNAs from release v1.0 of gtRNAdb 46 . Selenocysteine tRNAs, tRNAs with undetermined anticodon identity, and tRNAs mapping to contigs that are not part of the human genome assembly are not included in the tRNA reference set. ii) The 22 known human mitochondrial tRNA sequences. These are not listed among the tRNA sequences of gtRNAdb. Nonetheless, we strongly recommend the inclusion of mitochondrial tRNAs to the list of reference tRNAs. This recommendation is dictated by our recent discovery that mitochondrial tRNAs, just like their nuclearly-encoded counterparts, also produce tRNA fragments (5′ -tRFs, i-tRFs, 3′ -tRFs, 5′ -tRHs and 3′ -tRHs) with abundances that are modulated by an individual's sex, population origin and race as well as by tissue type and disease subtype 2  Correct profiling of tRFs requires that all isodecoders of an isoacceptor be considered. The tRNA space is characterized by an inherent hierarchy at the top of which lay the amino acids ( Fig. 2A). Many isoacceptors are grouped under a single amino acid (one distinct anticodon per isoacceptor). In turn, many isodecoders are grouped under a single anticodon whose respective DNA templates are spread across the nuclear chromosomes. For a given anticodon, the sequences of its isodecoders generally differ from one another. We highlight this last statement with an example. Figure 2B shows an alignment of the sequences of 20 isodecoders of the anticodon GTC of aspartic acid (Asp). First, we note that, even though these tRNA genes correspond to the same amino acid/anticodon combination, their sequences generally differ from one another. Second, we can recognize specific subsets of highly-similar sequences: e.g. rows 2 through 12 inclusive form one such subset; rows 13 and 14 form another subset; rows 19 and 20 form a third subset; etc. Third, we see that the same short segment can be present in various arrangements in multiple isodecoders of the same amino acid/anticodon combination: e.g. the segments highlighted in yellow, green, and dark blue in Fig. 2B are characteristic such examples.
The above examples show that it is not always possible to determine the source of a sequenced read at the tRNA gene level. For example, reads matching the dark blue subsequence, TTCCCCGACGGGGAG-CCA (all sequences that we use here are shown in 5′ → 3′ orientation), could be produced by any of 14 possible genomic loci. In such cases, one can only associate the respective fragment with the corresponding anticodon (GTC), but not with a specific isodecoder (tRNA gene).
Further complicating matters, a given sequence segment can be shared by tRNA genes from different anticodons of the same amino acid, or, by tRNA genes from different amino acids. Figure 2C shows several such examples. The gray segment (GGGGGTGTAGCTCAGTGGTAGAGC) is present in two isodecoders of tRNA AlaTGC and in isodecoders for tRNA AlaAGC , tRNA AlaCGC , tRNA CysGCA , and tRNA ValAAC respectively. The cyan (ACGAGGCCCCGGGTTC) and yellow (GGGGGTGTAGCTCAGTGGTAGAGCGCGTGCTT) segments are shared by two tRNA isoacceptors of Ala (tRNA AlaAGC and tRNA AlaCGC ). Finally, the green segment (GGGGGTGTAGCTCAGTGGTAGAGCA) is shared by an Ala tRNA gene (tRNA AlaTGC ) and a Cys tRNA gene (tRNA CysGCA ).
Another complication that arises from the potential provenance of a given sequence from multiple genomic locations within the tRNA reference set is that of "multiple counting." These shared sequences and the corresponding genomic locations cannot be treated as independent; doing so would bias all downstream analyses as a result. Our strategy for overcoming this problem is to take a "sequence-centric" approach through extensive bookkeeping: we count each such molecule/sequence only once, no matter how many times the molecule's sequence appears on the genome. In other words, MINTmap reports expression at the level of individual molecules/sequences. Let us look at the example of Fig. 2C: MINTmap calculates and reports expression values at the level of GGGGGTGTAGCTCAGTGGTAGAGCA (and not at the level of tRNA AlaTGC and tRNA CysGCA ) while still reporting in its output the fact that this segment can be originating from an isodecoder of either tRNA AlaTGC or tRNA CysGCA .
This sequence-centric approach is necessary because such shared sequence segments are rather frequent in the human tRNA space, as evidenced by the examples of Fig. 2B,C. To facilitate this approach and also capture this ambiguity, we recently introduced a novel labeling scheme for tRFs that is sequence-centric 1 . The new scheme leads to compact tRF labels, requires no brokering service (see also: https://cm.jefferson.edu/MINTcodes/) and Scientific RepoRts | 7:41184 | DOI: 10.1038/srep41184 differs from the genome-centric labeling scheme we proposed originally 2 . In the next section, we describe these two labeling schemes. MINTbase 1 uses both labeling schemes in order to provide the user with flexibility. As it is an intrinsic aim of mapping algorithms to identify the genomic origins of a sequenced read, MINTmap will generate and report all of the potential genomic tRNA origins of a tRF. In addition, MINTmap directly links each such fragment to its corresponding "tRF Summary Record" in MINTbase 1 (http://cm.jefferson.edu/MINTbase/). Figure 3 shows an example tRF Summar y record. The tRF is a 5′ -tRF with sequence GCATTGGTGGTTCAGTGGTAGAATTCTCGC and license plate tRF-30-PNR8YP9LON4V. This tRF is present in two isodecoders of tRNA GlyCCC and eight isodecoders of tRNA GlyGCC and has been found to be very abundant in more than a dozen public datasets. However, as the record indicates, it is potentially a false positive because this sequence also exists outside of the known tRNA space and thus could be transcribed from a locus that is unrelated to tRNAs (see the subsection named 'The need to tag fragments that can originate inside and outside of tRNA space' for a discussion). As mentioned above, the pattern of cleavage points of a given isodecoder, and thus the profile of tRFs it generates, changes as a function of tissue, sex, population origin, race, and disease subtype 2 . Thus, the information listed in the tRF Summary records has the potential to change as more short RNA-seq datasets are added to MINTbase.
A "genome-centric" and a "sequence-centric" scheme for labeling tRFs. We use two complementary schemes to label tRFs. We introduced the first in our earlier work 2 and based it on the genome assembly and the naming convention used by gtRNAdb 46 . Recently, we introduced the second scheme and based it on the actual sequence of a tRF 1 .
The genome-centric scheme augments the notation already in use by gtRNAdb. For example, to refer to a 19-nt-long i-tRF that spans positions 35 through 53 inclusive of the mature tRNA derived from the isodecoder trna5 of tRNA AspGTC ("trna5-AspGTC" in gtRNAdb notation) we use trna5_AspGTC_12_ + _98897281_98897352@35.53.19 as the label. The extended label also includes information about the chromosome ("12"), the strand ("+ "), and the genomic location of the fragment ("98897281" through "98897352" inclusive). The sequence-centric scheme decouples a tRF from its genomic origin and any given genome assembly while acknowledging the fact that tRFs can be found in distinct isodecoders of the same anticodon or of different anticodons (e.g. the tRFs highlighted in Fig. 2B,C). This sequence-centric scheme allows anyone to generate a unique label for their tRF of interest without any requirement for a brokering service such as the one that has been in use for miRNAs 48 . Importantly, since these labels are based on the tRF's sequence, they are universally consistent and do not change with time ("persistent"). These properties give researchers at different institutions the ability to generate the same label for the same tRF without delay, to use the label as a shorthand in a manuscript, to easily compare data across publications, etc. We refer to this unique label as the tRF's "license plate". Any tRF can be mapped to a unique license-plate and any license plate can be mapped back unambiguously to the tRF sequence it represents. The license-plate uses a base-32 system that relies on the following alphanumeric symbols: B, D, 0, E, F, 1, H, I, 2, J, K, 3, L, M, 4, N, O, 5, P, Q, 6, R, S, 7, U, V, 8, W, X, 9, Y, Z (A, C, G, or T are not allowed symbols). As an example, tRF-16-E0PRRND is the label for ACGAGGCCCCGGGTTC, the cyan tRF of Fig. 2C. The label for the tRF GGGGGTGTAGCTCAGTGGTAGAGC would be tRF-24-RK9P4P9LH9. Detailed information on the method used to generate the license plates can be found in our previous publication 1 . The codes that a user can employ to generate license plates are freely available at https://cm.jefferson.edu/MINTcodes/. It is important to note that the algorithm for generating license plates is deterministic and entirely sequence-based. The immediate, practical implication of this is that any user can generate license plates for his/her favorite tRFs at will and without having to rely on "curation" by a centralized team. Also, because the license plate is derived from the tRF's sequence the label will transcend changes to genome assemblies, possible changes to the nomenclature of tRNA genes, additions to the tRNA space, etc. Equally importantly, the scheme allows independent researchers to generate the exact same label for the same tRF, which will in turn ensure that labels remain consistent across publications and independently of the timing of manuscript submission.
The need to tag fragments that can originate inside and outside of tRNA space. We now discuss in more detail the case of sequence segments that can exist both inside and outside of tRNA space (see Fig. 3 for such an example). As we demonstrate below, this phenomenon is pervasive and must be taken into account explicitly during tRF profiling.
The complexity of tRNA space and the presence of repeats in the human genome necessitates disambiguation of the potential origin of a sequenced read before it is allowed to contribute to the expression of a tRF 2,41,42,49 . As an example, the sequences GTTCAATTCCCTGATGGG and GTGGTAGAGCATTTGACT (in red rectangles in Fig. 2B,C respectively) are present in the shown tRNA AspGTC and tRNA CysGCA isodecoders but are not exclusive to tRNA space. Similarly, the sequence GCATTGGTGGTTCAGTGGTAGAATTCTCGCCT which was reported to be abundant in diseased liver 22 exists both inside and outside of tRNA space 49 , and, thus, may not be of tRNA origin necessarily.
The factors that complicate the analysis of tRFs in this regard and must be addressed computationally during tRF profiling include: 1. The human genome harbors numerous full-length tRNA-lookalike sequences. We recently reported that the human nuclear genome harbors numerous previously unreported sequences that resemble the known nuclear and mitochondrial tRNAs 41 . We found 497 such loci that we termed "tRNA-lookalikes. " Specifically, 351 of the 497 are lookalikes of the 22 mitochondrially-encoded tRNAs and the remaining 146 are lookalikes of the nuclearly-encoded tRNAs. The presence of lookalikes of mitochondrial tRNAs in the nuclear genome is not unique to humans: in fact, we found similarly high numbers of lookalikes in other primates and marsupials 42 , an observation that suggests the possibility of a process that is orchestrated and evolutionarily significant. Eight of the 497 lookalikes are full-length, exact copies of mitochondrial tRNAs and we have included them in the "true tRNA space" (see above). Neither the transcriptional characteristics nor the true nature of the lookalikes are known currently (e.g. they could be functioning as tRNAs, producing aberrant transcripts, or not be transcribed at all). Thus, MINTmap will tag any segments that are shared by the tRNA Reference set and the 489 (= 497-8) tRNA-lookalikes as "not-exclusive to tRNA space" (Fig. 3) to alert users to the possibility that the corresponding sequence may be a false positive tRF.
2. The human genome harbors numerous incomplete tRNA sequences. The human nuclear genome also contains hundreds of incomplete tRNA sequences of variable lengths 2,41,49 . Indeed, the "tRNA" class of Repeat-Masker (http://www.repeatmasker.org; Library 2014/01/31) includes 716 genomic entries each of which corresponds to sequence segments that are ≤ 50 nt long 49 . We illustrate this situation in Fig. 4 with the help of five isodecoders of tRNA IleTAT . The genomic DNA of each tRNA gene contains an intron that is excised following transcription. As can be seen from the alignment shown in this Figure, the two exons have virtually identical sequences. However, the first exon (shown in blue) contains a 24 nt segment (black rectangle) that also exists by itself in a different chromosome (chr7:+ :44465584-44465621) and does not form part of a tRNA gene: indeed, the sequence after it (in 5′ → 3′ orientation) does not align with the second exon of tRNA IleTAT . Therefore, any sequenced reads that map exactly (i.e. without any mismatches or insertions/deletions) to this 24 nt segment have an ambiguous genomic origin, cannot be confidently declared bona fide tRFs, and ought to be treated as possible false positives. Reads mapping to segments that are shared by the tRNA Reference set and these incomplete tRNA sequences will be tagged by MINTmap as being "not-exclusive to tRNA space" (Fig. 3) to alert users to the possibility that the corresponding sequence may be a false positive tRF. sequences do not exists on the genome. 31 of these introns range from 12 to 24 nt whereas one pseudo-tRNA has an intron that is 104 nt long. MINTmap incorporates additional bookkeeping to accommodate sequenced reads that correspond to tRFs spanning exon-exon junctions. Intron-spanning fragments that could exist on the genome at a location unrelated to tRNA space are flagged as candidate false positives 49 .
The need to accommodate the non-templated CCA addition. Integral to the maturation of tRNAs is the addition of the trinucleotide "CCA" at the 3′ end of the pre-mature molecule [50][51][52] . The trinucleotide is added post-transcriptionally and is not present in the DNA sequence of the tRNA precursor. Were the sequenced reads to be mapped to an unmodified version of the human genome or to an unmodified version of the tRNA Reference set, the 3′ -tRFs would not be captured accurately, especially if the mapping step did not allow mismatches (see below). Previous studies tackled this matter by either allowing for mismatches at the 3′ end of the mapped reads, by masking these three nucleotides 53 , or by explicitly adding the trinucleotide sequence at the end of the tRNAs [54][55][56] . Of these three approaches, the first two carry the risk of introducing false positives or by making the classification of the fragment ambiguous (3′ -tRF vs. i-tRF). MINTmap uses the third approach. CCA-ending fragments that could exist on the genome at a location unrelated to tRNA space are flagged as candidate false positives 49 .
The need to accommodate the 5′ guanylation of tRNA His . In most eukaryotes, the tRNAs encoding the anticodon for histidine (His) require the post-transcriptional addition of a G nucleotide at their 5′ end before they can be recognized by their cognate tRNA synthetase 57 . This post-transcriptionally added nucleotide is described as the "− 1 nucleotide" in the literature 58,59 . To properly enumerate and quantitate 5′ -tRFs from tRNA His , MINTmap makes explicit provisions to permit sequenced reads to extend to the non-templated − 1 nucleotide of the His isodecoders. Currently, tRNA His is the only tRNA known to admit post-transcriptionally a 5′ addition of a nucleotide.
Mapping exactly, deterministically, and exhaustively. Because of the highly repetitive nature of tRNA sequences and the shared sequence segments that we illustrated above (Fig. 2B,C), MINTmap maps reads exactly (no mismatches, and no insertions or deletions allowed), deterministically (no reliance on probabilistic schemes for identifying read matches), and exhaustively (looks at the full genome and enumerates all possible hits for a read).
MINTmap permits only exact matches thereby greatly improving one's ability to disambiguate the genomic origin of a given read. A common practice encountered in mapping algorithms is to allow for a few mismatches between the sequenced read and the reference genome to accommodate the possibility of sequencing errors. However, as we demonstrated previously 2 , allowing mismatches introduces errors, can generate false positives, and complicates one's efforts to determine a read's genomic origin.
A characteristic example can be seen in Fig. 2C: the sequence GGGGGTGTAGCTCAGTGGTAGAGC is followed by GC in tRNA107-AlaTGC, by AC in tRNA113-AlaTGC, by AT in tRNA7-CysGCA, and by GT in tRNA115-ValAAC. If mismatches were allowed, a sequenced read that extends two nt past the end of GGGGGTGTAGCTCAGTGGTAGAGC would end up supporting a 5′ -tRF from at least one wrong isodecoder.
The exhaustive character of MINTmap is complemented by its deterministic nature. Many alignment algorithms use metrics that evaluate the goodness of an alignment in its genomic context. However, such an approach is not appropriate for a class of transcripts such as the tRFs. Indeed, tRFs are very diverse in terms of length, spanning from as few as 16 nt to as many as 36 nt, or even more. Also, tRFs are very diverse in terms of their GC content, which ranges between ~4% and ~94%. Moreover, many read mappers employ seed-based matching, an approach that increases performance at the expense of sensitivity.
How to achieve high performance. An important consideration for MINTmap is the need to analyze large datasets very speedily despite the competing requirement that the entire genome be examined in the process. MINTmap achieves its exceptional performance with the help of a lookup table that comprises all possible, distinct tRNA fragments with lengths between 16 and 50 nt. The choice of the upper limit of 50 nt is dictated by the currently available experimental evidence, and can be trivially changed to accommodate longer fragments. The lower limit of 16 nt is due to the fact that most of the 14 nt and 15 nt fragments also exist outside of tRNA space 49 -see also http://biorxiv.org/content/early/2016/07/07/061572. The provided lookup table contains information for each tRNA fragment, whether it is found solely in tRNA space (exclusive) or also found in genomic regions outside of tRNA space (ambiguous). Given that the sequence space of tRNAs is known, we began by enumerating all possible sequence fragments that the space can generate. Then, for each such fragment, we searched the entire genome deterministically and exhaustively, recorded whether we could find an exact match outside of tRNA space, or not, and used the information to populate the table.
For the above-mentioned tRNA Reference set there can be no more than 594,972 distinct tRNA fragments with lengths between 16 and 50 nt inclusive. Once the tRNA Reference set has been defined and augmented as described in Methods (introns, CCA, − 1 position in tRNA His ), this lookup table comprising the distinct tRNA fragments can be built "off-line. " The lookup table need only be built once and re-used in all subsequent computations. Each of the 594,972 entries of the lookup table also contains metadata indicating whether the corresponding sequence is exclusive to the tRNA space being used.
Methodological differences with previous methods. Several previous publications approached the profiling of tRNA fragments by mapping sequenced reads on only the space of tRNA sequences 22,[60][61][62] . This is typically done by first creating a database that comprises tRNA sequences, downloaded from e.g. gtRNAdb or tRNAdb, and then mapping to it adapter-trimmed and quality-filtered sequenced reads. However, as we discussed above, such an approach has the potential to generate false positives, lead to a miscalculation of tRF abundances, and misrepresent the profile of the present fragments 2,49 -see also http://biorxiv.org/content/ early/2016/07/07/061572.
MINTmap looks across the complete genome instead, and not solely at tRNA space. This approach allows MINTmap to determine for each sequenced read whether it maps exclusively to tRNA space, or not, and to notify the user accordingly. More specifically, any tRFs that do not map exclusively to tRNA space are flagged as candidate false positives and are reported in their own separate table in the generated output. By taking this approach, MINTmap gives researchers the option to focus on fragments that originate exclusively from tRNA space, or to relax this constraint for improved sensitivity at concomitant increase of the level of noise.
Runtime performance for the typical user. With an eye towards user-friendliness, we focused on helping the typical user, who simply wants to generate tRF expression profiles for their small RNA-seq datasets and does not generally have the time or skills to deal with a complex installation. To this end, we already executed steps 1-8 of Methods on behalf of the users (Methods), built the required lookup table, and include it in the MINTmap code bundle. What remains is for the user to apply their own preprocessing to the RNA-seq dataset at hand (step 9), and then execute steps 10-12 using the provided MINTmap codes.
With the provided MINTmap codes, a user can profile a 100-million-read short RNA-seq dataset (Illumina) and identify and report all present tRFs in under four minutes on a single-core 2.70 GHz Intel Xeon CPU E5-4650. If color-space reads are used, the time required to execute steps 10-12 will be a function of the third party aligner, e.g. Shrimp 63 or BWA 64 , used to execute the mapping step (see Methods for details).
The generated output. The output of MINTmap consists of two tables each listing the expression profiles of identified tRFs. The first table comprises sequences that are exclusive to tRNA space, and, thus, bona fide tRFs. The second table comprises sequences that, at the genome level, exist both inside and outside of tRNA space, and, thus, are candidate false positives.
In terms of format, each of the two tables is provided as (a) plaintext, and, (b) HTML (four output tables in total). The plaintext, tab-separated format allows for easy file manipulations and software import for downstream analyses (see also "Design goals" above for a discussion on data confidentiality). Table 1 shows the first few lines of the two HTML tables (bona fide tRFs and candidate false positives, respectively) that MINTmap generates for the dataset NA06986.6.MI. This dataset was described in an earlier report 65 Fig. 3). We designed and implemented MINTbase to provide users with the capability to interactively explore the tRF space while also serving as a knowledge repository for tRFs in the public domain.
Comparisons between MINTmap and other tRF profiling methods. We compared MINTmap with tRFdb 54 , tDRmapper 55 , and tRFfinder 66 . Table 2 summarizes the attributes of each scheme. Our quantitative findings are summarized in Supp. In all comparisons involving tRFdb, we distinguish between the tRFdb-AllAligned collection of tRFs (see Methods) and the tRFdb-withID collection of tRFs. Only fragments that are supported by reads matching exactly are used for these comparisons. Also, only tRFs that arise from mature tRNA sequences are used for these comparisons. All of the results are listed in Supp. Table 1. Note, however, that the Venn diagrams that we discuss next do not include tRFfinder because tRFfinder reports only six tRFs using its default settings, i.e. it has a very low sensitivity compared to the other three schemes.
We distinguish between fragments that are ≤ 15 nt in length and fragments that are ≥ 16 nt. As we discussed above and demonstrated recently 49 , 92.4% of all 14-mers and 79.0% of all 15-mers that one can form out of mature tRNA sequences have exact genomic copies outside of tRNA space. As an example, we mention the 15-mer tRF from tRNA LeuTAA with sequence ACCCCACTCCTGGTA: it has 275 exact genomic copies outside of tRNA space. Thus, the inclusion of 14-mer and 15-mer tRFs will increase the noise in the output (see also Supp. Table 1 of http://biorxiv.org/content/early/2016/07/07/061572). Allowing for indels/replacements during the mapping of reads further exacerbates the problem. We note that both tDRmapper and tRFdb-AllAligned report 14-mer and 15-mer tRFs, including ACCCCACTCCTGGTA. To not affect the tool comparisons by this prominent noise source, we used only tRFs that are ≥ 16 nt in length.
Sensitivity. As can be seen from the left Venn diagram of Fig. 5, there is a core group of 1,178 bona fide mature-tRNA-derived tRFs (≥ 16 nt) across the nine cell lines that are correctly identified by all three tools, i.e. MINTmap, tRFdb and tDRmapper. These are tRFs that have exact matches in tRNA space and no exact matches outside tRNA space. For tRFdb, we used the tRFdb-AllAligned collection in the Venn diagrams, since tRFdb-AllAligned maximizes tRFdb's sensitivity and the overlap with the other two schemes. MINTmap and tDRmapper can jointly identify an additional 255 tRFs that are not predicted by tRFdb. Moreover, MINTmap can identify a further 352 tRFs that neither tDRmapper nor tRFdb report: 253 of these arise exclusively from mitochondrial tRNAs; 92 arise from either mitochondrial tRNAs or their eight exact copies that are present in the human nuclear genome (see discussion above); the remaining 7 are 5′ -tRFs from the nuclearly-encoded tRNA His . Note: tDRmapper and tRFdb also report a combined total of 395 fragments that are either 5′ -trailer or 3′ -trailer tRFs and originate from the precursor tRNA whereas MINTmap restricts its analysis to only tRFs that overlap the mature tRNA sequence.

Specificity.
To determine specificity, we focus on fragments (≥ 16 nt) whose exact genomic matches occur both in mature tRNA space and in loci that are either partial tRNA sequences (e.g. see Fig. 4), or non-tRNAs. Because of this provenance ambiguity we refer to such fragments as candidate false positives.
By design, MINTmap explicitly flags in its output those tRFs that are candidate false positives. The right Venn diagram of Fig. 5 shows the performance of tDRmapper and tRFdb with respect to candidate false positives. As can be seen, more than 30% of the sequences that would be reported as tRFs by tDRmapper and tRFdb, at a threshold of RPM ≥ 5 and using exact matching only, appear identically within mature tRNAs as well as at genomic loci that are not tRNAs. For comparison purposes, we mention that tDRmapper and tRFdb report 382 tRFs from mature tRNAs that are 14-mers or 15-mers, which is a high number in proportion to the 1178 bona fide tRFs found by all three schemes. As we explained above, 14-mers and 15-mers are nearly ubiquitous outside of tRNA space; thus, it follows that simple removal of 14-mers and 15-mers from the generated output will amount to a very considerable boost in the effective specificity of both tDRmapper and tRFdb. Speed. The nine short RNA-seq datasets we analyzed also provided an opportunity to compare MINTmap's runtime performance to that of tDRmapper, for which codes are available. For the abundance threshold set to RPM ≥ 5, MINTbase is ~10 times faster. Additional experiments using lower RPM thresholds showed an even higher speed advantage for MINTmap, a direct consequence of MINTmap's use of a lookup table for processing. Of the other two tools, tRFfinder provides a web-based job-queuing system so a direct speed comparison is not feasible. Nonetheless, for the nine short RNA-seq datasets discussed here, results were returned within a few hours. tRFdb does not provide a way to profile individual datasets and, thus, runtime speed statistics cannot be obtained.

Optional
Step. building the entire pipeline. A user would not need to perform Methods steps 1-8 (see Methods) from scratch unless he/she needed to replicate the pipeline for another organism, for a different genome assembly, or for a different tRNA Reference set. In such an event, the time requirements would be as follows: • Steps 1-7 (preparatory stage): these depend on the skill of the bioinformatician and would require ~4 hours end-to-end to execute. • Step 8 (actual building of the lookup table): this step would require an estimated ~9600 CPU-core-hours, or 2.5 days using 160 compute cores (2.70 GHz Intel Xeon CPU E5-4650).
Step 9 is not MINTmap-specific; rather it depends on the library preparation of the biological sample and would require ~45 minutes.

Discussion
We presented MINTmap, a fast and exhaustive software resource for profiling tRFs that may be present in short RNA-seq datasets. We expect that practitioners who are interested in studying tRFs using deep-sequencing approaches will find MINTmap to be useful and very flexible. In fact, we sought to enable individual users who do not have the time or the skills needed for an involved software installation. Consequently, the software that we generated and make available can be run standalone on any modern laptop or workstation and is exceptionally fast: it can analyze 100 million adapter-cut-and-quality-trimmed Illumina reads and report the identified tRFs and their abundances in under four minutes. In addition, in this presentation, we provided a very detailed description of all of MINTmap's steps to enable users who might be interested in a site-specific implementation of our approach. We have already used MINTmap to analyze and publish tRF profiles from 843 public short   RNA-seq datasets. We made these tRF profiles publicly available through a companion web-based resource, MINTbase 1 . MINTbase allows users to interactively interrogate and study these tRF profiles across isodecoders, isoacceptors, tissues, diseases, etc. On a related note, MINTbase allows a user to directly submit over the web (https://cm.jefferson.edu/MINTsubmit/) the output of MINTmap on his/her short RNA-seq datasets, thereby contributing the user's respective tRF profiles to this public resource.

Equipment.
Personal computer with an Internet connection. A personal computer (either a desktop or a laptop) will be powerful enough to execute all steps of MINTmap and analyze datasets with one hundred million reads or more. We recommend a system with at least 2 Gb of main memory powered by at least 1 CPU Intel Core i5/i7, or equivalent.
Compute server. For step 8 of MINTmap, we recommend using a multi-core compute setup (compute time decreases in proportion to the number of processing cores). We recommend a compute server with a minimum of 16-cores Intel Xeon, Intel Core i5/i7, or equivalent, with at least 8 Gb of main memory. Nonetheless, it is possible to run step 8 on a personal computer such as the one recommended above: to do so, step 8 needs to be run separately on each chromosome (see below). For the human genome, this would amount to a total of 25 executions of step 8.
Data and other downloads. An Internet connection is required to access the tRNA Reference set (if not using the one in Supp. Table 2). Access to tools such as SAMTools (http://samtools.sourceforge.net/) may also be desired for some of the steps.
The generated output. MINTmap generates its output as both a plain text file and an HTML file. The textual output can be viewed using any standard text editor. The HTML output can be viewed using a web browser; all popular web browsers (Internet Explorer, Firefox, Chrome, and Safari) are currently supported.  their own instances of MINTmap; thus, we outline the steps required to build the table below (see also panels A through E of Fig. 6). We note that the table is generated after the tRNA Reference set is established.
Steps 1-8: create the lookup table 1. Initial tRNA sequence collection: Create a starting collection of the unspliced genomic sequences that encode tRNA isodecoders, in 5′ → 3′ orientation. This collection is referred to as the 'tRNA Reference set' and for the human genome comprises 640 sequences: true tRNAs, pseudo-tRNAs, mitochondrial tRNAs, and exact tRNA-lookalikes (Supp. Table 2 and Fig. 6A). 2. Accommodate exon-exon junctions: Using the above sequence list, remove any intronic sequences so that only the tRNAs' spliced sequences remain (Fig. 6B). 3. Accommodate non-templated CCA addition: Extend each tRNA sequence by appending the CCA trinucleotide (Fig. 6C). 4. Accommodate post-transcriptional modifications at the −1 position: Make 4 additional copies of each sequence, extending the 5′ end to the left by 1 nt with an A, T, C, and G respectively (Fig. 6D). 5. Generate the list of all potential tRF sequences: For each one of the above sequences, enumerate all possible substrings with lengths between 16 and 50 nt inclusive, independently of whether there is already available experimental evidence for the respective fragment or not. Add each one of these substrings to a new list called the 'tRF sequence candidate pool' and remove any duplicate entries before proceeding ( Fig. 6E and Supp. Table 3, column C). 6. Prepare the whole genome search space: Put the sequences of the forward and reverse strands of each chromosome on consecutive lines of a new text file-no labels are necessary. For the human genome there will be a total of 50 lines: two lines for each of chromosomes 1-22, X, and Y and the mitochondrial (MT) genome. CAUTION: All sequences, including those corresponding to the negative strand, must be listed in 5′ → 3′ orientation. 7. Mark tRNA positions within the chromosome (Fig. 7): For each of the 50 sequences above, create an identically-sized corresponding file, internally referred to as "exonic mask file. " Begin by replacing each and every position of the exonic mask file by a '0. ' Next, enumerate all exons of the tRNA Reference, go to the matching chromosome, strand, and genomic coordinates in the exonic mask file and mark each of the corresponding positions with a '1. ' For every genomic positions that would correspond to a "− 1" or "CCA" post-transcriptional addition mark the corresponding position(s) in the exonic mask file with a '2. ' Upon completion, in the exonic mask file, tRNA exons (blue regions in Fig. 7) will have been replaced by trails of 1's, positions corresponding to post-transcriptional additions will have been replaced by one or more 2's (green regions in Fig. 7), and all other positions (yellow regions in Fig. 7) will have been replaced by trails of 0's. CRITICAL STEP: When marking exonic positions, make sure that the correct chromosomal strand sequence gets marked with 1's. Failure to do so will over-represent the tRNA exons. 8. Determine exclusivity to tRNA space, or lack thereof: Check each candidate tRF sequence from the list that was created during step 5 above and identify all of its exact-matches within the search space created in step 6.
In turn for each exact-match that is found, examine the corresponding span of the match in the exonic mask file (from step 7). If any of the positions within the span for the exact-match being examined contains a '0, ' conclude that the candidate tRF is non-exclusive to mature tRNA space, terminate the search for this candidate tRF, and proceed to the next candidate tRF. If none of the positions within the span for the exact-match being examined contains a '0, ' continue with the next match for this candidate tRF. If upon investigation of all exact matches for the tRF being examined no positions are found that contain a '0, ' conclude that this sequence is exclusive to mature tRNA space and thus a bona fide tRF if transcribed ( Fig. 6 and Supp. if users do not provide it, MINTmap will still run correctly but will not report structural type information for the discovered tRFs. CLARIFICATION: In the event that the available computational platform is a personal computer, steps 6 and 7 should be modified to create one file per chromosome (for a total of 25 files in the human genome); each file should contain two lines, one for the forward and one for the reverse strand of the corresponding chromosome. Then, during step 8, each of the candidate tRF sequences should be examined separately in each of the chromosome files before it can be determined whether it is indeed exclusive to tRNA space.
Steps 9-10: generate raw and normalized expression values 9. Preprocess your reads. For each short RNA-seq dataset of interest, perform the desired quality filtering and adapter trimming. CAUTION: Any remaining adapter sequences must be removed from the reads prior to proceeding to the next step. The adapter sequences that will need to be removed depend on the library preparation for the corresponding biological sample. 10. Generate table of tRF abundances. This is the step where the actual tRF profiling takes place. This step should be performed using the option that is appropriate for the short RNA-seq dataset at hand. If the short RNA-seq datase to be analyzed does not contain color space reads, execute: • A1. Process the list of NGS read-sequences that resulted from step 9 and keep only those that exist in the lookup table that was generated above (Fig. 6E and column C of Supp. Table 3). • B1. For each NGS read-sequence that was obtained in 10.A1, generate molecule-level expression information by counting the frequency of the sequence (Fig. 6F) in the RNA-seq dataset from step 9. The resulting table is referred to as the 'count table' . Optionally, sort the count table in order of decreasing abundance. We suggest, and our provided codes already implement this, that normalization in reads-per-million (RPM) be computed in two different ways: (i) an RPM value computed using as a denominator the counts of reads that map either exclusively or non-exclusively to tRNA space; and, (ii) an RPM value computed using as a denominator the count of reads from the FASTQ file that is being processed)-see Table 1 for an example.  If the short RNA-seq dataset to be analyzed does contain color space reads, execute: • A2. First, you must map the sequenced reads at hand using a mapping tool of your choosing that can handle color-space reads. Allow for non-unique mappings and map the reads to all modified tRNA sequences obtained in step 4 above (lookup-table creation). Only map in the 5′ → 3′ orientation relative to each tRNA. CAUTION: It is important that one be exhaustive during this step, so we recommend using mappers with the highest sensitivity possible. • B2. From the collection of mapped reads from 10.A2 only keep those reads that could be mapped with no mismatches, no insertions, no deletions, and no crossovers, and create "(NGS read-label, NGS read-sequence)" pairs. Sort the formed list of pairs, and identify and remove any duplicates that may be present. From the resulting list keep only those of the (NGS read-label, NGS read-sequence) pairs for which the NGS read-sequence exists in the lookup table that was generated above (Fig. 6E and column C of Supp. Table 3). In this final list count the frequency of each distinct NGS read-sequence. The resulting table is referred as the 'count table' (Fig. 6F). Optionally, sort the count table in order of decreasing abundance. CRITICAL STEP: During mapping, a given sequenced read may be found in multiple tRNA genes: the read and its associated NGS read-label will be reported once for each of these hits. By removing duplicate (NGS read-label, NGS read-sequence) pairs we ensure that we do not misrepresent the abundance of the molecule represented by the NGS read-sequence.
Steps 11-12: flag candidate false positives, link to metadata, and report 11. Account for exclusivity to tRNA space: Use the lookup table's exclusivity column (Supp. Table 3, column D) to split the count table, which was generated during either step 10.B1 or step 10.B2, into two distinct files. The first file will contain fragments whose sequences are present exclusively in the tRNA space, and nowhere else on the genome: these are bona fide tRFs. The second file will contain fragments whose sequences are present both inside and outside of tRNA space: these are possible false positive tRFs and we recommend that they be treated as such. Optionally, add a column in each of the two output files that normalizes the reported counts by, e.g., scaling them to an RPM value (see above).

For all tRFs
Determine tRF exclusivity to tRNA space is an exact match found that overlaps with a yellow block? 12. Associate each fragment in the output with relevant meta-data: Add genomic information and other metadata to each of the reported tRFs. The codes that we provide already accomplish this by providing the following information (see Table 1 for a detailed example): number of reads mapping to the respective tRF; RPM values computed in three different ways, one of which is user-defined; a hyperlink to the respective tRF's Summary Record in MINTbase 1 (Fig. 3 shows an example of such a record); information about the isodecoder(s) from which the tRF could be arising, starting and ending location within the isodecoder(s), and chromosome, strand, and genomic coordinates of the isodecoder(s).
Troubleshooting MINTmap. In Table 3, we provide detailed information that will help users troubleshoot problems that they may encounter during execution of the codes. For some of the above-described steps, the table contains multiple troubleshooting entries.
How the comparisons between the various approached were run. We generated results from comparing MINTmap, tRFdb 54 , tDRmapper 55 , and tRFfinder 66 . We note that only MINTmap and tDRmapper are available as downloadable tools that can be used to profile individual datasets. tRFfinder provides a web-interface for users to upload pre-processed data that can then be analyzed. tRFdb does not allow users to profile custom datasets. MINTmap, tRFFinder 66 , and tDRmapper 55 were ran using their default settings. The schemes were evaluated using the datasets listed under GSE16579 67 in the NCBI Gene Expression Omnibus (GEO) repository. To facilitate comparisons, we enforced a normalized threshold of RPM ≥ 5 across all schemes. Because tRFdb contains only pre-computed data, we downloaded from their website the tRNA alignments for the datasets of GSE16579 and sub-selected only the tRFs that satisfied the RPM ≥ 5 threshold.
A tRF would enter our analysis (Supp. Table 1) if and only if (a) it had a length ≥ 16 nt (b) it was reported by at least one of the 4 schemes; (c) it was detected in any of the nine datasets above at the enforced RPM threshold; and, (d) it could be mapped to the genome exactly, i.e. without permitting any indels or replacements.
(Step d) is necessary because MINTbase, tRFdb, and tRFfinder (in its default settings) report only exactly-matching tRFs. Moreover, as we showed in our previous work 2,49 , mismatches produce erroneous results. Lastly, because tRFdb comprises tRFs that are shown in their "alignment view" but are not included in the tRFs to which tRFdb assigned unique identifiers, we reported statistics using two tRF collections: "tRFdb-AllAligned" is those tRFs from tRFdb that are listed in the reported alignments; "tRFdb-withID" is those tRFs from tRFdb that have been assigned a unique tRFdb identifier. Data availability. The available codes are written in Perl and allow a user to process a short RNA-seq dataset that has already been adapter-trimmed and quality-trimmed. MINTmap runs under Linux and OSX and requires Perl and Java. To facilitate use of the codes, we pre-generated and make available the lookup table for the human genome. Optionally, as noted in the help file that accompanies the codes, a custom user-generated lookup table can be employed instead. The output will comprise the identified tRFs and their respective abundances (both raw Step Problem Possible reason Solution The number of candidate sequences is less than those in Supp. Table 3.
Not all post-transcriptional modifications have been carried out.

5
The number of candidate sequences is more than those in Supp. Table 3 Duplicate values have been included.
If a sequence appears multiple times in the reference tRNA space, remove duplicates and report it once.
8 Exclusivity values do not match Column D in Supp. Table 3.
An exhaustive and deterministic mapper was not used.
Use a search/mapper that performs exact string matching and does not rely on alignment scores.

10
No or few entries exist in the sample's count table.
Adapters were not removed from the input dataset.
Remove any adapters from the sequenced reads in the input NGS file 10 If open source MINTmap script used: Error opening input file.
FASTQ file was not provided as input.
Provide the short RNA-seq dataset in FASTQ format (4 lines per read).

10
If open source MINTmap script used: Error opening input file.
Color-space data provided as input and cannot be directly provided to the tool.
Map the data as described in the color-space section of step 10. Table 3. Troubleshooting Table. For several of the steps in Methods, we list potential complications that the user may encounter together with an explanation of the behavior and the recommended solution.
and normalized). The output distinguishes between tRFs that are exclusive to tRNA space, and tRFs that occur inside and outside of tRNA space. The MINTmap codes are freely available under a GNU General Public License v3.0 from https://github.com/TJU-CMC-Org/MINTmap/. We expect to be releasing updated lookup tables and code-updates for MINTmap at regular intervals. The current implementation of MINTmap is for the human genome. MINTmap has been designed very generically, which should allow interested users to easily adapt the codes to other genomes as well.