MiREDiBase, a manually curated database of validated and putative editing events in microRNAs

MicroRNAs (miRNAs) are regulatory small non-coding RNAs that function as translational repressors. MiRNAs are involved in most cellular processes, and their expression and function are presided by several factors. Amongst, miRNA editing is an epitranscriptional modification that alters the original nucleotide sequence of selected miRNAs, possibly influencing their biogenesis and target-binding ability. A-to-I and C-to-U RNA editing are recognized as the canonical types, with the A-to-I type being the predominant one. Albeit some bioinformatics resources have been implemented to collect RNA editing data, it still lacks a comprehensive resource explicitly dedicated to miRNA editing. Here, we present MiREDiBase, a manually curated catalog of editing events in miRNAs. The current version includes 3,059 unique validated and putative editing sites from 626 pre-miRNAs in humans and three primates. Editing events in mature human miRNAs are supplied with miRNA-target predictions and enrichment analysis, while minimum free energy structures are inferred for edited pre-miRNAs. MiREDiBase represents a valuable tool for cell biology and biomedical research and will be continuously updated and expanded at https://ncrnaome.osumc.edu/miredibase.


Introduction
MiRNAs are the most studied class of small non-coding RNAs involved in gene expression regulation. According to the canonical miRNA biogenesis pathway, miRNAs are initially transcribed into primary transcripts (pri-miRNAs) that present hairpin structures and undergo a double RNase III-mediated processing 1 . The first step occurs within the nucleus, where the Drosha-DGCR8 enzymatic complex cleaves pri-miRNAs into ~70 nucleotide long transcripts. These typically maintain the stem-loop conformation and represent the precursors of miRNAs (pre-miRNAs). Pre-miRNAs are then exported to the cytoplasm, where they are ultimately processed by Dicer into ~22 nucleotides long single-stranded RNAs (mature miRNAs) 1 . These can be found as −5p or −3p forms, depending on which miRNA's arm they derive from 1 . To date, it is estimated that more than 1,900 pre-miRNAs are expressed in humans, giving rise to over 2,600 different mature miRNAs 2 .
MiRNAs are important modulators of gene expression 3,4 . The rule underlying their inhibitory activity over translation consists of a thermodynamically stable base pairing between a specific miRNA region, termed "seed region, " and a complementary nucleotide sequence of an mRNA, termed "miRNA responsive element" (MRE),

Results
Data collection. The MiREDiBase data processing workflow is depicted in Fig. 1. We first explored the PubMed literature by searching for specific keywords, such as "microRNA editing" and "miRNA editing, " narrowing the temporal range between 2000 and 2019. Retrieved articles were then manually filtered, discarding those not containing information on miRNA editing. Editing events detected or validated by targeted methods were included in the database and considered as authentic modifications. Among editing events detected through wide-transcriptome methods, we retained those established as "reliable" or "high-confidence" by the authors, classifying them as putative modifications. Statistical significance was taken into consideration when possible, eventually maintaining only significant editing events. We did not consider enzyme perturbation experiments as validation methods. For putative edited pre-miRNA sequences with no official miRNA name, e.g., "Antisensehsa-mir-451" in Blow et al. 26 , we employed the BLASTN tool to generate alignments between the putative pre-miRNA sequence and miRBase's pre-miRNA sequences (v22) 2,27 . Only perfect matches were retained and provided with their respective official name, as indicated by miRBase. In case editing positions were presented in the form of coordinates of previous genomic assemblies (i.e., hg19/GRCh37), these were converted to the hg38/ GRCh38 assembly using the University of California Santa Cruz (UCSC) liftOver tool 28 . Editing sites associated with miRBase's dead-entries were discarded.
In the second step, we expanded our search by employing the three most prominent online resources for A-to-I events available at present: DARNED 22 , RADAR 23 , and REDIportal 24 . Resources were manually screened, removing editing sites associated with dead entries and opposite strands. Editing sites falling into misassigned miRNAs in the hg19 genomic assembly (i.e., miRNAs of the hsa-mir-548 family and hsa-mir-3134 present in DARNED) were excluded from the database. The retained data were then integrated into the initial dataset.
Database content and statistics. Considering the recent knowledge about genomic differences and similarities among primates, we retained data from Homo sapiens and three primate species (Pan troglodytes, Gorilla gorilla, and Macaca mulatta). In particular, the current version of MiREDiBase includes 2,989 validated and putative unique A-to-I (2,885) and C-to-U (104) editing events occurring in 571 human miRNA transcripts ( Fig. 2a,b, see Data Availability section) and 70 unique A-to-I (46) and C-to-U (24) editing events taking place in 55 primate miRNA transcripts ( Supplementary Figs. 1, 2a,b, see Data Availability section). Overall, 909 (29.7%) editing events occur outside of the pre-miRNA sequences, 971 (31.7%) within pre-miRNA sequences, outside of the mature sequence, and 1,179 (38.6%) within mature miRNA sequences (Fig. 3a, Supplementary Fig. 3a, see Data Availability section). These data were manually extracted from 51 original papers (Supplementary Table 1), which refer to 256 biological sources (Supplementary Tables 2-5).
Human editing sites in MiREDiBase are distributed across several genomic positions throughout the human genome, covering most chromosomes (Fig. 2b). However, of the 2,989 unique editing sites, only 257 (8.6%) have been validated by low-throughput methods or ADAR expression perturbation experiments. The majority of such events fall into clustered miRNAs located in chromosomes 14 (9.5% A-to-I; 7.7% C-to-U), chrX (9.4% A-to-I; 7.7% C-to-U), chr1 (7.7% A-to-I; 6.7% C-to-U), and chr19 (6.7% A-to-I; 11.5% C-to-U), respectively. Such a phenomenon very likely depends on local structural elements and motifs in these primary transcripts that function as editing inducers 29,30 and would deserve more in-depth investigations. For the vast majority, the functionality of miRNA editing events has currently remained undetermined. So far, only 24 editing sites (0.8%) were functionally characterized by appropriate techniques (Fig. 3c). Among these, twelve were demonstrated to impair miRNA biogenesis; seven cause functional re-targeting; three cause impaired biogenesis and functional re-targeting; two cause enhancement of biogenesis.
Concerning primates, the majority of data refer to macaque (Macaca mulatta), for which our database reports 40 A-to-I and 24 C-to-U editing sites (Supplementary Figs. 1, 2, see Data Availability section). Here, 26 (65%) A-to-I editing sites are conserved between human and macaque, whereas only 8 (33%) C-to-U sites are conserved between these two species. This figure might suggest that A-to-I editing of miRNA transcripts is more conserved than the C-to-U type; however, it might also be due to the current low number of C-to-U instances reported for both human and primates. Only three editing sites are reported for both chimpanzee (Pan troglodytes) and gorilla (Gorilla gorilla), occurring in one pre-miRNA transcript for each species (see Data Availability section). None of the editing sites from primates have been validated yet.
When looking at editing sites falling within mature miRNA sequences, data from MiREDiBase let emerge two distinct patterns for A-to-I and C-to-U editing in humans (Fig. 3b). Examining the A-to-I type, most edited sites (325 out of 1018, 31.9%) are located at positions 2-5 of the seed region. Other hotspots for A-to-I editing seem to be represented by positions 1, 6-9, and 12, which account for 325 more edited sites. In the case of C-to-U miRNA editing, most modification sites are located outside of the seed region. In particular, 48 out of 104 edited sites (46.2%) are located at positions 10-12 and 15, whereas only 17 edited sites (16.3%) fall within the seed region. A very similar pattern can be observed in macaque ( Supplementary Fig. 3b).
To help users interpret and contextualize data, miRNA editing events occurring within pre-miRNA or mature miRNA sequences were supplied with in silico predictions. We computed 2,150 MFE pre-miRNA predictive structures using editing sites internal to pre-miRNA sequences and 1,018 miRNA-targeting predictions and enrichment analyses. In both cases, users have the opportunity to compare the edited miRNAs with their relative wild-type versions.
To infer whether local motifs influence A-to-I editing, we investigated the nucleotide composition around the editing sites across the different regions of the miRNA transcripts. Indeed, previous works have already demonstrated a neighbor preference for ADAR-mediated editing. Specifically, comparative studies showed that ADARs have a higher affinity for the 5′ nearest neighbor consisting of U ~ A > C ~ G 31,32 . A neighbor preference for the 3′ nearest neighbor was shown only for ADAR2, consisting of U ~ G > C ~ A 32 . More generally, the UAG triplet has been found as the most favored among others, even in miRNAs 33,34 . These results were recently confirmed by a structural study, which demonstrated that ADAR nearest neighbor preference in humans is mainly determined by www.nature.com/scientificdata www.nature.com/scientificdata/ nucleotide-amino acid interactions rather than local duplex stability 35 . Analysis of the nearest neighbors of edited adenosines from MiREDiBase revealed a similar pattern, although showing interesting clues when subdividing the miRNA transcripts into distinct regions ( Supplementary Fig. 4). In humans, edited adenosines falling into the mature sequences and regions of the pre-miRNA (excluding the loop region) showed a 5′ neighbor preference consisting of A ~ U > C > G and a 3′ neighbor preference consisting of G > A ~ C ~ U ( Supplementary Fig. 4a). Likewise, the pri-miRNA regions (out of the stem-loop sequence) showed a 3′ neighbor preference consisting of G > A ~ C ~ U but a distinct 5′ neighbor preference, consisting of A ~ C > G > U. The loop region showed a sharply different neighbor preference, consisting of G ~ A > C ~ U at 5′ and C ~ G > A ~ U at 3′. Such differences might indicate that local RNA structures also affect the editing preference during ADAR activity. Concerning the neighbor preference in editing of macaque miRNA transcripts, our analysis evidenced a 5′ neighbor preference consisting of U > C > A > G and a 3′ neighbor preference consisting of G > U > C ~ A, with UAG being the most representative motif ( Supplementary Fig. 4b). No analysis was carried out for the other miRNA regions due to the scarcity (pre-miRNA regions) or lack of data.
Finally, biological sources in MiREDiBase can be grouped into three main categories ( Table 1). The "normal condition" group (human and primates) accounts for 92 different healthy tissues/organs analyzed for miRNA editing. Among these, 85 were obtained from adult individuals and seven from pre-natal developmental stages (Supplementary Tables 2 and 5). The "adverse condition" group (human only) is broadly represented by tumors, with 60 distinct oncological conditions and 62 different sample subtypes. The neurological disorders include four pathological conditions and six sample subtypes. The inflammatory condition, cardiovascular disease, and genetic disorder are currently the less representative classes, with two pathological conditions and three sample subtypes for the former and one pathological condition and sample type for the latter two, respectively (Supplementary Table 3). The "cell line" group (human only) accounts for 78 commercial cell lines and ten primary human cells cultured in vitro. Of the 78 commercial cell lines, 71 are malignant, while the remaining represent non-malignant conditions. Among the ten primary human cells, only one refers to a malignant condition, while nine represent normal conditions (Supplementary Table 4).
User interface and data accessibility. MiREDiBase provides users an intuitive and straightforward web interface to access data, requiring no bioinformatics skills to perform accurate searches across the database. Users www.nature.com/scientificdata www.nature.com/scientificdata/  www.nature.com/scientificdata www.nature.com/scientificdata/ can explore MiREDiBase by interacting with the Search (Fig. 4) or the Compare module. Each module starts with a modal box by which users can filter miRNA editing sites.
The Search module provides four filtering fields, including organism (e.g., Human), modification type (e.g., the A-to-I editing), genomic region (e.g., chromosome, pre-miRNA, or miRNA), and, optionally, biological source (e.g., BRCA -breast carcinoma). The "Search module" generates a table listing a set of editing sites supplied with essential information based on the selected filtering options. Reported information covers the organism, modification type, chromosome, strand, genomic position, pre-miRNA and mature miRNA relative positions, employed detection strategies, and whether the site is putative or validated. By clicking on the dedicated left-sided buttons, users can dig down to find supplementary information about each editing site. Here, the detection strategies information is expanded, categorizing the editing site as putative (i.e., only detected by high-throughput sequencing methods) or validated (i.e., authenticated by targeted methods), indicating the confidence level for each modification. Additional information covers publications, external resources, biological sources, 2D structures of edited and non-edited pre-miRNAs, miRNA-target predictions, and associated functional enrichment data (Fig. 4), which enable ready access to a putative biological interpretation. The results in each module can be easily downloaded through dedicated buttons.
The Compare module aims at exploring differentially edited sites in adverse vs. normal conditions. It provides a set of essential information supplied with the editing level for each examined condition. Like the Search module, the Compare module allows users to filter out RNA editing sites by specifying the organism, modification type, disease, and pre-miRNA.
All miRNAs reported in MiREDiBase are linked to their specific miRbase and RNAcentral 36 web pages. Moreover, A-to-I genomic coordinates were mapped onto the UCSC hg38/GRCh38 genome assembly and available via the UCSC website. If applicable, editing sites provide links to external RNA editing resources, such as DARNED, RADAR, and REDIportal, to improve miRNA editing research.
To encourage users to familiarize themselves with our tool, MiREDiBase offers, throughout the website, helpers reporting explanations on how to interpret results, along with statistics and complete documentation on how to use each module. Advanced users can instead exploit the RESTful API, which provides a standalone web interface to explore available methods for extracting data, with the opportunity to embed RESTful API HTTP calls within users' code ( Supplementary Fig. 5).
The MiREDiBase platform adopts a multi-containerized microservice architecture ( Supplementary Fig. 5), which provides user-friendly and efficient ways to access all manually collected data (see Methods section for more details).

Discussion
At the beginning of the study on miRNA editing, Sanger sequencing represented the standard method to reliably identify editing events 34,37,38 . However, this low-throughput technique only enabled the detection of a relatively restricted set of editing sites. In later years, the employment of high-throughput sequencing (HTS) technologies and the design of ad-hoc bioinformatic pipelines have dramatically improved the computational identification of RNA editing events 39 , including those occurring in miRNAs.
Given the ever-increasing number of editing sites detected at a genome-wide scale, the need to create a comprehensive catalog of such modifications has become imperative. In light of this, Kiran and Baranov published DARNED, the first online repository providing centralized access to published data on RNA editing 22 . DARNED currently includes ~350,000 predicted RNA A-to-I editing sites from humans, mice (Mus musculus), flies (Drosophila melanogaster), and a few C-to-U instances. However, only a small portion of these modification events was manually annotated, and no information is provided about editing levels. DARNED's last update dates back to 2012 40 .
In 2013, Ramaswami and Li presented RADAR, a rigorously annotated A-to-I RNA editing database containing manually curated editing sites 23 . Like DARNED, RADAR includes data from humans, mice, and flies and currently accounts for ~1.4 million editing sites, providing several useful information like tissue-specific editing level, conservation in other model organisms, and genomic context. RADAR does not include C-to-U editing data, and the update took place in 2014.
In 2017, Picardi and colleagues developed REDIportal, which today is the most extensive collection of RNA editing in humans, including more than 4.5 million A-to-I modification events detected across 55 body sites from thousands of RNA-seq experiments 24 . Moreover, with its last update, REDIportal also includes ∼90,000 putative A-to-I editing events from the mouse brain transcriptome and incorporates CLAIRE, a searchable catalog of RNA editing levels across cell lines 41 .
Although these three mentioned online resources are undoubtedly the most authoritative repositories of RNA editing events, none of them is strictly dedicated to miRNA editing. The vast majority of the editing events reported in these databases fall into mRNAs and long non-coding RNAs (lncRNAs), with only a minority occurring in miRNAs. Indeed, a few online resources have been lastly developed that partially focus on the effects of RNA editing on miRNA functionality. For instance, the Editome-Disease Knowledgebase (EDK) 42 is a manually curated database that aims to link experimentally validated RNA editing events in non-coding RNAs to various diseases. However, this database currently contains only 16 validated A-to-I instances in miRNAs and does not provide any information about publications, position of editing sites, or detection/validation methods. The Cancer Editome Atlas (TCEA) is a powerful, user-friendly bioinformatics resource that characterizes more than 192 million editing events at ~4.6 million editing sites from approximately 11,000 samples across 33 cancer types recovered from The Cancer Genome Atlas 25 . However, TCEA is focused on editing events occurring in coding transcripts. From the miRNA standpoint, TCEA only allows users to predict A-to-I editing's effects in the 3′ UTR of mRNAs in terms of miRNA-mRNA interactions. Analogous considerations apply for miR-EdiTar 43 , a database that exploits DARNED data to predict the potential effects of A-to-I editing over miRNA targeting.
www.nature.com/scientificdata www.nature.com/scientificdata/ To cover the gap between the fields of RNA editing and miRNA biology, we developed MiREDiBase, the first-of-its-kind database dedicated explicitly to miRNA modifications. In the current version, MiREDiBase includes more than three thousand A-to-I and C-to-U miRNA editing events manually collected from the literature, occurring in humans and primates. MiREDiBase allows users to consult the RNA secondary structure of Fig. 4 The MiREDiBase Search module.Users can filter out MiREDiBase data by exploiting the specific modal box (a). Then, they can dig into the data by interacting with the filtered editing sites (b). The editing site's details (c) can be navigated by clicking on the red button placed on its left side. Additional resources include the list of biological sources in which the editing site has been identified (d), the thermodynamic comparison of the wildtype and edited pre-miRNA 2D structures (e), the miRNA-target predictions (f), and functional enrichment (g) data. Helpers and downloading buttons are provided throughout the module interface.
www.nature.com/scientificdata www.nature.com/scientificdata/ both the wild-type and edited pre-miRNAs and infer the possible function of edited mature miRNA, based on the predicted targetome and subsequent functional analysis.
We implemented a user-friendly interface that allows users to track each search step to improve the user experience. Moreover, MiREDiBase includes a "Compare" section, which compares adverse versus normal conditions in a study-specific manner. Finally, the MiREDiBase platform relies on cutting-edge technologies, aiming at providing reliability and continuous operability. The platform represents an orchestration of different containerized services on top of Docker. Each service fulfills a specific purpose, such as a Web Application Service (Vue.js/ Quasar -a Progressive JavaScript Framework), a RESTful API Service (FastAPI -a modern, high-performance, web framework for building secure APIs), and a Database Service (MongoDB -a NoSQL document-based database). The platform is designed to provide the smoothest and user-friendly experience to users.
We are aware that the lack of data on more commonly adopted model organisms and the inclusion of C-to-U RNA editing sites represent weaknesses in our work. The choice to include primates rather than other model species in this first release was motivated by the fact that primates present the highest genomic and transcriptomic similarity compared to humans 44 . Moreover, primates are recognized as excellent candidates to investigate epigenetic control of genome functions and are highly relevant for biomedical studies 44 . The choice to include putative C-to-U miRNA editing events was because this editing type is considered "canonical" among mammals. Indeed, previous Sanger-sequence validation of putative C-to-U editing sites in miRNAs found no evidence for real C-to-U miRNA editing 15,45 , letting hypothesize that such events were HTS artifacts. On the other hand, Negi et al. recently found and validated C-to-U editing at the fifth position of mature human miR-100, demonstrating that such an instance was functionally associated with CD4(+) T cell differentiation 20 . Given these controversies, we believe that collecting C-to-U miRNA sites with high consensus would serve to orientate future studies on this topic.
Besides expanding the database with published data, our main future goals are (i) to include editing events from other species, primarily model organisms like Mus musculus and Drosophila melanogaster, and (ii) adding other modification types. We believe that this will help interpret the functional roles of modified miRNA transcripts within the cell system. For example, after analyzing human brain samples for RNA editing events, Paul et al. unexpectedly found that a consistent percentage of miRNA editing events are non-canonical, especially C-to-A and G-to-U 11 . Similar data were reported by Wang and co-workers 46 , raising questions on whether these editing events exert essential function in neurons and if specific enzymes can catalyze such modifications. Likewise, miRNA methylation has recently caught the scientific community's attention, being demonstrated to affect miRNA biogenesis 47 . However, the study of this phenomenon and its potential functional implications have remained widely unexplored. With continuous updating, we believe that MiREDiBase will gradually become a precious resource for researchers in the field of epitranscriptomics, leading to a better understanding of miRNA modification phenomena and their functional consequences.

Methods
Data processing. Each editing event was supplied with essential information recovered from miRBase (v22), including the relative position within pre-miRNA and mature miRNA, genomic position, and pre-miRNA region (5′-or 3′-arm, or loop region). For editing events occurring outside the pre-miRNA sequence, we adopted the notation "pri-miRNA. " Editing events were then enriched with metadata manually collected from selected publications. Overall, we extracted eight different information types: detection/validation method, experiment type, biological source, correspondent condition (adverse or normal), comparison (pathological vs. physiological condition), editing level, enzyme affinity, and functional characterization.
The "detection method" does not specify the method adopted by authors to identify miRNA editing events. Instead, it indicates which kind of methodological approach (targeted, wide-transcriptome, or both) the authors selected for editing detection. Only in two cases, the method has been specified to highlight particularly sensitive and innovative approaches, i.e., miR-mmPCR-seq 15,48 and RIP-seq 49 .
The "validation method" refers to methods confirming sequencing data, especially those obtained by wide-transcription approaches, including enzyme knock-down (only ADAR in the current version), knock-out, differential expression, and modification-specific enzymatic cleavage 50 .
The "experiment type" specifies whether, in a particular study, individual editing events were identified in vitro, in vivo, or ex vivo. Editing events obtained by analyzing small RNA-seq data from The Cancer Genome Atlas (TCGA) 51 or Genotype-Tissue Expression (GTEx) atlas 52 were considered as detected in vivo. Editing events obtained by analyzing sequence libraries from the Sequence Read Archive (SRA) database 53 were considered as detected in vitro, in vivo, or ex vivo depending on the library derivation.
The "pathological condition" specifies whether a miRNA editing event was detected in one or multiple diseases. For a given study, physiological and pathological conditions were compared to whether editing levels for an individual miRNA were simultaneously available for both conditions. In studies with multiple editing level values per miRNA editing site, we considered only the minimum and maximum values, rounding them up by multiples of five (e.g., editing levels of 21.1% and 44% were rounded up to 20% and 45%, respectively). Whether a single value was reported for an individual miRNA, this was rounded, creating an interval of 5% (e.g., if a study reported the editing level as 13% for a specific editing site, the editing level was presented as "from 10% to 15%").
Information concerning enzyme affinity (only ADARs in the current version) was retrieved whether authors carried out enzyme-transfection experiments causing enzyme overexpression. Finally, we annotated all the functionally characterized editing events with information regarding their specific biological function. In the event of functional re-targeting, validation methods were reported along with the set of validated lost and gained targets.
Sequence logo generation. Sequence logos present in Supplementary Fig. 4 were generated by using ggseqlogo function from ggseqlogo R package (v0.1).
www.nature.com/scientificdata www.nature.com/scientificdata/ Secondary structure prediction analysis. We generated the minimum free energy (MFE) structures for all those pre-miRNAs subjected to editing and their wild-type (WT) counterparts. The double-stranded RNA structures were created by employing the RNAfold tool from the ViennaRNA package 54 with default settings. Finally, we considered all editing sites occurring within the mature miRNA region to infer possible miRNA target re-direction as well as diversified biological functions.

MiRNa-target prediction and functional enrichment analyses.
The miRNA-target prediction analysis, for both edited and WT miRNA, was achieved by using our web-based containerized application isoTar 55 , designed to simplify and perform miRNA consensus target prediction and functional enrichment analyses. For miRNA target predictions, we established a minimum consensus of 3. An adjusted P-value < 0.05 was considered as a threshold for the functional enrichment analysis.