AnnoMiner is a new web-tool to integrate epigenetics, transcription factor occupancy and transcriptomics data to predict transcriptional regulators

Gene expression regulation requires precise transcriptional programs, led by transcription factors in combination with epigenetic events. Recent advances in epigenomic and transcriptomic techniques provided insight into different gene regulation mechanisms. However, to date it remains challenging to understand how combinations of transcription factors together with epigenetic events control cell-type specific gene expression. We have developed the AnnoMiner web-server, an innovative and flexible tool to annotate and integrate epigenetic, and transcription factor occupancy data. First, AnnoMiner annotates user-provided peaks with gene features. Second, AnnoMiner can integrate genome binding data from two different transcriptional regulators together with gene features. Third, AnnoMiner offers to explore the transcriptional deregulation of genes nearby, or within a specified genomic region surrounding a user-provided peak. AnnoMiner’s fourth function performs transcription factor or histone modification enrichment analysis for user-provided gene lists by utilizing hundreds of public, high-quality datasets from ENCODE for the model organisms human, mouse, Drosophila and C. elegans. Thus, AnnoMiner can predict transcriptional regulators for a studied process without the strict need for chromatin data from the same process. We compared AnnoMiner to existing tools and experimentally validated several transcriptional regulators predicted by AnnoMiner to indeed contribute to muscle morphogenesis in Drosophila. AnnoMiner is freely available at http://chimborazo.ibdm.univ-mrs.fr/AnnoMiner/.

Transcriptional regulation is a highly complex process involving a combination of various molecular players and biochemical mechanisms, such as transcription factors (TFs), histone modifying enzymes, DNA methylases, as well as a structural reorganization of chromatin. Technical advances in analysing the interaction of proteins with DNA (ChIP-seq), to detect open or closed chromatin states (e.g. ATAC-seq, DNase-seq, FAIRE-seq), to detect hypermethylated CpG islands (bisulfite sequencing), or to map higher order chromosomal structural organisation (ChiaPET, Hi-C, 3C-seq) have revolutionized and significantly advanced our understanding of transcriptional regulation during the last decade 1,2 . Among other things, transcriptional enhancers were identified as crucial for regulating spatio-temporal gene expression programs by interacting with target gene promoters, often across large genomic distances (see [3][4][5] and references therein). Furthermore, the genome sequence in the chromatin is not a simple linear thread but organized in 3D, forming compartments, topologically associated domains (TADs) or chromatin loops that can bring distant elements in proximity, all of which can contribute to transcriptional regulation ( 3,6 and references therein). More recent evidence suggests even the presence of dualaction cis-regulatory modules (CRMs) that act as promoters, as well as distal enhancers 7,8 . All these findings suggest that transcriptional regulation is far more complex than initially anticipated and involves a collective effort of specific binding sites in the genome, a complex genome structure and the presence of various transcriptional

Results
The AnnoMiner web server. The AnnoMiner web server is a new tool for the convenient annotation and integration of epigenetic, and transcription factor occupancy data for the wet-lab researcher. Based on an underlying library of java classes developed for the handling of annotated genomic peak or ranges datasets, the interactive graphical user interface of the web application provides functionalities for the upload of datasets, choice of analysis mode and model organism, visualization and download of analysis results (see Fig. 1 for a schematic of AnnoMiner functions and Supplementary Figure S1 for implementation details).
Genomic peak annotation functions. AnnoMiner's three annotation functions (peak annotation, peak integration and nearby genes annotation, Fig. 1) take as input one or more files containing genomic coordinates (in BED format), for example from ChIP-seq, with the aim of finding their associations with annotated gene features. This is done by determining the overlap between the genomic coordinates of a peak and the attributes of annotated gene features. The user can set the following parameters for the search: the minimum required overlap among the gene feature attributes and peak features (in bp or %); whether only the longest isoform of a gene or all its isoforms will be considered; the gene's directionality; the organism and the genome resource. Optionally, a user-provided annotation file can be uploaded, containing for instance differential expression data which will be integrated with the gene lists generated from AnnoMiner annotation. The user-provided dataset is accepted in csv (comma separated values) or tsv (tab separated values) format. It can contain up to 6 columns, without any constraints in content, except the first column has to contain gene IDs to allow integration with AnnoMiner results.
Peak annotation. The peak annotation function computes the total coverage of the user-provided genomic regions (representing the peaks) with the attributes of each annotated gene feature in the genome assembly. AnnoMiner considers already annotated attributes of gene features (5′UTR, CDS and 3′UTR) as well as attributes or gene features provided by the user; in particular the promoter region is fully customizable with respect to the upstream and downstream region of the annotated TSS, as is the 5′ flanking region upstream and 3′ flanking region downstream of the gene body (Fig. 2a). The first result shown by AnnoMiner is a coverage plot, visualizing the total base pair coverage of all peaks with the annotated attributes of the gene features (Fig. 3a). The user next chooses a target region (corresponding to a gene feature attribute) by clicking on one of the bars and on the 'Show Genes!' button ( Fig. 3b). An interactive, sortable and downloadable table of all genes which overlap with the selected target region with peaks from the BED file is returned to the user (Fig. 3c). If the user provides a gene-based, custom annotation file, for instance containing differential expression data, these data will be integrated and displayed in the resulting table (Fig. 3d). While an annotation file will in most cases contain differential expression data based on RNA-seq analysis, it can contain any numerical or even text data. In summary, peak annotation in AnnoMiner annotates genomic regions provided by the user with gene features and their attributes in a flexible and user-centred way.
While we here demonstrate the usability of AnnoMiner's peak annotation function with a TF ChIP-seq dataset, which usually comprises narrow peaks in the vicinity or within gene promoters, it can be also used for any type of genomic peak file. In Supplementary Figure S2 (together with Supplementary Table S1), we show peak annotation for activating, as well as repressing histone modifications during early Drosophila development.
Peak integration. The peak integration function performs the peak annotation analysis, but for up to five genomic regions files (representing peaks from independent TFs, TFs together with HMs or independent HMs), allowing the user to integrate peaks from different transcriptional regulators and identify gene features and their attributes that overlap with them (Fig. 2b). The same algorithm as in peak annotation is used for the annotation of each individual peak file. With this function, genes co-regulated by TFs can be identified, or TF datasets can be integrated with chromatin structure data defined by histone modifications, or other epigenetic information derived by other experimental techniques. The coverage plots of all chosen genomic region files are returned by AnnoMiner (Fig. 4a), and the user chooses a target region for all and clicks on the 'Show Genes!' button ( Fig. 4b).
An interactive, sortable and downloadable table is returned with all genes that have a peak of the transcriptional regulators or epigenetic marks in their selected target regions (Fig. 4c). Custom gene annotation, such as differential expression data can again be provided, which is then integrated and displayed in the results table.
To show that AnnoMiner's peak integration function can be used to integrate three different datasets in one analysis step, we re-analysed a data series published on STAT3 function in different forms of diffuse large B-cell lymphomas (DLBCL, GEO super-series GSE50724 48 ). Two subtypes of DLBCL are known, germinal centre B-cell-like (GCB) and activated B-cell-like (ABC). The ABC type responds only poorly to available therapies and can be often associated with an overexpression of STAT3 48 . The authors had compared STAT3 binding by ChIP-seq analysis between 8 patient-derived cell lines from GCB-and ABC type. They had performed RNA-seq analysis of the same cell lines to retrieve differentially expressed genes between the two subtypes. We made use of these data to identify genes with increased expression levels in the ABC type together with increased STAT3 binding events. We only considered STAT3 peaks that were significantly upregulated (FDR 0.05, fold change 1.25) in ABC-type DLBCL. To demonstrate the added value of Annominer's peak integration function, we used H3K4me3 data from ENCODE from one of the ABC cell lines, OCI-Ly3 (accession: ENCFF763KFL), to limit the search to active promoters. Both peak files showed highest coverage with the direct promoter region of associated genes (Fig. 4a,b). We selected the major peaks for further analysis and integrated resulting peaks with differentially expressed genes from the same study. Of the upregulated unique genes, 42 contained an upregulated STAT3 peak in their promoter, as well as a H3K4me3 histone modification (Supplementary Figure S3a, Supplementary  Table S2). We submitted the list of 42 genes to the EnrichR web-server 44  www.nature.com/scientificreports/ Figure 1. Setup of the AnnoMiner web-server. The user interacts with the web-server via the front-end for data upload and data mining. Genomic peak data (in BED format), annotation data, e.g. from transcriptomic analysis, or a list of co-regulated genes can be uploaded to the server. The user can either annotate peaks (peak annotation), integrate peaks (peak integration) or annotate and analyse nearby genes (nearby genes annotation, long-range interactions) of a peak. AnnoMiner's fourth function allows searching for enriched transcription factor binding or histone modification events in the promoter regions of a list of co-regulated genes ( Table S2). To summarize, AnnoMiner's peak integration function was able to identify genes directly associated with ABC-type diffuse large B-cell lymphomas and potential direct targets for the transcription factor STAT3 by using a single user interface and a single AnnoMiner analysis step.
To demonstrate the usability of AnnoMiner's peak integration function further, we used four different datasets, as well as different types of genomic data. The results are displayed in Supplemental Material: first, (Supplementary Figure S4, Supplementary Table S1), we integrated H3K4me3 peak data during four stages of Drosophila MZT (maternal-to-zygotic transition) and identified its associated genes throughout MZT, which are thus constitutively transcribed from early to late MZT 49 . Second, we integrated ATAC-seq data with ChIP-seq data of an early embryonic TF and genome modifier, GAF/Trl, to identify genes activated by Trl in later stages of MZT 50 (Supplementary Figure S5, Supplementary Table S3).
Nearby genes annotation and long-range interactions. The nearby genes annotation function allows the user to visualize the differential regulation and retrieve the overlapping, as well as the closest 5 gene features up-and downstream of an individual peak (Fig. 2c). This function is most useful for exploring gene regulation in the vicinity of a genomic mutation or deletion in a non-coding region. Next to a BED file with the region of interest containing the mutation or genomic deletion, the user uploads an annotation file containing significantly differentially expressed genes. The resulting interactive AnnoMiner plot depicts the genomic neighbourhood of the peak, with the differential regulation of the overlapping and the five closest genes up-and downstream (Fig. 5a). The user can choose to visualize only the deregulation, or can discriminate between up-and down-regulation of the genes. In the latter case, the colour of the box reflects the direction of differential expression (either blue for up-, red for downregulated, green if an equal number of genes are up-and downregulated or grey if unchanged, (Fig. 5a). By selecting one or more boxes (Fig. 5b), the selected genes, as well as their log2FC and FDR with the user's annotation file will be returned in a table (Fig. 5c).
Alternative to the nearby genes annotation, users can also utilize the long-range interactions function of AnnoMiner to explore the neighbourhood of a peak (Fig. 2d). In this case, a range of base-pairs has to be chosen up-and downstream of the peak, which is then decorated with information on differential expression from the user-provided annotation file (Fig. 5d). After selecting the up-or downstream region (Fig. 5e), the user can retrieve the genes within that region of the peak together with their differential expression data (Fig. 5f). The long-range interactions function will have its best use whenever Hi-C data are available, as these types of data will give insight about the genomic boundaries and thus help to choose the genomic neighbourhood of a peak.
As a proof of concept for the nearby genes annotation function, we used a study that had shown the requirement of long-range enhancers regulating Myc expression for normal facial morphogenesis 51 (GEO dataset GSE52974). In humans, cleft lip or cleft palate (CL/P) is a frequent congenital malformation. This malformation has been associated with risk factors located at a 640 kb noncoding region on chromosome 8. The corresponding region in mouse was studied by Uslu and colleagues and refined to a more specific enhancer region, the medionasal enhancer (MNE 51 ). Deletions within the MNE in mouse led to smaller snouts and abnormalities of nasal and frontal bones amongst other defects. Myc was the only gene observed to be differentially expressed in the vicinity of this deletion. We used the CL/P deletion 8-17 (chr15:62668548-63550550) from Uslu et al. to create a single-peak BED file and uploaded it together with the significantly differentially expressed genes from the re-processed RNA-sequencing data from the same strain compared against control (GEO dataset GSE52974) to test the AnnoMiner nearby genes annotation function (Supplementary Table S4). Indeed, only a single gene is significantly differentially expressed (pink box, Fig. 5a), which is Myc (Fig. 5c). In principle, this function can also be used to explore the expression dynamics within the gene neighbourhood of multiple peaks (see Supplementary Figure S6). However, large-scale long-range gene regulatory data of this type are very sparse and their interpretation remains too complex to be exhaustively analysed with a tool like AnnoMiner. In summary, AnnoMiner's nearby genes annotation or long-range interaction functions help in an easy and quick way to identify deregulated genes in the neighbourhood of one genomic position.
Transcription factor & histone modification enrichment analysis. Transcription factor binding sites inferred from experimentally detected TF peaks in the genome can be used to predict TFs, which potentially co-regulate gene-sets. AnnoMiner's TF & HM enrichment analysis function identifies enriched peaks in the promoter regions of a user-provided gene list, for instance co-regulated genes from a transcriptomic analysis. Any valid identifier is accepted, as AnnoMiner performs gene ID conversion on-the-fly using BioMart 52 . AnnoMiner considers a gene as a potential target, if its promoter overlaps with a TF peak. The user can either choose the promoter region (up-and downstream number of base-pairs from the TSS) or use the DynamicRanges calculated by AnnoMiner, which is based on the distribution of a TF binding event relative to the TSS and therefore specific for each TF (see "Methods", not available for histone modifications). The results of the enrichment analysis are visualized as an interactive bar plot (for the first 10 hits, Fig. 6a), as well as an interactive table. In the table, all available TF ChIP-seq datasets in the AnnoMiner database for the species of interest are ranked according to their Combined Score. Along with this value, AnnoMiner also reports information about the experimental condition, cell line or developmental stage, contingency table values, p-value, enrichment score, FDR and the list of potential targets of the TF (Fig. 6b) in the downloadable version of the table.
As a proof of principle for predicting transcriptional regulators we selected a differential expression dataset from daf-16/FoxO mutants in C. elegans 53 . When uploading the list of all DAF-16A/F targets provided in 53  www.nature.com/scientificreports/ factor (Fig. 6a). Interestingly, pqm-1, which has been shown to bind to daf-16 response elements 54 , was found at 1st and 2nd position by AnnoMiner. Elt-2, a GATA-like transcription factor, which appeared as 3rd most significant hit, has been shown to bind to promoters of some daf-16-regulated genes and to be required for their regulation 55 . To conclude, AnnoMiner's TF & HM enrichment function is a powerful tool for predicting relevant transcription factors co-regulating sets of genes with similar expression patterns.

Performance evaluation of AnnoMiner's TF & HM enrichment analysis function.
We wanted to compare the performance of AnnoMiner's TF & HM enrichment analysis function with other web-tools for TF enrichment analysis. We followed in principle the evaluation protocol proposed by Keenan et al., which used PR-AUCs and ROC-AUCs calculated from the PPROC R-package for estimating performance 43 (for details see also "Methods"). In brief, we took manually curated datasets provided by 43 containing single TF perturbation experiments followed by RNA-seq from Gene Expression Omnibus 56 (GEO). Gene expression data used for benchmarking were restricted to experiments targeting TFs for which AnnoMiner is storing at least one high quality TF ChIP-seq dataset for the human assembly GRCh38, resulting in a total of 75 datasets that we could use for benchmarking. We submitted the list of significantly differentially expressed genes between perturbed TF versus wild-type control (which we hereafter refer to signature gene-sets) from these experiments to perform enrichment analysis using different tools, including AnnoMiner. We used the rank of the perturbed TF in the resulting enrichments of its associated signature gene-set to calculate PR-AUCs and ROC-AUCs. We furthermore calculated the cumulative distribution function for the ranks of each TF across all the experiments it was perturbed in. Only if a TF ranks randomly, the distribution function will be uniform; we performed Anderson-Darling tests to detect deviation from uniformity. We then computed the percentage of perturbed TFs that were correctly ranked within the first percentile to ensure that TFs were ranking high. We chose the following web-tools for comparison: ChEA3, TFEA.ChIP and EnrichR (Table 1). ChEA3 offers the user 2 different methods to rank predicted TFs (meanRank and topRank) and we evaluated both ranking methods. EnrichR offers different resources for enrichment analysis and we used the resources ARCHS4, ChEA 2016, ENCODE 2015, ENCODE and ChEA Consensus and TRRUST 2019 for enrichment analysis, respectively. The number of datasets included in the benchmarking set did depend on the resource tested (see "Methods"). The Anderson-Darling test returned significant results for all web-tools tested (ADtest in Table 1), except for EnrichR in combination with the ENCODE_and_ChEA_Consensus resource. This highlights the ability of all tools to rank the perturbed TF among the top candidates of the results. AnnoMiner outperformed TFEA.ChIP in all categories including percentage recovered TFs in the 1st percentile (7.0 vs 0.0), ROC AU (0.69 vs 0.63) and PR AUC (0.68 vs 0.60). EnrichR differed in performance depending on the resource used. On average, it outperformed AnnoMiner on the percent recovered TFs (9.5 vs 7.0), while AnnoMiner reached slightly higher values in ROC AUCs (0.69 vs 0.66) and PR AUCs (0.69 vs 0.68). ChEA3 performed similar for both ranking methods used and outperformed AnnoMiner in all categories. To summarize, though it cannot reach the performance of ChEA, AnnoMiner outperforms the other evaluated tools in identifying relevant TFs at a high rank the in gene-sets derived from TF perturbation studies. Other than ChEA, however, AnnoMiner is available for all four major model organisms, including the invertebrates Drosophila and C. elegans.
Using AnnoMiner to identify important transcriptional regulators of Drosophila flight muscle morphogenesis and growth. The assess the performance of AnnoMiner further, we tested its TF & HM enrichment analysis function to predict unknown transcriptional regulators for a list of co-regulated genes. We chose a dataset quantifying gene expression dynamic during development of Drosophila melanogaster indirect flight muscles 46 , in which mRNA from indirect flight muscles had been isolated at several time-points correlating with key steps during muscle development (Fig. 7a). We focused on the development of the contractile apparatus called myofibrillogenesis and compared gene expression at 30 h after puparium formation (APF), when myofibrils assemble, with 72 h APF, when myofibrils have matured (Fig. 7a). This comparison revealed 2193 differentially expressed genes that were shown in the original study to be strongly enriched for genes relevant for myofibril and mitochondrial development. We submitted this differential gene list to AnnoMiner (Supplementary Table S5) and searched all the 514 modERN and modENCODE TF ChIP-seq datasets from Drosophila stored in AnnoMiner for a potential enrichment of peaks. This identified 42 unique TFs significantly enriched with an FDR < 0.05. (Supplementary Table S5). The top 10 enriched datasets included Deaf1, Trl, Hr78 and cwo (Fig. 7b). Hence, these are potential transcriptional regulators of flight muscle development. The peaks from a transcription factor (TF) track can be located in the promoter region or 5′ flanking region (peak 1) of gene A, in the coding region (peaks 2 and 3) or in the 3′ flanking region of gene A (peak 4). Peak 4 is at the same time located in the 5′ flanking region/promoter region of gene B. Its association is therefore not strictly clear, which should be considered during peak annotation. (b) The peaks of two transcriptional regulators (two TFs, two Histone Modifications (HMs), or a TF and a HM, etc.) can be integrated based on their overlapping or nearby location in different attributes of a gene feature, as here demonstrated by their common overlap with the promoter region of a gene. (c,d) Activity of a transcriptional regulator, such as an enhancer, is not necessarily linked to the closest gene, but could affect neighbouring genes up-or downstream of the peak. Data from a differential expression analysis (DEA) of nearby genes can be correlated with a genomic peak harbouring a non-coding mutation or a genomic region harbouring a genome deletion. Long-range interactions can also be determined by specific genomic boundaries defining so-called TAD-domains (topologically associating domains).   The table can be sorted according to different values, downloaded, browsed online, or the gene list can be copied to the clipboard for further analysis (e.g. enrichment analysis). The user can also select a new target region or start a completely new analysis. Significant differential peaks from STAT3 (GEO dataset GSE50723) were uploaded for producing this Figure. For (d), differential expression data were uploaded from the associated GEO dataset GSE50721 from super-series GSE50724. www.nature.com/scientificreports/ Optionally, the user can upload a custom annotation file that contains for instance differential expression data (log2FC, FDR, etc.). These data are then shown in the resulting table.

Scientific Reports
The table can be sorted according to different values, downloaded, browsed online, and the list of genes can be copied to the clipboard for further analysis. The user can also select a new target region or start a completely new analysis. Significantly upregulated peaks from STAT3 (GEO dataset GSE50723 from super-series GSE50724), as well as H3K4me3 methylation data from a comparable cell line (GEO dataset GSE86718) were uploaded for producing this Figure; differential expression data were taken from the associated GEO dataset GSE50721 from super-series GSE50724. www.nature.com/scientificreports/ To identify which of the 42 transcriptional regulators may have a function during flight muscle development we next integrated data from an RNAi-screen for muscle function 57 , which had assayed for viability, flight muscle performance and body locomotion after muscle specific knock-down of individual TFs. From the 42 TFs identified by AnnoMiner, knock-down of two TFs resulted in flightless animals (CG14655, cwo), and seven were scored as lethal during development (Trl, Hr78, lola, Vsx2, Pif1B, salr, Hr51) (Supplementary Table S5); 21 TFs did not show a phenotype in this assay and the remaining 13 had not been tested (Fig. 7c).
Trithorax-like and the uncharacterized Zinc-finger protein CG14655 are required for flight muscle morphogenesis. For experimental verification we selected two proteins, Trl (Trithorax-like) and an uncharacterized zincfinger protein called CG14655. We used muscle-specific knock-down to investigate a putative function of both genes in muscle. For Trl knock-down we used 4 independent transgenic RNAi lines driven with muscle-specific Mef2-GAL4. Two of those resulted in pupal lethality and the other two resulted in viable but flightless flies, demonstrating a function of Trl in flight muscle (Supplementary Table S6). For morphological analysis we visualized the myofibrils of flight and leg muscles of mature 90 h APF pupae in wild type and three different Trl knockdown lines. We found that knock-down of Trl caused disordered and frayed myofibrils in flight muscles, whereas leg muscle myofibrils appeared normal (Fig. 7d, Supplementary Figure S7). This shows that Trl is required for normal myofibril development in flight muscle.
To investigate a role of CG14655 during muscle development we also applied muscle-specific knock-down with Mef2-GAL4 and three different RNAi lines, one of which resulted in viable but flightless animals and two other overlapping hairpins resulted in pupal lethality (Supplementary Table S6). Morphological analysis showed that CG14655 knock-down flight muscles displayed abnormal actin accumulations between their myofibrils, suggesting a role for CG14655 in myofibril development of flight muscle. Together, these findings demonstrate the predictive power of AnnoMiner to identify transcriptional regulators by combining chromatin binding and differential expression data.
AnnoMiner helps identify targets co-regulated by Sd and Yki during flight muscle growth in Drosophila. Strikingly, two of the enriched transcriptional regulators in the above comparative flight muscle development dataset were the transcriptional effector of the Hippo pathway in Drosophila called Yorkie (Yki) and its essential Tead co-factor Scalloped (Sd) 58 (Supplementary Table S5). Recently, an essential function for the Hippo pathway promoting flight muscle growth by transcriptional up-regulation of mRNAs coding for sarcomeric proteins, which built the myofibrils, was identified 47 . We wanted to know whether we could identify direct transcriptional targets of Yki/Sd during flight muscle growth. To this end, we integrated mRNA BRB-seq data from developing yki knock-down flight muscle (yki-IR), as well as from flight muscle expressing a constitutive active form of yki (yki-CA) compared to wild type controls (GEO accession GSE158957) with ChIP-seq data from Yki (mod-ENCODE dataset ENCSR422OTX) and Sd (modERN dataset ENCSR591PRH) obtained in fly embryos using AnnoMiner's peak integration function.
Both proteins showed prominent base pair coverage of the TSS regions of their target genes (Fig. 8a). We selected these peaks to retrieve associated genes and then integrated the BRB-seq data for 24 h APF yki knockdown (yki-IR 24 h), as well as 24 h and 32 h APF constitutively active yki, respectively (yki-CA 24 h, yki-CA 32 h; Supplementary Table S7). Upon knock-down of yki, already at 32 h APF, a severe myofibril assembly defect had been observed 47 . Interestingly, AnnoMiner identified Yki and Sd binding sites in the TSS of two genes essential for muscle function and development, which were downregulated in yki knock-down muscles at 24 h APF. These genes code for the sarcomeric proteins Tropomyosin 1 (Tm1) and the Nesprin-family protein Muscle-specific protein 300 kDa (Msp300), which is important to link the myofibrils to the nuclei 59 (Fig. 8b). The gain-of-function yki phenotype (yki-CA) is characterized by premature expression of sarcomeric proteins resulting in muscle fiber hyper-compaction 47 . At 24 h APF scalloped (sd) itself is the only direct target gene of the Yki/Sd complex which is differentially expressed in yki-CA (Fig. 8c). At 32 h AFP, AnnoMiner identified 177 unique genes and in total 541 transcripts as potential direct Yki/Sd targets. Using GO term enrichment analysis by modEnrichR 60 we found many processes and cellular compartments related to muscle development and function among the top enriched terms in these potential direct Yki/Sd targets (Fig. 8d, Supplementary Table S7). To conclude, using AnnoMiner's peak integration function, we identified putative direct targets of the Yki/Sd transcriptional complex Figure 5. The nearby genes annotation function of AnnoMiner. (a) AnnoMiner's nearby gene annotation function shows the differential regulation of 5 up-and down-stream, as well as the overlapping gene of a selected genomic peak. Next to a BED file with a single genomic region (a peak, a SNP or a genomic deletion), the user must upload an annotation file containing information on differential gene expression. Either general differential expression, or up-and down-regulation of neighbouring genes can be displayed. The user can also choose to consider directionality of the gene on the genomic strand. In the chosen example, a single gene is downregulated in the vicinity of the uploaded genomic regions file. (b) By selecting one or more boxes and pressing the 'Show Genes!' button, (c) the identity, as well as the log2FC and FDR of the selected gene(s) is returned in the table. (d) The long-range interactions function returns a plot with the up-and downstream regions from a peak, decorated with differential expression information taken from the user-provided annotation file. In this case, 1000 kb up-and downstream was chosen, whereby a down-regulated gene can be found in the upstream region. (e) After choosing this region and clicking on the 'Show Genes!' button, (f) a table is returned with all genes in the selected region, together with their differential expression data. We chose data from a study that had shown the requirement of long-range enhancers regulating Myc expression for normal facial morphogenesis (GEO dataset GSE52974). Upon deletion of the medionasal enhancer (MNE) in mouse, Myc was the only gene observed to be differentially expressed in the vicinity of this deletion.  www.nature.com/scientificreports/ that showed differential expression upon yki knock-down or yki constitutive activation. Many of these genes are likely important for flight muscle morphogenesis.

Discussion
Here, we introduced AnnoMiner, a web-based, flexible and user-friendly platform for genomic peak annotation and integration, as well as transcription factor enrichment analysis. We illustrated AnnoMiner's peak annotation and integration, as well as the nearby genes annotation functions with specific examples. We confirmed the predictive power of AnnoMiner's TF enrichment function experimentally by identifying important regulators of Drosophila indirect flight muscle development. This was achieved searching for overrepresented TF peaks in promoters of genes differentially regulated during myofibrillogenesis using AnnoMiner. AnnoMiner distinguishes itself from other peak annotation, as well as TF enrichment tools. AnnoMiner's peak annotation and peak integration outputs first a bar plot that shows the overlap of a peak with different gene feature attributes, including up-and downstream regions, the TSS, 5′ and 3′ UTRs and the gene body. This has two advantages: first, the user can visualize the distribution of peaks of the uploaded file with respect to all relevant attributes of annotated gene features in the genome. Second, AnnoMiner allows to interactively choose the target region(s) for which the associated genes are returned. While other tools provide statistics on the peak distribution relative to gene feature attributes (e.g. 28 ) in the output, to our knowledge, AnnoMiner is the only software that allows to easily retrieve specific gene-sets depending on the distribution of the peak coverage over gene feature attributes. The peak integration function offers the same flexibility. Moreover, both functions allow to directly integrate differential gene expression or other numerical data associated to genes with the genomic peak files. The nearby genes annotation and long-range interaction functions, which integrate expression data with peak data, is novel and for the first time, users can in a web-based manner visualize and retrieve genes that are not the nearest neighbours of a peak. It could be useful to integrate genomic, non-coding variants causative for human genetic diseases with disease-associated gene expression data and explore transcriptional activity within TAD domains. AnnoMiner's TF enrichment analysis function offers to treat promoter regions dynamically for each specific TF with its DynamicRanges function. Finally, AnnoMiner is independent of the genomic assembly of the source data, as it on-the-fly translates submitted IDs and uses the ID compatible with the database chosen for gene centred peak annotation, as well as TF enrichment.
We compared AnnoMiner's TF enrichment analysis function to the best-performing software in the field, which includes ChEA3, TFEA.ChIP and EnrichR. AnnoMiner could not reach the accuracy of ChEA3. One of the reasons could be that the dataset of TF-gene association used by ChEA3 supersedes data we retrieve from ENCODE, modENCODE and modERN, as it includes several additional datasets which are at least partially manually curated or generated. This hypothesis is supported by the fact that EnrichR shows differing performance when using different source data, showing lower performance when using the ENCODE data alone compared to the ones from ChEA3. One possible solution could be to add curated data to AnnoMiner's TF enrichment analysis function, for instance from ChIP-Atlas 61 or ReMap 62 . The disadvantage however is the higher cost in curation, as well as the fact that manually curated datasets are typically not available for model organisms such as Drosophila or C. elegans, but are rather restricted to human or mouse, as is the ChEA3 tool. Yet, users can easily add their own data using the scripts provided to fill the mongoDB present in our gitlab repository, when running AnnoMiner locally (https:// gitlab. com/ haber mann_ lab/ AnnoM iner/-/ tree/ master/ scrip ts).
Finally, we predicted and verified potential transcriptional regulators of muscle and myofibril morphogenesis as well as muscle growth using AnnoMiner's TF & HM enrichment analysis function. Amongst those is Trl, a    63 . We showed here that muscle-specific knock-down of Trl using four independent hairpins either leads to pupal lethality or flightlessness. Consistently, we find severely perturbed indirect flight muscles upon Trl knock-down whereas leg muscles appear largely normal. This indicates a preferential function of Trl in flight muscle, however as two hairpins result in pupal lethality, a role of Trl in other body muscles is also likely. A second potential direct transcriptional regulator identified by AnnoMiner is the uncharacterized Zincfinger transcription factor CG14655. Muscle-specific knock-down of CG14655 either results in pupal lethality or flightlessness and causes abnormal accumulations of actin in flight muscles. This again suggests that CG14655 is important for normal myofibril development in flight muscle.
Lastly, we made use of two other transcriptional regulators, Yorkie and Scalloped, which on DNA act in a complex 58 , to identify its direct targets. AnnoMiner identified two direct targets of Yki, which upon loss of yki were downregulated. Both code for important muscle structure proteins, constituents of the sarcomere or linking the sarcomere to the nucleus, and hence could contribute to the severe phenotype observed upon yki knock-down. Gain-of-function of yorkie results in muscle fiber hyper-compaction and premature expression of sarcomeric protein components 47 . Consistently, AnnoMiner identified a number of direct Yki targets with functions related to muscle development and growth. This substantiates a role for Scalloped and its transcriptional co-factor Yorkie during flight muscle growth.
To conclude, the new web-tool AnnoMiner is a user-friendly, intuitive, interactive and highly-flexible platform for genomic peak annotation and peak integration. It is suitable for identification of nearby genes or long-range interactions of a genomic peak, as well as to perform Transcription Factor and Histone Modification enrichment analysis for a list of genes. This manuscript details all AnnoMiner functions and shows its usefulness for annotating and integrating peaks from two different ChIP-seq experiments together with transcriptomics data. Finally, AnnoMiner helped identify several key regulators of indirect flight muscle development and growth in Drosophila, some of which were confirmed experimentally.

Methods
The AnnoMiner software and database. AnnoMiner is a modular software consisting of a library of java classes for retrieval, storage and analysis of annotated genomic peak data together with a web application providing a graphical user interface for a number of predefined analyses.
The AnnoMiner web-sever is a JavaEE web application implemented in the front-end as a single page application using Javascript, jQuery and Bootstrap 4. Queries for the provided analyses are processed by individual servlets that send a response back to the front-end in JSON format (Supplementary Figure S1).
The java library used in the back-end manages the retrieval of data from remote sources at UCSC and modENCODE 36 or alternatively from files in BED or gff format. Retrieved datasets are stored in a MongoDB database. This document-oriented NoSQL database system has been chosen for its flexible data model and runtime speed benefits benchmarked against an SQL database solution. For optimal performance, a custom database connectivity layer has been developed based on the MongoDB java driver. The overall structure of the application is shown in Fig. 1 and Supplementary Figure S1.
AnnoMiner's database currently holds genomic data from ENCODE 33 , modENCODE 36 and modERN 37 . For each model organism, the latest genome assembly is stored. For human, mouse and Drosophila, we also provide the second latest release.
Following the Findability, Accessibility, Interoperability and Reusability (FAIR) principle 64 , documentation and source code of the tool are available on GitLab: https:// gitlab. com/ haber mann_ lab/ annom iner. Using the java library classes from the code repository developers will be able to define custom analyses on annotated genomic ranges datasets and extend the database towards new data sources. The repository also provides executable java classes and python scripts for local maintenance such as to populate a local database with ENCODE, modEN-CODE or modERN data or to reproduce our benchmark analysis. Comparing 30 h to 72 h APF revealed a list of 2193 differentially expressed genes (taken from GSE107247), which was uploaded to AnnoMiner. The TF enrichment analysis function of AnnoMiner identified 42 unique candidate TFs. (b) Output of the top-ten enriched TFs from AnnoMiner. Trl was identified from 2 different ChIP-seq datasets and is highlighted in red. (c) Phenotypic integration with muscle-specific RNAi knock-down data revealed candidate TFs with a potential function in muscle (flightless or lethal). Data on DEGs, phenotypic data from the RNAi screen, as well as annotated AnnoMiner enrichment results are available in Supplementary  Table S5. (d) Trl has a function during flight muscle myofibrillogenesis. Flight and leg muscles of 90 h APF pupae were fixed and stained for actin (phalloidin in green) and titin homolog Sls (anti-Kettin in red). Note that muscle-specific knock-down of Trl using independent hairpins (Trl-IR-1 and IR-2) results in frayed flight muscle myofibrils compared to wild type control but rather normal leg muscle myofibrils. (e) CG14655 has a function in flight muscles. Flight muscles of young adult flies from wild type or CG14655-IR-1 or of 90 h APF pupae from CG14655-IR-2 were stained for actin (phalloidin). Note the prominent actin accumulations present in both CG14655 knock-down flight muscles.  www.nature.com/scientificreports/ introns), 3′UTR, 3′ flanking regions, as well as strandedness of the gene feature. For integrating peaks with gene features, we first calculate the coverage of each peak by testing whether the peak coordinates fall within the coordinates of any attribute of a gene feature. If all isoforms of a gene are chosen, the overlap of the peak with the attributes of each individual isoform will be checked. For calculating the coverage, overlap events are calculated cumulatively and the sums of all overlaps will be divided by the length of the attribute. Depending on which attribute (target region) the user has chosen, only genes whose chosen attribute overlaps (with a minimum % or number of base pairs) with the peak will be displayed in the data table.
Peak integration. The peak integration function is based on the same algorithm as the peak annotation. However, only genes, which meet the overlap criteria of the user-selected target regions with peaks from all provided BED files will be displayed in the resulting data table.
Nearby genes annotation. For the nearby genes annotation function, the 5 neighbouring genes up-and downstream, as well as the overlapping gene of a peak are extracted and displayed to the user. The user can choose whether the directionality of the gene with respect to the peak should be considered. If a gene overlaps with the peak, it will be shown in the Overlap box. Data on differential expression, which needs to be provided by the user in form of an annotation file, is displayed on the returned plot. The user can choose whether or not the directionality of the deregulation (up-or downregulated) is shown in the box plot. Depending on the user input, the nearby genes annotation function returns neighbouring genes of a single peak, or of many peaks.
Long-range interaction. With the long-range interactions function, the user is able to extract and visualize all the genes up-and down-stream the user provided peak (or peaks), which lie within an user-defined genomic window (default 50 kb upstream and 50 kb downstream). The user can choose whether the directionality of the gene with respect to the peak should be considered. Data on differential expression, which needs to be provided by the user in form of an annotation file, is displayed on the returned plot. The user can choose whether or not the directionality of the deregulation (up-or downregulation) is shown in the box plot. In the first case, hovering on the box of interest, the user can extract information regarding the overall number of deregulated genes (coverage). In the second case, the user will extract separate information about the number of up-and down-regulated genes. Depending on the user input, the long-range interactions function returns neighbouring genes of a single peak, or of many peaks within the user-defined genomic window.
Transcription factor enrichment analysis. Transcription factor enrichment analysis in AnnoMiner uses publicly available TF ChIP-seq data from the ENCODE, modENCODE and modERN databases. In order to keep only high-quality data, we undertook a strict selection of available TF ChIP-seq datasets. First, we kept only experiments, for which data had already been pre-analysed following the ENCODE processing pipeline. Second, only experiments with replicates and optimal IDR peaks files (narrowPeak) were retained. Third, we manually inspected each narrowPeak file, evaluating the peak distribution around the gene's TSS and discarded the ones with irregular peak distributions in relation to promoter regions of genes, as we considered those as outliers (Supplementary Figure S8c,d; all outliers csv files that were excluded are available in our gitlab repository at https:// gitlab. com/ haber mann_ lab/ AnnoM iner-paper/-/ tree/ master/ ChIP-seq_ exper iment_ exclu ded). The number of used TF ChIP-seq datasets for each of the model organisms and assemblies is available in Supplementary Tables S8, S9.
Over-representation analysis. For each TF ChIP-seq dataset stored in the AnnoMiner database, we first compute the contingency table of potential targets in both the uploaded gene list (signature gene-set) and in the background list of all genes of the chosen organism. We then apply a one-tailed hypergeometric test followed by Benjamini-Hochberg correction 65 . Furthermore, we calculate the enrichment score as follows: And a combined score based on the p-value and enrichment score according to 66 : Promoter definition. AnnoMiner allows the user to fully customize the promoter region considered for TF enrichment analysis. The user can choose both the upstream and downstream borders to consider without constraints.
DynamicRanges option. The user can also choose dynamic ranges that we pre-calculated for each individual TF based on available ChIP-seq data stored in AnnoMiner. In brief, for each TF ChIP-seq experiment we computed the distances between its peaks and the genes in a range of 20,000 bp upstream of their TSSs. We then binned the closest genes by distances with a resolution of 50 bp. A noise level was computed, representing non-significant bindings, from 5000 to 20,000 bp, as follows: Next, we smoothened the binning using the simple moving average approach (SMA) with a sliding window of size = 3. Intersecting the noise threshold with the SMA curve, we obtained the upstream TSS value. This value score = (listHits ÷ listSize)/ genomeHits ÷ genomeSize combined score = score * −log10 p-value . www.nature.com/scientificreports/ is unique for each TF and represents the distance range in which the TF is more often binding than is expected at random (see Supplementary Figure S8e,f).
Choosing the peak-promoter overlap. The user can define the amount of overlap between a TF and a promoter in percentage or in base pairs (bp) to categorize a gene as a potential target.
Histone modification enrichment analysis. The same function that is used for the Transcription factor enrichment analysis can be used for enriching histone modifications for a gene list. In this case, a Dynami-cRanges-like option is not meaningful. Thus, the user has to manually define the promoter region (default: 2000 bp upstream and 500 bp downstream of a gene's TSS).
Automatic ID conversion. In order to easily allow comparisons between genome versions (ENSEMBL, RefSeq, UCSC, Genecode, Flybase and Wormbase) and not restrict the user to a specific set of gene identifiers, we developed a geneID converter. This function converts on the fly the user's input gene IDs to match the selected resource, without any intervention from the user. The conversion is performed using the latest release of BioMart 52 . BioMart data were downloaded and are stored locally for reasons of speed and are regularly updated. Users can download the ID conversion table to control for genes that were dropped due to the automatic ID conversion process. Benchmarking metrics. Precision-Recall (PR) and Receiver-Operating Characteristic (ROC) areas under the curves (AUC) for the benchmarkingSet were computed using the R package PRROC 67 , following the benchmarking procedure proposed and applied by 43 and 68 . In brief, for each TF, signature gene-sets were derived based on differentially expressed genes of the perturbed TF versus its control and used for TF enrichment analysis. For each dataset, the rank of the perturbed TF was assigned to the positive (foreground (fg)) class, while all other ranks of TFs retrieved for the dataset were assigned to the negative (background (bg)) class; using the PPROC package, precision-recall (PR) and Receiver-Operator Curves (ROC) using continuous interpolation were computed and areas under curve (AUCs) for both were calculated (PR-AUC and ROC-AUC). As the negative bg class greatly exceeded the positive fg class, we down-sampled the negative class to the same size as the positive class. We repeated this procedure 5000 times. We used the approx R function to linearly interpolate between all points from the 5000 ROC and the 5000 PR curves and calculated the mean ROC and PR AUCs values of all bootstrap runs. Furthermore, we analysed the cumulative distribution of the rank values for each TF across all experiments it was perturbed in, which is not expected to be uniform, unless the TF would rank randomly in all its associated datasets. We then performed the Anderson-Darling test using the goftest R package to test for significant deviations from this distribution, indicating a non-random distribution of the TF. Finally, we computed the percentage of perturbed TFs that were correctly ranked within the first percentile to ensure high ranking of the TF. All values are available in Table 1.
Benchmarking against existing tools. We compared the performance of AnnoMiner's TF enrichment function with the performance of the following web-servers: TFEA.ChIP 42  Epigenetic and RNA-seq data used. For analysing STAT3 in acute B-cell lymphomas, we used data from super-series GSE50724 48 . Only peaks which were significantly upregulated in ABC type B-cell lymphoma were selected for AnnoMiner's peak integration (FDR cut-off < = 0.05; FC > = 1.25). We used H3K4me3 data from an ABC cell line stored in the ENCODE database (GSE86718_ENCFF763KFL, replicated peaks BED file) for peak integration with significantly upregulated STAT3 peaks. Human genome version hg19 was used for analysis. RNA-seq data (GSE50721) from the same super-series was used for integration with STAT3 and H3K4me3 peaks, whereby we used statistical results provided by the authors of the study. For demonstrating the nearby genes annotation function, we used data from 51 and GEO dataset GSE52974, describing long-range enhancer elements regulating Myc expression for normal facial morphogenesis, mapped to mouse genome version mm9. For demonstration of the TF & HM enrichment function we used RNA-seq data from a longevity study focusing on daf-16/FoxO mutants in C. elegans ( 53 and GEO dataset GSE72426) and C. elegans genome version ce11/ WBcel245. Analysed RNA-seq data from developing indirect flight muscle (IFM) in Drosophila melanogaster were taken from GSE107247 46  www.nature.com/scientificreports/ formation (APF) provided by the authors. Genes with log2FC in absolute value greater than 2 and FDR smaller than 0.01 were considered as differentially expressed. For integration with phenotypic data, we made use of data published on muscle morphogenesis and function in Drosophila 57 and flight muscle growth in Drosophila 47 .
For integrating gene expression changes of yki knocked-down flight muscles and flight muscles expressing a constitutively active yki compared to wild type, with Yki and Sd protein genome occupancy data, we used BRB-seq data from yki knock-down and yki constitutively active fly strains 47 available from GEO (GSE158957). Optimal IDR threshold narrow peak files for Yorkie (Yki, ENCSR422OTX) and Scalloped (Sd, ENCSR591PRH) ChIP-seq were downloaded from the ENCODE/modERN resource, both for Drosophila genome version dm6.
Drosophila experimental methods. Drosophila strains were grown under standard conditions at 27 °C to enhance GAL4 activity. All RNAi knock-downs were induced with Mef2-GAL4, a GAL4 line specifically expressed during development of all muscle types of the fly 57 . UAS RNAi lines used were from the Vienna 69 and Harvard collections 70 and ordered from VDRC or Bloomington stock centres. See Supplementary Table S6 for all genotypes used. The flight test was done as described in 57 .
Indirect flight muscle morphology was analysed in 3 to 5 days adult males or in 90 h APF pupae as previously published 71 . Briefly, 90 h APF pupae or adult flies were fixed with 4% paraformaldehyde in PBT (PBS, 0.5% triton X-100) and cut into half-thoraces using a sharp microtome blade. The half-thoraces were blocked with 3% normal goat serum in PBT for 30 min and F-actin in the flight and leg muscles was visualised with phalloidin (coupled to Alexa488 or rhodamine, Molecular Probes, 1:1000 in PBT for 2 h or overnight) and Sls was stained with anti-Kettin antibody MAC155 (Babraham Institute, 1:100 in PBT overnight). The stained half-thoraces were imbedded in Vectashield (Biozol) and imaged on an LSM780 or LSM880 confocal microscope. Images were processed using Fiji 72 .