Introduction

Transcriptional regulation is a highly complex process involving a combination of various molecular players and biochemical mechanisms, such as transcription factors (TFs), histone modifying enzymes, DNA methylases, as well as a structural reorganization of chromatin. Technical advances in analysing the interaction of proteins with DNA (ChIP-seq), to detect open or closed chromatin states (e.g. ATAC-seq, DNase-seq, FAIRE-seq), to detect hypermethylated CpG islands (bisulfite sequencing), or to map higher order chromosomal structural organisation (ChiaPET, Hi-C, 3C-seq) have revolutionized and significantly advanced our understanding of transcriptional regulation during the last decade1,2. Among other things, transcriptional enhancers were identified as crucial for regulating spatio-temporal gene expression programs by interacting with target gene promoters, often across large genomic distances (see3,4,5 and references therein). Furthermore, the genome sequence in the chromatin is not a simple linear thread but organized in 3D, forming compartments, topologically associated domains (TADs) or chromatin loops that can bring distant elements in proximity, all of which can contribute to transcriptional regulation (3,6 and references therein). More recent evidence suggests even the presence of dual-action cis-regulatory modules (CRMs) that act as promoters, as well as distal enhancers7,8. All these findings suggest that transcriptional regulation is far more complex than initially anticipated and involves a collective effort of specific binding sites in the genome, a complex genome structure and the presence of various transcriptional regulators. Hence, we need a tool that can ideally integrate all this information to better understand and predict transcriptional regulation.

Techniques such as ChIP-seq, ATAC-seq, Hi-C seq and others involve an NGS (Next Generation Sequencing) step, resulting typically in paired-end reads of isolated chromosomal fragments. Therefore, the first steps after sequencing consists of read mapping, which is usually done using a software such as BowTie29, followed by peak calling. There are a number of tools available for peak calling (reviewed e.g. in10,11,12,13), which include MACS214 for standard peak calling, ChIPdiff15, EpiCenter16 or diffReps17 for differential peak calling. The output of peak callers are the genomic coordinates of epigenetic marks or transcription factor binding sites under study. These coordinates are commonly stored in the Browser Extensible Data (BED) file format, a light-weight, standardized format to share genome coordinates.

The next step to biologically interpret genomic coordinates (also called peaks) is their genomic annotation, a process referred to as peak annotation or gene assignment. A number of peak annotation tools exist. Some of them combine ChIP-seq data analysis (including peak calling) and peak annotation, such as the ChIP-Seq tools and web server (18, web-based), Sole-Search19, CIPHER20, Nebula (21, web-based), PeakAnalyser22, BEDTools23 or HOMER24. Some of them are specific to peak annotation and visualization, such as ChIPseeker25, UROPA26, annoPeak (27, web-based), ChIPseek (28, web-based), PAVIS (29, web-based), Goldmine30, GREAT (31, web-based), or ChIPpeakAnno32. While most of the peak annotators assign peaks to the closest TSS (Transcription start site) or genome feature, others consider up- and down-stream gene features, or provide the overlap with gene feature attributes (such as promoter, 5′UTR, 3′UTR, exon, intron) of the nearest gene18,21,22,25,26,29. The web-tool GREAT31 offers peak annotation and gene assignment in larger genomic regions based on Gene Ontology (GO)-term similarity. What appears missing is a web-based, visual tool that helps experimental biologists to explore and integrate peak data with transcriptomics data in a flexible, user-centred and easy way. AnnoMiner is designed to close this gap.

An enormous community effort has been invoked in the past decade to collect, standardize and present annotated genomic data in form of the ENCODE33,34,35, modENCODE36 and modERN37 resources, with the aim to make sense of the encyclopaedias of genomes. These initiatives have also made it possible to explore available genomic data further and integrate them with each other as well as with user-generated data. These standardized, high-quality data can be used to identify enriched transcription factor binding events in promoters of co-regulated genes, for example from a differential gene expression dataset (e.g. an RNA-seq dataset). This form of data integration predicts possible transcriptional regulators for biological processes under study, and is already widely used in the community. Enrichment analysis is commonly performed by testing for TF overrepresentation in the promoter regions of a user-provided gene list compared to a background list (e.g. considering the entire genome). Most available tools define the promoter region rigidly as a range of upstream and downstream base pairs from the annotated gene transcription start site (TSS) and these parameters are kept fixed for all the TFs tested, without accounting for the differences between individual transcription factors. Following this approach, divers web-based tools have been developed, including VIPER38, DoRothEA39, BART40, oPOSSUM41, TFEA.ChIP42, ChEA343, EnrichR44 or i-cisTarget45. However, it would be advantageous to have a web-based, flexible and user-friendly solution for TF enrichment, considering promoter boundaries specific to each TF and working for the most widely used species (human, mouse, Drosophila and C. elegans).

Here we present AnnoMiner, a flexible, web-based, and user-friendly platform for peak annotation and integration, as well as TF and HM (histone modification) enrichment analysis. AnnoMiner allows users to annotate and integrate multiple genomic regions files with gene feature attributes and with transcriptomic data in an interactive and flexible way. AnnoMiner contains three distinct functions with different genomic peak annotation purposes in mind: first, peak annotation, which can be used to assign attributes of gene features to peaks, including user-defined upstream and downstream regions, TSS, 5′ and 3′UTRs, as well as the coding region; second, peak integration to search for overlapping binding events of up to five transcriptional regulators (e.g. different TMs; different HMs; or combinations of both) with gene feature attributes; these two annotation functions can optionally integrate user provided data, such as results from differential gene expression analysis, allowing to inspect for instance expression data and genomic peak data from multiple transcriptional regulators together. And third, nearby genes annotation or long-range interactions, which helps to identify long range interaction effects on gene expression of a single genomic region. The function for nearby genes annotation and long-range interactions requires the upload of data from differential expression analysis. As a fourth function, AnnoMiner performs TF (Transcription factor) and HM (Histone Modification) enrichment analysis, using all high-quality filtered TF and HM data from ENCODE, modENCODE and modERN, which are stored in an internal database, and a user defined gene lists as input. AnnoMiner’s TF enrichment function offers the DynamicRanges option, which uses promoter regions specific to each TF based on pre-calculated binding densities for each individual TF. We tested the predictive power of AnnoMiner’s TF enrichment function and applied it to flight muscle development in Drosophila focusing on the process of myofibril morphogenesis46. AnnoMiner predicted two potential transcriptional regulators, Trithorax-like (Trl) and the uncharacterized zinc-finger protein CG14655 to play a role during this process. For both, we provide experimental evidence for an essential function in myofibril morphogenesis in flight muscle. Thus, AnnoMiner correctly predicted a new role for Trl, as well as CG14655 as transcriptional regulators in a specific muscle type in Drosophila. Finally, we used AnnoMiner to predict direct transcriptional targets for the likewise enriched TFs Yorkie (Yki) and Scalloped (Sd) required for flight muscle growth47.

Results

The AnnoMiner web server

The AnnoMiner web server is a new tool for the convenient annotation and integration of epigenetic, and transcription factor occupancy data for the wet-lab researcher. Based on an underlying library of java classes developed for the handling of annotated genomic peak or ranges datasets, the interactive graphical user interface of the web application provides functionalities for the upload of datasets, choice of analysis mode and model organism, visualization and download of analysis results (see Fig. 1 for a schematic of AnnoMiner functions and Supplementary Figure S1 for implementation details).

Figure 1
figure 1

Setup of the AnnoMiner web-server. The user interacts with the web-server via the front-end for data upload and data mining. Genomic peak data (in BED format), annotation data, e.g. from transcriptomic analysis, or a list of co-regulated genes can be uploaded to the server. The user can either annotate peaks (peak annotation), integrate peaks (peak integration) or annotate and analyse nearby genes (nearby genes annotation, long-range interactions) of a peak. AnnoMiner’s fourth function allows searching for enriched transcription factor binding or histone modification events in the promoter regions of a list of co-regulated genes (TF & HM enrichment analysis). In the back-end, Java servlets are responsible for data upload, analysis, integration and ID conversion. User data, genome feature data from ENSEMBL, UCSC and NCBI, as well as ENCODE, modENCODE and modERN ChIP-seq data are stored in a MongoDB database. Currently, we host genome feature data from H. sapiens, M. musculus, D. melanogaster, C. elegans and S. cerevisiae. Data from ENCODE, modENCODE and modERN are available in AnnoMiner for human, mouse, Drosopihla and C. elegans.

Genomic peak annotation functions

AnnoMiner’s three annotation functions (peak annotation, peak integration and nearby genes annotation, Fig. 1) take as input one or more files containing genomic coordinates (in BED format), for example from ChIP-seq, with the aim of finding their associations with annotated gene features. This is done by determining the overlap between the genomic coordinates of a peak and the attributes of annotated gene features. The user can set the following parameters for the search: the minimum required overlap among the gene feature attributes and peak features (in bp or %); whether only the longest isoform of a gene or all its isoforms will be considered; the gene’s directionality; the organism and the genome resource. Optionally, a user-provided annotation file can be uploaded, containing for instance differential expression data which will be integrated with the gene lists generated from AnnoMiner annotation. The user-provided dataset is accepted in csv (comma separated values) or tsv (tab separated values) format. It can contain up to 6 columns, without any constraints in content, except the first column has to contain gene IDs to allow integration with AnnoMiner results.

Peak annotation

The peak annotation function computes the total coverage of the user-provided genomic regions (representing the peaks) with the attributes of each annotated gene feature in the genome assembly. AnnoMiner considers already annotated attributes of gene features (5′UTR, CDS and 3′UTR) as well as attributes or gene features provided by the user; in particular the promoter region is fully customizable with respect to the upstream and downstream region of the annotated TSS, as is the 5′ flanking region upstream and 3′ flanking region downstream of the gene body (Fig. 2a). The first result shown by AnnoMiner is a coverage plot, visualizing the total base pair coverage of all peaks with the annotated attributes of the gene features (Fig. 3a). The user next chooses a target region (corresponding to a gene feature attribute) by clicking on one of the bars and on the ‘Show Genes!’ button (Fig. 3b). An interactive, sortable and downloadable table of all genes which overlap with the selected target region with peaks from the BED file is returned to the user (Fig. 3c). If the user provides a gene-based, custom annotation file, for instance containing differential expression data, these data will be integrated and displayed in the resulting table (Fig. 3d). While an annotation file will in most cases contain differential expression data based on RNA-seq analysis, it can contain any numerical or even text data. In summary, peak annotation in AnnoMiner annotates genomic regions provided by the user with gene features and their attributes in a flexible and user-centred way.

Figure 2
figure 2

Different scenarios for the integration of peaks with gene features. (a) The peaks from a transcription factor (TF) track can be located in the promoter region or 5′ flanking region (peak 1) of gene A, in the coding region (peaks 2 and 3) or in the 3′ flanking region of gene A (peak 4). Peak 4 is at the same time located in the 5′ flanking region/promoter region of gene B. Its association is therefore not strictly clear, which should be considered during peak annotation. (b) The peaks of two transcriptional regulators (two TFs, two Histone Modifications (HMs), or a TF and a HM, etc.) can be integrated based on their overlapping or nearby location in different attributes of a gene feature, as here demonstrated by their common overlap with the promoter region of a gene. (c,d) Activity of a transcriptional regulator, such as an enhancer, is not necessarily linked to the closest gene, but could affect neighbouring genes up- or downstream of the peak. Data from a differential expression analysis (DEA) of nearby genes can be correlated with a genomic peak harbouring a non-coding mutation or a genomic region harbouring a genome deletion. Long-range interactions can also be determined by specific genomic boundaries defining so-called TAD-domains (topologically associating domains).

Figure 3
figure 3

Peak annotation function of AnnoMiner. In the peak annotation function, AnnoMiner allows users to annotate peaks from a BED file. (a) After uploading a BED file to the server and choosing the function peak annotation, a coverage plot is produced, showing the covered base-pairs/total length relative to attributes extracted from the gene features of an organism’s genome. These include the 5′ upstream flanking region, the TSS (− 2000 bp and + 500 bp), the 5′UTR, coding sequence (CDS), 3′UTR, as well as the 3′ flanking region, whereby the boundaries for the TSS and flanking regions can be set by the user. (b) The plot is interactive and the user should select the target region (corresponding to the attribute of interest) to be analysed by selecting one or more bars to proceed. (c) When clicking on ‘Show Genes!’, the user retrieves a list of associated genes that fulfils the selected overlap criteria with the peaks. The table can be sorted according to different values, downloaded, browsed online, or the gene list can be copied to the clipboard for further analysis (e.g. enrichment analysis). The user can also select a new target region or start a completely new analysis. Significant differential peaks from STAT3 (GEO dataset GSE50723) were uploaded for producing this Figure. For (d), differential expression data were uploaded from the associated GEO dataset GSE50721 from super-series GSE50724.

While we here demonstrate the usability of AnnoMiner’s peak annotation function with a TF ChIP-seq dataset, which usually comprises narrow peaks in the vicinity or within gene promoters, it can be also used for any type of genomic peak file. In Supplementary Figure S2 (together with Supplementary Table S1), we show peak annotation for activating, as well as repressing histone modifications during early Drosophila development.

Peak integration

The peak integration function performs the peak annotation analysis, but for up to five genomic regions files (representing peaks from independent TFs, TFs together with HMs or independent HMs), allowing the user to integrate peaks from different transcriptional regulators and identify gene features and their attributes that overlap with them (Fig. 2b). The same algorithm as in peak annotation is used for the annotation of each individual peak file. With this function, genes co-regulated by TFs can be identified, or TF datasets can be integrated with chromatin structure data defined by histone modifications, or other epigenetic information derived by other experimental techniques. The coverage plots of all chosen genomic region files are returned by AnnoMiner (Fig. 4a), and the user chooses a target region for all and clicks on the ‘Show Genes!’ button (Fig. 4b). An interactive, sortable and downloadable table is returned with all genes that have a peak of the transcriptional regulators or epigenetic marks in their selected target regions (Fig. 4c). Custom gene annotation, such as differential expression data can again be provided, which is then integrated and displayed in the results table.

Figure 4
figure 4

The peak integration function of AnnoMiner. If a user wants to integrate peaks from two transcriptional regulators and annotate them with gene feature attributes, the peak integration function must be chosen. (a) Two coverage plots of the two selected factors with gene feature attributes are shown to the user, from which a target region for each of them must be chosen (b). (c) When clicking on ‘Show Genes!’, the associated genes are displayed as a table. Optionally, the user can upload a custom annotation file that contains for instance differential expression data (log2FC, FDR, etc.). These data are then shown in the resulting table. The table can be sorted according to different values, downloaded, browsed online, and the list of genes can be copied to the clipboard for further analysis. The user can also select a new target region or start a completely new analysis. Significantly upregulated peaks from STAT3 (GEO dataset GSE50723 from super-series GSE50724), as well as H3K4me3 methylation data from a comparable cell line (GEO dataset GSE86718) were uploaded for producing this Figure; differential expression data were taken from the associated GEO dataset GSE50721 from super-series GSE50724.

To show that AnnoMiner’s peak integration function can be used to integrate three different datasets in one analysis step, we re-analysed a data series published on STAT3 function in different forms of diffuse large B-cell lymphomas (DLBCL, GEO super-series GSE5072448). Two subtypes of DLBCL are known, germinal centre B-cell-like (GCB) and activated B-cell-like (ABC). The ABC type responds only poorly to available therapies and can be often associated with an overexpression of STAT348. The authors had compared STAT3 binding by ChIP-seq analysis between 8 patient-derived cell lines from GCB- and ABC type. They had performed RNA-seq analysis of the same cell lines to retrieve differentially expressed genes between the two subtypes. We made use of these data to identify genes with increased expression levels in the ABC type together with increased STAT3 binding events. We only considered STAT3 peaks that were significantly upregulated (FDR 0.05, fold change 1.25) in ABC-type DLBCL. To demonstrate the added value of Annominer’s peak integration function, we used H3K4me3 data from ENCODE from one of the ABC cell lines, OCI-Ly3 (accession: ENCFF763KFL), to limit the search to active promoters. Both peak files showed highest coverage with the direct promoter region of associated genes (Fig. 4a,b). We selected the major peaks for further analysis and integrated resulting peaks with differentially expressed genes from the same study. Of the upregulated unique genes, 42 contained an upregulated STAT3 peak in their promoter, as well as a H3K4me3 histone modification (Supplementary Figure S3a, Supplementary Table S2). We submitted the list of 42 genes to the EnrichR web-server44 and could identify terms strongly related to cancer, diffuse large B-cell lymphomas of the ABC type, IL10/STAT3 signalling and other relevant terms for the disease under study (Supplementary Figure S3b,c; Supplementary Table S2). Moreover, AnnoMiner identified 28 additional direct targets of STAT3 compared to the ones already described in the original study (Supplementary Table S2). To summarize, AnnoMiner’s peak integration function was able to identify genes directly associated with ABC-type diffuse large B-cell lymphomas and potential direct targets for the transcription factor STAT3 by using a single user interface and a single AnnoMiner analysis step.

To demonstrate the usability of AnnoMiner’s peak integration function further, we used four different datasets, as well as different types of genomic data. The results are displayed in Supplemental Material: first, (Supplementary Figure S4, Supplementary Table S1), we integrated H3K4me3 peak data during four stages of Drosophila MZT (maternal-to-zygotic transition) and identified its associated genes throughout MZT, which are thus constitutively transcribed from early to late MZT49. Second, we integrated ATAC-seq data with ChIP-seq data of an early embryonic TF and genome modifier, GAF/Trl, to identify genes activated by Trl in later stages of MZT50 (Supplementary Figure S5, Supplementary Table S3).

Nearby genes annotation and long-range interactions

The nearby genes annotation function allows the user to visualize the differential regulation and retrieve the overlapping, as well as the closest 5 gene features up- and downstream of an individual peak (Fig. 2c). This function is most useful for exploring gene regulation in the vicinity of a genomic mutation or deletion in a non-coding region. Next to a BED file with the region of interest containing the mutation or genomic deletion, the user uploads an annotation file containing significantly differentially expressed genes. The resulting interactive AnnoMiner plot depicts the genomic neighbourhood of the peak, with the differential regulation of the overlapping and the five closest genes up- and downstream (Fig. 5a). The user can choose to visualize only the deregulation, or can discriminate between up- and down-regulation of the genes. In the latter case, the colour of the box reflects the direction of differential expression (either blue for up-, red for downregulated, green if an equal number of genes are up- and downregulated or grey if unchanged, (Fig. 5a). By selecting one or more boxes (Fig. 5b), the selected genes, as well as their log2FC and FDR with the user’s annotation file will be returned in a table (Fig. 5c).

Figure 5
figure 5

The nearby genes annotation function of AnnoMiner. (a) AnnoMiner’s nearby gene annotation function shows the differential regulation of 5 up- and down-stream, as well as the overlapping gene of a selected genomic peak. Next to a BED file with a single genomic region (a peak, a SNP or a genomic deletion), the user must upload an annotation file containing information on differential gene expression. Either general differential expression, or up- and down-regulation of neighbouring genes can be displayed. The user can also choose to consider directionality of the gene on the genomic strand. In the chosen example, a single gene is downregulated in the vicinity of the uploaded genomic regions file. (b) By selecting one or more boxes and pressing the ‘Show Genes!’ button, (c) the identity, as well as the log2FC and FDR of the selected gene(s) is returned in the table. (d) The long-range interactions function returns a plot with the up- and downstream regions from a peak, decorated with differential expression information taken from the user-provided annotation file. In this case, 1000 kb up- and downstream was chosen, whereby a down-regulated gene can be found in the upstream region. (e) After choosing this region and clicking on the ‘Show Genes!’ button, (f) a table is returned with all genes in the selected region, together with their differential expression data. We chose data from a study that had shown the requirement of long-range enhancers regulating Myc expression for normal facial morphogenesis (GEO dataset GSE52974). Upon deletion of the medionasal enhancer (MNE) in mouse, Myc was the only gene observed to be differentially expressed in the vicinity of this deletion.

Alternative to the nearby genes annotation, users can also utilize the long-range interactions function of AnnoMiner to explore the neighbourhood of a peak (Fig. 2d). In this case, a range of base-pairs has to be chosen up- and downstream of the peak, which is then decorated with information on differential expression from the user-provided annotation file (Fig. 5d). After selecting the up- or downstream region (Fig. 5e), the user can retrieve the genes within that region of the peak together with their differential expression data (Fig. 5f). The long-range interactions function will have its best use whenever Hi-C data are available, as these types of data will give insight about the genomic boundaries and thus help to choose the genomic neighbourhood of a peak.

As a proof of concept for the nearby genes annotation function, we used a study that had shown the requirement of long-range enhancers regulating Myc expression for normal facial morphogenesis51 (GEO dataset GSE52974). In humans, cleft lip or cleft palate (CL/P) is a frequent congenital malformation. This malformation has been associated with risk factors located at a 640 kb noncoding region on chromosome 8. The corresponding region in mouse was studied by Uslu and colleagues and refined to a more specific enhancer region, the medionasal enhancer (MNE51). Deletions within the MNE in mouse led to smaller snouts and abnormalities of nasal and frontal bones amongst other defects. Myc was the only gene observed to be differentially expressed in the vicinity of this deletion. We used the CL/P deletion 8–17 (chr15:62668548–63550550) from Uslu et al. to create a single-peak BED file and uploaded it together with the significantly differentially expressed genes from the re-processed RNA-sequencing data from the same strain compared against control (GEO dataset GSE52974) to test the AnnoMiner nearby genes annotation function (Supplementary Table S4). Indeed, only a single gene is significantly differentially expressed (pink box, Fig. 5a), which is Myc (Fig. 5c). In principle, this function can also be used to explore the expression dynamics within the gene neighbourhood of multiple peaks (see Supplementary Figure S6). However, large-scale long-range gene regulatory data of this type are very sparse and their interpretation remains too complex to be exhaustively analysed with a tool like AnnoMiner. In summary, AnnoMiner’s nearby genes annotation or long-range interaction functions help in an easy and quick way to identify deregulated genes in the neighbourhood of one genomic position.

Transcription factor & histone modification enrichment analysis

Transcription factor binding sites inferred from experimentally detected TF peaks in the genome can be used to predict TFs, which potentially co-regulate gene-sets. AnnoMiner’s TF & HM enrichment analysis function identifies enriched peaks in the promoter regions of a user-provided gene list, for instance co-regulated genes from a transcriptomic analysis. Any valid identifier is accepted, as AnnoMiner performs gene ID conversion on-the-fly using BioMart52. AnnoMiner considers a gene as a potential target, if its promoter overlaps with a TF peak. The user can either choose the promoter region (up- and downstream number of base-pairs from the TSS) or use the DynamicRanges calculated by AnnoMiner, which is based on the distribution of a TF binding event relative to the TSS and therefore specific for each TF (see “Methods”, not available for histone modifications). The results of the enrichment analysis are visualized as an interactive bar plot (for the first 10 hits, Fig. 6a), as well as an interactive table. In the table, all available TF ChIP-seq datasets in the AnnoMiner database for the species of interest are ranked according to their Combined Score. Along with this value, AnnoMiner also reports information about the experimental condition, cell line or developmental stage, contingency table values, p-value, enrichment score, FDR and the list of potential targets of the TF (Fig. 6b) in the downloadable version of the table.

Figure 6
figure 6

TF & HM enrichment analysis function of AnnoMiner. The output of a TF & HM enrichment analysis in AnnoMiner is first a plot of the top 10 enriched transcription factors (a). The identity of the enriched TF with its combined score is shown. Hovering over the bar of a TF will reveal meta-information associated with the ChIP-seq dataset (b). Second and at the same time, a table is returned, which contains the information on all TFs stored in the AnnoMiner database and their enrichments in the submitted gene list. The table is sortable and downloadable. Once downloaded, it includes information on the sample, the target gene for each TF, as well as the score, FDR and combined score. Target gene lists can be copied to the clipboard for further analysis. For demonstrating the TF enrichment analysis function of AnnoMiner, we took advantage of a study identifying direct daf-16/FoxO targets from a longevity study (53 and GEO dataset GSE72426). In this study, the authors looked for isoform-specific targets of daf-16. We combined the list of all DAF-16 isoform targets and submitted it to AnnoMiner for identifying enriched transcription factors. AnnoMiner displays the 10 most enriched TFs, with their combined score indicated in the plot (a) as well as the table of all TFs stored for C. elegans and including enrichment information for the submitted gene list. Daf-16 is the fourth hit; pqm-1, as well as elt-2, which both precede daf-16 in the hitlist, have been shown to bind to daf-16 response elements.

As a proof of principle for predicting transcriptional regulators we selected a differential expression dataset from daf-16/FoxO mutants in C. elegans53. When uploading the list of all DAF-16A/F targets provided in53 to AnnoMiner’s TF & HM enrichment analysis function, daf-16 was the 4th most significantly enriched transcription factor (Fig. 6a). Interestingly, pqm-1, which has been shown to bind to daf-16 response elements54, was found at 1st and 2nd position by AnnoMiner. Elt-2, a GATA-like transcription factor, which appeared as 3rd most significant hit, has been shown to bind to promoters of some daf-16-regulated genes and to be required for their regulation55. To conclude, AnnoMiner’s TF & HM enrichment function is a powerful tool for predicting relevant transcription factors co-regulating sets of genes with similar expression patterns.

Performance evaluation of AnnoMiner’s TF & HM enrichment analysis function

We wanted to compare the performance of AnnoMiner’s TF & HM enrichment analysis function with other web-tools for TF enrichment analysis. We followed in principle the evaluation protocol proposed by Keenan et al., which used PR-AUCs and ROC-AUCs calculated from the PPROC R-package for estimating performance43 (for details see also “Methods”). In brief, we took manually curated datasets provided by43 containing single TF perturbation experiments followed by RNA-seq from Gene Expression Omnibus56 (GEO). Gene expression data used for benchmarking were restricted to experiments targeting TFs for which AnnoMiner is storing at least one high quality TF ChIP-seq dataset for the human assembly GRCh38, resulting in a total of 75 datasets that we could use for benchmarking. We submitted the list of significantly differentially expressed genes between perturbed TF versus wild-type control (which we hereafter refer to signature gene-sets) from these experiments to perform enrichment analysis using different tools, including AnnoMiner. We used the rank of the perturbed TF in the resulting enrichments of its associated signature gene-set to calculate PR-AUCs and ROC-AUCs. We furthermore calculated the cumulative distribution function for the ranks of each TF across all the experiments it was perturbed in. Only if a TF ranks randomly, the distribution function will be uniform; we performed Anderson–Darling tests to detect deviation from uniformity. We then computed the percentage of perturbed TFs that were correctly ranked within the first percentile to ensure that TFs were ranking high.

We chose the following web-tools for comparison: ChEA3, TFEA.ChIP and EnrichR (Table 1). ChEA3 offers the user 2 different methods to rank predicted TFs (meanRank and topRank) and we evaluated both ranking methods. EnrichR offers different resources for enrichment analysis and we used the resources ARCHS4, ChEA 2016, ENCODE 2015, ENCODE and ChEA Consensus and TRRUST 2019 for enrichment analysis, respectively. The number of datasets included in the benchmarking set did depend on the resource tested (see “Methods”). The Anderson–Darling test returned significant results for all web-tools tested (ADtest in Table 1), except for EnrichR in combination with the ENCODE_and_ChEA_Consensus resource. This highlights the ability of all tools to rank the perturbed TF among the top candidates of the results. AnnoMiner outperformed TFEA.ChIP in all categories including percentage recovered TFs in the 1st percentile (7.0 vs 0.0), ROC AU (0.69 vs 0.63) and PR AUC (0.68 vs 0.60). EnrichR differed in performance depending on the resource used. On average, it outperformed AnnoMiner on the percent recovered TFs (9.5 vs 7.0), while AnnoMiner reached slightly higher values in ROC AUCs (0.69 vs 0.66) and PR AUCs (0.69 vs 0.68). ChEA3 performed similar for both ranking methods used and outperformed AnnoMiner in all categories. To summarize, though it cannot reach the performance of ChEA, AnnoMiner outperforms the other evaluated tools in identifying relevant TFs at a high rank the in gene-sets derived from TF perturbation studies. Other than ChEA, however, AnnoMiner is available for all four major model organisms, including the invertebrates Drosophila and C. elegans.

Table 1 Performance values of AnnoMiner compared to other TF enrichment web-tools.

Using AnnoMiner to identify important transcriptional regulators of Drosophila flight muscle morphogenesis and growth

The assess the performance of AnnoMiner further, we tested its TF & HM enrichment analysis function to predict unknown transcriptional regulators for a list of co-regulated genes. We chose a dataset quantifying gene expression dynamic during development of Drosophila melanogaster indirect flight muscles46, in which mRNA from indirect flight muscles had been isolated at several time-points correlating with key steps during muscle development (Fig. 7a). We focused on the development of the contractile apparatus called myofibrillogenesis and compared gene expression at 30 h after puparium formation (APF), when myofibrils assemble, with 72 h APF, when myofibrils have matured (Fig. 7a). This comparison revealed 2193 differentially expressed genes that were shown in the original study to be strongly enriched for genes relevant for myofibril and mitochondrial development. We submitted this differential gene list to AnnoMiner (Supplementary Table S5) and searched all the 514 modERN and modENCODE TF ChIP-seq datasets from Drosophila stored in AnnoMiner for a potential enrichment of peaks. This identified 42 unique TFs significantly enriched with an FDR < 0.05. (Supplementary Table S5). The top 10 enriched datasets included Deaf1, Trl, Hr78 and cwo (Fig. 7b). Hence, these are potential transcriptional regulators of flight muscle development.

Figure 7
figure 7

AnnoMiner identifies transcriptional regulators of flight muscle development. (a) Scheme of Drosophila indirect flight muscle development with different myofibrillogenesis phases highlighted in red. Comparing 30 h to 72 h APF revealed a list of 2193 differentially expressed genes (taken from GSE107247), which was uploaded to AnnoMiner. The TF enrichment analysis function of AnnoMiner identified 42 unique candidate TFs. (b) Output of the top-ten enriched TFs from AnnoMiner. Trl was identified from 2 different ChIP-seq datasets and is highlighted in red. (c) Phenotypic integration with muscle-specific RNAi knock-down data revealed candidate TFs with a potential function in muscle (flightless or lethal). Data on DEGs, phenotypic data from the RNAi screen, as well as annotated AnnoMiner enrichment results are available in Supplementary Table S5. (d) Trl has a function during flight muscle myofibrillogenesis. Flight and leg muscles of 90 h APF pupae were fixed and stained for actin (phalloidin in green) and titin homolog Sls (anti-Kettin in red). Note that muscle-specific knock-down of Trl using independent hairpins (Trl-IR-1 and IR-2) results in frayed flight muscle myofibrils compared to wild type control but rather normal leg muscle myofibrils. (e) CG14655 has a function in flight muscles. Flight muscles of young adult flies from wild type or CG14655-IR-1 or of 90 h APF pupae from CG14655-IR-2 were stained for actin (phalloidin). Note the prominent actin accumulations present in both CG14655 knock-down flight muscles.

To identify which of the 42 transcriptional regulators may have a function during flight muscle development we next integrated data from an RNAi-screen for muscle function57, which had assayed for viability, flight muscle performance and body locomotion after muscle specific knock-down of individual TFs. From the 42 TFs identified by AnnoMiner, knock-down of two TFs resulted in flightless animals (CG14655, cwo), and seven were scored as lethal during development (Trl, Hr78, lola, Vsx2, Pif1B, salr, Hr51) (Supplementary Table S5); 21 TFs did not show a phenotype in this assay and the remaining 13 had not been tested (Fig. 7c).

Trithorax-like and the uncharacterized Zinc-finger protein CG14655 are required for flight muscle morphogenesis

For experimental verification we selected two proteins, Trl (Trithorax-like) and an uncharacterized zinc-finger protein called CG14655. We used muscle-specific knock-down to investigate a putative function of both genes in muscle. For Trl knock-down we used 4 independent transgenic RNAi lines driven with muscle-specific Mef2-GAL4. Two of those resulted in pupal lethality and the other two resulted in viable but flightless flies, demonstrating a function of Trl in flight muscle (Supplementary Table S6). For morphological analysis we visualized the myofibrils of flight and leg muscles of mature 90 h APF pupae in wild type and three different Trl knock-down lines. We found that knock-down of Trl caused disordered and frayed myofibrils in flight muscles, whereas leg muscle myofibrils appeared normal (Fig. 7d, Supplementary Figure S7). This shows that Trl is required for normal myofibril development in flight muscle.

To investigate a role of CG14655 during muscle development we also applied muscle-specific knock-down with Mef2-GAL4 and three different RNAi lines, one of which resulted in viable but flightless animals and two other overlapping hairpins resulted in pupal lethality (Supplementary Table S6). Morphological analysis showed that CG14655 knock-down flight muscles displayed abnormal actin accumulations between their myofibrils, suggesting a role for CG14655 in myofibril development of flight muscle. Together, these findings demonstrate the predictive power of AnnoMiner to identify transcriptional regulators by combining chromatin binding and differential expression data.

AnnoMiner helps identify targets co-regulated by Sd and Yki during flight muscle growth in Drosophila

Strikingly, two of the enriched transcriptional regulators in the above comparative flight muscle development dataset were the transcriptional effector of the Hippo pathway in Drosophila called Yorkie (Yki) and its essential Tead co-factor Scalloped (Sd)58 (Supplementary Table S5). Recently, an essential function for the Hippo pathway promoting flight muscle growth by transcriptional up-regulation of mRNAs coding for sarcomeric proteins, which built the myofibrils, was identified47. We wanted to know whether we could identify direct transcriptional targets of Yki/Sd during flight muscle growth. To this end, we integrated mRNA BRB-seq data from developing yki knock-down flight muscle (yki-IR), as well as from flight muscle expressing a constitutive active form of yki (yki-CA) compared to wild type controls (GEO accession GSE158957) with ChIP-seq data from Yki (modENCODE dataset ENCSR422OTX) and Sd (modERN dataset ENCSR591PRH) obtained in fly embryos using AnnoMiner’s peak integration function.

Both proteins showed prominent base pair coverage of the TSS regions of their target genes (Fig. 8a). We selected these peaks to retrieve associated genes and then integrated the BRB-seq data for 24 h APF yki knockdown (yki-IR 24 h), as well as 24 h and 32 h APF constitutively active yki, respectively (yki-CA 24 h, yki-CA 32 h; Supplementary Table S7). Upon knock-down of yki, already at 32 h APF, a severe myofibril assembly defect had been observed47. Interestingly, AnnoMiner identified Yki and Sd binding sites in the TSS of two genes essential for muscle function and development, which were downregulated in yki knock-down muscles at 24 h APF. These genes code for the sarcomeric proteins Tropomyosin 1 (Tm1) and the Nesprin-family protein Muscle-specific protein 300 kDa (Msp300), which is important to link the myofibrils to the nuclei59 (Fig. 8b). The gain-of-function yki phenotype (yki-CA) is characterized by premature expression of sarcomeric proteins resulting in muscle fiber hyper-compaction47. At 24 h APF scalloped (sd) itself is the only direct target gene of the Yki/Sd complex which is differentially expressed in yki-CA (Fig. 8c). At 32 h AFP, AnnoMiner identified 177 unique genes and in total 541 transcripts as potential direct Yki/Sd targets. Using GO term enrichment analysis by modEnrichR60 we found many processes and cellular compartments related to muscle development and function among the top enriched terms in these potential direct Yki/Sd targets (Fig. 8d, Supplementary Table S7). To conclude, using AnnoMiner’s peak integration function, we identified putative direct targets of the Yki/Sd transcriptional complex that showed differential expression upon yki knock-down or yki constitutive activation. Many of these genes are likely important for flight muscle morphogenesis.

Figure 8
figure 8

The Yorkie and Scalloped transcriptional complex regulates muscle-specific gene expression during flight muscle growth. (a) Using AnnoMiner’s peak integration function with ChIP-seq data from Yorkie (modENCODE ENCFF708VEF) and Scalloped (modERN ENCFF664CDO) we found high base-pair coverage of both TFs in the TSS (− 2000 to + 500 bp of the Transcription Start Site) of putative target genes. We selected these peaks from both TFs and integrated them with flight muscle specific BRB-seq data (47 and GEO dataset GSE158957) isolated from a yorkie loss-of-function muscles at 24 h APF (b) or yorkie gain-of-function muscles at 24 h (c) and 32 h APF (d). The direct Yki/Sd targets Tm1 and Msp300 were downregulated in yorkie loss-of-function at 24 h APF. Only scalloped was identified as upregulated direct target from 24 h yorkie gain of function muscles. At 32 h, 177 genes were identified as potential direct Sd/Yki targets, many of which have muscle-related functions (see also Supplementary Table S7).

Discussion

Here, we introduced AnnoMiner, a web-based, flexible and user-friendly platform for genomic peak annotation and integration, as well as transcription factor enrichment analysis. We illustrated AnnoMiner’s peak annotation and integration, as well as the nearby genes annotation functions with specific examples. We confirmed the predictive power of AnnoMiner’s TF enrichment function experimentally by identifying important regulators of Drosophila indirect flight muscle development. This was achieved searching for overrepresented TF peaks in promoters of genes differentially regulated during myofibrillogenesis using AnnoMiner.

AnnoMiner distinguishes itself from other peak annotation, as well as TF enrichment tools. AnnoMiner’s peak annotation and peak integration outputs first a bar plot that shows the overlap of a peak with different gene feature attributes, including up- and downstream regions, the TSS, 5′ and 3′ UTRs and the gene body. This has two advantages: first, the user can visualize the distribution of peaks of the uploaded file with respect to all relevant attributes of annotated gene features in the genome. Second, AnnoMiner allows to interactively choose the target region(s) for which the associated genes are returned. While other tools provide statistics on the peak distribution relative to gene feature attributes (e.g.28) in the output, to our knowledge, AnnoMiner is the only software that allows to easily retrieve specific gene-sets depending on the distribution of the peak coverage over gene feature attributes. The peak integration function offers the same flexibility. Moreover, both functions allow to directly integrate differential gene expression or other numerical data associated to genes with the genomic peak files. The nearby genes annotation and long-range interaction functions, which integrate expression data with peak data, is novel and for the first time, users can in a web-based manner visualize and retrieve genes that are not the nearest neighbours of a peak. It could be useful to integrate genomic, non-coding variants causative for human genetic diseases with disease-associated gene expression data and explore transcriptional activity within TAD domains. AnnoMiner’s TF enrichment analysis function offers to treat promoter regions dynamically for each specific TF with its DynamicRanges function. Finally, AnnoMiner is independent of the genomic assembly of the source data, as it on-the-fly translates submitted IDs and uses the ID compatible with the database chosen for gene centred peak annotation, as well as TF enrichment.

We compared AnnoMiner’s TF enrichment analysis function to the best-performing software in the field, which includes ChEA3, TFEA.ChIP and EnrichR. AnnoMiner could not reach the accuracy of ChEA3. One of the reasons could be that the dataset of TF—gene association used by ChEA3 supersedes data we retrieve from ENCODE, modENCODE and modERN, as it includes several additional datasets which are at least partially manually curated or generated. This hypothesis is supported by the fact that EnrichR shows differing performance when using different source data, showing lower performance when using the ENCODE data alone compared to the ones from ChEA3. One possible solution could be to add curated data to AnnoMiner’s TF enrichment analysis function, for instance from ChIP-Atlas61 or ReMap62. The disadvantage however is the higher cost in curation, as well as the fact that manually curated datasets are typically not available for model organisms such as Drosophila or C. elegans, but are rather restricted to human or mouse, as is the ChEA3 tool. Yet, users can easily add their own data using the scripts provided to fill the mongoDB present in our gitlab repository, when running AnnoMiner locally (https://gitlab.com/habermann_lab/AnnoMiner/-/tree/master/scripts).

Finally, we predicted and verified potential transcriptional regulators of muscle and myofibril morphogenesis as well as muscle growth using AnnoMiner’s TF & HM enrichment analysis function. Amongst those is Trl, a GAGA transcription factor which contains a BTB/POZ domain, as well as a C2H2 zinc-finger that binds to DNA in a sequence-specific manner. Previous studies suggest that Trl is required to keep promoters nucleosome-free, thus allowing Pol-II-access63. We showed here that muscle-specific knock-down of Trl using four independent hairpins either leads to pupal lethality or flightlessness. Consistently, we find severely perturbed indirect flight muscles upon Trl knock-down whereas leg muscles appear largely normal. This indicates a preferential function of Trl in flight muscle, however as two hairpins result in pupal lethality, a role of Trl in other body muscles is also likely.

A second potential direct transcriptional regulator identified by AnnoMiner is the uncharacterized Zinc-finger transcription factor CG14655. Muscle-specific knock-down of CG14655 either results in pupal lethality or flightlessness and causes abnormal accumulations of actin in flight muscles. This again suggests that CG14655 is important for normal myofibril development in flight muscle.

Lastly, we made use of two other transcriptional regulators, Yorkie and Scalloped, which on DNA act in a complex58, to identify its direct targets. AnnoMiner identified two direct targets of Yki, which upon loss of yki were downregulated. Both code for important muscle structure proteins, constituents of the sarcomere or linking the sarcomere to the nucleus, and hence could contribute to the severe phenotype observed upon yki knock-down. Gain-of-function of yorkie results in muscle fiber hyper-compaction and premature expression of sarcomeric protein components47. Consistently, AnnoMiner identified a number of direct Yki targets with functions related to muscle development and growth. This substantiates a role for Scalloped and its transcriptional co-factor Yorkie during flight muscle growth.

To conclude, the new web-tool AnnoMiner is a user-friendly, intuitive, interactive and highly-flexible platform for genomic peak annotation and peak integration. It is suitable for identification of nearby genes or long-range interactions of a genomic peak, as well as to perform Transcription Factor and Histone Modification enrichment analysis for a list of genes. This manuscript details all AnnoMiner functions and shows its usefulness for annotating and integrating peaks from two different ChIP-seq experiments together with transcriptomics data. Finally, AnnoMiner helped identify several key regulators of indirect flight muscle development and growth in Drosophila, some of which were confirmed experimentally.

Methods

The AnnoMiner software and database

AnnoMiner is a modular software consisting of a library of java classes for retrieval, storage and analysis of annotated genomic peak data together with a web application providing a graphical user interface for a number of predefined analyses.

The AnnoMiner web-sever is a JavaEE web application implemented in the front-end as a single page application using Javascript, jQuery and Bootstrap 4. Queries for the provided analyses are processed by individual servlets that send a response back to the front-end in JSON format (Supplementary Figure S1).

The java library used in the back-end manages the retrieval of data from remote sources at UCSC and modENCODE36 or alternatively from files in BED or gff format. Retrieved datasets are stored in a MongoDB database. This document-oriented NoSQL database system has been chosen for its flexible data model and runtime speed benefits benchmarked against an SQL database solution. For optimal performance, a custom database connectivity layer has been developed based on the MongoDB java driver. The overall structure of the application is shown in Fig. 1 and Supplementary Figure S1.

AnnoMiner’s database currently holds genomic data from ENCODE33, modENCODE36 and modERN37. For each model organism, the latest genome assembly is stored. For human, mouse and Drosophila, we also provide the second latest release.

Following the Findability, Accessibility, Interoperability and Reusability (FAIR) principle64, documentation and source code of the tool are available on GitLab: https://gitlab.com/habermann_lab/annominer. Using the java library classes from the code repository developers will be able to define custom analyses on annotated genomic ranges datasets and extend the database towards new data sources. The repository also provides executable java classes and python scripts for local maintenance such as to populate a local database with ENCODE, modENCODE or modERN data or to reproduce our benchmark analysis.

Genomic peak annotation functions

Peak annotation

For calculating the overlap between peaks and genes, we compare the coordinates of peaks with those of stored gene feature attributes. These include 5′ flanking regions, promoter region, transcription start site (TSS), 5′UTR, coding sequence (CDS, including exons and introns), 3′UTR, 3′ flanking regions, as well as strandedness of the gene feature. For integrating peaks with gene features, we first calculate the coverage of each peak by testing whether the peak coordinates fall within the coordinates of any attribute of a gene feature. If all isoforms of a gene are chosen, the overlap of the peak with the attributes of each individual isoform will be checked. For calculating the coverage, overlap events are calculated cumulatively and the sums of all overlaps will be divided by the length of the attribute. Depending on which attribute (target region) the user has chosen, only genes whose chosen attribute overlaps (with a minimum % or number of base pairs) with the peak will be displayed in the data table.

Peak integration

The peak integration function is based on the same algorithm as the peak annotation. However, only genes, which meet the overlap criteria of the user-selected target regions with peaks from all provided BED files will be displayed in the resulting data table.

Nearby genes annotation

For the nearby genes annotation function, the 5 neighbouring genes up- and down-stream, as well as the overlapping gene of a peak are extracted and displayed to the user. The user can choose whether the directionality of the gene with respect to the peak should be considered. If a gene overlaps with the peak, it will be shown in the Overlap box. Data on differential expression, which needs to be provided by the user in form of an annotation file, is displayed on the returned plot. The user can choose whether or not the directionality of the deregulation (up- or downregulated) is shown in the box plot. Depending on the user input, the nearby genes annotation function returns neighbouring genes of a single peak, or of many peaks.

Long-range interaction

With the long-range interactions function, the user is able to extract and visualize all the genes up- and down-stream the user provided peak (or peaks), which lie within an user-defined genomic window (default 50 kb upstream and 50 kb downstream). The user can choose whether the directionality of the gene with respect to the peak should be considered. Data on differential expression, which needs to be provided by the user in form of an annotation file, is displayed on the returned plot. The user can choose whether or not the directionality of the deregulation (up- or downregulation) is shown in the box plot. In the first case, hovering on the box of interest, the user can extract information regarding the overall number of deregulated genes (coverage). In the second case, the user will extract separate information about the number of up- and down-regulated genes. Depending on the user input, the long-range interactions function returns neighbouring genes of a single peak, or of many peaks within the user-defined genomic window.

Transcription factor enrichment analysis

Transcription factor enrichment analysis in AnnoMiner uses publicly available TF ChIP-seq data from the ENCODE, modENCODE and modERN databases. In order to keep only high-quality data, we undertook a strict selection of available TF ChIP-seq datasets. First, we kept only experiments, for which data had already been pre-analysed following the ENCODE processing pipeline. Second, only experiments with replicates and optimal IDR peaks files (narrowPeak) were retained. Third, we manually inspected each narrowPeak file, evaluating the peak distribution around the gene’s TSS and discarded the ones with irregular peak distributions in relation to promoter regions of genes, as we considered those as outliers (Supplementary Figure S8c,d; all outliers csv files that were excluded are available in our gitlab repository at https://gitlab.com/habermann_lab/AnnoMiner-paper/-/tree/master/ChIP-seq_experiment_excluded). The number of used TF ChIP-seq datasets for each of the model organisms and assemblies is available in Supplementary Tables S8, S9.

Over-representation analysis

For each TF ChIP-seq dataset stored in the AnnoMiner database, we first compute the contingency table of potential targets in both the uploaded gene list (signature gene-set) and in the background list of all genes of the chosen organism. We then apply a one-tailed hypergeometric test followed by Benjamini–Hochberg correction65. Furthermore, we calculate the enrichment score as follows:

$$score = \left( {listHits \div listSize} \right)/\left( {genomeHits \div genomeSize} \right)$$

And a combined score based on the p-value and enrichment score according to66:

$$combined \, score = score \, * \, - log10\left( {p {\text{-}} value} \right).$$

Promoter definition

AnnoMiner allows the user to fully customize the promoter region considered for TF enrichment analysis. The user can choose both the upstream and downstream borders to consider without constraints.

DynamicRanges option

The user can also choose dynamic ranges that we pre-calculated for each individual TF based on available ChIP-seq data stored in AnnoMiner. In brief, for each TF ChIP-seq experiment we computed the distances between its peaks and the genes in a range of 20,000 bp upstream of their TSSs. We then binned the closest genes by distances with a resolution of 50 bp. A noise level was computed, representing non-significant bindings, from 5000 to 20,000 bp, as follows:

$$noise= median + 3std$$

Next, we smoothened the binning using the simple moving average approach (SMA) with a sliding window of size = 3. Intersecting the noise threshold with the SMA curve, we obtained the upstream TSS value. This value is unique for each TF and represents the distance range in which the TF is more often binding than is expected at random (see Supplementary Figure S8e,f).

Choosing the peak-promoter overlap

The user can define the amount of overlap between a TF and a promoter in percentage or in base pairs (bp) to categorize a gene as a potential target.

Histone modification enrichment analysis

The same function that is used for the Transcription factor enrichment analysis can be used for enriching histone modifications for a gene list. In this case, a DynamicRanges-like option is not meaningful. Thus, the user has to manually define the promoter region (default: 2000 bp upstream and 500 bp downstream of a gene’s TSS).

Automatic ID conversion

In order to easily allow comparisons between genome versions (ENSEMBL, RefSeq, UCSC, Genecode, Flybase and Wormbase) and not restrict the user to a specific set of gene identifiers, we developed a geneID converter. This function converts on the fly the user’s input gene IDs to match the selected resource, without any intervention from the user. The conversion is performed using the latest release of BioMart52. BioMart data were downloaded and are stored locally for reasons of speed and are regularly updated. Users can download the ID conversion table to control for genes that were dropped due to the automatic ID conversion process.

Benchmarking the TF enrichment function of AnnoMiner

Benchmarking dataset

Benchmarking data-sets for testing AnnoMiner’s enrichment function in the form of TF regulons were obtained from human single TF perturbation experiments (knock-downs, knockouts, over-expression and chemical inhibition) followed by RNA-seq from Gene Expression Omnibus (GEO56, 160 experiments), manually curated and made available from Keenan et al.43. We retained only experiments targeting TFs for which AnnoMiner is storing at least one high quality TF ChIP-seq dataset for the assembly GRCh38 (hg38), leading to a total of 75 RNA-seq experiments left for benchmarking (benchmarkingSet).

Benchmarking metrics

Precision-Recall (PR) and Receiver-Operating Characteristic (ROC) areas under the curves (AUC) for the benchmarkingSet were computed using the R package PRROC67, following the benchmarking procedure proposed and applied by43 and68. In brief, for each TF, signature gene-sets were derived based on differentially expressed genes of the perturbed TF versus its control and used for TF enrichment analysis. For each dataset, the rank of the perturbed TF was assigned to the positive (foreground (fg)) class, while all other ranks of TFs retrieved for the dataset were assigned to the negative (background (bg)) class; using the PPROC package, precision-recall (PR) and Receiver-Operator Curves (ROC) using continuous interpolation were computed and areas under curve (AUCs) for both were calculated (PR-AUC and ROC-AUC). As the negative bg class greatly exceeded the positive fg class, we down-sampled the negative class to the same size as the positive class. We repeated this procedure 5000 times. We used the approx R function to linearly interpolate between all points from the 5000 ROC and the 5000 PR curves and calculated the mean ROC and PR AUCs values of all bootstrap runs. Furthermore, we analysed the cumulative distribution of the rank values for each TF across all experiments it was perturbed in, which is not expected to be uniform, unless the TF would rank randomly in all its associated datasets. We then performed the Anderson–Darling test using the goftest R package to test for significant deviations from this distribution, indicating a non-random distribution of the TF. Finally, we computed the percentage of perturbed TFs that were correctly ranked within the first percentile to ensure high ranking of the TF. All values are available in Table 1.

Benchmarking against existing tools

We compared the performance of AnnoMiner’s TF enrichment function with the performance of the following web-servers: TFEA.ChIP42, ChEA343 and EnrichR44. The same dataset (benchmarkingSet) was used for all web-tools tested. Only datasets which could be effectively ranked by the web-tool under study were used. We used 57 signature gene-sets to test TFEA.ChIP and all the 75 signature-gene sets to test ChEA3. For EnrichR we evaluated the performance metrics on 5 different resources: ARCHS4, ChEA 2016, ENCODE 2015, ENCODE and ChEA Consensus and TRRUST 2019 using 74, 44, 44, 29 and 51 signature gene-sets, respectively.

Epigenetic and RNA-seq data used

For analysing STAT3 in acute B-cell lymphomas, we used data from super-series GSE5072448. Only peaks which were significantly upregulated in ABC type B-cell lymphoma were selected for AnnoMiner’s peak integration (FDR cut-off <  = 0.05; FC >  = 1.25). We used H3K4me3 data from an ABC cell line stored in the ENCODE database (GSE86718_ENCFF763KFL, replicated peaks BED file) for peak integration with significantly upregulated STAT3 peaks. Human genome version hg19 was used for analysis. RNA-seq data (GSE50721) from the same super-series was used for integration with STAT3 and H3K4me3 peaks, whereby we used statistical results provided by the authors of the study. For demonstrating the nearby genes annotation function, we used data from51 and GEO dataset GSE52974, describing long-range enhancer elements regulating Myc expression for normal facial morphogenesis, mapped to mouse genome version mm9. For demonstration of the TF & HM enrichment function we used RNA-seq data from a longevity study focusing on daf-16/FoxO mutants in C. elegans (53 and GEO dataset GSE72426) and C. elegans genome version ce11/WBcel245.

Analysed RNA-seq data from developing indirect flight muscle (IFM) in Drosophila melanogaster were taken from GSE10724746. We used the DESeq2 comparisons between time-points 72 h and 30 h after puparium formation (APF) provided by the authors. Genes with log2FC in absolute value greater than 2 and FDR smaller than 0.01 were considered as differentially expressed. For integration with phenotypic data, we made use of data published on muscle morphogenesis and function in Drosophila57 and flight muscle growth in Drosophila47.

For integrating gene expression changes of yki knocked-down flight muscles and flight muscles expressing a constitutively active yki compared to wild type, with Yki and Sd protein genome occupancy data, we used BRB-seq data from yki knock-down and yki constitutively active fly strains47 available from GEO (GSE158957). Optimal IDR threshold narrow peak files for Yorkie (Yki, ENCSR422OTX) and Scalloped (Sd, ENCSR591PRH) ChIP-seq were downloaded from the ENCODE/modERN resource, both for Drosophila genome version dm6.

Drosophila experimental methods

Drosophila strains were grown under standard conditions at 27 °C to enhance GAL4 activity. All RNAi knock-downs were induced with Mef2-GAL4, a GAL4 line specifically expressed during development of all muscle types of the fly57. UAS RNAi lines used were from the Vienna69 and Harvard collections70 and ordered from VDRC or Bloomington stock centres. See Supplementary Table S6 for all genotypes used. The flight test was done as described in57.

Indirect flight muscle morphology was analysed in 3 to 5 days adult males or in 90 h APF pupae as previously published71. Briefly, 90 h APF pupae or adult flies were fixed with 4% paraformaldehyde in PBT (PBS, 0.5% triton X-100) and cut into half-thoraces using a sharp microtome blade. The half-thoraces were blocked with 3% normal goat serum in PBT for 30 min and F-actin in the flight and leg muscles was visualised with phalloidin (coupled to Alexa488 or rhodamine, Molecular Probes, 1:1000 in PBT for 2 h or overnight) and Sls was stained with anti-Kettin antibody MAC155 (Babraham Institute, 1:100 in PBT overnight). The stained half-thoraces were imbedded in Vectashield (Biozol) and imaged on an LSM780 or LSM880 confocal microscope. Images were processed using Fiji72.