Gene expression regulation requires precise transcriptional programs, led by transcription factors in combination with epigenetic events. Recent advances in epigenomic and transcriptomic techniques provided insight into different gene regulation mechanisms. However, to date it remains challenging to understand how combinations of transcription factors together with epigenetic events control cell-type specific gene expression. We have developed the AnnoMiner web-server, an innovative and flexible tool to annotate and integrate epigenetic, and transcription factor occupancy data. First, AnnoMiner annotates user-provided peaks with gene features. Second, AnnoMiner can integrate genome binding data from two different transcriptional regulators together with gene features. Third, AnnoMiner offers to explore the transcriptional deregulation of genes nearby, or within a specified genomic region surrounding a user-provided peak. AnnoMiner’s fourth function performs transcription factor or histone modification enrichment analysis for user-provided gene lists by utilizing hundreds of public, high-quality datasets from ENCODE for the model organisms human, mouse, Drosophila and C. elegans. Thus, AnnoMiner can predict transcriptional regulators for a studied process without the strict need for chromatin data from the same process. We compared AnnoMiner to existing tools and experimentally validated several transcriptional regulators predicted by AnnoMiner to indeed contribute to muscle morphogenesis in Drosophila. AnnoMiner is freely available at http://chimborazo.ibdm.univ-mrs.fr/AnnoMiner/.
Transcriptional regulation is a highly complex process involving a combination of various molecular players and biochemical mechanisms, such as transcription factors (TFs), histone modifying enzymes, DNA methylases, as well as a structural reorganization of chromatin. Technical advances in analysing the interaction of proteins with DNA (ChIP-seq), to detect open or closed chromatin states (e.g. ATAC-seq, DNase-seq, FAIRE-seq), to detect hypermethylated CpG islands (bisulfite sequencing), or to map higher order chromosomal structural organisation (ChiaPET, Hi-C, 3C-seq) have revolutionized and significantly advanced our understanding of transcriptional regulation during the last decade1,2. Among other things, transcriptional enhancers were identified as crucial for regulating spatio-temporal gene expression programs by interacting with target gene promoters, often across large genomic distances (see3,4,5 and references therein). Furthermore, the genome sequence in the chromatin is not a simple linear thread but organized in 3D, forming compartments, topologically associated domains (TADs) or chromatin loops that can bring distant elements in proximity, all of which can contribute to transcriptional regulation (3,6 and references therein). More recent evidence suggests even the presence of dual-action cis-regulatory modules (CRMs) that act as promoters, as well as distal enhancers7,8. All these findings suggest that transcriptional regulation is far more complex than initially anticipated and involves a collective effort of specific binding sites in the genome, a complex genome structure and the presence of various transcriptional regulators. Hence, we need a tool that can ideally integrate all this information to better understand and predict transcriptional regulation.
Techniques such as ChIP-seq, ATAC-seq, Hi-C seq and others involve an NGS (Next Generation Sequencing) step, resulting typically in paired-end reads of isolated chromosomal fragments. Therefore, the first steps after sequencing consists of read mapping, which is usually done using a software such as BowTie29, followed by peak calling. There are a number of tools available for peak calling (reviewed e.g. in10,11,12,13), which include MACS214 for standard peak calling, ChIPdiff15, EpiCenter16 or diffReps17 for differential peak calling. The output of peak callers are the genomic coordinates of epigenetic marks or transcription factor binding sites under study. These coordinates are commonly stored in the Browser Extensible Data (BED) file format, a light-weight, standardized format to share genome coordinates.
The next step to biologically interpret genomic coordinates (also called peaks) is their genomic annotation, a process referred to as peak annotation or gene assignment. A number of peak annotation tools exist. Some of them combine ChIP-seq data analysis (including peak calling) and peak annotation, such as the ChIP-Seq tools and web server (18, web-based), Sole-Search19, CIPHER20, Nebula (21, web-based), PeakAnalyser22, BEDTools23 or HOMER24. Some of them are specific to peak annotation and visualization, such as ChIPseeker25, UROPA26, annoPeak (27, web-based), ChIPseek (28, web-based), PAVIS (29, web-based), Goldmine30, GREAT (31, web-based), or ChIPpeakAnno32. While most of the peak annotators assign peaks to the closest TSS (Transcription start site) or genome feature, others consider up- and down-stream gene features, or provide the overlap with gene feature attributes (such as promoter, 5′UTR, 3′UTR, exon, intron) of the nearest gene18,21,22,25,26,29. The web-tool GREAT31 offers peak annotation and gene assignment in larger genomic regions based on Gene Ontology (GO)-term similarity. What appears missing is a web-based, visual tool that helps experimental biologists to explore and integrate peak data with transcriptomics data in a flexible, user-centred and easy way. AnnoMiner is designed to close this gap.
An enormous community effort has been invoked in the past decade to collect, standardize and present annotated genomic data in form of the ENCODE33,34,35, modENCODE36 and modERN37 resources, with the aim to make sense of the encyclopaedias of genomes. These initiatives have also made it possible to explore available genomic data further and integrate them with each other as well as with user-generated data. These standardized, high-quality data can be used to identify enriched transcription factor binding events in promoters of co-regulated genes, for example from a differential gene expression dataset (e.g. an RNA-seq dataset). This form of data integration predicts possible transcriptional regulators for biological processes under study, and is already widely used in the community. Enrichment analysis is commonly performed by testing for TF overrepresentation in the promoter regions of a user-provided gene list compared to a background list (e.g. considering the entire genome). Most available tools define the promoter region rigidly as a range of upstream and downstream base pairs from the annotated gene transcription start site (TSS) and these parameters are kept fixed for all the TFs tested, without accounting for the differences between individual transcription factors. Following this approach, divers web-based tools have been developed, including VIPER38, DoRothEA39, BART40, oPOSSUM41, TFEA.ChIP42, ChEA343, EnrichR44 or i-cisTarget45. However, it would be advantageous to have a web-based, flexible and user-friendly solution for TF enrichment, considering promoter boundaries specific to each TF and working for the most widely used species (human, mouse, Drosophila and C. elegans).
Here we present AnnoMiner, a flexible, web-based, and user-friendly platform for peak annotation and integration, as well as TF and HM (histone modification) enrichment analysis. AnnoMiner allows users to annotate and integrate multiple genomic regions files with gene feature attributes and with transcriptomic data in an interactive and flexible way. AnnoMiner contains three distinct functions with different genomic peak annotation purposes in mind: first, peak annotation, which can be used to assign attributes of gene features to peaks, including user-defined upstream and downstream regions, TSS, 5′ and 3′UTRs, as well as the coding region; second, peak integration to search for overlapping binding events of up to five transcriptional regulators (e.g. different TMs; different HMs; or combinations of both) with gene feature attributes; these two annotation functions can optionally integrate user provided data, such as results from differential gene expression analysis, allowing to inspect for instance expression data and genomic peak data from multiple transcriptional regulators together. And third, nearby genes annotation or long-range interactions, which helps to identify long range interaction effects on gene expression of a single genomic region. The function for nearby genes annotation and long-range interactions requires the upload of data from differential expression analysis. As a fourth function, AnnoMiner performs TF (Transcription factor) and HM (Histone Modification) enrichment analysis, using all high-quality filtered TF and HM data from ENCODE, modENCODE and modERN, which are stored in an internal database, and a user defined gene lists as input. AnnoMiner’s TF enrichment function offers the DynamicRanges option, which uses promoter regions specific to each TF based on pre-calculated binding densities for each individual TF. We tested the predictive power of AnnoMiner’s TF enrichment function and applied it to flight muscle development in Drosophila focusing on the process of myofibril morphogenesis46. AnnoMiner predicted two potential transcriptional regulators, Trithorax-like (Trl) and the uncharacterized zinc-finger protein CG14655 to play a role during this process. For both, we provide experimental evidence for an essential function in myofibril morphogenesis in flight muscle. Thus, AnnoMiner correctly predicted a new role for Trl, as well as CG14655 as transcriptional regulators in a specific muscle type in Drosophila. Finally, we used AnnoMiner to predict direct transcriptional targets for the likewise enriched TFs Yorkie (Yki) and Scalloped (Sd) required for flight muscle growth47.
The AnnoMiner web server
The AnnoMiner web server is a new tool for the convenient annotation and integration of epigenetic, and transcription factor occupancy data for the wet-lab researcher. Based on an underlying library of java classes developed for the handling of annotated genomic peak or ranges datasets, the interactive graphical user interface of the web application provides functionalities for the upload of datasets, choice of analysis mode and model organism, visualization and download of analysis results (see Fig. 1 for a schematic of AnnoMiner functions and Supplementary Figure S1 for implementation details).
Genomic peak annotation functions
AnnoMiner’s three annotation functions (peak annotation, peak integration and nearby genes annotation, Fig. 1) take as input one or more files containing genomic coordinates (in BED format), for example from ChIP-seq, with the aim of finding their associations with annotated gene features. This is done by determining the overlap between the genomic coordinates of a peak and the attributes of annotated gene features. The user can set the following parameters for the search: the minimum required overlap among the gene feature attributes and peak features (in bp or %); whether only the longest isoform of a gene or all its isoforms will be considered; the gene’s directionality; the organism and the genome resource. Optionally, a user-provided annotation file can be uploaded, containing for instance differential expression data which will be integrated with the gene lists generated from AnnoMiner annotation. The user-provided dataset is accepted in csv (comma separated values) or tsv (tab separated values) format. It can contain up to 6 columns, without any constraints in content, except the first column has to contain gene IDs to allow integration with AnnoMiner results.
The peak annotation function computes the total coverage of the user-provided genomic regions (representing the peaks) with the attributes of each annotated gene feature in the genome assembly. AnnoMiner considers already annotated attributes of gene features (5′UTR, CDS and 3′UTR) as well as attributes or gene features provided by the user; in particular the promoter region is fully customizable with respect to the upstream and downstream region of the annotated TSS, as is the 5′ flanking region upstream and 3′ flanking region downstream of the gene body (Fig. 2a). The first result shown by AnnoMiner is a coverage plot, visualizing the total base pair coverage of all peaks with the annotated attributes of the gene features (Fig. 3a). The user next chooses a target region (corresponding to a gene feature attribute) by clicking on one of the bars and on the ‘Show Genes!’ button (Fig. 3b). An interactive, sortable and downloadable table of all genes which overlap with the selected target region with peaks from the BED file is returned to the user (Fig. 3c). If the user provides a gene-based, custom annotation file, for instance containing differential expression data, these data will be integrated and displayed in the resulting table (Fig. 3d). While an annotation file will in most cases contain differential expression data based on RNA-seq analysis, it can contain any numerical or even text data. In summary, peak annotation in AnnoMiner annotates genomic regions provided by the user with gene features and their attributes in a flexible and user-centred way.
While we here demonstrate the usability of AnnoMiner’s peak annotation function with a TF ChIP-seq dataset, which usually comprises narrow peaks in the vicinity or within gene promoters, it can be also used for any type of genomic peak file. In Supplementary Figure S2 (together with Supplementary Table S1), we show peak annotation for activating, as well as repressing histone modifications during early Drosophila development.
The peak integration function performs the peak annotation analysis, but for up to five genomic regions files (representing peaks from independent TFs, TFs together with HMs or independent HMs), allowing the user to integrate peaks from different transcriptional regulators and identify gene features and their attributes that overlap with them (Fig. 2b). The same algorithm as in peak annotation is used for the annotation of each individual peak file. With this function, genes co-regulated by TFs can be identified, or TF datasets can be integrated with chromatin structure data defined by histone modifications, or other epigenetic information derived by other experimental techniques. The coverage plots of all chosen genomic region files are returned by AnnoMiner (Fig. 4a), and the user chooses a target region for all and clicks on the ‘Show Genes!’ button (Fig. 4b). An interactive, sortable and downloadable table is returned with all genes that have a peak of the transcriptional regulators or epigenetic marks in their selected target regions (Fig. 4c). Custom gene annotation, such as differential expression data can again be provided, which is then integrated and displayed in the results table.
To show that AnnoMiner’s peak integration function can be used to integrate three different datasets in one analysis step, we re-analysed a data series published on STAT3 function in different forms of diffuse large B-cell lymphomas (DLBCL, GEO super-series GSE5072448). Two subtypes of DLBCL are known, germinal centre B-cell-like (GCB) and activated B-cell-like (ABC). The ABC type responds only poorly to available therapies and can be often associated with an overexpression of STAT348. The authors had compared STAT3 binding by ChIP-seq analysis between 8 patient-derived cell lines from GCB- and ABC type. They had performed RNA-seq analysis of the same cell lines to retrieve differentially expressed genes between the two subtypes. We made use of these data to identify genes with increased expression levels in the ABC type together with increased STAT3 binding events. We only considered STAT3 peaks that were significantly upregulated (FDR 0.05, fold change 1.25) in ABC-type DLBCL. To demonstrate the added value of Annominer’s peak integration function, we used H3K4me3 data from ENCODE from one of the ABC cell lines, OCI-Ly3 (accession: ENCFF763KFL), to limit the search to active promoters. Both peak files showed highest coverage with the direct promoter region of associated genes (Fig. 4a,b). We selected the major peaks for further analysis and integrated resulting peaks with differentially expressed genes from the same study. Of the upregulated unique genes, 42 contained an upregulated STAT3 peak in their promoter, as well as a H3K4me3 histone modification (Supplementary Figure S3a, Supplementary Table S2). We submitted the list of 42 genes to the EnrichR web-server44 and could identify terms strongly related to cancer, diffuse large B-cell lymphomas of the ABC type, IL10/STAT3 signalling and other relevant terms for the disease under study (Supplementary Figure S3b,c; Supplementary Table S2). Moreover, AnnoMiner identified 28 additional direct targets of STAT3 compared to the ones already described in the original study (Supplementary Table S2). To summarize, AnnoMiner’s peak integration function was able to identify genes directly associated with ABC-type diffuse large B-cell lymphomas and potential direct targets for the transcription factor STAT3 by using a single user interface and a single AnnoMiner analysis step.
To demonstrate the usability of AnnoMiner’s peak integration function further, we used four different datasets, as well as different types of genomic data. The results are displayed in Supplemental Material: first, (Supplementary Figure S4, Supplementary Table S1), we integrated H3K4me3 peak data during four stages of Drosophila MZT (maternal-to-zygotic transition) and identified its associated genes throughout MZT, which are thus constitutively transcribed from early to late MZT49. Second, we integrated ATAC-seq data with ChIP-seq data of an early embryonic TF and genome modifier, GAF/Trl, to identify genes activated by Trl in later stages of MZT50 (Supplementary Figure S5, Supplementary Table S3).
Nearby genes annotation and long-range interactions
The nearby genes annotation function allows the user to visualize the differential regulation and retrieve the overlapping, as well as the closest 5 gene features up- and downstream of an individual peak (Fig. 2c). This function is most useful for exploring gene regulation in the vicinity of a genomic mutation or deletion in a non-coding region. Next to a BED file with the region of interest containing the mutation or genomic deletion, the user uploads an annotation file containing significantly differentially expressed genes. The resulting interactive AnnoMiner plot depicts the genomic neighbourhood of the peak, with the differential regulation of the overlapping and the five closest genes up- and downstream (Fig. 5a). The user can choose to visualize only the deregulation, or can discriminate between up- and down-regulation of the genes. In the latter case, the colour of the box reflects the direction of differential expression (either blue for up-, red for downregulated, green if an equal number of genes are up- and downregulated or grey if unchanged, (Fig. 5a). By selecting one or more boxes (Fig. 5b), the selected genes, as well as their log2FC and FDR with the user’s annotation file will be returned in a table (Fig. 5c).
Alternative to the nearby genes annotation, users can also utilize the long-range interactions function of AnnoMiner to explore the neighbourhood of a peak (Fig. 2d). In this case, a range of base-pairs has to be chosen up- and downstream of the peak, which is then decorated with information on differential expression from the user-provided annotation file (Fig. 5d). After selecting the up- or downstream region (Fig. 5e), the user can retrieve the genes within that region of the peak together with their differential expression data (Fig. 5f). The long-range interactions function will have its best use whenever Hi-C data are available, as these types of data will give insight about the genomic boundaries and thus help to choose the genomic neighbourhood of a peak.
As a proof of concept for the nearby genes annotation function, we used a study that had shown the requirement of long-range enhancers regulating Myc expression for normal facial morphogenesis51 (GEO dataset GSE52974). In humans, cleft lip or cleft palate (CL/P) is a frequent congenital malformation. This malformation has been associated with risk factors located at a 640 kb noncoding region on chromosome 8. The corresponding region in mouse was studied by Uslu and colleagues and refined to a more specific enhancer region, the medionasal enhancer (MNE51). Deletions within the MNE in mouse led to smaller snouts and abnormalities of nasal and frontal bones amongst other defects. Myc was the only gene observed to be differentially expressed in the vicinity of this deletion. We used the CL/P deletion 8–17 (chr15:62668548–63550550) from Uslu et al. to create a single-peak BED file and uploaded it together with the significantly differentially expressed genes from the re-processed RNA-sequencing data from the same strain compared against control (GEO dataset GSE52974) to test the AnnoMiner nearby genes annotation function (Supplementary Table S4). Indeed, only a single gene is significantly differentially expressed (pink box, Fig. 5a), which is Myc (Fig. 5c). In principle, this function can also be used to explore the expression dynamics within the gene neighbourhood of multiple peaks (see Supplementary Figure S6). However, large-scale long-range gene regulatory data of this type are very sparse and their interpretation remains too complex to be exhaustively analysed with a tool like AnnoMiner. In summary, AnnoMiner’s nearby genes annotation or long-range interaction functions help in an easy and quick way to identify deregulated genes in the neighbourhood of one genomic position.
Transcription factor & histone modification enrichment analysis
Transcription factor binding sites inferred from experimentally detected TF peaks in the genome can be used to predict TFs, which potentially co-regulate gene-sets. AnnoMiner’s TF & HM enrichment analysis function identifies enriched peaks in the promoter regions of a user-provided gene list, for instance co-regulated genes from a transcriptomic analysis. Any valid identifier is accepted, as AnnoMiner performs gene ID conversion on-the-fly using BioMart52. AnnoMiner considers a gene as a potential target, if its promoter overlaps with a TF peak. The user can either choose the promoter region (up- and downstream number of base-pairs from the TSS) or use the DynamicRanges calculated by AnnoMiner, which is based on the distribution of a TF binding event relative to the TSS and therefore specific for each TF (see “Methods”, not available for histone modifications). The results of the enrichment analysis are visualized as an interactive bar plot (for the first 10 hits, Fig. 6a), as well as an interactive table. In the table, all available TF ChIP-seq datasets in the AnnoMiner database for the species of interest are ranked according to their Combined Score. Along with this value, AnnoMiner also reports information about the experimental condition, cell line or developmental stage, contingency table values, p-value, enrichment score, FDR and the list of potential targets of the TF (Fig. 6b) in the downloadable version of the table.
As a proof of principle for predicting transcriptional regulators we selected a differential expression dataset from daf-16/FoxO mutants in C. elegans53. When uploading the list of all DAF-16A/F targets provided in53 to AnnoMiner’s TF & HM enrichment analysis function, daf-16 was the 4th most significantly enriched transcription factor (Fig. 6a). Interestingly, pqm-1, which has been shown to bind to daf-16 response elements54, was found at 1st and 2nd position by AnnoMiner. Elt-2, a GATA-like transcription factor, which appeared as 3rd most significant hit, has been shown to bind to promoters of some daf-16-regulated genes and to be required for their regulation55. To conclude, AnnoMiner’s TF & HM enrichment function is a powerful tool for predicting relevant transcription factors co-regulating sets of genes with similar expression patterns.
Performance evaluation of AnnoMiner’s TF & HM enrichment analysis function
We wanted to compare the performance of AnnoMiner’s TF & HM enrichment analysis function with other web-tools for TF enrichment analysis. We followed in principle the evaluation protocol proposed by Keenan et al., which used PR-AUCs and ROC-AUCs calculated from the PPROC R-package for estimating performance43 (for details see also “Methods”). In brief, we took manually curated datasets provided by43 containing single TF perturbation experiments followed by RNA-seq from Gene Expression Omnibus56 (GEO). Gene expression data used for benchmarking were restricted to experiments targeting TFs for which AnnoMiner is storing at least one high quality TF ChIP-seq dataset for the human assembly GRCh38, resulting in a total of 75 datasets that we could use for benchmarking. We submitted the list of significantly differentially expressed genes between perturbed TF versus wild-type control (which we hereafter refer to signature gene-sets) from these experiments to perform enrichment analysis using different tools, including AnnoMiner. We used the rank of the perturbed TF in the resulting enrichments of its associated signature gene-set to calculate PR-AUCs and ROC-AUCs. We furthermore calculated the cumulative distribution function for the ranks of each TF across all the experiments it was perturbed in. Only if a TF ranks randomly, the distribution function will be uniform; we performed Anderson–Darling tests to detect deviation from uniformity. We then computed the percentage of perturbed TFs that were correctly ranked within the first percentile to ensure that TFs were ranking high.
We chose the following web-tools for comparison: ChEA3, TFEA.ChIP and EnrichR (Table 1). ChEA3 offers the user 2 different methods to rank predicted TFs (meanRank and topRank) and we evaluated both ranking methods. EnrichR offers different resources for enrichment analysis and we used the resources ARCHS4, ChEA 2016, ENCODE 2015, ENCODE and ChEA Consensus and TRRUST 2019 for enrichment analysis, respectively. The number of datasets included in the benchmarking set did depend on the resource tested (see “Methods”). The Anderson–Darling test returned significant results for all web-tools tested (ADtest in Table 1), except for EnrichR in combination with the ENCODE_and_ChEA_Consensus resource. This highlights the ability of all tools to rank the perturbed TF among the top candidates of the results. AnnoMiner outperformed TFEA.ChIP in all categories including percentage recovered TFs in the 1st percentile (7.0 vs 0.0), ROC AU (0.69 vs 0.63) and PR AUC (0.68 vs 0.60). EnrichR differed in performance depending on the resource used. On average, it outperformed AnnoMiner on the percent recovered TFs (9.5 vs 7.0), while AnnoMiner reached slightly higher values in ROC AUCs (0.69 vs 0.66) and PR AUCs (0.69 vs 0.68). ChEA3 performed similar for both ranking methods used and outperformed AnnoMiner in all categories. To summarize, though it cannot reach the performance of ChEA, AnnoMiner outperforms the other evaluated tools in identifying relevant TFs at a high rank the in gene-sets derived from TF perturbation studies. Other than ChEA, however, AnnoMiner is available for all four major model organisms, including the invertebrates Drosophila and C. elegans.
Using AnnoMiner to identify important transcriptional regulators of Drosophila flight muscle morphogenesis and growth
The assess the performance of AnnoMiner further, we tested its TF & HM enrichment analysis function to predict unknown transcriptional regulators for a list of co-regulated genes. We chose a dataset quantifying gene expression dynamic during development of Drosophila melanogaster indirect flight muscles46, in which mRNA from indirect flight muscles had been isolated at several time-points correlating with key steps during muscle development (Fig. 7a). We focused on the development of the contractile apparatus called myofibrillogenesis and compared gene expression at 30 h after puparium formation (APF), when myofibrils assemble, with 72 h APF, when myofibrils have matured (Fig. 7a). This comparison revealed 2193 differentially expressed genes that were shown in the original study to be strongly enriched for genes relevant for myofibril and mitochondrial development. We submitted this differential gene list to AnnoMiner (Supplementary Table S5) and searched all the 514 modERN and modENCODE TF ChIP-seq datasets from Drosophila stored in AnnoMiner for a potential enrichment of peaks. This identified 42 unique TFs significantly enriched with an FDR < 0.05. (Supplementary Table S5). The top 10 enriched datasets included Deaf1, Trl, Hr78 and cwo (Fig. 7b). Hence, these are potential transcriptional regulators of flight muscle development.
To identify which of the 42 transcriptional regulators may have a function during flight muscle development we next integrated data from an RNAi-screen for muscle function57, which had assayed for viability, flight muscle performance and body locomotion after muscle specific knock-down of individual TFs. From the 42 TFs identified by AnnoMiner, knock-down of two TFs resulted in flightless animals (CG14655, cwo), and seven were scored as lethal during development (Trl, Hr78, lola, Vsx2, Pif1B, salr, Hr51) (Supplementary Table S5); 21 TFs did not show a phenotype in this assay and the remaining 13 had not been tested (Fig. 7c).
Trithorax-like and the uncharacterized Zinc-finger protein CG14655 are required for flight muscle morphogenesis
For experimental verification we selected two proteins, Trl (Trithorax-like) and an uncharacterized zinc-finger protein called CG14655. We used muscle-specific knock-down to investigate a putative function of both genes in muscle. For Trl knock-down we used 4 independent transgenic RNAi lines driven with muscle-specific Mef2-GAL4. Two of those resulted in pupal lethality and the other two resulted in viable but flightless flies, demonstrating a function of Trl in flight muscle (Supplementary Table S6). For morphological analysis we visualized the myofibrils of flight and leg muscles of mature 90 h APF pupae in wild type and three different Trl knock-down lines. We found that knock-down of Trl caused disordered and frayed myofibrils in flight muscles, whereas leg muscle myofibrils appeared normal (Fig. 7d, Supplementary Figure S7). This shows that Trl is required for normal myofibril development in flight muscle.
To investigate a role of CG14655 during muscle development we also applied muscle-specific knock-down with Mef2-GAL4 and three different RNAi lines, one of which resulted in viable but flightless animals and two other overlapping hairpins resulted in pupal lethality (Supplementary Table S6). Morphological analysis showed that CG14655 knock-down flight muscles displayed abnormal actin accumulations between their myofibrils, suggesting a role for CG14655 in myofibril development of flight muscle. Together, these findings demonstrate the predictive power of AnnoMiner to identify transcriptional regulators by combining chromatin binding and differential expression data.
AnnoMiner helps identify targets co-regulated by Sd and Yki during flight muscle growth in Drosophila
Strikingly, two of the enriched transcriptional regulators in the above comparative flight muscle development dataset were the transcriptional effector of the Hippo pathway in Drosophila called Yorkie (Yki) and its essential Tead co-factor Scalloped (Sd)58 (Supplementary Table S5). Recently, an essential function for the Hippo pathway promoting flight muscle growth by transcriptional up-regulation of mRNAs coding for sarcomeric proteins, which built the myofibrils, was identified47. We wanted to know whether we could identify direct transcriptional targets of Yki/Sd during flight muscle growth. To this end, we integrated mRNA BRB-seq data from developing yki knock-down flight muscle (yki-IR), as well as from flight muscle expressing a constitutive active form of yki (yki-CA) compared to wild type controls (GEO accession GSE158957) with ChIP-seq data from Yki (modENCODE dataset ENCSR422OTX) and Sd (modERN dataset ENCSR591PRH) obtained in fly embryos using AnnoMiner’s peak integration function.
Both proteins showed prominent base pair coverage of the TSS regions of their target genes (Fig. 8a). We selected these peaks to retrieve associated genes and then integrated the BRB-seq data for 24 h APF yki knockdown (yki-IR 24 h), as well as 24 h and 32 h APF constitutively active yki, respectively (yki-CA 24 h, yki-CA 32 h; Supplementary Table S7). Upon knock-down of yki, already at 32 h APF, a severe myofibril assembly defect had been observed47. Interestingly, AnnoMiner identified Yki and Sd binding sites in the TSS of two genes essential for muscle function and development, which were downregulated in yki knock-down muscles at 24 h APF. These genes code for the sarcomeric proteins Tropomyosin 1 (Tm1) and the Nesprin-family protein Muscle-specific protein 300 kDa (Msp300), which is important to link the myofibrils to the nuclei59 (Fig. 8b). The gain-of-function yki phenotype (yki-CA) is characterized by premature expression of sarcomeric proteins resulting in muscle fiber hyper-compaction47. At 24 h APF scalloped (sd) itself is the only direct target gene of the Yki/Sd complex which is differentially expressed in yki-CA (Fig. 8c). At 32 h AFP, AnnoMiner identified 177 unique genes and in total 541 transcripts as potential direct Yki/Sd targets. Using GO term enrichment analysis by modEnrichR60 we found many processes and cellular compartments related to muscle development and function among the top enriched terms in these potential direct Yki/Sd targets (Fig. 8d, Supplementary Table S7). To conclude, using AnnoMiner’s peak integration function, we identified putative direct targets of the Yki/Sd transcriptional complex that showed differential expression upon yki knock-down or yki constitutive activation. Many of these genes are likely important for flight muscle morphogenesis.
Here, we introduced AnnoMiner, a web-based, flexible and user-friendly platform for genomic peak annotation and integration, as well as transcription factor enrichment analysis. We illustrated AnnoMiner’s peak annotation and integration, as well as the nearby genes annotation functions with specific examples. We confirmed the predictive power of AnnoMiner’s TF enrichment function experimentally by identifying important regulators of Drosophila indirect flight muscle development. This was achieved searching for overrepresented TF peaks in promoters of genes differentially regulated during myofibrillogenesis using AnnoMiner.
AnnoMiner distinguishes itself from other peak annotation, as well as TF enrichment tools. AnnoMiner’s peak annotation and peak integration outputs first a bar plot that shows the overlap of a peak with different gene feature attributes, including up- and downstream regions, the TSS, 5′ and 3′ UTRs and the gene body. This has two advantages: first, the user can visualize the distribution of peaks of the uploaded file with respect to all relevant attributes of annotated gene features in the genome. Second, AnnoMiner allows to interactively choose the target region(s) for which the associated genes are returned. While other tools provide statistics on the peak distribution relative to gene feature attributes (e.g.28) in the output, to our knowledge, AnnoMiner is the only software that allows to easily retrieve specific gene-sets depending on the distribution of the peak coverage over gene feature attributes. The peak integration function offers the same flexibility. Moreover, both functions allow to directly integrate differential gene expression or other numerical data associated to genes with the genomic peak files. The nearby genes annotation and long-range interaction functions, which integrate expression data with peak data, is novel and for the first time, users can in a web-based manner visualize and retrieve genes that are not the nearest neighbours of a peak. It could be useful to integrate genomic, non-coding variants causative for human genetic diseases with disease-associated gene expression data and explore transcriptional activity within TAD domains. AnnoMiner’s TF enrichment analysis function offers to treat promoter regions dynamically for each specific TF with its DynamicRanges function. Finally, AnnoMiner is independent of the genomic assembly of the source data, as it on-the-fly translates submitted IDs and uses the ID compatible with the database chosen for gene centred peak annotation, as well as TF enrichment.
We compared AnnoMiner’s TF enrichment analysis function to the best-performing software in the field, which includes ChEA3, TFEA.ChIP and EnrichR. AnnoMiner could not reach the accuracy of ChEA3. One of the reasons could be that the dataset of TF—gene association used by ChEA3 supersedes data we retrieve from ENCODE, modENCODE and modERN, as it includes several additional datasets which are at least partially manually curated or generated. This hypothesis is supported by the fact that EnrichR shows differing performance when using different source data, showing lower performance when using the ENCODE data alone compared to the ones from ChEA3. One possible solution could be to add curated data to AnnoMiner’s TF enrichment analysis function, for instance from ChIP-Atlas61 or ReMap62. The disadvantage however is the higher cost in curation, as well as the fact that manually curated datasets are typically not available for model organisms such as Drosophila or C. elegans, but are rather restricted to human or mouse, as is the ChEA3 tool. Yet, users can easily add their own data using the scripts provided to fill the mongoDB present in our gitlab repository, when running AnnoMiner locally (https://gitlab.com/habermann_lab/AnnoMiner/-/tree/master/scripts).
Finally, we predicted and verified potential transcriptional regulators of muscle and myofibril morphogenesis as well as muscle growth using AnnoMiner’s TF & HM enrichment analysis function. Amongst those is Trl, a GAGA transcription factor which contains a BTB/POZ domain, as well as a C2H2 zinc-finger that binds to DNA in a sequence-specific manner. Previous studies suggest that Trl is required to keep promoters nucleosome-free, thus allowing Pol-II-access63. We showed here that muscle-specific knock-down of Trl using four independent hairpins either leads to pupal lethality or flightlessness. Consistently, we find severely perturbed indirect flight muscles upon Trl knock-down whereas leg muscles appear largely normal. This indicates a preferential function of Trl in flight muscle, however as two hairpins result in pupal lethality, a role of Trl in other body muscles is also likely.
A second potential direct transcriptional regulator identified by AnnoMiner is the uncharacterized Zinc-finger transcription factor CG14655. Muscle-specific knock-down of CG14655 either results in pupal lethality or flightlessness and causes abnormal accumulations of actin in flight muscles. This again suggests that CG14655 is important for normal myofibril development in flight muscle.
Lastly, we made use of two other transcriptional regulators, Yorkie and Scalloped, which on DNA act in a complex58, to identify its direct targets. AnnoMiner identified two direct targets of Yki, which upon loss of yki were downregulated. Both code for important muscle structure proteins, constituents of the sarcomere or linking the sarcomere to the nucleus, and hence could contribute to the severe phenotype observed upon yki knock-down. Gain-of-function of yorkie results in muscle fiber hyper-compaction and premature expression of sarcomeric protein components47. Consistently, AnnoMiner identified a number of direct Yki targets with functions related to muscle development and growth. This substantiates a role for Scalloped and its transcriptional co-factor Yorkie during flight muscle growth.
To conclude, the new web-tool AnnoMiner is a user-friendly, intuitive, interactive and highly-flexible platform for genomic peak annotation and peak integration. It is suitable for identification of nearby genes or long-range interactions of a genomic peak, as well as to perform Transcription Factor and Histone Modification enrichment analysis for a list of genes. This manuscript details all AnnoMiner functions and shows its usefulness for annotating and integrating peaks from two different ChIP-seq experiments together with transcriptomics data. Finally, AnnoMiner helped identify several key regulators of indirect flight muscle development and growth in Drosophila, some of which were confirmed experimentally.
The AnnoMiner software and database
AnnoMiner is a modular software consisting of a library of java classes for retrieval, storage and analysis of annotated genomic peak data together with a web application providing a graphical user interface for a number of predefined analyses.
The java library used in the back-end manages the retrieval of data from remote sources at UCSC and modENCODE36 or alternatively from files in BED or gff format. Retrieved datasets are stored in a MongoDB database. This document-oriented NoSQL database system has been chosen for its flexible data model and runtime speed benefits benchmarked against an SQL database solution. For optimal performance, a custom database connectivity layer has been developed based on the MongoDB java driver. The overall structure of the application is shown in Fig. 1 and Supplementary Figure S1.
AnnoMiner’s database currently holds genomic data from ENCODE33, modENCODE36 and modERN37. For each model organism, the latest genome assembly is stored. For human, mouse and Drosophila, we also provide the second latest release.
Following the Findability, Accessibility, Interoperability and Reusability (FAIR) principle64, documentation and source code of the tool are available on GitLab: https://gitlab.com/habermann_lab/annominer. Using the java library classes from the code repository developers will be able to define custom analyses on annotated genomic ranges datasets and extend the database towards new data sources. The repository also provides executable java classes and python scripts for local maintenance such as to populate a local database with ENCODE, modENCODE or modERN data or to reproduce our benchmark analysis.
Genomic peak annotation functions
For calculating the overlap between peaks and genes, we compare the coordinates of peaks with those of stored gene feature attributes. These include 5′ flanking regions, promoter region, transcription start site (TSS), 5′UTR, coding sequence (CDS, including exons and introns), 3′UTR, 3′ flanking regions, as well as strandedness of the gene feature. For integrating peaks with gene features, we first calculate the coverage of each peak by testing whether the peak coordinates fall within the coordinates of any attribute of a gene feature. If all isoforms of a gene are chosen, the overlap of the peak with the attributes of each individual isoform will be checked. For calculating the coverage, overlap events are calculated cumulatively and the sums of all overlaps will be divided by the length of the attribute. Depending on which attribute (target region) the user has chosen, only genes whose chosen attribute overlaps (with a minimum % or number of base pairs) with the peak will be displayed in the data table.
The peak integration function is based on the same algorithm as the peak annotation. However, only genes, which meet the overlap criteria of the user-selected target regions with peaks from all provided BED files will be displayed in the resulting data table.
Nearby genes annotation
For the nearby genes annotation function, the 5 neighbouring genes up- and down-stream, as well as the overlapping gene of a peak are extracted and displayed to the user. The user can choose whether the directionality of the gene with respect to the peak should be considered. If a gene overlaps with the peak, it will be shown in the Overlap box. Data on differential expression, which needs to be provided by the user in form of an annotation file, is displayed on the returned plot. The user can choose whether or not the directionality of the deregulation (up- or downregulated) is shown in the box plot. Depending on the user input, the nearby genes annotation function returns neighbouring genes of a single peak, or of many peaks.
With the long-range interactions function, the user is able to extract and visualize all the genes up- and down-stream the user provided peak (or peaks), which lie within an user-defined genomic window (default 50 kb upstream and 50 kb downstream). The user can choose whether the directionality of the gene with respect to the peak should be considered. Data on differential expression, which needs to be provided by the user in form of an annotation file, is displayed on the returned plot. The user can choose whether or not the directionality of the deregulation (up- or downregulation) is shown in the box plot. In the first case, hovering on the box of interest, the user can extract information regarding the overall number of deregulated genes (coverage). In the second case, the user will extract separate information about the number of up- and down-regulated genes. Depending on the user input, the long-range interactions function returns neighbouring genes of a single peak, or of many peaks within the user-defined genomic window.
Transcription factor enrichment analysis
Transcription factor enrichment analysis in AnnoMiner uses publicly available TF ChIP-seq data from the ENCODE, modENCODE and modERN databases. In order to keep only high-quality data, we undertook a strict selection of available TF ChIP-seq datasets. First, we kept only experiments, for which data had already been pre-analysed following the ENCODE processing pipeline. Second, only experiments with replicates and optimal IDR peaks files (narrowPeak) were retained. Third, we manually inspected each narrowPeak file, evaluating the peak distribution around the gene’s TSS and discarded the ones with irregular peak distributions in relation to promoter regions of genes, as we considered those as outliers (Supplementary Figure S8c,d; all outliers csv files that were excluded are available in our gitlab repository at https://gitlab.com/habermann_lab/AnnoMiner-paper/-/tree/master/ChIP-seq_experiment_excluded). The number of used TF ChIP-seq datasets for each of the model organisms and assemblies is available in Supplementary Tables S8, S9.
For each TF ChIP-seq dataset stored in the AnnoMiner database, we first compute the contingency table of potential targets in both the uploaded gene list (signature gene-set) and in the background list of all genes of the chosen organism. We then apply a one-tailed hypergeometric test followed by Benjamini–Hochberg correction65. Furthermore, we calculate the enrichment score as follows:
And a combined score based on the p-value and enrichment score according to66:
AnnoMiner allows the user to fully customize the promoter region considered for TF enrichment analysis. The user can choose both the upstream and downstream borders to consider without constraints.
The user can also choose dynamic ranges that we pre-calculated for each individual TF based on available ChIP-seq data stored in AnnoMiner. In brief, for each TF ChIP-seq experiment we computed the distances between its peaks and the genes in a range of 20,000 bp upstream of their TSSs. We then binned the closest genes by distances with a resolution of 50 bp. A noise level was computed, representing non-significant bindings, from 5000 to 20,000 bp, as follows:
Next, we smoothened the binning using the simple moving average approach (SMA) with a sliding window of size = 3. Intersecting the noise threshold with the SMA curve, we obtained the upstream TSS value. This value is unique for each TF and represents the distance range in which the TF is more often binding than is expected at random (see Supplementary Figure S8e,f).
Choosing the peak-promoter overlap
The user can define the amount of overlap between a TF and a promoter in percentage or in base pairs (bp) to categorize a gene as a potential target.
Histone modification enrichment analysis
The same function that is used for the Transcription factor enrichment analysis can be used for enriching histone modifications for a gene list. In this case, a DynamicRanges-like option is not meaningful. Thus, the user has to manually define the promoter region (default: 2000 bp upstream and 500 bp downstream of a gene’s TSS).
Automatic ID conversion
In order to easily allow comparisons between genome versions (ENSEMBL, RefSeq, UCSC, Genecode, Flybase and Wormbase) and not restrict the user to a specific set of gene identifiers, we developed a geneID converter. This function converts on the fly the user’s input gene IDs to match the selected resource, without any intervention from the user. The conversion is performed using the latest release of BioMart52. BioMart data were downloaded and are stored locally for reasons of speed and are regularly updated. Users can download the ID conversion table to control for genes that were dropped due to the automatic ID conversion process.
Benchmarking the TF enrichment function of AnnoMiner
Benchmarking data-sets for testing AnnoMiner’s enrichment function in the form of TF regulons were obtained from human single TF perturbation experiments (knock-downs, knockouts, over-expression and chemical inhibition) followed by RNA-seq from Gene Expression Omnibus (GEO56, 160 experiments), manually curated and made available from Keenan et al.43. We retained only experiments targeting TFs for which AnnoMiner is storing at least one high quality TF ChIP-seq dataset for the assembly GRCh38 (hg38), leading to a total of 75 RNA-seq experiments left for benchmarking (benchmarkingSet).
Precision-Recall (PR) and Receiver-Operating Characteristic (ROC) areas under the curves (AUC) for the benchmarkingSet were computed using the R package PRROC67, following the benchmarking procedure proposed and applied by43 and68. In brief, for each TF, signature gene-sets were derived based on differentially expressed genes of the perturbed TF versus its control and used for TF enrichment analysis. For each dataset, the rank of the perturbed TF was assigned to the positive (foreground (fg)) class, while all other ranks of TFs retrieved for the dataset were assigned to the negative (background (bg)) class; using the PPROC package, precision-recall (PR) and Receiver-Operator Curves (ROC) using continuous interpolation were computed and areas under curve (AUCs) for both were calculated (PR-AUC and ROC-AUC). As the negative bg class greatly exceeded the positive fg class, we down-sampled the negative class to the same size as the positive class. We repeated this procedure 5000 times. We used the approx R function to linearly interpolate between all points from the 5000 ROC and the 5000 PR curves and calculated the mean ROC and PR AUCs values of all bootstrap runs. Furthermore, we analysed the cumulative distribution of the rank values for each TF across all experiments it was perturbed in, which is not expected to be uniform, unless the TF would rank randomly in all its associated datasets. We then performed the Anderson–Darling test using the goftest R package to test for significant deviations from this distribution, indicating a non-random distribution of the TF. Finally, we computed the percentage of perturbed TFs that were correctly ranked within the first percentile to ensure high ranking of the TF. All values are available in Table 1.
Benchmarking against existing tools
We compared the performance of AnnoMiner’s TF enrichment function with the performance of the following web-servers: TFEA.ChIP42, ChEA343 and EnrichR44. The same dataset (benchmarkingSet) was used for all web-tools tested. Only datasets which could be effectively ranked by the web-tool under study were used. We used 57 signature gene-sets to test TFEA.ChIP and all the 75 signature-gene sets to test ChEA3. For EnrichR we evaluated the performance metrics on 5 different resources: ARCHS4, ChEA 2016, ENCODE 2015, ENCODE and ChEA Consensus and TRRUST 2019 using 74, 44, 44, 29 and 51 signature gene-sets, respectively.
Epigenetic and RNA-seq data used
For analysing STAT3 in acute B-cell lymphomas, we used data from super-series GSE5072448. Only peaks which were significantly upregulated in ABC type B-cell lymphoma were selected for AnnoMiner’s peak integration (FDR cut-off < = 0.05; FC > = 1.25). We used H3K4me3 data from an ABC cell line stored in the ENCODE database (GSE86718_ENCFF763KFL, replicated peaks BED file) for peak integration with significantly upregulated STAT3 peaks. Human genome version hg19 was used for analysis. RNA-seq data (GSE50721) from the same super-series was used for integration with STAT3 and H3K4me3 peaks, whereby we used statistical results provided by the authors of the study. For demonstrating the nearby genes annotation function, we used data from51 and GEO dataset GSE52974, describing long-range enhancer elements regulating Myc expression for normal facial morphogenesis, mapped to mouse genome version mm9. For demonstration of the TF & HM enrichment function we used RNA-seq data from a longevity study focusing on daf-16/FoxO mutants in C. elegans (53 and GEO dataset GSE72426) and C. elegans genome version ce11/WBcel245.
Analysed RNA-seq data from developing indirect flight muscle (IFM) in Drosophila melanogaster were taken from GSE10724746. We used the DESeq2 comparisons between time-points 72 h and 30 h after puparium formation (APF) provided by the authors. Genes with log2FC in absolute value greater than 2 and FDR smaller than 0.01 were considered as differentially expressed. For integration with phenotypic data, we made use of data published on muscle morphogenesis and function in Drosophila57 and flight muscle growth in Drosophila47.
For integrating gene expression changes of yki knocked-down flight muscles and flight muscles expressing a constitutively active yki compared to wild type, with Yki and Sd protein genome occupancy data, we used BRB-seq data from yki knock-down and yki constitutively active fly strains47 available from GEO (GSE158957). Optimal IDR threshold narrow peak files for Yorkie (Yki, ENCSR422OTX) and Scalloped (Sd, ENCSR591PRH) ChIP-seq were downloaded from the ENCODE/modERN resource, both for Drosophila genome version dm6.
Drosophila experimental methods
Drosophila strains were grown under standard conditions at 27 °C to enhance GAL4 activity. All RNAi knock-downs were induced with Mef2-GAL4, a GAL4 line specifically expressed during development of all muscle types of the fly57. UAS RNAi lines used were from the Vienna69 and Harvard collections70 and ordered from VDRC or Bloomington stock centres. See Supplementary Table S6 for all genotypes used. The flight test was done as described in57.
Indirect flight muscle morphology was analysed in 3 to 5 days adult males or in 90 h APF pupae as previously published71. Briefly, 90 h APF pupae or adult flies were fixed with 4% paraformaldehyde in PBT (PBS, 0.5% triton X-100) and cut into half-thoraces using a sharp microtome blade. The half-thoraces were blocked with 3% normal goat serum in PBT for 30 min and F-actin in the flight and leg muscles was visualised with phalloidin (coupled to Alexa488 or rhodamine, Molecular Probes, 1:1000 in PBT for 2 h or overnight) and Sls was stained with anti-Kettin antibody MAC155 (Babraham Institute, 1:100 in PBT overnight). The stained half-thoraces were imbedded in Vectashield (Biozol) and imaged on an LSM780 or LSM880 confocal microscope. Images were processed using Fiji72.
Zentner, G. E. & Henikoff, S. High-resolution digital profiling of the epigenome. Nat. Rev. Genet. 15, 814–827 (2014).
ENCODE Project Consortium et al. Perspectives on ENCODE. Nature 583, 693–698 (2020).
Schoenfelder, S. & Fraser, P. Long-range enhancer-promoter contacts in gene expression control. Nat. Rev. Genet. 20, 437–455 (2019).
Spitz, F. Gene regulation at a distance: From remote enhancers to 3D regulatory ensembles. Semin Cell Dev Biol 57, 57–67 (2016).
Krivega, I. & Dean, A. Enhancer and promoter interactions-long distance calls. Curr. Opin. Genet. Dev. 22, 79–85 (2012).
Comoglio, F. et al. Thrombopoietin signaling to chromatin elicits rapid and pervasive epigenome remodeling within poised chromatin architectures. Genome Res. 28, 295–309 (2018).
Mitchelmore, J., Grinberg, N. F., Wallace, C. & Spivakov, M. Functional effects of variation in transcription factor binding highlight long-range gene regulation by epromoters. Nucleic Acids Res. 48, 2866–2879 (2020).
Dao, L. T. M. & Spicuglia, S. Transcriptional regulation by promoters with enhancer function. Transcription 9, 307–314 (2018).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Nakato, R. & Shirahige, K. Recent advances in ChIP-seq analysis: From quality management to whole-genome annotation. Brief. Bioinform. 18, 279–290 (2017).
Furey, T. S. ChIP-seq and beyond: New and improved methodologies to detect and characterize protein–DNA interactions. Nat. Rev. Genet. 13, 840–852 (2012).
Thomas, R., Thomas, S., Holloway, A. K. & Pollard, K. S. Features that define the best ChIP-seq peak calling algorithms. Brief. Bioinform. 18, 441–450 (2017).
Wilbanks, E. G. & Facciotti, M. T. Evaluation of algorithm performance in ChIP-seq peak detection. PLoS ONE 5, e11471 (2010).
Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137–R139 (2008).
Xu, H., Wei, C.-L., Lin, F. & Sung, W.-K. An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq data. Bioinformatics 24, 2344–2349 (2008).
Huang, W. et al. Efficiently identifying genome-wide changes with next-generation sequencing data. Nucleic Acids Res. 39, e130 (2011).
Shen, L. et al. diffReps: Detecting differential chromatin modification sites from ChIP-seq data with biological replicates. PLoS ONE 8, e65598 (2013).
Ambrosini, G., Dreos, R., Kumar, S. & Bucher, P. The ChIP-Seq tools and web server: A resource for analyzing ChIP-seq and other types of genomic data. BMC Genomics 17, 938–1015 (2016).
Blahnik, K. R. et al. Sole-Search: An integrated analysis program for peak detection and functional annotation using ChIP-seq data. Nucleic Acids Res. 38, e13 (2010).
Guzman, C. & D’Orso, I. CIPHER: A flexible and extensive workflow platform for integrative next-generation sequencing data analysis and genomic regulatory element prediction. BMC Bioinform. 18, 363–416 (2017).
Boeva, V., Lermine, A., Barette, C., Guillouf, C. & Barillot, E. Nebula—A web-server for advanced ChIP-seq data analysis. Bioinformatics 28, 2517–2519 (2012).
Salmon-Divon, M., Dvinge, H., Tammoja, K. & Bertone, P. PeakAnalyzer: Genome-wide annotation of chromatin binding and modification loci. BMC Bioinform. 11, 415–512 (2010).
Quinlan, A. R. & Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Yu, G., Wang, L.-G. & He, Q.-Y. ChIPseeker: An R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31, 2382–2383 (2015).
Kondili, M. et al. UROPA: A tool for Universal RObust Peak Annotation. Sci. Rep. 7, 2593–2612 (2017).
Tang, X. et al. annoPeak: A web application to annotate and visualize peaks from ChIP-seq/ChIP-exo-seq. Bioinformatics 33, 1570–1571 (2017).
Chen, T.-W. et al. ChIPseek, a web-based analysis tool for ChIP data. BMC Genomics 15, 539–613 (2014).
Huang, W., Loganantharaj, R., Schroeder, B., Fargo, D. & Li, L. PAVIS: A tool for Peak Annotation and Visualization. Bioinformatics 29, 3097–3099 (2013).
Bhasin, J. M. & Ting, A. H. Goldmine integrates information placing genomic ranges into meaningful biological contexts. Nucleic Acids Res. 44, 5550–5556 (2016).
McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 28, 495–501 (2010).
Zhu, L. J. et al. ChIPpeakAnno: A bioconductor package to annotate ChIP-seq and ChIP-chip data. BMC Bioinform. 11, 237–310 (2010).
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Davis, C. A. et al. The encyclopedia of DNA elements (ENCODE): Data portal update. Nucleic Acids Res. 46, D794–D801 (2018).
Jou, J. et al. The ENCODE portal as an epigenomics resource. Curr. Protoc. Bioinform. 68, e89 (2019).
Gerstein, M. B. et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330, 1775–1787 (2010).
Kudron, M. M. et al. The ModERN resource: Genome-wide binding profiles for hundreds of Drosophila and Caenorhabditis elegans transcription factors. Genetics 208, 937–949 (2018).
Alvarez, M. J. et al. Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nat. Genet. 48, 838–847 (2016).
Garcia-Alonso, L. et al. Transcription factor activities enhance markers of drug sensitivity in cancer. Cancer Res. 78, 769–780 (2018).
Wang, Z. et al. BART: A transcription factor prediction tool with query gene sets or epigenomic profiles. Bioinformatics 34, 2867–2869 (2018).
Kwon, A. T., Arenillas, D. J., Worsley Hunt, R. & Wasserman, W. W. oPOSSUM-3: Advanced analysis of regulatory motif over-representation across genes or ChIP-Seq datasets. G3 2, 987–1002 (2012).
Puente-Santamaria, L., Wasserman, W. W. & Del Peso, L. TFEA.ChIP: A tool kit for transcription factor binding site enrichment analysis capitalizing on ChIP-seq datasets. Bioinformatics 35, 5339–5340 (2019).
Keenan, A. B. et al. ChEA3: Transcription factor enrichment analysis by orthogonal omics integration. Nucleic Acids Res. 47, W212–W224 (2019).
Kuleshov, M. V. et al. Enrichr: A comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–W97 (2016).
Imrichová, H., Hulselmans, G., Atak, Z. K., Potier, D. & Aerts, S. i-cisTarget 2015 update: Generalized cis-regulatory enrichment analysis in human, mouse and fly. Nucleic Acids Res. 43, W57-64 (2015).
Spletter, M. L. et al. A transcriptomics resource reveals a transcriptional transition during ordered sarcomere morphogenesis in flight muscle. Elife 7, 1361 (2018).
Kaya-Çopur, A. et al. The Hippo pathway controls myofibril assembly and muscle fiber growth by regulating sarcomeric gene expression. Elife 10, 79 (2021).
Hardee, J. et al. STAT3 targets suggest mechanisms of aggressive tumorigenesis in diffuse large B-cell lymphoma. G3 3, 2173–2185 (2013).
Li, X.-Y., Harrison, M. M., Villalta, J. E., Kaplan, T. & Eisen, M. B. Establishment of regions of genomic activity during the Drosophila maternal to zygotic transition. Elife 3, e03737 (2014).
Gaskill, M. M., Gibson, T. J., Larson, E. D. & Harrison, M. M. GAF is essential for zygotic genome activation and chromatin accessibility in the early Drosophila embryo. Elife 10, e66668 (2021).
Uslu, V. V. et al. Long-range enhancers regulating Myc expression are required for normal facial morphogenesis. Nat. Genet. 46, 753–758 (2014).
Durinck, S. et al. BioMart and Bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).
Chen, A.T.-Y. et al. Longevity genes revealed by integrative analysis of isoform-specific daf-16/FoxO mutants of Caenorhabditis elegans. Genetics 201, 613–629 (2015).
Tepper, R. G. et al. PQM-1 complements DAF-16 as a key transcriptional regulator of DAF-2-mediated development and longevity. Cell 154, 676–690 (2013).
Zhang, P., Judy, M., Lee, S.-J. & Kenyon, C. Direct and indirect gene regulation by a life-extending FOXO protein in C. elegans: Roles for GATA factors and lipid gene regulators. Cell Metab. 17, 85–100 (2013).
Barrett, T. et al. NCBI GEO: Archive for functional genomics data sets–update. Nucleic Acids Res. 41, D991–D995 (2013).
Schnorrer, F. et al. Systematic genetic analysis of muscle morphogenesis and function in Drosophila. Nature 464, 287–291 (2010).
Wu, S., Liu, Y., Zheng, Y., Dong, J. & Pan, D. The TEAD/TEF family protein Scalloped mediates transcriptional output of the Hippo growth-regulatory pathway. Dev. Cell 14, 388–398 (2008).
Wang, S., Reuveny, A. & Volk, T. Nesprin provides elastic properties to muscle nuclei by cooperating with spectraplakin and EB1. J. Cell Biol. 209, 529–538 (2015).
Kuleshov, M. V. et al. modEnrichr: A suite of gene set enrichment analysis tools for model organisms. Nucleic Acids Res. 47, W183–W190 (2019).
Oki, S. et al. ChIP-Atlas: A data-mining suite powered by full integration of public ChIP-seq data. EMBO Rep. 19, e46255 (2018).
Chèneby, J., Gheorghe, M., Artufel, M., Mathelier, A. & Ballester, B. ReMap 2018: An updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP-seq experiments. Nucleic Acids Res. 46, D267–D275 (2018).
Fuda, N. J. et al. GAGA factor maintains nucleosome-free regions and has a role in RNA polymerase II recruitment to promoters. PLoS Genet. 11, e1005108 (2015).
Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018–160019 (2016).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc. Ser. B (Methodol.) 57, 289–300 (1995).
Xiao, Y. et al. A novel significance score for gene selection and ranking. Bioinformatics 30, 801–807 (2014).
Grau, J., Grosse, I. & Keilwagen, J. PRROC: Computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics 31, 2595–2597 (2015).
Garcia-Alonso, L., Holland, C. H., Ibrahim, M. M., Turei, D. & Saez-Rodriguez, J. Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Res. 29, 1363–1375 (2019).
Dietzl, G. et al. A genome-wide transgenic RNAi library for conditional gene inactivation in Drosophila. Nature 448, 151–156 (2007).
Ni, J.-Q. et al. A genome-scale shRNA resource for transgenic RNAi in Drosophila. Nat. Methods 8, 405–407 (2011).
Weitkunat, M. & Schnorrer, F. A guide to study Drosophila muscle biology. Methods 68, 2–14 (2014).
Schindelin, J. et al. Fiji: An open-source platform for biological-image analysis. Nat. Methods 9, 676–682 (2012).
We thank Friedhelm Pfeiffer, Assa Yeroslaviz and all team members of the IBDM Computational Biology team for helpful discussion and critical reading of the manuscript. We thank Aynur Kaya-Çopur for sharing data before publication. Fly stocks obtained from the Bloomington Drosophila Stock Center (NIH P40OD018537) and the Vienna Drosophila Resource Center (VDRC) were used in this study.
This work was supported by the French National Research Agency with ANR grant ANR-18-CE45-0016-01 MITO-DYNAMICS awarded to BHH and FS and with ANR-ACHN MUSCLE-FORCES (F.S.); by the excellence initiative Aix-Marseille University A*MIDEX (ANR-11-IDEX-0001-02, F.S.); the European Research Council under the European Union’s Seventh Framework Programme (FP/2007–2013)/ERC Grant 310939 (F.S.); the Human Frontiers Science Program (HFSP, RGP0052/2018, F.S.); the France-BioImaging infrastructure supported by the French National Research Agency (ANR–10–INBS-04-01, Investments for the future); by the “Investissements d'Avenir” French Government program managed by the French National Research Agency (ANR-16-CONV-0001) and from Excellence Initiative of Aix-Marseille University—A*MIDEX”; and by the Max Planck Society, the Centre National de la Recherche Scientifique (CNRS), Aix-Marseille University and the IBDM UMR 7288.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Meiler, A., Marchiano, F., Haering, M. et al. AnnoMiner is a new web-tool to integrate epigenetics, transcription factor occupancy and transcriptomics data to predict transcriptional regulators. Sci Rep 11, 15463 (2021). https://doi.org/10.1038/s41598-021-94805-1