AnnoMiner is a new web-tool to integrate epigenetics, transcription factor occupancy and transcriptomics data to predict transcriptional regulators

Meiler, Arno; Marchiano, Fabio; Haering, Margaux; Weitkunat, Manuela; Schnorrer, Frank; Habermann, Bianca H.

doi:10.1038/s41598-021-94805-1

Download PDF

Article
Open access
Published: 29 July 2021

AnnoMiner is a new web-tool to integrate epigenetics, transcription factor occupancy and transcriptomics data to predict transcriptional regulators

Arno Meiler¹^na1,
Fabio Marchiano²^na1,
Margaux Haering²,
Manuela Weitkunat¹,
Frank Schnorrer^1,2 &
…
Bianca H. Habermann^1,2

Scientific Reports volume 11, Article number: 15463 (2021) Cite this article

3710 Accesses
3 Citations
7 Altmetric
Metrics details

Subjects

Abstract

Gene expression regulation requires precise transcriptional programs, led by transcription factors in combination with epigenetic events. Recent advances in epigenomic and transcriptomic techniques provided insight into different gene regulation mechanisms. However, to date it remains challenging to understand how combinations of transcription factors together with epigenetic events control cell-type specific gene expression. We have developed the AnnoMiner web-server, an innovative and flexible tool to annotate and integrate epigenetic, and transcription factor occupancy data. First, AnnoMiner annotates user-provided peaks with gene features. Second, AnnoMiner can integrate genome binding data from two different transcriptional regulators together with gene features. Third, AnnoMiner offers to explore the transcriptional deregulation of genes nearby, or within a specified genomic region surrounding a user-provided peak. AnnoMiner’s fourth function performs transcription factor or histone modification enrichment analysis for user-provided gene lists by utilizing hundreds of public, high-quality datasets from ENCODE for the model organisms human, mouse, Drosophila and C. elegans. Thus, AnnoMiner can predict transcriptional regulators for a studied process without the strict need for chromatin data from the same process. We compared AnnoMiner to existing tools and experimentally validated several transcriptional regulators predicted by AnnoMiner to indeed contribute to muscle morphogenesis in Drosophila. AnnoMiner is freely available at http://chimborazo.ibdm.univ-mrs.fr/AnnoMiner/.

SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks

Article Open access 13 July 2023

Carmen Bravo González-Blas, Seppe De Winter, … Stein Aerts

Supervised enhancer prediction with epigenetic pattern recognition and targeted validation

Article 29 July 2020

Anurag Sethi, Mengting Gu, … Mark Gerstein

ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis

Article Open access 25 February 2021

Jeffrey M. Granja, M. Ryan Corces, … William J. Greenleaf

Introduction

Transcriptional regulation is a highly complex process involving a combination of various molecular players and biochemical mechanisms, such as transcription factors (TFs), histone modifying enzymes, DNA methylases, as well as a structural reorganization of chromatin. Technical advances in analysing the interaction of proteins with DNA (ChIP-seq), to detect open or closed chromatin states (e.g. ATAC-seq, DNase-seq, FAIRE-seq), to detect hypermethylated CpG islands (bisulfite sequencing), or to map higher order chromosomal structural organisation (ChiaPET, Hi-C, 3C-seq) have revolutionized and significantly advanced our understanding of transcriptional regulation during the last decade^1,2. Among other things, transcriptional enhancers were identified as crucial for regulating spatio-temporal gene expression programs by interacting with target gene promoters, often across large genomic distances (see^3,4,5 and references therein). Furthermore, the genome sequence in the chromatin is not a simple linear thread but organized in 3D, forming compartments, topologically associated domains (TADs) or chromatin loops that can bring distant elements in proximity, all of which can contribute to transcriptional regulation (^3,6 and references therein). More recent evidence suggests even the presence of dual-action cis-regulatory modules (CRMs) that act as promoters, as well as distal enhancers^7,8. All these findings suggest that transcriptional regulation is far more complex than initially anticipated and involves a collective effort of specific binding sites in the genome, a complex genome structure and the presence of various transcriptional regulators. Hence, we need a tool that can ideally integrate all this information to better understand and predict transcriptional regulation.

Techniques such as ChIP-seq, ATAC-seq, Hi-C seq and others involve an NGS (Next Generation Sequencing) step, resulting typically in paired-end reads of isolated chromosomal fragments. Therefore, the first steps after sequencing consists of read mapping, which is usually done using a software such as BowTie2⁹, followed by peak calling. There are a number of tools available for peak calling (reviewed e.g. in^10,11,12,13), which include MACS2¹⁴ for standard peak calling, ChIPdiff¹⁵, EpiCenter¹⁶ or diffReps¹⁷ for differential peak calling. The output of peak callers are the genomic coordinates of epigenetic marks or transcription factor binding sites under study. These coordinates are commonly stored in the Browser Extensible Data (BED) file format, a light-weight, standardized format to share genome coordinates.

The next step to biologically interpret genomic coordinates (also called peaks) is their genomic annotation, a process referred to as peak annotation or gene assignment. A number of peak annotation tools exist. Some of them combine ChIP-seq data analysis (including peak calling) and peak annotation, such as the ChIP-Seq tools and web server (¹⁸, web-based), Sole-Search¹⁹, CIPHER²⁰, Nebula (²¹, web-based), PeakAnalyser²², BEDTools²³ or HOMER²⁴. Some of them are specific to peak annotation and visualization, such as ChIPseeker²⁵, UROPA²⁶, annoPeak (²⁷, web-based), ChIPseek (²⁸, web-based), PAVIS (²⁹, web-based), Goldmine³⁰, GREAT (³¹, web-based), or ChIPpeakAnno³². While most of the peak annotators assign peaks to the closest TSS (Transcription start site) or genome feature, others consider up- and down-stream gene features, or provide the overlap with gene feature attributes (such as promoter, 5′UTR, 3′UTR, exon, intron) of the nearest gene^{18,21,22,25,26,29}. The web-tool GREAT³¹ offers peak annotation and gene assignment in larger genomic regions based on Gene Ontology (GO)-term similarity. What appears missing is a web-based, visual tool that helps experimental biologists to explore and integrate peak data with transcriptomics data in a flexible, user-centred and easy way. AnnoMiner is designed to close this gap.

An enormous community effort has been invoked in the past decade to collect, standardize and present annotated genomic data in form of the ENCODE^33,34,35, modENCODE³⁶ and modERN³⁷ resources, with the aim to make sense of the encyclopaedias of genomes. These initiatives have also made it possible to explore available genomic data further and integrate them with each other as well as with user-generated data. These standardized, high-quality data can be used to identify enriched transcription factor binding events in promoters of co-regulated genes, for example from a differential gene expression dataset (e.g. an RNA-seq dataset). This form of data integration predicts possible transcriptional regulators for biological processes under study, and is already widely used in the community. Enrichment analysis is commonly performed by testing for TF overrepresentation in the promoter regions of a user-provided gene list compared to a background list (e.g. considering the entire genome). Most available tools define the promoter region rigidly as a range of upstream and downstream base pairs from the annotated gene transcription start site (TSS) and these parameters are kept fixed for all the TFs tested, without accounting for the differences between individual transcription factors. Following this approach, divers web-based tools have been developed, including VIPER³⁸, DoRothEA³⁹, BART⁴⁰, oPOSSUM⁴¹, TFEA.ChIP⁴², ChEA3⁴³, EnrichR⁴⁴ or i-cisTarget⁴⁵. However, it would be advantageous to have a web-based, flexible and user-friendly solution for TF enrichment, considering promoter boundaries specific to each TF and working for the most widely used species (human, mouse, Drosophila and C. elegans).

Here we present AnnoMiner, a flexible, web-based, and user-friendly platform for peak annotation and integration, as well as TF and HM (histone modification) enrichment analysis. AnnoMiner allows users to annotate and integrate multiple genomic regions files with gene feature attributes and with transcriptomic data in an interactive and flexible way. AnnoMiner contains three distinct functions with different genomic peak annotation purposes in mind: first, peak annotation, which can be used to assign attributes of gene features to peaks, including user-defined upstream and downstream regions, TSS, 5′ and 3′UTRs, as well as the coding region; second, peak integration to search for overlapping binding events of up to five transcriptional regulators (e.g. different TMs; different HMs; or combinations of both) with gene feature attributes; these two annotation functions can optionally integrate user provided data, such as results from differential gene expression analysis, allowing to inspect for instance expression data and genomic peak data from multiple transcriptional regulators together. And third, nearby genes annotation or long-range interactions, which helps to identify long range interaction effects on gene expression of a single genomic region. The function for nearby genes annotation and long-range interactions requires the upload of data from differential expression analysis. As a fourth function, AnnoMiner performs TF (Transcription factor) and HM (Histone Modification) enrichment analysis, using all high-quality filtered TF and HM data from ENCODE, modENCODE and modERN, which are stored in an internal database, and a user defined gene lists as input. AnnoMiner’s TF enrichment function offers the DynamicRanges option, which uses promoter regions specific to each TF based on pre-calculated binding densities for each individual TF. We tested the predictive power of AnnoMiner’s TF enrichment function and applied it to flight muscle development in Drosophila focusing on the process of myofibril morphogenesis⁴⁶. AnnoMiner predicted two potential transcriptional regulators, Trithorax-like (Trl) and the uncharacterized zinc-finger protein CG14655 to play a role during this process. For both, we provide experimental evidence for an essential function in myofibril morphogenesis in flight muscle. Thus, AnnoMiner correctly predicted a new role for Trl, as well as CG14655 as transcriptional regulators in a specific muscle type in Drosophila. Finally, we used AnnoMiner to predict direct transcriptional targets for the likewise enriched TFs Yorkie (Yki) and Scalloped (Sd) required for flight muscle growth⁴⁷.

Results

The AnnoMiner web server

The AnnoMiner web server is a new tool for the convenient annotation and integration of epigenetic, and transcription factor occupancy data for the wet-lab researcher. Based on an underlying library of java classes developed for the handling of annotated genomic peak or ranges datasets, the interactive graphical user interface of the web application provides functionalities for the upload of datasets, choice of analysis mode and model organism, visualization and download of analysis results (see Fig. 1 for a schematic of AnnoMiner functions and Supplementary Figure S1 for implementation details).

Genomic peak annotation functions

AnnoMiner’s three annotation functions (peak annotation, peak integration and nearby genes annotation, Fig. 1) take as input one or more files containing genomic coordinates (in BED format), for example from ChIP-seq, with the aim of finding their associations with annotated gene features. This is done by determining the overlap between the genomic coordinates of a peak and the attributes of annotated gene features. The user can set the following parameters for the search: the minimum required overlap among the gene feature attributes and peak features (in bp or %); whether only the longest isoform of a gene or all its isoforms will be considered; the gene’s directionality; the organism and the genome resource. Optionally, a user-provided annotation file can be uploaded, containing for instance differential expression data which will be integrated with the gene lists generated from AnnoMiner annotation. The user-provided dataset is accepted in csv (comma separated values) or tsv (tab separated values) format. It can contain up to 6 columns, without any constraints in content, except the first column has to contain gene IDs to allow integration with AnnoMiner results.

Peak annotation

The peak annotation function computes the total coverage of the user-provided genomic regions (representing the peaks) with the attributes of each annotated gene feature in the genome assembly. AnnoMiner considers already annotated attributes of gene features (5′UTR, CDS and 3′UTR) as well as attributes or gene features provided by the user; in particular the promoter region is fully customizable with respect to the upstream and downstream region of the annotated TSS, as is the 5′ flanking region upstream and 3′ flanking region downstream of the gene body (Fig. 2a). The first result shown by AnnoMiner is a coverage plot, visualizing the total base pair coverage of all peaks with the annotated attributes of the gene features (Fig. 3a). The user next chooses a target region (corresponding to a gene feature attribute) by clicking on one of the bars and on the ‘Show Genes!’ button (Fig. 3b). An interactive, sortable and downloadable table of all genes which overlap with the selected target region with peaks from the BED file is returned to the user (Fig. 3c). If the user provides a gene-based, custom annotation file, for instance containing differential expression data, these data will be integrated and displayed in the resulting table (Fig. 3d). While an annotation file will in most cases contain differential expression data based on RNA-seq analysis, it can contain any numerical or even text data. In summary, peak annotation in AnnoMiner annotates genomic regions provided by the user with gene features and their attributes in a flexible and user-centred way.

While we here demonstrate the usability of AnnoMiner’s peak annotation function with a TF ChIP-seq dataset, which usually comprises narrow peaks in the vicinity or within gene promoters, it can be also used for any type of genomic peak file. In Supplementary Figure S2 (together with Supplementary Table S1), we show peak annotation for activating, as well as repressing histone modifications during early Drosophila development.

Peak integration

The peak integration function performs the peak annotation analysis, but for up to five genomic regions files (representing peaks from independent TFs, TFs together with HMs or independent HMs), allowing the user to integrate peaks from different transcriptional regulators and identify gene features and their attributes that overlap with them (Fig. 2b). The same algorithm as in peak annotation is used for the annotation of each individual peak file. With this function, genes co-regulated by TFs can be identified, or TF datasets can be integrated with chromatin structure data defined by histone modifications, or other epigenetic information derived by other experimental techniques. The coverage plots of all chosen genomic region files are returned by AnnoMiner (Fig. 4a), and the user chooses a target region for all and clicks on the ‘Show Genes!’ button (Fig. 4b). An interactive, sortable and downloadable table is returned with all genes that have a peak of the transcriptional regulators or epigenetic marks in their selected target regions (Fig. 4c). Custom gene annotation, such as differential expression data can again be provided, which is then integrated and displayed in the results table.

To show that AnnoMiner’s peak integration function can be used to integrate three different datasets in one analysis step, we re-analysed a data series published on STAT3 function in different forms of diffuse large B-cell lymphomas (DLBCL, GEO super-series GSE50724⁴⁸). Two subtypes of DLBCL are known, germinal centre B-cell-like (GCB) and activated B-cell-like (ABC). The ABC type responds only poorly to available therapies and can be often associated with an overexpression of STAT3⁴⁸. The authors had compared STAT3 binding by ChIP-seq analysis between 8 patient-derived cell lines from GCB- and ABC type. They had performed RNA-seq analysis of the same cell lines to retrieve differentially expressed genes between the two subtypes. We made use of these data to identify genes with increased expression levels in the ABC type together with increased STAT3 binding events. We only considered STAT3 peaks that were significantly upregulated (FDR 0.05, fold change 1.25) in ABC-type DLBCL. To demonstrate the added value of Annominer’s peak integration function, we used H3K4me3 data from ENCODE from one of the ABC cell lines, OCI-Ly3 (accession: ENCFF763KFL), to limit the search to active promoters. Both peak files showed highest coverage with the direct promoter region of associated genes (Fig. 4a,b). We selected the major peaks for further analysis and integrated resulting peaks with differentially expressed genes from the same study. Of the upregulated unique genes, 42 contained an upregulated STAT3 peak in their promoter, as well as a H3K4me3 histone modification (Supplementary Figure S3a, Supplementary Table S2). We submitted the list of 42 genes to the EnrichR web-server⁴⁴ and could identify terms strongly related to cancer, diffuse large B-cell lymphomas of the ABC type, IL10/STAT3 signalling and other relevant terms for the disease under study (Supplementary Figure S3b,c; Supplementary Table S2). Moreover, AnnoMiner identified 28 additional direct targets of STAT3 compared to the ones already described in the original study (Supplementary Table S2). To summarize, AnnoMiner’s peak integration function was able to identify genes directly associated with ABC-type diffuse large B-cell lymphomas and potential direct targets for the transcription factor STAT3 by using a single user interface and a single AnnoMiner analysis step.

To demonstrate the usability of AnnoMiner’s peak integration function further, we used four different datasets, as well as different types of genomic data. The results are displayed in Supplemental Material: first, (Supplementary Figure S4, Supplementary Table S1), we integrated H3K4me3 peak data during four stages of Drosophila MZT (maternal-to-zygotic transition) and identified its associated genes throughout MZT, which are thus constitutively transcribed from early to late MZT⁴⁹. Second, we integrated ATAC-seq data with ChIP-seq data of an early embryonic TF and genome modifier, GAF/Trl, to identify genes activated by Trl in later stages of MZT⁵⁰ (Supplementary Figure S5, Supplementary Table S3).

Nearby genes annotation and long-range interactions

The nearby genes annotation function allows the user to visualize the differential regulation and retrieve the overlapping, as well as the closest 5 gene features up- and downstream of an individual peak (Fig. 2c). This function is most useful for exploring gene regulation in the vicinity of a genomic mutation or deletion in a non-coding region. Next to a BED file with the region of interest containing the mutation or genomic deletion, the user uploads an annotation file containing significantly differentially expressed genes. The resulting interactive AnnoMiner plot depicts the genomic neighbourhood of the peak, with the differential regulation of the overlapping and the five closest genes up- and downstream (Fig. 5a). The user can choose to visualize only the deregulation, or can discriminate between up- and down-regulation of the genes. In the latter case, the colour of the box reflects the direction of differential expression (either blue for up-, red for downregulated, green if an equal number of genes are up- and downregulated or grey if unchanged, (Fig. 5a). By selecting one or more boxes (Fig. 5b), the selected genes, as well as their log2FC and FDR with the user’s annotation file will be returned in a table (Fig. 5c).

Alternative to the nearby genes annotation, users can also utilize the long-range interactions function of AnnoMiner to explore the neighbourhood of a peak (Fig. 2d). In this case, a range of base-pairs has to be chosen up- and downstream of the peak, which is then decorated with information on differential expression from the user-provided annotation file (Fig. 5d). After selecting the up- or downstream region (Fig. 5e), the user can retrieve the genes within that region of the peak together with their differential expression data (Fig. 5f). The long-range interactions function will have its best use whenever Hi-C data are available, as these types of data will give insight about the genomic boundaries and thus help to choose the genomic neighbourhood of a peak.

As a proof of concept for the nearby genes annotation function, we used a study that had shown the requirement of long-range enhancers regulating Myc expression for normal facial morphogenesis⁵¹ (GEO dataset GSE52974). In humans, cleft lip or cleft palate (CL/P) is a frequent congenital malformation. This malformation has been associated with risk factors located at a 640 kb noncoding region on chromosome 8. The corresponding region in mouse was studied by Uslu and colleagues and refined to a more specific enhancer region, the medionasal enhancer (MNE⁵¹). Deletions within the MNE in mouse led to smaller snouts and abnormalities of nasal and frontal bones amongst other defects. Myc was the only gene observed to be differentially expressed in the vicinity of this deletion. We used the CL/P deletion 8–17 (chr15:62668548–63550550) from Uslu et al. to create a single-peak BED file and uploaded it together with the significantly differentially expressed genes from the re-processed RNA-sequencing data from the same strain compared against control (GEO dataset GSE52974) to test the AnnoMiner nearby genes annotation function (Supplementary Table S4). Indeed, only a single gene is significantly differentially expressed (pink box, Fig. 5a), which is Myc (Fig. 5c). In principle, this function can also be used to explore the expression dynamics within the gene neighbourhood of multiple peaks (see Supplementary Figure S6). However, large-scale long-range gene regulatory data of this type are very sparse and their interpretation remains too complex to be exhaustively analysed with a tool like AnnoMiner. In summary, AnnoMiner’s nearby genes annotation or long-range interaction functions help in an easy and quick way to identify deregulated genes in the neighbourhood of one genomic position.

Transcription factor & histone modification enrichment analysis

Transcription factor binding sites inferred from experimentally detected TF peaks in the genome can be used to predict TFs, which potentially co-regulate gene-sets. AnnoMiner’s TF & HM enrichment analysis function identifies enriched peaks in the promoter regions of a user-provided gene list, for instance co-regulated genes from a transcriptomic analysis. Any valid identifier is accepted, as AnnoMiner performs gene ID conversion on-the-fly using BioMart⁵². AnnoMiner considers a gene as a potential target, if its promoter overlaps with a TF peak. The user can either choose the promoter region (up- and downstream number of base-pairs from the TSS) or use the DynamicRanges calculated by AnnoMiner, which is based on the distribution of a TF binding event relative to the TSS and therefore specific for each TF (see “Methods”, not available for histone modifications). The results of the enrichment analysis are visualized as an interactive bar plot (for the first 10 hits, Fig. 6a), as well as an interactive table. In the table, all available TF ChIP-seq datasets in the AnnoMiner database for the species of interest are ranked according to their Combined Score. Along with this value, AnnoMiner also reports information about the experimental condition, cell line or developmental stage, contingency table values, p-value, enrichment score, FDR and the list of potential targets of the TF (Fig. 6b) in the downloadable version of the table.

As a proof of principle for predicting transcriptional regulators we selected a differential expression dataset from daf-16/FoxO mutants in C. elegans⁵³. When uploading the list of all DAF-16A/F targets provided in⁵³ to AnnoMiner’s TF & HM enrichment analysis function, daf-16 was the 4th most significantly enriched transcription factor (Fig. 6a). Interestingly, pqm-1, which has been shown to bind to daf-16 response elements⁵⁴, was found at 1st and 2nd position by AnnoMiner. Elt-2, a GATA-like transcription factor, which appeared as 3rd most significant hit, has been shown to bind to promoters of some daf-16-regulated genes and to be required for their regulation⁵⁵. To conclude, AnnoMiner’s TF & HM enrichment function is a powerful tool for predicting relevant transcription factors co-regulating sets of genes with similar expression patterns.

Performance evaluation of AnnoMiner’s TF & HM enrichment analysis function

We wanted to compare the performance of AnnoMiner’s TF & HM enrichment analysis function with other web-tools for TF enrichment analysis. We followed in principle the evaluation protocol proposed by Keenan et al., which used PR-AUCs and ROC-AUCs calculated from the PPROC R-package for estimating performance⁴³ (for details see also “Methods”). In brief, we took manually curated datasets provided by⁴³ containing single TF perturbation experiments followed by RNA-seq from Gene Expression Omnibus⁵⁶ (GEO). Gene expression data used for benchmarking were restricted to experiments targeting TFs for which AnnoMiner is storing at least one high quality TF ChIP-seq dataset for the human assembly GRCh38, resulting in a total of 75 datasets that we could use for benchmarking. We submitted the list of significantly differentially expressed genes between perturbed TF versus wild-type control (which we hereafter refer to signature gene-sets) from these experiments to perform enrichment analysis using different tools, including AnnoMiner. We used the rank of the perturbed TF in the resulting enrichments of its associated signature gene-set to calculate PR-AUCs and ROC-AUCs. We furthermore calculated the cumulative distribution function for the ranks of each TF across all the experiments it was perturbed in. Only if a TF ranks randomly, the distribution function will be uniform; we performed Anderson–Darling tests to detect deviation from uniformity. We then computed the percentage of perturbed TFs that were correctly ranked within the first percentile to ensure that TFs were ranking high.

We chose the following web-tools for comparison: ChEA3, TFEA.ChIP and EnrichR (Table 1). ChEA3 offers the user 2 different methods to rank predicted TFs (meanRank and topRank) and we evaluated both ranking methods. EnrichR offers different resources for enrichment analysis and we used the resources ARCHS4, ChEA 2016, ENCODE 2015, ENCODE and ChEA Consensus and TRRUST 2019 for enrichment analysis, respectively. The number of datasets included in the benchmarking set did depend on the resource tested (see “Methods”). The Anderson–Darling test returned significant results for all web-tools tested (ADtest in Table 1), except for EnrichR in combination with the ENCODE_and_ChEA_Consensus resource. This highlights the ability of all tools to rank the perturbed TF among the top candidates of the results. AnnoMiner outperformed TFEA.ChIP in all categories including percentage recovered TFs in the 1st percentile (7.0 vs 0.0), ROC AU (0.69 vs 0.63) and PR AUC (0.68 vs 0.60). EnrichR differed in performance depending on the resource used. On average, it outperformed AnnoMiner on the percent recovered TFs (9.5 vs 7.0), while AnnoMiner reached slightly higher values in ROC AUCs (0.69 vs 0.66) and PR AUCs (0.69 vs 0.68). ChEA3 performed similar for both ranking methods used and outperformed AnnoMiner in all categories. To summarize, though it cannot reach the performance of ChEA, AnnoMiner outperforms the other evaluated tools in identifying relevant TFs at a high rank the in gene-sets derived from TF perturbation studies. Other than ChEA, however, AnnoMiner is available for all four major model organisms, including the invertebrates Drosophila and C. elegans.

Table 1 Performance values of AnnoMiner compared to other TF enrichment web-tools.

Full size table

Using AnnoMiner to identify important transcriptional regulators of Drosophila flight muscle morphogenesis and growth

The assess the performance of AnnoMiner further, we tested its TF & HM enrichment analysis function to predict unknown transcriptional regulators for a list of co-regulated genes. We chose a dataset quantifying gene expression dynamic during development of Drosophila melanogaster indirect flight muscles⁴⁶, in which mRNA from indirect flight muscles had been isolated at several time-points correlating with key steps during muscle development (Fig. 7a). We focused on the development of the contractile apparatus called myofibrillogenesis and compared gene expression at 30 h after puparium formation (APF), when myofibrils assemble, with 72 h APF, when myofibrils have matured (Fig. 7a). This comparison revealed 2193 differentially expressed genes that were shown in the original study to be strongly enriched for genes relevant for myofibril and mitochondrial development. We submitted this differential gene list to AnnoMiner (Supplementary Table S5) and searched all the 514 modERN and modENCODE TF ChIP-seq datasets from Drosophila stored in AnnoMiner for a potential enrichment of peaks. This identified 42 unique TFs significantly enriched with an FDR < 0.05. (Supplementary Table S5). The top 10 enriched datasets included Deaf1, Trl, Hr78 and cwo (Fig. 7b). Hence, these are potential transcriptional regulators of flight muscle development.

To identify which of the 42 transcriptional regulators may have a function during flight muscle development we next integrated data from an RNAi-screen for muscle function⁵⁷, which had assayed for viability, flight muscle performance and body locomotion after muscle specific knock-down of individual TFs. From the 42 TFs identified by AnnoMiner, knock-down of two TFs resulted in flightless animals (CG14655, cwo), and seven were scored as lethal during development (Trl, Hr78, lola, Vsx2, Pif1B, salr, Hr51) (Supplementary Table S5); 21 TFs did not show a phenotype in this assay and the remaining 13 had not been tested (Fig. 7c).

Trithorax-like and the uncharacterized Zinc-finger protein CG14655 are required for flight muscle morphogenesis

For experimental verification we selected two proteins, Trl (Trithorax-like) and an uncharacterized zinc-finger protein called CG14655. We used muscle-specific knock-down to investigate a putative function of both genes in muscle. For Trl knock-down we used 4 independent transgenic RNAi lines driven with muscle-specific Mef2-GAL4. Two of those resulted in pupal lethality and the other two resulted in viable but flightless flies, demonstrating a function of Trl in flight muscle (Supplementary Table S6). For morphological analysis we visualized the myofibrils of flight and leg muscles of mature 90 h APF pupae in wild type and three different Trl knock-down lines. We found that knock-down of Trl caused disordered and frayed myofibrils in flight muscles, whereas leg muscle myofibrils appeared normal (Fig. 7d, Supplementary Figure S7). This shows that Trl is required for normal myofibril development in flight muscle.

To investigate a role of CG14655 during muscle development we also applied muscle-specific knock-down with Mef2-GAL4 and three different RNAi lines, one of which resulted in viable but flightless animals and two other overlapping hairpins resulted in pupal lethality (Supplementary Table S6). Morphological analysis showed that CG14655 knock-down flight muscles displayed abnormal actin accumulations between their myofibrils, suggesting a role for CG14655 in myofibril development of flight muscle. Together, these findings demonstrate the predictive power of AnnoMiner to identify transcriptional regulators by combining chromatin binding and differential expression data.

AnnoMiner helps identify targets co-regulated by Sd and Yki during flight muscle growth in Drosophila

Strikingly, two of the enriched transcriptional regulators in the above comparative flight muscle development dataset were the transcriptional effector of the Hippo pathway in Drosophila called Yorkie (Yki) and its essential Tead co-factor Scalloped (Sd)⁵⁸ (Supplementary Table S5). Recently, an essential function for the Hippo pathway promoting flight muscle growth by transcriptional up-regulation of mRNAs coding for sarcomeric proteins, which built the myofibrils, was identified⁴⁷. We wanted to know whether we could identify direct transcriptional targets of Yki/Sd during flight muscle growth. To this end, we integrated mRNA BRB-seq data from developing yki knock-down flight muscle (yki-IR), as well as from flight muscle expressing a constitutive active form of yki (yki-CA) compared to wild type controls (GEO accession GSE158957) with ChIP-seq data from Yki (modENCODE dataset ENCSR422OTX) and Sd (modERN dataset ENCSR591PRH) obtained in fly embryos using AnnoMiner’s peak integration function.

Both proteins showed prominent base pair coverage of the TSS regions of their target genes (Fig. 8a). We selected these peaks to retrieve associated genes and then integrated the BRB-seq data for 24 h APF yki knockdown (yki-IR 24 h), as well as 24 h and 32 h APF constitutively active yki, respectively (yki-CA 24 h, yki-CA 32 h; Supplementary Table S7). Upon knock-down of yki, already at 32 h APF, a severe myofibril assembly defect had been observed⁴⁷. Interestingly, AnnoMiner identified Yki and Sd binding sites in the TSS of two genes essential for muscle function and development, which were downregulated in yki knock-down muscles at 24 h APF. These genes code for the sarcomeric proteins Tropomyosin 1 (Tm1) and the Nesprin-family protein Muscle-specific protein 300 kDa (Msp300), which is important to link the myofibrils to the nuclei⁵⁹ (Fig. 8b). The gain-of-function yki phenotype (yki-CA) is characterized by premature expression of sarcomeric proteins resulting in muscle fiber hyper-compaction⁴⁷. At 24 h APF scalloped (sd) itself is the only direct target gene of the Yki/Sd complex which is differentially expressed in yki-CA (Fig. 8c). At 32 h AFP, AnnoMiner identified 177 unique genes and in total 541 transcripts as potential direct Yki/Sd targets. Using GO term enrichment analysis by modEnrichR⁶⁰ we found many processes and cellular compartments related to muscle development and function among the top enriched terms in these potential direct Yki/Sd targets (Fig. 8d, Supplementary Table S7). To conclude, using AnnoMiner’s peak integration function, we identified putative direct targets of the Yki/Sd transcriptional complex that showed differential expression upon yki knock-down or yki constitutive activation. Many of these genes are likely important for flight muscle morphogenesis.

Discussion

Here, we introduced AnnoMiner, a web-based, flexible and user-friendly platform for genomic peak annotation and integration, as well as transcription factor enrichment analysis. We illustrated AnnoMiner’s peak annotation and integration, as well as the nearby genes annotation functions with specific examples. We confirmed the predictive power of AnnoMiner’s TF enrichment function experimentally by identifying important regulators of Drosophila indirect flight muscle development. This was achieved searching for overrepresented TF peaks in promoters of genes differentially regulated during myofibrillogenesis using AnnoMiner.

AnnoMiner distinguishes itself from other peak annotation, as well as TF enrichment tools. AnnoMiner’s peak annotation and peak integration outputs first a bar plot that shows the overlap of a peak with different gene feature attributes, including up- and downstream regions, the TSS, 5′ and 3′ UTRs and the gene body. This has two advantages: first, the user can visualize the distribution of peaks of the uploaded file with respect to all relevant attributes of annotated gene features in the genome. Second, AnnoMiner allows to interactively choose the target region(s) for which the associated genes are returned. While other tools provide statistics on the peak distribution relative to gene feature attributes (e.g.²⁸) in the output, to our knowledge, AnnoMiner is the only software that allows to easily retrieve specific gene-sets depending on the distribution of the peak coverage over gene feature attributes. The peak integration function offers the same flexibility. Moreover, both functions allow to directly integrate differential gene expression or other numerical data associated to genes with the genomic peak files. The nearby genes annotation and long-range interaction functions, which integrate expression data with peak data, is novel and for the first time, users can in a web-based manner visualize and retrieve genes that are not the nearest neighbours of a peak. It could be useful to integrate genomic, non-coding variants causative for human genetic diseases with disease-associated gene expression data and explore transcriptional activity within TAD domains. AnnoMiner’s TF enrichment analysis function offers to treat promoter regions dynamically for each specific TF with its DynamicRanges function. Finally, AnnoMiner is independent of the genomic assembly of the source data, as it on-the-fly translates submitted IDs and uses the ID compatible with the database chosen for gene centred peak annotation, as well as TF enrichment.

We compared AnnoMiner’s TF enrichment analysis function to the best-performing software in the field, which includes ChEA3, TFEA.ChIP and EnrichR. AnnoMiner could not reach the accuracy of ChEA3. One of the reasons could be that the dataset of TF—gene association used by ChEA3 supersedes data we retrieve from ENCODE, modENCODE and modERN, as it includes several additional datasets which are at least partially manually curated or generated. This hypothesis is supported by the fact that EnrichR shows differing performance when using different source data, showing lower performance when using the ENCODE data alone compared to the ones from ChEA3. One possible solution could be to add curated data to AnnoMiner’s TF enrichment analysis function, for instance from ChIP-Atlas⁶¹ or ReMap⁶². The disadvantage however is the higher cost in curation, as well as the fact that manually curated datasets are typically not available for model organisms such as Drosophila or C. elegans, but are rather restricted to human or mouse, as is the ChEA3 tool. Yet, users can easily add their own data using the scripts provided to fill the mongoDB present in our gitlab repository, when running AnnoMiner locally (https://gitlab.com/habermann_lab/AnnoMiner/-/tree/master/scripts).

Finally, we predicted and verified potential transcriptional regulators of muscle and myofibril morphogenesis as well as muscle growth using AnnoMiner’s TF & HM enrichment analysis function. Amongst those is Trl, a GAGA transcription factor which contains a BTB/POZ domain, as well as a C2H2 zinc-finger that binds to DNA in a sequence-specific manner. Previous studies suggest that Trl is required to keep promoters nucleosome-free, thus allowing Pol-II-access⁶³. We showed here that muscle-specific knock-down of Trl using four independent hairpins either leads to pupal lethality or flightlessness. Consistently, we find severely perturbed indirect flight muscles upon Trl knock-down whereas leg muscles appear largely normal. This indicates a preferential function of Trl in flight muscle, however as two hairpins result in pupal lethality, a role of Trl in other body muscles is also likely.

A second potential direct transcriptional regulator identified by AnnoMiner is the uncharacterized Zinc-finger transcription factor CG14655. Muscle-specific knock-down of CG14655 either results in pupal lethality or flightlessness and causes abnormal accumulations of actin in flight muscles. This again suggests that CG14655 is important for normal myofibril development in flight muscle.

Lastly, we made use of two other transcriptional regulators, Yorkie and Scalloped, which on DNA act in a complex⁵⁸, to identify its direct targets. AnnoMiner identified two direct targets of Yki, which upon loss of yki were downregulated. Both code for important muscle structure proteins, constituents of the sarcomere or linking the sarcomere to the nucleus, and hence could contribute to the severe phenotype observed upon yki knock-down. Gain-of-function of yorkie results in muscle fiber hyper-compaction and premature expression of sarcomeric protein components⁴⁷. Consistently, AnnoMiner identified a number of direct Yki targets with functions related to muscle development and growth. This substantiates a role for Scalloped and its transcriptional co-factor Yorkie during flight muscle growth.

To conclude, the new web-tool AnnoMiner is a user-friendly, intuitive, interactive and highly-flexible platform for genomic peak annotation and peak integration. It is suitable for identification of nearby genes or long-range interactions of a genomic peak, as well as to perform Transcription Factor and Histone Modification enrichment analysis for a list of genes. This manuscript details all AnnoMiner functions and shows its usefulness for annotating and integrating peaks from two different ChIP-seq experiments together with transcriptomics data. Finally, AnnoMiner helped identify several key regulators of indirect flight muscle development and growth in Drosophila, some of which were confirmed experimentally.

Methods

The AnnoMiner software and database

AnnoMiner is a modular software consisting of a library of java classes for retrieval, storage and analysis of annotated genomic peak data together with a web application providing a graphical user interface for a number of predefined analyses.

The AnnoMiner web-sever is a JavaEE web application implemented in the front-end as a single page application using Javascript, jQuery and Bootstrap 4. Queries for the provided analyses are processed by individual servlets that send a response back to the front-end in JSON format (Supplementary Figure S1).

The java library used in the back-end manages the retrieval of data from remote sources at UCSC and modENCODE³⁶ or alternatively from files in BED or gff format. Retrieved datasets are stored in a MongoDB database. This document-oriented NoSQL database system has been chosen for its flexible data model and runtime speed benefits benchmarked against an SQL database solution. For optimal performance, a custom database connectivity layer has been developed based on the MongoDB java driver. The overall structure of the application is shown in Fig. 1 and Supplementary Figure S1.

AnnoMiner’s database currently holds genomic data from ENCODE³³, modENCODE³⁶ and modERN³⁷. For each model organism, the latest genome assembly is stored. For human, mouse and Drosophila, we also provide the second latest release.

Following the Findability, Accessibility, Interoperability and Reusability (FAIR) principle⁶⁴, documentation and source code of the tool are available on GitLab: https://gitlab.com/habermann_lab/annominer. Using the java library classes from the code repository developers will be able to define custom analyses on annotated genomic ranges datasets and extend the database towards new data sources. The repository also provides executable java classes and python scripts for local maintenance such as to populate a local database with ENCODE, modENCODE or modERN data or to reproduce our benchmark analysis.