The Signaling Pathways Project: an integrated ‘omics knowledgebase for mammalian cellular signaling pathways

Public transcriptomic and ChIP-Seq datasets have the potential to illuminate facets of transcriptional regulation by mammalian cellular signaling pathways not yet explored in the research literature. Unfortunately, a variety of obstacles prevent routine re-use of these datasets by bench biologists for hypothesis generation and data validation. Here, we designed a web knowledgebase, the Signaling Pathways Project (SPP), which incorporates stable community classifications of three major categories of cellular signaling pathway node (receptors, enzymes and transcription factors) and the bioactive small molecules (BSMs) known to modulate their functions. We then subjected over 10,000 publically archived transcriptomic or ChIP-Seq experiments to a biocuration pipeline that mapped them to their relevant signaling pathway node, BSM or biosample (tissue or cell line of study). To provide for prediction of pathway node-target transcriptional regulatory relationships, we generated consensus ‘omics signatures, or consensomes, based on the significant differential expression or promoter occupancy of genomic targets across all underlying transcriptomic (expression array and RNA-Seq) or ChIP-Seq experiments. To expose the SPP knowledgebase to biology researchers, we designed a web browser interface that accommodates a variety of routine data mining strategies depending upon the requirements of the end user. Individual dataset pages provide for browsing or filtering, and facilitate integration of SPP with the research literature. Results of single gene, Gene Ontology or user-uploaded gene list queries are displayed in an interactive user interface referred to as the Regulation Report, in which evidence for transcriptional regulation of downstream genomic target by cellular signaling pathway nodes is compartmentalized in an intuitive manner. Consensome queries allow users to evaluate evidence for targets most consistently regulated by a given signaling pathway node family, and allow for detailed inspection of the pharmacology underlying node-target regulatory relationships predicted by the consensomes. Consensomes were validated using alignment with literature-based knowledge, gene target-level integration of transcriptomic and ChIP-Seq data points, and in bench experiments that confirmed previously uncharacterized node-gene target regulatory relationships. SPP is freely accessible at https://beta.signalingpathways.org. Availability and Implementation: The Signaling Pathways Project is freely accessible at https://beta.signalingpathways.org. Social media: @sigpathproject


Summary
Public transcriptomic and ChIP-Seq datasets have the potential to illuminate facets of transcriptional regulation by mammalian cellular signaling pathways not yet explored in the research literature. Unfortunately, a variety of obstacles prevent routine re-use of these datasets by bench biologists for hypothesis generation and data validation. Here, we designed a web knowledgebase, the Signaling Pathways Project (SPP), which incorporates stable community classifications of three major categories of cellular signaling pathway node (receptors, enzymes and transcription factors) and the bioactive small molecules (BSMs) known to modulate their functions. We then subjected over 10,000 publically archived transcriptomic or ChIP-Seq experiments to a biocuration pipeline that mapped them to their relevant signaling pathway node, BSM or biosample (tissue or cell line of study). To provide for prediction of pathway nodetarget transcriptional regulatory relationships, we generated consensus 'omics signatures, or consensomes, based on the significant differential expression or promoter occupancy of genomic targets across all underlying transcriptomic (expression array and RNA-Seq) or ChIP-Seq experiments. To expose the SPP knowledgebase to biology researchers, we designed a web browser interface that accommodates a variety of routine data mining strategies depending upon the requirements of the end user. Individual dataset pages provide for browsing or filtering, and facilitate integration of SPP with the research literature. Results of single gene, Gene Ontology or user-uploaded gene list queries are displayed in an interactive user interface referred to as the Regulation Report, in which evidence for transcriptional regulation of downstream genomic target by cellular signaling pathway nodes is compartmentalized in an intuitive manner. Consensome queries allow users to evaluate evidence for targets most consistently regulated by a given signaling pathway node family, and allow for detailed inspection of the pharmacology underlying node-target regulatory relationships predicted by the consensomes. Consensomes were validated using alignment with literature-based knowledge, gene target-level integration of transcriptomic and ChIP-Seq data points, and in bench experiments that confirmed previously uncharacterized node-gene target regulatory relationships. SPP is freely accessible at https://beta.signalingpathways.org.

Introduction
Signaling pathways describe functional interdependencies between distinct classes of molecules that collectively determine the response of a given cell to its afferent metabolic and endocrine signals [1]. The bulk of readily accessible information on these pathways resides in the conventional research literature in the form of peer-reviewed hypothesis-driven research articles, and in knowledgebases that curate such information [2]. Many such articles are based in part upon discovery-scale datasets documenting, for example the effects of genetic or small molecule perturbations on gene expression in transcriptomic datasets, and DNA promoter region occupancy in cistromics, or ChIP-Seq, datasets. Conventionally, only a small fraction of data points from such datasets are characterized in any level of detail in associated hypothesisdriven articles. While largely unused initially, the remaining data points in 'omics datasets possess potential collective re-use value for validating experimental data or gathering evidence to model cellular signaling pathways. We and others have described the limited findability and accessibility, interoperability and re-use (FAIR) status of these datasets [3,4]. Although some barriers to the FAIR status of these datasets are being addressed, and a number of useful' omics dataset-based research resources have been developed [5][6][7][8][9][10][11][12][13][14], opportunities exist to further develop the infrastructure enabling routine re-use of public 'omics datasets by bench researchers in the field of mammalian cellular signaling.
We previously described biocuration and web development approaches to enhance the FAIR status of public transcriptomic datasets involving genetic or small molecule perturbations of members of the nuclear receptor (NR) superfamily of ligand-regulated transcription factors [15].
Here we describe a distinct and original knowledgebase, the Signaling Pathways Project (SPP), which expands these FAIR efforts along three dimensions. Firstly, we have encompassed datasets involving genetic and small molecule perturbations of a broad range of cellular signaling pathway modules -receptors, enzymes, transcription factors. Secondly, we have integrated ChIP-Seq datasets, which document genomic occupancy by transcription factors, enzymes and other factors. Thirdly, we have developed a meta-analysis technique that surveys across these datasets to generate consensus ranked signatures, referred to as consensomes, which allow for prediction of signaling pathway node-target regulatory relationships. We validate the consensomes using alignment with literature knowledge, integration of transcriptomic and ChIP-Seq evidence, and using bench experimental use cases that validate signaling pathway node-target regulatory relationships predicted by the consensomes. Finally, we have made the entire data matrix available for routine data browsing, mining and hypothesis generation by the mammalian cell signaling research community at https://beta.signalingpathways.org.

Data model design
The goal of the Signaling Pathways Project (SPP) is to give bench scientists routine access to biocurated public transcriptomic and ChIP-Seq datasets to infer or validate cellular signaling pathways operating within their biological system of interest. Although such pathways are diverse and dynamic in nature, they typically describe functional interdependencies between molecules belonging to three major categories of pathway module: activated transmembrane or intracellular receptors, which initiate the signals; intracellular enzymes, which propagate and modulate the signals; and transcription factors, which give effect to the signals through regulation of gene expression [16]. Accordingly, we first set out to design a knowledgebase that would reflect this modular architecture. To ensure that our efforts were broadly aligned with established community standards, we started by integrating existing, mature classifications for receptors (International Union of Pharmacology, IUPHAR; [17]), enzymes (International Union of Biochemistry and Molecular Biology Enzyme Committee [18]) and transcription factors (TFClass [19]). Table S1 shows representative examples of the hierarchical relationships within each of the signaling pathway module categories. To harmonize and facilitate data mining across different signaling pathway modules, top level categories were subdivided firstly into functional classes, which in turn were subdivided into gene families, to which individual gene products were assigned. Fig. 1 summarizes the major classes and families in each category encoded in the data model. Consistent with terminology in use in the cellular signaling field [1,20], we refer to these individual gene products as nodes. Molecular classes that are relevant to, but less frequently studied in the context of cellular signaling, such as regulatory RNAs, chromatin factors and cytoskeletal components, were assigned to a Co-nodes category. Impacting the functions of nodes in all four categories are bioactive small molecules (BSMs), encompassing: physiological ligands for receptors; prescription drugs, targeting almost exclusively nodes in the receptor and enzyme categories; synthetic organics, representing experimental compounds and environmental toxicants; and natural products (S1 Table). BSM-node mappings were retrieved from an existing pharmacology biocuration initiative, the IUPHAR Guide To Pharmacology [17], or annotated by SPP biocurators de novo with reference to a specific PubMed identifier (PMID).

Dataset biocuration
Having defined relationships within each major signaling pathway module, we next designed a dataset biocuration strategy that would classify publically archived transcriptomic and ChIP-Seq datasets according to the signaling pathway node(s) whose transcriptional functions they were designed to interrogate (Fig. S1). For knowledgebase design purposes, we defined a dataset as a collection of individual experiments encompassed by a specific GEO series (GSE, for transcriptomic datasets) or SRA Project (SRP, for ChIP-Seq datasets).

Transcriptomic datasets
We previously described our efforts to biocurate Gene Expression Omnibus (GEO) transcriptomic datasets pertinent to nuclear receptor signaling as part of the Nuclear Receptor Signaling Atlas [15]. In order to expand this collection to encompass datasets involving perturbation of a broader range of signaling pathway nodes, we carried out a systematic survey of Gene Expression Omnibus to identify an initial population of transcriptomic datasets constituting a representative cross-section of the various classes of signaling pathway node referred to in Fig. 1. To supplement this effort, we also incorporated datasets from the CREEDS project [21], a crowd-based initiative that systematically identified and annotated single gene and BSM perturbation GEO datasets. From this initial collection of datasets, we next carried out a three step QC check to filter for datasets that (i) included all files required to calculate gene differential expression values; (ii) contained biological replicates to allow for calculation of associated significance values; and (iii) whose samples clustered appropriately by principal component analysis. Typically, 20-25% of archived transcriptomic datasets were discarded at this step. The remaining datasets were diverse in design, typically involving genetic (single or multi-node node overexpression, knockdown, knockin or knockout) or BSM (physiological ligand, drug or synthetic organic or natural substance; single or multi-BSM; time course; agonist, antagonist or tissue-selective modulator) manipulation of a signaling node across a broad range of human, mouse and rat biosamples. To maximize the amount of biological information extracted from each transcriptomic dataset, we calculated differential expression values for all possible contrasts, and not just those used by the original investigators in their publications.
Next, transcriptomic experiments were mapped where appropriate to approved symbols (AGSs) for human, mouse and rat genes, representing genetically perturbed signaling nodes, and/or to unique identifiers for BSMs, as well as to a previously described biosample controlled vocabulary. Gene differential expression values were calculated for each experiment using an industry standard Bioconductor pipeline [15]. Finally, experiments were organized into datasets for which digital object identifiers (DOIs) were minted as previously described [4].

ChIP-Seq datasets
In addition to integration of transcriptomic datasets with each other, their integration with related ChIP-Seq datasets was desirable since it would provide for cross validation of predicted nodetarget relationships, as well as providing for more detailed mechanistic modeling of such relationships than would be possible using either omics platform individually. The ChIP-Atlas resource [22] supports re-use of ChIP-Seq datasets by carrying out uniform MACS2 peakcalling across ChIP-Seq datasets archived in NCBI's Short Read Archive (SRA). We therefore next set out to identify and incorporate ChIP-Atlas-processed SRA ChIP-Seq datasets relevant to mammalian signaling pathway nodes. Individual SRA experiments were first mapped to the AGS of the immunoprecipitation (IP) node and any other genetically manipulated nodes (e.g. knockdown or knockout background), to any BSMs represented in the experimental design, and to the biosample in which the experiment was carried out.

Generation of consensomes
An ongoing challenge for the cellular signaling bioinformatics research community is the meaningful integration of the universe of 'omics data points to enable researchers lacking computational expertise to develop focused research hypotheses in a routine and efficient manner. A particularly desirable goal is unbiased meta-analysis to define community consensus reference signatures that allow users to predict regulatory relationships between signaling pathway nodes and their downstream targets. Accordingly, we next set out to design a metaanalysis pipeline that would leverage our biocurational platform to reliably identify signaling pathway node -target gene regulatory relationships in a given biosample context. Since this analysis was designed to establish a consensus across distinct datasets from different laboratories, we referred to it as consensomic analysis, and the resulting node-target rankings as consensomes.

Transcriptomic consensomes
Large scale meta-analysis pipeline of publically archived transcriptomic datasets is confronted primarily by the sheer heterogeneity of genetic and pharmacological perturbation designs represented in these datasets. We hypothesized that irrespective of the nature of the perturbation impacting a given pathway node, downstream targets with a greater dependence on the integrity of that node would be more likely to be differentially expressed in response to its perturbation than those with a weaker regulatory relationship with the node. Accordingly, to enhance the statistical power of the analysis, we initially binned transcriptomic experiments for meta-analysis on the basis of genetic or pharmacological manipulation of a given signaling node. To further extend statistical power, experiments involving manipulation of all nodes in a defined gene family were combined for meta-analysis. Next, we further classified experiments according to the biosample and species in which they were carried out. Gene target-specific nominal p-values and differential expression values were then aggregated over each defined set of experiments to yield target-specific summaries for a consensome. A more detailed description of the transcriptomic consensome algorithm is contained in File S1.
A number of factors determine whether a target will be induced or repressed by manipulation of a given signaling node in any given experiment. These include: node isoform differential expression [23]; cell cycle stage [24]; biosample of study [25]; BSM dose treatment duration; and perturbation type (loss or gain of function). To avoid these opposing alterations canceling each other out at the target transcript level in the meta-analysis, we converted fold changes to positive fold changes (i.e. max (FC, 1/FC)) so that both induction and repression would be counted as 'altered' in a summary measure of the magnitude of perturbation, which was computed as the geometric mean fold change. In addition, for each target, we counted the number of experiments with gene-specific nominal p-values ≤0.05, and computed the binomial probability, referred to as the consensome p-value (CPV), of observing that many or more nominally significant experiments out of the number of experiments in which the target was assayed, given a true probability of 0.05. Targets were then ranked in consensomes in ascending order of the consensome p-value (CPV), with average rank being reported for tied CPVs.

ChIP-Seq consensomes
For calculation of ChIP-Seq consensomes, groups of experiments were formed whose IP nodes mapped to a defined node family. These classes were further sorted into meta-analysis classes based on mapping to the same biosample controlled vocabulary used to annotate the transcriptomic datasets [15]. In contrast to the transcriptomic consensomes, which were based upon differential expression (DE) and significance values generated de novo from raw files, MACS2 peak calls and associated significance cut-offs were retrieved in pre-processed form from ChIP-Atlas [22].

Signaling Pathways Project user interface
To make the results of our biocuration and analysis routinely and freely available to the research community, we next developed a web interface for the SPP knowledgebase that would provide for browsing of datasets, as well as for mining of the underlying data points. A comprehensive walkthrough file containing instructions on the use of the SPP interface is shown in File S2.

Browsing of SPP datasets
The full list of SPP datasets can be filtered using any combination of 'omics type, signaling pathway category, class or family, biosample physiological system and organ, or species.
Individual dataset pages enable integration of SPP with the research literature via DOI-driven links from external sites, as well as for citation of datasets to enhance their FAIR status [3,4].
To accommodate users seeking a rapid summary of the targets with the highest differential expression (for transcriptomic datasets, example: analysis of the sperm-specific antigen 2 (Ssfa2)-dependent transcriptome in mouse liver), or highest MACS2 peak value (for ChIP-Seq datasets, example: analysis of the CREBBP cistrome in human embryonic kidney 293 cells), the most highly ranked targets in a given experiment are displayed. The user can toggle between individual experiments using a pull-down menu.

Mining of SPP datasets in Ominer
The SPP query interface, Ominer, allows a user to specify single gene target, GO term or a custom gene list in the "Gene(s) of Interest" drop-down, and to dial in additional node and biosample regulatory parameters in subsequent drop-down menus as required ( Fig. 2A).
Examples of single gene and GO term queries are shown in Table 1 and Table 2, respectively.
Results are returned in an interface referred to as the Regulation Report, a detailed graphical summary of evidence for transcriptional regulatory relationships between signaling pathway nodes and a genomic target(s) of interest (Fig. 2 Consistent with the hierarchy in Table S1, each Regulation Report category is subdivided into classes (depicted as Category | Class in the UI, Fig. 2, B & C) which are in turn subdivided into families, which in turn contain member nodes, which are themselves mapped to BSMs (Fig. 2, B & C). The transcriptomic Regulation Report displays differential expression levels of a given target in experiments involving genetic (rows labelled with italicized node AGS) or BSM (rows labelled with bold BSM symbol) BSM manipulations of nodes within a given family (Fig. 2B). The cistromics/ChIP-Seq Report displays MACS2 peak values within 10 kb of a given promoter transcriptional start site (TSS) in ChIP-Seq experiments named using the convention IP Node AGS | BSM Symbol | Other Node AGS (Fig. 2C).
To accommodate users seeking a perspective on regulation of a target in a specific organ, tissue, cell line or species, users can select the "Biosample" and "Species" views from the dropdown (Fig. 2B). Each data point in either Regulation Report links to a pop-up window containing the essential experimental information (Fig. 2D, upper = transcriptomic, lower = cistromic). This in turn links to a window summarizing the pharmacology of any BSMs used in the experiment (Fig. 2E), or a Fold Change Details window that places the experiment in the context of the parent dataset (Fig. 2F), linking to the full dataset page and associated journal article. The Fold Change Details window also provides for citation of the dataset, an important element of enhancing the FAIR status of 'omics datasets [3].
Consensomes: discovering downstream transcriptional targets of signaling pathway nodes Table 3 shows examples of the consensomes available in the initial version of the SPP knowledgebase. Consensomes can be accessed through Ominer, in which the user selects the "Consensome" from "Genes of Interest", then either "Transcriptomic" or "Cistromic (ChIP-Seq)" from the "'Omics Category" menu. Subsequent menus allow for selection of specific signaling pathway classes or families, physiological system or organ of interest, or species. To accommodate researchers interested in a specific physiological system or organ rather than a specific signaling node, consensomes are also calculated across all experiments mapping to a given physiological system (metabolic, skeletal) and organ (liver, adipose), providing for identification of targets under the control of a broad spectrum of signaling nodes in those organs (Table 3). To maximize their distribution and exposure in third party resources, consensomes can also be accessed by direct DOI-driven links.
Consensomes are displayed in an accessible tabular format in which the default ranking is in ascending order of CPV, although targets can be ranked by any column desired (Fig. 3). To reflect the frequency of differential expression of a target relative to others in a given consensome, the percentile ranking of each target within the consensome is displayed. Targets in the 90th percentile of a given consensomethe highest confidence predicted targets for a given node family -are accessible through the web interface, and the entire list of targets is available for download in spreadsheet format for import into custom analysis programs. As previously discussed, to suppress the diversity of experimental designs as a confounding variable in consensome analysis, the direction of differential expression is omitted when calculating the ranked signatures. An appreciation of the pharmacology of a specific nodetarget gene relationship is essential however to allow researchers to place the ranking in a specific biological context and to design subsequent experiments in an informed manner. To accommodate this, the target gene symbols in consensomes link to transcriptomic or cistromic Regulation Reports filtered by family and/or biosample to display those data points that contributed to the calculation of the specific consensome.
A useful feature of the consensome table is the ability to filter the list by target gene symbol using the Search box (Fig. 3). Although this can be used for identifying a single gene of interest, it also illuminates potentially biologically significant regulation of targets encoding multiple members of a gene family by a given node. For example the significant enrichment in the 90 th percentile of the human estrogen receptor family (ERs-Hs-All systems) consensome of multiple members of the go-ichi-ni-san (HGNC root symbol GINS; [26]), condensin (NCAP; [27]), minichromosome maintenance (MCM; [28]) and centromere protein (CENP; [29]) families, among others reflects the profound impact of estrogen receptor signaling on DNA replication and cell division in its target organs.

Validation of consensomes
The design of the transcriptomic consensome analysis was predicated upon three assumptions: firstly, that borrowing statistical power by binning experiments according to their perturbation of a given signaling node was biologically valid; secondly, that omitting direction of differential expression from the analysis would allow for direct interrogation of the strength of the regulatory relationship between a node and a target, independent of the nature of the node perturbation used in an experiment; and thirdly, that ranking targets according to the frequency of their significant differential expression, rather than by fold change, accurately reflected the relative strengths of the regulatory relationship between a given node and its transcriptional targets. We next wished to determine whether these assumptions were legitimate, and to establish whether the consensomes were indeed reliable consensus regulatory signatures for cellular signaling nodes. We designed a consensome validation strategy comprising four components:

Canonical signaling node targets are highly ranked in consensomes
To compare consensome rankings with canonical node-target relationships, we selected the ten top ranked targets in the ER subfamily in human mammary gland (ERs-Hs-MG), the androgen receptor in human prostate gland (AR-Hs-Prostate), the glucocorticoid receptor in the human metabolic system (GR-Hs-Metabolic), and the peroxisome proliferator-activated receptor (PPAR) family in the mouse metabolic system (PPARs-Mm-Metabolic). We then searched the research literature to identify articles in which these genes had been functionally characterized as targets of these receptors. As shown in Table S2, the most highly ranked targets for the AR-Hs-All, ER-Hs-All and PPARs-Mm-Metabolic consensomes were well supported by evidence in the research literature, although overlap with literature knowledge was lower for the GR-Hs-Metabolic consensome (Table S2).

Reciprocal validation of transcriptomic and ChIP-Seq consensomes
Although many of the most highly ranked consensome target were validated by prior characterization in the research literature some, such as those in the GR-Hs-Metabolic consensome, were not. What was unclear at this juncture was whether such genes were authentic target genes, and therefore represented gaps in literature knowledge that were filled by the consensomes, or they were false positives, and their elevated consensome rankings were therefore misleading. To distinguish between these two possibilities, we next wished to determine the extent to which NR node-target relationships predicted by the transcriptomic consensomes were validated by the publically archived ChIP-Seq datasets involving the corresponding receptors. In the canonical model of NR signaling binding of endogenous ligands such as 17β-estradiol (17BE2) or dihydrotestosterone (DHT), NRs are released from inhibitory heat shock proteins, spontaneously dimerize and translocate to the nucleus where they interact with specific promoter enhancer elements to regulate expression of target genes [30]. Of the 40 genes selected for literature validation (Table S2), 90% (45/50) were in the 90 th percentile or higher in the corresponding ChIP-Seq consensomes, indicating that they are regulated at least in part by direct receptor-enhancer interactions. Interestingly, of the eight transcriptomic consensome-predicted node-target relationships for which no supporting literature evidence was found, all but one were in the 90 th percentile or higher of the corresponding ChIP-Seq consensome.

Intersections of transcriptomic consensomes for key hepatic signaling nodes are enriched for targets encoding critical metabolic pathway enzymes
Transcriptional regulation of metabolism by cellular signaling pathways is a well-established paradigm [31]. Consistent with this, a broad range of hepatic pathways impacting metabolism of carbohydrate, lipids, amino acids and other intermediates are under fine transcriptional regulation by a variety of nuclear receptors, including NR1H4/FXR [32], NR3C1/GR [33] and members of the PPAR [34,35] families. If our assertion that consensomes reflected the relative strengths of node-target regulatory relationships was valid, we anticipated that gene targets with elevated rankings across these three hepatic consensomes (and, by implication, strong regulatory relationships with these receptors) would be enriched for targets encoding factors with prominent roles in hepatic metabolism. To test this hypothesis, we first identified genes in the All nodes-Mm-liver TC90 (n = 1999), that is, those genes that were in the top 10% of targets  (Table 4 and bold in Fig. 4B) and/or are deficient or mutated in a known metabolic disorder (Table 4 and marked with an asterisk* in Fig. 4B). Many of these enzymes are historically well characterized, including Acaca, which regulates the rate limiting step in fatty acid synthesis [37] and is deficient in acetyl CoA carboxylase syndrome [38], and Hal, which regulates the initial step in histidine catabolism and is deficient in hisitidinemia [39]. The critical metabolic roles of other enzymes however, such as Nnmt [40] and Parp14 [41], have been only much more recently characterized. In addition to enzymes, other nodes that participate in pathways with critical roles in hepatocyte homeostasis and development, such as Il6ra ( [42], Cebpb [43] and members of the Irf transcription family [44] are represented at the intersection of the four consensomes (File S3). This analysis demonstrates the ability of organ level consensomes to illuminate factors that are downstream targets of multiple signaling nodes and, by extension, have pivotal, tightly-regulated roles in the function of a given physiological system or organ.

Bench validation: elevated consensome rankings predict biological node-target relationships
A primary motivation in developing the SPP resource was to assist researchers in filling gaps in the literature regarding knowledge of cellular signaling pathways. Indeed, in addition to corroborating canonical node-target gene relationships, we found that the node transcriptomic consensomes contained targets that had elevated percentile rankings, but were uncharacterized in the research literature with respect to regulation by that node. Accordingly, we next set out to experimentally validate representative examples of these targets, shown in Table S3, at the bench.

TPD52L1 is a stress fiber-associated factor that supports 17BE2-dependent MCF-7 cell proliferation
We first wished to broadly evaluate the extent to which experimental evidence validated the node-target relationships predicted by the consensomes. To do this, we used Q-PCR to evaluate 17BE2-dependent regulation of a panel of both characterized and uncharacterized ER targets that were highly ranked in the ER-Hs consensome (Fig. 5A). Reflecting their elevated consensome rankings, the expression of all the genes tested were found to be regulated by 17BE2 in either a dose dependent (GREB1, TPD52L1 and others), or a biphasic (MYC and TFF1) manner, activated and suppressed at physiological or supraphysiological levels of 17BE2, respectively. We next wanted to evaluate the dependence of this regulation on the integrity of nodes in the ER family (ESR1 and ESR2) using the selective ER downregulator fulvestrant (FULV), which blocks the function of these nodes by disrupting their interaction with 17BE2 and inducing their proteasomal degradation [45]. Consistent with the strong ER family node dependence of their regulation predicted by the ER-Hs-mammary gland transcriptomic and ChIP-Seq consensomes, FULV completely abolished 17BE2 induction of all target genes tested (Fig. 5A).
We next selected one of the uncharacterized ER consensome targets for further study. The tumor protein D52-like 1 (TPD52L1) gene encodes a little-studied protein that bears sequence homology to members of the TPD52 family of coiled-coil motif proteins that are overexpressed in a variety of cancers [46]. Despite a ranking in the transcriptomic (ERs-Hs-All-TC CPV = 1E-130, 99 th percentile) and ChIP-Seq (ERs-Hs-All-CC 99 th percentile) ER consensomes that was comparable to or exceeded that of canonical ER target genes such as GREB1 or MYC, and subsequent experimental bench validation of the ER family-TPD52L1 regulatory relationship ( Fig. 5A), no evidence for regulation of ER by TPD52L1 was found in the research literature.
Interestingly, peak cell cycle expression of both TPD52L1 [47] and ESR1 occur at the G2-M transition, which is also a point at which ESR1 is known to exert control of cell cycle progression [48]. Based upon these observations, we selected TPD52L1 for further validation and characterization in the context of ER signaling. Immunofluorescene analysis of TPD52L1 in MCF-7 cells demonstrated specific 17BE2-dependent association of TPD52L1 with numerous structures, including nucleus, plasma membrane, cytoplasm and stress fiber-like structures (Fig.  5B), which play an important role in mitosis orientation, a critical process in cell division [49].
Having established a potential function for TPD52L1 in regulation of the cell cycle, we hypothesized that its depletion in cells might block this function and retard cell growth.
Consistent with this, and in support of previously-reported associations of its family member TPD52 with increased proliferation and invasive capacity [50,51], we found that siRNAmediated knockdown of TPD52L1 by 80% (data not shown) resulted in a significant decrease in 17BE2-induced proliferation of MCF-7 cells (Fig. 5C). Interestingly, the TPD52L1 transcriptomic Regulation Report showed disruption of TPD52L1 expression in response to manipulation of kinases in the checkpoint (CHEK1, CHEK2), cyclin-dependent kinase (CDK9) and MAPK superfamily (ATR, ATM and RAF1) are consistent with known roles for these enzymes in regulation of the G2/M checkpoint [52][53][54][55][56] . In concert, these observations constitute experimental validation of the biological relationship between ER signaling and TP52L1 predicted by the ER family-Hs-mammary gland transcriptomic and ChIP-Seq consensomes.

MBOAT2 connects phospholipid metabolism to AR regulation of prostate cell growth
The next bench validation use case illustrates the value of integration of ChIP-Seq data points in gathering evidence to establish the plausibility of a node-target relationship implied by the consensomes. The MBOAT2 gene encodes an enzyme, membrane-bound O-acyl transferase 2, that catalyzes cycles of glycerophospholipid deacylation and reacylation to modulate plasma membrane phospholipid asymmetry and diversity [57]. We noted that the ranking of MBOAT2 in both the AR-Hs-All TC (CPV = 2.2E-35, 99 th percentile) and ChIP-Seq (99 th percentile) consensomes was comparable to that of the canonic and intensively studied AR target genes such as KLK3 and TMPRSS2. In contrast to the large volume of literature these targets however, with the exception of a mention in a couple of androgen expression profiling studies [58,59], the role of MBOAT2 in the context of AR signaling was entirely unstudied. Our attention was drawn to MBOAT2 as a candidate for bench validation as an AR downstream target by a number of different lines of evidence. Firstly, in addition to numerous AR binding data points, the MBOAT2 cistromic/ChIP-Seq Regulation Report contained evidence for binding sites within 10 kb of the MBOAT2 TSS for the transcription factors GATA1, FOXA1, MYC and NANOG, all of which have known roles in AR crosstalk [60]. Encouraged by the cistromic/ChIP-Seq evidence corroborating its elevated ranking in the AR-Hs-All TC, we selected MBOAT2 for further validation and characterization. We first wished to test whether MBOAT2 was an ARregulated gene in cultured prostate cancer cell lines. As shown in Fig. 5D, MBOAT2 was induced in LNCaP prostate epithelial cells in response to treatment with the physiological AR agonist dihydrotestosterone (DHT). We next determined the effect of depletion of MBOAT2 on LNCaP cell viability and found that relative to control siRNA treatment, siMBOAT2 significantly increased LNCaP cell numbers at growth day 5 in in R1881-treated celIs, but not untreated cells (Fig. 5E).
This result was unexpected to us given the prevailing perception of AR as a driver of prostate tumor growth, but can be rationalized in the context of suppression of growth and support of differentiation by AR in normal prostate luminal epithelium [61]. This process is known to involve induction of the NKX3.1 (AGS: ZBTB16) homeobox transcription factor [62,63] -itself the highest ranked gene in the AR-Hs-All transcriptomic consensome -and it can be speculated that induction of MBOAT2 by AR represents an additional component of this process. Such an assertion is supported by the recent characterization of the role of MBOAT2 in chondrogenic differentiation of ATDC5 cells [64], and by the fact that the AR agonist testosterone stimulates the chondrogenic potential of chondrogenic progenitor cells [65].

GR and ERR exert co-ordinate regulation of glycogen metabolism via regulation of protein phosphatase subunit expression
The first two experimental validation studies focused on distinct single node-target regulatory relationships. We next wished to validate the use of consensome intersection analysis to highlight convergence of multiple signaling nodes on targets involved in a common downstream biological process. Interconversion of glucose and glycogen in metabolic organs is under the tandem control of glycogen synthase and glycogen phosphorylase, which respectively promote and restrict the incorporation of glucose into glycogen in response to hormonal and stress regulatory cues [66]. The activity of glycogen synthase is in turn under the control of protein phosphatase 1 (PP1), which converts it from its inactive phosphorylated form to its active dephosphorylated form [67], and 5'AMP-activated protein kinase (AMPK), which catalyzes the reverse reaction [68] (Fig. 6A). Although historical evidence indicates that glucocorticoids promote glycogen storage in the liver through upregulation of glycogen synthase phosphatase activity [69], the underlying mechanism has to date been unclear. Similarly, although members of the ERR subfamily have been shown to promote reprogramming of carbohydrate metabolism in exercising skeletal muscle [70], a direct role for ERRs in controlling glycogen turnover in muscle had not been investigated. Two key regulatory subunit genes relevant to PP1 and AMPK are Ppp1r3c, encoding PTG in the PP1 holoenzyme [71,72], and Prkab2, encoding AMPKβ2 in the AMPK holoenzyme [73,74]. Based on the significant rankings of PP1R3C in the GR human (CPV = 3.2E-08) and mouse (CPV = 3.4E-10) transcriptomic consensomes, and both PPP1R3C (CPV = 2.7E-06) and PRKAB2 (E=6.3E-05) in the ERR-Hs-All transcriptomic consensome, we hypothesized that the mechanism by which these receptors controlled carbohydrate metabolism in liver and skeletal muscle might encompass regulation of expression of these two genomic targets.
Based upon Ominer Regulation Report evidence for binding of GR to the Ppp1r3c promoter in mouse liver, we undertook sequence analysis of the murine Ppp1r3c promoter and identified two prominent peaks 5' to the first exon of Ppp1r3c, the more proximal peak of which contained a potential glucocorticoid response element (GRE, Fig. 6B; based on the GRE consensus [75]).
To determine whether Ppp1r3c was upregulated in isolated cells, we treated a hepatoma cell line with the synthetic glucocorticoid dexamethasone (DEX) for 48 h and observed upregulation of Ppp1r3c mRNA (Fig. 6C). As positive controls, we also noted observed hepatic induction of the genes encoding pyruvate carboxylase (Pcx) [76] and Fgf21 [77], established GR targets that have significant rankings in the GR-Mm-All transcriptomic consensome.
We next wished to determine whether the same Ppp1r3c PP1 regulatory subunit gene targeted by GR, as well as the AMPK subunit gene Prkab2, were directly regulated by ERRs. Evidence in the SPP cistromic Regulation Reports for Ppp1r3c and Prkab2 and from IVG analysis of additional datasets (Fig. 6D) supported the presence of one or more Esrra binding sites within 10 kb of the Ppp1r3c and Prkab2 TSSs. To investigate the effect of small molecule manipulation of Esrra on endogenous expression of Prkab2 or Ppp1r3c, we treated C2C12 myotubes (day 3 (Fig. 6E). Prkab2 was repressed by the Esrra inverse agonist XCT790 [78] (Fig. 5E, right panel) whereas Ppp1r3c transcript expression was unaffected in response to this treatment (Fig. 5E, left panel). We next evaluated the effects on endogenous Ppp1r3c and Prkab2 expression of genetic manipulation of ERR signaling using adenoviral overexpression of Esrra (Fig. 6F). Interestingly, whereas Ppp1r3c was upregulated in response to Esrra gain of function (Fig. 6F, left panel), expression of Prkab2 was not significantly impacted (Fig. 6F, right panel). The differential regulation of the two targets in these experiments suggests that Esrra may be more important for maintaining basal expression of Prkab2 and mediating regulation of Ppp1r3c expression in response to physiologic stimuli. We next assessed whether the expression of Prkab2 was altered in Esrra-deficient skeletal muscle [79]. Consistent with its elevated ERR consensome ranking, basal expression of Prkab2 transcript was reduced by 40% in Esrra-depleted skeletal muscle compared to wild-type tissue (Fig. 6G).

MT) for 24 h with BSM inhibitors of Esrra
To test if ERRs directly regulate the Prkab2 target, the -2820 to +27 region (relative to the TSS +1) was cloned upstream of a luciferase reporter gene. Based on JASPAR transcription factor binding site prediction software [80], we determined that this region contained a number of high scoring ERR family consensus binding sites (data not shown). Several of the predicted ERRE sites were in close proximity to consensus sites for Gabpa, Creb, and Stat3, which are often clustered with ERREs and are known to facilitate functional interactions between ERR family members and these transcription factors [81]. Consistent with this, the PRKAB2 ChIP-Seq Regulation Report contains evidence for binding of CREB and Stat factors to the PRKAB2 promoter. In transcriptional assays performed in C2C12 myoblasts we observed a similar magnitude of activation of the Prkab2 promoter-reporter in response to co-transfected Esrra or Esrrg (Fig. 6H). We then assessed whether the regulation of Prkab2 by Esrr was impacted by insulin-like growth factor 1 (IGF1), which signals through AKT and MAPK to promote myocyte glucose uptake and glycogen storage [82][83][84]. Treatment of myoblasts for 24 h with IGF1 stimulated Prkab.2.82.Luc activity and further enhanced the activation by both ERR isoforms (Fig. 6H). Collectively, these results validate consensome predictions that genomic targets encoding AMPK and PP1 regulatory subunits are under direct transcriptional regulation by ERRs, supporting further studies into a physiological role for ERRs in regulation of glycogen metabolism in skeletal muscle.

Discussion
Receptors, enzymes and transcription factors connect metabolic signals to their biological endpoints through a series of interdependent interactions that are commonly referred to as "signaling pathways". These three categories of pathway node act as points of convergence and integration on the one hand, and divergence and distribution on the other, to ensure an appropriate response of any given cell to its afferent metabolic cues [1]. The vast majority of information on signaling pathways that is readily available to researchers is canonical in nature and derived from the published literature. Although transcriptomic and ChIP-Seq datasets involving manipulation of these nodes have the potential to provide for the generation of focused hypotheses to resolve mechanistic blind spots in such knowledge, deficits in their management have complicated such re-use [4]. To address this problem, we designed here a knowledgebase, the SPP, which allows bench researchers to routinely evaluate transcriptomic or ChIP-Seq dataset evidence for regulatory relationships between cellular signaling pathway nodes and their downstream targets. To enhance discovery using SPP, we surveyed across these datasets in an unbiased and systematic manner, to generate consensus node-target signatures that would allow researchers to infer and model candidate signaling pathways operating in their biological system of interest. The SPP resource is predicated on the idea that receptors, enzymes, transcription factors and other regulatory nodes are molecular free agents whose function is not necessarily tied to any single context and, by extension, have the theoretical potential to associate in any modular combination in a given cellular context.
The direction and magnitude of regulation of a genomic target by a signaling node are highly contextual considerations that change dynamically in response to a broad spectrum of variables. The principle of the transcriptomic consensomic approach is predicated upon initially suppressing such parameters in favor of a ranking that emphasizes the relative responsiveness of a given gene to regulation by a specific signaling node in a given biosample context. Once a specific consensome has been retrieved, the user can then evaluate evidence across all the underlying data points to develop a well-informed hypothesis that can be mechanistically validated or refined at the bench. We hypothesized that the CPV correlated with the strength of the mechanistic connection between a node and a given target in a given organ context. Put another way, the more exquisitely interdependent a node-target relationship, the more frequently perturbation of the former would impact the latter. We intend the term "consensome" to be interpreted as a general term to refer to establishing consensus across a set of experiments related by perturbation of a given cellular signaling node. The actual method used to establish such consensus is determined by a variety of factors, such as the type of 'omics platform, or more practical considerations such as the format of the available data. Indeed, within this study itself we employed different approaches to generating the transcriptomic and ChIP-Seq consensomes that were influenced by the format of the available data.
The SPP resource is characterized by a unique combination of features. The Regulation Report organizes query results according to community classifications of the major categories of signaling pathway module, allowing researchers to readily place the results in context. In addition, previous transcriptomic meta-analysis approaches in the field of cellular signaling are perturbation-centric, and applied to experiments involving a single unique perturbant [85,86].
Consensomic analysis differs from these approaches in that it is node-centric: that is, it is predicated upon the functional relatedness of any genetic or small molecule manipulation of a given pathway node, and allows experiments to be grouped for meta-analysis accordingly. In doing so, it lends the meta-analysis greater statistical power, and calls potential node-target relationships with a higher degree of confidence than would otherwise be possible. A third unique aspect is that other primary analysis and meta-analysis studies describing integration of transcriptomic and ChIP-Seq datasets, although insightful, are limited in scope and exist only as stand-alone literature studies. Ours is the first meta-analysis to be sustainably integrated into an actively-biocurated public web resource in a manner supporting routine use by researchers lacking formal informatics training. Moreover, the continuous incorporation of newly biocurated datasets and reversioning of consensomes over time will have the effect of iteratively suppressing inter-dataset noise and enhancing the resolution of the true biological signal of a given pathway node in a given biosample. Finally, SPP is to our knowledge the first public resource to provide for gene level integration of transcriptomic and ChIP-Seq datasets mapped to common signaling pathway nodes. Given the highly contextual and nuanced nature of transcription factor-promoter relationships therefore, we anticipate that the ability to dissect experimental factors underlying the consensomes in side-by-side transcriptomic and cistromic/ChIP-Seq Regulation Reports will be of considerable value to users in modeling the specific regulatory mechanisms underlying node-target relationships.
Our resource has a number of limitations, some of these are generic in nature and not specific to SPP. For example, SPP is based upon transcriptomic and ChIP-Seq data since these are by far the most informatically mature and numerous of the various types of archived 'omics data.
Although archived proteomic and metabolomic datasets would improve the resource, those that involve manipulation of mammalian cellular signaling nodes not yet reached a volume where their integration would repay the biocurational effort involved in their integration. Moreover, in an ideal world, there would exist an even distribution among archived 'omics datasets of node manipulations in organ biosamples. Financial realities dictate however that research is directed towards nodes that show the greatest apparent promise for improving human health, resulting in the skewing of research funding towards those molecules, and leaving other families of signaling nodes either poorly characterized or entirely unstudied. Other limitations of the consensomes relate to the design of available archived experiments. For example, certain targets may be regulated by a given node only under specific circumstances (e.g. acute BSM administration) and if such experiments do not exist or are unavailable, these targets would not rank highly in the corresponding node consensome. Moreover, a low ranking for a target in a consensome does not necessarily imply the complete absence of a regulatory relationship, and may reflect the requirement for a quite specific cellular context for such regulation to take place.
Finally, since SPP is based upon transcriptional methodologies, effects exerted by signaling pathway nodes at the protein level, such as enhanced stabilization or degradation of protein, or modulation of the rate of translation, will not be reflected at the mRNA level.
To maximize statistical power, consensomes in the initial version of SPP encompass datasets involving genetic and pharmacological manipulation of nodes within a given family, and many are possible only by incorporating datasets in biosamples representing numerous distinct organs. Future rates of dataset generation and archiving permitting, node-and organ-specific consensomes of reasonable statistical power will become possible, allowing for more detailed dissection of node-and tissue-specific patterns of transcriptional regulation. Validation of the consensomes relied heavily on use of evidence in the literature. In an ideal future, 'omics datasets and the literature would exist in a mutually enhancing relationship, the former providing researchers with insights that are limited in resolution but broad in scope, the latter providing the focused mechanistic and functional detail required to properly interpret and contextualize the node-target relationships. Paramount to such a scenario is equal ease of access to both the literature and 'omics datasets, such that hypotheses can be generated from 'omics datasets as readily and intuitively as abstracts can be accessed through literature search engines.
Moreover, in an era of tightening research budgets, there is a pressing responsibility on the biomedical research community to re-purpose existing assets to allow bench researchers to routinely generate future research hypotheses. An important next step therefore will be to establish interoperability between SPP and knowledgebases such as Reactome [2] that are based upon expert manual curation of the research literature. The high degree of orthogonality between such initiatives will afford users a more complete perspective on cellular signaling pathways than is currently possible.      Table 4 for references. Abbreviations: AcCoA, acetyl-CoA; ApoB, apolipoprotein B; FFAs, free fatty acids; G3P, glycerol-3-phosphate; LPA, lysophosphatidic acid; UDP-Glc, uridine diphosphate glucose.      Smith-Lemli-Opitz syndrome [105] Other metabolic pathways

Statistical analysis
Full descriptions of the statistical analyses for each experiment are included in the descriptions of those experiments below and in the Figure Legends. A full description of the statistical basis of the consensomes is included in File S1.

Data availability
All SPP datasets and consensomes are freely available on the SPP knowledgebase under a Creative Commons Attribution 3.0 license, which provides for sharing, adaptation and both noncommercial and commercial re-use, as long as the resource is cited.

Signaling Pathways Project web application
The Signaling Pathways Project knowledgebase is a gene-centric Java Enterprise Edition 6, web-based application around which other gene, mRNA, protein and BSM data from external databases are collected. All software is freely available at www.github.com/BCM-DLDCC/nursa.
After undergoing semiautomated processed and biocuration as described above, the data and annotations are stored in SPP's Oracle 12c database. RESTful web services expose SPP data, which are served to responsively designed views in the user interface, were created using a Flat  [19]). To resolve differences between these classifications with respect to the number of hierarchical tiers, and to facilitate the design of the data models, each was reduced to a four-levels Category, Class, Family and Node as shown in Table S1. Biosample category mappings were carried as previously described [15]. To enhance the interoperability of SPP with other databases and pathway resources, small molecule-receptor mappings were based upon those maintained by the International Union of Pharmacology Guide to Pharmacology [17], a pharmacology community biocuration authority.

Consensomes
For cistromic consensomes, MACS2 peak calls from the ChIP-Atlas resource [22] for all nodes in a defined SPP family were averaged and the targets ranked based upon this value. For transcriptomic consensomes, differential expression values and associated significance measures were generated from appropriate experimental contrasts in GEO Series as previously described [112]. Consensomes were generated on a computer cluster and stored in the SPP Oracle 12c database

Maintenance and versioning of consensomes
SPP is continually expanding its base of data points by adding newly biocurated datasets to the resource. Accordingly, a quarterly process identifies all node/family and biosample category combinations represented by datasets added in the previous quarter and calculates new versions of the corresponding consensomes. A statement above the scatterplot and contained in the associated spreadsheet identifies the specific combination of pathway node, biosample (physiological system and organ) and species represented by the consensome, the version and date stamp, and the total number of data points, experiments and datasets on which it is based. μL. PCR amplification was carried out using the CFX384 qPCR system. Fold induction was calculated using the 2−ΔΔCt method [113], and normalized to 36B4. All data shown is representative of at least three independent experiments. Primer sequences are shown in Table   S4.

Bench validation and characterization experiments
Subcellular distribution. MCF-7 cells were kept in 5% CD-CS for 48 hrs prior treatment with 17BE2 10nM for 24 h. A previously published immunofluorescence protocol was followed [114].
Briefly, cells were fixed in 4% formaldehyde in PEM buffer (  Cell Viability Assay On day 5, the 6-well plate LNCaP cells were briefly trypsinized and collected. Cell viability was then determined using CellTiter-Glo Luminescent Cell Viability Assay, following the manufacturer's instruction, and a Berthold 96 well plate reading luminometer. PRISM software was used for the statistical analyses.

Validation and characterization of Ppp1r3c in the GR mouse metabolic consensome
Hepa1c cells were grown in DMEM with 10% fetal bovine serum and penicillin, streptomycin and gentamycin (Life Technologies) and treated with vehicle (ethanol) or 250 nM DEX (Sigma) for 48h. Cells were lysed in TriZOL and total RNA was purified by a PureLink RNA Kit. 250 μg of RNA was reverse transcribed into cDNA using a High Capacity cDNA Reverse Transcription Kit (Life Technologies). Genes were quantified using SYBR Green following the manufacturer's instructions on an QuantStudio 5 qPCR instrument (Applied Biosystems). Gene expression was normalized to an internal control (Rplp0; after evaluating several normalization genes to ensure they were unchanged by treatment). Each experiment was standardized to its own vehicle treatment. Primer sequences used are described in Table S4.

Validation and characterization of Prkab2 in the ERR mouse metabolic consensome
Animals. All animal protocols were approved by the Institutional Animal Care and Use Committee at City of Hope. The ERRα/Esrra-/mice have been described and were maintained as a hybrid strain (C57BL/6/SvJ129) [117,118]. For baseline comparisons, littermate wild-type and ERRα/Esrra-/mice were generated from heterozygous breeders to control for strain background. Skeletal muscle (quadriceps) was isolated from 12 week old mice fed wild-type and ERRα/Esrra-/mice during the daytime (1000 to 1200 h), flash frozen and stored at -80 o C until RNA isolation was performed. described [120]. Transient transfection in C2C12 myocytes using the calcium phosphate method and the plasmid concentrations used have been described [79]. Luciferase activity was assayed in MB 48h post-transfection or in day 4 MT after changing confluent cells to 2% HS/DMEM. To assess IGF1 activation, MB were changed to SFM -/+ 10nM recombinant IGF1 one day following transfection and activities were measured after 24 h treatment. Luciferase activity was assayed using Dual-Glo reagents (Promega, Madison, WI) on a Tecan M200 plate reader (Männedorf, Switzerland). Firefly luciferase activity was normalized to that of renilla luciferase, which was expressed downstream of the minimal thymidine kinase promoter from the pRL-TK-Renilla plasmid.
Statistical analysis. All cell experiments were performed in three independent trials with 3 replicates per trial. Data are presented as mean (± S.E.M.) relative activity or expression normalized to control (empty vector or vehicle treated condition). Differences between mean values for luciferase activities and real-time PCR analysis were analyzed by a one-way ANOVA