Exploratory Gene Ontology Analysis with Interactive Visualization

Zhu, Junjie; Zhao, Qian; Katsevich, Eugene; Sabatti, Chiara

doi:10.1038/s41598-019-42178-x

Download PDF

Article
Open access
Published: 24 May 2019

Exploratory Gene Ontology Analysis with Interactive Visualization

Junjie Zhu ORCID: orcid.org/0000-0001-7177-8694¹,
Qian Zhao²,
Eugene Katsevich² &
…
Chiara Sabatti^2,3

Scientific Reports volume 9, Article number: 7793 (2019) Cite this article

26k Accesses
9 Citations
2 Altmetric
Metrics details

Subjects

Abstract

The Gene Ontology (GO) is a central resource for functional-genomics research. Scientists rely on the functional annotations in the GO for hypothesis generation and couple it with high-throughput biological data to enhance interpretation of results. At the same time, the sheer number of concepts (>30,000) and relationships (>70,000) presents a challenge: it can be difficult to draw a comprehensive picture of how certain concepts of interest might relate with the rest of the ontology structure. Here we present new visualization strategies to facilitate the exploration and use of the information in the GO. We rely on novel graphical display and software architecture that allow significant interaction. To illustrate the potential of our strategies, we provide examples from high-throughput genomic analyses, including chromatin immunoprecipitation experiments and genome-wide association studies. The scientist can also use our visualizations to identify gene sets that likely experience coordinated changes in their expression and use them to simulate biologically-grounded single cell RNA sequencing data, or conduct power studies for differential gene expression studies using our built-in pipeline. Our software and documentation are available at http://aegis.stanford.edu.

STAGEs: A web-based tool that integrates data visualization and pathway enrichment analysis for gene expression studies

Article Open access 02 May 2023

reString: an open-source Python software to perform automatic functional enrichment retrieval, results aggregation and data visualization

Article Open access 06 December 2021

multiSLIDE is a web server for exploring connected elements of biological pathways in multi-omics data

Article Open access 16 April 2021

Introduction

Since its inception, the Gene Ontology (GO)¹ has empowered analyses of high-throughput molecular data. Researchers often perform statistical tests using the GO to determine functional enrichments within their discoveries^2,3,4 and recently it has also been used to improve model architectures in machine learning applications^5,6. The GO hinges on two continuously evolving elements: 1) a collection of curated biological terms with semantic hierarchical relationships and 2) annotations that link genes and gene products to specific terms. When associated with a dataset, such as genes identified from differential gene expression testing⁷, a statistical testing strategy can assign each GO term an “enrichment score”, which measures how genes previously annotated with this term are enriched in the data. The large number of interrelated concepts⁸ and the evolving nature of the knowledge base, while assuring the richness of GO information, also represent a challenge in defining appropriate testing strategies⁹, interpreting and displaying results¹⁰, and ensuring their reproducibility¹¹.

Data visualizations, by illustrating the number of terms, rendering the relations between them and displaying term annotations, can alleviate some of the aforementioned issues. The information in GO is organized as a directed acyclic graph (DAG), where each node corresponds to a GO term and edges indicate relations such as “part of”, “a type of”, etc. It is then natural to rely on the DAG structure of the GO for information visualization (Supplementary Fig. 1). Indeed, there are a number of published tools that leverage the structure and can be broadly divided in two groups: ones that are specialized for local graph characteristics, and ones that are optimized for rendering global graph structures.

On the one hand, many existing GO tools focus on displaying a small subset of terms to reveal nuances of the hierarchical relationships. For instance, one of the most commonly-used web interfaces, QuickGO¹² allows querying and browsing a single GO term along with its related terms (and annotations) as a DAG. REVIGO¹³ can present multiple selected terms in a semantic graph via different representations (e.g., its scatter and table view, tree map view, and interactive graph view). Other tools feature displays where nodes and links can be retrieved with more interactive features^14,15,16, or where node enrichment scores can be highlighted with additional customization^17,18,19.

On the other hand, relatively fewer visualizations aim to simultaneously display the entire GO^20,21, despite extensive literature on large graph or network visualization^{22,23,24,25,26}. In contrast to small graph displays, these tools are typically not flexible enough to highlight node- or link-specific details due to numerous visual elements. However, they can provide a global view of the graph, such as node clusters or overall hierarchies that are especially helpful for understanding trends of term relationships or enrichment scores in the GO.

Our new open-source software AEGIS (Augmented Exploration of the GO with Interactive Simulations) aims to bridge the merits of both visual approaches, and supports extensive interactions with the information coded in the ontology. We link representations of local and global structures within the GO DAG by adopting the focus-and-context framework, reminiscent of classical principles in visual information system design: overview first, zoom and filter, then details-on-demand²⁷. AEGIS capitalizes on the flexibility and power of efficient search routines to render dynamic displays for real-time interactions. Not only can the users interact with the interface to visualize their results that are output from existing pipelines with increasing degrees of specificity, they can also choose any vantage point to explore the GO with our visualization prior to collecting their data or running their pipelines. During the exploratory process, AEGIS allows them to extract biological information relevant for simulations and hypothesis generation, as well as power calculations for study design.

Results

To increase interpretability and facilitate user interactions with the GO, we reasoned that it is key to both effectively display granular information relative to a set of concepts of interest and provide a means to put them in the larger context of the ontology. AEGIS achieves this objective by focus and context graphs customized for DAGs. It couples a Sugiyama-style graph²⁸ (the focus graph) that renders the hierarchical structure of a selected sub DAG with a silhouette view (the context graph) that provides an indication of the overall number of nodes that are similar to or related to the concepts in the sub DAG.

Both the focus and the context graphs are customizable in real time thanks to our customized data structure and software system (Methods and Supplementary Figs 2–4). AEGIS complements the focus and context graphs with interactive highlighting options and layouts, including the buoyant layout, a graph drawing strategy that incorporates gene annotation information. The buoyant layout relies on a novel algorithm we developed, and improves the interpretation of the hierarchical levels in the GO: terms that are assigned to the same level share a similar number of annotated genes (Methods, Supplementary Algorithms, and Supplementary Note 1).

Unlike most GO visualizations that are used at the last step of the analysis to illustrate results, our visualization strategies allow researchers to explore the ontology and plan experiments prior to data acquisition. With interactive features complementing the focus and context graphs, they can generate biologically plausible simulated data and run power studies: in fact, interaction in AEGIS is not limited to customization of graph displays, but includes multiple methods to identify a set of concepts of interest, generate hypotheses, and select study parameters (Methods and Supplementary Note 2). Unique to AEGIS, the context graph can conveniently summarize the multiplicity of the tested hypotheses within the hierarchy, and the focus graph can magnify precise positions of the non-null hypotheses which researchers aim to discover. Subsequently, they can visualize the dependencies among the hypotheses and identify potential testing biases. As new testing strategies are emerging^29,30, these insights are valuable for benchmarking their performance and comparing their underlying assumptions.

We next resort to a number of biological examples that showcase the different aspects of our visualization and its applications. While the main results below are renderings of static displays, we refer the reader to detailed video tutorials and documentation of the examples (Supplementary Videos and Supplementary Note 3).

Focus and context graphs

The focus and context graphs are two juxtaposed representations tailored to balance the global structures and local details of a DAG with tens of thousands of nodes and links. The focus graph displays precise term relationships and annotation information via a bona fide graph representation following the Sugiyama style: nodes are arranged in different levels, with the roots at the top and a parent always above its children so that all links are directed downwards (Fig. 1, left -hand side). We will discuss in the next section the different options for assigning a node to a level, but it is important here to remark that levels are at the basis of the coupling of the focus and context displays. The context graph is represented by a bar chart (Fig. 1, right -hand side), capturing the node counts at each level of a larger reference DAG (by default the entire GO). The coupled displays allow the viewer to understand how many nodes are on the same level of a concept of interest, and how many are at higher or lower levels in the hierarchy.

While existing approaches that apply focus and context for generic graph visualizations^31,32 can, in principle, be used to visualize DAGs, their main function is to create graph distortion (also known as the fisheye view³³) in the node-link layout of their context to highlight their focus. In contrast, our silhouette representation of the context graph capitalizes on the inherent hierarchical structure of DAGs and yields quantitative level-by-level information.

To provide a concrete illustration, we used AEGIS to create a visualization of 27 significantly enriched GO terms for hair color phenotypes from a genome-wide association study³⁴ (Supplementary Note 3). By displaying the ancestral terms, it becomes apparent that some of these concepts are more closely related than others: the focus graph on the left-hand side of Fig. 1 identifies with roman numerals the most specific of the 27 terms, and groups together the concepts with a most recent common ancestor. Aside from stylistic variations and possible customizations, the focus graph resembles standard representations of the GO DAG^3,12,28. The context graph on the right-hand side, in contrast, is unique to AEGIS: the silhouette representation of the entire “biological process” ontology DAG (~15,000 terms) allows us to indicate the locations of the approximately 1,000 concepts related to the 27 query terms, highlighting, for example, how they include 907 descendants.

The focus and context graphs are achieved without overcrowding the display with numerous objects and without appreciably increasing the computing time. Consequently, the gains in computational and representational efficiency facilitate real-time interactions with the GO DAG. For example, an investigator can explore focus graphs anchored on different terms, or customize selections of context graphs (Methods and Supplementary Videos).

Buoyant layout that incorporates annotation

The visualization in Fig. 1 exemplifies a standard option for assigning a node to a level in a Sugiyama-style graph²⁸: the level represents a node’s longest distance to the roots. We refer to this leveled layout as the root-bound layout. Another standard option assigns a node to the longest distance to the leaves —we call this the leaf-bound layout, and both options are available in AEGIS. The root-bound and leaf-bound layouts achieve spatial compactness by minimizing the number of levels, but also have their limitations. A viewer is naturally prone to interpret terms in the same level as sharing the same degree of specificity. This might be approximately true in some contexts, but it is rather misleading in the case of the GO DAG.

The ontology is created as a way of representing existing literature and the curation process focuses on term-to-term relationships, not the overall graph structure. While it is always true that descendant terms are more specific than their ancestors, the number of terms separating a concept from the root or leaf nodes is not a reliable indication of its specificity. Some biological processes have been studied in detail, so that a number of different terms exist to describe them, while others are known at a much less granular level. One can imagine a well-studied concept separated from the root node by 10 concepts, while a less-studied concept of comparable specificity might be only two nodes away from the root.

Fortunately, in the case of GO, we have another source of information on the level of specificity of a node: the number of genes that annotate it. AEGIS leverages this in a novel Sugiyama-style graph, the buoyant layout, which enjoys two properties: (1) a parent node is always placed at a level above its children; and (2) a node with fewer annotations is never placed higher than any node with more annotations (Methods and Supplementary Note 1).

To contrast the renderings of the different layouts, we re-analyzed a chromatin immunoprecipitation sequencing (ChIP-seq) study^3,35 (Supplementary Note 3), and selected two significant terms to highlight their differences (Fig. 2). In the root-bound layout, for instance, the term “ruffle” (with 164 annotations) is at a higher level, and therefore appears to be more generic than the term “actin cytoskeleton” (with 448 annotations) only because a smaller number of concepts lie between “ruffle” and the root “cellular component”. The buoyant layout, in contrast, preserves a more interpretable node ordering based on the number of annotated genes. We also remark that the buoyant layout is particularly meaningful when coupled with the context graph: each level corresponds to a range of gene annotation counts (Supplementary Videos).

Interactive features and Gene Ontology navigation

By choosing which nodes to display in the focus graphs, against which context to interpret them, the user can navigate the GO and extract annotation information useful for simulations and study designs. This process is further enhanced with many auxiliary interactive features (Methods and Supplementary Videos). Here we start with showing how AEGIS facilitates extracting information from the GO for the purpose of simulations. We consider the task of generating artificial signals from single cell RNA sequencing (scRNA-seq), where one is interested in discovering sub-populations of cells that have distinct expression signatures. Rather than arbitrarily deciding which genes differ across simulated cell types^36,37, AEGIS allows a researcher to leverage the GO to identify a group of genes that are biologically related, specifying a signal structure that is realistic, and model a cellular process the researcher is specifically interested in.

We simulated the transcriptomic profiles of single cells that differ by cell cycle state, a common confounding factor in scRNA-seq data (Fig. 3). Previously, cell cycle genes have been collectively considered at the resolution of more generic GO terms such as the cell cycle process (GO:0022402)³⁸. With AEGIS we can explore and select more detailed sub-processes: specifically we identified three descendants: G1/S transition of mitotic cell cycle (GO:0000082), G2/M transition of mitotic cell cycle (GO:0000086), and mitotic cell cycle arrest (GO:0071850). We then select 10 genes among the ones annotating each of these terms and refer to these 30 genes as the “signature genes”. Using standard procedures (Supplementary Note 3), we can then generate expression data for 120, 150, 100, 300 cells for G1/S transition, G2/M transition, cell cycle arrest, and control, respectively. Each cell type has higher expression of its 10 signature genes, while the baseline cell type does not express any of these transition markers.

A biologically sound ground truth can help contrast alternative visualizations of the data, such as PCA³⁹ and t-SNE⁴⁰ (Fig. 3). As we expect, both methods organize the data points in a way consistent with the ground truth cell types, with better cell type segregation under t-SNE. However, our functionally related gene set uncovers how visualizations with PCA can offer additional insights: PCA-based biplots can indicate directions in the data dominated by individual genes. These “gene directions” can in turn suggest “functional directions”, i.e., annotated gene sets that collectively align in the subspaces spanned by the top principal components.

Power analysis for GO enrichment testing

Study design is one of the contexts where it is most useful for an investigator to interact with GO visualizations and AEGIS further enables them with a built-in power simulator, filling in a surprising gap in the literature of enrichment tests^2,18,41,42 (Methods). While enrichment hypotheses are stated and tested at the level of GO terms, their truth status depends on the expression of the genes associated to the concepts: often the same gene contributes to the annotation of multiple GO terms, so there are logical dependencies between enrichment hypotheses. Without actual access to the GO structure, it is difficult to simulate these dependencies, which have important consequences for power and interpretability of results.

AEGIS’ pipeline aids the investigator in multiple ways (Supplementary Fig. 5): 1) By allowing navigation of the GO, it enables the investigator to select ontology terms that are representative of the biological process under study and determine that all or some of the genes in their annotation are differentially expressed—thereby anchoring the power study to biologically relevant hypotheses. 2) Once differentially expressed genes are specified, AEGIS identifies all the non-null enrichment hypotheses, according to one of two standard definitions (e.g., ‘competitive’ or ‘self-contained’⁹, described in Supplementary Note 2), allowing the investigator to evaluate the specificity of the hypotheses tested. 3) The built-in power simulator generates gene expression datasets using the sample size and signal strength specified, carries out testing with a choice of testing and multiplicity correction strategies, and computes power and False Discovery Rate (FDR). 4) The power results are displayed with reference to the relevant GO subDAG, using binder plots. By organizing nodes linearly while still displaying hierarchical information, this visualization facilitates the display of additional node-specific information such as term names, bar plots or heatmaps for the summary statistics from the power analysis. The order of nodes is based on the number of annotations (following the buoyant layout), and edges are rendered with curves that preserve Sugiyama-style graph drawing rules.

As an example, we conducted a power analysis to guide a study with the goal of confirming the enrichment of heart-specific developmental processes (Supplementary Note 3). We study how power changes with sample size using two different strategies for formulating and testing the enrichment hypotheses: competitive nulls with hypergeometric test, and self-contained nulls with p-values obtained with Simes’ combination rule (Fig. 4 and Supplementary Note 2).

As previously documented⁹, increasing the number of samples leads to monotonically higher power for self-contained tests, but can have limited impact on competitive tests that are based on a “gene sampling” (rather than subject sampling) strategy. On the other hand, inspection of the binder plots reveals that the loss of power of competitive/hypergeometric tests is mostly due to not rejecting rather non-specific hypotheses: they do single out “heart development” (where the signal was planted) and “adult heart development” (whose annotation is very similar).

We note that this strategy can be used to benchmark multiplicity-correction procedures, and we documented the results of a comparison of DAGGER²⁹, a recent sequential testing method, and the Benjamini-Hochberg procedure (Supplementary Note 3).

Discussion

AEGIS employs multiple strategies to navigate the GO and present GO-based discoveries: the focus-and-context graph, the buoyant layout, and the binder plot are novel visualizations that leverage graph-theoretic properties of the GO DAG, extensive customization and interactive displays allow user engagement, and built-in power simulations facilitate study design.

AEGIS is available both as a web-based application for simple GO exploration, and a downloadable software package for hypothesis testing and further customization. Currently, version-controlled visualizations rendered via AEGIS can be downloaded as vector graphics, such that attributes such as colors, annotations and node positions can be customized via common editing tools. The computation back-end and visualization front-end, along with online documentation, tutorials, and videos, are open-source and portable. As such, existing GO analysis pipelines can employ the functionalities of AEGIS to present and disseminate their results. For example, our current pipeline is compatible with emerging GO testing strategies and multiple comparison procedures^3,29,30, and it also has the flexibility to support the analyses of a broader range of data input types with these options. Moreover, the visualization strategies can potentially be transferred to other ontologies³, such as the Plant Ontology and Human Disease Ontology. For ontologies where annotations are not limited to gene sets, we can extend the buoyant layout to encode other side information associated with each term to render interpretable leveled graph layouts using their specific constraints.

Last but not least, the current visualization framework was developed with anecdotal inputs from several researchers who have extensively performed GO analysis; we hope more users can evaluate the program on different web platforms and provide us feedback through a user survey (available on our documentation website). The results from the survey could help improve the front-end design of AEGIS, allow us to include more interactions, and thus extend its usefulness to other kinds of users.

Methods

GO data preprocessing

The analysis in this manuscript is based on the GO version: release-2018-06-21 (format version: 1.2) with gene annotations downloaded from the National Center for Biotechnology Information (NCBI) gene2go table (ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz, release-2018-06-21). The open-source software goatools (https://github.com/tanghaibao/goatools) was used to parse these files. For the associations, all evidence types are included as default³. All relationships (e.g., “is a”, “part of”, “has part”, or “regulates”) for the GO terms were used for the cellular component ontology, but only “is a” relationships were kept for the “biological process” ontology.

To handle the redundancy of GO terms during hypothesis testing, we developed a DAG refinement algorithm, which also preserves the semantic relationships in the original DAG (Supplementary Algorithm 1). Essentially, the algorithm iteratively removes a parent if one of its children is annotated to the exact same gene set. DAG refinement is optional for visualization but is strongly recommended for hypothesis testing, because if a parent and a child share the exact same gene sets, the corresponding test statistics would be identical and unnecessarily increase the burden of testing multiplicity. Further, the user can optionally filter the node-size range to remove overly generic or specific terms, a function commonly used in gene set testing tools³. We found that other works rely on DAG reduction approaches for similar reasons^3,6, however, we were unable to identify a full algorithmic description or justification of the process. Thus, our algorithmic description is also accompanied with its theoretical justification (Supplementary Note 1).

Customization of the context graph

The context graph is the entire GO by default, but it can be narrowed to a subDAG by specifying a set of context anchors: any GO terms can be chosen as anchors, playing one of three roles in identifying the sub-DAG: root, leaf or waypoint. Selecting root anchors will result in a DAG that includes all their descendants; leaf anchors result in a DAG that only includes their ancestors; and way-point anchors result in a DAG which include both their descendants and ancestors (Supplementary Fig. 2). This procedure can be seen as a way to eliminate nodes that have no semantic relationship with the context anchors. The induced subDAG can be further processed using DAG refinement.

Once the context is specified, the focus graph can be similarly specified with focus anchors, which can be defined as waypoint or leaf. If these anchors are not specified, the default would be the the context anchors as leaf anchors. In order to make sure that the number of nodes in the focus graph is small enough to allow interaction, two constraints are implemented on the sizes of the displayed subDAG: 1) we limit the number of focus anchors; and 2) descendants beyond the immediate children of the focus anchors are displayed only when the the size of the resulting focus graph is predicted to be smaller than a threshold. Both these parameters can be adjusted.

AEGIS supports three layouts for the focus graph: root-bound, leaf-bound, or buoyant. The first two are standard topological layouts²⁸, while the buoyant layout is new and described below. Given a context graph, one can change the focus anchors in the focus graph in multiple interactive ways, including key word search, node selection, or random draws of nodes from a specific level (Supplementary Videos).

Buoyant layout and the bubble float algorithm

For graphical displays of the GO DAG, a buoyant layout requires two constraints: each parent node should be placed at a higher level than all of its children (topological constraint) and a node annotated with a smaller number of genes should never be placed higher than any larger node (descending node size constraint). The bubble float algorithm aims to generate the buoyant layout based on the two constraints. It starts by initializing the layout by lexicographically sorting all the nodes according to (1) descending ordering of the node size and (2) ascending ordering of the longest distance to the root. The remainder of the algorithm consists in repeated applications, starting from the bottom of the graph, of a float operation that merges multiple nodes into a layer so long as the two constraints remain satisfied. We show that the bubble float algorithm satisfies both the topological constraint and the descending node size constraint during initialization and at each iteration. In addition, the necessary and sufficient conditions of the buoyant layout are proven theoretically, such that it is guaranteed to exist for any GO DAG (Supplementary Note 1).

The resulting number of levels of the buoyant layout is always equal to or greater than that of the root-bound layout (Fig. 2), because the latter only requires the topological constraint. Particular to the GO, this can be advantageous when too many nodes are displayed and could be more spread out by levels. Thus, we also provide a parameter to control the maximum span of node sizes on each level (Supplementary Algorithm 2).

Once the level of each node in the DAG is determined, what remains is the ordering of the nodes in each layer for the focus graph. Computing the optimal ordering of nodes in each level that minimizes edge crossings between layers reduces to the problem of bipartite crossing minimization, which is unfortunately NP-hard⁴³. Yet, there exist many graph drawing heuristics, which we build upon to determine the layout of the focus graph: we group anchors together if they are semantically related, add exclusive ancestor and descendant nodes to the respective groups, and separate these groups the nodes during display (Fig. 1, Supplementary Algorithm 3). The grouping delivers a clearer interpretation of the focus anchors and their relatives. In contrast to the focus graph layout, the rendering of the bar layout of the context graph does not require node ordering at each layer. Because the representation only concerns the node counts, the computation cost for the context graph is negligible.

Software architecture

The specifications of the context and focus graphs, choices of different layouts, and states in the interaction process can be combinatorial. To efficiently support the graphical rendering and interaction, we created a new data hierarchy: a data object includes 1) primary attributes that are invariant across any context and focus representations; 2) secondary attributes that can vary under different choices of the context graph, but are invariant to what the focus graph is; 3) tertiary attributes that can vary with the choice of the focus graph and interactions on the graph. All these attributes collectively define how an object is displayed on the front-end interface, but, because some of the attributes can be expensive to compute, we use this hierarchy to reduce the space and time needed for computations (Supplementary Fig. 3). For example, a term node displayed in the GO can contain several attributes in the front-end display system, including:

1.
Its gene annotations, p-values computed from statistical tests, or node-specific power;
2.
Its hierarchical level in the context graph in root-bound, leaf-bound, and buoyant layouts; and
3.
Its horizontal and vertical coordinates in the focus graph display.

During each frame of rendering in the web browser (due to user interaction or animation for the focus graph), Attribute 3 is more frequently updated on the client-side, but does not require any communication with the back-end server. The heavier computation needed for Attribute 2 takes place on the server-side because it depends on the entire GO structure, but, because the context is typically updated less frequently, the total communication cost is insignificant. Further, computations for Attribute 1 do not change with context or focus graph updates and do not need to be recomputed for layout and interaction purposes, so they can be seen as pre-computations prior to rendering the visualization. As such, statistical results, especially power calculations, in Attribute 1 are computed on the server for each experiment setup, and then displayed on the client for interactive interpretation (as shown in Supplementary Videos). Similarly, updating gene annotations for a specific node requires significant data communication with the GO and gene annotation databases, as well additional text file parsing (which can take up to minutes depending on Internet connection), but, fortunately, this computation only occurs once when the server starts running the program.

To systematically integrate this data hierarchy, we created new software libraries to define the data classes and perform the core algorithms (Supplementary Fig. 4). We also utilized state-of-the-art Python packages to query and parse GO ontology databases, and incorporated Flask, a Python micro-framework to deliver server-side computations. All of the user interactions were developed based on commonly-used JavaScript packages: jQuery.js and D3.js to support lightweight front-end computations on the client side. Despite the size of a DAG with tens of thousands of nodes, our computational framework and system architecture offer the advantage of minimal time and space required for interactions with the full-scale data. As a result, our application does not require expensive resources to perform all the computations in the background, and reduces the need to be hosted as a centralized web portal. All the development and demonstrations were performed on a MacBook Pro laptop with a 2.8 GHz i7 processor with 8 Virtual Cores and 16GB RAM. The interaction updates in the front-end interface each take less than 0.1 seconds.

Power analysis workflow

The power analysis for GO enrichment testing in RNA-seq studies requires three ingredients: 1) the ground truth: a priori defined truly differentially expressed genes, as well as whether each GO term is a true null or non-null hypothesis; 2) the data: the gene expression matrix and term-specific p-values; and (3) the statistical procedure: testing methods to select which terms are significant. The selected ground truth genes determine if a GO term is a true null hypothesis or not, and there are two distinct types of null hypothesis to be considered: the self-contained null and the competitive null⁹. We say that a GO term is a self-contained null if no gene associated with the term is truly differentially expressed; a term is a competitive null if the genes associated with this term are at most as often differentially expressed as the genes not associated with this term (Supplementary Note 2).

To generate the data, AEGIS takes the selected truly differentially expressed genes, effect size β_effect and sample sizes n_control, n_case as parameters to generate g-dimensional gene expression matrices for the control samples and case samples. For simplicity, the gene expression values for the controls are sampled from N (0, I), a multivariate normal distribution with a length-g zero mean vector and a covariance matrix equal to the identity matrix; and the cases are similarly sampled from N (µ, I), where µ is vector of values equal to β_effect only at the truly differentially expressed genes and 0 otherwise. This vanilla model can be extended for more complicated scenarios: for example, the single cell RNA-seq simulation uses the negative binomial model instead and imposes additional zero values in the expression matrix, also known as as “drop-outs” (Supplementary Note 3). Further, the assumption that the gene measurements are uncorrelated can be relaxed given the knowledge of platform-specific noise patterns.

For testing, a total of g gene p-values are obtained based on a two sample t-test; these are the ingredients to determine GO term-specific p-values. For the self-contained hypothesis, the Simes’ combination rule aggregates all the gene p-values annotated with a GO term to obtain the term p-value; for the competitive hypothesis, the hypergeometric test is based on first selecting significant genes based on a gene-wide significance threshold, and testing whether these genes are independent of those annotated to a specific GO term (Supplementary Note 2). For either case, we control the false discovery rate at level q using the Benjamini-Hochberg (BH) procedure. Note that this pipeline is modifiable: other methods, such as Bonferroni correction or DAGGER, can replace BH for multiplicity correction (Supplementary Note 3).

Without loss of generality, let A₁ be the set of non-null terms for a particular type of null hypothesis, and let R be the set of terms rejected by the appropriate procedure. Then, the (empirical) power and false discovery proportion (FDP) are defined as follows:

$${\rm{P}}{\rm{o}}{\rm{w}}{\rm{e}}{\rm{r}}=\frac{|{A}_{1}\cap R|}{max\,\{|{A}_{1}|,1|\}},{\rm{F}}{\rm{D}}{\rm{P}}=\frac{|R{\rm{\backslash }}{A}_{1}|}{max\,\{|R|,1|\}}$$

where |·| denotes the size of the set, and\denotes set difference. For a particular power calculation, we fix the ground truth and repeat the data generation process to obtain multiple realizations of Power and FDP, in order to infer the power as well as the false discovery rate (FDR) of detecting the true non-nulls.

References

Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25 (2000).
Article CAS Google Scholar
Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102, 15545 LP–15550 (2005).
Article ADS Google Scholar
McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nature Biotechnology 28, 495 (2010).
Article CAS Google Scholar
Schmid, P. R., Palmer, N. P., Kohane, I. S. & Berger, B. Making sense out of massive data by going beyond differential expression. Proceedings of the National Academy of Sciences 109, 5594 LP–5599 (2012).
Article ADS Google Scholar
Kulmanov, M., Khan, M. A. & Hoehndorf, R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34, 660–668 (2018).
Article CAS Google Scholar
Ma, J. et al. Using deep learning to model the hierarchical structure and function of a cell. Nature Methods 15, 290 (2018).
Article CAS Google Scholar
Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnology 31, 46 (2012).
Article Google Scholar
Smith, B. et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25, 1251 (2007).
Article CAS Google Scholar
Goeman, J. J. & Bühlmann, P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23, 980–987 (2007).
Article CAS Google Scholar
Tomczak, A. et al. Interpretation of biological experiments changes with evolution of the Gene Ontology and its annotations. Scientific Reports 8, 5115 (2018).
Article ADS Google Scholar
Haynes, W. A., Tomczak, A. & Khatri, P. Gene annotation bias impedes biomedical research. Scientific Reports 8, 1362 (2018).
Article ADS Google Scholar
Binns, D. et al. QuickGO: a web-based tool for Gene Ontology searching. Bioinformatics 25, 3045–3046 (2009).
Article CAS Google Scholar
Supek, F., BoŠkunca, N., Šnjak, M. & Šmuc, T. REVIGO summarizes and visualizes long lists of gene ontology terms. PloS one 6, e21800 (2011).
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research 13, 2498–2504 (2003).
Article CAS Google Scholar
Sealfon, R. S. G., Hibbs, M. A., Huttenhower, C., Myers, C. L. & Troyanskaya, O. G. GOLEM: an interactive graph-based gene-ontology navigation and analysis tool. BMC Bioinformatics 7, 443 (2006).
Article Google Scholar
Hinderer, E. W., Flight, R. M. & Moseley, H. N. B. GOcats: A tool for categorizing Gene Ontology into subgraphs of user-defined concepts. bioRxiv (2018).
Eden, E., Navon, R., Steinfeld, I., Lipson, D. & Yakhini, Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC bioinformatics 10, 48 (2009).
Article Google Scholar
Wang, J., Vasaikar, S., Shi, Z., Greer, M. & Zhang, B. WebGestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit. Nucleic Acids Research 45, W130–W137 (2017).
Article CAS Google Scholar
Bindea, G. et al. ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics 25, 1091–1093 (2009).
Article CAS Google Scholar
Pareja-Tobes, P., Tobes, R., Manrique, M., Pareja, E. & Pareja-Tobes, E. Bio4j: a high-performance cloud-enabled graph-based data platform. bioRxiv (2015).
Heberle, H., Carazzolle, M. F., Telles, G. P., Meirelles, G. V. & Minghim, R. CellNetVis: a web tool for visualization of biological networks using force-directed layout constrained by cellular components. BMC Bioinformatics 18, 395 (2017).
Article Google Scholar
Bastian, M., Heymann, S. & Jacomy, M. Gephi: An Open Source Software for Exploring and Manipulating Networks. In International AAAI Conference on Weblogs and Social Media (2009).
Merico, D., Gfeller, D. & Bader, G. D. How to visually interpret biological data using networks. Nature Biotechnology 27, 921 (2009).
Article CAS Google Scholar
van Ham, F. & Perer, A. Search, Show Context, Expand on Demand: Supporting Large Graph Exploration with Degree-of-Interest. IEEE Transactions on Visualization and Computer Graphics 15, 953–960 (2009).
Article Google Scholar
Gehlenborg, N. et al. Visualization of omics data for systems biology. Nature Methods 7, S56 (2010).
Article CAS Google Scholar
Baryshnikova, A. Systematic Functional Annotation and Visualization of Biological Networks. Cell Systems 2, 412–421 (2016).
Article CAS Google Scholar
Shneiderman, B. The eyes have it: A task by data type taxonomy for information visualizations. In The Craft of Information Visualization, 364–371 (Elsevier 2003).
Chapter Google Scholar
Sugiyama, K., Tagawa, S. & Toda, M. Methods for visual understanding of hierarchical system structures. IEEE Transactions on Systems, Man, and Cybernetics 11, 109–125 (1981).
Article MathSciNet Google Scholar
Ramdas, A., Chen, J., Wainwright, M. J. & Jordan, M. I. DAGGER: A sequential algorithm for FDR control on DAGs. arXiv preprint arXiv. 1709, 10250 (2017).
Google Scholar
Kerepesi, C., DarÓczy, B., Sturm, A., Vellai, T. & Benczúr, A. Prediction and characterization of human ageing-related proteins by using machine learning. Scientific Reports 8, 4094 (2018).
Article ADS Google Scholar
Jankun-Kelly, T. J. & Ma, K.-L. MoireGraphs: radial focus + context visualization and interaction for graphs with visual nodes. In IEEE Symposium on Information Visualization 2003 (IEEE Cat. No. 03TH8714), 59–66 (2003).
Du, F., Cao, N., Lin, Y.-R., Xu, P. & Tong, H. isphere: Focus + context sphere visualization for interactive large graph exploration. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 2916–2927 (ACM, 2017).
Sarkar, M. & Brown, M. H. Graphical fisheye views of graphs. In Proceedings of the SIGCHI conference on Human factors in computing systems, 83–91 (ACM, 1992).
Hysi, P. G. et al. Genome-wide association meta-analysis of individuals of European ancestry identifies new loci explaining a substantial fraction of hair color variation and heritability. Nature Genetics 50, 652–656 (2018).
Article CAS Google Scholar
Valouev, A. et al. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nature Methods 5, 829 (2008).
Article CAS Google Scholar
Huang, M. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nature Methods 15, 539–542 (2018).
Article CAS Google Scholar
Soneson, C. & Robinson, M. D. Bias, robustness and scalability in single-cell differential expression analysis. Nature Methods 15, 255 (2018).
Article CAS Google Scholar
Buettner, F. et al. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nature Biotechnology 33, 155 (2015).
Article CAS Google Scholar
Jolliffe, I. Principal component analysis. In International encyclopedia of statistical science, 1094–1096 (Springer, 2011).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. Journal of machine learning research 9, 2579–2605 (2008).
MATH Google Scholar
Mi, H. et al. PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements. Nucleic Acids Research 45, D183–D189 (2017).
Article CAS Google Scholar
Reimand, J. et al. g:Profilera web server for functional interpretation of gene lists (2016 update). Nucleic Acids Research 44, W83–W89 (2016).
Article CAS Google Scholar
Eades, P. & Wormald, N. C. Edge crossings in drawings of bipartite graphs. Algorithmica 11, 379–403 (1994).
Article MathSciNet Google Scholar

Download references

Acknowledgements

C.S. and J.Z. were supported by NSF DMS 1712800 and Stanford Discovery Innovation Fund. E.K. was supported by a Hertz Foundation Fellowship. The authors would like to thank Stanislaw Antol for advice on web development.

Author information

Authors and Affiliations

Department of Electrical Engineering, Stanford University, Stanford, CA, USA
Junjie Zhu
Department of Statistics, Stanford University, Stanford, CA, USA
Qian Zhao, Eugene Katsevich & Chiara Sabatti
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
Chiara Sabatti

Authors

Junjie Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Qian Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Eugene Katsevich
View author publications
You can also search for this author in PubMed Google Scholar
Chiara Sabatti
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.Z. conceived, designed, and developed the software. Q.Z. developed the written and video tutorials. J.Z. and E.K. performed analysis of the case studies. J.Z. and C.S. wrote the manuscript. All authors discussed the results and commented on the manuscript.

Corresponding authors

Correspondence to Junjie Zhu or Chiara Sabatti.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhu, J., Zhao, Q., Katsevich, E. et al. Exploratory Gene Ontology Analysis with Interactive Visualization. Sci Rep 9, 7793 (2019). https://doi.org/10.1038/s41598-019-42178-x

Download citation

Received: 26 October 2018
Accepted: 14 March 2019
Published: 24 May 2019
DOI: https://doi.org/10.1038/s41598-019-42178-x

This article is cited by

RDFizing the biosynthetic pathway of E.coli O-antigen to enable semantic sharing of microbiology data
- Sunmyoung Lee
- Tamiko Ono
- Kiyoko Aoki-Kinoshita
BMC Microbiology (2021)
GeneTonic: an R/Bioconductor package for streamlining the interpretation of RNA-seq data
- Federico Marini
- Annekathrin Ludt
- Konstantin Strauch
BMC Bioinformatics (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.