Graph-based exploitation of gene ontology using GOxploreR for scrutinizing biological significance

Gene ontology (GO) is an eminent knowledge base frequently used for providing biological interpretations for the analysis of genes or gene sets from biological, medical and clinical problems. Unfortunately, the interpretation of such results is challenging due to the large number of GO terms, their hierarchical and connected organization as directed acyclic graphs (DAGs) and the lack of tools allowing to exploit this structural information explicitly. For this reason, we developed the R package GOxploreR. The main features of GOxploreR are (I) easy and direct access to structural features of GO, (II) structure-based ranking of GO-terms, (III) mapping to reduced GO-DAGs including visualization capabilities and (IV) prioritizing of GO-terms. The underlying idea of GOxploreR is to exploit a graph-theoretical perspective of GO as manifested by its DAG-structure and the containing hierarchy levels for cumulating semantic information. That means all these features enhance the utilization of structural information of GO and complement existing analysis tools. Overall, GOxploreR provides exploratory as well as confirmatory tools for complementing any kind of analysis resulting in a list of GO-terms, e.g., from differentially expressed genes or gene sets, GWAS or biomarkers. Our R package GOxploreR is freely available from CRAN.


Methods
In this section, we provide technical information about the main features provided by GOxploreR. First, we discuss how one obtains a directed acyclic graph (DAG) for given GO-terms. Then we discuss organism-specific GO-DAGs and a mapping that converts such a DAG into a reduced GO-DAG. Finally, we discuss an algorithm for prioritizing GO-terms. Determining the Go-DAG. The problem with existing packages is that none provides a function to directly obtain a GO-DAG for a domain, i.e., BP, MF or CC, in the form of an adjacency matrix. Instead, they provide local information which needs to be used for deducing such a tree tediously. For instance, GOdb provides the function GOBPCHILDREN to get the children of a GO term for BP. For the other two domains similar functions are available. The problem is that a children node does not need to be on the next hierarchy level but can jump further down the DAG. For an example see Fig. 1. In this figure, the child of node 2 is node 8 which is located on level 4, i.e., the child jumps from level 1, the location of its parent, to level 4.
The following example demonstrates how one can deduce a GO-DAG from this information. First, we list all children of a GO term (as obtained via the command GOBPCHILDREN). www.nature.com/scientificreports/ The root node is unique and we assign it the level 0, i.e., L(x 1 ) = 0 . The children for the root node receive as first assignment for a level the value L(x 1 ) + 1 = 1 , i.e., We wrote the right-hand side as a set because if such a node appeares again, we just add the new level value to this set. Going through the list of children, we assign each children of a node x i the value L(x i ) + 1.    www.nature.com/scientificreports/ From the last line we see that x 8 appears once on level 2 and once on level 4, which is correct if one looks at Fig. 1. However, there is just one correct level for x 8 and this is level 4. In general, if more than one level is assigned to a node then the correct one is the largest of these values. Such a GO-DAG can be constructed for every domain, i.e., biological process, molecular function and cellular component. In our package, we call the resulting graphs: • g.GO-DAG.BP: A DAG for all GO-terms of biological processes.
• g.GO-DAG.MF: A DAG for all GO-terms of molecular functions. • g.GO-DAG.CC: A DAG for all GO-terms of cellular components.
Organism-specific GO-DAG. Starting from a GO-DAG for a domain, as constructed in the previous section and using a list of all genes from an organisms, we can map these genes to GO-terms. For a particular organism, not all GO-terms may be present but only a subset. Such a subset can then be mapped back to the entire GO-DAG of the knowledge base. This gives a subtree of the general GO-DAG that is organism-specific. Using the function GetDAG(organism = o.name, domain = "BP") one obtains, e.g., a GO-DAG of BPs for the organism given by ' o.name' . For all domains, the following functions can be used: Reduced Go-DAG. Visualizing one of the GO-DAGs determined above (for all GO-terms or for organismspecific GO-terms) is usually challenging because of the size of such graphs containing thousands of GO-terms corresponding to nodes in a graph. For this reason, we derive a simplified GO-DAG, containing only dozens of nodes, that can be easily visualized to obtain a global overview of all used GO-terms.
In order to simplify a GO-DAG, we introduce the following categorization of GO-terms, excluding the root node. This categorization is applied to each level separately: • A GO-term is in category 'leaf node' (LN) if it has no children.
• A GO-term is in category 'regular node' (RN) if all its children are on the next level.
• A GO-term is in category 'jump node' (JN) if it has children and at least one of these is not on the next level.
We apply this categorization for all GO-terms. This results in the mapping GO-term X → GO-term category on level L That means we have functions of the form with c ∈ {LN, RN, JN} and l ∈ N . For instance, from Fig. 1 follows 3 → RN on level 1 and 2 → JN on level 1, which can be written formally as Algorithmically, the implementation is described in 1.  In addition to the node categorization, we need to find the connections between these nodes. This is realized via the implementation shown in Algorithm 2. www.nature.com/scientificreports/ Overall, a GO-DAG is described by an adjacency matrix A and a level function g and analogously, a reduced GO-DAG is described by adjacency matrix B and level function h and C (number of original nodes summarized by a new category).
In Fig. 2 we show a complete example for this mapping. The GO-DAG on the left-hand side has 19 GO terms and the resulting simplified GO-DAG on the right-hand side has only 8 nodes, whereas these nodes correspond to the three GO categories (RN, JN & LN) defined above. As one can see, each level will contain at most 3 nodes because this is the number of different categories. However, it is possible to have even fewer nodes, if a category is absent on a level.
Importantly, this transformation can be applied to any GO-DAG, regardless if this DAG is for all GO terms of, e.g., BPs, or for an organism-specific GO-DAG.
prioritizing lists of Go-terms. In general, the comparison of GO-terms with respect to their biologicalsemantic importance is complex. However, the comparison of GO-terms along a path is much simpler because the higher a level of a GO-term is the more specific is its biological information 26 . That means vertically one wants to traverse a DAG along a path as far down as possible. This implies that the GO-term at the end of a path is most interesting compared to all other GO-terms along this path. This increase in the semantic meaning along vertical paths is exploited by our algorithm for prioritizing lists of GO-terms.  www.nature.com/scientificreports/ Our algorithm applies the above described logic iteratively, by starting from the GO-term at the highest level and searches all shortest paths to the root node. Then all GO-terms along these shortest paths are removed from the list and the procedure starts over. See Fig. 3 for a visualization. In this figure, one shortest path from node 17 to the root node is shown. The pseudo-code of this is shown in Algorithm 3. Here XX corresponds to BP, MF or CC. The algorithm guarantees that for a non-empty list, H, of GO-terms the resulting set, R, containing the prioritized GO-terms consists of at least one GO-term. For instance, say H = {5, 9, 17} . Then our algorithm starts at node 17 and searches all shortest paths to the root. One of these is highlighted in green in Fig. 3. As a result, the nodes 5 and 9 are eliminated because they appear on a lower hierarchy level than node 17. In this case, the final result of our algorithm gives R = {17}.
Overall, our prioritizing algorithm provides a parameter-and assumption-free, non-redundant ranking of GO-terms that exploits only vertical structural information of GO. technical details about Go. For the construction of the various DAGs, we are only utilizing information from GO-basic. The information about this can be obtained from the go-basic.obo file, which can be obtained from the Gene Ontology website (http://geneo ntolo gy.org/docs/downl oad-ontol ogy/). This file contains the basic version of GO and it is guaranteed that the resulting DAG is acyclic and annotations can be propagated through the graph. We would like to note that the relations included in this, i.e., "is_a", "part_of ", regulates, "negatively_regulates" and "positively_regulates" also guarantee transitivity (NB: transitivity is not obeyed by "has_part" relations which are included in GO-core available from the go.obo file via the GeneOntology website).

Results
In the following sections, we highlight some of the features provided by the GOxploreR package and show some example applications.
Structural exploration of Go. In Table 1, we show an overview of the organisms supported by the GOx-ploreR package. Overall, at the moment ten organisms are supported corresponding also to the main organisms within the GO database. The second column in Table 1 shows the option name as used for arguments in functions.
For instance, the following command gives for the gene list ' c(10212, 9833)' containing Entrezgene IDs information about the associated GO-terms and hierarchy levels. www.nature.com/scientificreports/ In case a list of GO-terms is already available the corresponding hierarchy levels can be obtained with the command 'GOTermXXOnLevel' . Here 'XX' is either BP, MF or CC. In the following, 'XX' corresponds always to one of these three domains. For the analysis of enriched GO-terms, one frequently wants to limit such an analysis to more informative GO-terms which are located toward higher hierarchy levels. In order to obtain all GO-terms located on a specific hierarchy level one can use the function 'Level2GOTermXX' . www.nature.com/scientificreports/ > Level2GOTermBP(level = 17, organism = "Human") [1] "GO:2000321" "GO:0010880" "GO:2000320" "GO:0045630" "GO:2000703" [6] "GO:2000734" "GO:0031587" "GO:0045627" "GO:0045629" "GO:0045626" [11] "GO:0021808" "GO:0060315" "GO:0060316" "GO:0021836" "GO:0021972" [16] "GO:0031586" "GO:0021817" "GO:0097379" "GO:0021816" "GO:0097380" It is interesting to highlight that the children of a GO-term in a GO-DAG can 'jump' to different levels. For instance, using the function 'GOTermXX2ChildLevel' gives the GO-terms as well as the corresponding hierarchy levels of these. Here the GO-term "GO:0007635" is on level 3, however, its children are not only on level 4. The reason for this is that in GO there are no cross links on the same level. That means the children of any GO-term are always on a lower level because the terms are more specific. This implies that "GO:0007636" which is located on level 5 has (at least one) parent node located on level 4. In order to find this parent(s) we can use the following. It is important to note that GO does not only provide one DAG but several different ones. The reason for this is that each organism has a specific number of genes, and from these genes one obtains only a subset of all GO-terms that are connected to an organism. In total there are eleven GO-DAGs available from GOxploreR, ten for the organisms and one for all GO-terms.
In order to demonstrate the differences in the GO-terms for different organisms, we show in Fig. 4 the distribution of GO-terms of BP for human (top), zebrafish (middle) and E. coli (bottom). The x-axis corresponds to the hierarchy level of the corresponding GO-DAG of BP. As one can see for human one has a GO-DAG with 19 hierarchy levels whereas for zebrafish one has 16 and for E. coli 14. Furthermore, also the number of GO-terms on these levels is considerably different from each other as can be seen from the counts (number of GO-terms) on the y-axis. In Table 1, we show an overview of the number of levels (column four) and the number of GO-terms of BP (column five) for all ten organisms. For completeness, we want to mention that if one does not specify the organism in the command 'Level2GOTermBP' one can obtain a total number of 29698 GO-terms of BP for all levels.
Structure-based ranking of Go-terms. Maybe the most popular application of GO is the identification of enriched GO-terms for a list of genes. Unfortunately, as a result from such an analysis it is not uncommon to find large numbers of GO-terms making a focused discussion very difficult. However, a GO-DAG provides information that can be utilized for an exploratory analysis of such a list. Specifically, the hierarchy levels of GO-terms can be utilized. Despite the fact that a GO-level is not an absolute indicator for biological specificity it provides still valuable information 26 . Using our function GOTermBPOnLevel gives the GO-levels of BP for a list of GO-terms allowing, e.g., a simple ordering for complementing an enrichment analysis.
For instance, in Fig. 5A, we show results for a list of enriched GO-terms of BP found from an analysis of the breast cancer gene regulatory network 27 . Specifically, the hierarchy levels (x-axis) of these GO-terms (y-axis) are shown in purple. For reasons of comparison, the maximal depth of paths in the GO-DAG passing through these GO-terms is shown in red. As one can see, in all cases, the GO-terms are not at the end of these paths but somewhere situated along the way toward the highest possible (maximal) level that can be reached by passing through the corresponding GO-terms. This information is important because on one-hand one wants to interrogate GO-terms that are biologically specific, i.e., are situated toward the highest hierarchy level of the GO-DAG -for human this would be level 19. On the other-hand not every GO-term is connected to the highest level, i.e., there is no path that would allow to reach the maximal level. Hence, there is a trade-off between absolute and level GO−DAG (GO) ∈ (0, 1] the resulting score is also positive and at most one. Hence, the score, s t , is a product of two probabilities, i.e., s t = p 1 ( max path)p 2 (GO − DAG) allowing to optimize the trade-off between both objectives. The resulting score s t is shown in Fig. 5B. As one can see, the ranking of GO-terms is similar to Fig. 5A but not identical because Fig. 5A considers for the ranking only the relative distance between the actual and the maximal attainable position in a GO-DAG. Hence, both figures provide slightly complementary information. For our example GO:0006614 (SRP-dependent cotranslational protein targeting to membrane) and GO:0006613 (cotranslational protein targeting to membrane) have the highest score, which are interestingly directly connected in the GO-DAG. Overall, in general this information enables an exploratory analysis of GO-terms which complement the obtained p-values from an enrichment analysis.
In GOxploreR, such an analysis can be performed by using the commands distRankingGO and scoreRank-ingGO, i.e., the results in Fig. 5A,B can be obtained by (19)  Reduced Go-DAG. The starting point for many different types of analyses is usually a visualization of the data in order to derive an intuition about the information contained in the data. Unfortunately, for unfiltered GO-terms such a visualization is not feasible because the entire GO-DAG of an organism is too large containing thousands or even tens of thousands of GO-terms (see Table 1). For instance, even the smallest organism with respect to GO-terms of BP consists of 1491 nodes in the corresponding GO-DAG, distributed over 15 hierarchy levels. A graph of such a size cannot be visualized in an insightful way 28 . For this reason, we introduce a so called reduced GO-DAG that allows an easy visualization. The underlying idea of such a reduced GO-DAG is a mapping from GO-terms into three node categories, namely: regular nodes (RN), jump nodes (JN) and leaf nodes (LN). A GO-term is called a 'regular node' (RN) if all its children are on the next level, a GO-term is a 'jump node' (JN) if it has children and at least one of these is not on the next level and a GO-term is a 'leaf node' (LN) if it has no children at all. Such a mapping is obtain by the function getGOcategory.
As an example, Fig. 6A shows the reduced GO-DAG of MF for C. elegans. This GO-DAG contains only 37 category nodes, i.e., RNs, JNs or LNs, which summarize all 2102 GO-terms of MF for this organism on 14 hierarchy levels. That means only category nodes are shown that contain at least one GO-term, allowing a system-wide view of all MFs of C. elegans. Importantly, a reduced GO-DAG has the same number of hierarchy levels as the original GO-DAG because the mapping into category nodes does not effect the hierarchy levels. This holds for all GO-DAG. The following code demonstrates how the information shown in Fig. 6A can be obtained. In case one has a list of GO-terms, one can also perform such a mapping only for this limited number of GO-terms. Furthermore, also a visualization for this sub-set of all GO-terms can be obtained using the function visRDAGMF. Overall, a reduced GO-DAG helps in simplifying the complexity provided by the gene ontology especially with respect to the connectivity between the GO-terms. This enables a general visualization for an exploratory analysis of system-wide information propagation capabilities.
prioritizing Go-terms. Finally, GOxploreR provides a function called prioritizedGOTerms for prioritizing GO-terms. The idea is to go beyond the ordering of GO-terms for a provided list of GO-terms to eliminate selected terms that are capturing redundant and less biologically specific information; see the discussion of Fig. 6B below.
In order to realize an implementation for such a function, we apply the following strategy (see Methods Sec. 2.4 for technical details). Specifically, it is known that the comparison of GO-terms with respect to their biological meaning is complex. However, the comparison of GO-terms that can be found along a path is much simpler because the higher a level of a GO-term, the more specific is its biological information 26 . That means traversing a path vertically toward higher levels increases the biological specificity of GO-terms implying that the GO-term at the end of a path is the most interesting one. Hence, by eliminating all GO-terms that are together on a path, except the one on the highest level, results in a prioritizing of terms with respect to the semantic meaning of GO-terms. The function prioritizedGOTerms implements this strategy. In Fig. 6B, we show visualized of this. Here one path is highlighted containing three GO-terms (GO:1, GO:2, and GO:3) whereas GO:3 has the highest level. This results in an elimination of GO:1 and GO:2. Similarly, all other paths are explored resulting in GO:1 and GO:6 as output of the prioritizing algorithm.
As an example, we investigate a list of GO-terms that was obtained from analyzing a gene regulatory network of S. cerevisiae 29 . The original list contains 30 different GO-terms of BP 29 , each significantly enriched with a significant p-value. Application of our function prioritizedGOTerms for prioritizing GO-terms results in only 5 GO-terms, shown in Table 2. Each of these 5 GO-terms is located on a separate brunch of the underlying GO-DAG between which no paths exist. Hence, despite of a certain similarity of the biological processes, e.g., for metabolic or mitochondrial processes, each of these terms is from a different, separate semantic category because otherwise connections with the DAG would exist. Such an analysis complements available p-values and gives further information on which GO-terms a follow-up analysis could focus on.
Overall, the function prioritizedGOTerms can prioritize a list of GO-terms with information about the semantic information content of a GO-DAG as provided by the level of GO-terms. If desired, a separate visualization could be obtained only for these GO-terms by using the function visRDAGsubMF.

Discussion
In this paper, we introduced the R package GOxploreR and highlighted some of the functionality it provides. Overall, GOxploreR provides functions and algorithms for four different types of analyses. Specifically, GOx-ploreR enables a (1) direct access to structural features of GO, (2) structure-based ranking of GO-terms, (3) mapping to a reduced GO-DAG and (4) prioritizing of GO-terms. www.nature.com/scientificreports/ The first three features of GOxploreR permit an exploratory analysis of GO-terms and GO-DAGs whereas the fourth feature provides a dedicated algorithm for a particular problem. Despite the fact that it is well-known that GO has the structure of a DAG, there are surprisingly few tools allowing a direct assess to structural, i.e., graph-based information of GO. Hence, our features and the corresponding functions help in utilizing this rich source of information which is in our opinion so far largely underexplored. A reason for this lack could be that the conceptual realization and implementation of graph-based algorithms is not straight forward requiring inter-and transdisciplinary knowledge of graphs and the underlying biology. www.nature.com/scientificreports/ One important novelty of GOxploreR is to provide a mapping from a GO-DAG to a reduce GO-DAG. This leads to a tremendous reduction in complexity of graphs because a GO-DAG can contain thousands of nodes, depending on the organism and the domain, i.e., BP, MF or CC. In contrast, a reduced GO-DAG has at most three nodes of the categories, JN (jump node), RN (regular node) or LN (leaf node) on each hierarchy level. The idea behind this mapping is inspired by the detection of differentially expressed genes (DEG) 30 . While the expression level of a gene is continuous, a DEG analysis performs a kind of classification of the expression level into two categories: active and inactive. This allows a reduction in the complexity of the gene expression level by capturing simplified yet essential information. Our mapping from a GO-DAG to a reduce GO-DAG follows a similar strategy by capturing simplified yet essential information of the connection between GO-terms. As far as we know, GOxploreR is the only package that provides such a mapping and reduction in the GO complexity.
Another novelty of the GOxploreR package is to provide visualizations of reduced GO-DAGs. This feature is directly enabled by the tremendous reduction in complexity of the mapping from a GO-DAG to a reduce GO-DAG because the visualization of a DAG containing thousands of nodes (see Table 1) is not feasible. In contrast, a reduce GO-DAG permits such a visualization allowing to obtain an overview of the biological information processing of the entire ontology. Given the novelty of a mapping from a GO-DAG to a reduce GO-DAG other packages that provide also visualization capabilities do not offer this particular visualization.
Finally, the GOxploreR package provides a prioritizing algorithm. The idea of this algorithm is to go beyond the ordering of GO-terms for a given list of GO-terms, and to eliminate GO-terms capturing redundant biological information. For the prioritizing of GO-terms in a list, we utilized the fact that the higher a level of a GO-term is the more specific is its biological information 26 . That means vertically one wants to traverse a DAG as far down as possible because the end of a path is most specific compared to all other GO-terms along this path. Our algorithm applies this logic iteratively by starting from the GO-term at the highest level and searches all (shortest) paths to the root node. Then all GO-terms along these shortest paths are removed from the list and the procedure starts over; see Fig. 1 for a visualization. As a result, one obtains a prioritizing of GO-terms that is a parameter-and assumption-free algorithm which removes redundant GO-terms by exploiting only vertical structural information of a GO-DAG. Hence, the output of our prioritizing algorithm is a non-redundant ranking of GO-terms.
We would like to highlight that there is a crucial difference between our prioritizing algorithm and approaches based on the semantic similarity of genes 31,32 . The difference is that we utilize only vertical information from a GO-DAG. This implies that there is no need for comparing GO-terms horizontally because they cannot be connected by any path (besides over the root node). However, this horizontal comparison is usually the problem since the biological significance of different GO-terms on the same hierarchy level can be different. This simplifies the analysis yet allows the elimination of redundant GO-terms. The resulting list of GO-terms maybe be further reduced, however, not without making additional assumptions, e.g., in the form of semantic similarity measures. A common problem with the latter is that there is not one but many different measures for semantic similarity all of which are non-trivial in their definition and interpretation 33 . In contrast, our prioritizing algorithm is parameter-and assumption-free allowing to remove redundant GO-terms by exploiting only vertical structural information along paths of a GO-DAG. Another fundamental difference between our prioritizing algorithm and semantic similarity measures is that our algorithm focuses on GO-terms and not on genes. This facilitates a general systems view on the underlying problem from which the GO-terms have been obtained as represented by systems biology 34,35 .
In Table 3, we compare the capabilities of the GOxploreR package with other software tools available for analyzing GO. The first column shows the name of the software whereas the remaining columns refer to various features. Specifically, the second column indicates if a software tool is available as an R package and the third column refers to direct assess of structural information provided by a GO-DAG. Examples thereof are the hierarchical level of a GO-term, the GO-terms on a certain hierarchy level or the adjacency matrix of a DAG. The fourth column is about identifying the enrichment of GO terms, whereas the fifth column is about the availability of reduced GO-DAGs and the sixth column refers to a prioritizing algorithm for a list of GO-terms.
As one can see from Table 3, the GOxploreR package is considerably different from all the other software tools, hence, providing novel and complementary analyses functionality. Importantly, GOxploreR is available as R package allowing the easy utilization of it within existing analysis pipelines for their extensions. Hence, GOxploreR does not provide dead-end functionality via web-interfaces but enables future biomedical data science projects 36 . Table 2. Using GOxploreR one can prioritize lists of GO-terms. The table shows results for significant GO-terms from analyzing a gene regulatory network of S. cerevisiae 29 after the application of our prioritizing algorithm. The GO-terms are for BP and complement p-values obtained from an independent enrichment analysis. www.nature.com/scientificreports/ conclusion In this paper, we introduced the R package GOxploreR, available from CRAN (after acceptance of the paper).
GOxploreR is a versatile tool that can be applied to any list of GO-terms from an upstream analysis as a result from studying, e.g., differentially expressed genes, GWAS, biomarkers, gene sets or gene regulatory network studies [7][8][9][10][11][12][13] . Its main features include: 1. A direct access to structural features of GO. 2. A structure-based ranking of GO-terms. 3. A mapping from a GO-DAG to a reduced GO-DAG. 4. A visualization of reducuded GO-DAGs.
5. An algorithm for prioritizing GO-terms.
Given the lack of tools for exploring the DAG-structure of GO from a graph theoretical perspective, GOxploreR complements non-structural analysis tools. Overall, GOxploreR has the potential to enhance studies investigating differentially expressed genes, GWAS (genome-wide association study), biomarkers, gene sets or gene regulatory network studies significantly because the obtained information has a clear interpretation directly derived from the gene ontology knowledge base and is not based on additional assumptions.