Quantitative evaluation of ontology design patterns for combining pathology and anatomy ontologies

Data are increasingly annotated with multiple ontologies to capture rich information about the features of the subject under investigation. Analysis may be performed over each ontology separately, but recently there has been a move to combine multiple ontologies to provide more powerful analytical possibilities. However, it is often not clear how to combine ontologies or how to assess or evaluate the potential design patterns available. Here we use a large and well-characterized dataset of anatomic pathology descriptions from a major study of aging mice. We show how different design patterns based on the MPATH and MA ontologies provide orthogonal axes of analysis, and perform differently in over-representation and semantic similarity applications. We discuss how such a data-driven approach might be used generally to generate and evaluate ontology design patterns.

When evaluating ontology design patterns we can distinguish between internal evaluation and external evaluation. An internal evaluation relies only on the ontologies and the patterns that are applied. It may involve automated reasoning to determine consistency and the number of unsatisfiable classes, as well as several metrics related to the complexity of the expressed knowledge 17 . An external evaluation requires a biological hypothesis and an additional well-characterized dataset, and involves applying the ontology design patterns to address the hypothesis. Common forms of evaluation include the application of semantic similarity as a predictor for a type of biological relation.
Here, we demonstrate that we can devise and evaluate alternative design strategies using backgound knowledge from a large biological dataset, and that alternate, validated design patterns can open new axes of analysis. Specifically, we show how to combine two ontologies related to anatomy and pathology, the Mouse Anatomy Ontology (MA) 18 and the Mouse Pathology Ontology (MPATH) 19 , through ontology design patterns. We apply several methods to evaluate the ontology design patterns through application-driven data analysis, i.e., an external evaluation of the generated ontologies.
The dataset we use for the external evaluation was derived from a very large aging study of 28 inbred strains of laboratory mice carried out at the Nathan Shock Aging Center at The Jackson Laboratory. Over their natural lifespan, cohorts of mice were subject to periodic necropsy and complete histopathological workup to determine the frequency of spontaneous age-related pathological changes 20,21 .
In our analysis we perform ontology enrichment analysis by strain and sex for different experimental groups, and demonstrate that different ontology design patterns yield different statistical results. We use the impact that the ontology structure has on frequently performed data analyses as additional motivation to provide quantitative measures for evaluating the design patterns. For this purpose, we compute the semantic similarities between the ontology annotations for each individual mouse and apply different clustering methods. We use a cluster purity measure with respect to our original input data and define an area under the purity curve as a quantitative measure that evaluates the quality of the different ontology design patterns. We find that there are differences between the four derived ontologies in all analysis approaches. It is well established that individual mice within an inbred (i.e., genetically homogeneous) strain exhibit a more similar spectrum of disease than mice in different strains 22,23 . We use semantic similarity measures to confirm this observation and show that some of the ontology design patterns generate significantly better results compared to using each ontology individually. Our work introduces a repertoire of quantitative ontology evaluation measures that will be useful in different applications and have the potential to improve ontology interoperability and data analysis.

Methods
Mouse pathology dataset. We used a dataset of spontaneous diseases of aging in the mouse 24 available from the Mouse Phenome Database 25 . The dataset used provides 20,885 diagnoses for 1740 mice and four different experimental datasets, of which we utilize three in this study. The original experimental groups comprised: (i) A longitudinal study (LONG), where the group consisted of mice euthanised when they appeared moribund thoughout the course of the study; (ii) A cross-sectional study, consisting of groups where mice were sacrificed at 6 months old (6 m), 12 months old (12 m), and 20 months old (20 m). Because the 6 month study utilised a subset of strains and examined limited types of lesions, we did not use this data in our work.
In the the cross-sectional study, at least 15 animals were sacrificed at 12 and 20 months (12 m and 20 m), irrespective of health status. After 14 months some strains show considerable attrition due to disease and inter-animal aggression and so at the 20 month time point the same number of mice was not available for every strain 21 .
All animals were subject to complete necropsy and examination and each diagnosis specified using classes from MPATH and MA. We used data from 1,595 mice from 28 strains; their counts are shown in Table 1. Strains AKR/J, CAST/EiJ, and SJL/J were excluded because most of the animals died early in the experiment due to well-characterised severe disease or aggressive behavior. Each mouse can have multiple diagnoses and some of the mice have no diagnosis, which we will refer to as healthy mice. As not all strains show high survival rates past 14 months of age, numbers necropsied for each strain are most variable at 20 months. ontologies. We used two ontologies in this work, the Mouse Pathology Ontology (MPATH) 19 and the Mouse Anatomy Ontology (MA) 18 . MPATH describes mouse pathological processes and structures. The version used was released on 2018-01-06 and contains 889 classes. MA describes adult mouse anatomy. We used the version released on 2017-02-07 and which contains 3,257 classes. As a preprocessing step in all our analyses, we added an axiom for each class in MA making them all sub-classes of a common root class, Mouse anatomical entity. oQuaRe ontology evaluation measures. The ontology quality requirements and evaluation (OQuaRE) 26 is a framework which adapts the standard for software quality requirements and evaluation (SQuaRE) 27 . OQuaRE intends to provide a standardized method that is applicable in measuring the quality of ontologies. We used three of the metrics of OQuaRE which describes structural characteristic of ontologies. The tangledness (TMOnto) measure describes the ratio of multiple parents classes to the total numbers of classes in an ontology (see Eqation 1) where C i is the i th class in the ontology, N is the total number of classes in the ontology, and DP(C) equals to one if the class has more than one directed parent and zero otherwise. The weighted method count (WMCOnto) is the average depth of a leaf classes (Equation 2) where N leafs is the number of leaf classes, and C i is the i th leaf class. The depth of the subsumption hierarchy (DITOnto) is the maximum depth of a leaf class (Equation 3).
i N i 1 leafs enrichment analysis. We performed enrichment analysis using the tools FUNC 28 and OntoFunc 4 . FUNC is a software package that was developed to find significant associations between gene sets and ontological annotations in Gene Ontology, and OntoFUNC is a tool that was developed to extend the use of FUNC tool to perform enrichment analysis in ontologies other than GO.
We performed a hypergeometric test using the six different ontologies. We first applied OntoFUNC to each of the ontologies separately to generate files that will be used by FUNC. Then we generated an annotation file for each strain of mouse for each tested ontology using groovy scripts. The annotation file consists of three columns; individual mouse identifier (ID), phenotype from the ontology classes and a binary value that represents whether the mouse belongs to the strain of interest or not. Healthy mice, without phenotypes, are added by assigning them the root of the tested ontology as their phenotype. For using FUNC we specified the parameters as follows: the root for each ontology graph to be owl:Thing, the number of random sets to be 1,000, and ensured that each group (case and control) has at least one individual. semantic similarity. We calculated mouse to mouse groupwise semantic similarity 10 based on the existing ontologies MA, MPATH, as well as using the newly generated ontologies MAP, MAPT, PAM and PAMT. Generation of the new combined ontologies is described in Results. Briefly, MAP and MAPT are built with the MA as the primary axis of classification, and PAM and PAMT using MPATH as the primary axis of classification. MAPT and PAMT include additional axioms that base classification on the transitivity (T) of parthood relations.
We used Resnik's similarity measure 29 and best match average (BMA) strategy implemented in the Semantic Measures Library (SML) 30 . Resnik's similarity method is based on information content (IC) of an ontology class. www.nature.com/scientificreports www.nature.com/scientificreports/ The information content of a given phenotype class in the ontology is defined as the negative log of its occurrence probability 31 . As shown in Equation 4, the probability of each phenotype is calculated as the sum of each of its subclasses' probabilities where n x is the number of occurrences of the phenotype x in the corpus and  ′ ⊆ | x y y x { } and N is the total number of phenotype classes. The similarity between two phenotypes is then calculated as the information content of their most informative common ancestor (MICA, see Equation 6).
Resnik 1 2 1 2 Finally, we used a best matching average method (BMA) to compute mouse-to-mouse similarity. We have two sets of ontology-based annotations, those for the first mouse and those for the second. For each annotation in either of the two annotation sets, the BMA method looks for the best match in the other set (the class with the highest similarity) and averages their similarities. In Equation 7, m 1 and m 2 are the number of classes associated with mouse 1 and mouse 2 , respectively. p 1i and p 2i refer to the i th annotation of mouse 1 and mouse 2 , respectively. Clustering and clustering purity. We perform clustering based on the similarity matrices generated by applying semantic similarity to each pair of mice. We applied K-medoids clustering, complete linkage agglomerative clustering, unweighted pair group method with arithmetic mean (UPGMA), and neighbor joining agglomerative clustering (NJ).
K-medoids clustering is very similar to K-means clustering but has some advantages over K-means: it does not require an observation matrix and can work directly with the similarity matrix, and it may also be less sensitive to outliers.
Hierarchical agglomerative clustering methods start by creating clusters by the number of mice. They then group the closest clusters into one cluster one at a time and the distances between this newly generated cluster and previously existing ones are calculated. In complete linkage, the maximum distance between points in the two clusters is computed 32 . However, in UPGMA, it is calculated as the average distance between points in the two clusters 33 .
Neighbor joining 34 is a method that is used for constructing phylogenetic trees. It starts with a star-like tree at every phase of this algorithm and tries to pick the pairs x and y with the smallest sum of branch S xy and joins them.
After clustering, we measure the quality of clusters by the cluster purity 35 . We calculate the purity based on the ground truth of mice and the strains to which they belong, by assigning each cluster to the most frequent strain in that cluster. Then, the sum of correctly assigned mice is divided by the total number of mice. Let m i,j be the strain of mouse j in cluster i, maxi is the dominant strain in cluster i, and M is the total number of mice then the purity is calculated as following equations 8,9. To compare the clustering results quantitatively, we use the area under the purity curve using the trapezoidal method divided by the number of mice in the group, as shown in equation 10, where M is the number of mice in the group, Purity n is the purity of clusters when the number of clusters is set to n.

RoC analysis.
A receiver operating characteristic (ROC) curve is an evaluation measure for binary classification problems 36 . ROC curves visualize the trade-off between positives and negatives in different numerical threshold points. This allows a direct comparison between classifiers without setting a specific threshold 35 . To plot (2019) 9:4025 | https://doi.org/10.1038/s41598-019-40368-1 www.nature.com/scientificreports www.nature.com/scientificreports/ the ROC curve, we compute the true positive rate and false positive rate as shown in equations 11 and 12. This tests whether the t highest similar mice to a mouse would be from the same strain or not.
Rank-based statistics. The ontology-based analyses that we apply in this work are enrichment and semantic similarity. The enrichment analysis provides a p-value for the classes that are over-and under-represented in each mouse strain, and the semantic similarity provides a similarity value for each pair of mice. We can use the p-values as well as the similarity score to rank classes for each strain based on their significance for over-or under-representation, and we can rank all mice for each mouse based on their pairwise similarity score. This is motivated by the practice of considering only the "most significant" or "most similar" entities as relevant, and determining how the ontology design patterns effect this kind of scenario. We can determine the strength of the effect of the different ontology design patterns based on how much the ranks (of the classes in the enrichment analysis, or the pairs of mice in the semantic similarity) change.
We apply several rank-based statistical measures, specifically Kendall's tau correlation coefficient and the Wilcoxon rank-sum test to determine whether different ontology design patterns provide different ranks. All tests we apply are non-parametric statistical measures.
Kendall's τ rank correlation coefficient is a measure of how well two sets of ranks of the same set of objects are correlated. To calculate Kendall's τ we need to first calculate the number of concordant and discordant pairs. For each pair of objects this method compares the rank of those objects using the alternative ranking algorithms. If the rankings of the two objects are of the same order then they are concordant, if they are different then they are discordant. We used the implementation in SciPy 37 for Kendall's tau-b to perform this test.
The Wilcoxon rank sum test, also known as the Mann-Whitney U-test, is a non-parametric alternative to the t-test for independent samples. We use the implementation in Matlab to perform this test. Implementation. We use several tools and libraries in this work. The OWL API 4.2.5 38 is used to combine ontologies and implement our design patterns, and the FUNC 0.4.7 28 and OntoFUNC 4 tools are used to perform the enrichment analysis. To compute semantic similarity between mice based on their annotations, we use the semantic measure library (SML) 30 . For the quantitative and statistical analyses, we use Matlab, R and SciPy (to perform clustering, compute ROC curves and area under the ROC curves, calculate purity and rank-based statistics). All code that is required to reproduce our results, and the generated ontologies, are available at https:// github.com/bio-ontology-research-group/mpath-ma.

Results
ontology design patterns to combine anatomy and pathology. In the original application of MPATH, the problem of many lesions potentially occurring in many different anatomical locations was dealt with at the level of annotation, using classes from two or more ontologies to describe each lesion. This avoids the creation of all possible compound classes with the inevitable increase in class numbers, making the ontology cumbersome to use and expensive to compute over. The shortfall of this annotation-based approach is in the difficulty of using both ontologies separately in any analysis. Creating what is effectively a precomposed compound ontology obviates this problem. There are, however, different possible ways to combine the the MPATH and MA ontologies. The key problem we address is how to select the primary axis of classification, i.e., whether the classes in the combined ontology should represent anatomical entities with particular pathological lesions, or, alternatively, pathological lesions that affect particular anatomical locations. In the first case, the MA ontology will provide the natural backbone of the combined ontology's taxonomy, while in the second case the backbone taxonomy will be provided by MPATH. A further challenge is how to incorporate information from the ontologies' axioms in the combined ontology; for example, if a pathological lesion affects the left ventricle, we may also wish to classify this lesion as a lesion affecting the heart (and therefore utilize anatomical parthood axioms to structure the combined ontology). We combine the MPATH and MA ontologies in a data-driven way, using the OWL API 38  We generated the second pattern, which we call MAPT, in the same way but instead of the definition in Equation 13 we define a MAPT class as in Equation 14, i.e., using the parthood relation. MAP and MAPT are ontologies in which classes combine the observation that a certain anatomical entity, or its parts, have a certain lesion from MPATH. In MAPT, we reuse the Part-of relation from the MA ontology which will then be used to infer that a lesion observed for X is also observed in any part of X. For example, if an adenoma is observed in the lungs then it is also observed in some part-of the respiratory system. (2019)  Finally, we also add a "contextualization axiom" as defined in Equation 15, which asserts that everything (that falls in the domain of our ontology) has a lesion.
The MAP and MAPT ontologies contain 1,575 MAP classes. After generating the axioms that form the ontology, we used the HermiT reasoner 39 to classify the ontology hierarchy.
To generate the third and fourth ontology, which we call PAM and PAMT, we define two classes from each pair of inputs to the data set in each ontology. For PAM, we define ?PAM class as defined in 17 and for PAMT we defined the ?PAMT class as defined in 16. ?PAM and ?PAMT are classes that combine the observation that a certain pathological lesion ?MPATH affected part of the anatomical site ?MA. In ?PAMT classes we reused the part-of relation from MA ontology for the same reason illustrated above. Another axiom that we added is a "contextualization axiom" axiom defined in 18, which indicates that an affect of some lesion in any anatomical entity is a type of that thing. For example, an adenoma that affects the lung is still a type of adenoma. After PAM, PAMT and MPATH affects classes were asserted we inferred logical relations using the HermiT reasoner 39 . The PAM and PAMT ontologies contains 1,575 classes each.
. affects (18) oQuaRe evaluation Measures. To understand the differences between the new ontologies and the original ones, as well as to evaluate the effect of automated reasoning on the newly generated ontologies, we applied three methods from the OQuaRE ontology evaluation suite 26 . The tangledness (TMOnto) measure describes the ratio of multiple parents classes to the total numbers of classes in an ontology, the weighted method count (WMCOnto) is the average depth of a leaf classes, and we further measure the maximum depth of the subsumption hierarchy (DITOnto). Table 2 shows the OQuaRE metrics for these three measures. Our results demonstrate that tangledness in the new classes is higher compared to the original ones, and the maximum depth of the newly created ontologies is slightly larger than the original ontologies. We also limit this analysis to newly created classes only. We find that PAM and PAMT have a higher average depth than MAP and MAPT (WCMOnto PAM = 6.5022, WCMOnto PAMT = 6.7674, WCMOnto MAP = 4.9020 and WCMOnto MAPT = 2.4889). Finally, to illustrate how the structure of newly generated ontologies is determined through application of automated reasoning using the HermiT reasoner 39 , in Table 3 we show the three metrics for the original ontologies compared to the inferred ones. enrichment analysis. We performed enrichment analysis to find overrepresented lesions and anatomical sites in each strain. We used the original ontologies MA and MPATH and the newly generated ones MAP, MAPT, PAM and PAMT. Each ontology showed a characteristically different rank profile of the overrepresented classes.  www.nature.com/scientificreports www.nature.com/scientificreports/ We used Kendall's rank correlation coefficient τ to quantify how much two ontologies, or two ontology design patterns, differ. We found that changing the primary axis between anatomy and pathology yielded highly Clustering purity. The driving motivation behind the generation of inbred strains of laboratory mice was that genomically identical individual mice within an inbred strain show closely related phenotypes. The phenotypic relatedness of mice within the same strain allows for the genetic analysis of genes and variants giving rise to those phenotypes 22,24,40,41 . We therefore expect that once individual mice are phenotypically annotated, these annotations should be useable in a classifier to assign each mouse to a specific strain, or at least to cluster together mice of genetically related inbred strains 23 .
Cluster purity provides a method for evaluating whether mice of the same or similar genetic background are predisposed to the same or similar lesions at same or similar anatomical sites. We first use a similarity measure to generate mouse to mouse disease similarity matrix. To eliminate confounder effects, we distinguish males and females and different time points in the study. We convert similarity matrices to distance matrices and applied four clustering methods.
The purity metric reflects how well mice of the same strain have been grouped together based solely on their phenotypic similarities, using annotation to each of the six different ontologies (i.e., MA, MPATH, MAP, MAPT, PAM, PAMT). Depending on the number of clusters, purity will naturally increase, and we determine the overall performance of the clustering task through the area under the purity curve. Table 4 shows the area under purity curves for all the group. We observe that PAM and PAMT, using the MPATH ontology as their primary axis, have the highest average AUC in the UPMGA and complete linkage methods, whereas neighbor joining and K medoids methods show MA and MAP, based on the anatomy axis do better on average in all groups. semantic similarity. Instead of using phenotype-based cluster and measuring the purity, we can also validate the hypothesis that a similar genetic background results in similar lesions directly by determining whether mice from the same strain are more phenotypically similar to each other than mice from different strains. In particular, instead of identifying how well different mouse strains separate into clusters, we are testing globally how much more similar mice in the same strain are to mice in different strains. For this purpose, we treat phenotypic similarity as a classifier that identifies mice of the same strain as positives and mice of different strains as negatives, and we can compute the true and false positive rates. A receiver operating characteristic (ROC) curve is a plot of the true positive rate as a function of the false positive rate. Figure 1 shows the ROC curves 36 for the tested groups of mice.
We further quantify the differences between the ROC curves using the area under the ROC curves (ROCAUC) 36 . We then calculated the weighted average across the four groups of mice, weighting by the number of mice in each group. Table 5 shows the ROCAUC for the four groups of mice as well as their weighted average. We find that both MPATH and MA alone achieve a lower average AUC (0.7117 and 0.7290, respectively) than MAP (AUC 0.7454), MAPT (AUC 0.7486), PAM (AUC 0.7481), and PAMT (AUC 0.7462). The results show that the newly generated ontologies produced higher average area under the curve than MPATH and MA alone.  www.nature.com/scientificreports www.nature.com/scientificreports/   www.nature.com/scientificreports www.nature.com/scientificreports/ To test how the similarities generated by the ontologies for each pair of mice differ, we used a Wilcoxon rank sum test to test whether the differences in ROCAUC is significant. We perform the test for each pair of two ontologies and adjust p-values using Bonferroni correction. We find that there are significant differences between the new ontologies and the original ones: mice from the same strain are ranked significantly more similar than mice from different strains when using MAP, MAPT, PAM, and PAMT compared to MPATH (p = 4.5⋅10 −10 , p = 7.5⋅10 −9 , p = 1.1⋅10 −11 , and p = 9.1⋅10 −12 ) as well as compared to MA (p = 0.02, p = 0.0272, p = 0.0159 and p = 0.0159, respectively). We also find that the difference between the new ontologies and MPATH is larger than the difference to MA. Among the four ontologies we generated, only the difference between MAP and MAPT as well as between MAPT and PAM/PAMT is significant (p = 0.021 and p = 0.033, respectively).

Discussion
pattern-based ontology design and evaluation. In the life sciences it is widely accepted that reference ontologies should cover mainly one type of entity, and that multiple, interoperable ontologies can be used to characterize the different facets of a biological phenomenon 1 . Consequently, there is now a large set of ontologies available that can capture a wide range of phenomena 1,8 . As the ontologies are separate and often cover distinct, yet related, concepts, it is a common practice to use multiple ontologies in annotating complex datasets. For example, in annotation of protein functions, which is mainly based on the Gene Ontology (GO) 3 , additional ontologies are used to provide more complete and accurate descriptions: the Celltype Ontology (CL) 42 or the Uberon anatomy ontology 6 can be used to restrict certain annotations to the context of particular cell types, anatomical structures or developmental processes; the ChEBI ontology of chemical structures 2 can provide accurate information about environmental exposures or stressors; and further ontologies can provide additional modifiers to annotations. Similarly, in the area of systems biology, it is very common to characterize models or the states of biological systems through a combination of multiple different ontologies 43 , and systematically combining these ontologies to formally describe the biological system can significantly extend the utility of individual annotations to separate ontologies 44 . Most importantly, multiple ontologies are widely combined in the area of phenotype descriptions 15 since phenotypes can involve a wide range of morphological, environmental and processual entities, some very difficult to capture, such as lifestyle and food preferences.
Consequently, it has now become a major challenge to identify ways in which classes from multiple ontologies can be combined systematically so as to comprehensively and accurately characterize biological phenomena while maintaining the interoperability between datasets that ontologies aim to achieve. Ontology design patterns (ODPs) are an approach to provide shared, tested, and well-documented axiomatic patterns which can be applied recurrently in similar situations and therefore maintain interoperability, even when multiple ontologies are used together 11 . The application of ODPs has a wide range of purposes. The dominant ones in the biomedical sciences being patterns for standardization of content, structure and presentation, the aim being to maximize efficiency of maintenance and development 45 . Additional motivations for using ODPs are for the support of reasoning and ontology matching. Recently, motivated by the increasing importance of ontology design patterns in achieving and maintaining ontology and dataset interoperability, pattern libraries such as Dead Simple OWL Design Patterns 14 have emerged. These libraries collect design patterns that are intended to be reused throughout the life sciences. There are often different choices in how to combine classes from different ontologies, and these choices depend on a dataset, use case, or application 46 , and some design patterns may be suited better or worse for particular applications. It has thus become a challenge to identify ways to evaluate the ontology design patterns and their utility in achieving certain outcomes; at the very least, given two choices of patterns to use (as, for example, in the area of phenotypes), it would be beneficial to determine whether the choices differ significantly or whether they likely lead to the same results.
We see one of our main contributions here as provision of a comprehensive set of evaluation methods for different ontology design patterns, and ways of comparing the effect that different ontology design patterns have on ontology-based data analysis. We use some of the most common ontology-based analysis methods in our evaluation: enrichment analysis and semantic similarity. While enrichment analysis is an exploratory method, we compare the ranks assigned through enrichment analysis to quantify how much ontology design patterns affect relative enrichment estimates. We use semantic similarity to determine how well the design patterns can reproduce a well-established biological hypothesis (i.e., that organisms with the same genotype have similar phenotypes) and quantify the effects through statistical measures, specifically, the receiver operating characteristic (ROC) curve 36 . Furthermore, we use clustering to determine if and how the different patterns make different groups of biological entities separable, and we introduce the area under the cluster purity curve as a quantitative measure.
The ontology design patterns we evaluate here are not only applicable to the two ontologies we use in our work. There is significant work on deciding optimal design patterns for combining anatomy and physiology ontologies into phenotype ontologies 7,16,47,48 , and the ontology design patterns that are commonly used are similar to the patterns we evaluate here. The main difference between generic phenotype ontologies and our work is the use of the Phenotype And Trait Ontology (PATO) 15 instead of MPATH, and the use of different anatomy ontologies such as the cross-species ontology Uberon 6 . Our evaluation strategy can therefore also be applied to other phenotype ontologies. However, different datasets and applications may yield different evaluation results 46 .
The findings from our analysis demonstrate that while we obtain a statistically significant improvement in most analyses when using the combined ontologies, the differences between them are small in magnitude and it is not obvious which design pattern performs better than others. This finding shows the utility of our metrics-based approach to evaluation and informs the choice of ontology used in subsequent data analysis. In our case it seems that either ontology is better than using the single ontologies but that both would be expected to perform equally well. The additional benefits of combining the MA and MPATH ontologies are the ease of data annotation with the combined ontologies (in contrast to annotation with two independent ontologies), access to (2019) 9:4025 | https://doi.org/10.1038/s41598-019-40368-1 www.nature.com/scientificreports www.nature.com/scientificreports/ more comprehensive classes that combine anatomy and pathology, and a more comprehensive characterization of data. As future research, a similar kind of evaluation can be performed on phenotype ontologies that are used to characterize human or mouse phenotypes and employed, for example, to discover gene-disease associations 49 or protein-protein interactions 50 .
While our aim was to provide evaluation procedures and evaluation results that can be transferred to other ontologies and applications, our approach nevertheless has several crucial limitations. Most importantly, our evaluations depend on the semantic similarity measure we employ and may not generalize to other similarity measures 51 ; our evaluation should be repeated if another similarity measure is chosen. Similarly, we demonstrate that the choice of the clustering algorithm changes the results, and while some general trends are observable across all algorithms we tested, the actual performance results are dependent on the algorithm. Our results do not necessarily generalize to other ontologies or even to other datasets, but demonstrate a set of methods, tools and approaches that can be employed to evaluate and test ontology design patterns.
The evaluation methods we introduce can be seen as complementary to evaluation frameworks such as OQuarE 17 which provide quantitative statistics and measures for evaluating ontologies intrinsically. Our methods are based on an application in which ontologies are used for the analysis of a specific dataset and can therefore provide an external evaluation.
Related approaches to our work are ontology alignment approaches in which relations between classes in two or more ontologies are identified. The most common kind of relation identified in ontology alignment is an equivalence relation between two classes, and several ontology alignment systems can also identify subclass relations [52][53][54] . These approaches are well suited for ontologies that overlap-at least partially-in some of their content but cannot generally find relations between ontologies that do not share a common domain. In these cases, inductive logic programming systems for OWL knowledge bases such as the DL-Learner framework 55 may be applied to find more common patterns through which classes or instances may be related. Finding ontology design patterns that optimize particular evaluation metrics (such as cluster purity or area under the ROC curve, as in our approach) can further improve our approach.
Annotation challenges in histopathology. The formal coding of anatomic pathological observations needs to reflect the type of lesion observed together with its anatomical location and, where necessary, other characteristics such as microscopical anatomical variation, severity, or behavior. Because many lesions can occur in multiple tissues, a precomposed ontology to cover all eventualities runs into problems of combinatorial "bloat", which makes the resulting ontology difficult to use either by humans or in computation. The solution to this challenge has been to annotate to multiple ontologies, in particular anatomy (in the case of mice to the mouse anatomy ontology MA) and pathology (from the mouse pathology ontology, MPATH), and use additional classes from PATO 56 and other ontologies when necessary. This approach allows for the coding of almost any lesion but as the classes are used separately at the level of annotation this limits the kinds of ontology-based analysis that can be carried out. For example, even simple tasks such as counting specific lesions in given sites, e.g., determining how many mammary gland lesions of all types have been observed, becomes more challenging.
Here, we formally combine MA and MPATH into a compound ontology using two different design patterns, one in which MA is used as the ontology framework and the other MPATH. We additionally investigate the impact of introducing transitive parthood relationships into the structure of these new ontologies. Generating all possible MA and MPATH combinations is avoided by limiting the number of classes to those required to describe the dataset, plus a small number of structuring classes. We explore design patterns for compound ontologies and evaluate alternative models of representing histopathology data based on anatomical or pathological knowledge. We are able to relate the performance of different patterns to external and independently validated concepts, namely the expectation that phenotype similarity should correlate with genotype similarity.
The first question we address is whether the compound ontologies provide a better description of the data than the single anatomy or pathology ontologies, MA and MPATH, and which of the two designs is better. The second question evaluates the impact of introducing transitivity over parthood relations into the ontology axioms. As we use completely inbred strains of mice, individuals within the same strain have an identical genotype. We utilize assessments of disease status relatedness of individual mice through semantic similarity and test globally inter-and intra-strain similarity. We find that the compound ontologies perform better than the individual MA and MPATH ontologies in establishing that the mice used show closer phenotypic relatedness within, rather than between, strains. Furthermore, using clustering and evaluating cluster purity, we also find that the mice separate better in groups based on their background strain using the combined ontologies compared to using either MA or MPATH alone.
We next evaluate the primary axis of classification for the combined ontologies, being either MA or MPATH. Evaluating these different ways of combining the ontologies has significant implications not only for our dataset but also in the area of phenotype ontologies, where different phenotype ontologies have been built based on different classification axes 15,16,48,57 . In our evaluation, when we compare the performance of ranks of classes produced by the compound ontologies in enrichment analysis, we find that the primary axis of classification has little effect on the ranks (at least when using the particular aging dataset on which we rely here). We obtained similar results using the evaluation of semantic similarity measures and the clustering.

Conclusions
Using the example of a specific large biological dataset we have shown that the data-driven generation of compound ontologies can yield powerful tools for data analysis. In addition we propose and assess comprehensive evaluation procedures for different design patterns for the resulting ontologies. We believe that these strategies for ontology generation and evaluation of ontology design patterns are generally applicable, and will be of great www.nature.com/scientificreports www.nature.com/scientificreports/ utility in dealing with the increasingly complex and multi-dimensional annotation of the large biomedical datasets now being widely collected.