Assessing phenotype order in molecular data

Biological entities are key elements of biomedical research. Their definition and their relationships are important in areas such as phylogenetic reconstruction, developmental processes or tumor evolution. Hypotheses about relationships like phenotype order are often postulated based on prior knowledge or belief. Evidence on a molecular level is typically unknown and whether total orders are reflected in the molecular measurements is unclear or not assessed. In this work we propose a method that allows a fast and exhaustive screening for total orders in large datasets. We utilise ordinal classifier cascades to identify discriminable molecular representations of the phenotypes. These classifiers are constrained by an order hypothesis and are highly sensitive to incorrect assumptions. Two new error bounds, which are introduced and theoretically proven, lead to a substantial speed-up and allow the application to large collections of many phenotypes. In our experiments we show that by exhaustively evaluating all possible candidate orders, we are able to identify phenotype orders that best coincide with the high-dimensional molecular profiles.

Biological entities are key elements of biomedical research. Their definition and their relationships are important in areas such as phylogenetic reconstruction, developmental processes or tumor evolution. Hypotheses about relationships like phenotype order are often postulated based on prior knowledge or belief. Evidence on a molecular level is typically unknown and whether total orders are reflected in the molecular measurements is unclear or not assessed. in this work we propose a method that allows a fast and exhaustive screening for total orders in large datasets. We utilise ordinal classifier cascades to identify discriminable molecular representations of the phenotypes. These classifiers are constrained by an order hypothesis and are highly sensitive to incorrect assumptions. two new error bounds, which are introduced and theoretically proven, lead to a substantial speed-up and allow the application to large collections of many phenotypes. in our experiments we show that by exhaustively evaluating all possible candidate orders, we are able to identify phenotype orders that best coincide with the highdimensional molecular profiles.
Assessing the correspondence between observable phenotypes and their underlying molecular background is a challenging task in molecular biology. Even for pairwise comparisons it is not straight forward to confirm hypothesised relations in high-dimensional marker representations.
This becomes even more evident for higher order relations among multiple phenotypes. In this case, local events and global processes might be confused as they both can lead to the same pattern of observable phenotypes. While local -pairwise -events might be reflected by any type of pairwise differences, an overall connecting pattern is required for global processes. An example, for such higher level relations are ordinal phenotypes of type   phenotype phenotype phenotype 1 2 3 as they might occur in developmental processes [1][2][3][4] , like embryogenesis 5 , phylogenetic reconstruction [6][7][8] or diagnostic stagings or gradings [9][10][11][12] . Their observable representations suggest an order of the phenotypes (), which might lead to hypotheses on a connecting "ordinal" relation or process on a molecular level (Fig. 1B). Providing evidence for these hypotheses is quite challenging due to the high dimensionality of molecular profiles. Being defined for univariate categorical variables, the concept of ordinality can be embedded in many different ways in a multivariate real-valued feature representation. There might also be several ordinal relations that coexist in parallel.
In the research field of ordinal classification usually, a known order is used to improve classification performance. The assumption is that the given order between the classes (phenotypes) can be mapped to the given representation and hence also holds in the feature space. In this work, we instead propose a method that can check whether the reflection is provided, by elaborating a performance-based criterion for detecting and comparing ordinal structures in multivariate feature representations. We present an algorithm (CASCADES) that allows for systematic and exhaustive screens through the search space of all phenotype orders. It is applicable for extracting a small set of candidate orders from a feature representation that fulfils a minimal generalisation ability of a predictive model. Based on supervised classification, our method uses the canonical paradigm for learning relationships between raw uninterpreted feature representations and semantically meaningful phenotypes (classes, categories, concepts, etc.) [13][14][15] . Utilising feature representations and class memberships these techniques allow the extraction of phenotype-specific patterns and the construction of phenotype separating boundaries. In this way, classifiers identify characteristics of phenotypes or even learn the key attributes of their concepts. Mainly designed for discrimination the learning processes of classification algorithms often neglect the semantic relationships among classes. Standard training algorithms would neither request nor reconstruct such dependencies explicitly [16][17][18] .
For this reason, we focus on ordinal classifier cascades 19 of binary base classifiers. They are a specialisation of general decision lists [20][21][22] . A predefined order of phenotypes constrains the learning algorithm of an ordinal classifier cascade. Initially designed for guiding the learning process, we showed that wrong assumptions on the class order can lead to severely decreased detection rates of an ordinal classifier cascade 23 .
In our approach (Fig. 1), we separate the training and evaluation of binary base classifiers from the construction of the classifier cascade. For the training, no order information is used, and each base classifier is trained independently. But the assumed phenotype order defines the evaluation sequence of these pairwise base classifiers. In each step of the evaluation a feature space region is labelled as decision region for a specific class, and the remaining space stays unlabelled. If the order is wrong, then samples of classes that are later in the order already lie within this region, whereas if it reflects the order they lie within the unlabelled region. Although trained only pairwise, the base classifiers show good performance when used to distinguish between a class and all following classes in the order.
Here, we utilise this susceptibility as a clear cut criterion for discriminating between class orders that allow a high generalisation ability or not. We provide theoretical upper bounds on the class-wise sensitivities of ordinal classifier cascades, which would enable the proposed algorithm to scale up to large collections of phenotypes. The combination of the pairwise training scheme and these bounds lead to a complexity reduction, as the number of base classifier trainings, in a single train-test experiment, for n classes is reduced from (n − 1)n! to (n − 1)n, and the number of cascade constructions and trainings is in the worst case n! but decreases by the number of cascades that do not pass this bound. We could show the utility of our method to identify reflected orderings in experimental evaluations on artificial data and gene expression profiles of developmental and ageing phenotypes.

Results
We evaluated the ability of the CASCADES algorithm to detect reflected orders in feature space based on artificial and existing gene expression datasets (see Methods). For our analysis a linear support vector machine (SVM) 18 was chosen as a base classifier for the ordinal cascades due to its superior performance 23 . The SVM was imported from the LIBSVM library 24 . Its cost parameter was fixed to a value of one.
The performance of the ordinal cascades as well as their base classifiers were evaluated in 10 × 10 cross-validation (CV) experiments 25 . The 10 × 10 CV experiments were repeated for all class orders (  | |! experiments) and the performance was measured in terms of minimal class-wise sensitivity. All classification experiments were performed with help of the TunePareto software 26 . The analysis on the simple artificial datasets show that our method can distinguish between phenotype orders that are reflected and are not reflected in the two dimensional data. The ordinal assumption is imposed by a common increase in both features. The results for sd = 0.2 are given in Table 1. It can be seen that for d 1 and d 2 the correct order and its inverse are returned. All other possible orders show a minimal class-wise sensitivity lower than 50%. For d 3 (non-ordinal) no order passed the threshold of 50%.
We additionally analysed the performance in dependency of the standard deviation of the artificial data clouds. For the dataset d 1 (linear) and d 2 (curved) the sensitivities under the correct assumption decline with increasing standard deviations ( Supplementary Fig. S2). The corresponding bounds lie above the real sensitivities. For the wrong order, the sensitivity of at least one class drops. In the given example the sensitivities of classes y 1 to y 5 and the corresponding bounds are mainly identical to those of the correct class order but largest changes can be observed for classes y 6 and y 7 . For the non-ordinal dataset d 3 the minimal class-wise sensitivity is zero independent of the standard deviation ( Supplementary Fig. S2).
For each dataset and each setting, all 10! ≈ 3.6⋅10 6 possible class orders are screened and the number of remaining class orders is reported ( Supplementary Fig. S3). Datasets d 1 and d 2 show comparable results. Our method returned at most four candidate cascades for each experiment. With increasing standard deviation the distinction of classes became harder. Candidate cascades could only fulfil lower sensitivity thresholds t. The bounds of all rejected cascades predicted minimal sensitivities below 0.5. With lower thresholds the chance of finding more than two candidate cascades increased. As expected, no candidates were proposed for dataset d 3 (non-ordinal). Evaluating the real minimal class-wise sensitivities of the remaining cascades revealed that additional candidates were rejected.

Gene expression datasets.
Furthermore, experiments on existing gene expression data were performed (see Methods). We chose ordinal multi-class expression data for which the classes correspond to specific points in time of a process. In three datasets (d 4 , d 5 , d 7 ) the classes correspond to developmental stages of Drosophila Screening experiments were performed for sensitivity thresholds t ∈ {1.00,0.95, …, 0.50}. Each screening experiment comprises all  | |! class orders. All candidate cascades that pass a minimal sensitivity threshold bound t ≥ 0.5 are reported. The bound and the real minimal class-wise sensitivities are reported. The highest bounds and minimal class-wise sensitivities are marked in boldface. melanogaster 1 , Danio rerio 2 , and Caenorhabditis elegans 4 . Additionally, d 7 was used with a different labelling. This further labelling was not based on stages but based on the point in time at which the sample was taken. The further dataset in our analysis d 6 comprises transcriptome samples of human muscles 3 . The data was categorised into four classes according to the age (in years) of the participants. For all these datasets it is expected that the assumed order based on the order of points in time is reflected within the expression profiles. To test our method (CASCADES algorithm) on real data for which no order is assumed, we included gene expression profiles from cell lines that are derived from 9 different cancer tissue types 27 .
The results of the real datasets are shown in Table 1. As the performance of the cascade depends on a sensitivity bound which depends only on the performance of the independent base classifiers, first those candidate cascades are reported that pass a sensitivity bound t ≤ 0.5. For the temporal ordinal datasets d 4 -d 8 , the CASCADES algorithm rejected at least 83.3% of all candidate cascades. No candidate passed the CASCADES algorithm for the non-ordinal dataset d 9 . The number of candidates is further depleted by analysing the minimal class-wise sensitivity of the full cascades. For dataset d 4 , the highest minimal class-wise sensitivity (89.4%) was achieved by the correct class order. It was followed by the correct inverse class order (72.3%) and an incorrect class order (71.0%).
Three candidates passed the CASCADES algorithm for dataset d 5 . The highest minimal class-wise sensitivity was achieved by two candidate cascades (85.0%). Both reflect the inverse class order. The first one corresponds to the correct inverse class order. The second one assumes an incorrect class order   embryo embryo embryo 3 1 2 . The third candidate achieved a minimal class-wise sensitivity of 54.7%. A general division between the adult and embryo samples can be observed. The order might be explained by the duration between the different states. Whereas all three embryonic samples cover a range of 10 days after birth, the first adult class comprises samples taken at month 3 and the adult 2 class is 1-2 years after birth. As a result the order assumption 3 might only be reflected ambiguously. Four class orders passed the CASCADES algorithm on dataset d 6 . Two of these candidates dropped out due to a minimal class-wise sensitivity lower than 50.0%. The remaining two achieved minimal class-wise sensitivities of 62.5%. One of these class orders corresponds to the correct class order. The second one proposes a partially consistent class order (  age age 4 3 ). As age 3 and age 4 comprise 10 years each and age 1 and age 2 20 years it can be argued, similar to the cascades of d 5 . The two classes age 3 and age 4 might be too similar under the order assumption leading to comparable results.
For dataset d 7 , the minimal class-wise sensitivity of the correct class order outperformed the performance of other candidate cascades (91.7%). All other candidate cascades achieved a minimal class-wise sensitivity of at most 66.7%. Only one candidate cascade passed the CASCADES algorithm when analysed on the level of points in time (d 8 ). The correct class order gained a minimal class-wise sensitivity of 66.7%, which was achieved for class t 2 . All other classes achieved class-wise sensitivities of at least 80.0%. Especially this dataset shows that our analysis does not aim at improving the classification performance as much as possible, but rather finding the order that outperforms other orders and this independent of a specific performance, as long as the performance is better than 50%.

Discussion
Ordinal relations between phenotypes are often defined on a semantic level. These relations are assumed to be reflected in a given feature representation without evaluating whether these assumptions hold. It might be the case that independent causes lead to ordinal phenotype characteristics or that the order is not reflected in the chosen feature space because the measured features are not responsible for the observed order.
In this work, we present ordinal classification as an example of a supervised learning task that incorporates semantic relations in the training process of classification models. By constraining the learning process, ordinal classification results in a restricted model class, which is no longer able to separate an arbitrary landscape of classes. This property is used to falsify wrong assumptions on the dependencies of the classes and the chosen feature space.
We provide two theoretical upper bounds on the minimal class-wise sensitivity, which are utilised for accelerating the training of ordinal classifier cascades and allow an exhaustive evaluation of all possible class orders. In this way, ordinal classifier cascades are used as an explorative tool to screen for unknown ordinal dependencies. In our experiments, we give examples for up to 10 different classes resulting in the evaluation of over 3.6 million class orders. Although our algorithm requires pairwise training of the ensemble members, both the bounds and the algorithm are independent of the chosen type of base classifier, the binary training type and might be transferred to alternative ensembles.
Our experiments on the artificial data showed that only if any ordinality is reflected in the feature space possible sets of candidate orders are returned. If there is no ordinal sequence reflected no cascade passes the bound of 0.5. No order was detected for non-ordinal datasets. Always the correct order and its reverse were found as dominant order if cascades were returned for the artificial ordinal datasets.
For all datasets independent of the chosen standard deviation, at least 80% of all candidate cascades could be rejected due to minimal class-wise sensitivities lower than 50%. However, although the procedure can reconstruct the correct class order for all datasets, alternative ordinal class structures might be detected. In our experiments, these alternatives differ from the assumed class order in the position of the last two classes. A reason might be the lower number of constraints for these classes.
For biological applications, we evaluated our method on observable ordinal phenotypes for which a reflection in gene expression levels can be assumed. For three different model organisms we analysed developmental stages characterised by their morphology (D. melanogaster) 1 , age (Danio rerio) 2 and number of C-lineage cells (C. elegans) 4 . For C. elegans also the sampling time points were used in the analysis. www.nature.com/scientificreports www.nature.com/scientificreports/ Our screening procedure allowed to reveal ordinal structures within the gene expression profiles of all three model organisms. The hypothesised time relation or its inverse is always included in the set of best-performing cascades. In three out of those four datasets the hypothesised relation dominates with a performance gap towards all other cascades. This strongly indicates a reflection of these orderings in the profiles.
For the Danio rerio dataset (d 5 ) two cascades rank first before a performance gap. There is a swap observed between the two youngest embryo phenotypes. This might be caused either by a data intrinsic reason that those classes are not distinct enough, as staging by days post fertilization has been shown to exhibit high variation in growth rate 28 , or by the technical aspect of the lower number of constraints for later classes.
In contrast to the developmental processes, of which the order of stages is tightly regulated by a genetic program 5,29 , ageing is influenced by multiple factors 30 . Nevertheless we resulted in comparable results for the dataset that measures age related gene expression changes. On the human muscle adaptation dataset 91.7% of all candidate orders were rejected. Among the remaining two candidates the expected order and only one false positive can be found. For non-ordinal phenotypes, as given by the collection of distinct cancer cell lines, no candidate cascades were observed. This indicates that ordinal relations are not a common phenomenon among multiple phenotypes.
Our method can, however, not only be used to confirm proposed hypotheses but also to explore the feature space for potential ordinal structures. This might become relevant if the relation is not easily accessible due to sampling. In surgery, for example, histologically distinguishable tissue regions can be defined in the same biopsy or in single-cell experiments various cell types are extracted from one sample. Within these feature spaces our procedure allows for screening of total ordinal cascades and additionally of ordinal sub-cascades embedded in a larger set of non-ordinal classes. It can hence be used to screen for intrinsic molecular ordinal structures and hypothesis relational axes, which might not be detected in a standard multi-class analysis.

Methods
We will use following notation throughout the description of the methodology behind the algorithm. An object is represented by a vector of real-valued measurements x = (x (1) , …, x (n) ) T ∈ ℝ n . Each object is assumed to be cat-  denotes the space of all class labels. The general classification task will be to identify a function mapping c, a classifier, that allows the accurate prediction of the class labels of new unseen objects  As quality measures, we utilise the conditional prediction rates of c. These estimate the probability of classifier c to predict the class label y j for samples of class y i based on a set of test samples i  . In its basic version, a conditional prediction rate can be calculated as denotes the indicator function. Other (re-) sampling strategies might be used for determining conditional prediction rates. However, they will not alter the theoretical characteristics discussed in this work. We distinguish between three types of conditional prediction rates: 1. sensitivities if y i = y j and ∈ y y , i j , 2. confusions if y i ≠ y j and ∈ y y , i j , 3. external rates if  ∉ y i and  ∈ y j . While (class-wise) sensitivities and confusions build the standard quality measures of a confusion matrix 31 , the external rates describe the classifiers behaviour on foreign classes. They will especially be of interest when dealing with different label spaces.
In the basic multi-class classification scenario, a classifier is typically adapted in a data-driven training procedure based on a set of training samples which can be separated in the chosen feature space. In the ordinal classification scenario it is additionally assumed that the labels in  are totally ordered … .
In this context, the symbol y i ( ) denotes the ith class of the order. We utilise the symbol  to indicate that the order relationship is only known for the label space. It is unclear, if this relationship is reflected by the chosen measurements. Nevertheless, ordinal classifiers rely on this assumption. The order of the classes is utilised for guiding the construction of the decision regions. It is provided as additional information to the training algorithm.
Ordinal classifier cascades. In the following, we will discuss ordinal classifier cascades of type The cascade h i,j can be seen as a late-aggregation multi-classifier system 32 A scheme of this architecture can be found in Fig. 2. For classifying a sample x, the ensemble members c (k) (x) are evaluated sequentially according to the assumed order of classes. If a base classifier c (k) (x) predicts its first class label y (k) , the procedure stops and h i,j (x) = y (k) . If it predicts class label y (k+1) the sample is passed to the subsequent base classifier c (k+1) . This fusion scheme implies following three characteristics on h i,j : 1. Each class y (k) , i < k < j + 1 can be predicted by two base classifiers. 2. The lowest class y (i) can only be predicted by the first classifier c (i) (x). The highest classes y (j+1) can only be predicted by the last classifier c (j) (x). 3. A sample x will only be passed to a base classifier c (k) (x), 1 < k ≤ j, if all its predecessors c (l) (x), l < k, decide for their second class y (l+1) .
Training algorithms for ordinal classifier cascades mainly focus on the training of the base classifiers. In the following, we utilise a pairwise inductive training, in which the training set  k ( ) of a base classifier c (k) consists of the samples of classes y (k) and y (k+1) In a previous study, this type of training showed to induce the highest susceptibility to incorrect assumptions on the class order 23 .
Upper bounds on class-wise sensitivities. The structural properties of the ordinal classifier cascade allow for the construction of upper limits on their empirical class-wise sensitivities. These bounds are based on the training of the cascade's base classifiers and postulated in Theorem 1. Although this theorem is formulated for full cascades, the corresponding bounds can directly be applied for partial cascades.

 be a non-empty set of samples of class y (i) . Then the sensitivity of h for y (i) is limited by
Proof. The theorem is a direct consequence of Lemmata 1 and 2 (see Supplementary). Theorem 1 states that the sensitivities of an ordinal classifier cascade h can be upper bounded by several conditional prediction rates of its base classifiers. For class y (i) , the sensitivity of the cascade is limited by the corresponding sensitivity of its ith base classifier c (i) (Eq. 4). It is also bounded by the predictions of all previous base classifiers c (k) , k < i (Eq. 5). A sample of class y (i) will not be classified correctly, if it is classified as y (k) by c (k) . The sensitivity of the cascade for class y (i) is therefore also limited by the conditional prediction rate of c (k) for predicting class label y (k+1) for samples of class y (i) . A detailed theoretical proof can be found in the Supplementary.

Detection of ordinal class structures.
Ordinal classifier cascades can be used for detecting wrong assumptions about the ordinality of the real class structures. Due to their susceptibility, these classifiers will fail when the real feature structures reflect a different class order or no class order at all. In a screening process, A sensitivity threshold t ≤ p * is used for determining whether an ordinal class structure can be assumed or not. The criterion can be evaluated for each order of the classes in .
The findings of Theorem 1 allow an alternative evaluation of this criterion. As a direct consequence of Theorem 1, the value of p * can again be upper bounded by conditional prediction rates of the base classifiers Ordinal classifier cascades that are based on wrong assumptions about the ordinality of the classes can therefore be sorted out by the training of the corresponding base classifiers. A graphical illustration describing this sorting out based on a four class example and dependent on Eqs 6 and 7 can be found in Supplementary Fig. S4.
Coupled to a pairwise inductive training of the base classifiers (Eq. 3) the bounds of Theorem 1 can reduce complexity of screens for ordinal structures. As the training of a base classifier c (i) is only based on the samples of classes y (i) and y (i+1) , it is no longer dependent on the position of the base classifiers within the cascade h. Cascades trained on different orders of  will therefore consist of common building blocks. The exhaustive training of all | |!  cascades, each consisting of  | | − 1 base classifiers, can therefore be accelerated by precalculating and evaluating all possible   | | − | | ( 1) base classifiers c i,j :ℝ n → {y i , y j }. Note that symbols y i , y j and c i,j no longer rely on an assumed class order.
In any case, the complexity of the exhaustive evaluation is mainly determined by the training and evaluation complexity of the base classifiers. A comparison of the precalculation scheme and a de novo calculation of all cascades in dependency on the numbers of classes | |  can be found in Table 2. For the presented numbers we assume an evaluation via a single training-test split. For three classes, the de novo strategy already requires twice the number of base classifier trainings and evaluations than the precalculation strategy. For ten classes, the de novo strategy trains more than 3⋅10 7 base classifiers while the precalculation scheme only demands 90 base classifiers. The number of base classifier trainings and evaluations might be increased by a constant factor if resampling strategies are applied.
The following quality measures are needed for the application of Theorem 1: . The algorithm is initialised with the full set of labels , an empty set of candidate cascades  = ∅, the chosen threshold t and y i = ∅. The parameter y i will later on indicate the class label selected in the previous recursion. In each recursion, the class labels  ∈ y j are tested as possible extensions of the candidate cascades in . If FC i,j ≥ t and  ∀ ∈ ≥ y r t : SC ( ) r i j , the current base classifier c i,j fulfils the bounds on p * . In this case, class label y j can be added to the current candidate cascade and can be removed from the set of remaining labels . The next base classifier will be chosen by a recursive call CASCADES( y \{ } j  ,  r , y j , t). If the current base classifier does not fulfil the minimal criteria, the corresponding (partial) candidate cascades are erased and an empty set ∅ is returned. All suitable candidate cascades are collected at the end of the recursive call. Although the algorithm CASCADES rejects cascades with too low minimal class-wise sensitivities, the remaining candidates are not guaranteed to fulfil the minimal criterion t ≤ p * . Each of the final candidates must therefore be cross-checked by an evaluation of the full cascade. CASCADES can directly be applied for the evaluation of partial ordinal cascades. By replacing the initial set of class labels  by a subset ′ ⊂  , the algorithm will evaluate all orders of the class labels in ′  .
Datasets. An overview of the characteristics of all datasets can be found in Supplementary Table S1. The datasets d 5 -d 9 were collected from the gene expression omnibus repository 33 (GSE13371, GSE47881, GSE2180, GSE32474) and processed using the robust multi-array average (rma) normalisation as implemented in the affy package 34 . For d 4 the processed data was downloaded.
. As a starting point = m (0, 0) y T 0 was chosen. This dataset has a curved shape.  and evaluated on the samples of all classes in . The sensitivity of c i,j on detecting samples of y i is denoted by FC i,j . The conditional prediction rates of c i,j of predicting class y j for samples of class y r is given by SC i,j (r). The function parameters are the set of class labels , a set of candidate cascades , the threshold t and y i which specifies the class label that was selected in the previous recursion.