Assessing phenotype order in molecular data

Lausser, Ludwig; Schäfer, Lisa M.; Schirra, Lyn-Rouven; Szekely, Robin; Schmid, Florian; Kestler, Hans A.

doi:10.1038/s41598-019-48150-z

Download PDF

Article
Open access
Published: 13 August 2019

Assessing phenotype order in molecular data

Ludwig Lausser¹^na1,
Lisa M. Schäfer¹^na1,
Lyn-Rouven Schirra¹^na1,
Robin Szekely¹,
Florian Schmid¹ &
…
Hans A. Kestler ORCID: orcid.org/0000-0002-4759-5254¹

Scientific Reports volume 9, Article number: 11746 (2019) Cite this article

1245 Accesses
7 Citations
Metrics details

Subjects

Abstract

Biological entities are key elements of biomedical research. Their definition and their relationships are important in areas such as phylogenetic reconstruction, developmental processes or tumor evolution. Hypotheses about relationships like phenotype order are often postulated based on prior knowledge or belief. Evidence on a molecular level is typically unknown and whether total orders are reflected in the molecular measurements is unclear or not assessed. In this work we propose a method that allows a fast and exhaustive screening for total orders in large datasets. We utilise ordinal classifier cascades to identify discriminable molecular representations of the phenotypes. These classifiers are constrained by an order hypothesis and are highly sensitive to incorrect assumptions. Two new error bounds, which are introduced and theoretically proven, lead to a substantial speed-up and allow the application to large collections of many phenotypes. In our experiments we show that by exhaustively evaluating all possible candidate orders, we are able to identify phenotype orders that best coincide with the high-dimensional molecular profiles.

An atlas of evidence-based phenotypic associations across the mouse phenome

Article Open access 03 March 2020

Bayesian estimation of gene constraint from an evolutionary model with gene features

Article 08 July 2024

High-content CRISPR screening

Article 10 February 2022

Introduction

Assessing the correspondence between observable phenotypes and their underlying molecular background is a challenging task in molecular biology. Even for pairwise comparisons it is not straight forward to confirm hypothesised relations in high-dimensional marker representations.

This becomes even more evident for higher order relations among multiple phenotypes. In this case, local events and global processes might be confused as they both can lead to the same pattern of observable phenotypes. While local - pairwise - events might be reflected by any type of pairwise differences, an overall connecting pattern is required for global processes. An example, for such higher level relations are ordinal phenotypes of type $phenotyp{e}_{1}\,\prec \,phenotyp{e}_{2}\,\prec \,phenotyp{e}_{3}$ as they might occur in developmental processes^1,2,3,4, like embryogenesis⁵, phylogenetic reconstruction^6,7,8 or diagnostic stagings or gradings^9,10,11,12. Their observable representations suggest an order of the phenotypes ($\prec $), which might lead to hypotheses on a connecting “ordinal” relation or process on a molecular level (Fig. 1B). Providing evidence for these hypotheses is quite challenging due to the high dimensionality of molecular profiles. Being defined for univariate categorical variables, the concept of ordinality can be embedded in many different ways in a multivariate real-valued feature representation. There might also be several ordinal relations that coexist in parallel.

In the research field of ordinal classification usually, a known order is used to improve classification performance. The assumption is that the given order between the classes (phenotypes) can be mapped to the given representation and hence also holds in the feature space. In this work, we instead propose a method that can check whether the reflection is provided, by elaborating a performance-based criterion for detecting and comparing ordinal structures in multivariate feature representations. We present an algorithm (CASCADES) that allows for systematic and exhaustive screens through the search space of all phenotype orders. It is applicable for extracting a small set of candidate orders from a feature representation that fulfils a minimal generalisation ability of a predictive model. Based on supervised classification, our method uses the canonical paradigm for learning relationships between raw uninterpreted feature representations and semantically meaningful phenotypes (classes, categories, concepts, etc.)^13,14,15. Utilising feature representations and class memberships these techniques allow the extraction of phenotype-specific patterns and the construction of phenotype separating boundaries. In this way, classifiers identify characteristics of phenotypes or even learn the key attributes of their concepts. Mainly designed for discrimination the learning processes of classification algorithms often neglect the semantic relationships among classes. Standard training algorithms would neither request nor reconstruct such dependencies explicitly^16,17,18.

For this reason, we focus on ordinal classifier cascades¹⁹ of binary base classifiers. They are a specialisation of general decision lists^20,21,22. A predefined order of phenotypes constrains the learning algorithm of an ordinal classifier cascade. Initially designed for guiding the learning process, we showed that wrong assumptions on the class order can lead to severely decreased detection rates of an ordinal classifier cascade²³.

In our approach (Fig. 1), we separate the training and evaluation of binary base classifiers from the construction of the classifier cascade. For the training, no order information is used, and each base classifier is trained independently. But the assumed phenotype order defines the evaluation sequence of these pairwise base classifiers. In each step of the evaluation a feature space region is labelled as decision region for a specific class, and the remaining space stays unlabelled. If the order is wrong, then samples of classes that are later in the order already lie within this region, whereas if it reflects the order they lie within the unlabelled region. Although trained only pairwise, the base classifiers show good performance when used to distinguish between a class and all following classes in the order.

Here, we utilise this susceptibility as a clear cut criterion for discriminating between class orders that allow a high generalisation ability or not. We provide theoretical upper bounds on the class-wise sensitivities of ordinal classifier cascades, which would enable the proposed algorithm to scale up to large collections of phenotypes. The combination of the pairwise training scheme and these bounds lead to a complexity reduction, as the number of base classifier trainings, in a single train-test experiment, for n classes is reduced from (n − 1)n! to (n − 1)n, and the number of cascade constructions and trainings is in the worst case n! but decreases by the number of cascades that do not pass this bound. We could show the utility of our method to identify reflected orderings in experimental evaluations on artificial data and gene expression profiles of developmental and ageing phenotypes.

Results

We evaluated the ability of the CASCADES algorithm to detect reflected orders in feature space based on artificial and existing gene expression datasets (see Methods). For our analysis a linear support vector machine (SVM)¹⁸ was chosen as a base classifier for the ordinal cascades due to its superior performance²³. The SVM was imported from the LIBSVM library²⁴. Its cost parameter was fixed to a value of one.

The performance of the ordinal cascades as well as their base classifiers were evaluated in 10 × 10 cross-validation (CV) experiments²⁵. The 10 × 10 CV experiments were repeated for all class orders ($|{\mathscr{Y}}|!$ experiments) and the performance was measured in terms of minimal class-wise sensitivity. All classification experiments were performed with help of the TunePareto software²⁶.

Artificial datasets

We performed experiments on three different kinds of artificial datasets, two of them reflect an ordered sequence of sample clouds in the feature space (d₁ and d₂) and one does not (d₃) (Supplementary Fig. S1). Each dataset comprises $|{\mathscr{Y}}|=10$ classes (i = 1, …, 10) of 100 samples. The instances of class y are drawn independent and identically distributed from a normal distribution ${\bf{x}} \sim {\mathscr{N}}({{\bf{x}}}_{y},sd)$ in dependence of a class specific centroid m_y ∈ ℝ². The standard deviation sd was identical for all classes. Experiments were performed for standard deviations sd ∈ {0.1, 0.2, …, 1.0}.

The analysis on the simple artificial datasets show that our method can distinguish between phenotype orders that are reflected and are not reflected in the two dimensional data. The ordinal assumption is imposed by a common increase in both features. The results for sd = 0.2 are given in Table 1. It can be seen that for d₁ and d₂ the correct order and its inverse are returned. All other possible orders show a minimal class-wise sensitivity lower than 50%. For d₃ (non-ordinal) no order passed the threshold of 50%.

Table 1 Evaluation of the CASCADES algorithm on the artificial datasets d₁, …, d₃ and the real datasets d₄, …, d₉.

Full size table

We additionally analysed the performance in dependency of the standard deviation of the artificial data clouds. For the dataset d₁ (linear) and d₂ (curved) the sensitivities under the correct assumption decline with increasing standard deviations (Supplementary Fig. S2). The corresponding bounds lie above the real sensitivities. For the wrong order, the sensitivity of at least one class drops. In the given example the sensitivities of classes y₁ to y₅ and the corresponding bounds are mainly identical to those of the correct class order but largest changes can be observed for classes y₆ and y₇. For the non-ordinal dataset d₃ the minimal class-wise sensitivity is zero independent of the standard deviation (Supplementary Fig. S2).

For each dataset and each setting, all 10! ≈ 3.6⋅10⁶ possible class orders are screened and the number of remaining class orders is reported (Supplementary Fig. S3). Datasets d₁ and d₂ show comparable results. Our method returned at most four candidate cascades for each experiment. With increasing standard deviation the distinction of classes became harder. Candidate cascades could only fulfil lower sensitivity thresholds t. The bounds of all rejected cascades predicted minimal sensitivities below 0.5. With lower thresholds the chance of finding more than two candidate cascades increased. As expected, no candidates were proposed for dataset d₃ (non-ordinal). Evaluating the real minimal class-wise sensitivities of the remaining cascades revealed that additional candidates were rejected.

Gene expression datasets

Furthermore, experiments on existing gene expression data were performed (see Methods). We chose ordinal multi-class expression data for which the classes correspond to specific points in time of a process. In three datasets (d₄, d₅, d₇) the classes correspond to developmental stages of Drosophila melanogaster¹, Danio rerio², and Caenorhabditis elegans⁴. Additionally, d₇ was used with a different labelling. This further labelling was not based on stages but based on the point in time at which the sample was taken. The further dataset in our analysis d₆ comprises transcriptome samples of human muscles³. The data was categorised into four classes according to the age (in years) of the participants. For all these datasets it is expected that the assumed order based on the order of points in time is reflected within the expression profiles. To test our method (CASCADES algorithm) on real data for which no order is assumed, we included gene expression profiles from cell lines that are derived from 9 different cancer tissue types²⁷.

The results of the real datasets are shown in Table 1. As the performance of the cascade depends on a sensitivity bound which depends only on the performance of the independent base classifiers, first those candidate cascades are reported that pass a sensitivity bound t ≤ 0.5. For the temporal ordinal datasets d₄–d₈, the CASCADES algorithm rejected at least 83.3% of all candidate cascades. No candidate passed the CASCADES algorithm for the non-ordinal dataset d₉. The number of candidates is further depleted by analysing the minimal class-wise sensitivity of the full cascades. For dataset d₄, the highest minimal class-wise sensitivity (89.4%) was achieved by the correct class order. It was followed by the correct inverse class order (72.3%) and an incorrect class order (71.0%).

Three candidates passed the CASCADES algorithm for dataset d₅. The highest minimal class-wise sensitivity was achieved by two candidate cascades (85.0%). Both reflect the inverse class order. The first one corresponds to the correct inverse class order. The second one assumes an incorrect class order $embry{o}_{3}\,\prec \,embry{o}_{1}\,\prec \,embry{o}_{2}$. The third candidate achieved a minimal class-wise sensitivity of 54.7%. A general division between the adult and embryo samples can be observed. The order might be explained by the duration between the different states. Whereas all three embryonic samples cover a range of 10 days after birth, the first adult class comprises samples taken at month 3 and the adult₂ class is 1–2 years after birth. As a result the order assumption $embry{o}_{1}\,\prec \,embry{o}_{2}\,\prec \,embry{o}_{3}$ might only be reflected ambiguously.

Four class orders passed the CASCADES algorithm on dataset d₆. Two of these candidates dropped out due to a minimal class-wise sensitivity lower than 50.0%. The remaining two achieved minimal class-wise sensitivities of 62.5%. One of these class orders corresponds to the correct class order. The second one proposes a partially consistent class order ($ag{e}_{4}\prec ag{e}_{3}$). As age₃ and age₄ comprise 10 years each and age₁ and age₂ 20 years it can be argued, similar to the cascades of d₅. The two classes age₃ and age₄ might be too similar under the order assumption leading to comparable results.

For dataset d₇, the minimal class-wise sensitivity of the correct class order outperformed the performance of other candidate cascades (91.7%). All other candidate cascades achieved a minimal class-wise sensitivity of at most 66.7%. Only one candidate cascade passed the CASCADES algorithm when analysed on the level of points in time (d₈). The correct class order gained a minimal class-wise sensitivity of 66.7%, which was achieved for class t₂. All other classes achieved class-wise sensitivities of at least 80.0%. Especially this dataset shows that our analysis does not aim at improving the classification performance as much as possible, but rather finding the order that outperforms other orders and this independent of a specific performance, as long as the performance is better than 50%.

Discussion

Ordinal relations between phenotypes are often defined on a semantic level. These relations are assumed to be reflected in a given feature representation without evaluating whether these assumptions hold. It might be the case that independent causes lead to ordinal phenotype characteristics or that the order is not reflected in the chosen feature space because the measured features are not responsible for the observed order.

In this work, we present ordinal classification as an example of a supervised learning task that incorporates semantic relations in the training process of classification models. By constraining the learning process, ordinal classification results in a restricted model class, which is no longer able to separate an arbitrary landscape of classes. This property is used to falsify wrong assumptions on the dependencies of the classes and the chosen feature space.

We provide two theoretical upper bounds on the minimal class-wise sensitivity, which are utilised for accelerating the training of ordinal classifier cascades and allow an exhaustive evaluation of all possible class orders. In this way, ordinal classifier cascades are used as an explorative tool to screen for unknown ordinal dependencies. In our experiments, we give examples for up to 10 different classes resulting in the evaluation of over 3.6 million class orders. Although our algorithm requires pairwise training of the ensemble members, both the bounds and the algorithm are independent of the chosen type of base classifier, the binary training type and might be transferred to alternative ensembles.

Our experiments on the artificial data showed that only if any ordinality is reflected in the feature space possible sets of candidate orders are returned. If there is no ordinal sequence reflected no cascade passes the bound of 0.5. No order was detected for non-ordinal datasets. Always the correct order and its reverse were found as dominant order if cascades were returned for the artificial ordinal datasets.

For all datasets independent of the chosen standard deviation, at least 80% of all candidate cascades could be rejected due to minimal class-wise sensitivities lower than 50%. However, although the procedure can reconstruct the correct class order for all datasets, alternative ordinal class structures might be detected. In our experiments, these alternatives differ from the assumed class order in the position of the last two classes. A reason might be the lower number of constraints for these classes.

For biological applications, we evaluated our method on observable ordinal phenotypes for which a reflection in gene expression levels can be assumed. For three different model organisms we analysed developmental stages characterised by their morphology (D. melanogaster)¹, age (Danio rerio)² and number of C-lineage cells (C. elegans)⁴. For C. elegans also the sampling time points were used in the analysis.

Our screening procedure allowed to reveal ordinal structures within the gene expression profiles of all three model organisms. The hypothesised time relation or its inverse is always included in the set of best-performing cascades. In three out of those four datasets the hypothesised relation dominates with a performance gap towards all other cascades. This strongly indicates a reflection of these orderings in the profiles.

For the Danio rerio dataset (d₅) two cascades rank first before a performance gap. There is a swap observed between the two youngest embryo phenotypes. This might be caused either by a data intrinsic reason that those classes are not distinct enough, as staging by days post fertilization has been shown to exhibit high variation in growth rate²⁸, or by the technical aspect of the lower number of constraints for later classes.

In contrast to the developmental processes, of which the order of stages is tightly regulated by a genetic program^5,29, ageing is influenced by multiple factors³⁰. Nevertheless we resulted in comparable results for the dataset that measures age related gene expression changes. On the human muscle adaptation dataset 91.7% of all candidate orders were rejected. Among the remaining two candidates the expected order and only one false positive can be found. For non-ordinal phenotypes, as given by the collection of distinct cancer cell lines, no candidate cascades were observed. This indicates that ordinal relations are not a common phenomenon among multiple phenotypes.

Our method can, however, not only be used to confirm proposed hypotheses but also to explore the feature space for potential ordinal structures. This might become relevant if the relation is not easily accessible due to sampling. In surgery, for example, histologically distinguishable tissue regions can be defined in the same biopsy or in single-cell experiments various cell types are extracted from one sample. Within these feature spaces our procedure allows for screening of total ordinal cascades and additionally of ordinal sub-cascades embedded in a larger set of non-ordinal classes. It can hence be used to screen for intrinsic molecular ordinal structures and hypothesis relational axes, which might not be detected in a standard multi-class analysis.

Methods

We will use following notation throughout the description of the methodology behind the algorithm. An object is represented by a vector of real-valued measurements x = (x⁽¹⁾, …, x⁽ⁿ⁾)^T ∈ ℝⁿ. Each object is assumed to be categorisable in one out of $|{\mathscr{Y}}|\ge 2$ predefined classes $y\in {\mathscr{Y}}$, where ${\mathscr{Y}}={\{{y}_{i}\}}_{i=1}^{|{\mathscr{Y}}|}$ denotes the space of all class labels. The general classification task will be to identify a function mapping c, a classifier, that allows the accurate prediction of the class labels of new unseen objects $c\,:\,{{\mathbb{R}}}^{n}\to {\mathscr{Y}}=\{{y}_{1},\ldots ,{y}_{|{\mathscr{Y}}|}\}.$

As quality measures, we utilise the conditional prediction rates of c. These estimate the probability of classifier c to predict the class label y_j for samples of class y_i based on a set of test samples ${{\mathscr{X}}}_{i}$. In its basic version, a conditional prediction rate can be calculated as

$${p}_{c}({y}_{j}|{{\mathscr{X}}}_{i})={\frac{1}{|{{\mathscr{X}}}_{i}|}}_{x\in {{\mathscr{X}}}_{i}}\sum {{\mathbb{I}}}_{[c(x)={y}_{j}]},$$

where ${\mathbb{I}}$_[⋅] denotes the indicator function. Other (re-) sampling strategies might be used for determining conditional prediction rates. However, they will not alter the theoretical characteristics discussed in this work. We distinguish between three types of conditional prediction rates: 1. sensitivities if y_i = y_j and ${y}_{i},{y}_{j}\in {\mathscr{Y}}$, 2. confusions if y_i ≠ y_j and ${y}_{i},{y}_{j}\in {\mathscr{Y}}$, 3. external rates if ${y}_{i}\notin {\mathscr{Y}}$ and ${y}_{j}\in {\mathscr{Y}}$.

While (class-wise) sensitivities and confusions build the standard quality measures of a confusion matrix³¹, the external rates describe the classifiers behaviour on foreign classes. They will especially be of interest when dealing with different label spaces.

In the basic multi-class classification scenario, a classifier is typically adapted in a data-driven training procedure based on a set of training samples ${{\mathscr{S}}}_{tr}={\{({{\bf{x}}}_{i},{y}_{i})\}}_{i=1}^{|{{\mathscr{S}}}_{tr}|}$. The basic assumption of this scenario are pairwise distinct classes ${y}_{1}\ne \ldots \ne {y}_{|{\mathscr{Y}}|},$ which can be separated in the chosen feature space. In the ordinal classification scenario it is additionally assumed that the labels in ${\mathscr{Y}}$ are totally ordered ${y}_{(1)}\prec \,\ldots \,\prec {y}_{(|{\mathscr{Y}}|)}.$ In this context, the symbol ${y}_{(i)}$ denotes the $i$th class of the order. We utilise the symbol $\prec $ to indicate that the order relationship is only known for the label space. It is unclear, if this relationship is reflected by the chosen measurements. Nevertheless, ordinal classifiers rely on this assumption. The order of the classes is utilised for guiding the construction of the decision regions. It is provided as additional information to the training algorithm.

Ordinal classifier cascades

In the following, we will discuss ordinal classifier cascades of type

$${h}_{i,j}:{{\mathbb{R}}}^{n}\to \{{y}_{(i)},{y}_{(i+1)},\ldots ,{y}_{(j+1)}\}.$$

(1)

The cascade h_i,j can be seen as a late-aggregation multi-classifier system³², where indices $1\le i\le j < |{\mathscr{Y}}|$ indicate the base classifiers of the corresponding ensemble $ {\mathcal E} =\{{c}_{(i)},\ldots ,{c}_{(j)}\}$. The members of the ensemble are designed for separating two neighboured classes c_(i):ℝⁿ → {y_(i),y_(i+1)}.

An ordinal cascade will be called a full cascade if it is designed for predicting all labels of the label space ${\mathscr{Y}}$. Full cascades will be denoted by $h={h}_{1,|{\mathscr{Y}}|-1}$. Other ordinal cascades will be called partial cascades.

The fusion strategy of an ordinal cascade can be interpreted as a sequence of logical conjunctions of its base classifiers

$${h}_{i,j}({\bf{x}})=\{\begin{array}{ccc}{y}_{(i)} & {\rm{i}}{\rm{f}} & {c}_{(i)}={y}_{(i)}\\ {y}_{(k)} & {\rm{i}}{\rm{f}} & (\mathop{\wedge }\limits_{i\le l < k\le j}({c}_{(l)}({\bf{x}})={y}_{(l+1)}))\wedge {c}_{(k)}({\bf{x}})={y}_{(k)}\\ {y}_{(j+1)} & {\rm{e}}{\rm{l}}{\rm{s}}{\rm{e}} & \mathop{\mathop{\wedge }\limits_{l=i}}\limits^{j}({c}_{(l)}({\bf{x}})={y}_{(l+1)})\end{array}.$$

(2)

A scheme of this architecture can be found in Fig. 2. For classifying a sample x, the ensemble members c_(k)(x) are evaluated sequentially according to the assumed order of classes. If a base classifier c_(k)(x) predicts its first class label y_(k), the procedure stops and h_i,j(x) = y_(k). If it predicts class label y_(k+1) the sample is passed to the subsequent base classifier c_(k+1). This fusion scheme implies following three characteristics on h_i,j:

1.
Each class y_(k), i < k < j + 1 can be predicted by two base classifiers.
2.
The lowest class y_(i) can only be predicted by the first classifier c_(i)(x). The highest classes y_(j+1) can only be predicted by the last classifier c_(j)(x).
3.
A sample x will only be passed to a base classifier c_(k)(x), 1 < k ≤ j, if all its predecessors c_(l)(x), l < k, decide for their second class y_(l+1).

Training algorithms for ordinal classifier cascades mainly focus on the training of the base classifiers. In the following, we utilise a pairwise inductive training, in which the training set ${{\mathscr{S}}}_{(k)}$ of a base classifier c_(k) consists of the samples of classes y_(k) and y_(k+1)

$${{\mathscr{S}}}_{(k)}=\{({\bf{x}},y)|({\bf{x}},y)\in {{\mathscr{S}}}_{tr},y\in \{{y}_{(k)},{y}_{(k+1)}\}\}.$$

(3)

In a previous study, this type of training showed to induce the highest susceptibility to incorrect assumptions on the class order²³.

Upper bounds on class-wise sensitivities

The structural properties of the ordinal classifier cascade allow for the construction of upper limits on their empirical class-wise sensitivities. These bounds are based on the training of the cascade’s base classifiers and postulated in Theorem 1. Although this theorem is formulated for full cascades, the corresponding bounds can directly be applied for partial cascades.

Theorem 1 Let h denote an ordinal classifier cascade $h:{{\mathbb{R}}}^{n}\to {\mathscr{Y}}=\{{y}_{(1)},\ldots ,{y}_{(|{\mathscr{Y}}|)}\}$ with base classifiers $ {\mathcal E} =\{{c}_{(1)},\ldots ,{c}_{(|{\mathscr{Y}}|-1)}\}$. Let furthermore ${{\mathscr{X}}}_{(i)}$ be a non-empty set of samples of class y_(i). Then the sensitivity of h for y_(i) is limited by

$${p}_{h}({y}_{(i)}|{{\mathscr{X}}}_{(i)})\le {p}_{{c}_{(i)}}({y}_{(i)}|{{\mathscr{X}}}_{(i)})$$

(4)

$${p}_{h}({y}_{(i)}|{{\mathscr{X}}}_{(i)})\le \mathop{{\rm{\min }}}\limits_{k < i}{p}_{{c}_{(k)}}({y}_{(k+1)}|{{\mathscr{X}}}_{(i)}).$$

(5)

Proof. The theorem is a direct consequence of Lemmata 1 and 2 (see Supplementary).

Theorem 1 states that the sensitivities of an ordinal classifier cascade h can be upper bounded by several conditional prediction rates of its base classifiers. For class y_(i), the sensitivity of the cascade is limited by the corresponding sensitivity of its ith base classifier c_(i) (Eq. 4). It is also bounded by the predictions of all previous base classifiers c_(k), k < i (Eq. 5). A sample of class y_(i) will not be classified correctly, if it is classified as y_(k) by c_(k). The sensitivity of the cascade for class y_(i) is therefore also limited by the conditional prediction rate of c_(k) for predicting class label y_(k+1) for samples of class y_(i). A detailed theoretical proof can be found in the Supplementary.

Detection of ordinal class structures

Ordinal classifier cascades can be used for detecting wrong assumptions about the ordinality of the real class structures. Due to their susceptibility, these classifiers will fail when the real feature structures reflect a different class order or no class order at all. In a screening process, ordinal classifier cascades can be used for revealing unknown ordinal class structures. We have proposed the minimal class-wise sensitivity p^* of an ordinal classifier cascade as a measure of the correctness of the assumed class order ${p}^{\ast }={{\rm{\min }}}_{1\le i\le |{\mathscr{Y}}|}{p}_{h}({y}_{(i)}|{{\mathscr{X}}}_{(i)}).$ A sensitivity threshold t ≤ p^* is used for determining whether an ordinal class structure can be assumed or not. The criterion can be evaluated for each order of the classes in ${\mathscr{Y}}$.

The findings of Theorem 1 allow an alternative evaluation of this criterion. As a direct consequence of Theorem 1, the value of p^* can again be upper bounded by conditional prediction rates of the base classifiers

$$\forall i:\,{p}^{\ast }\le {p}_{h}({y}_{(i)}|{{\mathscr{X}}}_{(i)})\le {p}_{{c}_{(i)}}({y}_{(i)}|{{\mathscr{X}}}_{(i)})$$

(6)

$$\forall i:\,{p}^{\ast }\le {p}_{h}({y}_{(i)}|{{\mathscr{X}}}_{(i)})\le \mathop{{\rm{\min }}}\limits_{k < i}\,{p}_{{c}_{(k)}}({y}_{(k+1)}|{{\mathscr{X}}}_{(i)}).$$

(7)

Ordinal classifier cascades that are based on wrong assumptions about the ordinality of the classes can therefore be sorted out by the training of the corresponding base classifiers. A graphical illustration describing this sorting out based on a four class example and dependent on Eqs 6 and 7 can be found in Supplementary Fig. S4.

Coupled to a pairwise inductive training of the base classifiers (Eq. 3) the bounds of Theorem 1 can reduce complexity of screens for ordinal structures. As the training of a base classifier c_(i) is only based on the samples of classes y_(i) and y_(i+1), it is no longer dependent on the position of the base classifiers within the cascade h. Cascades trained on different orders of ${\mathscr{Y}}$ will therefore consist of common building blocks. The exhaustive training of all $|{\mathscr{Y}}|!$ cascades, each consisting of $|{\mathscr{Y}}|-1$ base classifiers, can therefore be accelerated by precalculating and evaluating all possible $(|{\mathscr{Y}}|-1)|{\mathscr{Y}}|$ base classifiers c_i,j:ℝⁿ → {y_i, y_j}. Note that symbols y_i, y_j and c_i,j no longer rely on an assumed class order.

In any case, the complexity of the exhaustive evaluation is mainly determined by the training and evaluation complexity of the base classifiers. A comparison of the precalculation scheme and a de novo calculation of all cascades in dependency on the numbers of classes $|{\mathscr{Y}}|$ can be found in Table 2. For the presented numbers we assume an evaluation via a single training-test split. For three classes, the de novo strategy already requires twice the number of base classifier trainings and evaluations than the precalculation strategy. For ten classes, the de novo strategy trains more than 3⋅10⁷ base classifiers while the precalculation scheme only demands 90 base classifiers. The number of base classifier trainings and evaluations might be increased by a constant factor if resampling strategies are applied.

Table 2 A comparison between the precalculation strategy and a de novo calculation is given.

Full size table

The following quality measures are needed for the application of Theorem 1:

$$F{C}_{i,j}={p}_{{c}_{i,j}}({y}_{i}|{{\mathscr{X}}}_{i})\,{\rm{and}}\,S{C}_{i,j}(r)={p}_{{c}_{i,j}}({y}_{j}|{{\mathscr{X}}}_{r}).$$

(8)

Here, FC_i,j denotes the class-wise sensitivity of c_i,j for predicting its first class label y_i. The term SC_i,j(r) denotes the conditional prediction rates of c_i,j for samples of class y_r that are classified as y_j. Both quantities can be precalculated and memorised for all binary base classifiers (Table 3).

Table 3 Training table of binary classifiers c_i,j in a multi-class scenario ($|{\mathscr{Y}}| > 2$).

Full size table

The CASCADES algorithm

We propose the recursive enumeration scheme ${\rm{CASCADES}}({\mathscr{Y}},{\mathscr{C}},{y}_{i},t)$ for the exhaustive training of all orders of ${\mathscr{Y}}$ (Fig. 3). It can be seen as a filter routine that rejects ordinal cascades that will not achieve a minimal class-wise sensitivity t ≤ p^* according to the bounds of Theorem 1. The remaining cascades will be returned as a set of candidates ${\mathscr{C}}$.

The CASCADES algorithm is based on the evaluation of an extended confusion table as shown in Table 3. It will replace the training procedure of each base classifier (training-test split or resampling strategy) by looking up $|{\mathscr{Y}}|-k$ conditional prediction rates, where k is the base classifiers position within the cascade. As there exist $(\begin{array}{l}|{\mathscr{Y}}|\\ k+1\end{array})$ possibilities of constructing subcascades (prefixes) of k base classifiers, at most $f(|{\mathscr{Y}}|):=\mathop{\sum }\limits_{k=2}^{|{\mathscr{Y}}|}\,k(\begin{array}{l}|{\mathscr{Y}}|\\ k+1\end{array})$ comparisons are required in a worst case scenario. Nevertheless, this number will rapidly break down by utilising early stopping criteria.

A single cascade is represented as an ordered vector of class labels $({y}_{(1)},\ldots ,{y}_{(|{\mathscr{Y}}|)})\in {\mathscr{C}}.$ Each candidate cascade is constructed sequentially. It is extended by a new class label in each recursive call of the algorithm. The construction stops, if the performance measures of the current base classifier falls under the chosen threshold t. In this case the candidate cascade is rejected.

The sequential extension of a partial cascade h_1,k−1 improves the runtime of the exhaustive search. If the kth (candidate) base classifier does not fulfil the minimal criteria, all full ordinal cascades that utilise h_1,k as prefix can be withdrawn. This corresponds to $(|{\mathscr{Y}}|-k-1)!$ full ordinal cascades. A single early stopping will reduce the number of lookups by at least $f(|{\mathscr{Y}}|-k)$.

The algorithm is initialised with the full set of labels ${\mathscr{Y}}$, an empty set of candidate cascades ${\mathscr{C}}=\varnothing $, the chosen threshold t and y_i = ∅. The parameter y_i will later on indicate the class label selected in the previous recursion. In each recursion, the class labels ${y}_{j}\in {\mathscr{Y}}$ are tested as possible extensions of the candidate cascades in ${\mathscr{C}}$. If FC_i,j ≥ t and $\forall \,{y}_{r}\in {\mathscr{Y}}:{{\rm{SC}}}_{i,j}(r)\ge t$ the current base classifier c_i,j fulfils the bounds on p^*. In this case, class label y_j can be added to the current candidate cascade and can be removed from the set of remaining labels ${\mathscr{Y}}$. The next base classifier will be chosen by a recursive call CASCADES(${\mathscr{Y}}\backslash \{{y}_{j}\}$, ${{\mathscr{C}}}_{r}$, y_j, t). If the current base classifier does not fulfil the minimal criteria, the corresponding (partial) candidate cascades are erased and an empty set ∅ is returned. All suitable candidate cascades are collected at the end of the recursive call. Although the algorithm CASCADES rejects cascades with too low minimal class-wise sensitivities, the remaining candidates are not guaranteed to fulfil the minimal criterion t ≤ p^*. Each of the final candidates must therefore be cross-checked by an evaluation of the full cascade. CASCADES can directly be applied for the evaluation of partial ordinal cascades. By replacing the initial set of class labels ${\mathscr{Y}}$ by a subset ${\mathscr{Y}}\text{'}\subset {\mathscr{Y}}$, the algorithm will evaluate all orders of the class labels in ${\mathscr{Y}}\text{'}$.

Datasets

An overview of the characteristics of all datasets can be found in Supplementary Table S1. The datasets d₅–d₉ were collected from the gene expression omnibus repository³³ (GSE13371, GSE47881, GSE2180, GSE32474) and processed using the robust multi-array average (rma) normalisation as implemented in the affy package³⁴. For d₄ the processed data was downloaded.

Linear dataset (d ₁)

For a first series of experiments, the centroid of the ith class y_i is chosen as ${{\bf{m}}}_{{y}_{i}}={(i,i)}^{T}$. In this way, the class centroids lie on a line.

Curved dataset (d ₂)

For the second dataset, the class centroids were chosen depending on their predecessors. ${{\bf{m}}}_{{y}_{i}}={({m}_{{y}_{i}}^{(1)},{m}_{{y}_{i}}^{(2)})}^{T}={({m}_{{y}_{i-1}}^{(1)}+{u}_{i}^{(1)},{m}_{{y}_{i-1}}^{(2)}+{u}_{i}^{(2)})}^{T}$, where ${u}_{i}^{(1)},{u}_{i}^{(2)} \sim {\mathscr{U}}(0.5,2)$. As a starting point ${{\bf{m}}}_{{y}_{0}}={(0,0)}^{T}$ was chosen. This dataset has a curved shape.

Non-ordinal dataset (d ₃)

The third artificial dataset is designed to be non-ordinal. The centers of the classes are arranged on a predefined two dimensional grid in the range [1, 4]² (Supplementary Table S2).

Drosophila melanogaster (d ₄)

The drosophila dataset generated by Arbeitman et al.¹ consists of gene expression profiles of the fruit fly Drosophila melanogaster. These profiles consist of 4028 measurements and were collected at different points in time during the life cycle of the model organism. They can be categorised according to the developmental stages of Drosophila melanogaster: $embryo\prec larva\prec pupa\prec adult.$ Overall the dataset contains profiles for 31 embryos, 10 larvae, 18 pupae and 8 adults. We use the natural order of the developmental stages as ordinal class labels for our experiment.

Danio rerio (d ₅)

The dataset collected by Toyama et al.² consists of gene expression profiles of the pineal glands of zebrafish (Danio rerio). The expression profiles were collected at five different time points: $embry{o}_{1}\prec embry{o}_{2}\prec embry{o}_{3}$$\prec adul{t}_{1}\prec adul{t}_{2},$ where embryo₁-embryo₃ were collected 3, 5 and 10 days after birth and adult₁ and adult₂ were collected at an age of 3 months and 1–2 years. The dataset comprises 12–15 samples for each class. The age of the samples will be used as class order.

Human muscle adaptation (d ₆)

Philips et al.³ have compared the transcriptome of human muscle cells before and after 20 weeks of supervised resistance-exercise training (RET). The corresponding dataset consists of paired gene expression profiles. For our experiments, the data was categorised into four classes according to the age (in years) of the participants: $ag{e}_{1}\prec ag{e}_{2}\prec ag{e}_{3}\prec ag{e}_{4}.$ The class labels denote age intervals of $[20;40)$ years (18 samples), $[40;60)$ years (38 samples), $[60;70)$ years (16 samples) and $[70;80)$ years (16 samples). In order to avoid overoptimistic results, we ensured that the profiles of a subject are not used for training and testing the classifier simultaneously.

Caenorhabditis elegans (d ₇ and d ₈)

Baugh et al.⁴ analysed the influence of the homeodomain protein PAL-1 of the C-lineage-specific gene regulatory network in the model organism C. elegans. They gathered gene expression data of samples of wild-type embryos and mutant embryos with additional C blastomeres, as well as on samples of mutants without any C blastomeres. For our experiments we used data of the C-cell-free organisms, taken at 10 points in time after the 4-cell-stage of the embryo. We labelled these samples in two different ways. In the first experiment (d₇) the samples were labelled according to the developmental stages proposed in the original publication: $stag{e}_{1}\prec stag{e}_{2}\prec stag{e}_{3}\prec stag{e}_{4}\prec stag{e}_{5},$ where 0 and 23 minutes samples were merged in the stage₁ class, 41 and 53 minutes samples in stage₂, and samples taken at 66, 83 and 101 minutes after the 4-cell stage in the class stage₃. stage₄ only consists of samples taken at 122 minutes, and stage₅ contains time points 143 and 186 minutes. In the second experiment (d₈), the points in time were analysed solely: ${t}_{1}\,\prec \,{t}_{2}\,\prec \,{t}_{3}\,\prec {t}_{4}\,\prec \,{t}_{5}\,\prec \,{t}_{6}\,\prec {t}_{7}\,\prec \,{t}_{8}\,\prec \,{t}_{9}\,\prec \,{t}_{10}.$

Various cancer cell lines (d ₉)

Pfister et al.²⁷ collected gene expression profiles from cell lines that derived from 9 different cancer tissue types (breast (15 samples), central nervous system (18 samples), colon (21 samples), leukemia (18 samples), melanoma (26 samples), non-small cell lung (26 samples), ovarian (21 samples), prostate (6 samples), renal (23 samples)). In contrast to d₁–d₈, the classes of this dataset are not assumed to fulfil an ordinal relationship as each group originates from a different tissue type: line₁ ≠ line₂ ≠ line₃ ≠ line₄ ≠ line₅ ≠ line₆ ≠ line₇ ≠ line₈ ≠ line₉.

Data Availability

The drosophila dataset is available from http://flygenome.yale.edu/Lifecycle/. The other datasets are available from the GEO repository https://www.ncbi.nlm.nih.gov/gds: GSE13371, GSE47881, GSE2180, GSE32474.

References

Arbeitman, M. N. et al. Gene expression during the life cycle of Drosophila melanogaster. Sci. 297, 2270–2275, https://doi.org/10.1126/science.1072152 (2002).
Article ADS CAS Google Scholar
Toyama, R. et al. Transcriptome analysis of the zebrafish pineal gland. Dev. Dyn. 238, 1813–1826, https://doi.org/10.1002/dvdy.21988 (2009).
Article CAS PubMed PubMed Central Google Scholar
Phillips, B. E. et al. Molecular networks of human muscle adaptation to exercise and age. PLOS Genet. 9, 1–15, https://doi.org/10.1371/journal.pgen.1003389 (2013).
Article CAS Google Scholar
Baugh, L. R. et al. The homeodomain protein PAL-1 specifies a lineage-specific regulatory network in the C. elegans embryo. Dev. 132, 1843–1854, https://doi.org/10.1242/dev.01782 (2005).
Article CAS Google Scholar
Long, C., Li, W., Liang, P., Liu, S. & Zuo, Y. Transcriptome comparisons of multi-species identify differential genome activation of mammals embryogenesis. IEEE Access 99, 1–1, https://doi.org/10.1109/ACCESS.2018.2889809 (2018).
Article Google Scholar
Kearney, P. E., Hayward, R. B. & Meijer, H. Evolutionary trees and ordinal assertions. Algorithmica 25, 196–221, https://doi.org/10.1007/PL00008274 (1999).
Article MathSciNet MATH Google Scholar
Guénoche, A. Ordinal properties of tree distances. Discret. Math. 192, 103–117, https://doi.org/10.1016/S0012-365X(98)00068-5 (1998).
Article MathSciNet MATH Google Scholar
Kannan, S. & Warnow, T. Tree Reconstruction from Partial Orders. SIAM J. on Comput. 24, 511–519, https://doi.org/10.1137/S0097539793252195 (1995).
Article MathSciNet MATH Google Scholar
Buchholz, M. et al. Transcriptome analysis of microdissected pancreatic intraepithelial neoplastic lesions. Oncogene 24, 6626–6636, https://doi.org/10.1038/sj.onc.1208804 (2005).
Article CAS PubMed Google Scholar
Feldmann, U. & Steudel, I. Methods of ordinal classification applied to medical scoring systems. Stat. Medicine 19, 575–586, doi:10.1002/(SICI)1097-0258(20000229)19:4<575::AID-SIM357>3.0.CO;2-A (2000).
Bender, R. & Grouven, U. Ordinal logistic regression in medical research. J. Royal Coll. Physicians Lond. 31, 546–51 (1997).
CAS Google Scholar
Cardoso, J. S., da Costa, J. F. P. & Cardoso, M. J. Modelling ordinal relations with svms: An application to objective aesthetic evaluation of breast cancer conservative treatment. Neural Networks 18, 808–817, https://doi.org/10.1016/j.neunet.2005.06.023 (2005).
Article PubMed Google Scholar
Bishop, C. Pattern Recognition and Machine Learning (Springer, New York, 2006).
Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning (Springer, New York, 2001).
Webb, A. R. Statistical Pattern Recognition, 2nd edn (John Wiley & Sons Ltd., Chichester, 2002).
Fix, E. & Hodges, J. L. Discriminatory analysis: Nonparametric discrimination: Consistency properties. Tech. Rep. Project 21-49-004, Report Number 4, USAF School of Aviation Medicine, Randolf Field, Texas (1951).
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. Classification and Regression Trees. The Wadsworth statistics/probability series (Chapman and Hall/CRC, Boca Raton, 1984).
Vapnik, V. N. Statistical Learning Theory (John Wiley & Sons Ltd., New York, 1998).
Frank, E. & Hall, M. A simple approach to ordinal classification. In Raedt, L. D. & Flach, P. (eds) Machine Learning: ECML 2001 - 12th European Conference on Machine Learning. Proceedings, vol. 2167, 145–156, https://doi.org/10.1007/3-540-44795-413 (Springer, Berlin, 2001).
Rivest, R. L. Learning decision lists. Mach. Learn. 2, 229–246, https://doi.org/10.1007/BF00058680 (1987).
Article Google Scholar
Kestler, H. A., Lausser, L., Lindner, W. & Palm, G. On the fusion of threshold classifiers for categorization and dimensionality reduction. Comput. Stat. 26, 321–340, https://doi.org/10.1007/s00180-011-0243-7 (2011).
Article MathSciNet MATH Google Scholar
Meyer, L. H. et al. Early Relapse in ALL Is Identified by Time to Leukemia in NOD/SCID Mice and Is Characterized by a Gene Signature Involving Survival Pathways. Cancer Cell 19, 206–217, https://doi.org/10.1016/j.ccr.2010.11.014 (2011).
Article CAS PubMed Google Scholar
Lattke, R., Lausser, L., Müssel, C. & Kestler, H. A. Detecting ordinal class structures. In Schwenker, F., Roli, F. & Kittler, J. (eds) Multiple Classifier Systems (MCS 2015), vol. 9132, 100–111, https://doi.org/10.1007/978-3-319-20248-89: (Springer International Publishing, Cham, 2015).
Chang, C.-C. & Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intell. Syst. Technol. 2, 27:1–27:27, https://doi.org/10.1145/1961189.1961199 (2011).
Article Google Scholar
Japkowicz, N. & Shah, M. Evaluating Learning Algorithms: A Classification Perspective (Cambridge University Press, New York, 2011).
Müssel, C., Lausser, L., Maucher, M. & Kestler, H. A. Multi-objective parameter selection for classifiers. J. Stat. Softw. 46, 1–27, https://doi.org/10.1007/1179085329 (2012).
Article MATH Google Scholar
Pfister, T. D. et al. Topoisomerase I levels in the NCI-60 cancer cell line panel determined by validated ELISA and microarray analysis and correlation with indenoisoquinoline sensitivity. Mol. Cancer Ther. 8, 1878–1884, https://doi.org/10.1158/1535-7163.MCT-09-0016 (2009).
Article CAS PubMed PubMed Central Google Scholar
Parichy, D. M., Elizondo, M. R., Mills, M. G., Gordon, T. N. & Engeszer, R. E. Normal table of postembryonic zebrafish development: Staging by externally visible anatomy of the living fish. Dev. Dyn. 238, 2975–3015, https://doi.org/10.1002/dvdy.22113 (2009).
Article PubMed PubMed Central Google Scholar
Dong, P. & Liu, Z. Shaping development by stochasticity and dynamics in gene regulation. Open Biol. 7, 170030, https://doi.org/10.1098/rsob.170030 (2017).
Article CAS PubMed PubMed Central Google Scholar
Rodríguez-Rodero, S. et al. Aging Genetics and Aging. Aging Dis. 2, 186–195, http://www.aginganddisease.org/EN/abstract/article_147277.shtml (2011).
Zuo, Y.-C. et al. Predicting peroxidase subcellular location by hybridizing different descriptors of Chou’ pseudo amino acid patterns. Anal. Biochem. 458, 14–19, https://doi.org/10.1016/j.ab.2014.04.032 (2014).
Article CAS PubMed Google Scholar
Lorena, A. C., de Carvalho, A. C. P. L. F. & Gama, J. M. P. A review on the combination of binary classifiers in multiclass problems. Artif. Intell. Rev. 30, 19–37, https://doi.org/10.1007/s10462-009-9114-9 (2009).
Article Google Scholar
Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210, https://doi.org/10.1093/nar/30.1.207 (2002).
Article CAS PubMed PubMed Central Google Scholar
Gautier, L., Cope, L., Bolstad, B. M. & Irizarry, R. A. affy–analysis of Affymetrix GeneChip data at the probe level. Bioinforma. 20, 307–315, https://doi.org/10.1093/bioinformatics/btg405 (2004).
Article CAS Google Scholar

Download references

Acknowledgements

The research leading to these results has received funding from the German Research Foundation (DFG, SFB 1074 project Z1, and GRK 2254 HEIST), and the Federal Ministry of Education and Research (BMBF, Gerontosys II, Forschungskern SyStaR, id 0315894 A; e:Med, SYMBOL-HF, id 01ZX1407A; conFirm, id 01ZX1708C) all to H.A.K.

Author information

Ludwig Lausser, Lisa M. Schäfer and Lyn-Rouven Schirra contributed equally.

Authors and Affiliations

Institute of Medical Systems Biology, Ulm University, 89069, Ulm, Germany
Ludwig Lausser, Lisa M. Schäfer, Lyn-Rouven Schirra, Robin Szekely, Florian Schmid & Hans A. Kestler

Authors

Ludwig Lausser
View author publications
You can also search for this author in PubMed Google Scholar
Lisa M. Schäfer
View author publications
You can also search for this author in PubMed Google Scholar
Lyn-Rouven Schirra
View author publications
You can also search for this author in PubMed Google Scholar
Robin Szekely
View author publications
You can also search for this author in PubMed Google Scholar
Florian Schmid
View author publications
You can also search for this author in PubMed Google Scholar
Hans A. Kestler
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.L. designed the algorithm and conceived the experiments, L.M.S. and L.-R.S. performed data acquisition and conducted the experiments, L.L., L.M.S. and L.-R.S. analysed the results, L.L., R.S. and F.S. implemented the algorithm, L.L. and F.S. drafted the manuscript, H.A.K. supervised and guided the study. L.L., L.M.S., L.-R.S. and H.A.K. wrote the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Hans A. Kestler.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Info

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lausser, L., Schäfer, L.M., Schirra, LR. et al. Assessing phenotype order in molecular data. Sci Rep 9, 11746 (2019). https://doi.org/10.1038/s41598-019-48150-z

Download citation

Received: 18 January 2019
Accepted: 29 July 2019
Published: 13 August 2019
DOI: https://doi.org/10.1038/s41598-019-48150-z

This article is cited by

Permutation-invariant linear classifiers
- Ludwig Lausser
- Robin Szekely
- Hans A. Kestler
Machine Learning (2024)
Detecting Ordinal Subcascades
- Ludwig Lausser
- Lisa M. Schäfer
- Hans A. Kestler
Neural Processing Letters (2020)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

An atlas of evidence-based phenotypic associations across the mouse phenome

Bayesian estimation of gene constraint from an evolutionary model with gene features

High-content CRISPR screening

Introduction

Results

Artificial datasets

Gene expression datasets

Discussion

Methods

Ordinal classifier cascades

Upper bounds on class-wise sensitivities

Detection of ordinal class structures

The CASCADES algorithm

Datasets

Linear dataset (d 1)

Curved dataset (d 2)

Non-ordinal dataset (d 3)

Drosophila melanogaster (d 4)

Danio rerio (d 5)

Human muscle adaptation (d 6)

Caenorhabditis elegans (d 7 and d 8)

Various cancer cell lines (d 9)

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Supplementary information

Supplementary Info

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Permutation-invariant linear classifiers

Detecting Ordinal Subcascades

Comments

Search

Quick links

Linear dataset (d ₁)

Curved dataset (d ₂)

Non-ordinal dataset (d ₃)

Drosophila melanogaster (d ₄)

Danio rerio (d ₅)

Human muscle adaptation (d ₆)

Caenorhabditis elegans (d ₇ and d ₈)

Various cancer cell lines (d ₉)