Multi-classifier prediction of knee osteoarthritis progression from incomplete imbalanced longitudinal data

Conventional inclusion criteria used in osteoarthritis clinical trials are not very effective in selecting patients who would benefit the most from a therapy under test. Typically these criteria select majority of patients who show no or limited disease progression during a short evaluation window of the study. As a consequence, less insight on the relative effect of the treatment can be gained from the collected data, and the efforts and resources invested in running the study are not paying off. This could be avoided, if selection criteria were more predictive of the future disease progression. In this article, we formulated the patient selection problem as a multi-class classification task, with classes based on clinically relevant measures of progression (over a time scale typical for clinical trials). Using data from two long-term knee osteoarthritis studies OAI and CHECK, we tested multiple algorithms and learning process configurations (including multi-classifier approaches, cost-sensitive learning, and feature selection), to identify the best performing machine learning models. We examined the behaviour of the best models, with respect to prediction errors and the impact of used features, to confirm their clinical relevance. We found that the model-based selection outperforms the conventional inclusion criteria, reducing by 20-25% the number of patients who show no progression and making the representation of the patient categories more even. This result indicates that our machine learning approach could lead to efficiency improvements in clinical trial design.


Introduction
Knee osteoarthritis (OA) is a chronic degenerative joint disease characterised by cartilage loss and changes in bones underneath it, causing pain and functional disability. The main clinical symptoms of knee OA are pain and stiffness, particularly after exercise, leading to reduced mobility and quality of life, and eventually resulting in knee replacement surgery. OA is one of the leading causes of global disability in people aged 65 and older, and its burden is likely to increase in the future with the ageing of the population and rise in obesity worldwide [1].
OA is a heterogeneous disease where progression spreads over several years with periods of fast changes and periods of stability. A major challenge in OA drug development is effective selection of patients to the clinical trials. In an ideal case, all selected patients would show disease progression within the trial period, and their response to the drug in trial would be properly assessed. However, identification of patients in need of treatment, that is those with a high probability of progression, is an open problem.
To help analyse knee OA progression, the APPROACH consortium is running a 2-year observational study in 5 clinical centres from 4 European countries. One of the study objectives is to discover new markers of disease progression. The consortium recruits patients from centres with existing OA cohorts, and similarly to clinical trials, is interested in selecting only patients who will progress during the observation period.
The traditional approach to patient selection relies on expert knowledge and typically follows a set of consensus criteria defined by the American College of Rheumatology (ACR), mixed with a presence of limited joint damage (so further progression is possible) and significant pain complaints. When these criteria are satisfied, the patient's disease is expected to progress over time. However, the speed with which this will happen is unknown. This is a problem for clinical trials and short-term studies, like APPROACH, in which the observation time is typically limited to about 2 years.
The main hypothesis of this article is that machine learning can be more effective at identifying progressive patients than the traditional approach. We hypothesise that prediction models trained on historical data will be able to differentiate between patients for whom a fast progression happen during the observation period, and patients who show no progression or progress slowly and should not be selected to trials. Throughout the course of this article we examine different algorithms and learning process configurations, to finally develop predictive models for patient selection that outperform the conventional inclusion criteria used in clinical trials.
To train the models, and verify our hypothesis, we use longitudinal data from two large studies running in parallel in Europe and North America: the Cohort Hip and Cohort Knee (CHECK) study [2], and the Osteoarthritis Initiative (OAI) study [3]. We outline a data preprocessing strategy to handle missing values and different attribute types (Section 2.1.1), and we define four classes of patients using clinically relevant measures of OA progression (Section 2.1.2). We set up the experiments that allow us to estimate the typical performance of a model on out-of-sample instances (Section 2.2) and find the best approach to handle the class imbalance present in the data (Section 2.2.3). We choose the best performing algorithm (Section 3.1) and test several of its multi-model / multi-label variants to further improve performance (Section 3.3). We select the most effective configuration of parameters and train the final models on all data and estimate their performance (Section 3.4). Then we interpret the behaviour of these models, by looking at the individual attributes contribution to the model output, and assess their clinical relevance (Section 3.6). Next, we simulate two patient selection scenarios and compare the best model results against the selection with conventional clinical classification criteria (Section 3.7). Finally, we include a discussion on limitations, the experiment design choices, related literature, and future work (Section 4).

Datasets
The CHECK cohort data used in this article were contributed by the CHECK steering committee (available upon request at http://check-onderzoek.nl/). Specifically, we used the clinical and X-ray image assessment (radiographic scoring and KIDA features [4]) data.
The OAI cohort data used in the preparation of this article were obtained from the Osteoarthritis Initiative (available at http://www.oai.ucsf.edu/). Specifically, we used the clinical, X-ray image assessment (semiquantitative readings and joint space width measurements) and outcomes (knee replacements) data.
Both of these cohorts have been studied for over 10 years, and collected longitudinal data with typically yearly updates. For both cohorts we used the time points between the study baseline and the 8 year follow-up, for which the joint space width measurements (used in class definition, see Section 2.1.2) were available (see summary in Table 1). To maximise the size of the training set, instead of using only the baseline and 2 year follow-up time points, for every patient we used all available periods that were at least 2-year long (some periods were longer than two years, e.g. between CHECK time points 2-5 or 5-8). As a consequence, each instance in the training set represented a period, not a patient. We excluded all periods after a knee replacement, to avoid problems with a change in meaning of some attributes (e.g. pain would no longer be related to the knee but to issues with the prosthesis). As we show in Table 1, both datasets contain a relatively large number of attributes and a small number of patients, together with a large proportion of missing values. This introduces a challenge to the machine learning algorithms and we tried to improve this balance with additional preprocessing steps (see below).

Preprocessing
We dropped all attributes with more than 50% missing values and all periods with over 40% missing attributes. These thresholds are quite conservative, as we tried to retain as much data as possible. We also dropped all attributes that did not vary across instances (i.e. had just a single non-null value), and thus were not useful in distinguishing between the classes. Finally, we removed attributes that could be exploited by the model, such as dates, visit numbers, barcodes and patient and staff IDs.
As the CHECK cohort is one of the recruitment sources for the APPROACH consortium, we spent extra time analysing the reasons behind the missing values and fixing them were possible. We filled forward values from the most recent time point, for attributes which values cannot change in the future (e.g. past diseases), and used a default value in place of a missing one where this was a reporting convention (e.g. for presence of rare disorders).
For both datasets, we assumed that all attributes with at most 10 different values are categorical. For CHECK, we additionally went through the cohort variable guide and manually identified ordinal and continuous attributes. This step was not practical for OAI, as its variable guide has almost 4000 pages.
We performed additional preprocessing during the model training. We imputed missing values, using only the values found in the training set (to avoid information leaks from the test set). We performed the imputation with the mode/mean value (for categorical/continuous attributes). We briefly tried other methods (cluster centroids, a vote of nearest neighbours), but as they did not produce better results, we settled for the simplest method.
The final step after imputation was the one-hot encoding of nominal attributes. That is, their replacement with dummy attributes, of which only one is "hot" at a time (set to 1, while others are zero). We encoded all categorical attributes with more than 2 distinct values, unless they were known to be ordinal.

Class definition
The APPROACH consortium decided to use similar patient categorisation to the OAI-based FNIH biomarker study [5], but defined more broadly and bounded in the observation time to 2 years. Patients were split into non-progressive (N), pain (P), structure (S), and combined pain and structure (P+S) progressive categories.
To define the categories, the consortium relied on the measures of pain symptoms and structural damage at the beginning and at the end of a period. Pain was measured using the pain subscale from the WOMAC self-report questionnaire [6], which includes perceived level of pain during 5 different activities: walking, using stairs, in bed, sitting or lying, and standing upright. Structural progression was measured using radiographic readings of minimum joint space width (JSW) across both lateral and medial femorotibial compartments of the knee.
The exact definitions of the categories are given below: • S period -a minimum total JSW must decrease by at least 0.3mm per year.
pain increase of at least 5 WOMAC points per year (∆p ≥ 5) on 0-100 scale, pain at the end of a period must be substantial (p e ≥ 40), for a rapid pain increase (∆p ≥ 10), end pain can be lower (p e ≥ 35), sustained pain must be substantial at both the start and the end of a period (p s ≥ 40 ∩ p e ≥ 40).
For each period, the most affected knee (with greater ∆JSW) and maximum pain (if reported for both knees) were used in the calculation of progression. When we could not measure the progression due to missing values, we excluded the period. This way, the class definition was never based on imputed numbers.
When we applied the above definitions to the datasets, we obtained imbalanced class distributions strongly skewed towards non-progressive periods (see Table 2).

Experimental setup
All experiments were performed using the scikit-learn library [7] and its implementation of the machine learning algorithms. In data preprocessing, analysis and generation of statistics, we used pandas [8], NumPy [9] and SciPy [10]. For data visualisation, we used seaborn [11] and Matplotlib [12].

Measures of performance
The problem of patient selection is similar in its nature to a well-studied task of document retrieval. In this task, the rare relevant documents are mixed with large number of unrelated ones, and the goal is to retrieve a maximum number of relevant documents with the best possible precision. So what matters most, is the method performance on the relevant documents. We, in a similar fashion, are trying to identify the relevant patients who best fit the goals of the study.
The performance in information retrieval is typically measured using the F 1 score, which is a harmonic mean of precision and recall, designed as a measure of classifier performance in presence of rare classes. Precision is the probability that a (randomly selected) retrieved document is relevant. Recall is the probability that a (randomly selected) relevant document has been retrieved. In medical literature precision is known as positive predictive value and recall is equivalent to sensitivity. Although F 1 score has been originally defined for binary classification problems, it can be extended to a multi-class case, by averaging F 1 scores for each class. Throughout this article we use weighted average F 1 score, with weights depending on the instance frequency for each class (to take into account the class imbalance).
See Section 4 for more detailed arguments behind the choice of the performance measure.

Cross-validation
In all experiments we used out-of-sample estimation of the algorithm performance. That is, we kept some of the instances hidden from the algorithm during training, and used them later as an independent test set. Specifically, we followed the standard 10-fold stratified cross-validation (CV) protocol, in which the instances are split into 10 approximately equal-sized parts (folds) and the split preserves the overall class distribution within each fold. Each fold is then used in turn as a test set, and the remaining 9 folds are used as a training set. To score the method performance, rather than averaging the scores across all 10 folds, we pool the out-of-sample predictions together and use it to calculate a single score.
The cross-validation is repeated 10 times with different partitions into folds. As some of the machine learning algorithms are not deterministic, we also repeat the model training (25 times) with different random seeds (the seeds remain constant across folds and cross-validation repeats). We report typical performance of a configuration (algorithm + parameters), as a median score amongst the cross-validation repeats, where the score for each repeat is the median across all trained models.

Class imbalance
To test how well different machine learning algorithms can learn from the data, we initially simplified the problem to a case of balanced classification through down-sampling. We fixed the size of the classes to 150 for CHECK and 600 for OAI, and drew 11 different random samples of 600/2400 instances. For each sample we performed repeated cross-validation (as described in the previous section) using for each fold a fixed-size test set, and a subset of the training set of increasing size (10%, 20%, . . . , 100%), to obtain a learning curve.
We tested six machine learning algorithms with the default parameters: • logistic regression (using one-vs-rest scheme), • multinomial logistic regression using cross-entropy loss with L-BFGS solver, • k nearest neighbours classifier (kNN) using KDTree (default k = 5), • support vector classifier (SVC) using one-vs-rest scheme with linear kernel, • support vector classifier using one-vs-rest scheme with the Radial Basis Function (RBF) kernel • random forest (with 100 trees (default in scikit-learn 0.22)).
For scale-sensitive algorithms (SVC and kNN) all attribute values in the training set were scaled to the [0, 1] range.
At this stage (see Section 3.1), random forest was the best performing algorithm and we focused our further experiments on it. Random forest can be made cost-sensitive by incorporation of class weights to penalise the misclassification of the minority classes (as the weights influence the node split criteria). The cost-sensitive learning is an alternative to up/down sampling techniques that does not introduce artificial instances (as with up-sampling of the minority classes) and does not lose information (as with down-sampling of the majority classes). And specifically for random forest, the algorithm creators have demonstrated that the weighted variant performs better on imbalanced data, than on up/down-sampled ones [13].
To test the difference in performance between the cost-sensitive and the balanced learning, we first performed a repeated cross-validation (as before) using a full imbalanced dataset while incrementally increasing the training set size. Then we kept the imbalanced test sets unchanged, and down-sampled each of the imbalanced folds used to form the training set, to obtain a balanced training set that does not overlap with the imbalanced test set. We repeated this procedure 11 times with different sampling seeds. In the cost sensitive variant, we used weights inversely proportional to the class distribution in the full dataset.
The rationale behind this process is that regardless of the different training sets, the test sets have to remain the same in all cross-validation rounds, so that the performance scores obtained by the two strategies are truly comparable. With experiments set up this way, we are able to examine whether a larger training set is more important to performance than the class balance.

Multi-model methods
As we are trying to solve a multi-class problem, where the class labels are a combination of two clinical criteria (see Section 2.1. 2), we have tested multi-model and multi-label strategies to further improve the performance of random forest. In particular, we first tested (1) a one-vs-rest scheme, in which a combination of 4 independent models is used, each trained to discriminate one class from the rest, and (2) a multi-label classification [14], in which a single model is trained to assign P and S labels independently (rather than to predict the class) that are later mapped to 4 classes. Finally, we combined the two strategies to create (3) a duo classifier that uses two independent models, each trained to predict a single label (P or S). We implemented this classifier as a wrapper class on top of the random forest algorithm that predicts one of the 4 class labels, but at the same time, provides independent P and S probabilities for each instance.
Because we tried multiple models, cross-validated performance of the best configuration is an optimistically biased estimate of the performance of the final model trained on all data. This "multiple induction" problem is conceptually equivalent to multiple hypothesis testing in statistics. To estimate the unbiased performance of the final model, we used a recently proposed bootstrap-based BBC-CV protocol [15]. It is a computationally efficient alternative to the popular nested cross-validation procedure and provides good bias estimation for datasets with 100+ instances.
BBC-CV uses the out-of-sample predictions to (1) select a configuration with best performance on a bootstrapped sample of instances, and (2) score the performance of the selected configuration on the out-of-bootstrap instances only. The returned performance estimate is the average out-of-bootstrap score over all bootstrap iterations.
As we repeat each cross-validation 10 times, we used the most robust variant of the protocol -BBC-CV with repeats. It includes in the estimate the results from all CV-repeats, which reduces the variance introduced by the random partitioning into folds. The number of bootstraps in the protocol was set to 1000.

Recursive feature elimination
To test if a reduced set of features can lead to better performance, we added an inner 3-fold cross-validation loop that selects the best subset of features to use in model training. The inner loop operates on the training folds only. It starts from a full set of features and eliminates the worst, one by one, until only one feature is left. Then a subset of features that maximises inner cross-validation score is selected and used to train the model on the full training fold.

Model interpretation
As each tree in the random forest votes for a class label, it is possible to count how many times each of the features have contributed to the final decision and estimate the feature importance. The problem with the feature importance determined in this way, is that it treats all splits in a tree equally, while the early, close to the root splits, tend to have the most impact.
Therefore, we decided not to use the feature importance provided by the random forest, but to examine each tree using the TreeExplainer class from the SHAP module [16,17]. It provides consistent and locally accurate (per prediction) estimates of feature influence on the model output. It combines ideas from game theory (Shapley sampling values) [18] and local explanations (LIME method) [19] and goes beyond the impact magnitude, providing information on the direction of the influence (probability boost/reduce) in relation to the feature low/high values.

Comparison to the conventional inclusion criteria
To simulate conventional inclusion decisions, we used a logical conjunction of the following three criteria: (1) a combination of the ACR clinical classification criteria for knee OA [20], (2) the Kellgren & Lawrence grade of OA severity [21,22] between 1 and 3 (inclusive), and (3) pain complaints resulting in at least 40 points score on the WOMAC questionnaire. We applied the variant of ACR criteria that uses history, physical examination and radiographic findings. It requires presence of (1) pain in the knee and (2) one of: age over 50, less than 30 minutes of morning stiffness, crepitus (crackling noises) on active motion and osteophytes. We assumed the criteria are satisfied if one of the knees satisfy them.
To simulate selection with machine learning models we used two scenarios: ML-L (based on class labels) and ML-P (based on class probabilities). Both scenarios were based on predictions made by the best configuration of the duo classifier, specifically the median score model from the median cross-validation repeat.
In the ML-L scenario, we selected all instances classified as progressive (predicted to belong to the P, S, or P+S class). This scenario simplifies the task to a binary classification, and makes it comparable to the binary decision made using the conventional inclusion criteria.
In the ML-P scenario, for a more direct comparison, we selected the same number of instances as obtained with the conventional criteria. We used the progression probabilities p(S) and p(P ) returned by the model to three-way sort the instances (in a descending order) by p(P ) + p(S), p(S), and p(P ). Then we selected 1 /3 of instances from each sorted group (to obtain balanced representation), in that exact order, disregarding the duplicates.

Comparison of algorithms on balanced subsets
In the initial experiments on balanced subsets, the best performing algorithm was the random forest. For the CHECK dataset, the other algorithms were competitive only at small training set sizes, and otherwise were trailing 10% and more behind (see Figure 1a). For the OAI dataset, logistic regression and SVC with the RBF kernel were closer, but on the other hand, the performance gap between random forest and the linear SVC or multi-modal regression was as large as 20% (see Figure 1b).  variance in model performance. Secondly, the typical (median) learning curve on the full set had a higher performance at every training set size compared. The difference was especially large in case of the OAI dataset (about 20% in relative numbers). Therefore, in all subsequent experiments we used the full imbalanced training set and the cost-sensitive learning.

Performance of multi-model methods
Figures 3a and 3b compare the performance of multi-label and multi-model strategies, to a single model 4-class random forest (indicated as "single"). Although all the strategies to some degree improved over the single model, the overall performance gain was minor, especially in case of the multi-label and one-vs-rest strategies. The duo classifier emerged as the best option, achieving a median F 1 score improvement of about 2% for CHECK and 1% for OAI. As a result, in subsequent experiments we used the duo classifier.     A general conclusion is that above 400 trees the improvement in performance is very small, and a difference in the maximum tree depth has the largest impact on the score. However, random forest is not over-training easily with more trees, and more trees can be useful (even if they do not improve performance), as they improve the  reliability of the feature importance estimates. On the other hand, with increased depth and larger trees, their interpretability decreases and there is more potential for over-fitting.

Random forest parameter tuning
In subsequent experiments we used the best performing configuration with lowest median absolute deviation, preferring lower depth and less trees in case of ties, in particular: {800 trees, depth 9, entropy criterion} for CHECK, and {1000 trees, depth 10, gini criterion} for OAI.
The expected performance (F 1 score) of the final models trained on all data, estimated with the Bootstrap Bias Corrected Cross-Validation protocol (BBC-CV), was 0.584 -95% CI (0.560, 0.609) for CHECK, and 0.689 -95% CI (0.680, 0.698) for OAI. For both datasets, the estimate is the same (with respect to rounding) as the score of a typical run of the best configuration (median of median runs for each CV-repeat). Table 3 summarises the results of experiments with the recursive feature elimination (RFE) procedure. As the table shows, the use of reduced set of features did not improve the model performance. Its median score was about 2% lower compared to configurations using all features. We counted the frequency with which each feature was selected (out of 100 selection rounds = 10 repeats * 10 folds). For CHECK only minimum JSW (left/right knee), WOMAC pain, WOMAC function, WOMAC total and height of the medial eminence (left/right) were selected 100% of the time (see Figure 10 in Appendix). For OAI this subset was much larger, 181 features were selected every time, therefore not much can be learned there.  We report the median model score for a median CV-repeat and the 95% confidence interval around it (from binomial distribution). For the size of the selected subset of features, we report a range across all CV-repeats.

Feature selection experiments
The advantage of using the model with a subset of features might be in model interpretation, which should become easier, especially with a median of 12 features for CHECK (see Figure 11 in Appendix). It is also an advantage from the clinical perspective, as data collection is costly and sometimes less measurements could be preferred over slightly better performance. However, it would not help much in case of the OAI models, where the median number of selected features was almost 20 times higher (see Figure 12 in Appendix).
The key issue with our RFE experiments is that the performance gains made at the inner cross-validation level did not generalise to the outer level. It is possible that an inner cross-validation with more than 3 folds could identify a more competitive subset of features, but this will come at the expense of substantial extra computational effort. The longest reported RFE experiment already took over 200 CPU days on our HPC cluster (using Intel Xeon E5-2690 v2) and due to the sequential nature of the procedure, its parallelisability is limited. For the P sub-predictor, the four most impactful features are the WOMAC scores (3 sub-scores and the total score). They all reduce the probability of assigning the P label if their value is low and boost that probability if their value is high (see left panel of Figure 6). An example of an opposite direction of influence can be seen for the rfys feature (physical functioning from the SF-36 health survey), where higher values indicate a better health status.  For the S sub-predictor, some of the most impactful features are pain related: DIRKN6 -pain level while walking in the last 7 days (part of the WOMAC questionnaire), and P7RKACV -knee pain severity in the last 7 days. But there are several impactful radiographic features as well, such as: JSW175 -medial JSW at x = 0.175mm, MCAJSW -average medial JSW, or MCMJSW -the minimum medial JSW. In the top 3, we can also find GLCFQCV -glucosamine frequency of use in past 6 months (glucosamine is a popular supplement used by OA patients).

Features impact on model output
A few features in the top make much less sense: KIKBALL -leg used to kick a ball, DFUCOLL -difference in minutes between baseline and follow-up urine collection times, or IMPIXSZ -radiograph pixel size used in conversion to millimetres. This might be a sign of attribute exploitation, as with large number of attributes in OAI and not so many instances, the model might be finding dataset specific patterns, rather than discovering general rules, and perhaps these attributes should be removed from the dataset. Nevertheless, even if taken alone the contribution of a feature is difficult to explain, it might be useful in interaction with other features, e.g. KIKBALL_3.0 indicates a person is ambipedal (has no dominant leg), which might trigger the use of radiographic features from both knees.

Simulated patient selection
We performed a selection from both datasets using the conventional clinical criteria, and compared that to two selection scenarios based on predictions of the best machine learning models: ML-L using the class labels, and ML-P using the class probabilities. In the simpler ML-L scenario, we selected all instances predicted not to be in the non-progressive class (N). In the more refined ML-P scenario, we selected equal number of instances most likely to be in the P+S, S or P class. Tables 4 and 5 summarise results of the selection with the conventional criteria and the ML-L selection scenario. The comparison between the two revealed several issues with the conventional criteria. Firstly, the retrieval of progressive periods was low (18% in total) for both CHECK and OAI, especially in the S category (only 7%). Secondly, the selection focused primarily on the P category, resulting in approximately half of the progressive periods from there. On the other hand, as desired, the percentage of retrieved non-progressive periods was low (5% for CHECK and 7% for OAI).  Table 4: Subset of CHECK periods selected by the conventional clinical criteria and the ML-L scenario. The number of total instances of each category is reported next to the class name. For each category we report an absolute and relative number of included instances, and a recall percentage (how many instances of that category have been retrieved). The "not N" column shows the summarised recall percentage for all progressive instances.
The ML-L selection scenario retrieved over 2 times more progressive periods (≈ 45% in total). In the S category the retrieval was 5 times higher than the conventional criteria result. The balance between the categories has  Table 5: Subset of OAI periods selected by the conventional clinical criteria and the ML-L scenario. The number of total instances of each category is reported next to the class name. For each category we report an absolute and relative number of included instances, and a recall percentage (how many instances of that category have been retrieved). The "not N" column shows the summarised recall percentage for all progressive instances.
improved for CHECK where P and S categories only differed by 2 p.p., but not for OAI, where the S category became dominant. Overall, we see that our machine learning models were less conservative (i.e. have made more non-N predictions) than the conventional criteria, which resulted in retrieving more progressive instances, at the cost of incorporating higher relative percentage of non-progressive ones.
Although in the ML-L scenario, the machine learning had some advantages in recall levels over the conventional criteria, it selected a larger number of non-progressive instances. It also selected 2.5-3 times more instances overall. To make a more direct comparison, in the ML-P scenario we selected the same total number of instances as obtained with the conventional criteria. The selection prioritised the instances more likely to progress and directly used the probabilities provided by the classifier. Table 6 shows the results of the ML-P selection scenario. Not only did it reduce the number of non-progressive instances compared to the conventional criteria (by ≈ 20% for CHECK and ≈ 25% for OAI), but it also increased the balance between the progressive categories (boosting selection from S and P+S, while reducing the bias towards P).  Table 6: Comparison between selection with conventional clinical criteria and the ML-P scenario.

Choice of performance measure
In this work, inspired by the similarity of the patient selection problem to the task of document retrieval, we decided to measure the classification performance with F 1 score. Here we would like to briefly discuss the alternative measures and list main reasons why we did not choose them instead.
In medical research the most common measure is area under the ROC curve (AUC). It is only defined for binary problems and is typically used in cases vs. controls analysis. Its generalisation to multi-class problems, M-score, has been proposed by Hand and Till [23] but it shares many weaknesses with AUC. Hand himself criticised the use of AUC for model comparison. He not only pointed out problems with comparison of the crossing ROC curves (where difference in AUC creates false impression that one curve dominates the other), but also demonstrated the measure incoherence [24] (AUC evaluates different classifiers with a different metric, as it depends on the score distributions, which depend on the classifier). Hand proposed H-measure as a replacement for the AUC, but it has been only defined for binary problems.
Matthew's Correlation Coefficient (MCC) is another measure of binary classification performance [25] that has been extended to handle multi-class problems [26]. Its main merit is in taking into account true negatives (accuracy or F 1 do not), which makes MCC especially useful when negative examples are the minority. Unfortunately, this is not the case in the patient selection task.
Measures based on the error matrix (like F 1 score or MCC), do not take into account the distance in the class probability space (they treat every mistake the same, regardless of its scale). There are several measures that do, each with its own problems. Area under the precision-recall curve (AUPRC) does not generalise to a multi-class case. Log-loss or the Brier distance can handle multi-class problems, but they do not address the class imbalance directly. Perhaps the patient selection task would benefit from a dedicated measure of performance designed to align with the specific recruitment requirements.

Limitations and future work
A clear limitation of the experiment design, revealed in Section 3.6, was the weak preprocessing strategy for this OAI dataset. We did not identify the ordinal attributes and therefore we applied one-hot encoding to every categorical attribute regardless of its semantics. A similar problem repeated for the continuous attributes with low number of unique values, which were treated as categorical and unnecessarily encoded. This led to a construction of less general decision trees, with splits relying on specific attribute values (rather than value ranges), and made the model less trustworthy from a clinical point of view.
A related issue is the clinical relevance of the features the model relies on. It is inevitable that some of the features will be exploited to make shortcut decisions, despite not representing any real knowledge. For that reason, it is important to look "inside" the models and iteratively refine the data representation in the training set, to gradually eliminate the potential for misuse. But this process is not trivial, as models can use hard to explain features (indirectly associated with progression) as a proxy for what is not directly observed. Although we eliminated some of the feature misuse already (e.g. our first OAI models were misusing the image barcodes), still more work needs to be done in this regard, involving further dialogue with the domain experts.
In terms of further improvement of the model performance, it might be possible to achieve better results if the configuration of parameters used to train the duo classifier is not shared between its sub-classifiers. That is, each of the sub-classifiers could have been tuned separately (even with dedicated feature elimination procedure) to maximise its individual performance. Whether that would lead to a better overall performance is a matter of experiment, as it might as well increase the risk of over-training.
Although the results of the probability-based selection are very encouraging, it would be difficult to immediately implement this approach in clinical practice. Mainly because the patients' data are usually collected on a rolling basis (over the course of several months) rather than all at once, and that makes a single selection step impractical. Therefore, an extension to a multi-step approach is needed with which selections on small batches of patients could be performed without sacrificing the overall selection quality.

Related work
Although several long-term OA clinical studies have been completed and their outcomes analysed in detail, very little research has been done on improving the patient selection process. To our best knowledge, this work is a first attempt at building machine learning models that can compete with the established clinical practice.
Our approach differs from most of the analyses found in the literature in two important ways. Firstly, it does not focus on determining the risk factors, but on the prediction of the disease progression. Secondly, it defines the progression within a strict time window and targets the change in fine-grained radiographic measurements (JSW), rather than just a categorical difference in the KL/JSN grade.
Most of the previous works do not focus on disease progression, but analyse OA incidence instead, where a patient can either be diagnosed with OA (typically when KL grade ≥ 2) or be "OA free" (when KL grade ≤ 1). The incidence of disease is then defined, as a change in diagnosis of the same knee between the baseline and the follow-up visit, and is analysed with statistical methods to determine the risk factors (usually odds ratios with univariate analysis of variance, or multivariate logistic regression). Some authors go a bit further and test the logistic regression models on a binary classification task (cases vs. controls) [27,28,29] hand-picking the input variables. However, as Jamshidi et al. point out in their recent perspective article [30], very few authors reach beyond statistical analysis and build machine learning models.
Yoo et al. [31] trained an artificial neural network with 7 inputs and 3 hidden layers to directly predict the KL grade, obtaining AUC > 0.8. However, they only focused on discriminating between KL grade levels at baseline, rather than trying to predict future disease progression. Similar results were obtained with random forest by Minciullo et al. [32] who were able to discriminate between cases and controls with AUC > 0.85, but in the prediction task (same cohort, OA incidence after 84 months) achieved a much lower score (≈ 0.6). Better OA incidence prediction (AUC > 0.8) was reported by Lazzarini et al. [33] who used random forest with an iterative feature elimination heuristic (RGIFE).
These results are not directly comparable, as the models were trained on data from different cohorts. As a consequence, the models operated on a different input, and used inconsistent definition of the outcome (the OA incidence was defined over a period of varying length: 10 [29], 7 [32], or 2.5 [33] years). Moreover, due to the AUC measure incoherence discussed earlier, any comparison between these models would be, at most, approximate.
When it comes to the definition of the progression used in this article, in many aspects it is similar to the definition used by the FNIH OA Biomarkers Consortium (e.g. [5,34]). They likewise defined four categories of patients (N, P, S, and P+S) based on the change in WOMAC/JSW over time, but flexibly allowed the progression to happen at 2, 3 or 4 year follow-up. In contrast to our fixed 2 year time period, this does not select for a fast progression. Furthermore, the analysis performed in these works, is again focused on the risk factors only. In the best case, a test of discriminatory power is performed (without correcting for overfitting) but no independent prediction is attempted. Notable exception is the work by Hafezi-Nejad et al. [35] who used a small artificial neural network with 10 inputs and 1 hidden layer to predict the joint space loss, and with a single training/test set random split and 100 runs, obtained an average AUC of 0.669.

Conclusions
We developed multi-class classification models for patient selection, using data from two existing long-term knee OA studies (CHECK and OAI). Specifically, we focused on predicting the disease progression within a short time window typical for clinical trials.
To find a best performing model we tested multiple algorithms, learning parameter configurations and feature selection techniques. Our best models were obtained with multi-label classification approach, in which a custommade "duo" configuration of two sub-models was separately predicting the pain and structure related labels. Compared to conventional clinical criteria, the selection based on probabilities returned by the best models reduced the number of retrieved non-progressive instances by 20-25%. It also improved the balance between retrieved progressive categories by reducing the selection bias towards the progression on pain.
The overall outcome of the proposed machine learning strategy for the prediction of the OA progression, is that models trained on historical data can improve the effectiveness of patient selection and have a potential to make the clinical trials more efficient. This strategy is currently being used in an actual recruitment process for the APPROACH project, and in two years' time we will be able to assess its real impact.

Contributions
This section describes the roles of all contributors (whether formally listed as autors or named in acknowledgments) using the CRediT taxonomy [36].

Disclaimer
This communication reflects the views of the authors and neither IMI nor the European Union and EFPIA are liable for any use that may be made of the information contained herein.