Multimodal Machine Learning-based Knee Osteoarthritis Progression Prediction from Plain Radiographs and Clinical Data

Knee osteoarthritis (OA) is the most common musculoskeletal disease without a cure, and current treatment options are limited to symptomatic relief. Prediction of OA progression is a very challenging and timely issue, and it could, if resolved, accelerate the disease modifying drug development and ultimately help to prevent millions of total joint replacement surgeries performed annually. Here, we present a multi-modal machine learning-based OA progression prediction model that utilises raw radiographic data, clinical examination results and previous medical history of the patient. We validated this approach on an independent test set of 3,918 knee images from 2,129 subjects. Our method yielded area under the ROC curve (AUC) of 0.79 (0.78–0.81) and Average Precision (AP) of 0.68 (0.66–0.70). In contrast, a reference approach, based on logistic regression, yielded AUC of 0.75 (0.74–0.77) and AP of 0.62 (0.60–0.64). The proposed method could significantly improve the subject selection process for OA drug-development trials and help the development of personalised therapeutic plans.


GBM model that uses Age, Sex, Body-Mass Index (BMI), total Western Ontario and McMaster Universities Arthritis
Index (WOMAC) score, injury and surgery history (model S1).
2. Model S1 with the addition of a KL grade (model 4 in the main text).
5. Model 6 in the main text.
6. Model 7 in the main text.
The experiments were conducted as follows. As mentioned previously, we leveraged the existing fJSW measurements for our data from the train set (OAI). MOST dataset was not used as the fJSW measurements are not available for it. To simulate the independent testing, we kept one data acquisition site out in an external cross-validation loop and trained our model exactly as described in Methods using the remaining data and 5-fold cross-validation. After the training was finished, we performed prediction on the data which was kept out in the external cross-validation loop and computed the performance metric. This procedure was conducted for every data acquisition site in OAI dataset (5 sites in total) and we eventually averaged the results across the data acquisition sites. The results of the experiment are presented in Supplementary Table S1.
Supplementary Table S1 shows that the model which performed the best (model 7) in the main experiments also outperformed all the models which included fJSW measurements -models S1, S2 and S3, respectively. A secondary observation from the conducted experiment on OAI dataset is that the performance of all the methods differed from the MOST dataset. We point out that these datasets are different (e.g. in percentage of progressors, see Table S3), therefore the performance between them cannot be compared directly. Despite this, all the conclusions in our study still hold as shown in Supplementary Table S1.

Optimal Train Dataset Size
In this experiment, we investigated the relationship between the performance of our Convolutional Neural Network (CNN) on the test set and the size of the training data. Specifically, we sampled 400, 800, 1600 and 3200 knee images from the train set so that the each sample has exactly the same distribution of progressors and non-progressors. Subsequently, we trained our CNN exactly as described in Methods and evaluated average precision on the test set. These results are shown in Supplementary Figure S3. From this figure, it can be observed that the performance of our our model on the test set increases with the increase of training data.

Feature Importance of the Second-Level Model
We utilized two techniques for getting the insights about the contributions of each of the factors used in models 6 and 7 in the main text to the decision. Specifically, we used Shapley Additive Explanation (SHAP) technique 2 to explore the feature importance on the test set. We also used the relative predictor importance information naturally available from GBM after training 3 .
The train (Supplementary Figure S4) and the test (Supplementary Figure S5) feature importance plots indicate that the predictions produced by our model have highest contributions into the decisions produced by both models 6 and 7, respectively. Interestingly, both train and test feature importance plots indicate the importance of the symptomatic assessment for the final prediction (Western Ontario and McMaster Universities Arthritis Index, WOMAC 4 ) Table S1. Assessments of added value of our method compared to semi-automatic measurements of fixed Joint Space Width (fJSW). We used the data from OAI dataset and conducted experiments with nested cross-validation, keeping one data acquisition site out from the dataset and re-training out method and the models described below on the remaining parts. The results in the table show the average performances across data acquisition sites in OAI dataset and the standard deviation.