Prediction of total knee replacement using deep learning analysis of knee MRI

Rajamohan, Haresh Rengaraj; Wang, Tianyu; Leung, Kevin; Chang, Gregory; Cho, Kyunghyun; Kijowski, Richard; Deniz, Cem M.

doi:10.1038/s41598-023-33934-1

Download PDF

Article
Open access
Published: 28 April 2023

Prediction of total knee replacement using deep learning analysis of knee MRI

Haresh Rengaraj Rajamohan¹,
Tianyu Wang¹,
Kevin Leung²,
Gregory Chang³,
Kyunghyun Cho^1,2,
Richard Kijowski³ &
…
Cem M. Deniz^3,4

Scientific Reports volume 13, Article number: 6922 (2023) Cite this article

2114 Accesses
3 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Current methods for assessing knee osteoarthritis (OA) do not provide comprehensive information to make robust and accurate outcome predictions. Deep learning (DL) risk assessment models were developed to predict the progression of knee OA to total knee replacement (TKR) over a 108-month follow-up period using baseline knee MRI. Participants of our retrospective study consisted of 353 case–control pairs of subjects from the Osteoarthritis Initiative with and without TKR over a 108-month follow-up period matched according to age, sex, ethnicity, and body mass index. A traditional risk assessment model was created to predict TKR using baseline clinical risk factors. DL models were created to predict TKR using baseline knee radiographs and MRI. All DL models had significantly higher (p < 0.001) AUCs than the traditional model. The MRI and radiograph ensemble model and MRI ensemble model (where TKR risk predicted by several contrast-specific DL models were averaged to get the ensemble TKR risk prediction) had the highest AUCs of 0.90 (80% sensitivity and 85% specificity) and 0.89 (79% sensitivity and 86% specificity), respectively, which were significantly higher (p < 0.05) than the AUCs of the radiograph and multiple MRI models (where the DL models were trained to predict TKR risk using single contrast or 2 contrasts together as input). DL models using baseline MRI had a higher diagnostic performance for predicting TKR than a traditional model using baseline clinical risk factors and a DL model using baseline knee radiographs.

Deep Learning Predicts Total Knee Replacement from Magnetic Resonance Images

Article Open access 14 April 2020

Aniket A. Tolpadi, Jinhee J. Lee, … Sharmila Majumdar

Increasing trend of radiographic features of knee osteoarthritis in rheumatoid arthritis patients before total knee arthroplasty

Article Open access 21 June 2022

Ryutaro Takeda, Takumi Matsumoto, … Sakae Tanaka

Deep learning for large scale MRI-based morphological phenotyping of osteoarthritis

Article Open access 25 May 2021

Nikan K. Namiri, Jinhee Lee, … Valentina Pedoia

Introduction

Knee osteoarthritis (OA) is one of the most prevalent and disabling chronic diseases, occurring in 10% of men and 13% of women over 60 years of age¹ and contributing to more than $27 billion in annual healthcare expenditures in the Unites States alone². Conservative treatment options including weight loss, aerobic activity, and muscle strengthening exercises can alleviate symptoms and potentially slow the rate of disease progression in patients with knee OA^3,4, but are most effective when initiated during the early stages of the disease^5,6. Thus, a preventive strategy that targets patients without advanced knee joint degeneration at high risk for OA progression could optimize the effectiveness of current conservative interventions. In addition, identifying individuals with knee OA at high risk for disease progression would help select the most optimal subjects for inclusion in clinical OA drug trials. This would facilitate the conduct of more affordable and shorter-duration studies involving smaller number of subjects that would expedite the development of new disease modifying OA therapies⁵.

Magnetic resonance imaging (MRI) is a highly useful imaging modality for evaluating knee OA as it can assess all joint structures including cartilage, bone, meniscus, ligament, and synovium that can be sources of pain in patients with the disease⁷. Semi-quantitative measures of structural features of knee joint degeneration on MRI have been shown to be associated with knee OA progression including an increased risk for future total knee replacement (TKR)^{8,9,10,11,12,13}. However, obtaining semi-quantitative parameters requires a reader to assess each individual structural feature in each region of the knee joint using categorical based scoring systems. This is a time-consuming and reader-dependent process, which would be difficult to incorporate into widespread, cost-effective OA risk assessment models. Thus, more efficient and reliable methods are needed to extract useful prognostic information from imaging studies.

Deep learning (DL) is an advanced form of artificial intelligence that has been successfully used for various medical imaging applications. Especially notable is a subclass of DL algorithms termed convolutional neural networks (CNNs), which have dominated the field of computer vision and surpassed human abilities in many important tasks¹⁴. Given enough training data, a CNN could automatically learn a representative subset of features such as structural features of knee joint degeneration on MRI associated with knee OA progression. DL offers an exciting new opportunity for rapid, fully automated extraction of useful prognostic information from imaging studies that could potentially be used to assess the risk of OA progression in every patient being evaluated in clinical practice with knee radiographs and MRI. Our study was performed to develop DL risk assessment models to predict the progression of knee OA to TKR over a 108-month follow-up period using baseline knee MRI. We hypothesize that DL models using baseline MRI would have higher diagnostic performance for predicting TKR than a traditional model using baseline clinical risk factors and a DL model using baseline knee radiographs.

Methods

Subject cohort

Our retrospective study was performed using knees selected from subjects in the publicly available Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis Study (MOST) databases. The OAI database contains demographic and clinical information, radiographs, and MRI examinations from 4796 subjects between 45 and 79 years of age with or at risk for knee OA evaluated at baseline and 12, 24, 36, 48, 60, 72, 84, and 108-month follow-up¹⁵. The MOST database contains the same information and imaging studies from 3026 subjects between 50 and 79 years of age with or at risk for knee OA evaluated at baseline and 15, 30, 60, 84, 144, and 168-month follow-up. The OAI and MOST were approved by the Internal Review Boards at University of California at San Francisco, Boston University Medical Center, and each individual clinical recruitment site and was performed in compliance with the Declaration of Helsinki. All subjects signed written informed consent.

Training and validation group

A training and validation group from the OAI database was selected to train the models that are not biased towards age, BMI, sex, and race, and use only features on radiographs and MRI to predict TKR. A balanced case–control cohort was selected by matching case subjects and control subjects using baseline demographic variables associated with knee OA progression including age, sex, ethnicity, and body mass index (BMI). Case subjects were defined as individuals who underwent a TKR in either knee after the baseline enrollment date, while control subjects were defined as individuals who appeared at the 108-month follow-up visit and had not undergone a TKR in either knee. If a patient underwent TKR in both knees during OAI data collection, the knee that first underwent TKR was included. Each case patient with TKR was matched to a control subject without TKR who was the same age, sex, and ethnicity and with an additional constraint on the baseline BMI within a 10% tolerance. The data set from case–control pairs contained either the left or right knee from each case and control subject.

A total of 353 case–control pairs were identified from the 4796 subjects in the OAI database. Subjects were excluded if they had TKR at baseline, received partial knee replacement over the course of follow-up, were missing baseline or 108-month follow-up information, or did not match with a case or control subject. A summary of the selection of case–control pairs is illustrated in Fig. 1. Study cohort characteristics are summarized in Supplementary Table 1. All 706 matched subjects had baseline standing posterior-anterior knee radiographs, knee MRI examinations, and clinical outcome measures including the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC)¹⁶ and Quality of Life from Knee Injury and Osteoarthritis Outcome Score (KOOS QoL)¹⁷. A subset of 270 subjects had baseline semi-quantitative MRI scores for the severity of cartilage loss and bone marrow edema lesions on 14 knee articular surfaces using the MRI Osteoarthritis Knee Score (MOAKS) system¹⁸ provided by central reading of the National Institute of Health OA Biomarkers Consortium Project.

Total knee replacement risk assessment models

The MRI examinations for all subjects in the training and validation group included coronal intermediate-weighted turbo spin-echo (IW-TSE), sagittal fat-suppressed intermediate-weighted turbo spin-echo (FS-IW-TSE), and sagittal fat-suppressed three-dimensional (3D) dual-echo in steady state (DESS) sequences performed on a 3.0T whole-body scanner (Magnetom Trio, Siemens Healthcare, Erlangen, Germany) with the imaging parameters shown in Supplementary Table 2. To determine the best OA risk assessment model, multiple models were developed using analysis of each individual sequence and using two different approaches for combined analysis of multiple sequences. Three separate MRI models were created for the IW-TSE, FS-IW-TSE, and DESS images as each tissue contrast has advantages for evaluating different knee joint structures¹⁹. A multi-input MRI model was also developed by inherently combining information from FS-IW-TSE and DESS images in the same CNN architecture (Fig. 2). All MRI models were created with CNN architectures using conventional residual blocks with extension to 3D. The use of 3D CNN architectures allowed 3D analysis of the DESS sequence without the need to evaluate reformatted images in different planes. The outcome predictions for the models were a confidence value between 0 and 1 indicating the likelihood for TKR. Details regarding the CNN architectures used in the IW-TSE, FS-IW-TSE, DESS, and multi-input MRI models are included in the Supplementary Materials. The source code for this study is available at [Link provided after review].

A radiograph model for predicting TKR was created from a previous study using a CNN architecture adapted from ResNet with 34 layers, which analyzed baseline standing posterior-anterior knee radiographs and provided a confidence value between 0 and 1 as an outcome prediction indicating the likelihood for TKR²⁰. The OAI cohort used in this work is a subset of the cohort from²⁰ and during evaluation, it was ensured that there was no leakage between the training sets in²⁰ and the test set in our work. An MRI ensemble model was created by averaging the outcome predictions of the IW-TSE, FS-IW-TSE, and DESS models. In comparison to the multi-input MRI model that analyzed the different MRI sequences together to predict TKR, the MRI ensemble model averaged the outcome predictions of the models analyzing the different sequences separately. An MRI and radiograph ensemble model was created by averaging the outcome predictions of the ensemble MRI and radiograph models.

A traditional risk assessment model for predicting TKR was created using multi-layer perceptrons, which analyzed baseline clinical risk factors including BMI, WOMAC and contralateral WOMAC, and KOOS QoL^21,22. Input features were normalized to unit mean and zero variance using training dataset statistics. The multi-layer perceptrons passed the input features through 2 fully-connected hidden layers of size 256 with ReLU activation and batch normalization and provided a confidence value between 0 and 1 as an outcome prediction indicating the likelihood for TKR.

Model training and validation

Model training and validation was performed using sevenfold nested cross-validation. The 353 case–control pairs were split equally into 7 parts, with the number of pairs used for model training ranging between 250 and 256 and the number of pairs used for model validation and testing ranging between 47 and 52. The dataset splits were performed randomly using a random data generator in Python (version 2.7, Python Software Foundation, Wilmington, DE). Details regarding model training, evaluation, and dataset splits are included in the Supplementary Materials.

Model evaluation on internal and external testing groups

The models were evaluated on an internal hold-out testing group from the knees of the remaining 4090 subjects in the OAI database that were not involved in model training and validation and who were evaluated using the same MRI protocol and same MRI scanner. Among the remaining 4090 subjects, there were 32 case knees in 27 subjects that underwent TKR and 7891 control knees in 4034 subjects that did not undergo TKR between the baseline and 108-month follow-up periods. The remaining 257 knee in 216 subjects were excluded due to missing baseline or 108-month follow-up information. Study cohort characteristics are summarized in Supplemental Table 3.

The models were also evaluated on an external testing group from the MOST database, which consisted of a 270 case–control pairs of subjects with and without TKR performed between the baseline and 84-month follow-up periods. Case subjects and control subjects were matched using baseline age, sex, ethnicity, and BMI with identical inclusion and exclusion criteria as used for selection of the case–control pairs for the training and validation group. Study cohort characteristics are summarized in Supplemental Table 4. The MRI examinations for all subjects consisted of coronal short-tau inversion recovery (COR STIR) and sagittal fat-suppressed intermediate-weighted turbo spin-echo (SAG FS-IW-TSE) sequences performed on a 1.0T dedicated extremity scanner (ONI MSK Extreme; GE Healthcare, Waukesha, WI) with the imaging parameters shown in Supplementary Table 2.

Analysis of saliency maps

Saliency maps were created for the DESS model that showed the regions of discriminative high activation on the images on which the CNN based its interpretation. The saliency maps were produced by one minus the prediction probability of the DESS model with blocking 32 × 32 × 16 regions sequentially with stride of 8, 8, 4 in the 3 directions. The saliency maps were created for 50 randomly selected subjects and were overlaid on their corresponding DESS reformatted images in axial, coronal, and sagittal planes.

The axial, coronal, and sagittal overlaid saliency maps and DESS images for all image slices through the knee were reviewed by a fellowship trained musculoskeletal radiologist who was blinded to all subject information. The radiologist graded the presence and absence of regions of discriminative high activation in different anatomic locations of the knee including the central bone-cartilage interface and peripheral bone-cartilage interface of the medial and lateral femoral condyles and tibia plateau, medial and lateral meniscus, infrapatellar fat pad, intercondylar notch, and fluid filled joint space.

Statistical analysis

Statistical analysis was performed using R Project for Statistical Computing Software (R version 3.6.0, R-Project.org). Statistical significance was defined as a p value less than 0.05. The Holm method was used to adjust p values to account for multiple comparisons²³.

Model evaluation was performed using sevenfold nested cross-validation on the testing and validation group with additional evaluation performed on an internal hold-out testing group from the OAI database and an external testing group from the MOST database. Models were evaluated for all knees combined and for knees of each individual Kellgren-Laurence (KL) grade for the internal hold-out testing group in the OAI database and for all knees combined for the external testing group in the MOST database. Receiver operator characteristic analysis with areas under the curve (AUC) and area under the precision-recall curve (AUPRC) was used to evaluate the diagnostic performance of the different models developed to predict TKR. 95% confidence intervals for AUCs and AUPRCs were calculated across 5000 bootstrap samples. The Youden index was used to determine optimal model sensitivity and specificity²⁴. Statistical significance of improvements in diagnostic performance of the different models was analyzed by comparing AUC differences between models using the Delong test²⁵.

Univariate and multi-variate conditional logistic regression models were used to determine the ability of different variables to predict TKR including the outcome prediction of the MRI ensemble model, outcome prediction of the radiograph model, and individual clinical and MRI risk factors for all knees combined and for knees with each individual KL grade. The risk values predicted by the models were normalized to zero mean and unit variance to ensure that differences in mean magnitude prediction are not leading to any discrepancies in odds ratio computation. Clinical risk factors included baseline BMI, WOMAC and contralateral WOMAC scores, and KOOS QoL scores, while MRI risk factors included baseline semi-quantitative MOAKS scores for the severity of cartilage loss and bone marrow edema lesions, which were available in a subset of 270 case–control pairs. The number of articular surfaces with cartilage loss and bone marrow edema lesions was used as MRI risk factors as these variables were shown to provide superior diagnostic performance for predicting incident knee OA than the total MOAK scores for cartilage loss and bone marrow edema lesions²⁶. Crude odds ratio (OR) was used to assess the effect of a given variable when it was used as the only predictor, while adjusted OR was used to assess the effect of the given variable adjusted for the effects of all other variables included in the model. Wald test was used to assess the significance of individual variables.

The frequency of regions of discriminative high activation on the saliency maps at each anatomic location of the knee were calculated for case and control subjects. Fisher’s exact test was used to compare differences in the proportions of locations of discriminative high activation in different anatomic locations of the knee between subjects with and without TKR.

Results

Table 1 shows the AUCs and AUPRCs of the models developed to predict TKR evaluated using sevenfold nested cross-validation on the testing and validation group. All DL models using baseline MRI and radiographs had significantly higher (p < 0.001) AUCs for predicting TKR when compared to a traditional machine learning model using clinical risk factors. The MRI and radiograph ensemble model had the highest overall diagnostic performance with an AUC of 0.90 (80% sensitivity and 85% specificity), which was similar (p = 0.12) to the AUC of the MRI ensemble model and significantly higher (p < 0.05) than the AUCs of the radiograph, IW-TSE, FS-IW-TSE, DESS, and multi-input MRI models. The MRI ensemble model had the second highest overall diagnostic performance with an AUC of 0.89 (79% sensitivity and 86% specificity), which was marginally significantly higher (p = 0.06) than the AUC of the DESS model and significantly higher (p < 0.05) than the AUCs of the radiograph, IW-TSE, FS-IW-TSE, and multi-input MRI models. The DESS model had the highest diagnostic performance of all individual MRI models with an AUC of 0.88 (82% sensitivity and 81% specificity), which was similar (p = 0.24) to the AUC of the IW-TSE model and significantly higher (p < 0.05) than the AUCs of the IW-FSE and multi-input MRI models.

Table 1 Receiver operator characteristic analysis with areas under the curve (AUC) and area under the precision-recall curve (AUPRC) evaluating the diagnostic performance of the models to predict total knee replacement (TKR) using sevenfold nested cross-validation on the training a validation group in the OAI database.

Full size table

Table 2 shows the crude and adjusted ORs of the models developed to predict TKR for all knees combined, while Supplemental Table 5 shows the adjusted ORs of the models for knees with each individual KL grade. The outcome prediction of the MRI ensemble model was the variable that had the highest predictive performance in the univariate and multi-variate models with a crude OR and adjusted OR of 17.64 and 10.54, respectively for all knees combined and an adjusted OR of 7.55, 5.84, 3.33, and 3.19 for knees with KL grades of 0, 1, 2, and 3, respectively. For all knees combined, the outcome prediction of the MRI ensemble model, outcome prediction of the radiograph model, BMI, WOMAC score, KOOS QoL score, cartilage MOAK score, and bone marrow edema lesion MOAK score were significant predictors (p < 0.05) of TKR in the univariate model. However, the outcome prediction of the MRI ensemble model and cartilage MOAK score were the only variables that remained statistically significant predictors (p < 0.05) in the multi-variable model.

Table 2 Crude odds ratio (OR) and adjusted OR indicating the ability of different variables to predict total knee replacement (TKR) in univariate and multi-variate conditional logistic regression models for all knees combined.

Full size table

Tables 3 and 4 shows the AUCs and AUPRCs of the models developed to predict TKR evaluated using the internal hold-out testing group in the OAI database and the external testing group in the MOST database, respectively for all knees combined. All models showed a small decrease in diagnostic performance for the internal testing group with AUCs between 0.84 and 0.89 compared to AUCs between 0.85 and 0.90 for the sevenfold nested cross-validation. All models showed a larger decrease in diagnostic performance for the external testing group with AUCs between 0.55 and 0.71. The DESS, FS-IW-TSE, and MRI ensemble models had the highest AUCs between 0.67 and 0.71 for evaluating the sagittal fat suppressed intermediate-weighted TSE images in the MOST database, while the FS-IW-TSE and MRI ensemble models had the highest AUCs of 0.69 and 0.66, respectively for evaluating the coronal STIR images.

Table 3 Receiver operator characteristic analysis with areas under the curve (AUC) and area under the precision-recall curve (AUPRC) evaluating the diagnostic performance of the models to predict total knee replacement (TKR) using the internal hold-out testing group in the OAI database for all knees combined.

Full size table

Table 4 Receiver operator characteristic analysis with areas under the curve (AUC) and area under the precision-recall curve (AUPRC) evaluating the diagnostic performance of the models to predict total knee replacement (TKR) using the external testing group in the MOST database for all knees combined.

Full size table

Table 5 shows the AUCs of the models developed to predict TKR for knees of each individual KL grade evaluated using the internal hold-out testing group in the OAI database. The MRI ensemble model had the highest diagnostic performance with AUCs of 0.66, 0.70, 0.78, 0.74, and 0.87 for predicting TKR in knees with KL grades of 0, 1, 2, 3, and 4, respectively. For all models, the diagnostic performance was highest for knees with a KL grade of 4 and lowest for knees with a KL grade of 0.

Table 5 Receiver operator characteristic analysis with areas under the curve (AUC) evaluating the diagnostic performance of the models to predict total knee replacement (TKR) using the internal hold-out testing group in the OAI database for knees with each individual Kellgren-Laurence (KL) grade.

Full size table

Table 6 shows the frequency of regions of discriminative high activation on the saliency maps at each anatomic location of the knee for case and control subjects. Significantly higher proportion (p < 0.05) of case subjects with TKR had regions of high activation on the peripheral bone-cartilage interfaces of the medial and lateral femoral condyles and tibia plateau, intercondylar notch, and medial meniscus compared to subjects without TKR. Significantly higher proportion (p < 0.05) of control subjects without TKR had regions of high activation on the central bone-cartilage interfaces of the femoral condyles and tibia plateau compared to subjects with TKR (Figs. 3, 4).

Table 6 Frequency of regions of discriminative high activation on the saliency maps at anatomic location of the knee for case subjects with total knee replacement (TKR) and control subjects without TKR computed using the DESS model.

Full size table

Discussion

Our study showed that DL models using baseline MRI had significantly higher (p < 0.05) diagnostic performance for predicting the progression of TKR over a 108-month follow-up period when compared to a traditional model using baseline clinical risk factors and a DL model using baseline radiographs. Furthermore, the outcome prediction of the MRI ensemble model in the univariate and multi-variate analysis had much higher crude and adjusted ORs than the outcome prediction of the radiograph model and clinical and MRI risk factors and was one of few variables that was a significant predictor (p < 0.05) of TKR when accounting for the effects of all other variables in the multivariate analysis. Thus, our findings indicate that DL analysis of the ensemble of baseline IW-TSE, FS-IW-TSE, and DESS images provides the best independent prognostic information regarding the likelihood of TKR. This emphasizes the need to incorporate DL analysis of baseline MRI into knee OA risk assessment models to maximize diagnostic performance, especially when the imaging modality is available for use.

Our study confirmed the findings of a previous study by Tolpadi et al., which also showed a significantly higher (p < 0.05) diagnostic performance for a DL model using baseline DESS images for predicting TKR over a 60-month follow-up period in 4,796 subjects in the OAI when compared to a DL model using baseline radiographs. In this study, a traditional model using clinical risk factors, DL model using radiographs, and DL model using DESS images had AUCs of 0.87, 0.85, and 0.89, respectively for predicting TKR, which were similar to the AUCs in our study of 0.77, 0.86, and 0.88, respectively²⁷. Our study further investigated the diagnostic performance of DL models that used different MRI tissue contrasts and found that the DESS model had higher AUC than the IW-TSE and FS-IW-TSE models. The improved diagnostic performance of the DESS model is likely due to the fact that the CNN analyzed higher resolution, 3D volumetric images with a larger number of slices that provided more image data for model training. In addition, DESS has been shown to have high diagnostic performance for detecting various structural features of knee joint degeneration on MRI including cartilage defects, osteophytes, joint effusion and synovitis, ligament and tendon tears, and bone marrow edema lesions²⁸.

Our study was the first to investigate DL models that used different MRI tissue contrasts and different imaging modalities together to predict TKR. The MRI ensemble model, which averaged the outcome predictions of the IW-TSE, FS-IW-TSE, and DESS models, and MRI and radiograph ensemble models, which averaged the outcome predictions of the MRI ensemble and radiograph models, had the highest diagnostic performance for predicting TKR. Previous studies have also shown that ensemble models outperform the best individual model inside the ensemble, provided that the individual models are uncorrelated^29,30. Our study also investigated a multi-input MRI model that analyzed the IW-FSE and DESS images together in the same CNN architecture. However, the multi-input MRI model had lower diagnostic performance than the DESS and MRI ensemble models, which was likely the result of the increase in the number of model parameters, or the inadequate concatenation fusion strategy used in the model architecture. Future work is needed to investigate alternative approaches for DL analysis of different MRI tissue contrasts and different imaging modalities in the same CNN architecture. For example, knowledge distillation and attention-based fusion approaches have been shown to be superior to ensemble methods in DL prediction models combining information from multi-view images^31,32.

Our study showed a decrease in diagnostic performance when the models were evaluated using an external testing group in the MOST database. This highlights the challenges for widespread application of DL models analyzing baseline MRI, which may be influenced by variability in imaging protocols and scanner types across institutions. However, our results were encouraging as the DESS, FS-IW-TSE, and multi-modality MRI models had moderately high AUCs for predicting TKR for subjects in the MOST database, which could be considered a “worse-case scenario” external testing group with MRI sequences acquired on 1.0T extremity scanners with lower image quality than most clinical MRI protocols performed on 1.5T and 3.0T whole-body scanners. Nevertheless, future research efforts are needed to develop better CNN model architectures that are less sensitive to image fluctuations and new methods of model training using image datasets with more heterogeneous tissue contrasts and image quality to optimize model performance and improve model generalizability.

Our study found differences in the locations of regions of high discriminatory activation on the saliency maps between case and control subjects. Regions of high activation were more common in case subjects with TKR on the peripheral bone-cartilage interfaces, intercondylar notch, and medial meniscus. These likely reflected areas of osteophyte formation, intercondylar synovitis, and meniscal tear and extrusion, which have been previously shown to be risk factors for TKR^{9,10,11,12,13,33}. Our results are different from the results of other studies that have used DL models to predict knee pain progression and future TKR. Chang et al. identified regions of high activation in areas of joint effusion in 89% of subjects with knee pain progression³⁴. Tolpadi et al. found that the risk of TKR decreased when regions of high activation were present on the bone and cartilage, meniscus, and anterior cruciate ligament and increased when regions of high activation were present on the medial patella retinaculum, gastrocnemius tendon, and plantaris muscle²⁷. Differences in our findings and the findings of previous studies may be due multiple factors including differences in CNN architectures, outcome measures for knee OA progression, and methods for saliency map reconstruction and interpretation.

Our DL models could have important future impact in clinical practice and clinical drug trials. The models could improve clinical outcomes and reduce symptoms by identifying patients at high risk for knee OA progression early enough to provide a window of opportunity for disease modification. The models could also provide a paradigm shifting approach that would reduce the size, duration, and costs of future clinical drug trials through exclusive selection of subjects at high risk for knee OA progression, thereby expediting the development of new OA disease modifying therapies. The models could also provide a paradigm shifting approach that would reduce the size, duration, and costs of future clinical drug trials through exclusive selection of subjects at high risk for knee OA progression, thereby expediting the development of new OA disease modifying therapies. Semi-quantitative measures of structural features on MRI have also been shown to be associated with knee OA progression^{8,9,10,11,12,13}. However, obtaining semi-quantitative parameters is time-consuming and reader-dependent, which would make it difficult to incorporate them into widespread, cost-effective OA risk assessment models that could be used in all patients evaluated with MRI in clinical practice. Furthermore, our study has shown that DL models have higher diagnostic performance for predicting future TKR using baseline MRI than semi-quantitative MOAK scores for cartilage loss and bone marrow edema lesions.

Our study has several limitations. Our study defined case subjects as individuals who underwent a TKR over the 108-month follow-up period. Thus, the outcome measure was a binary variable that did not take into account the time after baseline the TKR was performed. Furthermore, the decision to undergo a TKR can be influenced by multiple factors other than the degree of structural joint degeneration on imaging studies such as pain severity, comorbidities, and healthcare access³⁵. Our models were also trained and evaluated using the OAI and MOST databases that were primarily composed of older, overweight, Caucasian subjects. Thus, model generalizability to more age, BMI, race, and ethnic diverse populations needs to be further investigated. In addition, our traditional machine learning model used only a limited number of clinical risk factors although all variables were documented in previous studies to be strongly associated with the progression of knee pain and future TKR^21,22. Finally, our DL models could provide no mechanistic information regarding the factors responsible for progression to TKR.

In conclusion, our study showed that DL models using baseline MRI had significantly higher (p < 0.001) diagnostic performance for predicting TKR over a 108-month follow-up period when compared to a traditional model using baseline clinical risk factors and a DL model using baseline radiographs. Furthermore, DL models ensembling different MRI tissue contrasts and different imaging modalities achieved significantly higher (p < 0.05) diagnostic performance than DL models that used a single MRI tissue contrast or single imaging modality. However, our study also showed a decrease in diagnostic performance of the models when evaluated on the external testing group in the MOST database, which demonstrates the influence of variables such as subject cohort characteristics and MRI protocols on model performance. Additional work is needed to develop new approaches to combine different image datasets in the same CNN architecture to maximize model performance and to investigate new methods to improve model generalizability to more diverse subject populations and image datasets with more heterogeneous tissue contrasts and image quality.

Data availability

The datasets analyzed during the current study are available in the Osteoarthritis Initiative and Multicenter Osteoarthritis Study repositories: https://nda.nih.gov/oai/, https://most.ucsf.edu/multicenter-osteoarthritis-study-most-public-data-sharing. The github repo for our model development and analysis can be found in https://github.com/denizlab/OAI-MRI-TKR.

References

Felson, D. T. & Zhang, Y. An update on the epidemiology of knee and hip osteoarthritis with a view to prevention. Arthritis Rheum. 41(8), 1343–1355 (1998).
Article CAS PubMed Google Scholar
Losina, E. et al. Lifetime medical costs of knee osteoarthritis management in the United States: Impact of extending indications for total knee arthroplasty. Arthritis Care Res. (Hoboken) 67(2), 203–215. https://doi.org/10.1002/acr.22412 (2015).
Article PubMed Google Scholar
Christensen, R., Bartels, E. M., Astrup, A. & Bliddal, H. Effect of weight reduction in obese patients diagnosed with knee osteoarthritis: A systematic review and meta-analysis. Ann. Rheum. Dis. 66(4), 433–439. https://doi.org/10.1136/ard.2006.065904 (2007).
Article PubMed PubMed Central Google Scholar
Roddy, E., Zhang, W. & Doherty, M. Aerobic walking or strengthening exercise for osteoarthritis of the knee? A systematic review. Ann. Rheum. Dis. 64(4), 544–548. https://doi.org/10.1136/ard.2004.028746 (2005).
Article CAS PubMed PubMed Central Google Scholar
Karsdal, M. A. et al. Disease-modifying treatments for osteoarthritis (DMOADs) of the knee and hip: Lessons learned from failures and opportunities for the future. Osteoarthr. Cartil. 24(12), 2013–2021. https://doi.org/10.1016/j.joca.2016.07.017 (2016).
Article CAS Google Scholar
Felson, D. T. & Hodgson, R. Identifying and treating preclinical and early osteoarthritis. Rheum. Dis. Clin. N. Am. 40(4), 699–710. https://doi.org/10.1016/j.rdc.2014.07.012 (2014).
Article Google Scholar
Peterfy, C., Woodworth, T. & Altman, R. Workshop for consensus on osteoarthritis imaging: MRI of the knee. Osteoarthr. Cartil. OARS Osteoarthr. Res. Soc. 14(Suppl 1), 44–45 (2006).
Article Google Scholar
Hunter, D. J. Risk stratification for knee osteoarthritis progression: A narrative review. Osteoarthr. Cartil. 17(11), 1402–1407. https://doi.org/10.1016/j.joca.2009.04.014 (2009).
Article CAS Google Scholar
Raynauld, J. P. et al. Risk factors predictive of joint replacement in a 2-year multicentre clinical trial in knee osteoarthritis using MRI: Results from over 6 years of observation. Ann. Rheum. Dis. 70(8), 1382–1388. https://doi.org/10.1136/ard.2010.146407 (2011).
Article PubMed Google Scholar
Hafezi-Nejad, N., Zikria, B., Eng, J., Carrino, J. A. & Demehri, S. Predictive value of semi-quantitative MRI-based scoring systems for future knee replacement: Data from the osteoarthritis initiative. Skelet. Radiol. 44(11), 1655–1662. https://doi.org/10.1007/s00256-015-2217-2 (2015).
Article Google Scholar
Scher, C., Craig, J. & Nelson, F. Bone marrow edema in the knee in osteoarthrosis and association with total knee arthroplasty within a three-year follow-up. Skelet. Radiol. 37(7), 609–617. https://doi.org/10.1007/s00256-008-0504-x (2008).
Article Google Scholar
Hunter, D. J. et al. Systematic review of the concurrent and predictive validity of MRI biomarkers in OA. Osteoarthr. Cartil. 19(5), 557–588. https://doi.org/10.1016/j.joca.2010.10.029 (2011).
Article CAS Google Scholar
Nielsen, F. K., Egund, N., Jorgensen, A. & Jurik, A. G. Risk factors for joint replacement in knee osteoarthritis; a 15-year follow-up study. BMC Musculoskelet. Disord. 18(1), 510. https://doi.org/10.1186/s12891-017-1871-z (2017).
Article PubMed PubMed Central Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521(7553), 436–444. https://doi.org/10.1038/nature14539 (2015).
Article ADS CAS PubMed Google Scholar
Lester, G. Clinical research in OA–the NIH osteoarthritis initiative. J. Musculoskelet. Neuronal Interact. 8(4), 313–314 (2008).
CAS PubMed Google Scholar
Bellamy, N., Buchanan, W. W., Goldsmith, C. H., Campbell, J. & Stitt, L. W. Validation study of WOMAC: A health status instrument for measuring clinically important patient relevant outcomes to antirheumatic drug therapy in patients with osteoarthritis of the hip or knee. J. Rheumatol. 15(12), 1833–1840 (1988).
CAS PubMed Google Scholar
Roos, E. M., Roos, H. P., Lohmander, L. S., Ekdahl, C. & Beynnon, B. D. Knee Injury and Osteoarthritis Outcome Score (KOOS)—Development of a self-administered outcome measure. J. Orthop. Sports Phys. Ther. 28(2), 88–96. https://doi.org/10.2519/jospt.1998.28.2.88 (1998).
Article CAS PubMed Google Scholar
Hunter, D. J. et al. Evolution of semi-quantitative whole joint assessment of knee OA: MOAKS (MRI Osteoarthritis Knee Score). Osteoarthr. Cartil. 19(8), 990–1002. https://doi.org/10.1016/j.joca.2011.05.004 (2011).
Article CAS Google Scholar
Peterfy, C. G., Schneider, E. & Nevitt, M. The osteoarthritis initiative: Report on the design rationale for the magnetic resonance imaging protocol for the knee. Osteoarthr. Cartil. 16(12), 1433–1441. https://doi.org/10.1016/j.joca.2008.06.016 (2008).
Article CAS Google Scholar
Leung, K. et al. Prediction of total knee replacement and diagnosis of osteoarthritis by using deep learning on knee radiographs: Data from the osteoarthritis initiative. Radiology 296(3), 584–593. https://doi.org/10.1148/radiol.2020192091 (2020).
Article PubMed Google Scholar
Hochberg, M. C., Favors, K. & Sorkin, J. D. Quality of life and radiographic severity of knee osteoarthritis predict total knee arthroplasty: Data from the osteoarthritis initiative. Osteoarthr. Cartil. 21, S11–S11. https://doi.org/10.1016/j.joca.2013.02.044 (2013).
Article Google Scholar
Joseph, G. B. et al. Tool for osteoarthritis risk prediction (TOARP) over 8 years using baseline clinical data, X-ray, and MRI: Data from the osteoarthritis initiative. J. Magn. Reson. Imaging 47(6), 1517–1526. https://doi.org/10.1002/jmri.25892 (2018).
Article PubMed Google Scholar
Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6(2), 65–70 (1979).
MathSciNet MATH Google Scholar
Youden, W. J. Index for rating diagnostic tests. Cancer 3(1), 32–35. https://doi.org/10.1002/1097-0142(1950)3:1%3c32::aid-cncr2820030106%3e3.0.co;2-3 (1950).
Article CAS PubMed Google Scholar
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 44(3), 837–845 (1988).
Article CAS PubMed MATH Google Scholar
Sharma, L. et al. Knee tissue lesions and prediction of incident knee osteoarthritis over 7 years in a cohort of persons at higher risk. Osteoarthr. Cartil. 25(7), 1068–1075. https://doi.org/10.1016/j.joca.2017.02.788 (2017).
Article CAS Google Scholar
Tolpadi, A. A., Lee, J. J., Pedoia, V. & Majumdar, S. Deep learning predicts total knee replacement from magnetic resonance images. Sci. Rep. 10(1), 6371. https://doi.org/10.1038/s41598-020-63395-9 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Chaudhari, A. S. et al. Diagnostic accuracy of quantitative multicontrast 5-minute knee MRI using prospective artificial intelligence image quality enhancement. AJR Am. J. Roentgenol. 216(6), 1614–1625. https://doi.org/10.2214/AJR.20.24172 (2021).
Article PubMed Google Scholar
Kuncheva, L. I. & Whitaker, C. J. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn. 51(2), 181–207 (2003).
Article MATH Google Scholar
Sollich, P. & Krogh, A. Learning with ensembles: How overfitting can be useful. In Advances in Neural Information Processing Systems (eds Touretzky, D. et al.) (MIT Press, 1995).
Google Scholar
Allen-Zhu, Z. & Li, Y. Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning (2020).
Zhou, T., Ruan, S., Guo, Y. & Canu, S. A Multi-modality Fusion Network Based on Attention Mechanism for Brain Tumor Segmentation 377–380.
Roemer, F. W. et al. Anatomical distribution of synovitis in knee osteoarthritis and its association with joint effusion assessed on non-enhanced and contrast-enhanced MRI. Osteoarthr. Cartil. 18(10), 1269–1274. https://doi.org/10.1016/j.joca.2010.07.008 (2010).
Article CAS Google Scholar
Chang, G. H. et al. Assessment of knee pain from MR imaging using a convolutional Siamese network. Eur. Radiol. 30(6), 3538–3548. https://doi.org/10.1007/s00330-020-06658-3 (2020).
Article PubMed PubMed Central Google Scholar
Kwoh, C. K. et al. Determinants of patient preferences for total knee replacement: African-Americans and whites. Arthritis Res. Ther. 17, 348. https://doi.org/10.1186/s13075-015-0864-2 (2015).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Study supported by National Institutes of Health (R01 AR074453). The OAI is a public-private partnership comprised of five contracts (N01-AR-2-2258; N01-AR-2-2259; N01-AR-2-2260; N01-AR-2-2261; N01-AR-2-2262) funded by the National Institutes of Health, a branch of the Department of Health and Human Services, and conducted by the OAI Study Investigators. Private funding partners include Merck Research Laboratories; Novartis Pharmaceuticals Corporation, GlaxoSmithKline; and Pfizer, Inc. Private sector funding for the OAI is managed by the Foundation for the National Institutes of Health. This manuscript was prepared using an OAI public use data set and does not necessarily reflect the opinions or views of the OAI investigators, the NIH, or the private funding partners.

Author information

Authors and Affiliations

Center for Data Science, New York University, 60 5th Ave, New York, NY, 10011, USA
Haresh Rengaraj Rajamohan, Tianyu Wang & Kyunghyun Cho
Courant Institute of Mathematical Sciences, New York University, 251 Mercer St, New York, NY, 10012, USA
Kevin Leung & Kyunghyun Cho
Department of Radiology, New York University Langone Health, 660 1st Ave, New York, NY, 10016, USA
Gregory Chang, Richard Kijowski & Cem M. Deniz
Bernard and Irene Schwartz Center for Biomedical Imaging, New York University Langone Health, 650 First Avenue, Room 418, New York, NY, 10016, USA
Cem M. Deniz

Authors

Haresh Rengaraj Rajamohan
View author publications
You can also search for this author in PubMed Google Scholar
Tianyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Leung
View author publications
You can also search for this author in PubMed Google Scholar
Gregory Chang
View author publications
You can also search for this author in PubMed Google Scholar
Kyunghyun Cho
View author publications
You can also search for this author in PubMed Google Scholar
Richard Kijowski
View author publications
You can also search for this author in PubMed Google Scholar
Cem M. Deniz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.R.R. literature research, 3D CNN implementation, data analysis, manuscript preparation, T.W. literature search, data preparation, 3D CNN implementation, K.L. data preparation, initial statistical analysis, G.C. study concept and manuscript editing, K.C. study design and manuscript editing, R.K. study design, analysis of results, analyzing saliency maps, manuscript editing, C.M.D. study concept and design, experiments, analysis of the results, manuscript preparation.

Corresponding author

Correspondence to Cem M. Deniz.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rajamohan, H.R., Wang, T., Leung, K. et al. Prediction of total knee replacement using deep learning analysis of knee MRI. Sci Rep 13, 6922 (2023). https://doi.org/10.1038/s41598-023-33934-1

Download citation

Received: 12 December 2022
Accepted: 21 April 2023
Published: 28 April 2023
DOI: https://doi.org/10.1038/s41598-023-33934-1

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.