Deep learning for large scale MRI-based morphological phenotyping of osteoarthritis

Osteoarthritis (OA) develops through heterogenous pathophysiologic pathways. As a result, no regulatory agency approved disease modifying OA drugs are available to date. Stratifying knees into MRI-based morphological phenotypes may provide insight into predicting future OA incidence, leading to improved inclusion criteria and efficacy of therapeutics. We trained convolutional neural networks to classify bone, meniscus/cartilage, inflammatory, and hypertrophy phenotypes in knee MRIs from participants in the Osteoarthritis Initiative (n = 4791). We investigated cross-sectional association between baseline morphological phenotypes and baseline structural OA (Kellgren Lawrence grade > 1) and symptomatic OA. Among participants without baseline OA, we evaluated association of baseline phenotypes with 48-month incidence of structural OA and symptomatic OA. The area under the curve of bone, meniscus/cartilage, inflammatory, and hypertrophy phenotype neural network classifiers was 0.89 ± 0.01, 0.93 ± 0.03, 0.96 ± 0.02, and 0.93 ± 0.02, respectively (mean ± standard deviation). Among those with no baseline OA, bone phenotype (OR: 2.99 (95%CI: 1.59–5.62)) and hypertrophy phenotype (OR: 5.80 (95%CI: 1.82–18.5)) each respectively increased odds of developing incident structural OA and symptomatic OA at 48 months. All phenotypes except meniscus/cartilage increased odds of undergoing total knee replacement within 96 months. Artificial intelligence can rapidly stratify knees into structural phenotypes associated with incident OA and total knee replacement, which may aid in stratifying patients for clinical trials of targeted therapeutics.

www.nature.com/scientificreports/ of baseline phenotypes with 48-month structural OA and symptomatic OA. Lastly, we examined associations between phenotypes and undergoing TKR by 96 months from baseline.

Materials and methods
Study participants. We obtained the data from Osteoarthritis Initiative (OAI), which enrolled 4796 participants, aged 45 to 80 years, between 2005 and 2006 at four US centers. Participants in OAI had OA or were at high risk of developing OA in at least one knee at baseline. Participants at each site were assessed annually and evaluated information included questionnaires, physical examination, radiographic imaging, and MRI. Exclusion criteria included rheumatic or other inflammatory arthritis, contraindication to MRI, end-stage knee OA bilaterally, and inability to walk without assistance. National Institute of Arthritis and Musculoskeletal and Skin Disease approved the OAI study; the OAI was carried out in accordance with relevant guidelines and regulations (registered as "Osteoarthritis Initiative (OAI): A Knee Health Study", NCT#00080171, on ClinicalTrials. gov). Participants provided informed consent at each study visit. The full trial protocol, eligibility criteria, and interventions have been previously documented 15-17 . Imaging. MRIs  MRI-based morphological phenotyping. ROAMES is a simplified MRI assessment metric for stratification of knees into morphological phenotypes potentially applicable to determine eligibility for DMOAD trials 10 . Subchondral bone phenotype is defined as knees with bone marrow edema in greater than 66% (MRI Osteoarthritis Knee Score 18  A subset of the knee MR images from OAI were graded according to MOAKS as part of several previous studies and shared publicly, the first being the OA Biomarkers Consortium FNIH Project which studied 600 participants in a case-control study of OA incidence 19 . In 2017, MOAKS readings were also released for four other projects including case-control studies in 574 participants for studying incident lateral compartment OA 20 , in 613 participants for studying incident radiographic OA 21 , and the Pivotal OAI MR Imaging Analyses and a subcohort study of 850 participants with bilaterally normal knees at baseline 22 . Details of these five projects are publicly available (Supplemental File 1). The MOAKS grading was performed by a centralized group under the supervision of two musculoskeletal radiologists with more than nine years of training in semi-quantitative knee OA grading 17 . The radiologists were blinded to the clinical data and case-control status. A total of 2653 unique participants were imaged at either or both of two visits (baseline, 4 years), resulting in 4413 knee MRIs for grading in a total of 3117 unique knees. Baseline demographics for the participants were as follows: Women = 1574, Men = 1074, Age (mean[SD]) = 60.9 [9.0], BMI (mean[SD]) = 28.5 [4.8]), and the baseline Kellgren-Lawrence (KL) grades of the knees were KL0 = 1212, KL1 = 654, KL2 = 626, KL3 = 446, KL4 = 170.
Since ROAMES is a simplification of MOAKS, we used the radiologist MOAKS grades to directly assign ROAMES phenotypes of bone, meniscus/cartilage, inflammatory, and hypertrophy. This subset of OAI images with assigned ROAMES phenotypes was then used as ground truth for training neural network classifiers. The sample size of cases and controls from the ROAMES assigned OAI subset for bone, meniscus/cartilage, inflammatory, and hypertrophy were 532 and 3109, 101 and 3535, 50 and 1906, and 57 and 543, respectively. Every knee was not necessarily graded for each aspect of MOAKS. For example, some knees were graded for osteophytes, while others were not. Knees that did not have MOAKS grades necessary to determine presence or absence of a phenotype could not be used for training the particular neural network for that phenotype, which is why there are different sample sizes among the phenotypes. Knees graded with MOAKS grades to determine presence or absence of more than one phenotype subsequently were used for training each eligible phenotype. Thus, each classifier had knees that had non-exclusive phenotypes (i.e. training bone phenotype possessed cases that also had meniscus/cartilage phenotype). We trained four separate neural networks, so one knee could be a case in training one particular phenotype but may be a control in training a different phenotype classifier. Atrophy, defined as minimal osteophytes with severe cartilage damage, is the fifth ROAMES phenotype and was not included in this study due to low number of cases.
Model training for automated morphological phenotyping. Using the subset of the OAI with radiologist-assigned ROAMES phenotypes, we trained convolutional neural networks (CNNs) from coronal and sagittal knee MRIs to classify the four ROAMES phenotypes of bone, meniscus/cartilage, inflammatory, and hypertrophy. The radiologist-graded images were split into training (70%), validation (10%), and test sets (20%) for each CNN phenotype classifier, preserving the distributions of baseline demographics, radiographic severity, and pain severity. www.nature.com/scientificreports/ Images in these three splits for each phenotype classifier were from distinct, non-overlapping participants. The training set was used to train each of the CNNs with back propagation. Model performance after each training epoch was evaluated over the validation set. Test set was blind to the model until after training to serve as final metric of performance. The CNNs utilized MRNet neural network architecture, which utilizes each slice of the concatenated coronal and sagittal views as input into an ImageNet pre-trained AlexNet for feature extraction 23 . The features from each slice were then pooled and input into a fully connected layer to produce a final binary classification probability assessing the presence or absence of phenotype. We trained one CNN for each phenotype over 80 training epochs with early stopping with following parameters: Adam optimizer, learning rates of 5 × 10 -5 (bone CNN) and 1 × 10 -5 (meniscus/cartilage, inflammatory, and hypertrophy CNNs), empirically-weighted cross-entropy loss to account for class imbalances, and batch size of 1. These model configurations were selected through several iterations of empirical parameter selection based on previously solving similar classification tasks 12,24,25 . Training set augmentation consisted of random two-dimensional translations, rotations, and zooming. We then performed a systematic hyperparameter tuning of these CNNs with a grid search of differing architectures (AlexNet, ResNet, DenseNet), learning rates (1E-4, 1E-5, 1E-6), weight decays (None, 0.01), and dropout rates (0.1, 0.3, 0.5). The highest performing phenotype models from the grid search were compared to the empirically tuned CNNs. McNemar's test was used to compare classification performance on the validation set to determine statistically significant differences between the phenotype classifiers. The higher performing CNN was used to infer on the test set and entire OAI. All CNNs were developed in Pytorch (Facebook, Menlo Park, CA), and computations were performed on NVIDIA (Santa Clara, CA) GeForce GTX Titan X graphics processing units.
Model inference for automated morphological phenotyping of entire OAI dataset. To investigate the associations between morphological phenotypes and knee OA outcomes, the trained CNNs were then utilized to predict morphological phenotypes for the entire cohort's bilateral knee images; specifically, we studied 4971 baseline patients over 8 study time points and obtained images from both knees at each visit. This resulted in a total of 45,300 MRI exams that were analyzed with both coronal and sagittal MRI views. To understand the prognosis effects of each phenotype, we chose one knee per participant and allowed maximum one phenotype per each participant, excluding samples fulfilling more than one phenotype. We chose the knee with greater radiographic severity or a random knee if severity was equal. The predicted morphological phenotypes served as the primary independent variables. Statistical analysis. We compared ROAMES predictions on the test set images from the CNNs with the corresponding ground truth radiologist assigned ROAMES phenotypes, which served to evaluate phenotype classification metrics of the CNNs. Performance measures included area under the curve (AUC), accuracy, sensitivity, and specificity. In these metrics, the true value was the radiologist phenotype and the predicted value was the model phenotype prediction. Standard errors were calculated using bootstrapping principle. One-way ANOVA tests compared training, validation, and test set demographics, radiographic scores, and pain scores.
Baseline characteristic differences were assessed between participants without phenotype and participants with each of the four morphological phenotypes using Kruskal-Wallis test for continuous variables or Chi-square test for categorial variables. Benjamini-Hochberg method was used for P-value adjustment as needed.
The primary outcome was structural OA and symptomatic OA. Structural OA was defined as KL radiographic grading scheme greater than or equal to 2 (presence of definite osteophyte) 26 . Symptomatic OA was defined as the presence of pain, aching, or stiffness in knee joint for most days lasting at least one month in past 12 months 27 . We investigated the association between baseline phenotypes and concurrent structural and symptomatic OA among all participants using logistic regression. In a longitudinal model, we evaluated the association of baseline phenotypes with incidence of structural OA and symptomatic OA at 48 months among participants without OA at baseline using mixed effects logistic regression analyses to account for multiple observation by participants. We additionally assessed the association between phenotypes and undergoing primary TKR after baseline and prior to the 96-month visit using logistic regression. Both cross-sectional and longitudinal model were adjusted for baseline characteristics, including age, sex, race, and body mass index (BMI), by adding these variables as predictors to the regressions. We built an additional TKR logistic regression model adjusted for symptomatic OA and KL by similarly adding baseline symptomatic OA and KL grade as predictors in the model. The definitory time point of phenotype characterization was baseline.
Two-tailed P-values less than 0.05 were considered statistically significant. Statistical analyses were performed in R environment for statistical computing and important packages included lme4 and car 28 .

Results
Automated morphological phenotyping performance. There were no statistically significant differences between participants in the training, validation, and test sets regarding demographics, radiographic scores, and pain scores (Supplemental Table 1 Longitudinal associations between morphological phenotype and OA outcomes. We performed longitudinal analyses in only those subjects who had no OA at baseline and had 48 months follow up assessment. We analyzed 874 and 814 subjects at baseline for structural OA and symptomatic OA, respectively.

Discussion
We built an end-to-end deep learning model to rapidly stratify knees into morphological phenotypes using a large, longitudinal cohort. We examined associations of phenotypes with odds of concurrent OA, obtaining OA within 48 months from baseline, and receiving TKR surgery within 96 months from baseline. All phenotypes, particularly meniscus/cartilage and hypertrophy, were associated with concurrent structural OA. Additionally, all phenotypes increased odds of concurrent symptomatic OA. Among knees with no baseline OA, bone phenotype and hypertrophy phenotype each respectively increased odds of incident structural OA and symptomatic OA in 48 months. All phenotypes except meniscus/cartilage increased odds of undergoing TKR within 96 months after adjustment for baseline KOOS score and KL grade. Identifying phenotypes of knee OA may aid in stratifying patients for clinical trials and guide development of targeted interventions to prevent disease progression 1,30 .
Roemer et al. conducted a study associating ROAMES phenotypes with OA in a cohort of 485 knee MRIs with a priori-defined outcomes from FNIH 11 . They reported knees, with KL grades 2 and 3, possessing bone phenotype at baseline had highest odds of structural OA at either 24, 36, or 48 months (OR 1.87; 95% CI, 1.18-2.97). Neither bone, meniscus/cartilage, nor inflammatory phenotypes increased odds of pain progression over the same study period. Our study similarly determined bone phenotype to increase incident structural OA and that bone and meniscus/cartilage did not increase odds of incident symptomatic OA. However, hypertrophy phenotype did increase odds of symptomatic OA in our study. Roemer et al. did not report hypertrophy phenotype due to sample size constraints, and there is little literature evaluating hypertrophy phenotype in relation to incident OA. Compared to Roemer et al., we did not exclude knees based on KL grade, whereas Roemer et al. excluded all knees with KL less than 2. They also defined structural progression as a decrease in minimal joint space width of at least 0.7 mm in the medial tibiofemoral joint. The authors also utilized Western Ontario and McMaster Universities Osteoarthritis Index to assess symptomatic progression, whereas our study examined minimally detectable change in KOOS. Finally, both studies had different sample sizes and study lengths.
Cross-sectional analysis of baseline characteristics demonstrated a significant proportion of radiographic OA among knees with any phenotype, the highest proportion of which appearing in meniscus/cartilage and hypertrophy. These two phenotypes most overlap with criteria for radiographic OA, defined as definite evidence Table 5. Association between phenotypes and undergoing primary TKR, with and without adjustment for symptomatic OA and KL grade, after baseline and prior to the 96-month visit (n = 3154). The adjustment refers to adding baseline symptomatic OA and KL grade as additional predictors in the logistic regression model. Both models adjusted for age, sex, and BMI. "-", not applicable; OA, Osteoarthritis; CI, confidence interval; BMI, body mass index; TKR, total knee replacement; KL, Kellgren-Lawrence. *Statistically significant at P value < 0.05. www.nature.com/scientificreports/ of osteophytes and joint space narrowing. Knees fulfilling criteria for either phenotype but not structural OA may be reflective of decreased sensitivity of x-ray in detecting osteophytes and cartilage degeneration relative to MRI 31,32 . Limited specificity of the CNNs may also contribute to the discrepancy. Although these two phenotypes generated highest odds of concurrent structural OA, inflammatory phenotype was most associated with concurrent symptomatic OA. Increased odds of effusion-synovitis observed two years prior to incident radiographic OA has been documented; moreover, weight and sex of the subjects can further augment this odds ratio 33 . Our logistic regression model similarly found BMI as predictive of concurrent OA, though we did not find a sexdependent relationship. The majority of subjects without OA at baseline did not have a phenotype. Despite low prevalence, bone phenotype significantly increased odds of incident structural OA at 48 months. It is difficult to put this finding into perspective as odds ratios could not be computed for any other phenotype given their limited sample sizes. Nonetheless, changes in subchondral bone have been reported as biomarkers of incident OA 34 . Specifically, morphological maps of bone shape analyzed by artificial intelligence were found to be predictive of incident OA. Damage to subchondral bone has been hypothesized to be a precursor to subsequent cartilage deterioration and a mediator of early resorptive phases in OA [35][36][37] . Notably, we did not find an association between bone phenotype and incident symptomatic OA at 48 months, but rather there was a relationship with hypertrophy phenotype. Although inflammatory phenotype possessed highest odds of symptomatic OA in cross-sectional analysis, the sample size was too limited to discern conclusions regarding longitudinal effects on symptomatic OA.
In longitudinal analysis of incident TKR in 96 months, KL grade portended highest odds when added to the regression. With this adjustment, all phenotypes except meniscus/cartilage demonstrated increased odds of TKR, suggesting incorporation of phenotypes can further stratify risk among subjects with similar KL scores. Of the phenotypes, inflammatory increased odds of incident TKR more than bone or hypertrophy. A prior study investigating predictive factor of MRI for incident TKR demonstrated tibiofemoral joint cartilage and bone, as well as medial and lateral menisci, were significant structures for accurate predictions by neural networks 38 . The study did not evaluate potential contribution from lesioned synovium or effusions, which should be explored in future works. Longitudinal changes in physical activity and pain have been reported to be unaffected by baseline cartilage damage 39 , which may corroborate our findings that meniscus/cartilage phenotype did not independently increase odds of incident TKR.
Despite relatively satisfactory performance metrics from the CNNs, methods using deep learning are limited. Artificial intelligence may serve as a valuable aid for clinicians and researchers with high workload or limited expertise, but detailed evaluation of relevant pathology by radiologists is inevitably necessary for accurate staging and diagnosis. Other limitations include use of MRI instead of arthroscopy as reference. The grades used for model training are dependent on subjective assessment by a radiologist, and our model can only perform as good as the MRI standard used in training. Moreover, OA is multifactorial, and future model building should include genetic, biochemical, and post-traumatic data. We also did not exclude posterior medial meniscus root tears, osteonecrosis, or malignancies which are typically exclusion criteria in DMOAD trials. In future work, we aim to develop CNNs to automatically detect these pathologies from large study cohorts. Inferring on samples from other studies is particularly important to demonstrate external validity of the CNNs, given our study results were not validated on an external cohort such as the Multicenter Osteoarthritis Study 40 . Another aim is to build a single multi-label classifier to compare with the current approach of a separate classifier for each phenotype. Multi-label models offer generalizability, interpretability, and less overfitting; however, they are limited by the label with the lowest sample size, which in our case was hypertrophy phenotype.
In conclusion, our study underscores the prognostic value of morphological phenotypes for characterizing progression of knee OA. These findings hold implications for improving understanding of OA pathogenesis, which may guide inclusion criteria of DMOAD trials towards MRI-based structural phenotypes. This may improve effectiveness of DMOADs by using individual knee phenotypes to offer patient-specific treatment. Future research can survey individual DMOAD trials to analyze whether specific subgroups of structural phenotypes received increased therapeutic benefits. www.nature.com/scientificreports/ Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.