## Introduction

Osteoarthritis (OA) is the result–and the observable status–of inflammatory processes in a joint leading to functional and anatomical impairments. The resulting status often shows irreversible damages to the joint cartilage and the surrounding bone structures1,2. The knees are the most commonly affected joints in the human body and knee osteoarthritis (KOA) is more prevalent in females aged 60 years or more compared to males of the same age (13% vs 10%)3. Severity of KOA amongst females with more than 55 years of age is higher compared to their male counterparts and the severity of KOA is higher compared to other types of OA4,5. Approximately one in every six patients consult with a general practitioner in their first year of an OA episode4,5. The incidence of KOA has a positive association with age and weight and the prevalence is more common in younger age groups, particularly those who have obesity problems6.

Swelling, joint pain, and stiffness are the prominent symptoms among others, such as restrictions in movement including walking, stair climbing, and bending7. The symptoms worsen over time and elderly patients are affected more frequently than patients in other age groups. The presence of OA in the knee reduces activity in daily life and eventually leads to disability, which can incur high costs related to loss in productivity8. It is estimated that functional impairment of the knee and the hip are the eleventh highest disability factors9 contributing to considerable socio-economic burden with an estimated cost per patient per year of approximately 19,000 Euro10. The estimated prevalence of disability due to arthritis is expected to reach 11.6 million individuals by the year 202011, which is greater than the estimated risk of disability attributable to cardiovascular diseases or any other medical condition12. Total joint replacement surgery is the most favorable option to treat advanced stage OA. However, diagnosing the status of KOA at an early stage and providing behavioral interventions could be beneficial for prolonging a healthy life for a patient13.

In a review of possible risk factors of KOA, Heidari7 concluded that age, obesity, gender (i.e., female), repetitive knee trauma and kneeling are the most common risk factors for KOA. The common symptoms include pain, functional impairment, swelling and stiffness. The severity of KOA and the pain status is measured based on the Kellgren and Lawrence (KL) scale of 0 to 4 by visual inspection of the knee X-ray images14.

Considering the impact of KOA on disability and the subsequent unavoidable economic burden, there is a need to quantify the severity of KOA during the early stages of development. KOA severity level helps in determining appropriate treatment decisions and for the monitoring of disease progression15. The classical way of quantifying KOA severity is by inspection of X-ray images of the knee by a radiologist who then grades the images according to the KL scale (from 0 for “normal” up to 4 for “severe” stage)14. This approach suffers from high levels of subjectivity as there is no gold standard grading system: the semi-quantitative nature of the KL grading scale creates ambiguity, thus giving rise to disagreements between raters (for details please refer to14,16,17).

To reduce the influence of subjectivity in quantifying KOA severity from X-ray images, computer-aided diagnosis has been very helpful18. To date the sample size of available images has been the main limiting factor to train an efficient model19,20,21,22. The Osteoarthritis Initiative (OAI)23 and the Multi-centre Osteoarthritis Study (MOST)24 mitigated this small sample size limitation by making thousands of patients data and X-ray images available. Recently, several researchers have used these resources to develop an automatic approach for quantifying KOA severity by analyzing X-ray images21,25,26,27,28. Although there have been multiple attempts to quantify KOA severity based on an automated analysis of X-ray images, so far there has been no attempt to build a predictive model on a patient’s assessment data such as signs, symptoms, medication and other characteristics about a patient (later on referred as patient’s questionnaire data) and to compare this approach against the X-ray based prediction. Developing predictive models using patient data other than X-ray images offers additional advantages such as identifying those variables that contribute strongest towards predicting the severity of KOA. A good predictive model based on patient’s questionnaire data could reduce treatment costs and could also contribute to a prolonged healthy life of a patient due to early behavioral intervention.

The Osteoarthritis Initiative (OAI) is a multi-center longitudinal study for men and women sponsored by the National Institute of Health (NIH) to better understand KOA. Data collected through the OAI can provide useful information about the marginal distributions of relevant patient characteristics, their demographics, signs & symptoms and medication history. To date, there are more than 200 scientific publications that have used data collected through the OAI including several attempts to automate the KL grade quantification using X-ray images. But to date no study (or publication) has tried to predict KOA severity based on patient questionnaire data. Our primary goal is to compare the prediction accuracy of a statistical model based on patient questionnaire data to the prediction accuracy based on X-ray image based modeling to predict KOA severity score. In this paper we present several statistical approaches to predict the severity of KOA using patient questionnaire data. Furthermore, we use a convolution neural network (CNN) model to predict the same outcome using corresponding X-ray images for the same patients. The performance of both the approaches has been compared using the calculated root mean squared error on a validation set. As a secondary goal, we identified key variables with the strongest predictive ability, which may be useful to monitor a patients over time and design early interventions for prolonging healthy life in patients of concern.

## Results

### Exploratory analysis

The OAI dataset contains data for 4,796 individuals. After initial pre-processing, 2,951 patients with sufficient data on potential candidate explanatory variables were selected, representing 62% of the original patients. The remaining 38% of individuals did not have enough data for the potential explanatory variables and were not included in the analysis. The list of candidate variables, their labels and type (binary, numeric and categorical) has been provided in a Supplementary Table (Supp. Table 1).

To train and validate the predictive models we used a training and validation data split as shown in Table 1 (roughly a 70–30% split). To make valid comparisons, we used the same validation set in the models developed on patient’s questionnaire data and the model developed using X-ray images26,27. The validation set contained data for both knees for 846 patients, i.e. 1,692 data points for a knee in total. The training set for the predictive models included data from 2,105 patients, i.e. 4,210 knees.

The validation data that contains 30% of the original patients data are the same patients information that has been used in X-ray image based modeling. To make our results comparable with X-ray based prediction we used same validation patients information, although we performed cross validation to check sensitivity of our entire analysis. The cross validation result is consistent with our original 70–30 split.

Relevant summary statistics for patient characteristics are given in Table 2 for the entire dataset and for the training and the validation subsets. Good balance is evident when comparing the mean and variability for each patient characteristic across the training and validation data and it is plausible that they can be considered as representative samples taken from the same overall population. Maintaining a similar distribution of patient characteristics between the training and validation data is paramount for making reliable inference. We dropped occupation from the patient characteristics table and from subsequent analysis as this variable has more than 30% missing data.

The box-plot (Fig. 1) displays the distribution of several patient characteristics. We have introduced minor displacement quantity (jitters) along the horizontal axis and alpha-blending to make the display of the distribution of the points clearer. The level of KOA severity appears to be higher among elderly people. Height, weight and BMI show similar patterns but in contrast the distribution of blood pressure measurements does not indicate any strong obvious pattern with severity score (Fig. 1).

The data contains a mixture of continuous, binary, and categorical predictors. To observe the relationship among such a collection of predictor variables we calculated Pearson correlation between continuous predictors, polyserial correlations between continuous and categorical predictors and polychoric correlation between categorical predictors29. The correlation matrix (Fig. 2) shows the relationship among the predictor variables of interest where a higher color intensity indicates a stronger correlation between variables. The blue color indicates a positive correlation and red color indicates a negative correlation. We can observe that the predictor variables are positively correlated with each others to a moderate degree. Patients sex, height and weight shows weak negative correlation with other variables but only sex and height show a strong negative correlation. The upper block represents correlation among signs and symptoms in the left knee whereas the lower right block represents correlation among signs and symptoms of the right knee. The lower left block is the correlation between signs and symptoms of left to right knee. Other than the three blocks of correlation there are some variables that represent neither of the knees; rather, those variables represents medication history and other characteristics (Fig. 2). What is clear from Fig. 2 is the large number of candidate predictors and the presence of multicollinearity amongst predictors which will have to be accounted for accordingly in any subsequent model.

Among the five levels of KOA severity there were very few patients from severity level 4 (KL grade) in comparison to other categories. Overall 42% of patients were from severity level 0 indicating the normal knee followed by mild severity level 2 with 24% and doubtful severity level 1 with 17%. The distribution of severity level frequencies across the training and the validation data is well balanced indicating that it is plausible that the training and validation data came from same underlying population (Table 1).

### Model building, evaluation, and comparison

Initially an Elastic Net regression30, a weighted combination of LASSO and Ridge regression, was fitted. An Elastic Net regression model can be used to select variables with high predictive power. The weighting is controlled by the mixing parameter α that controls the amount of mixing between LASSO and Ridge penalties, whereas the parameter λ controls the amount of shrinking in the regression coefficients. To estimate a suitable value for the shrinkage parameter λ we performed repeated cross validation using a fixed α = 0.5, which corresponds to the minimum cross-validation RMSE. Using this value of α the value of λ that also minimizes the RMSE (Fig. 3) was selected. The contribution (i.e. direction and magnitude) of each predictor variable has been extracted from the corresponding estimated regression coefficients (Fig. 4).

A Random Forest31 regression model was then fitted using differing numbers of trees where the RMSE was calculated for each scenario. Based on these evaluations, we found that using 100 trees produced the lowest RMSE in the validation set. We also identified those predictors with highest variable importance in terms of improved predictive ability of the final forest.

The overall RMSE for the Elastic Net regression model is 0.97 and the RMSE for the random forest model is 0.94. Both models give higher accuracy for the prediction of the severity levels 1 and 2 in contrast to the other categorical levels. The RMSE from the X-ray image based CNN model is 0.77, which is slightly lower than the RMSE from the Elastic Net regression and the Random Forest model. The advantage of using Elastic Net regression over a Random Forest regression model and X-ray image based CNN model is that we can easily identify the variables that have high predictive power and also the direction of the contribution of each variables by looking at the magnitude and sign of their regression coefficients (Fig. 4).

The Elastic Net regression model produced higher prediction accuracy for severity levels 1 and 2, in comparison to other levels. A similar result is noted in the predictions by the Random Forest model. The overall RMSE of Elastic Net regression and Random Forest regression models are 0.974 and 0.943. The overall accuracy of the CNN model is higher than Elastic Net and Random Forest regression. The performance of each of the three models show their lowest outcome for the KOA severity level 4 as there is less data available in that category. Relatively higher accuracy in predicting KOA severity using an X-ray based CNN model has been observed however, the margin of difference between the RMSE of the predictions from the X-ray image based CNN model in comparison to the predictions from the patient’s questionnaire data models is considerably small. Table 3 shows the RMSE for the models trained with patient data (Elastic Net and Random Forest)and the model trained with X-ray images (CNN regression).

Both the Elastic Net and Random Forest models allow us to variable importance of the individual predictors on overall predictive ability. There are some variables commonly identified by both these models with higher contribution towards the final predictions. However, the variables identified by the Elastic Net have more interpretable properties than the variables selected by the Random Forest model. The sign of the regression coefficients in an Elastic Net regression allows us to understand the direction of the contribution; whether it increases the severity score or reduces it. A negative sign indicates a reduction in the overall severity score for increasing values of the predictor, whereas a positive sign indicates an increase in the severity score for increasing values of the predictor. The direction of the contribution by predictors selected by the random forest model is unclear as it gives similar importance to both directions. Figure 4 shows the sign and magnitude of the contribution for each of the selected variables. The positive sign indicates an increase in the severity score whereas a negative sign indicates a decrease in the severity score. The identified variables could be a proxy indicators of patient knee’s anatomical structure which ultimately indicates the level of severity.

Since the data have a hierarchical structure (i.e. knee nested within patient) the number of replicates at the individual level is more appropriately modeled using a mixed effects model with a random effect to capture the correlation between knees within an individual. To explore the random effect of patient level information, a linear mixed effect model was fitted using the predictor variables initially selected from the Elastic Net regression. There is clear evidence for the need of random effects due to the study design. In addition a small p-value (less than 0.001) was evident for the test for the need of the random effect term due to subject level within knee correlation in the model. The intra-class correlation coefficient is 0.265 which indicates that the proportion of the variance explained by the random effect component (patient level information) in the population 26.5%. The overall RMSE for the linear mixed effects model is 0.978, which is almost the same as the RMSE from the Elastic Net regression. However, the predicted severity levels, and more importantly the corresponding uncertainty, is correctly adjusted to account for the within patient correlation.

## Discussion

Judging the impairment for patients with KOA requires a thorough understanding of the disease condition. Expert radiologists or clinicians assess the functional knee impairments and the KOA severity level from the X-ray images. Ideally, the image analysis should give an objective measure of the impairments; however, in reality not all functional impairments show up in anatomical transformations of the knee, and the patho-physiological evaluation relies on the subjective perception of the patient and the physician jointly.

Our primary goal was to explore whether the prediction accuracy of a statistical model based on patient’s questionnaire data is comparable to the prediction accuracy based on X-ray image based modeling to predict KOA severity. We have demonstrated that statistical models, using patients’ questionnaire data, could predict KOA severity level with a good level of accuracy (RMSE: 0.974 & 0.943). The prediction performance of the statistical models presented in this paper are comparable to models using X-ray image data based on model performance as assessed by RMSE measures26,27,28. In particular we have demonstrated that functional impairment at severity levels 1 and 2 can be predicted by our statistical models (Elastic Net & Random Forest and LMM) trained from the patients’ assessment data to a level of accuracy similar to the accuracy achieved on the basis of CNN model trained on X-ray images. There are very subtle structural variations in the knee joints (minimal joint space narrowing (JSN) and osteophytes formation) belonging to grade 0 and grade 1, and these are not fully reflected in the KL grades. Also, there are relatively large overlaps in the JSN measurements for KL grades 0 and 1 compared to the other grades32. These factors make them challenging to distinguish by inspecting the X-ray images. Also, patients share almost similar distribution on their characteristics, signs, symptoms and functional impairments. Due to very subtle differences of the predictors between KOA levels the prediction accuracy gets affected.

We were able to identify the key variables that contributed most to the predictive ability in our models. These identified variables can be monitored over time to assess the progression of KOA severity. The strong indicator variables are reporting on knee baseline radiographic OA status for the right or left knee (P01LXRKOA, P01RXRKOA) and on treatments such as surgery on the right or left knee (P01KSURGR, P01KSURGL) as well as other reasons to see the doctor (P01ARTDOC). Patient’s sex also plays important role in predicting KOA severity. The next indicator variables cover medication (P02KPMED) and functional impairments, pain or other symptoms to the right or left knee (V00WPRKN1, V00KSXLKN3, V00KSXRKN5, V00KSXRKN1, V00KSXLKN1, V00WPLKN1, P01KPNREV, P01KPACT30). A final parameter notes whether a doctor “ever said you have rheumatoid arthritis or other inflammatory arthritis” (P01RAIA). The predictors variable that we found as important predictor of KOA severity were also reported important risk factor in previous studies7,33.

Importantly, an early behavioral intervention could be developed based on the identified variables to prolong the healthy life of a patient. By observing the identified variables that have higher predictive ability to predict KOA severity, we can identify the subjects who are currently taking medication for pain relief and facing functional difficulty in their daily life. Variables representing limited knee functions in particular are the potential indicators for quantifying KOA severity that could lead to developing targeted interventions for further treatment and medications.

When making predictions the LMM is favored as it is the only approach that correctly adjusts for the hierarchical structure present in the data. It is interesting that the severity levels 1 and 2 can be predicted with good accuracy in all the four models (EN, RF, LMM, and CNN), while the other levels of severity are more challenging to predict. For higher severity levels, i.e. levels 3 and 4, this could be due to the lack of patient data, i.e. the sample sizes at these levels are smaller than for levels 1 and 2 (Fig. 1 and Table 2).

As a conclusion based on the results in this paper, we can say that the patients’ questionnaire data can predict KOA severity level with good accuracy and it is comparable with the prediction based on X-ray images. Patient’s assessment data also enables us to identify some of the key variables that can be used to design early interventions and monitor the patients over the treatment period. The accuracy of the model developed using patient’s assessment data is almost comparable to the CNN model. Moreover, the statistical models have an edge over the CNN model by identifying key variables that helps the physicians to design interventions and helps the patients for further treatment.

There is at least one potential limitation in developing statistical models to predict KOA severity, that is the KL grade score itself is not a gold standard and suffers from subjectivity. The KL grade is dependent on the perception of the radiologist who is inspecting the X-ray images. In the model building process, we are effectively using a quasi-gold standard outcome. Considering this potential limitation, one way to improve the prediction accuracy could be to build a model of the X-ray image data in combination with the patients’ assessment data. The prediction of KOA severity based on patients data shows comparable accuracy, it would be interesting to see the performance of prediction based on a statistical model combining both patient’s questionnaire data and with X-ray images.

## Methods

### Data

The data used in this study were obtained from the Osteoarthritis Initiative (OAI), which is available for public access at http://www.oai.ucsf.edu/. The specific dataset used is labeled 0.2.2. This is the data from the multi-center longitudinal and prospective observational study of KOA. We have used the baseline dataset for this work. The description of each variable used in our analysis has been given in the Supplementary Table 1.

### Data pre-processing and descriptive statistics

The baseline dataset contains a large number of variables related to patients’ characteristics, their vital signs, symptoms of KOA, medication history, and functional impairment. At the early stage of the analysis, we manually inspected each of the variables and selected a subset of candidate variables that were clinically relevant and previously reported risk factor for KOA7,33. We inspected the completeness of the data in terms of missing values. We calculated the amount of missing values (in percent) for each variable. The variables that had at least 85% non-missing values were kept for further analysis. If it can be assumed that missing data are missing at random (i.e. missingness is explained by the covariates available) a multiple imputation step is unnecessary if a linear mixed model is used as the likelihood is correctly specified under this assumption. Moreover, we excluded categorical variables with very low discriminatory power, for example, the variables with very low frequency in one of the categories compared to the rest within the same variable. The reduced set of candidate variables was used for further processing and analysis. The dataset was split into two parts, training and validation sets, by taking a random sample of 70% of the data for training and the remaining 30% for validation. To make valid comparisons, we used the same validation set in the models developed on patient’s questionnaire data and the model developed using X-ray images26,27. The data pre-processing steps are summarized in (Fig. 5).

To summarize and explore the explanatory variables, we calculated descriptive statistics: mean and standard deviation for numeric variables, frequency and percentage for categorical variables. The relationship among predictors was also explored by calculating Pearson correlation between numeric variables, polyserial correlations between numeric and categorical variables and polychoric correlation between categorical variables29.

The KL grade score was recorded on an ordinal scale from 0: normal to 4: severe. To model an ordinal outcome, ordinal logistic regression34,35 is the typical approach used. We fitted ordinal logistic regression models, but the prediction performance was poor. In this paper we have treated the severity score as a continuous response to investigate if this would improve predictive ability. Moreover, the data are hierarchical in structure; for each patient we have data for both knees. To capture this structure appropriately we have used a linear mixed effect model incorporating a random effect at the subject level36.

### Elastic Net regression

Elastic Net regression is a combination of ridge regression and LASSO, and this model is appropriate in the presence of correlated predictors30. We denote the outcome variable: KL grade score by Y (considered as a continuous variable) and all predictors by X1, X2, …, Xp. The Elastic Net regression linearly combines L1 and L2 penalties as follows:

$$SSE=\sum _{i=1}^{n}\,{({y}_{i}-{\hat{y}}_{i})}^{2}+\lambda [(1-\alpha )\sum _{j=1}^{p}\,{\theta }_{j}^{2}+\alpha \sum _{j=1}^{p}\,|{\theta }_{j}|]$$
(1)

The L1 penalty is defined as the sum of absolute value of the regression coefficients and the $${L}_{1}={\sum }_{j=1}^{p}\,|{\theta }_{j}|$$ and L2 penalty is defined as the sum of squared values of regression coefficients: $${L}_{2}={\sum }_{j=1}^{p}\,{\theta }_{j}^{2}$$. The amount of mixing between two penalty term is controlled by a mixing parameter α. If the value of α = 0 then it leads to a ridge regression whereas a value of α = 1 leads to LASSO regression. The hyper-parameter λ controls the amount of shrinkage of regression coefficients for various values of α. A higher value of λ leads to shrink the regression coefficients towards zero and a very small value of λ has little effect on the regression coefficients. Using both L1 and L2 penalty enables us to select appropriate variables that have higher predictive power by shrinking some of the regression coefficient to zero using an appropriate value of hyper-parameter λ. To estimate the most suitable value for the shrinkage parameter λ we performed repeated cross validations with fixed values of the mixing parameter α = 0.5 and choose the value of λ that minimizes the root mean squared error (RMSE). Figure 3, shows the cross-validation results while selecting λ.

### Random forest

Random Forests (RF) are an ensemble method that combines the predictive ability of multiple tree based models. The RF model is an extension of the original work of Tin Kam Ho37 who developed the algorithm for random decision forests. Leo Breiman31 used the idea of bagging (bootstrap aggregating) and random variable selection. The principle of random forest is to combine multiple tree based models to form a single model that can achieve better accuracy compared to its individual counterparts. This method takes a random sample with replacement from the original data, then builds a decision tree model based on a random selection of variables at each branch in the tree. This process is repeated for multiple trees and stores the prediction from each tree. The predicted value is then the mode (for a categorical response) for the mean (for a continuous response) across the forest. The random forest model is popular because it can reduce the variance of single tree models and also overcomes the problem of correlated predictors as it takes only a subset of candidate predictor variables in each of the individual trees.

### Linear mixed effect model

There is a clear hierarchical structure in the dataset as we have patient level data along with knee level data. A linear mixed effect model (LMM)36 is an extension of a linear model that accounts for the hierarchical structure in data. The primary benefit of using a LMM in this paper is that the uncertainty in knee level prediction is now correctly adjusted for through the introduction of a suitable random effect. This approach will account for measurements on both knees collected for each subject correctly. The Intra-class correlation has been reported that indicates how much variation in the dependent variable is due to random effect component in the LMM model. A random effect model can be formulated as:

$${y}_{ij}={x}_{ij}^{t}\,\beta +{u}_{ij}^{t}\,{\gamma }_{i}+{\varepsilon }_{ij};i=1,2,\cdots ,m;j=1,2,\cdots ,{n}_{i}$$
(2)

Here yij is the KL grade of i-th knee of j-th patient, xij the covariate of vector of j-th member of cluster i for fixed effects; uij covariate vector of j-th member of cluster i for random effects; γi is the random effect parameter, m is the the number of cluster (in our case m = 2 representing left and right knee), β is the regression coefficient of the fixed effect covariates.

### Convolution neural network

In the machine learning based approach to automatically assess the KOA severity, the first step is to localize the region of interest (ROI), that is to detect and extract the knee joint regions from the X-ray images, and the next step is to classify the localized knee joints based on KL grades. In our previous study27, we introduced a fully convolutional neural network (FCN) to automatically detect and extract the knee joints, and trained CNNs from scratch to predict the KOA in both discrete and continuous scales using classification and regression respectively26,27. We used the baseline X-ray images from the OAI dataset to train the CNN model. After testing different configurations, the network in Table 4 was found to be the best for classifying knee images. The network contains five layers of learned weights: four convolutional layers and a fully connected layer. Each convolutional layer in the network is followed by batch normalization and a ReLU activation layer. After each convolutional stage there is a max pooling layer. The final pooling layer (maxPool4) is followed by a fully connected layer (fc5) with output shape of 1024 and a softmax dense (fc6) layer with output shape of 5 representing five level of KOA severity. To avoid overfitting, a drop out layer with a drop out ratio of 0.25 is included after the last convolutional (conv4) layer and a drop out layer with a drop out ratio of 0.5 after the fully connected layer (fc5). Also, a L2-norm weight regularization penalty of 0.01 is applied in the last two convolutional layers (conv3 and conv4) and the fully connected layer (fc5). Applying a regularization penalty to other layers increases the training time whilst not introducing significant variation in the learning curves. The network is trained to minimize categorical cross-entropy loss using the Adam optimizer with default parameters: initial learning rate (α) = 0.001, β1 = 0.9, β2 = 0.999, ε = 1e−8. The inputs to the network are knee images of size 200 × 300. This size is selected to approximately preserve the aspect ratio based on the mean aspect ratio (1.6) of all the extracted knee joints.

After training, this network achieves an overall root mean-squared error 0.771 on the test data. Figure 6 shows the learning curves whilst training this network. The learning curves show proper convergence of the training and validation losses with consistent increase in the training and validation accuracy until they reach constant values.