Development of a model predicting non-satisfaction 1 year after primary total knee replacement in the UK and transportation to Switzerland

We aimed to develop a predictive model for non-satisfaction following primary total knee replacement (TKR) and to assess its transportability to another health care system. Data for model development were obtained from two UK tertiary hospitals. Model transportation data were collected from Geneva University Hospitals in Switzerland. Participants were individuals undergoing primary TKR with non-satisfaction with surgery after one year the outcome of interest. Multiple imputation and logistic regression modelling with bootstrap backward selection were used to identify predictors of outcome. Model performance was assessed by discrimination and calibration. 64 (14.2%) patients in the UK and 157 (19.9%) in Geneva were non-satisfied with their TKR. Predictors in the UK cohort were worse pre-operative pain and function, current smoking, treatment for anxiety and not having been treated with injected corticosteroids (corrected AUC = 0.65). Transportation to the Geneva cohort showed an AUC of 0.55. Importantly, two UK predictors (treated for anxiety, injected corticosteroids) were not predictive in Geneva. A better model fit was obtained when coefficients were re-estimated in the Geneva sample (AUC = 0.64). The model did not perform well when transported to a different country, but improved when it was re-estimated. This emphasises the need to re-validate the model for each setting/country.

demonstrate usefulness of a model for real clinical practice, to assess whether a single tool can be used worldwide or whether country specific models are required, as demonstrated by the fracture risk assessment tool (FRAX) for predicting osteoporotic fracture 8 . The aim of this study was to develop, validate and assess the transportability of a predictive model for non-satisfaction after primary TKR based on pre-operative factors and surgeon experience.
The percentage of missing values in explanatory variables was <10%, except for educational level (16.2%) and surgeon experience (13.1%). We had complete information for sex, age and body mass index (BMI).
Validation dataset. Model transportation was carried out on 791 (49.3% out of 1654) patients, 157 of whom (19.9%) were non-satisfied with their operation. Mean age and Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) were 72 years (sd: 9 years) and 21 points (sd: 8 points), respectively.
Only educational level, WOMAC score and smoking status had missing values for the validation dataset, with 25.0%, 18.8% and 1.3% missing, respectively. Table 1 shows differences between the UK and Geneva settings. There was a higher proportion of non-satisfied among Geneva patients than UK patients. The Geneva sample had a higher percentage of women, slightly older individuals and more smokers. The UK sample had more obese patients (although this difference was not statistically significant), with more co-morbidities and more treated for depression and knee pain. Educational level, musculoskeletal condition, and proportion treated for anxiety did not differ between the samples. Table 2 presents non-satisfaction events according to candidate predictor category in the UK and the Geneva samples.
Model production and validation. In the UK, being treated for anxiety, being current smoker, not having been treated with injected corticosteroids and worse pain and function prior to surgery, were related to non-satisfaction. The logistic regression coefficients with their 95% confidence intervals (95% CI) of the variables selected are summarised in the following equation: non-satisfaction probability = 1/(1 + exp(−(−0.19 × man +0.02 × age at operation +0.997 × prior treatment for anxiety +0.93 × current smoker −1.04 × injection of corticosteroids −0.37 × standardised OKS −3.29))) ( Table 3). The model showed moderate discriminatory ability for ascertaining true non-satisfied cases against false non-satisfied cases (Area Under the receiver operating characteristic Curve, AUC = 0.69). Bootstrap validation reduced this to a bias-corrected AUC = 0.65 (Fig. 1).

Discussion
This is the first time a predictive model for non-satisfaction with the outcome of primary TKR for which both internal validation and its ability to be transported to a different country has been assessed. We have demonstrated good internal validation within the UK, but poor transportation to Switzerland when using the same model specification. However, re-running the model using the Swiss data to obtain centre specific regression coefficients substantially improved the transportability. Its moderate performance might help to forecast non-satisfaction 1 year after TKR in the UK, but it would require re-estimation in other countries before attempting international use 5 . The model has a moderate predictive capacity to identify non-satisfied patients, therefore, it has limited usefulness to support clinicians and patients in their decisions to undergo a TKR. Further work is still necessary to identify additional risk factors of non-satisfaction to improve discriminatory ability of the model There are many reasons why an internally validated model may fail to transport to a different country, including a different patient case-mix, different healthcare system (referral patterns, waiting times, and follow-up regimes), surgical training, techniques or implants used.
We found more non-satisfied patients in Geneva, which might be related to the higher proportion of younger patients and of women undergoing TKR in Geneva. In fact, lower mean levels on a visual analogue scale for satisfaction were reported for TKR and unicompartmental knee replacement patients under 55 years of age in an UK hospital 9 . Moreover, women under 60 years of age following TKR were less likely to be satisfied than men in a national (USA), multi-centre study 10 . Women also presented higher non-satisfaction than men in a national survey in Sweden 2 . Smoking could be another potential explanation for the lower satisfaction in Geneva as it was a predictor of outcome and differed substantially in prevalence between the UK and Geneva cohort. The fact that current smokers were less satisfied might be related to stronger residual symptoms 1 year after surgery, higher complications rates, and/or differences in the health care received.
Transportation of the developed model is limited by important differences in BMI and associated co-morbidities. The UK has the highest obesity prevalence in Western Europe and this fact is observed in the comparison of UK and Geneva samples. Obesity is associated with other diseases, thus UK and Geneva patients are not equivalent in terms of comorbidities. Interestingly, there were more non-satisfied patients in the group with no co-morbidities in the Geneva sample, possibly pointing to higher expectations about TKR in younger people who can be expected to have fewer comorbidities.
In the UK sample an injection of a corticosteroid in the months prior to surgery and anxiety treatment were significant predictors while in Geneva they were not. Waiting times for elective TKR are usually one year in the UK as compared to approximately two months in Geneva. Corticosteroid treatment is employed to reduce pain. The shorter waiting time in Geneva may have made the use of this treatment option less frequent.
Other factors to consider in model transportation are unmeasured differences in health care access and socio-economic status. However, in the context of the present study patients, both in the UK and in Switzerland, have universal access to care. Moreover, the proportion of patients with high education was similar. Differences between settings could also be due to post-surgical complications, the number of patients sharing rooms or a negative experience with the staff (i.e. feeling of being treated disrespectfully). Because in this study we restricted the predictor choice to variables known prior to surgery, these factors do not explain poor model transportation here.
Potential differences in non-satisfaction between ethnicities could not be addressed in the model because the vast majority of the patients were white in both UK and Switzerland. Therefore, this model is not generalizable to non-white people for countries where the race is a proxy of socioeconomic status and the access to the health care is not universal 11 . Additionally, the influence of ethnicity on satisfaction is not clear. For example, in the only two studies querying about satisfaction in USA only one found that African-Americans were less likely to be satisfied 12,13 .
It was not possible to compare physical activity levels between UK and Geneva patients. Nonetheless, in the Geneva data physical activity levels did not significantly differ between non-satisfied and satisfied patients, neither before the onset of osteoarthritis (OA) symptoms (6.9, sd: 2.2 vs. 6.6, sd: 2.3; P = 0.4) nor prior to TKR (3.6, sd: 1.6 vs. 3.6, sd 1.5; P = 0.7). Physical activity was measured using the University of California, Los Angeles (UCLA) Physical Activity Scale, which evaluates level of activity between 1 and 10 (minimum and maximum). Therefore, we would not expect physical activity to be a predictor of non-satisfaction.
To re-estimate the coefficients in the Geneva cohort using the same predictors improved the performance of the model to similar levels as those obtained in the Clinical Outcomes in Arthroplasty study (COASt). This is consistent with the experiences of producing the FRAX tool for predicting fractures in osteoporotic populations, where country specific coefficients were estimated using similar techniques 8 .
Several methodological issues need to be considered. Firstly, the degree of preoperative symptoms (pain and functional disability) were selected as an important predictor of non-satisfaction during the internal validation process. However, different instruments to measure pain and function had been used in the development (OKS) and the validation (WOMAC score) datasets. To address this issue we standardised both scores and observed almost similar proportions of low scores in the validation dataset. Worse preoperative pain and function scores were related to non-satisfaction. High expectations to recover total functionality may be behind this result 4,14 .
Second, greater accuracy but reduced prediction was obtained as a consequence of using bootstrapping to avoid over-fitting. However, transportation to another setting and population further diminished the prediction of the model. Transportation illustrates the difficulty in predicting outcomes in other settings 5,15 . This is because internal validation protects only against over-fitting caused by sampling variation, and not against fundamental differences between populations. A possible solution would be to develop predictive models in multiple setting datasets from the beginning. Then the coefficients would be identical in all settings. Even then the discrimination may vary between settings, e.g. if race was a useful predictor globally this would not help in a racially homogenous setting as ours.
Third, non-satisfaction events in the development dataset were less than a minimum of 100 suggested for developing prediction models using logistic regression 5,16 .
Finally, post-operative factors were not included as previously has been suggested to further improve the prediction of non-satisfaction one year after TKR 17 . This is because, including post-operative factors as confounders, would reduce the chance of finding association between the hospital and the outcome, since the patient's post-operative status is potentially attributable to the intervention and to hospital care. In addition, we envisage the model to be used in both primary care and pre-operative clinics to assess a patient's risk of a poor outcome, defined by non-satisfaction, at his/her pre-surgery visit. As such, post-operative parameters would not be available to the clinician or the patient to use the model and help inform the decision making strategy.
We produced and internally validated a model to predict non-satisfaction with outcome after TKR in a UK population. This model did not perform well when transported to a different country, but improved when the model coefficients were re-estimated in the new population. This demonstrated the issues with transporting an internally validated model to a different country, and emphasises the need to re-validate the model for each setting/country.

Material and Methods
Source of data and participants. Development dataset. The COASt study, is a prospective, dual-centre Validation dataset. The Geneva Arthroplasty Registry (GAR) collects information on socio-demographic variables, comorbidities, medication, PROMs (e.g. WOMAC), radiographs and blood samples (subset) in addition to implant-and surgery-related variables. A prospective longitudinal cohort of TKR patients has been recruited since 1998 at the Division of Orthopaedics and Trauma Surgery of the Geneva University Hospitals. The institution is the only public tertiary hospital in the area serving a population of 500,000 habitants 18 . This analysis included TKRs performed between January 2010 and February 2015. Data from both datasets is available for access to recognised academics. There is a standard application form which must be submitted to a data access committee.
Inclusion/exclusion criteria. We included patients with OA and rheumatoid arthritis (RA) aged over 18 years and those competent and willing to consent who underwent primary TKR. We excluded from the study those patients with a history of diseases that would be able to mask the outcome analysed, i.e. multiple sclerosis, leg neuropathy, sciatica, stroke or mini stroke, cerebellar ataxia, knee septic arthritis, knee pseudo-gout, avascular necrosis, polymyalgia, systemic lupus erythematous, fibromyalgia, Alzheimer, and poliomyelitis.
Validation dataset. GAR contributed with 1654 patients undergoing knee replacement. Specific operations carried out were: primary TKR (1397 patients), TKR revision (115 patients) and UKR (28 patients). 114 (7.1%) patients were excluded because they had a disease meeting the exclusion criteria. Therefore 808 (50.4%) patients who completed and returned the one-year follow-up form were included. In turn, we excluded 16 (1.0%) patients with not response for satisfaction 1 year after TKR.
Sample size. The development and validation datasets were convenience samples where we included all patients who answered the satisfaction question.
Outcome: Non-satisfaction. All the patients included in the analysis rated their "overall satisfaction with the outcome of your operation" one year after the surgery. We generated a binary variable grouping satisfied answers (very/somewhat satisfied) versus non-satisfied answers (neither satisfied or dissatisfied, somewhat/very dissatisfied).
Predictors. Twelve preoperative variables common to COASt and GAR were chosen among those considered relevant by eight surgeons and researchers. Predictors were sex (woman vs. man); age at operation; educational level (higher vs. lower education, i.e. less than university degree); BMI, (<35 vs. ≥35 Kg/m 2 , World Health Organisation (WHO) obesity class II/III); musculoskeletal condition (OA vs. RA), number of comorbidities (liver, bowel, renal, and lung problems, as well as urine infections, diabetes, heart murmur or rheumatic fever, angina or chest pain, heart attack, history of heart failure, pacemaker fitted, history of hypertension, blood clot, unusual bruising or bleeding, and high cholesterol); treated for anxiety; treated for depression; current smoker; intra-articular corticosteroid injection (last 12 months for COASt, injection for OA any time prior to TKA for GAR); surgeon experience (≥8 vs. <8 training years) and; standardised OKS for knee pain and function ((OKS-mean OKS )/standard deviation OKS , sd). We used standardised WOMAC ((WOMAC-mean WOMAC )/sd WOMAC ) instead of standardised OKS for the validation dataset because OKS was not available for GAR. Lower scores corresponded to most severe symptoms and higher to least symptoms on the standardised OKS and WOMAC scores. To allow the application of the UK score to Geneva patients, both the OKS and WOMAC scores were standardised to a mean of 0 and a standard deviation of 1.
Statistical analysis. Differences between UK patients and Geneva patients were assessed using χ 2 test for categorical variables and Student's t-test for continuous variables.

Development and internal validation dataset.
A. First, to develop a risk prediction model, we performed the following steps 19 : Step 1, imputation of missing values: Multiple imputations on the 12 potential predictors of non-satisfaction were used to address potential bias in the analysis as a result of missing values. Keeping the highest sample size leaded to higher statistical power to predict outcome. 50 imputed datasets were generated using the twelve potential predictors together with the outcome. Imputation also considered the auxiliary variable "hospital where the surgery took place". Regression coefficients were averaged across the 50 datasets, and standard error was calculated as standard error average plus the variability between the imputations (Rubin's rules) 20 .
Step 2, selection of principal predictors: We generated 200 logistic regression models from 200 bootstrap samples. Bootstrapping is a statistic technique that takes randomly patients, with replacement, from the original sample. Some patients may be duplicated, and other patients from the original data may be omitted in a bootstrap sample, being the bootstrap sample size the same as the total number of observations we have in the original sample (n = 450). The aim of this technique is to provide an estimate of the sampling variability with our sample size. For this study, each bootstrap sample was drawn with replacement from the combined 50 imputed datasets. Within each bootstrap sample, the 12 predictors were introduced in a logistic regression model, and an automatic backward selection 21 was applied using a significance level equal to 0.157, as recommended by Steyerberg 19 . Sex and age were forced into all the models regardless of their P value because of their biological relevance 22 .
Step 3, retention of principal predictors: We retained in the final regression model those variables selected at least 60% of the time. Odds ratio and coefficients with their 95% CI were obtained between each predictor and the outcome using logistic regression. B. Second, once the principal risk factors were selected, we assessed the performance of the prediction model using discrimination (AUC) and calibration measures. They were represented using calibration and discrimination plots, respectively. Discrimination plot showed the ability of the model to distinguish between non-satisfied patients and satisfied patients. AUC was estimated from the original COASt sample using the final equation obtained (model with selected variables obtained in the previous point). Calibration plots showed the relationship between predicted and observed probabilities of a patient to be non-satisfied. A comparison was done between predictive and observes values for each tenth of predicted risk ensuring 10 equally sized groups. For each decile, the observed proportion of non-satisfied was obtained together with 95% Agresti-Coull confidence interval. C. Third, to test the internal validity of the model, 200 bootstrap samples with replacement combined with multiple imputations were once again used to evaluate bias-corrected estimates of predictive ability. Bias corrected estimator of AUC was estimated using the following steps: 200 random samples (bootstrap samples) were drawn from the full original sample (imputed COASt dataset of 450 patients). Estimated AUC in each bootstrap model was compared to estimated AUC in the original full sample. Differences in AUC were averaged, providing an average estimated optimism. Subsequently, we subtracted to the overfitted AUC of the imputed COASt dataset the estimated optimism in order to obtain a bias-corrected AUC.
Transportability. The transportability of the model was assessed using data from GAR. We generated 50 imputed datasets for GAR using the same potential predictors previously ran in the UK dataset. The equation and the coefficients obtained during the model development were used on GAR dataset to obtain an AUC curve for the Geneva setting. In addition, calibration plot was produced to assess the degree of agreement between observed and predicted probabilities of outcome in GAR sample. An AUC was also obtained from a model using Geneva data with the same predictors identified in the developed model as sensitive analysis but without the specification of the coefficients. Therefore, the same principal predictors retained for the development model were used to re-estimate new coefficients for the Geneva setting, i.e. a new logistic regression was obtained predicting non-satisfaction for Geneva patients.
All the test used were two-tailed. Analysis were conducted using Stata v.13, and SPSS v.22.
Ethics. COASt has been approved by the Oxford REC A (Ethics Reference: 10/H0604/91). The sponsoring organisation of the study is the University Hospitals Southampton NHS Foundation Trust (UHS). The Total Knee Arthroplasty registry prospectively enrolling all patients undergoing knee replacement since 1998. Ethical approval for the registry (No. CER 05-017 (05-041)) was obtained from the Ethical Committee of the Geneva University Hospitals. Data were collected within the two cohorts as confirmed by the study participants in their written informed consent and as directed by the ICH-GCP (International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use of Good Clinical Practice) guidelines and appropriate local and international legislation.