Development and validation of a scoring system for pre-surgical and early post-surgical prediction of bariatric surgery unsuccess at 2 years

Bariatric surgery (BS) is an effective treatment for morbid obesity. However, a simple and easy-to-use tool for the prediction of BS unsuccess is still lacking. Baseline and follow-up data from 300 consecutive patients who underwent BS were retrospectively collected. Supervised regression and machine-learning techniques were used for model development, in which BS unsuccess at 2 years was defined as a percentage of excess-weight-loss (%EWL) < 50%. Model performances were also assessed considering the percentage of total-weight-loss (%TWL) as the reference parameter. Two scoring systems (NAG-score and ENAG-score) were developed. NAG-score, comprising only pre-surgical data, was structured on a 4.5-point-scale (2 points for neck circumference ≥ 44 cm, 1.5 for age ≥ 50 years, and 1 for fasting glucose ≥ 118 mg/dL). ENAG-score, including also early post-operative data, was structured on a 7-point-scale (3 points for %EWL at 6 months ≤ 45%, 1.5 for neck circumference ≥ 44 cm, 1 for age ≥ 50 years, and 1.5 for fasting glucose ≥ 118 mg/dL). A 3-class-clustering was proposed for clinical application. In conclusion, our study proposed two scoring systems for pre-surgical and early post-surgical prediction of 2-year BS weight-loss, which may be useful to guide the pre-operative assessment, the appropriate balance of patients’ expectations, and the post-operative care.

www.nature.com/scientificreports/ associated with the outcome at univariate analyses may lose their statistical significance when included in multivariate models. Another important issue in predictive models is to establish the relative strength of each considered variable in terms of outcome prediction; model calibration with proper tuning of predictors' relative weights is a crucial step for any subsequent practical application. Several equations for post-operative weight loss prediction have been proposed in recent years 21,[31][32][33][34][35][36] ; however, the retrieved evidence was often limited by a short follow-up time (up to 1 year after BS in most cases) 21,[31][32][33]35 or by a relatively small sample size 21,33,35 . Moreover, post-operative weight loss and model predictors were considered as continuous measures in most cases [31][32][33][34][35][36] ; this represents a further limitation for a comfortable use of these models in the clinical practice, in which dichotomous evaluations, conclusions, or choices are most often needed.
To the best of our knowledge, a simple and easy-to-use estimation tool for pre-surgical and early post-surgical prediction of long-term BS weight loss is still lacking. The aim of the present study was to propose two scoring systems, the NAG-score and the ENAG-score, respectively for pre-BS and early post-BS prediction of weight loss after a 2-year follow-up.

Methods
Study design and patient management. In this retrospective observational study, we collected data from the clinical records of the first 300 patients who underwent BS at the General Surgery Department of the "Città della Salute e della Scienza" Hospital of Turin, University of Turin, from the 1 st January 2016 and met the below-reported criteria.
Inclusion criteria were: (1) age from 18 to 65 years old; (2) BMI ≥ 35 kg/m 2 with comorbidities or BMI ≥ 40 kg/ m 2 , with numerous unsuccessful attempts to lose weight; (3) a minimum 2-year follow-up at our Obesity Unit. Exclusion criteria were: (1) secondary causes of obesity (e.g., hypothalamic diseases, endocrine diseases); (2) associated comorbidities or diseases impacting on weight loss (for example, patients who performed BS in prevision of transplantation).
Patients gave their informed consent to the processing of their data. The study was approved by the local Ethics Committee (Comitato Etico Interaziendale A.O.U. Città della Salute e della Scienza di Torino-A.O. Ordine Mauriziano-A.S.L. Città di Torino) and was in accordance with the principles of the Declaration of Helsinki.
Patient management and data collection. Pre-operatively, data relative to eating behaviours, previous attempts of weight loss, and presence of comorbidities were collected 28 . Weight, height, waist circumference, and neck circumference were assessed in all patients as anthropometric measures. A fasting blood sample was drawn from all patients, and glucose and lipid values were centrally measured.
The surgical approach consisted of either SG or RYGB. The type of surgery was based on the following patient features: BMI, age, gender, body fat distribution, presence of type 2 diabetes mellitus, hiatal hernia, gastroesophageal reflux disease, patient's expectations/realistic goals, long-term treatment for a coexisting disease or condition for which absorption and pharmacokinetics of drugs are of major concern. Follow-up visits at our obesity unit took place at 1, 2, 6, 12, and 24 months after BS. Details about patient management and surgical techniques have been previously described 28 .
To the scope of model creation, BS success was defined as the percentage of excess-weight-loss (%EWL), which was calculated as: Weight corresponding to the BMI = 25 kg/m 2 was considered as the ideal body weight. An unsuccessful weight loss after BS was defined as %EWL < 50%, according to guidelines 1 .
The use of %EWL for the evaluation of BS weight-outcome has been recently questioned, mostly due to its dependency on pre-BS weight excess 37,38 , with the percentage of total-weight-loss (%TWL), calculated as: being proposed as an alternative parameter 37,38 . However, there is still no consensus in literature about which threshold to adopt for the definition of BS unsuccess, since %TWL < 20% was proposed by some authors 37-39 , while %TWL < 25% was suggested by others 40 . Therefore, to the scope of this study, we opted to use %EWL as the reference parameter for model development, due to the availability of an unanimously recognized cut-off for the definition of BS unsuccess; nevertheless, in order to further strengthen the consistency of our results, final model performances were also assessed considering %TWL as the outcome of choice.
Measurements. Weight was measured with the patient wearing light clothes and no shoes to the nearest 0.1 kg by a digital scale with a capacity of 300 kg (Wunder Sa.Bi.srl). Height was measured to the nearest 0.1 cm with a Stadiometer SECA 220 measuring rod (Hamburg, Germany). Waist and neck circumferences were assessed by a plastic tape meter at the umbilicus level and under the cricoid cartilage, respectively. Type 2 diabetes mellitus and arterial hypertension were diagnosed in accordance with international guidelines. Obstructive sleep apnea (OSA) was hypothesized in the presence of excessive daytime sleepiness, snoring, and choking or gasping during sleep, enlarged neck circumference, and an intermediate to high-risk score at the STOP-Bang questionnaire 41 . The diagnosis was confirmed by a sleep-expert neurologist by means of further exams, according to international guidelines 42,43 . Statistical analysis. Patient characteristics were summarized using mean and standard deviation for continuous variables and percent values for categorical data.
[(pre-BS weight − weight at the time of visit)/(pre-BS weight)] × 100. www.nature.com/scientificreports/ Relevant predictors of BS outcomes were found through univariate and multivariate logistic regressions, using a stepwise backward selection. Optimal cut-points for continuous variables were found through a supervised machine-learning algorithmic approach; cut-offs for class distinction were automatedly derived by Class-Attribute Contingency Coefficient (CACC) discretization algorithm as those maximizing separation between classes 44 . Multivariate logistic regression was re-applied on discretized variables; integer and half-integer point scores were assigned upon normalization and rounding of regression beta-coefficients. Model calibration was evaluated by the Hosmer-Lemeshow test. A ten-fold cross-validation algorithm was adopted for internal validation, in order to provide an estimate of model performance on unseen data 45,46 : after a random split of the original sample into ten groups, the modelling process was repeated starting from stepwise variable selection in nine of them, and its performance was evaluated in the tenth; the process was then repeated ten times, rotating the validation group at each round; final model performance was obtained as the average performance over the ten iterations 46 . According to the TRIPOD statement 46 , this internal validation approach was preferred to the more classical sample-split approach due to its better reliability in reducing the bias and the variability of performance estimates. Iterative Dichotomiser 3 (ID3) algorithm 47,48 was applied to cluster risk classes of clinical relevance.
Statistical analysis was performed using STATA 17 (StataCorp, College Station, Texas, USA) and R 4.0.3 (R Foundation for Statistical Computing, Vienna, Austria). For the supervised machine-learning approach used for score creation, 'arulesCBA' , 'partykit' and 'rpart' packages were used.
For continuous variables, the possible presence of significant threshold values was explored through the application of CACC discretization algorithm; if present, meaningful cut-offs for class distinction were automatedly derived as those maximizing separation between classes. Not surprisingly, a meaningful categorization could be found for neck circumference, age, and %EWL at 6 months, i.e., those variables who were already significant when considered as continuous. The optimal cut-points retrieved were ≥ 44 cm for neck circumference (OR = 5.21, 95% CI 2.39-11.36), ≥ 50 years for age (OR = 3.64, 95% CI 2.00-6.64), and ≤ 45% for %EWL at 6 months (OR = 12.21, 95% CI 6.11-24.40). Moreover, a significant dichotomization emerged for two other predictors, i.e., fasting glucose and waist circumference. The optimal cut-points retrieved were ≥ 118 mg/dL for fasting glucose (OR = 2.57, 95% CI 1.33-4.96) and ≥ 142 cm for waist circumference (OR = 2.35, 95% CI 1.07-5.16) ( Table 1). The clinical significance of the algorithmically retrieved threshold values was qualitatively substantiated by descriptive analyses, in which the non-linear dependence between the predictors and the outcome was readily evident (Supplementary Fig. 1).
Development and internal validation of pre-surgical score. All pre-surgical variables showing a significant correlation with the outcome at univariate analysis were included in a multivariate regression model. Given the intention of developing a predictive score, continuous variables were included in the model according to their retrieved dichotomizations. After a stepwise backward selection, the variables retaining statistical significance were neck circumference ≥ 44 cm (OR = 4.21, 95% CI 1.85-9.55), age ≥ 50 years (OR = 3.03, 95% CI 1.62-5.67), and fasting glucose ≥ 118 mg/dL (OR = 2.06, 95% CI 1.01-4.18) ( Table 2). All other variables (i.e., male sex, OSA, waist circumference ≥ 142 cm) were excluded as non-significant at multivariate analysis.
This model showed a moderate accuracy in the prediction of the outcome (AUC = 0.713). The Hosmer-Lemeshow test did not reveal any significant miscalibration (p = 0.58). Internal validation of the model was performed through ten-fold cross-validation; the final estimation of the model performance on unseen data, obtained as the average AUC over the ten iterations, was equal to 0.695, thus reassuring about a substantially null overfitting effect.
In order to simplify its clinical application, the three variables selected by the model were used to develop a discrete-point prediction score; integer and half-integer point scores were assigned upon normalization and rounding of regression beta-coefficients (Table 2). Due to the considered variable, this score was named "NAGscore" (Neck circumference, Age, Glucose), and was defined by the sum of all three components, on a 4.5-pointscale. Notably, this mild simplification did not lead to a significant reduction in the predictive power of the model since the AUC remained equal to 0.713.
A descriptive graph of the risk of unsuccessful weight loss after BS according to NAG-score is presented in Fig. 1; the performance of the model in predicting BS unsuccess was evaluated both in terms of %EWL and %TWL, with similar results. In addition, in order to simplify the interpretation of NAG-score and to identify the most clinically relevant risk classes, an algorithmic classification clustering was proposed. ID3 algorithm clustered the patients in three distinct risk classes (0-1 points, 1.5-2 points, 2.5-4.5 points), with no differences whether considering %EWL < 50% or %TWL < 20% as the reference outcome (Fig. 2); given their clinical correlates, we referred to them as "low-risk", "intermediate-risk", and "high-risk" classes, respectively. As reported in Table 3 Development and internal validation of early post-surgical score. In order to develop an early post-surgical predictive score, %EWL at 6 months was added to pre-surgical variables in a further multivariate regression model. Again, given the intention of developing a predictive score, continuous variables were included in the model according to their retrieved dichotomizations. After a stepwise backward selection, the  Table 4). All other variables (i.e., male sex, OSA, waist circumference ≥ 142 cm) were excluded as non-significant at multivariate analysis. A significant increase in model accuracy could be noted (AUC = 0.846). The Hosmer-Lemeshow test did not reveal any significant miscalibration (p = 0.68). Internal validation of the model was performed through ten-fold cross-validation; the final estimation of the model performance on unseen data, obtained as the average AUC over the ten iterations, was equal to 0.817, thus reassuring about a modest overfitting effect.
The four variables selected by the model were used to develop a discrete-point prediction score (Table 4), which was named "ENAG-score" (Early loss, Neck circumference, Age, Glucose), and was defined by the sum of all four components, on a 7-point-scale. This mild simplification did not lead to a significant reduction in the predictive power of the model, since the AUC only slightly declined from 0.846 to 0.845.
A descriptive graph of the risk of unsuccessful weight loss after BS according to ENAG-score is presented in Fig. 1; the performance of the model in predicting BS unsuccess was evaluated both in terms of %EWL and %TWL, with similar results. ID3 algorithm clustered the observations in three distinct and clinically relevant Table 2. Prediction of BS failure, defined as %EWL < 50% at 2 years, by multivariate logistic regression after stepwise backward selection of pre-surgical data; NAG-score point assignment according to multivariate regression coefficients. BS bariatric surgery, CI confidence interval, EWL excess weight loss, OR odds-ratio. www.nature.com/scientificreports/ risk classes (0-2.5 points, 3-4.5 points, 5-7 points), with no differences whether considering %EWL < 50% or %TWL < 20% as the reference outcome (Fig. 2); given their clinical correlates, these classes were labelled again as "low-risk", "intermediate-risk", and "high-risk", respectively. As reported in Table 5, when considering %EWL, the low-risk class comprised 199 patients, with a 5.5% risk of unsuccessful weight loss at 2-years from BS. The intermediate-risk class comprised 82 patients, with a 32.9% risk of unsuccessful post-surgical weight loss. The high-risk class comprised 19 patients, with a 94.7% risk of unsuccessful post-surgical weight loss. The stratification performance of the model across different risk classes was overall preserved when adopting %TWL as the Figure 2. Stratification of BS unsuccess risk, defined either as %EWL < 50% (upper row) or %TWL < 20% (lower row) at 2 years, based on NAG-score (left column) and ENAG-score (right column). Relevant cut-offs for class distinction and patients' clustering were automatedly retrieved though ID3 algorithm. BS bariatric surgery, EWL excess weight loss, ID3 Iterative Dichotomiser 3, TWL total weight loss.

Discussion
Two multivariate models for pre-surgical and for early post-surgical prediction of BS weight loss after a 2-year follow-up were developed and internally validated. To facilitate their clinical use, two simplified scoring systems (NAG-score and ENAG-score) were derived by assigning integer or half-integer points to each of the included predictors.
The first model, based only on pre-surgical data, showed a moderate overall accuracy for the prediction of the outcome of interest, with an AUC of 0.713. The retrieved score, i.e., the NAG-score, can be a simple and useful tool to stratify different pre-surgical risks for BS failure and may be helpful, during pre-operative assessment, for an appropriate balance of patient's expectations and for a personalization of the intensity of the follow-up. In particular, a higher-intensity follow-up might be considered in patients with intermediate-to-high risk of unsuccessful post-surgical weight-loss, since, possibly, these patients are those who may benefit the most from a closer dietary and medical counselling 1,4,49,50 .
The second model, which included also early post-operative weight-loss data at 6 months, showed good overall accuracy for the prediction of the outcome of interest, with an AUC of 0.846. The retrieved score, i.e., the ENAG-score, can be of value for early prediction of long-term outcomes of the surgical procedure. This can be useful as a further guide for the refinement of the follow-up continuation and for the personalization of the therapeutic approach; in particular, it may be helpful as a guide for the avoidance of therapeutic inertia in patients with an intermediate-to-high probability of unsuccessful post-surgical weight loss. In fact, these patients are those who may benefit the most from stricter counselling and more intensive lifestyle management, and that may be potential candidates for an early start of adjunctive pharmacological treatments 1,4,49,50 .
It is interesting to note that the algorithm of stepwise backward selection led to the inclusion of exactly the same set of pre-surgical factors in both models; in particular, the parameters that were retained as statistically significant were larger neck circumference (≥ 44 cm), older age (≥ 50 years), and higher fasting glucose levels (≥ 118 mg/dL). These results were coherent with previous findings by other authors 7,8,10,11,28 . Even more interestingly, their predictive capacity remained significant after adding to the model a very robust parameter such as early post-operative weight loss; this further supported their relevance as independent predictors of BS outcomes in the longer term. On the other hand, male sex, OSA, and larger waist circumference (≥ 142 cm), though significant at univariate analysis, lost their significance in the multivariate model. This is not surprising, and it is likely a consequence of the multiple collinearities between these variables and those retained in the scores. Nevertheless, the fact that neck circumference performed better than waist circumference in the prediction of BS outcomes is noteworthy. Neck circumference, indeed, relates to oropharyngeal fat infiltration, which narrows the upper respiratory tract and is a more stable index because it is not affected by eating or body position or respiratory rates as the waist circumference measurement does 51 . Furthermore, the neck fat depot has been reported to be strongly associated with cardiometabolic and atherosclerotic diseases independent of visceral and whole-body obesity [51][52][53][54][55] . Ectopic fat deposition is dysfunctional and associated with chronic sub-clinic inflammation, oxidative stress, and endothelial dysfunction, and upper-body subcutaneous fat delivers more free acids than visceral fat in the systemic circulation, thus contributing to increased insulin resistance and other dysmetabolic disorders 53,55,56 . In particular, neck circumference appears to be a unique pathologic fat depot with Table 4. Prediction of BS failure, defined as %EWL < 50% at 2 years, by multivariate logistic regression after stepwise backward selection of pre-surgical and early post-surgical data; ENAG-score point assignment according to multivariate regression coefficients. BS bariatric surgery, CI confidence interval, EWL excess weight loss, OR odds-ratio.  www.nature.com/scientificreports/ a unique genetic basis, independent of BMI 57 . Finally, increased neck circumference is also a risk factor for OSA, which, in turn, is associated with increased cardiometabolic risk 53 .
The main strength of our study was the simplicity of the proposed scoring systems, which was achieved upon the categorization of predictive variables through a robust supervised algorithmic approach. Moreover, the internal validation of our model, together with the assessment of its good calibration, conferred higher consistency to the obtained results, which were further strengthened by the reproducibility of risk-class stratification over two different BS-success defining parameters (%EWL and %TWL).
Our study had also some limitations. First, its retrospective design limited the possibility to take into account some other potential predictors, such as psychosocial data and physical activity levels; their inclusion might have led to a more complete and better-performing scoring system and could have allowed the assessment of causal effects through Mendelian randomization analysis procedures based on genomic background and environmental exposure data [58][59][60][61][62] , which might be the subject for future research. Second, stronger evidence would have been achieved by considering a longer follow-up time; however, the time-point which has been examined (i.e., 2 years after BS) was longer than in other studies proposing equations or scoring systems for the prediction of BS outcomes 21,[31][32][33][34][35][36] , and was considered as a time-point of weight stabilization in most patients 29,30,63,64 . Third, the sample size was not sufficient to develop different scoring systems for SG and RYGB; however, the type of intervention was taken into account during model development, and no significant differences emerged between the two procedures in terms of outcomes. Fourth, our study comprised only Caucasian patients; as there is extensive evidence of weight loss variability among ethnical groups [65][66][67][68] , the generalizability of our data to other populations is uncertain.
In conclusion, our study proposed two simple scoring systems (the NAG-score and the ENAG-score) for pre-surgical and early post-surgical prediction of 2-year BS weight loss. The presented data supported their consistency as easy-to-use estimation tools. Further studies are needed to confirm their external validity on different patient cohorts; if so, their application in clinical practice might provide a simple and objective instrument for the evaluation of BS failure risk, which may be useful to guide pre-operative patient's assessment, to appropriately balance patient's expectations, and to manage more effectively the post-operative care.

Data availability
The dataset analyzed during the current study is available from the corresponding author on reasonable request.