Application of regularized regression to identify novel predictors of mortality in a cohort of hemodialysis patients

Cohort studies often provide a large array of data on study participants. The techniques of statistical learning can allow an efficient way to analyze large datasets in order to uncover previously unknown, clinically relevant predictors of morbidity or mortality. We applied a combination of elastic net penalized Cox regression and stability selection with the aim of identifying novel predictors of mortality in a cohort of prevalent hemodialysis patients. In our analysis we included 475 patients from the “rISk strAtification in end-stage Renal disease” (ISAR) study, who we split into derivation and confirmation cohorts. A wide array of examinations was available for study participants, resulting in over a hundred potential predictors. In the selection approach many of the well established predictors were retrieved in the derivation cohort. Additionally, the serum levels of IL-12p70 and AST were selected as mortality predictors and confirmed in the withheld subgroup. High IL-12p70 levels were specifically prognostic of infection-related mortality. In summary, we demonstrate an approach how statistical learning can be applied to a cohort study to derive novel hypotheses in a data-driven way. Our results suggest a novel role of IL-12p70 in infection-related mortality, while AST is a promising additional biomarker in patients undergoing hemodialysis.

www.nature.com/scientificreports/ A more recent advancement is a combination of regularization with resampling, termed "stability selection", which consists of recalculating the regularized regression many times with random subgroups of the available patient population in order to select the variables which show most consistent association with the outcome. This reportedly results in a more reliable selection in settings with a large number of potential predictors compared to the number of patients 11 .
Here we report on the application of elastic net regularization combined with stability selection for the identification of relevant predictors of mortality in the "ISAR" cohort of prevalent hemodialysis patients 12 .

Results
Characteristics of the study population and collected data. The overall study design and rationale of the ISAR cohort trial have been described in our previous report 12 . Briefly summarized, prevalent hemodialysis patients were recruited in 17 centers in and around Munich, Germany and followed up for a median time of 37 months (interquartile range of 25-49 months). A total of 475 patients were included in the analysis (Fig. 1). Baseline characteristics for this population were previously reported 5 . Mortality events were classified into three categories: cardiovascular, infectious/sepsis and other (or unknown), with 64, 43 and 62 respective events occurring during the follow-up period (Fig. 1). Overall over 100 possible predictors were collected for the patients in the study population including clinical parameters, comorbidities and medications, blood value measurements and static retinal vessel analyses (Table S1).
We randomly separated the total cohort into a derivation and confirmation dataset (n = 317 and 158 respectively) with similar characteristics (Fig. 1). The sets were specifically similar with regard to age, gender, dialysis patient adapted Charlson comorbidity index (CCI) 13 , serum IL-6 and mortality ( Table 1).

Application of a combination of elastic net regularization and stability selection for the identification of mortality predictors.
Variable selection with consecutive stability selection in the derivation dataset resulted in a substantially reduced predictor subset compared to using only the elastic net regression ( Figure S1 and Fig. 2). Most of the predictors passing selection criteria for all-cause mortality or the causespecific events showed significant associations with mortality in the withheld confirmation cohort ( Table 2). Among the comorbidities diabetes mellitus, pulmonary hypertension, atrial fibrillation (AF), history of myocardial infarction, neoplasia, atherosclerosis (other than CHD) and peripheral artery disease (PAD) were selected and confirmed. Most of these have in fact previously been shown to have a significant effect on mortality 13 . Other previously reported predictors were catheter use for dialysis, blood pressure, hemoglobin levels, serum creatinine, CRP 2,14 , use of oral anticoagulation 15 , serum levels of IL-6 and YKL-40, as well as IP-10 5 . We could also observe a detrimental effect of low serum triglycerides in our population as has been reported previously 16 . Interestingly, we also found low levels of AST to be associated with reduced all-cause mortality, while high levels of IL-12p70 were associated with an increased infection-associated mortality ( Table 2). The association of AST with mortality is a relatively novel finding in dialysis patients 17 . The positive association of IL-12p70 with mortality in these patients is to our knowledge a novel finding. We therefore further analyzed these predictors in our study population.  Table 2). In this "low" group, AST levels were in the range of 3-13 U/l and therefore on the lower end or even below the reference range for the normal population (10-50 U/l for males and 10-35 U/l for females). When analyzing mortality binned by AST subgroup in the total study population, this protective effect could be observed for all-cause and cardiovascular mortality, while increased levels of AST (top quintile) were associated with increased mortality for all-cause and "other" mortality causes (Fig. 3a). Modelling mortality using spline fitting on log transformed AST values significantly outperformed a regular Cox model (p = 0.038 for all-cause mortality in a likelihood ratio test on the total population with non-missing AST-values, n = 377, one patient was excluded due to strongly increased AST and ALT as discussed in "Methods"). We additionally observed a numerically improved predictive performance for a spline fitted model as further discussed below (Fig. 5a). In univariate and multivariable Cox regression incorporating spline transformations a similar association as described above for both a protective effect of lower than average AST as well as a detrimental effect of increased AST levels for all-cause mortality as well as cause-specific hazard for cardiovascular mortality and other causes were identified (Fig. 4). Of note, unknown mortality causes were also classified as "other" causes, we therefore expect some degree of overlap with cardiovascular mortality within this group due to e.g. sudden cardiac death. Interestingly, despite a strong correlation of serum AST with ALT levels (Pearson coefficient: 0.62, p < 0.001), ALT levels were not significantly associated with all-cause or cardiovascular mortality (not shown) and did not show a similar protective effect of lower values when spline transformations were considered ( Figure S2).
IL12p70 as a predictor of infection-associated mortality in dialysis patients. High IL-12p70 was one of the most stable predictors associated with mortality due to infection (Fig. 2c), showing a significant Table 1. Matching characteristics of the derivation and confirmation datasets. Nominal variables are reported as counts and percentages, ordinal/continuous variables as median and interquartile range. P-values were calculated using a Chi-squared and a Mann-Whitney test respectively. Dialysis-patient adapted comorbidity index was calculated as described by Liu et al. 13 . D/T, due to.  Vertical dashed lines represent the penalty parameter (ln-lambda) chosen by the stability selection, horizontal dashed lines represent the predefined selection threshold of 60%. Variables above this threshold at the selected penalty were considered stable. "High" and "low" indicates that the value falls within the top or bottom quintile of the total study population (see "Methods").  Table 2). Criteria for stability selection were missed by a small margin for all-cause mortality (Fig. 2a). With the applied assay for IL-12p70, the majority of patients showed an undetectable level of this interleukin ( Figure S3). When analyzing mortality binned by IL-12p70 expression (undetectable, weakly elevated, top quintile) in the total study population, we could observe a selective increase in mortality in the top quintile without according tendency in the weakly elevated group (Fig. 3b).
Added value of AST and IL-12p70 for mortality prediction. Time-dependent receiver-operating characteristics (ROC) analysis for univariate Cox models showed highest areas und the ROC curve (AUC) for AST after logarithmic transformation and spline fitting (Fig. 5a). Using the predefined threshold (top quintile) for IL-12p70 had a similar AUC for all-cause mortality as using transformed full data and when fitting a spline (Fig. 5a). Fitting a spline to transformed IL-12p70 data also did not show a significant improvement in a likelihood ratio test (p = 0.209 for all-cause and p = 0.137 for infection-associated mortality, n = 475). The AUCs for these single predictors were comparable to those of other established predictors such as serum albumin and the use of a catheter for dialysis, though not as high as some of the top predictors for mortality, age and adapted comorbidity score (Fig. 5a).
Analysis of multivariable models showed an additive effect of the new predictors on our previously established model 5 (Fig. 5b, see "Methods" for included predictors). In fact, inclusion of both AST and IL-12p70 (green line in Fig. 5b) was superior to albumin or catheter use and had a similar additive value on AUC as the strong predictors age and adapted CCI within the first year of follow up (orange and blue dashed lines in Fig. 4b left panel).

Discussion
Here, we applied techniques of statistical learning to analyze a cohort study of dialysis patients. Although similar techniques have been previously applied in this population in order to improve mortality prediction [18][19][20] , our approach differs in that our main goal was not to derive an overall prediction model or to address a specific hypothesis, but rather to find novel promising candidates as predictors in a setting with a large amount of available characteristics for a relatively small study population. Many of the well-known and established prognostic factors for mortality were recovered, while a relatively modest number of 317 patients (with 112 events during follow-up) were used for selection among ≈ 110 predictors. We were then able to confirm significant associations in the withheld smaller confirmation group for most of the factors obtained after stability selection using  www.nature.com/scientificreports/ IL-12p70 is the active heterodimer form of IL-12, which consists of the IL-12p35 and IL-12p40 subunits. While IL-12p35 is thought to be specific for IL-12, IL-12p40 also associates with IL-23p19 to form active IL-23. IL-12 activity has been implicated in the immunologic response to pathogens and tumor cells, as well as in autoimmune disease 21,22 . Levels of IL-12p70 are known to be relatively low in the serum of healthy individuals 23 and were below the detection limit for our assay in the majority of the study population. Contradictory reports concerning IL-12 activity exist for dialysis patients. One study, published shortly after the discovery of this cytokine, observed an increased level of IL-12 in dialysis patients compared to healthy controls 24 , a finding that could possibly be explained by increased IL-12 release from peripheral blood mononuclear cells (PBMCs) as an inflammatory response to the dialysis membrane 25 . This study also reported, contrary to our observation, that increased IL-12 levels were associated with improved survival in dialysis patients 24 . However, in a more recent report assessing specifically the levels of IL-12p40 and IL-12p70, a significant increase of IL-12p40 was observed, while no increase was reported for IL-12p70 in dialysis patients vs. healthy controls 26 . The discrepancy between our finding and those of Kimmel et al. 24 could therefore be explained by the specificity for IL-12p70 in our assay. Alternatively, since we categorized data and compared individuals with 20% highest levels to others, the effects of IL-12 might be non-linear, in that they might be protective up to a certain level and detrimental or indicative of an additional underlying disease (e.g. autoimmune) above that level. Further studies employing more sensitive IL-12p70 and IL-12p40 measurements would be of interest to solve this question. Having high levels of IL-12p70 was one of the most stable predictors for increased infection-associated mortality in our derivation cohort. Although a causal relationship cannot be established by the applied approach, this close correlation suggests that it may be a causal factor or that it may be closely linked to a process causing mortality in hemodialysis patients. Further studies into the underlying pathophysiology would be highly interesting to elucidate this relationship and clarify if modulation of IL-12 activity may serve as a therapeutic intervention.
Another stable predictor which was obtained in our analysis is AST and we further looked into this association because until now it has received little attention in dialysis patients. AST is often used in clinical practice to assess liver damage, however unlike ALT it is less specific and is also elevated in other conditions, such as myocardial infarction or skeletal muscle damage 27,28 . Levels of AST have been shown to be predictive of both all-cause as well as specifically cardiovascular mortality (potentially due to an association with the metabolic syndrome) in the general population [29][30][31][32] . While on average a decreased AST activity in dialysis patients compared to the general population was observed long ago 33,34 , we could only find one previous study analyzing its association with mortality in dialysis patients 17 (in this case with all-cause mortality). Our results confirm these findings and suggest a potential association with cardiovascular disease in dialysis patients. An association with liver disease may explain an increased mortality due to "other" causes in patients with high AST values, similar as was observed for increased ALT values ( Figure S2). The finding that ALT levels do not show a similar association with cardiovascular mortality, however, suggests that there may be an alternative relevant pathomechanism or a source of AST elevation outside the liver (such as the heart) which is worthy of further investigation. Although our study is not sufficiently powered to further elucidate the pathophysiology behind AST variation (e.g. only 5% of participants had a known liver disease based on medical records) our results underline the prognostic value of AST and support the previously expressed need for different reference ranges in hemodialysis patients 35 .
When assessing the additional value of AST and IL-12p70 for the prediction of mortality in dialysis patients using time-dependent ROC analysis, we found an additive value of these predictors. Although in the multivariable model the numeric gain was relatively modest, it has to be compared to other established variables, because the model already incorporated many important predictors. We observed a similar effect of AST and IL-12p70 as e.g. for known independent predictors albumin and catheter use for dialysis. The added effect of these two predictors together was similar in the first year to some of the strongest predictors: age and adapted CCI score.
One of the limitations of the applied penalized Cox regression is that it assumes linear associations with the outcome (as does regular Cox regression). This also means that single high-leverage points, which can represent false measurements or extraordinary conditions, may have a strong effect on selection (or omission) of variables. Here we addressed these issues by "dummy coding" quantitative measurements to nominal variables which represent top or lowest quintiles for specific values. Although such thresholding may influence the selection due to loss of information, we in fact observed that most known ordinal/continuous predictors for mortality in dialysis patients passed our selection criteria (i.e. hsCRP, hemoglobin, creatinine, triglycerides, IL-6, YKL-40, AST). However, one notable exception is serum albumin, which was not retained despite a well-documented correlation in the literature. Decreased albumin levels are both observed as part of malnutrition as well as inflammation (anti-acute phase protein), also summarized as dialysis-associated inflammation-malnutrition complex 36 . There is also a known negative correlation between IL-6 and albumin 37,38 . Other retained predictors are also associated with inflammation-malnutrition in ESRD patients, such as hsCRP, creatinine or lipids 36 . It is therefore likely that, next to the thresholding approach, such predictors have additionally diminished the explanatory value of albumin for mortality prediction, as has been reported previously 39 . A similar collinearity of hsCRP with IL-6 40,41 may have reduced the predictive value of the former in the case of infection-associated mortality, yet it was selected as a predictor for all-cause mortality where more events were available and potentially due to its described additional role as a predictor of malnutrition 42 .
We strongly assume that including more patients in the derivation cohort would further improve factor selection, while in our study this dataset was relatively small (i.e. 317 patients). Having a larger cohort would also allow to include variable interactions in order to identify subgroup-specific effects, something that was not accounted for in our approach. Further limitations of our study are missing values for some of the variables and unknown cause of death for some of the patients. We expect that a relevant portion of missing values, as these were set to the reference level (see "Methods"), will reduce the selection stability of a variable and therefore typically lead to a more conservative selection. An additional conceptual limitation is that only the variables at baseline could www.nature.com/scientificreports/ be used in the current analysis. Yet, the evolution of biomarkers over time in dialysis patients and the course of chronic kidney disease may provide additional important insights and improve predictive performance. In summary, we successfully applied techniques of statistical learning as an exploratory approach to highdimensional data from a cohort of dialysis patients. Despite the relatively modest number of patients, we thereby could retrieve most of the established risk markers and additionally identify promising candidates for novel predictors. The described techniques can be considered complementary, yet not as competitors to widely-applied and state of the art hypothesis-driven approaches. Similar techniques, as they become more accessible, may gain importance in routine exploratory study analysis in order to gain additional insights into the underlying risk-determining factors.

Methods
Subjects/study population and data collection. The clinical trial was registered under ClinicalTrials.
gov identifier: NCT01152892. For detailed information please refer to the study protocol 12 . In brief, we included stable HD patients who were at least 18 years of age, had an HD vintage of at least 90 days, and provided written and informed consent. Malignant disease with a life expectancy shorter than 2 years had been the only relevant exclusion criterion. The present study included 475 patients, for whom sufficient unthawed sera were available for cytokine quantification. The study was approved by the ethics committees of the Klinikum rechts der Isar, Technical University Munich, and by the Bavarian State Board of Physicians. It was performed in accordance with the Declaration of Helsinki.
Age, patient characteristics, dialysis modalities, comorbidities and medication were assessed from medical records. Comorbidities were recorded at inclusion and updated during the 2 years follow up as previously described 5,43 . An adapted CCI was calculated according to Liu et. al. 13 (version without ESRD cause).
Cause of death was attributed to the following domains: cardiovascular, infection associated and other (or unknown) of which the cardiovascular domain had been prespecified in the study protocol 12 . For further details also see Lorenz et al. 5 Medical records and interviews of attending physicians, and relatives were used by the ISAR study physician board to ascertain death and assigned each case an underlying cause of death. When no agreement on cause of death could be reached the cause of death was defined as unknown (n = 28).
Serumbiochemistry and functional studies. Serum was collected prior to a midweek dialysis session at inclusion and after the exclusion of active infection. Patients that had undergone surgery within the past 2 weeks were included at a later timepoint. Serum was centrifuged after 30 min of resting at room temperature, aliquoted and stored at -80 °C. Routine laboratory analyses were assessed by International Organization for Standardization-certified laboratories. YKL-40 levels were measured in duplicates using a commercial enzyme-linked immunosorbent assay (ELSIA) kit (R&D Systems, Inc., Minneapolis, MN). The remaining inflammatory mediators were assessed using BD Flex-sets (Becton Dickinson, Heidelberg, Germany) on a FACS Canto II (Becton Dickinson) and BD Diva software (Becton Dickinson) according to manufacturer's instructions. Theoretical limits of detection were indicated by the manufacturer between 0.6 and 1.6 pg/mL. Ambulatory 24-h blood pressure (BP) and pulse wave velocity (PWV) measurements were carried out as previously described 44 using a Mobil-O-Graph 24 h-PWA (I.E.M. GmbH, Stolberg, Germany). Retinal vessel analysis was performed using Static Vessel Analyzer (Imedos Systems UG, Jena, Germany). Fundus images of one eye were gathered and classified into four categories according to quality. Two images with the highest qualitative category were analyzed and means of central retinal arteriolar (CRAE) and central retinal venular equivalent (CRVE) as well as arteriolar-to-venular diameter ratio (AVR) were calculated as previously described 45 . Statistical analyses. Statistical analysis was performed in R version 3.6.3 46 .
In order to be able to test the selected predictors on patients whose data was not used for variable selection, we split the cohort into a derivation and confirmation datasets. To this end we performed 1000 random splits in the ratio of 2/3 to 1/3 and calculated for each split the p-value for difference in the matching variables (age, adapted CCI, gender, IL-6 and mortality as indicated in Table 1) using Chi-squared test for nominal and Mann-Whitney test for ordinal or continuous variables. We then selected the split where the two groups were most likely similar (based on where the lowest p-value of the tests was the highest).
For stable variable selection we applied elastic net regularized regression on the derivation subcohort on all predictors listed in Table S1 simultaneously. In order to address possible non-linearity, ordinal and continuous variables (with the exception of age for which a linear relationship can be expected and which was expressed in decades) were dummy coded as "high" and "low" group by assigning for each patient whether he falls into the top or bottom quintile for the specific variable. Ties were assigned randomly. Values not falling into top or bottom quintile and missing values were assigned 0 in both dummy variables. Missing values in nominal variables were also assigned to the reference category (meaning absence of a potential risk factor).
In a first step we performed elastic net 10 Cox regression separately for all-cause mortality, cardiovascular and infection-associated mortality with 20-fold cross-validation using 15 equally spaced alpha values in the range of 0 to 1 (here alpha = 0 equals ridge regression and alpha = 1 equals lasso regression). We selected for each alpha the lambda with the lowest cross-validation error ( Figure S1a). We then selected for each regression the non-zero alpha value with the lowest cross validation error (vertical dashed lines in Figure S1a). Alpha of zero would not result in any variable selection (as it equals ridge regression) and was therefore not chosen. Next, we used the chosen alpha values in a stability selection approach as described by Meinshausen et Bühlmann 11 and implemented in R by Sill et al. 47 For this preselection approach as a means of hypothesis generation we used 100 subsamples of the derivation cohort, a "weakness" of 0.8 and a (maximal) per-comparison error rate of 10%. All variables passing the selection criteria are reported in Table 2 www.nature.com/scientificreports/ For spline fitting we used penalized splines 48,49 with four degrees of freedom. For the analysis in Fig. 4 and Figure S2 one outlier with AST of > 600 U/l and ALT > 900 U/l (suggestive of acute hepatopathy) was removed.
Time-dependent ROC analysis for Cox regression was performed by first fitting uni- (Fig. 5a) or multivariable (Fig. 5b) models to a bootstrap sample of the total study population and then calculating incident/dynamic AUCs 50 on the out of bootstrap sample. The bootstrap procedure was performed 1000 times and the resulting AUCs were averaged per month. For multivariable analysis missing values were randomly sampled from the available values for the respective variable and a total of 25 thus imputed datasets were generated. AUCs for each of these datasets were calculated as described above (these are presented as dots in Fig. 5b).