A machine learning driven nomogram for predicting chronic kidney disease stages 3–5

Chronic kidney disease (CKD) remains one of the most prominent global causes of mortality worldwide, necessitating accurate prediction models for early detection and prevention. In recent years, machine learning (ML) techniques have exhibited promising outcomes across various medical applications. This study introduces a novel ML-driven nomogram approach for early identification of individuals at risk for developing CKD stages 3–5. This retrospective study employed a comprehensive dataset comprised of clinical and laboratory variables from a large cohort of diagnosed CKD patients. Advanced ML algorithms, including feature selection and regression models, were applied to build a predictive model. Among 467 participants, 11.56% developed CKD stages 3–5 over a 9-year follow-up. Several factors, such as age, gender, medical history, and laboratory results, independently exhibited significant associations with CKD (p < 0.05) and were utilized to create a risk function. The Linear regression (LR)-based model achieved an impressive R-score (coefficient of determination) of 0.954079, while the support vector machine (SVM) achieved a slightly lower value. An LR-based nomogram was developed to facilitate the process of risk identification and management. The ML-driven nomogram demonstrated superior performance when compared to traditional prediction models, showcasing its potential as a valuable clinical tool for the early detection and prevention of CKD. Further studies should focus on refining the model and validating its performance in diverse populations.


Dataset collection and subject information
The present investigation employed a dataset obtained from 7 , which included health records of 544 patients collected from Tawam Hospital located in Al-Ain city, Abu Dhabi, United Arab Emirates (UAE) between January 1, 2008, and December 31, 2008.Figure 1 shows the flow diagram of the study design and patient selection process.
A total of 467 patients were included according to the inclusion and exclusion criteria.Out of which, 234 were female patients and 233 were male patients, aged 23-89 years.Due to the retrospective nature of the study, the need for informed consent was waived by the Tawam Hospital and UAE University Research Ethics Board, which approved the study protocol under Application No. IRR536/17.The study was performed in accordance with the Declaration of Helsinki.All the patients were UAE citizens over the age of 20 and diagnosed with one or more of the following conditions: coronary heart disease (CHD), pre-hypertension, diabetes mellitus (DM) or prediabetes, vascular diseases, dyslipidemia, smoking, or being overweight or obese.The data collected includes the age of the patients ( ≤ 49 , 50-60, and ≥ 65 ), sex (female, male), smoking status (no, yes), obesity (no, yes), total cholesterol (TC), triglycerides (TG), estimated glomerular filtration rate (eGFR), glycosylated hemoglobin type A1C (HbA1C), systolic blood pressure (SBP), diastolic blood pressure (DBP), body mass index (BMI), and serum creatinine (Scr) of the patients.The study also includes disease parameters such as CHD (no, yes), diabetes mellitus (no, yes), hypertension (HTN) (no, yes), dyslipidemia (no, yes), and vascular diseases (no, yes), angiotensin-converting enzyme (ACE) inhibitors and angiotensin II receptor blockers (ARBs) use (no, yes).The category within the parentheses in the definition mentioned above serves as the reference group.Patients were recorded as having CHD if they had evidence of a coronary event, a coronary revascularization operation, or a cardiologist-determined diagnosis.Similarly, patients were categorized as having vascular disease based on specific criteria.These criteria included a documented history of cerebrovascular accident or transient ischemic stroke, a documented history of peripheral arterial disease, or the occurrence of revascularization for peripheral vascular disease.The exclusion criteria of this study were as follows: (i) eGFR less than 60 mL/min/1.73;(ii) patients with incomplete clinical data; (iii) the period of time during which the patient's follow-up was lost.All dataset attributes refer to the patients' initial visits in January 2008, except for the time-year variables and EventCKD35 (binary variables 0 and 1).The duration of the follow-up ended in June 2017.The binary variables 0 and 1 indicate that the patients are in CKD stages 1 or 2, and 3, 4, or 5, respectively.During the follow-up period, 54 patients (11.56%) with CKD stages 3-5 were identified in the entire cohort.In the context of this study, 'time' refers to the duration of the follow-up period subsequent to patients' diagnosis and initiation of treatment, which is quantified in terms of survival months.In the sample of 54 patients, the average duration of follow-up was found to be 50 months, with the minimum observed follow-up period being 3 months.

Diagnostic criteria
The diagnostic criteria for CKD stages 3-5 were defined based on the eGFR and kidney damage, which can be assessed through various diagnostic tests and clinical evaluations.The Kidney Disease Improving Global Outcomes (KDIGO) was used to categorize patients with CKD into two groups: normal (eGFR is ≥ 60 mL/min/1.73),and CKD stages 3-5 (eGFR is ≤ 60 mL/min/1.73) 18.The CKD epidemiology collaboration (CKD-EPI) creatinine equation was used to determine eGFR, as per the definition given below 19 : where SCr denoted seram creatinine measured in µmol/L , age is expressed in years, κ is a constant of 0.9 for 'males' and 0.7 for 'females' , α is a constant of −0.411 for 'males' and −0.329 for 'females' , 'min' represents the 'minimum' value of SCr/κ or 1, and 'max' represents the 'maximum' value of SCr/κ or 1 [19][20][21] .A factor of 1.0 was assigned for ethnicity due to the absence of African-descent subjects in this study.The BMI ranges used for identifying individuals as overweight and obese are 25-29.9kg/m 2 and ≥ 30 kg/m 2 , respectively.According to 22 , HTN was described as SBP over 140 mmHg, DBP over 90 mmHg, or taking medicine to treat high blood pressure.Diagnostic standards for dyslipidemia included serum TC values of ≥ 6.21 mmol/L, serum TG levels of ≥ 2.26 mmol/L, or the use of lipid-lowering drugs 23 .The reference ranges for creatinine were 58-96 µ mol/L for females and 53-115 mol/L for males 7 .Patients were considered to have a positive smoking history if they reported either current or past tobacco smoking.The definition of prediabetes and DM followed the guidelines set by the American Diabetes Association (ADA) 24 .

Model estimation and selection
To analyze the data, first, the non-parametric Kaplan-Meier (KM) estimator was used to measure the amount of time spent in follow-up and visualize the survival curves.Then, a semi-parametric Cox proportional hazard regression model was employed to describe the impact of the variables on the survival outcome.These methods are briefly detailed here.

Kaplan-Meier method
The KM method is a non-parametric modeling approach established by Kaplan & Meier in 1958 that predicts survival probability based on observed survival 25 .The general formula for determining the survival probability Ŝ(t) at time t i is as follows: where t 1 , t 2 , • • • , t n are the ordered unique event timings, and n i is the total number of patients that were 'at risk' prior to time t i .The variable d i represents the count of instances that have occurred at time t i .The estimated prob- ability is a step function that begins with a horizontal line at a survival probability of 1 (when survival probability is 100% ) and then steps down to zero as survival probability drops.The KM estimates model is used to perform an analysis of the survival probability.The survival time, measured in months, was the primary dependent variable.Follow-up time can be interpreted as a time to event (TTE), where the event would be CKD stages 1-2 or CKD stages 3-5.The non-parametric KM method has a significant drawback: it cannot represent survival probability with a smooth function, rendering it unable to make predictions.On the other hand, parametric models such as the exponential and weibull distribution models can overcome this limitation 26 .They serve as a www.nature.com/scientificreports/logical progression from the KM method, bridging the gap and greatly improve understanding of survival analysis.Besides, in cases where parametric models are appropriate, they are more exact, more effective, and more informative than KM.The KM estimation curve fits with exponential and weibull distributions by considering statistical measures such as the AIC (Akaike Information Criterion) and maximum log-likelihood.A model with a smaller AIC value is a better fit, while a model with a higher (maximum) log-likelihood is a good fit.After running the initial analysis, it was seen that the weibull distribution has a larger loglikelihood of − 259.78 and the smallest AIC of 523.56 compared to exponential model estimates (loglikelihood: − 265.49,AIC: 532.98).So, weibull is a superior fit for the model because it follows the statistical preference of maximizing log-likelihood while minimizing AIC for fitting the model and making predictions.
Figure 2 shows the KM plots for the survival function of CKD patients in stages 3-5 and the visual distribution of both models.The Python programming language (version 3.10.12)and the "lifelines" package were used to estimate the KM curve 27 .It displays the time period (follow-up months) on the x-axis and survival probabilities on the y-axis.A notable disparity was observed with regards to patient survival.The exponential distribution survival plot, depicted by the green curve (Fig. 2), exhibits a slight deviation from the KM survival plot represented by the blue curve, whereas the orange plot aligns with it.The smooth rate of decrease observed in the described approach effectively characterizes the survival probability, surpassing the step-wise nature of the KM method, which experiences abrupt drops in probability only following an event while maintaining constant probabilities between events.In order to determine which model provides the best fit, a comparison of the quantile-quantile (Q-Q) plot (as shown in Fig. 3) is used to check the clustering of observations along a slope line 28 .The Q-Q plot determines which distribution provides a better fit to the KM estimation survival curve.The distribution whose Q-Q plot aligns more closely with a straight line indicates a better fit to the data.If the points deviate significantly from a straight line, it indicates that the data does not fit the chosen distribution well.From Fig. 3, it can be observed that the weibull distribution is a good fit for the model as most of the data points (observed data) seem to be clustered along the slope line.Hence, we can use the weibull distribution model to predict other features affecting CKD patients in stages 3-5; this will help us determine which features are most strongly associated with patients' survival.

Cox proportional hazard model
The Cox proportional hazard model is a semi-parametric method that can be used to analyze survival-time outcomes, also known as time-to-event outcomes, based on one or more predictors 29 .The model demonstrates features of a general regression analysis, which enables the evaluation of different levels of a factor's influence on survival time while accounting for other factors.Its functionality is highly similar to that of the logistic regression model, but instead of predicting a binary outcome, it focuses on time-to-event data.The computation of the regression coefficient enables the determination of the relative risk that is linked to the corresponding factor.The logistic regression model is designed to handle only qualitative variables as the dependent variable, such as the outcome of a case (the end event), without incorporating the duration of survival time.The Cox hazard-based model utilizes survival time and event occurrence as its dependent variables.The Cox proportional hazards model is presented in the following form of an equation 30 : where, t represents the time, and X indicates a number of contributing factors.The relative risk function, denoted as g(X) = β T X , is solely dependent on the p explanatory variables X = x 1 , x 2 , • • • , x p and the regression param- eter β .The exponential values of e β are called hazard ratios (HR).A positive value of β i or a HR greater than one indicates that an increase in the i th covariate leads to an increase in the event hazard, resulting in a decrease in the survival length.In other words, a covariate with an HR over 1 is one that is positively correlated with the likelihood of an occurrence and hence negatively correlated with the duration of survival.

Results and discussion
In this study, a total of 467 participants with eGFR greater than or equal to 60 mL/min/1.73m 2 was considered during every 3-month follow-up period from baseline visit to June, 30 2017.After a period of follow-up, a total of 54 new cases (male: 34; female: 20) of CKD stages 3-5 were identified.There are 233 males and 234 females in this study, and their ages range between 23 and 89 years old (Table 1).
(3) h(t, X) = h 0 (t)e g(X) The oldest male was 89 years old, and the oldest female was 79 years old.Among 233 males, 199 were in CKD stages 1-2 and 34 were in CKD stages 3-5.Similarly, among 234 females, 214 were in CKD stages 1-2 and 20 were in CKD stages 3-5.The dataset contains a total of 23 features (numerical and categorical) that report demographic, biochemical, and clinical information about the CKD patients.The categorical features include the gender of the patient.Additionally, personal history factors are considered, such as diabetes history, CHD history, vascular disease history, smoking history, HTN history, DLD history, and obesity history.Furthermore, specific-disease medicines, namely DLD medications, diabetes medications, HTN medications, and inhibitors (angiotensin-converting enzyme inhibitors or angiotensin II receptor blockers), are represented as binary values (0, 1).A descriptive statistical analysis was done using a mean ± standard deviation (SD) with an unpaired, twotailed t-test for continuous variables and a frequency distribution for categorical variables (using the Chi-squared test) to find out about the patients and their medical conditions.The statistical quantitative description of the categorical and numerical features are described in Tables 2 and 3, respectively.It has been observed from the  .The levels of triglycerides (TG), glycosylated hemoglobin type A1C (HbA1C), serum creatinine (SCr), and systolic blood pressure (SBP) in the CKD group were significantly higher as compared to the non-CKD group, but the estimated glomerular filtration rate (eGFR), cholesterol, diastolic blood pressure (DBP), and body mass index (BMI) were lower.The data are expressed as the median, mean, and standard deviation.A p-value less than 0.05 was considered statistically significant.It has been observed from Table 3 that the p-value of the covariates such as age, cholesterol, triglycerides, HgbA1C, creatinine, eGFR, SBP, and time follow-up is less than 0.05, and this indicates that these variables had a significant impact on the CKD stage 3-5.The other covariates have no significant influence.
In this study, we employed the KM survival curve fitting approach in combination with the weibull distribution to analyze and model the survival data.The aim was to determine the "decay rate" with respect to the follow-up time period, which was used as the dependent variable for subsequent regression models.The initial step involved fitting the KM survival curve using the weibull distribution.We produced an accurate representation of the survival data by computing the two parameters of the Weibull distribution, γ (shape parameter) and (scaling parameter).This allowed us to calculate the shape and scale of the survival curve, providing valuable insights into the underlying survival trends.After obtaining the parameters γ = 1.53 and = 55.35 , we deter- mined the decay rate for the follow-up time.This result was used as the dependent variable in our regression models.We employed two regression techniques: Support Vector Machine (SVM) 31 and Linear Regression (LR) 32 to investigate the relationship between the decay rate and other relevant features.To identify the most influential features, a feature ranking process was performed, which led to the selection of the top 11 predictors.Using the "SelectKBest" class in Python 3.10.12with scikit-learn (version: 1.2.2), we specifically employed feature ranking to pinpoint the top 10 most relevant features.This method allowed us to extract features with the highest scores, as determined by statistical tests, underscoring their significance in our analysis and leveraging the chi-squared scoring function for feature selection.These top 11 features were carefully chosen to enhance both the predictive accuracy of our models and the interpretability of the results.Subsequently, these selected features served as the inputs for our regression models, contributing to a more comprehensive understanding of the relationship between these features and the decay rate.For our regression analysis, we adopted a data partitioning strategy, allocating 70% of the data for training the model and reserving the remaining 30% for testing and validation purposes.To assess the performance of the regression analyses, different metrics are used, namely R-score (R-squared), mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE).The MAE is a matric used to measure the average squared difference between the original and predicted values obtained by averaging the absolute differences over the entire dataset.It gives an indication of how close the predictions are to the actual values.The MSE is a measure of the average squared difference between the original values and the predicted values.It is calculated by squaring the average difference over the dataset.RMSE is generated from MSE and provides the error rate of the prediction model.It is evaluated by taking the square root of MSE.RMSE is a popular metric since it provides a measure of the average magnitude of the prediction errors.It helps to understand the magnitude of the errors in the predictions.R-squared, alternatively referred to as the coefficient of determination, It indicates the goodness of fit of the model by measuring how well the predicted values align with the original values.R-squared can be interpreted as the percentage of variability in the dependent variable that is explained by the independent variables.The value of R-squared ranges between 0 and 1, with a higher R-squared value indicating a better fit and 1 representing a perfect fit.The scores obtained from both the SVM and Linear Regression models were tabulated and compared in Table 4 in order to select the best prediction model.
Based on the comparison results provided in Table 4, it is evident that linear models exhibit superior performance on this dataset.In order to obtain an optimal regression model, it is desirable to minimize the error, aiming for a value close to zero, while simultaneously maximizing the variability of the target variable explained by the features, striving for a value close to one.Interestingly, the results indicated that the Linear Regression model outperformed the SVM model, demonstrating better predictive accuracy for the used dataset.Therefore, we consider linear regression models having the lowest RMSE (0.069526) and the highest R 2 (0.954079) as the final prediction models.The performance of the linear regression model was assessed by comparing the actual observed values with the predicted values.Figure 4 presents the ' Actual vs. Prediction' plot, where each data point represents an observation in the dataset.The x-axis represents the observed values of the dependent variable, while the y-axis corresponds to the predicted values based on the regression model.
It can be observed from the plot that the majority of the data points align along a diagonal line, indicating a reasonably strong linear relationship between the predicted and actual values.This alignment indicates that the model has successfully captured the underlying trends in the data.However, it is evident that a small number of data points deviate from the diagonal line, indicating a certain level of discrepancy or inaccuracy in the predictions.These deviations could be attributed to various factors, such as measurement errors or unaccounted variables that influence the dependent variable.The ' Actual vs. Prediction' plot demonstrates the satisfactory performance of the linear regression model in capturing the inherent relationship between the predictors and the dependent variable.The model's capability to predict values that fall within a reasonable range of the observed values suggests its reliability for making accurate predictions and extracting meaningful insights from the data.
We have conducted a thorough evaluation of our predictive model using the five-fold cross-validation approach.This approach involves partitioning the dataset into five subsets, training the model on four subsets, and evaluating its performance on the remaining subset.This process is repeated five times, ensuring that each subset serves as the validation set exactly once.Table 5 provides a comparison of the cross-validation-based model performance metrics.By utilizing the cross-validation approach, we have ensured a robust assessment of its performance.The results from this comprehensive evaluation confirm that our predictive model is reliable and demonstrate its effectiveness.
To estimate the impact of various covariates on CKD stage 3-5, a semi-parametric Cox hazard model was fitted using the 'lifelines' module in Python 3.10.12;the obtained results are presented in Table 6.
The HR and corresponding p values for each of the twenty one variable sets are listed in this table.The HR was used to evaluate the relative risk of a variable.If the HR is greater than one, it implies that the variable is positively connected with the likelihood of CKD stage 3-5 and negatively correlated with survival time.On the other hand, if the HR is less than one, it shows that the correlation is in the other direction.It has been observed from Table 6 that the p-value of the covariates such as history of CHD, DLD medications and SBP is less than 0.05, and this indicates that these variables had a significant impact on the CKD stage 3-5.The other covariates have no significant influence.The p-value for history of CHD is < 0.05 and the HR is 4.0603 indicating a strong relationship between the patients' history of CHD and CKD stage 3-5.The variable ranking based on CKD stage 3-5 is illustrated in Fig. 5.
The figure provides a forest plot reporting the HR and the 95% confidence intervals (CI) of the HR for each covariate included in the Cox proportional hazards model.Only history of CHD, DLD medications, and SBP were found to be significant with 0.05 cutoff.It is evident from looking at the figure that history of CHD have a positive influence on survival time while DLD medications have a negative influence on the survival time.The concordance index, or C-index 33 , provides a measure of the discriminative ability of the KM estimate and the    This makes them valuable for descriptive and exploratory analysis, (iii) KM analysis is relatively simple and does not involve the complexities of modeling covariates.It's a suitable choice when you want to focus solely on estimating and comparing survival probabilities between groups, (iv) KM is the method of choice when the primary goal is to examine and describe the time-to-event data without modeling covariates.It is particularly useful for studying event occurrence in clinical trials and observational studies.However, it is important to note that while the KM estimate has these advantages, it is limited in its ability to model the impact of covariates on survival time and does not provide HRs.For such analyses, the Cox proportional hazards model may be more appropriate.Following the selection of the superior regression model, we extracted the coefficients and intercept values from the model.These coefficients and intercepts were crucial in constructing a nomogram.A nomogram is a graphical representation that provides a simple and intuitive tool for predicting outcomes based on the regression model.It consists of four lines: the point line, the line for the risk factor, the line for the probability, and the line for the total number of points.The process of constructing these lines has been previously explained 34,35 .
The point line is built by assigning values ranging from 0 to 100.The linear predictor ( LP mn ) value is determined based on a coefficient derived from a fitted regression model.If the independent attributes X is a categorical with n categories, and ( n − 1 ) dummy variables are generated.The formula for LP mn is as follows: Using this formula, PointS mn are calculated for each risk category and aligned to the respective risk factor lines.The calculation for PointS mn is as follows: where β mn represents the regression coefficient value for the nth category of the mth risk factor.LP * n indicates the LP value of the risk factor with the largest estimated range of attribute values.The probability line indicates the probability value associated with a given total point, which spans the range from 0 to 1.The total point line is derived by cumulatively summing up the PointS mn values.
The Logistic Regression model is represented by the expression mn LP mn .The total number of points corre- sponding to each value of the probability line can be determined by substituting this equation into the previous expression.
In this equation, the value on the probability line, P(Y = 1|X = x) is substituted to construct the total point line.By utilizing the coefficients and intercept value ( α ), a nomogram can be developed as shown in Fig. 6 to aid in clinical decision-making and risk assessment 34 .
To predict the risk of CKD stages 3-5 for a patient with the following values: gender = 0, age = 89, history of smoking = 1, DM medications = 1, SBP = 92, and time follow-up = 5 months, each value is assigned to its respective points as illustrated in Fig. 7.
The resulting point values obtained are as follows: 38, 100, 20, 0, 28, and 65.These numbers are then summed to get an overall point value of 251, which may be used to assess the risk of CKD stages 3 to 5 by consulting the nomogram's given curve.Using these data, we may estimate that this patient has a 0.58% chance of developing CKD stages 3-5.This example demonstrates the practical applications of nomograms to predict clinical outcomes.Figure 8 shows the nomogram results indicating the risk scores based on the established logistic regression model during the follow-up periods of 31-50 and 81-95 months, respectively.
Additionally, supplementary Figs.S1, S2, S3, and S4 provided the corresponding results for the follow-up periods of 16-30 months, 51-65 months, 66-80 months, and 96-111 months, respectively.The nomogram assessment considered various factors such as age, gender, medical history, laboratory results, and specific risk factors associated with CKD stages 3-5.By integrating these factors, we have generated personalized risk scores for each patient.These risk scores are visually represented in Fig. 9 and the summary of results is provided in supplementary Table ST1.
The plot depicting the patient's ID versus risk score for CKD stages 3-5 provides a visual representation of the varying levels of risk associated with individual patients within these stages.The x-axis of the plot corresponds to the patient ID, which is a unique identifier assigned to each patient within the dataset.The patient IDs are organized in ascending order, meaning that the patients' data points will be plotted sequentially along the x-axis.The vertical y-axis, is used to represent the risk score associated with stages 3-5 of CKD.The risk score is a quantitative measure that evaluates the probability or seriousness of complications associated with CKD.Through an analysis of the plot, one can observe the distribution of risk scores across the patients with CKD stages 3-5.Higher risk scores are typically associated with patients who have a higher probability of developing (4) LP mn = β mn × X mn

Conclusion
This study presents a novel machine learning-driven nomogram for predicting CKD stages 3-5.The proposed approach offers an accurate and personalized risk assessment tool with the potential to improve early detection and preventive strategies.The integration of advanced machine learning algorithms and comprehensive patient data contributes to the robustness and reliability of the developed nomogram.This proposed nomogram has great predictive capacity and may have major clinical implications for diagnosing CKD stages 3-5.Future research needs to focus on the integration of additional data sources and validation through prospective studies, fostering the translation of this nomogram into clinical practice, and improving patient outcomes.

Figure 1 .
Figure 1.Flow diagram of study design and participants selection.
www.nature.com/scientificreports/Cox Proportional Hazards model in our study.Remarkably, the KM estimate achieved a perfect C-index of 1.0, signifying its impeccable ability to distinguish between different outcomes and accurately order survival times within our dataset.In contrast, the Cox Proportional Hazards model yielded a C-index of 0.7510, indicating a substantial but not flawless discriminatory power.This comparison suggests that the KM estimate outperforms the Cox model in terms of discrimination, demonstrating an unparalleled capacity to precisely predict survival outcomes within our specific context.The KM estimate and the Cox Proportional Hazards model are both important tools in survival analysis, but they serve different purposes and have distinct advantages.Here are

Figure 7 .
Figure 7.An example of Nomogram results for CKD stages 3-5 to predict risk score.

Table 1 .
Explanation, measurement units, and intervals of each feature of the dataset.ACEI angiotensinconverting enzyme inhibitors, ARB angiotensin II receptor blobkers, kg kilogram, mmol millimoles, mmHg millimetre of mercury.

Table 2 .
Statistical and quantitative description of the category features.

Table 3 .
Statistical and quantitative description of the numerical features.

Table 4 .
Comparison of prediction models using MSE, RMSE, MAE and R 2 .

Table 5 .
Cross-validation-based model performance metrics comparison.

Table 6 .
Significance of variables under Cox regression analysis and highlighted estimated coefficients those are significant.

.
Cox proportional hazard model variable ranking based on log(HR).someadvantages of the KM estimate over the Cox Proportional Hazards model: (i) KM estimates provide a non-parametric way to estimate survival curves.They make no assumptions about the underlying hazard function, which can be advantageous when the assumptions of the Cox model do not hold, (ii) KM curves are easily interpretable and can be plotted to visualize survival probabilities over time for different groups or categories.