Construct a classification decision tree model to select the optimal equation for estimating glomerular filtration rate and estimate it more accurately

Chronic kidney disease (CKD) has become a worldwide public health problem and accurate assessment of renal function in CKD patients is important for the treatment. Although the glomerular filtration rate (GFR) can accurately evaluate the renal function, the procedure of measurement is complicated. Therefore, endogenous markers are often chosen to estimate GFR indirectly. However, the accuracy of the equations for estimating GFR is not optimistic. To estimate GFR more precisely, we constructed a classification decision tree model to select the most befitting GFR estimation equation for CKD patients. By searching the HIS system of the First Affiliated Hospital of Zhejiang Chinese Medicine University for all CKD patients who visited the hospital from December 1, 2018 to December 1, 2021 and underwent Gate’s method of 99mTc-DTPA renal dynamic imaging to detect GFR, we eventually collected 518 eligible subjects, who were randomly divided into a training set (70%, 362) and a test set (30%, 156). Then, we used the training set data to build a classification decision tree model that would choose the most accurate equation from the four equations of BIS-2, CKD-EPI(CysC), CKD-EPI(Cr-CysC) and Ruijin, and the equation was selected by the model to estimate GFR. Next, we utilized the test set data to verify our tree model, and compared the GFR estimated by the tree model with other 13 equations. Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Bland–Altman plot were used to evaluate the accuracy of the estimates by different methods. A classification decision tree model, including BSA, BMI, 24-hour Urine protein quantity, diabetic nephropathy, age and RASi, was eventually retrieved. In the test set, the RMSE and MAE of GFR estimated by the classification decision tree model were 12.2 and 8.5 respectively, which were lower than other GFR estimation equations. According to Bland–Altman plot of patients in the test set, the eGFR was calculated based on this model and had the smallest degree of variation. We applied the classification decision tree model to select an appropriate GFR estimation equation for CKD patients, and the final GFR estimation was based on the model selection results, which provided us with greater accuracy in GFR estimation.

and CKD may become one of the leading causes of death ranking second to ischemic heart disease, stroke, infection and COPD by 2040 10,11 . Therefore, preventing the progression of CKD effectively has become a pressing medical problem.
Correct assessment of renal function in CKD patients is important to clinical treatment and the estimation of patients' outcomes. As an indicator of glomerular filtration function, Glomerular Filtration Rate (GFR) is currently considered to be the most valuable parameter for assessing renal function in patients. However, measuring GFR in CKD patients is not a simple clinical procedure. At present the main method is to detect inulin, iohexol, 125 I-iothalamate or Technetium-99m-diethylenetriaminepentaacetic acid ( 99m Tc-DTPA) and other exogenous markers dynamic clearance in the body to determine the GFR. Although inulin is considered as 'the gold standard' for determining GFR, this method is too complex and expensive to implement 12 . Gate's method of 99m Tc-DTPA renal dynamic imaging, recommended by the Nephrology Committee of Society of Nuclear Medicine as well and served as a reliable method for determining GFR, was only used in some medical centers and could not cover all patients 13 . Therefore, various endogenous markers such as creatinine, cystatin C, and β2-microglobulin have been widely applied to assess renal function in patients, and various equations have been utilized to estimate Glomerular Filtration Rate (eGFR).
Although estimating eGFR via endogenous markers such as creatinine and cystatin C is a relatively conventional method, it is limited by the accuracy of the estimation equation. If eGFR deviates too much from the actual GFR, it will affect the clinician's judgment of the patient's condition and therapeutic regimen. For example, Yeli Wang et al. found that the CKD-EPI equation was not a reliable estimate of GFR in the South Asian population 14 . Marco van Londen et al. in a study of living kidney donors found that none of the current eGFR estimation equation can accurately estimate the donor's GFR, which is likely to underestimate the true rate of decline in GFR in 3 months-5 years after donation 15 . In addition, some equations customized for specific ethnic groups also showed huge deviations in the validation population in subsequent studies 16 .
Therefore, we believe that developing new equations for specific crowds or based on new endogenous markers may not be an optimal solution. In this study, we attempted to construct a classification decision tree model to select a more appropriate eGFR equations for CKD patients, and obtaining a more accurate estimate of glomerular filtration rate.

Results
Clinical characteristics and demographic data of the patients. We eventually collected 518 eligible subjects, 70% (362) of whom were assigned to training set to construct the discriminant model and 30% (156) to test set to verify whether the model is accurate or reliable. In this study, most CKD patients were over 50 years old, with an average age of 60.63. The number of males was slightly larger than females, accounting for 61.88% in the training set and 66.67% in the test set. However, kidney transplant patients were excluded in this study. Only about one-third of patients had a renal biopsy with a definite renal pathology diagnosis, and half of them were diagnosed with diabetic nephropathy. Among primary glomerular diseases, IgA nephropathy accounts for the highest proportion, which is about 20% of all pathologically confirmed patients. Our study also included a small subset (8.11%) of CKD patients, who were taking calcium dobesilate. They were always excluded from the development of GFR estimation equations because calcium dobesilate interferes with creatinine measurement and causes overestimation of glomerular filtration rate. Details of patients' clinical and demographic data are shown in Table 1.

Classification decision tree model and variable importance.
A total of 28 variables, including patient demographic data, past medical history, medication status, renal pathology results and laboratory measurement, were used to construct a classification decision tree model. Then we filtrated 15 relatively important variables and further selected the relative key variables to build the final classification decision tree model (Fig. 1).
BSA, BMI, 24-hour urine protein quantity, diabetic nephropathy, age and RASi were selected to constitute the final classification decision tree model. As shown in Fig. 2, when a patient with CKD was diagnosed with diabetic nephropathy, BIS-2 equation was recommended to him directly, otherwise he would need to be assessed based on RASi, BSA, 24-hour urine protein quantitation, and age. RASi may be a key factor influencing GFR estimation in patients with non-diabetic nephropathy. Meanwhile, patients with low BSA or BMI may be more appropriate to use Ruijin equation or CKD-EPI (Cr-CysC) equation. CKD-EPI (CysC) equation may be more suitable for CKD patients with higher BMI and BSA, which may be related to the fact that cystatin C is less affected by metabolic factors than creatinine. Classification decision tree model estimation of GFR performance. We estimated GFR by 13 known equations and the classification decision tree model respectively. Then, the results obtained above were compared with the GFR converted by 1.73 m 2 standard body surface area. Finally, we found that the estimated GFR of Ruijing, BIS-2, CKD-EPI, abbreviated MDRD and classification decision tree model approximated to the levels measured by 99m Tc-DTPA (Table 2, GFR estimated by tree model and traditional equations for the training set and total population are shown in Supplement 1).
RMSE and MAE were used to evaluate the accuracy of these equations and the classification decision tree model in estimating GFR. According to RMSE and MAE, the values of Ruijin, BIS-2 and CKD-EPI (Cr-CysC) were far less than the other 10 equations, indicating a more accurate estimation (Table 3). However, our classification decision tree model combined the accurate prediction of BIS-2, CKD-PEI (CysC), CKD-EPI (Cr-CysC) and Ruijin equations for specific population, showing a more precise estimation of GFR for CKD patients. MAE can better represent the accuracy of estimating GFR by the model, and the MAE of the classification decision White blood cell (× 10 9 ) 6.0 (4.5-7.0) 6.28 (4.9-7. www.nature.com/scientificreports/ tree model in the test set was only 8.5, which was much lower than other estimation equations (RMSE and MAE for the training set and total population are shown in Supplement 1).

Comparison of deviation of various methods for estimating GFR.
In the box plot ( Fig. 3), the estimation accuracy of Cockcroft Gault, MDRD and Chinese Modification MDRD was far worse than that of others, which tended to overestimate the glomerular filtration rate of CKD patients. However, over-optimistic estimates of renal function in CKD patients might seriously affect clinical decision making and bring unpredictable risks to patients. Although the average deviation of other equations was not remarkably different from the classification decision tree model, the estimation bias obtained from classified decision trees was more centralized (Variations in eGFR in different equations of training data and total population are shown in Supplement 1).

Comparison of degree of variation in GFR estimation bias.
According to Bland-Altman plot, the eGFR based on the classification decision tree model maintained a high degree of consistency among different patients. Bland-Altman diagram also showed that GFR estimation of small RMSE and MAE, such as BIS-2, Ruijin and abbreviated MDRD, had a large deviation (Fig. 4, Bland-Altman plots for the training set and total population are shown in Supplement 1).   www.nature.com/scientificreports/

Discussion
Chronic kidney disease has become a worldwide public health problem, and most patients with CKD experience the irreversible renal impairment and end up with ESRD 1 . Protecting the renal function of patients is one of the prerequisites to improve the prognosis of CKD patients. In addition, the kidney is also a major organ for drug metabolism and excretion, which means that the variety and dosage of drugs will be adjusted according to renal function in CKD patients. Therefore, accurate assessment of renal function in CKD patients is critical.
As the most reliable indicator of renal function, GFR is of great significance for patients and is also the best choice for clinicians to make clinical decision 17 . But the actual measurement of GFR is a very complex operation. Most of the time, clinicians only estimate GFR by combining serum levels of endogenous markers such as creatinine and cystatin C with GFR estimation equations. However, this will raise a wholly new problem-estimation bias, which may be unacceptable and lead to misjudgments of patient outcomes 18 . How to reduce eGFR bias has always been a topic of interest to nephrologists. Previous studies mainly focused on developing new estimation equations or finding new endogenous markers 16,19 . In this study, we tried to integrate several equations, and used the classification decision tree model to determine the optimal equation for different patients, and finally obtained a more accurate eGFR.
There are many explanations for this estimation bias, which can be divided into two categories 20 . First, some eGFR equations are seriously over-fitting. They show high accuracy in the populations who have been developed already, but lose robustness in the population beyond the base value range. Currently, more than 80% of GFR estimation equations are based on Caucasian or black clinical data, while only a small percentage of equations include data from Asian populations, which may have a large bias in the population of Asian countries such as China and Japan or other ethnic minority areas 14,[21][22][23][24] . In addition to race and genetic specificity, the variation of patients' disease status is also one of the main sources of equations bias. Hyperperfusion and hyperfiltration of the glomerulus are common in obese or diabetic patients, which may also lead to inaccurate estimates 25,26 . Bassiony et al. 27 found that all commonly used formulae for GFR estimation were not accurate enough in morbidly obese patients. And only the 24-hour creatinine excretion rate can be used to estimate renal function indirectly. The unique pathophysiological and hemodynamic characteristics of patients with obstructive nephropathy or transplantation also make many GFR estimation equations unsuitable for them 15,28 .
Second, many equations rely on a single endogenous marker, whose serum concentration is likely to be affected by other factors, to estimate GFR. For example, calcium hydroxybenzene sulfonate is applied to many CKD patients with diabetic peripheral vascular disease, which can markedly affect serum creatinine measurement and lead to a significant overestimation of eGFR in all creatinine-based eGFR estimation equations. In addition, factors such as patient muscle mass, diet, exercise and metabolic level can interfere with endogenous marker levels, attributing to a bias in the estimation of GFR. Pottel et al. developed an equation based on creatinine, which was more accurate than others. However, this equation has a large bias in patients with reduced creatinine production, such as anorexia, paralysis, malnutrition, proteinuria, and hypoalbuminemia 29 . What's more, Xie et al. also proposed that inflammatory state and thyroid function could also affect the estimation of GFR 26 .
Therefore, researchers attempted to use novel endogenous filtration markers, combined with multiple endogenous markers, to develop eGFR estimation equations for specific populations or a multi-parameter estimation equation [30][31][32] . Unfortunately, these methods do not provide more accurate estimates for GFR. Li et al. developed a Xiangya GFR estimation equation for Chinese CKD population and showed much more efficacy than EPI www.nature.com/scientificreports/ equation in the validation data 19 . However, this equation did not show strong robustness in subsequent studies 16 .
The MDRD is a classical multi-parameter equation that includes the creatinine, urea nitrogen, age and serum albumin levels of CKD patients. However, Hu 33 and our results showed that the accuracy of MDRD or Chinese modified MDRD is not ideal. What's more, adding too many parameters into the equation increases the complexity of the equation, which is not convenient for clinicians 33 . We believe that overfitting and interference from non-renal factors may be present in any GFR estimation equations, and developing a new equation may not eliminate these two problems fundamentally 17,18 . New equation or correction coefficients have been developed for many countries and ethnic population. For example, Ruijin and Xiangya equations developed for Chinese are greatly improved on CKD-EPI and MDRD equations developed for Caucasian and black people. Instead of putting the same equation into all CKD patients, we try to combine multiple equations and choose a suitable and accurate GFR estimate equation for each CKD patient. Therefore, we attempted to utilize machine learning to predict the most accurate GFR estimation equation for each CKD patient and then to estimate GFR. We chose the classification decision tree model to implement this process. Our classification decision tree model indicates that BSA, BMI, 24-hour urine protein quantity, diabetic nephropathy, age and RASi may be vital factors for GFR estimation bias. To achieve a more accurate estimate of GFR, the classification decision tree model was used to classify CKD patients according to the variables above and then to select the optimal estimation equation for them to minimize the bias. What we eventually obtained from this research was that the values of MAE and RMSE based on classified decision tree model were far smaller than other 13 equations, which verified our hypothesis. This indicated that the classification decision tree model could combine the advantages of multiple equations and automatically select the appropriate equation to obtain more accurate eGFR.
However, there are some limitations in the study. First, the sample size of this study is relatively limited. The 99m Tc-DTPA measurement of GFR in CKD patients is not a clinical procedure that can be performed in every hospital, which makes data sources very limited. In order to ensure the robust of the model, only subjects with www.nature.com/scientificreports/ complete clinical data were included, which made us must eliminate a lot of patients. With the increasement of clinical data, we look forward to optimizing our model with larger data sets soon. Second, the lack of an independent external validation set may prevent us from objectively evaluating our model. However, we randomly selected 30% of the data from the total population as the test set, which also ensures the independence of the validation data. Third, classification decision tree is only a kind of weak classifier in machine learning. But it belongs to a classic machine learning model, which is different from the "black box" model such as random forest and neural network, and its decision process can be clearly displayed, which is also important for clinicians. Finally, all of our data are from mainland China and only Chinese population is included in our study, which may result in our model failing to accurately predict other races such as Europeans, Americans and Africans. However, we believe that the tree model is a reliable method to reduce the deviation in estimating GFR. This method can also be applied to other populations, and the adjustment of tree model parameters in different populations may be a new problem in the future. In summary, it is a novel approach to using a classification decision tree to select estimation equation and to estimate the GFR. We used machine learning methods combined with the advantages of multiple estimation equation to obtain a more accurate estimate of GFR. Taken together, this study provides an optimized way of machine learning that can efficiently select the appropriate equation and estimate GFR more accurately, which will help nephrologists precisely assess renal function in CKD patients. Table 2. GFR estimated by decision tree model and traditional equations based on BSA for test set (Mean (P 25 -P 75 )). a All GFR estimation equations were converted to a uniform unit, mL/min per 1.73 m 2 . b sGFR: GFR was measured by 99m Tc-DTPA, and the GFR was converted to 1.73 m 2 standard body surface area based on the patient's body surface area.  www.nature.com/scientificreports/

Methods
Study design and subjects. This is a retrospective study. We searched the HIS system of the First Affiliated Hospital of Zhejiang Chinese Medicine University for all CKD patients who visited the hospital from December 1, 2018 to December 1, 2021 and underwent Gate's method of 99m Tc-DTPA renal dynamic imaging to detect GFR. Subjects were included as follows: (1) clinically diagnosed with CKD; (2) 99m Tc-DTPA GFR was measured at the time of visit, and creatinine, cystatin C were available. Exclusion criteria included the following: (1) aged < 18 years; (2) underwent hemodialysis or peritoneal dialysis treatment within three months prior to the creatinine and cystatin C detection; (3) critical information, such as age and gender, was missing.

Basic information collection.
In this study, we collected the patients' demographic data (age, sex, height, weight, systolic blood pressure, diastolic blood pressure, etc.), conditions related to renal disease (renal biopsy result, whether receiving glucocorticoid treatment and the use of immunosuppressants), previous history (cancer, diabetes, stroke, hyperuricemia, etc.), medication (diuretics, SGLT2i, RASi, etc.) and so on. Du Bois equation is used to calculate the body surface area (BSA).
Laboratory and GFR measurements. Serum creatinine level was detected by sarcosine oxidase method with reagents purchasing from Zhongsheng Beikong Biotechnology Co., LTD. (92,644,093). Cystatin C was detected by latex enhanced immunoturbidimetry with reagents purchasing from Zhejiang Content Biotech Co., Ltd. (20,210,901). 24-hour Urinary Protein Quantity was detected by pyrogali-molybdic method with reagents purchasing from Beijing Leadman Biochemistry Co., Ltd. (21,011,107). The detection instrument is Abbott's ARCHITECT C16000 automatic biochemical analysis system (c16000659), and the specific detection is completed by the Clinical laboratory of Zhejiang Hospital of Chinese Medicine. GFR was measured by Gate's method of 99m Tc-DTPA renal dynamic imaging 34,35 , which used Single photon emission computed tomography scanner (INFINIA 17261, GE Healthcare). 99m Tc-DTPA was used as renal dynamic imaging agent with a dose of 185 MBq. 99m TC-DTPA renal dynamic imaging was performed in supine position and collected in posterior position. 99m Tc-DTPA 185 MBq was injected intravenously, and the collection procedure was started at the same time. Both kidneys were collected continuously. Low energy collimator, window width 20%, matrix 64 × 64, energy peak 140 keV, magnification 1-1.5. Dynamic collection was carried out for 31 min. Blood perfusion phase was collected for 1 min at 2 s/frame, and functional phase was collected for 30 min at 15 s/frame. The radioactivity count of the syringe was measured before and after injection. After the imaging was completed, the left and right kidneys and the background were manually delineated using ROI technology to generate time-radioactivity curves and calculate glomerular filtration rate of both kidneys. Statistical analysis. Statistical description and data set split. All continuous variables were presented as the mean (P 25 -P 75 ) and categorical variables were described as N (n %). All the patients were randomly divided into two groups: 70% subjects (362) into the training set and 30% (156) into the test set.    Ethical approval. This study was approved by the Ethics Committee of the First Affiliated Hospital of Zhejiang Chinese Medicine University (2022-KL-030-01). We also confirm that all research processes are carried out in accordance with relevant guidelines and regulations and under the supervision of the ethics committee and other regulators. Because our study was a retrospective analysis using data from the hospital information system, intervention of the subjects was not involved. We hid all patient information at the beginning of the study, fully protect the rights and interests of patients. So, the ethics committee has approved that we can exempt informed consent.

Data availability
All data, models generated or used during the study appear in the submitted article. The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.