An early prediction model for chronic kidney disease

Based on the high incidence of chronic kidney disease (CKD) in recent years, a better early prediction model for identifying high-risk individuals before end-stage renal failure (ESRD) occurs is needed. We conducted a nested case–control study in 348 subjects (116 cases and 232 controls) from the “Tianjin Medical University Chronic Diseases Cohort”. All subjects did not have CKD at baseline, and they were followed up for 5 years until August 2018. Using multivariate Cox regression analysis, we found five nongenetic risk factors associated with CKD risks. Logistic regression was performed to select single nucleotide polymorphisms (SNPs) from which we obtained from GWAS analysis of the UK Biobank and other databases. We used a logistic regression model and natural logarithm OR value weighting to establish CKD genetic/nongenetic risk prediction models. In addition, the final comprehensive prediction model is the arithmetic sum of the two optimal models. The AUC of the prediction model reached 0.894, while the sensitivity was 0.827, and the specificity was 0.801. We found that age, diabetes, and normal high values of urea nitrogen, TGF-β, and ADMA were independent risk factors for CKD. A comprehensive prediction model was also established, which may help identify individuals who are most likely to develop CKD early.


Results
In this nested case-control study, 348 participants (all had eGFR ≥ 60 mL/(min·1.73 m 2 ) at baseline) were included (116 cases, 232 controls, subjects who reached eGFR < 60 mL/(min·1.73 m 2 ) during the 5-year followup were considered "cases") ( Fig. 1) to build a 5-year risk prediction model for the onset of CKD. The baseline characteristics of the included participants in the nested case-control study are described in Table 1. The levels of fasting plasma glucose (FPG), total cholesterol (TC), urea nitrogen (BUN), serum creatinine (SCr), total protein (TP), globulin (GLB), systolic blood pressure (SBP), cystatin C (CysC), transforming growth factor-β (TGF-β), and asymmetric dimethylarginine (ADMA) in the CKD group were significantly higher than those in the controls. The age of the CKD group was significantly higher than that of the non-CKD group, and the incidences of type 2 diabetes and hyperuricemia were higher than those of the non-CKD group (Table 1). In addition, triglyceride (TG), serum uric acid (SUA) and body mass index (BMI) levels in the CKD group were higher than those in the non-CKD group, but the differences were not statistically significant.

Non-genetic risk factors for CKD.
A Cox proportional risk regression model showed that age, diabetes mellitus, a normal high value of urea, a normal high value of TGF-β, and ADMA were independent risk factors for CKD (Table 2; Supplementary Table S3). Kaplan-Meier survival analyses showed that the elderly, normal high value of urea nitrogen, normal high value of TGF-β, normal high value of ADMA, and diabetes (we defined age ≥ 60 years as the elderly, taking the higher quartile of other measurement data as their normal high values) were significantly associated with chronic kidney disease onset in our cohort (Fig. 2).

Internal validation.
In the nested case-control study, bootstrap five-fold cross validation was carried out for different prediction models of CKD onset. After the verification results were averaged, the AUC values of the nongenetic, genetic, and comprehensive prediction models of CKD were 0.786, 0.692, and 0.820, respectively.

Discussion
Early prediction of CKD is challenging. Decades of research have shown that diabetic nephropathy, primary glomerulonephritis, hypertension, interstitial nephritis, and polycystic kidney can all induce CKD. The awareness of CKD is notoriously low; once CKD has developed, treatment is usually limited until the last remedies of dialysis and renal transplantations are needed for ESRD. The eGFR is a sensitive indicator of renal function; however, it is not an early predictor of CKD. Although many biomarkers have been tested for CKD, reappraisal in prospective cohort studies with large sample sizes is needed. Seeking an early, sensitive, easy to perform and cost-effective prediction model. We carried out a nested case-control study for CKD prediction out of the "Tianjin Medical University Chronic Disease Cohort" 26,27 , with strong pertinence, facilitated prediction of the 5-year probability of chronic kidney disease onset in this area. The average age of the subjects was 63 years; thus, those individuals were more likely to develop CKD than younger subjects.
We combined traditional laboratory indicators, multiple biomarkers related to renal function, and SNP loci to develop CKD prediction models. In the NGRS model, we not only included some indicators that were used in other studies, such as diabetes and age 25,28,29 , but several biomarkers, especially TGF-β and ADMA, were also employed as early CKD predictors in the model.
Although hundreds of associations were found among CKD and susceptibility genes, large sample-sized GWAS also yielded very significant results, and genetic factors only provided a little improvement of the prediction model. Given a certain SNP, the genetic relative risk (GRR) could be high; however, its contribution to CKD risks in the general population was limited. All 17 SNPs employed in our study were from GWASs out of the UK Biobank and other large cohorts; however, the AUC of the genetic risk model (GRS) was only 0.643 and had only given a marginal improvement in the AUC in the comprehensive model (from 0.889 to 0.894). A study in Japan showed that genetic predictors do not contribute significantly to the improvement of the prediction efficiency of the comprehensive prediction model 29 . Although certain SNPs had very significant associations with CKD in large sample-sized GWASs (i.e., high genetic relative risk, GRR), their contribution to phenotype variance might be limited.
Several biomarkers were tested and included in our prediction model. The plasma TGF-β level, alone with ADMA, provided better prediction value than the more direct glomerular filtration indicator cystatin C. In our previous study, we found that TGF-β pathway genes were highly expressed in the kidneys of very early stage diabetic nephropathy renal biopsies, long before renal fibrosis and decreased filtration occurred. Indeed, screening early biomarkers before decreasing eGFR may give CKD predictions several years earlier, although early treatment could be another obstacle to overcome.
This study has a few limitations. First, the research on CKD-related biomarkers was carried out in a nested case-control study that selected from a cohort of chronic diseases, and the sample size was relatively small; therefore, the results from the study may have had certain deviations. Second, our risk prediction model only focused on the onset of chronic kidney disease but did not assess the progression of chronic kidney disease to Table 2. Non-genetic multivariate Cox regression analyses and non-genetic risk models (NGRS). TGF-β transforming growth factor-β, ADMA asymmetric dimethylarginine, BUN urea nitrogen, NGRS non-genetic risk score, HR hazard ratio, CI confidence interval. a Defined as the serum concentration of TGF-β ≥ 1.011 pg/ mL. b Defined as the serum concentration of ADMA ≥ 0.019 μmol/L. c Defined as the serum concentration of BUN ≥ 5.9 mmol/L. d Defined as the age of the participants ≥ 60 years. www.nature.com/scientificreports/ renal failure or other complications. Third, participants who made up the "Tianjin Medical University Chronic Disease Cohort" were mostly teachers and government employees who worked in urban areas. This group of people were more self-disciplined and paid more attention to health. Whether our prediction model could be applied to other groups of people needs more external validation. Our future studies will detect more renal function-related biomarkers in larger cohorts to validate and improve the prediction model for CKD. Recently, numerous predictive models have been established and came into use in the clinic for decisionmaking. Among them, there exist several models estimating the risk of prevalent and incident CKD 22,28-31 . However, due to differences in race, lifestyle, and geographic environment, it is still necessary to develop an effective predictive model for chronic kidney disease in different ethnic groups, which can help to identify people with higher CKD risks earlier, thus improving health care by allocating resources to those individuals who benefit most from it while preventing the potential abuse of health care resources by individuals who are at low risk.

Methods
Study design and population. This research was designed as a nested case-control study involving 348 participants from the "Tianjin Medical University Chronic Diseases Cohort". The cohort was established in 2006, with an initial number of 2068 people for an annual physical examination. By the end of 2018, a total of 21,750 people had been recruited to the cohort, with the longest follow-up period of 13 years. We collected demographic markers, laboratory markers, and genotyping results for 110 loci (including 380 cases with genomewide genotyping data). We screened patients who met the following criteria: (i) with a follow-up period of at least 5 years; (ii) no CKD at the first physical examination; (iii) blood samples and other important information among whom 1804 were eligible; 116 were selected as the case group; and 232 were selected as the control group with sex and age ± 3 years matching; therefore, a total of 348 subjects were included. All subjects denied family history of inherited diseases and nephrotoxic drug usage.
This study was reviewed and approved by the Ethics Committee of Tianjin Medical University, and all participants signed informed consent forms.
Diagnostic criteria. The diagnostic criteria for CKD were eGFR < 60 mL/(min·1.73 m 2 ) or positive proteinuria (≥ 1 +). The glomerular filtration rate is estimated using the simplified Chinese MDRD equation 32

Measurements of biomarkers.
After twelve hours of fasting, participants' venous blood samples were collected into nonanticoagulant blood collection tubes at 7:30-9:00 am, incubated at room temperature for half an hour and then centrifuged at 3000 rpm at 4 °C for 10 min to separate serum. The serum was stored at − 80 °C before analysis. Levels of fasting plasma glucose, serum creatinine, urea nitrogen, serum uric acid, total cholesterol, triglyceride, alanine aminotransferase, total protein, albumin, globulin, total bilirubin, and direct bilirubin were determined using a Hitachi automatic biochemical analyzer. Cystatin C (CysC), transforming growth factor beta (TGF-β), asymmetric dimethylarginine (ADMA) and neutrophil gelatinase-associated lipocalin (NGAL) were measured by ELISA kits (Shanghai Huyu Biotechnology Co., LTD).

Selection of CKD-related nongenetic/genetic risk factors.
We incorporated 21 potential risk factors, including several biomarkers, into the univariate Cox proportional hazard model (Supplementary Table S3), and then significant factors were taken as explanatory variables and incorporated into the multivariate Cox proportional hazard regression model. Finally, we obtained five nongenetic risk factors (Table 2; Fig. 2). After obtaining part of the data access of the UK-Biobank database, we used PLINK to perform genome-wide association analysis (GWAS) for renal function-related indicators, including eGFR, SCr, and CysC. The results of the GWAS are shown in the Manhattan plot ( Supplementary Fig. S1). Combined with the results of previous studies, a total of 10 SNP loci on 10 genes were screened (Supplementary Table S1). Meanwhile, after integrating information from GWAS databases, the UCSC Genomic bioinformatics Database, and GWAS results for kidney function-related phenotypes in Asia or China [35][36][37] , SNP loci with both high genotype relative risk (GRR) and genome-wide polygenetic score (GPS) for CKD were selected. Finally, we selected a total of 27 SNP loci from 24 genes to construct a genetic risk model for CKD (Supplementary Table S2). The 27 SNPs selected in this study were genotyped in 348 nested case-control subjects using a matrix-assisted laser desorption ionization time-offlight mass spectrometry (MALDI-TOF-MS) platform. Hardy-Weinberg equilibrium (HWE) was checked for all 27 SNPs, and we deleted 2 SNPs that failed HWE; therefore, genotyping data for 25 SNPs were documented.
Developing prediction models. In this study, genetic risk score (GRS) models and nongenetic risk score (NGRS) models were built from the weights of natural logarithms (β) of different risk factors' OR values. The combined effects of each nongenetic or genetic factor were calculated in a weighted way, and the optimal combination method was selected to develop the prediction model of CKD. The GRS equation was established based on the different contributions of each candidate SNP site to the pathogenesis of CKD. Each SNP site was considered a potential risk factor for CKD. Different weights for the contribution to the onset of CKD were determined by different OR (or β) values from logistic regression analysis to establish several combinations and screen for the optimal combination. Using a weighted genetic risk score (ωGRS), ωGRS = i 1 β i G i (β i is the weight of the ith SNP, G i is the number of alleles at the ith SNP, and assigns a value of 0, 1, 2). The weight is the natural logarithm of the odds ratio (OR) of SNPs and could be an estimated effect (β coefficient). For each individual, ωGRS is the sum of the number of risk alleles weighted by the OR (β) value of each SNP site in logistic regression. See Formula (1) for details.  Table 3. Logistic regression analysis and prediction power comparison of nongenetic (NGRS), genetic (GRS), and comprehensive models for CKD. ROC receiver operating characteristic, OR odds ratio, CI confidence interval, AUC area under curve, NGRS4 nongenetic risk score model 4, GRS14 genetic risk score model 14. a NGRS4 = 1.84 × S1 + 1.137 × S2 + 0.84 × S3 + 0.497 × S4 + 0.603 × S5 (S i represents the state of the ith nongenetic risk factor; if the individual has the risk factor, the value is 1; if not, the value is 0. S1 = TGF-β normal high value (0: < 1.011 pg/mL; 1:1.011 pg/mL), S 2 = ADMA normal high value (0: < 0.019 μmol/L; 1: ≥ 0.019 μmol/L), S 3 = diabetes (0:unaffected; 1:affected), S 4 = BUN normal high value (0: < 5.9 mmol/L; 1: ≥ 5.9 mmol/L), S 5 = elderly (0: < 60 years; 1: ≥ 60 years www.nature.com/scientificreports/ In the above formula, to fix the weight in advance, we used the value of log-converted single-risk alleles in studies with large sample sizes and high reliability (e.g., meta-analysis) as the weight in the actual model construction.
The building principle of the nongenetic risk score model is the same as that of the GRS. That is, according to the different contributions of the identified CKD-related nongenetic risk factors (e.g., normal high value of TGF-β, the elderly) to the incidence of CKD, different OR (or β) values of logistic regression analysis are used to determine different weights for the onset of CKD, establish different combinations and select the optimal combination. The weighted nongenetic risk score (ωNGRS) was used, ωNGRS = i 1 β i S i (β i is the weight of the ith corresponding nongenetic risk factor in the risk of developing CKD, and S i is the ith corresponding nongenetic risk factor), and the weight β takes the natural logarithm of the OR value obtained by logistic regression analysis of different risk factors. For every individual, ωNGRS is the sum of risk factors weighted by the OR (β) value of different nongenetic risk factors in logistic regression. See Formula (2) for details.
In the above formula, S represents the set vector of a group of nongenetic risk factors (S i represents the state of the ith nongenetic risk factor; if the individual has the risk factor, the value is 1; if not, the value is 0). The β value used in this study was the β value of each nongenetic risk factor in logistic regression analysis.
The construction of the comprehensive risk scoring model integrates the optimal GRS model and the NGRS model, which is the sum of the two models. See formula (3) for details.
Prediction model evaluation. The evaluation of the constructed GRS model, NGRS model and comprehensive predictive model adopted the receiver operating characteristic curve (ROC) area under the curve (AUC) method. MedCalc software was used to determine the optimal cut-off point of the ROC curve and the  www.nature.com/scientificreports/ sensitivity and specificity at the optimal cut-off point. Finally, the evaluation of the prediction effectiveness of the constructed CKD prediction model is realized. The constructed GRS model, NGRS model and comprehensive prediction model were internally validated in a nesting case-control study using bootstrap five-fold crossvalidation. All data analyses were performed using SPSS 21.0 software. Statistical significance was determined with a threshold P value of < 0.05. All methods were performed in accordance with the relevant guidelines and regulations.

Conclusion
Age, diabetes, normal high values of creatinine, TGF-β, and ADMA are independent indicators for CKD incidence. A comprehensive prediction model was established, although genetic factors that analyzed in our study yielded limited prediction values for CKD incidence. Early and appropriate intervention can be exerted to avoid getting worse and even irreversible.