Introduction

Chronic kidney disease (CKD) affects 15% of the U.S. population1. Progression of CKD is associated with a high risk of medical complications including cardiovascular disease2, bone and metabolic disease3,4, and frailty4. Patients who progress to kidney failure need to consider initiation of dialysis or a kidney transplant. The cost of care for patients with advanced CKD adds a significant burden to the healthcare system5. Anticipating how rapidly a person with CKD will progress to kidney failure and discovering biomarkers of CKD progression and potential therapeutic targets for slowing CKD progression remain high priorities6.

Proteins regulate biological processes and integrate the effects of genes with those of the environment, age, comorbidities, behaviors, and drugs7,8,9. Multiprotein models predict the risk of developing diseases and their clinical outcomes as well or better than traditional clinical models7,8,9. The 4-variable Kidney Failure Risk Equation (KFRE)4,10, the most commonly used tool for predicting CKD progression to kidney failure, consists of estimated glomerular filtration rate (eGFR), age, sex, and albuminuria. Whereas KFRE is highly predictive of progression to kidney failure with a c-statistic of ~0.88 at 5 years10, among its components only albuminuria is readily modifiable with treatment. Personalized prognostic equations for CKD progression that consist of modifiable biological factors could be used to monitor responses to medical treatments. For example, a prognostic equation for cardiovascular risk that consisted of modifiable protein risk factors accurately predicted which patients remained at high risk for poor outcomes and might benefit from more specialized therapies11.

Niewczas and colleagues examined 194 circulating inflammatory proteins in a total of 525 participants with type 1 and type 2 diabetes and identified a kidney risk inflammatory signature (KRIS), consisting of 17 proteins enriched for the tumor necrosis factor receptor superfamily members, that was associated with a 10-year risk of kidney failure12. More recently, the same group of investigators measured 1129 plasma proteins in a total of 358 participants with type 1 and type 2 diabetes and identified 3 proteins associated with a lower risk that are potentially protective against the progression of CKD to kidney failure13.

In this study, we have utilized SomaScan V.4.0 (SomaLogic, Boulder, CO), a large-scale aptamer proteomic platform that measures nearly 5000 distinct plasma proteins simultaneously, to conduct the largest proteomic analysis of CKD progression to date. Our derivation cohort consisted of 3235 participants of the Chronic Renal Insufficiency Cohort (CRIC). By design, CRIC includes nearly equal numbers of participants with and without diabetes14. The validation cohort consisted of 578 participants with CKD (eGFR<60 ml/min/1.73 m2) from the Atherosclerosis Risk in Communities Study (ARIC). Our goals were (1) to discover numerous plasma proteins that are markers or mediators of CKD progression; (2) to identify biological pathways leading to CKD progression; (3) to elucidate whether protein markers of CKD progression differ by diabetic status or other clinical factors; and (4) to build a multiprotein prognostic model for CKD progression that is highly predictive and includes factors potentially more modifiable than those in the KFRE. A summary of the study design is illustrated in Fig. 1.

Fig. 1: Summary of study design and results.
figure 1

The study design, including derivation and validation for risk models and individual proteins, and selection of proteins for pathway and Mendelian randomization, are illustrated above. Our results include novel risk models for CKD progression as well as biological insights into potential causal mediators of kidney disease.

Results

CRIC cohort and renal outcomes

Detailed baseline characteristics are found in Supplementary Data 1. In brief, among the 3235 CRIC participants included in the analysis of the primary outcome, 10-year kidney failure/50% eGFR decline, mean (±SD) age was 59 (±11) years, eGFR was 43 (±17) ml/min/1.73 m2, 45% were women, and by design, nearly 50% had a history of diabetes. There were a total of 1139 (35%) events, including 998 (31%) kidney failure events, over median (IQR) 6.0 (2.6–10.0) years. Participants who reached the primary outcome were older, more likely to be male, black, diabetic, and have a lower baseline eGFR, higher albuminuria, and history of CVD. For the secondary outcome of 4-year eGFR slope, the median (IQR) eGFR slope was −1.01 (−2.18, 0.27) ml/min/1.73 m2 per year; 316 (9.74%) had eGFR slope ≤ −3 (ml/min/1.73 m2)/year.

ARIC validation cohort for the primary renal outcome

The validation cohort was comprised of 578 ARIC participants with eGFR <60 ml/min/1.73 m2 at ARIC Visit 3, all of whose samples were assayed with the same version of SomaScan. These ARIC participants had a mean age of 64 years, a lower prevalence of diabetes (32%), and higher mean eGFR (48 ml/min/1.73 m2). There were 85 (15%) events for the primary renal outcome, kidney failure or a 50% decline in eGFR, including 80 kidney failure events (Supplementary Data 2).

Associations of individual proteins with the primary outcome in CRIC and ARIC

Associations of individual proteins with the primary outcome (≥50% eGFR decline or kidney failure within 10 years) are visualized in Fig. 2 as Volcano plots, shown unadjusted (Fig. 2A), adjusted for eGFR (Fig. 2B) and fully adjusted (Fig. 2C). Among the 4638 proteins investigated, in fully adjusted analyses, 330 proteins (7.1% of all proteins measured) were associated with primary renal outcome at FDR significance (q < 0.05). We identified numerous proteins associated with a higher risk of the primary outcome. Whereas only 1 of the previously reported 17 KRIS proteins had fully adjusted log2 HR > 2, 14 additional proteins with fully adjusted HRs between 2 and 5 were identified in this study (Fig. 2C). The top 20 proteins with the largest HR per log2, listed with their biological functions and current drugs that target them, are shown in Table 1. We identified numerous proteins associated with lower risk (HR < 1), referred to in the literature as potentially protective, which are shown in Fig. 2 and Supplementary Data 6. Protein associations are shown as HR per MAD unit in the Supplementary Data (Supplementary Data 36, 912).

Fig. 2: Volcano plots of individual protein associations with the primary outcome.
figure 2

Associations of 4638 proteins with the primary outcome unadjusted (A), eGFR adjusted (B) and after full adjustment for age, gender, race, eGFR, log[urine protein to creatinine ratio], systolic blood pressure, diabetes, smoking status, body mass index, and cardiovascular disease history (C). 17 proteins in the Kidney Risk Inflammatory Signature, and 3 proteins previously found to be associated with lower risk in patients with DM are labeled as blue dots. P values are two-sided. FDR false discovery rate<0.05. FRAS1 Fraser extracellular matrix complex subunit 1, WAP whey acid protein.

Table 1 Individual proteins associated with the CKD progression primary outcome in both CRIC and ARIC studies

Higher-risk markers in CRIC and ARIC were enriched with members of the ephrin family (5 of 20 top proteins) and bone morphogenetic proteins (BMPs) (4 of 20 top proteins). Through a search of the Druggable Target Database15, 8 of these 20 proteins are currently druggable targets (Table 1). The three proteins with the lowest HRs that passed the criteria for validation in ARIC are Cartilage intermediate layer protein 2 (CILP2), C1GALT1-specific chaperone 1, and albumin. Potential roles for these proteins in the biology of CKD progression are delineated in “Discussion”.

Pathway analysis of proteins associated with CKD progression in CRIC

We performed an overrepresentation analysis with the IPA tool to define the canonical pathways linked to the primary outcome. There were 1516 proteins that were associated with the primary outcome of CKD progression at an FDR of 5%, after adjustment for eGFR, and we compared this subset of proteins to the 4638 background proteins measured by SomaScan. The top ten canonical pathways are listed in Table 2. Ephrin signaling was again prominent, represented as the Ephrin A pathway. There was also significant enrichment for proteins that link inflammation and metabolic processes (LXR/RXR activation), matrix metalloprotease inhibition, hepatic fibrosis, and intrinsic prothrombin activation pathway. In “Discussion”, we focus on potential roles for ephrin signaling, BMPs and prothrombin activation pathways in worsening kidney disease.

Table 2 Canonical pathways among 1516 proteins associated with CKD progression

Mendelian randomization

Using CRIC genotype data, we identified one or more pQTLs for 23 of 76 of our selected protein risk factors. Within the eGFR database, we found significant MR associations for four proteins (listed with SNP and P value): protein delta homolog 2/EGFL9 (rs2125739, 1.9 × 10−6), low-density lipoprotein receptor-related protein 11/LRP-11 (rs9689036, 1.56 × 10−4), Interleukin-1 receptor type 2/IL-1 sRII (rs2310170, 1.3 × 10−3), and alpha-1 microglobulin (rs10982054, 1.7 × 10−3). Within the CKDi25 GWAS, one variant was significant: EGF-containing fibulin-like extracellular matrix protein 1 (aka fibulin 3 FBLN3) (rs6755214, 1.3 × 10−3). None of the variants we tested had MR associations in the Rapid3 database. Using the deCODE database, 54 of 76 proteins were linked to cis pQTLs. Significant MR associations were confirmed for LRP-11 in the eGFR GWAS, and for FBLN3 in the CKDi25 GWAS. In addition, significant MR associations were found for matrix metalloproteinase 7/MMP-7 (Rapid3 GWAS: 7 SNPs comprised instrumental variable (IV), P = 7.0 × 10−4), leukocyte immunoglobulin-like receptor subfamily B member 1/ILT-2 (Rapid3: 5 SNPs comprised IV, P = 1.6 × 10−6) and matrix remodeling associated 7/MXRA7 (Rapid3: 2 SNPs comprised IV, P = 1.2 × 10−5) (Fig. 3). MR associations with nominal P < 0.05 were observed for several proteins including two BMP antagonists, follistatin-related protein 3 (FSTL3) and twisted gastrulation protein homolog 1 (TWSG1), as well as CILP2, a protein associated with a lower risk of CKD progression that inhibits fibrosis, all of which replicated in ARIC. Two proteins that passed MR with significance after adjustment for multiple testing are druggable targets: IL-1 sRII and MMP-7 (Druggable Target Database)15 (Supplementary Data 7).

Fig. 3: Mendelian randomization of proteins associated with CKD progression.
figure 3

Eight proteins were linked to pQTLs that had significant Mendelian randomization associations after correction for multiple tests in at least one GWAS for cross-sectional kidney function (eGFR) or for CKD progression (Rapid3 or CKDi25). HR (95% CI) per log2 of protein are shown, for the outcome of ESRD or 50% decline in eGFR, in CRIC, with adjustment for eGFR. pQTLs were identified in CRIC and in deCODE (deCODE marked with *). The three GWAS used are eGFR (red), Rapid3 (blue), and CKDi25 (green). Mendelian randomization associations are shown as red points (% difference in eGFR), or blue and green points that represent odds ratio. If an association met significance after multiple testing, P value is in bold. All P values are two-sided.

Impact of diabetic status for 20 previously reported risk factors

We visualized HRs of all proteins we found to be significantly associated with the primary outcome in a scatterplot, comparing participants with vs. without DM. HRs were similar in direction and effect size for DM vs. non-DM for the primary outcome (rho = 0.68) and the kidney failure outcome alone (rho = 0.68) (Supplementary Information Fig. 1). In addition, we examined 17 KRIS proteins reported by Niewczas et al. to predict the progression of CKD to kidney failure in cohorts of patients with diabetes as well as three proteins associated with lower risk12,13. After full adjustment, six of these 20 proteins replicated in CRIC at P < 0.0025 (0.05/20) for the outcome of kidney failure among participants with DM. Three of these five KRIS proteins that replicated were members of the tumor necrosis factor receptor superfamily—members 1A, 1B (measured with two aptamers) and 19. One of the three lower-risk proteins, angiopoietin-1, replicated in the CRIC participants with DM. There was a significant interaction of DM with Interleukin-18 receptor 1 and angiopoietin-1: HR per log2 [95% CI] for Interleukin-18 receptor 1 in DM 1.35 [1.08, 1.69] vs non-DM 0.77 [0.57, 1.1], P for interaction 0.008; HR [95% CI] for angiopoietin-1 in DM 0.78 [0.69, 0.87] vs non-DM 1.07 [0.90, 1.3], P for interaction <0.001 (Supplementary Data 8).

Association of individual proteins with short-term kidney function decline

Twenty individual proteins associated with the highest and 20 with the lowest risk of 4-year eGFR decline are listed in Supplementary Data 9 and 10, respectively. Among those predicting a faster decline, 15/20 were also among the top 20 protein risk factors for the primary outcome, and 15/20 had associations with eGFR decline that remained significant after full adjustment at FDR < 0.05. These risk factors for eGFR decline included ephrin receptors and tumor necrosis factor receptor 1A. Less well-known risk factors included brorin and erythropoietin receptors, both of which were successfully validated in ARIC for the primary outcome. Among proteins predicting a slower decline, 15/20 were listed among protein factors with the lowest hazard ratio for the primary outcome. Three of these proteins remained significant by FDR q value after full adjustment: mitochondrial superoxide dismutase, fibroblast growth factor 9, and follistatin-related protein 5. For further description of specific proteins, see “Discussion”.

Risk prediction models for primary and secondary outcomes in CRIC

In the 80% training set of CRIC participants, we derived a 65-protein model for the primary outcome (≥ 50% eGFR decline or kidney failure within 10 years) using elastic net regression. The β-coefficients, adjusted HRs, and relevant drug for each protein included in the risk model are listed in Supplementary Data 11. In the 20% CRIC testing set, the model yielded a C-statistic of 0.862 (95% CI: 0.835, 0.888), with similar discrimination to the refit KFRE. Similarly, we derived a 20-protein model for the secondary outcome (4-year eGFR slope) in the 80% training set. The β-coefficients, adjusted HRs, and available drug for each protein included in the risk model are listed in Supplementary Data 12. In the 20% testing set, the C-statistic (95% CI) for the 20-protein model was 0.728 (0.708, 0.748), similar to the refit KFRE (0.744 (0.725, 0.763)). Hybrid clinical-protein models for both primary and secondary outcomes showed incremental, statistically significant improvement in discrimination over the refit KFRE. (Fig. 4) Calibration was excellent for both protein risk models (10-year model: model-based calibration P > 0.1 for all except Q1, with only eight events; dichotomous slope calibration P ≥ 0.05 for all except Q1) (Supplementary Fig. 2) Both protein models show a broad dynamic range of prediction, with the ratio of quintile 5/quintile 1 of predicted as well as observed risk being 20 for the primary outcome and 10 for the binary eGFR slope outcome. Through a search of the Druggable Target Database15, 14 of the 65 proteins included in the risk model for the primary outcome, and 3 of 20 in the model for the secondary outcome are currently druggable targets (Supplementary Data 11 and 12).

Fig. 4: Risk models for primary and secondary outcomes.
figure 4

For the primary outcome, C-statistics and concordance are calculated in the CRIC testing set, N = 577 participants with 186 events. For eGFR slope, N = 571 participants, P values calculated by two-sided concordance testing. Points represent C-statistic for primary outcome, and concordance for the secondary outcome. The whiskers stand for standard error. KFRE Kidney Failure Risk Equation, ACR albumin-to-creatine ratio, eGFR estimated glomerular filtration rate.

Sensitivity analyses

The 65-protein model for the primary outcome showed similar discrimination in subgroups of diabetes status, race, and eGFR (P for interaction >0.1 for all) (Supplementary Fig. 3). We examined whether discrimination varied by the length of follow-up. Time-dependent AUCs for the 65-protein model and other risk models for follow-up periods ranging between 1 and 15 years show higher AUCs in the short term. (Supplementary Fig. 4). The hybrid model incrementally surpassed the KFRE at 5 years and 15 years: 5-year hybrid 0.90 (0.87, 0.93) vs KFRE 0.89 (0.86, 0.91) (P = 0.01); 15-year hybrid 0.87 (0.84, 0.89) vs KFRE 0.85 (0.82, 0.88) (P = 0.0005). We additionally explored creating distinct multiprotein risk models for different time horizons and found higher c-statistics for shorter time horizons for both KFRE and protein models (Supplementary Data 13).

We also evaluated the effect of calculating eGFR using a creatinine-based race-free equation on the performance of the risk models. Discrimination of the protein model, as well as for KFRE and hybrid models, for the primary outcome, was unchanged: protein model C-statistic (SE) 0.862 (0.014) in our initial analysis, 0.860 (0.014) using race-free equation. Discrimination of the protein model (and other models) for eGFR slope was slightly lower using the race-free eGFR: C-statistic (SE) 0.728 (0.010) in our initial analysis, 0.684 (0.010) using the race-free equation. (Supplementary Data 14) In additional sensitivity analyses, we evaluated whether KFRE with coefficients refit to our cohort performed better for the primary outcome than the original KFRE equation. In the 20% testing set, C-statistics were similar between the original KFRE10 (0.844 (95% CI: 0.817, 0.872)), and the refit KFRE (0.855 (95% CI: 0.828, 0.882)). For the outcome of 10-year kidney failure alone (without the endpoint of 50% decline in eGFR), the C-statistics (95% CI) were also similar: original KFRE 0.892 (95% CI: 0.870, 0.915); refit KFRE 0.894 (95% CI: 0.872, 0.916).

Validation of the 65-protein model in ARIC

The validation cohort was comprised of 578 ARIC participants with eGFR<60 ml/min/1.73 m2 at ARIC Visit 3. The C-statistic (95% CI) for the 65-protein model in ARIC validation was 0.840 (0.785–0.896). The calibration of the 65-protein model in ARIC was fair (GND chi2 = 9.2, P = 0.06) overall. Calibration was good in the highest two quintiles of risk but may have suffered from few events in the lowest three quintiles of predicted risk, potentially leading to discrepant predicted vs observed estimates (each of these 3 quintiles having <8 events). Calibration was good in the 4th and 5th quintiles of predicted risk, each having 12 and 57 outcome events, respectively (Supplementary Data 2).

Discussion

In this study of proteomics of CKD progression, we quantified 4638 unique plasma proteins in 3249 participants of CRIC and validated our findings in 578 participants from ARIC with CKD, comprising a total of nearly 18 million individual protein measurements. We identified over 500 proteins associated with CKD progression after adjustment for eGFR and 100 proteins after extensive covariate adjustment at the Bonferroni-corrected statistical significance threshold. Individual protein and canonical pathway analyses highlight potential roles of ephrin signaling, BMP antagonists, and prothrombin activation. We identified 8 plasma proteins with potentially causal significant associations by MR; 5 of these have not been previously identified by MR, and 3 are currently druggable targets. Applying machine learning, we developed proteomic risk models for long- and short-term CKD progression with a similar excellent predictive utility to the refit KFRE clinical model but, in contrast to the refit KFRE, the protein models consist of modifiable risk factors11.

Four of the top twenty proteins identified in CRIC, validated in ARIC, and associated with a higher risk of CKD progression are antagonists of BMPs, also known as growth differentiation factors (GDFs) (Table 1). BMPs were originally discovered as constituents of bone extract that cause ectopic bone formation when implanted in rats16. More than 30 BMPs form a subgroup of the transforming growth factor- β (TGF-β) superfamily, with diverse skeletal and extraskeletal functions17. BMP antagonists include Gremlin, sclerostin, follistatin, noggin, and brorin, and there is evidence that these antagonists play a role in modulating the extracellular matrix (ECM)18. There has been interest in BMP antagonists for treating renal disease: for example, Gremlin, an antagonist of BMP2 and 4, may be protective of diabetic nephropathy in experimental models19. TWSG1, a protein risk factor also found to be an independent risk factor for CKD progression20, with a nominally significant MR association in CKDi25 in our study, is an antagonist of BMP 7, which is produced in the kidney and is protective against renal fibrosis and other types of renal injury in experimental models21. FSTL3 is another BMP antagonist previously shown to predict CKD progression20, which we found to have a nominal (P < 0.05) association by MR in CKDi25 GWAS. FSTL3, a 30 kDa protein, is an antagonist of BMP2 and 4 (both of which promote bone formation and other processes), GDF8 (a growth factor for skeletal muscle), and GDF11 (a factor negatively associated with age-related left ventricular hypertrophy)22,23,24. FSTL3 is renally cleared, and its hepatic production may be increased in renal disease25. Thus, there is plausibility to our findings that members of the BMP family are involved in CKD progression.

Five of the top 20 proteins identified in CRIC and validated in ARIC associated with a higher risk of CKD progression are members of the Ephrin family (Table 1) and Ephrin signaling was among the top canonical pathways identified in our study (Table 2). Ephrin receptors interact with vascular endothelial growth factor to control angiogenesis26, and CKD is characterized by microvascular disease and capillary rarefaction within the kidney. Ephrin type-B receptor 4 stimulates angiogenesis after kidney injury to enhance recovery27. Ephrin-B2 knockout mice are protected from renal fibrosis in a renal ischemia model, suggesting that ephrin-B2 facilitates renal fibrosis28.

The prominence of the canonical pathway of prothrombin activation among proteins associated with CKD progression in our study might be explained by interactions of thrombin with protease-activated receptors that are found in several cell types in the kidney29. Thrombin may have direct effects on the kidney via protease-activated receptor 1 (PAR1), which is activated by thrombin and found in several different cell types. PAR1 deficiency is protective against diabetic nephropathy in animal models30.

We also found several proteins that are associated with a lower risk of CKD progression (Supplementary Data 4 and 6). C1GALT1-specific chaperone 1 was associated with a lower risk of the primary and secondary outcome of CKD progression in CRIC and passed the criteria for validation in ARIC for the primary outcome. This protein facilitates protein glycosylation and platelet activation. CILP2 was associated with a lower risk of CKD progression in CRIC and ARIC and had a nominally significant MR association (P < 0.05). CILP1 levels are increased in the myocardium after infarction, and CILP1 is thought to protect against fibrosis in the myocardium by inhibiting TGFβ31. CILP2 could have a similar anti-fibrotic effect in the kidney.

Our MR analysis revealed eight potentially causal mediators of CKD progression that were significant after adjustment for multiple tests in one or more renal GWAS. Five of these proteins, to our knowledge, have not been shown previously to have MR associations: EGFL9 is an antagonist of the NOTCH pathway32 which has roles in kidney development and disease33. LRP-11 is a membrane protein related to lipid metabolism. MXRA7 is an extracellular matrix protein. IL-1 sRII and ILT-2 are immunologic receptors, and both are currently druggable targets.

Given that approximately half of CRIC participants have diabetes mellitus, our study provides an opportunity for characterizing differences in proteins that predict the progression of diabetic vs. nondiabetic kidney disease. We validated several proteins previously found to predict higher or lower risk of kidney failure among individuals with CKD all of whom had diabetes12,13 (Supplementary Information Fig. 1 and Supplementary Data 8). Yet, overall, we found that many proteins predict CKD progression similarly, irrespective of diabetes status, suggesting shared mechanisms of progression of diabetic and nondiabetic CKD. However, two proteins did have significantly different statistical associations among patients with diabetes compared to those without. Interleukin-18 receptor 1 predicted higher risk and angiopoietin-1 predicted lower risk among patients with diabetes. Stratification by diabetes may be an important component for the future discovery of biomarkers of CKD progression, with the expectation that while few markers may differ by diabetes status, these differences could be important for developing therapeutics for different etiologies of kidney disease.

The KFRE equation was developed to predict kidney failure over 5 years and has shown excellent validation in meta-analyses of international studies10. It is accessible to clinicians, given that the four factors of age, gender, eGFR, and albuminuria can be readily determined. A key limitation of the KFRE is that it sheds little light on the biological mechanisms by which CKD progresses in individual patients and besides albuminuria, its components are not readily modifiable. Plasma levels of proteins readily change in response to lifestyle and pharmacological interventions. The 65-protein model derived in this study for a 10-year 50% decline in eGFR or kidney failure matched the KFRE for its excellent discrimination and had even better discrimination at 5 years. A separately derived protein model for kidney failure alone at 2 years had a C-statistic of 0.95 (similar to the KFRE applied to 2 years, 0.94). Short-term protein models could be used as surrogate outcomes in clinical trials of therapeutics. Clinicians might use the protein model not only to identify patients at higher risk of kidney failure, but also to monitor patients’ response to lifestyle and medication changes. Showing the patient that his or her risk score has improved could improve compliance with medications. Hybrid clinical-protein models showed modest statistically significant improvement over KFRE. Notably, the addition of clinical factors added little to the discrimination of protein models. One may conclude from this that proteins encode demographic and clinical information in addition to carrying important biological signals, a concept that we have demonstrated previously11.

Our study has numerous strengths, but we also acknowledge limitations. Additional clinical and experimental approaches informed by our proteomics findings will be needed to establish conclusively which of the protein biomarkers identified in our study are involved as causal mediators in CKD progression, given the limits of epidemiological association studies. The prognostic utility of the multiprotein risk score, and its capacity to reflect effects of medications, could be validated using samples from clinical trials involving kidney endpoints. The biological roles of specific proteins could be elucidated with animal models. While the CRIC population is well-phenotyped and affords extensive multivariable adjustment, any unmeasured confounders may bias the assessments of individual proteins as independent risk markers. We measured circulating and not tissue proteins, since plasma is more readily accessible as a diagnostic matrix than kidney biopsy tissue. Future studies are expected to correlate proteomic information from plasma and kidney biopsies. Lastly, the present Mendelian randomization analyses may be augmented by utilizing a more comprehensive GWAS for renal function that includes a meta-analysis of CKD Genetics Consortium and UK Biobank34.

In conclusion, we present the largest proteomic study of participants with CKD to date with a total of nearly 18 million individual protein measurements, in a well-phenotyped population of >3000 participants. Our analyses reveal multiple individual protein risk factors for CKD progression that have not been previously described, and we show that individual proteins and a 65-protein risk model for 10-year CKD progression replicate well in ARIC. Druggable targets within our protein risk models and significant MR findings may provide the impetus for developing therapeutics. Biological pathways and individual proteins that we have identified, including BMP antagonists, ephrin signaling, and prothrombin activation warrant further study.

Methods

Participants

The CRIC study protocols adhered to ethics regulations of each institution where participants were enrolled, requiring approval from the following committees: University of Pennsylvania Institutional Review Board, Federalwide Assurance # 00004028; Johns Hopkins Institutional Review Board NA_00044034/CIR00004697; The University of Maryland, Baltimore Institutional Review Board; University Hospitals Cleveland Medical Center Institutional Review Board; MetroHealth Institutional Review Board; Cleveland Clinic Foundation Institutional Review Board IRB #5969; University of Michigan Medical School Institutional Review Board; Wayne State University Institutional Review Board; University of Illinois at Chicago Institutional Review Board; Tulane Human Research Protection Office, Institutional Review Boards, Biomedical Social Behavioral, IRB #140987; Kaiser Permanente Northern California Institutional Review Board. The Atherosclerosis Risk in Communities (ARIC) Study adhered to ethics regulations from and was approved by a single Institutional Review Board (sIRB) at Johns Hopkins School of Medicine (FWA00005752; IRB00311861) and Institutional Review Boards (IRB) at all participating institutions: University of North Carolina at Chapel Hill, Johns Hopkins University School of Public Health, University of Minnesota, Wake Forest University Health Sciences, University of Mississippi Medical Center, Baylor College of Medicine, University of Texas Houston Health Science Center, and Brigham and Women’s Hospital. Study participants provided written informed consent at all study visits.

The CRIC study was designed to investigate risk factors for progression of CKD, incident cardiovascular disease, and overall mortality in persons with CKD14. Between 2003 and 2008, the CRIC study enrolled a total of 3939 ethnically diverse men and women at 7 clinical centers, ages 21–74 years, with eGFR 20–70 ml/min/1.73 m2 by the simplified Modification of Diet in Renal Disease equation14. Eligibility criteria and baseline characteristics of the CRIC cohort have been published14,35. The CRIC study was approved by the Institutional Review Boards of the participating centers, and the research was conducted in accordance with the principles of the Declaration of Helsinki. All study participants provided written informed consent. At enrollment, information on participant sex/gender was collected by self-report; there were no sex/gender-based inclusion or exclusion criteria. For the present analysis, plasma samples from 3419 CRIC participants from the year 1 visit, considered our study’s baseline, were assayed with SomaScan V4.0. Each sample was assayed once with SomaScan. We excluded 53 participants with prevalent kidney failure. Due to the interference of lupus antibodies with aptamers (communication from SomaLogic), we also excluded 12 participants with systemic lupus erythematosus. After 105 samples were excluded that did not pass SomaLogic’s quality control standards, the final analytical cohort consisted of 3249 participants. Fourteen participants were excluded who did not have a baseline measure of eGFR, leaving 3235 individuals eligible for analyses of the primary outcome of a 50% decline in eGFR or kidney failure over 10 years and 3243 participants eligible for analyses for the secondary outcome of a 4-year eGFR decline.

SomaScan version 4.0

SomaScan is an assay based on modified aptamers, which are chemically modified single strands of deoxyribonucleic acid ~40 nucleotides long, as binding reagents for target proteins7,8,36,37,38,39. Modified aptamers bind to proteins with high affinity similar to antibodies (lower limit of detection 10−15 moles per liter)36,37,38 “Pull-down” studies, in which the aptamer-protein complexes were isolated and the identities of the bound proteins were verified by targeted mass spectrometry and gel electrophoresis, have been performed for 920 proteins among 1305 proteins in a previous version of the assay39. These studies showed that >95% of aptamers correctly targeted the intended proteins (for those proteins in concentrations sufficient to be detected by mass spectrometry). The samples on the SomaScan assay are run at three different dilutions to assay each analyte within its linear range of concentrations. The assay results are quantified on a hybridization microarray and reported in RFU. SomaLogic has procedures for data calibration, standardization and internal controls, typical of microarray technologies.

The SomaScan V4.0 menu includes 5284 aptamers (Supplementary Data 15). We excluded 305 aptamers paired to non-human proteins, 130 incompletely characterized investigational aptamers, and 19 aptamers with >50% coefficients of variation (CVs) in 129 split duplicates from CRIC participants that were run simultaneously to our large-scale proteomic study. This left 4830 aptamers and 4638 unique proteins (some proteins are measured by 2 or more aptamers) (Supplementary Data 16). The median intra-assay CVs, from plasma of healthy individuals are reported as ≤5%12,40. We conducted our quality control study using split duplicate plasma samples from CRIC participants with CKD stages 3A, 3B, and 4. Median split duplicate CVs were ≤5% and did not vary by the stage of CKD or by diabetes status41.

Study outcomes

The primary outcome was time to the first of two clinical outcomes, i.e., ≥50% eGFR decline or incident kidney failure (defined as the need for renal replacement therapy), within a 10-year time horizon. To capture short-term CKD progression, we analyzed the 4-year eGFR slope as a secondary outcome, generated using a linear mixed effect model with a random intercept and a random slope. The eGFR slope was formulated as a continuous variable, and alternatively as a dichotomized endpoint of eGFR decline ≥ versus < 3 mL/min/1.73 m2 per year. The slope was censored at kidney failure. For the derivation of risk models, we wished to optimize the accuracy of eGFR measures specifically among CRIC participants, and for this reason GFR was estimated using the 5-variable CRIC equation including serum creatinine, serum cystatin, age, gender, and race, given this equation has been extensively validated among CRIC participants as the closest estimate of GFR measured by iothalamate clearance42. In sensitivity analyses, we estimated GFR using the 2021 CKD EPI creatinine that is based on age, sex, and creatinine and omits race as a variable43,44.

Covariate definitions

Study covariates were chosen a priori based on the literature and used definitions published by CRIC45. Diabetes mellitus was defined by a fasting glucose of ≥126 mg/dL or the use of insulin or oral hypoglycemic medications. Hypertension was defined by systolic blood pressure ≥140 mm Hg, diastolic blood pressure ≥90 mm Hg, or the use of antihypertensive medications. Lifestyle, sociodemographic and medical history information was obtained at baseline using self-reported questionnaires, including gender, race, ethnicity, and smoking status. Prevalent cardiovascular disease at entry was assessed by a self-reported history of prior myocardial infarction, coronary revascularization, heart failure, stroke, or peripheral artery disease. Body mass index was calculated using measured height and weight and expressed in kilograms per meter squared. At the visit with proteomics, albuminuria was not directly measured. Albuminuria was calculated from urine protein to creatinine ratio using the crude (unadjusted) equation as published in ref. 46.

Statistical analysis

Summary statistics for the CRIC participants’ baseline characteristics were calculated as mean and standard deviation (SD) for symmetric variables and median and interquartile range (IQR) for skewed variables. SomaLogic normalizes the entire protein dataset using Adaptive Normalization by Maximum Likelihood (ANML) to remove unwanted biases in the assay. ANML is an iterative procedure that adjusts values for analytes that fall outside expected measurements from a reference distribution. Protein values are reported in relative fluorescent units (RFU) after ANML normalization. We chose to standardize RFU using median absolute deviation (MAD); this approach allows for the ranking of predictors and is more robust than conventional methods of standardization (mean subtraction, standard deviation division) for skewed data. We Winsorized (clipped) outliers at median ± 5 MAD.

The Cox proportional hazards regression model was used to assess the association between individual proteins and the primary outcome. Associations of individual proteins with continuous eGFR slope were assessed using multivariable linear regression. In each instance, we constructed models with three levels of adjustment: (i) no adjustment, (ii) adjustment for eGFR only, or (iii) adjustment for age, gender, race, eGFR, log[urine protein to creatinine ratio], systolic blood pressure, diabetes, smoking status, body mass index, and cardiovascular disease history. Evaluating each individual protein was a preliminary step, prior to determining which proteins to replicate externally, and then to examine with Mendelian randomization. In order to rank individual proteins by strength of association with the outcome, we employed MAD standardization because it is more robust than log2 standardization for skewed predictors. We chose to select “top hits” from among the protein associations meeting a significance threshold of FDR < 0.05, rather than Bonferroni significance, to minimize type II error at the screening stage. The Benjamini–Hochberg (BH) method was used to control the false discovery rate (FDR) at 5%47,48. We then selected protein “top hits” by effect size per MAD unit. We present these top hits in tables using HR per log2 to illustrate effect sizes on a scale more commonly used in epidemiology than MAD. Presentation tables also include the P value in order to illustrate that most of these proteins meet the Bonferroni-corrected statistical significance level (P < 1.0 × 10−5 after adjusting for ~5000 tests).

To determine whether any associations of protein biomarkers with CKD progression may be unique to people with diabetes, we explored the impact of diabetes on associations of individual proteins with the primary outcome by visualizing a scatterplot of HRs in participants with vs. without diabetes. We also performed formal statistical interaction testing by diabetes status for all proteins that were associated with the primary outcome. This analysis included the 17 KRIS proteins reported to predict kidney failure in patients with diabetes12 and three additional proteins reported by the same investigators as potentially protective of kidney failure in patients with diabetes13.

We developed multiprotein risk models for the prediction of CKD progression and compared their predictive performance to clinical and hybrid clinical-protein models. We randomly split the CRIC data into two sets: 80% of individuals comprised the training set, and the remaining 20% the testing set. We used the training set to build prediction models and determine attendant tuning parameters. The testing set was used solely to evaluate the models’ performance. Our frontline technique for developing protein risk prediction models was elastic-net (EN) Cox regression which combines ridge (L2) and LASSO (L1) penalties and handles each of the three (time-to-event, continuous, binary) outcome types. Model fitting was conducted using the R package glmnet11,12. The relative contributions of the two penalties are controlled by a mixing parameter α which we set to 0.5 for balance. The shrinkage (regularization) parameter λ which controls model complexity (the number of included proteins) was determined by tenfold cross-validation and the “1 standard error rule”. After the final selection of proteins, to reduce bias in estimated regression coefficients49, we refit the selected features for the EN model in a Cox regression model for the CKD progression survival outcomes and a logistic regression model for the binary eGFR decline outcome, as previously published50.

We evaluated predictive performance by calculating Harrell’s C-index47 or Receiver Operating Characteristics (ROC) Area under the Curve (AUC) in the testing set47. For survival outcomes, we additionally calculated time-dependent AUC for years 1 to year 15 using the testing set data. We evaluated model calibration in the training set with calibration bar plots to visualize the agreement between predicted and observed risk in each quintile of participants defined by predicted risk. A formal assessment of calibration made recourse to a model-based test that can accommodate survival endpoints in addition to continuous and binary outcomes51. We further conducted stability analyses of our EN models to ensure that results were not overly dependent on the specific training / test set partition deployed. This involved repeating the entire EN procedure on five alternate random partitions into training and test sets.

We compared the protein models for CKD progression to two clinical risk models. The first model was comprised of variables from the 4-variable Kidney Failure Risk Equation (KFRE) model (age, gender, eGFR, urine albumin-to-creatine ratio)10. We also used a 10-variable clinical model (referred to herein as an expanded clinical model) that included the 4 KFRE variables, plus 6 other variables reported to associate with CKD progression in CRIC (race, systolic blood pressure, diabetes, smoking status, BMI, and cardiovascular disease (CVD) history). To optimize the performance of these clinical models in CRIC, the coefficients of the variables of both clinical models were refit to the primary and secondary outcomes. Comparisons between the various risk models were based on C-statistics calculated in the same participant set, using significance as a two-sided P value < 0.05, and visualized using forest plots.

In sensitivity analyses, we examined discrimination of the 65-protein risk model for the primary outcome in subgroups of gender, race, diabetes or eGFR. Furthermore, in addition to the 10-year time horizon for risk modeling, we evaluated the performance of other protein models derived for shorter or longer time horizons of 2, 5, and 15 years. We also evaluated the performance of our primary 65-protein model protein model using a race-free creatinine-based equation to calculate a 50% eGFR decline for the primary outcome.

External validation

We performed external validation for the primary outcome in 578 participants at visit 3 of the ARIC Study52 who had CKD (eGFR < 60 ml/min/1.73 m2) when plasma was obtained for SomaScan V4.0 proteomic analysis. We performed validation of 20 individual proteins with the highest and 20 proteins with the lowest HRs for the primary outcome in CRIC after adjustment for eGFR, by performing Cox regression for their associations with the same outcome in ARIC, with adjustment for eGFR. The statistical criterion for validation was a Bonferroni P value of <(0.05/40) or <0.00125, based on correcting for 40 proteins carried forward for validation. Discrimination and calibration of the multiprotein model for the primary outcome from CRIC were tested in ARIC, the calibration after adjustment for differences in baseline hazard, but retaining coefficients developed in CRIC. Statistical analyses were performed using R, version 4.0.3 (RStudio, Inc., Boston, MA. URL http://www.rstudio.com/), with the packages of glmnet (version 4.0-2), survival (version 3.2-7 pec (version 2019.11), compareC (version 1.3.1), forestplot (version 1.10), lme4 (version 1.1-26).

Pathway analysis

We performed pathway analyses to elucidate the biological processes and regulatory mechanisms associated with CKD progression. The set of CKD progression-associated proteins with HRs significant at a false discovery rate (FDR) threshold of 0.05 after adjustment for eGFR were organized into canonical pathways by the Ingenuity Pathway Analysis (IPA) tool as we have described previously6,9,53,54. For those modified aptamers that had multiple Uniprot identifications associated with 1 result, only the first Uniprot identification listed was used6,9,53. For proteins measured by two or more aptamers, the aptamer measurement with the largest effect size was utilized for the analysis. The Fisher right-tailed exact test was used to calculate a P value to determine the probability that the association of the differently expressed proteins in the measured dataset, and the pathway are explained by chance alone.

Mendelian randomization

To investigate the potential causality of CKD progression for a limited set of proteins from our study, we conducted Mendelian Randomization (MR) analysis for 76 aptamers (75 proteins) that were either discovered as risk factors for CKD progression in CRIC and successfully validated in ARIC or were included in the 65-protein risk model for the primary outcome in CRIC. Genotyping has been performed in CRIC using Illumina HumanOmni1-QUAD V1.055 with 7,102,205 measured or imputed genetic variants available for pQTL analysis (861,291 variants prior to imputation). For each protein, we performed the protein quantitative loci (pQTL) analysis and considered cis-pQTL variants within 1 megabase (Mb) upstream or downstream of the transcription start site of the corresponding protein-coding gene that had a P value < 5e-6. Furthermore, we conducted the conditional association analysis within the candidate set with the GCTA-COJO software56 and selected the conditional significant variants with the p-value threshold 5 × 10−6 for the subsequent MR analysis. pQTL-protein associations were adjusted for age, gender, eGFR, BMI, and the first five genotype principal components. For proteins with more than one single nucleotide polymorphism (SNP) selected, we used multi-SNP MR using the inverse variance weighting method57. For proteins with just one SNP selected as the instrumental variable, we estimated the causal effect using the Wald ratio test58. R packages Mendelian Randomization59 and TwoSampleMR58 were used in our analyses. Since most genome-wide association studies (GWAS) of CKD focus on European Ancestry (EA), we restricted our pQTL analysis for this study to 1208 CRIC participants of European Ancestry. We augmented our MR analysis using published significant pQTLs for the SomaScan V4.0 in the deCODE, a cohort of 35,559 Icelandic participants, for which the methods have been previously published60. Utilizing cis pQTLs from CRIC or deCODE, we searched within three publicly available GWAS for kidney function to determine whether these variants were associated with kidney function decline, designating the significance threshold as P value = 0.05/# distinct proteins queried in the GWAS. We chose three publicly available GWAS datasets assembled from the CKD Genetics Consortium and the United Kingdom Biobank. The eGFR dataset includes 567,460 participants of European descent with eGFR measures within the CKD Genetics Consortium61. Rapid3 and CKDi25 include 42 cohorts from either CKD Genetics Consortium or UK Biobank with serial kidney function measures62. Rapid3 includes 34,874 cases in whom eGFR decline was ≥3 ml/min/1.73 m2 and 107,090 controls. CKDi25 includes 19,901 cases who start at eGFR >60 ml/min/1.73 m2 and decline to less than 60 ml/min/1.73 m2 and have ≥25% decline in eGFR, as well as 175,244 controls62. GWAS are available at http://ckdgen.imbi.uni-freiburg.de/.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.