Urinary peptidomics and bioinformatics for the detection of diabetic kidney disease

The aim of this study was to establish a peptidomic profile based on LC-MS/MS and random forest (RF) algorithm to distinguish the urinary peptidomic scenario of type 2 diabetes mellitus (T2DM) patients with different stages of diabetic kidney disease (DKD). Urine from 60 T2DM patients was collected: 22 normal (stage A1), 18 moderately increased (stage A2) and 20 severely increased (stage A3) albuminuria. A total of 1080 naturally occurring peptides were detected, which resulted in the identification of a total of 100 proteins, irrespective of the patients’ renal status. The classification accuracy showed that the most severe DKD (A3) presented a distinct urinary peptidomic pattern. Estimates for peptide importance assessed during RF model training included multiple fragments of collagen and alpha-1 antitrypsin, previously associated to DKD. Proteasix tool predicted 48 proteases potentially involved in the generation of the 60 most important peptides identified in the urine of DM patients, including metallopeptidases, cathepsins, and calpains. Collectively, our study lightened some biomarkers possibly involved in the pathogenic mechanisms of DKD, suggesting that peptidomics is a valuable tool for identifying the molecular mechanisms underpinning the disease and thus novel therapeutic targets.

improvement in early DKD diagnosis and prognosis 15 . While large-scale studies of proteins are termed proteomics, the studies of naturally occurring peptides produced by endogenous protease activity are termed peptidomics 18 . A classifier composed of 273 urinary peptides (CKD273) associated with CKD, irrespective of the underlying etiology, has been previously described [19][20][21] . The predictive value of CKD273 in T1DM and T2DM patients has been demonstrated in its ability to anticipate the progression from normo-to macroalbuminuria long before the modification of albumin excretion rate 22 . Recently, CKD273 was shown to be independent of albuminuria and eGFR by enabled the identification of diabetic progressors to eGFR <60 mL/min/1.73 m 2 in the absence of albuminuria 23 .
Using SELDI-MS analysis, a study showed a 12-peak signature predicting the development of DKD ten years prior to the increase in albumin to creatinine ratio 24 , while other study described a 4-peak pattern able to recognize T2DM patients with DKD 25 . A panel composed of 65 urinary peptide biomarkers, many of which fragments of type 1 collagen, was capable of distinguishing between diabetic patients without albuminuria from those with DKD, as well as predicting the progression toward overt DKD in patients with diabetes who had albuminuria over 3 years 15 . However, the lack of data reproducibility and the need of replication in populations with different genetic background and with different proteomic techniques pave the way for new studies.
Here, we reported the profile of naturally occurring urinary peptides from T2DM patients with different stages of DKD analyzed by LC-MS/MS. We used bioinformatics analyses to classify patients at different disease stages (A1, A2 or A3) according to their peptidomic profile and to identify peptides differentially represented between these disease stages. Bioinformatics tools were also used to investigate proteases possibly involved in the generation of these urinary peptides. Peptides from collagen, SERPING1 and SERPINA1, were identified as potential biomarkers to differentiate DKD stages. Possible roles for identified proteins and proteases in the pathogenic mechanisms of DKD were discussed. Table 1 depicts the characteristics of T2DM patients included in this study categorized by UAE according to the Kidney Disease: Improving Global Outcomes (KDIGO) guidelines. It was included 22 with normal (stage A1), 18 with moderately increased (stage A2) and 20 with severely increased albuminuria (stage A3) patients. eGFR was significantly decreased only in the A3 group as compared to A1 group. The three groups presented similar mean ages, diabetes duration, and HbA1c levels, as well as percentages of women and smokers. The presence of hypertension and the mean of systolic and diastolic blood pressure, BMI and waist were also similar among the 3 groups.

Characteristics of type 2 DM patients according to UAE.
Urinary peptidomic analysis. Naturally occurring peptides in the urine from patients of UAE stages A1, A2 and A3 were isolated by ultrafiltration and analyzed by LC-MS/MS. MS/MS data were searched against the UniProt human protein database for peptide and protein identification and a total of 1080 peptides were identified, which corresponded to a total of 100 proteins (Supplementary Table S1). Peptide abundance was assessed by label-free quantification using normalized spectral counts of individual peptides (Supplementary Table S1, S2 and S3). The most abundant peptides detected in patients' urine were from collagen-derived proteins (COL1A1 and COL3A1) (Fig. 1). A3) according to their peptidomic profile. The detection rate for the filtered peptides varied between 100% (i.e., detected in all samples) and 5% (i.e., detected in 5% of samples). The confusion matrix for the tuned RF model, averaged over the 10 repetitions of the 5-fold cross-validation, is shown in Table 2. Among the 55 combinations of model's parameters tested, the best performance was achieved with ntree = 500 and mtry = 21 (see Supplementary  Fig. S1 for the complete results). An overall difficulty of the model in correctly classifying disease stage based on peptides quantification levels was observed. In particular, A1 patients were largely misclassified as A2, suggesting that peptidomic profiles of these two groups of patients may share similarities that turns the machine learning model training and classification more challenging. The average accuracy in the cross-validation process was 45.64% (±11.21%), with an AUC score of 0.619 (±0.106), and average sensitivity and specificity equal to 44.73% (±10.95%) and 72.58% (±5.60%), respectively. The same analysis was performed turning the data into a binary classification problem, with the two classes represented by A1 + A2 patients and A3 patients ( Table 2). For this scenario, model tuning returned ntree = 500 and mtry = 9 as the best parameter combination ( Supplementary  Fig. S1). The average accuracy improved to 74.83%, with an AUC score of 0.746 (±0.1518), and average sensibility and specificity of 95.0% (±7.14%) and 34.5% (±21.9%), respectively. This improvement demonstrates that the main difficulty of the previous model is to differentiate between patients classified as A1 and A2. Due to this difficulty, further analyses were done using the binary variable (A1 + A2 vs. A3) as target value.
Random Forest-based variable importance. The RF analysis ranks variables (peptides) based on their predictive importance during model training, which may be useful to better understand how the variance observed in peptides quantification levels may be associated to patients' DKD stages. Estimates for variable importance according to the average decrease in accuracy assessed during RF model training are shown in Fig. 2 (only the top 60 variables are shown. For the importance analysis of the 300 selected peptides used, see Supplementary File S1). The variables with the most predictive power included by ranking: multiple fragments of alpha-1 antitrypsin (SERPINA1; 10 fragments), type I collagen (COL1A1; 23 fragments), type III colagen (COL3A1; 4 fragments), C1 inhibitor (SERPING1; 3 fragments), but also other fragments as showed in Fig. 2.
We performed hierarchical clustering of peptidomic profiles for all samples considering the 60 most important variables as assessed by the RF model (Fig. 3). Pearson correlation-based distance and complete linkage were adopted as parameters in the clustering algorithm. Results corroborate the difficulty in finding clear patterns associated to disease stage that may aid in the classification of patients' diagnosis. Nonetheless, this analysis allows the visualization of singularities in peptides levels for subgroups of patients. For instance, four peptides associated to SERPINA1 show a particularly high concentration level for a subgroup of patients with advanced DKD (A3).  www.nature.com/scientificreports www.nature.com/scientificreports/ To further elucidate the expression pattern associated to the top predictors according to the RF model, we analyzed their fold change in the comparison between A1 + A2 patients and A3 patients. Of these peptides, 35 were found at lower and 25 at higher levels in A3 versus A1 and A2 groups (Table 3). Inspecting their statistical significance determined by the Mann-Whitney test, we observed that about half of peptides from the list (i.e., 29 out of 60) were significantly differentially expressed between groups (p < 0.05 and fold change <0.66 or fold change > 1.5), which are associated mainly to SERPINA1, COL1A1, and SERPING1. We note, however, that only 11 differentially expressed peptides showed statistically significant FDR value (FDR <0.05), probably due to the small sample size of our study (Supplementary File S2). The values distribution for the top 60 predictors identified by the RF model according to patients' disease stage (A1 and A2 or A3) are provided in Supplementary Fig. 2.
Protease prediction. In silico analysis was performed in order to predict the proteases possibly involved in the generation of the 60 most important peptides of the present study. The Proteasix open-source peptide-centric tool was used for the analysis 26 . By using this bioinformatics approach it is possible to track back to the enzymes responsible for the generation of these peptides. The majority of human proteases have several protein targets and likewise, one peptide sequence could be cleaved by different proteases. The analysis yielded a list of 48 proteases putatively responsible for the generation of the 60 most important peptides (Supplementary Table S4). From the 20 predicted proteases (Table 4)

Discussion
The limitations of UAE and GFR as a predictive biomarkers of DKD pave the way for the search of new biomarkers using omics technologies 12,28 . Furthermore, since the DKD prevalence is continuously increasing, there is an urgency to identify the mechanisms underlying its pathogenesis 4 . In the present study, we investigated the urinary peptidomic pattern of T2DM patients with different stages of DKD. We used a more flexible data-driven approach instead of hypothesis-driven methods for the investigation of multiple peptides related to DKD stages. The use of bioinformatics tools is a well-suited alternative for conventional regression models that fails to include large numbers of covariates. Therefore, the random forest (RF) algorithm was used to rank variables based on their predictive importance during model training to better understand how the peptides levels variation may be correlated to patients' diagnostic. In our study, considering the 3 stages of DKD, the RF approach achieved accuracy of below 70%, but accuracy raise when stage A1 and A2 were compared to stage A3, thus showing that the most severe DKD stage presented a distinct urinary peptidomic pattern when compared to two initials stages of disease. Next, www.nature.com/scientificreports www.nature.com/scientificreports/  www.nature.com/scientificreports www.nature.com/scientificreports/ the peptides were ranked in importance showing that the distinct pattern of A3 patients were due to an increased abundance of peptides from SERPINA1 and SERPING1 and a decreased abundance of collagen peptides.
Most of the differentialy expressed peptides found in the present study are in agreement with previous studies, despite the use of different proteomic methods, such as CE-MS 22,29,30 . Zurbig and collaborators 22 analyzed urine peptides using a biomarker classifier for chronic kidney disease (CKD273) in a longitudinal cohort of diabetic patients. In this retrospective cohort, collagen fragments decreased before the increase of albumin excretion, showing a major role in the initiation of DKD 22 . The collagen-derived peptides were also observed relatively decreased in studies on CKD273 classifier for prognosis of CKD development, as well as in other proteomic analyses [29][30][31] . In the present study, we detected decreased levels of peptides from COL1A1 in urine of patients with the most severe DKD stage. These findings are also supported by previous hypothesis that urinary collagen fragments decrease might be an indicator of reduced collagen breakdown, resulting in fibrosis 32,33 . These peptides are likely the result of normal physiological turnover of the extracellular matrix. It has been assumed that diminished activity of matrix metalloproteinases may be responsible for the accumulation of proteins in the extracellular matrix and collagens that characterize the fibrotic kidney 34 . The collagen breakdown might be inhibited by the higher levels of tissue inhibitor of matrix metalloprotease type 1 observed in patients with renal disease 35 . Moreover, the accumulation of extracellular matrix as predominantly observed in DKD was shown to be associated with decreased excretion of several specific collagen fragments 32 .
Diabetes complications are intimately linked to inflammation, evidenced by the presence of high levels of plasma inflammatory markers such as high-sensitivity C-reactive protein 36 . Elevated acute-phase proteins may reflect the inflammation and activation of innate immune system during the course of DKD 37 . Related to inflammation, peptides from SERPING1 and SERPINA1 were found among the top 60 most important variables identified in the DKD severe stage (A3 stage). SERPINA1 is a serine protease inhibitor, which targets elastase as well as other proteases 38 . Neutrophil elastase degrades a range of substrates including elastin and other extracellular matrix proteins such as collagen, fibronectin, complement receptors and several growth factors 39 . Therefore, in DKD, the increased degradation of SERPINA1 would lead to activation of elastase and thus might contribute to accumulation of matrix molecules. In addition, SERPINA1 has been associated to non-protease inhibitory effects, including anti-inflammatory, anti-oxidant, and anti-apoptotic properties 40 . Although the exact mechanism remains unknown, previous studies also reported that peptide levels from SERPINA1 were increased in the kidney of microalbuminuria state, indicating a role for this protein in the pathogenesis of DKD 30,41,42 .
Increasing evidence points toward a role for the complement system in the pathogenesis of DKD 36 . SERPING1 codes for a plasma protease C1 inhibitor that is involved in the regulation of the complement cascade 36 . In our study, SERPING1 peptides were increased in A3 patients' urine. Likewise, these peptides were negatively correlated to eGFR and related to advanced CKD 43 . SERPING1 main function is the inhibition of the complement system, through the irreversible inactivation of C1r and C1s proteases of the classical pathway C1 complex, thus avoiding its spontaneous activation. On the other hand, activation of this pathway leads to the production of complement C3 convertase, which activates complement component C3, leading to generation of the opsonic C3b and eventually generation of the membrane attack complex, which lyses, damages, or activates target cells 36 . A disturbance of complement regulation might lead to different inflammatory actions that may be linked to the pathogenesis of DKD. Since we observed a higher SERPING1 degradation in A3 patients, it could be suggested that there is more activation of the complement system in the most severe DKD stage. Transcript levels of C3 and  www.nature.com/scientificreports www.nature.com/scientificreports/ other complement components are increased in DKD glomeruli 44 . Additionally, mechanistic experiments performed on rodent diabetes models showed increased complement expression and activation correlated with DKD severity, whereas complement blockade improved outcomes in different nephropathy models 36,44 . Furthermore, patients and animals with DKD exhibited increased kidney C1q and C3 expression 45 . Based on these findings, and on the association of circulating immune complexes with diabetic renal injury, it is likely that complement system might play a functional role in DKD 45 .
The proteases regulation is necessary to maintain tissue homeostasis and its altered activity seems to be linked to the generation of peptides associated to DKD 27,46 . Bioinformatics peptide centric tools have been developed in order to track back to the proteases responsible for the generation of the urinary peptidome involved in DKD 26,47 . In the present study, we included DKD top 60 peptides in the Proteasix tool, yelding predicted proteases notably to the family of metalloproteinases, cathepsins and calpains. Supporting these findings, a recent study confirmed the altered activities and decreased expressions of some metalloproteinases in DKD at tissue levels 27 . The dysregulation of matrix metalloproteinases has been linked to renal fibrosis progression 48 and DKD 49 , although the evidence is not always consistent 49,50 . MMP-2 and MMP-9 preferentially degrades collagen type IV 51 -major component of tubular basement membrane (TBM) and linked to tubulointerstitial change and thickening of TBM 50,52 , hallmarks especifically related to DKD. Cathepsin D (CTSD), also verified in a previous study, is known to mediate inflammation 53 and appeared increased in the human kidney tissue of DKD patients, especially in the areas of tubular damage 54 . From the same protease family, Cathepsin K (CTSK) seems to be involved in the DKD peptides generation and has been shown to be correlated with DKD progression and vascular endothelial dysfunction and deterioration of renal function 27 . Thus, the identification of protease dysregulation confirms the role of imbalance of collagen degradation, inflammation and fibrotic processes in DKD 27,47 .
Some factors may obscure the association between naturally occurring peptides and presence of kidney damage. Patients were burdened with coexisting diseases and a complex combination of drugs. These confounders might affect the urinary proteome and peptidome. The drugs known to reduce proteinuria such as angiotensin-converting-enzyme inhibitors and angiotensin receptor blockers certainly underestimated UAE levels. The study is crossectional, so the described association should be interpreted with caution and need further experimental verification. However, our analysis confirms the previously described findings in other populations and provided information that can be used for putative therapeutic targets and better understanding of DKD pathogenesis.
In conclusion, our LC-MS/MS analysis revealed that urinary peptide profile varies according to DKD severity, indicating a potential use for DKD risk stratification. The results were obtained by a sequence of bioinformatics approaches, first to rank the high important peptides in DKD context and second to allow the interpretation of integrated urinary peptidomics to the prediction of proteolytic events linked to DKD, emphasizing the differential regulation of inflammation and the complement system in DKD. Metalloproteases and cathepsins appear to be involved in the pathogenesis of DKD, highlighting the importance of further mechanistic studies. Due to the bioinformatics tools improvements, the information on urinary peptides should help to better define DKD on a molecular level and to identify specific therapeutic targets.

Clinical and laboratorial analyses of patients.
A total of 60 patients with Type 2 DM (T2DM) from the Hospital de Clínicas de Porto Alegre (HCPA) had their urine samples consecutively collected using consistent standard operating procedures. All samples were collected as a spontaneously voided elimination and were stored immediately at −20 °C until analysis. The definition of T2DM was based on diagnosis of diabetes after the age of 40 years with no use of insulin during the first five years after diagnosis and no previous episodes of ketoacidosis. The exclusion criteria were malignancy and pregnancy. Informed written consent approved by responsible committee was obtained from all subjects, and ethical approval was obtained from the HCPA scientific committee.
A standard questionnaire was used to collect information on age, age at DM diagnosis, and drug treatment, and all patients underwent physical examination and laboratory evaluation. They were weighed unshod, wearing light outdoor clothes and their height was measured. Body mass index (BMI) was calculated as weight (kg)/ height squared (meters). Blood pressure (BP) was measured twice using a digital sphygmomanometer (Omron) with the subject seated and a 5-minute rest between measurements. The means of both measurements were used to calculate systolic and diastolic BP. Hypertension was defined as BP levels of 140/90 mm Hg or higher, or if the patient was taking antihypertensive drugs 55 .
A serum sample was taken after 12 hours of fasting for laboratory analyses. Creatinine levels were determined using a traceable Jaffe reaction kit; glycated hemoglobin (HbA1c) was quantified using an ion-exchange HPLC procedure (Merck-Hitachi L-9100 GhB Analyzer; Merck, Darmstadt, Germany; reference range: 4.7%-6.0%). www.nature.com/scientificreports www.nature.com/scientificreports/ mass cutoff; Sartorius, Goettingen, Germany) at 2,000 g for 10 min at 18 °C 19 . Subsequently, 1.2 ml of filtrate was applied onto a PD-10 desalting column (GE Healthcare) equilibrated with 0.01% NH 4 OH to remove urea, electrolytes, and salts. Additionally, 1.3 ml of equilibration buffer was applied to the filter bed as a first step and allowed to wash out by gravity flow, improving the yield of naturally occurring peptides. Next, 2 ml of equilibration buffer was applied to the PD-10 column. The flow-through was collected, lyophilized, and stored at −20 °C until further use 19 .

Liquid chromatography-tandem mass spectrometry (LC-MS/MS) analysis and database searching.
Urinary peptides were analyzed by liquid chromatography-tandem mass spectrometry (LC-MS/MS) using a Waters nanoACQUITY UPLC system coupled to a Q-TOF Ultima API mass spectrometer (Waters, Milford, MA, USA), as described previously 58 . The peptides were eluted from the reverse-phase column toward the mass spectrometer at a flow rate of 200 nl/min with a linear gradient of acetonitrile in 0.1% formic acid (2-90%, 60 min). The MS survey scan was set to 1 s (0.1 s interscan time) and recorded from 100 to 2000 m/z. MS/MS scans were acquired from 50 to 2000 m/z, and scan and interscan rates were set as for MS. The samples were run in DDA mode where each full MS scan was followed by MS/MS acquisition using collision-induced dissociation of the three most intense ions from the MS scan. Each sample was run in duplicates (technical replicates). MS/MS raw data were processed using ProteinLynx Global Server software (Waters) and peak lists were exported in the mascot generic (.mgf) format. Peptides and protein identifications were performed with the MASCOT search engine (Matrix Science, London, UK; version 2.3.0) against UniProt/Swiss-Prot human protein database 59 . Database search was performed with the following parameters: oxidation of methionine (M), proline (P) and lysine (K) as variable modifications, without any enzyme specificity and maximal missed cleavage of 0. A mass tolerance of 0.1 Da for precursor and fragment ions was set. Ion type was set as monoisotopic, and peptide charges 2+, 3+ and 4+ were taken into account. Protein and peptide identifications were validated by Scaffold (Proteome Software Inc., version 4.4.1.1) analysis. MASCOT *.dat files were loaded on Scaffold and peptide identifications were accepted if they could be established at >95% of probability as assigned by the Peptide Prophet algorithm 60 and protein identifications were accepted if they could be established at >99% of probability as assigned by the Protein Prophet algorithm 61 and contained at least 2 identified peptides. The false discovery rate (FDR, Decoy) was <1% for proteins and peptides. The spectral counts were calculated for each peptide using Scaffold and normalized following default settings in which spectral counts are multiplied by the average across all samples and divided by the total number of spectral counts within each sample 62 .
Estimating variables importance with Random Forests. Random forest (RF) is an algorithm for classification that uses an ensemble of decision trees, each of which is built using a bootstrap sample of the data 63 . In addition, RF adopts random variable selection during trees growth that when combined to the bootstrap aggregation (i.e., bagging) procedure tends to generate a collection of weakly correlated individual trees. As a result, the ensemble-based decision has a great potential to achieve low bias and low variance in predictions, leading to improved predictive performance in relation to individual trees. RF has demonstrated excellent performance in genomic classification tasks, comparable to other state-of-the-art methods such as Support Vector Machines (SVMs) 64 . However, its ability to evaluate and rank variables based on their predictive importance during model training makes it particularly well-suited to better understand how the variance observed in peptides quantification levels may be associated to patients' diagnostic.
For this purpose, we trained a RF classifier using as training data the quantification of peptides for the 60 subjects. Common data pre-processing steps, such as removing highly correlated variables or variables with near zero variance across samples, were not applied due to the sparse nature of data and to the expected correlation among peptides related to the same protein. Since most peptides have over 50% missing values, only the top 300 detected peptides (i.e., in decreasing order of detection rate) were considered as model's variables. The number of trees in the ensemble (ntree) and the number of variables randomly sampled as candidates for node splitting during the tree growing process (mtry) were tunned, testing all possible combinations of ntree = [500, 1000, 1500, 2000, 2500] and mtry = [7,9,11,13,15,17,19,21,23,25,27]. Mtry values were chosen by defining an interval of −10 and +10 around the default suggested value (i.e., the square root of the total number of variables, which would be approximately 17 for our data), in steps of two. Model performance was assessed using the sensitivity, specificity, and accuracy computed based on the confusion matrix, as well as the area under the Receiver Operating Characteristic (ROC) curve (i.e., AUC score). Relative importance was estimated for each variable during model training as the mean decrease in accuracy observed after permuting the predictor's values. Analyses were performed with R package caret v.6.80.

Protease prediction.
The open-source tool for protease prediction -Proteasix (www.proteasix.org) 26 was used for the analysis to link naturally occurring peptides in urine to the proteases potentially involved in their generation. Proteasix uses information about naturally occurring peptides i.e. corresponding protein UniProt ID and start and stop amino acid position to predict potential cleaving proteases 27 . Proteasix retrieves information about cleavage sites (CS) from protease databases (MEROPS, BRENDA) considering also cleavage site restrictions (from ENZYME database) 27 . A list of predicted proteases is generated as a result of the analysis.
Statistical analysis. One-way ANOVA and χ2 tests were performed when appropriate. Statistical significance was assumed at p < 0.05. Peptide differential expression was assessed with the Mann-Whitney test and P-values were adjusted for multiple testing using the Benjamini-Hochberg method (FDR, false discovery rate). Peptides were considered differentially expressed among groups based on p < 0.05 and ratio higher than 1.5 (fold change higher than 1.5 or lower than 0.66) 27 . Statistical analyses were performed using SPSS version 18.0 (SPSS, Chicago, IL, USA) and R statistical language.