The utility of texture analysis of kidney MRI for evaluating renal dysfunction with multiclass classification model

We evaluated a multiclass classification model to predict estimated glomerular filtration rate (eGFR) groups in chronic kidney disease (CKD) patients using magnetic resonance imaging (MRI) texture analysis (TA). We identified 166 CKD patients who underwent MRI comprising Dixon-based T1-weighted in-phase (IP)/opposed-phase (OP)/water-only (WO) images, apparent diffusion coefficient (ADC) maps, and T2* maps. The patients were divided into severe, moderate, and control groups based on eGFR borderlines of 30 and 60 mL/min/1.73 m2. After extracting 93 texture features (TFs), dimension reduction was performed using inter-observer reproducibility analysis and sequential feature selection (SFS) algorithm. Models were created using linear discriminant analysis (LDA); support vector machine (SVM) with linear, rbf, and sigmoid kernels; decision tree (DT); and random forest (RF) classifiers, with synthetic minority oversampling technique (SMOTE). Models underwent 100-time repeat nested cross-validation. Overall performances of our classification models were modest, and TA based on T1-weighted IP/OP/WO images provided better performance than those based on ADC and T2* maps. The most favorable result was observed in the T1-weighted WO image using RF classifier and the combination model was derived from all T1-weighted images using SVM classifier with rbf kernel. Among the selected TFs, total energy and energy had weak correlations with eGFR.

Chronic kidney disease (CKD) affects 8-16% of the population worldwide and remains a major threat to global public health due to its increasing incidence and mortality. Common causes of CKD include diabetes and hypertension, especially in developed countries. However, less than 5% of patients with early CKD report being aware of their disease 1 . Therefore, appropriate screening, early diagnosis, and management are significant in preventing CKD-associated adverse clinical outcomes, such as end-stage kidney disease, cardiovascular disease, and increased mortality. The Kidney Disease Improving Global Outcomes guidelines 2 suggested a risk-based approach to the evaluation and management of CKD and proposed six disease categories related to the estimated glomerular filtration rate (eGFR): G1-G5, with G3 subdivided into 3a and 3b. The most essential cutoff points of eGFR are 60 and 30 mL/min/1.73 m 2 (as the borderlines of G2/G3 and G3/G4, respectively); therefore, the risk of death was reported to increase as the eGFR decreased below 60 mL/min/1.73 m 2 in recent CKD cohort studies 3 . In addition, an eGFR of less than 30 mL/min/1.73 m 2 is important from a radiological point of view as it relates to the availability of the contrast media 4 . The risk stratification of CKD based on the eGFR has undisputed advantages and has helped achieve greater awareness of CKD and its impact on global health. Ischemia and hypoxia are associated with the progression of CKD; however, clinical tools to quantify these factors in patients are lacking. Renal biopsy is the gold standard method to histologically evaluate renal pathology; nevertheless, it carries certain risks due to complications, such as bleeding. Conversely, magnetic resonance imaging (MRI) of the kidney has been used to non-invasively assess CKD progression. Several MRI methods have been successfully used to evaluate renal function, including diffusion-weighted imaging (DWI) and blood oxygen level-dependent imaging (BOLD). DWI and apparent diffusion coefficient (ADC) values are the most studied methods and have demonstrated a good correlation with renal function decline and renal fibrosis in CKD [5][6][7][8] . BOLD based on the T2* map reflects the regional renal oxygenation status and can assess hypoxia occurring during renal dysfunction 9,10 . Although Dixon-based gradient-echo MRI is another imaging method that is routinely performed in abdominal imaging and can measure renal lipid accumulation in type II diabetes mellitus 11 , its utility in the evaluation of CKD has not been thoroughly studied.
Texture analysis (TA) is an emerging technique that permits the quantification of image characteristics based on the distribution of pixels and their surface intensity or patterns 12,13 . TA has been applied to several medical image analyses, including oncologic imaging 14 , neuroimaging 15 , and abdominal imaging 16,17 . Recent reports have demonstrated the utility of TA based on DWI, BOLD, Susceptibility-weighted imaging (SWI), and T1 and T2 mapping 18,19 . However, TA of other essential MRI methods, including Dixon-based T1-weighted imaging (T1WI), has not been fully studied. Previous studies have described that as the renal function declines, a decreased difference between the values in the cortex and those in the medulla is observed in T1WI 20,21 . Therefore, we hypothesized that TA based on Dixon-based T1WI is especially important because of its capacity to capture the clearest images and reflect the morphological characteristics of the kidney.
Thus, this study aimed to assess and compare the performance of TA based on Dixon-based T1WI, ADC maps, and T2* maps (BOLD) for evaluating renal dysfunction.
According to the eGFR, 36 patients had severe renal dysfunction (se-RD) ( [42], mean age 43.7 ± 18.1 years, mean eGFR 78.1 ± 16.7 mL/ min/1.73 m 2 ). Table 1 details the distribution of the study population in each eGFR group. The age, percentage of men, and incidence rates of hypertension and diabetes increased significantly with progressive renal dysfunction. There was no significant difference in the incidence rate of IgA nephropathy and nephrotic syndrome among the three groups.
Dimension reduction of texture features. The T1-weighted in-phase (IP)/opposed-phase (OP)/wateronly (WO) images showed good reproducibility in the inter-observer reproducibility analysis, with mean interclass correlation coefficient (ICC) values of 0.767, 0.774, and 0.781, respectively. Conversely, the ADC and T2* maps showed slightly lower reproducibility, with mean ICC values of 0.732 and 0.718, respectively. Good interobserver reproducibility was observed for 59, 60, 61, 54, and 50 features (ICC ≥ 0.75 and lower 95% confidence interval [CI] ≥ 0.6) in T1-weighted IP/OP/WO images, ADC map, and T2* map, respectively. By excluding features with poor reproducibility (ICC < 0.75 or lower 95% CI < 0.6) from any one of the imaging methods, the www.nature.com/scientificreports/ number of features for each imaging method was reduced to 40. Table 2 lists the ICC values of these 40 features for each imaging method. Subsequently, the sequential feature selection (SFS) algorithm was used for feature selection. For each imaging method and machine learning (ML) classifier, a subset of five features that provided good classification accuracies was identified. The selected features for each classification attempt are listed in Tables 3, 4, 5, 6 and 7.
Concurrently, cross-correlation analyses were conducted between the eGFR and the 40 texture features. The highest correlation coefficients were observed for two texture features (total energy [0.55, p < 0.001] and energy [0.55, p < 0.001]) in T1-weighted WO images. Table 8 shows the relationship between the eGFR and the 40 selected texture features in each imaging modality.

Classification and validation.
Receiver-operating characteristic (ROC) curve analyses were performed to compare the capacity of TA quantified from each imaging method to differentiate the three groups of CKD. Overall, the TA based on T1-weighted IP/OP/WO images provided better classification performance than that based on the ADC and T2* maps. Among the five imaging methods, the T1-weighted WO images obtained the highest classification scores: an accuracy of 76.5-82.0% and a macro-average area under the curve (AUC) of 0.812-0.884. Among the six classifiers we studied, the most favorable performance was observed in the random forest (RF) classifier. As for the support vector machine (SVM) classifier, the favorable results were obtained in the SVM with rbf kernel, whereas the results were poor in the SVM with sigmoid kernel. The results of all classification attempts are summarized in Tables 3, 4 Combination models. The combination models derived from T1-weighted IP/OP/WO images (ALL T1WIs) and those derived from all imaging methods (ALL IMs) were evaluated. The selected texture features are listed in Tables 9 and 10. The best classification performance was observed in ALL T1WIs using the SVM with rbf kernel classifier: an accuracy of 82.8% and a macro-average AUC of 0.887. The results of all classification attempts are summarized in Tables 9 and 10. Figures 1b and 2b present the ROC curve and the confusion matrix of the representative model (ALL T1WIs using the SVM with rbf kernel classifier). The confusion matrices and ROC curves of all classification attempts are summarized online in Supplementary Figs. S6,S7 and S13,S14, respectively.

Discussion
In the present study, we sought to investigate whether multiclass classification models based on TA of kidney MRI could predict the three eGFR groups of renal dysfunction. The results of our study suggest that TA of kidney MRI would be a modest predictor of these eGFR groups, but might not be a valuable differentiator in a clinical setting. TA quantified from T1-weighted IP/OP/WO images provided better classification performance compared with that of those based on ADC maps and T2* maps. Furthermore, we examined the combination models and showed that texture models derived from ALL T1WIs using the SVM with rbf kernel classifier afforded the moderate diagnostic performance as well. To our knowledge, this is the first study to evaluate the possibility of using multiclass models in the classification of eGFR groups. Several attempts have been made to differentiate between eGFR groups using TA of kidney MRI. Previous reports have shown the possibility of using TA based on DWI, BOLD, SWI, and T1 and T2 mapping 18,19 . In all these attempts, only binary classifiers were examined to differentiate between each eGFR group separately. However, in clinical practice, it is not uncommon for the classification of diseases to extend to three or more groups; therefore, it is reasonable to build multiclass problems for clinical use. According to the recent guidelines 2 , renal impairment and prognosis have been stratified into six disease groups based on the eGFR: G1 to G5, with G3 split into 3a and 3b. The most important cutoff points are eGFR 60 and 30 mL/min/1.73 m 2 , and previous studies on TA of kidney MRI were designed to classify the Table 1. The demographic and clinical characteristics of the study population. Unless otherwise indicated, data are represented as the number (%) of patients. se-RD severe renal dysfunction (estimated glomerular filtration rate [eGFR] < 30 mL/min/1.73 m 2 , i.e., CKD stage G4-5), mo-RD moderate renal dysfunction (30 ≤ eGFR < 60 mL/min/1.73 m 2 , i.e., CKD stage G3a/3b), CG control group (eGFR ≥ 60 mL/min/1.73 m 2 , i.e., CKD stage G1-2), IgA immunoglobulin A, SD standard deviation. www.nature.com/scientificreports/ eGFR according to these cutoffs, except for one study that considered eGFR 90 mL/min/1.73 m 2 as an additional cutoff point 19,22 . Therefore, our research also aimed to classify three groups, with cutoffs at eGFR 60 and 30 mL/ min/1.73 m 2 . Ideally, the classification system of all six eGFR groups would be beneficial, although a large deviation in the distribution of each group prohibited these attempts.
Our study used Dixon-based T1-weighted images, which have not been fully discussed in the assessment of renal dysfunction. Dixon-based imaging, also called chemical shift imaging, uses the IP/OP cycling of fat and water molecules and allows the acquired images to be combined mathematically into four sequences: IP/OP/ WO/fat-only (FO) images 23 . This technique has been used as a homogeneous fat suppression or fat quantification method in various medical imaging fields. The Dixon method has the potential for measuring the renal lipid accumulation in type II diabetes mellitus 11 . Moreover, the Dixon technique is useful in detecting iron deposition Table 2. Representative texture features and their respective intraclass correlation coefficient. ADC apparent diffusion coefficient, GLCM gray-level co-occurrence matrix, GLDM gray-level dependence matrix, GLRLM gray-level run length matrix, GLSZM gray-level size zone matrix, IP in-phase, OP opposed-phase, TF texture feature, WO water-only. www.nature.com/scientificreports/ related to T2* effects and susceptibility artifacts owing to the double-echo sequence 24 . Additionally, Dixon-based images provide better signal-to-noise efficiency than other conventional fat-suppressed methods 25 . In our results, TA based on T1-weighted IP/OP/WO images demonstrated moderate classification scores, and the favorable classification scores were obtained by T1-weighted WO images. Although we could not analyze the FO image and fat fraction ratio map, as these were not available in all patients, a T1-weighted WO image can be regarded as a total fat-suppressed image and contains T1 information on components other than fat. In a recent study on T1 mapping, increased cortical T1 values and decreased T1 cortico-medullary differentiation correlated with the severity of renal impairment 20,21 . The changes in the T1 values could be attributed to renal physiological states, such as inflammation, hypoxia, and fibrosis 20,[26][27][28][29] . In our opinion, T1-weighted WO images may represent these changes and could be useful in the non-invasive assessment of CKD etiologies. In our study, TA based on the ADC and T2* maps showed relatively low diagnostic performance compared with that of those based on Dixon-based T1WI. In the ADC map, the texture features in the renal cortex had a good correlation with fibrosis and chronic lesions, and the texture features in the renal medulla were more related to renal function than those quantified from the renal cortex 19 . Additionally, the difference between the cortical and medullary ADC, the so-called delta-ADC, has been correlated with fibrosis in CKD 5,30 . However, since BOLD provides a marker of blood oxygenation levels, relative hypoxia associated with renal injury may be reflected by the T2* map. T2* measurement demonstrated a good correlation with the eGFR in patients with CKD, and TA Table 3. Performance of each classification attempt in discriminating between the three groups in T1-weighted in-phase imaging (T1WI IP). TF texture feature, DT decision tree, LDA linear discriminant analysis, SVM support vector machine, RF random forest classifier, AUC area under the curve, se-RD severe renal dysfunction (estimated glomerular filtration rate [eGFR] < 30 mL/min/1.73 m 2 , i.e., CKD stage G4-5), mo-RD moderate renal dysfunction (30 ≤ eGFR < 60 mL/min/1.73 m 2 , i.e., CKD stage G3a/3b), CG control group (eGFR ≥ 60 mL/min/1.73 m 2 , i.e., CKD stage G1-2). Feature name codes are as follows: TF1 = 10th percentile, TF3 = energy, TF4 = entropy, TF5 = interquartile range, TF8 = median, TF9 = robust mean absolute deviation, TF11 = total energy, TF13 = difference average, TF18 = joint energy, TF20 = maximum probability, TF22 = dependence non uniformity, TF24 = dependence variance, TF25 = gray level non uniformity (gray-level dependence matrix), TF28 = gray level non uniformity (gray-level run length matrix), TF31 = run entropy, TF32 = run length non uniformity. The data are expressed as means ± standard deviations. www.nature.com/scientificreports/ of BOLD was linearly correlated with the eGFR in several studies 18,[31][32][33] . In our results, TA of the ADC and T2* maps showed little correlation with the eGFR, and unsatisfactory results were obtained in multiclass problems. A possible explanation for the lower performances in the ADC and T2* maps is the difference in the quantification methods of texture features. We quantified the texture features from the renal parenchyma as a whole, whereas most other studies quantified the texture features from the renal cortex or medulla 5-10,19 . As described above, it would be favorable to consider the renal cortex and medulla separately for the assessment of the ADC and T2* maps. Moreover, our classification system was a multiclass model, and TA using non-linear classifiers based on clear images, such as Dixon-based T1WI, may be suitable for such systems. Regarding inter-reader reproducibility, the ICC values for the ADC and T2* maps were relatively low. These unsatisfactory results may be attributed to their relatively low resolution. Lower discriminative performance and reproducibility in TA of ADC maps have been reported due to the low resolution of the images 34,35 . In our research, the texture features were extracted from the whole area of both kidneys, although in most studies, these were measured in the renal cortex and medulla separately, and in the ipsilateral kidney, mostly on the right side due to artifacts caused by factors such as intestinal gas, poor breath holds, and susceptibility effects 18 . Considering the different functionalities of the renal cortex and medulla, it might be more appropriate to assess them separately. However, in clinical practice, we contemplated that evaluating them as a whole would be simpler and easier to understand. For the ADC and T2* maps, in particular, the delineation between the renal Table 4. Performance of each classification attempt in discriminating between the three groups in T1-weighted opposed-phase imaging (T1WI OP). TF texture feature, DT decision tree, LDA linear discriminant analysis, SVM support vector machine, RF random forest classifier, AUC area under the curve, se-RD severe renal dysfunction (estimated glomerular filtration rate [eGFR] < 30 mL/min/1.73 m 2 , i.e., CKD stage G4-5), mo-RD moderate renal dysfunction (30 ≤ eGFR < 60 mL/min/1.73 m 2 , i.e., CKD stage G3a/3b), CG control group (eGFR ≥ 60 mL/min/1.73 m 2 , i.e., CKD stage G1-2). Feature name codes are as follows: TF2 = 90th percentile, TF3 = energy, TF5 = interquartile range, TF7 = mean, TF10 = root mean squared, TF11 = total energy, TF13 = difference average, TF15 = id, TF16 = idm, TF17 = inverse variance, TF22 = dependence non uniformity, TF24 = dependence variance, TF25 = gray level non uniformity (graylevel dependence matrix), TF26 = large dependence emphasis, TF31 = run entropy, TF34 = run percentage, TF38 = size zone non uniformity normalized. The data are expressed as means ± standard deviations. www.nature.com/scientificreports/ cortex and medulla is difficult because of their relatively low resolution, and poor reproducibility is expected when considering the renal cortex and medulla separately. Moreover, in patients with advanced renal dysfunction, it is often difficult to distinguish between them because of cortical thinning 36,37 . Another point where our research differs from others is whether one or both sides of the kidney are considered. Our study evaluated both kidneys based on the idea that they might contain more integrated information concerning renal function. However, considering the time-consuming process of region of interest (ROI) delineation, it would be favorable to segment only one side of the kidney. If TA derived from one side of the kidney is sufficient, it would be beneficial to consider only one side of the kidney because of the severe artifacts on the other side. Moreover, in our study, we performed manual segmentation instead of using automatic methods. A method that automatically divides the renal parenchyma into 12 layers using a computer (twelve-layer concentric objects method) has been validated so far 32,38,39 . Its use may improve the discrimination capacity and reproducibility of our models, especially for the ADC and T2* maps, which need to be examined in the future. In recent years, TA has become a promising technique for quantitative imaging analysis, providing biomarkers for pathological changes or the response to treatment [12][13][14][15][16][17] . In our analysis, TA of kidney MRI was not a good discriminator of eGFR groups, especially in the clinical setting. Texture features such as energy and total energy were frequently included in the selected features in most classification attempts, although their correlation with the eGFR was not good. It might be interesting to know the existence of universal texture parameters, as such Table 5. Performance of each classification attempt in discriminating between the three groups in T1-weighted water-only imaging (T1WI WO). TF texture feature, DT decision tree, LDA linear discriminant analysis, SVM support vector machine, RF random forest classifier, AUC area under the curve, se-RD severe renal dysfunction (estimated glomerular filtration rate [eGFR] < 30 mL/min/1.73 m 2 , i.e., CKD stage G4-5), mo-RD moderate renal dysfunction (30 ≤ eGFR < 60 mL/min/1.73 m 2 , i.e., CKD stage G3a/3b), CG control group (eGFR ≥ 60 mL/min/1.73 m 2 , i.e., CKD stage G1-2). Feature name codes are as follows: TF1 = 10th percentile, TF3 = energy, TF5 = interquartile range, TF6 = mean absolute deviation, TF11 = total energy, TF12 = uniformity, TF13 = difference average, TF14 = difference entropy, TF16 = idm, TF18 = joint energy, TF19 = joint entropy, TF22 = dependence non uniformity, TF24 = dependence variance, TF29 = gray level non uniformity normalized, TF30 = long run emphasis, TF31 = run entropy, TF32 = run length non uniformity, TF39 = small area emphasis. The data are expressed as means ± standard deviations. www.nature.com/scientificreports/ features may represent the underlying pathophysiology of kidney disease. The energy and total energy show the magnitude of pixel values and accentuate the high signal intensities in the images 40 . In our opinion, cortico-medullary differentiation may have played a role in the renal dysfunction; as the renal function declined, decreased cortico-medullary differentiation was noted, as described above 20,21 . Both parameters showed decreased values in our results, implying decreased signal intensities in the whole area of the kidneys, and this may be affected by a decrease in the cortico-medullary difference. However, in most studies, entropy correlated well with the eGFR and showed the capacity to differentiate between the eGFR groups of patients with CKD 18,19,41 . Other reports state that skewness, kurtosis, and correlation may be useful in discriminating between these groups 18,19 .
None of the studies commented on the energy and total energy. One reason for this discrepancy could be the difference in the classification system: a multiclass classification model was used in our study. Although we did not examine the binary classification for each border, it is suspected that the energy and total energy could be weak indicators for the overall classification of the eGFR groups. Another reason for the discrepancy is that we considered Dixon-based T1WI, whereas other studies mainly discussed DWI, BOLD, and SWI: the classifiers used in this study were mainly non-linear ML methods. In addition, TA was conducted on the whole region of the kidneys, whereas in other studies the renal cortex and medulla were analyzed separately. In this study, we also demonstrated the possible candidates of classifiers, such as SVM and RF. SVM has high generalizability since it Table 6. Performance of each classification attempt in discriminating between the three groups in ADC map imaging. TF texture feature, DT decision tree, LDA linear discriminant analysis, SVM support vector machine, RF random forest classifier, AUC area under the curve, se-RD severe renal dysfunction (estimated glomerular filtration rate; eGFR < 30 mL/min/1.73 m 2 , i.e., CKD stage G4-5), mo-RD moderate renal dysfunction (30 ≤ eGFR < 60 mL/min/1.73 m 2 , i.e., CKD stage G3a/3b), CG control group (eGFR ≥ 60 mL/min/1.73 m 2 , i.e., CKD stage G1-2). Feature name codes are as follows: TF3 = energy, TF4 = entropy, TF6 = mean absolute deviation, TF11 = total energy, TF13 = difference average, TF14 = difference entropy, TF15 = id, TF16 = idm, TF25 = gray level non uniformity (gray-level dependence matrix), TF27 = small dependence emphasis, TF28 = gray level non uniformity (gray-level run length matrix), TF29 = gray level non uniformity normalized, TF30 = long run emphasis, TF33 = run length non uniformity normalized, TF36 = short run emphasis, TF37 = gray level non uniformity normalized, TF38 = size zone non uniformity normalized, TF39 = small area emphasis. The data are expressed as means ± standard deviations. www.nature.com/scientificreports/ can be used to select linear or non-linear kernels, and the 'rbf ' (non-linear) kernel could be the most suited to our models. Generally, non-linear classifiers would show good performance in multiclass classification 42 , and our results showed this tendency as well.
Our study had many limitations. First, we retrospectively enrolled 166 patients from a single institution; this was a small sample with some imbalance between each group. A greater number of patients with more balanced grouping is needed to validate the results in the future. Second, since we excluded patients with renal lesions, some important renal diseases, such as polycystic kidney disease, were ignored in this analysis, which would have caused a selection bias. Third, the data were not divided into training and validation sets because of the limited number of patients; hence, further investigation using an external validation cohort should be performed in the future. Fourth, since we focused on classifications predicting the eGFR group, other important laboratory data or underlying pathologies were missed in this study. As stated above, renal lipid accumulation in diabetes mellitus can be assessed using the Dixon method 11 , which is worth investigating in the future. Fifth, we extracted texture features from one layer of the image as two-dimensional data; the use of only one layer may result in important texture features being missed. This problem can be solved by extracting three-dimensional features, although this approach may be time-consuming. Sixth, in this study, the individual texture feature sets were selected for each classifier and imaging method. Future studies should strictly compare the performances of classifiers or imaging methods by selecting one common feature set. Lastly, we should have examined other T1-weighted Table 7. Performance of each classification attempt in discriminating between the three groups in T2* map imaging. TF texture feature, DT decision tree, LDA linear discriminant analysis, SVM support vector machine, RF random forest classifier, AUC area under the curve, se-RD severe renal dysfunction (estimated glomerular filtration rate [eGFR] < 30 mL/min/1.73 m 2 , i.e., CKD stage G4-5), mo-RD moderate renal dysfunction (30 ≤ eGFR < 60 mL/min/1.73 m 2 , i.e., CKD stage G3a/3b), CG control group (eGFR ≥ 60 mL/min/1.73 m 2 , i.e., CKD stage G1-2). Feature name codes are as follows: TF2 = 90th percentile, TF3 = energy, TF5 = interquartile range, TF6 = mean absolute deviation, TF7 = mean, TF10 = root mean squared, TF11 = total energy, TF13 = difference average, TF22 = dependence non uniformity, TF24 = dependence variance, TF25 = gray level non uniformity (gray-level dependence matrix), TF26 = large dependence emphasis, TF32 = run length non uniformity, TF35 = run variance, TF38 = size zone non uniformity normalized, TF40 = zone percentage. The data are expressed as means ± standard deviations. www.nature.com/scientificreports/ Dixon-based images, such as FO image and fat fraction ratio map, as well as other diffusion-based images such as intra-voxel incoherent motion.
In conclusion, multiclass classification models based on TA of kidney MRI showed modest classification performance for predicting the eGFR in patients with CKD. TA based on Dixon-based T1WI, particularly WO images, showed moderate performance. Energy and total energy were weakly correlated with the eGFR. Our results were limited in terms of the clinical value of TA of kidney MRI, and thus further studies should verify its reproducibility and feasibility. Table 8. The cross-correlation analyses between the eGFR and the 40 texture features derived from each imaging method. ADC apparent diffusion coefficient, eGFR estimated glomerular filtration rate, GLCM gray-level co-occurrence matrix, GLDM gray-level dependence matrix, GLRLM gray-level run length matrix, GLSZM gray-level size zone matrix, IP in-phase, OP opposed-phase, PCC Pearson's Correlation Coefficient, TF texture feature, WO water-only.

Code
Feature name code Hospital. The requirement for informed consent was waived by the committee (approval number BYOU2022-037). All experiments were performed in accordance with the relevant guidelines and regulations. Figure 3 presents the inclusion and exclusion criteria for this study. We identified and reviewed 209 patients referred from the Department of Nephrology in our hospital who underwent kidney MRI between January 2017 and September 2021. The inclusion criteria included: (1) age of 15 years or older; and (2) MRI scanning with Dixon-based T1WI, DWI, and ADC maps and T2* maps in our hospital. The exclusion criteria included: (1) lack of Dixon-based T1WI, DWI, and ADC maps or T2* maps (n = 5); (2) insufficient clinical or laboratory data (n = 1); (3) high-grade kidney atrophy (difficulty in segmentation) (n = 2); (4) severe artifacts on MRI (n = 17);   where age is in years and serum creatinine (sCr) is in mg/dL. The eGFR was defined as 120 mL/min/1.73 m 2 if it was greater than 120 mL/min/1.73 m 2 as calculated using Eq. (1). The patients were divided into three groups according to the eGFR: se-RD group (eGFR < 30 mL/min/1.73 m 2 , i.e., CKD stage G4-5), mo-RD group (30 ≤ eGFR < 60 mL/min/1.73 m 2 , i.e., CKD stage G3a/b), and CG (eGFR ≥ 60 mL/min/1.73 m 2 , i.e., CKD stage G1-2). MRI acquisition. MRI images were acquired using a 3.0 Tesla superconducting unit (Skyra, Siemens Healthcare, Erlangen, Germany) with a spine coil and an 18-channel phased-array body coil. The standard dedicated MRI protocol consisted of the following sequences: Dixon-based T1WI with the 3D gradient-echo method, (1) eGFR (mL/min/1.73 m 2 ) = 194 × sCr −1.094 × age −0.287 × 0.739 (for women), Table 9. Performance of each classification attempt in discriminating between the three groups in all T1-weighted imaging methods (ALL T1WIs). AUC area under the curve, IP in-phase, OP opposed-phase, T1WI T1-weighted imaging, TF texture feature, WO water-only, DT decision tree, LDA linear discriminant analysis, SVM support vector machine, RF random forest classifier, se-RD severe renal dysfunction (estimated glomerular filtration rate [eGFR] < 30 mL/min/1.73 m 2 , i.e., CKD stage G4-5), mo-RD moderate renal dysfunction (30 ≤ eGFR < 60 mL/min/1.73 m 2 , i.e., CKD stage G3a/3b), CG control group (eGFR ≥ 60 mL/ min/1.73 m 2 , i.e., CKD stage G1-2). Feature name codes are as follows: TF2 = 90th percentile, TF3 = energy, TF4 = entropy, TF8 = median, TF10 = root mean squared, TF12 = uniformity, TF13 = difference average, TF14 = difference entropy, TF18 = joint energy, TF19 = joint entropy, TF20 = maximum probability, TF24 = dependence variance, TF25 = gray level non uniformity (gray-level dependence matrix), TF28 = gray level non uniformity (gray-level run length matrix), TF31 = run entropy, TF37 = gray level non uniformity normalized. The data are expressed as means ± standard deviations. www.nature.com/scientificreports/ DWI with multiple b-factors, and T2*WI with multiple gradient-echoes obtained in the coronal plane. For the Dixon-based T1WI, only IP/OP/WO images were used in the analysis, as other images (such as fat-only images and fat fraction ratio maps) were not generated in all patients. The ADC map was automatically generated based on a monoexponential fitting model using DWI at the four b-factors. In BOLD, 12 T2* WIs corresponding to 12 different gradient echoes were acquired. T2* maps were generated on a pixel-by-pixel basis by fitting a linear regression method through the logarithms of the signal intensities versus their 12 echo times. Table 11 presents the representative MRI scanning sequences and parameters. Figure 4 presents the data analysis workflow. After segmentation, image processing, texture feature extraction, and reproducibility analysis were performed for each imaging method (T1-weighted IP/OP/WO images, ADC map, and T2* map), followed by texture feature selection and ML-based model construction in separate classification attempts. The combinations of the texture features were also exam- . Feature name codes are as follows: TF1 = 10th percentile, TF3 = energy, TF7 = mean, TF10 = root mean squared, TF11 = total energy, TF12 = uniformity, TF13 = difference average, TF18 = joint energy, TF19 = joint entropy, TF20 = maximum probability, TF21 = sum entropy, TF25 = gray level non uniformity (gray-level dependence matrix), TF28 = gray level non uniformity (gray-level run length matrix), TF31 = run entropy, TF35 = run variance, TF37 = gray level non uniformity normalized, TF40 = zone percentage. The data are expressed as means ± standard deviations. Texture feature extraction. Segmentation was performed using an open-source software (ITK-SNAP version 3.8.0). One slice of T1-weighted IP/OP/WO images, ADC maps, and T2* map images in the coronal plane were selected for each patient. An irregular two-dimensional ROI was drawn manually to contain the outline borders of the entire region of both kidneys on each selected image, and the cystic region was avoided to the maximum extent (Fig. 5). Two radiologists with 7 and 6 years of experience performed ROI delineation independently to assess the inter-observer reproducibility in the segmentation process. Both radiologists were blinded to the clinical information.

Data analysis procedures.
To avoid data heterogeneity bias, all MRI data were subjected to image normalization (the intensity of the image was scaled to 0-100) and resampled to the same resolution (3 × 3 × 3 mm) before feature extraction. The texture features were calculated using an open-source software package capable of extracting a large panel of engineered features from medical images (PyRadiomics version 2.1.0). Texture features were calculated based on six feature classes (first-order statistics, gray-level co-occurrence matrix, gray-level dependence matrix, graylevel run-length matrix, gray-level size zone matrix, and neighboring gray-tone difference matrix). Ninety-three texture features were extracted and analyzed to select the most valuable features for discerning the three CKD groups with each imaging method.
Dimension reduction of texture features. We performed dimension reduction of texture features to avoid overfitting and generalization errors in the classification models. After normalizing the numeric values as z-scores, the ICC was measured to evaluate the inter-observer reproducibility. Features with poor reproducibility (ICC < 0.75 or lower 95% CI < 0.6) in any of the imaging methods were excluded. Furthermore, the SFS algo-  www.nature.com/scientificreports/ rithm, a wrapper-based greedy search algorithm, was used for feature selection. This algorithm identifies feature subsets that maximize the performance of predictive models by adding or eliminating features stepwise based on a user-defined classifier algorithm. We considered four representative ML classifiers in this study: linear discriminant analysis (LDA), SVM, decision tree (DT), and RF classifiers. As for the SVM algorithm, various kernel functions provide different decision-making algorithms and generate versatility. We adopted three representa- After segmentation, image processing, texture feature extraction, and reproducibility analysis were conducted for each imaging method (T1-weighted in-phase/opposed-phase/water-only images, ADC maps, and T2* maps), followed by texture feature selection and ML-based model construction in separate classification attempts. The combinations of texture features were also examined: those derived from all T1-weighted images and those derived from all imaging methods.  www.nature.com/scientificreports/ tive kernels separately and compared their results in this study: linear, rbf, and sigmoid kernels. Thus, in total, we tested six different ML classifiers: LDA; SVM with linear, rbf, and sigmoid kernels; DT; and RF. By using the SFS algorithm, a subset of features that provided the best classification accuracies in each ML classifier was selected. The number of texture features was reduced to five in this step to prevent overfitting due to the small sample size. Concurrently, the relationship between the eGFR and the selected texture features for each imaging modality was examined using Pearson's correlation coefficient.
Classification and validation. Multiclass classification models were created using the six ML classifiers described above and validated using the cross-validation method. We adopted the following methods to obtain the generalizability of our classification models and to test their applicability: (1) synthetic minority oversampling technique (SMOTE), (2) nested cross-validation with grid-search parameter tuning, and (3) 100-time repeat cross-validation method.
Since our data had an imbalance between classes, we applied a SMOTE method before the final classification and validation step. This method creates synthetic instances that are not exact replications and increases the datasets of the minority group without damaging the structure of the actual data [43][44][45] . We applied this technique to augment the minority group datasets (i.e., 45 control cases and 36 severe RD cases), while preserving the majority group datasets (i.e., 85 moderate RD cases), resulting in 85 labeled cases for each class (255 cases in total). Several intrinsic hyperparameters are known for the SVM, DT, and RF classifiers, and the classification performance could be changed by attenuating these values. Thus, a nested cross-validation method with tenfold inner loops and tenfold outer loops was adopted to tune the parameters of these classifiers to avoid the double-dipping phenomenon, a potential bias 46,47 , which indicates that training and test data were used for feature selection and model development, along with validation. A grid-search system was used for parameter tuning, in which a set of parameters with a discrete number of values was tested repeatedly to obtain the best parameter combination. The following hyperparameters were tested: C-value = 1, 10, and 100 and gamma = 0.001, 0.01, and 0.1 for SVM; and max-depth = 2, 4, and 6 and min-samples-leaf = 0.1, 0.5, 1, 5, and 10 for DT and RF.
The cross-validation method was repeated 100 times to ensure the stability and reproducibility of our results. We repeated a SMOTE process along with a nested cross-validation as data augmented by the SMOTE may have some arbitrariness. The performance of the classifiers was evaluated using ROC curve analysis and the AUC. The accuracy, sensitivity, and specificity for each group and macro-average of all groups were calculated based on the confusion matrix of the classification results.