Introduction

Chronic kidney disease (CKD) affects 8–16% of the population worldwide and remains a major threat to global public health due to its increasing incidence and mortality. Common causes of CKD include diabetes and hypertension, especially in developed countries. However, less than 5% of patients with early CKD report being aware of their disease1. Therefore, appropriate screening, early diagnosis, and management are significant in preventing CKD-associated adverse clinical outcomes, such as end-stage kidney disease, cardiovascular disease, and increased mortality. The Kidney Disease Improving Global Outcomes guidelines2 suggested a risk-based approach to the evaluation and management of CKD and proposed six disease categories related to the estimated glomerular filtration rate (eGFR): G1–G5, with G3 subdivided into 3a and 3b. The most essential cutoff points of eGFR are 60 and 30 mL/min/1.73 m2 (as the borderlines of G2/G3 and G3/G4, respectively); therefore, the risk of death was reported to increase as the eGFR decreased below 60 mL/min/1.73 m2 in recent CKD cohort studies3. In addition, an eGFR of less than 30 mL/min/1.73 m2 is important from a radiological point of view as it relates to the availability of the contrast media4. The risk stratification of CKD based on the eGFR has undisputed advantages and has helped achieve greater awareness of CKD and its impact on global health.

Ischemia and hypoxia are associated with the progression of CKD; however, clinical tools to quantify these factors in patients are lacking. Renal biopsy is the gold standard method to histologically evaluate renal pathology; nevertheless, it carries certain risks due to complications, such as bleeding. Conversely, magnetic resonance imaging (MRI) of the kidney has been used to non-invasively assess CKD progression. Several MRI methods have been successfully used to evaluate renal function, including diffusion-weighted imaging (DWI) and blood oxygen level-dependent imaging (BOLD). DWI and apparent diffusion coefficient (ADC) values are the most studied methods and have demonstrated a good correlation with renal function decline and renal fibrosis in CKD5,6,7,8. BOLD based on the T2* map reflects the regional renal oxygenation status and can assess hypoxia occurring during renal dysfunction9,10. Although Dixon-based gradient-echo MRI is another imaging method that is routinely performed in abdominal imaging and can measure renal lipid accumulation in type II diabetes mellitus11, its utility in the evaluation of CKD has not been thoroughly studied.

Texture analysis (TA) is an emerging technique that permits the quantification of image characteristics based on the distribution of pixels and their surface intensity or patterns12,13. TA has been applied to several medical image analyses, including oncologic imaging14, neuroimaging15, and abdominal imaging16,17. Recent reports have demonstrated the utility of TA based on DWI, BOLD, Susceptibility-weighted imaging (SWI), and T1 and T2 mapping18,19. However, TA of other essential MRI methods, including Dixon-based T1-weighted imaging (T1WI), has not been fully studied. Previous studies have described that as the renal function declines, a decreased difference between the values in the cortex and those in the medulla is observed in T1WI20,21. Therefore, we hypothesized that TA based on Dixon-based T1WI is especially important because of its capacity to capture the clearest images and reflect the morphological characteristics of the kidney.

Thus, this study aimed to assess and compare the performance of TA based on Dixon-based T1WI, ADC maps, and T2* maps (BOLD) for evaluating renal dysfunction.

Results

Clinical characteristics

The study included 166 participants. The major etiologies of CKD were hypertensive nephrosclerosis (n = 80), diabetic mellitus nephropathy (n = 25), immunoglobulin A (IgA) nephropathy (n = 22), and nephrotic syndrome (n = 5). No abnormalities were observed in the remaining 34 patients.

According to the eGFR, 36 patients had severe renal dysfunction (se-RD) (men, n [%] = 26 [72], mean age 60.9 ± 16.4 years, mean eGFR 19.8 ± 7.7 mL/min/1.73 m2), 85 patients had moderate renal dysfunction (mo-RD) (men, n [%] = 57 [67], mean age 62.3 ± 13.3 years, mean eGFR 46.3 ± 8.1 mL/min/1.73 m2), and 45 patients were in the control group (CG) (men, n [%] = 19 [42], mean age 43.7 ± 18.1 years, mean eGFR 78.1 ± 16.7 mL/min/1.73 m2).

Table 1 details the distribution of the study population in each eGFR group. The age, percentage of men, and incidence rates of hypertension and diabetes increased significantly with progressive renal dysfunction. There was no significant difference in the incidence rate of IgA nephropathy and nephrotic syndrome among the three groups.

Table 1 The demographic and clinical characteristics of the study population.

Dimension reduction of texture features

The T1-weighted in-phase (IP)/opposed-phase (OP)/water-only (WO) images showed good reproducibility in the inter-observer reproducibility analysis, with mean interclass correlation coefficient (ICC) values of 0.767, 0.774, and 0.781, respectively. Conversely, the ADC and T2* maps showed slightly lower reproducibility, with mean ICC values of 0.732 and 0.718, respectively. Good inter-observer reproducibility was observed for 59, 60, 61, 54, and 50 features (ICC ≥ 0.75 and lower 95% confidence interval [CI] ≥ 0.6) in T1-weighted IP/OP/WO images, ADC map, and T2* map, respectively. By excluding features with poor reproducibility (ICC < 0.75 or lower 95% CI < 0.6) from any one of the imaging methods, the number of features for each imaging method was reduced to 40. Table 2 lists the ICC values of these 40 features for each imaging method.

Table 2 Representative texture features and their respective intraclass correlation coefficient.

Subsequently, the sequential feature selection (SFS) algorithm was used for feature selection. For each imaging method and machine learning (ML) classifier, a subset of five features that provided good classification accuracies was identified. The selected features for each classification attempt are listed in Tables 3, 4, 5, 6 and 7.

Table 3 Performance of each classification attempt in discriminating between the three groups in T1-weighted in-phase imaging (T1WI IP).
Table 4 Performance of each classification attempt in discriminating between the three groups in T1-weighted opposed-phase imaging (T1WI OP).
Table 5 Performance of each classification attempt in discriminating between the three groups in T1-weighted water-only imaging (T1WI WO).
Table 6 Performance of each classification attempt in discriminating between the three groups in ADC map imaging.
Table 7 Performance of each classification attempt in discriminating between the three groups in T2* map imaging.

Concurrently, cross-correlation analyses were conducted between the eGFR and the 40 texture features. The highest correlation coefficients were observed for two texture features (total energy [0.55, p < 0.001] and energy [0.55, p < 0.001]) in T1-weighted WO images. Table 8 shows the relationship between the eGFR and the 40 selected texture features in each imaging modality.

Table 8 The cross-correlation analyses between the eGFR and the 40 texture features derived from each imaging method.

Classification and validation

Receiver-operating characteristic (ROC) curve analyses were performed to compare the capacity of TA quantified from each imaging method to differentiate the three groups of CKD. Overall, the TA based on T1-weighted IP/OP/WO images provided better classification performance than that based on the ADC and T2* maps. Among the five imaging methods, the T1-weighted WO images obtained the highest classification scores: an accuracy of 76.5–82.0% and a macro-average area under the curve (AUC) of 0.812–0.884. Among the six classifiers we studied, the most favorable performance was observed in the random forest (RF) classifier. As for the support vector machine (SVM) classifier, the favorable results were obtained in the SVM with rbf kernel, whereas the results were poor in the SVM with sigmoid kernel. The results of all classification attempts are summarized in Tables 3, 4, 5, 6 and 7. Figures 1a and 2a present the ROC curve and the confusion matrix of the representative model (T1-weighted WO image with RF classifier). The confusion matrices and ROC curves of all classification attempts are summarized online in Supplementary Figs. S1S5 and S8S12, respectively.

Figure 1
figure 1

The receiver operating characteristic (ROC) curves and area under the curve (AUC) values of representative classification models using T1-weighted water-only images with a random forest classifier (A) and all T1-weighted images using a support vector machine with rbf kernel classifier (B) in classifying the three groups of chronic kidney disease. Severe renal dysfunction group (se-RD, estimated glomerular filtration rate [eGFR] < 30 mL/min/1.73 m2), moderate renal dysfunction group (mo-RD, 30 ≤ eGFR < 60 mL/min/1.73 m2), and control group (CG, eGFR ≥ 60 mL/min/1.73 m2). The AUC values are expressed as means.

Figure 2
figure 2

Confusion matrices show the status of representative classification models using T1-weighted water-only images with a random forest classifier (A) and all T1-weighted images using a support vector machine with rbf kernel classifier (B) in classifying the three groups of chronic kidney disease. Severe renal dysfunction group (se-RD, estimated glomerular filtration rate [eGFR] < 30 mL/min/1.73 m2), moderate renal dysfunction group (mo-RD, 30 ≤ eGFR < 60 mL/min/1.73 m2), and control group (CG, eGFR ≥ 60 mL/min/1.73 m2). The data are expressed as means ± standard deviations.

Combination models

The combination models derived from T1-weighted IP/OP/WO images (ALL T1WIs) and those derived from all imaging methods (ALL IMs) were evaluated. The selected texture features are listed in Tables 9 and 10. The best classification performance was observed in ALL T1WIs using the SVM with rbf kernel classifier: an accuracy of 82.8% and a macro-average AUC of 0.887. The results of all classification attempts are summarized in Tables 9 and 10. Figures 1b and 2b present the ROC curve and the confusion matrix of the representative model (ALL T1WIs using the SVM with rbf kernel classifier). The confusion matrices and ROC curves of all classification attempts are summarized online in Supplementary Figs. S6,S7 and S13,S14, respectively.

Table 9 Performance of each classification attempt in discriminating between the three groups in all T1-weighted imaging methods (ALL T1WIs).
Table 10 Performance of each classification attempt in discriminating between the three groups in all imaging methods (ALL IMs).

Discussion

In the present study, we sought to investigate whether multiclass classification models based on TA of kidney MRI could predict the three eGFR groups of renal dysfunction. The results of our study suggest that TA of kidney MRI would be a modest predictor of these eGFR groups, but might not be a valuable differentiator in a clinical setting. TA quantified from T1-weighted IP/OP/WO images provided better classification performance compared with that of those based on ADC maps and T2* maps. Furthermore, we examined the combination models and showed that texture models derived from ALL T1WIs using the SVM with rbf kernel classifier afforded the moderate diagnostic performance as well. To our knowledge, this is the first study to evaluate the possibility of using multiclass models in the classification of eGFR groups. Several attempts have been made to differentiate between eGFR groups using TA of kidney MRI. Previous reports have shown the possibility of using TA based on DWI, BOLD, SWI, and T1 and T2 mapping18,19. In all these attempts, only binary classifiers were examined to differentiate between each eGFR group separately. However, in clinical practice, it is not uncommon for the classification of diseases to extend to three or more groups; therefore, it is reasonable to build multiclass problems for clinical use. According to the recent guidelines2, renal impairment and prognosis have been stratified into six disease groups based on the eGFR: G1 to G5, with G3 split into 3a and 3b. The most important cutoff points are eGFR 60 and 30 mL/min/1.73 m2, and previous studies on TA of kidney MRI were designed to classify the eGFR according to these cutoffs, except for one study that considered eGFR 90 mL/min/1.73 m2 as an additional cutoff point19,22. Therefore, our research also aimed to classify three groups, with cutoffs at eGFR 60 and 30 mL/min/1.73 m2. Ideally, the classification system of all six eGFR groups would be beneficial, although a large deviation in the distribution of each group prohibited these attempts.

Our study used Dixon-based T1-weighted images, which have not been fully discussed in the assessment of renal dysfunction. Dixon-based imaging, also called chemical shift imaging, uses the IP/OP cycling of fat and water molecules and allows the acquired images to be combined mathematically into four sequences: IP/OP/WO/fat-only (FO) images23. This technique has been used as a homogeneous fat suppression or fat quantification method in various medical imaging fields. The Dixon method has the potential for measuring the renal lipid accumulation in type II diabetes mellitus11. Moreover, the Dixon technique is useful in detecting iron deposition related to T2* effects and susceptibility artifacts owing to the double-echo sequence24. Additionally, Dixon-based images provide better signal-to-noise efficiency than other conventional fat-suppressed methods25. In our results, TA based on T1-weighted IP/OP/WO images demonstrated moderate classification scores, and the favorable classification scores were obtained by T1-weighted WO images. Although we could not analyze the FO image and fat fraction ratio map, as these were not available in all patients, a T1-weighted WO image can be regarded as a total fat-suppressed image and contains T1 information on components other than fat. In a recent study on T1 mapping, increased cortical T1 values and decreased T1 cortico-medullary differentiation correlated with the severity of renal impairment20,21. The changes in the T1 values could be attributed to renal physiological states, such as inflammation, hypoxia, and fibrosis20,26,27,28,29. In our opinion, T1-weighted WO images may represent these changes and could be useful in the non-invasive assessment of CKD etiologies.

In our study, TA based on the ADC and T2* maps showed relatively low diagnostic performance compared with that of those based on Dixon-based T1WI. In the ADC map, the texture features in the renal cortex had a good correlation with fibrosis and chronic lesions, and the texture features in the renal medulla were more related to renal function than those quantified from the renal cortex19. Additionally, the difference between the cortical and medullary ADC, the so-called delta-ADC, has been correlated with fibrosis in CKD5,30. However, since BOLD provides a marker of blood oxygenation levels, relative hypoxia associated with renal injury may be reflected by the T2* map. T2* measurement demonstrated a good correlation with the eGFR in patients with CKD, and TA of BOLD was linearly correlated with the eGFR in several studies18,31,32,33. In our results, TA of the ADC and T2* maps showed little correlation with the eGFR, and unsatisfactory results were obtained in multiclass problems. A possible explanation for the lower performances in the ADC and T2* maps is the difference in the quantification methods of texture features. We quantified the texture features from the renal parenchyma as a whole, whereas most other studies quantified the texture features from the renal cortex or medulla5,6,7,8,9,10,19. As described above, it would be favorable to consider the renal cortex and medulla separately for the assessment of the ADC and T2* maps. Moreover, our classification system was a multiclass model, and TA using non-linear classifiers based on clear images, such as Dixon-based T1WI, may be suitable for such systems. Regarding inter-reader reproducibility, the ICC values for the ADC and T2* maps were relatively low. These unsatisfactory results may be attributed to their relatively low resolution. Lower discriminative performance and reproducibility in TA of ADC maps have been reported due to the low resolution of the images34,35.

In our research, the texture features were extracted from the whole area of both kidneys, although in most studies, these were measured in the renal cortex and medulla separately, and in the ipsilateral kidney, mostly on the right side due to artifacts caused by factors such as intestinal gas, poor breath holds, and susceptibility effects18. Considering the different functionalities of the renal cortex and medulla, it might be more appropriate to assess them separately. However, in clinical practice, we contemplated that evaluating them as a whole would be simpler and easier to understand. For the ADC and T2* maps, in particular, the delineation between the renal cortex and medulla is difficult because of their relatively low resolution, and poor reproducibility is expected when considering the renal cortex and medulla separately. Moreover, in patients with advanced renal dysfunction, it is often difficult to distinguish between them because of cortical thinning36,37. Another point where our research differs from others is whether one or both sides of the kidney are considered. Our study evaluated both kidneys based on the idea that they might contain more integrated information concerning renal function. However, considering the time-consuming process of region of interest (ROI) delineation, it would be favorable to segment only one side of the kidney. If TA derived from one side of the kidney is sufficient, it would be beneficial to consider only one side of the kidney because of the severe artifacts on the other side. Moreover, in our study, we performed manual segmentation instead of using automatic methods. A method that automatically divides the renal parenchyma into 12 layers using a computer (twelve-layer concentric objects method) has been validated so far32,38,39. Its use may improve the discrimination capacity and reproducibility of our models, especially for the ADC and T2* maps, which need to be examined in the future.

In recent years, TA has become a promising technique for quantitative imaging analysis, providing biomarkers for pathological changes or the response to treatment12,13,14,15,16,17. In our analysis, TA of kidney MRI was not a good discriminator of eGFR groups, especially in the clinical setting. Texture features such as energy and total energy were frequently included in the selected features in most classification attempts, although their correlation with the eGFR was not good. It might be interesting to know the existence of universal texture parameters, as such features may represent the underlying pathophysiology of kidney disease. The energy and total energy show the magnitude of pixel values and accentuate the high signal intensities in the images40. In our opinion, cortico-medullary differentiation may have played a role in the renal dysfunction; as the renal function declined, decreased cortico-medullary differentiation was noted, as described above20,21. Both parameters showed decreased values in our results, implying decreased signal intensities in the whole area of the kidneys, and this may be affected by a decrease in the cortico-medullary difference. However, in most studies, entropy correlated well with the eGFR and showed the capacity to differentiate between the eGFR groups of patients with CKD18,19,41. Other reports state that skewness, kurtosis, and correlation may be useful in discriminating between these groups18,19. None of the studies commented on the energy and total energy. One reason for this discrepancy could be the difference in the classification system: a multiclass classification model was used in our study. Although we did not examine the binary classification for each border, it is suspected that the energy and total energy could be weak indicators for the overall classification of the eGFR groups. Another reason for the discrepancy is that we considered Dixon-based T1WI, whereas other studies mainly discussed DWI, BOLD, and SWI: the classifiers used in this study were mainly non-linear ML methods. In addition, TA was conducted on the whole region of the kidneys, whereas in other studies the renal cortex and medulla were analyzed separately. In this study, we also demonstrated the possible candidates of classifiers, such as SVM and RF. SVM has high generalizability since it can be used to select linear or non-linear kernels, and the 'rbf' (non-linear) kernel could be the most suited to our models. Generally, non-linear classifiers would show good performance in multiclass classification42, and our results showed this tendency as well.

Our study had many limitations. First, we retrospectively enrolled 166 patients from a single institution; this was a small sample with some imbalance between each group. A greater number of patients with more balanced grouping is needed to validate the results in the future. Second, since we excluded patients with renal lesions, some important renal diseases, such as polycystic kidney disease, were ignored in this analysis, which would have caused a selection bias. Third, the data were not divided into training and validation sets because of the limited number of patients; hence, further investigation using an external validation cohort should be performed in the future. Fourth, since we focused on classifications predicting the eGFR group, other important laboratory data or underlying pathologies were missed in this study. As stated above, renal lipid accumulation in diabetes mellitus can be assessed using the Dixon method11, which is worth investigating in the future. Fifth, we extracted texture features from one layer of the image as two-dimensional data; the use of only one layer may result in important texture features being missed. This problem can be solved by extracting three-dimensional features, although this approach may be time-consuming. Sixth, in this study, the individual texture feature sets were selected for each classifier and imaging method. Future studies should strictly compare the performances of classifiers or imaging methods by selecting one common feature set. Lastly, we should have examined other T1-weighted Dixon-based images, such as FO image and fat fraction ratio map, as well as other diffusion-based images such as intra-voxel incoherent motion.

In conclusion, multiclass classification models based on TA of kidney MRI showed modest classification performance for predicting the eGFR in patients with CKD. TA based on Dixon-based T1WI, particularly WO images, showed moderate performance. Energy and total energy were weakly correlated with the eGFR. Our results were limited in terms of the clinical value of TA of kidney MRI, and thus further studies should verify its reproducibility and feasibility.

Methods

Subjects

This study was approved by the Research Ethics Committee of the Saitama Medical University Hospital. The requirement for informed consent was waived by the committee (approval number BYOU2022-037). All experiments were performed in accordance with the relevant guidelines and regulations.

Figure 3 presents the inclusion and exclusion criteria for this study. We identified and reviewed 209 patients referred from the Department of Nephrology in our hospital who underwent kidney MRI between January 2017 and September 2021. The inclusion criteria included: (1) age of 15 years or older; and (2) MRI scanning with Dixon-based T1WI, DWI, and ADC maps and T2* maps in our hospital. The exclusion criteria included: (1) lack of Dixon-based T1WI, DWI, and ADC maps or T2* maps (n = 5); (2) insufficient clinical or laboratory data (n = 1); (3) high-grade kidney atrophy (difficulty in segmentation) (n = 2); (4) severe artifacts on MRI (n = 17); and (5) presence of renal lesions with maximal diameter > 1 cm or number of renal masses > 5 in each kidney, including polycystic kidney disease (n = 18). In total, 166 patients were enrolled.

Figure 3
figure 3

Flow chart of the inclusion and exclusion criteria for the study. ADC apparent diffusion coefficient, DWI diffusion-weighted imaging, MRI magnetic resonance imaging, T1WI T1-weighted imaging.

The eGFR was calculated using Eq. (1):

$${\text{eGFR }} ( {{\text{mL}}/{\text{min}}/{1}.{\text{73 m}}^{{2}} } ) = { 194 } \times {\text{ sCr}}^{{ - 1.094}} \times {\text{ age}}^{{ - 0.{287}}} \times 0.739\, ( {\text{for women}} ),$$
(1)

where age is in years and serum creatinine (sCr) is in mg/dL. The eGFR was defined as 120 mL/min/1.73 m2 if it was greater than 120 mL/min/1.73 m2 as calculated using Eq. (1).

The patients were divided into three groups according to the eGFR: se-RD group (eGFR < 30 mL/min/1.73 m2, i.e., CKD stage G4–5), mo-RD group (30 ≤ eGFR < 60 mL/min/1.73 m2, i.e., CKD stage G3a/b), and CG (eGFR ≥ 60 mL/min/1.73 m2, i.e., CKD stage G1–2).

MRI acquisition

MRI images were acquired using a 3.0 Tesla superconducting unit (Skyra, Siemens Healthcare, Erlangen, Germany) with a spine coil and an 18-channel phased-array body coil. The standard dedicated MRI protocol consisted of the following sequences: Dixon-based T1WI with the 3D gradient-echo method, DWI with multiple b-factors, and T2*WI with multiple gradient-echoes obtained in the coronal plane. For the Dixon-based T1WI, only IP/OP/WO images were used in the analysis, as other images (such as fat-only images and fat fraction ratio maps) were not generated in all patients. The ADC map was automatically generated based on a monoexponential fitting model using DWI at the four b-factors. In BOLD, 12 T2* WIs corresponding to 12 different gradient echoes were acquired. T2* maps were generated on a pixel-by-pixel basis by fitting a linear regression method through the logarithms of the signal intensities versus their 12 echo times.

Table 11 presents the representative MRI scanning sequences and parameters.

Table 11 Representative MRI scanning sequences and parameters.

Data analysis procedures

Figure 4 presents the data analysis workflow. After segmentation, image processing, texture feature extraction, and reproducibility analysis were performed for each imaging method (T1-weighted IP/OP/WO images, ADC map, and T2* map), followed by texture feature selection and ML-based model construction in separate classification attempts. The combinations of the texture features were also examined: those derived from all T1-weighted images (ALL T1WIs) and those derived from all imaging methods (ALL IMs).

Figure 4
figure 4

Flow chart showing the technical study pipeline. After segmentation, image processing, texture feature extraction, and reproducibility analysis were conducted for each imaging method (T1-weighted in-phase/opposed-phase/water-only images, ADC maps, and T2* maps), followed by texture feature selection and ML-based model construction in separate classification attempts. The combinations of texture features were also examined: those derived from all T1-weighted images and those derived from all imaging methods.

Texture feature extraction

Segmentation was performed using an open-source software (ITK-SNAP version 3.8.0). One slice of T1-weighted IP/OP/WO images, ADC maps, and T2* map images in the coronal plane were selected for each patient. An irregular two-dimensional ROI was drawn manually to contain the outline borders of the entire region of both kidneys on each selected image, and the cystic region was avoided to the maximum extent (Fig. 5). Two radiologists with 7 and 6 years of experience performed ROI delineation independently to assess the inter-observer reproducibility in the segmentation process. Both radiologists were blinded to the clinical information.

Figure 5
figure 5

A method to set the region of interest (ROI) for each group and each image. ROIs were manually drawn on the contour lines of both kidneys, as shown by the red curves (avoiding the cystic area). ADC apparent diffusion coefficient, IP in-phase, OP opposed-phase, WO water-only. Severe renal dysfunction group (se-RD, estimated glomerular filtration rate [eGFR] < 30 mL/min/1.73 m2), moderate renal dysfunction group (mo-RD, 30 ≤ eGFR < 60 mL/min/1.73 m2), and control group (CG, eGFR ≥ 60 mL/min/1.73 m2).

To avoid data heterogeneity bias, all MRI data were subjected to image normalization (the intensity of the image was scaled to 0–100) and resampled to the same resolution (3 × 3 × 3 mm) before feature extraction. The texture features were calculated using an open-source software package capable of extracting a large panel of engineered features from medical images (PyRadiomics version 2.1.0). Texture features were calculated based on six feature classes (first-order statistics, gray-level co-occurrence matrix, gray-level dependence matrix, gray-level run-length matrix, gray-level size zone matrix, and neighboring gray-tone difference matrix). Ninety-three texture features were extracted and analyzed to select the most valuable features for discerning the three CKD groups with each imaging method.

Dimension reduction of texture features

We performed dimension reduction of texture features to avoid overfitting and generalization errors in the classification models. After normalizing the numeric values as z-scores, the ICC was measured to evaluate the inter-observer reproducibility. Features with poor reproducibility (ICC < 0.75 or lower 95% CI < 0.6) in any of the imaging methods were excluded. Furthermore, the SFS algorithm, a wrapper-based greedy search algorithm, was used for feature selection. This algorithm identifies feature subsets that maximize the performance of predictive models by adding or eliminating features stepwise based on a user-defined classifier algorithm. We considered four representative ML classifiers in this study: linear discriminant analysis (LDA), SVM, decision tree (DT), and RF classifiers. As for the SVM algorithm, various kernel functions provide different decision-making algorithms and generate versatility. We adopted three representative kernels separately and compared their results in this study: linear, rbf, and sigmoid kernels. Thus, in total, we tested six different ML classifiers: LDA; SVM with linear, rbf, and sigmoid kernels; DT; and RF. By using the SFS algorithm, a subset of features that provided the best classification accuracies in each ML classifier was selected. The number of texture features was reduced to five in this step to prevent overfitting due to the small sample size.

Concurrently, the relationship between the eGFR and the selected texture features for each imaging modality was examined using Pearson’s correlation coefficient.

Classification and validation

Multiclass classification models were created using the six ML classifiers described above and validated using the cross-validation method. We adopted the following methods to obtain the generalizability of our classification models and to test their applicability: (1) synthetic minority oversampling technique (SMOTE), (2) nested cross-validation with grid-search parameter tuning, and (3) 100-time repeat cross-validation method.

Since our data had an imbalance between classes, we applied a SMOTE method before the final classification and validation step. This method creates synthetic instances that are not exact replications and increases the datasets of the minority group without damaging the structure of the actual data43,44,45. We applied this technique to augment the minority group datasets (i.e., 45 control cases and 36 severe RD cases), while preserving the majority group datasets (i.e., 85 moderate RD cases), resulting in 85 labeled cases for each class (255 cases in total).

Several intrinsic hyperparameters are known for the SVM, DT, and RF classifiers, and the classification performance could be changed by attenuating these values. Thus, a nested cross-validation method with tenfold inner loops and tenfold outer loops was adopted to tune the parameters of these classifiers to avoid the double-dipping phenomenon, a potential bias46,47, which indicates that training and test data were used for feature selection and model development, along with validation. A grid-search system was used for parameter tuning, in which a set of parameters with a discrete number of values was tested repeatedly to obtain the best parameter combination. The following hyperparameters were tested: C-value = 1, 10, and 100 and gamma = 0.001, 0.01, and 0.1 for SVM; and max-depth = 2, 4, and 6 and min-samples-leaf = 0.1, 0.5, 1, 5, and 10 for DT and RF.

The cross-validation method was repeated 100 times to ensure the stability and reproducibility of our results. We repeated a SMOTE process along with a nested cross-validation as data augmented by the SMOTE may have some arbitrariness. The performance of the classifiers was evaluated using ROC curve analysis and the AUC. The accuracy, sensitivity, and specificity for each group and macro-average of all groups were calculated based on the confusion matrix of the classification results.

Combination models

We also evaluated the classification performance of the combination models derived from ALL T1WIs and those derived from ALL IMs. The SFS algorithm was used again for feature selection, and the number of texture features was reduced to 5. The multiclass classification models were created using the six ML classifiers mentioned above, and the performance of the classifiers was evaluated in the same manner as described above.

Statistical analyses were performed using an open-source software package (Python scikit-learn 0.22.1). Statistical significance was set at P < 0.05.