Introduction

In recent years, medical image analysis has seen increased application, with radiomics emerging as a prominent technique1,2,3. Radiomics involves analyzing images by applying machine-learning techniques to quantitative characteristics extracted from imaging data, like morphological and textural features. It aids in diagnostics and predictions, such as identifying tumor molecular types4,5 and predicting disease outcomes6,7.

Radiomics models often predict rare diseases stemming from clinical routine8,9; consequently, the resulting datasets can often be class-imbalanced, meaning that the sample size of one class is much smaller than the other10. This imbalance can give rise to several modeling challenges since classifiers might overfit the majority class; that is, they model the majority class primarily and treat the minority class as noise. In this case, the utility of the classifier is largely diminished since the minority class is often predicted wrong11.

This problem is often tackled by balancing the data using resampling methods, which add or remove samples to obtain evenly sized classes. Resampling methods mainly fall into three categories: Oversampling, undersampling, and a combination of both12,13. Oversampling creates synthetic minority samples, whereas undersampling removes samples from the majority class. These strategies also have drawbacks. Since, in the context of radiomics, the sample sizes are often relatively small, undersampling could remove valuable information and thus severely affect overall performance. Oversampling, on the other hand, might generate incorrect samples and thus distort the data, leading to decreased performance as well. However, most studies in radiomics that employ resampling only evaluate it on a single dataset14,15; therefore, the measured effect could be specific to the dataset. In addition, resampling could also influence feature selection methods and the set of selected features, which would, in turn, affect the interpretation of the resulting models.

To measure these effects, in this study, we applied nine different resampling methods to fifteen radiomics datasets. We estimated their impact on the predictive performance and the set of selected features to gain insight into the overall effect of resampling methods on radiomic datasets.

Results

Predictive performance

Overall, no large difference in predictive performance could be seen between the resampling methods (Fig. 1), and, on average, resampling resulted in a slight loss in AUC (up to − 0.027 for the worst resampling method). Compared to not applying a resampling method, most oversampling methods (SMOTE, SVM-SMOTE) virtually showed no difference (up to + 0.015 maximum difference). Undersampling methods performed worse, especially Edited NN, and all k-NN showed losses in AUC, which showed at least a loss of at least 0.025. The same was also true for the two combined methods.

Figure 1
figure 1

Relative predictive performance of the resampling methods. Mean rank and mean gain in AUC, sensitivity and specificity of all resampling methods across all datasets, compared to not resampling (None). Maximum gain in AUC, sensitivity, and specificity denotes the largest difference seen in any of the datasets. Oversampling methods are denoted in blue, undersampling methods in red, and combined methods in green.

Yet, no single method outperformed all others across all datasets (Fig. 2): For example, the worst-performing method, Edited NN (k = 3) still showed a small performance increase (+ 0.013 in AUC) when compared to the best oversampling method on a dataset (Fig. 3). The Friedman test indicated a statistical significance between the resampling methods (p < 0.001); a post hoc Nemenyi test showed that the all k-NN method (with k = 5 and k = 7) were inferior to not resampling and to SMOTE (k = 7) (all p < 0.05). In addition, Edited NN (k = 3) was significantly worse than SMOTE (k = 7) (p = 0.04).

Figure 2
figure 2

Pairwise wins and losses for all resampling methods. Wins and losses of all resampling methods. Each row denotes how often the resampling method won against the other methods (column). Draws between resampling methods counted as 0.5. Oversampling methods are denoted in blue, undersampling methods in red, and combined methods in green.

Figure 3
figure 3

Rankings on each dataset. The rankings on each dataset. Rankings were obtained by sorting the AUCs of the best-performing model. Draws were counted as 0.5. Oversampling methods are denoted in blue, undersampling methods in red, and combined methods in green.

Regarding the sensitivity and specificity of the resulting models, again, no clear difference on average could be seen between the resampling methods (Fig. 1). However, in contrast to AUC, the sensitivity showed a more considerable gain on specific datasets (up to 0.055 in sensitivity). Similarly, the specificity did improve on average compared to not resampling. Nevertheless, more consistent gains of up to 0.032 could be seen for nearly all methods in specific datasets.

Feature agreement and similarity

Resampling changed the selected features that performed best in terms of AUC (Fig. 4). Using the Jaccard index, on average, the set of selected features agreed with only 28.7%. The highest agreement between any oversampling method and no resampling at all was for random oversampling, with an agreement of 40%. Using the Ochai index did not alter these results largely. On average, a higher agreement of around 38.5% were seen (Fig. S1 in the Additional file 1). The largest feature agreement between no resampling were again observed for random oversampling.

Figure 4
figure 4

Feature agreement using Jaccard index. Agreement of the set of features selected by the resampling methods. For this, the Jaccard index of the selected features on each fold of the cross-validation were computed and averaged. Oversampling methods are denoted in blue, undersampling methods in red, and combined methods in green.

Feature similarity was higher and summed up to 86.9% (Fig. 5). Overall, it seemed the worse the resampling method performed relatively, the less similar the selected features were. Using the Zucknick measure led to even smaller feature similarities (Fig. S2 in the Additional file 1).

Figure 5
figure 5

Features similarity based on correlation. Similarity among the set of features selected by the resampling methods. The similarity was computed by identifying the maximum correlated feature and averaging. Oversampling methods are denoted in blue, undersampling methods in red, and combined methods in green.

Discussion

Resampling methods have often been applied in radiomics, with the promise of improving the predictive performance if the data is unbalanced. In this study, we estimated the impact of different resampling methods on the predictive performance and the selected features across multiple datasets.

Regarding the predictive performance, virtually no improvement was seen compared to not resampling. Even worse, applying undersampling decreased the performance on average. On specific datasets, however, a slight increase of up to + 0.015 in AUC could be seen for the SMOTE, showing that when only a single dataset is compared, one method can outperform every other. Yet, these observations do not generalize since SMOTE performed worse on other datasets. The same is also true for the models' sensitivity and specificity. While, on average, no improvement compared to resampling was seen, a higher sensitivity (up to + 0.055) and specificity (up to + 0.032) could be observed on specific datasets. Again, this was dependent on the dataset; it also means that both models performed worse in terms of their sensitivity and specificity on other datasets.

More complicated is the situation when comparing the agreement of the set of selected features of the best-performing models. Even if two different resampling models resulted in similarly performing models, this does not entail that the same amount or the same set of features were chosen. On average, less than one-third of the selected features agreed, which shows that if one were to identify the features with biomarkers, no agreement could be reached when models were trained using different resampling methods. Using the Ochiai index instead of the more commonly used Jaccard index to measure the agreement did not change this picture, although a higher agreement (around 10%) was observed.

However, the picture changed when similarity was considered: On average, each feature selected by the best-performing model for one resampling method correlated highly to a feature selected by another resampling method. It is partially an effect of the high correlation present in radiomic datasets16: Resampling can change the statistical distribution of the features so that the feature selection identifies other features as relevant, but it seems to select highly correlated features, i.e., those that contain similar information. We have employed our own measure for feature similarity since no universally accepted metric exists17. This measure intuitively captures the average of the highest correlations between the two feature sets. Alternatively, we have employed the Zucknick measure, which can be understood as a variant of the Jaccard index considering feature similarity. Using this measure, however, led to lower feature similarities. One reason for this difference is that the Zucknick measure takes the number of selected features into account, which our measure does not. The Zucknick measure can subsequently lead to unexpected results when features are duplicated in the feature sets. Therefore, we believe our measure is more appropriate for measuring feature similarity.

Together with the fact the predictive performance did not show improvements, this indicates that resampling in radiomic datasets does not help in radiomics as much as one would hope for.

Given the relatively large amount of radiomic studies utilizing SMOTE, or other resampling methods, our result is surprising. However, Blagus and Lusa analyzed SMOTE on three high-dimensional genetic datasets and concluded that it had no measurable effect on high-dimensional datasets and that undersampling is preferable to SMOTE18. Our results confirmed these observations partly. Indeed, none of the resampling methods improved the overall predictive performance, yet, SMOTE and its variants did not result in a drop in predictive performance as did undersampling methods. The difference to the study of Blagus might lie in the datasets; radiomics datasets are also high-dimensional but might have different characteristics than genetic datasets, which could lead to different behavior.

Our results could also point towards a publication bias: If resampling did not improve predictive performance, it might have been dropped from reporting, leaving only those studies where resampling did help. Arguably, this would hurt radiomic research from a scientific viewpoint19. Another bleak explanation could be that some studies did not apply the resampling correctly. If only cross-validation is used without an independent test set, it is of utmost importance that resampling is applied only to the training set and does not utilize the validation set in any way20,21. If this is not followed, a large bias can be expected22,23; yet, this kind of error is common24 and often cannot be detected without access to the code, which is most often not provided in radiomic studies. There is also the possibility that the setup of these studies differed in some ways from ours; for example, we only used filtering feature selection methods and more simple classifiers. More intricate wrapper methods like SVM-RFE combined with more complex classifiers like XgBoost might perform better in specific datasets. However, we followed rather strictly the usual radiomic pipeline and employed the most often-used methods.

Other studies partly confirm our results. Sarac and Guvenis considered six different resampling methods in a cohort of patients with oropharyngeal cancer to determine their HPV status25. They demonstrated that oversampling performed better overall than undersampling, with SMOTE obtaining the highest performance. While in our experiments, oversampling performed on average better than undersampling, and SMOTE performed relatively well, not resampling at all performed best. Unfortunately, Sarac and Guvenis did consider this. In a similar study, Zhang et al. tested four subsampling methods in a cohort of patients with non-small-cell lung cancer (NSCLC) and reported that applying SMOTE improved the performance significantly, with an improvement of + 0.03 in AUC26, while other resampling methods seemed to fare less well. However, it must be noted that they extracted only 30 features, which makes their data low-dimensional (more samples than features). Therefore, the improvement might be larger than we observed on a single dataset. In a recent large benchmarking study on non-radiomics datasets, Tarawneh et al. concluded that many resampling methods are not helpful27. This result is in line with our study, even though they employed only low-dimensional datasets with very high imbalance, which is often uncommon in radiomics.

In our study, we applied resampling before feature selection, following the observations from Blagus and Lusa18. Yet, resampling can be used before and after feature selection, and there are arguments for both choices. Since feature selection methods might be affected strongly by imbalance, applying resampling ahead might be more beneficial18. However, using it after feature selection also has some advantages: As the data set is slimmed down by feature selection, the resampling approach will be computationally more effective and will not resample otherwise irrelevant features. Yet, the situation is not clear: In a recent study on high-dimensional genetic data, Ramos-Pérez et al.28 demonstrated that the order of resampling and feature selection could depend on the resampling method. They state that random undersampling (RUS) should be ideally performed before feature selection, but random oversampling (ROS) and SMOTE afterward. This result was not observed in our study, where SMOTE applied upfront outperformed RUS. Accordingly, when confronted with a new dataset, both variants should be tested if predictive performance is the goal.

Some limitations apply to our study. First, although we have employed a rather large collection of datasets, these were collected opportunistically. We cannot exclude that a potential bias might be present. Furthermore, we could not use external data since there are only very few publicly available datasets where such data is provided. We cannot rule out that resampling methods could help the model to generalize better to external data29. Instead, we utilized cross-validation, which can measure robustness with respect to different distributions only in a limited way. In addition, cross-validation could lead to some overfitting. However, this would possibly affect all methods by a similar amount. Due to restricted computational resources, we opted for a fivefold cross-validation with 30 repeats, although we acknowledge that other validation schemes like leave-one-out CV or using a higher number of repeats could allow for more precise results. In addition, although we tested the most commonly used resampling methods, many more have been developed, especially methods based on generative adversarial networks, which are promising30.

The same applies to the feature selection methods and classifiers we employed in this study. We also only considered generic features, that is, those based on morphological, intensity, and textural features, and did not employ datasets with features extracted from deep neural networks31,32,33. Since these features might be quantitatively different, our conclusions might not hold for these datasets. Furthermore, we did not apply feature reduction, like principal component analysis, because these methods generate new features which usually do not have a direct interpretation. However, since features are thought to correlate to biomarkers, their interpretation is critical in radiomics. Also, our study only considered AUC as the primary metric for predictive performance and considered sensitivity and specificity as secondary metrics. Depending on the problem, other metrics can be more important. Our study cannot estimate the effect of resampling on these metrics. However, AUC is arguably the most essential metric since it can be considered as the de-facto metric for radiomic studies34,35.

Our study demonstrated that, on average, resampling methods did not improve the overall predictive performance of models in radiomics, although this might be the case for a specific dataset. Applying resampling largely changed the set of selected features, which obstructs feature interpretation. However, the set of features was highly correlated, indicating that resampling does not change the information in the data by much.

Methods

In this study, we utilized previously published and publicly accessible datasets to ensure reproducibility. The corresponding ethical review boards granted ethical approval for these datasets. Since the study was retrospective, the local Ethics Committee (Ethik-Kommission, Medizinische Fakultät der Universität Duisburg-Essen, Germany) waived the need for additional ethical approval. The study was conducted following relevant guidelines and regulations.

Datasets

We collected a total of 15 publicly available radiomic datasets (Table 1). These datasets were not collected systematially but were gathered opportunistically, reflecting the scarcity of relevant data in the field36. All datasets comprised already extracted radiomics features; no feature generation was performed for this study. Each dataset was prepared by removing non-radiomic features (like clinical or genetic data) before merging all data splits. The datasets were all high-dimensional, meaning they had more features than samples, except for two datasets (Carvalho2018 and Saha2018).

Table 1 Overview of the datasets used.

Preprocessing

A few missing values were observed in the datasets, notably in the three datasets by Hosny et al. Here, at most, 0.79%, 0.65% and 0.19% of the values were missing. However, these missings nearly exclusively affected the two features, ‘exponential_ngtdm_contrast’ and ‘exponential_glcm_correlation’; the missings possibly occurred because of numerical overflows due to the exponential function. These two features were consequently removed from the analysis. Other datasets had less than < 0.16% missing values. Due to how radiomics is computed, these values were likely missing completely at random and did not lead to systematic bias. The missing values were removed by imputing them with feature means. All datasets were then normalized by z-Score, i.e., by subtracting the mean of each feature and dividing by the standard deviation.

Resampling methods

Nine different resampling methods were used in this study, encompassing over- and undersampling and combination techniques (Table 2). The undersampling methods, which were random undersampling, edited nearest neighborhood (ENN), all k-NN, and Tomek links, aim to reduce the size of the majority class to match that of the minority class. In contrast, the oversampling methods, random oversampling, synthetic minority oversampling technique (SMOTE), and SVM-SMOTE, aim to increase the size of the minority class to match that of the majority class. Combination methods, like SMOTE + ENN, and SMOTE + Tomek, involve resampling both classes, usually resulting in datasets where the majority class is smaller than before, and the minority class becomes larger.

Table 2 List of resampling methods and parameters.

Some resampling methods need a choice of the neighborhood size to consider during the resampling, In the original study SMOTE37, the neighborhood size was set to 5. Since this choice might not be optimal for all 15 datasets, in addition, a smaller and a larger size was also considered, i.e., the neighborhood size was chosen from 3, 5, and 7. However, in a few datasets, a size of 5 or 7 for undersampling methods effectively removed the minority class; therefore, in ENN and SMOTE + ENN, only a neighborhood of size 3 was used.

Feature selection

For the selection of relevant features, four often-used feature selection methods were used38: Analysis of variance (ANOVA), Bhattacharyya scores, Extra trees (ET), and the least absolute shrinkage and selection operator (LASSO). Being filter methods, each of them scored the features according to their estimated relevance. The highest-scoring features were then extracted based on a choice of how many features should be included. Here, the number of selected features was chosen on a logarithmic scale among N = 1, 2, 4, … 32, 64. This approach allowed for efficient exploration while maintaining low computational complexity.

Classifiers

Models were trained using often-used classifiers39: k-Nearest Neighbor (kNN), logistic regression (LR), naive Bayes, random forest (RF), and kernelized SVM (RBF-SVM). These methods had partly hyperparameters, e.g., in the case of the RBF-SVM, it is known that its performance depends strongly on the choice of the regularization parameter C40. This parameter was therefore optimized using a simple grid search on the training data41,42, during which it was selected from 2–10, 2–8, …, 2–1, 1, 22, … 28, 210. The kernel width γ of the RBF-SVM was set to the inverse of the mean distance between any two samples. For the RF, the number of trees was set to 250. The neighborhood size of the k-NN was chosen among 1, 3, 5, 7, 9. Finally, the regularization parameter of the logistic regression was also chosen from 2–10, 2–8, …, 2–1, 1, 22, … 28, 210. Other parameters were left at their default values.

Training

The evaluation followed the standard radiomics pipeline43,44 and was performed using a fivefold stratified cross-validation (CV) with 30 repeats (Fig. 6). Stratification was employed to ensure that the original class balance of the data is kept in the test folds as well. In each repeat, first, the data was split into five folds. In turn, each fold was once left out for validation, while the other four folds were used as a training set. It was then resampled using one of the resampling methods. A feature selection method and a classifier were subsequently applied to the resulting data. The final model was then evaluated on the validation fold, i.e., the relevant features were selected in the validation fold first, and then prediction took place with the classifier.

Figure 6
figure 6

Flow chart of the experiments.

Predictive performance

Since the primary focus in radiomics is obtaining accurate predictions, the macro-averaged area under the receiver operator characteristic curve (AUC) over the five CV validation folds was used to identify the best-performing model. The best-performing models were then analyzed; models performing worse were discarded. In addition, the sensitivity and the specificity of the models were computed as secondary metrics.

Feature agreement and similarity

The agreement of the selected features was compared pairwise for all resampling methods over each training fold during the CV. We used the Jaccard index, also called Intersection-over-Union, to measure agreement. Since no universal metric exists, we also employed the Ochiai index45.

Since radiomic datasets are known to be highly correlated16, two sets of features might look vastly different, although they might describe similar information. Therefore, we computed the similarity between the set of selected features. It is calculated roughly as follows: First, for each feature in the one set, the feature with the highest correlation in the other set is identified. The (symmetrized) mean over all these correlations is then defined as the similarity. More information can be found in Additional file 1. Since this is an ad-hoc metric, we also computed the Zucknick measure46, which can be understood as a correlation-corrected version of the Jaccard index45.

Software

Experiments were implemented using Python 3.10. The resampling methods were utilized from the Imbalanced-learn package 0.10.047. Our code repository can be found on github, where, apart from the results, a complete list of the dependencies and software versions used in this study can also be found.

Statistics

Descriptive statistics were reported using mean and standard deviation. P-values below 0.05 were considered statistically significant. All statistics were computed using Python 3.10. Resampling methods were compared using a Friedman test and a post hoc Nemenyi test48. The Friedman test was preferred over the ANOVA test since it is a non-parametric test and has thus fewer assumptions on the data. Since the Friedman test only tests for the hypothesis of whether there are any differences between the methods, a pairwise post hoc Nemeyi test was employed to determine the differences. The Nemenyi test can be understood as the non-parametric equivalent of the Tukey test usually employed for the ANOVA test48.

Ethics approval and consent to participate

This is a retrospective study using only previously published and publicly accessible datasets. The ethical approval for this study was waived by the local Ethics Committee (Ethik-Kommission, Medizinische Fakultät der Universität Duisburg-Essen, Germany) due to its retrospective nature.