Machine learning for medical imaging: methodological failures and recommendations for the future

Research in computer analysis of medical images bears many promises to improve patients’ health. However, a number of systematic challenges are slowing down the progress of the field, from limitations of the data, such as biases, to research incentives, such as optimizing for publication. In this paper we review roadblocks to developing and assessing methods. Building our analysis on evidence from the literature and data challenges, we show that at every step, potential biases can creep in. On a positive note, we also discuss on-going efforts to counteract these problems. Finally we provide recommendations on how to further address these problems in the future.

what to (not) do via an illustrative example.
Suppose we are interested in detecting cancer from lung images. Given a dataset of healthy and cancerous images, and a performance metric of interest -such as accuracy-how can we design statistically-sound evaluation of a classifier? There are several different underlying questions that call for different methods.

Evaluating a prediction rule
The first question that we might be interested in is: given a prediction rule how well does it perform? The prediction rule can be independent of the images, or it can come from the output of a classifier trained on the data.
In both settings, evidence for clinical application of the prediction rule, for instance as required by regulatory agencies, calls for statistical evaluation.
For evaluating the prediction rule we can use confidence intervals or null-hypothesis testing. For this we need test data, which (in machine learning) is often a held-out part of the existing dataset, or new data -external validation. The size of the test set then determines the statistical power: the confidence errors on the measure of the prediction performance and the effect size that can be detected. The test set should be large enough, for example too small test sets lead to large error bars of the estimated prediction performance. 1 Riley et al 2 give recommendations on minimum sample sizes for various performance metrics.

Evaluation of a machine-learning procedure
Another question we might be interested in is to evaluate a machine-learning procedure. Unlike a prediction rule, by machine-learning procedure we refer the full process of starting from training data, extracting a prediction rule, and using it to classify test images as healthy or cancerous. This question is often of interest in machine-learning research, or if we want to retrain an existing prediction rule on new data. Here we need different evaluation techniques because the machine-learning procedure has several uncontrolled sources of variance, such as the training set or random initialization. 3 For machine-learning research, conclusions on a given procedure should not be driven by the choice of particularly favorable training set if we cannot expect similar performance when using new training data. On the contrary, for clinical applications, it is safer to evaluate an already trained algorithm which will be used as the prediction rule in practice, to rule out the possibility of a poor performance if training the algorithm on new training data.
Given our dataset of lung images, a good evaluation of a learning procedure requires repeatedly sampling different training and testing data -as in a cross-validation loop-, as well as other sources of variance. Due to the flexibility of machine-learning classifiers, it is hard derive closed-form expressions of confidence intervals or p-values to account for all the sources of variability. Instead, we can estimate the distribution of performance scores by repeating the experiments with such variations and deduce confidence intervals. 3 Note that standard statistical tests (such as the t-test) cannot be used across cross-validation folds, as these are not independent samples. 1 Sample size is an important factor to the success of prediction studies, both for the training data and the testing data. To evaluate of much the amount of data impacts the prediction performance, we can use learning curves 1 where we vary the training set size, evaluate the trained classifier on the test set, and plot the performance metric as a function of the training set size. If the curve is flattening, we might conclude that adding more training data will not improve performance. We might also be able to observe that less flexible classifiers (such as linear models) might outperform more flexible classifiers (such as neural networks) when the training set is small, but the situation to reverse when more training data is added. To give a concrete example, we refer to the results from a machine-learning paper by one of the authors 4 where classifiers are evaluated on non-medical datasets, but with similar dataset sizes and evaluation metrics as often used in medical imaging. We refer to Fig. 7 in, 4 which is not reproduced here for copyright reasons. This figure shows several panels, each panel corresponding to one benchmark dataset. Each panel shows a learning curve with the training set size on the x-axis, and the area under the curve (AUC, higher is better) on the y-axis, for seven different classifiers. We see AUC increases with the training set size, but the slopes of the classifiers are different. For example, looking at the "Musk1"dataset, we see that the classifier "minimax libsvc" starts out being the worst classifier, but is among the best at larger training sizes. Ideally, this plot should have also included error bars on the performances.

Comparing machine-learning procedures
In machine-learning research we might want to evaluate that a classifier is better than one or more competing classifiers. The question is then whether the difference in the observed performance metrics is due to chance.
Given a particular dataset and a classifier -a learning procedure-, cross-validation can give an estimate of the expected performance and its distribution. But we cannot yet conclude our classifier is better than another classifier for detecting lung cancer in images in general: in particular, we would need to evaluate the classifier on other, independent, datasets.
In this scenario, we can compare ranks of classifiers on multiple independent datasets to conclude that a classifier is generally better than another, as recommended by 5 (though with caveats pointed out by the same author 6 ). Based on the number of datasets (samples) and the number of classifiers, we can test whether the average classifier ranks are due to chance. If not, we can use a post-hoc test to find the critical difference: the minimum difference in ranks that classifiers need have, to be considered significantly-different. The critical difference decreases with the number of datasets, but increases with the number of classifiers.
We show an illustration of the evaluation procedure in Table 1 on the Friedman test recommended by. 5 Since the null hypothesis (that the differences in these ranks overall are due to chance) is rejected, the critical difference is calculated, which for 14 datasets and six classifiers is equal to 2.0153. From these results we could conclude that although MInD is the classifier with the lowest rank, MILES and Minimax are not significantly different, because their ranks are within the critical difference from 1.7857.

Brain imaging biomarkers meta-analysis
While the sample size of studies increases with time, there is a wide variability. We run a multivariable regression analysis, to separate out the effect of sample size of the study and publication date on reported prediction accuracy. Table 2 gives the corresponding estimated normalized coefficients, confidence intervals, and p-values. It confirms what is visible in Table 1: Area under the curve (AUC) and standard error (×100), 5 × 10fold cross-validation for 14 datasets and 6 classifiers. The last row shows the classifier ranks from the Friedman test, for which the critical difference is 2.0153. Classifiers in bold are best, or not significantly worse than best.

Literature popularity review methods
We give here the methodological details behind Fig. 2. To assess relative popularity of studies on breast versus lung cancer in medical and AI research, we quantify the prevalence of these topics in the corresponding literature. For this, we use the Dimensions.AI app, 7 querying the titles and abstracts of papers, with the following two queries: • lung AND (tumor OR nodule) AND (scan OR image) • breast AND (tumor OR nodule) AND (scan OR image) We do this for two categories, which are the largest subcategories within top-level categories "medical sciences" and "information computing": • 1112 Oncology and Carcinogenesis • 0801 Artificial Intelligence and Image Processing We then normalize the number of papers per year, by the total number of papers for the "cancer AND (scan OR image)" query in the respective categories (1112 Oncology or 0801 AI).

Included Kaggle challenges
We selected 8 medical-imaging challenges from Kaggle, which allows efficient retrieval of public and private leaderboard scores. In July 2021, there were around 15 medical-imaging challenges available, of which we selected four based on their varying focus (classification or segmentation) and incentives. Table 3 gives details on the challenges we use to compare performance gains to evaluation noise.
For each competition, we looked at the public and private leaderboards, extracting the following information: • Differences d i , defined by the difference of the i-th algorithm between the public and private leaderboard • Distribution of d i 's per competition, its mean and standard deviation • The interval t 10 , defined by the difference between the best algorithm, and the "top 10%" algorithm