Robust performance of deep learning for distinguishing glioblastoma from single brain metastasis using radiomic features: model development and validation

We evaluated the diagnostic performance and generalizability of traditional machine learning and deep learning models for distinguishing glioblastoma from single brain metastasis using radiomics. The training and external validation cohorts comprised 166 (109 glioblastomas and 57 metastases) and 82 (50 glioblastomas and 32 metastases) patients, respectively. Two-hundred-and-sixty-five radiomic features were extracted from semiautomatically segmented regions on contrast-enhancing and peritumoral T2 hyperintense masks and used as input data. For each of a deep neural network (DNN) and seven traditional machine learning classifiers combined with one of five feature selection methods, hyperparameters were optimized through tenfold cross-validation in the training cohort. The diagnostic performance of the optimized models and two neuroradiologists was tested in the validation cohort for distinguishing glioblastoma from metastasis. In the external validation, DNN showed the highest diagnostic performance, with an area under receiver operating characteristic curve (AUC), sensitivity, specificity, and accuracy of 0.956 (95% confidence interval [CI], 0.918–0.990), 90.6% (95% CI, 80.5–100), 88.0% (95% CI, 79.0–97.0), and 89.0% (95% CI, 82.3–95.8), respectively, compared to the best-performing traditional machine learning model (adaptive boosting combined with tree-based feature selection; AUC, 0.890 (95% CI, 0.823–0.947)) and human readers (AUC, 0.774 [95% CI, 0.685–0.852] and 0.904 [95% CI, 0.852–0.951]). The results demonstrated deep learning using radiomic features can be useful for distinguishing glioblastoma from metastasis with good generalizability.

Features with zero total score were not included. kNN = k-nearest neighbor, NB = naïve Bayes, RF = random forest, ADA = adaptive boosting, L-SVM = linear support vector machine, R-SVM = radial basis function support vector machine, LDA = linear discriminant analysis, CE = contrast-enhancing, NS = not selected for this classifier. The F score is a univariate feature selection method. It is based on F-test estimate the degree of linear dependency between two random variables. It is to find a subset of features, such that in the data space spanned by the selected features, the distances between data points in different classes are as large as possible, while the distances between data points in the same class are as small as possible.

MI (Mutual information) 2
The MI method measures arbitrary dependencies between random variables. It is suitable for assessing the information content of features in complex classification tasks, where methods based on linear relations are prone to mistakes

RFE (Recursive feature elimination) 3
The RFE selects features by recursively considering smaller and smaller sets of features. The estimator is trained on an initial set of features and weights are assigned to each. Then the features whose absolute weights are the smallest are eliminated from the current features in a backwards elimination manner.
This procedure is recursively repeated until the desired number of features is reached

LASSO (Least absolute shrinkage and selection operator) 4
The LASSO performs two tasks: regularization and feature selection. The LASSO minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models.

Tree-based 5
The tree-based algorithm utilizes a tree structure to model relationships among the features and the potential outcomes. A standard CART algorithm is used to select the split predictor that maximizes the split-criterion gain over all possible splits of all predictors. Finding the optimal size of the tree helps improve predictive accuracy through the reduction of overfitting.

kNN (k-nearest neighbor) 6
The kNN algorithm assigns to unclassified observation (incoming test sample) the class/category/label of the nearest sample (using metric) in training set. The letter k is a variable term implying that any number of nearest neighbors could be used.

NB (Naïve Bayes) 7
The NB model assumes that observations have a multivariate distribution, given class membership, but the predictor and features composing the observation are independent.

RF (Random forest) 8
The RF model uses ensembles of trees, where each tree in the ensemble is grown in accordance with a random parameter. Final predictions are obtained by aggregating over the ensemble. As the base constituents of the ensemble are tree-structured predictors, and since each of these trees is constructed using an injection of randomness.

AdaBoost (Adaptive boosting) 9
The ADA is a general method for generating a strong classifier out of a set of weak classifiers. It works even when the classifiers come from a continuum of potential classifiers (such as neural networks, linear discriminants, etc.).

L-SVM (Support vector machine using linear kernel) 10
The L-SVM model constructs a hyperplane separating data into two classes. The optimal hyperplane maximizes a margin surrounding itself, which creates boundaries for positive and negative classes.

R-SVM (Support vector machine using radial basis function kernel) 10
The R-SVM is a nonlinear version of L-SVM with Gaussian kernel function that projects original features onto a higher dimensional space via a nonlinear mapping function where it becomes linearly separable.

LDA (Linear discriminant analysis) 11
The LDA model searches for the vectors in the underlying space that best discriminate among classes (rather than those that best describe the data). Given a number of independent features relative to which the data is described, LDA creates a linear combination of these which yields the largest mean differences between the desired classes. The goal is to maximize the between-class measure while minimizing the within-class measure.
Supplementary Figure S8. Multi-input DNN implemented for deep learning