Predicting stereotactic radiosurgery outcomes with multi-observer qualitative appearance labelling versus MRI radiomics

Qualitative observer-based and quantitative radiomics-based analyses of T1w contrast-enhanced magnetic resonance imaging (T1w-CE MRI) have both been shown to predict the outcomes of brain metastasis (BM) stereotactic radiosurgery (SRS). Comparison of these methods and interpretation of radiomics-based machine learning (ML) models remains limited. To address this need, we collected a dataset of n = 123 BMs from 99 patients including 12 clinical features, 107 pre-treatment T1w-CE MRI radiomic features, and BM post-SRS progression scores. A previously published outcome model using SRS dose prescription and five-way BM qualitative appearance scoring was evaluated. We found high qualitative scoring interobserver variability across five observers that negatively impacted the model’s risk stratification. Radiomics-based ML models trained to replicate the qualitative scoring did so with high accuracy (bootstrap-corrected AUC = 0.84–0.94), but risk stratification using these replicated qualitative scores remained poor. Radiomics-based ML models trained to directly predict post-SRS progression offered enhanced risk stratification (Kaplan–Meier rank-sum p = 0.0003) compared to using qualitative appearance. The qualitative appearance scoring enabled interpretation of the progression radiomics-based ML model, with necrotic BMs and a subset of heterogeneous BMs predicted as being at high-risk of post-SRS progression, in agreement with current radiobiological understanding. Our study’s results show that while radiomics-based SRS outcome models out-perform qualitative appearance analysis, qualitative appearance still provides critical insight into ML model operation.

For each group of correlated features, the feature most strongly correlated to the model output was retained.This filtering of features was then transferred to the testing dataset, ensuring the testing dataset did not inform the selected features.Depending on whether the intent of the machine learning experiment was to predict the probability of progression or to replicate an observer's qualitative appearance labelling, one of the six different model outputs would be provided, as shown.All experiments used the set of 107 radiomic features (Table S3), but only the radiomic and clinical progression experiment (Fig. 1d) included the set of 12 clinical features (Table S1).S4.Hyperparameters for the random decision forest model used.For hyperparameters that underwent optimization, the optimization domain and search transform are provided.Numerical domains are indicated with minimum and maximum values in square brackets.Hyperparameter optimization was performed using 50 iterations of Bayesian optimization using the expected-improvement-plus acquisition function.The AUC on the out-of-bag samples was used as the optimization objective function.For hyperparameters that were not optimized, their value and justification are provided.For further descriptions of the hyperparameters, see the documentation for the TreeBagger function provided in Matlab 2019b.

Figure S2
. Schematic of the ML experiments and analysis associated with (a) the radiomic appearance experiments (as needed for Fig. 1b), (c) the radiomic progression experiment (as needed for Fig. 1c), (d) the radiomic and clinical progression experiment (as needed for Fig. 1d), and (b) the analysis performed for ML model interpretation using the results from (a) and (c).The "Bootstrapped Resampling achine Learning Experiment" process blocks in (a), (b), (d) each represent an instance of the ML experiment common template described in the article text under heading " achine learning experimental design" and shown schematically above in Fig. S1 of this supplementary document.The specific model inputs/outputs and experimental results associated with each experiment instance are shown in (a), (b), and (d), on which further analysis is performed.From each ML experiment instance, there are training and testing dataset prediction probabilities per bootstrapped resampling iteration.To get the "Average Prediction Probability er B ", the testing dataset probabilities are iterated through to find all instances in which a given BM was randomly chosen to be in the testing dataset for that bootstrapped iteration.The prediction probabilities for the BM from all these instances are then aggregated and their average taken.This process is then repeated for each B in the entire dataset.A similar process is performed on the "Training Dataset rediction robabilities" when the " hoose…" process blocks are used, except in these cases the training dataset probabilities are just aggregated according to the " hoose…" process rule, with no average taken, allowing for the creation of an average ROC from the chosen training dataset probabilities instead.The specific details of the model interpretation analysis processes outlined in (b) are provided in the article text under the heading "Post-SRS progression machine learning model interpretation". 1 23 0 1 2 0 10 1 2 14 0 9 1 3 14 0 19 2 1 5 B 27 1 7 9 7 3 2 3 7 10 5 1 7 11 2 6 0 3 17 4 3 C 14 0 2 0 7 5 0 0 0 3 11 0 2 0 9 3 0 2 0 7 5 D 15 0 8 1 2 4 0 3 2 4 6 0 5 0 2 8 0 6 3 0 6 E 21 14 6 0 0 1 14 5 0 0 2 9 7 1 1 3 A Expert 2 59 2 29 3 6 19 3 31 1 3 21 2 41 8 3 5 B  12  0 0 9 3 0 0 7 5 0 0 0 0 12 0 0 C  17  0 0 0 11 6 0 0 5 11 1 0 2 4 8 3 D  14 0 0 0 1 13 0 2 1 2 9 0 0 1 2 11 E   16  11 3 0 0 2 9 3 1 1 S5.Confusion matrices across appearance labels (A-E), for each pairwise comparison between observers.The highlighted cells indicate instances of agreement between observers, allowing for the calculation of the agreement rate when summed and divided by the total number of BMs (n = 123).The diagonal of the larger "observer-level" matrix only contains these highlighted cells as each observer is in perfect agreement with themselves, and so these values provide the number of each appearance label an observer called.(a) shows disagreement percentages from all observers, while (b) is only from expert observers.A given value in a table e.g.column "B", row "A" represents how many disagreements across all observer pairs occurred when one observer selected an certain appearance e.g."B" or " eterogeneous" and the other observer selected a certain alternate appearance e.g."A" or omogeneous .

Figure S3
. KM analysis plots for progressive disease for each individual observer for comparison against Expert 1 (see Fig. 2a).The risk group number for each risk curve is labelled on the right y-axis, and the number of BMs at risk per 3-month follow-up interval is given below each x-axis.

Figure S1 .
Figure S1.Schematic diagram of the machine learning experimental method technique used in the study.(a)shows the overall experimental method, while (b) provides enhanced detail of the model training to show how the out-of-bag samples from each tree in the random decision forest are used to produce the aggregated training dataset prediction probabilities.The colouring of the objects in the diagram is used to illustrate the separation of the entire dataset into training and testing datasets, the isolation of objects derived from each of the datasets during feature filtering and hyperparameter optimization (to prevent overfitting), and the recombination of objects from the datasets only during model testing and error metrics calculation.The interfeature correlation filter used hierarchical clustering on the training dataset alone to determine groups of correlated features with a correlation coefficient > 0.8.For each group of correlated features, the feature most strongly correlated to the model output was retained.This filtering of features was then transferred to the testing dataset, ensuring the testing dataset did not inform the selected features.Depending on whether the intent of the machine learning experiment was to predict the probability of progression or to replicate an observer's qualitative appearance labelling, one of the six different model outputs would be provided, as shown.All experiments used the set of 107 radiomic features (TableS3), but only the radiomic and clinical progression experiment (Fig.1d) included the set of 12 clinical features (TableS1).TableS4provides further detail on the hyperparameter optimization, while FigureS2shows how this machine learning experiment template was applied to produce the reported results and analysis.

Table S1 .
Clinical feature distributions for number of BMs, BMs progressing post-SRS, and patients (where applicable) for the study sample.The "Neurological Symptoms Corticosteroid Response"

Table S2 .
Number of patients and BMs scanned by each of the MR scanner models and acquisition parameter configurations.Siemens (Erlangen, Germany); General Electric (Chicago, USA)

Table S3 .
Complete catalogue of the 107 radiomic features included within the study.All features were computed on the pre-treatment T1w-CE MRI, with complete documentation of the features provided by the PyRadiomics project (https://pyradiomics.readthedocs.io/en/latest/features.html).
TableS4provides further detail on the hyperparameter optimization, while FigureS2shows how this machine learning experiment template was applied to produce the reported results and analysis.

Table S6 .
Percentage of disagreements arising across observer pairs based upon qualitative appearances from each observer.
The stated pvalues are from the log-rank test performed over all risk groups.Error metrics from the radiomic Expert 1 appearance experiments using the observer labels from the original study.As each appearance label e.g."homogeneous" had specific models trained to make a binary labelling decision e.g."homogeneous" or "not homogeneous" , error metrics for each appearance label are presented.The error bars for the non-AUC0.632+errormetricsrepresentthe95%confidenceinterval of each value determined from the 250 bootstrapped resampling iterations.ALE plots for the highly importance features from Table3that were not included in Fig.5, as these features were not also highly important for any of the radiomic appearance label experiments.The earson correlation coefficient values for each plot's pair of ALE curve are given in Table3.