Machine Learning methods for Quantitative Radiomic Biomarkers

Radiomics extracts and mines large number of medical imaging features quantifying tumor phenotypic characteristics. Highly accurate and reliable machine-learning approaches can drive the success of radiomic applications in clinical care. In this radiomic study, fourteen feature selection methods and twelve classification methods were examined in terms of their performance and stability for predicting overall survival. A total of 440 radiomic features were extracted from pre-treatment computed tomography (CT) images of 464 lung cancer patients. To ensure the unbiased evaluation of different machine-learning methods, publicly available implementations along with reported parameter configurations were used. Furthermore, we used two independent radiomic cohorts for training (n = 310 patients) and validation (n = 154 patients). We identified that Wilcoxon test based feature selection method WLCX (stability = 0.84 ± 0.05, AUC = 0.65 ± 0.02) and a classification method random forest RF (RSD = 3.52%, AUC = 0.66 ± 0.03) had highest prognostic performance with high stability against data perturbation. Our variability analysis indicated that the choice of classification method is the most dominant source of performance variation (34.21% of total variance). Identification of optimal machine-learning methods for radiomic applications is a crucial step towards stable and clinically relevant radiomic biomarkers, providing a non-invasive way of quantifying and monitoring tumor-phenotypic characteristics in clinical practice.


Supplementary A: Datasets
This supplementary information contains the detailed description of the used datasets. It should be noted that the datasets has been previously used and described by Aerts et. al 1 . In order to enhance the redability of this manuscript, here we have reproduced the datasets description from the previous study of Aerts et. al 1

Patient population
Four hundred and twenty-two consecutive patients were included (132 women and 290 men), with inoperable, histologic or cytologic confirmed NSCLC, UICC stages I-IIIb, treated with radical radiotherapy alone (n = 196) or with chemo-radiation (n = 226). Mean age was 67,5 years (range: 33-91 years). The institutional review board approved the study. All research was carried out in accordance with Dutch law.

Treatment
During the study period, induction chemotherapy was standard of care for patients with N2/N3 and T4 tumors and consisted of three courses of gemcitabine (1,250 mg/m2 on days 1 and 8) in combination with cisplatin (75 mg/m2) or carboplatin (area under the concentration-time curve [AUC] 5) on day 1. Cycles were repeated every 21 days, and standard dose-reduction rules were applied. An interval between chemotherapy and start of radiotherapy of at minimum 14 days was mandatory.
All patients received an FDG PET-CT scan for radiotherapy treatment planning, in radiotherapy position on a dedicated PET-CT simulator with both arms above the head. For the FDG PET-CT scans a Siemens Biograph (SOMATOM Sensation-16 with an ECAT ACCEL PET scanner) was used. An intravenous injection of (weight * 4 + 20) MBq FDG (Tyco Health Care, Amsterdam, The Netherlands) was followed by 10 ml physiologic saline. After a 45-min uptake period, during which the patient was encouraged to rest, PET and CT images were acquired. A spiral CT (3 mm slice thickness) with or without intravenous contrast was performed covering the complete thoracic region.

Radiotherapy planning was performed on a XiO (Computerized Medical Systems, St
Louis, Missouri) treatment planning system, based on a convolution algorithm using inhomogeneity corrections.
Delineation based on fused PET-CT images was performed by the radiation oncologist by using a standard clinical delineation protocol. The protocol included fixed window level settings of both CT (lung W1700; L-300, mediastinum W600; L40) and PET scan (W30000; L15000) to be used for delineation. For all patients, a gross tumor volume (GTV) was defined based on FDG PET-CT data.
For patients treated with radical radiotherapy, the radiation dose was escalated to an individualized maximal total tumor dose, applying a mean lung dose of 19 Gy while respecting a maximum spinal cord dose of 54 Gy5. The maximal total tumor dose allowed was 79.2 Gy. There were no esophageal dose constraints. Radiotherapy was delivered twice a day in fractions of 1.8 Gy, 5 days per week, with a minimum of 8 h 27 between the two fractions. This protocol was applied as well in patients that received sequential chemo-radiation (n = 104).
Patients that received concurrent chemo-radiation (n = 100), were treated following 2 cycles of carboplatin-gemcitabine, a radiation dose of 45 Gy, in fractions of 1.5 Gy delivered twice a day for the first course, directly followed by an individualized dose ranging from 6 -24 Gy and delivered in 2.0 Gy fractions once a day. In all patients, individualized patient dosimetry using electronic portal imaging devices was performed.

Patient population
This dataset included 225 consecutive patients with confirmed NSCLC (mean age, 65.5 years; range, 36-86 years), stages (I-IVa), treated at the Radboud University Nijmegen Medical Centre, The Netherlands, between February 2004 and October 2011.

Treatment
All primary tumors and the mediastinal N2 disease were cytologically or histologically proven. All patients underwent diagnostic work-up, including contrast enhanced CT of the thorax and upper abdomen, whole body 18F-FDG-PET/CT, MRI of the brain, bronchoscopy with transbronchial needle aspiration (TBNA), and/or oesophageal ultrasound fine needle aspiration (EUS-FNA) and/or endobronchial ultrasound with TBNA (EBUS-TBNA) and mediastinoscopy in case of PET-positive, cytologically negative mediastinal lymph nodes. After work up, all patients were discussed in a thoracic oncology multidisciplinary board. Prior to radiotherapy a CT of the thorax was performed in radiotherapy position for radiotherapy planning.
Patients in good general condition were treated with concurrent chemo radiotherapy, those with a contraindication for chemotherapy were treated by radiation alone, and all remaining patients were treated with a sequential chemotherapy and radiotherapy.

Supplementary B: Feature Selection Methods
In literature, feature selection methods are mainly divided into three categories: filter methods, wrapper methods and embedded methods. Wrapper and embedded methods are classifier dependent approaches, whereas filter methods are classifier independent. Wrapper methods are basically the search methods, which search through the whole feature space and identify a relevant and non-redundant feature subset. Training/validation accuracy of a particular classifier (or model) is used as a measure of utility for the candidate feature subset. These computationally expensive methods may produce feature subsets that are overly specific to the classifiers and hence has low generalizability. Embedded methods incorporate feature selection as a part of training process and are computationally efficient as compared to the wrappers. However, they still use a quite strict model (classifier) structure assumption and hence lacks in the generalizability. On contrary, classifierindependent filter methods are the simple feature ranking methods based on some heuristic scoring criterion. Filters are computationally efficient and they have high generalizability and scalability. Therefore, here in this study we only used some popular filter based approaches for feature selection.
The defining component of filter based feature selection methods is the scoring/selection criterion, which is often known as 'relevance index'. All filter based feature selection methods can be divided into two categories: univariate methods and multivariate methods. In case of univariate methods, the scoring criterion only considers the relevancy of features ignoring the feature redundancy, whereas multivariate methods investigate the multivariate interaction within features and the scoring criterion is a weighted sum of feature relevancy and redundancy. We formulated the feature selection problem as defined by brown et al 2 .
Let J be the scoring criterion (relevance index), Y be the class labels, X be the set of all features, X k be the feature to be evaluated and S be the set of already selected features.

Fisher score (FSCR)
Fisher score 3 based feature selection method selects features such that the between class distance is maximized and the within class distance is minimized. The scoring criterion is defined as Where ! is the overall mean of feature ! , ! is the number of samples in m'th class, and !,! and !,! ! is the mean and variance of feature ! on m'th class.

Relief (RELF)
Relief 4 assumes p randomly sampled data instances and defines the scoring criterion Where !,! is the value of instance ! on feature ! , !" ! ! ,! and !" ! ! ,! are the values on the k'th feature of the nearest point to ! with the same and different class label respectively, and d(.) denotes the distance. Here, we used p=50.

T-test (t-score) (TSCR)
T-test based feature selection evaluates a feature using a t-score, which is defined where ! , ! and ! ! , ! ! are the means and variances of the two classes on feature ! , whereas ! and ! correspond to the cardinality of the two classes.

Chi-square (CHSQ)
Chi-square score for a feature with r different values is defined as Where !" is the number of samples with I'th feature value in m'th class and here, ! * is the number of samples with i'th feature value, * ! is the number of samples in class m and is the number of samples.

Wilcoxon (WLCX)
Willcoxon is a non-parametric method based on ranks for the comparison of the population medians of the two classes. The scoring function is defined as Where N is the total number of samples, ! is the number of samples in class m, !" is the rank of sample i in class m, ! is the average rank of samples belonging to class m, and is the average rank of all samples.

Gini index (GINI)
For gini index, the scoring criterion is defined as Where ! is the conditional probability of a class m given the feature ! .
Smaller the values of gini index correspond to higher feature relevance.

Mutual information maximization (MIM)
Mutual information maximization 5 uses an information theory to measure the relevance of a feature. The scoring criterion is defined as a mutual information between a feature and class labels. It is given as Multivariate Feature Selection methods

Mutual information feature selection (MIFS)
Battiti et. al 6 proposed this multivariate feature selection method, which tries evaluate features based on their relevance with the class labels and penalizes the feature redundancy. The scoring criterion is a weighted sum of feature relevancy and redundancy and is given by Here the first term ! ; is the mutual information between the feature ! and class labels, which indicates the feature relevancy.

Second term
corresponds to the feature redundancy. So a feature is only going to get the high score if it is highly relevant to the class labels and also non-redundant to the set of already selected features . is the configurable parameter, which must be set experimentally. Battiti et. al. 6 experimentally found that = 1 is often optimal.

Minimum redundancy maximum relevance (MRMR)
As similar to MIFS, minimum redundancy maximum relevance (MRMR) 7 also tries to evaluate feature using relevancy-redundancy tradeoff. Here the configurable parameter is set as the cardinality of the set of selected features. Hence, the scoring criterion is defined as

Conditional infomax feature extraction (CIFE)
Conditional infomax feature extraction (CIFE) 8 also tries to optimize relevancyredundancy trade off. In the case of cife, the penalty term is added by one more term that is called as conditional redundancy. This term has an opposite sing to the penalty (redundancy) term, which indicates that correlated features will be given high score if they have strong class conditional dependence in a combined manner.

Joint mutual information (JMI)
In the case of joint mutual information (JMI) 9 , the scoring criterion is the mutual information between the class labels and the joint random variable ! ! and it given by
Here, basically the scoring criterion is the mutual information between the candidate feature ! and class labels conditioned on the set of already selected features .

Interaction capping (ICAP)
As similar to CMIM, interaction capping (ICAP) 11 also has a non-linear scoring criterion that is defined as It can be observed from the equation that the penalty will be lower if the candidate feature ! has strong pairwise class conditional dependence with the set of already selected features.

Double input symmetric relevance (DISR)
Double input symmetric relevance (DISR) 12 is the modification of the joint mutual information criterion. Here the joint mutual information is normalized with a joint entropy term. The criterion is defined as Publicly available Matlab implementations 2,13 were used for the implementations of feature selection methods. For further understanding of the theoretical assumptions and relations between these feature selection methods, we encourage reader to refer 2,13 .

Supplementary C: Classification Methods
We different data sets. This study has reported that most of the best performing classifiers of their study were implemented using R and tuned using the "caret" package 15 . We therefore chose R and caret as an implementation framework for our classifiers. A brief overview about implementation details of the classifiers and the corresponding parameters is given below. For the detailed theoretical description, we encourage reader to refer the individual method.

Decision tree (DT)
A C5.0 decision tree based classification method was used in the analysis. C5.0 function of the "C50" package was used for creating classification trees with default parameter tuning under caret interface.

Boosting (BST)
A Boosting ensemble of C5.0 decision tree was created using the R package "C50".
Parameter tuning was carried out using caret interface. Number of boosting trials was varied in {1, 10, 20} with and without winnow.

Bayesian (BY)
R package "klaR" with default caret parameter tuning was used for the implementation of Naïve Bayes classifier.

Discriminant analysis (DA)
Flexible discriminant analysis is a non-linear extension of linear discriminant analysis. R package "mda" was used for the implementation. Parameter tuning was done using caret with the parameter nprune varing from 2:3:15 (2 to 15 with an increment of 3).

Bagging (BAG)
Bagging falls into the category of ensemble algorithms in machine learning. R package "ipred" was used for the implementation and default parameter tuning was done using caret interface.

Random forest (RF)
Random forest provides an improvement to bagging with a modification step of random sampling of predictors. R package "randomForest" with caret interface was used for the implementation. Parameter ntree was set to 500 and mtry was varied with values 2:3:29 (2 to 29 with an increment step of 3)

Neural network (Nnet)
Neural network, also known as multi layer perceptron is a non-linear classification model. It was implemented by a caret interface and R package "nnet" by tuning the size and weight decay parameter with values 1:2:9 and {0, 0.1, 0.01, 0.001, 0.0001} respectively.

Nearest neighbor (NN)
K-nearest neighbor was implemented using the "knn" R package and caret interface.
10 different values of number of neighbors 5:2:23 (5 to 23 with an increment step of 2) were used.

Partial least squares and principal component regression (PLSR)
We used mvr function in "pls" package to fit PLSR model. Parameter tuning was done using caret. 10 different values of the number of components 1:1:10 were used.

Generalized linear models (GLM)
A generalized linear model via penalized maximum likelihood was fitted using the glmnet function of the R package "glment" and the default parameter tuning with caret interface.

Multivariate adaptive regression splines (MARS)
An additive MARS model was built using the gcvEarth function of the R package "earth" with default parameter tuning using caret interface.

Color Key and Histogram
Count