Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features

Gliomas belong to a group of central nervous system tumors, and consist of various sub-regions. Gold standard labeling of these sub-regions in radiographic imaging is essential for both clinical and computational studies, including radiomic and radiogenomic analyses. Towards this end, we release segmentation labels and radiomic features for all pre-operative multimodal magnetic resonance imaging (MRI) (n=243) of the multi-institutional glioma collections of The Cancer Genome Atlas (TCGA), publicly available in The Cancer Imaging Archive (TCIA). Pre-operative scans were identified in both glioblastoma (TCGA-GBM, n=135) and low-grade-glioma (TCGA-LGG, n=108) collections via radiological assessment. The glioma sub-region labels were produced by an automated state-of-the-art method and manually revised by an expert board-certified neuroradiologist. An extensive panel of radiomic features was extracted based on the manually-revised labels. This set of labels and features should enable i) direct utilization of the TCGA/TCIA glioma collections towards repeatable, reproducible and comparative quantitative studies leading to new predictive, prognostic, and diagnostic assessments, as well as ii) performance evaluation of computer-aided segmentation methods, and comparison to our state-of-the-art method.

contrast (T1-Gd), T2, and T2-FLAIR (Fig. 1a). Specifically, we considered 135 and 108 pre-operative baseline scans from the TCIA-GBM and TCIA-LGG collections, respectively. Further detailed information on the diversity of the imaging sequences used for this study is included in Table 2 (available online only). This table covers the TCIA institutional identifier, patient information (i.e., age, sex, weight), scanner information (i.e., manufacturer, model, magnetic field strength, station name), as well as specific imaging volume information extracted from the dicom headers (i.e., modality name, series number, accession number, acquisition/study/series date, scan sequence, type, slice thickness, slice spacing, repetition time, echo time, inversion time, imaging frequency, flip angle, specific absorption rate, numbers of slices, pixel dimensions, acquisition matrix rows/columns).
It should be noted that the diversity of the available scans in NCI/NIH/TCIA is driven by the fact that TCIA collected all available scans for subjects whose tissue specimens had passed the quality evaluation of the NCI/NIH/TCGA program. Due to this collection being retrospective all the MRI scans are considered 'standard-of-care', without following any uniform imaging protocol.

Pre-processing
All pre-operative mMRI volumes were re-oriented to the LPS (left-posterior-superior) coordinate system (which is a requirement for GLISTRboost), co-registered to the same T1 anatomic template 61 using affine registration through the Oxford center for Functional MRI of the Brain (FMRIB) Linear Image Registration Tool (FLIRT) 62,63 of FMRIB Software Library (FSL) [64][65][66] , and resampled to 1 mm 3 voxel resolution (Fig. 1b). The volumes of all the modalities for each patient were then skull-stripped using the Brain Extraction Tool (BET) 67,68 from the FSL [64][65][66] (Fig. 1c). Subsequent skull-stripping, on cases that BET produced insufficient results, was performed using a novel automated method based on a multi atlas registration and label fusion framework 69 . The template library for this task consisted of 216 MRI scans and their brain masks. This library was then used for target specific template selection and subsequent registrations using an existing strategy of MUlti-atlas Segmentation utilizing Ensembles (MUSE) 70 . A final region-growing based processing step, guided by T2, was applied to obtain a brain mask that includes the intra-cranial CSF. The resulted volumes are the ones provided in [Data Citation 3 and Data Citation 4].
For producing the computer-aided segmentation labels, further preprocessing steps included the smoothing of all volumes using a low-level image processing method, namely Smallest Univalue Segment Assimilating Nucleus (SUSAN) 71 , in order to reduce high frequency intensity variations (i.e., noise) in regions of uniform intensity profile while preserving the underlying structure (Fig. 1d). The intensity histograms of all modalities of all patients were then matched 72 to the corresponding modality of a single reference patient, using the implemented version in ITK (HistogramMatchingImageFilter).
It should be noted that we did not use any non-parametric, non-uniform intensity normalization algorithm [73][74][75] to correct for intensity non-uniformities caused by the inhomogeneity of the scanner's magnetic field during image acquisition, as we observed that application of such algorithm obliterated the T2-FLAIR signal (Fig. 1e).

Segmentation labels of glioma sub-regions
Consistent with the BraTS challenge 56 the segmentation labels that we consider in the present study, and make available through TCIA [Data Citation 3 and Data Citation 4], delineate the enhancing part of the tumor core (ET), the non-enhancing part of the tumor core (NET), and the peritumoral edema (ED) (Fig. 2). The ET is described by areas that show hyper-intensity in T1-Gd when compared to T1, but also when compared to normal/healthy white matter (WM) in T1-Gd. Biologically, ET is felt to represent regions where there is leakage of contrast through a disrupted blood-brain barrier that is commonly seen in high grade gliomas. The NET represents non-enhancing tumor regions, as well as transitional/prenecrotic and necrotic regions that belong to the non-enhancing part of the tumor core (TC), and are typically resected in addition to the ET. The appearance of the NET is typically hypo-intense in T1-Gd when compared to T1, but also when compared to normal/healthy WM in T1-Gd. Finally, the ED is described by hyper-intense signal on the T2-FLAIR volumes.

Computer-aided segmentation approach
The method used in this study to produce the computer-aided segmentation labels for all pre-operative scans of both TCGA-GBM and TCGA-LGG collections is named GLISTRboost 36,38 and it is based on a hybrid generative-discriminative model. The generative part incorporates a glioma growth model [58][59][60] , and is based on an Expectation-Maximization (EM) framework to segment the brain scans into tumor (i.e., ET, NET and ED), as well as healthy tissue labels (i.e., WM, gray matter, cerebrospinal fluid, vessels and cerebellum). The discriminative part is based on a gradient boosting 76,77 multi-class classification scheme, which was trained on BraTS'15 data (www.virtualskeleton.ch/BRATS/Start2015), to refine tumor labels based on information from multiple patients. Lastly, a Bayesian strategy 78 is employed to further refine and finalize the tumor segmentation based on patient-specific intensity statistics from the multiple modalities available. Example segmentation labels are illustrated in Fig. 2. GLISTRboost 36,38 is based on a modified version of the GLioma Image SegmenTation and Registration (GLISTR) 79 software. GLISTR jointly performs a) the registration of a healthy population probabilistic atlas to brain scans of patients with gliomas using a tumor growth model to account for mass effects, and b) the segmentation of such scans into healthy and tumor tissues. The whole framework of GLISTR is based on a probabilistic generative model that relies on EM, to recursively refine the estimates of the posteriors for all tissue labels, the deformable mapping to the atlas, and the parameters of the incorporated brain tumor growth model [58][59][60] . GLISTR was originally designed to tackle cases with solitary GBMs [79][80][81] , and subsequently extended to handle multifocal masses and tumors of complex shapes with heterogeneous texture 82 . Furthermore, the original version of GLISTR [79][80][81][82] was based on a single seed-point for each brain tissue label to represent its mean intensity value, while the variance was described by a fixed value for all labels. On the contrary, GLISTRboost incorporates multiple tissue seedpoints for each label, to model more accurately the intensity distribution, i.e., mean and variance, for each tissue class. Note that both GLISTR and GLISTRboost take into account only the intensity value of the initialization tissue seed-points on each modality, while they discard spatial information regarding the coordinate position of the respective points. As a consequence, even if the initialized tissue seed-points during two independent segmentation attempts have different coordinates, the output sets of segmentation labels should be identical, given that the modeled intensity distributions during these attempts are the same. In addition to the tissue seed-points, GLISTR and on that account GLISTRboost, requires the definition of a single seed-point and a radius for approximating the center and the bulk volume of each apparent tumor by a sphere. All these seed-points are initialized using the 'Cancer Imaging Phenomics Toolkit' (CaPTk) 83 (www.med.upenn.edu/sbia/captk.html), which has been primarily developed for this purpose, by the Center for Biomedical Image Computing and Analytics (CBICA) of the University of Pennsylvania. Given the tumor seed-point and radius for a tumor, a growth model is initiated by the parametric model of a sphere. This growth model is used to deform a healthy atlas into one with tumor and edema tissues matching the input scans, while approximating the deformations occurred to all brain tissues due to the mass effect of the tumors. A tumor shape prior is estimated by a random-walk-based generative model, which uses the tumor seed-points as initialization cues. This shape prior is systemically incorporated into the EM framework via an empirical Bayes model 82 . Furthermore, a minimum of three initialization seed-points is needed for each brain tissue label, in order to capture the intensity variation and model the intensity distribution across all modalities. Use of multiple seed-points improves the initialization of the EM framework, leading to more accurate segmentation labels, when compared to the single seed-point approach 82 . The output of GLISTR is a posterior probability map for each tissue label, as well as an integrative label map, which describes a very good 'initial' segmentation of all different tissues within a patient's brain.
This 'initial' segmentation is then refined by taking into account information from multiple patients via a discriminative machine-learning algorithm. Specifically, we used the gradient boosting algorithm 76 to perform voxel-level multi-label classification. Gradient boosting produces a prediction model by combining weak learners in an ensemble. We used decision trees of maximum depth 3 as 'weak learners', which were trained in a sub-sample of the training set, in order to introduce randomness 77 . The sampling rate was set equal to 0.6, while additional randomness was introduced by sampling stochastically a subset of imaging (i.e., radiomic) features at each node. The number of sampled features was set equal to the square root of the total number of features. The algorithm was terminated after 100 iterations.
The set of features used for training our model was extracted volumetrically and consists of i) intensity information, ii) image derivative, iii) geodesic information, iv) texture features, and v) the GLISTR posterior probability maps. The intensity information is summarized by the raw intensity value, I, of each image voxel, v i , at each modality, m, (i.e., I(v i m )), as well as by the respective differences among all four modalities, i.e., . The image derivative component consists of the Laplacian of Gaussians and the image gradient magnitude. Note that in order to ensure that the intensity-based features are comparable, intensity normalization was performed across subjects based on the median intensity value of the cerebrospinal fluid label, as provided by GLISTR. Geodesic information was used to introduce spatial context information. At any voxel v i we calculated the geodesic distance from the seed-point at voxel v s , which was used in GLISTR as the tumor center. The geodesic distance between v i and v s was estimated using the fast marching method 84,85 and by taking into account local image gradient magnitude 86 . Furthermore, we used texture features computed from a gray-level co-occurrence matrix (GLCM) 87 . Specifically, these texture features describe first-order statistics (i.e., mean and variance of each modality's intensities within a radius of 2 voxels for each voxel), as well as second-order statistics. To obtain the latter, the image volumes were firstly normalized to 64 different gray levels, and then a bounding box of 5-by-5-by-5 voxels was used for all the voxels of each image as a sliding window. Then, a GLCM was populated by taking into account the intensity values within a radius of 2 pixels and for the 26 main 3D directions to extract the energy, entropy, dissimilarity, homogeneity (i.e., inverse difference moment of order 2), and inverse difference moment of order 1. These features were computed for each direction and their average was used. To avoid overfitting, the gradient boosting machine was trained using simultaneously both LGG and GBM training data of BraTS'15, in a 54-fold cross-validation setting (allowing for using a one out of the 54 available LGGs of the BraTS'15 training data, within each fold).
Finally, the segmentation results were further refined for each patient separately, by assessing the local intensity distribution of the segmentation labels and updating their spatial configuration based on a probabilistic model 78 . The intensity distributions of the WM, ED, NET and ET, were populated separately using the corresponding voxels of posterior probability equal to 1, as given by GLISTR. Histogram normalization was then performed for the 3 pair-wise distributions considered; ED versus WM in T2-FLAIR, ET versus ED in T1-Gd, and ET versus NET in T1-Gd. Maximum likelihood estimation was used to model the class-conditional probability densities (Pr(I(v i )|Class) by a distinct Gaussian model for each class. In all pair-wise comparisons described before, the former tissue is expected to be brighter than the latter. Voxels of each class with spatial proximity smaller than 4 voxels to the voxels of the paired class, were evaluated by assessing their intensity I(v i ) and comparing the ('Pr(I(v i )|Class 1 ) with Pr(I(v i )|Class 2 ). The voxel v i was then classified into the tissue class with the larger conditional probability. This is equivalent to a classification based on Bayes' Theorem with equal priors for the two classes, i.e., Pr(Class 1 ) = Pr(Class 2 ) = 0.5.

Manual revision
The output of GLISTRboost segmentation is expected to yield labels for ET, NET, and ED. However, some gliomas, especially LGG, do not exhibit much contrast enhancement, or ED. Biologically, LGGs may have less blood-brain barrier disruption (leading to less leak of contrast during the scan), and may grow at a rate slow enough to avoid significant edema formation, which results from rapid disruption, irritation, and infiltration of normal brain parenchyma by tumor cells. As such, manual revision of the segmentation labels was performed, particularly for LGG cases lacking ET or ED regions. Specifically, after taking all the above into consideration, in scans of LGGs without an apparent ET area we consider only the NET and ED labels (Fig. 3a,d), whereas in LGG scans without ET and without obvious texture differences across modalities we consider only the NET label, allowing for distinguishing between normal and abnormal brain tissue (Fig. 3e). The difficulty in calculating the accurate boundaries between tumor and healthy tissue in the operating room is reflected in the segmentation labels as well; there is high uncertainty among neurosurgeons, neuroradiologists, and imaging scientists in delineating these boundaries. Therefore, small regions within the segmented labels that were ambiguous of their exact classification, were left as segmented by GLISTRboost.
Manual revisions/corrections applied in the computer-aided segmentation labels include: i) obvious under-or over-segmented ED/ET/NET regions (Fig. 3d-g), ii) voxels classified as ED within the tumor core (Fig. 3b,c,g), iii) unclassified voxels within the tumor core ( Fig. 3c-g), iv) voxels classified as NET outside the tumor core. Contralateral and periventricular regions of T2-FLAIR hyper-intensity were excluded from the ED region (Fig. 3c,f), unless they were contiguous with peritumoral ED (Fig. 3g-addition of apparent contralateral ED), as these areas are generally considered to represent chronic microvascular changes, or age-associated demyelination, rather than tumor infiltration 88 .
These radiomic features are provided on an 'as-is' basis, and are distinct from the panel of features used in GLISTRboost. The biological significance of these individual radiomic features remains unknown, but we include them here to facilitate research on their association with molecular markers, clinical outcomes, treatment responses, and other endpoints, by researchers without sufficient computational background to extract such features. Although researchers can derive their own radiomic features from our segmentation labels, and the corresponding images we included a collection of features that have been shown in various studies to relate to clinical outcome 31 and underlying tumor molecular characteristics 30,32 . Note that the radiomic features we provide are extracted from the denoised images, and the users might also want to consider extracting features from the unsmoothed images provided in [Data Citation 3 and Data Citation 4].

Code availability
All software tools used for pre-processing, initialization, and generation of the hereby described segmentation labels are based on publicly available tools. Specifically, the tools used for the preprocessing steps of skull-stripping (BET) 67,68 and co-registration (FLIRT) 62,63 are publicly available from the FMRIB Software Library (FSL) [64][65][66] , in: fsl.fmrib.ox.ac.uk. The software used for the further skullstripping approaches, i.e., Multi-Atlas Skull-Stripping (MASS) 69 and MUSE 70 , are publicly available in www.med.upenn.edu/sbia/mass.html and www.med.upenn.edu/sbia/muse.html, respectively.
We developed CaPTk 83 as a toolkit to facilitate translation of complex research algorithms into clinical practice, by enabling operators to conduct quantitative analyses without requiring substantial computational background. Towards this end CaPTk is a dynamically growing software platform, with various integrated applications, allowing 1) interactive definition of coordinates and regions, 2) generic image analysis (e.g., registration, feature extraction), and 3) specialized analysis algorithms (e.g., identification of genetic mutation imaging markers 12 ). Specifically for this study, CaPTk was used to 1) manually initialize seed-points required for the initialization of GLISTRboost 36,38 , 2) apply the de-noising approach (SUSAN) 71 used for smoothing images before their input to GLISTRboost, as well as 3) to extract the radiomic features released in TCIA [Data Citation 3 and Data Citation 4]. The exact version used for initializing the required seed-points in this study was released on the 14th of October 2016 and the code source, as well as executable installers, are available in: www.med.upenn.edu/sbia/captk.html.
Finally, our segmentation approach, GLISTRboost 36,38 , has been made available for public use through the Online Image Processing Portal (IPP-ipp.cbica.upenn.edu) of the CBICA. CBICA's IPP allows users to perform their data analysis using integrated algorithms, without any software installation, whilst also using CBICA's High Performance Computing resources. It should be noted that we used the Python package scikit-learn 104 for the implementation of the gradient boosting algorithm.

Data Records
We selected only the pre-operative multimodal scans of the TCGA-GBM [Data Citation 1] and TCGA-LGG [Data Citation 2] glioma collections, from the publicly available TCIA repository. The generated data, which is made publicly available through TCIA's Analysis Results Directory (wiki.cancerimagingarchive.net/x/sgH1) [Data Citation 3 and Data Citation 4], comprise pre-operative baseline reoriented, co-registered and skull-stripped mMRI scans together with their corresponding computer-aided and manually-revised segmentation labels in NIfTI 57 format. We have further enriched the file containers to include an extensive panel of radiomic features, which we hope may facilitate radiogenomic research using the TCGA portal, as well as comparison of segmentation methods, even among those scientists without image analysis resources.
A subset of the pre-operative scans included in the generated data [Data Citation 3 and Data Citation 4] was also part of the BraTS'15 dataset (Table 4 (available online only)), which were skull-stripped, coregistered to the same anatomical template and resampled to 1 mm 3 voxel resolution by the challenge organizers. For this subset, we provide the identical MRI volumes as provided by the BraTS'15 challenge, allowing other researchers to compare their segmentation labels to the leaderboard of the BraTS'15 challenge. Furthermore, the manually-revised segmentation labels provided in [Data Citation 3 and Data Citation 4] are included in the datasets of the BraTS'17 challenge, for benchmarking computational segmentation algorithms against tumor delineation validated by expert neuroradiologists, allowing for repeatable research.

Technical Validation Data collection
Our expert board-certified neuroradiologist (M.B.) identified 135 and 108 pre-operative baseline scans of the TCGA-GBM and the TCGA-LGG glioma collections, via radiological assessment and while blinded to the glioma grade. Since it is not always easy to determine if a scan is pre-operative or post-operative only by visually assessing MRI volumes, and the radiological reports were not available through the TCGA/TCIA repositories, whenever we mention 'pre-operative scans' in this study, we refer to those that radiographically do not have clear evidence of prior instrumentation. Specifically, the main evaluation criterion for classifying scans as pre-operative, was absence of obvious skull defect and of operative cavity through either biopsy or resection.
We note that a mixed (pre-and post-operative) subset of 223 and 59 scans from the TCIA-GBM and TCIA-LGG datasets, respectively, were included in the BraTS'15 challenge, as part of their training (n GBM = 200, n LGG = 44) and testing (n GBM = 23, n LGG = 15) datasets, via the Virtual Skeleton Database (VSD) platform 56,105 (www.virtualskeleton.ch). Since an explicit distinction as pre-or post-operative was not provided for the BraTS'15 dataset, we conducted the radiological assessment of the complete TCIA collections, blind to whether a scan was part of the BraTS challenge, and only included the BraTS'15 volumes identified as pre-operative (Fig. 1f) (Table 4 (available online only)).

Segmentation labels
The segmentation method we developed to produce the segmentation labels, GLISTRboost 36 the computer-aided segmentation labels was assessed during the challenge for the test data, through the VSD platform, by comparing the voxel-level overlap between the segmentation labels produced by GLISTRboost and the ground truth labels provided by the BraTS organizers in three regions, i.e., the whole tumor (WT), the tumor core (TC) and the ET. The WT describes the union of the ET, NET and ED, whereas the TC describes the union of the ET and NET. The performance was quantitatively validated by the per-voxel overlap between respective regions, using the DICE coefficient and the robust Hausdorff distance (95% quantile), as suggested by the BraTS'15 organizers 56 . The former metric takes values between 0 and 1, with higher values corresponding to increased overlap, whereas lower values in the latter correspond to segmentation labels closer to the gold standard labels. Note that the quantitative results for the test data were not provided to the participants, until a manuscript summarizing the results of BraTS'14 and BraTS'15 is published. However, for reporting the performance of our method, we report here the cross-validated results of the same metrics used in BraTS'15 for the subset of GBM subjects included in the training set of BraTS'15 and identified as pre-operative in this study (Table 4 (available online only)). The median DICE values with their corresponding inter-quartile ranges (IQR) for the three evaluated regions, i.e., WT, TC, ET, were equal to 0. 92  Furthermore, we used the Jaccard coefficient, in order to quantify the difference between the computer-aided segmentation labels produced for all the scans identified as pre-operative and all the manually-corrected labels that we provide in [Data Citation 3 and Data Citation 4]. The median (mean ± std.dev) Jaccard values for the three regions of interest i.e., WT, TC, ET, were equal to 0.96 (0.93 ± 0.1), 0.87 (0.78 ± 0.23), and 0.86 (0.73 ± 0.29), respectively.

Manual correction
The classification scheme of segmentation labels considered for the manual corrections of the GBM and LGG cases describe all three segmentation labels (i.e., ET, NET, and ED) for both GBMs and LGGs with an apparent ET area. However, whenever we note LGG scans without an apparent ET area and not obvious texture differences, we considered only the NET label, allowing for distinguishing between normal and abnormal brain tissue, as slowly growing tumors are not expected to induce ED. Furthermore, due to high uncertainty (reported by neurosurgeons, neuroradiologists, and imaging scientists) on the exact boundaries between the various tumor labels, particularly between NET and ED, small regions that visual assessment was ambiguous of their exact classification, were left as segmented by GLISTRboost.
Manual revisions/corrections applied in the computer-aided segmentation labels comprise: i) obvious under-or over-segmented ED/ET/NET regions ( Fig. 3d-g), ii) voxels classified as ED within the tumor core (Fig. 3b,c,g), iii) unclassified voxels within the tumor core ( Fig. 3c-g), iv) voxels classified as NET outside the tumor core. Note that during the manual corrections only peritumoral ED was considered, and both contralateral, and periventricular ED was deleted (Fig. 3c,f), unless it was a clear continuation of the peritumoral ED, in which cases was added (Fig. 3g). The rationale for this is that contralateral and periventricular white matter hyper-intensities regions might be considered pre-existing conditions, related to small vessel ischemic disease, especially in older patients.
The scheme followed for the manual correction included two computational imaging scientists (S.B., A.S.) and a medical doctor (H.A.) working in medical image computing and analysis for 10, 12 and 8 years, respectively. These operators corrected mislabeled voxels following the rules set by our expert board-certified neuroradiologist (M.B.) with 14 years of experience. The corrected labels were then iteratively re-evaluated by the latter and re-iterated until they were satisfactory segmented.