Introduction

Emerging targeted therapies interfere with specific molecules that promote tumor growth and infiltration based on patient-specific predictive cellular and molecular biomarkers1. However, heterogeneous genomic and phenotypic tumor microenvironments contribute to incomplete treatment by targeted therapy and promote tumor recurrence via a non-linear branched evolution of the cancer genome2,3,4. Biopsy is currently the most effective method to assess patient-specific tumor biomarkers for targeted therapeutics, but clinical outcomes are limited by tumor heterogeneity which cannot be assessed by invasive biopsy alone1. Medical imaging techniques that are minimally-invasive and assess cellular and molecular tissue characteristics across the entire tumor bed and tumor microenvironment (TME) hold the potential to significantly improve the characterization and treatment of aggressive brain tumors5,6,7.

Significant efforts are underway to develop tumor heterogeneity mapping techniques using minimally-invasive imaging including texture analysis8,9,10 proton11,12 and hyperpolarized 13C13 spectroscopy; and most recently MR fingerprinting14. Generally these methods classify tumor properties at one of two levels: (1) volumetrically, by segmenting adjacent voxels together into classes, or (2) on a voxel-wise basis, treating each voxel independently. Volumetric segmentation techniques leverage spatial correlations in adjacent voxels that may be associated with tumor biology and/or the physical attributes of the acquisition process to improve SNR and classification accuracy. However, these improvements are balanced by a decrease in the theoretical spatial resolution of the parametric images, ultimately limiting the assessment of heterogeneity. Voxel-wise methods have a theoretical spatial resolution on the order of a single voxel, but suffer from significantly increased noise, which may be counteracted by the concomitant acquisition of multiple MR signatures. A recent voxel-wise algorithm demonstrated the ability to map tumor cellularity from three MR contrasts when biopsy findings were localized to the pre-surgical images15. Functional Diffusion Maps (fDMs) have also been estimated from ADC maps by identifying biopsy core locations on intra-operative computed tomography and post-surgical high resolution 3D anatomical images16. Alternatively, MR Fingerprinting (MRF) is a promising voxel-wise approach that has been successfully used to parameterize important tumor tissue properties including T1, T2, and M0, as well as physical system properties including B0 and B117. There is some emerging evidence that MRF can be used to map functional tissue parameters including perfusion, oxygenation, and microvascular structure18, but the extent to which the MRF technique can be applied to functional, cellular, and molecular imaging remains unknown.

Here we propose to map cellular and molecular tumor properties throughout the TME in a voxel-wise manner by leveraging the growing dimensionality of clinical MR data. Our approach does not inherently rely on spatial correlation information or simulations of various tissue properties for classification. Instead, we hypothesize that the dimensionality of MR data alone provides a readily available vehicle to traverse tissue scale. We evaluate our hypothesis in three separate sub-steps: Sub-hypothesis 1) significant relationships (\(\alpha \le 0.05\)) between macro- and micro-scale properties can be identified using elementary statistical testing when surgical pathology results are localized to the pre-surgical image space; Sub-hypothesis 2) non-parametric machine learning can classify microscopic properties from macroscopic images with high accuracy (\(\ge \,95 \% \)) when traditional corrections for family-wise error rates are employed; and Sub-hypothesis 3) clinically-useful multiscale classification across the entire image space can be accomplished when the parametric images are treated as Gaussian random fields.

Experimentally, we developed a data-driven model linking spatially registered core biopsy data to multiparametric MR. We used a diverse patient population consisting of more than 10 different disease classes, making the microscopic classifications more difficult but also more generalizable to a clinical population. We performed initial statistical evaluations on the model to determine the feasibility of predicting the biopsy findings from the MR values alone. We then evaluated the use of non-parametric machine learning to predict four clinically relevant properties: IDH1 mutation status, MGMT promoter methylation, cellular necrosis, and microvascular proliferation. Class membership probabilities output from the machine learning model were converted to chi-square statistical estimates using probabilistic distributions of the dependent variables identified a priori. The Benjamini-Hochberg algorithm controlled for family-wise error rates (FWER), and the classification accuracy and sensitivity of the results were optimized across a single classification tuning parameter. Finally, the machine learning model was extended to calculate chi-square (χ2) parametric maps across the entire brain of all 29 patients. To improve statistical classification sensitivity in the image domain, we implemented Gaussian random field theory (RFT) to estimate the interdependence of voxels and then group statistical findings into thresholded clusters. We evaluated the images qualitatively by clinical experts and quantitatively by classification accuracy in the biopsy sample volume.

Methods

Study population and model development

The Indiana University Institutional Review Board (IRB) reviewed and exempted this retrospective study and waived informed consent under the conditions that all patient data would be de-identified upon the completion of patient enrollment. All procedures, methods, and experiments performed in this study were carried out in accordance with relevant guidelines and regulations, including the HIPAA Privacy Rule and the Declaration of Helsinki. De-identification consisted of removing all 18 HIPAA Privacy Rule identifiers from images, pathology reports, and clinical data. Accordingly, all dates were removed, but age and the difference in days between imaging and biopsy were retained for each patient. This study was not listed on ClinicalTrials.gov, and no part of the dataset presented here has been used or published on in the past. All source data used in this paper are openly shared with the radiology community for research replication and further analyses at http://www.iu.edu/~mipl. Inclusion criteria for this study required that the patients (1) had previously undergone targeted (stereotactic image-guided) core biopsy of the brain at our institution with at least three orthogonal plane images saved showing the location of the core; (2) had completed an MR scan a maximum of 60 days prior to biopsy that included at least T1 weighting (T1w), T1 weighting post gadolinium injection (T1w-post), T2 weighting (T2w), T2 weighting with fluid attenuated inversion recovery (T2-FLAIR), and diffusion weighted imaging (DWI). We specifically did not limit the study to a single tumor type (e.g. glioma) to ensure that the non-parametric model could be tested in a clinically-relevant population. Approximately 100 patients were screened, and 29 met the criteria for enrollment (N = 29). The characteristics of the enrolled population are shown in Table 1. All pathology reads and diagnoses were performed by two experienced neuropathologists, each with more than 10 years’ experience practicing in an academic medical center.

Table 1 Subject population characteristics

The five MR sequences were the only independent variables used in this analysis – for an overview of the acquisition parameters please see Supporting Information – Supplemental Table 1. Approximately 90% of the acquisitions were performed at 1.5 T (26 of 29), and approximately 70% of the anatomical sequences used 3D readout (101 of 145). All DWI acquisitions used two b-values (0,1000 s/mm2) and 3 orthogonal directions. Fifteen (15) of the acquisitions were performed on a Siemens MAGNETOM Aera (Siemens Healthineers, Germany); the remainder were distributed relatively evenly across the GE HDxt (5 acquisitions; GE Healthcare, General Electric Company, USA) and various Siemens systems including the Espree (2), Avanto (4), Skyra (2), and Trio (1). For post-processing, all images were initially registered to the T1w-post frame-of-reference for each patient. T1w was registered using a 12 degree-of-freedom (DOF) transform and minimization of a correlation ratio objective function19. T2w, T2-FLAIR, and DWI (B0-only) were registered using a 12 DOF transform and minimization of a mutual information objective function20. Apparent diffusion coefficient (ADC) maps were then registered to the individual T1w-post reference frame by applying the affine transformation matrix estimated for the DWI B0 images. We normalized voxels of each contrast to the mean value of uninvolved white matter determined by a spherical Table 1. Subject population characteristics region-of-interest on the T1w-post. The 3-dimensional centroid of the biopsy core was identified on the T1w-post for each patient by visually comparing the three-plane neuronavigation plans (Fig. 1) to the pre-intervention MR images. We used the size of the biopsy core as reported in the pathology report to define a sphere centered at the location of the biopsy needle tip from which the image contrasts were extracted. This method ensured the feature matrix and subsequent machine learning model included only those voxels representative of biopsied tissues.

Figure 1
figure 1

Example neuronavigation targeting images for Subject 2.

We extracted the four dependent categorical variables from clinical pathology reports for each patient, classifying voxels as IDH1 mutation status positive (IDH1MS+) if the corresponding specimen contained any IDH1-R132H-positive cells based on immunohistochemistry, and voxels as MGMT promoter methylation status positive (MGMTPMS+) if present based on a methylation-specific PCR-based assay. When applicable, a clinical pathologist evaluated several representative microscopic sections for the presence of cellular necrosis (CNEC+) and/or microvascular proliferation (MVP+). Because this study was not limited to primary brain tumors, the pathologist used their discretion to determine which tests should be applied on an individual patient basis, as a standard of care. Importantly, if the physician determined a variable need not be measured for a given patient, we classified it as negative for the analysis. These instances were referred to as triage classifications. An overview of the four dependent variables across all patient is given in Table 2. The voxel-wise expected distribution for each dependant variable (Table 2, row 5) was calculated as the random probability of occurrence based on the number of voxels testing positive and the total number of voxels labeled by the tissue specimens. Table 2. Overview of the values (positive or negative) of the four dependent variables across the tissue specimen population, and their associated voxel-wise probability distributions.

Table 2 Overview of the values (positive or negative) of the four dependent variables across the tissue specimen population, and their associated voxel-wise probability distributions.

A biostatistician and co-author on this paper (S.C.) guided, oversaw, and reviewed the statistical analyses; an overview is given in Fig. 2.

Figure 2
figure 2

Flowchart of the processing and statistical analysis steps. Endpoints resulting in statistical conclusions are outlined in green.

Sub-hypothesis 1: elementary statistical evaluations

First, we calculated the normalized contrast values for the independent variables across the entire biopsy sphere for each patient, and combined them into a single feature vector mapping the five independent variables to the four outcome variables for each voxel. This resulted in a feature matrix of size 147,031 rows x 9 (binary) columns. A binary logistic regression was performed for each dependant variable by fitting a maximum-likelihood logit model. The regressions were sample weighted by the inverse of the probability of inclusion due to the sampling design to account for class imbalances21. The results characterized the overall (combined) predictive power of the five image contrasts for each microscopic variable using the Wald χ2 test and McFadden’s pseudo R222.

Sub-hypothesis 2: multiscale classification without spatial information

Next, we developed individual training feature matrices (FVi, train) and testing feature matrices (FVi, test) for each patient, i. The FVi, train and FVi, test matrices included the following data: each of the 5 normalized MR contrast values in columns 1–5; the subject number (i) in column 6; and the binary class flag for IDH1MS+, MGMTPMS+, CNEC+, and MVP+ in columns 7–10, respectively. The rows of FVi, train corresponded to the biopsied voxels across all patients except patient i; the rows of FVi, test corresponded to the biopsied voxels for patient i. A set of 116 machine learning experiments (29 patients × 4 dependent variables) were then carried out using a leave-one-out design to ensure that in no case could data from the same patient be used for both training and testing.

The machine learning classifier was a non-parametric weighted k-nearest neighbor design (wKNN)23 with class weights calculated by the inverse Euclidean distance. The only tuning parameter used for classification was the number of neighbors, k, included in the class calculations. The classifier output was a 2-element vector for each voxel representing the probability of membership in each binary class, calculated as the normalized sum of the inverse Euclidean distance. We transformed the probability vectors across the biopsy volume for each patient to a chi-square test statistic (χ2) using Pearson’s method24. The statistic compared our predicted class probability for each voxel with the background probability calculated for the entire voxel population across all patients. The chi-square transform was chosen (i.e. instead of z or t distributions) because the background probabilities could be explicitly calculated from the data. A clinical implementation of this algorithm would similarly have access to background population probabilities assuming the availability of a robust training dataset. The χ2 values were then thresholded to a given \(\alpha \)-value using standard statistical transforms. For FWER correction, a \(p\)-value threshold was calculated for each patient using the Benjamini-Hochberg procedure25 at an α of 0.05. We calculated a confusion matrix for each patient by choosing the class of greatest probability for all voxels that passed the FWER correction. The final measure of classification accuracy, ACC(k), was calculated as the mean accuracy across all 29 confusion matrices, with optimization across the tuning parameter k. The final measure of classification sensitivity, SENS(k), was calculated as the percent of voxels sampled by biopsy that met the α threshold.

Sub-hypothesis 3: multiscale classification across the image space

χ2 parametric maps of each microscopic variable were then calculated as before for every voxel and overlayed on the T1w-post images using the tuning parameter that yielded the greatest value of SENS(k) at an ACC(k) ≥ 0.95. Because the images resulted in several orders of magnitude more voxels to be classified than in any of the FVi,test vectors, we determined that a less conservative FWER correction approach was necessary. As the χ2 maps were smooth statistical fields, we used a mature FWER correction technique widely used in functional MRI which first estimates the spatial correlation of the statistical image and then identifies clusters of voxels which result in the expected Euler characteristic (EC) for a smooth statistical map26,27,28. We performed both the spatial correlation and EC optimizations using FSL29,30, resulting in χ2 parametric images for each dependent variable that controlled FWER at the 5% level.

Results

Study population and model development

The enrolled population had a median age of 59 years (max 89, min 23) and had 16 males (55%). The mean difference in time between imaging and biopsy was 9.7 ± 9.1 days. The biopsy-confirmed diagnoses included: sixteen gliomas (seven Grade IV, two Grade III, three Grade II, and four Grade I), four metastatic carcinomas (two breast, one lung, and one melanoma), two diffuse large B-cell lymphoma, one schwannoma, two reactive changes, one abscess, one germinoma, one demyelination, and one normal. A detailed overview of the enrolled subject population and demographics is given in Table 1.

Sub-hypothesis 1: elementary statistical evaluations

The combination of all five image contrasts was found to be significantly correlated with outcome (P < 10−4) for all four microscopic variables (Table 3). IDH1MS+ had the greatest likelihood with a pseudo R2 of 0.26, followed by MGMTPMS+ (0.25), CNEC+ (0.22), and MVP+ (0.07). For a complete breakdown of the prediction results by individual image contrast, please see Supporting Information – Supplemental Table 2.

Table 3 Overall prediction results of the combined (5) image contrasts from binary logistic regression analyses.

Figure 3 shows the parameter estimates (regression coefficients), demonstrating that characteristic patterns of the logits across the five predictors exist for each microscopic variable, even when accounting for the robust standard errors. Of note, IDH1MS+ and MGMTPMS+ exhibited strong negative correlations with T1w, and MGMTPMS+ also displayed a large negative correlation with ADC. CNEC+ demonstrated a strong positive correlation with T1w, while IDH1MS+ has a strong positive relationship with T2-FLAIR. These initial findings provided a statistical foundation upon which our hypothesis could then be tested using the previously described machine learning techniques.

Figure 3
figure 3

Regression coefficients for each microscopic variable across the 5 image contrasts. Error bars represent robust standard errors.

Sub-hypothesis 2: multiscale classification without spatial information

The results of the machine learning optimization procedure are shown in Fig. 4. Accuracy and the number of statistically significant voxels are shown in black and blue, respectively. The plots demonstrate that the tuning parameter k has a large effect on the number of voxels which pass the FWER correction, and thus, indirectly, the overall accuracy calculation. There was similar classification behavior between IDH1MS+ and MGMTPMS+, in which classification accuracy generally increased with k, and then plateaued as the number of significant voxels began to decrease. No voxels passed the FWER correction for cellular necrosis at any value of k that was tested. The classification accuracy of MVP had an approximately linear relationship with k, while the number of voxels passing the FWER threshold had an approximately inverse linear dependence on k.

Figure 4
figure 4

Accuracy (black; left vertical axis) and number of significant voxels (blue; right vertical axis) vs. the wKNN tuning parameter k. The optimal tuning parameter value (kopt) maximized the number of significant voxels when \(\alpha \le 0.05\). kopt is indicated by a vertical green line for each outcome variable.

From the optimization plots, we chose a tuning parameter that maximized the accuracy and the number of voxels that passed the significance threshold. In keeping with our α threshold of 0.05, we limited the minimum acceptable classification accuracy to be 0.95; thus, the optimal tuning parameter, kopt, was that which maximized SENS(k) in the condition that ACC(k) ≥ 0.95. Table 4 shows the optimized tuning parameter and classification results for the four microscopic variables. The values of kopt for each outcome are also shown as vertical green bars in Fig. 4.

Table 4 Optimized results from the leave-one-out machine learning classification using Benjamini-Hochberg correction without spatial correlation information.

Of the 3 variables that had significant findings (IDH1, MGMT, MVP) the average classification accuracy was 0.984 ± 0.02 and the average classification sensitivity was 1.567% ± 0.967. Optimal classification results for the molecular markers IDH1MS+ and MGMTPMS+ were similar, both yielding an ACC(k) of 1.0 and a SENS(k) slightly greater than 2%. The optimal ACC(k) of MVP+ was 0.951 with a SENS(k) of 0.2%. These results confirmed our hypothesis that multiscale classification could be performed without spatial information. However, the low number of voxels passing the correction threshold supported further evaluation of a FWER-correction technique that was more sensitive to classification.

Sub-hypothesis 3: multiscale classification across the image space

Example images corrected by RFT using kopt for IDH1MS+, MGMTPMS+, and MVP+ are shown in Fig. 5 for 3 exemplary patients. Images for 4 additional patients are given in Supplemental Materials – Supp. Figure I. In Fig. 5, the biopsy site for each patient is indicated with a yellow crosshair on the zoomed-in χ2RFT maps (2nd row), and the original uncorrected probability maps generated by the machine learning model are shown in row 3.

Figure 5
figure 5

Results of the statistical mapping procedure for 4 select patients, with the location of the biopsy marked with a yellow plus sign. In all cases the χ2 image with random field theory correction dramatically reduces the number of false positive findings and demonstrates smooth noise properties across space.

Qualitatively the images demonstrate smooth statistical fields that are well localized to the tumor bed and TME. The quantitative classification results based on the χ2RFT-corrected images are shown in Table 5. RFT demonstrated improved average classification accuracy (0.989 ± 0.008) and sensitivity (5.967%  ± 2.857) compared with Benjamini-Hochberg. Notably, SENS(k) for MVP+ increased to 9.9% using RFT compared with 0.2% with Benjamini-Hochberg.

Table 5 Optimized results from the leave-one-out machine learning classification using random field theory correction including spatial correlation information.

Figure 6 shows an example GBM subject had significant results for all 3 outcome variables that nearly covered the entire TME (Subject 8). The 5 predictor contrasts are shown along the left side of the image zoomed-in on the tumor bed and TME. The colormaps are windowed from 0 to the maximum χ2RFT statistic (right side colorbars). Potential significant findings for MVP+ outside the T2-FLAIR abnormality (3 red speckles frontal and medial to the tumor) may hold important information related to microscopic disease spread.

Figure 6
figure 6

Extensive visualization of statistical confidence ROI’s mapping genomic and cellular heterogeneity in a GBM patient.

Discussion and Conclusions

Cellular and molecular heterogeneity is a significant driver of brain tumor morbidity that cannot be assessed by biopsy alone. This paper demonstrated three different methods to predict microscopic cellular and molecular properties of brain tumors from macroscopic, minimally-invasive clinical images. Elementary statistical evaluations demonstrated that significant relationships between the macroscopic and microscopic variables of interest did exist. Machine learning combined with a conservative correction for family-wise error rates was able to predict cellular and molecular properties with high accuracy but limited classification sensitivity (0.2–2.3%) for three of the four outcome variables. When spatial correlations across voxels were taken into account using Gaussian random field theory, high accuracy was retained with a significant increase in classification sensitivity (3.2–9.9%). The images generated by random field theory demonstrated acceptable noise and spatial resolution properties for clinical interpretation. Taken together, our results show that in vivo microscopic and even genomic mapping of human brain tumors may be clinically possible in the near future.

The near-term implication of our findings is that researchers and clinicians utilizing machine learning to predict tumor heterogeneity should consider dimensionality to be one potential vehicle by which in vivo imaging signatures may be used to traverse scale. The rapid expansion of anatomical and functional MR sequences and the growing availability of hybrid imaging systems only serve to enhance this opportunity. The long-term implications of our findings are that it may be possible to map cellular and molecular tumor properties across both space and time during treatment, allowing for highly personalized treatment strategies that are not currently possible. For example, MGMT promoter methylation status can vary across the tumor bed and microenvironment31 making treatment planning challenging. Patients who are determined by surgical biopsy to have MGMTPMS+ are expected to demonstrate good response to standard of care treatment with concomitant and adjuvant radiation therapy and chemotherapy with temozolomide32, although this is almost always followed by relapse and eventual death. A subset of these patients are expected to also have undiagnosed MGMTPMS- properties, and thus may benefit from experimental personalized therapies33. The ability to map MGMT promotor methylation status with a minimally-invasive in vivo probe would allow for better treatment selection and drug combinations than is currently possible.

This study provides initial evidence in support of our hypothesis; however, there are significant limitations on the generalizability of our findings due to our study design. First, the retrospective design used in this paper did not allow for standardization of the immunohistochemical and molecular tests used across patients. This drawback required our analysis to rely on the clinical expertise of the pathology physicians in determining which tests were required at the individual patient level. Furthermore more comprehensive genomic evaluations (i.e. genome-wide association) could have been conducted to identify other predictor-outcome relationships than the four we investigated. The MR sequence parameters used for the five predictor variables varied across patients and locations which may have diminished their individual and collective effect sizes. The number of MR sequences was limited to 5, although many other sequences could have been used including perfusion imaging, chemical exchange saturation transfer, and MR spectroscopy. The localization procedure used to identify voxels that underwent biopsy did not always use the same image, the same image contrast, or even the image modality as the image used during biopsy. For example, several biopsies used a recent head CT for stereotactic guidance. Thus, the accuracy of the localization procedure was highly dependent on the ability of the human operator to map the biopsy location based on anatomical structures and spatial features. Finally, although the diversity of our patient cohort was clinically relevant, it very likely weakened our control over the experimental variables and ultimately reduced our statistical effect sizes compared with a highly controlled study focused on a single tumor or tissue type. However, the ubiquitous drawback of highly-controlled, single-disease radiomics studies is a failure to generalize to a clinically-relevant patient population34,35.

The poor voxel sensitivity found in this study represents a challenge for future clinical implementations of this method. Voxel sensitivity is directly related to the statistical threshold generated by RFT and thus is also linearly related to χ2RFT. Three of the four expected distributions calculated from our data had a positive background probability of <3%, effectively limiting sensitivity to those voxels with very high classification confidence. We believe the background probabilities were primarily limited by the small and heterogeneous biopsy population, and we expect a prospectively designed study that could control for the somatic and genomic tests performed across all patients, as well as the distribution of tumor and non-tumor types included, would yield far more accurate background probabilities and significantly increase voxel sensitivity.

In summary, we have demonstrated statistical relationships between routine multiparametric imaging signatures and underlying cellular and molecular properties of brain tumors. We have applied advanced statistical methods to correct for the family-wise error rate problem associated with whole-brain statistical parametric mapping, and have shown that the results have strong agreement with surgical biopsy. These results imply that cellular and molecular mapping of tumor heterogeneity from minimally-invasive images may be possible in the near future.