Deciphering Genomic Underpinnings of Quantitative MRI-based Radiomic Phenotypes of Invasive Breast Carcinoma

Magnetic Resonance Imaging (MRI) has been routinely used for the diagnosis and treatment of breast cancer. However, the relationship between the MRI tumor phenotypes and the underlying genetic mechanisms remains under-explored. We integrated multi-omics molecular data from The Cancer Genome Atlas (TCGA) with MRI data from The Cancer Imaging Archive (TCIA) for 91 breast invasive carcinomas. Quantitative MRI phenotypes of tumors (such as tumor size, shape, margin, and blood flow kinetics) were associated with their corresponding molecular profiles (including DNA mutation, miRNA expression, protein expression, pathway gene expression and copy number variation). We found that transcriptional activities of various genetic pathways were positively associated with tumor size, blurred tumor margin, and irregular tumor shape and that miRNA expressions were associated with the tumor size and enhancement texture, but not with other types of radiomic phenotypes. We provide all the association findings as a resource for the research community (available at http://compgenome.org/Radiogenomics/). These findings pave potential paths for the discovery of genetic mechanisms regulating specific tumor phenotypes and for improving MRI techniques as potential non-invasive approaches to probe the cancer molecular status.


Section 1. Preparation of Radiomic Data
The MR images used in this study were downloaded from The Cancer Imaging Archive (TCIA, http://www.cancerimagingarchive.net). There were 108 MRI cases available at the time of this study. In order to minimize the variations from different imaging acquisition protocols, only 93 MRI cases that were acquired on a 1.5 Tesla magnetic strength with GE scanners were included. By excluding one case with missing images in the dynamic sequence and one case without genomic data, a total of 91 cases were used in the final dataset. All cases were primary tumors from female patients. The average age of patients was 53.6 years with a standard deviation of 11.5 years and ranging from 29 to 82 years with a median of 53 years. Out of the 91 invasive breast carcinoma cases, 79 were ductal carcinoma, 10 were lobular carcinoma, and 2 were mixed. Table 1 in the main text summarizes the 91 cases and their tumor pathological stages and molecular receptor status. These cases were contributed by four institutions: Memorial Sloan Kettering Cancer Center, Mayo Clinic, University of Pittsburg Medical Center, and Roswell Park Cancer Institute. MR images acquired from a dynamic-contrast enhanced T1weighted protocol were used in this study. In general, there are one pre-contrast, and three to five post-contrast images obtained using a standard double breast coil on a 1.5T GE whole-body MRI system with T1-weighted 3D spoiled gradient echo sequence and a Gadolinium-based contrast agent. In-plane resolution ranged from 0.53 to 0.86 mm, and spacing between slices ranged from 2 to 3 mm.
Each MRI case was independently reviewed by 3 of 11 expert breast radiologists who were blinded to the clinical outcomes. The primary tumor location was established by their consensus and was the only input to the computer for the subsequent quantitative image analysis. Using this lesion location, each primary breast tumor was automatically segmented from background parenchyma using a fuzzy c-means clustering method 1 . This segmentation yields the tumor margin. Next, using our quantitative image analysis (QIA) MRI workstation (from computer-aided diagnosis), a total of 38 mathematical descriptors of the breast tumors were automatically extracted. These descriptors can be divided into six MRI phenotypic categories describing the tumor (1) size, (2) shape, (3) morphology, (4) enhancement texture, (5) enhancement kinetic curve, and (6) variance of enhancement kinetics [2][3][4][5][6][7] . Table S1 summarizes the formal name, brief description, and category of all radiomic phenotypes. where N is the number of voxels on the surface of the lesion, S(x i , y i , z i ) is the signal intensity of the ith voxel located at (x i , y i , z i ) in the shell, and is the voxel-value gradient: with , , denoting the voxel size in three dimensions in world coordinates.

Variance of margin sharpness:
Variance of the image gradient at the lesion margin 3. Variance of radial gradient histogram: Degree to which the enhancement structure extends to have a radial pattern originating from the center of the lesion where is the radial gradient value at location r and defined as the normalized dot product of the gradient direction and the radial direction vector, denotes the voxel value at location r (r = (x,y,z)) in the 3D lesion, and is the center of the lesion, is the radial gradient histogram in a given of volume of interest.

Group 4: Enhancement texture phenotypes
Enhancement texture phenotypes characterize the textural properties of the contrast-enhanced tumors on the first post-contrast images, i.e., the heterogeneity of the uptake. These texture phenotypes were calculated from the gray-level co-occurrence matrix (GLCM) [3][4] .
Some necessary notations are the following. : the (i,j) entry of the normalized GLCM, which is a probability of two neighboring voxels, one with gray level i and the other with gray level j. G: the number of distinct gray-levels in the image. GLCM is of the size G × G.
: Kinetic curve assessments characterize the physiological process of the uptake and washout of the contrast agent in a breast lesion during the dynamic imaging series and were extracted from the kinetic curve obtained from the most enhancing voxels within a lesion [5][6] .
Two necessary notations are the following: : the signal intensity at time frame t, t=0 (i.e., pre-contrast frame), 1, 2, …, T 1 (i.e., the last post-contrast frame) : the maximum signal intensity, which is the maximum value of over all time points. The time at which maximum variance of enhancement, i.e. , appears, denoted by Q. x v

Enhancement-variance increasing rate
x v

Section 2. Preparation of Genomic Data
All tumor cases were staged according to the American Joint Committee on Cancer (AJCC) staging system. The multi-layer genomic data of each case were generated by TCGA using a biospecimen from the primary tumor. The biospecimen was selected to be adjacent to the tissue block from which a diagnostic slide was obtained for the hospital to diagnose the patient. A pathology review was conducted on the tumor sample to determine whether it is qualified for making genomic data. A tumor sample was required to contain at least 60% tumor cell nuclei with less than 20% necrosis for inclusion in the study per TCGA protocol requirements.
We used TCGA-Assembler 8 , a software package that automatically downloads, assembles, and processes TCGA data, to retrieve genomic data from the TCGA data server. Only samples with both genomic data and radiomic data were included in the analysis. We used the normalized read counts for RNA-seq data, which were generated by the Illumina HiSeq 2000 system and processed by the MapSplice genome alignment algorithm 9 and the RSEM gene expression estimation algorithm 10 . We used the RPM (Reads Per Million miRNAs mapped) values for miRNA-seq data. TCGA used the Affymetrix® Genome-Wide Human SNP Array 6.0 and the circular binary segmentation algorithm 11 to generate the copy numbers of DNA fragments. We used TCGA-Assembler to calculate an average copy number for each gene in each sample, which was the actual copy number value used in the analysis. TCGA somatic mutation data were generated using exome sequencing on the Illumina Genome Analyzer and HiSeq 2000 DNA sequencing platforms. Candidate mutations were identified using VarScan 2 for SNVs/Indels 12 , SomaticSniper for SNVs 13 , and GATK IndelGenotyper v2.0 for Indels 14 . The union of the mutation call sets from these callers were used and additional filtering and processing were taken by TCGA to ensure the quality of somatic mutation calls 15 . Note that TCGA used the Reverse Phase Protein Array (RPPA) to generate protein expression data, which measures 142 important proteins and phospho-proteins such as some transcription factors involved in cancer development.

Section 3. Overview of Radiogenomics Data
We conducted clustering analysis on tumor samples to provide an overview of the dataset and also associated the clustering results with the breast cancer clinical subtypes, such as different tumor pathological stages and molecular receptor status. The radiomic data were preprocessed before clustering analysis. Values of three radiomic phenotypes, including maximum variance of enhancement, enhancement-variance increasing rate, and enhancement-variance decreasing rate, were log2 transformed to reduce the effects of extreme values. Zero values of enhancement-variance decreasing rate were replaced by the smallest positive value of the phenotype (over all patient samples) to allow the log2 transformation. All radiomic phenotypes were standardized to have a zero mean and a unit standard deviation for clustering analysis and producing heatmap. The Affinity Propagation Clustering (APC) 16 was applied to cluster the tumor samples and three tumor clusters with distinct radiomic profiles were identified, as shown in Fig. S1a. The genomic data were also preprocessed for clustering analysis and heatmap drawing. For RNA-seq data, gene filtering kept y w P ( P ) ≥ 2 , which resulted in 13177 genes for the analysis. 0 read counts in a sample were replaced by the minimum positive read count among all the genes in the sample. A Log2 transformation was then performed on the data. The expression values of each gene were standardized to have a zero mean and a unit standard deviation over all samples. The same procedure was used to preprocess the miRNA-seq data and 292 miRNAs were kept for the clustering analysis and heatmap drawing. Fig. S1b, c, and d present the tumor clustering results based on the gene expression data, miRNA expression data, and protein expression data, respectively.
Associations between the tumor clustering partitions and the classifications of tumors based on their pathological stage or receptor status were tested using the Fisher's exact test [17][18] . The obtained p-values were adjusted for multiple tests using the Benjamini-Hochberg (BH) procedure 19 and presented in Table S2. M-stage of the tumor samples was not considered for association analysis, because 89 out of the 91 samples are at M0 stage, indicating no clinical or radiographic evidence of distant metastases, and the rest two samples are at cM0(i+), indicating molecular or microscopical evidence of potential metastasis but no clinical or radiographic evidence.
The clustering partitions of tumors based on gene expressions, miRNA expressions, and protein expressions were found to be statistically significant in their associations with the status of Estrogen Receptor (ER) and Progesterone Receptor (PR), which means P w x P , respectively, at multiple molecular levels. For the tumor clusters defined by mRNA expressions (Fig. S1b), most of the tumors in cluster II, III, and IV were ER+ and PR+, while almost all the tumors in cluster I were P . miRNA expressions defined five tumor clusters (Fig.  S1c). Compared to clusters IV and V, clusters I, II, and III were dominated by ER+ and PR+ tumors. The four tumor clusters identified on protein expressions were statistically significantly associated with the tumor T stage with an adjusted p-value of 0.0171. Fig. S1d shows that clusters II and IV mainly consisted of T-II and T-III tumors and that clusters I, II, and III mainly included ER+ and PR+ tumors, while cluster IV wa y P .

Section 4. Overview of Identified Statistically Significant Associations
Fig. 2b in the main paper summarizes the numbers of identified statistically significant associations between genomic features of different platforms and radiomic phenotypes of different categories. Associations were deemed as statistically significant if the adjusted p-value ≤ 0.05. Because a gene usually mutated in only very few patients (see Table S6 right), which resulted in a small number of samples with a mutation event and thus a relatively weak statistical power, we used less stringent criteria to call statistically significant associations of somatically mutated genes. The criteria are (1) p-v ≤ 0.05 and (2) the gene mutated in at least five patients. Based on the table in Fig. 2b, the Fisher's exact test [17][18] shows that the frequencies of statistically significant associations are dependent on the categories of genomic features and radiomic phenotypes (p-value ≤ 1.0×10 8 ). Table S3 shows that the identified associations are enriched for certain categories of genomic features and radiomic phenotypes, evaluated by the adjusted p-values from the Fisher's exact tests [17][18] . To explain how the enrichment significance was statistically evaluated, we take the associations between gene expressions of pathways and tumor size phenotypes as an example. The total number of genomic features is 186 (gene expressions of pathways) + 186 (copy number variations of pathways) + 292 (miRNA expressions) + 3734 (mutated genes) + 142 (protein expressions) = 4540, so there are 4540×38 = 172520 potential associations, among which 186×4 = 744 potential associations are between gene expressions of pathways and four tumor size phenotypes. From the table in Fig. 2b, 1404 statistically significant associations have been identified, among which 173 statistically significant associations are between gene expressions of pathways and size phenotypes. Based on these numbers, Fisher's exact test gives a p-value smaller than 1.0×10 30 , after correction over all 30 tests included in Table S3 using the BH procedure.

Table S3
The adjusted p-values resulted from the enrichment tests of statistically significant associations between each genomic platform and each radiomic phenotype category. A Chi-squared proportion test based on equal proportions of positive and negative associations was carried out for every combination of a genomic feature platform and a radiomic phenotype, when there are at least 10 statistically significant associations between them. The purpose was to examine whether a genomic feature platform is dominantly positively or negatively associated with a radiomic phenotype. The resulted p-values were then adjusted using the BH procedure over all the Chi-squared proportion tests.

Section 5. Associations between Genetic Pathways and Radiomic Phenotypes
We studied the associations between the transcriptional activities of genetic pathways documented by the Kyoto Encyclopedia of Genes and Genomes (KEGG) database 20 and the tumor radiomic phenotypes using the Gene Set Enrichment Analysis (GSEA). GSEA is based on known genetic pathways or functional gene modules. It studies the behavior of genes involved a pathway as a whole group in response to a biological condition. Various GSEA methods have been developed since it was first proposed in 2005 21 . However, there is no consolidated opinion regarding what methods should be preferred, because most methods have their own advantages and disadvantages 22 . Thus, Väremo et al. integrated multiple GSEA methods into a uniform workflow implemented as an R package called PIANO 22 . We used the PIANO package including nine different GSEA methods for our analysis. The nine methods were Stouffer's method, reporter features, tail strength, Wilcoxon rank-sum test, mean, median, sum, statistic, original GSEA, and Parametric Analysis of Gene set Enrichment (PAGE). They differ in the choice of gene-level statistic and the way of calculating gene-set-level statistic and evaluating its statistical significance. Refer to Väremo et al. (2013) 22 for a detailed description of these methods and the software package. The gene sets used in the analysis were KEGG pathways collected in the Molecular Signature Database 21 that includes 186 genetic pathways or modules covering a wide range of genetic and molecular functionalities.
The RNA-seq data were filtered to remove the genes with unreliable expressions that could introduce significant noise or bias to the analysis results. A gene was excluded from the analysis, if its normalized read count was 0 in ten or more samples or its average read count over all samples was smaller than 8. After filtering, 15544 genes were kept.
For each of the 38 radiomic phenotypes, we performed the GSEA for identifying the KEGG pathways whose transcriptional changes associated with the change of radiomic phenotype. The gene-level statistics used to characterize the relationship between a gene's expression and a radiomic phenotype were the correlation coefficient and/or the p-value resulted from the Spearman rank correlation test 23 , depending on the GSEA method. Stouffer's method, reporter features, and tail strength used p-values as the gene-level statistic and the signs of correlation coefficients were used to indicate the association direction. All other methods used correlation coefficient as the gene-level statistic. Nominal p-values for evaluating the statistical significance of association were calculated based on 30000 random gene sets. For each method, the False Discovery Rate (FDR) was controlled for each radiomic phenotype over its association tests with all KEGG pathways using the BH procedure 19 . Thus, we got nine adjusted p-values evaluating the association between each gene set and each radiomic phenotype, produced by nine different GSEA methods. An association was deemed as statistically significant, if the median of the nine adjusted p-values is no larger than 0.05, which actually required that more than a half of the methods simultaneously identified the association using a cutoff of 0.05 on the adjusted pvalue. The associations between gene sets and radiomic phenotypes were tested separately for two different directions, i.e. positive association and negative association.

Table S4
The median adjusted p-values resulted from the association tests between all radiomic phenotypes and the transcriptional activities of all KEGG pathways. The results are summarized in two different directions, i.e. positive association and negative association.
[ Table S4 is in the excel file] Table S4 provides the median adjusted p-values for all association tests between pathway transcriptional activities and radiomic phenotypes. Totally, 1103 statistically significant associations were identified. Fig. 3 in the main text shows some of the associations that involve cancer-related pathways. Besides the findings elaborated in the main text, some other findings are the following.

P53 Signaling Pathway
The P53 signaling pathway takes a tumor suppressive role by causing cell cycle arrest, cellular senescence, and apoptosis in normal condition. Its transcriptional activity is positively associated with the tumor size and tumor shape irregularity (Fig. 3 in the main text), which indicates the activation of this pathway during tumor development. It also has a positive association with the radiomic phenotype maximum enhancement, indicating that tumors with an active P53 signaling pathway tend to have increased permeability of blood vessels.

MAPK Signaling Pathway
The MAPK signaling pathway is critical for cell proliferation and migration involved in cancer development. We see a statistically significant negative association between the MAPK activity and time to peak at maximum variance (Fig. 3 in the main text), which suggests that tumors with a strong MAPK pathway activity tend to quickly reach their maximum value of enhancement variation, which measures the heterogeneity of blood uptake within a tumor.

DNA Repair Pathways
Two related DNA repair pathways, the mismatch repair pathway and the base excision repair pathway, play key roles in maintaining genomic stability and preventing DNA mutations from becoming permanent in dividing cells. We found that the w y ' v w positively correlated with all tumor size phenotypes (Fig. 3 in main text), indicating their activity during tumor progression trying to prevent DNA mutations. The mismatch repair pathway is also positively correlated with leaky microvessels characterized by maximum enhancement and enhancement at the first post-contrast time point.
Besides gene expressions, we also used the same GSEA scheme to identify the KEGG pathways whose gene copy number variations statistically significantly associated with the radiomic phenotypes. 15000 genes whose DNA copy number varies most (measured by the standard deviation over all patients) were used for analysis. Gene sets with a median adjusted pvalue ≤ 0.05 over all nine methods were deemed as statistically significant. Fig. S2 shows all the statistically significant associations between radiomic phenotypes and copy number variations of KEGG pathways.
We take the median of nine adjusted p-values obtained by nine different GSEA methods and then apply the cutoff of 0.05 on the median adjusted p-values to call significant associations. We do so because of the following two reasons.
(1) The nine methods differ in the choice of gene-level statistic and the ways of calculating geneset-level statistic and evaluating its statistical significance 22 . Unfortunately, there is no consolidated opinion regarding what methods should be preferred, because most methods have their own advantages, disadvantages, and statistical assumptions 22 . So whether a method is suitable/best for statistical inference on a particular dataset is data-dependent. It might not be the best to assume all methods are suitable for a given data set. Making a significance call base on the median of adjusted p-values from the nine methods is basically a majority vote. If more than a half of the methods (i.e. five methods) vote for a significant call, the association is reported.
(2) Requiring all nine methods to agree on the significance call is more stringent and can reduce false discoveries, but may also increase false non-discoveries. To show this, we tried requiring that all nine methods must reach the statistical significance cutoff (adjusted p-value = 0.05) in order to call a significant association. As a result, the number of significant associations involving pathway transcriptional activities decreased from 1103 to 503, and the number of significant associations involving pathway copy number variations decreased from 88 to 26.

Figure S2
A heatmap presentation of all statistically significant associations between radiomic phenotypes and copy number variations of KEGG pathways. In the heatmap, genetic pathways are rows and radiomic phenotypes are columns.

Section 6. Associations between miRNA Expressions and Radiomic Phenotypes
The radiomic data were preprocessed before carrying out the analysis. Values of three radiomic phenotypes, including maximum variance of enhancement, enhancement-variance increasing rate, and enhancement-variance decreasing rate, were log2 transformed to reduce the effect of extreme values. 0 values of enhancement-variance decreasing rate were replaced by the smallest positive value of the phenotype (over all patient samples) to allow the log2 transformation. Gene filtering was performed to keep the miRNAs whose RPM values are no less than 2 in at least a half of the samples, which resulted in 292 miRNAs for the analysis. 0 RPMs in a sample were replaced by the minimum positive RPM among all miRNAs in the sample. A log2 transformation of the data was then performed.
For each miRNA and each radiomic phenotype, a linear regression with adjustments of the patient age and the tumor grade was used to fit the values of the radiomic phenotype and to examine whether the miRNA expression has a statistically significant effect on the phenotype, which was formulated as where was the value of the radiomic phenotype for patient , was the expression level of the miRNA in patient , was the age of patient , 2 and were two 0/1 indicators coding the patient tumor stage (stage I: 2 and ; stage II: 2 and ; stage III: 2 and ). A p-value evaluating the statistical significance of was calculated and adjusted over the tests of all miRNAs with the particular radiomic phenotype. We collected a list of oncogenic and tumor-suppressive miRNAs by literature survey [24][25][26][27] . Fig. 4a in the main text shows the obtained statistically significant (adjusted p-v ≤ 0.05) associations between the cancer-related miRNAs and the radiomic phenotypes. Table S5 summarizes all the statistically significant associations between miRNA expressions and radiomic phenotypes.

Table S5
Analysis results of the statistically significant associations between radiomic phenotypes and miRNA expressions.

Section 7. Associations between Protein Expressions and Radiomic Phenotypes
The radiomic data were preprocessed as described in the previous section. The RPPA data included the expression levels of 142 proteins and phospho-proteins related to cancer. We used the level-3 data from TCGA, which were log2 transformed. A linear regression model was used to study the association between each radiomic phenotype and the expression of each protein, following equation (1), where here denoted the protein's expression level in patient . A pvalue evaluating the statistical significance of was calculated and adjusted among the tests of all proteins with the radiomic phenotype. Fig. 4b in the main text shows all the identified statistically significant (adjusted p-v ≤ 0.05) associations. Some interesting findings have been observed. For example, MEK1 is a protein kinase encoded by gene MAP2K1 that activates the extracellular signal-regulated kinases (ERKs) in the MAPK signaling pathway. The expression of MEK1 is statistically significantly negatively associated with lesion volume.

Section 8. Associations between Somatic Gene Mutations and Radiomic Phenotypes
The radiomic data were preprocessed as described in Section 6. For each mutation and each radiomic phenotype, a linear regression was used to fit the value of the radiomic phenotype and examine whether the mutation has a significant effect on the phenotype, following equation (1), where here was a 0/1 indicator indicating whether patient has the mutation.
Since the number of patients with exactly the same mutation was quite small (see Table  S6 left) and might not provide reliable statistics, we only studied the associations of radiomic phenotypes with the most frequent mutation, which was A/A > A/G in PIK3CA at chromosome 3 and position 178952085 that occurred in 10 patients. P-values were calculated to evaluate the statistical significance of and were adjusted by the BH procedure to control FDR over the 38 tests. No statistically significant association was found. The smallest p-value and adjusted pvalue obtained in the analysis were 0.148 and 0.820, respectively, for the association between uptake rate and the mutation. Alternatively, we examined whether a gene underwent mutation was associated with a significant change in radiomic tumor phenotypes, regardless what mutations occurred to the gene and where they occurred in the gene. In this way, the patient group with somatic mutations (Table S6 right) can be larger than considering the exact same mutation occurring among patients (Table S6 left). The linear regression analysis was applied for every pair of mutated gene and radiomic phenotype according to equation (1), with now being a 0/1 variable indicating whether patient has at least one somatic mutation in the gene. The BH procedure was taken to control the FDR over the association tests between a radiomic phenotype and all mutated genes. Due to the weak statistic power associated with the usually small number of patients with the mutation events, we used loose criteria to call statistically significant associations, which were pv ≤ 0.05 and that the gene mutated in at least five patients. Table S7 summarizes all the identified associations. Some of the associations do not have statistically significant adjusted pvalues (<= 0.05) after the correction for multiple tests, although non-significant adjusted pvalues do not guarantee that the specific associations are untrue.

Table S7
The analysis results of associations between somatically mutated genes and radiomic phenotypes. Only the associations in which a gene mutated in at least five patients and the pvalue ≤ 0.05 are included. We also studied the associations between radiomic phenotypes and somatic gene mutations at the pathway level, by comparing the measurements of radiomic phenotypes for patients with gene mutations in a KEGG pathway versus those without any mutation. A linear regression analysis was applied for each pair of KEGG pathway and radiomic phenotype according to equation (1), where here is a 0/1 variable indicating whether patient has at least one somatic mutation in the genes of the pathway. The BH procedure was taken to control the FDR for the association tests between a radiomic phenotype and all KEGG pathways. Table S8 shows all the associations in which the pathway has gene mutations in at least five patients and the obtained adjusted p-v ≤ 0.05.

Table S8
The analysis results of associations between radiomic phenotypes and somatic gene mutations at the pathway level. Only the associations in which the pathway has mutated genes in at least five patients and the adjusted p-value ≤ 0.05 are included in the  Table S9 summarizes the numbers of statistically significant (adjusted p-v ≤ 0.05) associations between the radiomic phenotypes of Invasive Breast Carcinoma (BRCA) and the transcriptional activities of KEGG pathways dedicated to cancer types other than BRCA. The basal cell carcinoma has the highest number of associations, followed by the small cell lung cancer and glioma.