Histopathological imaging features- versus molecular measurements-based cancer prognosis modeling

For lung and many other cancers, prognosis is essentially important, and extensive modeling has been carried out. Cancer is a genetic disease. In the past 2 decades, diverse molecular data (such as gene expressions and DNA mutations) have been analyzed in prognosis modeling. More recently, histopathological imaging data, which is a “byproduct” of biopsy, has been suggested as informative for prognosis. In this article, with the TCGA LUAD and LUSC data, we examine and directly compare modeling lung cancer overall survival using gene expressions versus histopathological imaging features. High-dimensional penalization methods are adopted for estimation and variable selection. Our findings include that gene expressions have slightly better prognostic performance, and that most of the gene expressions are weakly correlated imaging features. This study may provide additional insight into utilizing the two types of important data in cancer prognosis modeling and into lung cancer overall survival.


Scientific RepoRtS
| (2020) 10:15030 | https://doi.org/10.1038/s41598-020-72201-5 www.nature.com/scientificreports/ changes that have "prior information", for example, with evidence of being relevant from previous studies. In this line of work, multiple gene panels have been developed and utilized. For example, Jablons and others aimed at developing a prognostic risk score for patients with completely resected lung adenocarcinomas based on genes previously identified in microarray models of NSCLC prognosis. They suggested narrowing the 61-gene panel down to four genes 7 . A drawback of molecular data is that it is not as easy to collect: many patients are still concerned with providing tissues for molecular profiling, not all hospitals can conduct profiling and process such data routinely, and the cost of high-throughput profiling is still not "friendly". A more recent type of data for cancer modeling comes from histopathological imaging. In cancer clinical practice, biopsy is routinely conducted, which generates histopathological images. Such images have been long used for definitive diagnosis and staging 8 . They contain rich information on tumors' "micro" properties and surrounding microenvironment, which play important roles in cancer development. Traditionally, pathologists would examine specimens on slide glass for hours using microscopes and make judgement on a handful of features such as tumor-infiltrating lymphocytes (TIL) and tumor cell intensity. This process can be highly time-consuming, and have poor inter-laboratory, inter-observer, and intra-observer reproducibility 9 . More recently, the development of digital imaging processing algorithms and software has made it possible to automatedly extract features from histopathological images. Compared to the traditional approach which highly relies on human capability, the new approach is much less labor-intensive and can extract more features that are "hidden" from human eyes and have not been traditionally studied, hence containing possibly different information. With less dependence on human, these high dimensional features can also be more objective and reliable. In a handful of recent studies, histopathological imaging features, especially those extracted using automated imaging processing software, have been used for modeling cancer prognosis (as well as other outcomes and phenotypes) 10,11 . However, such studies are still relatively scarce. With the consideration that tumor properties as reflected in histopathological images can be affected by molecular changes, there have been studies modeling the relationships between imaging features and molecular changes 12,13 . Such studies are biologically well-grounded. In particular, morphological features of tumor cells and microenvironment can be caused and regulated by molecular changes. As a testament, the successful prediction of microsatellite instability from histopathological images of gastrointestinal cancer 14 and colorectal cancer 15 suggests that such a genotype-phenotype correlation is consistent enough to robustly infer genotypes by observing histopathological imaging features. A recent pan-cancer study confirmed this finding by analyzing the histopathological images of more than 5,000 patients across 14 solid tumor types using deep learning. This study demonstrated the feasibility of identifying genetic variants, gene expression signatures, and clinical biomarkers from images 16 . There are also a small number of recent studies showing that collectively analyzing molecular and imaging data can improve prediction. For example, to predict the prognosis of Glioblastoma Multiforme (GBM), Kang et al. integrated histopathological imaging and gene expression data with a deep learning approach. The integrated data achieved a C-index of 0.702 in comparison to 0.640 by using only histopathological imaging data 17 . Similar data integration has also been pursued for breast cancer 18,19 , glioma 20 , lung cancer 21 , and prostate cancer 22 . These studies have suggested the great potential of high dimensional histopathological imaging features for cancer research. Overall, with the cost-effectiveness and routineness of biopsy and histopathological images can potentially play an important role in cancer modeling. As a "side note", we distinguish between histopathological images and radiological images-the latter are generated by CT, PET, and other radiological techniques and inform "macro" properties of tumors such as size, shape, and density.
A common limitation of the existing studies is that information has been scattered. More specifically, studies that analyze both histopathological imaging features and molecular changes using the same data and on the same ground are very limited. With differences in patient characteristics and data generation, processing, and analysis procedures, findings from different studies may not be directly comparable. In the integration studies, there is often a lack of attention to the direct comparison of molecular and imaging data analysis results.
The objective of this study is multi-fold. Specifically, it intends to further demonstrates cancer prognosis modeling using histopathological imaging and molecular data, taking advantage of high-dimensional regularization techniques (which may have a more lucid interpretation than the deep learning and some other techniques). More importantly, it provides a direct and fair comparison of modeling using these two types of highly important and popular data-this differs from most of the published studies. To be comprehensive, we also examine integrating these two types of data for modeling prognosis as well as modeling their relationships, as in some of the aforementioned studies. With the analysis of TCGA LUAD and LUSC data, this study may also provide additional insight into lung cancer prognosis.

Materials
TCGA (The Cancer Genome Atlas) is one of the largest and most comprehensive cancer projects organized by the NCI (National Cancer Institute) and NHGRI (National Human Genome Research Institute). For over thirty different types of cancer, it has published comprehensive phenotypic, demographic, molecular, and imaging data 23 . We choose to analyze TCGA data because of its high quality, comprehensiveness, and public availability. In particular, we analyze data on LUAD (lung adenocarcinoma) and LUSC (lung squamous cell carcinoma), two subtypes of NSCLC. Lung cancer patients in general have poor prognosis, and as such, prognosis modeling can be especially important. For prognosis outcome, we choose overall survival, as in Radzikowska et al. 24 , Collins et al. 25 , and quite a few other studies.
Histopathological imaging data. Whole-slide histopathological images in the svs format are downloaded from the TCGA website (https ://porta l.gdc.cance r.gov). These tissue slides are formalin-fixed and paraffin-embedded, and the cell morphology is well-preserved and suitable for image feature recognition. They are captured at 20× or 40× magnification by the Aperio medical scanner. In recent studies, we 13  www.nature.com/scientificreports/ developed and implemented a pipeline for extracting high dimensional imaging features, which is sketched in Fig. 1. Briefly, it includes the following three main steps. First, whole-slide histopathological images are chopped into small subimages of 500 × 500 pixels, and 20 subimages are randomly selected. Then, imaging features are extracted using CellProfiler 27 , a publicly available software package that has been adopted in quite a few recent studies 11,18,28 . In the next step, for each patient, features are averaged. We refer to Zhong et al. 13 and Luo et al. 26 for more detailed discussions on this imaging processing pipeline as well as alternatives. With this processing pipeline, a total of 299 features can be obtained. We note that this is significantly higher than in studies such as Wang et al. 29 and Romo et al. 30 . As briefly mentioned above, some imaging studies, especially the early ones, utilize low dimensional imaging features. Comparatively, high dimensional features may have less lucid interpretations but can contain information not reflected in the low dimensional features. With their advantages such as cost-effectiveness and reliability, it can be of higher interest to examine their prognosis modeling performance.
With the extracted features, we further conduct quality control. In particular, irrelevant features, such as file size and execution information, are removed. We also remove features with severe missingness (> 25%) and no or little variation. A total of 221 features are included in downstream analysis.

Molecular data.
For molecular data, we analyze gene expressions, which have been considered in many lung cancer prognosis modeling studies 31,32 . Compared to DNA and epigenetic changes, gene expressions are "closer" to phenotypes. With a lack of high-quality protein data, TCGA gene expression data have been extensively analyzed for prognosis, other phenotypes, and biomarkers. In TCGA, gene expressions were measured using the Illumina Hiseq2000 RNA Sequencing Version 2 analysis platform and processed and normalized using the RSEM software. More detailed information is available in the literature 33,34 . It is possible to directly conduct whole transcriptome analysis. However, findings may be unreliable when sample sizes are limited. As such, we take a candidate gene approach. In particular, the 61 gene panel developed in Raz et al. 7 is adopted. Matching this panel with gene names in the TCGA data leads to 50 genes for analysis. We acknowledge that there is still a lack of definitive consensus on lung cancer prognosis genes and that there are other lung cancer prognosis gene panels. This particular panel is selected as it has been recently examined in authoritative studies. The proposed analysis can be directly applied to other prognosis panels.
Available data. Beyond imaging and gene expression data, clinical characteristics have also been established as associated with prognosis and included in our analysis. Following published studies and considering data availability, we include sex, age, cancer stage, and tumor size. More specifically, tumor size is defined as the longest dimension × shortest dimension, and we combine cancer stages into three levels to avoid small counts.

Analysis techniques
Denote T and C as the event and censoring times, respectively. With right censoring, we observe (U = min (T, C), δ = I(T ≤ C)) . Denote X as the p-dimensional vector of histopathological imaging features, Z as the q-dimensional vector of gene expressions, and L as the r-dimensional vector of clinical characteristics. Assume n iid samples.

Associate histopathological imaging features and gene expressions with survival.
Here our goal is to conduct various "standard" survival analysis and associate imaging features and/or gene expressions with overall survival, while properly accounting for the effects of clinical characteristics. We comprehensively consider multiple sets of analysis.
First consider the analysis with X L = X ′ , L ′ ′ as input. Consider the Cox model, under which the hazard function: where subscripts i and j correspond to subjects i and j , and Y j (U i ) is the subject j 's at risk indicator at time U i . To accommodate the high data dimensionality, and to remove noises and identify relevant effects, we consider the Lasso penalized estimate: where τ > 0 is the data-dependent tuning parameter and chosen using cross-validation, and β l is the l th component of β . Here it is noted that penalization is only imposed on the imaging features. As such, the clinical variables are automatically included, given their established importance in lung cancer prognosis. For a specific imaging feature, a nonzero estimate suggests its association with survival. Literature review suggests that penalization is one of the most popular techniques for accommodating high-dimensional input and feature selection, and Lasso is likely the most popular penalization technique. The adopted "Cox model + Lasso estimation" approach has been examined in multiple published studies 35,36 . In our analysis, it is realized using the R package glmnet. We note that analysis can also be conducted using other penalties and regularization techniques other than penalization, and that analysis results depend on the adopted technique.
Next we consider the analysis with Z L = Z ′ , L ′ ′ as input. Analysis can be conducted in the same manner as for imaging features. Denote γ as the vector of unknown regression coefficients in the Cox model and γ as its Lasso penalized estimate. Note that the baseline hazard functions in this and the above analysis may be different. In this analysis, although the genes have been pre-selected, it is still necessary to apply penalization. In particular, the number of variables, relative to the sample size, is still large. As such, certain regularization is needed in estimation. In addition, to be cautious, it may still be sensible to examine whether all genes in the panel are associated with survival for the particular TCGA patient cohort (which may differ from those examined in published studies).
In the next set of analysis, we integrate the imaging features and gene expressions using an additive approach.
In particular, we consider a Cox model with input variable β 1 , . . . ,β p X, γ 1 , . . . ,γ q Z, L ′ ′ . Prior to model fitting, we compute the correlation coefficient between β 1 , . . . ,β p X and γ 1 , . . . ,γ q Z , which can suggest whether the two types of data have overlapping information in modeling survival (after adjusting for the clinical variables). In model fitting, as the dimensionality is low, we do not impose any penalization. This analysis takes an additive modeling strategy, which has been developed in the literature 28 and shown as reasonably effective for data integration. It retains the "structure" of imaging effects and that of gene expressions. It can be more interpretable compared to some existing approaches, for example the "black-box" deep learning. www.nature.com/scientificreports/ For the above three sets of survival analysis, we adopt the following random splitting approach to evaluate prediction performance: (a) randomly split all samples into a training and a testing set with sizes roughly 3:1; (b) conduct survival analysis as described above using the training set; (c) for subjects in the testing set, compute the predicted risk scores. For example, for the analysis with imaging features, the risk scores are β ′ X L . Compute the C-index using the predicted risk scores and testing set (observed time, event indicator). The C-index ranges between 0 and 1, with a larger value indicating better prediction. It is also the time-integrated AUC (Area under the Receiver Operating Characteristic curve). To avoid an extreme split, Steps (a)-(c) are repeated 100 times, and the average C-index is computed to quantify prediction performance. The goal of this analysis is two-fold. The first is to directly compare prognostic performance of the imaging-based model versus that of the gene expression-based. In addition, this analysis also examines whether integrating the two distinct types of measurements using the additive approach can further improve prediction performance.
Associate gene expressions with histopathological imaging features. As briefly discussed in "Introduction", histopathological imaging features can be affected by molecular changes, and such a relationship has been studied in some recent publications 13,21 . We note that this analysis is unsupervised in the sense that it does not involve survival. As such, the most direct goal is not to improve prognosis modeling but rather to understand, in a broad sense, overlapping information contained in the two distinct types of data.
With normalization to zero means, consider the model: where η is the p × q matrix of regression coefficients, and ε is the p-dimensional vector of random errors. Here we model the "downstream" imaging features using the "upstream" gene expressions. Linear regression is adopted with the consideration that more complex modeling may not be reliable with the limited sample size and high dimensionality of both sides of modeling. For estimating η , consider: where subscript i corresponds to subject i , τ > 0 is a data-dependent tuning parameter and chosen using crossvalidation, η j. is the j th row of η , and || · || 2 is the l 2 norm. Here to accommodate the high data dimensionality and select gene expressions that are relevant for imaging features, we apply the group Lasso penalization. Similar to above, to more objectively evaluate the relationship, we consider the following approach: (a) randomly split data into a training and a testing set in the same way as above; (b) conduct the group Lasso estimation using the training set; (c) for the testing set subjects, predict imaging feature values using gene expressions and the training set estimate. For each imaging feature, compute the correlation coefficient between the predicted and estimated values; (d) to avoid an extreme split, repeat Steps (a)-(c) 100 times, and compute the average correlation values. We note that penalization may introduce shrinkage towards zero. As such, we adopt correlation coefficient as the criterion, which is less affected by shrinkage.  Tables 2 (LUAD) and 3 (LUSC), respectively. It is noted that Level C is chosen as the reference level for stage, thus having an "NA" estimate. Beyond the clinical characteristics, 7 and 9 imaging features are identified, representing AreaShape, Texture, Granularity, and other characteristics. It has been noted in the literature that, unlike omics and some other types of data, high-dimensional imaging features extracted using automated algorithms/software do not have lucid functional interpretations. As such, we do not further pursue bioinformatics interpretations.

Results
In the next set of analysis, we regress survival on gene expressions. The identified gene expressions and clinical characteristics as well as their estimated coefficients are shown in Tables 4 (LUAD) and 5 (LUSC), respectively. www.nature.com/scientificreports/ Among the identified genes, there are "familiar" discoveries such as PIK3CG 37 and RND3 38 . In addition, there are also genes that have not yet been well examined in the literature, such as DNMT2 and UQCRC2. When integrating the combined imaging effect with the combined gene expression effect in one Cox model, for the LUAD data, we obtain regression coefficients 0.9842 (imaging feature, p value = 2.12e−6) and 0.4726 (gene expression, p value = 5.36e−9). For the LUSC data, we obtain regression coefficient 0.9709 (imaging feature, p value = 5.55e−9) and 0.8769 (gene expression, p value = 2.04e−3).
In the random-splitting based prediction evaluation, for the LUAD data, the median prediction C-index values are 0.6202 (imaging features), 0.6864 (gene expressions), and 0.6823 (combined). For the LUSC data, the median prediction C-index values are 0.5466 (imaging features), 0.5606 (gene expressions), and 0.5511 (combined). More detailed information, for example on the prediction C-index of each split, is available from the authors.

Remarks
In the separate survival analysis with imaging features and gene expressions, relevant effects have been identified. For imaging features, extensive additional research will be needed to annotate and fully comprehend the identified variables. We note that this issue has been noted in the literature 8Remarks . In the analysis of gene expression data, the "familiarity" of findings may provide support to the validity of analysis to a certain extent. However, it is noted that more definitive validation will be needed to confirm the findings. The survival analysis with both imaging and gene expression signatures as covariates seems to suggest that the two types of measurements have independent effects. In the random splitting-based evaluation, it is observed that for LUAD, gene expression has moderate predictive performance, and imaging data has moderate/weak predictive performance. For LUSC, both types of measurements have weak predictive performance. For both datasets, gene expression has better performance, which is sensible considering the genetic nature of lung cancer (and other cancers too).  www.nature.com/scientificreports/ Although both LUAD and LUSC are lung cancer subtypes, we observe significantly different results, which can be attributable to the complexity of cancer and suggest that there may not be a definitive conclusion applicable to all cancers. The random splitting evaluation further suggests that integrating the two types of signatures in an additive manner may not further improve prediction, which seems to "contradict" the analysis above. There can be multiple interpretations for this finding. First, the distinction between estimation and prediction should be made -a "good" estimation result may not directly translate into a good prediction. Second, the estimation analysis is repeatedly based on the same data, and there is a risk of over fitting. Third, in the random splitting evaluation, both the training and evaluation are based on fewer observations. An improvement that can be potentially observed with a larger dataset may not be observable with a smaller dataset. It is also noted that penalization and some other sparse approaches have been designed for estimation and may not be ideal for prediction, which may explain the less satisfactory prediction performance observed here.

Association of gene expressions and histopathological imaging features.
We first regress imaging features on gene expressions. Detailed information on the identified gene expressions and their estimated coefficients are provided in the Supplementary Materials 1 and 2. In Fig. 2 Fig. 3, where we sort performance, from the worst to the best, across imaging features. More detailed numerical results are provided in the Supplementary Materials.

Remarks
The regression analysis suggests that certain gene expressions are connected to imaging features. This observation is sensible considering, as described in "Introduction", that properties reflected in imaging features are regulated by molecular changes to a certain extent. On the other hand, the prediction results, as shown in Fig. 3, suggest that such associations are mostly weak to moderate. The majority of information in imaging  www.nature.com/scientificreports/ features cannot be readily explained by gene expressions, and this finding differs from that in some published studies [39][40][41] . It is unclear whether such a difference is attributable to the complexity of cancer, difference in analysis approach, or other factors. More exploration, especially a direct comparison, will be needed.
Additional analysis. To complement the above analysis, we conduct additional exploration and present the findings in the Supplementary Materials. In particular, (1) in some cancer studies with high dimensional variables, marginal screening is conducted prior to modeling to reduce dimensionality to a more manageable level.
In the above analysis, as the dimensionalities are not as high, screening is not conducted. Results presented in the Supplementary Materials suggest that, for our particular data and analysis, screening can change estimation and identification results, but has no substantial impact on prediction performance. (2) The above penalized estimations involve a tuning parameter, which is selected using cross validation. In the literature, there are many tuning parameter selection methods, and cross validation has been among the most extensively used. In the Supplementary Materials, we show that varying the tuning parameter values near the cross-validation-selected optimal has some moderate impact on estimation. But the findings on prediction are not strongly impacted.

conclusions
Accurately modeling prognosis and other cancer outcomes has been and will remain an important problem for a long time to come. Molecular and histopathological imaging data have played important roles in cancer prognosis modeling. In particular, with unique advantages including broad availability and high cost-effectiveness, it will be of interest to develop more histopathological imaging-based prognosis modeling. In this study, we have analyzed and integrated molecular and imaging data on the same ground using regularization techniques. More analysis of this kind will be needed to better understand the relative roles that molecular and imaging data play for other cancer types. Some of our findings are "negative": for example, we have found that integrating data using the additive approach cannot improve prediction. More sophisticated methodological development will be needed to conclude whether this lack of improvement should be attributable to data/cancer type or analysis approach. The revealed interconnections between imaging and molecular features warrants additional investigation.