Fused feature signatures to probe tumour radiogenomics relationships

Radiogenomics relationships (RRs) aims to identify statistically significant correlations between medical image features and molecular characteristics from analysing tissue samples. Previous radiogenomics studies mainly relied on a single category of image feature extraction techniques (ETs); these are (i) handcrafted ETs that encompass visual imaging characteristics, curated from knowledge of human experts and, (ii) deep ETs that quantify abstract-level imaging characteristics from large data. Prior studies therefore failed to leverage the complementary information that are accessible from fusing the ETs. In this study, we propose a fused feature signature (FFSig): a selection of image features from handcrafted and deep ETs (e.g., transfer learning and fine-tuning of deep learning models). We evaluated the FFSig’s ability to better represent RRs compared to individual ET approaches with two public datasets: the first dataset was used to build the FFSig using 89 patients with non-small cell lung cancer (NSCLC) comprising of gene expression data and CT images of the thorax and the upper abdomen for each patient; the second NSCLC dataset comprising of 117 patients with CT images and RNA-Seq data and was used as the validation set. Our results show that our FFSig encoded complementary imaging characteristics of tumours and identified more RRs with a broader range of genes that are related to important biological functions such as tumourigenesis. We suggest that the FFSig has the potential to identify important RRs that may assist cancer diagnosis and treatment in the future.

Lung cancer is one of the leading causes of cancer death among men and women worldwide. Non-small cell lung cancer (NSCLC) accounts for approximately 85% of all cases of lung cancer 1 . NSCLC diagnosed at an early stage has a 5-year survival rate of up to 80% for small and localised tumours (stage IA) 2 . When compared to patients in advanced stage of NSCLC (stage IV), the 5-year survival rate is 2% 2 .
Advances in the understanding of molecular characteristics of NSCLC have provided insights into the biology of NSCLC and assisted in more precise treatment 3,4 . The usual approach for molecular characterisation is with large-scale gene expression profiling, a technique that determines the process by which information from a gene is converted into a functional gene product, such as proteins. Gene expression analysis at different levels of transcription can provide a global picture of different biological functions and can be identified using computational and statistical methods 5 . Gene expression analysis provided insights that facilitated the development of therapies that target specific biological pathways such as epidermal growth factor receptor (EGFR) in NSCLC that have improved clinical outcomes 6,7 . Cetuximab 8 is an example of target therapy medications that downregulates the EGFR. Specific types of EGFR mutations, such as the exon 19 deletions and the L858R point mutation are particularly responsive to gefitinib 9 and erlotinib 10 . These medications are small-molecule tyrosine kinase inhibitors (TKIs) that restrict EGFR from transmitting cellular signals that are related to tumour progression 11 .
Gene expression profiling, however, requires adequate tumour tissue samples that are obtained from core biopsies that sample only a part of the tumour and is invasive and expensive. In contrast, medical imaging is a non-invasive technique that plays a crucial role in routine clinical practice by capturing important imaging visual Scientific Reports | (2022) 12:2173 | https://doi.org/10.1038/s41598-022-06085-y www.nature.com/scientificreports/ Methods NSCLC-Radiomics-Genomics dataset. We used the public NSCLC Radiomics-Genomics dataset 36 from the Harvard University, and we refer to this dataset as the 'NRG-H' . The dataset was sourced from the Cancer Imaging Archive (TCIA) 37 . The NRG-H is a pre-processed and de-identified dataset. The creator of the dataset has indicated that the collection and processing of the dataset were conducted according to national laws and guidelines and approved by the appropriate local trial committee at Maastricht University Medical Center (MUMC1), Maastricht, The Netherlands. The dataset comprises 89 patients (29 W, 60 M; age range 37-85 years) with histologically confirmed NSCLC with T stage (T1-T4) 38 . A detailed dataset description is presented in Supplementary Table S2. All patients had a CT scan of the thorax/upper abdomen. CT scan slice thickness was between 1.5 and 5 mm. Gene expression information was acquired using the Rosetta/Merck human RSTA custom Affymetrix 2.0 microarray (Affymetrix HuRSTA-2a520709). Gene expression values were normalised using the RMA algorithm 5 in Bioconductor. Gene expression information was accessed via the Gene Expression Omnibus (GEO) 39 . The primary tumours were delineated by an experienced medical imaging specialist (M.F., more than 20 years of experience), slice-by-slice, on trans-axial image slices using open source software (Medical imaging Interaction Toolkit (MITK); version 2016.11 40 ). We excluded three patients (all men) because there were lung collapses distal to a proximal tumour and the extent of the tumour could not be reliably identified. Delineations were independently validated by a second clinician (E.K., 7 years of experience). Details of the delineation validation process are described in Supplementary Material Section 1. The annotation differences between the two clinicians are shown in Table S1 in the Supplementary Materials.

NSCLC-RADIOGENOMICS dataset.
The NSCLC-Radiogenomics dataset reported by Bakr et al. 41 from the Stanford University is a pre-processed and de-identified dataset, and we refer to this dataset as 'NRG-S' . The creator of the dataset has indicated that the collection and processing of the dataset were conducted under IRB approval from Stanford University and the Veterans Administration Palo Alto Health Care System. The NRG-S dataset comprises CT images and RNA-Seq data from 117 subjects (29 W, 88 M; age range 46-85 years) with histologically confirmed NSCLC with T stage (Tis, T1-T4). A detailed dataset description is presented in Supplementary Table S3.
All patients had a CT scan from the apex of the lung to the adrenal gland in supine position. CT scan thickness was between 0.625 and 3 mm. Detailed scanning parameters, such as the manufacturer attributes are specified in the DICOM headers. Total RNA was extracted from the tissue and analysed with RNA sequencing technology. Gene expression information was processed using the STAR algorithm 42 and Cufflinks version 2.0.2 43 . Gene expression information was accessed via the Gene Expression Omnibus (GEO) 39 . Primary tumours were segmented using an unpublished automatic segmentation algorithm on the axial image slices for all 117 subjects. Segmentations were viewed by a thoracic radiologist (M.K.) with more than 5 years of experience and edited as necessary using ePAD. An additional thoracic radiologist (A.N.L.) reviewed and approved the final segmentations.
Experimental overview. An overview of the experimental design is outlined in Fig. 1. HC and deep ETs are used to extract HC, TL and FT feature from delineated tumour ROIs from CT image volumes. HC features are extracted from the CT image volume directly. FT features are extracted from a 2.5D representation of the CT data around the tumour centroid 31 . The extracted HC, TL and FT features are fused into a feature matrix using concatenation. The FF Sig is generated by applying a multi-step feature selection procedure involving median absolute deviation (MAD), minimum redundancy maximum relevance (mRMR), and least absolute shrinkage and selection operator (LASSO) generalised linear model. RRs are determined by using Spearman rank correlation between FF Sig and the averaged gene expressions. RRs between image features signatures and GO terms are determined by using GSEA. For evaluation purposes, the same multi-step feature selection procedure is applied to HC, TL and FT features. The resulting collections of image features are denoted as HC Sig , TL Sig and FT Sig , respectively. The training of the deep ETs was performed on the NRG-H dataset; the ETs were then used to extract image features and generate FF Sig . We validated the robustness and generalisability of the FF Sig by applying NRG-H trained deep ETs to the validation NRG-S dataset.

Image features. HC and deep ETs.
We employed a set of standard HC ETs that are implemented in the pyradiomics framework to quantify HC features 15,44 . For each patient, we extracted a well-documented set of 431 HC features from CT volumes 45,46 . These 431 HC features comprised the following: (a) first-order statistics, describing the distribution of voxel intensities; (b) shape and size that are geometric descriptors of tumoural 3D characteristics such as compactness and surface area; (c) textural or co-occurrence matrix features to illustrate the spatial distribution of the voxel intensities and, (d) first order statistics and textural features of the wavelet decompositions of the raw imaging data. The detailed description to the 431 HC features is provided in Supplementary Material Section 4.
Deep ETs used a ResNet-101 backbone that was pre-trained on ImageNet ILSVRC challenge data. ResNet-101 is a well-established CNN architecture, which introduced the concept of 'residual blocks' , a combination of skip connections and identity mapping to learn deeper features, and is robust to accuracy degradation 47 . ResNet-101 is a robust and efficient CNN architecture that have been applied in a range of different medical image analysis tasks such as brain tumour classification 48 and segmentation for kidney and space-occupying lesion area 49 . In comparison with other widely used pre-trained deep models, such as GoogLeNet 50 , ResNet-101 has demonstrated superior performance in natural image detection across different datasets such as PASCAL VOC 2012 and ImageNet detection 28 www.nature.com/scientificreports/ well-annotated images that belongs to 1000 natural object classes) allowed it to extract deep features that represent generic image characteristics applicable to all images such as edge and texture 29 , which has been demonstrated to be useful descriptors for medical images 47 . We have compared the tumour classification performance between ResNet-101 and some of the most commonly used pre-trained deep models such as VGG-19 51 on the testing set of the NRG-H. The detailed protocol for evaluating the tumour classification performance between the deep models is presented in Supplementary Materials Section 2.
To adopt the ResNet-101 model with pre-trained weight and to recognise the features in the NSCLC CT data, we fine-tuned it for the task of identifying CT images that contained tumours. The 86 subjects from the NRG-H dataset were divided into two groups: a training set that comprises imaging data from 69 patients and, a testing set that comprises imaging data from 17 patients. Subjects in the training and testing groups were randomly selected. We implemented a fivefold cross-validation strategy on the training set of 69 patients to fine-tune the ResNet-101 model. The testing set of 17 subjects was 'held out'/'unseen' during the fine-tuning process and serves to assess the robustness and generalisability of those fine-tuned ResNet-101 models.
From the training set, a total of 2420 CT image slices were sampled for the fine-tuning task. The CT images from the training patients were augmented to avoid overfitting during the fine-tuning. Training data was augmented by randomly rotating images between 0 to 360 degrees and translating the rotated images between − 5 to 5 pixels on both the x and y-axis. The last layer of the pre-trained ResNet-101 was replaced by a new fully connected layer to accommodate the classification task. The weight learn rate factor and bias learn rate factor were set to 20 for the new fully connected layer.
The fine-tuning process of the ResNet-101 model involved 300 epochs of training using stochastic gradient descent with a momentum of 0.9 and a batch size of 5. The Learning rate was set at 1 × 10 -3 , with L2Regularization set at 0.001. For every 100 epochs, the learning rate decreased by the factor of 0.1. These hyperparameters were determined and tuned by using the widely adopted random search optimisation method 52 . This is achieved by finding the optimum model which consists of the combination of hyperparameters that give the best overall performance for the classification task. Fine-tuning was implemented using MATLAB 2019b on a machine running Ubuntu 18.04, with an 11 GB NVIDIA RTX 2080 Ti GPU and CUDA 10.1. The fine-tuned model with the best overall performance was selected to serve as deep ET for FT features (Supplementary Table S4). The performance of the selected deep ET was then assessed on the second NRG-S dataset (Supplementary Table S5).
Image feature extraction. HC features were extracted directly from the volumetric CT images using the pyradiomics framework. TL and FT features were extracted from the 'pool5' layer of the ResNet-101 model. We used the axial, sagittal and coronal views of the tumour ROI from the volumetric CT images as the input for deep ETs 31 . All views were centred on the physical centroid of the tumour ROI. Such an aggregated views display a 2.5-dimensional (2.5D) representation of the tumour ROI 53 . The 2.5D representation for deep ETs contains richer spatial information of neighbouring pixels compared with traditional 2D images while demanding less computational power when compared with running ETs directly on 3D image volumes 54 . For each view of the www.nature.com/scientificreports/ 2.5D representation, gray values were normalised from [0, 4096] to [0, 255] using a linear transformation. All three input slices were resized to 224 × 224 to fit the input size of ResNet-101 using nearest-neighbour interpolation and were padded with zeros to preserve the tumour aspect ratio. 6144 FT and 6144 TL were extracted from each CT image.
Image feature fusion and selection. We used a feature fusion strategy that concatenates the HC, TL and FT feature together to generate a feature matrix across the patients 55 . The 431 HC, 6144 TL, and 6144 FT features were then fused into a single 12,719-dimensional feature matrix using concatenation. The resulting high-dimensional feature matrix presented challenges in performing statistically significant analyses as the number of features is much larger than observations 56,57 . In such circumstances, small random fluctuations in individual features may be mistaken for important variance in the data and lead to the selection of features that are suboptimal for representing the observations. In addition, the concatenation may cause redundant features from individual extraction techniques to be contained within the matrix and add complexity during data interpretation. Feature selection is a technique to reduce the dimensionality and identify the subset of optimal and robust features that provide the best predictive power 58 . We hence applied a multi-step image feature selection scheme that aims to: (i) reduce the dimensionality of the concatenated feature matrix; (ii) remove image features that are redundant or irrelevant to the histology classification of tumours; and (iii) identify a set of image features that are most relevant to the histology characteristics of patients. The reduction of the dimensionality removed features that have poor variability and dispersion across patients. These features do not reflect the variances in tumour imaging characteristics and therefore unideal for identifying radiogenomics associations. We used the median absolute deviation (MAD) as an indication for these features as it measures the variability across features and is robust against outliers in the concatenated feature matrix.
The second stage reduced the dimensionality of the remaining features by removing those that are redundant or irrelevant to the histology characteristics of patients. The histology characterisation is a crucial parameter that indicates the subtypes of the disease and may also contain information that reflects distinct patterns of genetic alterations 59 . The removal of biologically irrelevant and redundant features, therefore, prevents the discovery of meaningless radiogenomics associations. The histology characteristics of each patient were categorised into one the following classes: (1) squamous cell carcinoma, (2) adenocarcinoma and (3) other types including Non-Small cell and Not otherwise specified (NOS).
We used mRMR, a widely-adopted approach for feature selection, to produce a subset of features with high biological relevance 60 . The mRMR method selects features that have: (i) the maximal mutual information between the total feature set and the histology characterisation and (ii) the minimal mutual information between the selected features subset and the total feature set. A total of 100 features were selected using the mRMR method, taking into consideration of the number of patients as well as the original dimensionality of the feature matrix 61 .
The last stage of feature selection employed LASSO regularisation for generalised linear models to identify the set of remaining image features that are most relevant to the histology characteristics of patients. LASSO shrinks regression coefficients towards zero-based on regularisation weight λ; features with non-coefficients are those that are related to predicting histological characteristics and hence are selected. We performed 10-fold crossvalidation to identify the value of λ with the minimum cross-validation error. The outcome of this stage was the final FF Sig that was used for identifying radiogenomics associations with gene expressions and GO terms. We also applied the multi-stage image feature selection process to the HC, TL and FT features individually for comparison. The resulting image feature signatures are hereafter denoted as 'HC Sig ' , 'TL Sig ' and 'FT Sig ' , correspondingly.
Associating FF Sig with primary tumour T stages. The tumour, node, metastasis (TNM) staging is the most important clinical parameter to predict survival and establish treatment plans 62 . The T stage describes the size of the primary tumours and their involvement in the adjacent structures. We investigate investigated if the FF Sig is relevant to primary tumour T stages (T1-T4) prior to the radiogenomics analysis. We used unsupervised k-means clustering to the FF Sig to stratify the patients into distinct groups; the patient clusters were defined using 10 repeated new initial cluster centroid positions with a maximum of 1000 iterations. We compared the three patient clusters with the distribution of the T stage. We used the χ 2 test of independence to assess the ability of the FF Sig to encode tumour staging characteristics 63 . For comparative evaluation, the HC Sig , TL Sig and FT Sig were also validated for their relevance to the T stage.

Functional gene analysis. Gene selection.
Probes that map to multiple unique gene symbols were discarded and the repeated total gene expression values of the same gene were averaged. Gene expression data may contain redundant genes that are irrelevant to the disease. We used the following process to remove genes that had low variance, entropy and absolute expression value because such genes showed poor variability and dispersion, and therefore may not reflect the differences in the underlying tumour biology. We firstly removed genes with a variance of less than one-quarter percentile, as such genes may not reflect changes in tumour biological behaviours. The averaged gene expression was filtered to remove the genes with a variance of less than one-quarter percentile across all patients. The remaining genes were then filtered to remove genes that have an absolute expression level in the lowest quarter percentile of the gene expression; genes with low absolute expression were removed because they are prone to errors due to large quantisation or spot hybridisation. Finally, gene expressions were filtered to remove the genes with an entropy value that is less than the quarter percentile; genes with low entropy are considered to be consistently expressed across patients and may not reflect the variance in tumour biological characteristics 64  www.nature.com/scientificreports/ Radiogenomics analysis. We determined RRs between the FF Sig with the averaged gene expressions using the Spearman rank correlation. We also employed functional enrichment analysis to enrich radiogenomics relationships with GO terms. We used 1046 gene sets from the C5 collection of MSigDB 65 , which categorise the following GO terms: molecular function, cellular component and biological process. The gene list was generated by ranking the radiogenomics associations for each of the features from FF Sig in descending order. Gene sets that include between 15 and 500 contributing genes were selected for the enrichment analysis as was the standard protocol in prior work 15 . The determined RRs were then assessed using a pre-ranked functional enrichment analysis. In this process, the radiogenomics relationships between FF Sig and gene expressions were sorted to provide a ranked gene list based on the strength of the Spearman rank correlation. We used the pre-ranked gene list to perform GSEA, which derives the association between the provided ranked gene list and GO terms by testing the enrichment of each annotated term iteratively in a linear model. The enriched radiogenomics relationships with GO terms can be quantified by calculating normalised enrichment scores (NES) based on the number of genes. NES indicates the degree to which a GO term is overrepresented by the radiogenomics relationships. To ensure that only significantly associated genes were used for functional enrichment analysis, RRs with p-value < 0.001 were selected and ranked and serve as input to the functional enrichment analysis with GO terms. The same procedure was applied to the HC Sig , TL Sig and FT Sig for comparative experiments.

Results
Image feature signatures. After performing the multi-stage image feature fusion and selection on the NRG-H dataset, all four feature signatures were generated. FF Sig is comprised of features that were all extracted from sagittal planes of the 2.5D presentation and has the highest number of features at 7. TL Sig is also comprised of features that were all extracted from sagittal planes and has 6 features. FT Sig is comprised of features that were extracted from 1 axial and 2 sagittal planes and has 3 features. HC Sig is comprised of features that were extracted directly from image volumes and have 2 features. In contrast, our validation experiments on the NRG-S dataset show that only FF Sig , FT Sig and HC Sig were generated after performing the multi-stage image feature fusion and selection. Our validation results from the NRG-S dataset show that the FF Sig is comprised of features that were all extracted from 1 axial and 12 sagittal planes of the 2.5D presentation and has the highest number of features at 13. TL Sig was not generated as none of the TL features were selected after the multi-step feature selection scheme. FT Sig is comprised of 2 image features that were extracted from axial planes. HC Sig is comprised of features that were extracted directly from image volumes and have 1 feature only.

Image signatures and T stage.
In our experiment on the NRG-H dataset, the HC, TL and TF features were significantly associated with the T stage parameters (T1-T4) across patient clusters. The χ 2 test statistics for HC, TL and TF features with T stage parameters are p < 2.9 × 10 -4 , p < 5.0 × 10 -3 and p < 4.8 × 10 -2 , respectively. For image signatures, FF Sig was significantly associated with primary tumour T stages (χ 2 test, p < 4.0 × 10 -2 ). None of the HC Sig , TL Sig or FT Sig is found to be significantly associated with primary tumour T stages, their χ 2 test statistics are p > 0.8, p > 6.0 × 10 -2 and p > 0.5, respectively. Figure 2 illustrate the relationships among FF Sig , T stages and patient clusters from the NRG-H dataset. Each row of the heatmap represents one image feature that comprises the FF Sig. Each column of the heatmap represents a single patient. Z-score is calculated for each radiomics feature across patients. Z-score shows the distinct distribution of T stage parameters across patient clusters. The association between FF Sig and the T stage parameters is indicated by the grouped image features among the patient cluster II and III. The distinct pattern is represented using a z-score of image features that were extracted from each patient.
In our validation experiment on the NRG-S dataset, none of the HC, TL and TF features were significantly associated with the T stage parameters (Tis, T1-T4) across patient clusters. The χ 2 test statistics for HC, TL and TF features with T stage parameters are p > 0.7, p > 0.7 and p > 0.8, respectively. For image signatures, none of the FF Sig , HC Sig or FT Sig was found to be significantly associated with primary tumour T stages, their χ 2 test statistics are p > 0.5, p > 0.5 and p > 0.2, respectively.
RRs between image feature signatures and genes. After gene expression filtering, a total of 11,318 gene expression remained from the NRG-H dataset to establish radiogenomics associations. Notably, two of the key biomarkers for NSCLC: KRAS and RRM1, were filtered due to low variance across the patients in the NRG-H dataset. Figure 3a represents the distribution of RRs that were determined between the averaged gene expression values of 11,318 individual genes and FF Sig , HC Sig , TL Sig and FT Sig . FF Sig identified the highest number of RRs at 5039 and correlated with the highest number of genes at 3881. HC Sig identified 1193 RRs with 886 genes. TL Sig identified 3816 RRs with 3297 genes. FT Sig identified 2089 RRs with 2008 genes. Figure 4a details the distribution of unique genes that were associated with FF Sig , HC Sig , TL Sig , and FT Sig . Among the 3881 unique genes that were associated with the FF Sig, 1896 unique genes cannot be associated with any of the HC Sig , TL Sig , and FT Sig . In contrast, a total number of 3269 unique genes were associated with one of the HC Sig , TL Sig , and FT Sig , but were not correlated with the FF Sig .   www.nature.com/scientificreports/ Table 1 compares the strengths of all RRs that were determined using the FF Sig against those determined using HC Sig , TL Sig and FT Sig . Our results show stronger RRs are identified between the FF Sig and genes, when compared with HC Sig and TL Sig , in the inverse direction. The FF Sig, however, did not show stronger inverse RRs when compared with FT Sig . On the other hand, the FF Sig did not show stronger positive RRs when compared with HC Sig , TL Sig nor FT Sig . Figure 5 illustrates the distribution of RRs that were determined between image feature signatures of the FF Sig , HC Sig , TL Sig , FT Sig with the gene expression value from the key genetic biomarkers of EGFR for NSCLC 20 . Our result shows that the FF Sig and FT Sig were inversely correlated with EGFR expression. In contrast, HC Sig is shown to be the only positive RRs with EGFR. Notably, FT Sig shows to derive more and stronger inverse RRs with EGFR when compared with the FF Sig. In addition, our result shows that ERCC1, a key genetic biomarker for NSCLC, is exclusively correlated with a single feature from the FF Sig, where the same feature showed inverse RRs with EGFR previously.
The gene selection process was repeated in our validation experiments on the NRG-S dataset. A total of 22,126 unique genes were identified for each patient from the NRG-S dataset. After gene selection, 2993 gene expression remained from the NRG-S dataset to establish radiogenomics associations. In comparison to NRG-H, three of the key biomarkers for NSCLC: EGFR, KRAS and ERCC1, were filtered due to low variance across the patients in the NRG-S dataset. Figure 3b represents the distribution of RRs that were determined between the averaged gene expression values of 2993 individual genes and FF Sig , HC Sig , and FT Sig . Radiogenomics analysis show that FF Sig identified the highest number of RRs at 2856 and correlated with the highest number of genes at 1756. HC Sig identified 474 RRs with 474 genes. FT Sig identified 642 RRs with 621 genes. In addition, our result shows that RRM1, a key genetic biomarker for NSCLC, is exclusively correlated with a single feature from the FF Sig . Figure 4b details the distribution of unique genes that were associated with FF Sig , HC Sig , TL Sig , and FT Sig . Among the 1756 unique genes that were associated with the FF Sig , 1141 unique genes cannot be associated with any of the HC Sig and FT Sig . In contrast, a total number of 265 unique genes were associated with one of the HC Sig and FT Sig but were not correlated with the FF Sig . Table 2 compares the strengths of all RRs that were determined using the FF Sig against those determined using HC Sig , and FT Sig . Our validation results show that the FF Sig did not identify stronger RRs with genes, when compared with HC Sig and TL Sig , in both statistical directions.      Table 3 shows the GO terms with the highest NES. Notably, FF Sig determined RRs with GO terms that exhibit distinct patterns relating to the biological functions and cellular behaviours: (i) 3 GO terms were related to lumen structures including organelle, nuclear and membrane; (ii) 2 GO terms were reflecting biosynthesis processes that involve glycoprotein or macromolecule; (iii) 3 GO terms were related to the response mechanism to viruses, other organism or biotic stimulus, and other types of stimulus processes. In comparison, our results also show that TL Sig determined RRs with 4 GO terms that are associated with fraction activities. In addition, FT Sig determined RRs with GO terms that are related to enzyme activities. In contrast, HC Sig determined RRs with GO terms are shown to be without overlaps in their biological functionalities. Table 4 shows the comparison between GO terms that have exclusive RRs with FF Sig and those GO terms that are restricted to have RRs with FF Sig . Among the GO terms with the highest NES, our result shows clusters of biological functions and cellular behaviours that have exclusive RRs with the FF Sig : (i) 3 GO terms were related to kinase activities for transmembrane receptor protein and tyrosine kinase; (ii) 2 GO terms were related to metabolism activities; The identical 3 GO terms were most enriched by FF Sig and related to the virus response  www.nature.com/scientificreports/ mechanism. In contrast, our result shows 2 groups of related biological functions among the GO terms that were restricted to FF Sig. Such GO terms are related to fraction processes and enzyme activities. From our validation experiment on the NRG-S dataset, functional gene enrichment analysis reveals that FF Sig determined RRs with the highest number of GO terms at 322. HC Sig determined RRs with 31 GO terms. TL Sig determined RRs with 0 GO terms. Figure 6b details the distribution of GO terms that were associated with image feature signatures of FF Sig, HC Sig , TL Sig , and FT Sig . FT Sig determined RRs with 142 GO terms. Among the 322 GO terms that have RRs with by FF Sig, 233 GO terms were exclusively enriched; these GO terms account for 72.4% of the total enriched GO terms or 22.3% of the total 1046 GO terms. Table 5 shows the GO terms with the highest NES. Notably, FF Sig determined RRs with GO terms that exhibit distinct patterns relating to the cellular structure: (i) 3 GO terms were related to lumen structures including organelle, nuclear and membrane; (ii) 2 GO terms that reflect the cell junction. In comparison, FT Sig determined RRs with GO terms that are related to cellular structures, protein transportation and localisation. HC Sig determined RRs with GO terms that are related to signalling pathways, such as cAMP mediated signalling and second messenger mediated signalling. Table 6 shows the comparison between GO terms that have exclusive RRs with FF Sig and those GO terms that are restricted to have RRs with FF Sig . Among the GO terms with the highest NES, our validation results show a cluster of biological functions and cellular behaviours that have exclusive RRs with the FF Sig : (i) 3 GO terms were related to peptidase activity; (ii) 2 GO terms that reflect the cell junction. In contrast, our result shows 2 groups of related biological functions among the GO terms that were restricted to FF Sig. Such GO terms are related to the intrinsic components of organelle membranes and metabolic processes.

Discussion
Our main findings are that our FF Sig : (i) encoded complementary medical image's visual characteristics when compared with other image feature signatures; (ii) determined a greater number of RRs with a greater number of genes; (iii) determined RRs with distinct GO terms; (iv) determined exclusive RRs with genetic biomarkers of NSCLC and GO terms that are related to NSCLC and (v) is robust and generalisable for determining RRs when validated on NRG-S.  www.nature.com/scientificreports/ From our experiments using the NRG-H dataset, the FF Sig comprises 7 image features that are complementary to image features that were selected in the HC sig , TL sig , and FT sig . Image features that are included in the FF Sig can be traced back to the 6144-dimensional TL features. This finding indicates that the multi-step feature selection scheme prioritised a set of complementary image features that are relevant to the histological characteristics while reducing the overall redundancy in the information captured. This finding suggests that the FF Sig encodes unique medical imaging visual characteristics when compared with other image signatures. The FF Sig was the only feature signature that produced a significant association (p < 0.05) with the T stage. The HC Sig , TL Sig , and FT Sig did not have any association with the T stage, despite the fact that the FF Sig was selected from the HC, FT, and TL features. Our results showed that the semantic information that is encoded in the HC features and the abstract-level information that are encoded in the TL and FT features contributed towards the selection of features in FF Sig . This finding implies that the association between FF Sig and T stage occurred because the FF Sig leveraged complementary information using both HC and deep ETs.
The FF Sig determined a greater number of RRs with a greater number of genes when compared with the other image feature signatures. The FF Sig was also correlated with EGFR. One potential explanation for our finding is that the FF Sig encodes the imaging characteristics of the tumour that can reflect the underlying molecular characteristics of NSCLC 66 . The FF Sig has also determined stronger inverse RRs with a range of genes when compared to HC Sig and TL Sig . There was no stronger positive RRs with genes when compared with the HC Sig , TL Sig and FT Sig . The reason for this is because the FF Sig did not incorporate any image feature that was learned from scratch from the raw data using deep ETs; the FT components were the closest and as stated previously were aligned with the non-medical TL features. We suggest that positive RRs may appear when deep ETs are directly trained from scratch on the NRG-H CT data.
In addition, from our experiments using the NRG-H dataset, the FF Sig determined RRs with a distinctive collection of GO terms with higher NES when compared to the other image feature signatures. A higher NES of GO terms is typically the result of a stronger correlation between the image feature signatures and the affiliated genes that contribute to the GO term and, RRs with a greater number of affiliated genes that contribute to the GO term. Notably, GO terms with the highest NES consist of a range of biological functions that relate to cellular structures. It has been reported that abnormalities in cellular structures are related to the development of NSCLC 67 . FF Sig has shown to determine RRs with more GO terms when compared with HC Sig and FT Sig . A potential explanation for this finding is that the FF Sig determined RRs with a greater number of unique genes. These genes may be affiliated with a greater range of biological functions and therefore provide opportunities for FF Sig to determine RRs with more and unique GO terms. We note that while the TL Sig determined RRs with a higher number of GO terms, these are generally related to normal human anatomical information rather than the subtle disease processes related to the primary tumour. This finding is evidenced by the most enriched GO terms, such as "Regulation of Anatomical Structure Morphogenesis", as shown in Table 3.
From our experiments using the NRG-H dataset, FF Sig determined exclusive RRs with a group of GO terms that consist of a range of biological functions that are related to protein kinase activities, such as "Transmembrane Receptor Protein Kinase Activity". Atypical kinase and its activities have been reported previously as an oncogene in NSCLC 68 , which play a crucial role in cell growth and tumourigenesis that may be observable in medical images 69 . In contrast, GO terms that are restricted to have RRs with FF Sig include, for example, "Soluble Fraction" and "Enzyme Regulator Activity". A potential explanation is that the specific enzyme activities and fractions cannot be depicted by CT images and hence cannot be quantified by the FF Sig .
Our validation experiments on the NRG-S dataset show that the FF Sig comprises 13 image features that are complementary to image features that were selected in the HC sig , and FT sig . Among the 13 image features, 12 can be traced back to the 6144-dimensional FT features and the other feature can be traced back to a HC feature. Our results using NRG-S demonstrated that the FF Sig encoded complementary medical imaging visual characteristics. The validation results are consistent with our previous findings from the NRG-H dataset.
However, none of the FF Sig , HC Sig , nor FT Sig from the NRG-S dataset produced a significant association with the T stage. We attribute our findings to the different scanning parameters used in the NRG-S dataset, for example, slice thickness that ranges from 0.625 to 3 mm. Such factors contribute to subtle imaging differences and have potential impacts on the feature extraction process.
In our validation study, FF Sig has determine a greater number of RRs with a greater number of genes when compared with the other image feature signatures. This result validates that the FF Sig is robust and generalisable in encoding the imaging characteristics of the tumour that can reflect the underlying molecular characteristics of NSCLC. However, in our validation study, using the NRG-S dataset, FF Sig did not identify stronger RRs with a range of genes when compared with HC Sig and FT Sig . One potential explanation is that the FF Sig did not incorporate any image feature that was fine-tuned on the NRG-S dataset. Despite NRG-S dataset has many similarities to the NRG-H dataset, such as the type of disease, the distribution of patients' clinical parameters and their histopathology status are vastly different to the NRG-H dataset. We suggest that stronger RRs may appear when deep ETs are fine-tuned on the NRG-S dataset.
In our validation experiments, the FF Sig has also shown to determine RRs with a distinctive collection of GO terms with higher NES when compared to the other image feature signatures. Notably, our validation results share a high degree of similarity with our previous findings from experiments using the NRG-H dataset. For example, from both experiments, the proposed FF Sig determined RRs with GO terms such as 'Membrane Enclosed Lumen' and 'Organelle Lumen' . Interestingly, such RRs with GO terms that relate to lumen structures are in opposite statistical direction. We attribute this finding to the differences between the NRG-H and NRG-S datasets where their distribution of T stage parameters and histology sub-types, as they played important roles in the multi-stage feature selection scheme. Such findings further demonstrate the robustness and generalisability of our proposed FF Sig to determine RRs with GOs across different datasets.

Scientific Reports
| (2022) 12:2173 | https://doi.org/10.1038/s41598-022-06085-y www.nature.com/scientificreports/ Furthermore, in our validation experiments using the NRG-S dataset, FF Sig determined exclusive RRs with a group of GO terms that consist of a range of biological functions that are related to peptidase activity such as 'Endopeptidase Activity' . Previous study has shown that bombesin-like peptides and other neuropeptides are autocrine growth factors for both small cell lung cancer (SCLC) and NSCLC 70 . Our validation results demonstrate the robustness and generalisability of our proposed FF Sig for determining GO terms that are related to NSCLC.
We recognise that a limitation of our study is the size of the dataset and that lack of knowledge about the patients' mutation status. This limits the ability to optimise deep ETs to quantify image features that are most relevant to the NSCLC. Another limitation of this study is the differences between the train dataset and the independent test dataset. The two datasets use different methods for gene expression profiling, and as such the NRG-H dataset has a greater amount of genetic information compared to the NRG-S dataset. The ideal situation would have been to utilise two datasets that use the same technology for gene expression profiling, but at the time of experimentation and to the best of our knowledge, no such public radiogenomics dataset existed. However, despite these differences we note that the NRG-S dataset shares similarity with the NRG-H dataset, such as the type of disease and histopathology subtypes, and these similarities mean that it is the closest dataset that can be used for independent validation.
The limited availability of the clinical parameters e.g., survival data in the datasets has restricted our study from designing a deep learning-based image feature selection scheme. We note that as more radiogenomics datasets becomes available in the future, a key area for radiogenomics studies is to investigate the feasibility for a data-driven method for image feature selection 71 . Another potential future direction for our study is to investigate deep learning-based gene expression level prediction. Such a deep model can encode imaging characteristics that are reflective towards changes in gene expression levels and therefore may provide more insights into RRs.

Conclusion
We used a selection of image features from handcrafted and deep ETs, which we named FF Sig , to determine RRs. Our results show that the FF Sig encoded complementary medical image visual characteristics when compared with other image feature signatures. The FF Sig determined more RRs with genes and with a group of distinct GO terms. Our results show that FF Sig is correlated with a key biomarker for NSCLC and GO terms that are related to tumour developments in NSCLC. Furthermore, our validation experiments demonstrate that the FF Sig is robust and generalisable in different dataset. The FF Sig has demonstrated its potentials to identify important RRs that may facilitate cancer diagnosis and treatment in the future.