Non-invasive decision support for NSCLC treatment using PET/CT radiomics

Two major treatment strategies employed in non-small cell lung cancer, NSCLC, are tyrosine kinase inhibitors, TKIs, and immune checkpoint inhibitors, ICIs. The choice of strategy is based on heterogeneous biomarkers that can dynamically change during therapy. Thus, there is a compelling need to identify comprehensive biomarkers that can be used longitudinally to help guide therapy choice. Herein, we report a 18F-FDG-PET/CT-based deep learning model, which demonstrates high accuracy in EGFR mutation status prediction across patient cohorts from different institutions. A deep learning score (EGFR-DLS) was significantly and positively associated with longer progression free survival (PFS) in patients treated with EGFR-TKIs, while EGFR-DLS is significantly and negatively associated with higher durable clinical benefit, reduced hyperprogression, and longer PFS among patients treated with ICIs. Thus, the EGFR-DLS provides a non-invasive method for precise quantification of EGFR mutation status in NSCLC patients, which is promising to identify NSCLC patients sensitive to EGFR-TKI or ICI-treatments.

It is not clear that the EGFR-DLS model is really guiding treatment, in essence what the authors have done is developed a radiomic biomarker that is predictive of EGFR somatic mutation status, and shown that it is accurate in two cohorts. The subsequent correlation with treatment response is not novel as it is well known that EGFR+ cohorts will benefit from TKI therapy vs. EGFRnegative/squamous cohorts will benefit from ICI treatment. It is not clear how the EGFR DLS model is actually changing, adding or improving this paradigm? I strongly suggest to adapt the title to include "using PET/CT imaging", to make it immediately clear what type of data is being used. In the results section "Distribution and characteristics of EGFR-DLS" it would benefit the reader to concisely provide at least some background of what is meant with pattern I and II. Without reading the methods (which is what most readers will do), it is not possible to understand what these results mean. What data/cohort was used for the univariate, and multivariate modeling reported in Table S2? This is important as the EGFR-DLS signature was developed likely based on a subset of the data set, and therefore, testing multivariate models combining this signature with EGFR-DLS, ideally this has to be done on a different data set then the one that was used to develop the EGFR-DLS. The same holds for Table S5. What is the authors interpretation based on the relationship between EGFR-DLS and PD-L1 status? In essence the imaging predictor, EGFR-DLS is a model for EGFR mutation status, so there model is in essence suggesting mutual exclusivity between EGFR mutation status & PDL1 expression, rather than that the EGFR-DLS is predictive of treatment response of ICIs. Figure 4f needs more clear annotation of the PDL1 status in this figure and in the caption. Table S8 shows that within the PDl1 negative group, PFS is much lower in the EGFR-DLS high group vs. the EGFR-DLS lwo group. However, based on the way that the EGFR-DLS model was developed this only suggests that these patients are not receiving the right treatment, as in essence these are predicted to be EGFRmutants and thus should get TKI treatment. Or are the authors suggesting something different here? To be able to make the point that the EGFR-DLS and the PDL1-DLS (in essence the imaging versions of these molecular biomarkers), are useful in treatment guidance, the performance has to be compared in terms of treatment response with the actual molecular biomarkers. It is not clear if this is anywhere in the manuscript? A case can be made for non-invasive biomarkers of molecular features, however recent reports show that in the case of driver mutations in EGFR, MET, BRAF & TP53, all these mutations are clonal. Therefore, the heterogeneity argument cannot be made here. See Jamal-Hanjani et al. NEJM 2017. It is not clear what methods were evaluated in terms of the "accuracy of the segmentations" as virtually no results are reported related to this topic? E.g. this results paragraph: "For the complete overread study (n=73 cases), high ICCs of 0.91 (95%CI:0.87-0.94,p<.001) were obtained in three ROIs per cases, and there were no significant differences in AUCs of the EGFR-DLSs obtained with different radiologists' delineations (Supplemental Figure S1)." Appears out of nowhere, the methods section does not detail the segmentation process and how does not suggest any novel way of doing segmentation which would suggest that the model is more robust to variances in segmentation.
1. The title of the manuscript suggests the development of a treatment decision support system based on retrospective data (observational rather than interventional approach) which raises questions since hidden colliders might kick-in and hamper the clinical value of the proposed DSS. In particular recent publications (1,2) indicate the necessity to incorporate causal inference in order to estimate the effect of a treatment selection on clinical outcomes when using observational data. Potential hidden colliders like patient fitness, age, ethnicity and others need to be considered in order to reduce selection bias that is inherent to the observational data used in the current study. I would strongly suggest that a relative text should be discussed in the paper, providing arguments why such an approach was not followed. 2. The proposed deep neural network is comprising of almost 1.4 million trainable parameters which is way more than the available samples for training (9911 training patches). The latter raises concerns for overfitting and these needs again to be discussed in a way to provide arguments on how the authors prevented overfitting using such a heavyweight architecture. Also it needs to be very clear that hyper images of the same patient didn't participated in the various data sets simultaneously (training, validation, test). 3. The proposed architecture is 2d, which means that the probabilities for predicting EGFR status is on a slice basis rather than a patient or a tumor basis. How the authors aggregated these probabilities to transform them on a per tumor prediction? Please clarify. The selected architecture also needs some justification, why the authors selected the specific one? Why they didn't consider more suitable approaches including pre-trained networks with transfer learning given the typical n<<p situation they had?

Reply:
Thanks very much for the thoughtful critique.
At present, EGFR status detected by tissues collected through invasive biopsy or ctDNA analysis from plasma is a clinically actionable biomarker for treatment planning according to NCCN Guideline for treatment of Non-Small Cell Lung Cancer. However, the EGFR mutation determined by tissue-based assays has many limitations including inter alia: a requirement for invasive biopsies with associated morbidities, the assays are not rapid, and may fail to yield actionable results due to insufficient quantity or quality of the tissue. EGFR mutation determined by plasma based assays have limitations including requiring active shedding of tumor DNA into the blood stream which does not occur in all patients, thus creating a subset of false negative results. Additionally, EGFR mutational status may change during the course of therapy and progression, which may need repeated sampling. The results from our study using the EGFR-DLS provide a complimentary and possibly alternative non-invasive guideline for treatment planning, which could reduce the risk of repeated invasive biopsies.
Having said that, we agree with the reviewer that it is well known that EGFR positive cohorts will benefit from TKI therapy and that EGFR negative cohorts will have better benefit from ICI treatment. However, while a decision could be made on EGFR status alone, we believe that knowledge of both EGFR status and PD-L1 status would make for a stronger decision support paradigm. While PD-L1 status and EGFR mutation status could be determined by IHC and advanced DNA sequencing of the same biopsy, there are the above-described limitations in tissue testing and these could be complimented by EGFR and PD-L1 status determined by the same information-rich PET/CT scan. Non-invasive image biomarkers to distinguish between two treatments at the same time haven't been well investigated.
Therefore, due to the availability of the routine PET/CT images, the non-invasive, easy-to-use characteristics, and real-time turn-around time to results, the deep learning based model provides a complimentary confirmatory data point in the case that high quality IHC are available, and an alternative when the EGFR or PD-L1 status is not available due to the unavailable or insufficient tissue samples.
Thank you for your comments and we agree.
(a) Initially, Table S2 showed only the results of the training cohort. In order to investigate the added value of EGFR-DLS to standard clinical variables, a clinical model comprised of standard clinical variables, and a combined model with EGFR-DLS plus clinical variables were trained using multivariable logistic analysis and subsequently tested on independent validation and test cohorts. Additionally, their predictive performance was compared with just the EGFR-DLS in different cohorts based on AUC, accuracy, specificity, and sensitivity. To make this Table clearer, and to further validate the important value of EGFR-DLS in different cohorts, we have now added the results of the multivariable logistic analysis of the validation and test cohorts in Table S2 as follows. All three multivariable analyses (training, validation, and test cohorts) identified the EGFR-DLS as being independently significant. Notably, in the external HMU test cohort, only EGFR-DLS was identified as being significant. Therefore, these results further indicate the important predictive value of the EGFR-DLS compared to the clinical variables. This has now been added to the third paragraph of RESULTS:

"Performance of EGFR-DLS in predicting EGFR mutation status"
… When investigating the added value of EGFR-DLS in addition to standard clinical variables (age, sex, stage, histology, Smoking status, and SUVmax), a clinical signature (CS model) was created by combining sex, histology, and smoking status (all other variables were uninformative), and a combined signature incorporating EGFR-DLS, histology and smoking status (CMS model) were built with multivariable logistic regression analyses in the training cohort. Their quantitative performance shown in Figure 2 and Supplemental Additionally, through multivariable logistic regression analysis, EGFR-DLS was the only identified significant independent variable in EGFR prediction in the validation and test cohorts (Supplemental Table S2)." Note., For Sex: male was assigned set as the referent group and for histology, adenocarcinoma was set as the referent group (b) Table S5 now shows the results of the Cox analysis based on the independent external TKI-treated and ICItreated test cohorts, and we have clarified this in the revised version as follows. Note., For Sex: male was assigned set as the referent group and for histology, adenocarcinoma was set as the referent group

Reply:
Thank you for your thoughtful comments. According to Gainor et al 11 , EGFR mutant NSCLC patients had significantly lower rates of PD-L1 expression and CD8+ TILs compared to EGFR WT. Notably, CD8 + TIL cells are generally thought to be the dominant effector population following ICI treatment, and a lack of effector cells may limit an antitumor immune response, regardless of PD-L1 status. Therefore, the low rates of PD-L1 expression and CD8+ TILs may contribute to the low response rates to ICI treatment among EGFR mutant patients. Given the EGFR-DLS has a high predictive value for EGFR mutational status, the EGFR-DLS alone could be predictive of treatment response of ICIs, as the reviewer states. However, this is a passive diagnosis and we have preferred instead to generate a signature for PD-L1 status. Indeed, we did observe a weak but significant inverse correlation between the two signatures. Consistent with this, as we now show in Fig. 4h, using both signatures we were able to identify a sizeable cohort who had low EGFR-DLS and low PDL1_DLS, suggesting that they may not be response to either TKI or ICI. We have altered this in the discussion accordingly.

Reply:
Thanks for pointing this out. We apologize for the lack of clarity; the caption has been revised to include additional details:

Table S8 shows that within the PDl1 negative group, PFS is much lower in the EGFR-DLS high group vs. the EGFR-DLS low group. However, based on the way that the EGFR-DLS model was developed this only suggests that these patients are not receiving the right treatment, as in essence these are predicted to be EGFR mutants and thus should get TKI treatment. Or are the authors suggesting something different here?
Reply: Thank you for your comments. This is exactly what we would like to say, but we cannot because this was a retrospective study, and thus we could not test this hypothesis. In this study, the ICI-treated patients were curated from different clinical trials. The inclusion criteria for some of these trials did not account for EGFR mutation status.
The design of our study includes three parts: the predictive value of EGFR-DLS in TKI treatment; the predictive value of EGFR-DLS in ICI treatment; and the potential value of EGFR-DLS in guiding treatment together with the PDL1_DLS. Table S8 demonstrates that EGFR-DLS was a significant negative predictive factor for ICI treatment, which corresponded to the second part of the study design. The final alternative non-invasive guideline (i.e., the third part) was obtained according to the further analysis on the TKI-treated and ICI-treated patients. Therefore, we think this is not conflicting or inconsistent.

To be able to make the point that the EGFR-DLS and the PDL1_DLS (in essence the imaging versions of these molecular biomarkers), are useful in treatment guidance, the performance has to be compared in terms of treatment response with the actual molecular biomarkers. It is not clear if this is anywhere in the manuscript?
Reply: At your suggestion, we have performed this analysis. We have added the comparison of progression free survival between the molecular/tissue-based biomarkers (EGFR and PD-L1) and image-based biomarkers (EGFR-DLS and PDL1_DLS) in the revised manuscript. The revised Supplemental Figure S3 (below) shows that the PFS probabilities were indistinguishable between the DLS and the molecular scores. The advance herein is that these non-invasive radiomic markers are readily obtainable and can be used to predict EGFR and PDL1 status. These have also been clarified in the main text as follows:

"Performance of EGFR-DLS to predict EGFR-TKI-treatment response
Additionally, the patients with lower EGFR-DLS group and higher EGFR-DLS showed similar PFS compared to the biopsy detected EGFR wild type patients (p=0.31) and EGFR mutant patients (p=0.91), respectively (Supplemental Figure S3a).

Potential value in guiding treatment
In addition to the current EGFR-DLS, we have also developed an 18 F-FDG PET/CT based deep learning score predictor of PD-L1 status (PDL1_DLS), which showed similar prognostic value compared to the IHC-detected PD-L1 status on which it was tested, as shown in Supplemental Figure S3b and applied it herein."

Figure S3. Comparison of progression survival between the tissue-based molecular biomarkers (EGFR and PD-L1) and image-based biomarkers (EGFR-DLS and PDL1_DLS).
(a) is the progression survival of TKI-treated patients with EGFR mutation status relative to the EGFR mutation and EGFR-DLS (cutoff:0.5). (b) is the progression survival of ICI-treated patients with PD-L1 status relative to the EGFR-DLS (cutoff:0.5) and PD-L1 status. Note. P value was from log rank test.

Reply:
Indeed, we agree that the results coming out of TRACERx studies are eye-popping and informative. They do seem to indicate that "driver" mutations are generally clonal (i.e. present in all cancer cells). However, in figure 4 of Jamal-Hanjani, it can be seen that EGFR mutations was sub-clonal in 2/13 samples, which could lead to false negatives. Nonetheless, because of the reviewers' concern, which we are sure others share, we have removed reference to heterogeneity. The discussion has been revised as follows.  Figure S1)." Appears out of nowhere, the methods section does not detail the segmentation process and how does not suggest any novel way of doing segmentation which would suggest that the model is more robust to variances in segmentation.

Reply:
Thank you so much for your comments, as we had overlooked this description in Methods.
In this work, rough ROIs were selected by radiologist using ITK software (the analytic Pipeline is shown in Supplemental Figure S9) and accurate segmentations were not needed, which is one advantage of Deep Learning over Conventional Radiomics. Therefore we didn't evaluate the accuracy of the segmentations. However, because there could be minor differences among different radiologists in selecting the rough ROIs, we had all three radiologists delineate ROIs of 73 patients within the validation cohort to examine its effects on the reproducibility of DLS. The inter-rater agreement of DLS estimations were calculated by intraclass correlation coefficient (ICC) among the three radiologists. We have now clarified this in both Results and Methods as follows.

Distribution and characteristics of EGFR-DLS
In this work, accurate segmentations were not needed, yet radiologists had to delineate a rough ROI that contained the tumors and some surrounding tissue. In order to investigate the effect of the minor differences between different radiologists in selecting the rough ROIs, the ROIs of a subset of the validation patients (n=73 cases) were generated by all the three radiologists, and three EGFR-DLSs were obtained accordingly. The intraclass correlation coefficient (ICC) of these three EGFR-DLSs was 0.91 (95%CI:0.87-0.94,p<.001), indicating that there were no significant differences in AUCs of these three EGFR-DLSs (Supplemental Figure S2), both of which validate the reproducibility of DLS with the input images selected by different radiologists. "

"Patients and Methods
ROIs of 73 patients within the validation cohorts were selected by all the three radiologists to validate the reproducibility of EGFR-DLS.

The inter-rater agreement of EGFR-DLS estimations were calculated by intraclass correlation coefficient (ICC) among the different EGFR-DLSs obtained from the different delineation of the three radiologists. "
The details of the segmentation process were provided in supplementary S2 and supplementary Figure S9 as follows.

"Preparation of the input images
Using ITK-SNAP software, the PET and CT images were firstly registered, and then a square or an irregular box that was close to the boundary of the tumor was delineated, by experienced nuclear medicine radiologist. After resampling, dilation, and resize using cubic spline interpolation, the PET region of interest (ROI) and CT ROI were obtained keeping the entire tumor and its peripheral region with the same size (64×64). (Pipeline is shown in Supplemental Figure S9). "

The authors emphasize the "dynamic nature of EGFR" however it is not clear what is meant with that?
The authors need to distinguish between EGFR somatic mutation status vs. EGFR expression, as these are two very different things.

Reply:
Thank you so much for your suggestion.
EGFR mutational status might change during the course of therapy and subsequent clonal selection for resistant variants. While in the case of EGFR mutations, resistance is commonly associated with point mutations, this is not always the case, such as amplification of Met pathway (e.g. Ricordel et al, Ann Oncol, 2018) that can abrogate the need for EGFR. And in order to make it clearer, we have revised to "dynamic change in proportion of cells expressing EGFR mutation" in the revised version.
Additionally, we have changed "EGFR expression" to "EGFR mutation status" in describing the EGFR-DLS and the 18 F-MPG.

Reply:
We apologize for this error; we have revised this to "small-residual-convolutional-network" in the current version.

Reply:
Thank you for your suggestion, and we have added the details in the supplementary S2 as follows.

Preparation of the input images
Using ITK-SNAP software, the PET and CT images were firstly registered, and then a square or an irregular box that was close to the boundary of the tumor was delineated, by experienced nuclear medicine radiologist. After resampling, dilation, and resize using cubic spline interpolation, the PET region of interest (ROI) and CT ROI were obtained keeping the entire tumor and its peripheral region with the same size (64×64). Subsequently, the fusion images were calculated through the α-fusion equation: where and are the normalized PET and CT pixel-wise image data by z-score normalization. The fusion ROI was further standardized by z-score normalization, and constructed a 3channel hyper-image together with the normalized PET ROI and normalized CT ROI. This hyper-image was used as the input of the SResCNN model (Pipeline is shown in Supplemental Figure S9). Z-score nomorlaiziton means the ROI image was subtracted by the mean intensity value and divided by the standard deviation of the image intensity, before inputting to the deep learning model, to reduce the effect of different equipment and different reconstruction parameters."

Reply:
Thank you for your suggestion. We have changed all "EGFR expression" to "EGFR mutation status" in describing the EGFR-DLS. Additionally, this has been further clarified in the development of the DLS as follows.
"The EGFR mutation status (positive vs. negative) was encoded to one-hot and used as the label. The output of the network, i.e., the deep learning score (EGFR-DLS), was used as the classification result to represent the EGFR mutation positivity probability."

Reply:
We apologize for this error. It should be the ROI of the constructed hyper-images. And the size of the training data should be 13,583, and we have revised in this version as follows: "To reduce overfitting, augmentation including width/height-shift, horizontal/vertical-flip, rotation and zoom for the 13,583 training ROI-based hyper-images were used" We have also defined this as a 2D model in the methods and supplemental S2 as follows. Figure S8.

Structure and training of the SResCNN network
The 2D SResCNN is based on several residual blocks with the 3-channel input images, which is similar to the well-known Resnet18 network with fewer filters."

Reply:
We appreciate your concerns and addressing them will improve the portability and quality of this work. To make the manuscript more concise, we omitted some pertinent details of the deep learning model. Resnet is a popular and powerful architecture for image classification. Given the limited dataset, we constructed the network with fewer layers, similar to Resnet-18 with fewer filters, which was trained on our own workstation. Additionally, given single resolution may not be optimal and depends on the scale of the objects within the image, a multi-resolution CNN model was proposed and proved to have significantly better performance [s1, s2]. Therefore, we incorporated the concept of multi-resolution into this small Resnet structure. By trying different number of filters of each layer, the current structure could give an acceptable result with fewest filters on the validation cohort. In response, we have now added more details about the network and the training process in the Patients and Methods and Supplemental methods S2 as follows.

Development of the deep learning model
The EGFR mutation status prediction 2D small-residual-convolutional-network (SResCNN) model is presented in Supplemental Figure S8. The regions of interest (ROIs) of the PET and CT images were first selected by experienced nuclear medicine radiologists (L.J, JY. Z, and Y.S) after registration using ITK 30 on the condition that entire tumor and at least 10 mm of its peripheral region were included, and were then resized to 64x64 pixels by spline interpolation and constructed a three-channel hyper image together with their fusion image (alpha-blending fusion 31 , α=1) (Pipeline is shown in Supplemental Figure S9). To reduce the effect of the difference between the central slice and peripheral slices, only ROIs that contained measurable tumor tissue were regarded as valid ROIs, and were fed into the SResCNN model to update the parameters of the SResCNN model with backward propagation. The EGFR mutation status (positive/negative) was encoded to one-hot and used as the label. The output of the network, i.e. the deep learning score (EGFR-DLS), was used as the classification result to represent the EGFR mutation positivity probability. EGFR mutation positivity probability at the patient level was obtained by averaging the EGFR-DLSs of the slices that included tumor tissue. To reduce overfitting, augmentation including width/height-shift, horizontal/vertical-flip, rotation and zoom for the 13,583 training hyper-images were used, and the model with the best performance on the validation dataset was selected. Details are shown in Supplemental S2.

Preparation of the input images
Using ITK-SNAP software, the PET and CT images were firstly registered, and then a square or an irregular box that was close to the boundary of the tumor was delineated, by experienced nuclear medicine radiologist. After resampling, dilation, and resize using cubic spline interpolation, the PET region of interest (ROI) and CT ROI were obtained keeping the entire tumor and its peripheral region with the same size (64×64). Subsequently, the fusion images were calculated through the α-fusion equation: where and are the normalized PET and CT pixel-wise image data by z-score normalization. The fusion ROI was further standardized by z-score normalization, and constructed a 3channel hyper-image together with the normalized PET ROI and normalized CT ROI. This hyper-image was used as the input of the SResCNN model (Pipeline is shown in Supplemental Figure S9). Z-score normalization means the ROI image was subtracted by the mean intensity value and divided by the standard deviation of the image intensity, before inputting to the deep learning model, to reduce the effect of different equipment and different reconstruction parameters. Because of the big difference of the central slice and peripheral slices, only the slices with the area larger than the 30% of the maximum area of this patient were regarded as valid input images and were used as the input of the deep learning model. The area here means the area of the smallest square including the selected region (Supplemental Fig S9c). Finally, 13,583 training hyper-images were generated for training.

Structure and training of the small-residual-convolutional-network (SResCNN)
The 2D SResCNN is based on several residual blocks with 3-channel input images, which is similar to the well-known Resnet18 network with fewer filters. Given single resolution may not be optimal and depends on the scale of the objects within the image, multi-resolution CNN model was proposed and proved to have significantly better performance [s1, s2]. Therefore, the concept of multi-resolution was further incorporated into the architecture, which was shown in supplemental Figure S8. Specifically, the architecture was comprised with three convblocks (including a 3 × 3 convolutional layer followed by a batch normalization layer and a rectified linear unit (ReLU) activation layer) for three different resolutions of the input hyper-images, 8 residual blocks (Resblock), and one fully connected layer. Finally, a softmax activation layer was connected to the last fully connected layer, which was used to yield the prediction probabilities of nodule candidates. Additionally, one dropout layer with probability of 0.5 was added to the fully connected layers.
The training of the model focuses on the optimization of the parameters of the SResCNN model to build a relationship between PET/CT images and EGFR mutation status (positive: 1 or negative: 0). Binary cross entropy was employed as the loss function, while the Adam optimizer was used with an initial learning rate = 0.0001, beta_1=0.9, beta_2=0.999. The learning rate was reduced by a factor of 5 if no improvement of the loss of the validation dataset was seen for a 'patience' number (n=10) of epochs.
The number of the filters, the learning rate and the batch size was determined according the predictive performance on the validation cohort using grid search method.
In order to reduce the risk of overfitting, several techniques were deployed. 1) Augmentation: During the training, augmentation including width/height-shift, horizontal/vertical-flip, rotation and zoom were used to expand the training dataset to improve the ability of the model to generalize. 2) Regularization: L2 regularization was used, which added a cost to the loss function of the network for large weights. As a result, a simpler model that was forced to learn only the relevant patterns in the training data would be obtained. 3) Dropout: Dropout layer, which would randomly set output features of a layer to zero during the training process, was added. 4) Early stop: During training, the model is evaluated on the validation dataset after each epoch. The training was stopped after waiting an additional 30 epochs since the validation loss started to degrade.

Application of the SResCNN network
The generated hyper-image was input into the SResCNN model after z-score normalization, and a deeply learned score (DLS) representing the EGFR mutation positivity could be yielded after a sequential activation of convolution and pooling layers. To develop a robust prediction, all valid slices of each patient were fed into the SResCNN model and the average DLSs with equal weight for each slice was regarded as the final EGFR positive probability of the tumor.

Reply:
Thank you for your useful suggestion. Through efforts of the WHO and IAEA, PET imaging is becoming increasing available in the developing world. However, the WHO has estimated that less than 30% of the world's population has access to any diagnostic imaging tests. Hence, we acknowledge the limitation and have revised the title to "Non-invasive decision support for NSCLC treatment using PET/CT radiomics" And we have also added this limitation in the revised version as follows： "Lastly, this work is based on PET/CT imaging, which is not widely available in many parts of the world. Therefore, this model may be limited to the developed world and to large urban centers in the developing world."

Reply:
Thank you for your thoughtful comment. The results of the multivariate linear regression show that the actual EGFR-DLS is picking some variance from clinical variables limited to: sex, histology and SUVmax. However, only 25.0% of EGFR-DLS variability could be explained by these three parameters, which means the current standard clinical variables are not enough to impact the EGFR-DLS. Other clinical variables did not contribute. Further, the clinical signature constructed with these standard clinical variables only achieved AUCs of 0.78, 0.78 and 0.70 in the training, validation and external test cohort, which are significantly poorer compared to the EGFR-DLS with AUCs of 0.86, 0.83, and 0.81 in the training, validation and external test cohort, respectively. There is a potential that some other uncommon factors may be predictive in EGFR mutation status, and highly correlated with EGFR-DLS, which need further investigation. Comparatively, EGFR-DLS could reflect more information in predicting EGFR mutation status in an easier way with the more commonly used PET/CT images. We have added this limitation in the revised version as follows.
"Fifth, though 25.0% of EGFR-DLS variability could be explained by the amalgamation of some standard clinical variables, EGFR-DLS could reflect more information and achieve significant higher performance in predicting EGFR mutation status in an easier way with the more commonly used PET/CT images."

The paper has a number of strong features like the use of a really large multi-institutional cohort, a fairly well written paper, with some very nice results regarding the classification of EGFR status using the deep learning model and the association with survival using the deep learning model for classifying EGFR status.
However, there are several aspects in which this paper needs improvement.

Reply:
Thank you for your suggestion since addressing this will improve the portability and quality of this work. As noted to our response to reviewer #1, to make the manuscript concise, we apologize for omitting some pertinent details of the deep learning model.
The deep learning score was the output of the deep learning model, and we have stated in the introduction and methods section of the revised version as follows.

Development of the deep learning model
The output of the network, i.e. the deep learning score (EGFR-DLS), was used as the classification result to represent the EGFR mutation positivity probability." More details about the deep learning model and the training process in the Patients and Methods and Supplemental methods S2 as described in our response to Comment #16 of the REVIEWER #1.

Reply:
The deep learning model was used to predict binary EGFR mutation status with PET/CT images. The inputs of the deep learning model were the ROIs selected by the radiologists that included the whole tumour and surrounding tissue, and the EGFR mutation status (positive or negative) was encoded to one-hot and used as the label. The output of the network, i.e., the deep learning score (EGFR-DLS), was the classification result to represent the probability of EGFR mutation positivity. The details of the training, optimization, learning rates, hyper-parameter tuning, and network architecture have now been added in the Patients and Methods and Supplemental methods S2 as follows.

Development of the deep learning model
The EGFR mutation status prediction 2D small-residual-convolutional-network (SResCNN) model is presented in Supplemental Figure S8. The regions of interest (ROIs) of the PET and CT images were first selected by experienced nuclear medicine radiologists (L.J, JY. Z, and Y.S) after registration using ITK 30 on the condition that entire tumor and at least 10 mm of its peripheral region were included, and were then resized to 64x64 pixels by spline interpolation and constructed a three-channel hyper image together with their fusion image (alpha-blending fusion 31 , α=1) (Pipeline is shown in Supplemental Figure S9). To reduce the effect of the difference between the central slice and peripheral slices, only

ROIs that contained measurable tumor tissue were regarded as valid ROIs, and were fed into the SResCNN model to update the parameters of the SResCNN model with backward propagation. The EGFR mutation status (positive/negative) was encoded to one-hot and used as the label. The output of the network, i.e. the deep learning score (EGFR-DLS), was used as the classification result to represent the EGFR mutation positivity probability.
EGFR mutation positivity probability at the patient level was obtained by averaging the EGFR-DLSs of the slices that included tumor tissue. To reduce overfitting, augmentation including width/height-shift, horizontal/vertical-flip, rotation and zoom for the 13,583 training hyper-images were used, and the model with the best performance on the validation dataset was selected. Details are shown in Supplemental S2.

"Supplemental S2: Details of the deep learning model Preparation of the input images
Using ITK-SNAP software, the PET and CT images were firstly registered, and then a square or an irregular box that was close to the boundary of the tumor was delineated, by experienced nuclear medicine radiologist. After resampling, dilation, and resize using cubic spline interpolation, the PET region of interest (ROI) and CT ROI were obtained keeping the entire tumor and its peripheral region with the same size (64×64). Subsequently, the fusion images were calculated through the α-fusion equation: = + , where and are the normalized PET and CT pixel-wise image data by z-score normalization. The fusion ROI was further standardized by z-score normalization, and constructed a 3channel hyper-image together with the normalized PET ROI and normalized CT ROI. This hyper-image was used as the input of the SResCNN model (Pipeline is shown in Supplemental Figure S9). Z-score normalization means the ROI image was subtracted by the mean intensity value and divided by the standard deviation of the image intensity, before inputting to the deep learning model, to reduce the effect of different equipment and different reconstruction parameters. Because of the big difference of the central slice and peripheral slices, only the slices with the area larger than the 30% of the maximum area of this patient were regarded as valid input images and were used as the input of the deep learning model. The area here means the area of the smallest square including the selected region (Supplemental Fig S9c). Finally, 13,583 training hyper-images were generated for training.

Structure and training of the small-residual-convolutional-network (SResCNN)
The 2D SResCNN is based on several residual blocks with 3-channel input images, which is similar to the well-known Resnet18 network with fewer filters. Given single resolution may not be optimal and depends on the scale of the objects within the image, multi-resolution CNN model was proposed and proved to have significantly better performance [s1, s2]. Therefore, the concept of multi-resolution was further incorporated into the architecture, which was shown in supplemental Figure S8. Specifically, the architecture was comprised with three convblocks (

including a 3 × 3 convolutional layer followed by a batch normalization layer and a rectified linear unit (ReLU) activation layer) for three different resolutions of the input hyper-images, 8 residual blocks (Resblock), and one fully connected layer.
Finally, a softmax activation layer was connected to the last fully connected layer, which was used to yield the prediction probabilities of nodule candidates. Additionally, one dropout layer with probability of 0.5 was added to the fully connected layers.

The training of the model focuses on the optimization of the parameters of the SResCNN model to build a relationship between PET/CT images and EGFR mutation status (positive: 1 or negative: 0).
Binary cross entropy was employed as the loss function, while the Adam optimizer was used with an initial learning rate = 0.0001, beta_1=0.9, beta_2=0.999. The learning rate was reduced by a factor of 5 if no improvement of the loss of the validation dataset was seen for a 'patience' number (n=10) of epochs.
The number of the filters, the learning rate and the batch size was determined according the predictive performance on the validation cohort using grid search method.

Application of the SResCNN network
The generated hyper-image was input into the SResCNN model after z-score normalization, and a deeply learned score (DLS) representing the EGFR mutation positivity could be yielded after a sequential activation of convolution and pooling layers. To develop a robust prediction, all valid slices of each patient were fed into the SResCNN model and the average DLSs with equal weight for each slice was regarded as the final EGFR positive probability of the tumor. the accuracy, specificity and sensitivity metrics, which are also now stated in the results as follows.  Table S2)."

The next interesting result is the association with survival. It is unclear how the cut off was done to compute the association though. Please explain this in detail. Was this done using median value of the DLS score? Or was ROC analysis done?
Reply: The median value of the EGFR-DLS from the training cohorts was used as the cut-off, and this has been clarified in the revised version in Statistical analysis section as follows.

"Statistical analysis
The area under the receiver operating characteristics curve (AUC), accuracy (ACC), specificity (SPEC) and sensitivity (SEN)

Reply:
Thank you for your comments and proposed experiment, which sounds very interesting.
(1) ROIs including a much larger portion of the image will need a more complex model to identify more organs and tissues. Additionally, to minimize the effect of losing precision, the larger bounding box means a larger size of ROI for training. Both of the above yields a larger computational burden. In order to design a system that can be deployed on a local workstation with a few GPUs and be trained with limited dataset, we used the current smaller bounding box that include small portion of the image for training and test in this work. This is because the main aim of this study is to predict the EGFR mutation status of the tumor, the object of this research is the tumor.
Herein, we also went back and identified ROIs with much larger bounding boxes by dilation of the smallest square including the manual delineation with a square of size 60, 80, and 100 mm for 60 cases from the validation cohorts, and the generated EGFR-DLSs achieved AUCs of 0. With the input of a ROI including more organs and tissues, the prediction ability was decreased. This was reasonable, since the training of the model was based on the ROI with 10~20 mm of tumor peripheral region included, without considering the situation with more organs and tissues included. A model that works for a larger ROI with more organs and tissues should be trained with larger ROIs based on more data. Though smaller AUCs were obtained with more organs and tissues included, the difference was not statistically different for the dilation with square of size 60 mm. For this dilation size, we visualized the model using larger ROIs of the same two patients shown in Figure 3 as shown in the following figure ( Figure S1). From this figure, with the input of a ROI including more organs and tissues, the shape and the location of the activation maps and the positive/negative filters were not the same, but they are very similar to the original. We have also clarified this in the revised manuscript as follows.

"Distribution and characteristics of EGFR-DLS
"Similarly, the negative filter was strong activated and the positive filter was nearly shut down with EGFR-wild-type tumor fed to the deep learning model, which reveals the strong classification ability of the deep learning model. When the input ROIs were enlarged to include more organs and tissues, similar activation maps, positive/negative filters and predicted EGFR-DLSs were also obtained as shown in Supplemental Figure S1." Unfortunately, given this model is trained mainly for the tumor region, we are afraid that it shouldn't be used with the input of the ROI without tumor region. This will be our future study that developing a model with both high tumor location ability and high EGFR prediction ability. We have added this in the limitation section as follows: "Seventh, given this model is trained mainly for the tumor with 10~20 mm of tumor peripheral region included, the model couldn't be used for the ROIs without tumor included, and the prediction ability will be decreased with the input of ROI including more organs and tissues. A more intelligent model to solve this problem will be left for our future work." Figure S1．Visualization of the model using different ROIs of the same patients in Figure 3. The first lines are the original input ROIs, and the second line show the two of the activation maps of the fourth ResBlocks, the positive filter and the negative filter generated with the original input ROI. The third and fourth lines are the input ROIs with more organs/tissues included, and the corresponding activation maps and positive/negative filter.
2) The SSIM was reported with the median value with interquartile range (IQR) for all cases as follows.  Figure 3d, respectively). "

Reply:
Thank you for your useful suggestion. In this work, the rough ROIs were selected by radiologist using ITK software (Pipeline is shown in Supplemental Figure S9). However, given there were minor differences among different radiologists in selecting the rough ROIs, ROIs of 73 patients within the validation cohort were selected by all three radiologists to validate the reproducibility of DLS. The inter-rater agreement of DLS estimations were calculated by intraclass correlation coefficient (ICC) among the three radiologists. We have added concise detail in both the results and methods section as described in Comment #10 of REVIEWER #1.

Reply:
Thank you for your useful suggestion, and we have revised the related description of the relationship as follows.

Response to REVIEWER #4
The proposed approach is novel and very interesting however there are some points that need to be clarified and discussed. More specifically,

Reply:
Thank you for your suggestion, which sounds very interesting. Sex, ethnicity and others may indeed introduce selection bias. The first referred publication (Amsterdam et. al.) provided a very novel and useful way to solve the selection bias problem and we have cited it here. However, this is a retrospective study, and the training of the model was limited in the current data. Not all the data have these colliders and clinical outcome at the same time. For example, the HLM cohort doesn't have the EGFR mutation status, while the SPH and HBMU cohorts don't have the clinical outcome of TKI-treatment or ICI-treatment. Therefore, we could not follow this approach in this study, but it will be left for future work. Additionally, the results of the test cohorts with different demographic characteristics (e.g. different ethnicities, different histology) compared to the training cohorts supports the generalization of the model, indicating that this model was less affected by the hidden colliders. We have stated this in the limitation section as follows.
"Sixth, the hidden colliders like sex, ethnicity and histology may introduce the selection bias in the current study.

Reply:
Thank you for your comments. We apologize for the error in the previous version; the correct number of ROIs is 13,583 ROIs for training. Though the dataset was not big enough, several techniques were deployed to reduce overfitting. We have clarified in the supplemental S2 as follows:

"Structure and training of the SResCNN network
In order to reduce the risk of overfitting, several techniques were deployed. 1) Augmentation: During the training, augmentation including width/height-shift, horizontal/vertical-flip, rotation and zoom were used to expand the training dataset to improve the ability of the model to generalize. 2) Regularization: L2 regularization was used, which added a cost to the loss function of the network for large weights. As a result, a simpler model that was forced to learn only the relevant patterns in the training data would be obtained. 3) Dropout: Dropout layer, which would randomly set output features of a layer to zero during the training process, was added. 4) Early stop: During training, the model is evaluated on the validation dataset after each epoch. The training was stopped after waiting an additional 30 epochs since the validation loss started to degrade. " Additionally, the hyper images for training, validation and test were generated based on the patient level. And we have clarified this in the methods section as follows:

Study population
Patient cohorts from SPH and HBMU were divided into a training (n=429) and validation cohort (n=187) randomly with a ratio of 70/30 to train and validate the deep learning model to predict EGFR mutation, while patients from HMU were used as external test cohort to test this model. Data from cohorts was rigorously kept separate. " basis rather than a patient or a tumor basis. a) How the authors aggregated these probabilities to transform  them on a per tumor prediction? Please clarify. b) The selected architecture also needs some justification, why the authors selected the specific one? c) Why they didn't consider more suitable approaches including pre-trained networks with transfer learning given the typical n<<p situation they had?

Reply:
Thank you for your comments. a) To develop a robust prediction, the average EGFR-DLSs of all valid slices that included tumor tissue fed into the SResCNN model with equal weight was regarded as the final EGFR positive probability of the tumor. This has been re-stated in the methods section as follows.

Development of the deep learning model
To reduce the effect of the difference between the central slice and peripheral slices, only the ROIs contained measurable tumor tissue were regarded as valid ROIs.
EGFR mutation positivity probability at the patient level was obtained by averaging the EGFR-DLSs of the valid slices that included tumor tissue. " b) Resnet is one of the best architectures for the image classification. Given the limited number of the datasets, we constructed a network similar to Resnet-18 but with fewer filters, which could be trained on our own workstation with less data. Additionally, given single resolution may not be optimal and depends on the scale of the objects within the image, multi-resolution CNN model was proposed and proved to have significantly better performance [s1,s2]. Therefore, we incorporated the concept of multi-resolution into the smaller Resnet structure. By trying different number of filters, the current structure could give an acceptable result. We have clarified in the supplemental S2 as follows:

"Structure and training of the SResCNN network
The 2D SResCNN is based on several residual blocks with 3-channel input images, which is similar to the well-known Resnet18 network with fewer filters. Given single resolution may not be optimal and depends on the scale of the objects within the image, multi-resolution CNN model was proposed and proved to have significantly better performance [s1, s2]. Therefore, the concept of multi-resolution was further incorporated into the architecture, which was shown in supplemental Figure S8.
The number of the filters, the learning rate and the batch size was determined according the predictive performance on the validation cohort using grid search method."

Reply:
Thank you for your correction. We have now changed "univariable" to "univariate" in Table S2 and Table S5.

Reply:
Thank you for your thoughtful comments, and we apologize for the lack of clarity.
For all cases in this study, contemporaneous planar CT images were acquired and applied to the image registration with PET images. The SOPs from the institutions that conducted the PET/CT imaging studies first acquired a non-contrast CT scan, and then the PET scan was acquired next. The PET and CT images were coregistered on the same machine by scanner software. Thus, most of the patients in this analysis were already registered well prior to our analysis. We acknowledge that a few patients may have minor misalignment due to respiratory motion as the CT image acquisition is rapid whereas the PET acquisition requires much longer meaning that the CT is a snapshot of position while the PET is more of an integral of average position. For the small number of patients of mis-registration, a rigid transformation was performed using ITK-SNAP by an experienced nuclear medicine radiologist in the manual mode to fuse the non-contrast CT and PET images. We have now clarified this in the supplemental S2 as follows. Figure S9. In clinical practice, a noncontrast CT scan was acquired first and then PET scan was acquired subsequent. The PET and CT images were co-registered on the same machine by scanner software. Thus, almost all cases of this study are already registered. A few cases had minor misalignment due to respiratory motion. For these cases, an experienced nuclear medicine radiologist manually adjusted the alignment using ITK-SNAP. Then a square or an irregular box that was close to the boundary of the tumor was delineated using ITK-SNAP software by the experienced nuclear medicine radiologist."

Reply:
We agree that resampling may introduce some errors that may lead to less accurate results, but this should not be a differential bias/error in relation to the outcome/dependent variables. Nonetheless, to ensure the PET and CT images with the same resolution and that the input ROI have the same size, resampling is necessary. In our work, bicubic interpolation was used for image resampling since it can get relatively clear picture quality and is most commonly used in image processing compared to other resampling techniques [s1].
Regarding the dilation, though there is no tumor contour, a square or an irregular box that was close to the boundary of the tumor was delineated by the experienced nuclear medicine radiologist. Then the smallest square mask that included the selected region was obtained, and the dilation was performed on this mask. Through the dilation, the entire tumor and at least 10 mm of the peritumoral region could be included in the input ROI.
We have now clarified this, provided a pipeline of this process and cited in the beginning of supplemental S2 as follows.
"Pipeline is shown in Supplemental Figure S9. In clinical practice, …. Then a square or an irregular box that was close to the boundary of the tumor was delineated using ITK-SNAP software by the experienced nuclear medicine radiologist. After resampling with bicubic spline interpolation to ensure the PET and CT images have the same resolution, dilation of the smallest square mask including the selected region to keep the peritumoral region included, and resize using bicubic spline interpolation, the PET region of interest (ROI) and CT ROI were obtained keeping the entire tumor and its peripheral region with the same size (64×64)." Figure S9. Illustration of the generation of the input hyper-image. A square or an irregular box, which was close to the boundary of the tumor, was delineated manually in ITK software firstly, and then the hyper-image was generated after resampling (bicubic interpolation), dilation and fusion automatically.
[s1] HAN, Dianyuan. Comparison of commonly used image interpolation methods. In: Proceedings of the 2nd international conference on computer science and electronics engineering. Atlantis Press, 2013.

Also how does this work depending on tumor location -say tumors located in the mediastinum vs. tumors surrounded by air vs. tumors attached to chestwall?
Reply: a). What we meant by Z-score normalization, was that the ROI image was subtracted by the mean intensity value and divided by the standard deviation of the ROI image intensity. We have now clarified this in the supplemental S2 as follows. " and are the normalized PET and CT pixel-wise image data by z-score normalization, which means that the ROI image was subtracted by the mean intensity value and divided by the standard deviation of the ROI image intensity." b). Z-score normalization is a widely used normalization technique to eliminate the offset effect [S1] Because it is reported to lead to a smoother and faster convergence in the training of CNN networks [S2, S3], we decided to use Z-score normalization in our work. c). In response, we performed a further experiment on ROIs with much larger bounding boxes by dilation of the smallest square including the manual delineation with a square of size 60 mm for 60 cases from the validation cohorts, and the generated EGFR-DLSs achieved AUCs of 0.77 (95%CI: 0.65, 0.89) (p=0.38, Delong test) which is slightly lower, but not significantly, compared to the original AUC of 0.83 (95%CI: 0.73, 0.93). That is to say, this method could also work with the input of a larger ROI that included more organ/tissue. d). For the ROI selection shown in supplemental Figure S9, a square or an irregular box, which was close to the boundary of the tumor, was delineated. Therefore, the radiologists did not carefully select the ROI to keep a similar proportion of air/tumor/normal tissue as this would have been impractical. To have a method that is more generally applicable, it was important that we put no constraints on our radiologist regarding choice of input conditions. e). According to the tumor location, lung cancer could be divided into the following four types: I) tumors surrounded by air; II) tumors surrounded by air and mediastinum; III) tumors surrounded by air and chestwall; IV) tumors surrounded by air, mediastinum and chestwall. In this study, tumor location was not the inclusion or exclusion criteria. Therefore, all different types of tumors were included in this analysis. In detail, for the 65 patients from the external HMU test cohort, 15 patients with an EGFR mutant prevalence of 60.0% belong to Type I, and the EGFR-DLS achieved an AUC of 0.98 (95%CI:0.93-1.00,p=0.002); 23 patients with an EGFR prevalence of 34.78% belong to Type II, and the EGFR-DLS achieved an AUC of 0.80 (95%CI: 0.61-1.00, p=0.020); 13 patients with an EGFR prevalence of 61.5% belong to Type III, and the EGFR-DLS achieved an AUC of 0.93 (95%CI:0.77-1.00, p=0.013); the remaining 14 patients with an EGFR mutant prevalence of 78.6% belong to Type IV, and the EGFR-DLS achieved an AUC of 0.97 (95%CI:0.88-1.00, p=0.016). There are no significant difference of the AUC between different types (Type I vs Type II, p=0.10; Type I vs Type III, p=0.53; Type I vs Type IV, p=0.82; Type II vs Type III, p=0.35; Type II vs Type IV, p=0.14; Type III vs Type IV, p=0.64). Therefore, this work is independent on tumor location. And we have added this in the Results section as follows.

"Distribution and characteristics of EGFR-DLS
The intraclass correlation coefficient (ICC) of these three p<.001), and indicating that there were no significant differences in AUCs of these three EGFR-DLSs (Supplemental Figure S2)

Related to this, the authors say that ResNet18 didn't work (or at least that's how I understood it as the rationale for using a similar network with fewer layers). Presenting comparison results of the actual ResNet18 would be helpful.
Was the network fine-tuned or trained from scratch? Please provide details.

Reply:
Thank you for your thoughtful comments, and we apologize for the lack of clarity.
The current architecture has the same number of layers as ResNet18. However, the number of filters is much fewer (1/8) for each layer compared to ResNet18. The main reason we can use a smaller number of filters was that we are tasked with only 2 classes to predict compared to the 1000 classes in imagenet.
In this work, we used 3 different resolutions with image sizes of 64x64, 32x32 and 16x16. Though with 3 resolutions, the number of convolutional layers was not added. And the number of filters was still much smaller compared to ResNet18. According to [s1,s2], a multi-resolution CNN model proved to have significantly better performance. Therefore, we only utilized different resolution inputs, and the comparison with multiscales could be considered for future research. We have prepared different resolutions with image sizes of 64x64, 32x32, 16x16 and 8x8. The actual number was determined according to the predictive performance on the validation cohort using the grid search method, and we have clarified this in supplemental S2 as follows.
"The number of the filters, the number of resolutions, the learning rate, and the batch size was determined according to the predictive performance on the validation cohort using the grid search method." In response to the comment, we have performed two extra experiments. 1) To validate the necessity of multi-resolution and define the number of resolutions, architectures with different number of resolutions (Supplemental Figure S10) were trained, validated and tested on the conditions that the structures of other convolutional layers were kept the same. The predictive performance of different architectures is provided in supplemental Figure S11. When the number of resolution was less than 4, the predictive performance was improved in the training and validation cohorts with the increase number of different resolutions. When the number of resolution was 4, though predictive performance was improved in the training cohort, no improvement was found for validation cohort. Thus, to keep the architecture with fewer parameters, we used 3 different resolutions in this analysis. Additionally, the advantage of multiresolution architecture was also proved in the external test cohort, which has more advanced stage cases wither larger tumor volume. That is to say, the multi-resolution architecture is more independent on the scale of the objects within the image. 2) In this study, we believe it is imperative to reveal the biological underpinnings of the model through the visualization. We were concerned that pre-trained models with transfer learning might not yield better explanation of the features of the middle layers since the pre-trained models were trained with real-world images and might extract real-world-related features that might not applicable in tumor images. Additionally, we also tried transfer learning, and significantly worse performance was obtained with AUCs of 0.76 and 0.67 in the training and validation cohorts (p<0.05), respectively. Therefore, compared to transfer learning, we ResNet18 but fewer filters for each layer, the performance of ResNet18 is better than in the training cohort, but nearly the same in the validation cohorts and little worse in the test cohort (Supplemental Figures S11).
This result indicates that larger number of filters wouldn't increase the predictive performance but increase the risk of overfitting for the task in our study. Additionally, during the training of Resnet18, it took 200 seconds for training each epoch, which is ten times longer than training the proposed model. Therefore, the ResNet18-similar architecture with fewer filters for each layer is more appropriate in our study.
The determination of this architecture has now been provided in Supplemental S3 as follows.

"Supplemental S3: Determination of architectures of CNN network
To select the optimal architecture, we first compared the ResNet18 with a similar architecture but smaller number of filters for each layer (referred as 1-resolution model later, shown in Supplemental Figure S10 (a)).  Figure S10) were trained and tested on the conditions that the structures of other convolutional layers were kept the same. From the performance shown in supplemental Figure S11, when the number of resolution was less than 4, the predictive performance was improved in the training and validation cohorts with the increase number of different resolutions. When the number of resolution was 4, though predictive performance was improved in the training cohort, no improvement was found for validation cohort. In order to keep the architecture with fewer parameters, we used 3 different resolutions in this work. Additionally, the advantage of multi-resolution architecture was also proved in the external test cohort, which has more advanced stage cases wither larger tumor volume. That is to say, the multi-resolution architecture is more independent on the scale of the objects within the image.
Based on the above comparison, the current SResCNN network was arrived at and used in this work."