A deep learning integrated radiomics model for identification of coronavirus disease 2019 using computed tomography

Since its first outbreak, Coronavirus Disease 2019 (COVID-19) has been rapidly spreading worldwide and caused a global pandemic. Rapid and early detection is essential to contain COVID-19. Here, we first developed a deep learning (DL) integrated radiomics model for end-to-end identification of COVID-19 using CT scans and then validated its clinical feasibility. We retrospectively collected CT images of 386 patients (129 with COVID-19 and 257 with other community-acquired pneumonia) from three medical centers to train and externally validate the developed models. A pre-trained DL algorithm was utilized to automatically segment infected lesions (ROIs) on CT images which were used for feature extraction. Five feature selection methods and four machine learning algorithms were utilized to develop radiomics models. Trained with features selected by L1 regularized logistic regression, classifier multi-layer perceptron (MLP) demonstrated the optimal performance with AUC of 0.922 (95% CI 0.856–0.988) and 0.959 (95% CI 0.910–1.000), the same sensitivity of 0.879, and specificity of 0.900 and 0.887 on internal and external testing datasets, which was equivalent to the senior radiologist in a reader study. Additionally, diagnostic time of DL-MLP was more efficient than radiologists (38 s vs 5.15 min). With an adequate performance for identifying COVID-19, DL-MLP may help in screening of suspected cases.

www.nature.com/scientificreports/ lesions, assessing disease severity, and predicting disease prognosis of COVID-19 have been developed [6][7][8][9][10][11][12][13] . Wang et al. developed a DL model to provide clinical diagnosis before the pathogenic examinations by extracting radiographical features of COVID-19 8 . Yue et al. built a ML model using CT images to estimate the hospital stay of COVID-19 patients 14 . Another study developed a radiomics nomogram using features extracted from the lung parenchyma window to predict COVID-19 13 . When reviewing published literature on prediction models for COVID-19 diagnosis 15 , we noticed that regions of interest (ROIs) annotation which was time-consuming but indispensable for model development were one of the common challenges for both deep learning and radiomics modeling. Moreover, though radiomics is a widely utilized method in the field of medical imaging 16 , lack of automatic ROI annotation is a key hurdle during its clinical application because each case needs to be annotated before being applied to the radiomics models.
In recent years, radiomics is developed rapidly and has attracted broad attention for its potential to identify subtle disease characteristics that failed to be discovered by naked eyes. However, the performance of the radiomics model could be greatly influenced by different feature selection methods and classification algorithms [17][18][19] . To achieve the best model, feature selection and classification algorithm need to be well-designed. To our knowledge, no research so far has tried to evaluate the effects of feature selection methods and classification algorithms on the performance of radiomics models for distinguishing COVID-19 and other community acquired pneumonia (CAP) patients. In this study, we solved the time-consuming ROI annotation problem by integrating a DL segmentation algorithm with the radiomics approach, and developed an end-to-end model using CT images to screen COVID-19 patients. Additionally, cross-combinations of five feature selection methods and four machine learning algorithms were used to develop the optimal radiomics model. Furthermore, the clinical feasibility of the model was validated on an external dataset in terms of classification performance and time efficiency.

Materials and methods
Patients. This  In addition, patients' characteristics were summarized, including clinical stages and imaging manifestations. In particular, over 65% of the included COVID-19 patients were clinically classified as the moderate type, followed by 27.1% mild type, 2.3% severe type, and 0.8% critical type (Appendix Table S1). In terms of imaging manifestations on chest CT scans, multifocal small patchy shadows, ground glass opacity (GGO), and consolidation were the main lesions found in both COVID-19 and CAP cases. As can be seen in Appendix Table S2, GGO was more common and consolidation was less common in COVID-19 patients than among CAP cases, which could be attributed to the relatively larger proportion of mild or moderate clinical type patient. Other reported imaging manifestations, including infiltrate and pleural effusion, were rare among the included patients of this study.
DL segmentation algorithms. The DL segmentation algorithm was a built-in feature on InferScholar platform by Infervision (https ://www.infer visio n.com/, Beijing, CHINA) and applied to automatically delineate ROIs in this study. The segmentation algorithm was trained with 507 sets of CT scans from suspected COVID-19 patients in Wuhan area. Coarse annotation strategy was utilized in which major lesions with multifocal small patchy shadowing, ground-glass opacities, and consolidations were selectively annotated on CT images by experienced radiologists (Fig. 1a). During algorithm training, CT images of different sizes were first resized to 512 × 512 using bilinear interpolation method as previously described 20 and the CT values of images were rescaled at window center of -600 and window width at 1500 so that the pneumonia lesions could be presented and easily distinguished (Fig. 1b). Annotated lesions on each slide were merged into a 3D ROI after segmentation (Fig. 1c). Training and testing of the DL segmentation algorithm were performed by using Mxnet (version 1.6.0) and CUDA (version 10.0).
To briefly summarize the structure of the DL segmentation algorithm, U-Net was the main architecture of the algorithm in which Xception 21,22 served as the backbone (sFig. 1). The annotation performance was evaluated by the Dice index. Dice Loss equation for loss function was as followed: where Pred denotes lesion pixels predicted by the DL segmentation algorithm and Anno represents the reference lesion pixels annotated by senior radiologists.  Appendix Table S3.
Feature selection. In order to select discriminating features, five methods were applied and compared in this study, including L1 regularized least absolute shrinkage and selection operator (L1-LASSO), L1 regularized logistic regression (L1-LR), L1 regularized ridge regression (L2-Ridge), eXtreme gradient boosting (XGBoost), and Z-test 24,25 . Five-fold cross-validation method was utilized. All methods were implemented by calling the scikit-learn (version 0.20.2) package and the optimal one with the highest accuracy was chosen as the final dimensionality reduction method.

ML model training and testing.
For unbiased estimation of diagnostic accuracy, data from two hospitals (Jinan Infectious Disease Hospital and Beijing Haidian Hospital) was divided into training and internal testing sets at a ratio of 2:1; data from the third hospital was utilized as an external testing set. With the selected features, four independent ML models were trained on the training set, including support vector machine (SVM), multi-layer perceptron (MLP), logistic regression (LR), and XGBoost. These methods were all implemented by calling the scikit-learn (version 0.20.2) package. To select the best model and the optimal hyper-parameters for each model, five-fold cross-validation was performed on the training set, in which 80% of the data was randomly selected to train models and the remaining 20% data (tuning set) validated the trained models. Training and validation process repeated five times until each cross section was part of the tuning set once. In model testing stage, ensemble models from five-fold-cross validation were used to discriminated COVID-19 and CAP patients while the model performance was evaluated on internal and external testing datasets.

Reader study.
To further evaluate the clinical feasibility of these proposed models, two radiologists (one senior radiologist with 15 years' experience and one junior radiologist with 5 years' experience) participated in the reader study on both the internal and external testing datasets. The senior radiologist and junior radiologist both had taken part in the fight against COVID-19 in the front line. They diagnosed cases independently only based on the CT imaging information in the reader study. Their diagnostic performance was compared with www.nature.com/scientificreports/ the proposed end-to-end models. Of note, the diagnostic efficiency was evaluated in terms of diagnostic timeconsumption.
Model evaluation and statistical analysis. Diagnostic performance was evaluated by classification sensitivity, specificity, precision, accuracy, F1 score, G-Mean, and area under ROC curve (AUC) and PR curve (AP). PR curve, a measure complementary to the ROC curve 26 , was utilized as well just in case of the possible asymmetrical data problems. Categorical variables were expressed in terms of frequency and statistically analyzed by Chi-square test. P < 0.05 was considered statistically significant. Continuous variables were represented by the means ± SD. A two-sided 95% confidence interval for AUC or AP was constructed following the approach of Hanley and McNeil (1982) 27 . Cohen's Kappa coefficient was calculated to measure the agreement between ground-truth results and model predictions. All statistical analyses were performed with the R statistical package (The R Foundation for Statistical Computing, Vienna, Austria).

Results
Performance of feature selection methods and ML models. The pre-trained DL segmentation algorithm achieved a Dice index of 0.69 and also displayed an adequate performance on the CT scans in this study. Much more lesions were annotated by DL algorithms comparing the coarse annotation method. Examples of coarse annotated and AI labeled ROIs were shown in Fig. 2. Of the five selection methods, L1-LR which selected 108 radiomics features enabled three ML models to achieve the highest AUC on validation set and was thus selected as the optimal method (sFig. 2, Fig. 1d). Pearson Correlation Coefficient (PCC) among the 108 selected features were calculated; features with PPC < 0.8 and 0.5 constituted another two feature sets, respectively (Appendix Tables S4 and S5). Feature redundancy was examined by training models with these three Performance evaluation of the end-to-end models. ML models integrated with DL segmentation algorithm constituted the end-to-end models. We then evaluated the performance of these models on testing datasets. DL-MLP outperformed other models with an AUC of 0.922 (95% CI 0.856-0.988), an F1 score of 0.841, and a kappa coefficient of 0.761 on the internal testing dataset; the AP reached 0.851 (95% CI 0.762-0.939) (Fig. 5a,b). In contrast, the AUC of DL-SVM, DL-LR, and DL-XGBoost were 0.927 (95% CI 0.864-0.991), 0.918 (95% CI 0.851-0.986), and 0.882 (95% CI 0.802-0.961), respectively. Detailed diagnostic performance metrics of these models were listed in Table 2. In addition, subgroup analysis was performed between COVID-19 and etiologically confirmed influenza pneumonia or mycoplasma pneumonia and DL-MLP again demonstrated an adequate classification performance with AUC of 0.891 (95% CI 0.805-0.977) and 0.933 (95% CI 0.865-1.000) (Fig. 5c).
Furthermore, DL-MLP achieved better performance on the external testing dataset with an AUC of 0.959 (95% CI 0.910-1.000), an F1 score of 0.841, and a kappa coefficient of 0.750; its AP reached 0.937 (95% CI 0.877-0.997). Detailed diagnostic performance metrics of other models were summarized in Table 2 and Fig. 6. Notably, it just took the end-to-end model 38 s to diagnose each input CT scan, indicating its high efficiency in practice.
Performance evaluation of the participated radiologists in a reader study. In comparison to the junior radiologist, senior radiologist achieved an overall better performance with the diagnostic accuracy, precision, sensitivity, and specificity of 0.90, 0.83, 0.88, and 0.91 on the internal testing dataset and 0.926, 0.964, 0.818, and 0.984 on external testing dataset ( Table 2). The radiologists' diagnostic performance was dotted in ROC and PR curves according to their sensitivity, specificity, and precision (Figs. 5a and 6a). The kappa coefficient of senior radiologist reached 0.781 and 0.832 on internal and external testing datasets (Figs. 5b and 6b). In addition, junior and senior radiologists spent an average time of 5.29 min and 5 min to diagnose a set of CT images.  www.nature.com/scientificreports/ www.nature.com/scientificreports/

Discussion
Early and timely detection of COVID-19 patients is of great importance in containing the pandemic. The practice has proved that the CT examination serves as a complementary approach to rRT-PCR for COVID-19 screening in some emergent scenarios [28][29][30] . By integrating DL segmentation algorithm with radiomics, we developed an end-to-end model using CT images from multiple medical centers to screen COVID-19 patients. Automatically delineated ROIs by DL segmentation algorithm greatly enhanced the application potentials of radiomics models in clinical practice. Trained with selected radiomics features, DL-MLP model demonstrated comparable diagnostic performance to a senior radiologist with 15 years' experience on internal and external testing datasets. To date, many DL and radiomics models were developed since the outbreak of COVID-19, focusing on screening, diagnosis, and prognosis of COVID-19 15 . However, due to limited medical labor resources and diffused lesion distribution across multiple sections, ROI annotations remained challenging in many of the current studies 8,9,11 . In our study, we utilized a DL segmentation algorithm that was trained with 507 sets of coarse annotated suspected COVID-19 CT scans. Lesions were selectively annotated on certain CT sections where they predominantly presented. This strategy reduced the annotation workload when medical resources were scarce and eventually achieved adequate results. The DL segmentation algorithm enabled direct application of radiomics models in clinical practice by saving the need for manual annotation, which is of great value to be extended to other disease scenarios when the radiomics approach was utilized.
Of note, five feature selection methods and four machine learning algorithms were utilized so as to discover the optimal radimocis model for identifying COVID-19 patients. A total of 20 models were tested and compared on both internal and external testing datasets in terms of AUC. Optimal feature selection methods were firstly screened by comparing the corresponding model performance on validation sets. Three of the four machine learning models achieved the best AUC when trained with L1-LR selected features. Redundancy of L1-LR selected features was further tested by modeling without features with strong correlations (PCC ≥ 0.8; PCC ≥ 0.5). All L1-LR selected features were finally utilized because of the robust performance on internal and external testing datasets. Machine learning models were trained with L1-LR selected features. Based on the performance on internal and external testing datasets in terms of AUC, AP, and other diagnostic performance metrics, the optimal model MLP was further analyzed in subgroups and compared with radiologists.
Current diagnostic performance for COVID-19 varied from model to model due to different development datasets and techniques. Detection sensitivity ranged from 0.83 to 1 while the AUC ranged from 0.81 to 0.996 15,31,32 . A recent study ensembled transfer learning with deep convolutional neural networks (15 architectures) to detect COVID-19 on CT images and achieved the best performance with sensitivity of 0.854, accuracy of 0.85, and precision of 0.857 33 . Another DL-based multi-view fusion model was developed using CT images with the maximum lung regions in axial, coronal and sagittal views and achieved AUC, accuracy, sensitivity and specificity of 0.819, 0.760, 0.811 and 0.615 on testing set, respectively 32 . In comparison, our study shared similar data size and achieved a better diagnostic performance as evidenced by the AUC, accuracy, sensitivity and specificity of 0.959, 0.884, 0.879 and 0.887 on the external testing dataset. Similarly, the multi-view fusion model solved annotation problem by using certain whole CT images, however, that may also result in insufficient features to properly detect COVID-19 32 . Another deep learning model was trained with a large dataset to identify COVID-19 from other pneumonia 34 . Like this model, our proposed DL-MLP could also distinguish COVID-19 from etiologically confirmed influenza and mycoplasma pneumonia and achieved better performance in terms of AUC. Table 2. Detailed diagnostic metrics of end-to-end models and radiologists on internal and external testing datasets. *On either internal or external testing dataset, different lowercase letters in the same column indicate significant differences among different models or readers (P < 0.05). www.nature.com/scientificreports/ Notably, there were also developed radiomics models to distinguish COVID-19, predict hospital stay, disease severity, and prognosis of COVID-19 patients 10,12-14 . An earlier radiomics study that utilized both lesion and normal region patches cropped from COVID-19 CT scans achieved a higher classification accuracy of 99.68% with GLSZM features 35 . However, this study ignored the within-patient correlation between the two classes of image patches. Meanwhile, radiomics nomogram for predicting COVID-19 was also developed by combining radiomics scores and significantly associated CT characteristics 13 and obtained a comparable performance to ours. Yet, note that in addition to internal and external testing sets, the proposed DL-MLP model was further validated by comparing with experienced radiologists on external testing dataset, which substantiated the model's greater application potentials in clinical scenarios.
The diagnostic performance of two radiologists served as the benchmark to evaluate the diagnostic efficacy of models in this study. Unlike studies with imbalanced classifications of data whose diagnostic threshold was www.nature.com/scientificreports/ determined by G-Mean 36 , our model output the normalized predicted probabilities of each class and achieved an adequate performance on identifying COVID-19 with a diagnostic threshold of 0.5 (sFig. 5). Notably, diagnostic performance of the participating radiologists on identification of COVID-19 was generally comparable to radiologists in other studies with similar sensitivity, specificity and accuracy 11,37 . In consistent with previous DL studies 11,37,38 , DL-MLP demonstrated comparable diagnostic performance to the experienced senior radiologist on both internal and external testing datasets in terms of detection sensitivity, specificity and accuracy. Adequate performance on the external testing dataset further increased the reliability of the end-to-end DL-MLP model. In addition, diagnostic efficiency is another important parameter to evaluate model feasibility. Comparable reading time of the radiologists was found in the current and previous study (5.15 min vs. 6.5 min) 11,38 ; in contrast, the model made a diagnosis in about 38 s which was much more efficient.
There are still limitations in this study that can be improved in future research. More radiologists for reader study, the utilization of AI-assisted reading mode, and detailed subgroup analyses could further validate the model's feasibility in clinical practice. In addition, integrating clinical information other than CT images could potentially improve diagnostic performance.
In conclusion, an end-to-end DL-MLP model was developed by integrating the DL segmentation algorithm with the radiomoics approach to efficiently screen COVID-19 patients from other CAP patients. DL-MLP achieved an adequate diagnostic performance that was comparable to a senior radiologist on both internal and external testing datasets, demonstrating the algorithm's great potential to assist radiologists to screen suspected COVID-19 cases in joint with rRT-PCR testing in emergent scenarios or high prevalence areas.

Data availability
The data will be made available to others on reasonable requests to the corresponding author.