Effective deep learning approaches for predicting COVID-19 outcomes from chest computed tomography volumes

The rapid evolution of the novel coronavirus disease (COVID-19) pandemic has resulted in an urgent need for effective clinical tools to reduce transmission and manage severe illness. Numerous teams are quickly developing artificial intelligence approaches to these problems, including using deep learning to predict COVID-19 diagnosis and prognosis from chest computed tomography (CT) imaging data. In this work, we assess the value of aggregated chest CT data for COVID-19 prognosis compared to clinical metadata alone. We develop a novel patient-level algorithm to aggregate the chest CT volume into a 2D representation that can be easily integrated with clinical metadata to distinguish COVID-19 pneumonia from chest CT volumes from healthy participants and participants with other viral pneumonia. Furthermore, we present a multitask model for joint segmentation of different classes of pulmonary lesions present in COVID-19 infected lungs that can outperform individual segmentation models for each task. We directly compare this multitask segmentation approach to combining feature-agnostic volumetric CT classification feature maps with clinical metadata for predicting mortality. We show that the combination of features derived from the chest CT volumes improve the AUC performance to 0.80 from the 0.52 obtained by using patients’ clinical data alone. These approaches enable the automated extraction of clinically relevant features from chest CT volumes for risk stratification of COVID-19 patients.

www.nature.com/scientificreports/ slice as independent training examples. In this study, we explore deep learning approaches for extracting clinically relevant features from chest CT volumes using limited data. Figure 1 shows an overview of the CT-based machine learning pipeline we propose. We develop a novel method for aggregating the chest CT volume into a 2D representation of global features that can distinguish COVID-19 pneumonia from other viral pneumonia and normal chest CT volumes, with state-of-the-art performance. Furthermore, we present a multitask model for joint segmentation of different classes of pulmonary lesions present in COVID-19 infected lungs with the goal of extracting local, highly relevant features. We then create prognostic models using the extracted features together with patient demographic data, comparing the performance of models using different combinations of relevant data to predict mortality. The overall goal of this work is to enable automated extraction of relevant features from chest CT volumes that can be incorporated with clinical data for risk stratification of COVID-19 patients. We follow the guidelines for applying artificial intelligence to medical imaging proposed by Mongan et al. 8 in terms of providing modeling details to ensure reproducibility.

Related work
Classification. There are already some published studies on CT-based COVID-19 diagnosis systems 9,10 .
Many researchers 3,4,6,7,11 have proposed different feature extraction approaches to exploit the power of CT scans for COVID-19 diagnosis and prognosis. Zhang et al 7  Segmentation. In addition to diagnostic models, several prediction models have been proposed based on an assessment of lung lesions. There are three typical classes of lesions that can be detected in COVID-19 chest CT scans: ground-glass opacity (GGO), consolidation, and pleural effusion 12,13 . Imaging features of the lesions, including shape, location, extent, and distribution of involvement of each abnormality, have been found to have good predictive power for mortality 14 or hospital stay 15 . These features, however, are mostly derived from the delineated lesions, and so depend heavily on lesion segmentation. Harrison et al 16 showed that a deep learning-based segmentation beats a specialized methodology in cases with interstitial lung maladies. Following the same idea, automatic lung lesion segmentation for COVID-19 has been actively investigated in recent stud- Features obtained from the disease classifier, lesion anatomic extent features, and patient's demographics are then used for mortality prediction using a prognosis model (blue). The classification dataset we used in this paper, the CC-CCII 7 dataset, consists of a total of 514,103 CT slices from 2,471 patients. It includes 156,070 slices from 839 COVID-19+ patients, 159,700 slices from 874 viral pneumonia patients, and 95,459 slices from 758 normal patients. The splits were performed at the CT level with a Train/Test/Validation ratio of 90%/5%/5%. As patients might have undergone multiple scans, for patient classification, we used the highest severity prediction against the highest severity label.
CC-CCII pulmonary lesions segmentation dataset. For the pulmonary lesions segmentation experiments, a set of CT slices were manually segmented slices from the CC-CCII dataset. The segmentation labels were selected as relevant pathological features for distinguishing COVID-19 from other common pneumonia. The annotation included lung field and five commonly seen categories of lesions, including lung consolidation, groundglass opacity(GGO), pulmonary fibrosis, interstitial thickening, and pleural effusion. Segmentation results were annotated and reviewed by five senior radiologists with 15 to 25 years of experience. The full dataset includes a slice segmented for each one of the 1302 available CT scans along with the corresponding polygons outlining each pulmonary lesion present in the slice. 293 slices showed pulmonary fibrosis lesions, 294 slices were obtained from patients within the first week of being diagnosed with COVID-19, 45 slices came from patients after having mild COVID-19, 201 slices were obtained from patients after severe COVID-19, and 489 were from patients with intermediate COVID-19 severity. The severity level was determined based on the size and type of pulmonary lesions as defined by Zhang et al. 7 . Mild was defined as less than three GGO lesions of size less than 3 cm; intermediate was defined as a lesion area more than 25% of the entire lung field; and severe was defined as a lesion area more than 50% of the entire lung field. The segmentation slices were assigned with random probability of 80%, 10%, and 10% to one of three distinct sets: training (1035 slices), validation (134 slices), and test (133 slices). This was done at the severity level to keep similar severity proportions among the sets. The partitions were disjoint at the study (CT scan) level. Best models were selected using the performance on the validation set and were evaluated on the held-out test set.
CC-CCII prognosis dataset. For the prognosis model, a set of 701 scans containing 61,810 CT slices from 136 COVID-19 patients (130 surviving, 6 deceased) at hospital admission in the CC-CCII dataset were labeled as severe (defined as a lesion area more than 50%of the entire lung field) and were not used for any of the other experiments. The data from these COVID-19 patients were combined with available demographics information (age and sex only) to create a set of 105 patients (101 surviving, 4 deceased) with demographics and imaging feature data for the prognosis experiments.
Stony Brook University (SBU) prognosis dataset. To show how our imaging models generalize and improve prognosis, we further tested our models on a larger dataset, with a completely different demographic back- Intensity map projection. We first created a solution for global feature extraction. The proposed method allowed the creation of a fixed-size 2-D global representation of any 3-dimensional CT volume by aggregating individual CT slices/planes' intensity histogram as horizontal entries to create an intensity map.
Texture analysis. Texture analysis methods have proven helpful in describing medical images. The oldest and most widely used texture analysis technique is the intensity histogram. It counts the number of occurrences of intensity values for all pixels of an image and builds a histogram from the cumulative entries. Its output is also a set of features/frequencies.
Occlusion mask. Sometimes, partly visible objects on an image are blocked by other objects located in the foreground. This phenomenon is called occlusion. One can think of lesions (GGO, consolidations, fibrosis, …) as occlusion objects that are masking the "normal" background image.
Let f k , with k = 0, . . . , N − 1 , be N images, or planes, of the same size. Then, the occlusion mask set of the f k planes with respect to a given label function φ : k=0 . In other words, for any pixel x, the label φ(x) determines which one of the multiple pixel values {f k } N−1 k=0 actually appears in the final image Otherwise the slices are stacked up, resized vertically and finally re-sampled horizontally to have the needed number of horizontal planes. Let V a 3D-volume composed of M horizontal planes p j . Put V = p 0 ⊕ p 1 ⊕ · · · ⊕ p M−1 , with ⊕ being the image stacking operation (see Fig. 2b).
Finally, under the operator ⊕ , the histogram of a volume V is the sum of the histograms of its composing planes p j , with the convention that lesions are mapped to the occlusion mask sets Intensity map of CT volumes. Based on Eq. (1) and volume decomposition, we propose to aggregate individual CT slices/planes for creating the intensity map as shown in Algorithm 1. The algorithm sets the intensity map as the unit of prediction for a CT scan while preserving the natural ordering of the slices/planes in the final 2D intensity map. We used the ImageNet pre-trained InceptionResnetV2 as the main classification model. Training was performed over the global set of resampled (if the number of slices doesn't match) 256 slices volumes, each row representing a typical gray-scale 256 bins intensity histogram. For the optimizer we used Nadam with a learning rate equal to 0.0001, and categorical Cross-Entropy was used as the loss function. All classification models were trained using Tensorflow and Keras on 4 GPUs.
(1) www.nature.com/scientificreports/ Obtaining pulmonary lesion features for COVID-19 prognosis model. We let (x n ) N n=1 represent a set of training CT slices. Each slice x n is associated with a corresponding lung contour mask l n . Depending on whether the slice x n shows the presence of a pulmonary lesion it will be also associated to the corresponding lesion masks element of the set lesions ∈ {ggo n , cl n , fl n , it n } . For each pixel (i, j) in the CT slice we aim to assign a label l n when the pixel is located within the lung contours: ggo n for pixels where ground glass opacity is present, cl n for consolidation, fl n for every pixel showing signs if pulmonary fibrosis, and it n for pixels where interstitial thickening is present. The lesion coverage for a particular lesion can then be computed as the ratio between the total amount of pixels showing signs of the lesion and the total of lung contour ( l n ) pixels present on the slice. It is important to notice that a pixel (i, j) in a CT slice image can be associated with multiple lesions.
Multi-decoder segmentation network. We proposed a "multi-task multi-decoder" segmentation network inspired by the U-Net 21 architecture where the encoding part is shared among the different pulmonary lesion tasks with independent decoding heads. We referred to it as "multi-task segm. net", a U-Net-like multitask network architecture where both encoder and decoder parameters are shared among the different segmentation tasks. See Figure A.2 describing the proposed network architecture in the Appendix section B.
All training slices in the segmentation dataset were divided by 255 to get values in a range from zero to one. Resulting slices are pre-processed for zero mean and unit variance using mean (0.481456) and (0.398588) and standard deviation calculated over the entire dataset. Both "multi-task multidecoder segm. net" and "multi-task segm. net" networks can be trained end-to-end using gradient-based optimization. The full criterion is described in Eq. (2), where α GGO , α cl , α fl , and α it are hyper-parameters. Since the tasks are very imbalanced, we used the ratio of the number of slices not showing the lesion over the number of slices showing that particular lesion as the corresponding alpha value (1.29, 1.53, 2.9, 7 respectively in our experiments). L GGO , L cl , L fl , and L it can be any standard segmentation loss as Jaccard or binary cross-entropy (BCE). For our experiments we used weighted BCE with weights 0.3 for background and 0.7 for lesion prediction, since often the lesions cover a very small region of the slices.
All segmentation models were trained using PyTorch on 4 GPUs. All layers but Batch Normalization were initialized using PyTorch's default initialization method. Batch Norm weights were initialized using a normal distribution. Refer to the Appendix section B.1 for more implementation details.
We report performance results using mean intersection over Union (mIoU), a standard metric for semantic segmentation 22 .
(2) L multitask = α GGO * L GGO + α cl * L cl + α fl * L fl + α it * L it Lung contour segmentation. All segmentation slices have labels for lung contours and most standard semantic segmentation networks can segment the lung very accurately (the lung segmentation model had 94.09% mean IoU performance). For our pipeline, we used the U-Net architecture from Ronneberger et al 21 .
Lesion anatomic extent estimation. A lesion anatomic extent was computed at the patient level as the percentage of the lung coverage by the lesion. The lung area was obtained using the lung segmentation model and the pulmonary lesion area was obtained from the multitask segmentation model predictions applied to multiple slices from the middle of the CT (3 to 5 slices) and averaged out lesion sizes. For patients with multiple CT scans available, the final lesion score was obtained by averaging the scan's lesion scores. The anatomic extents of these pulmonary lesions were used as features for the prognosis model presented in the following step.
Prognosis model. We compared the performance of five different machine learning models for mortality prediction in COVID-19 patients. For these models, we used demographic information alone, intensity map classification alone, segmentation alone, and combinations of classification and segmentation approaches combined with demographic information, in order to determine which model most accurately predicted mortality. As there were very few deaths recorded, we applied sub-sampling on the majority class (survivors) to balance the dataset. We experimented with Multi-tree XGBoost, Random Forest, Extra randomized tree, and Logistic regression models, with multiple stratified splits. Among these models, the Extra randomized tree was most successful for predicting mortality using a combination of different feature sets. To explore the global feature set from the classification pipeline, we examined two separate cutoffs for features by PCA: the top 3 and the top 18 features. The extra randomized tree classification model was trained using Scikit-Learn python library.

Experiments and results
Diagnosis results on CC-CCII dataset. For each CT scan, we created an intensity map representation following the methodology described in the previous section. The intensity maps can then be used as input to a classification model for diagnosis. Table 1 shows the achieved performance in both validation and test sets from the CC-CCII classification dataset using a k-fold Cross-Validation procedure ( k = 10). On a per-patient basis, the overall accuracy of the InceptionResnetV2 classifier trained using our intensity map projection is 95.3% (95% CI 92.0-98.7) in the validation set and 96.0% (96% CI 92.864-99.136%) in the test set.
We compare our approach to Zhang et al. 7  Comparing Table 1 per-scan with results from Zhang et al. 7 , our method achieved higher AUC value than the baseline method for the three classes Pneumonia/COVID-19/Normal. COVID-19 sensitivity/specificity/ ROC scores are also higher, showing the usefulness of our proposed representation. Moreover, our method uses a single model and does not rely on lung or lesion segmentation.
Multi-decoder segmentation results. The next step is to perform semantic segmentation using our proposed multi-decoder segmentation network on the CT slices of COVID-19 patients. Lesion segmentation results will later be used to obtain each lesion anatomic extent.
Evaluation metrics. Not all types of lesions are present in a patient CT slice. In our training set 75% of the slices presented consolidation, 87% of the slices included 1 or more GGO, but only 29% and 10% of the slices included  www.nature.com/scientificreports/ pulmonary fibrosis and interstitial thickening respectively. In those cases, we assume that background is the only class and assign mIoU of 1 to models not predicting that class. This makes this metric not very informative for certain lesions. Hence, we also report Global mean IoU (GIoU) as the mean IoU calculated only over slices where the lesion of interest is present. For all segmentation experiments we report the average performance and corresponding standard deviation after three training runs using different random data splits. Table 2 shows the quantitative performance of our proposed approach compared to using individual models. Multitask models show better performance for all pulmonary lesions even when the number of trainable parameters was the same as the parameters of individual models. The performance improvement is even more noticeable in the low data regime as is shown in the supplementary document where models were trained using 50 percent of the training data (the number of available test slices for fibrosis and thickening is very small and might not accurately reflect model's performance). Figure 3 shows qualitative results for the "multi-task multi-decoder segm. net" on the test set. Model's predictions closely align with the masks generated by expert radiologists.
Prognosis results. CC-CCII prognosis. We performed prognosis experiments following the leave one participant out cross-validation (LOOCV), an extreme version of k -fold cross-validation where k is set to the number of examples in the prognosis dataset described in Section 2.5 using the prognosis model previously described. Results are shown in Table 3. We use standard deviation to calculate the errors, which gives us the variability of the sample means over 10 iterations. The demographics alone (age and gender) achieved an AUC of 0.52 ± 0.01. To explore the global feature set from the classification model, we obtained top 3 and the top 18 PCA components from the features activation of the second last layer of the classification model, with the top 3  www.nature.com/scientificreports/ explaining 50% of the variance. These PCA components are used as input features to the prognosis model. The model using the top 18 PCA features performed poorly compared to using the top 3 features alone. Interestingly, using local data from the segmentation pipeline alone achieved similar performance to using demographics data alone. However, by including the segmentation features with the demographics, the AUC improved the AUC to 0.71, suggesting that imaging features encoded clinically relevant features for mortality prediction. We apply the ordinary least square statistical model to calculate the p-value for each feature of the prognosis model. The p-value for demographics (age and gender) showed up as statistically more significant than that of the CT segmentation and top 3 PCA features. Among those of the CT segmentation and top 3 PCA features, the feature fibrosis was the most statistically significant. The highest performing model used a combination of patient demographics, the top 3 PCA features from the classification model, and the segmentation features, for an AUC of 0.80 and 0.04 standard error. Prognosis results on all 136 severe COVID-19 patients are explained in the Appendix section D.
SBU prognosis. To further generalize our models, we use a dataset from SBU with a bigger patient base of 241 patients (205 surviving, 36 deceased) and different demographic distribution. We used our prior trained imaging models to extract imaging features from the CT scans of this dataset (following the same steps as we did for CC-CCII dataset). Then we combined the patient demographics (age and sex only) with the extracted imaging features. We performed prognosis experiments following the leave one participant out cross-validation (LOOCV), as we did for the CC-CCII dataset above. Results are shown in Table 4. We use standard deviation to calculate the errors, which gives us the variability of the sample means over 10 iterations. The demographics alone (age and gender) achieved an AUC of 0.64 ± 0.03. To explore the global feature set from the classification model, we obtain the top 3 PCA components from the features activation of the second last layer of the classification model, with the top 3 explaining 90% of the variance. These PCA components are used as input features to the prognosis model. Interestingly, using local data from the segmentation pipeline alone achieved slightly better performance compared to using demographics data alone. Including the segmentation features with the demographics, the AUC improved further to 0.76, suggesting that imaging features encoded clinically relevant features for mortality prediction. The highest performing model used a combination of patient demographics, the top 3 PCA features from the classification model, and the segmentation features, for an AUC of 0.81 and 0.03 standard deviation. We apply the ordinary least square statistical model to calculate the feature importance. The demographics (age and gender) shows up with slightly higher predictive power than the CT segmentation and top 3 PCA features. The fibrosis anatomic extent shows up as the CT feature with the most predictive power.

Discussion
We developed two methods for automated extraction of clinically relevant features from chest CT images in the setting of limited data and combined these features with demographic information to develop a prognosis model for COVID-19 outcomes. First, the intensity map projection created 2D representations of 3D CT volumes, which can then be used in a COVID-19 diagnostic model without the need for lesion segmentation. Next, the multitask segmentation model identified four pulmonary lesions specific to COVID-19 infection while computing the location and extent of each type of lesion. Taken together, these two approaches provide a method for identifying 3D pulmonary pathology data from a CT scan that can be combined with other clinical data to improve predictions about COVID-19 outcomes. We thus created a prognosis model that analyzed the most relevant CT imaging findings in combination with demographic information to achieve more accurate outcome predictions than models based on demographics, intensity mapping, or segmented features alone. Improving the accuracy of prognosis models is critical in the setting of a pandemic. Early screening with a reliable prognosis model could reduce the burden on the hospital system, by identifying patients who could recover safely at home. Many studies have developed prognostic models for COVID-19, but a recent systematic review found that many were either poorly reported and/or at risk for bias 23 . One source of bias is the lack of available clinical data. CT imaging models are often limited by the small number of COVID-19 positive patients in the available datasets, and are further limited to analyzing only slices that contain pre-segmented lesions 9 . Our intensity map projecting method enhances global disease feature extraction from CT volumes by translating 3D information from the entire lung region into a 2D map. The resulting model performed better at COVID-19 diagnosis than a model trained on individual CT slices. COVID-19 pneumonia can be difficult to differentiate from other pneumonias in the early stages of disease, so analyzing the entire lung volume may enhance the model's ability to detect pathology specific to COVID-19 infection. The multi-task segmentation model, however, allowed for simultaneous assessment of key localized COVID-19 pulmonary lesions and was able to take advantage of related information between regions of interest, outperforming segmentation models designed to assess individual lesions. We provide a small set of guidelines to follow while testing our models on new data in section C of the supplementary document.
Relevant imaging data can be difficult to extract and integrate with other types of clinical data when building prognosis models 24 . Our methods for extracting the features from CT data that are most specific for COVID-19 facilitates integration of such information with clinical data. The results of our prognosis model experiments demonstrate how adding the most relevant imaging data improved the performance of the prognostic model. The highest performing model used a combination of patient demographics, the top three features from the classification model (based on intensity mapping), and the segmentation features. Interestingly, using local data from the segmentation pipeline alone achieved similar performance to using demographics data alone. However by including the segmentation features with the demographics, the AUC improved, suggesting that imaging features encoded information that was clinically relevant for mortality prediction. These results suggest that both the intensity mapping approach and the segmentation approach extracted complementary clinical information from the CT volumes which, when considered together, are useful for predicting mortality. The intensity mapping provided a global representation of features from the entire volume, while the segmentation model contributed information about localized but highly relevant features. While the accuracy of the best model in Table 4 is above 90%, it is important to note that the best performing model had relatively low precision and recall. The relative importance of these features are important for understanding the disease process but further work would need to be done before this model could be used in a clinical context. Clinicians rely on both imaging data and clinical information to assess patients and predict disease course. The most accurate prognostic models for COVID-19 outcomes will rely on information that is most specific to risk for severe COVID-19 disease. In the setting of limited patient data, it is challenging to prevent deep learning models from associating less clinically significant or even unrelated data with specific outcomes. The CC-CCII prognosis dataset used for testing includes a limited number COVID-19 deceased patients. Performing a similar analysis using a larger and more balanced dataset might be worthwhile once more data becomes available. In this study, we demonstrate an approach that combines two methods for identifying relevant pulmonary findings associated with COVID-19 disease that are useful for predicting clinical outcomes. Ideally, these approaches will help to enable rapid triage of newly infected patients, allowing for early intervention to prevent severe disease and better manage clinical resource allocation.