Introduction

Hospitalized COVID-19 patients are likely to develop severe outcomes requiring mechanical ventilation or high-flow oxygenation. Among hospitalized patients, 14–30% will require admission to an intensive care unit (ICU), 12–33% will require mechanical ventilation, and 20–33% will die1,2,3,4. Detection at admission of patients at risk of severe outcomes is important to deliver proper care and to optimize use of limited ICU ressources5.

Identification of hospitalized COVID-19 patients at risk for severe deterioration can be done using risk scores that combine several factors including age, sex, and comorbidities (CALL, COVID-GRAM, 4C Mortality Score)6,7,8,9,10,11,12. Some risk scores also include additional markers of severity such as the dyspnea symptom, clinical examination variables such as low oxygen saturation and elevated respiratory rate, as well as biological factors reflecting multi-organ failures such as elevated lactate dehydrogenase (LDH) values8,10,13,14,15.

Beyond clinical and biological variables, computerized tomography (CT) scans also contain prognostic information, as the degree of pulmonary inflammation is associated with clinical symptoms, and the amount of lung abnormality is associated with severe evolution16,17,18,19,20. CT scans can be acquired at admission to diagnose COVID-19 when RT-PCR results are negative21. However, the extent to which CT scans at patient admission add prognostic information beyond what can be inferred from clinical and biological data is unresolved.

The objective of this study was to integrate clinical, biological, and radiological data to predict the outcome of hospitalized patients. By processing CT scan images with a deep learning model and by using a radiologist report that contains a semi-quantitative description of CT scans, we evaluated the additional amount of information brought by CT scans.

Here, we show that integrating clinical and biological data with a deep learning CT scan analysis more accurately predicts severity of COVID-19 among hospitalized patients than existing scores for severity.

Results

A total of 1003 patients from Kremlin-Bicêtre (KB, Paris, France) and Gustave Roussy (IGR, Villejuif, France) were enrolled in the study. Clinical, biological, and CT scan images and reports were collected at hospital admission. There were 931 patients for whom clinical, biological, and CT scan data were available (Supplementary Fig. 1). A total of 506,341 images were analyzed for the 980 patients with available CT scans (average of 517 slices per scan). Radiologists annotated 17,873 images from 329 CT scans. Summary statistics for the clinical, biological, and CT scan data are provided in Table 1.

Table 1 Population description for the KB and IGR hospitals and association between variables measured at admission and severity.

Variables associated with severity

We first evaluated how clinical and biological variables measured at admission were associated with future severe progression, which we defined as an oxygen flow rate of 15 L/min or higher and/or the need for mechanical ventilation and/or patient death22. This definition of severe progression corresponds to a score of 5 or more according to the World Health Organization evaluation of severity on a 1–10 scale. We computed the severity odds ratios for each individual variable, and at each hospital center (Table 1). When combining association results from the two centers, we found 12 variables significantly associated with severity (p < 0.05/58 to account for testing 58 variables, Table 1): age, sex, oxygen saturation, diastolic pressure, respiratory rate, chronic kidney disease, hypertension, LDH, and urea, CRP, polynuclear neutrophil, and leukocytes.

We then assessed the predictive value of features from admission radiology reports. These reports contain semi-quantitative evaluations of the extent of disease which values range from 0 to 5, as well as a presence/absence coding of several types of lung lesions in COVID-19 patients. We found three significant associated features (p < 0.05/58): extent of disease, and presence of crazy-paving lesions, which are both associated with greater severity, and presence of a peripheral distribution of lesions, which is associated with lesser severity.

A neural network model to predict severity based on CT scans

To capture CT scan prognosis information from images, we considered a weakly supervised approach with no radiologist-provided annotations (Supplementary Fig. 2)23. A deep learning model was trained to predict severe progression based on a CT scan image. The neural network was trained on a development cohort consisting of 646 patients from Kremlin-Bicêtre Hospital (KB). It was evaluated on 150 KB patients, who were leftover from the development cohort, and it was further evaluated using a validation cohort consisting of 135 patients from Institut Gustave Roussy hospital (IGR). The discriminative ability of the neural network was of AUC = 0.76 (0.67, 0.85) for the 150 leftover KB patients, who were not used to train the network, and of AUC = 0.75 (0.65,0.84) for the validation IGR dataset. As a point of comparison, the AUC obtained with the radiologist evaluation of disease extent is of 0.73 (0.64–0.82) for the 150 KB patients of the development cohort and of 0.66 (0.56–0.76) for the validation IGR cohort, and the difference between the two AUC values was significant for the validation IGR cohort only (p ≤ 0.05).

Interpretability analysis of the neural network model

To apprehend the information present within the CT scans that is captured by the weakly supervised neural network model, we evaluated to what extent the features (internal representation) extracted by the neural network can predict clinical and radiological variables. To this end, we trained a new logistic regression with the extracted features as input, and some clinical and radiological variables as output. AUC on the 150 leftover patients of the KB development cohort was 0.93 (0.88,0.97) for disease extent (threshold > 2), 0.78 (0.70, 0.85) for crazy paving, 0.64 (0.53, 0.74) for condensation and 0.80 (0.65, 0.94) for ground glass opacity (GGO) (Supplementary Table 1). It was also possible to relate internal representations of the neural networks to clinical variables. We obtained an AUC of 0.88 (0.82, 0.94) for predicting an age strictly more than 60 years old, an AUC of 0.93 (0.89, 0.97) for sex, and of 0.76 (0.68, 0.84) for predicting an oxygen saturation more than 90%. As a comparison, a logistic regression trained on the variables from the radiology report obtained only AUC scores of 0.70 (0.61, 0.78) for age, 0.57 (0.48, 0.67) for sex, and of 0.68 (0.58, 0.77) for oxygen saturation, and differences of AUC were significant (p < 0.05). Simply put, this analysis shows that the internal representation of the neural network captures clinical features from the lung CTs, such as sex or age, on top of the known COVID-19 radiology features.

A multimodal prognostic models for severity

To add information from lab tests and chest characteristics to the CT scan information, we constructed the AI-severity score. We used a greedy search approach to include optimal clinical and biological variables (Methods). In addition to the CT deep learning variable, the variables included in AI-severity are age, sex, oxygen saturation, urea, and platelet counts. Coefficients and transformations required to compute the 6-variable AI-severity score are available in Supplementary Table 2. Coefficients required to compute AI-severity were learned using the WHO-defined high severity outcome of "oxygen flow rate of 15 L/min or higher, or need for mechanical ventilation, or death." All the prognosis scores were also evaluated on two other outcomes that consist of "death or ICU admission" and "death."

We evaluated AI-severity with several statistical measures of performance. The discriminative ability of AI-severity was of AUC = 0.78 (0.69, 0.86) for the 150 leftover KB patients, and of AUC = 0.79 (0.70, 0.87) for the validation IGR dataset. We also evaluated calibration properties of AI-severity using calibration plot (Supplementary Fig. 3)24. We found slope of 0.949 (0.650, 1.371) (150 leftover individuals at KB) and of 0.996 (0.755, 1.383) (IGR), and intercept (calibration-in-the-large) of −0.206 (−0.564, 0.172) (KB) and of 0.529 (0.088, 1.084) (IGR). Estimated slopes and intercepts indicated correct calibration of AI-severity for the leftover patients of the development KB cohort and an underestimation of severe outcomes for the validation IGR cohort; AI-severity predicted a mean severity of 22% (0.18, 0.25) for the 135 IGR patients, whereas severe outcomes occurred for 30% (0.22, 0.37) of these patients.

To compute additional measures of performance, individuals in the top tercile were assigned in a high-risk group. We found that the survival function of the individuals at high risk was significantly different from the survival function of the other individuals (Fig. 1, p = 4.77e–07 at KB, p = 4.00e–12 at IGR for a log-rank test). When considering a binary classification consisting of a high-risk group and a medium- or low-risk group, we obtained for the “O2 ≥ 15 L/min or Ventilation or Death” outcome, a positive predictive values (or precision) of 54% (0.40–0.67) (KB) and 76% (0.56–0.92) (IGR), negative predictive values of 86% (0.78–0.93) (KB) and 81% (0.73–0.88) (IGR), specificities of 75% (0.66–0.84) (KB) and 94% (0.89–0.98) (IGR), and sensitivities of 70% (0.56–0.83) (KB) and 47% (0.30–0.63) (IGR) (Table 2).

Fig. 1: Kaplan–Meier curves for the high-risk individuals and the ones with low or medium risk according to AI-severity.
figure 1

The threshold to assign individuals into a high-risk group was the 2/3 quantile of the AI-severity score computed for patients of the KB development cohort. a Kaplan–Meier curves were obtained for the 150 leftover KB patients from the development cohort. b Kaplan–Meier curves were obtained for the 135 patients of the IGR validation cohort. p-values for the log-rank test were equal to 4.77e–07 (KB) and 4.00e–12 (IGR). The two terciles used to determine threshold values for low-, medium-, and high-risk groups were equal to 0.187 and 0.375. Diamonds correspond to censoring of patients who were still hospitalized at the time when data ceased to be updated. The bands correspond to the sequence of the 95% confidence intervals of the survival probabilities for each day. KB Kremlin-Bicêtre hospital, IGR Institut Gustave Roussy hospital.

Table 2 Statistical measures of the performance of AI-severity.

AI-severity outperformed 11 previously published severity or mortality scores that were developed using 200–50,000 patients in the development and validation cohorts (Fig. 2 and Supplementary Table 3). The mean difference (averaged over outcomes) between the AUC of AI-severity and of other scores ranged between 0.05 (4C mortality, COVID-GRAM, CURB-65, MIT analytics) and 0.16 (NEWS2) for the 150 leftover patients of the KB development cohort and between 0.07 (NEWS2 for COVID-19) and 0.2816 for the 135 patients of the IGR validation cohort. Among alternative scores, the COVID-GRAM, the NEWS2 for COVID-19 score, and the 4C mortality scores were the ones with the largest mean AUC values (averaged over outcomes and hospitals). The AI-severity score was significantly larger than the NEWS2 for COVID-19 score for all outcomes when evaluated with the leftover patients of the KB development cohort and for the “Death or ICU” and the “Death” outcomes when evaluated with patients from the IGR validation cohort. Differences between AI-severity on the one hand and the COVID-GRAM score or the 4C mortality score on the other hand were significant only for the “Death or ICU” outcome when being evaluated on the leftover patients of the KB development cohort but they were significant for all outcomes when being evaluated on the validation IGR cohort.

Fig. 2: AUC values when comparing AI-severity to other prognostic scores for COVID-19 severity/mortality.
figure 2

The AI-severity model was trained using the severity outcome defined as an oxygen flow rate of 15 L/min or higher, the need for mechanical ventilation, or death. When evaluating AI-severity on the alternative outcomes, the model was not trained again. a AUC results are reported on the leftover KB patients from the development cohort (150 patients). b The mean AUC (averaged over outcomes and over hospitals) as a function of the sample size (sum of sample sizes for the development and validation cohorts) used to construct the score. c AUC results are reported on the external validation set from IGR (135 patients). Models are sorted from left to right (and from top to bottom in the legend) by decreasing order of AUC values (averaged over outcomes and over hospitals). Error bars represent the 95% confidence intervals obtained with the DeLong procedure. Stars indicate the order of magnitude of p-values for the DeLong one-sided test in which we test if AUCAI-severity > AUCother score, • 0.05 < p ≤ 0.10, *0.01 < p ≤ 0.05, **0.001 < p ≤ 0.01, ***p ≤ 0.001. KB Kremlin-Bicêtre hospital, IGR Institut Gustave Roussy hospital, ICU intensive care unit, NEWS2 National Early Warning Score 2, AUC area under the curve.

Development of alternative models that include CT scan information

In addition to AI-severity, we considered two alternative scores that also integrate CT scan information. The two scores include the same clinical and biological variables (age, sex, oxygen saturation, urea, platelets) as AI-severity. The first score (AI-segment) uses an automatic quantification of disease extent to include CT scan information and the second score (C & B & RR) considers a radiologist quantification—available in the radiological report—instead. AI-segment relies on segmentation of lesions that was performed by training another deep learning model using fully annotated and partially annotated CT scans (Supplementary Notes). The correlation between automatic quantification of lung lesions with AI-segment and radiologist quantification was of 0.56 (Supplementary Fig. 4 and Supplementary Notes).

AI-severity has a superior discriminative ability when compared to the alternative C & B & RR and AI-segment scores, although differences of AUC were generally not significant (Supplementary Fig. 5). The mean difference averaged over outcomes between AUCAI-severity and AUCC & B & RR (resp. AUCAI-segment) is null (resp. 0.03) for the 150 leftover KB patients of the development cohort and of 0.04 (resp. 0.01) for the IGR validation cohort. Differences between scores were not significant except when comparing AI-severity to AI-segment at KB for the “Death or ICU” outcome (Supplementary Fig. 5).

Additional value of CT scan information

Last, we evaluate to what extent CT scan adds prognosis information to the clinical characteristics and biological variables from lab tests. To this end, we trained a score named C & B based on clinical and biological variables only. The AUC of the scores that integrate CT scan information was larger or equivalent to the AUC of the C & B score (Supplementary Fig. 6). The mean difference averaged over outcomes between AUCAI-severity and AUCC&B was equal to 0.03 for both cohorts. Differences between AI-severity and C & B were significant for some outcomes and cohorts but not for all combinations (Supplementary Fig. 6). We also computed the confusion matrix for the outcome “oxygen flow rate of 15 L/min or higher and/or the need for mechanical ventilation and/or patient death” (Fig. 3). AI-severity correctly classified 3 and 4 additional positive patients among the 44 and 40 positive patients of the development and validation cohorts when compared to C & B and 4 additional negative patients among the 106 and 95 negative patients of the cohorts. Overall, CT scan information increases AUC by a measurable but limited amount in both cohorts; there was a difference of AUC of 0.03 when comparing AI-severity to the C & B score.

Fig. 3: Confusion matrix obtained with AI-severity, which includes CT scan information in addition to clinical and biological variables and with C & B, which contains only clinical and biological variables.
figure 3

Values in the matrices correspond to the number of patients in each category, which is defined by the true severity status and its predicted one. The confusion matrix was computed using the outcome “oxygen flow rate of 15 L/min or higher and/or the need for mechanical ventilation and/or patient death.” For both scores, we considered the 2/3 quantile—computed using the development cohort (KB)—to distinguish severe patients from non-severe patients. In addition to the neural network variable computed from CT scan images, the variables included in AI-severity consist of oxygen saturation, age, sex, platelet, and urea. The variables included in C & B consist of oxygen saturation, age, sex, platelet, urea, LDH, hypertension, chronic kidney disease, dyspnea, and neutophil values. Both scores were constructed using a feature selection algorithm that selected optimal variables. KB Kremlin-Bicêtre hospital, IGR Institut Gustave Roussy hospital.

To interpret the difference of AUC, we computed differences of AUC for several subgroups of patients. Because CT scan information is correlated with markers of inflammation25, we considered subgroups of patients with different levels of inflammation. The difference of AUC was significantly larger in patients with higher levels of inflammatory markers for 150 leftover patients of the development cohort (KB) (paired t-test, p = 0.003) but the difference was not significant for the validation cohort (IGR) (paired t-test, p = 0.24) (Supplementary Fig. 7). In both cohorts, the subgroup analysis suggested that prognosis of patients with larger values of CRP, LDH, and leukocytes benefited from the inclusion of CT scan information (Supplementary Fig. 7).

To further investigate the added prognosis value of CT scan, we studied the association between COVID-19 severity and the prognosis variable provided by the neural network. In the KB dataset, the three variables that were the most correlated with the prognosis variable of the neural network were oxygen saturation (r = −0.52 (−0.58, −0.48)), LDH (r = 0.46 (0.39,0.52)), and CRP (r = 0.43 (0.37,0.49)) (Supplementary Table 4). To account for the confounding effect of these variables, we regressed the severity outcome with the neural network prognosis variable and the three correlated variables. We found that the neural network variable was significantly correlated with the severity outcome (p = 0.01). The statistical evidence for association between the neural network prognosis variable and COVID-19 severity was also found (p = 3.24 × 10−6) when accounting for the five additional variables of AI-severity. This confirms that CT scan information captured by the neural network brings unique prognostic information.

Discussion

Using a deep learning model to capture CT scan prognosis information, we have built the AI-severity score to prognose severe evolution for COVID-19 hospitalized patients. In addition to the deep learning variable, AI-severity is based on age, sex, oxygen saturation, urea, and platelet counts. On the IGR validation cohort containing a majority of cancer patients, AI-severity provided values of AUCs significantly larger than the ones obtained with the best prognosis scores of our comparative analysis, which consist of COVID-GRAM, the NEWS2 score modified for COVID-19 patients, and the 4C mortality score6,12,26. Taken together, these results show that future disease severity markers are present within routine CT scans performed at admission.

Looking back on the prognostic clinical and biological variables, we found 12 of these significantly associated with severe evolution, which is consistent with previous studies15,27,28. First, looking at clinical characteristics, we confirmed that male and older persons are more at risk29. Second, looking at clinical examination variables, we found that respiratory rate, diastolic pressure, and oxygen saturation are clinical variables associated with severity. These associations may reflect physician decisions taken for ICU triage. Inclusion criteria for critical care triage include (i) requirement for invasive ventilatory support characterized by an oxygen saturation lower than 90%, or by respiratory failure, or (ii) requirement for vasopressors characterized by hypotension and low blood pressure30. Third, looking at comorbidities, we confirmed the results of several meta-analyses28,31,32,33 that showed that chronic kidney disease and hypertension are linked to severity. We however did not find significant associations for other comorbidities previously associated with severity, such as diabetes, and cardiovascular diseases33,34. While we expected cancer patients to have more severe outcomes because they are generally older, with multiple comorbidities and often in a treatment-induced immunosuppressive state35,36,37, we did not find this association. Several factors can explain this. Each cohort was not optimally balanced to conclusively study the association between cancer and severity: IGR admitted mostly cancer patients (80% of the patients), while KB admitted very few cancer patients (7%). Fourth, looking at COVID-19 symptoms, we did not find any significantly associated with severity. Dyspnea is a prominent symptom that has been repeatedly associated with severity and our results are compatible with a positive association with severity but we may lack a large-enough sample size to be significant6,38,39. Last, looking at biological measures, we found that inflammatory biomarkers, LDH, and CRP are related to severity14,27,40. We also found association of severity with leukocytes, neutrophils, and urea, the latter being explained by the fact that high urea is indicative of kidney dysfunction. Thrombocytopenia (low platelet count) was not significantly associated with severity, possibly because of lack of statistical power and stringent correction for multiple testing, but association between thrombocytopenia and severity was in the expected direction and platelet counts are included in the 6-variable AI-severity score41.

Beyond these clinical and biological variables, chest CT scans provided additional markers of disease severity. Significant features include the total extent of lesions, and the presence of crazy-paving pattern lesions. Although the extent of disease severity and consolidation are known to be associated with severity16,19,42,43,44,45,46,47, our study discovered its association with crazy paving, a precursor of consolidation lesions. Initial damages to the alveoli, as well as protein and fibrous exudation, explain the early onset of GGO. As the disease progresses, more and more inflammatory cells infiltrate the alveoli and interstitial space, followed by diffuse alveolar lesions and the formation of a hyaline membrane, which results in a crazy-paving appearance, which is then followed by consolidation on the CT examination48,49.

Compared to a radiologist’s reporting and quantification of lesions, there are several advantages to capturing CT scan information through a deep learning model. Good reproducibility is a key element for imaging biomarkers, and visual inspection of images introduces variability that can hinder its clinical application50. Another advantage is that radiologists are faced with the challenge that large numbers of cases must be read, annotated, and prioritized in a COVID-19 pandemic. AI analysis of radiological images has the potential to reduce this burden and speed up their reading time. Finally, prognosis scores obtained with deep learning models trained on CT scans are more predictive of severity than a quantification of disease extent performed by a radiologist. We indeed showed that internal representation of the AI-severity neural network model captures clinical information from CT scans, and this can be particularly useful when some clinical or lab measurements are missing.

Our reported prognostic values for CT scan-based models (AUC range of 0.70–0.80) are lower than the 0.85 AUC reported in a previously published study that uses deep learning with CT scan images for prognosis17. We hypothesize that this is due to use of different outcome definitions, as well as different patient characteristics in the study cohorts (age, severity at admission, etc.). Hospital admission criteria vary between countries and hospitals; for instance, the proportion of deaths in our French KB and IGR cohorts was of 16–17%, while it was of 39% in the study that reported larger AUC values17. When applying other previously published scores to the KB and IGR datasets, we found smaller AUC scores than reported values in the original papers. This difference can again be explained by differing patient characteristics, and different measures of severity between studies6,7,9,10,16.

Our evaluation of AI-severity and of alternative scores revealed that including CT scan information in addition to clinical and biological information significantly improves prognosis of future severity at least for the IGR validation cohort. A better prognosis performance was more pronounced for subgroups containing patients with higher levels of inflammatory markers. The neural network prognosis variable was correlated with biological and clinical severity biomarkers such as CRP levels, tissue damage (LDH), and oxygenation. Information redundancy between data modalities explains the relatively modest 0.03 increase of AUC values provided by CT scan when being added to biological and clinical variables25,38,51,52,53.

Beyond AI modeling, our study shows that the 6-variable AI-severity score integrating a radiological quantification of lesions with key clinical and biological variables provides accurate severity predictions. When comparing AI-severity with 11 existing scores for severity, we find significantly improved prognosis performance in the validation datasets of 150 and 135 patients. Our results suggest that AI-severity can become a useful severity scoring approach for COVID-19 patients.

Methods

Description of the retrospective study

Data including CT scans were collected at two French hospitals (Kremlin-Bicêtre Hospital, APHP, Paris denoted as KB and Gustave Roussy Hospital, Villejuif denoted as IGR). CT scans, clinical, and biological data were collected in the first 2 days after hospital admission. This study has received approval of ethic committees from the two hospitals and authors submitted a declaration to the National Commission of Data Processing and Liberties (N° INDS MR5413020420, CNIL) in order to get registered in the medical studies database and respect the General Regulation on Data Protection (RGPD) requirements. An information letter was sent to all patients included in the study. We stopped to update information about patient status on 5 May. Among the 1003 patients of the study, two patients asked to be excluded from the study.

Inclusion criteria were (1) date of admission at hospital (from 2 February to 20 March at Kremlin-Bicêtre and from the 2 March to 24 April at Institut Gustave Roussy) and (2) a positive diagnosis of COVID-19. Patients were considered positive either because of a positive real-time fluorescence polymerase chain reaction (RT-PCR) based on nasal or lower respiratory tract specimens or a CT scan with a typical appearance of COVID-19 as defined by the ACR criteria for negative RT-PCR patients54. Children and pregnant women were excluded from the study.

The clinical and laboratory data were obtained from detailed medical records, cleaned and formatted retrospectively by ten radiologists with 3–20 years of experience (five radiologists at GR and five at KB). Data include demographic variables: age and sex, variables from the clinical examination include: body weight and height, body mass index, heart rate, body temperature, oxygen saturation, blood pressure, respiratory rate, and a list of symptoms including cough, sputum, chest pain, muscle pain, abdominal pain or diarrhea, and dyspnea. Health and medical history data include presence or absence of comorbidities (systemic hypertension, diabetes mellitus, asthma, heart disease, emphysema, immunodeficiency), and smoker status. Laboratory data include conjugated alanine, bilirubin, total bilirubin, creatine kinase, CRP, ferritin, hemoglobin, LDH, leukocytes, lymphocyte, monocyte, platelet, polynuclear neutrophil, and urea.

CT scan acquisition

CT scan data were available for 980 patients representing a total of 506,341 2D images (517 slices per patient on average). Summary statistics for the clinical, biological, and CT scan data are provided in Table 1. Three different models of CT scanners were used: two General Electric CT scanners (Discovery CT750 HD and Optima 660 GE Medical Systems, Milwaukee, USA) and a Siemens CT scanner (Somatom Drive; Siemens Medical Solutions, Forchheim, Germany). All patients were scanned in a supine position during breath-holding at full inspiration. The acquisition and reconstruction parameters were of 120 kV tube voltage with automatic tube current modulation (100–350 mAs), 1 mm slice thickness without interslice gap, using filtered-back-projection (FBP) reconstruction (SOMATOM Drive) or blended FBP/iterative reconstruction (Discovery or Optima). Axial images with slice thickness of 1 mm were used for coronal and sagittal reconstructions.

Radiology reports

COVID-19-associated CT imaging features were obtained from radiologist reports that follow the guidelines of several scientific societies of radiology (French SFR, STR, ACR, RSNA) regarding the reporting of chest CT findings related to COVID-1954. The template of the radiologist report (https://ebulletin.radiologie.fr/actualites-covid-19/compte-rendu-tdm-thoracique-iv-0) was accessed on 17 March and the reports were completed retrospectively for the patients who were admitted to the hospital before that date. CT imaging characteristics were evaluated to provide the following five variables: (i) GGO (rounded/nonrounded/absent) that is defined as an increase in lung density not sufficient to obscure vessels or preservation of bronchial and vascular margins, (ii) consolidation (rounded/nonrounded/absent) that occurs when parenchymal opacification is dense enough to obscure the vessels’ margins and airway walls and other parenchymal structures, (iii) the crazy-paving pattern (present/absent) that is defined as ground glass opacification with associated interlobular septal thickening55, (iv) peripheral topography (present/absent) that corresponds to the spatial distribution of lesions in the one-third external part of the lung, and (v) inferior predominance (present/absent) that is defined as a predominance of lesions located in the lower segments of the lung. A rounded pattern (for GGO and consolidation) is defined as a lesion presenting a well delineated shape. In addition to the five CT imaging features, radiologists assessed the extent of lung lesions according to the evaluation criteria established by the French Society of Radiology (SFR)56. Disease extent can be: absent/minimal (<10%)/moderate (10–25%)/extensive (25–50%)/severe (>50%)/critical >75%. The coding absent/minimal/moderate extensive/severe/critical was based on a quantitative variable with values of 0/1/2/3/4/5. Variables were automatically extracted from the report using optical character recognition.

Statistical analysis

When detecting association with the severity outcome, odds ratio and p-values (two-sided tests) were computed separately for each hospital using logistic regression (glm function of the R statistical software). p-values from the two different hospitals were pooled using the Stouffer meta-analysis formula accounting for the two different sample sizes. For association between severity and each variable, we considered Bonferroni correction accounting for 58 variables. To compute confidence intervals for AUC values, we considered DeLong method57. Survival functions were obtained using Kaplan–Meier estimators. For computing calibration slope and intercept, we considered the rms R package that transforms predicted probabilities to log odds ratios, which are then used as a dependent variable in a logistic regression.

Deep learning models for severity classification based of CT scans

The deep learning model was defined as an ensemble of two submodels, as illustrated in Supplementary Fig. 2. Each submodel predicted disease severity from CT scans without using any expert annotations at the slice level. Preprocessing of the data consisted of resizing the CT scans to a fixed pixel spacing of (0.7 mm, 0.7 mm, 10 mm) and applying a specific windowing on the HU intensities. Each submodel is composed of two blocks: a deep neural network called feature extractor and a penalized logistic regression. The two submodels feature extractors are EfficientNet-B058 pre-trained on the ImageNet public database and ResNet5059 pre-trained with MoCo v260 on one million CT scan slices from both Deep Lesion61 and LIDC62. Each of these networks provide an embedding of the slices of the input CT scans into a lower-dimensional feature space (1280 for EfficientNet-B0 and 2048 for ResNet50). For the ResNet50-based submodel, we reduced the dimension of the feature space using a principal component analysis with 40 components before applying logistic regression. A different windowing was applied on the CT scans before the feature extractor: (–1000 HU, 600 HU) for EfficientNet-B0 and (–1000 HU, 0 HU), (0 HU, 1000 HU) and (–1000 HU, 4000 HU) for ResNet50. Predictions of AI-severity were obtained by averaging predictions of the submodels using equal weights. Optimization of the architecture of the network (preprocessing, feature extraction or model architecture and training, feature engineering, model aggregation) was performed using a fivefold cross-validation on the training set of 646 patients from KB.

CT scans may contain devices such as catheters (EKG monitoring, oxygenation tubing, etc.) that are easily detectable in a CT and can bias prediction of severity. Indeed, there is a risk of detecting the presence of a technical device associated with severity instead of detecting the radiological features associated with severity63. In order to ensure that medical devices do not affect feature extraction, all voxels outside of the lungs were masked using a pre-trained U-Net lung segmentation algorithm64.

Multivariate models to predict severity

The different models that combine multiple features to predict severity were fitted using logistic regression (AI-severity, AI-segment, C & B, C & B & RR). Models were trained using cross-validation with five folds on the training dataset of 646 patients from KB, and folds were stratified by age and severity outcome. Variables that were available for less than 300 patients of the training set (conjugated bilirubin and alanine) were not used. For the remaining variables, missing values were imputed by the average over patients of the training set. L2 regularization was applied when fitting logistic regression. The regularization coefficient was determined by maximizing the average AUC over the five cross-validation folds, using a range of different values ranging from 0.01 to 100. XGBoost algorithm was also evaluated but did not show better performance than logistic regression. We use pandas and scikit-learn to manipulate data, train and evaluate machine learning algorithms65.

To select variables in the multivariate models, we considered a forward feature selection technique (Supplementary Fig. 8). The first variable included in the model is the variable that provides the largest AUC values. Then, we computed AUC values for all models with two variables including the first one that has already been included. We continued this procedure until all variables were included. Performances of the models increased quickly when the first variables were included and then AUC values reached a plateau (Supplementary Fig. 8). We used the elbow method to select the parsimonious set of variables that is found when a plateau of AUC is reached. For the three models that include CT scan information, we consider the model C & B & RR to perform variable selection. The three models (AI-severity, C & B & RR, AI-segment) were then trained using the six variables found with our variable selection procedure. The variable selection procedure for the C & B model that contains clinical and biological variables only indicated that ten variables should be kept in the model (Supplementary Table 5).

Other scores to predict severity and mortality

We performed a comparison of the proposed AI-severity model with 11 other COVID-19 severity scores published in the literature: COVID-Gram6, two scores from Colombi et al.16, CALL9, CURB-6566, Yan et al.7, Liang et al.67, NEWS2 and NEWS2 for COVID-1968, 4C mortality score12, and MIT analytics (https://www.covidanalytics.io/mortality_calculator). Among these scores, only three included radiological information in their model: the presence of an X-ray abnormality for COVID-Gram and Liang et al., and the lung disease extent for Colombi et al. The number of considered clinical and biological variables for these scores varied (from 3 for Yan et al. to 14 for NEWS2 for COVID), as well as the model architecture (simple scoring system, logistic regression, XGBoost, or multilayer perceptron), and the outcome they were trained on (such as death or admission to ICU). Notable variation between scores includes the definition of comorbidities and details about how it has been computed for the different scores are provided in Supplementary Table 6. For missing variables, we manually imputed the missing variables with a constant value (Supplementary Table 6). Due to the poor performances of one of the score7, we retrained their score by repeating their training procedure with the KB development cohort.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.