MultiCOVID: a multi modal deep learning approach for COVID-19 diagnosis

The rapid spread of the severe acute respiratory syndrome coronavirus 2 led to a global overextension of healthcare. Both Chest X-rays (CXR) and blood test have been demonstrated to have predictive value on Coronavirus Disease 2019 (COVID-19) diagnosis on different prevalence scenarios. With the objective of improving and accelerating the diagnosis of COVID-19, a multi modal prediction algorithm (MultiCOVID) based on CXR and blood test was developed, to discriminate between COVID-19, Heart Failure and Non-COVID Pneumonia and healthy (Control) patients. This retrospective single-center study includes CXR and blood test obtained between January 2017 and May 2020. Multi modal prediction models were generated using opensource DL algorithms. Performance of the MultiCOVID algorithm was compared with interpretations from five experienced thoracic radiologists on 300 random test images using the McNemar–Bowker test. A total of 8578 samples from 6123 patients (mean age 66 ± 18 years of standard deviation, 3523 men) were evaluated across datasets. For the entire test set, the overall accuracy of MultiCOVID was 84%, with a mean AUC of 0.92 (0.89–0.94). For 300 random test images, overall accuracy of MultiCOVID was significantly higher (69.6%) compared with individual radiologists (range, 43.7–58.7%) and the consensus of all five radiologists (59.3%, P < .001). Overall, we have developed a multimodal deep learning algorithm, MultiCOVID, that discriminates among COVID-19, heart failure, non-COVID pneumonia and healthy patients using both CXR and blood test with a significantly better performance than experienced thoracic radiologists.

As lung involvement is one of the main causes of morbidity and mortality in SARS-CoV-2 infection, a quick identification of characteristic findings in chest imaging can support the diagnosis and speed up the identification of COVID-19 positive patients at the emergency units.
Several studies have shown that implementation of deep learning (DL) tools to detect chest X-rays (CXR) findings typically associated with SARS-CoV-2 infection, deliver comparable results to those acquired by interpretation of radiologists.However, most of the trained models have a drop in their prediction performance when tested over external datasets 1 .In addition, one of the main hurdles to overcome when training an algorithm to detect Sars-CoV-2 infection in CXR is the similarity of findings with other entities like bacterial pneumonias or heart failure 2 .On the other hand, models based on laboratory results of peripheral blood also give predictive results on diagnosis 3 and prognosis 4 .
A key fact to highlight is how the incursion of COVID-19 caused a dramatic drop in the emergency room consultations of other pathologies.Later on, after the initial peak, the decline of the COVID-19 prevalence made the non-COVID diseases emerge once again at the hospitals.This is relevant due to the challenge of performing an efficient differential diagnosis with selected pathologies during a pandemic.It is well known that the predictive value of a diagnostic test is conditioned by the prevalence of the disease and that of COVID varies widely throughout the different waves of the pandemic 5 .A multicategory approach that takes into account differential diagnoses that are more stable in their prevalence could reduce this variability.
With the objective of improving and accelerating the diagnosis of COVID-19, we developed a tool to assist physicians in reaching a diagnosis.This tool is a multi-modal prediction algorithm (MultiCOVID) based on CXR and blood test with the ability to discriminate between COVID-19, Heart Failure (HF), Non-COVID Pneumonia (NCP) and healthy (Control) samples.

Dataset
We retrospectively collected CXR images and hemogram values from 8578 samples from 6123 patients and healthy subjects (mean age 66 ± 18 years of standard deviation, 3523 men) from Parc Salut Mar (PSMAR) Consortium, Barcelona, Spain.Four cohorts were designed: (i) 1171 samples from patients diagnosed with COVID-19 by RT-PCR from March to May 2020; (ii) 1008 samples of patients who suffered an episode of heart failure between 2012 to 2019; (iii) 490 samples of patients diagnosed with non-COVID pneumonia (NCP) from 2018 to 2019; (iv) 5909 samples of standard preoperatory studies of healthy subjects from 2017 to 2019 (Fig. 1).HR and NCP diagnosis were selected as defined by the International Classification of Diseases, Tenth Revision (ICD-10) code.All the CXR images from groups i-iii were validated by two independent radiologists (MB and JM).
Figure 1.Flowchart for sample selection and patient inclusion in the study and breakdown of training, validation, and hold-out test data sets.Around 25,000 entries were obtained using both CXR images and blood test in a time wise manner.The whole dataset totals 8822 entries of paired CXR and blood test data.Samples with low completeness (less than 80% of blood test data available) were discarded for the model building.

Acquisition of blood sample and image data
We included CXR images performed in a period ranging from 1 day before the patient's diagnosis to 7 days after.The images were filtered to include only frontal projections regardless of the quality and the radiography system used.Blood sample results were collected within a range of 2 days before or 7 days after the CXR acquisition date using PSMAR lab record system, except for control samples whose measurements ranged for 2 weeks.If two or more blood test results were collected, measurements were averaged.
CXR images and blood test results were combined in the same dataset and split into train/validation set (90%), and hold-out test (10%) set.For training/validation split, we divided the dataset in training (80%) and validation (20%) sets with 5 different random seeds.We ensured that there were no cross-over patients between groups.

Deep learning models
Detailed description of the models, training policy and image preprocessing are provided in Supplementary Material.In brief, segmentation model is based on a U-Net architecture 6 .The CXR-only classification model consists of a validated Convolutional neural network (CNN) resnet-34 architecture 7 .Tabular only-model is an Attention-based network (TabNet) 8 .Joint model is a multi-modal deep learning algorithm which merges the CXR-only and the Blood-only models and uses both CXR image and blood tests as input values.It uses Gradient Blending in order to prevent overfitting and improve generalization 9 .MultiCOVID model is an ensemble predictor of 5 different Joint models that would classify independently between the different classes.Then it uses majority vote to assign a final classification.The whole pipeline development and training was performed using fastai deep learning API 10 .

Comparison with thoracic radiologist interpretations
Hold-out test dataset consisting of 300 samples (ensuring no patient overlap with training or validation sets) was used for expert interpretation.Each sample consisted of a CXR with matched blood results.Expert interpretations were independently provided by five board-certified thoracic radiologists (FZ, SC, LdC, DR, AG) with 2-30 years post-residency training experience.Radiologists were able to check both non segmented images and blood test results without any other additional information in a platform created ad-hoc for prediction.They provided a classification for each image in one of the four categories (COVID-19, control, HF and NCP).A consensus interpretation for the radiologist was obtained by the majority vote for each paired CHX-blood test analyzed.

Statistical analysis
A two-tailed t-test P value was reported when clinical and population blood test differences were assessed.
McNemar-Bowker test was used to compare model performance against radiologist majority vote using FDR correction.Plotting and statistical analyses were performed using the packages ggplot, ggpubr and rcompanion in R, version 3.6 (R Core Team; R Foundation for Statistical Computing).

Ethical approval
The study was designed to use radiology images and associated clinical/demographic/ laboratory patient information already collected for the purpose of performing clinical COVID-19 research by Hospital del Mar.The study was conducted in accordance with the relevant institutional guidelines and regulations.The experimental protocols, data acquisition and analysis were approved by the Parc de Salut Mar Clinical Research Ethics Committee (2020/9199/I).Informed consent was obtained, when possible, from patients or legal representatives or waived by the local Parc de Salut Mar Clinical Research Ethics Committee (2020/9199/I) if informed consent was not available due to the pandemic situation.

Patient characteristics
A total of 8578 samples were evaluated across datasets.Patient characteristics and blood test parameters are shown in Table 1.A highly significant difference in age was found between the cohort of patients with heart failure (82.8 ± 10 years) and the other three cohorts (66.0 ± 16 years for COVID-19 samples, 63.2 ± 18 years for control samples and 67.8 ± 17 years for NCP samples, P < 0.001 for each comparison) and was not considered as a valid variable for further classification.

Whole CXR models learn spurious characteristics for classification
Previous studies have demonstrated that deep learning (DL)-based algorithms should be rigorously evaluated due to their ability to learn non relevant features in order to increase its prediction accuracy 1 .For this reason, we first developed a segmentation algorithm able to segment lung parenchyma at a 95%-pixel accuracy.Then, after segmentation, we evaluated the accuracy of the algorithms for three complementary datasets: non-segmented images, segmented regions and excluded regions.After a few training epochs the three different models achieved nonrandom accuracies between 67 and 74% (Fig. 2A).However, attention map exploration on the images showed that the different models based their predictions not only inside but also outside of the lung parenchyma (Fig. 2B).
These observations showed that, although there are important features outside the lung parenchyma that may help the model to classify between the different entities (eg.heart size), there are other elements (eg.oxygen nasal cannulas or intravenous (IV) catheters) that might confound the model.Thus, we decided to first segment all the CXR before training our models for prediction of diagnosis.In order to accomplish this task, we generated a 785-radiology level lung segmentation dataset and trained a U-net model to regenerate the whole CXR dataset keeping only the lung parenchyma.www.nature.com/scientificreports/

Performance of single and multimodal models
In order to evaluate the prediction capacity of both segmented CXR and blood sample data, we built different DL models using both sources alone or in combination.Metrics comparison of all the single vision (CXR-only) and tabular (Blood-only) models are detailed in Supplementary Material.As expected, CXR-only models had a more robust prediction of all 4 categories tested compared to Blood-only models (Fig. 3).This difference is stronger in the classes with less samples (HF, and NCP) where CXR-only models could identify features in the CXR images which are characteristic of these two entities whereas this was not possible with Blood-only models.

Discussion
Diagnosis of COVID-19 is an evolving challenge.During the beginning of the pandemic and the successive peaks with high prevalence rates, a prompt and effective diagnosis was critical for proper patient isolation and evaluation.However, since the prevalence of the COVID-19 cases oscillated, showing fewer cases between waves, and more non-COVID cases, it was important to differentiate patients with other diseases than COVID-19 presenting similar visual characteristics in the CXR.
During patient assessment in the emergency room, clinicians take into account different inputs for a proper diagnosis.First, the anamnesis, symptoms, vitals and physical findings guide the physician to an initial assumption.Based on this information, additional tests are requested (CXR, blood test, ECG and SARS-CoV-2 detection).The integration of these results allows the team to diagnose a patient accurately.However, this process is time consuming and sometimes findings are difficult to interpret, leading to misdiagnosis.
To improve this diagnostic process, we have developed and trained a multimodal deep learning algorithm based in a multiple input approach combining CXR images together with blood sample data to identify COVID-19 diagnosis with high sensitivity.This way we were able to manage the increased complexity of the dataset.These data from multiple sources are somehow correlated and complementary to each other and could reflect patterns that are not present in single models alone 13 .
Hence, MultiCOVID is fed by two of the most common and fast clinical tests requested in the emergency room (CXR and Blood test) and can predict the presence of three different diseases (COVID-19, heart failure and non-COVID pneumonia) with similar CXR characteristics.
Analysis of single models shows the importance of model interpretation.While CXR-only models could identify patterns outside the lung parenchyma that could diminish its generalization capacity 9 , Blood-only models could point to interesting population of cells that are differently represented in COVID-19 patients, leveraging its prediction capacity.In this context, the immune compartment plays an important role in the COVID-19 response, and it has been already published that COVID-19 patients present fewer overall leukocytes counts and, more concretely, eosinophil counts 14,15 .Furthermore, oxygen transport seems to be somehow affected, modulating the red cell population.In this regard, in our work we found significant differences in the erythrocyte count and the hemoglobin concentration.Although most of the studies correlate the reduction of this values to severe COVID-19 patients 16 , this is the first dataset to compare them in these four different categories at the time of diagnosis.
Moreover, although a huge amount of literature about COVID-19 diagnosis and prognosis has been published using only blood tests [17][18][19][20] or CXR [21][22][23][24][25][26][27][28] this is the first study that combines both parameters and compares its prediction capacity at diagnosis.Of note, only one previously published study integrates both blood test and CXR severity scores in order to determine in-hospital death of COVID-19 patients 29 .Hence, it is clear that merging both sources of data leads to a better prediction performance when compared with the two single models alone and that this difference is more pronounced where the number of cases is scarce.It is important to stress that this combination of data sources addresses the variable prevalence of COVID-19 cases during the pandemic, which is an issue that could not be solved in previous studies 23,24 .
Our study has several limitations.First, the algorithm was evaluated on a single center; thus, there was likely some degree of bias.Additionally, the sample collection was performed in different time periods for each group of patients, which could present some kind of differences in the CXR image acquisition although this was partially solved using the lung segmentation model which removes the noise signal present outside the lung parenchyma.And finally, model performance could be influenced by potential shifts in the disease landscape due to COVID-19 variants and vaccination efforts, which could influence the generalizability and interpretation of our findings.

Conclusions
We have developed a multimodal deep learning algorithm, MultiCOVID, that discriminates among COVID-19, heart failure, non-COVID pneumonia and healthy patients using both CXR and blood test with a significantly better performance than experienced thoracic radiologists.
Our approach and results suggest an innovative scenario where COVID-19 prediction could be identified from other similar diseases and facilitate triage within the emergency room in a COVID-19 low prevalence situation. https://doi.org/10.1038/s41598-023-46126-8

Figure 3 .
Figure 3. Performance of different models on the entries from hold-out test datasets.Means for precision (green), sensitivity (blue), F1 score (yellow), AUC (red) and accuracy (black diamond) for each model type and category assessed, respectively.CXR-only models use only CXR images for 4 category classification.Bloodonly models use blood test a source of information.Joint model uses both CXR and blood test as input for classification and MultiCOVID is the majority vote of 5 different Joint models.

Figure 4 .
Figure 4. Blood-only model interpretability by SHAP analysis.(A) Summary plot showing the mean absolute SHAP value of the ten most important features for the four classes.(B) Blood test values of the different features identified by SHAP analysis.RDW-CV: red cell distribution width; MCHC: Mean Corpuscular Hemoglobin Concentration; RBC: red blood cells.

Figure 5 .
Figure 5.Comparison of the performance of MultiCOVID model with consensus expert radiologist interpretations on random sample of 300 images from the test set.The receiver operating characteristic (ROC) curves for each category (COVID-19 -blue; Control -green; Heart Failure (HF) -red and Non-COVID Pneumonia (NCP) -magenta) are shown for MultiCOVID (DL) and for the consensus interpretation of radiologists (majority vote).Sensitivity (Sens) and specificity (Spec) are also plotted for each category assessed.DL: deep learning.