Introduction

The COVID-19 began in late 2019 and caused a significant uproar worldwide1. Most patients experienced mild-moderate symptoms such as cough, cold, myalgia, sore throat, muscle pain, nausea, loss of taste/smell and headaches. However, people also developed severe symptoms such as accurate respiratory disorder syndrome (ARDS), severe hypoxia and multi-organ failure and succumbed to this deadly disease2. As of today, the virus is still spreading, and new mutations are being created. Cytokine storm manifests in COVID-19 patients, distinguished by an enormous release of cytokines such as IL-6 and IL-1. This condition has led to the immune system attacking itself and has caused deaths in many Sars-Cov-2 patients3.

The severe symptoms of COVID-19 have decreased after the introduction of vaccines4. However, some COVID-19 patients are still vulnerable to severe prognoses3. Older patients and people with comorbidities such as hypertension, diabetes, cancer etc., are still at risk. It is crucial to identify these patients early so that appropriate medications and treatments can be provided to them to avoid unnecessary casualties. A few drugs have been created and shown to prevent the onset of severe COVID-19 symptoms3. These medicines must be administered during the illness's initial stages to be effective.

Artificial Intelligence (AI) applications have been extensively utilized in the healthcare sector5,6,7. Diagnostic and prognostic models, decision support systems and predictive modelling are being developed to assist healthcare professionals using machine learning (ML). The above technologies are also being used in the fight against COVID-198,9,10. Explainable artificial intelligence (XAI) makes the models more transparent and understandable. The reasoning behind a patient prediction can be visually represented using XAI. It has also been utilized in various domains such as finance, engineering, pharmacy, medicine and commerce.

A few markers such as c-reactive protein (CRP), D-Dimer, lactate dehydrogenase (LDH), neutrophil to lymphocyte ratio (NLR) and Ferritin are known to change excessively before the actual onset of the severe symptoms2. Machine learning models can be deployed using the markers to predict COVID-19 severity in advance. The early detection of patients with poor prognoses and the development of reliable forecasting techniques that are simple to use in routine clinical practice are crucial for ensuring the highest level of treatment in clinics.

Several researchers have utilized machine learning to predict COVID-19 severity using hematological parameters. Huyut11 developed an automatic decision support system to predict mild and severe coronavirus patients. The dataset consisted of 194 severe and 4010 mild patients. Twenty-nine markers were considered, and the local weighted algorithm obtained a maximum accuracy of 97.86%. Wendland et al.12 used classifiers to predict severe COVID-19 cases. They were able to predict the severity status with an AUC of 0.918. The most important markers in this study were CRP and blood sugar levels. A COVID-19 severity prediction model was developed by Nguyen et al.13. Two hundred sixty-one patients from Vietnam were considered for this research. The random forest obtained a accuracy of 97%. The best prognostic markers were CRP, IL-6, dyspnea, D-Dimer and ferritin. A nature-inspired was developed to predict COVID-19 severity14. The study used details of 65,000 patients, which consisted of twenty-six features. A variant of the artificial bee colony algorithm was used for feature selection. Among all the algorithms, the support vector machine obtained an accuracy of 96%. The model categorized the patients into mild, moderate and severe. Laatifi et al.15 used two explainable AI techniques to predict COVID-19 severity. Eighty-seven patients were considered in this study. Shapley additive values (SHAP) and local interpretable model-agnostic explanations (LIME) were used to make the models understandable. The most critical cytokine markers are VEGF-A and IL-7.

COVID-19 vaccines have been successful in preventing severe COVID-19 in most patients. However, a small part of the population still experiences severe symptoms. It is of utmost importance to prevent the onset of severe prognosis in these patients. The machine learning models can be beneficial in predicting the same.

The above studies show that COVID-19 severity could be predicted effectively using clinical and laboratory markers. The main objective of this research is to forecast the severity of a COVID-19 patient. The other contributions are given below:

  • Descriptive statistical analysis of the data has been conducted to understand various trends and patterns in the data.

  • Fourteen feature selection methods including nature-inspired algorithms have been used to choose the most important markers.

  • Machine learning models including bagging, boosting, voting and stacking have been used to predict COVID-19 severity. The classifiers have been further compared to with the state-of-the-art deep learning models such as deep neural network (DNN), one-dimensional convolutional neural network (1D-CNN) and Long short-term memory (LSTM).

  • Five XAI techniques have been used to interpret the predictions such as SHAP, LIME, Eli5, QLattice and Anchor.

  • Further discussion about crucial COVID-19 prognostic markers from a medical perspective.

The reminder of the paper is structured as follows. Materials and methods are described in “Methods” section. Extensive explanation of the results is made in “Results” section. The discussion of the results obtained is made in “Discussion” section. The article concludes in “Conclusion” section.

Methods

Description of the dataset

The COVID-19 datasets were obtained from two Hospitals in India: Dr TMA Pai Hospital and Kasturba Medical College. The Manipal Academy of Higher Education has provided ethical clearance to conduct this research (IEC:613/2021). The patients have been completely anonymized in this study. COVID-19 patients who were tested between September 2021 and December 2021 have been considered in this study. Only patients above eighteen years of age have been included. Records of 899 patients have been utilized to train the machine learning models. The dataset included 599 non-severe patients and 300 severe patients. All patients whose condition deteriorated and required admission to the intensive care unit (ICU) and if the respiratory rate > 30/minute or SpO2 < 90% (World Health Organization standards) were grouped as severe cases16. Thirty-two clinical parameters were considered in this study (31—continuous and one categorical). The clinical markers chosen are tabulated in Table 1.

Table 1 Attributes chosen in this study.

Data pre-processing

Pre-processing of the dataset is critical in machine learning. Missing values are imputed, categorical attributes are encoded, continuous values are scaled, data balancing is performed, and unnecessary attributes are dropped. In order to make sure that there are as few missing values as possible, we chose patients who completed the most clinical tests when gathering data. A few missing values in the dataset were replaced by their respective median. The “gender” attribute (categorical) had no missing values. Descriptive statistical analysis was conducted using the open-source statistical software Jamovi. Some statistical parameters utilized are described in Table 2.

Table 2 Descriptive statistical parameters for the COVID-19 dataset.

Violin plots were used to find interesting patterns in the dataset, as shown in Fig. 1. From the figure, it can be seen that the median age was elevated in the severe COVID-19 cohort. Further, markers such as Neutrophils, HbA1c and CRP were elevated in severe patients. The lymphocytes and monocytes count decreased in the severe COVID-19 cohort.

Figure 1
figure 1

Violin plots for some of the markers. (a) Age (b) Neutrophil (c) Lymphocyte (d) HbA1c (e) CRP (f) Monocyte.

The frequency of the “gender” attribute for severe/non-severe COVID-19 patients is described using a bar plot in Fig. 2. There were 347 male and 252 female patients in the non-severe cohort. There were 204 male patients and 96 female patients in the severe cohort.

Figure 2
figure 2

Frequency distribution of the gender attribute.

In machine learning analysis, categorical values must be encoded since the classifiers do not handle text values. Several encoding techniques exist in machine learning17. In this study, we used the one-hot encoding technique to encode the “Gender” attribute18. This encoding mechanism solves the problem of ordinality, which can happen in categorical variables. Data scaling was performed using the standardization method19. When there is a considerable discrepancy in data points, the accuracy decreases. The classifiers also favour parameters with higher values, regardless of the units considered. Normalization and standardization are the two approaches utilized to scale the datasets in machine learning. Standardization was chosen in this study since they are better with outliers. The dataset was then split into training and testing in the ratio (80:20). There was a significant imbalance in the dataset. The number of severe COVID-19 cases was almost half compared to non-COVID-19 cases. The results obtained for the unbalanced data are completely biased since the models favour the majority classes. Hence, we used the oversampling technique called Borderline Synthetic Minority Oversampling Technique (SMOTE) to balance the training dataset20. This algorithm generates new synthetic samples using the K-nearest algorithm. The borderline cases are also handled well using the above technique. Under-sampling was not preferred in this study since we did not want to lose interesting trends and patterns. Further, the testing data was not balanced to protect data integrity.

Fourteen feature selection methods were used to choose the most important markers. Several metaheuristic nature-inspired algorithms have been utilized in this study. Feature selection is essential in machine learning since the classifiers perform better when removing redundant features. In this article, we have chosen multiple nature inspired algorithms. They have several advantages over traditional feature selection techniques. They are known for their global optimization, robustness, scalability, parallelism, adaptability, simplicity and stochasticity. Table 3 describes the features chosen by each algorithm. Among all the algorithms, the salp swarm optimization chose the maximum number of features (18). The whale optimization algorithm, flower pollination algorithm and mutual information chose 15 features. The sine cosine algorithm chose the minimum number of features (3). The Harris Hawk’s optimization and particle swarm optimization chose six features each. The markers chosen by the feature selection techniques are also described in Fig. 3. CRP was the most chosen feature since thirteen algorithms have included it. This was followed by neutrophils, NLR and AST, which were chosen 10, 9 and 8 times, respectively. The marker platelets were not chosen by any algorithm.

Table 3 Feature selection using several algorithms.
Figure 3
figure 3

Markers chosen by the feature selection methods.

Machine learning concepts

Machine learning is a form of artificial intelligence that enables software programs to forecast predictive outcomes using past information as input. Several ML classifiers have been used in this study, such as random forest, decision tree, logistic regression, K nearest neighbors, catboost, adaboost, xgboost, lightgbm, stacking and voting algorithms. Stacking combines the result of multiple baseline models35. The stacking architecture consists of a classifier incorporating the initial model’s predictions. Aggregation of the models are performed based on their weights, improving the model’s accuracy. The meta-learner becomes a crucial factor in stacking. Logistic regression was the meta-learner used in this research. The stacking architecture is described in Fig. 4.

Figure 4
figure 4

Stacking methodology used in this research.

A voting classifier gathers training data from a large ensemble of classifiers, and predictions are made according to the class with the highest probability. It uses the concept of majority voting36. The voting algorithm is of two types: Hard-voting and soft-voting. The maximum number of votes is considered in hard-voting irrespective of the weights37. “Average probability” predicts the outcome of soft-voting38. The voting architecture is described in Fig. 5.

Figure 5
figure 5

Voting methodology used in this research.

Further, the data was subjected to a fivefold cross validation technique. Here, various subsets of data are trained to validate the model efficiency. The input data is divided into five equal groups. Four groups are used for training, while the fifth group is used for testing using various permutations and combinations in cross-validation. Hyperparameter tuning was performed to choose the best parameters using the grid search method. The performance of a classifier depends upon the hyperparameters chosen. Grid search automates the hyperparameter tuning and provides the best values as output.

We have chosen several classification and loss metrics to evaluate the models in this study. These include precision, recall, accuracy, F1-score, area under curve (AUC), average precision (AP), Mathew’s correlation, log loss, Jaccard score and hamming loss. Emphasis has been given to precision and recall since they focus on false-positive and false-negative cases.

In this research, three state-of-the-art deep learning models have been tested. They are DNN, 1D-CNN and LSTM. A DNN consists of multiple input, hidden and output layers39. The essential function of a deep neural network is to take input, process them through more sophisticated computations, and predict results. CNNs are primarily used for image classification. However, 1D-CNN models have also been highly influential in classifying tabular data40. LSTMs are highly used in sequence prediction problems41. Three types of gates are considered in LSTM: input gate, output gate, and forget gate. LSTMs have proven to be highly efficient in handling time series data.

After training and testing the ML and DL models, five XAI techniques have been used to demystify the predictions. The results obtianed by the XAI techniques are in the form of graphs and tables, which can be easily understood by the ML users. The entire process-flow of this study is described in Fig. 6.

Figure 6
figure 6

Machine learning methodology used in this research.

Ethical approval

Ethical clearance has been obtained to collect patient data from Manipal Academy of Higher Education ethics committee with id IEC: 613/2021. The need for informed consent was waived by the ethics committee/Institutional Review Board of Manipal Academy of Higher Education, because of the retrospective nature of the study. All methods were carried out in accordance with relevant guidelines and regulations.

Results

Model testing

In this research, multiple machine learning and deep learning classifiers have been trained and tested to predict COVID-19 severity. The precision obtained by the models for various feature selection techniques is tabulated in Table 4. We emphasized the stacking and voting classifiers since they combine multiple models. From the table, it can be seen that the stacked model obtained the maximum precision of 94% after using mutual information. The soft-voting and hard-voting obtained a precision of 94% each. The bat algorithm performed well too. The stack, hard-voting and soft-voting classifier obtained a precision of 91%, 91% and 90%, respectively. The flower pollination algorithm was also efficient. The stack, hard-voting and soft-voting obtained a precision of 87%, 86% and 84%, respectively. The precision obtained for the stack, hard-voting and soft-voting after using the Jaya algorithm was 87%, 90% and 89%, respectively.

Table 4 Precision obtained by the classifiers for various feature selection methods (In %).

The recall obtained by the models for all the feature selection techniques is described in Table 5. Mutual information was the best feature selection method. The recall obtained by the stack, hard-voting and soft-voting algorithms were 93%, 95% and 94%, respectively. The bat algorithm was the next best-performing model. The recall obtained by the stack, hard-voting and soft-voting models were 90%, 93% and 91%, respectively. The flower pollination algorithm performed well too. The recall obtained by the stack, hard-voting and soft-voting models were 86%, 90% and 90%, respectively. The recall obtained by the stack, hard-voting and soft-voting classifiers after using the Jaya algorithm was 87%, 91% and 90%, respectively. For further analysis, the best four feature selection techniques were considered. They are mutual information, bat algorithm, flower pollination algorithm and Jaya algorithm.

Table 5 Recall obtained by the classifiers for various feature selection methods (In %).

The classification and the loss metrics are tabulated in Table 6. Mutual information performed the best among the four methods. The accuracy obtained for the stack, hard-voting and soft-voting classifiers were 90%, 95% and 94%, respectively. The bat algorithm was able to obtain excellent results too. The accuracies obtained by the stacking, hard-voting and soft-voting classifiers were 92%, 95% and 91%. The flower pollination algorithm performed relatively well. The accuracy obtained by the stacking, hard-voting and soft-voting classifiers were 87%, 85% and 86%. The accuracies obtained by the stack, hard-voting and soft-voting for the Jaya algorithm were 89%, 89% and 89%, respectively.

Table 6 Classification and loss metrics for the best four selection methods (In %).

The ROC curves for the stacked model for the four feature selection methods are depicted in Fig. 7. The AUC was maximum for the mutual information algorithm with 0.96. The precision-recall curves for the stacked classifiers for the four feature selection methods are described in Fig. 8. The stacked model obtained a maximum average precision of 0,98 after being trained on features chosen by mutual information.

Figure 7
figure 7

ROC curves for the stacked models. (a) MI (b) BA (c) FPA (d) JA.

Figure 8
figure 8

ROC curves for the stacked models. (a) MI (b) BA (c) FPA (d) JA.

Further, the results obtained by the machine learning models were compared with the deep learning models. DNN, 1D-CNN and LSTM were the classifiers used in this study. The model architecture of the deep neural networks is described in Fig. 9. For the DNN, five layers were considered. The number of neurons used was 30, 11, 7, 4 and 1. “Relu” was the activation function used for the input and hidden layers. The “sigmoid” activation function was used for the output layer. “Adam” was the optimizer, and “binary cross entropy” was the loss function used. A learning rate of 0.0001 was utilized, and the batch size was set to 10. The neural network was run for 750 epochs to establish reliable results.

Figure 9
figure 9

Architecture of the deep learning models. (a) DNN (b) 1D-CNN (C) LSTM.

For the 1D-CNN, we included layers such as conv1D, max pooling, drop out and flatten layers. The loss function was “binary cross entropy”, and “Adam” was the optimizer. The number of epochs and batch size were set to 10 each. A learning rate of 0.001 was utilized, and “leaky relu” was the activation function used for the input and hidden layers. “sigmoid” was the activation function used for the output layer.

The LSTM used four layers consisting of 150, 75, 50 and 1 neurons, respectively. The loss function used was “binary cross-entropy, and the optimizer was “Adam”. The batch size was set to 32.

All three models were split into training and testing in the ratio of 80:20. The results obtained by the deep learning models are described in Table 7. Among the three, DNN performed the best, with an accuracy of 89%. 1D-CNN and LSTM obtained accuracies of 85% and 83%, respectively. The accuracy and loss curves for the models are depicted in Fig. 10. From the figure, the results obtained by the models are reliable and not overfitting.

Table 7 Classification and loss metrics obtained by the deep learning models.
Figure 10
figure 10

Accuracy and loss curves obtained by the deep learning classifiers. (a) Accuracy curve for DNN (b) Accuracy curve for 1D-CNN (C) Accuracy curve for LSTM (d) Loss curve for DNN (e) Loss curve for 1D-CNN (f) Loss curve for LSTM.

Explainable artificial intelligence

In this study, five XAI methods: SHAP, LIME, QLattice, Eli5 and Anchor have been used to make the models more interpretable. We chose the stacked model for interpretation since they obtained good results and are generally reliable. Deep learning classifiers were not considered since many explainers do not support deep learning algorithms today. Further, machine learning algorithms performed better than deep learning models in this study. This is normal in artificial intelligence applications since deep learning models perform better only with comprehensive data.

SHAP is a widely used XAI technique that makes global and local interpretations42. SHAP uses game and probability theory to understand the impact of each attribute. The global interpretation of the models is explained using beeswarm plots as described in Fig. 11. A hyperplane separates the non-severe (left) and severe classes (right). Red indicates a higher value, and blue indicates a lower value. The markers are also arranged based on their importance (The best feature remains at the top). The figure shows that the most important markers are basophils, CRP, LDH, lymphocytes, albumin, protein and ferritin. CRP, LDH and Ferritin levels increased in severe COVID-19 patients. Basophils, lymphocytes, albumin and protein levels decreased in severe COVID-19 patients.

Figure 11
figure 11

Global SHAP interpretation using beeswarm plots. (a) MI (b) BA (c) FPO (d) JA.

Local interpretations can be explained using the SHAP force plot, as shown in Fig. 12. Figure 12a,c indicate a non-severe prognosis. It can be seen that markers such as lymphocytes, SPO2, basophils and CRP are pushing the predictions towards a non-severe prognosis. Figure 12b,d indicate a severe COVID-19 prognosis. Markers such as CRP, AST, basophils and lymphocytes push the predictions towards severe COVID-19.

Figure 12
figure 12

Local SHAP interpretation using force plots. (a) MI (b) BA (c) FPO (d) JA.

LIME is another explainer used to make local interpretations43. It uses a model-agnostic approach (It works for most ML models). It uses a ridge regression model and kernels such as Gaussian and RBF to explain the predictions. The LIME interpretations are depicted pictorially in Fig. 13. Figure 13a,b predict a severe prognosis, and Fig. 13c,d indicate a non-severe prognosis. The attributes are also arranged based on the descending order of their importance. The figure shows that the most important markers are albumin, D-Dimer, LDH, CRP, basophils, protein, AST, SPO2 and lymphocytes.

Figure 13
figure 13

Model explainablity using LIME. (a) MI (b) BA (c) FPO (d) JA.

Eli5 is yet another method to demystify predictions44. It is a python package and is highly used with tree-based classifiers. Figure 14 depicts Eli5 predictions, and according to it, the most essential attributes are albumin, urea, lymphocytes, CRP, NLR, and basophils count. This explainer considers the “bias” (error rate).

Figure 14
figure 14

Model explainablity using Eli5. (a) MI (b) BA (c) FPO (d) JA.

Abzu developed the QLattice explainer45. It uses quantum computing and symbolic regression to explain the predictions. QLattice trains the models to understand the variation in data. The input attributes are called registers. A collection of registers is termed a QGraph. Every QGraph has a set of nodes (registers) and activation functions. Activation functions such as add, multiply, log, sine, tanh and Gaussian are generally used. The QGraphs are described in Fig. 15. It can be seen that the most important markers are lymphocytes, CRP and D-Dimer.

Figure 15
figure 15

Model explainablity using QLattice. (a) MI (b) BA (c) FPO (d) JA.

Anchor is an XAI technique that uses rules and conditions46. The strength of an anchor is measured using its precision and coverage. Precision defines the accuracy of the anchor. Coverage determines how many instances utilize the same conditions. The anchors for non-severe and severe cases are described in Table 8. The most important markers are basophils, albumin, lymphocytes, CRP, D-Dimer, neutrophils, protein and NLR.

Table 8 Model explainablity using Anchor.

Five XAI techniques have been utilized and their findings are similar. The most important markers that can predict a patient's severity are CRP, lymphocytes, basophils, albumin, D-Dimer, NLR, and neutrophils.

Discussion

This research used multiple machine learning algorithms to predict severe COVID-19 cases in advance so that appropriate treatments could be provided for vulnerable patients. To demystify the predictions, five heterogenous XAI techniques were used. Doctors and medical professionals can easily understand the variation in the markers provided by the explainers. This decision support system can be setup in various medical facilities to aid healthcare workers. In developing countries, this application can be used to make judicious use of essential medical assets such as ICU beds, ventilators and medicines. The models can also be utilized to present a second opinion to the doctors.

Fourteen feature selection methods were utilized and we chose the best four for further analysis. They are mutual information, bat algorithm, flower pollination algorithm and Jaya algorithm. A maximum accuracy of 95% was obtained by the mutual information algorithm. The F1-score, AUC and AP were 94%, 0.98 and 0.99. When the bat algorithm was utilized, a 93% accuracy was obtained. The F1-score, AUC and AP were 92%, 0.97 and 0.94. When the flower pollination algorithm was used, an accuracy of 89% was obtained. The F1-score, AUC and AP were 88%, 0.95 and 0.97. When the Jaya algorithm was utilized, a 90% accuracy was obtained. The F1-score, AUC and AP were 88%, 0.95 and 0.97. Most machine learning models performed relatively well.

Several markers showed variation between the two cohorts. Among all, CRP was chosen by all the XAI techniques. CRP levels increased in severe COVID-19 patients in this study47. Lymphocyte levels decreased in severe COVID-19 patients. Lymphopenia has been commonly recorded in patients when their conditions deteriorate48. This research also observed Basopenia in severe COVID-19 patients49. Low serum albumin has been associated with severe COVID-1950. This marker variation was also observed in this study. D-Dimer has already been an important marker in predicting COVID-1951. Elevated D-Dimer levels were found in the severe COVID-19 cohort. NLR is a vital marker which has already been utilized to diagnose and predict severity in patients. NLR levels elevates in severe COVID-19 patients52. The same trend has been observed in this research. Spo2 levels decreased rapidly in the severe COVID-19 cohort. Lower oxygen levels can seriously threaten COVID-19 patients since it causes hypoxia53. Age was also observed to be an important factor in predicting COVID-19 severity. Older patients were more vulnerable to experiencing severe symptoms54. The above markers can be monitored carefully to prevent a fatal prognosis.

Various machine researches have been conducted to predict the severity of COVID-19. Raman et al.55 used machine learning to predict COVID-19 severity during hospital admission. Patient data was collected from the University of Texas. The random forest obtained a sensitivity of 72% and a specificity of 78%. Their model could predict the severity within six hours of hospital admission. Ershadi et al.56 used image and clinical data to predict COVID-19 severity. A fuzzy-based classifier was developed to forecast severe cases. Two datasets were used, and the accuracies obtained for them were 92% and 90%, respectively. Chest X-ray images and clinical data were used to predict COVID-19 severity in another research57. 930 COVID-19 patients from Italy were considered for this research, and the stacking classifier achieved an accuracy of 89.03%. The most important markers were LDH, CRP, age, WBC and SpO2. Bello et al.58 used clinical markers and omics to predict COVID-19 severity. The model obtained an accuracy of 91.6%. The most important markers were LDH, albumin, creatinine, lymphocytes, neutrophils and potassium.

However, no articles that use five different XAI techniques to predict COVID-19 severity exist. Explainers such as Anchor, QLattice and Eli5 have been rarely used in medical machine learning. There are some limitations in our study too. The data collected was from a single geographical territory (India). Multiple datasets from different sources must be considered to make the classifiers more reliable. This research made exclusive use of supervised learning. Unsupervised learning and reinforcement learning algorithms were not considered. Graphical processing units (GPU) increase computational speed while training. However, they were not used in this study. We had divided the dataset into training and testing and had performed cross validation. However, we could not test the data on real-time patients as our study was retrospective. Prospective study can be conducted in the future to test real patients’ prognosis. Our machine learning models could also be used for other diseases and public health issues59,60,61,62,63,64,65.

Conclusions

XAI is a part of machine learning, generally used to demystify the predictions made by the classifiers. In this study, we used several supervised learning algorithms and XAI techniques to predict the COVID-19 severity in advance. The patients vulnerable to severe COVID-19 symptoms can be identified early, and appropriate treatments can be provided to save them. Various patterns and trends in the clinical markers were observed using descriptive statistics in the initial part of this research. Multiple feature selection techniques, including nature-inspired algorithms, were utilized to select the most crucial parameters. Several algorithms, such as bagging, boosting, stacking, voting and state-of-art deep learning, were used to make accurate predictions. The mutual information algorithm proved to be the most efficient feature selection technique obtaining a maximum accuracy of 95%. Five heterogeneous XAI algorithms such as, SHAP, LIME, QLattice, Eli5 and Anchor, have been used to understand the classification predictions. According to them, the most essential marker was CRP. Other markers such as D-Dimer, lymphocytes, neutrophils, albumin and basophils were also crucial. The classifiers can be utilized as a decision support system in various hospitals for prediction. The models can be used to predict the COVID-19 severity in advance. It can also aid the medical professionals and can offer them a second opinion. The algorithms can also be used for a rapid diagnosis too.

In the future, cloud-based models can be deployed. They can easily store both the data and code more efficiently. High-end GPUs can be utilized to train deep learning algorithms. Other diagnostic methods, such as rapid antigen tests, chest X-rays and genome sequencing, can be combined suitably. Prognosis can be predicted for various COVID-19 variants. Electronic health records from multiple hospitals across various countries can be combined before training the models. Other deep learning techniques such as fuzzy ensembling techniques could be utilized.