Deep forest model for diagnosing COVID-19 from routine blood tests

The Coronavirus Disease 2019 (COVID-19) global pandemic has threatened the lives of people worldwide and posed considerable challenges. Early and accurate screening of infected people is vital for combating the disease. To help with the limited quantity of swab tests, we propose a machine learning prediction model to accurately diagnose COVID-19 from clinical and/or routine laboratory data. The model exploits a new ensemble-based method called the deep forest (DF), where multiple classifiers in multiple layers are used to encourage diversity and improve performance. The cascade level employs the layer-by-layer processing and is constructed from three different classifiers: extra trees, XGBoost, and LightGBM. The prediction model was trained and evaluated on two publicly available datasets. Experimental results show that the proposed DF model has an accuracy of 99.5%, sensitivity of 95.28%, and specificity of 99.96%. These performance metrics are comparable to other well-established machine learning techniques, and hence DF model can serve as a fast screening tool for COVID-19 patients at places where testing is scarce.


Related work
A review of the machine learning models for diagnosing COVID-19 using routine laboratory and/or clinical data can be found in article 12 . The most popular prediction models proposed were based on random forest (RF) 16 , logistic regression (LR) 28 , support vector machine (SVM) 29 and neural networks (ANN) 30 . In this section, we describe the recent machine learning models reported for early detection of COVID-19 or identifying the severity level of confirmed COVID-19 patients based on laboratory and/or clinical data.
Cabitza et al. 9 extended their earlier work reported in 31 for COVID-19 detection from routine blood samples. The authors considered five machine learning models including RF, naive bayes (NB), LR, SVM, and k-nearest neighbors (KNN). A dataset of 1,624 routine blood samples was obtained from patients (52% COVID-19 positive) admitted to San Raphael Hospital (OSR), Milan, Italy. The models resulted in an accuracy in the range of 74-88%, a sensitivity in the range of 70-89%, a specificity in the range of 79-92% and an Area Under the Curve (AUC) in the range of 74-90%.
Abdulaal et al. 11 designed and compared an artificial neural network (ANN) and COX regressions models based on clinical data to predict the death for COVID-19 patients. The clinical dataset is collected from 398 patients in London Teaching Hospital. The Cox regression model achieved an average accuracy of 83.8%, a sensitivity of 50%, a specificity of 96.6% and an AUC of 86.9%, whereas ANN model accuracy was 90%, a sensitivity of 64.7%, a specificity of 96.8% and an AUC of 92.6%. An ensemble learning model for COVID-19 diagnosis from routine blood tests was proposed in 12 . The ensemble model used three classifiers extra trees, RF and LR at the first level, and an extreme gradient boosting (XGBoost) classifier at the second level to combine the predictions from the first level classifier. The model was trained and evaluated by using a dataset from Albert Einstein Hospital in Brazil 26 .
In a recent study 13 , the authors were the first ones to report the use of deep learning models to diagnose COVID-19 from routine blood tests. The authors selected six different deep learning models including ANN, convolutional neural networks (CNNs), long-short term memory (LSTM), recurrent neural networks (RNNs), CNNLSTM, and CNNRNN for evaluation. Models were trained and tested with 18 blood features from 600 patients seen at the Albert Einstein Hospital in Brazil 26 . The best performing algorithm was the CNNLSTM hybrid model with train-test split approach, which resulted in an accuracy of 92.3%, a precision of 92.35%, an AUC of 90%, a F1-score of 93% and a recall of 93.68%.
Aktar et al. 32 addressed the problem of identifying the severity level of confirmed COVID-19 patients for their appropriate ward selection (intensive care units (ICUs) or normal ward) based on blood samples. The authors considered eight machine learning models including decision tree (DT), RF, gradient boosting machine (GBM), XGBoost, SVM, light gradient boosting machine (LGBM), KNN, and ANN. The models resulted in an accuracy and a precision of above 90% for disease severity and mortality predictions.
Yao et al. 33 investigated the detection of severity level of COVID-19 patients by using the clinical information and the blood/urine test data. The authors utilized five machine learning models including LR, RF, Adaboost, SVM, and KNN. The dataset consisted of 137 clinically confirmed cases of COVID-19 patients from Tongji Hospital, China. Among the models, SVM was the winner which achieved an overall accuracy of 81.48%. Henzel et al. 34 used LR and XGBoost predictive models for the early screening of COVID-19 patients. Dataset consisted of 3114 patient's records collected at a hospital in Poland was used to train and test the models.
In contrast to other studies, Razavian et al. 35 developed a model with primary focus to identify COVID-19 patients with favourable outcomes within three days of a prediction. The aim of the study was to discharge low risk patients to free up limited beds for incoming patients. The authors trained four classifiers: LR, RF, LGBM and ensemble of these three models based on a simple averaging of the model probabilities. The models achieved a high average precision of 88.6%.
Hallman et al. 36 utilized XGBoost prediction model to determine which level of care a COVID-19 patient requires such as self-quarantine, admitted to the hospital or sent to ICU. The model was trained with 70% data and tested with 30% data from Albert Einstein Hospital in Brazil 26 .
Goodman-Meza et al. 37 39 proposed a model to predict three clinical outcomes: deceased, ventilated, or admitted to ICU for COVID-19 positive patients at NYU Langone Health (NYULH). The authors considered two prediction models: LR with feature selection using Least Absolute Shrinkage and Selection Operator (LASSO) and XGBoost. Models were trained with a dataset of 3740 patients and XGBoost model excelled in performance as compared with LR.
Vaid et al. 40 utilized XGBoost classifier model to predict in-hospital mortality and critical events at time windows of 3, 5, 7, and 10 days from admission. The model was trained and validated by using the electronic health records (EHRs) of COVID-19 positive patients admitted to Mount Sinai Health System in New York City. The XGBoost model achieved good performance for mortality as well as for critical event prediction.
Zhu et al. 41  Connor et al. 47 proposed models to predict hospitalization within 4 weeks of an outpatient COVID-19 test, ICU admission, mechanical ventilation and inpatient mortality by leveraging EHR data from the Duke University Health. For each type of models, three classifiers were considered including logistic regression, XGBoost and LGBM. Based on the validation, LGBM performed the best.
Casiraghi et al. 48 built an explainable COVID-19 risk prediction model based on clinical, laboratory and radiological data. First the features were selected based on their importance and then a RF classifier was trained with the chosen features. Kenneth et al. 49 developed a model to predict the risk of developing severe or fatal infections based on the UK Biobank. XGboost machine learning model was trained with 93 clinical variables and its predictive performance was assessed by cross-validation.
Xu et al. 50 developed a multi-class classification model to predict non-severe COVID-19, severe COVID-19, non-COVID viral infection, and healthy classes from clinical, lab testing, and CT scan features. A deep CNN was used to extract features from CT scan. Then, features from three different modalities (clinical, lab testing and CT scan) were fused together to train three machine learning models (KNN, RF, and SVM) to differentiate four classes at once. All three models achieved high accuracy (95.4-97.7%) to differentiate the overall four classes.
Souza et al. 51 proposed a model to study the disease progression in positive COVID-19 patients based on demographic and clinical data along with comorbidities in Brazil. The authors trained and evaluated seven machine learning models including LR, Linear Discriminant Analysis (LDA), NB, KNN, DT, XGBoost and SVM to predict the disease outcome. Chen et al. 52 investigated COVID-19 severity by using RF based on 26 comorbidity/symptom features and 26 blood features from patients in Wuhan, China. The authors identified the top five features from each modality to train and validate the model.
Bezzan et al. 53 selected XGBoost model using Bayesian Optimization among several machine learning models to predict whether COVID-19 patients are going to require special care (hospitalisation in regular or special-care units). The model was trained and evaluated based on lab exam data from patients in different hospitals in Brazil.
Subudhi et al. 54 compared the performance of 18 machine learning algorithms for predicting ICU admission and mortality among COVID-19 patients. The evaluated 18 machine learning algorithms belong to 9 broad categories, namely ensemble, gaussian process, linear, naïve bayes, nearest neighbour, support vector machine, tree based, discriminant analysis and neural network models. The dataset was obtained from 10,826 COVID-19 positive patients in the multihospital database (Massachusetts General Brigham Healthcare database). The ensemble-based models achieved the best performance among all the models.
Fakhartousi and Davies 27 reported a framework to select the best set of features from all the features obtained from routine blood tests with the aim to improve the COVID-19 diagnosis. For their study, the data samples of 279 patients from IRCCS Ospedale San Raffaele was collected, which included 177 positive samples and 102 negative samples. The authors employed 6 prediction models including KNN, LR, DT, RF, SVM, and NB. The authors considered different features selection methods to enhance the model prediction accuracy. Experimental results demonstrated that the combination of proper feature selection and prediction model played a key role for reliable and accurate detection of the coronavirus. www.nature.com/scientificreports/ A review of machine learning prediction models for COVID-19 diagnosis is given in Table 1. The six most popular classifiers employed in literature are the RF, LR, SVM, XGBoost, KNN and DNNs. As one can observe from Table 1, that DF has not been studied for COVID-19 diagnosis. The aim of this work is to study and compare the performance of DF with the existing techniques for COVID-19 detection. The structure of the deep forest enhances COVID-19 prediction as it is built based on accurate and diverse set of non-differentiable classifiers such as RF, and XGBoost. This work exploits the strength of DF to build a better COVID-19 prediction model on different datasets.

Design methodology
Dataset description. Two different datasets were used for this study, the first is publicly available by Kaggle 26 . This dataset has 5644 patient records which were collected from the 28th of March 2020 to 3rd of April 2020 at the Albert Einstein Israelita Hospital located in São Paulo, Brazil. The Albert Einstein dataset is informative as it contains several clinical tests data such as blood, urine, SARS-CoV-2, and rt-PCR test. The clinical data were standardized to have a mean of zero and a unit standard deviation. In this dataset 559 patients received a positive diagnosis while 5085 were negative cases. The SARS-Cov2 attribute indicates COVID-19 diagnosis, the proposed model converted the SARS-Cov2 attribute from string to integer where it is equal to zero in the absence of COVID-19 infection, and it is equal to one for positive cases. The second dataset is publicly available by the IRCCS Ospedale San Raffaele 10 . This dataset has 279 patients who were admitted to San Raffaele Hospital, Milan, Italy, from the end of February 2020 to mid of March 2020. The patient information in the San Raffaele dataset includes age, gender, routine blood tests values, rt-PCR test, to name a few. Among 279 patients, 177 patients were diagnosed as positive cases, the Swab variable gives COVID-19 diagnosis. Fig. 1. The green rectangles represent data preparation steps. In this study, there are three steps to prepare data. The first step is handling missing values where KNNImputer is utilized from sklearn.impute Python library. For imputation, KNNImputer used the k-nearest neighbors algorithm. The number of nearest neighbors is determined by n_neighbors parameter which was set to 11 in this work. The second step is removing outliers with isolation forest (iForest) 55 from sklearn.ensemble Python library 56 . Removing outliers improves the classification model performance because there are quantity and quality differences between outliers and normal records. The initial step iForest follows to remove outliers is creating an ensemble of isolation trees (iTrees). Then, detecting outliers is done by computing the average path lengths for each instance on the iTrees, outlier has a short average path length. The proposed model tunes two parameters of iForest: n_estimators and contamination. The number of estimators has been set to 150. The contamination parameter controls the outliers ratio in the dataset. The contamination parameter has been set to 2%. After that, from sklearn.utils Python library resample is employed to create small multiple subsets with replacement from the entire dataset. This resampling step enhances data variety and randomness. Thereafter, the whole dataset is randomly split into the training set (80%) and the test set (20%). The third step in data preparation is balancing the training data. This is an important step to avoid the bias of classifying toward the majority class. To achieve class balance, the dataset has been processed with SVMSMOTE 57 . The SVMSMOTE is an over-sampling technique using the SVM algorithm to generate new   58 . Figure 2 illustrates the structure of the cascade level in the proposed model. The structure is composed of different types of forests that encourage the diversity of classifiers. In fact, diversity helps in increasing the accuracy of an ensemble model. The DF-COVID-19 cascade level is constructed using six forests: two extra trees, two XGBoost, and two LightGBM. The number of trees in each forest is 250, and for each tree the maximum depth was set to 7. One of the main advantages of a deep forest model is that it has few hyper-parameters to tune. For instance, the parameter max_layers determines the maximum number of cascade layers. By setting this parameter, the structure of any deep forest based model is designed to check if adding a new cascade level will improve the performance or not, then this checking specifies the automatic termination of the cascade levels expansion progress. As shown in Fig. 2, the number of cascade levels in the proposed model is four. Figure 2 depicts that each forest in the cascade outputs a two dimensional class vector. In our case, the two dimensional class vector has two values: positive and negative COVID-19 diagnosis. The model processes the dimensional vector layerby-layer until the last layer to predict the final prediction. Thus, each layer in the cascade receives its input by its preceding layer, and outputs its results to the next layer. In detail, every single forest in the model will produce a 2-dimensional class vector. As a result, the generated class vectors of all forests in the DF-COVID-19 are equal to 12 where each forest generates one 2D vectors so 6 vectors in total, as illustrated in Fig. 2. After that, the generated class vectors are concatenated with the original input feature vector to be processed to the next level. To lower the risk of overfitting during the generation of every class vector, the model uses k-fold cross-validation. Subsequently when the last level in the cascade finishes its processing, the information produced by every forest in the previous layer will be averaged resulting in generating the final class vector. The final prediction is the maximum value in the final class vector.

DF-COVID-19 model general pipeline. The general pipeline of DF-COVID-19 model is illustrated in
The upper left corner in Fig. 2 illustrates an example of generating class vector for the extra trees forest. The leaf node has the concerned instance, for each forest an estimate vector of class distribution is created by calculating the percentage of different classes at the leaf node. Then, the final class vector in the last layer is computed by averaging all trees in the same forest, as shown in Fig. 2. www.nature.com/scientificreports/ is utilized to assess the importance of a feature in predicting COVID-19. SHAP is a technique based on a game theory that produces an interpretable model by using Shapley values. Since DF-COVID-19 is constructed of deep forest which is an ensemble of tree-based models, the TreeExplainer 59 is applied to compute the SHAP values of the proposed model. Figure 3 depicts the SHAP summary plot for one of the LightGBM classifiers in the cascade level of the DF-COVID-19 model. In this example, the selected features are set to be the same as the study 60 . The SHAP summary plot reveals the impacts of input features on COVID-19 positive cases prediction. The y-axis lists the input features in a descending order according to their importance. Each point on the plot represents a Shapley value, the color of the point ranges from blue (low) to red (high). The density of the points indicates the distribution in the dataset. Figure 3 shows that aspartate aminotransferase (AST) is the most important factor. The next critical features are white blood cells counts (WBC), lymphocyte count (Lymphocytes), neutrophil count (Neutrophils), gamma glutamyl transpeptidase (GGT), and age. The SHAP summary plot proves that low values of WBC, lymphocyte and neutrophil tend to increase the probability of positive diagnosis. While high values of GGT and age increase the possibility of COVID-19. Figure 3 demonstrates that a decrease in basophils count (Basophils), eosinophil count (Eosinophils) and alanine aminotransferase (ALT) leads to a decrease in the possibility of COVID-19 positive cases. As shown in Fig. 3 the least significant features are: c-reactive protein (CRP), alkaline phosphatase (ALP), lactate dehydrogenase (LDH), and monocytes count (Monocytes).

Performance evaluation and discussions
Performance metrics. Evaluating the model performance is one of the important steps. This study adopted five metrics to assess the performance of the proposed model: accuracy, AUC, sensitivity and specificity. Confusion matrix values are used to calculate the aforementioned performance metrics. The confusion matrix values include: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). TP and TN are when the actual class is correctly predicted, while FP and FN are when the actual class is incorrectly predicted. The area under the ROC Curve is known as AUC which illustrates the relation between TP (y-axis) and FP (x-axis). High score of AUC indicates that the model has high quality in differentiating its classes. The proportion of correctly predicted cases (TP and TN) over the whole dataset predictions is measured by accuracy. This measurement is mathematically described by the following Eq. (1): Specificity is defined as the proportion of true negative. The higher value of specificity the model has the better the model performs. The formula of specificity is given by Eq.  The first experiment is a comparison with the study in 13 . This study predicts COVID-19 diagnosis by training several deep learning approaches including: artificial neural network (ANN), convolutional neural networks (CNNs), long-short term memory (LSTM), and recurrent neural networks (RNNs). In addition, two hybrid deep learning models were implemented: CNNLSTM and CNNRNN. The training or evaluation were performed with two different approaches: 10 fold cross-validation and 80-20 train-test split. The Albert Einstein dataset 26 is used in this experiment. Table 2 reports the comparison results between DF-COVID-19 model and the deep learning models in 13 . It can be observed that DF-COVID-19 achieved the best results in the terms of accuracy and AUC. It is worth mentioning that the six different deep learning models in 13 have fluctuated values of AUC. For example, the highest value is 90% recorded by CNNLSTM with train-test split approach while the lowest value is 52.45% achieved by RNN with 10 fold cross-validation approach. Those lower values of AUC indicate that the deep learning models have limitations in distinguishing between positive and negative cases of COVID-19. Further, the best precision of 92.35% is obtained by CNNLSTM using train-test split approach. On the other hand, the DF-COVID-19 model recorded a lower precision of 85.66%. In terms of sensitivity and F1-score, the LSTM and CNNLSTM classifiers are mostly better than the DF-COVID-19. However, DF-COVID-19 proved its effectiveness in terms of accuracy, AUC, and precision.
For the second experiment, Table 3 shows the comparison of the proposed method with the different combinations of selected features and models in 27 . Different subsets of features are employed in 27 with the aim of selecting the best set of features that enhances the prediction of COVID-19 infection. Those subsets of features are: weighted by information gain ratio (Wei_IGR), weighted by correlation (Wei_Cor), forward, backward, and optimize. The forward features obtained the best accuracy and sensitivity with RF and SVM, respectively. The best specificity is achieved by NV with Wei_IGR features. The IRCCS Ospedale San Raffaele 10 dataset is used in this experiment. Among the different subsets of features, the best results of DF-COVID-19 are accomplished by applying Wei_IGR subset. Results in Table 3 demonstrate the improvement of DF-COVID-19 in accuracy and specificity. However, DF-COVID-19 recorded much lower sensitivity.   Table 4 shows the comparison between DF-COVID-19, FSGA and FCNB. The best accuracy is obtained by FCNB which is 99.0%, FSGA has almost the same accuracy (98.0%) while DF-COVID-19 recorded on average a lower accuracy of 87.80%. However, DF-COVID-19 outperforms FSGA and FCNB in terms of precision, sensitivity, and F1-score. DF-COVID-19 achieved a similar precision with no significant difference when comparing it with FSGA. For sensitivity, DF-COVID-19 obtained a high sensitivity of 91.3%, contrastingly FSGA and FCNB got much lower sensitivity of 84.0% and 79.0%, respectively. In regard to F1-score, DF-COVID-19 achieved a significantly higher F1-score (81.72%) than FSGA (78.0%) and FCNB (76.0%). Moreover, the Receiver Operating Characteristic (ROC) of DF-COVID-19 is illustrated in Fig. 5. As shown in Fig. 5, DF-COVID-19 achieved a considerably high value of ROC equals to 0.93 which proves the model robustness in differentiating between COVID-19 cases. All results in this experiment confirm the effectiveness of DF-COVID-19 in selecting discriminative features that enhance model performance. One possible reason is the deep forest model on which DF-COVID-19 is based.
The last experiment in this section demonstrates the effectiveness of DF-COVID-19 against various studies [61][62][63][64] . Table 5 lists the comparison between DF-COVID-19 and the other studies which used the Albert   Although DF-COVID-19 achieves a competitive performance in COVID-19 classification, it has several limitations. First, DF-COVID-19 model is only able to predict COVID-19 diagnosis: whether it is positive or negative. Actually, more COVID-19 classification tasks are needed to be classified such as COVID-19 severeness detection, predict the need for intensive care unit (ICU) admission, and COVID-19 fatalities prediction. The second limitation is the size of the dataset used which is relatively small, this may limit the performance of the model. To overcome this limitation, a bigger dataset with more number of patients is recommended to be used to train and test the proposed model. Another limitation is the lack of dataset diversity, even though two different datasets are used in the current study, however the model generalizability have to be increased by utilizing more datasets. In addition, the model needs to be explored further including checking its ability to work on extremely imbalance dataset.

Conclusions
The COVID-19 infection rate is still growing rapidly and continues to pose a dangerous threat to global health. Early catching of COVID-19 patients is crucial to combat the disease. However, PCR-based testing has bottlenecks in capacity and in scalability. In this article, a deep ensemble framework based on DF was exploited using routine laboratory and/or clinical data for fast screening of COVID-19 patients in hospital settings where PCR testing is scarce. In the proposed DF-based model, extra trees, XGBoost, and LightGBM were selected as the basic forests in each layer to improve the diversity to enhance performance. The training process is quick as it only has a few hyper-parameters to tune and does not require backpropagation and gradient adjusted, as compared with deep neural networks models. Experimental results showed that this DF-model has superior performance on COVID-19 diagnosis in comparison with existing state-of-the-art machine learning methods using information from two publicly available datasets. While computational methods come with their own limitations, application of this DF-based model would allow clinicians to have a quantifiable pre-test probability of COVID-19 cases, which can facilitate management, prognostication, and resource allocation. For the future work, we plan to extend the DF-based model to predict progression to severe COVID-19, respiratory insufficiency, need for ICU-level care, etc. in order to instruct more prompt management based on the availability of the dataset. In addition, the structure of the proposed model will be improved by including multi-grained scanning.