A novel hybrid supervised and unsupervised hierarchical ensemble for COVID-19 cases and mortality prediction

Though COVID-19 is no longer a pandemic but rather an endemic, the epidemiological situation related to the SARS-CoV-2 virus is developing at an alarming rate, impacting every corner of the world. The rapid escalation of the coronavirus has led to the scientific community engagement, continually seeking solutions to ensure the comfort and safety of society. Understanding the joint impact of medical and non-medical interventions on COVID-19 spread is essential for making public health decisions that control the pandemic. This paper introduces two novel hybrid machine-learning ensembles that combine supervised and unsupervised learning for COVID-19 data classification and regression. The study utilizes publicly available COVID-19 outbreak and potential predictive features in the USA dataset, which provides information related to the outbreak of COVID-19 disease in the US, including data from each of 3142 US counties from the beginning of the epidemic (January 2020) until June 2021. The developed hybrid hierarchical classifiers outperform single classification algorithms. The best-achieved performance metrics for the classification task were Accuracy = 0.912, ROC-AUC = 0.916, and F1-score = 0.916. The proposed hybrid hierarchical ensemble combining both supervised and unsupervised learning allows us to increase the accuracy of the regression task by 11% in terms of MSE, 29% in terms of the area under the ROC, and 43% in terms of the MPP metric. Thus, using the proposed approach, it is possible to predict the number of COVID-19 cases and deaths based on demographic, geographic, climatic, traffic, public health, social-distancing-policy adherence, and political characteristics with sufficiently high accuracy. The study reveals that virus pressure is the most important feature in COVID-19 spread for classification and regression analysis. Five other significant features were identified to have the most influence on COVID-19 spread. The combined ensembling approach introduced in this study can help policymakers design prevention and control measures to avoid or minimize public health threats in the future.


Related works
The rapidly evolving disease and the straightforward transmission of virus pathogens have resulted in the development of numerous machine-learning models and applications.S. Solayman et al., in the study 9 , began by precisely preparing knowledge obtained from the Israeli Ministry of Health open-source website for classifiers.Experiments demonstrated that the hybrid convolutional neural network and long short-term memory algorithm with the SMOTE approach achieved the best results for classifying the introduced data.Satisfactory outcomes led to implementing an application to forecast COVID-19 infections for users, providing feedback based on entered symptoms.Another application of machine learning in the fight against COVID-19 is highlighted in the paper 3 , where the authors predict the condition of coronavirus-infected patients.Experiments utilized two datasets: demographic and clinical data of patients (n = 11,712) and demographic data, clinical information, and patient blood test results (n = 602) to develop predictive models and identify key features.Subsequently, the performance of eight different machine learning algorithms was compared.The research used demographic, clinical, and blood data.Experiments demonstrated that C-reactive protein, lymphocyte ratio, lactic acid, and serum calcium significantly influence the prognostic predictions of COVID-19.A study conducted in South Korea 10 involving 10,237 patients revealed that factors like age over 70, moderate or severe disability, comorbidities, and male gender are strongly associated with an increased risk of mortality from COVID-19.Through machine learning analysis, Lasso and Linear Support Vector Machine (SVM) models exhibited higher sensitivity and specificity in predicting mortality.The developed predictive model can classify patients rapidly under limited medical resources during a pandemic.
One consistent observation from the ongoing research on COVID-19 data is the variability in the application of classification methods across different countries.Experiments in the study 11 reveal that the Prophet model demonstrated sufficient accuracy in predicting cases in the USA while considering Brazil or India; the Autoregressive Integrated Moving Average model performed better.The superiority of a deep learning model, including the Neural Prophet model, is confirmed by the study 12 conducted in 2022.Another research 13 illustrates the differences in applied statistical models depending on the location.The article examined multilayer perceptron, vector autoregression, and linear regression to predict the epidemic caused by the SARS-CoV-2 virus, utilizing data from Asian countries obtained from the Johns Hopkins University data repository.Drawing on data from Mexico, Muhammad et al. 14 developed supervised machine-learning models for COVID-19 infection using various classification models, examining correlations between input features beforehand.According to the research, the highest accuracy is associated with decision trees at 94.99%, the highest sensitivity (93.34%) with the Support Vector Machine model, and Naive Bayes exhibits the highest specificity at 94.30%.
A lack of accurate data on COVID-19 hinders the standard techniques for predicting the consequences of an epidemic.Considering this knowledge, Tiwari et al. 15 applied meta-analysis based on artificial intelligence, utilizing machine learning algorithms such as Naive Bayes, SVM, and Linear Regression to predict the trends of the global epidemic caused by the SARS-CoV-2 virus.Among the discussed techniques, Naive Bayes yielded the most satisfying results, demonstrating high effectiveness in predicting future values with less mean absolute error and mean squared error.A comprehensive study employing diverse artificial intelligence strategies is described in reference 16 , where long short-term memory, multilayer perceptron, adaptive neuro-fuzzy inference system, and recurrent neural network were employed.The analysis of the effectiveness of the considered methods focuses on results obtained from calculating mean squared error (RMSE), mean absolute percentage error (MAPE), mean absolute error (MAE), and R 2 coefficient of determination (R 2 ).The results indicate that for Bidirectional long short-term memory (LSTM) and artificial neural network models, R 2 values range from 0.64 to 1. Autoregressive Integrated Moving Average (ARIMA)and LSTM models demonstrated the highest MAPE errors.Another approach is characterized by the work of S. A.-F. Sayed et al. 17 , who built a model predicting various levels of severity risk for COVID-19 using the analysis of chest X-ray images.Deeply trained CheXNet model and hybrid feature extraction techniques were applied in experiments.The study showed that the XGBoost classifier performed best with combined features (PCA + RFE), generating 97% accuracy, 98% precision, 95% recall, 95% F1-score, and 100% ROC-AUC.In the study, SVM demonstrated results that were equally satisfying as those of XGBoost.In the paper 18 , Atta-ur-Rahman et al. directed their attention towards a mathematical model based on a cloud-based smart detection algorithm using a support vector machine.The obtained solutions oscillated around 98.4% accuracy with a 15-fold cross-validation.The comparison conducted in the study suggests that the proposed model exhibits greater accuracy and efficiency.
In one of the reviews encompassing 160 studies 19 , a compilation of machine learning techniques from various sources such as Springer, IEEE Xplore, and MedRxiv was made.Two categories of machine learning were outlined: deep learning and supervised learning.Statistics indicate that deep learning is employed in 79% of cases, with 65% utilizing convolutional neural networks (CNN) and 17% using Specialized CNN.Focusing on supervised learning, only 16% of analyses were observed, predominantly using Random Forest, Support Vector Machine, and regression algorithms.On the other hand, studies from 2021 by Kwekha-Rashid et al. 5 demonstrated that better learning results could be observed using supervised learning, characterized by high accuracy at 93%.A comparison of research results 4 on machine learning applications in the context of COVID-19 revealed that recurrent neural networks, deep diagnostic models, various contact tracing, medical diagnostics, and drug development-related algorithms were effective.Forecasting models achieved high correlations and diagnostic models analyzing computer tomography and X-ray images demonstrated accuracy at 99%.The authors emphasized that limitations related to the lack of full access to patient data and algorithm imperfections highlight the need for the involvement of government agencies in facilitating the acquisition of COVID-19-related data.Alballa and Al-Turaiki 20 focused on COVID-19 diagnosis and predicting severity and mortality risk using machine learning algorithms.The authors note that most machine learning algorithms are supervised learning models, which are more straightforward and more understandable.The referenced article states the need for further research, especially in identifying optimal screening models for COVID-19 and creating a comparative dataset.The limitation is the use of unbalanced datasets, requiring effective techniques to deal with this issue, and the potential integration of different types of data, necessitating further research for precise COVID-19 prediction.
Tkachenko et al. 6 aimed to increase the performance of prediction tasks using combined RBF-SGTM neurallike structures.They developed a committee of non-iterative artificial intelligence tools for regression analysis.Using the developed committee for insurance cost prediction allows the authors to decrease training and test errors and increase accuracy with a slight increase in the training procedure time.The authors conclude that the developed neural-like structures can be used to solve regression and classification tasks with large volumes of data for different application areas.

Dataset description
This study used publicly available COVID-19 outbreaks and potential predictive features in the USA dataset 21 .The dataset provides information related to the outbreak of COVID-19 disease in the United States, including data from each of 3142 US counties from the beginning of the outbreak (January 2020) until June 2021.This data was collected from many public online databases and includes the daily number of COVID-19 confirmed cases and deaths, as well as 46 features that may be relevant to the pandemic dynamics: demographic, geographic, climatic, traffic, public-health, social-distancing-policy adherence, and political characteristics of each county 21 .The dataset contains the number of confirmed COVID-19 cases and deaths and 46 factors that may be relevant to the pandemic dynamics in each county and for each day since the beginning of the outbreak (Table 1).Haratian et al. 21also prepared a processed version of the dataset, where the missing values are imputed and the abnormal values, e.g., negative counting values, are fixed.The detailed description of the dataset and the data origin and processing are described in 21 .The target variables are the COVID-19 confirmed cases and deaths numerical values.The 46 features can be divided into the following classes: The details of the dataset are listed in Table 1.

Proposed supervised-unsupervised ensemble model
We used R version 4.1.2to conduct this research, run on Spark 3.0.0,8 cores.Since the dataset contains 46 fixed and temporal features, it could be challenging to build a good predicting ML model or even to select the most important features of a different nature.Since these data are of a different nature (demographic, geographic, economic, climate, social etc., see Section "Dataset description"), we developed a naturally inspired multimodal-like ML model that combines both supervised and unsupervised learning.Like The following performance metrics were used to evaluate the model performance: accuracy, F1-score, the area under the receiver operating characteristic curve (ROC-AUC) for the classification task, and mean squared error (MSE), ROC-AUC, Model Performance Predictor (MPP) for the regression task.It tracks the predictive performance metric of the model.
In the first step of the research, data preprocessing was done.This included one-hot encoding for two features, missing data removal and feature selection.Missing data imputation is not implemented; only missing data removal was used based on the low level of missing data (see Table 1).The imputed using the KNN imputer dataset is also available in 21 , and the potential usage of this dataset will be discussed later in this paper.As a result, 69 features were obtained.Primary Component Analysis (PCA) was used to reduce the dimensionality.PCA results in the 60 primary components that do not substantially affect the dimensionality.This finding confirms the research hypothesis on the necessity of combining unsupervised and supervised learning to reduce the dimensionality of the input data and potentially increase the accuracy and robustness of the prediction model.
The next step was to use a complex ensemble for the classification task with three labels: min risk for confirmed cases, mid risk and huge risk.Two hybrid stacking ensembles are proposed.

Stacking supervised-unsupervised ensemble
The first ensemble is built on the classical stacking approach when only class probabilities and the corresponding target values are fed to a meta-classifier.A hierarchical hybrid classifier was developed (Fig. 1), which includes the following three levels: 1. Clustering of the input data using the k-means algorithm.2. Selecting the most important features in each obtained cluster using Boruta, decision tree and Random forest.www.nature.com/scientificreports/ 3. Building a stacking ensemble using the selected features for each cluster using the Random Forest algorithm as a meta-model.Logistic regression, KNN, SVM with linear kernel, naïve Bayes, decision tree and SVM with RBF kernel were used as weak classifiers.
First, one-hot encoding was implemented for categorical features such as age, education level, country, state, etc.The elbow method was used to select the appropriate number of clusters.Fourteen features were selected after voting for Boruta, random forest, and decision tree feature selectors.
The decision tree returns the feature weight as the criterion for evaluating features.It allows building a ranked list of selected features using different measures.In our case, CART was used for feature selection, with the Gini index as a measure.
Random Forest is an ensemble of numerous training-sensitive algorithms (decision trees).These algorithms have a slight offset.The bias of the training method is the deviation of the average response of the trained algorithm from the response of the ideal algorithm.Each of these classifiers is built on a random subset of objects and a random subset of features.
Boruta is a heuristic algorithm for selecting significant features based on the use of Random Forest.At each iteration, those features are removed for which the Z-measure is less than the maximum Z-measure among the added features.To get the Z-measure of a feature, it is necessary to calculate its importance, obtained using the built-in algorithm in Random Forest, and divide it by the standard deviation of the feature importance.Added features are obtained as follows: the characteristics available in the selection are copied, and then each new attribute is filled by shuffling its values.This procedure is repeated several times to get statistically significant results, and variables are generated independently at each iteration.
Next, the Jaccard index is used for feature selector voting.Next, voting for the features is developed.First, all important values are added.Next, the features with scores higher than the mean value are chosen.After that, 15 diverse classifiers were used, and 9 of the strongest were selected.

Modified stacking supervised-unsupervised ensemble
The second ensemble utilizes a modified stacking approach when all datasets and transformed outputs of the weak classifiers are fed to a meta-classifier.Figure 2 depicts the structure of the proposed ensemble and the data transformation.
In contrast to Ensemble 1, Ensemble 2 trains the cutoff function of the classifier in addition to the trained weak models.The proposed cutting method increases the overall efficiency of the ensemble compared to classical voting, where the class cut-off is done with a constant coefficient of 0.5, thus sharply reducing the efficiency of the algorithm to approx.79%.The essence of the algorithm is the selection of the cut-off coefficient.In this case, the voting input contains a vector of independent classifier scores, which will vote differently depending on the context.The idea of the method is to determine the average value of the rating at each vote and add it to the list of average ratings.The list of average scores is a set of independent scores.Next, the cut-off coefficients are obtained at the output using the mathematical expectation function on this set.The obtained cut-off coefficient is close to the optimal class partition coefficient.
For each classifier and regressor, fivefold nested cross-validation was used.Each fold is constituted by two arrays: the first one is related to the training set, and the second one is related to the test set.
The general pipeline is given in Fig. 3.

Classification task
The target classes for this task were three classes with a risk of new COVID cases.Nine single classifiers, viz., Logistic Regression (GM), Decision Tree, SVM with linear kernel, k-nearest neighbors (KNN), eXtreme Gradient Boosting (XGBoost), SVM with Radial kernel (RBF), Random Forest, Naïve Bayes, and Multilayered perceptron with three hidden layers and four neurons inside of each layer (Ml (c(4, 3, 3)), were used to compare the performance of the proposed ensembles.Table 2 lists the most important features for the new COVID-19 case classification according to Boruta, Random Forest, and Decision Tree feature selectors (for each feature description, see Table 1).The listed features can help decision-makers select factors affecting COVID-19 spread and thus optimize medical care and/or restriction policy to minimize the epidemic impact, considering all aspects of human well-being.
The classification performance metrics for 9 weak classifiers and the proposed ensembles are summarized in Table 3.As one can see from the table, the best classification results were obtained in the case of the KNN model, with Accuracy = 0.816, ROC-AUC = 0.797, and F1-score = 0.814.Using the developed ensembles allows us to increase all the metrics substantially.Thus, in the case of Ensemble 1, Accuracy was raised to 0.895, ROC-AUC to 0.897, and F1-score to 0.897.The proposed cut-off voting improvement in Ensemble 2 further increased all the metrics compared to Ensemble 1 by approx.2% (Accuracy, ROC-AUC, and F1-score values are 0.912, 0.916, and 0.916 correspondingly).Hence, the developed hybrid hierarchical classifiers outperform single classification algorithms by more than 10% and are well-suited for COVID-19 spread prediction in real life.
Dynamic voting based on mathematical expectation is used.In addition to the trained models themselves, the cutoff function of the classifier is trained in this algorithm.The traditional stacking is based on averaging indicators, and there is a cut-off by class with a constant coefficient of 0.5; then, the efficiency of the algorithm drops sharply to ~ 79%.The proposed cutting method increases the overall efficiency of the ensemble by several percent.The essence of the algorithm is to choose a cut-off coefficient.In the case of this work, the voting input contains a vector of independent classifier scores, which will vote differently depending on the context.The idea of the method is to calculate the average score for each vote and add it to the list of average scores.The list of average grades is a set of independent grades on which the mathematical expectation function is applied.We got a cut-off coefficient close to the optimal class separation coefficient at the output.
We used the nested fivefold cross-validation technique to perform additional tests, as described in 28 .Nested cross-validation was used to validate the findings obtained using the proposed approach in addition to the usual fivefold cross-validation.Though this approach has its limitations, e.g., the assumption of the data split independence, it is widely used across the ML community.The difference between the Accuracy values across the five folds was 0.018.Next, we performed a more robust statistical test, viz.Kolmogorov-Smirnov normality test.The obtained p-value was 0.793.
Table 4 shows the efficiency of proposed ensembles for the whole dataset and for selected features.Selecting features allows for increasing the total analyzed metrics.

Regression task
For the regression task, the following regression models were used: linear model, polynomial regression, regression tree with CART algorithm, Gradient boosted tree, random forest, l1 regularization for the linear model, and l2 regularization for the linear model.These models aimed to predict the number of confirmed COVID-19 cases and deaths.Table 5 summarizes the most important features affecting the prediction of the COVID-19 spread.
As it follows from the comparison of Tables 2 and 5, virus pressure, i.e., a measure for virus transmission from neighboring counties, defined as the weighted average of the number of confirmed cases in the adjacent counties, is the most important feature for classification and regression analysis.Besides, there is a subset of common features, which were recognized as the most important in these two studies, viz., (i) the total population of the county-the second most important common feature, (ii) distance to the nearest international airport with average daily passenger load more than ten, (iii) daily average temperature, (iv) the longitude of the county barycenter, (v) number of total COVID-19 tests performed at each day in the state of the county, and (vi) population ratio in the state.As we can see, the COVID-19 spread is affected by various factors: epidemiological, like the virus pressure; demographic, like the total population and population density; social, like the distance to the nearest international airport; climate, like daily average temperature; geographical, like the longitude of the county barycenter, and medical like the number of total COVID-19 tests performed at each day.These findings can help epidemiologists to analyze the spread and lifecycle of the virus and decision makers to select the most important restriction factors and limitations to prevent the spread of the disease.
Other factors affecting the number of COVID-19 cases and deaths-as seen in Table 4-are mainly social features, like social distancing, percentage of health-insured residents, median household income, and percent change in mobility trends in retail shops and recreation centers.The analysis of Table 2 reveals that while speaking on the classification, there are some additional factors affecting the chance of getting infected with coronavirus, viz., percentage of residents in the age group 25-29, immigrant student ratio, intensive care unit bed ratio, and the percent change in human encounters compared to pre-COVID-19 period.
Table 6 lists the regression task performance evaluation for the six most common regression models and the proposed ensemble.
The proposed hybrid hierarchical ensemble combining both supervised and unsupervised learning allows us to increase the accuracy of the regression task by 11% in terms of MSE, 29% in terms of the area under the ROC, and 43% in terms of the MPP metric.Indeed, the ROC-AUC value increased from 0.609 for the best traditional regression model (Gradient Boosted Tree) up to 0.790 in the case of the proposed Ensemble; MSE decreased from 112.6 down to 101.3, and MPP from 18.8 to 13.1 respectively.Thus, using the proposed approach, it is possible to predict the number of COVID-19 cases and deaths based on demographic, geographic, climatic, traffic, public health, social-distancing-policy adherence, and political characteristics with sufficiently high accuracy.
Besides, we used a nested fivefold cross-validation technique 28 to perform a grid search hyperparameters optimization.The tuning parameter α was set to a constant value of 1. RMSE was used to select the optimal model The developed way of cutting off the classifier or regressor, which is the part of the ensemble, increases the overall efficiency of the ensemble by several percent.A vector of models with different contextual characteristics can provide reasonable generalized estimates.
Table 7 shows the efficiency of proposed ensembles for the whole dataset and for selected features.Feature selection allows for increasing all the analyzed metrics.

Conclusions
This paper introduces two hybrid hierarchical machine-learning ensembles, which combine supervised and unsupervised learning algorithms for classification and regression predictions of the COVID-19 spread.The developed ensembles are based on a combination of supervised learning algorithms and unsupervised algorithms with a new method of selecting the cut-off coefficient based on the mathematical expectation of the weak classifier predictors.The study utilizes publicly available COVID-19 outbreak and potential predictive features in the USA dataset, which provides daily information related to the outbreak of COVID-19 disease in the US, including data from each of 3142 US counties from the beginning of the epidemic (January 22, 2020) until June 10, 2021.
The developed hybrid hierarchical classifiers outperform single classification algorithms by more than 10% and are well-suited for COVID-19 spread prediction in real life.In the case of Ensemble 1, the achieved Accuracy metric was 0.895, ROC-AUC-0.897,and F1-score-0.897.The proposed cut-off voting improvement in Ensemble 2 further increased all the metrics compared to Ensemble 1 (Accuracy, ROC-AUC, and F1-score values are 0.912, 0.916, and 0.916, respectively).
Central to our innovation is using mathematical expectation to guide the selection of the cut-off coefficient in Ensemble 2. This dynamic voting mechanism considers the individual scores of weak classifiers within the ensemble, allowing context-aware decision-making.Rather than relying on a static threshold, our approach computes the average score for each vote, which is then subjected to mathematical expectation to derive an optimal cut-off coefficient.This adaptive strategy ensures that the ensembles of classification are finely tuned to the specific characteristics of the input data, resulting in improved performance across a range of classification tasks.
The proposed hybrid hierarchical ensemble combining both supervised and unsupervised learning allows us to increase the accuracy of the regression task by 11% in terms of MSE, 29% in terms of the area under the ROC, and 43% in terms of the MPP metric.The ROC-AUC value increased from 0.609 to 0.790; MSE decreased from 112.6 to 101.3, and MPP from 18.8 to 13.1, respectively.Thus, using the proposed approach, it is possible to predict the number of COVID-19 cases and deaths based on demographic, geographic, climatic, traffic, public health, social-distancing-policy adherence, and political characteristics with sufficiently high accuracy.
The model described in 26 was able to predict the number of daily infected cases up to 35 days in the future, with an average mean absolute percentage error of 20.15% with further improvement to 14.88% if combined with human mobility data.In our study, we used the MSE metric instead, so the results cannot be compared directly.MAE value obtained during nested cross-validation is 9.51.The obtained AUC value for this research is 0.916 for the classification task and 0.795 for the case of regression analysis.A similar AUC value (0.80) was also reported by Zahra Gholamalian et al. to predict the statuses over time, viz. for the classification, in 25 .
Wang et al. 26 determined the policies of restrictions on gatherings, testing and school closing as the most influential predictor variables.In this paper, the most influential predictor variables are virus pressure, social www.nature.com/scientificreports/distancing total grade, total population, area, and retail and recreation mobility percent change.Virus pressure was also reported as the key indicator for the number of COVID-19 cases in each county 24 .
The study shows that the most important feature in COVID-19 spread is virus pressure for classification and regression analysis.Besides, there is a subset of common features which were recognized as the most important in these two studies: • the total population of the county-the second most important common feature, • distance to the nearest international airport, • daily average temperature, • the longitude of the county barycenter, • number of total COVID-19 tests performed each day in the state of the county, • population ratio in the state.
These findings can help practitioners analyze the spread and lifecycle of the virus, and decision-makers select the most critical restriction factors and limitations to prevent the spread of the disease.COVID-19 model predictions play a crucial role in shaping public health practices and informing policy decisions, offering insights into the potential trajectory of the pandemic and the effectiveness of various interventions.Models can help predict the demand for healthcare resources such as hospital beds, ventilators, and medical staff in different scenarios.This information allows policymakers to allocate resources efficiently, ensuring that healthcare systems are adequately prepared to handle surges in cases.The model and the findings of the paper allow for the integration of both medical and non-medical interventions into the decision-making policy to prevent the virus spread.Thus, for example, social distancing and retail and recreation mobility percent change (as can be seen from Table 5) are the most important factors resulting in the total number of new cases and mortality ratio, while additional non-medical factors like temperature, immigrant students ratio, airport distance or housing density are among the most important features derived from the classification model (see Table 2).Hence, while developing the virus prevention (restriction) policy, the policymakers can consider such factors as the current and forecasted temperature, airport distance, and house density in the specific region etc., to restrict social distancing or retail or recreation closing or limitations.
Our work represents a significant advancement in classification ensemble methodologies.It offers a novel approach to cut-off determination that improves classification accuracy and adaptability in real-world applications.Future research will be related to using the developed ensembles for multimodal data analysis.Another possible approach is to use the imputed dataset, available in 21 .The authors used the KNN imputer to impute the missing values of a feature based on the other non-missing values of that feature for that county, with a few exceptions.However, in our opinion, this procedure makes the dataset to be an artificial one and not the realworld data.That's why we do not examine the imputed dataset in this research.The comparison of the findings of this paper with the results of the machine-learning models applied to the imputed dataset will be carried out in future studies.Besides, other weak predictors could be used for ensembles as well as calibrated predictions of individual base models to ensure that their confidence estimates are well-calibrated and consistent across the ensemble.We plan to explore techniques for fusing the predictions of different models or datasets at various stages of the prediction process, such as feature fusion, decision fusion, or late fusion.

Table 1 .
21scription of the features21.thehuman brain combines input signals of different origins, e.g., audial and visual, in the temporal lobe, our ensemble combines inputs from different feature clusters in a hybrid classifier.The working hypothesis is that it is insufficient to select important features, but we should combine them into clusters of similar impact on the COVID-19 spread.Next, these clustered features provide an aggregated input to an ensemble classifier to increase the prediction accuracy and resilience.

Table 2 .
Important classification features after voting Boruta, Random Forest and Decision Tree feature selectors.

Table 3 .
Classification performance of the weak and ensemble classifiers.Significant values are given in bold.

Table 4 .
Classification performance of the whole dataset and selected features.

Table 5 .
Important features for the regression task after voting Boruta, Random Forest and Decision Tree feature selectors.

Table 6 .
Regression task performance metrics for weak and ensemble classifiers.Significant values are given in bold.

Table 7 .
Regression performance of the whole dataset and selected features.