Abstract
This study aims to develop an assumptionfree datadriven model to accurately forecast COVID19 spread. Towards this end, we firstly employed Bayesian optimization to tune the Gaussian process regression (GPR) hyperparameters to develop an efficient GPRbased model for forecasting the recovered and confirmed COVID19 cases in two highly impacted countries, India and Brazil. However, machine learning models do not consider the time dependency in the COVID19 data series. Here, dynamic information has been taken into account to alleviate this limitation by introducing lagged measurements in constructing the investigated machine learning models. Additionally, we assessed the contribution of the incorporated features to the COVID19 prediction using the Random Forest algorithm. Results reveal that significant improvement can be obtained using the proposed dynamic machine learning models. In addition, the results highlighted the superior performance of the dynamic GPR compared to the other models (i.e., Support vector regression, Boosted trees, Bagged trees, Decision tree, Random Forest, and XGBoost) by achieving an averaged mean absolute percentage error of around 0.1%. Finally, we provided the confidence level of the predicted results based on the dynamic GPR model and showed that the predictions are within the 95% confidence interval. This study presents a promising shallow and simple approach for predicting COVID19 spread.
Introduction
In December 2019, the world was waiting to welcome 2020; Wuhan hospital note unusual Severe Acute Respiratory by a new virus, and it was spread swiftly. They identify it later SARSCoV2 because of its similarity to the previous SARS CoV in 2002^{1}. Sooner World Health Organization (WHO) calls this virus a novel coronavirus (nCOV19) known as COVID19. This virus can stay in the person for around 14 days without showing any symptoms that lead to transforms from the local epidemic of Wuhan to the global pandemic of the whole world. Because the early forecasting of the number of COVID19 cases will help to control the incubation and nonspreading of the virus, the researchers and governments depend on machine learning (ML) which is part of artificial intelligence (AI) that can learn from the previous data to decide a solution in the realworld problem. In the COVID19 pandemic problem, ML can predict the outbreak of COVID19 for evaluating the riskiness of the virus and therefore raising the level of the procedures applied. The fact, the spread of the virus has receded in many countries when they use ML to detect COVID19^{2}.
In recent years, the effectiveness and benefit of the application of Artificial Intelligence (AI) have been proved in numerous sectors, such as healthcare, where it showed good performance as a decision support system to help identify diseases and make medical diagnoses^{3,4,5,6}. During this pandemic, AI showed to be useful in predicting outbreaks and aid assemble quickly evolving data to support general health specialists in complex decisionmaking^{7}. In addition, various AIbased tools were designed in the healthcare field^{3,6,8}. For instance, a team at Boston Children’s Hospital developed an automated electronic information system called Health Map^{9}. Notably, the Health Map employs realtime surveillance of emerging public health threats and unofficial online sources for observing disease outbreaks. Another example of an AIbased company specializing in infectious disease epidemiology is Blue Dot, which has flagged an alarm to its clients regarding the COVID19 outbreak on December 31^{3}. In addition, this company offered suitable predictions achievement for Zika virus in Brazil^{10}. Also, we can find Google Flu, which employed search engine queries for enhancing the flu epidemic track. In^{11}, authors introduced an intelligent framework for the COVID19 telemedicine diagnostic via extended reality technology and deep learning networks. Specifically, an innovative Auxiliary Classifier Generative Adversarial Networks (ACGAN) is designed for COVID19 prediction. This intelligentbased strategy can be viewed as a promising tool for supporting COVID19 therapy and remote surgical plan cues. More improvement can be obtained by enhancing hardware design and deep learning models used in this Internet of Medical Things (IoMT) system. The authors in^{12} introduced a deep learningdriven approach for semisupervised fewshot segmentation (FSS) of COVID19 infection via radiographic images. The challenge addressed in this study is designing an effective and accurate segmentation of 2019nCov infection based on smallsized annotated lung computed tomography (CT) scans. Essentially, the model was built semisupervised using unlabeled CT slices and labeling one during training. Results based on publicly available COVID19 CT scans revealed the superior performance of the FSS2019nCov compared to conventional models. However, the segmentation performance has not been tested on a large dataset. In^{13}, a combined CNNLSTM deep learning approach is introduced to detect COVID19 cases based on Xray images. More specifically, CNN has been employed as a feature extractor, and the LSTM is applied to CNN’s features to discriminate healthy people from the contaminated patients with COVID19. They concluded that this approach outperformed the competitive CNN architectures by reaching an accuracy of 99.4%. However, the performance of this approach has been tested only on a relatively smallsized dataset; the generalizability of this approach needs to be verified using a large dataset. In addition, this approach cannot efficiently discriminate COVID19 images containing different disease symptoms. Recently, the study in^{14} suggested an unsupervised detector combing a Variational Autoencoder(VAE) model with oneclass SVM (1SVM) to detect COVID19 infection using blood tests and reported accuracy of around 99%. Here, the VAE is used as a features extractor, and the 1SVM discriminates healthy patients from contaminated ones. Results showed the superior detection accuracy of this approach compared to Generative adversarial networks (GAN), Deep Belief Network (DBN), and restricted Boltzmann machine (RBM)based 1SVM methods. However, this detector is verified using routine blood tests samples from two hospitals in Brazil and Italy; a large dataset is needed to verify the generalization of this approach. In^{15}, an intelligent framework based on deep learning and cloud computing is presented to identify potential violations of COVID norms in workplaces. To this end, this approach employs Closed Circuit Television (CCTV) cameras and webcams installed in workplaces. This approach can detect two types of violations: maskwearing and physical distancing between employees. Results based on a video of almost eight hours demonstrated that this framework achieved 98% accuracy. However, this approach can be improved by including other COVID norms and tracking the location of employees’ movement after office hours. Essentially, AI presents relevant support to predict pandemics and take early measures to mitigate the negative consequences. Much research has been done recently on developing datadriven techniques to combat the COVID19 pandemic. For example, see some relevant review articles on detection and forecasting of COVID19^{16,17,18,19}.
Efforts devoted to mitigating the effects of COVID19 transmission have been conducted since its appearance in December 2019. Recently, there have been many studies conducted to understand and manage the COVID19 pandemic by developing several techniques for different applications, such as wearing mask detection^{20}, COVID19 spread forecasting^{21}, and chest Xrays diagnosis^{22}. Wearable technologies have been recently demonstrated promising solutions to aid in mitigating infectious diseases, such as COVID19. In^{23}, the authors presented an overview of different wearable monitoring devices and respiratory support systems that are used for assisting patients infected with COVID19. Even with the promising potential of wearable technologies for slow downing the spread of COVID19, however, their utilization is still limited due to several restrictions, such as data privacy and cyberattacks. AIenabled systems have also been designed to detect people that are not wearing any facial masks to mitigate the propagation of COVID19 spread. For instance, in^{24}, a visionbased deep learning approach has been proposed for facial mask detection in a smart city network. Results revealed that this approach achieved 98.7% accuracy in discriminating people with facial masks from people without masks. However, to guarantee sufficient monitoring, a large number of cameras is needed to cover the whole monitored city, which is not easy to get and also has an economic burden. Accurate forecasting of COVID19 cases is essential to help mitigate and slowdown COVID19 transmission^{25,26,27}. In^{4}, the authors present a comparative study between eight machine learning models to forecast COVID19, such as logistic regression, Restricted Boltzmann Machine, convolutional neural networks, and support vector regression(SVR). They used timeseries data for confirmed and recovered COVID19 cases from seven countries, including Brazil, India, and Saudi Arabia, recorded from January 22, 2020, to September 06, 2020. It has been shown that machine learning models can track the future COVID19 trend even in the presence of a relatively smallsized dataset. The convolutional neural networksLong shortterm memory LSTMCNN showed high performances with an averaged mean absolute percentage error (MAPE) of around 3.718%, because of its ability to learn higherlevel features. The study in^{28} used machine learning to predict weekly cumulative COVID19 cases recorded in the USA and Germany. The first 18 weeks are employed for models construction, and 17 weeks after 18/09/2020 are used for testing. Results showed that the SVR delivers the best prediction accuracy compared to Random Forest (RF), Linear Regression (LR), and Multilayer perceptron (MLP) in terms of Root Mean Square Error (RMSE) and MAPE metrics. Specifically, the SVR model achieved an averaged MAPE of 0.1162%. However, in this study, only weekly predictions are considered, and daily COVID19 cases predictions, which are important to shortterm decision making, are ignored. In^{29}, authors focused on forecasting the future number of COVID19 in the next 60 days for the confirmed, recovered, and death cases in the 16 high impacted countries. To this end, they considered a Seasonal AutoRegressive Integrated Moving Average (SARIMA) and AutoRegressive Integrated Moving Average (ARIMA). The result reveals that the SARIMA model is more realistic than the ARIMA model in this study. The study in^{30} employed an autoregression model utilizing Poisson distribution called Poisson Autoregression(PAR) to predict the confirmed and recovered cases of COVID19 in Jakarta. Results showed that this approach provides acceptable forecasting accuracy with an MPAE value lesser than 20%. This approach showed better performance compared to conventional methods, including ARIMA, Exponential Smoothing, BATS, and Prophet. However, the Poisson Autoregression approach’s prediction quality still requires more improvement to reach a satisfactory prediction performance. Similarly, in^{31} the ARIMA model has been employed for daily prediction of COVID19 spread in Italy, Spain, and France based on data collected between 21/02/202015/04/2020. This approach showed satisfactory prediction performance by achieving an average MAPE of 5.59%. Although ARIMA models can provide suited prediction accuracy of data with regular trends, they are limited in extracting only linear relationships within the time series data. The study in^{32} investigated linear regression and Polynomial regression for forecasting the spread of COVID19 cases in India using data from March 12 to October 31, 2020. Forecasting results showed that the Polynomial model with 2 degrees outperformed the linear regression model by achieving an averaged MAPE of 13.3%. However, the forecasting accuracy can be improved by using a large dataset and finetuning the model parameters. The work in^{33} considered SVM and Multilayer Perceptron (MLP) methods to predict confirmed COVID19 cases using data recorded in Brazil, Chile, Colombia, Mexico, Peru, and the US from 1/22/2020 to 5/25/2020. This study reported that the MLP model outperformed the SVM by providing an averaged MAPE of 17%. The hyperparameters were optimized via a tabu list algorithm. Another study^{34} presented a comparison of four methods, ARIMA, ANN, LSTM, and CNN, to predict the COVID19 spread based on data available from March 12 to October 31, 2020. The CNN model outperformed the other models by achieving an averaged MAPE of 3.13%. However, the models were trained using smallsized data, making it difficult to get accurate models for forecasting the future trends of COVID19 spread. In^{35}, the paper focuses on predicting future COVID19 confirmed and death cases in nine high affected countries from January 22, 2020, till December 13, 2020. Four factors are used as input variables, including vaccination, weather conditions, malarial treatments, and average age, to predict COVID19 spread. This study reported that the Multilayer perceptron (MLP) model provided satisfactory forecasting accuracy. Authors in^{36} proposed a cloudbased shortterm forecasting model to predict the number of COVID19 confirmed cases for the next seven days. Results indicate the importance of the cloudbased shortterm forecasting model in decisionmaking to prepare the needed medical resources. In^{37}, a modified version of LSTM (MLSTM) has been introduced to forecast the COVID19 outbreak in nine countries from three continents. Specifically, the authors used data from January 22 till July 30, 2020, for the train set and last month, August, for the test. It has been shown that the MLSTM is the winner model among other investigated models. In^{38}, LSTM and gated recurrent unit (GRU) deep learning models have been applied to forecast COVID19 confirmed cases and deaths in Saudi Arabia, Egypt, and Kuwait from 1/5/2020 to 6/12/2020. In this study, LSTM with a single layer exhibited the best forecasting of confirmed cases with an average MAPE of 0.6203%. The authors in^{39} implemented an approach for forecasting COVID19 by combining Graph Neural Networks (GNNs) within the gates of an LSTM to enable exploiting temporal and spatial information in data. Results based on data of 37 European nations show better performance compared to stateoftheart methods by reaching a mean absolute scaled error (MASE) value around 0.27. However, this approach can be improved by considering other pertinent factors like poverty rates, hospital capacity, and age demographics. Further, authors in^{40} considered four machine learning models (i.e., Linear Regression (LR), Least Absolute Shrinkage, and Selection Operator (LASSO), Random Forest (RF), and Ridge Regression (RR)) to forecast future COVID19 cases. The result shows that the RF outperformed the other models. Two machine learning models, namely Neural Network Time Series (NARNNTS) and Nonlinear Autoregressive (NAR), were evaluated by^{41} to forecast COVID19 cases. Results indicate the outperformance of the NARNNTS model compared to the NAR model. In^{42}, four regression models, ARIMA, MLP, LSTM, and feedforward neural network (FNN), are considered to predict COVID19 spread. It has been shown that the LSTM model reached the best forecast accuracy in this study. In^{43}, the aim is to predict confirmed and deaths cases recorded in Iran and Australia by considering one, three, and seven pastday ahead in the next 100 days. This study applied six models: LSTM, GRU, and Convolutional LSTM with their bidirectional extension. The results showed that the bidirectional models achieve better performance than nonbidirectional most of the time. This could be attributed to forward and backward data processing in bidirectional models, which allow better learning temporaldependencies in COVID19 data. In^{44}, six models, including SusceptibleInfectedRecovered, Linear Regression, Polynomial Regression, and SVR and LSTM, are compared in forecasting COVID19 cases in Saudi Arabia and Bahrain. Results reveal that SVR provides the best forecasting when using confirmed COVID19 cases data from Saudi Arabia, and LR outperforms the other models when using Bahrain confirmed cases data.
Accurate forecasting of COVID19 spread is a key factor in mitigating this pandemic’s transmission by providing relevant information to help hospital managers in decisionmaking and appropriately managing the available resources and staff. In the presence of smallsized COVID19 data, our objective is to present shallow and efficient machine learning methods to forecast future trends of COVID19 spread. The most common machine learning approaches for COVID19 time series forecasting rely only on the actual data point in the forecasting process and ignores the information from past data. Thus, The overarching goal of this study is to take into account information from the actual and past data in developing efficient machine learning models to accurately forecast COVID19 spread. Specifically, this study investigates the forecasting ability of the optimized GPR, a kernelbased machine learning method, in forecasting the COVID19 time series. This choice is motivated by the desirables features of the GPR model, including its simple and flexible construction using the mean and covariance functions, its ability and superior nonlinear approximation, and the possibility to explicitly provide a probabilistic representation of forecasting outputs^{8,45}. The contributions of this paper are summarized in the following key points.

Firstly, we employed Bayesian optimization (BO) to tune the Gaussian process regression (GPR) hyperparameters to develop an efficient GPRbased model for forecasting the recovered and confirmed COVID19 cases in two highly impacted countries, India and Brazil. We compared the performance of the Optimized GPR with 16 models, including Support vector regression with different kernels, GPR with different kernels, Boosted trees, Bagged trees, Random Forest, and eXtreme Gradient Boosting (XGBoost). The daily records of confirmed and recovered cases from Brazil and India are adopted in this study. The kfold crossvalidation technique has been considered in constructing these models based on the training data. Three statistical criteria are used for the comparison. The results showed that the optimized GPR model exhibited a superior prediction capability over the other models.

However, machine learning models do not consider the time dependency in the COVID19 data series. The time dependency in COVID19 data can be captured by incorporating lagged data in designing the considered ML models. Meanwhile, considering information from past data is expected to improve the ML models’ capabilities to effectively follow the trend of future COVID19 data. Here, we evaluated the potential of incorporating dynamic information to further enhance the forecasting performance of the investigated ML models. The results clearly reveal that the lagged data contribute significantly to improved prediction quality of the ML models and highlight the superior performance of the dynamic OGPR.

Additionally, after showing the necessity of including information from past data to enhance the investigated machine learning models, we assessed the importance or contribution of the included features to the COVID19 prediction quality. Importantly, we applied the RF algorithm to identify variable contribution or importance for predictive ML models. Generally speaking, this step is essential to design parsimonious models by ignoring unimportant features.

Finally, we provided the confidence level of the predicted results based on the dynamic OGPR model and showed that the predictions are within the 95% confidence interval.
Of course, we conclude that the dynamic OGPR model is an efficient forecasting approach and can predict confirmed and recovered COVID19 times series data with high accuracy.
The remaining of this study is structured as follows. The second Section presents the used COVID19 datasets, provides a brief description of the GPR model and the BO algorithm. The results and discussions were given in the third section to show model performances and comparisons. The conclusions are outlined in the fourth Section.
Methodology
The overarching goal of this study is to provide accurate forecasting of the recovered and confirmed COVID19 cases in two highly impacted countries, India and Brazil. In total, eighteen machine learning models have been investigated and compared against each other for COVID19 timeseries forecasting. The general framework adopted in this study is depicted in Fig. 1. At first, We feed the model with training data to find the parameters that minimize the loss functions in training. Specifically, we used the Bayesian Optimization algorithm, a powerful tool for the joint optimization of design choices, to hyperparameters tuning. After that, the constructed models are used to forecast the future trend of COVID19 spread. The model’s accuracy will be checked by comparing measured data to forecasted data via the score indicators.
Data description
Here, daily confirmed and recovered COVID19 data from two highly impacted countries, India and Brazil, are utilized to evaluate the forecasting capacity of the 14 investigated databased methods. The daily record of cumulative confirmed and recovered cases of COVID19 from the first case, in India and Brazil on the 30th of January and 26th of February 2020, are available in (https://github.com/CSSEGISandData/COVID19). The dataset automated update for delayed data in the website without any missing value. Figure 2a–b displays the confirmed and recovered COVID19 cases dataset used in this study. We observe that India has the highest number of confirmed cases. Considering the population in each country, India is receiving the most considerable impact from COVID19. On the other hand, India shows rapid growth in recovered cases, indicating their prompt and effective response to this public health event.
Table 1 list the descriptive statistics of the used COVID19 timeseries dataset. We can conclude from Table 1 that these datasets are nonGaussian distributed.
Figure 3 illustrates boxplots of the confirmed and recovered COVID19 cases recorded in India and Brazil. We observe that the distributions of the recorded confirmed and recovered cases are heavily rightskewed.
For the COVID19 time series data in Fig. 2a–b, the the autocorrelation function (ACF) are shown in Fig. 4. The ACF measures the similarity between \(y_{t}\) and \(y_{t+k}\), where \(k=0, \dots , l\) and \(y_{t}\) is the investigated COVID19 times series data^{46}. In other words, ACF quantifies the selfsimilarity of the univariate timeseries data over different delay times. Mathematically, the ACF of a signal \(y_t\) is defined as^{46},
The ACF of the data in Fig. 2a–b provides some relevant information about the timedependence and process structure in COVID19 data points. From Fig. 4, we first observe that there is shortterm autocorrelation in these COVID19 datasets. Also, we observe the similarity between the ACFs of confirmed and recovered timeseries in each country. The third observation is that the fluctuation of India’s data recorded is relatively different from Brazil’s data. This could be attributed to the high spread of COVID19 in INDIA compared to India. Regarding the population in every country, India is getting the most significant impact from COVID19.
GPR model
The GPR, a supervised nonparametric (Bayesian) machine learning method, can flexibly model complex nonlinear relationships between input and output variables^{47}. GPR is an effective kerneldriven approach to learn implicit correlations among various variables in the training set, making GPR especially suitable for challenging nonlinear prediction^{48}. Importantly, GPR, a probabilisticbased nonlinear regression approach, owns desirables characteristics, including the capability for handling large dimensionality, smallsized data, and complex regression problems^{47}.
For a prediction problem, the output \({\mathbf {y}}\) of a function \({\mathbf {f}}\) at the input \({\mathbf {x}}\) in GPR is expressed as,
where \(\mathbf {\varepsilon }\sim {\mathcal {N}}(0, \sigma ^{2}_{\varepsilon })\). In GPR, the term, f(x), is assumed to be a random variable that is distributed according to a particular distribution. Indeed, observing the output of the function at various input points could reduce the uncertainty regarding f. The observations are always tainted with a noise term \(\varepsilon\) that reflects their inherent randomness.
Assume \({\mathcal {D}} = \lbrace (x_{i}, y_{i})\rbrace _{i=1}^{n}\) is the inputoutput measurements and \(f(\cdot )\) to be approximated and assumed following a Gaussian process. For the sake of simplicity, let assume that \(x_{i}\)’s and \(y_{i}\) ’s are scalar observations while \(\varepsilon _{i}\) ’s are independent and identically distributed random noises following the normal distribution with mean value \({\bar{\varepsilon }}_{i}=0\) and variance \(\sigma ^{2}\).
Let’s consider the measured \(y_{i}\) values \([y_{1}, y_{2}, {\dots }\, y_{n}]^{\top }\) are finite values of the function \(f(\cdot )\) contaminated with noises. Thus, \(y_{i}\) ’s follow a joint Gaussian distribution:
where \({\mathbf {m}}({\mathbf {x}}) = [m(x_{1}), m(x_{2}), {\dots }\, m(x_{n})]^{\top }\) represents the mean vector \(m(\cdot )\), \({\mathbf {I}}\) refers to the identity matrix, and \({\mathbf {K}}\) denotes the \(n\times n\) covariance matrix with \((i,j){\rm{th}}\) element \({\mathbf {K}}_{ij} = k(x_{i}, x_{j})\). For a GPR model, \(k(x_{i}, x_{j})\) is usually termed a kernel function^{49}.
The optimized kernel parameters are achieved by maximization of the following likelihood.
where \(\mathbf {\theta } = [\theta _{1}, \theta _{2}, \, \dots ]\) refers to kernel parameters, the mean values m(.) are chosen to be zero, and
In this study, Bayesian optimization will be applied to determine the optimal GPR hyperparameters via the maximization of the marginal likelihood in (4) with respect to \(\theta\)^{50}.
Let \(x_{*}\) is a new input, then the predictive mean and variance associated with \({\hat{y}}_{*}=f(x_{*})=f_{*}\) are respectively expressed as follows:

the mean value
$$\begin{aligned} {\hat{y}}_{*} = {\mathbf {k}}_{*}^{\top }({\mathbf {K}} + \sigma ^{2} {\mathbf {I}})^{1}{\mathbf {y}} \end{aligned}$$(6) 
and variance
$$\begin{aligned} {\Sigma }_{*} = k_{**}  {\mathbf {k}}_{*}^{\top }({\mathbf {K}}+\sigma ^{2} {\mathbf {I}})^{1}{\mathbf {k}}_{*} \mathrm {.} \end{aligned}$$(7) 
and \({\mathbf {y}}_{*}\) follows a conditional distribution:
$$\begin{aligned} y_{*}{\mathbf {y}} \sim {\mathcal {N}}({\hat{y}}_{*}, {\Sigma }_{*}) \end{aligned}$$(8)
where \({\mathbf {K}}={\mathbf {k}}({\mathbf {x}}, {\mathbf {x}})\) refers to the covariance matrix of training data; \(\mathbf {K_{**}}={\mathbf {k}}({\mathbf {x}}_{*}, {\mathbf {x}}_{*})\) represents the covariance of testing data, and \({\mathbf {K}}_{*}={\mathbf {k}}({\mathbf {x}}, {\mathbf {x}}_{*})\) represents the covariance matrix obtained using the training and test dataset.
The GPR predicted output value for a given test input \({\mathbf {x}}\) is \(\mathbf {{\overline{f}}}^{*}\). In addition to the predicted output, GPR can provide a confidence interval (CI) to assess the reliability of the prediction, which can be computed using the variance \(cov(\mathbf {{\overline{f}}}^{*})\). For example, the 95% CI is computed as^{51},
For more details about GPR model, see^{52,53}.
Bayesian optimization of model parameters
Various machine learning methods, including GPR and ensemble models, include many hyperparameters to be chosen (e.g., kernel types in GPR and parameters). Essentially, the selected values of hyperparameters highly impact the performance of machine learning models^{54}. Accordingly, several optimization methods to search for the best hyperparameter, including grid search, random search, and Bayesian Optimization (BO), are reported in the litterature^{55}. The Grid search essentially made a grid of the search space and then evaluated each hyperparameter setting at the points we introduced for as many dimensions as necessary^{56}. On the other hand, Random search uses a random combination of a range of values and compares the result in each iteration, but this method will not guarantee to get the best hyperparameter combination^{56}. This study employed the BO procedure, which is frequently applied in machine learning to find the optimal values of hyperparameters. This study applied the Bayesian optimization algorithm to find the optimal hyperparameters of four investigated methods: SVR, GPR, Boosted trees, and Bagged trees. Notably, the BO algorithm is an efficient and effective global optimization approach that is designed based on Gaussian processes and Bayesian inference^{50}. Crucially, Bayesian Optimization can bring down the time spent to get to the optimal set of parameters by considering the past evaluations when choosing the hyperparameters set to evaluate next^{57}.It could be employed to optimize functions with unknown closedform^{58}. Although, unlike grid search, BO can find the optimal hyperparameters with fewer iterations.
The essence of the BO algorithm is to construct a probabilistic proxy model for the cost function based on outcomes of historical experiments as training data. Essentially, the proxy model, such as the Gaussian process, is more inexpensive to compute, and it gives sufficient information on where we should assess the true objective function to obtain relevant results. Let’s consider m hyperparameters \({\mathbf {P}}={{\mathbf {p}}_{1}, \dots , {\mathbf {p}}_{m}}\) to be tuned. The aim is to determine
where \({\mathbf {g}}\) is a cost function. The whole optimization procedure is controlled via a suitable acquisition function (AF) that defines the following set of hyperparameters to be assessed. Crucially, any acquisition function requires adjusting within exploration and exploitation. Generally speaking, exploration is an area search with high uncertainty, where we expect to discover a new set of parameters that enhance the model’s prediction accuracy. At the same time, exploitation refers to an area search nearby to already computed high estimated values^{59}.
In this study, the BO algorithm is employed to find the hyperparameters of the GPR model, the SVR, and ensemble learning models. The optimization procedure is performed during the training stage based on the training data, as shown in Fig. 5. At each iteration, the mean squared error (MSE) between the actual COVID19 data and the estimated GPR data using the values of the hyperparameters determined by BO. This procedure is repeated until the MSE converges to a small value, close to zero.
Alternative models for Comparison
In this study, we investigated the performance of the OGPR and compared its forecasting accuracy with the set of machine learningbased forecasting models listed in Table 1. In short, a total of seventeen forecasting methods are applied to predict COVID19 timeseries data: 6 SVR methods^{48,60}, 4 GPR methods, 2 ensemble learning techniques (i.e., BT, BS, RF and XGBoost)^{61,62,63} and six SVR models^{47}, and 3 optimized methods.
SVR models
Support Vector regression is another efficient assumptionfree approach that possesses good learning ability through kernel tricks. The essence of SVR is to map the train data to a higher dimension then linear regression is performed in this feature space. In short, SVR can efficiently deal with nonlinear regression via the socalled kernel trick by mapping the input features into highdimensional feature spaces^{64,65}. It is designed using the concept of structural risk minimization. Moreover, SVR models proved to be efficient in the presence of limited samples^{66}. Additionally, SVR has been broadly applied in various applications, including wind power forecasting^{67}, fault detection^{8}, and solar irradiance prediction^{48}. This study built six SVR models using different kernels and an optimized SVR using Bayesian optimization (Table 2).
Boosted tree model
Boosted is an ensemble machine learning model built based on the statistical learning theory. The essence of the boosted tree is to optimize the prediction quality of conventional regression methods by using an adaptive combination of weak prediction models^{68}. Moreover, it employs an aggregate model to obtain a smaller error than those obtained by individual models. Compared to other ensemble models, like bagging and averaging models, boosting is matchless because of the sequentiality^{63,68,69}.
Bagged tree model
The bagged tree (BT) is an ensemble machine learning model; also, it is called bootstrap aggregating. Essentially, BT merges the bagging procedure and decision trees to improve prediction efficiency^{61}. Specifically, The bagged model generates multiple samples via bootstrap sampling from the original dataset, builds multiple distinct decision trees, then aggregates their prediction outputs together^{70}. Accordingly, the prediction error of the decision trees will be reduced, and substantially the overfitting problem in a single tree is bypassed^{71,72}.
Random forest
RF is also within the ensemble learning family that uses several weak learners to build a more efficient joint model^{73}. In the RF model, decision trees are used as a base learner. The RF repeatedly builds regression trees based on the training data. In boosting, each new training set is sampled with replacement from the original training set by using the bootstrap technique. However, the strategy for node selection in RF is different by randomly selecting a subset from the current feature set and then selecting one optimized feature in the subfeature set. It has been widely exploited in different applications related to classification and regression problems.
XGBoost model
Extreme Gradient Boosting algorithm (XGBoost) is an efficient ensemble learning algorithm that can handle missing values and combine a set of weak predictors for building a more effective one^{74}. It can be used for classification and prediction problems. XGBoost can reduce the loss function by employ the gradient descent method to determine the objective function optimization. Especially, XGBoost will avoid the overfitting in the model by relying on a set of learners to build a robust model that also helps minimize the running time. XGBoost is flexible and efficient and is adopted in many winning data mining competitions^{75}.
Evaluation metrics
In this study, we assess the accuracy of the forecasting models using three metrics: root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).
where \(y_{t}\) is the number of COVID cases, \({\hat{y}}_{t}\) is its corresponding forecasted COVID cases, and n is the number of records. Lower RMSE, MAE, and MAPE values would imply better precision and forecasting quality.
Forecasting framework
The general procedure performed in this study to forecast COVID19 cases is represented in Fig. 6. Firstly the daily recovered and confirmed timeseries data are split into training subsets. All models are trained using the training set and evaluated using the testing set. The best forecasting model is indicated by three statistical criteria, namely RMSE, MAE, and MAPE.
Results and discussion
Static prediction models
The COVID19 time series used in this study is free from missing values. We first split these data into training subdata and testing subdata. The training data used to construct each model includes confirmed and recovered cases from January 22, 2020, to June 5, 2021. We used seven days for the testing period from June 6, 2021, to June 12, 2021. Here, we mean by static prediction models, the models that predict COVID19 spread a given time point without considering information from past data.
Firstly, we need to transform the timeseries forecasting problem into a supervised learning problem to apply the investigated machine learning models. In other words, univariate COVID19 timeseries data will be preprocessed to get pairs of input and output data points. In supervised learning, the models first learn the mapping between the input and output variables based on training data, and then they can be used to predict the output from the input test data. We can structure the data to look like inputoutput data. This can be done by utilizing previous data points as input variables and use the next data point as the response variable (see Fig. 7). We can see from Fig. 7 that shifting the series forward one step allows us to use the previous observations to predict the value at the next time step.
The kfold crossvalidation technique has been considered in constructing these models based on the training data as recommended in^{76,77}. Specifically, we applied a 5fold crossvalidation technique in training the investigated models. This permits assessing the models’ robustness, exploiting the whole training dataset, and helps avoid overfitting. In the training stage, the considered models are constructed by finding the appropriate values of hyperparametrs that produce high prediction accuracy. In the BT model, we used 30 trees with a minimum leaf size of 8. Similarly, in BST, 30 trees are used as based learners with a minimum leaf size of 8 and a learning rate of 0.1. We used the SVR model with Kernel scale: 0.25, box constraint: 6.534, and Epsilon: 1.3156. Here, GPR models with four different kernels are considered. The values of Sigma and kernel scales of GPR\(_{SE}\), GPR\(_{RQ}\), GPR\(_{M25}\), and GPR\(_{Exp}\) are respectively (5102.98, 508349.66), (51029.88, 5083496.65), (51029.87, 5083496.65), and (51029.92, 5083496.65). For the RF model, 1000 trees are used in the forest, and ’max_features=1’ is chosen to consider only one feature to find the best split, and ’random_state=1’ is selected for controlling both the randomness of the bootstrapping of the samples used when building trees. For the XGBoost model, the values of the used hyperparameters are: ’num_feature=1’, ’max_depth= 10’, and ’booster=gblinear’.
Here, we applied the BO procedure for the OGPR, OSVR, and OEL models to get the optimal parameters maximizing the forecasting precision based on training data. The hyperparameter search ranges for each model, and the computed values of the hyperparameters of each model using the BO algorithm are summarized in Table 3. Specifically, the values of the hyperparameters are obtained by the MSE between the actual COVID19 data and the predicted data during the training stage.
In this study, seventeen machine learning models (Table 2) are used to predict COVID19 spread. We implemented these methods using Matlab R2021b. These models are first built based on training data and then used for forecasting confirmed and recovered COVID19 cases for a forteenday forecast horizon from May 30, 2021. We applied a 5fold crossvalidation technique in training the investigated models. Figures 8 and 9 display the recorded test set together with model forecasts of confirmed and recovered cases in India and Brazil, respectively. From Fig. 8, we observe that the forecasted values of the confirmed and recovered cases in India from the considered models are closer to the actual data, indicating good forecast performance. For the confirmed and recovered cases in Brazil, Fig. 9, shows broader bands around the actual cases, indicating wider variations among model predictions. In this scenario, models showed relatively better forecasts for India confirmed and recovered cases series.
Tables 4 and 5 quantifies the performances of each model in terms of RMSE, MAE, and MAPE, for COVID19 data recorded in India and Brazil, respectively. In terms of all metrics calculated, the GPR models showed the best performance in RMSE and MAE. It could be attributed to their capacity to capture dynamics in timeseries data.
Figure 10 displays the heatmap of the MAPE values achieved by the investigated model for the confirmed and recovered COVID19 data from Indian and Brazil. We observe that GPR models achieved the best forecasting performance with the lowest MAPE values. This could be attributed to the extended capacity of the GPR models to learn dynamics in COVID19 timeseries data. Furthermore, this study shows the capability of machine learning models to forecast the future trends of COVID19.
Figure 10 indicates that there is not a unique approach that is uniformly superior to others. For instance, the GPR\(_{M2.5}\) achieved the best results for confirmed cases in India and recovered cases in Brazil, and OGPR obtained the best accuracy for recovered cases in India and confirmed cases in Brazil. Thus, averaged MAPE values per model are provided for comparison to find the best model. Figure 11 depicts the averaged MAPE per model. The lowest average MAPE value characterizes the best model. The average MAPE of the OGPR mode is 0.185%. For SVR models, the best prediction is obtained by using the SVR\(_{Q}\) with an average MAPE of 2.376%. The average MAPE of XGBoost, RF, and OEL models are 3.347%, 3.524%, and 4.696%, respectively. Importantly, results in Fig. 11 highlight that a satisfactory forecasting COVID19 spread can be obtained using shallow and simple machine learning approaches. In addition, it is easy to see that the GPR with optimized parameters using Bayesian optimization exhibited superior performance.
Dynamic model
Note that the abovementioned results are based on static models that ignore information from the past data. In this section, we investigate the performance of the machine learning models when incorporating information from the past data in model construction. In other words, to capture the dynamic and evolving nature of the COVID19 time series, we introduce lagged data when building the prediction models. Here, we apply a dynamic fifteen models on India and Brazil dataset by considering the past days’ interval. As in the static model, we used the last fourteen days from May 30, 2021, to June 12, 2021, for the testing. Figure 12 shows how the past data can be incorporated into the input data; here is an example of adding the information from the past three days in the input data. In this case study, we evaluate the impact of introducing past information (i.e., one day, two days, three days, four days, five days, six days, and seven days) on the prediction performance of the investigated models.
Figures 13 and 14 show the MAPE performances values of each model when applied to forecast COVID19 data recorded in India and Brazil from 1 to 7 days, respectively. It can be seen that incorporating information from past data improves the forecasting performance compared to the static model, and the MAPE values decrease, which means that prediction performance has been improved. Prediction results in Figs. 13 and 14 confirm that incorporating information from past data improves forecasting quality compared to the static models. Figure 15 illustrates the averaged MAPE values per model and shows that GPR models exhibited the highest forecasting accuracy among all other models by reaching the lowest MAPE values. Also, we can see that GPR\(_{M52}\) and GPRO reached relatively similar performance and outperformed the other models. In short, this demonstrates the ability of GPR models to learn dynamics in COVID19 timeseries data.
As shown above, considering information from past data is constructing prediction models leads to improved prediction performance. Now, it is crucial to identify the most important features for prediction. Indeed, this will enables removing unnecessary features and constructing parsimonious models. Here, Random forest (RF) will be applied to evaluate the impact of each variable on the prediction of COVID19 spread. It used the Recursive feature elimination algorithm to identify the weights of the features and rank the features according to the importance weights^{78,79}. Figure 16 shows the variable importance score when applying RF to COVID19 India and Brazil dataset. Importantly, the seven past days (features) are relatively impacting the prediction at a similar level. Moreover, from Fig. 16, reduced dynamic models that incorporate information from the past six days will be able to sufficiently capture the future tend of COVID19 timeseries data.
Figure 17 displays the onestepahead prediction results of the dynamic OGPR model of the confirmed cases and recovered cases in India based on testing data. It can be seen that the OGPR predicted values are in agreement with the recorded COVID19 values. In addition, the predicted values are very close to the observed values, and both of them are inside the 95% confidence intervals. It should be noted that this information cannot be obtained using ensemble models and SVR models, which makes dynamic SVR models very helpful. Results are very promising and confirm that the predicted COVID19 cases closely follow the recorded COVID19 trends. Also, these results reveal the importance of optimizing the GPR model with BO and incorporating information from past data to achieve the best prediction performance.
In summary, in this study, both static and dynamic machine learning models have been investigated for predicting COVID19 spread in two highly impacted countries, India and Brazil. It was concluded that both dynamic and static model prediction models considered in this work could predict COVID19 spread with a satisfactory degree of accuracy. Importantly, results revealed that the dynamic models that incorporate past data information result in lower prediction errors than the static models. Specifically, the dynamic optimized GPR model outperformed all the other considered static and dynamic prediction models in predicting COVID19 spread in India and Brazil.
As discussed above, various methods have been developed to improve the forecasting of COVID19 spread using machine learning and timeseries models^{4,28,30,31,32,33,34}. Table 6 summarized different studies on COVID19 time series forecasting. Table 6 compares the achieved average MAPE of the proposed dynamic OGPR model with those of the stateoftheart methods. It should be noted that the average MAPE of the SOTA methods listed in Table 6 are computed from the provided MAPE values in the original papers. Note that as the used COVID19 training and testing data are not the same, it is not easy to compare the performance of the proposed approach with the SOTA methods. However, the summary in Table 6 is helpful to get a big picture about the forecasting performance of the existing methods when applied to smallsized datasets. Table 6 shows that timeseries models in^{30,32,34}, such as ARIMA, SES, HW, BATS, PAR, and polynomial, obtained lower accuracy in range 13.3%47.415%. On the other hand, an ARIMA model in^{31} showed moderately high prediction with a MAPE of 5.59%. These results could be due to different factors, including parameters setting and the size of data used for model training. The amount of data employed in these studies is relatively small. From Table 6, it is interesting to see in^{28} that a linear regression model reached high prediction performance (i.e., MAPE of 0.2228%) since the COVID19 outbreak is often considered as having exponential dynamics. The shallow machine learning methods employed in^{28} (i.e., RF, MLP, and SVM) showed good performance by obtaining an averaged MAPE in the range of 0.11621.0042. However, it can be seen that SVM and MLP obtained relatively low prediction performance with MAPE values 23.5 and 17, respectively. As datadriven approaches, the quality and amount of data are essential to construct a good predictive model. In addition, tuning the hyperparameters in training is crucial to obtain an efficient model that captures the most variability in training data and can predict future trends of COVID19 spread. Results in^{4,37,38} show that RNNbased models, including LSTM, GRU, and GANGRU, LSTMCNN, have sufficient ability in solving this limited size univariate time series forecasting problem with high efficiency and satisfying precision (i.e., MAPE values within 0.6203–5.254%). However, it is worth noting that deep learning methods, such as LSTM and GRU, are designed to capture longterm dependencies in time series data; they could provide enhanced prediction when implemented using a large amount of data. Overall, the proposed OGPR approach achieved high forecasting accuracy with an average MAPE of 0.1025%. Overall, the proposed OGPR approach achieved high forecasting accuracy with an average MAPE of 0.1025%. It could be attributed to different factors: i) the GPR as a distributionfree learning model can be applied to handle not necessarily normally distributed data, ii) it has good capacity to address difficult nonlinear regression problems via kernel trick, iii) it considers dynamic information by incorporating lagged data as input, and iv) provide better prediction when the hyperparameters are optimized using Bayesian optimization algorithm. Thus, it can be deduced that the proposed approach presents a promising system to forecast COVID19 spread.
Conclusion
Accurate forecasting of COVID19 spread is a key factor in slowing down this pandemic’s transmission by providing relevant information to help hospital managers make decisions and appropriately manage the available resources and staff. This work aimed to develop an effective datadriven approach to predict the number of COVID19 confirmed and recovered cases in India and Brazil, ranked as the second and third countries with the highest number of confirmed cases behind the United States. This paper introduces a dynamic GPR model with optimized hyperparameters via Bayesian optimization into COVID19 spread forecasting. Other promising prediction models, such as SVM, GPR, Boosted trees, Bagged trees, RF, and XGBoost, were also considered based on the same data. Here, the considered machine learning models are distributionfree learning methods that can be employed with no prior assumption on the data distribution. The SVR and GPR are within kernelbased prediction methods, while Boosted, Bagged trees, RF, and XGBoost are within ensemble learning methods. The SVR modeling is based on solving a nonlinear optimization problem; on the other hand, the GPR model uses Bayesian learning. This study investigates two types of prediction models, static and dynamic models, to improve COVID19 forecasting accuracy. The static model ignores the information from past data, whereas dynamic models consider information from lagged data in forecasting COVID19 spread. The results showed that the dynamic GPR models outperformed the other static and dynamic models in all cases. In short, the forecasting result shows that the optimizable GPR model is the winner model that achieved the best performance among the other models in terms of RMSE, MAE, and MAPE. In addition, the dynamic OGPRbased prediction models enable generating predictions with confidence intervals. This information is relevant and enables evaluating the reliability of the COVID19 spread predictions and for making better use of the forecasted data. The overall prediction accuracy of the suggested dynamic OGPR model has been satisfying.
Despite the satisfactory COVID19 spread forecasting results using the dynamic machine learning models, there is still plenty of room for improvement. At first, the suggested OGPR approach needs to be employed in more countries to confirm its superior performance. Moreover, accurate modeling of temporal and spatial dynamics of the COVID19 spread is necessary to understand its spread in spacetime for improved risk management. As the developed methods ignore the spatial spatiotemporal correlation in the COVID19 spread, we plan to develop a more flexible forecasting approach that considers spatiotemporal correlations and mobility information in constructing machine learning methods to improve the forecasting quality of COVID19 spread. Another direction of improvement is to incorporate external factors that affect the number of COVID19 cases, such as the number of administered vaccines, the country’s population, medical resource availability, and government policies.
References
Kırbaş, İ, Sözen, A., Tuncer, A. D. & Kazancıoğlu, F. Ş. Comparative analysis and forecasting of covid19 cases in various European countries with ARIMA, NARNN and LSTM approaches. Chaos Solitons Fractals 138, 110015 (2020).
Rustam, F. et al. Covid19 future forecasting using supervised machine learning models. IEEE access 8, 101489–101499 (2020).
Long, J. B. & Ehrenfeld, J. M. The role of augmented intelligence (AI) in detecting and preventing the spread of novel coronavirus (2020).
Dairi, A., Harrou, F., Zeroual, A., Hittawe, M. M. & Sun, Y. Comparative study of machine learning methods for covid19 transmission forecasting. J. Biomed. Inf. 118, 103791 (2021).
Harrou, F., Dairi, A., Kadri, F. & Sun, Y. Forecasting emergency department overcrowding: A deep learning framework. Chaos Solitons Fractals 139, 110247 (2020).
Wang, W., Lee, J., Harrou, F. & Sun, Y. Early detection of Parkinson’s disease using deep learning and machine learning. IEEE Access 8, 147635–147646 (2020).
Zeroual, A., Harrou, F., Dairi, A. & Sun, Y. Deep learning methods for forecasting covid19 timeseries data: A comparative study. Chaos Solitons Fractals 140, 110121 (2020).
Harrou, F., Saidi, A., Sun, Y. & Khadraoui, S. Monitoring of photovoltaic systems using improved kernelbased learning schemes. IEEE J. Photovolt. 11, 806–818 (2021).
HealthMap. Health Map.
Bogoch, I. I. et al. Anticipating the international spread of zika virus from brazil. Lancet 387, 335–336 (2016).
Tai, Y. et al. Trustworthy and intelligent covid19 diagnostic IOMT through XR and deep learningbased clinic data access. IEEE Internet Things J. (2021).
AbdelBasset, M., Chang, V., Hawash, H., Chakrabortty, R. K. & Ryan, M. FSS2019NCOV: A deep learning architecture for semisupervised fewshot segmentation of covid19 infection. KnowledgeBased Syst. 212, 106647 (2021).
Islam, M. Z., Islam, M. M. & Asraf, A. A combined deep CNNLSTM network for the detection of novel coronavirus (covid19) using xray images. Inf. Med. Unlock. 20, 100412 (2020).
Dairi, A., Harrou, F. & Sun, Y. Deep generative learningbased 1svm detectors for unsupervised covid19 infection detection using blood tests. IEEE Trans. Instrum. Meas. (2021).
Singh, A., Jindal, V., Sandhu, R. & Chang, V. A scalable framework for smart covid surveillance in the workplace using deep neural networks and cloud computing. Expert Syst. e12704 (2021).
Islam, M. M., Karray, F., Alhajj, R. & Zeng, J. A review on deep learning techniques for the diagnosis of novel coronavirus (covid19). IEEE Access 9, 30551–30572 (2021).
Asraf, A., Islam, M. Z., Haque, M. R. & Islam, M. M. Deep learning applications to combat novel coronavirus (covid19) pandemic. SN Comp. Sci. 1, 1–7 (2020).
Shoeibi, A. et al. Automated detection and forecasting of covid19 using deep learning techniques: A review. arXiv preprint arXiv:2007.10785 (2020).
Rahman, M. M. et al. Machine learning approaches for tackling novel coronavirus (covid19) pandemic. SN Comp. Sci. 2, 1–10 (2021).
Wang, B., Zhao, Y. & Chen, C. P. Hybrid transfer learning and broad learning system for wearing mask detection in the covid19 era. IEEE Trans. Instrum. Meas. 70, 1–12 (2021).
Sharma, R. R., Kumar, M., Maheshwari, S. & Ray, K. P. EVDHMARIMAbased time series forecasting model and its application for covid19 cases. IEEE Trans. Instrum. Meas. 70, 1–10 (2020).
Wu, W., Shi, J., Yu, H., Wu, W. & Vardhanabhuti, V. Tensor gradient lnorm minimizationbased lowdose CT and its application to covid19. IEEE Trans. Instrum. Meas. 70, 1–12 (2021).
Islam, M. M. et al. Wearable technology to assist the patients infected with novel coronavirus (covid19). SN Comp. Sci. 1, 1–9 (2020).
Rahman, M. M., Manik, M. M. H., Islam, M. M., Mahmud, S. & Kim, J.H. An automated system to limit covid19 using facial mask detection in smart city network. In 2020 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), pp. 1–5 ( IEEE 2020).
Moein, S. et al. Inefficiency of sir models in forecasting covid19 epidemic: a case study of Isfahan. Sci. Rep. 11, 1–9 (2021).
Ilin, C. et al. Public mobility data enables covid19 forecasting and management at local and global scales. Sci. Rep. 11, 1–11 (2021).
de Paula Oliveira, T. & de AndradeMoral, R. Global shortterm forecasting of covid19 cases. Sci. Rep. 11, 1–9 (2021).
Ballı, S. Data analysis of covid19 pandemic and shortterm cumulative case forecasting using machine learning time series methods. Chaos Solitons Fractals 142, 110512 (2021).
ArunKumar, K. et al. Forecasting the dynamics of cumulative covid19 cases (confirmed, recovered and deaths) for top16 countries using statistical machine learning models: Autoregressive integrated moving average (arima) and seasonal autoregressive integrated moving average (sarima). Appl. Soft Comput. 103, 107161 (2021).
Nasution, B. I., Nugraha, Y., Kanggrawan, J. I. & Suherman, A. L. Forecasting of covid19 cases in jakarta using poisson autoregression. In 2021 9th International Conference on Information and Communication Technology (ICoICT), 594–599 ( IEEE, 2021).
Ceylan, Z. Estimation of covid19 prevalence in Italy, Spain, and France. Sci. Total Environ. 729, 138817 (2020).
Shaikh, S. et al. Analysis and prediction of covid19 using regression models and time series forecasting. In 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), 989–995 ( IEEE, 2021).
Acosta, M. F. J. & GarciaZapirain, B. Machine learning algorithms for forecasting covid 19 confirmed cases in america. In 2020 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT) 1–6 ( IEEE 2020).
Istaiteh, O., Owais, T., AlMadi, N. & AbuSoud, S. Machine learning approaches for covid19 forecasting. In 2020 International Conference on Intelligent Data Science Technologies and Applications (IDSTA), 50–57 ( IEEE 2020).
Zawbaa, H. M. et al. A study of the possible factors affecting covid19 spread, severity and mortality and the effect of social distancing on these factors: Machine learning forecasting model. Int. J. Clin. Pract. 75, e14116 (2021).
Satu, M. et al. Shortterm prediction of covid19 cases using machine learning models. Appl. Sci. 11, 4266 (2021).
Kafieh, R. et al. Covid19 in Iran: Forecasting pandemic using deep learning. Comput. Math. Methods Med. 2021, 1–16, (2021).
Omran, N. F. et al. Applying deep learning methods on timeseries data for forecasting covid19 in Egypt, Kuwait, and Saudi Arabia. Complexity 2021 (2021).
Sesti, N., GarauLuis, J. J., Crawley, E. & Cameron, B. Integrating LSTMS and GNNS for covid19 forecasting. arXiv preprint arXiv:2108.10052 ( 2021).
Raja, P. V., Sangeetha, K., Nithya, M. T. et al. Future forecasting with machine learning models for covid19. Ann. Romanian Soc. Cell Biol. 25, 210–215 (2021).
Namasudra, S., Dhamodharavadhani, S. & Rathipriya, R. Nonlinear neural network based forecasting model for predicting covid19 cases. Neural Process. Lett. 1–21 ( 2021).
Yu, C.S. et al. A covid19 pandemic artificial intelligencebased system with deep learning forecasting and automatic statistical data acquisition: Development and implementation study. J. Med. Internet Res. 23, e27806 (2021).
Nooshin Ayoobi, D. S. et al. Time series forecasting of new cases and new deaths rate for covid19 using deep learning methods. J. Results Phys. 27, 104495 (2021).
Khaloofi, H., Hussain, J., Azhar, Z. & Ahmad, H. F. Performance evaluation of machine learning approaches for covid19 forecasting by infectious disease modeling. In 2021 International Conference of Women in Data Science at Taif University (WiDSTaif ), pp. 1–6, https://doi.org/10.1109/WiDSTaif52235.2021.9430192 ( 2021).
Xie, Y., Zhao, K., Sun, Y. & Chen, D. Gaussian processes for shortterm traffic volume forecasting. Transp. Res. Record 2165, 69–78 (2010).
Box, G. E., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. Time series analysis: Forecasting and control (Wiley, 2015).
Rasmussen, C. E. Gaussian processes in machine learning. In Summer school on machine learning, pp. 63–71 ( Springer, 2003).
Lee, J., Wang, W., Harrou, F. & Sun, Y. Reliable solar irradiance prediction using ensemble learningbased models: A comparative study. Energy Convers. Manag. 208, 112582 (2020).
Williams, C. K. & Rasmussen, C. E. Gaussian processes for regression. ( 1996).
Nguyen, V.H. et al. Applying bayesian optimization for machine learning models in predicting the surface roughness in singlepoint diamond turning polycarbonate. Math. Probl. Eng. 2021, 1–16 (2021).
GarcíaNieto, P. J. et al. Prediction of outlet dissolved oxygen in microirrigation sand media filters using a gaussian process regression. Biosyst. Eng. 195, 198–207 (2020).
Schulz, E., Speekenbrink, M. & Krause, A. A tutorial on gaussian process regression: Modelling, exploring, and exploiting functions. J. Math. Psychol. 85, 1–16 (2018).
Murphy, K. P. Machine learning: A probabilistic perspective (MIT press, Cambridge, 2012).
Protopapadakis, E., Voulodimos, A. & Doulamis, N. An investigation on multiobjective optimization of feedforward neural network topology. In 2017 8th International Conference on Information, Intelligence, Systems & Applications (IISA), 1–6 ( IEEE 2017).
Bull, A. D. Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res. 12, 2879–2904 (2011).
Bergstra, J. & Bengio, Y. Random search for hyperparameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & De Freitas, N. Taking the human out of the loop: A review of bayesian optimization. Proc. IEEE 104, 148–175 (2015).
Snoek, J., Larochelle, H. & Adams, R. P. Practical bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 25, 1–9 (2012).
Springenberg, J. T., Klein, A., Falkner, S. & Hutter, F. Bayesian optimization with robust bayesian neural networks. Adv. Neural Inf. Process. Syst. 29, 4134–4142 (2016).
Vapnik, V., Golowich, S. E., Smola, A. et al. Support vector method for function approximation, regression estimation, and signal processing. Adv. Neural Inf. Process. Syst. 281–287 ( 1997).
Zhang, Y. & Haghani, A. A gradient boosting method to improve travel time prediction. Transp. Res. Part C Emerg. Technol. 58, 308–324 (2015).
Freund, Y. & Schapire, R. E. A decisiontheoretic generalization of online learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997).
Khaldi, B., Harrou, F., Benslimane, S. M. & Sun, Y. A datadriven soft sensor for swarm motion speed prediction using ensemble learning methods. IEEE Sens. J. (2021).
Yu, P.S., Chen, S.T. & Chang, I.F. Support vector regression for realtime flood stage forecasting. J. Hydrol. 328, 704–716 (2006).
Hong, W.C., Dong, Y., Chen, L.Y. & Wei, S.Y. SVR with hybrid chaotic genetic algorithms for tourism demand forecasting. Appl. Soft Comput. 11, 1881–1890 (2011).
Smola, A. J. & Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 14, 199–222 (2004).
Lee, J., Wang, W., Harrou, F. & Sun, Y. Wind power prediction using ensemble learningbased models. IEEE Access 8, 61517–61527 (2020).
Elith, J., Leathwick, J. R. & Hastie, T. A working guide to boosted regression trees. J. Anim. Ecol. 77, 802–813 (2008).
Wang, H. & Wu, J. Boosting for realtime multivariate time series classification. In ThirtyFirst AAAI Conference on Artificial Intelligence (2017).
Bauer, E. & Kohavi, R. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach. Learn. 36, 105–139 (1999).
Harrou, F., Saidi, A. & Sun, Y. Wind power prediction using bootstrap aggregating trees approach to enabling sustainable wind power integration in a smart grid. Energy Convers. Manag. 201, 112077 (2019).
Ribeiro, M. H. D. M. & dos Santos Coelho, L. Ensemble approach based on bagging, boosting and stacking for shortterm prediction in agribusiness time series. Appl. Soft Comput. 86, 105837 (2020).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Schapire, R. E. et al. Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Stat. 26, 1651–1686 (1998).
Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794 ( 2016).
Kuhn, M. et al. Applied predictive modeling 26th edn. (Springer, 2013).
Harrou, F. et al. Statistical process monitoring using advanced datadriven and deep learning approaches: theory and practical applications (Elsevier, 2020).
Zhang, C., Li, Y., Yu, Z. & Tian, F. Feature selection of power system transient stability assessment based on random forest and recursive feature elimination. In 2016 IEEE PES AsiaPacific Power and Energy Engineering Conference (APPEEC), 1264–1268 ( IEEE 2016).
Darst, B. F., Malecki, K. C. & Engelman, C. D. Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genetics 19, 1–6 (2018).
Funding
This work was supported by funding from King Abdullah University of Science and Technology (KAUST), Office of Sponsored Research (OSR) under Award No: OSR2019CRG73800.
Author information
Authors and Affiliations
Contributions
Y.A.: conceptualization, methodology, Formal analysis, data analysis and writing. F.H.: Methodology, Formal analysis, Supervision, Formal analysis, data analysis and writing. Y.S.: Methodology, Formal analysis, Supervision, Formal analysis, data analysis and Funding acquisition. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Alali, Y., Harrou, F. & Sun, Y. A proficient approach to forecast COVID19 spread via optimized dynamic machine learning models. Sci Rep 12, 2467 (2022). https://doi.org/10.1038/s41598022062183
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598022062183
This article is cited by

Application of optimal subset regression and stacking hybrid models to estimate COVID19 cases in Dhaka, Bangladesh
Theoretical and Applied Climatology (2023)

Forecasting SARSCoV2 transmission and clinical risk at small spatial scales by the application of machine learning architectures to syndromic surveillance data
Nature Machine Intelligence (2022)

A review about COVID19 in the MENA region: environmental concerns and machine learning applications
Environmental Science and Pollution Research (2022)

Tracking machine learning models for pandemic scenarios: a systematic review of machine learning models that predict local and global evolution of pandemics
Network Modeling Analysis in Health Informatics and Bioinformatics (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.