Estimating the COVID-19 prevalence and mortality using a novel data-driven hybrid model based on ensemble empirical mode decomposition

In this study, we proposed a new data-driven hybrid technique by integrating an ensemble empirical mode decomposition (EEMD), an autoregressive integrated moving average (ARIMA), with a nonlinear autoregressive artificial neural network (NARANN), called the EEMD-ARIMA-NARANN model, to perform time series modeling and forecasting based on the COVID-19 prevalence and mortality data from 28 February 2020 to 27 June 2020 in South Africa and Nigeria. By comparing the accuracy level of forecasting measurements with the basic ARIMA and NARANN models, it was shown that this novel data-driven hybrid model did a better job of capturing the dynamic changing trends of the target data than the others used in this work. Our proposed mixture technique can be deemed as a helpful policy-supportive tool to plan and provide medical supplies effectively. The overall confirmed cases and deaths were estimated to reach around 176,570 [95% uncertainty level (UL) 173,607 to 178,476] and 3454 (95% UL 3384 to 3487), respectively, in South Africa, along with 32,136 (95% UL 31,568 to 32,641) and 788 (95% UL 775 to 804) in Nigeria on 12 July 2020 using this data-driven EEMD-ARIMA-NARANN hybrid technique. The contributions of this study include three aspects. First, the proposed hybrid model can better capture the dynamic dependency characteristics compared with the individual models. Second, this new data-driven hybrid model is constructed in a more reasonable way relative to the traditional mixture model. Third, this proposed model may be generalized to estimate the epidemic patterns of COVID-19 in other regions.

subseries representing the trend of the data. Second, the IMFs terms were modeled using appropriate NARANN methods, whereas the residue term was modeled with a suitable ARIMA model. Finally, the prediction results from our proposed hybrid model were obtained by a conjunction of those from the basic NARANN and ARIMA models 44 . Since the lack of adequate health infrastructure and services in many regions of Africa, such estimates can elucidate the spreading dynamics of the outbreak, which will be a useful aid for government institutions and policymakers to plan the number of additional materials and resources in order to keep the outbreak under control well. Additionally, such estimates may also assist local people to lessen their present socioeconomic and psychosocial pressures and distresses related to the COVID-19 pandemic.

Material and methods
Data source. This research focused on the daily time series analysis of the COVID-19 prevalence and mortality, the overall diagnosed COVID-19 cases and death tolls between 28 February 2020 and 27 June 2020 were taken from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (https:// github. com/ CSSEG ISand Data/ COVID-19) and the COVID-2019 situation reports by the WHO (https:// www. who. int/ emerg encies/ disea ses). Often, at least 50 observations and preferably 100 observations or more are required in order to construct an adequate and effective model 47 . Thus, the datasets used in this study were divided into two parts. The subset from 28 February 2020 through 15 June 2020 was treated as the training horizon (109 observations), the other was deemed as the prediction horizon (12 observations).
The study protocol was approved by the research institutional review board of the Xinxiang Medical University (No: XYLL-2019072). All relevant guidelines were followed for the study. Ethical approval is not warranted for this research as these data without personal information are publicly available around the globe and the same is approved by the CSSE and WHO.
ARIMA model. The ARIMA model has been the most frequently used forecasting tool in the domain of health care in the past because of its simple structure, flexible applicability, and potential to interpret a given time series 7 . Supposing that there exists a certain linear pattern between the past observations and the future observations, the ARIMA model can then make use of this pattern to predict the epidemic trends in the near future 4,48 . A representative ARIMA (p, d, q) model is composed of three components, where, p, d, and q represent the orders of the autoregressive method (AR), the non-seasonal differenced degrees, and the moving average method (MA), respectively. The ARIMA model is often established through four steps. Initially, an augmented Dickey-Fuller (ADF) test was applied to the original data to investigate its stationarity, if indicating a non-stationary series, a differenced transformation would help to achieve stationarity 48,49 . Secondly, the crude values of the key parameters (p, d, and q) were determined by plotting the autocorrelation function (ACF) and partial ACF (PACF) graphs based on the differenced series. Among all the candidate models, the one that produced such goodness of fit measures as a larger value of the log-likelihood, as well as a lower value of the Akaike information criteria (AIC), consistent AIC (CAIC), and Bayesian information criterion (BIC), was considered the preferred 50 . Thirdly, using statistical-based diagnostic indices, including Ljung-Box Q test, ACF plot, PACF plot, and t-test, to check the adequacy of the identified model, once the residuals behaved like a white-noise series under the Ljung-Box Q test and the determined parameters were statistically significant under the t-test, meaning that this model is suitable 51 . Ultimately, the preferred ARIMA method can be employed to conduct out-of-sample forecasts. NARANN model. ANNs can well enable arbitrarily complex non-stationary series to obtain any desired accuracy thanks to its flexible nonlinear mapping ability 52 . The NARANN method with the time-varying state of interconnected neurons is an important dynamic recurrent ANNs model. For this reason, this method has the inherent attributes of ANNs (e.g., powerful nonlinear mapping capacity, self-learning and adaption ability, along with generalization and fault-tolerant ability) 33,53 . Further, the NARANN model also has a long or short-term memory function by retaining the prior inputs, outputs and network structures with the help of the tapped delay line, resulting in a dynamic modeling potential to the time-dependent series 33 . An NARANN method can be in the form below where X t signifies the forecasting results from the NARANN method based on the previous given values at lagged period d.
In this study, the modeling procedures consist of three steps. First, the whole data were divided into two blocks including training samples (from 28 February 2020 to 15 June 2020) and testing samples (from 16 June 2020 to 27 June 2020). To develop an effective and accurate NARANN model, the effective training samples were further partitioned into training (80% of the training samples), validation (10%), and testing (10%) subseries by use of the dividerand function in MATLAB software. Second, the number of hidden neurons and delays d were investigated by trial and error by use of the Levenberg-Marquardt algorithm in an open feedback form 33 . Whilst the response plot between the estimated outputs and targets, the ACF plot, along with the mean square error (MSE) and correlation coefficient (R) were computed until the best possible specification was determined 53 . Finally, the training open-loop form was closed to make a multi-step-ahead forecast.
A hybrid model of EEMD-ARIMA-NARANN. EEMD. Although the EMD method has been widely employed to deal with the noisy nonlinear and non-stationary processes in signal analysis, it has been (1) www.nature.com/scientificreports/ shown that this method suffers from two major shortcomings, including the edge-effects and mode-mixing in applications 39,54,55 , particularly for the mode-mixing issue, it can not only lead to the mixing of different scale vibration modes but also even result in the loss of the physical meaning of the decomposed IMFs terms 40 . To compensate for the weaknesses of the EMD method, an advanced EEMD technique was therefore introduced based on the EMD method 39 . This EEMD technique resolved the mode-mixing issue by defining the original each IMFs term as the average of an ensemble of experiments, and each IMFs term consists of the signal and noise of finite-amplitude 54 . The decomposition processes of the EEMD approach can be done as below: Firstly, adding a white noise series w(t) to the original series x(t) , and then the produced new time series was defined as Secondly, decomposing this new time series into the IMFs terms by use of the EMD method. Thirdly, repeating the first and second steps using different white noise series, and the obtained results were added to the original time series each time.
Finally, averaging the ensemble of the IMFs terms from the EMD method. At the decomposition stage, determining the number of the ensembles and the amplitudes of the added white noise series is very crucial for the resultant results 43 . Fortunately, these two parameters can be determined by use of a well-demonstrated statistical rule 39 where N is the number of the ensembles, ε represents the amplitudes of the added white noise series, and ε n refers to the standard error. It has been shown that the EEMD technique can obtain a satisfactory result when the ensemble numbers were 100 and the amplitudes of added white noise series were 0.2 times standard deviation 39,56 .
EEMD-ARIMA-NARANN mixture model. To achieve the goal of making full use of the constituent linear and nonlinear components in the object series, inspired by the "decomposition and ensemble" idea of the EEMD method and its powerful flexible nonlinear mapping capacity of the NARANN method 57 , the EEMD-ARIMA-NARANN mixture method was thus constructed. In this advanced mixture model-developing process, the prevalence and mortality time series of COVID-19 were first decomposed into various IMFs and residue terms. Then, each of IMFs terms was modeled by use of an adequate NARANN method; whereas the residue term was modeled by use of an adequate ARIMA method. Finally, the results from our proposed mixture method could be obtained by combing the forecasts from the ARIMA and NARANN models (Fig. 1). By doing so, the new datadriven mixture technique can capture both linear and nonlinear patterns simultaneously in the prevalence and mortality series of COVID-19. The specific representation of our proposed EEMD-ARIMA-NARANN mixture method can be expressed as where ŷ refers to the estimated results from the EEMD-ARIMA-NARANN mixture technique, â t represents the estimated results from the ARIMA model, b t is the estimated results from the NARANN model.
Assessing model performance. In this study, four statistical measures of error, including root mean square percentage error (RMSPE), mean absolute deviation (MAD), mean error rate (MER), and mean absolute percentage error (MAPE), were calculated to evaluate the accuracy of forecasts. The above statistical measures of error had smaller values, indicating a better model. www.nature.com/scientificreports/ here X i signifies the prevalence and mortality data of COVID-19, X i is the estimates using the chosen approaches, X i refers to the mean of the prevalence and mortality data of COVID-19, and N stands for the number of simulations and forecasts.

Results
Development of the ARIMA model. During the study span, the overall confirmed cases totaled 12,459 in South Africa and 23,298 in Nigeria, with a daily mean of 1030 and 193 cases, respectively. Out of them, there were overall 2340 deaths in South Africa and 554 deaths in Nigeria, with a daily mean of 20 and 5 cases, respectively. As shown in Fig. 2, the prevalence and mortality time series displayed an apparent increasing trend, so the differencing is required to remove the trend effects of these target series. After differencing, an ADF test was employed to the differenced series, and the resulting statistics for the differenced series are illustrated in Table S1, indicating a stationary series. Thus, the possible values of the ARIMA models' key parameters were crudely determined based on these stationary series. As illustrated in Table 1, it appeared that the sparse coefficient ARIMA (2, 2, (1, 3)) (AIC = 1482.590, CAIC = 1483.441, BIC = 1498.642, and Log-likelihood = -736.290) and ARIMA (0, 2,(1, 3, 4)) (AIC = 733.390, CAIC = 733.980, BIC = 746.750, and Log-likelihood = − 362.690) specifications were expected to be considered the best models for simulating the prevalence and mortality data, respectively, in South Africa because the measurement metrics of AIC, CAIC, and BIC provided the lowest values, and log-likelihood gave the greatest value among all the possible models. Furthermore, as illustrated in Tables 2  and 3, Fig. 3, the identified key parameters of the best-fitting ARIMA models showed a statistical significance (p < 0.05) and the Box-Ljung Q tests for the error series from these best models suggested no statistical significance at different lags (p > 0.05), these results meant that the identified optimal ARIMA models are adequate for modeling the target data. Similarly, the diagnostic checking for the best ARIMA models could be done on the residuals from the prevalence and mortality data in Nigeria (Tables 1, 2, 3 and Fig. 3), it was demonstrated that the ARIMA (1, 2, 2) and sparse coefficient ARIMA (0, 2,(1, 2, 4)) models were also suitable for modeling the www.nature.com/scientificreports/ prevalence and mortality data, respectively, in Nigeria. Accordingly, these preferred ARIMA models determined can be used to forecast the epidemics in the next days.
Construction of the NARANN model. To obtain the preferred NARANN model, the different number of hidden units ranging from 1 to 20 and feedback delays ranging from 1 to 6 were trained by trial and error. After trying, it was found that the NARANN with 15 hidden units and 6 delays and the NARANN with 14 hidden units and 5 delays tended to be identified as the optimal specifications for mimicking the prevalence and mortality data, respectively, in South Africa as the NARANN (15,6) and NARANN (14,5) specifications showed the lowest MSE values in the training (2648.213 and 9.710, respectively), validation (1595.504 and 12.849, respectively), and testing (8647.196 and 24.024, respectively) subsets, along with the greatest R values in the training (1 and 1, respectively), validation (1 and 1, respectively), and testing (1 and 1, respectively) subsets of the prevalence and mortality data among all the potential models (Tables 3 and 4, Figures S1 and S2). Moreover, almost all autocorrelation coefficients of the resulting errors fell into the estimated 95% uncertainty level (UL) at different lags and the response plots between inputs and outputs showed that the resulting residuals presented an acceptable level of fluctuation in their corresponding subsets (Figs. 4, 5). The above-mentioned results intimated that the identified two best NARANN specifications offered reliable estimates for the prevalence and mortality www.nature.com/scientificreports/ Establishment of the EEMD-ARIMA-NARANN hybrid model. Based on the decomposed procedures, the original target series was decomposed into different IMFs and residues (Fig. 6). Subsequently, the residues representing the trends of the target series were used to establish the ARIMA model, and the obtained best-fitting ARIMA models and their goodness of fit statistics for different target series are listed in Table 5; whereas the IMFs components representing the detailed (nonlinear) information contained in the target series were employed to develop the NARANN models, and the identified best-fitting NARANN models and their diagnostic testing results for various IMFs series are summarized in Table 4. Then each decomposed series is fitted and predicted by adopting the most appropriate target models and the resulting in-sample simulations and out-of-sample forecasts can be summed to obtain the final results from the advanced EEMD-ARIMA-NARANN hybrid model.

Comparisons of forecasting accuracy level between models.
We discovered that the EEMD-ARIMA-NARANN mixture model showed the lowest values of the measurement metrics, including MAD, MAPE, MER, and RMSPE, in addition to the RMSPE value in the prevalence data of Nigeria by comparing the forecasts for the testing samples from the selected best-fitting three models in the study regions (  (Table S5).

Discussion
Effective prevention and control plans are needed to curb and harness the rapid transmission of the COVID-19 outbreak. Early nowcasting and forecasting are essential to forming such plans as the allocation of limited health resources, the timely adjustment of the current intervention strategies, the arrangement of production activities, and even the local economic development 30,31,58 . For this reason, it is imperative to develop statistical techniques with high forecasting accuracy and reliability. Time series modeling is a useful aid for developing underlying hypotheses to analyze the current epidemic patterns and to predict the spreading dynamics of different diseases in the near future 4,7 . As far as we are aware, this is the only study to analyze and forecast the epidemiological trends of the COVID-19 prevalence and mortality time series in South Africa and Nigeria by use of a novel data-driven EEMD-ARIMA-NARANN hybrid technique, and a series of modeling experiments indicated that this new hybrid technique produced lower forecasting errors over the basic ARIMA and NARANN methods by comparing the measurement metrics, such as MAD, MAPE, MER, and RMSPE (Table 6). These results meant our proposed hybrid method has a greater potential to track the dynamic dependence characteristics during the epidemic process of COVID-19 relative to the others used in this study, which may act as a profitable tool-supportive www.nature.com/scientificreports/ , and the sample ACFs and PACFs lag 15 in (C) (which are also reasonable because some higher-order correlation coefficients readily exceed the estimated 95% uncertainty levels by chance). These results meant that the residuals from identified ARIMA models for different datasets were without pattern, suggesting that the selected ARIMA models appear to be suitable for capturing the dynamic dependency structure in the object series. www.nature.com/scientificreports/ for policymakers to develop appropriate prevention and control strategies and measures in both mitigating the outbreak and reducing the deaths due to COVID-19 pandemic. Whilst this hybrid model is also of great value in assessing the effects of the current public interventions. For example, if this model forecasted a remarkably higher epidemic level than the actual in the coming periods, suggesting that the current measures could take effect in the target population; otherwise, indicating that the current public interventions could be required to be reinforced or additional plans could be needed. In addition, the basic ARIMA and NARANN models also provided a high forecasting accuracy for our target data in light of the above four measurement metrics. The most versatile method to fit the time series data is the ARIMA model, which postulates that there is a certain linear association between the future epidemics of a given series and the past and present states of the target series, and thus this model can not only be used to model nonseasonal data but also seasonal data, and such benefit as nonstationary data 48,49 . Yet for nonstationary series, it requires to be differenced and/or transformed with logarithm or square root 50 . For instance, Yousaf et al. built the ARIMA (0, 2, 1), ARIMA (2, 2, 0), and ARIMA (1, 2, 1) models to study and predict the accumulative confirmed cases, recoveries, and deaths of COVID-19, respectively, for the upcoming month in Pakistan 19 . Ceylan established the ARIMA (0, 2, 1), ARIMA (1, 2, 0), and ARIMA (0, 2, 1) models to forecast the total reported cases of COVID-19 in Italy, Spain, and France, respectively 7 . Even though these obtained ARIMA models have high forecasting accuracy and reliability, the major disadvantage of the ARIMA model is its linear assumption, which makes it difficult to handle the randomness in the target series 52 . Hence, we proposed a novel data-driven EEMD-ARIMA-NARANN hybrid model to overcome the limitation of the basic model. It can be said that this data-driven mixture technique shows a strong capacity to improve the forecasting power for the prevalence and mortality data of COVID-19 in that the principal advantage of such a model facilitates to identify the preferred hybridization by decomposing the target data into various multi-scale levels to consider the underlying trend and random parts simultaneously by use of the different types of models. Given the forecasting superiority of our proposed data-driven hybrid method, it seems that this hybrid model is also useful in nowcasting and forecasting the epidemiological trends of the COVID-19 prevalence and mortality time series in other regions or other infectious diseases 44 . Of note, current studies found that some other forecasting tools (e.g., the new innovations state space modeling framework 59 , long www.nature.com/scientificreports/ short-term memory neural network 60 , advanced error-trend-seasonal (ETS) framework 61 , α-Sutte Indicator 62 , and SBDiEM 30 ) performed a highly accurate forecast for the epidemiological trends of COVID-19. As a result, to further our research we are planning to make a comparative study between our proposed EEMD-SARIMA-NARANN hybrid model and the ones above. The contributions of the current work are several-fold. First, at least 14.321% and at most 40.488%, along with at least 22.545% and at most 59.766% of computational accuracies are achieved compared with the ARIMA and NARANN models, respectively, when using the MAPE (which is the most frequently used index to judge the predictive performance) to measure the forecasting accuracy. Second, this work presents a new data-driven integrated system in a more reasonable way compared with the conventional mixture pattern. Third, this new data-driven hybrid model may be generalized to estimate the epidemic patterns in other regions seriously affected by the COVID-19 outbreak. Given the outbreak trends of COVID-19 and the situation of the health infrastructure and services in Africa, there is a great concern on whether African regions' health system capacity is able to duly and effectively meet the requirements of the medical supplies for the increased confirmed cases. For this reason, we used our proposed mixture technique to predict the next 15-day confirmed cases and deaths in South Africa and Nigeria. Particularly in South Africa, the infected individuals show an exponential trend since 18 May 2020 (Figs. 2, 7), and even worse, our prediction results display that the epidemiological trends of the outbreak may still be rapidly increasing with an average of around 3465 confirmed cases and 75 deaths per day in the upcoming 15 days in South Africa (Fig. 7A,B, Table S5), and it needs more time to reach the platform in the morbidity. Therefore, more strict or additional precautionary measures are required to reduce the rapid spreading of COVID-19 (e.g., increasing the number of doctors, pharmacists, medical students, and other health workers who can offer their expertise in the frontlines of the pandemic response, strengthening the overnight curfew management to prevent the social interaction, raising public awareness by strengthening advocacy, issuing more stringent lockdown rules, building more mobile cabin hospitals to treat the mild patients, forcing mandated face-covering in public, suspending trans-regional public transportation, suspending or prohibiting tourism across regions, strengthening inspection and quarantine, extending the closure period of public places such as schools, universities and church, supporting the home office work, prohibiting possible social gatherings, accelerating research www.nature.com/scientificreports/ on the vaccines and clinical treatment programmes, and seeking help from other countries in a position to do so) 12,19,31,60,63 . Nigeria that was hit the second hardest with the COVID-19 outbreak is witnessing a downward trend in the COVID-19 prevalence and mortality with daily 590 estimated confirmed cases and 16 deaths in the next 15 days (Fig. 7C,D, Table S5). However, strict prophylactic measures still need to be implemented in Nigeria to avoid the rebounding of the outbreak. The findings in this report are subject to some shortcomings. Firstly, accurate statistics on the prevalence and mortality data in these two study regions are vital for the understanding of the epidemic patterns of COVID-19 by use of our proposed data-driven EEMD-ARIMA-NARANN hybrid technique. However, the limited nuclear acid detection ability may result in under-diagnosis or under-reporting for the prevalence and mortality data during the COVID-19 outbreak. Secondly, in the NARANN method-developing process, there is currently a lack of general guidelines for selecting the number of hidden neurons and delays. In applications, repeated training is required. Thirdly, although this data-driven mixture technique does a good job of estimating the epidemic patterns of COVID-19 in this study, whether this data-driven mixture technique can perform a highly accurate prediction for the epidemiological trends of COVID-19 in other regions or other infectious contagious diseases, more work will need to be done. Fourthly, the forecasting performance under the EEMD-ARIMA-NARANN hybrid technique may be further improved by integrating some related factors (e.g., internet search queries, Figure 5. Time series displaying the response results between inputs and outputs. (A) Response plot between inputs and outputs for the prevalence dada in South Africa; (B) Response plot between inputs and outputs for the mortality dada in South Africa; (C) Response plot between inputs and outputs for the prevalence dada in Nigeria; (D) Response plot between inputs and outputs for the mortality dada in Nigeria. These plots display which samples were treated as the training, validation and testing datasets, and illustrate the corresponding errors between inputs and targets. It could be seen that the vast majority of data points had smaller errors between inputs and targets, indicating that the identified NARANN methods seem to be adequate for estimating the epidemiological trends of COVID-19 in the study regions. www.nature.com/scientificreports/ meteorological parameters, air pollution indicators, and policy intervention), and further studies, which take these factors related to the COVID-19 into account, will be very interesting. However, this failed to be investigated in the current work. Lastly, the forecasting reliability level of this data-driven mixture technique may decrease with the increase of the forecasting periods. Therefore, the new real-time data should be integrated into the model to ensure its forecasting accuracy.

Conclusions
Insights from the time series modeling are extremely invaluable for the policymaker to plan effective prevention and control strategies in order to make the outbreak under control well in the future. In this work, we proposed a new data-driven EEMD-ARIMA-NARANN mixture technique, and it is demonstrated that the predicted values from this mixture model show better consistency with the actual observations than the basic ARIMA and NARANN methods, which can function as a helpful policy-supportive tool to plan and prepare medical supplies effectively, and thus favoring to alleviate the outbreak in South Africa and Nigeria over the upcoming days or weeks. It is significant to stress that the estimated values may differ from the observed values looking at the strategic preparedness and the measures taken by the government of these study regions. Also, our proposed hybrid model may be of great help to estimate and forecast the future epidemic trends in other regions severely affected by this crisis.  www.nature.com/scientificreports/