Prediction of global omicron pandemic using ARIMA, MLR, and Prophet models

Zhao, Daren; Zhang, Ruihua; Zhang, Huiwu; He, Sizhang

doi:10.1038/s41598-022-23154-4

Download PDF

Article
Open access
Published: 28 October 2022

Prediction of global omicron pandemic using ARIMA, MLR, and Prophet models

Daren Zhao¹,
Ruihua Zhang²,
Huiwu Zhang¹ &
…
Sizhang He³

Scientific Reports volume 12, Article number: 18138 (2022) Cite this article

2353 Accesses
8 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Globally, since the outbreak of the Omicron variant in November 2021, the number of confirmed cases of COVID-19 has continued to increase, posing a tremendous challenge to the prevention and control of this infectious disease in many countries. The global daily confirmed cases of COVID-19 between November 1, 2021, and February 17, 2022, were used as a database for modeling, and the ARIMA, MLR, and Prophet models were developed and compared. The prediction performance was evaluated using mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE). The study showed that ARIMA (7, 1, 0) was the optimum model, and the MAE, MAPE, and RMSE values were lower than those of the MLR and Prophet models in terms of fitting performance and forecasting performance. The ARIMA model had superior prediction performance compared to the MLR and Prophet models. In real-world research, an appropriate prediction model should be selected based on the characteristics of the data and the sample size, which is essential for obtaining more accurate predictions of infectious disease incidence.

Forecasting the spread of COVID-19 based on policy, vaccination, and Omicron data

Article Open access 30 April 2024

A prospective evaluation of AI-augmented epidemiology to forecast COVID-19 in the USA and Japan

Article Open access 08 October 2021

The United States COVID-19 Forecast Hub dataset

Article Open access 01 August 2022

Introduction

Since November 2021, the Omicron variant has rapidly spread worldwide. The B.1.1.529 variant was first reported by WHO in South Africa on November 24, 2021¹. The World Health Organization (WHO) announced the SARS-CoV-2 variant Omicron (B.1.1.529) on November 26, 2021^2,3. Consequently, many countries have enacted various restrictions to prevent the spread of Omicron variants.

Globally, as of February 14, 2022, a total of 416,614,051 confirmed cases of COVID-19 comprising 5,844,097 deaths were reported by the WHO⁴. It was estimated that the R₀ of the Omicron variants may be as high as 10⁵. Therefore, it is crucial that prediction models are used to forecast the COVID-19 epidemic trend, which can help the government and relevant authorities take effective measures to respond in advance⁶. Time series forecasting models play an important role in disease surveillance⁷. Accurate prediction results are required for the prevention and control of COVID-19 to provide early warning information to government officials.

Numerous mathematical models, including traditional time series and machine learning models, have been applied to predict the incidence of COVID-19. In particular, in the traditional time series model, the ARIMA time series model is the most widely used for COVID-19 incidence prediction. Ceylan et al.⁸ used the ARIMA model to estimate the overall prevalence of COVID-19 in three European countries, and the results can help politics and health authorities allocate medical resources reasonably. Sun et al.⁹ used a modified ARIMA model to forecast the COVID-19 pandemic in Alberta, Canada. Roy et al.¹⁰ analyzed the effectiveness of COVID-19 epidemiological surveillance using ARIMA models. Malki et al.¹¹ applied the ARIMA model to predict the spread of COVID-19 worldwide. James et al.¹² adopted the ARIMA model to forecast the short-term trajectory of the acceleration of fatalities caused by COVID-19. Dawoud et al.¹³ utilized the ARIMA model to estimate COVID-19 cumulative confirmed cases. Alzahrani et al.¹⁴ used the autoregressive model (AR), moving average (MA), a combination of both (ARMA), and integrated ARMA (ARIMA) to forecast the COVID-19 pandemic and found that the performance of the ARIMA model outperformed the other models.

In addition, the ARIMA model is used not only in the estimation of the number of COVID-19 pandemics, but also in the estimation of the number of fully vaccinated people or in the estimation of electricity consumption and natural gas amounts. Cihan et al.¹⁵ developed the ARIMA model to predict electricity and natural gas consumption in an industrial zone in Turkey. Cihan et al.¹⁶ used the ARIMA model to determine the number of people fully vaccinated against COVID-19.

However, some of the research has focused on the use of machine learning models to predict COVID-19 incidence, such as LSTM, GRU, SVR, XGBoost, RNNs, etc. Shahid et al.¹⁷ constructed the ARIMA, SVR, LSTM, and Bi-LSTM models to forecast COVID-19 confirmed cases, deaths, and recoveries in ten major countries, and stated that Bi-LSTM achieved much better prediction results than other models. Luo et al.¹⁸ established and compared the prediction performance of the LSTM and XGBoost algorithms. ArunKumar et al.¹⁹ developed GRU, LSTM, and RNN models to forecast future trends of the cumulative COVID-19 confirmed cases for the top-10 countries.

However, to date, no studies have compared global COVID-19 incidence predictions using ARIMA, MLR, and Prophet models since the outbreak of Omicron variants. In this study, the global daily confirmed cases of COVID-19 between November 1, 2021, and February 17, 2022, were obtained from the WHO website. Based on the sample size and data characteristics, ARIMA, MLR, and Prophet models were constructed and compared, and the optimum model was selected to predict the global daily confirmed cases of COVID-19 from February 18 to March 18, 2022. To the best of our knowledge, this is the first study to explore in detail the construction and comparison of the ARIMA, MLR, and Prophet models for predicting daily confirmed cases of COVID-19 worldwide. We hope that the prediction results of this study will serve as a reference for COVID-19 prevention and control worldwide.

Materials and methods

Materials

Data source

We collected daily confirmed cases of COVID-19 globally between November 1, 2021, and February 17, 2022, from the website of the World Health Organization (https://covid19.who.int/). Microsoft Excel was used to create the time series database. All data were updated daily. In this study, 109 observations were divided into training and validation sets, 80% of which was the training set, and the rest (20%) was the test set. The datasets for November 1, 2021, and January 27, 2022, were considered as the training set, and data from January 28, 2022, to February 17, 2022, were considered as the validation set.

Methods

ARIMA model

The autoregressive integrated moving average (ARIMA) model, a classic time series prediction technique, was proposed by Box and Jenkins in the early 1970s, and has been extensively applied to the prediction of infectious diseases²⁰. ARIMA is a mathematical model that uses historical values to forecast future values of a variable²¹. The basic equation for ARIMA is as follows²²:

$$ \Theta_{P} (B^{s} )\theta_{p} (B)(1 - B^{s} )^{D} (1 - B)^{d} y_{t} = \Phi_{Q} (B^{s} )\varphi_{q} (B)\varepsilon_{t} $$

(1)

In this equation, y_t is the predictive value, B is the backward shift operator, ε_t is the residuals from time series²³,$\Theta_{P}$ and $\theta_{p}$, $\Phi_{Q}$, and $\varphi_{q}$ represent the four parameters in the ARIMA model p, q, P, and Q, respectively. Here, d and D represent the degrees of the seasonal and trend differences, respectively. The ARIMA model parameters p, P, q, Q, and s represent the order of auto-regression, seasonal auto-regression lag, order of moving average, seasonal moving average, and seasonal periodicity, respectively²⁴.

In general, the ARIMA model is defined as ARIMA(p, d, q) (P, D, Q)s. In this study, however, the ARIMA model was expressed as ARIMA(p, d, q) because the daily confirmed COVID-19 cases in the time series were non-seasonal data, and its equation can be written as follows²³:

$$ \theta_{p} (B)(1 - B)^{d} y_{t} = \varphi_{q} (B)\varepsilon_{t} $$

(2)

The construction process of the ARIMA model includes several steps^25,26,27,28. First, the daily confirmed COVID-19 case sequence was plotted to determine whether the time series was stationary. Sequences with non-stationary time series were transformed into stationary sequences using difference and log transformations. Second, the parameters of the ARIMA model were estimated by analyzing auto-correlation and partial auto-correlation function graphs. The parameters p, P, q, and Q were determined using auto-correlation function (ACF) and partial auto-correlation function (PACF) graphs after difference and log transformations. The candidate ARIMA model was determined initially. Third, the ARIMA model diagnosis and evaluation were determined using the Ljung-Box (Q) test and the t-test, respectively. The Ljung-Box (Q) test required that residuals of the daily COVID-19 case time series were white noise (significant level, p > 0.05). A t-test was used to determine whether the parameters of each candidate ARIMA model were significant. The optimum model depends on the maximum R-square value, minimum normalized BIC, and RMSE values, and the residuals are white noise sequences. Bayesian information criterion (BIC) is commonly used for model selection in time series forecasting²⁹. It was developed by Schwarz and is defined as^29,30:

$$ {\text{BIC}} = { - }2\ln (L) + \ln (n)*k $$

(3)

where L is the maximized value of the likelihood function of the model, n is the sample size, and k is the number of parameters estimated by the model. The normalized Bayesian information criterion (BIC) was used to confirm the adequacy of the model³⁰. The smaller the value of the normalized BIC, the more adequate the model fits³⁰.

MLR model

Multiple linear regression model(MLR), an extension of simple linear regression, is used to describe the a linear relationship between multiple independent variables and a single dependent variable³¹. The formula for the MLR model is as given below³².

$$ {\text{Y}} = \beta + \beta_{0} {\text{X}}_{1} + \beta_{1} {\text{X}}_{2} + ... + \beta_{k} {\text{X}}_{k} + \varepsilon $$

(4)

where Y is the dependent variable; X₁, X₂, … are the independent variables; β is the Y-intercept; β₀, β₁,… β_k are the regression coefficients; and ε is the random error term.

Prophet model

The Prophet model, an open-source time-series forecasting algorithm, was created by Facebook in 2017, and can be run using R or Python³³. The basic formula for the Prophet model is as follows^34,35:

$$ y(t) = g(t) + s(t) + h(t) + \varepsilon_{t} $$

(5)

Here, y_(t) is the predictive value, g(t) is the trend function that models non-periodic changes in the time series of daily confirmed COVID-19 cases, s (t) signifies periodic changes(weekly characteristics of confirmed COVID-19 cases time series), and h (t) signifies the effects of holidays on potentially irregular schedules. For example, Christmas Day. $\varepsilon_{t}$ signifies idiosyncratic changes that are not accommodated by the model³⁶.

In trend model g (t), there are two types of models: a saturating growth model and a one-piece linear model that covers numerous Facebook applications. The formula for the nonlinear saturation growth model is as follows³⁷:

$$ g(t) = \frac{{\text{C}}}{1 + \exp ( - k(t - m))} $$

(6)

where C is the carrying capacity, k is the growth rate, and m is the offset parameter.

The formula for the piecewise logistic growth model is as follows³⁶:

$$ g(t) = \frac{{\text{C(t)}}}{{1 + \exp ( - (k + \alpha (t)^{T} \delta )(t - (m + \alpha (t)^{T} \gamma )))}} $$

(7)

where $\delta$ is a vector of rate adjustments and $\gamma$ is the correct adjustment at the change point.

The seasonality s(t) depends on the Fourier series to provide a viable model for periodic effects. This formula is expressed as follows³⁴.

$$ s(t) = \sum\limits_{n = 1}^{N} {\left( {a_{n} \cos \left( {\frac{2\pi nt}{P}} \right) + b_{n} \sin \left( {\frac{2\pi nt}{P}} \right)} \right)} $$

(8)

where a is standard Fourier series, P is the periodic changes.

Holidays and events h(t) have a greater influence on predicting time-series performance because they do not follow a periodic pattern³⁷.

$$ {\text{Z}}(t) = \left[ {1(t \in D_{1} ,...1(t \in D_{L} )} \right] $$

(9)

$$ {\text{h}}(t) = Z(t)k $$

(10)

where t is during holiday i and ki is the holiday parameter and a prior k ~ normal (0, ν²).

Evaluation of the prediction performance

In this study, the mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE) were used to evaluate the prediction performances of the ARIMA, MLR, and Prophet models. The smaller the values of MAE, MAPE, and RMSE, the better is the prediction performance of the model. These evaluation indices are expressed as³⁸:

$$ {\text{MAE}} = \frac{{\sum\limits_{t = 1}^{n} {\left| {X_{t} - {\hat{\text{X}}}_{t} } \right|} }}{n} $$

(11)

$$ {\text{MAPE}} = \frac{{\sum\limits_{t = 1}^{n} {\left| {\frac{{X_{t} - {\hat{\text{X}}}_{t} }}{{X_{t} }}} \right| \times 100{\text{\% }}} }}{n} $$

(12)

$$ {\text{RMSE}} = \sqrt {\frac{{\sum\limits_{t = 1}^{n} {(X_{t} - {\hat{\text{X}}}_{t} )^{2} } }}{n}} $$

(13)

where ${\hat{\text{X}}}_{t}$ is the predicted value, $X_{t}$ is the observed value, and n is the sequence sample size.

Statistical software

SPSS (version 24.0; IBM Corp., Armonk, NY, USA, URL: https://www.ibm.com/support/pages/node/724325?mhsrc=ibmsearch_a&mhq=statistics%2024) and EView (version10.0; IHS Global Inc., Irvine, CA, USA, URL:https://eviews.com/download/ev10download.shtml) were used to create the ARIMA model. SPSS version 24.0 (version 24.0; IBM Corp., Armonk, NY, USA, URL: https://www.ibm.com/support/pages/node/724325?mhsrc=ibmsearch_a&mhq=statistics%2024) was used to create the MLR model. R software (version 4.1.1, URL:https://stat.ethz.ch/pipermail/r-announce/2021/000672.html) was used to construct the Prophet model. Among which, “Prophet” package of R software was used in construction of the Prophet model. The level of significance was set at p < 0.05.

Ethical approval

Data were obtained from publicly accessible sources. Formal ethical approval was not required for this study.

Results

General analysis

A total of 167,658,527 confirmed cases of COVID-19 were reported worldwide between November 1, 2021, and February 27, 2022. Descriptive Statistics of the daily confirmed cases of COVID-19 are shown in Table 1. The histogram chart of the daily confirmed cases of COVID-19 is shown in Fig. 1. As shown in Fig. 2, there was a rising periodicity trend characteristic of the daily confirmed cases of the COVID-19 time series. The growth rate of new confirmed coronary cases was 1.92% per day during this period. In addition, the confirmed cases occurred at a minimum peak on the first day and then reached a high peak two days later every other week with a cycle of 7 days (Fig. 2).

Table 1 Descriptive Statistics of the daily confirmed cases of COVID-19.

Full size table