TW-SIR: time-window based SIR for COVID-19 forecasts

Since the outbreak of COVID-19, many COVID-19 research studies have proposed different models for predicting the trend of COVID-19. Among them, the prediction model based on mathematical epidemiology (SIR) is the most widely used, but most of these models are adapted in special situations based on various assumptions. In this study, a general adapted time-window based SIR prediction model is proposed, which is characterized by introducing a time window mechanism for dynamic data analysis and using machine learning method predicts the basic reproduction number and the exponential growth rate of the epidemic. We analyzed COVID-19 data from February to July 2020 in seven countries–––China, South Korea, Italy, Spain, Brazil, Germany and France, and the numerical results showed that the framework can effectively measure the real-time changes of the parameters during the epidemic, and error rate of predicting the number of COVID-19 infections in a single day is within 5%.


Scientific Reports
| (2020) 10:22454 | https://doi.org/10.1038/s41598-020-80007-8 www.nature.com/scientificreports/ The results of our numerical results analysis are encouraging. The results show that the model can effectively measure the real-time changes of parameters during the spread of epidemics, including the basic number of infections R 0 (t) and exponential growth rate Ex(t) . Our experiments demonstrate that TW-SIR perform better than the formula derivation method in the parameter measurement. And the error rate of predicting the number of COVID-19 infections in a single day is within 5%. At the same time, the model can adapt to the second wave of infection which traditional SIR model cannot do. This study is of great significance for understanding the spread of COVID-19 and guiding the designation of control strategies and measures.
The rest of this paper is organized as follows: in the second section, we propose the TW-SIR model. In the third section, we conducted some numerical experiments and analyzed the experimental results to illustrate the effectiveness of our model. Then, in "Discussion", we made some discussions and suggestions. Finally, the last section is a summary of the paper. Methods SIR epidemic model. The susceptibility-infection-recovery (SIR) model 27 is one of the simplest and commonly used epidemic models. The model consists of three compartments: S : the number of susceptible individuals, I : the number of infectious individuals, R for the number of removed (and immune) or deceased individuals. The SIR epidemic model can be expressed by following set of ordinary differential equations (ODE): Among them, S(t) , I(t) and R(t) , respectively represent the functions of S , I and R related to time t , and their sum satisfies Eq. (4); N represents the total number of populations; β represents the probability of infection rate, which means that each susceptible population randomly infects β people every day. The recovery rate γ indicates that the infected person recovers or dies with the probability of γ.
Although the SIR model is simple, the analysis and use of it in many studies generally show that it can capture the trend and overall characteristics of the epidemic. In the traditional SIR model, β and γ are parameters that reflect the characteristics of the epidemic, and they are constants. However, if the parameters are constant, it is often impossible to measure and predict the development trend of epidemics when applied to the real world. Therefore, many studies have regarded them as functions that change over time and used equations to derive them. Considering that during the development of the epidemic, the parameters in the SIR model are changing in real time for different countries and regions. In order to reflect these changes in the parameters of the epidemic model, in this article we propose the TW-SIR prediction model, which can capture, track and predict the dynamic changes of the epidemic parameters in real time. We will introduce this model in detail in the next section.
Time-window SIR model. In order to represent the changes of parameters in the SIR model, we propose a time window-based SIR model (TW-SIR) which splits historical data into a time window segment. The purpose of this method is to capture the real-time changes in R 0 and the exponential growth rate Ex . The TW-SIR model is based on the assessment of the changes in the epidemiological parameters of historical data every day through a time window and solves the problem that the formula derivation method cannot be measured in real time. Figure 1 shows the main workflow of the model.
As shown in Fig. 1, the TW-SIR model is mainly composed of three parts: model solution, parameter evaluation and parameter prediction. The detailed process is as follows.
Step 1: Solution of the SIR model. First, the historical data input for the TW-SIR model includes the daily number and data of susceptible, infected and recovered populations, and the data are divided according to the time window size. For the data in the specified time window, Runge-Kutta method is used to solve the SIR model numerically.
Step 2: Evaluation of the model parameters. Based on the historical data in the time window, the least square method is used to set the initial values of the model parameters, and then the model parameters are traversed and searched to represent the changes of the basic reproduction number R 0 and the exponential growth rate Ex in the historical data.
Step 3: Prediction of the model parameters and the epidemic. Based on the existing parameter values obtained from the parameter evaluation, a machine learning method was used to track and predict the future parameter values with the combination of basic reproduction number R 0 and exponential growth rate Ex . Finally, the prediction results of the epidemic are returned.
Aim of TW-SIR is to evaluate the changes in the parameters of the epidemic in order to predict the development trend of the epidemic. In the rest of this section, we will describe the contents of each part in detail.  28 . Compared with analytical solutions, methods such as numerical solutions are more commonly used in such research problems, and these methods are more effective. In this paper, numerical solution method, namely Runge-Kutta method, is used to numerically solve the SIR model. The Runge-Kutta method is a high-precision single-step algorithm, and its classic method is the fourth-order Runge-Kutta method (RK4). RK4 divides the time interval between t and t + 1 into four subintervals and solves ordinary differential equations by calculating the slope values of these subintervals points and weighting them as the average slope. For the three states of the SIR model, we use RK4 to modify the differential equations in (1-3) into discrete differential equations: where h is the step-size, S   Parameter evaluation. The parameters evaluation part is mainly to characterize the change of the infected rate β and the recovered rate γ over time in the historical data, so as to facilitate subsequent parameter prediction. Firstly, the historical data is divided according to the size of the time window, then an initial values of the model parameters are set within the time window, and then traverse the search for the model parameters, and finally get the best model parameters for each day through evaluation. A time-dependent β(t) and γ (t) functions are used to instead of β and γ in the SIR model, which can be obtained where β(t) and γ (t) are functions with time t as an independent variable rather than constants. Due to the government action on infection prevention and control for COVID-19 and awareness of the population on COVID-19, β(t) and γ (t) change in real time. In order to measure this change, the time series data set is divided into time windows of size W, and then use the optimal parameter solution in the time window as the evaluation value. For historical data at time t , its time window is {w t , 0 ≤ t ≤ T − 1} , we can get the following equation: Among them, β w t and γ w t represent a certain parameter solution in the SIR model at time t in the historical data with a time window size of w . In order to obtain the optimal solution opt β w t under the time window w , two steps are applied in calculating it through search: firstly, determine the initial values of the model parameters and secondly perform traversal search on the model parameters. The first is the determination of the initial values of the model parameters. In the early stages of the epidemic, the proportion of the number of infected and cured population in the population is negligible. We can regard the susceptible number S(t) and the total population N as approximately equal, so the differential Eq. where, the number of infected people is an exponential function that changes over time, and then the least squares method is used to retrospectively fit the actual data of the epidemic to obtain the initial values β 0 and γ 0 of the parameter. The initial value obtained can evaluate the characteristics of the early stage of the epidemic, but a simple exponential growth model cannot fully reflect the full picture of the epidemic and a more accurate estimation needed. Therefore, based on the initial values, total number of confirmed COVID-19 cases and model numerical solution methods are used to traverse the model parameters.
Given the data within a specified time window , R(t) and D(t) are respectively the cumulative number of COVID-19 cases, cumulative number of cured COVID-19 cases, and cumulative number of death cases per day), Eq. (19) is used to calculate the actual daily number of infections I(t): www.nature.com/scientificreports/ After getting the daily actual number of infected people, we use the RK4 method to find the numerical solution of the model, which is the predicted number of infected people I(t) . In order to evaluate the parameters β and γ , the following equation is used to calculate the MSE (mean squared error) of the predicted result: In order to get the optimal size of time window, the size of time window is set from 3 to 30 to be tested and the accumulated forecast error is used to evaluate the accuracy and effectiveness of the forecast under each time window. Error w is the accumulated prediction error under the time window w , and the equation is shown as following: In the process of searching for model parameters, it takes too much time if a grid search is applied and it is easy to fall into the local optimum. To overcome this problem, in this article an optimized search method is used. First, we assume that the value of β is greater than the value of γ in the early stage of the epidemic, because this is necessary to ensure that the epidemic infection continues 29 , namely the value of R 0 is greater than 1 and estimate the initial parameter values β 0 and γ 0 using Eqs. (17) and (18). Based on the initial values β 0 and γ 0 , we set the size of search step and the size of search interval. Then RK4 is used to solve the model by using Eq. (6). Finally, the MSE for β w t and γ w t are calculated and the β w t and γ w t with minimize of MSE are as Optimal parameters. The detailed steps of our parameter evaluation based on time window are shown in Algorithm 1.
After getting β(t), γ (t){β(t), γ (t), w − 1 ≤ t ≤ T − 1} , machine learning methods can be applied to predict the time change of the infection coefficient and the cure coefficient and predict the future development trend of the epidemic.
Parameter prediction. Parameters prediction is to predict the subsequent model parameters based on the changes over time of the model parameters obtained from the previous part of the parameter evaluation. In this section, the polynomial regression algorithm widely used in machine learning is applied to track and predict β(t) and γ (t) . It is difficult to accurately directly predict β(t) and γ (t) because of value fluctuations. Therefore, this paper proposes a new prediction method, using the method of predicting the R 0 and exponential growth rate Ex(t) to calculate them, which their changing curve is easier to predict in the development of the epidemic. The Basic reproduction number R 0 also reflects the development of the epidemic. It can also be regarded as a function over time R 0 (t) which can be obtained by using Eq. (22): In order to get β(t) and γ (t) , we define an exponential growth rate index Ex(t) according to the exponential growth model of Eq. (18), which is shown in the following equation: where the predicted basic reproduction number is R 0 (t) , and the predicted exponential growth rate is Ex(t) . Through polynomial regression, they can be written in the following form: www.nature.com/scientificreports/ n and m are the order of R 0 (t) and Ex(t) polynomials (n, m ≥ 2), a i (i = 0, 1, . . . , n) and b j (j = 0, 1, . . . , m) are the coefficients of these two polynomial functions. In order to determine the coefficient and order of the polynomial function, the most widely used least squares method (OLS) to evaluate the prediction results. At the same time, in order to ensure that the model is under-fitting and reflect the real-time changes of the epidemic, Time window method mentioned in the previous section is used to solve the following optimization problems: W is the size of the time window. The coefficients and orders of the polynomial can be obtained by solving the objective optimization function, such as a i , i = 0, 1, . . . , n , and b j , j = 0, 1, . . . , m . After obtained these coefficients, R 0 (t) and Ex(t) at time t = T can be obtained through the Eqs. (24,25), and then the predicted infection rate β(t) and the predicted recovery rate γ (t) can be calculated by using Eqs. (28) and (29), namely: Now we have got β(t) and γ (t) , and then through the model solution method in "Model solution", the number of infections I(t), t > T in the subsequent epidemic can be predicted.

Numerical results
Data sources. In this paper, we gathered epidemiological data from Johns Hopkins University 30 . Project data is available on the open source GitHub site, and the life cycle of the project is continuous 31 . The data include the various countries from January 23, 2020 up to now. The daily cumulative number of confirmed cases, cumulative death cases, and cumulative cured cases in the region. Taking China as an example, Table 1 shows the details of the data we used. In this article, we use the data of seven countries including China, South Korea, France, Spain, Italy, Germany and Brazil as our data set. In addition, in order to verify that our method is applicable to different epidemics, we also gathered the SARS epidemic data of Beijing, China from April 20, 2003 to June 23, 2003 from the website of the Ministry of Health of China, and the format of the data is the same as in Table 1. Table 2 shows the COVID-19 data for China, South Korea, France, Spain, Italy, Germany, and Brazil, and the time frame of the 2003 Beijing SARS data.

Parameter setup.
(1) Determination of window value W www.nature.com/scientificreports/ Different time window sizes are used in the experiment, which scope is from 3 to 30. Figure 2 shows the cumulative forecast error of China under different time windows calculated according to Eq. (21). It can be found that there is a time window that minimizes the cumulative forecast error, that is, W = 7.
For every country in the data set, the respective optimal time window size is shown in Table 3.

(2) Parameter evaluation
After determining the appropriate time window size, Algorithm 1 is used to evaluate the model parameters. When using polynomial regression to predict the parameters β(t) and γ (t) , we set initial order of the polynomial to 2, that is, n = m = 2 . Because β(t) and γ (t) are non-negative, if their value is less than 0 in the regression calculation, we set them to 0. The stopping condition in the model solving process is I(t) ≤ 0 . Finally, we use model solving methods to predict the development trend of the epidemic. Experiment and result analysis. In order to illustrate the scientificity and effectiveness of the TW-SIR model, we will present and analyze the three research questions (RQ1, RQ2 and RQ3) in this section.
RQ1 experiment results. In the epidemic model, a very important question is when the epidemic will end. To answer this question, one commonly used indicator is the basic reproduction number R 0 , which is defined as the average of how many other people an infected person will transmit the disease to before they recover. In the TW-SIR prediction model, R 0 (t) is a time-dependent function. If R 0 (t) > 1 , the epidemic will spread quickly and infect a certain percentage of the total population N. On the contrary, if R 0 (t) < 1 , the epidemic will eventually be brought under control and end. Therefore, by observing the changes in R 0 (t) and predicting the future R 0 (t) , the development trend of the epidemic and whether the control measures of the epidemic are effective can be known. At the same time, in this paper, an indicator exponential growth rate Ex(t) is used, that is, the difference between β(t) and γ (t) , to measure the exponential growth trend of the epidemic, which also reflects the changing trend of the epidemic. When Ex(t) > 0 , it means that the infection speed of the epidemic is faster than the cure. On the contrary, the number of people infected by the epidemic is gradually cured and the epidemic is gradually coming to an end. Firstly, we applied TW-SIR model to the historical data of COVID-19 in   www.nature.com/scientificreports/ China, South Korea, Italy, Spain, Brazil, Germany and France from January 27 to July 2, 2020 to measure R 0 (t) and Ex(t) . We compare TW-SIR prediction model with the measurement method based on formula derivation proposed in 22 . Tables 4 and 5, respectively summarize the basic reproduction number R 0 and exponential growth rate Ex measured using the TW-SIR model and the formula derivation method used in literature 22 . It can be seen from the table that the parameter values measured based on the TW-SIR model are closer to the actual situation, while the formula derivation method has outliers inconsistent with the actual situation, such as too large or too small. Figure 3a shows the result of using the data to measure R 0 (t) method in the literature 22 , and Fig. 3b is the result of using the TW-SIR model to measure R 0 (t) . All date starts from February 21, 2020 in the two figures. R 0 (t) in Fig. 3a has reached two hundred, and there are negative values, which is obviously not true. We can also see from Fig. 3b that the value of the R 0 (t) is much smaller and more in line with the actual situation. In addition, in Fig. 3b, it can be seen that there is a turning point of R 0 (t) < 1 on April 19, 2020, that is, the epidemic situation in Italy reaches its peak at this moment. After April 19, 2020, R 0 (t) remains at a level less than 1, which means that the number of infected people I(t) will decrease and will lead to the end of the Italian epidemic. TW-SIR model can accurately measure the time when R 0 (t) < 1 and the measured value is close to the actual situation. At the same time, our results are similar to those measured in most literatures 32 , which shows the effectiveness of TW-SIR model to measure R 0 (t).
Similarly, Fig. 4 shows the results of TW-SIR model and formula derivation method in measuring the exponential growth rate Ex(t) . The exponential growth rate Ex(t) calculated by the two methods can reflect the development and changes of the epidemic, and the overall trend is roughly the same, and both can measure the peak time of the epidemic. However, the Ex(t) value calculated based TW-SIR model includes the value   www.nature.com/scientificreports/ calculated based on the formula derivation method, which can more clearly reflect the change of the exponential growth rate.
RQ2 experiment results. Figure 5 shows the measured R 0 (t) and the predicted R 0 (t) in Italy by using TW-SIR model. The blue curve is the measured R 0 (t) , from February 26, 2020 to July 2, 2020. The gray curve is the predicted R 0 (t) from June 1, 2020 to July 2, 2020. The red dotted line is the threshold value representing the R 0 (t) = 1 . We can see that R 0 in Italy was almost the same as R 0 in China in the early stages of the epidemic. From the figure that R 0 is a turning point around April 19, which means a peak of the epidemic. Compared with China, Italy has a relatively long time to enter the peak, which may be caused by different prevention and control strategies. In Fig. 6, we show the exponential growth rate Ex(t) measured by Italy and the predicted exponential growth rate Ex(t) . The green curve is the measured exponential growth rate Ex(t) , from February 26, 2020 to July 2, 2020. The yellow curve is the predicted exponential growth rate Ex(t) , from June 1, 2020 to July 2, 2020. In Fig. 6, the exponential growth rate of the Italian epidemic has approached zero. If this situation remains, the number of infected persons will decrease and the epidemic will be faded. But due to the changes of temperatures, government's epidemic control measures and people's awareness, there is a second wave of infection from August, 2020, which we'll discuss later. Figures 5 and 6 show that TW-SIR model accurately predicted the changes of R 0 (t) and Ex(t) from June 1 onwards, which shows that our parameter prediction method is effective.
In order to show the accuracy of our model, we show the prediction results of our model for the next day (single-day forecast) in Fig. 7. The orange curve in the figure represents the actual number of infections I(t) in Italy, and the blue curve represents the predicted number of infections I(t) . The figure shows that the predicted curve is very close to the actual data curve.
We further tested the accuracy of our prediction and calculated the error of the single-day prediction of the number of infected people, as shown in Fig. 8. The error rate of the predicted number of infected people is all within 5%, which shows that our model can accurately predict the number of infected people next day.
Judging from the results of applying TW-SIR model to the data of epidemic in China and Italy, the model can effectively measure the real-time changes of parameters during the development of the epidemic, including the basic reproduction number of the epidemic and the exponential growth rate of the development of the epidemic, as well as the development trend of the epidemic follow up and forecast.  www.nature.com/scientificreports/ RQ3 experiment results. From September 2020 into the autumn and winter season, many countries have a second wave of COVID-19 infections. We applied the TW-SIR model to the data from August to October 2020 for seven countries in the data set, and the measurement results are shown in the Table 6.
In South Korea and Brazil, the R 0 value was less than 1, indicating a downward trend in the number of infections from September to October. For the exponential growth rate, only China and South Korea's exponential growth rate is less than 0, and the average number of infected persons I(t) in these two countries is very small, which means that the development trend of the epidemic in these two countries is in a relatively stable state for a long time. Italy, Spain, Germany and Brazil have all seen a second wave of attacks, and the number of cases is rising.
In Italy, for example, Fig. 9 shows the trend change in the number of existing infections after the TW-SIR model was applied. The orange line is the actual number of infections, and the blue line is the predicted change in the number of infections. As can be seen from the figure, the number of existing infections showed a slow decline from July to early August 2020. The average value of R 0 obtained by using the TW-SIR model during   The trend of the number of infections in Italy from July to October shows an increase in the number of infections caused by the second wave. The blue line at the back end is the TW-SIR prediction curve (Fig. 10).

Discussion
Since the outbreak began in China, COVID-19 has spread to many countries and regions around the world. There have been 37,213,592 confirmed cases in 188 countries and regions on 11 October 2020. Different countries and regions have taken different measures to prevent and control the epidemic, such as closing cities, closing schools and quarantining people at home. As a result, the epidemic has developed at different levels. In previous studies, constants are usually used to measure parameters in epidemic transmission models 1,33-35 , but it is difficult to measure the dynamic and real-time evolution of the epidemic. Different from the fixed parameters of the traditional SIR model, we use the time window to measure the model parameters dynamically and propose the TW-SIR model based on the time window. The advantage of the TW-SIR model is that it is more in line with the actual dynamic measurement of epidemic parameters. We applied the proposed TW-SIR model to historical data from 27 January to 2 July 2020 for China, South Korea, Italy, Spain, Brazil, Germany and France to measure the basic number of infections R 0 (t) and the exponential growth rate Ex(t) . Compared with the formula measurement method proposed in literature 22 , the measurement results of the TW-SIR model are closer to the reality.
As for the R 0 values assessed, as shown in Table 6, Brazil has the highest R 0 mean (5.136) and China has the lowest R 0 mean (2.285) among the seven countries. Liu Ying et al. reviewed the R 0 of COVID-19 in 12 studies and found that the estimated average R 0 of COVID-19 was about 3.28, with a median of 2.79 and an IQR of 1.16 32 . Our R 0 of seven countries average measurement results for the average, the R 0 value of 3.656, and studies have found   www.nature.com/scientificreports/ similar worth pointing out that measured value R 0 is large, it is because we only select the parameter measurement, seven countries outbreak spread of COVID-19th in various countries have differences, further studies are needed to confirm this measurement index of growth also illustrates some problems, in the selection of seven countries, only China's exponential growth rate is negative, for other countries outbreak development degree and not to a fairly low level. This also shows that China has done a good job in prevention and control measures.
In order to show that our method is applicable to different epidemics, we also used epidemiological data for SARS in Beijing, China, from April 20, 2003 to June 23, 2003. Figure 11 shows the change curves of R 0 and exponential growth rate Ex of the SARS in Beijing, China in 2003. Among them, the average number of basic infections transmitted by SARS in Beijing was 2.099 and the average exponential growth rate was − 0.02046. Compared with the spread of COVID-19 in China, R 0 of SARS in the early stage of infection is about half of R 0 of COVID-19, and the exponential growth rate is about a quarter of that of COVID-19. This is consistent with the actual situation 32 , indicating that COVID-19 spread more violently than the SARS in 2003. At the same time, the average rate of exponential growth was negative during the whole epidemic period, which ensured the end of SARS epidemic.
Our model has several limitations when it comes to parameter measurement and trend prediction of epidemic transmission processes. First, the model did not take account of asymptomatic infected persons because they are difficult to obtain and may be inaccurate. Second, another limitation of our study is that the methods we use in each part of the model may not be optimal, and there are better methods for solving the model and predicting the parameters.

Conclusion
With the outbreak of the epidemic in other countries and regions, COVID-19 has swept the world. In this study, we proposed a TW-SIR prediction model which is able to reflect the real-time trend of the epidemic in the process of infection for different areas, different policies and different epidemic diseases. Machine learning methods are applied to predict the basic number of infections R 0 and the exponential growth rate of the epidemic Ex . And we conducted mathematical and numerical analyses for COVID-19. The numerical results shows that the model can effectively measure the real-time changes of parameters during the spread of epidemics, including the basic number of infections R 0 (t) and exponential growth rate Ex(t) . And error rate of predicting the number of COVID-19 infections in a single day is within 5%. In general, the measurement of these parameters is of great significance for understanding the spread of COVID-19 and guiding the designation of control strategies and measures.
In addition, many countries have a second wave of COVID-19 infections from September 2020 into the autumn and winter season. From our analysis of outbreak data in Italy from July to October 2020, we found that the TW-SIR model can be adapted to the second peak of COVID-19. In terms of the parameters we measure, China and South Korea have maintained low R 0 and exponential growth rates, while Italy, Spain, Brazil, Germany and France are mostly still on the rise. This means that the epidemic prevention and control measures need to be more stringent to ensure that the epidemic does not get out of control.
Last but not least, the TW-SIR model can also be applied in different epidemics such as SARS based on the experimental results. Although we lack the knowledge on the data of asymptomatic infection cases, our research results will provide some advice for the follow-up epidemic prevention and control.