Real-Time Forecasting of Hand-Foot-and-Mouth Disease Outbreaks using the Integrating Compartment Model and Assimilation Filtering

Hand-foot-and-mouth disease (HFMD) is a highly contagious viral infection, and real-time predicting of HFMD outbreaks will facilitate the timely implementation of appropriate control measures. By integrating a susceptible-exposed-infectious-recovered (SEIR) model and an ensemble Kalman filter (EnKF) assimilation method, we developed an integrated compartment model and assimilation filtering forecast model for real-time forecasting of HFMD. When applied to HFMD outbreak data collected for 2008–11 in Beijing, China, our model successfully predicted the peak week of an outbreak three weeks before the actual arrival of the peak, with a predicted maximum infection rate of 85% or greater than the observed rate. Moreover, dominant virus types enterovirus 71 (EV-71) and coxsackievirus A16 (CV-A16) may account for the different patterns of HFMD transmission and recovery observed. The results of this study can be used to inform agencies responsible for public health management of tailored strategies for disease control efforts during HFMD outbreak seasons.

(2019) 9:2661 | https://doi.org/10.1038/s41598-019-38930-y www.nature.com/scientificreports www.nature.com/scientificreports/ traditional compartment model relies on a set of static conditions and model parameters that are difficult to estimate for forecasting the HFMD outbreak due to the interactions of many uncertain factors, such as weather conditions and measures to control for social interaction. Moreover, conditions contributing to an HFMD outbreak are usually dynamic, which rarely meets the presumption of a traditional compartment model. Adding to the complexity, although one virus may predominate, multiple viruses can co-contribute to HFMD outbreaks. Furthermore, previous HFMD outbreaks in China have shown two peaks during a single year [25][26][27] , and predicting the timing and magnitude of the second peak tends to be more complex.
Recent literature has described a new approach for forecasting infectious diseases by integrating dynamic models and the assimilation technology, which has been used for performing dynamic and real-time adjustments of a forecasting model 28,29 . In this case, preliminary predictions from a compartment model are dynamically adjusted by incorporating real-time observations through the assimilation filter to improve model prediction. This Integrated Compartment Model Assimilation Filtering approach has been used in recent years to forecast outbreaks of several infectious diseases, such as influenza [30][31][32] , Ebola 33 and West Nile virus 34 . However, the prediction of HFMD outbreaks has not been yet benefited from this innovative approach.
This study used the Integrated Compartment Model and Assimilation Filtering model for real-time forecasting of HFMD outbreaks in Beijing using reported HFMD cases during 2008-2011. In particular, we (1) simulated disease occurrences over time, (2) estimated model parameters by incorporating real-time observation data, and (3) evaluated the model by comparing the weekly forecasts of HFMD to the reported data. Moreover, we examined potential associations between the dominant virus type responsible for the outbreak and the transmission pattern of HFMD, which are used as additional information for forecasting HFMD outbreaks. This approach of real-time HFMD forecasting will potentially help researchers design effective interventions for the control and prevention of HFMD.

Results
Descriptive statistics. The weekly rate of HFMD infections in Beijing between January 1, 2008 and December 31, 2011 is shown in Fig. 1. HFMD cases occurred throughout the year, but the rate was very low in the first 10 weeks, corresponding to the period from January to March. The high-incidence period occurred from May to July. prior Forecast and posterior Analysis. Prior forecasting employs the SEIR model and the current week's variable measurements to forecast the infection rate of the next week, while the posterior analysis produces assimilation results by adjusting prior forecasts for observed data. The graphs in Fig. 2 show the observation data, prior forecast, and posterior analysis for the study period using our model. The posterior analysis successfully captured the trend of infection rates in all four years, including the second peak during the 2008 HFMD outbreak. Moreover, the results of the posterior analysis appeared to be less impacted by the outliers in the observation data. We conducted regression analyses between the observed data and the data predicted from both the prior forecast and the posterior analysis to further evaluate the difference between the results obtained from the posterior analysis and the prior forecast (see Supplementary Fig. 1); the results of this analysis also showed a better fit between the observed data and posterior analyses. Accuracy Assessment. We performed real-time forecasting of HFMD for 2008-2011 ( Supplementary   Fig. 2) and assessed the accuracy of the forecast of the peak magnitude by comparing the forecast peak magnitude and the observed peak magnitude (Fig. 3). HFMD data were available for this study 12 weeks before the peak week in 2010 and 11 weeks before the peaks in 2009 and 2011, but only 5 weeks before the peak in 2008. Correspondingly, the peak week magnitude forecast was the most accurate for 2010 and least accurate for 2008.   www.nature.com/scientificreports www.nature.com/scientificreports/ Without sufficient observation data, the predicted peak magnitude deviated far from its observed counterpart for the early weeks of each of the four years. As more observed data were fed into the model, the ensemble of variables and parameters were updated, and the peak magnitude accuracy continued to improve and approached 100% (i.e., zero offset). However, for 2008, as only five weeks of observation data were available for training the model, the forecast of the peak magnitude prediction was not comparable to the other three years. Notably, since the initial states for the ensembles and variables were set to establish the model, the forecast accuracy for the first week is not interpretable. Nevertheless, the time required for the peak magnitude prediction to converge is related to the initial states. Compared with 2009, the forecast accuracy of 2010 was high in the initial week, and the model only required five weeks to achieve a very accurate and stable prediction; conversely, nine weeks were required to achieve a similar level of accuracy for the forecasting of 2009.
When evaluating the prediction accuracy of the peak week time, we excluded the year 2008 due to its shorter pre-peak data period. The accuracy of peak week forecasting was evaluated using the number of weeks of offset between the predicted peak week and the observed peak week (Fig. 4). As more observation data are used for peak week forecasting, the offset becomes smaller, showing a monotonic decreasing trend. For 2009, the offset was narrowed to one week as early as five weeks prior to the actual peak week. For 2011, the offset was narrowed to two weeks as early as six weeks before the actual peak week; additionally, no offset between the predicted peak week and the observed peak week was observed up to three weeks before the peak week. For 2010, the prediction offset was three weeks when forecast six weeks before the actual peak week, and it was within two weeks of offset when forecast two weeks before the actual peak.
The root-mean-square error (RMSE) between forecast data and observed data for the early weeks of forecasting appeared to be large and differed substantially across the four years (Fig. 5). These errors do not actually reflect the forecast quality because the forecast of the initial week was calculated using a randomly set initial state to run the SEIR-ensemble Kalman filter (EnKF) system. As the forecast was extended further into an outbreak season, more observation data were entered into the model, and the RMSE decreased sharply. Notably, the RMSE tended to remain low and relatively stable for all four years after a few weeks, suggesting that the performance  www.nature.com/scientificreports www.nature.com/scientificreports/ of the model improves as more observation data are applied, thereby becoming more stable and reliable. This pattern also indicates that SEIR-EnKF forecasting is not very sensitive to the initial state, suggesting a minimum requirement of our forecasting model for additional or excessive data. Instead, as continuous observation data are entered into the forecasting system, the output converges to a reliable result. For 2009 and 2010, approximately 10 to 12 observation data points were needed before the RMSE decreased to a low and stable level. Importantly, 10-12 weeks is approximately the time frame for which data were available during these two years from the beginning of an outbreak to the peak week.

Discussion
HFMD is a high-risk childhood disease for East Asian countries, but its prediction is difficult due to complex influencing factors. In this study, we developed a real-time forecasting model for HFMD outbreaks by integrating SEIR and EnKF. The integrated SEIR-EnKF forecasting system assimilated real-time observation data into forecasts in a dynamic manner and exhibited good performance for the real-time forecasting of HFMD by predicting the peak time and magnitude with acceptable accuracy.
Based on empirical testing, the SEIR-EnKF system is reliable for real-time forecasting of HFMD. In 2008, the lack of pre-peak observation data for assimilation resulted in a relatively weak forecast. The forecasts of the magnitude of the infection rate and the time of the first peak were unclear. From 2009 to 2011, the system performed www.nature.com/scientificreports www.nature.com/scientificreports/ well in forecasting both the peak week timing and peak magnitude. The forecast accuracy was also assessed using RMSE, which showed a decreasing trend over time. Therefore, the integrated SEIR-EnKF system delivered real-time forecasts with short time intervals (e.g., weekly), and the forecast continuously improved as more observation data became available for assimilation. Our SEIR-EnKF forecasting system is a promising tool that may serve as an essential component in a warning system to assist public health agencies and the public in disease management and control measures in response to an HFMD epidemic.
Transmission and recovery rates are essential parameters for understanding an infectious disease. To a large extent, their values determine the status of an on-going epidemic. The size of the population infected with HFMD is largely decided by the transmission rate 21 . Compared with incubation and recovery rates, transmission dynamics are affected by many factors, such as geolocation, climate, the activities of susceptible populations, and government control measures. Therefore, it has the greatest uncertainty in the model, and the basic reproductive number 35 , which is defined as the transmission rate divided by the recovery rate, reflects whether an outbreak is spreading or controlled. However, a few previous studies have discussed transmission dynamics in detail. In 2008 and 2010, the transmission rate initially decreased and then increased throughout the year, whereas the transmission rate exhibited relatively small change during the seasons in 2009 and 2011. The recovery rates for 2008 and 2010 increased, whereas the recovery rate exhibited a continuous decrease throughout 2009 and 2011. The four-year estimated parameters provide a model-level explanation for the dynamic changes in the infection rate. It is showed that the combination of changes in the transmission rate and recovery rate contribute to the trend in the infection rate. According to the available four-year data analysed in our study, the different patterns in the years investigated might be associated with the virus type and corresponding features. Thus, the discovered patterns may be helpful for preparing a more accurate forecast. However, the changes in parameters at each step are difficult to explain due to the contributions of multiple, complex factors mentioned by other studies, such as control strategies 22 and cultural practices 33. In fact, these factors are uncertain and not easy to study quantitatively. Continuous monitoring of the parameters and the use of observed data will help establish a more accurate forecast in the future.
Our study has several limitations. First, the speed of forecast error convergence is not sufficient, as it decreased to a low level after the peak time in our study. This parameter might be improved by employing other assimilation technologies, such as the Particle filter. Second, we did not accurately predict the second peak in this study. Although the second peak in 2008 was forecast, its accuracy was not satisfying. In addition, we only used limited observation data to assimilate all parameters in the model, which may increase model uncertainty. Although we adjusted for this uncertainty by discussing the initial parameters and prior information of model parameters, more types of observed data would be better. These data directly related to HFMD, such as search engine data and weather data, would be helpful for the forecast and improve the speed of forecast error convergence.
In our study, we simplified the discussion of the possible association between virus types and transmission patterns. Further research is warranted to obtain a better understanding of this association. Long-term data for virus types and infection rates will be obtained and analysed using the SEIR-EnKF to discover and quantitatively describe the deterministic association. Once the connections are clearly identified, virus types will be incorporated into the SEIR-EnKF system for more accurate predictions. In addition, other factors, such as population structure, public health literacy, and weather conditions, may also impact the outbreak patterns. Future investigations should seek to determine the quantitative descriptive relationships with HFMD and incorporate these factors into the SEIR-EnKF system as well for more accurate forecasting of HFMD outbreaks.

Methods
Ethics statement. This study was based on HFMD data in Beijing. All records were anonymized and no individual information can be identified. The research study protocol was approved by the Institutional Review Board at the Beijing CDC. All methods were performed in accordance with the principles of the Declaration of Helsinki. seIR Model. The SEIR model is a dynamic model that considers the incubation period of a disease. It considers four groups of people: the susceptible (S), the exposed (E), the infected (I) and the recovered (R). The model is expressed using the following equations: where t denotes time, S(t) is the number of susceptible people in the overall population at time t, E(t) is the number of exposed people, I(t) is the number of infected people, and R(t) is the number of recovered people who are no longer included in the transmission cycle. β(t) is the transmission rate from infectious people at time t, σ(t) is www.nature.com/scientificreports www.nature.com/scientificreports/ the incubation rate at time t, which is the inverse of the incubation period, γ(t) is the case recovery rate at time t, and N is the population size. The model does not consider deceased people because the death rate in the data used in this study was extremely low compared with the infection rate. enKF Method. EnKF uses an ensemble to calculate error covariance based on stochastic theories. The median of ensemble states is assumed to be an optimal estimate of the true state. The forecast generates the prediction at time k + 1 based on data analysed at time k using a mechanical model, SEIR in this study. This step is often performed using a dynamic model. The following formula is used: where X k a is the status at time k and X k f 1 + is the forecast at time k + 1. M k is the model system (SEIR model in this study) at time k.
A set of forecasts are generated, and the error terms of the forecasts are analysed as follows:  (7) and (8) are used to calculate the Kalman gain matrix, as denoted by K k+1 (see Eq. 9 below), which is further used for balancing observation data and modelling results.
where R is the error covariance of the observation. Finally, K is used to forecast the variables at the next time point: where z k+1 is the observation data at time k + 1. We used the results of this analysis as the updated variable to forecast the model parameters and variables at the next time point.
Initiation of the seIR-enKF Framework. The beginning week of an outbreak season is defined as the first week when the HFMD weekly infection rate reaches 2e-06, and it increases at a rate of 50% or more per week. The incubation period is generally 3-7 days 36 . The incubation period of HFMD among schoolchildren increases as they age 37 , and the mean incubation period for kindergarten students who are approximately 2-5 years of age is 4.4 days. In our data, the majority of the infected population was children younger than 5 years, corresponding to an incubation rate of (0.2, 0.3). Approximately 7 days are generally required to recover from HFMD 38 . The recovery period of HFMD caused by EV-71 is 4-6 days 39 and 5-7 days for HFMD caused by a CV-A16 infection 40 . Therefore, we concluded that (0.1, 0.4) is a meaningful range for the recovery rate from HFMD compared with the incubation and recovery rates. In addition, we defined the initial susceptible population S(0) to range from (0.8 N, N), where N is the proportion of the susceptible population, and the initial exposed population E(0) was in the range (5e-4N ± 10%). The range of the initial infected population I(0) was calculated by adding a 10% disturbance based on the observed infected population.
We conducted a series of parameter sensitivity analyses to assist in the selection of the initial parameters for the transmission and recovery rates in the HFMD forecast (Fig. 7). We calculated the model error by running the SEIR-EnKF framework with different combinations of transmission and recovery rates. The transmission and recovery rates ranged between 0.1 and 2.0, and each parameter value was divided into 19 equal-length segments. As shown in Fig. 7(a), the cumulative error of the infection rate was high when the recovery rate was (0.1, 0.2); the error decreased as the recovery rate increased to 0.5, beyond which the error and recovery rate showed a positive relationship. However, model errors are generally not sensitive to changes in the transmission rate. Overall, when the recovery rate was (0.2, 0.6), and the transmission rate was (0.1, 1.5), the lowest error of the infection rate was observed. The error of the peak magnitude forecast was low when the recovery rate ranged from 0.2 to 0.8, and the error was not sensitive to the transmission rate ( Fig. 7(b)). In Fig. 7(c), when the recovery rate was (0.1, 0.3), the cumulative error of the peak time was relatively low; but as the recovery rate increased from 0.3 to 0.8 and the transmission rate was less than 1.5, the error generally remained high. Therefore, we set the initial transmission rate to (1.0, 1.2) and the initial recovery rate to (0.2, 0.3) to keep the three aforementioned types of error low in our forecasting model. www.nature.com/scientificreports www.nature.com/scientificreports/

Data Availability
We obtained data from 72,266 HFMD cases reported in Beijing during 2008-2011 from the Beijing Center for Disease Control and Prevention. For each case, the incidence date and diagnosis date were recorded. The number of individuals in the overall population and children in Beijing were obtained from the Beijing Bureau of Statistics. respectively. In these equations, n is the number of years and m is the number of steps in the forecast. e ij is the RMSE between forecast infection rate of j th step in i th year and the observed data; FPT ij is the forecast peak time of j th step in i th year; TPT i is the true peak time in i th year, FPM ij is the forecast peak magnitude of j th step in i th year, and TPM i is the true peak magnitude in i th year.