Using deep learning to predict the hand-foot-and-mouth disease of enterovirus A71 subtype in Beijing from 2011 to 2018

Hand-foot-and-month disease (HFMD), especially the enterovirus A71 (EV-A71) subtype, is a major health problem in Beijing, China. Previous studies mainly used regressive models to forecast the prevalence of HFMD, ignoring its intrinsic age groups. This study aims to predict HFMD of EV-A71 subtype in three age groups (0–3, 3–6 and > 6 years old) from 2011 to 2018 using residual-convolutional-recurrent neural network (CNNRNN-Res), convolutional-recurrent neural network (CNNRNN) and recurrent neural network (RNN). They were compared with auto-regressio, global auto-regression and vector auto-regression on both short-term and long-term prediction. Results showed that CNNRNN-Res and RNN had higher accuracies on point forecast tasks, as well as robust performances in long-term prediction. Three deep learning models also had better skills in peak intensity forecast, and CNNRNN-Res achieved the best results in the peak month forecast. We also found that three age groups had consistent outbreak trends and similar patterns of prediction errors. These results highlight the superior performance of deep learning models in HFMD prediction and can assist the decision-makers to refine the HFMD control measures according to age groups.


Results
Data Sources and Characteristics. The original data was the hospital visit data of EV-A71 virus infection, from the Beijing hand-foot-mouth disease surveillance system, and the time range was 2011-2018. To reasonably estimate the number of EV-A71 cases in the Beijing population, hospital data and population serum sampling data were combined for statistical inference using stratified inference 23 . Besides, Beijing has gradually inoculated the EV-A71 vaccines in the population since August 2016. As of December 2018, the total number of vaccinations was 298,341, which led to a significant decline in the number of EV-A71 cases in Beijing after 2016. Therefore, the prevalence of EV-A71 under non-vaccination after 2016 was reconstructed using a disease transmission kinetic model 24 . Figure 1 represents the monthly numbers of EV-A71 cases of three age groups in Beijing population from 2011 to 2018, after the statistical inferences and reconstruction. The number of EV-A71 cases showed a single peak or two peaks in a year and the largest outbreak was in 2014. The rapid increase of cases started in April and the peak month was June or July. After August, there were still small outbreaks and the decline was slow. Cases in the age group of > 6 were significantly less than the cases in the age groups of 0-3 and 3-6. As for the peak months, from 2011 to 2013, there were one-month lags in peak months between the age groups of 0-3 and 3-6. In 2015, both May and July were peak months in the age group of 0-3, while the peak month in the age group of 3-6 was June.
This dataset was split into a training set, validation set and test set in chronological order according to the ratio of 3:1:1. Each data subset was four-dimensional, including the number of cases in three age groups and the total number of cases. The training set (from January 2011 to September 2015) was responsible for model optimization.  Figure 1. The monthly number of EV-A71 cases of three age groups in Beijing from 2011 to 2018. Black cross dots represent age group of 0-3; red dots represent age group of 3-6; green cross dots represent age group of > 6; blue circles represent the total number of EV-A71 cases. Accuracy assessment on point forecasts. Tables 1 and 2 show the R-squares of three deep learning models (CNNRNN-Res, CNNRNN and RNN) and three regressive models (AR, VAR and GAR), respectively, with prediction horizons = 1, 2, 4, 6, 8, 10, 12 months. When horizon = 1 or 2, R-squares reflect the accuracies in short-term prediction; when horizon = 4, 6, 8, 10 and 12, R-squares can reflect the performance changes of the six models in long-term prediction.
For the performances of the three deep learning models in short-term or long-term predictions, the accuracies of CNNRNN-Res and RNN were relatively stable as the increase of horizon, except when the horizon was 4 months, while the accuracies of CNNRNN decreased when the horizons were 4, 6 and 8 months. In Table 1, the CNNRNN-Res model had the highest accuracies in almost all three age groups for forecasts of 1 and 2 months ahead, as well as the total number of EV-A71 cases (0.9062 and 0.8467, respectively). As for the long-term prediction, CNNRNN-Res model and RNN model achieved the highest accuracies in different age groups for the forecast of 6, 8 and 10 months ahead. When the horizon was 12 months, the CNNRNN model had the highest accuracies in age groups of 3-6 and > 6 (0.8310 and 0.7523, respectively), while CNNRNN-Res had the highest accuracy in age groups of 0-3 (0.9032). But the gaps between CNNRNN-Res and CNNRNN were small.
Among the three regressive models, GAR model has shown stability and superiority than AR or VAR models in both short-term and long-term prediction. In Table 2, GAR had the highest R-squares for the forecast of 1 and 2 months ahead in age groups of 0-3 (0.8051 and 0.5101) and > 6 (0.8141 and 0.5096). As for the long-term prediction, GAR had the highest R-squares for prediction of 4, 8 and 12 months ahead. When horizon was 6 or Table 1. R-squares of CNNRNN-Res, CNNRNN and RNN predictions on test set in different age groups and horizons. The number in bold indicates the maximum value in a certain horizon and age group. 'Total' means the total number of cases of EV-A71 subtype in Beijing.  www.nature.com/scientificreports/ 10 months, the GAR model didn't differ significantly from the optimal model in terms of R-squares for all age groups.
As for the comparison of accuracy and robustness between three deep learning models and three regressive models, GAR model had better performances than RNN model in the age groups of 0-3 (0.6009) and > 6 (0.6210), as well as in the total number of cases (0.6050) when the horizon was 4 months. But in the other prediction horizons, the deep learning model (numbers in bold in Table 1) all achieved higher R-squares than the regressive model (numbers in bold in Table 2) in the given age groups. Moreover, the performances of three deep learning models didn't have a sharp decline in the long-term prediction, e.g. the R-squares of deep learning models only changed from 0.9062, 0.8845 and 0.8911 (horizon = 1 month) to 0.8538, 0.8661 and 0.7722 (horizon = 12), respectively, for the prediction of total number of cases. However, the performances of three regressive models declined gradually as the increase of horizon, e.g. the R-squares of them declined from 0.8420, 0.8319 and 0.8181 (horizon = 1 month) to 0.4810, 0.3498 and 0.5160 (horizon = 12), respectively, for the prediction the total number of cases. In order to visualize their differences on robustness, predictions of six models on the test set (horizon = 1 and 12 months) can be found as Supplementary Figure S1 online.
The differences and connections of outbreak trends among three age groups were also reflected. At every given horizon in Table 1, each deep learning model had similar R-squares in three age groups. In Table 2, GAR had similar accuracies in three age groups at each horizon, but performances of AR and VAR were not stable enough for different age groups. For example, when the horizon was 6 months, the AR model had the highest accuracies in the age group of 0-3 (0.5693) and the total number of cases (0.5413), but the AR model had the lowest accuracies in the same horizon for age groups of 3-6 (0.3501) and > 6 (0.2717). Such gaps of R-squares in three age groups still appeared in the VAR model when the horizon was 10 months.
Accuracy assessment on peak intensity. The peak intensity of the EV-A71 subtype refers to the highest value of the curve each year. Peak intensity determines the level of early warning and the number of resources to prepare for the peak outbreak. Therefore, the prediction accuracy on peak intensity need to be analyzed separately from the overall point forecasts. Figure 2 shows the normalized mean absolute errors (NMAE) between the predicted peak intensities and the true peak intensities on the test set. They were normalized by maximum errors of 1,880.46, 2,791.61, 614.02 and 4,671.82, corresponding to the three age groups and the total number of cases, separately.
Three deep learning models had better accuracy and robustness on peak intensity forecast than three regressive models. The NMAE of three regressive models was higher than three deep learning models in Fig. 2a, b, c, d. The general NMAE of VAR was the largest, followed by the GAR. The NMAE of AR varied considerably with horizons, without a clear trend (Fig. 2b, d). As for the deep learning models, they performed better on peak www.nature.com/scientificreports/ intensity forecast, and with the increase of the horizon, the NMAE of deep learning models didn't have an upward tendency. On the contrary, the NMAE of the three autoregressive models increased rapidly after horizon = 2. NAME of peak intensity predictions varied in three age groups. VAR, RNN and CNNRNN models had increased NMAE in the age group of > 6. Deep learning models had no significant advantages in the age group of > 6. The models' prediction ability of peak intensity is not completely consistent with their point prediction ability.
Accuracy assessment on peak month. Peak month is the time when the number of cases reaches the maximum in a year. A more accurate prediction on peak month will help the department of public health give early warning at the right time and reasonably arrange the prevention and control schedule. Figure 3 gives the average delay between the predicted peak month and the true peak month on the test set. Most of the average delays are between − 2 and 2, except for two outliers in Fig. 3b. A negative number represents a lagged prediction, while a positive number represents the prediction is ahead of the real peak month. CNNRNN-Res model had the highest accuracies in peak month prediction in all age groups, while the AR model had the lowest accuracies, especially when the horizon was 4 or 8 months (Fig. 3a, b). Many models, except for the CNNRNN-Res model, gave lagged predictions (− 0.5 or − 1) when the horizon was 1 or 2 months. While in the long-term forecast, the predicted peak months were always ahead of the true values (Fig. 3a-d).
The three age groups and the group of total number represented similar pattern on distribution of errors of peak month prediction. Fig. 4 show the predicted results of the CNNRNN-Res model with the horizon = 1 month. CNNRNN-Res model had a robust skill for point forecast and peak month forecast. Its prediction successfully captured the trend in all three age groups and the total number of cases, even though their peak shapes and magnitudes are varied. However, the CNNRNN-Res model couldn't predict the troughs between two peaks, e.g. the trough in June 2015 (Fig. 4a) and the trough in July 2013 (Fig. 4c), because the model didn't learn the existence of twin peaks from historical information. Similarly, there was an abnormal outbreak of EV-A71 subtype in 2014, and the predicted peak values in all age groups (Fig. 4a-c) and the predicted total number of cases (Fig. 4d) were lower than the true values. As for the long-term forecasting, predictions (horizon = 12 months) of GAR and CNNRNN-Res can be found as Supplementary Figure S2 online. www.nature.com/scientificreports/

Discussion and conclusions
HFMD is a childhood disease and has periodic outbreaks in Beijing every year. Regressive models were the classical methods for HFMD prediction, and the prediction of the incidences of different age groups had always been ignored. In this work, we compared the long-term and short-term prediction effects of the deep learning models (CNNRNN-Res, CNNRNN and RNN) and regression models (AR, VAR and GAR) for EV-A71 subtype in Beijing from 2011 to 2018. We divided the affected population into three age groups (0-3, 3-6 and > 6 years old) for simultaneous prediction, and analyzed the differences and connections among different age groups. We evaluated the prediction performances using three metrics: R-squares of point forecast, NMAE of peak intensity forecast and average delays of peak month forecast. Deep learning models had better accuracy on point forecast in both short-term and long-term predictions, especially the CNNRNN-Res model. It's worth noticing that the RNN model and CNNRNN-Res model were robust in long-term prediction. With the increase of prediction horizons, their R-squares only decreased slightly (Table 1), while the R-squares of regressive models decreased sharply ( Table 2). The three deep learning models also maintained better accuracies in peak intensity prediction, but their advantages were not obvious in the age group of > 6 (Fig. 2c). In the prediction of peak month, all models had advanced or lagged errors (Fig. 3). There were mainly lagged errors in short-term prediction and advanced errors in long-term prediction. CNNRNN-Res model still had the highest skill for peak month prediction, indicating that it could respond to the changes in the curve faster. Overall, in our three tasks, the CNNRNN-Res model showed the best level of prediction.
The successes of the three deep learning models are attributed to their model capacities. AR, VAR and GAR models are essentially linear models, which make predictions by the weighted sum of history signals within the window length. The historical information will be forgotten after several iterations. On the contrary, a multi-layer deep learning model can fit complex nonlinear dependencies. More hidden layers or neurons will increase the representation space. Especially, each neuron of the RNN module has a mechanism of updating and resetting, and their cascade structure allows RNN to retain the long-term dependencies of history inputs.
It is necessary to divide the cases of EV-A71 subtype into different age groups for prediction because there are significant differences in the social contact networks and immunities of susceptible children. Children of 0-3 years old are children with a simple social network, while children of 3-6 years old are kindergarten children, and > 6 years old are elementary school students, with an expanded social network. Children over 6 years of age have higher immunity to EV-A71 than children under 6 years of age. It can be found from Fig. 1 that the outbreaks of three age groups are consistent in the time dimension and the scales of the outbreak in age groups of 0-3 and 3-6 are almost the same. The prediction errors of different age groups also share similar patterns (Figs. 2 and 3). These analyses indicate that the scale of the EV-A71 outbreak is more affected by children's immunity and seasonal factors. www.nature.com/scientificreports/ Error indicators for infectious disease prediction should differ from those in other fields. Deep learning was implied earlier in fields such as finance, energy, and traffic prediction, and then introduced into the field of infectious disease prediction. The most used error indicator is the overall points prediction accuracy, but the indicators for infectious disease prediction should serve the practical applications, and narrow the gaps between modeling and model users. For example, the 'FluSight' challenge 25 of the US evaluates the proposed models on future incidence prediction, peak intensity prediction, peak week prediction and onset week prediction, because these error indicators are directly related to the development of control measures by the public health department. Previous HFMD prediction [12][13][14][26][27][28][29] didn't use similar indicators. In our study, we evaluated our models on future point prediction, peak intensity prediction and peak month prediction, and these error indicators may facilitate deep learning models to be more widely used in the practice of epidemic prediction. Our study still has some limitations. First, we can't explain more specifically why deep learning models are dominant, because interpretability of deep learning is a long-standing problem that has not been solved yet. Second, our data set is limited in length and the conclusions reached may not be general. To measure how the characteristics of deep learning models scale with data accuracy and quantity, a lot of experiments and simulations need to be done 30 .
The main conclusions of this paper are threefold. Firstly, deep learning models have higher precision than regressive models in EV-A71 outbreak predictions. Secondly, three deep learning models, especially the CNN-RNN-Res model, are more robust in long-term predictions. Thirdly, the number of cases of three age groups are consistent in time, but there is heterogeneity in the peak intensity in age groups of > 6. We believe that deep learning models and practical error indicators should be more widely used in the prediction and early warning of infectious diseases, and the consistency of the incidence patterns of multiple age groups could be utilized to improve the prediction accuracy.
The continuity of research in the future includes the following. First, investigate the performance of deep learning models combining with seasonal factors and web search indexes in HFMD prediction. Because seasonal changes are the main factors affecting the incidences, and the search indexes are helpful for real-time forecasting. Second, ensemble predictions need more attention. Results in Tables 1 and 2 indicate that different models have varied forecasting advantages. The fusion of deep learning model and regressive model may improve the prediction accuracy and robustness.

Structures of CNNRNN-Res, CNNRNN and RNN models.
Researchers at Carnegie Mellon University proposed and shared the source codes of CNNRNN-Res model 18 . CNNRNN-Res model is mainly composed of three parts (Fig. 5b): CNN module, RNN module and residual links. The other two ablation models, CNN-RNN and RNN, are obtained after removing some functional modules of CNNRNN-Res, i.e., the CNNRNN model doesn't have residual links, and RNN only uses the RNN module to make a prediction.
CNN module captures the correlation among the three age groups using convolution operation. The convolution kernel is a two-dimensional weight matrix. This matrix and the input time-sequence segments within the window length are convolved to obtain the time-series features. The output of the CNN module is used as the input of the RNN module. There are many variations of the RNN module, and the gated recurrent unit (GRU) is used in this research. Figure 5a is a single GRU cell and it computes the following function: www.nature.com/scientificreports/ where h t is the hidden state at time t , x t is the input data at time t , h (t−1) is the hidden state at time t − 1 or the initial hidden state at time 0 , and r t , z t , n t are the reset, update, and new gates, respectively. σ is the sigmoid function, and * is the Hadamard product. The internal parameters of the three deep learning models are fine-tuned by means of back-propagation of prediction errors.
Constructing the AR, VAR and GAR models. AR, VAR and GAR are classical regression methods in time series prediction. AR treats the age groups independently, i.e. assume the numbers of cases in different age groups are not correlated. AR can be formalized as: where p is the order of AR, w is the input window length, h is the prediction horizon, x (i) t is the i-th input variable at time t, and α (i) p is the weight parameter. ε t+h is random noise at time t + h, and c (i) is the intercept term. GAR simplifies the AR model. Its premise assumes that the variation pattern of each age group is consistent, and the same set of α p and c can be used to predict the number of cases in different age groups. VAR fits the cross-signal dependencies and it is more complex and expressive. The VAR model assumes that the number of patients in each age group is correlated with those in the rest groups, which is different from the independent hypothesis of AR and the consistent variation hypothesis of GAR. VAR can be formalized as: where A p is the parameter matrix to capture correlations among age groups.

Hyper-parameters selection and model training.
We implemented the six models using Facebook's open-source platform-Pytorch. The original dataset was divided into a training set, validation set and test set, as described in the results section. Before training, we need to set up some hyper-parameters, as they determine the structure of models and can't be changed through training. We set up an optional set for each hyper-parameter, i.e. the number of hidden neurons (5, 10, 20 or 40), the length of the input window (2,4,8,16 or 32), residual window (4, 8 or 16), prediction horizons (1, 2, 4, 6, 8, 10, 12 months) and the residual ratio (0.01, 0.1, 0.5 or 1). Then, we conducted a grid search over all hyper-parameters and fine-tuned these six models using the training set, separately. After that, we put validation set into the trained models, and got six optimal models that had the highest R-squares of points forecasts on the validation set. The optimal combination of hyper-parameters for each model can be found as Supplementary Table S1 online.
Metrics of prediction errors. The six optimal models forecasted on test set with seven prediction horizons (horizon = 1, 2, 4, 6, 8, 10, 12 months). Three metrics were implied to measure prediction accuracies of three deep learning models and three regressive models: R-squares of point forecasts, NMAE on peak intensity forecasts, and average delays of peak month forecasts. R-squares performed an overall prediction accuracy evaluation on all discrete points in the test set. Because the predicted peak intensity and peak time have more practical guiding significance, NMAE only measures the peak intensity of each year, and the average delays only focus on when the peak month arrives. The formulas of three metrics are as follows: where j represents the age groups, h represents different prediction horizons, y is the number of cases in original dataset, and ŷ is the predicted value. The three metrics corresponding to different horizons are used to compare the performance of the short-term and long-term predictions of the models. The three metrics corresponding to different age groups are used to analyze the outbreak patterns of HFMD in different age groups.
Ethics statement. All methods were carried out in accordance with the principles of the Declaration of Helsinki. The experimental protocols were approved by the ethical committee of Beijing Center for Diseases (3) n t = tanh W in x t + b in + r t * W hn h (t−1) + b hn (4) h t = (1 − z t ) * n t + z t * h (t−1) i=1 y i − y i 2 (8) NMAE j,h = y peak −ŷ j,h peak max h y peak −ŷ j,h peak (9) delay j,h = t peak −t j,h peak