Multivariate time series analysis on the dynamic relationship between Class B notifiable diseases and gross domestic product (GDP) in China.

The surveillance of infectious diseases is of great importance for disease control and prevention, and more attention should be paid to the Class B notifiable diseases in China. Meanwhile, according to the International Monetary Fund (IMF), the annual growth of Chinese gross domestic product (GDP) would decelerate below 7% after many years of soaring. Under such circumstances, this study aimed to answer what will happen to the incidence rates of infectious diseases in China if Chinese GDP growth remained below 7% in the next five years. Firstly, time plots and cross-correlation matrices were presented to illustrate the characteristics of data. Then, the multivariate time series (MTS) models were proposed to explore the dynamic relationship between incidence rates and GDP. Three kinds of MTS models, i.e., vector auto-regressive (VAR) model for original series, VAR model for differenced series and error-correction model (ECM), were considered in this study. The rank of error-correction term was taken as an indicator for model selection. Finally, our results suggested that four kinds of infectious diseases (epidemic hemorrhagic fever, pertussis, scarlet fever and syphilis) might need attention in China because their incidence rates have increased since the year 2010.

data could dynamically predict the future incidence rates. On the contrary, our study employs both the vector auto-regressive (VAR) model and error-correction model (ECM) for multivariate time series analysis, which can effectively capture the dynamic interdependencies among multiple data sources. Besides, Tan, et al. 9 examined the county-level socio-demographic characteristics associated with syphilis and gonorrhoea in Guangdong Province by using linear and spatial lag regression, but considering China is a very large country with highly imbalanced development of regional economy, it is plausible to doubt whether the results are the same at the nationwide level. Figure 1 shows the flowchart of building MTS models between the infection and GDP time series data to forecast the future infection rates with the established model. To this end, the second part consists of the preparation, construction, implementation, verification and application of modelling. The third part reports the main results of this study. Finally, the last part ends the paper by concluding the new discoveries and future works to do in this research field.

Materials and Methods
The data. The 10 , ranging from 1978 to 2014. The GDP data of the corresponding period was obtained from the National Bureau of Statistics of China (http://data.stats.gov.cn/easyquery.htm?cn=C01). According to relevant laws and regulations, 26 types of infectious diseases were classified as Class B notifiable diseases in China, though three of them (severe acute respiratory syndromes, anthrax and human avian influenza) were actually treated as Class A notifiable diseases. In this study, 11 types of Class B notifiable diseases were included for analysis, and the rest diseases were excluded for the following reasons: ① the annual incidence rates remained constantly too low (usually < 0.05/10 5 ) in the last decade, so it did not make much practical sense to explore their relationship with GDP (e.g., poliomyelitis and diphtheria); ② for the sake of scientific rigor, diseases with hard-to-interpret outliers were also excluded, e.g., measles; ③ the data of incidence for some diseases were not available until recent years, so their sample sizes were too small to build reliable statistical models, e.g., tuberculosis and dengue; ④ some Scientific RepoRts | 6: 29 | DOI:10.1038/s41598-016-0020-5 endemic diseases prevailed only in certain areas, therefore it was not appropriate to analyse them at the nationwide level, e.g. brucellosis and schistosomiasis. Specifically, Table 1 listed the summary information of diseases to be analysed in this study.
The preparation of modelling. Before modelling, the time plot and cross-correlation matrices 11 were applied to illustrate the characteristics of data and help select the appropriate analysis models. The time plot showed the data against the time index (i.e. incidence v.s. year, or GDP v.s. year), and could present temporal characteristics such as short-term oscillation and long-term trend. Additionally, considering the multivariate cases, the cross-correlation matrices were also used to describe the dynamic relationships. For example, GDP in recent years may be correlated with the incidence rate of infectious disease in the coming years. In view of this, let {x 1,t } and {x 2,t } denote the value of incidence rate and GDP at year t, respectively. Then the whole data observed at year t could be noted as x t = {x 1,t , x 2,t }, where x t was a vector with two series components (in this paper, boldface notation was used to indicate vectors and matrices). For any time lag k (k is an integer), the lag-k cross-correlation matrix was defined as ρ ij (k), which was the correlation coefficient between x i,t and x j,t+k (i, j = 1, 2). For illustration, if both i = j = 1, then ρ 11 (k) measured the correlation of incidence rates between the current year and k year ahead (if k < 0) or later (if k > 0); likewise, if i = 1, j = 2, and k > 0, then ρ 12 (k) was the dependence of current incidence on the GDP at k year later. In this way, it not only considered temporal effect, but also accounted for the correlation between different series components.

Model construction.
To guarantee the fitted and forecasted incidence rates were non-negative, all the data were logarithmically transformed to lnx t = {lnx 1,t , lnx 2,t } before modelling. Then the MTS model was built based on lnx t , and this model was further employed to make forecasts. Finally, the inverse-logarithmic (or exponential) transformation was taken on the fitted and forecasted results to transform them back into original form. As mentioned above, both VAR and ECM are useful models for multivariate time series analysis, but each of them has its own applicable conditions. Tsay 11 proposed a two-step testing procedure to help select a most appropriate model. The first step is to build an ECM for the vector series lnx t : , a t is the residual series, and matrix II is called error-correction term. Then according to the testing result on rank II, three types of MTS models are utilised, i.e., the VAR model for original series {lnx t }, the VAR model for differenced series ∇ x { ln } t and the ECM, below are some more details.
(1) VAR model for original series.
If rank(II) = 2, it implies the ECM is not so informative that the VAR model could analyse lnx t directly.
The VAR model is an extension of traditional autoregressive (AR) model from univariate to multivariate time series analysis. It reflects the influence of the last p historical data on the current one, which can be written as (2) VAR model for differenced series. When the testing result is null, i.e., rank(II) = 0, it indicates that the dynamic relationship between incidence and GDP are nonstationary, and the differencing technique will be used to transform it into stationary one. Consequently, the VAR(p) model would be applied to the differenced series ∇ x ln t instead of lnx t , that is, The ECM is applied when rank(II) = 1, which is of the form Eq. (1). The ECM could be considered as a supplement to the VAR model by adding an error-correction term to the latter. Generally, the VAR model characterises the long-term trend, while the error-correction term adjusts the short-term oscillation.
The implementation of modelling in R. In this study, the VAR and ECM were estimated by ordinary least squares (OLS) method. To determine the unknown order p for the model, the Akaike information criterion (AIC) 12 would be used. All statistical analyses were performed in R 3.2.3 (the R Foundation for Statistical Computing) 13 , a free software environment for statistical computing and graphics. Computing packages {vars} 14 and {tsDyn} 15 can be downloaded from the Comprehensive R Archive Network (CRAN) at http://cran.r-project. org/ and installed in advance. The cross-correlation matrices could be calculated by the command ccf, and the VAR and ECM could be estimated by the command VAR and VECM, respectively.
The verification of modelling. Since the model was built for the aim of forecasting, verification was considered to make the results more convincing. In particular, models were verified in three ways: ① the goodness-of-fit; ② the comparison with other models; ③ and with previous studies.
In this study, the goodness-of-fit consisted of two measures to evaluate the fitting performance of the model. One was the mean squared percentage error (MSPE), which quantified the difference between the fitted incidence rates and the real ones. The other one was the Ljung-Box test for the residuals {a t }, which was to test whether the model was good enough to efficiently extract useful information from the data and thus leave the residuals to be white noise (with zero mean and constant standard deviation).
The second way for verification was to compare the results of our approaches with those of traditional method. Since the autoregressive integrated moving average (ARIMA) model has been one of the most widely used techniques 6 , it served as benchmark to evaluate the performance of MTS model in this study.
The third way for verification was to compare our results with some similar previous studies. As mentioned in the Introduction part, since the relationship between incidence rate and GDP has been previously studied to some degree, it could help us to verify whether our new results make practical sense or not.
Scientific RepoRts | 6: 29 | DOI:10.1038/s41598-016-0020-5 The application of modelling. After the models were built and verified, they could finally be utilised to make forecasts on future incidence rates if the growth of Chinese GDP remained below 7%. To make the forecasting step clear, we took the ECM as an illustration, which was almost the same for VAR model. The ECM model represented by Eq. (1) could be rewritten for incidence (x 1,t ) and GDP (x 2,t ) series, respectively, that was Then at the current time point t, the future incidence at time point t + l (l ≥ 1) could be forecasted as were the historical data which were already known; on the other hand, if l − i > 0, they represented future unknown values. In this case, the future incidence-related information ∇

Results
The results of modelling preparation. Since incidence and GDP series differed dramatically in scale, to better illustrate their mutual relationship, each series were standardised before time plotting. The relationships illustrated by the time plots could be summarised into two categories. The first category, as was shown by Fig. 2(a), indicated that the incidence rate fell dramatically as GDP increased. In contrast, Fig. 2(b) presented the second category, suggesting both incidence rate and GDP were increasing. Consequently, cross-correlation matrices were applied to further identify the direction of relationship, which further classified the diseases into the positively-correlated group (gonorrhoea and syphilis) and the negatively-correlated group (epidemic hemorrhagic fever, malaria, pertussis, rabies, bacterial and amoebic dysentery, epidemic encephalitis B, scarlet fever, typhoid fever and virus hepatitis).
The model. The  and for the differenced log-transformed GDP, (3) ECM for typhoid fever.
In the analysis of typhoid fever, a trend term "t" was included in the model to account for long-term trend. For the differenced log-transformed incidence, Overall, from the above three examples, it could be seen that both the historical incidence rates and GDP would affect the current incidence rate. However, on the other hand, incidence rate scarcely had any influence on GDP. Therefore, the results suggested there was unidirectional relationship from GDP to the incidence rates of the eleven Class B notifiable infectious diseases included in this study for the last three decades.

Model interpretation.
The influence of GDP on disease incidence could be in either positive or negative way. On the one hand, with the increase of GDP in China, the expenditure on health and medicine has been enhanced. To name but a few examples, the government expenditure on health has annually risen by 18.66% on average since the year of 1990 10 . Besides, the number of people benefitting from the water-improving project accelerated from 0.6 billion in 1990 to 0.9 billion in 2014. Those events were undoubtedly of great benefit to disease control and prevention. On the other hand, this study also found that the incidence rates of gonorrhoea and syphilis had risen along with the economy development. Although the reasons were various, it was at least plausible to say that the increasing power of purchase made some people more financially affordable to extramarital and premarital sex behaviours, which gave rise to the risks of sexually transmitted infections 6,16 .
The absence of influence from incidence rates to GDP might seem implausible at the first sight, but it would be easier to understand if the following three points were taken into consideration. At first, GDP means a monetary measure of the market value of all final goods and services produced in a period 17 , and the influence of incidence on GDP should be distinguished from the money of loss due to diseases. For example, suppose that 1,000 dollars were paid by a patient for medicine and health care services, it indeed caused a financial loss for the patient, but this amount of money was also counted as the market value produced by the medical and health industry. Therefore, from the economic point of view, the payment was a promotion instead of poison for GDP. Secondly, as told by the widely accepted Cobb-Douglas production function 18 , diseases may deteriorate GDP through damaging the health of labours, which was generally measured by the disability adjusted of life years (DALYs). However, according to the Global Burden of Disease Study 2013 (GBD 2013) 19 , not only were infectious diseases no longer the main causes of DALYs in China, but also the DALYs of infectious diseases had all declined globally ever since 1990. Finally, it should be reminded that this study merely included 11 types of Class B infectious diseases in China. Although some other researchers have declared infectious diseases had an impact on economics 20 , what they mainly referred were those diseases lack of timely diagnosis and prevention. Since the diseases in our study were under well prevention, treatment and control in China, it was appropriate to judge that there were no essential contradictions between this study and others.
The verification result. The fitting plots for epidemic hemorrhagic fever, syphilis and typhoid fever were shown in Fig. 3, which illustrated that the fitted incidence rates were generally in consistent with the actual ones. The fitting plots for other diseases were the same, and thus not presented here due to the limited space. Table 2 presented the overall evaluation results on goodness-of-fit. For each disease, the second column in Table 2 presented the Ljung-Box test result of MTS model, indicating this model was efficient enough to extract information from data. Besides, Table 2 also listed results for the comparison of MSPEs between the MTS and ARIMA model. It could be seen from column 3 and 4 that the MSPEs of MTS model were very small (<0.04 on average), and even smaller than those of the ARIMA model. Although the MSPEs of the fitted incidence rates for epidemic encephalitis B, malaria, rabies and scarlet fever were slightly bigger, however, after careful check of the original data, we found the inconsistencies were mainly occurred in the early 1980s, so it was plausible to infer that these inconsistencies would not jeopardise the validity of forecasts.
Another approach to verify the result of modelling was to compare our results with those of other studies. As has been mentioned above, the findings of correlation between GDP and infectious diseases have coincided with most previous studies 2, 3,9,21,22 , but with only a few exceptions 9, 23-25 . In those exceptional studies, GDP/GDP per capita was not significant predictor of the infectious diseases (e.g., syphilis, gonorrhoea, malaria and Hepatitis C), however, they still displayed the same signs of correlation as this study did. Meanwhile, from the perspective of epidemiology and biostatistics, certain variables should be remained even though they had non-significant effects, because of the logical importance in the particular problem 26 . Therefore, it was plausible to say that the results of this study did not essentially contradict with those of previous researches.
In addition, some concerns may arise over the matter of spatial stratified heterogeneity, that is, whether the relationship between incidence and GDP distributed unevenly across different parts of the whole country. To this end, this study utilised the q-statistics proposed by Wang et al. 27 to make hypothesis test. According to the National Bureau of Statistics of China, the 31 provinces in mainland China were classified into eastern, middle and western regions, respectively. For each disease, the testing results were shown in Table 3, which suggested that the null hypothesis H 0 (i.e., no spatial stratified heterogeneity) could not be rejected yet.   Table 3. The testing results of spatial stratified heterogeneity for each disease. *The significance level α was set to be 0.05 in advance. The F-statistics (column 2) was constructed based on the q-statistics, which followed a non-central F-distribution, with first and second degree of freedom f 1 (column 3) and f 2 (column 4), and noncentrality parameter lambda (column 5). Column 6 listed the 95% cutoff point, and by comparing it with the F statistics, it could be inferred whether P > α or not (column 7).
Scientific RepoRts | 6: 29 | DOI:10.1038/s41598-016-0020-5 The application of model. After verification, the corresponding MTS models were utilised to forecast the incidence rates of the next five years. Table 4 provided the changing incidence rates for each disease from 1978 to 2020. The incidence rates from 1978 to 2014 were observed ones, and those from 2015 to 2020 were predicted from the model. Except for scarlet fever, the incidence rates of all the other diseases were expected to decrease between 2015 and 2020. If the forecasting results were true, then it meant the incidence rate of scarlet fever would have been increasing ever since 2000. Besides, Table 4 also indicated that the incidence rates of epidemic hemorrhagic fever, pertussis and syphilis had risen to some degree during the last decade. These results raised warnings for future disease outbreaks, which were further discussed in the next part.

Discussion
In this study, a new approach based on MTS model was provided to investigate not only the direction of dynamic relationship between incidence rates of Class B notifiable diseases and GDP in China, but also the effect size of this relationship. Statistically significant evidence was found that the Chinese GDP growth affected its incidence rates of Class B notifiable diseases over the past thirty years. In addition, based on the IMF's forecasts about future Chinese GDP, our models forecasted the future trends of incidence rates in the next five years, and therefore indicated the key point of disease control and prevention from our own view. Finally, these results have been verified in multiple ways to increase their creditability. This study highlighted the importance and necessity of merging multiple sources of information into the surveillance of infectious diseases. At least two kinds of information were proved useful by previous studies 28, 29 : historical incidence data and exogenous variables including but not limited to GDP. Therefore, this study built MTS models to account for both of them. It could bring benefits in three ways: ① characterising both long-term and short-term relationships between incidence rate and GDP; ② making conditional predictions; ③ reducing uncertainty by introducing extra information. Our results directly supported the first two of them. As for the last one, since this study has already demonstrated cross-correlation between incidence rate and GDP series, it was plausible to confirm it from the view of information theory.
Another feature of this study was the provision of integrative approaches for multivariate time series analysis of infectious diseases. To account for any possible relationship between incidence rate and GDP series, totally three kinds of MTS models were considered: VAR model for original series, VAR model for differenced series and ECM. As for a certain infectious disease, based on the rank of error-correction term, clear indication was given on which of those models should be taken for analysis. Furthermore, this study has provided the R software codes to realise the whole modelling procedures. It was highly expected that all these attempts would encourage and help practioners to apply our methods to study the relationship between incidence rates of infectious disease with many other factors besides GDP.
It is quite necessary to emphasise that the ultimate goal of disease surveillance is to suggest what should be done in the future rather than to make mere forecasts. This study contributes to this goal by warning four kinds of infectious diseases (epidemic hemorrhagic fever, pertussis, scarlet fever and syphilis) might need special attention because their incidence rates have increased since the year 2010. Epidemic hemorrhagic fever is caused by hanta viruses, and its incidence rate is positively correlated with rodent density 30 , so rodent control and extinguishment needs to be strengthened. Since pertussis is vaccine-preventable disease, future work needs be done to maintain high level of DPT (diphtheria, tetanus and pertussis combined vaccine) immunisation coverage. In recent years, scarlet fever mostly occurred among school children, however, there has not been any efficient vaccine for prevention yet; therefore, it is imperative to protect susceptible population by reinforcing health education, especially in nursery, kindergarten and primary school. Similarly, the prevention of syphilis also relies on health education of the public about its hazard and transmission. Improving the surveillance system is the key to early warning of epidemics, and multivariate time series analysis could help by suggesting which variables should be included into the system and how to obtain comprehensive analysing results. On this basis, further studies with provincial level data and more variables are needed to explore the causation net of epidemics for faster and better control and prevention.  Table 4. The change rate for each disease from 1978 to 2020. *For each column, the change rate = (the incidence of the last year-the incidence of first year)/the incidence of first year. **The first period for gonorrhoea was from 1981 to 1989. ***The first period for syphilis was from 1987 to 1989.