Using the Baidu Search Index to Predict the Incidence of HIV/AIDS in China

Based on a panel of 30 provinces and a timeframe from January 2009 to December 2013, we estimate the association between monthly human immunodeficiency virus/acquired immune deficiency syndrome (HIV/AIDS) incidence and the relevant Internet search query volumes in Baidu, the most widely used search engine among the Chinese. The pooled mean group (PMG) model show that the Baidu search index (BSI) positively predicts the increase in HIV/AIDS incidence, with a 1% increase in BSI associated with a 2.1% increase in HIV/AIDS incidence on average. This study proposes a promising method to estimate and forecast the incidence of HIV/AIDS, a type of infectious disease that is culturally sensitive and highly unevenly distributed in China; the method can be taken as a complement to a traditional HIV/AIDS surveillance system.


Models.
Dynamic heterogenous panel estimation has attracted rising attention in recent years. The shortand the long-run growth effect are often of primary interest. Researchers often formulate traditional regression models to undertake the analysis, if the underlying time series is stationary and exhibits a long-run relationship. However, this stationary assumption, in most circumstances, is not as commonly applicable as has conventionally thought, resulting in misleading estimations. Cointegration has thus become an overriding requirement for any models with non-stationary time series. When cointegration is met, autoregressive distributed lag (ARDL) models are often applied to jointly estimate the short-and long-run effects of the considered variables, as ARDL models can be reparametrized into error correction models (ECM), in which the short-run dynamics are influenced to any deviation from long-run equilibrium 19,20 .
Recent literature has suggested several approaches in the estimation of dynamic heterogenous panel models. A dynamic fixed-effects (DFE) estimator can provide estimations at one extreme, under which all the provinces are pooled. Both slope coefficients and error variances proved to be identical in the long run; only the intercepts are allowed to vary across provinces. If the homogenous slope coefficients or error variances are violated, the DFE estimator would provide inconsistent results. At the other extreme, a mean group (MG) estimator is employed, which impose no cross-province parameter restrictions and allows the estimated coefficients to vary across provinces. The MG estimator can be estimated on a by-province basis, and it is shown to provide consistent estimates of province-specific mean of the long-run coefficients, when the cross-province dimension is large 21 .
The PMG estimator, on the other hand, reflects an intermediate approach that involves both pooling and averaging, offering a variety of appealing features. As a special case of ARDL models, it can provide a valid estimation of whether variables of interest follow the I(0) or I(1) process. As aforementioned, it can also separate short-and long-run relationships between BSI and HIV/AIDS incidence. More often than not, short-run dynamics are less likely to share common features across provinces; common features are often revealed in the long run. This insight can be exploited by the PMG estimator, which imposes homogeneity of slope coefficient in a long-run relationship, while allowing the short-run coefficients, such as the intercepts, the speed of adjustment to the long-run equilibrium values and error variances, to be heterogenous across provinces. This slope homogeneity holds when the long-run equilibrium is expected. If not, an alternative estimator (the MG estimator) should be adopted. When neither slope coefficients nor error variances vary across the provinces in the long run, the DFE estimator should be employed. As stated, these three estimators can be incorporated into an error-correction model using the ARDL (p, q). In this analysis, we take the maximum lag as being 1 (i.e., p = 1; q = 1). The times series of BSI and CDC records are processed by transforming them into logarithm form. Therefore, the model can be written as follows: In Equation (1), CDC log it refers to the logarithm of CDC-reported HIV/AIDS incidence in province i at month t, and BSI log it denotes the logarithm of the HIV/AIDS BSI. To rule out possible 0 incidence and 0 searching, we added a very small number, 0.001, to CDC it and BSI it before taking the logarithm. By taking the log, we transform the highly skewed distributions to distributions close to normal, which helps meet the assumption of inferential statistics. In the model, δ i 1 is a short-run coefficient of lagged BSI log i t , , and θ are the long-run coefficients. φ i is the coefficient of speed of adjustment to the long-run equilibrium; when φ = 0 i , there would be no long-run relationship. ε i t , stand for idiosyncratic errors at month t in province i, which are independent and identically distributed. All three models are estimated using the STATA14.0 command "xtpmg" 22 .

Results
Descriptive analysis. According to the monthly reported incidence rates by CDC, HIV/AIDS incidence has been growing steadily in China. In 2009, only four provinces had more than 1 per 1,000 population people newly infected with HIV/AIDS; by 2013, the number of provinces had increased to twelve (including Xinjiang). The prevalence of HIV/AIDS is particularly severe in Guangdong, Guangxi, Henan, Sichuan and Yunnan (Fig. 1), which together account for over 60% of HIV/AIDS infection in China.
To show the distribution of BSI and CDC record of HIV/AIDS, we present the summary statistics in Table 1. Both BSI and CDC had large variations at national level, where between-province variations accounted for 70.55% and 83.53% of the total variation of the variables of interest respectively (Table 1). To bridge the link between these two variables, we further plotted the provincial monthly trend of BSI and CDC records. To remove the scale effect of the variables, we standardized the BSI and CDC recodes for each province. As Fig. 2 shows, the standardized BSI went hand in hand with the CDC records, indicating the presence of dynamic interactive effects between BSI and the incidence of HIV/AIDS. We also fit naïve bi-variate model for each province, and found that most provinces had moderate correlation coefficients between 0.3 and 0.5 in 2013. Additionally, some seasonal patterns can be revealed in both BSI and CDC (Fig. 2).
In-Sample Prediction. The application of the PMG model requires that the data should follow the I(0) or the I(1) process. To test this, we first conducted a stationary test for all the variables in their original form and the first difference. The augmented Dickey-Fuller test shows that, although CDC log and BSI log do not conform to I(0) process in some provinces, both variables are shown to follow I(1) process across all the provinces in China.
To examine the association of monthly BSI and true HIV/AIDS incidence records from January 2009 to December 2013, we estimate Equation (1) with PMG, MG and DFE estimators. As shown in Table 2, the coefficient of error-correction term of the three estimators are all significantly negative, with the magnitude ranging from −0.366 to −0.609 (no smaller than −2), which indicates the existence of a long-run relationship between BSI and CDC (the premise behind the application of the long-run modeling). From DFE to PMG, further to MG, a gradient of decreasing model restrictions is displayed.
To choose among these three methods, we conducted the Hausman test. By comparing DFE and PMG, a significant Hausman test suggested that the equalization of the speed of adjustment coefficient and short-run coefficients across provinces was invalid; the PMG estimator was thus preferred. By further comparing the PMG and MG estimators, the insignificant test result revealed that two estimators can produce no different estimation of coefficients; in this regard, long-run slope homogeneity cannot be rejected. Clearly, imposing slope homogeneity somewhat reduced the standard residual of the long-run BSI log coefficient, while the change of the point estimate, which is at the third digit after the decimal point, was minor. Research suggests that only when there is no long-run slopes homogeneity can the MG estimator provide a consistent estimate of the mean of the long-run coefficients. When no long-run slope homogeneity is violated, the PMG estimator is preferred. As in case of slope homogeneity, PMG is consistently proved to offer an efficient and consistent estimate 23 . In this analysis, we therefore chose PMG as the best fitting model. The results show that BSI has a positive and significant long-run correlation with CDC-reported incidence (Table 1): a 1% increase of BSI was associated with a 2.089% of increase in HIV/AID incidence. In the short run, the increase of BSI index has, to some degree, suppressed the HIV/AID incidence.
To show visually the goodness-of-fit of PMG estimator, the residual-time plot was drawn (Fig. 3). From Fig. 3, the PMG model fits the HIV/AIDS incidence extremely well. The predicted residuals present no apparent time trend, showing a straight horizontal line at y = 0. For the provinces with more scattered dots, such as Inner-Mongolia, Tibet, Qinghai and Ningxia, the larger errors are primarily due to the zero incidence of HIV/ AIDS. When looking into the CDC-reported HIV/AIDS incidence data, these four provinces show the lowest HIV/AIDS incidence levels in China. In epidemiology studies, the aim of prediction is to accurately identify the most at-risk groups and to predict the probability of infections and the speed of their spread in high-risk regions. The government and relevant organizations can therefore react promptly to effectively control the transmission of the disease. To show how well the PMG model can predict HIV/AIDS incidence in the HIV high prevalence area, Fig. 4 was drawn to compare the true and predicted valued of HIV/AIDS incidence (Fig. 4). This figure implies that the HIV/AIDS incidence predicted from the PMG model shows a high congruence of HIV/AIDS incidence reported by CDC.
We also tried various other model specifications. As HIV/AIDS incidence is very low in most of the provinces, the static models (such as the two-way fixed-effects model and the AR(1) error model) have shown to provide satisfactory predictions of HIV/AIDS incidence. All these models indicate a positive and significant association between BSI and CDC. Unfortunately, static specification fails to take account of the dynamic features associated with high-incidence provinces; this leads to a large underestimation of HIV/AIDS incidence in these areas. The PMG estimator is, so far, the best estimation we come up with.

Out-of-Sample Forecasting.
To check the performance of the PMG estimator further, we also performed out-of-sample forecasting for HIV/AIDS incidence. For the sake of comparison, we divided the period into two when doing the forecasting. The first period covered monthly data from January 2009 to December 2012 (48 months); the second ranged from January 2013 to December 2013 (12 months). We attempted to use first-period data to estimate the model, and further forecast the HIV/AIDS incidence based on this estimated model for the rest of the months covered in the second period.   In STATA, after running "xtpmg", we create a new variable with fitted values ("pr") and an error correction term ("ec"). The sum of pr and ec is the predicted log CDC of the previous month. Rather than directly forecasting HIV/ AIDS incidence, we can thus forecast pr and ec instead. From Equation (1), log BSI as a key predictor of log CDC is also pivotal to successful forecasting. As Baidu can provide its users with instant search information, we here assume that BSI is given across the whole period. This assumption results in a large simplification of the forecasting process. As long as the BSI is known, xtpmg can always give the full range of predicted pr values. To obtain the accurate log CDC forecast, therefore, the central task becomes how to achieve an accurate forecast of ec.
As mentioned, the distribution of HIV/AIDS incidences is highly uneven in China. When employing a dynamic forecast, we adopt a static specification of ec in low-incidence areas, and a dynamic specification in high-incidence areas. In terms of the former, we used a province-level fixed-effects model with the control of 1-to-5 lags of ec and a linear time trend. We also adjusted for provincial fixed components when forecasting. In terms of the latter, we employed a seasonal autoregressive integrated moving-average (SARIMA) model in each of the high-incidence provinces, where ARIMA(1,0,0)(0,1,0) 12 was the final model adopted for the dynamic forecasting. This model included one AR term and one seasonal difference of ec. The inclusion of the AR term takes care of the first order autocorrelation of ec, while the adoption of the seasonal differencing can remove a seasonal random walk type of non-stationarity.
The procedure to forecast ec using fixed-effects models is as follows: 1. Estimate the PMG model using first 48-month observations of each province, and predict pr and ec; 2. Employ provincial fixed-effects model to create a forecasting model by regressing predicted ec from step 1 to its 1-to-5 lags and a linear time trend; 3. Predict the fixed component from fixed-effects model and create a provincial average fixed component for forecasting adjustment using the "forecast" command; 4. Store the predicted pr and forecasted ec at month 49; and 5. Create the forecasted log CDC at time 49 by summing the predicted pr and forecasted ec at month 49.
We carried out the above five steps recursively until we obtained all the logCDC forecast from month 49 to month 60. We took a different approach when forecasting ec in provinces with a high incidence of HIV/AIDS using the SARIMA model. The procedure was as follows: 1. Estimate the PMG model using the first 48-month observations of each province, and predict the full range of pr and ec for the first 48 months; 2. Estimate ARIMA (1,0,0)(0,1,0) 12 model of ec from step 1 using the first 48-month observations, then apply a dynamic forecast for ec of month 49 by using the "arima" command in STATA14.0; 3. Replace the missing value of ec predicted from step 1 with the forecasted ec at month 49; and 4. Create the forecasted log CDC at time 49 by summing the pr estimated from step 1 and the ec at month 49.
To obtain all the log CDC forecast in the second period, we carried out steps 2-4 recursively. Notice that when using a fixed-effects model to forecast ec, we also update ec by running the PMG model recursively. This is because, as a static specification, the fixed-effect model can only fix the province-level time-invariant pattern of ec. By doing this, we took account of minor, even nuisance, time-variant changes as much as we could. However, in the ARIMA model, as a dynamic specification, these problems are far less of a concern.
The forecasting results of HIV/AIDS incidence from January to December 2016 for each province are presented in Fig. 5. Clearly, our strategy, although indirectly, provided satisfactory log CDC forecasts, especially for some provinces with high HIV/AIDS incidence (such as Yunnan and Guangxi). However, our forecast contained a relatively large forecast error for Tibet: the province contains several zero incidences over time (Fig. 6). We also tried various other time breakpoints, such as the 30 th month (i.e., we forecast the last 30-month records with the first 30-month HIV/AIDS incidence) and the 35 th month (i.e., we forecast the last 25-month records with the first 35-month HIV/AIDS incidence); the forecasting results remained consistent.

Discussion
HIV/AIDS prevalence in China has long been low at the national level, but high in some key regions. This low national prevalence might mask the high prevalence of HIV among specific groups in certain areas. HIV/AIDS can never simply be taken as an infectious disease; it involves a dynamic and interactive correlation between government action and epidemic. More often, it involves ethnic politics in some areas in China. Despite the government's enormous efforts at HIV prevention, due to the cultural stigma associated with HIV, a deep fear of disclosure undermines the state's attempts at HIV/AIDS diagnosis, treatment and progression. Globally, it has been widely suggested that the HIV epidemic is underreported. It is therefore necessary to identify some effective and accurate prediction models to assist the detection of HIV/AIDS incidence. Existing studies, however, primarily focus on the identification of HIV transmission type 24 , and attempts to predict HIV/AIDS incidence remain void. Effective prediction is a prerequisite for prompt HIV prevention. In epidemiology studies, much recent scholarship has drawn attention to the use of Internet search data to assist in the surveillance of some diseases. Taking advantage of Baidu search data, this study attempts to examine whether the BSI of HIV/AIDS can be an effective predictor of the detection of province-specific HIV/AIDS incidence in China. The result of the PMG estimation has suggested that the increase of BSI in the short run can decrease the HIV/AIDS incidence temporarily, while in the long run, BSI has shown a significantly positive correlation with HIV/AIDS incidence.
China experienced a growing epidemic of HIV/AIDS over the time span covered by the analysis. Considering the pervasive variation of HIV transmission type across regions 25 , the rising trend of HIV/AIDS may imply the increasing dominance of one or more types of HIV transmission channels. However, the increase of HIV/AIDS incidence does not necessarily indicate the rapid spread of HIV/AIDS through actual virus transmission. This pattern may imply certain other meanings. One recent research in Hong Kong has suggested that media coverage can predict HIV related search 26 . As scale and intensity of media coverage varies across social contexts, whether media would arouse the similar level search in mainland China remains an open question. Moreover, this trend of HIV/AIDS incidence may reveal the advancement of medical technology, which has improved the accuracy and precision of diagnosis while reducing the cost associated with it. It also reveals the government's ceaseless efforts at HIV prevention, for example the implementation of relevant health policy and the increase in the health service coverage to areas with high concentration of HIV/AIDS. Regarding the positive long-run association between BSI and CDC, the growing awareness of individuals about HIV/AIDS infection is implied. Given the availability of relevant services, individuals actively participating in HIV/AIDS check-ups give impetus to the growing identification of HIV/AIDS. These factors, taken together, contribute to the rising trend of HIV/AIDS incidence as observed. Before taking prompt action, an attempt to predict HIV/AIDS accurately and reliably is necessary.
Our study nevertheless has several limitations that deserve further discussion. First, to achieve the consistent estimation of short-run effects, the application of the PMG model requires no serial correlation of predicted errors. However, not all provinces can satisfy the zero serial correlation assumption made in this analysis. The good thing is that the long-run relationship is of great interest in our study. As Pesaran suggests, given that independent variable x has a finite-order autoregression, it is reasonable to allow for the possible independence between x and error when estimating the long-run coefficient 19 . This being said, despite the presence of serial correlation in some provinces, the long-run coefficient remains consistent and efficient. Second, our prediction contains relatively large predicted errors in the areas with very low HIV/AIDS incidence. It is important to note that HIV/AIDS incidence is closely associated with the accessibility or availability of relevant health services; possible underestimation should be taken into account.
HIV prevention is never easy. As a culturally sensitive disease, HIV/AIDS control and prevention should be conducted in a more strategic way. Internet search data can provide an innovative and cost-effective way to help early detection of potential HIV/AIDS infection, and it can be used as a complement to a traditional HIV/AIDS surveillance system.