Disease surveillance based on Internet-based linear models: an Australian case study of previously unmodeled infection diseases.

Effective disease surveillance is critical to the functioning of health systems. Traditional approaches are, however, limited in their ability to deliver timely information. Internet-based surveillance systems are a promising approach that may circumvent many of the limitations of traditional health surveillance systems and provide more intelligence on cases of infection, including cases from those that do not use the healthcare system. Infectious disease surveillance systems built on Internet search metrics have been shown to produce accurate estimates of disease weeks before traditional systems and are an economically attractive approach to surveillance; they are, however, also prone to error under certain circumstances. This study sought to explore previously unmodeled diseases by investigating the link between Google Trends search metrics and Australian weekly notification data. We propose using four alternative disease modelling strategies based on linear models that studied the length of the training period used for model construction, determined the most appropriate lag for search metrics, used wavelet transformation for denoising data and enabled the identification of key search queries for each disease. Out of the twenty-four diseases assessed with Australian data, our nowcasting results highlighted promise for two diseases of international concern, Ross River virus and pneumococcal disease.

dengue; symptoms of dengue fever; symptoms of pneumonia; symptoms of swine; symptoms of swine flu; tamiflu; tamiflu side effects; townsville flood; Varicella; vicks; water flood; white creamy discharge; white discharge; whooping; whooping cough; whooping cough in adults; whooping cough treatment Barmah Forest virus infection: "Barmah forest"; barmah forest; barmah forest virus; flood australia; flood damage; flood damaged cars; flood recovery; myxomatosis; ross river; ross river fever; ross river virus; water flood

Model construction
In this section we describe the four models that used a 52-week training period; the approach to producing the other eight models was identical but for a training period of either 104-or 156-weeks. The first model for the 52-week period was denoted 52RC (52-week model; Raw data; Continuous keyword selection). This model was built upon disease notification and search metrics data covering 52 weeks. An additional two weeks search metrics data was provided to the model upon which one and two week predictions of disease notification were made. Search metrics data used in the 52RC model were raw as they were in the format provided by Google Trends. However, the time-series for these search metrics were shifted in accordance with the 52-week cross-correlation results described above. Keyword selection was performed using a robust feature selection method based on multiple hypotheses testing and prediction from the mht R package (45). Once the predictions were recorded an additional data point was made available for each time series and the 52-week window was then shifted forward by one week. The process was then repeated using only 52-weeks data. This process allowed generation of one and two-week predictions with the linear model rebuilt for each predictions while including the most appropriate search metrics at each step.
The second model, denoted 52WC (52 week model; Wavelet transformed data; Continuous keyword selection) used an identical process to the 52RC model, except that data were denoised using wavelet transforms to be input in the linear models. The DaubLeAsymm family of wavelets from the wavethresh R package was used (46). Since the denoising is datadependent, the wavelet-transformed data were recalculated each time the window was shifting one week because of an additional data point.
The other two models for the 52-week period (52RS: 52 week model, Raw data, Set keywords and 52WS: 52 week model, Wavelet-transformed data, Set keywords) were similarly fitted as to 52RC/52WC, except that the keyword selection in the linear models was not performed at each time point for the validation period (2012-2013). Rather, the terms selected over the training period (2009-2011) were maintained through the entire validation period. Keyword selection for these models was performed using the same process as described above for 52RC/52WC using the 2009-2011 training data. Using the 104 iterations obtained from successive shifts of the 52-week period over 2009-2011, we calculated the frequency of each selected search term in a model. The search terms selected in at least 50% of these models fitted in the training period were used for the construction of the 52RS and 52WS models.
Linear models were built on a 52-week period using the selected search terms. Prediction was obtained as described previously and the prediction accuracy was assessed based on the Mean Square Error of Prediction (provided in the supplementary material) as well as with Spearman correlation between notifications and the resulting prediction. Models using state level Google Trends and disease notification data were also fitted using the workflow described above. Owing to the loss of Google Trends data when analysing smaller geographical areas, only models that performed well at national level were produced.

Models were produced for pneumococcal disease and Ross River virus infection for
Queensland, New South Wales and Victoria only.

Wavelet Transform
The search term data collected from Google Trends were pre-processed by wavelet transform. The process depends on the model, either 52WC/52WS, 104WC/WS or 156WC/WS. Since our models were tasked to predict the two following point of the disease notification, the wavelet transforms were performed on 54/106 or 158 data points, for each search term, and each time the window was moved forward one week. Each 54/106 or 158-week period was written as the sum of weighted elementary functions, describing hierarchically the signal from a rough tendency to the finest details, in a finite number of resolution levels. Here, each signal was decomposed onto the Daubechies basis made of smooth trimodal elementary functions. The corresponding wavelet coefficients were thresholded with a soft-thresholding method (see Mallat, 1999, for details) to reduce signal noise by applying low smoothing. The original signal was then rebuilt based on the denoised wavelet coefficients.
There are several levels of decomposition of an initial spectrum from level N-1, high resolution, to level 0, rough tendency. The number of these levels depends on the number of data point; a 54-week period has N=6 levels because 54 lies between 2 5 =32 and 2 6 =64, a 106week period has N=7 levels and a 158-week period has N=8 levels. The initial signal f(t) is decomposed as the sum of a detail signal DN-1(t) and an approximation AN-1(t). Then the approximated signal AN-1(t) is decomposed into a further detail signal DN-2 and a further approximation AN-2. Each approximated signal is decomposed sequentially as the sum of a detail signal and of an approximation signal (residual), as illustrated in Supplementary Figure  1 for the Daubechies basis. The detail signal of level j is obtained as: where each Ψ , is a translation and a dilation of a so-called mother wavelet Ψ( ) (Daubechies trimodal function here). In practice, the index k is in a finite support. The coefficients b are called the (detailed) coefficients and are equal to , = ∫ ( )Ψ , ( ) . An empirical estimator of these coefficients is used, from the values of the discretised signal at points ti . Some of the numerous wavelet coefficients are close to 0, so thresholding is made to reduce the number of non-null coefficients.
Similarly, the approximated signal of level j is obtained as: where each , is a translation and a dilatation of a so-called father wavelet ( ). The coefficients a are called the approximated coefficients.
The initial signal f(t) can be entirely reconstructed from all detail signal and the approximation A0 at the lowest resolution level:

Frequency of search term selection in model training
The three tables below indicate the frequency at which each search term was selected for use in training models for pneumococcal disease, Ross River virus infection and pertussis. All values presented are proportions.

Frequency of search term selection in predictive models
The three tables below indicate the frequency at which each search term was selected for use in predictive models for pneumococcal disease, Ross River virus infection and pertussis. All values presented are proportions.

National model performance -Spearman's correlation
National model performance for 1 week (top) and 2 week estimates (bottom), as assessed by Spearman's correlation. The highest performing models for each disease are indicated in bold.