Use Internet search data to accurately track state level influenza epidemics

For epidemics control and prevention, timely insights of potential hot spots are invaluable. Alternative to traditional epidemic surveillance, which often lags behind real time by weeks, big data from the Internet provide important information of the current epidemic trends. Here we present a methodology, ARGOX (Augmented Regression with GOogle data CROSS space), for accurate real-time tracking of state-level influenza epidemics in the United States. ARGOX combines Internet search data at the national, regional and state levels with traditional influenza surveillance data from the Centers for Disease Control and Prevention, and accounts for both the spatial correlation structure of state-level influenza activities and the evolution of people’s Internet search pattern. ARGOX achieves on average 28% error reduction over the best alternative for real-time state-level influenza estimation for 2014 to 2020. ARGOX is robust and reliable and can be potentially applied to track county- and city-level influenza activity and other infectious diseases.


Introduction
Each year in the United States (US) alone, the seasonal influenza (flu) epidemics may claim up to 61,000 deaths [1]. Quick responses and preventive actions to changes in flu epidemics rely on timely and accurate information on the current flu severity. In particular, due to the geographically varying timing and intensity of disease epidemics, most public health decisions and executive orders for disease control and prevention are made at the state or local level. Accurate real-time flu tracking at the state/local level is thus indispensable. Traditional flu surveillance, such as those conducted by the US Centers for Disease Control and Prevention (CDC), however, often lags behind real time by up to two weeks. Here we propose a statistically principled, self-coherent framework ARGOX (Augmented Regression with GOogle data CROSS space) for real-time, accurate flu estimation at the state level. ARGOX efficiently combines publicly available Internet search data with traditional flu surveillance data and coherently utilizes the data from multiple geographical resolutions (national, regional, and state levels).
For the last two decades, tracking of flu activities in the US mainly relies on traditional surveillance systems, such as the US Outpatient Influenza-like Illness Surveillance Network (ILINet) by the CDC. Through the ILINet, thousands of healthcare providers across the US report their numbers of outpatients with Influenza-like Illness (ILI) to CDC on a weekly basis. CDC then aggregates the data and publishes the ILI percentages (%ILI, i.e., the percentages of outpatients with ILI) in its weekly reports at the national and regional levels (there are ten Health and Human Services (HHS) regions in the US, each consisting of multiple states). Starting from 2017, the state-level %ILI reports became available for selected states, and in late 2018 the state-level %ILI reports became available for all states except Florida. Owing to the time for administrative processing and aggregation, CDC's flu reports typically lag behind real time for up to 2 weeks and are also subject to subsequent revisions. Such delay and inaccuracy are far from optimal for public health decision making, especially in the face of epidemic outbreaks or pandemics.
Big data from the Internet offers the potential of real-time tracking of public health or social events. In fact, valuable insights have been gained from the Internet data about current social and economical status of a nation, including epidemic outbreaks [2] and macro economic indices [3,4]. Furthermore, real-time data from the Internet could also offer insights at the regional, state, or local level. Examples include foreshadowing state-wise housing price index in the US [5], estimating New York City flu activity [6], estimating real-time county-level unreported COVID-19 severity in the US [7] among others. For epidemic surveillance, such real-time digital data at local level can be potentially used to provide insights for early epidemic hot-spot detection and timely public health resource allocation (e.g. vaccine campaigns) as well as to gather information on the overall disease prevalence.
Various models have been proposed to utilize Internet data, especially Internet search volume data, to provide real-time estimation of the current flu activity at the national level. Google Flu Trends (GFT), as one of the early examples, uses the search frequency of selected query terms from Google to estimate the real-time %ILI [2]. Recent models on combining CDC's surveillance data with Internet-derived data appear to work well at the national level [8,9]. Other methods, primarily targeting national flu epidemics, were also developed based on traditional epidemiology data and mechanistic models, such as susceptible-infectious-recovered-susceptible model with ensemble adjustment Kalman filter (SIRS-EAKF) [6,10,11,12,13].
Compared to estimation at the national level, %ILI estimation at the regional or state level is much more challenging, as documented by FluSight, the CDC-sponsored Flu Prediction Initiative [14]. Due to factors like geographical proximity, transportation connectivity, and public health communication, the state-wise epidemic spread exhibits strong spatial structure. However, many digital flu estimation methods [11,15,16], including GFT, ignore such spatial structure and apply the same national-level method to regional, and/or state-level flu estimation. A few attempts have been made to incorporate the dependent geographical structure. For example, [17] studied the estimation of ILI activity in the boroughs and neighborhoods of New York City (using traditional epidemiological mechanistic model without Internet-derived data) and concluded that the spatial network is helpful at the borough scale but not at the neighborhood scale; [18] utilized an ordinary-least-squares-based network model to improve upon the output of GFT; [19] employs a multi-task nonlinear regression method for regional %ILI estimation; [20] uses a network approach for %ILI estimation in a few selected states; [21] shows that careful spatial structure modeling can lead to much improved accuracy in %ILI estimation at the regional level. Nevertheless, at the state level, no existing methods provide real-time flu tracking with satisfactory accuracy and reliability. (i) There are no unified approaches to combine multi-resolution and cross-state information effectively to provide national, regional, and state-level estimates within the same framework. (ii) Few existing models can outperform a naive estimation method, which, for each state, without any modeling effort, simply uses CDC's reported %ILI from the previous week as the %ILI estimate for the current week (see Figure 1 for an illustration). This would be particularly worrisome for public health officials who rely on accurate flu estimation at the local level to make informed decisions.
In this article we introduce ARGOX, a unified spatial-temporal statistical framework that combines multi-resolution, multi-source information to provide real-time state-level %ILI estimates while maintaining coherency with %ILI estimation at the regional and national levels (in a cascading fashion). To illustrate the underlying idea of ARGOX, let us take estimating the %ILI in California as an example. The real-time Google search volumes for flu-related terms like "flu symptoms" or "flu duration" from California reflect its current state-level flu intensity to some extent. In addition, California's flu epidemics could be highly correlated with flu epidemics of nearby states such as Oregon and Nevada, as well as with geographically distant but transportation-wise well-connected states such as Illinois. California's current flu situation may also depend heavily on the recent trends of flu epidemics, in particular, the overall national and Pacific-west regional flu trends. Taken these considerations together, ARGOX operates in two steps: at the first step, it extracts Google search information of most relevant query terms at three geographical resolutions -national, regional, and state levels; at the second step, the cross-time, cross-resolution, cross-state information mentioned above, together with Internet-extracted information, are integrated through careful modeling of their temporal-spatial dependence structure, which yields significant enhancement in the estimation accuracy.
Through the ARGOX framework, the state-level flu activity estimates are produced in a unified and coherent way with the national and regional estimates. ARGOX achieves on average 28% mean squared error (MSE) reduction compared to the best alternative and shows strong advantages over all benchmark methods, including GFT, time-series-based vector autoregression (VAR), and another recent Internet-search-based method developed in Lu et al. (2019) [20]. ARGOX achieves its high estimation accuracy through a few features: (i) it automatically selects the most relevant search queries to address the problem of lower-quality Google search information at state or regional level; (ii) it incorporates time-series momentum of flu activity; (iii) it pools the multi-resolution information by combining the national-, regional-, and state-level data; (iv) it explicitly models the spatial correlation structure of state-level flu activities; (v) it adapts to the evolution in people's search pattern, Google's search engine algorithms, epidemic trends, and other time-varying factors [22] with a dynamic two-year rolling window for training; and (vi) it achieves selective pooling of most immediately relevant information for a handful of stand-alone states (details in Materials and Methods).

Results
We conducted retrospective estimation of the weekly %ILI at the US state level -50 states excluding Florida whose ILI data is not available from CDC, plus Washington DC and New York City -for the period of Oct 11, 2014 to March 21, 2020. For each week during this period, we only used the data that would have been available -the historical CDC's ILI reports up to the previous week and Google search data up to the current week -to estimate state-level %ILI of the current week. To evaluate the accuracy of our estimation, we compared the estimates with actual %ILI released by CDC weeks later in multiple metrics, including the mean squared error (MSE), the mean absolute error (MAE), and the correlation with the actual %ILI (detailed in Materials and Methods). We also compared the performance of ARGOX with several benchmark methods, including (a) GFT (last estimate available: the week ending on August 15, 2015), (b) estimates by the lag-1 vector autoregressive model (VAR model), (c) the naive estimates, which for each state without any modeling effort simply use CDC's reported %ILI of the previous week as the estimate for the current week, and (d) a recent Internet-search-based state-level estimation model developed in Lu et al. (2019) [20]. As ARGOX uses a two-year training window, for fair comparison we keep the same two-year training window for VAR as well. Also for fair comparison, the numerical results of the method of Lu et al. (2019) were directly quoted from the article [20] (which reported results through May 14, 2017). Table 1 summarizes the overall results of ARGOX, VAR, GFT, and the naive method, averaging over the 51 states/district/city for the whole period of 2014 to 2020 (up to March 21, 2020).  [20] 0.912 0.912 0.808 0.858  Table  2 shows that ARGOX also uniformly outperforms Lu et al. (2019) in all three seasons when the benchmark is available. More detailed results comparing ARGOX with the benchmarks can be found in the supplementary material (Table S4).
Among all the methods that we numerically compared, ARGOX is the only one that uniformly outperforms the naive method in all 51 states/district/city in terms of MSE for the whole period of evaluation. Figure 1 [20] to the naive method. The relative MSE is the ratio of the MSE of a given method to that of the naive method. Blue color means smaller MSE (i.e., better performance) than the naive method; red color means larger MSE (i.e., worse performance ) than the naive method; grey color means result not available. ARGOX with all blue colors uniformly dominates the naive method, while mixed colors in the rest of the plots show that VAR, GFT, and Lu et al. the only method that gives uniformly better performance than the native method across all states. All other methods in comparison fail to do so for a large portion of the states investigated. Note that the naive method provides a model-free baseline benchmark that solely relies on information from CDC's flu reports. Therefore, ARGOX is the only method that effectively utilizes the Internet data to uniformly improve flu tracking from the traditional surveillance system, indicating ARGOX's reliability and adaptability. With its universally enhanced accuracy over the alternative methods for real-time state-level flu situation estimate, it appears that ARGOX could help timely, proper public health decision making for the local control of the disease.
Detailed numerical results for each state and for each flu season are reported in Tables S5-S55 and the figures in SI, where ARGOX holds lead over other methods in the vast majority of the cases, further revealing its robustness over geographical and seasonal variability in flu epidemics.
In addition to the point estimate, ARGOX also provides 95% confidence intervals for each week's estimates. For the entire period from 2014 to 2020, over all 51 states/district/city, the intervals provided by ARGOX successfully cover the actual %ILI in 92.5% of the cases (Table S1), which is close to the nominal 95%, demonstrating ARGOX's accurate uncertainty quantification.

Discussion
ARGOX effectively combines state-, regional-, and national-level publicly available data from Google searches and CDC's traditional flu surveillance system. It incorporates geographical and temporal correlation of flu activities to provide accurate, reliable real-time flu tracking at the state level. Across all the available states, ARGOX outperforms time-series-based benchmark models, GFT, and the method of Lu et al. (2019). ARGOX's weekly %ILI estimations are accompanied by reliable interval estimates as a measure for uncertainty. The state-level real-time tracking of flu epidemics by ARGOX could help public health officials be aware of potential epidemic hot spots and thereby optimize resource allocation across the nation.
ARGOX's adaptive pooling of the most-relevant information among the 51 US states/district/city plays an important role in its performance. To avoid the possibility of overfitting, a structured covariance matrix on the %ILI increments is utilized. Such structured dynamic modeling of the cross-state covariance serves to capture the ever-changing geographic spread pattern of the flu. It aggregates state-to-state, time-varying connectivity factors such as commuting traffic, airline frequency, geographic proximity, and climatic patterns. The utilization of cross-state correlation also helps pool information from different states, regions and the entire nation in addition to the information at a given state. The pooling from national and regional level estimates incorporates the shared seasonality component in flu trends across all the states, which further helps reduce the risk of overfitting.
The two-step design of ARGOX has broad applicability. The first step could be substituted by other models or include other data sources, while the second step remains adaptable for multi-resolution spatial-temporal boosting. A wide spectrum of flu estimation models, including susceptible-infectious-recovered-susceptible model [6], empirical Bayes method [15], Wisdom-of-crowds forecast [16], or ensemble of them [23] can be fitted into the cross-state boosting step (the second step) of ARGOX.
Like all big-data-based models, our result has certain limitations. ARGOX's accuracy depends on the reliability of its inputs -Google Trends data and historical %ILI data from CDC. Google Trends data have increasing amount of missing data and zero counts as the resolution goes from national to regional and state levels (Table S3). Such degeneracy in data quality is a challenge for high-resolution inference. Google search information could also be sensitive to media coverage [24,25]. Fortunately, the L 1 penalty and the dynamic training of ARGOX effectively addressed the sparsity and over-shooting problem of Google data. In addition, we should be aware that our estimation target, the CDC's %ILI, is only a proxy for the true flu incidence in the population, as it's calculated from a sample of outpatient visits with influenza-like symptoms. The reported %ILI at the state level could have (1) high noise due to its limited sample size, (2) subsequent revision when healthcare providers update their information, and (3) bias towards those with easy healthcare access. Nevertheless, accurate estimation of CDC's %ILI at the state level is valuable for optimizing resource allocations. More detailed discussion about the importance of alternative indicators for flu incidence in the population can be found in [26,27,28].
ARGOX is accurate, reliable, flexible and generalizable, making it adaptable to other spatial and temporal resolutions for tracking or forecasting other diseases and social/economic events that leave traces on people's Internet activity records. The ARGOX framework can be potentially adapted for COVID-19 tracking by incorporating additional coronavirus-related query terms at city, state, regional, and national level [29]. With the current development of COVID-19 pandemic, it is likely that the coronavirus would come back in the winter of 2020/2021. In light of this, accurate localized tracking of epidemic activity has become more important than ever before.

CDC's ILINet data
Every Friday, CDC releases a report of %ILI for the previous week, which gives the percent of outpatient visits with influenza-like illness for the whole nation, each HHS region, each state (except Florida), Washington DC, and New York City (separated from New York State) (http://www.cdc.gov/flu/weekly/overview.htm). CDC also revises the initial report numbers in the subsequent weeks when more information become available (gis.cdc.gov/grasp/ fluview/fluportaldashboard.html). Consequently, CDC's %ILI data lag behind real-time for up to 2 weeks and are less accurate for more recent weeks. CDC's %ILI data for this study were downloaded on Mar 27, 2020.

Google Data
The Internet search volume data from Google are publicly available through Google Trends (trends.google.com). A user can specify the desired query term, geographical location, and time frame on Google Trends; the website then will return a (weekly) time series in integer values from 0 to 100, which corresponds to the normalized search volume of the query term within the specified time frame, where 100 represents the historical maximum, and 0 represents missing data due to inadequate search intensity. This integer-valued time series from Google Trends is based on sampling Google's raw search logs.
The search query terms that we use are based on previous work for national and regional flu estimation [8,21]. We also included several additional queries and topics in this study, which were obtained from "Related queries" and "Related topics" on the Google Trends website when searching for flu related information. Table S2 in the Supplementary Material lists these search terms.

Regional-Enrichment of state-level Google search data
Google Trends provides (normalized) search volume data at both national and state levels. However, for the state-level data, there is a high level of sparsity (i.e., zero observations) among the returned integer-valued time series (see Table  S3). These zeros, which correspond to missing data due to inadequate search intensity, significantly lower the data quality at the state level (compared to the national level), which in turn severely reduces the prediction accuracy at the state level. To enhance the predictive power of state-level Google data, we use a simple approach to borrow information from the regional level. First, we reconstruct regional-level search frequency for each region in the US by weighting the state-level search frequencies within a given region, where the weights are proportional to the state's population. Second, instead of using the state-level Google Trends time-series, for each search term, we use a weighted average of the state-level search frequency (2/3 weight) and the regional-level search frequency (1/3 weight) as the input for state-level %ILI estimation. We carry out this regional-enrichment process for all states/district/city, except seven states -Hawaii (HI), Alaska (AK), Vermont (VT), Montana (MT), North Dakota (ND), Maine (ME), and South Dakota (SD)because these seven states are modeled with a separate stand-alone model (as detailed in the following sections). For these seven states, the raw Google Trends state-level times series, not the regional-enriched time series, are used as input.

Evaluation metrics
We use three metrics to evaluate the accuracy of an estimate against the actual %ILI released by CDC: the mean squared error (MSE), the mean absolute error (MAE), and the Pearson correlation (Correlation). MSE between an estimatê p t and the true value p t over period t = 1, . . . , T is 1 MAE between an estimatep t and the true value p t over period t = 1, . . . , T is 1 Correlation is the Pearson correlation coefficient between p = (p 1 , . . . ,p T ) and p = (p 1 , . . . , p T ).

Prediction model of ARGOX
ARGOX operates in two steps: the first step extracts Internet search information at the state level, and the second step enhances the estimates using cross-state and cross-resolution information.
At the second step, we take a dichotomous approach for the 51 US states/district/city (50 states except Florida, which does not have %ILI data, plus Washington DC and New York City). We set apart seven states: HI, AL, VT, MT, ND, ME, and SD. The first two (HI and AL) are geographically separated from the contiguous US. The last five (VT, MT, ND, ME, and SD) are the states that have the lowest multiple correlations (a.k.a. the R) in %ILI to the %ILI of the entire nation, the %ILI of the other states, and the %ILI of the other regions (detailed calculation method is given in Supplementary Material). A low multiple correlation of a state implies that the state's flu activity is not well correlated with other states' or other regions'. For these seven states, due to either the geological discontinuity or the low multiple correlation, it is not clear if using information cross the other states or other regions can help the state-level %ILI estimation. Therefore, we adopt the dichotomous approach: For the 44 states/district/city (the vast majority), we apply a joint estimation approach at the second step to enhance the state-level %ILI estimation by using all information, including information from other states and other regions; for the above-mentioned seven states, we use a stand-alone estimation approach at the second step to enhance the %ILI estimation (not using information from other states and regions). The two steps of ARGOX are detailed below.
First step: extracting Internet search information at the state level This step concerns extracting Google search information at each state. In particular, for a given state/district/city m, m = 1, . . . , 51, let X i,t,m be the logarithm of 1 plus the state-level Google Trends data of search term i at week t (note: 1 is added to each state-level Google Trends data point to avoid taking logarithm of zero); let y t,m be the logit-transformation of CDC's %ILI at time t for state m. To estimate y T,m , an L 1 regularized linear estimator is used in the first step based on the vector X T,m = (x i,T,m ): where the coefficients (β 0,m ,β m ) are obtained via arg min β0,m,βm We set N = 104, i.e., a two-year window, as recommended in previous studies [8,21,22]. We set λ through cross-validation.
In addition, we obtain an accurate estimatep nat T for the national %ILI by using the ARGO method [8], which uses national level Google search data. We also obtain an estimate (p reg T,1 , . . . ,p reg T,10 ) for the ten HHS regional %ILI by the ARGO2 method [21], which uses aggregated regional level Google search data.
Second step: joint model for the 44 states/district/city other than HI, AK, ND, VT, MT, ME, and SD For the 44 states, let p t = (p t,1 , . . . , p t,44 ) denote CDC's %ILI at the state level; they are related to y t,m through p t,m = exp(y t,m )/(1 + exp(y t,m )). Our raw estimate for p t from the first step isp GT t = (p t,1 , . . . ,p t,44 ) , wherê p t,m = exp(ŷ t,m )/(1 + exp(ŷ t,m )). Our estimate of the national %ILI from the first step isp nat t . Let the boldfacê p nat t denote the length-44 vectorp nat t = (p nat t , . . . ,p nat t ) . We also have the regional %ILI estimate (p reg t,1 , . . . ,p reg t,10 ) from the first step. Letp reg t denote the length-44 vectorp reg t = (p reg t,r1 , . . . ,p reg t,r44 ) , where r m is the region number for state m.
Estimating p t is equivalent to estimating the time series increment ∆p t = p t − p t−1 . We denote Z t = ∆p t for notational simplicity. For the estimation of Z t , we want to incorporate the cross-state, cross-source correlations. We have four predictors for Z t after the first step: ; they represent time series information, information from the state level Google search, information from the regional level estimation, and information from the national level estimation, respectively. Let W t denote the collection of these four vectors To combine the four predictors, we use the best linear predictor formed by them: where µ Z and µ W are the mean vectors of Z and W respectively, and Σ ZZ , Σ ZW , and Σ W W are the covariance matrices of and between Z and W . The best linear predictor gives the optimal way to linearly combine the four predictors to form a new one. The variance ofẐ t is Consistent with the first step, we adopt a sliding two-year training window to estimate µ Z , µ W , Σ ZZ , Σ ZW , and Σ W W in Eq. (2) and (3). For µ Z and µ W , we use the empirical mean of the corresponding variables as the estimates. However, for the covariance matrices, due to their large sizes and the small number of observations, we need to structure the covariance matrices for reliable estimation.
We assume the following structure: 1. The covariances between the time series increments satisfy Var This essentially assumes that the time series increments are stationary and have a stable autocorrelation across time and states. 2. Independence among the different sources of information: time series increment, the estimation error of the first-step state-level estimate, the estimation error of the regional estimate, and the estimation error of the national estimate, i.e., The covariance matrices are thereby simplified as: To further control the estimation stability, we incorporate a ridge-regression-inspired shrinkage [30] to the linear predictor (2), replacing the joint covariance matrix of (Z t , W t ) by the average of the structured covariance matrix and its empirical diagonal. Effectively, in Eq. (2), Σ ZW is replaced by 1 Σ ZZ , Σ nat , Σ reg , Σ GT and D W W are estimated by the corresponding sample covariance from the data in the most recent 2-year training window; ρ is estimated by minimizing the Frobenius norm (L 2 distance) between the empirical correlation and structured correlation. Based on Eq. (3), the variance estimate is similarly updated by Our final state-level %ILI estimate for week T after the second step is: with corresponding 95% interval estimate Second step: stand-alone model for HI, AK, ND, VT, MT, ME and SD For m ∈ {HI, AK, ND, VT, MT, ME, SD}, we take a stand-alone modeling approach. For each of these states, which is either non-contiguous or has the lowest multiple correlation with out-of-state %ILI (detailed in Supplementary Material), we focus on estimating the individual state's %ILI by integrating the within-state and national information in the second step. Thereby, our target is a scalar Z (m) t = p t,m − p t−1,m , the state's %ILI increment at the current week. The predictor vector in the second step for state m is W , where the regional terms are dropped. The best linear predictor with ridge-regression inspired shrinkage is then used to get the final estimateẐ The corresponding covariance matrices between the components Σ (m) W W ) are estimated by the corresponding sample covariance from the data in the most recent 2-year training window.
The final state-level %ILI estimate for week T after the second step for m ∈ {HI, AK, ND, VT, MT, ME, SD} is: with corresponding 95% interval estimate

Availability of data and material
All analyses were performed with the R statistical software [31]. The R package that implements the ARGOX method is available on CRAN at https://cran.r-project.org/web/packages/argo/, which uses the glmnet package [32]. All datasets analyzed in the current study are available in the Harvard Dataverse repository, doi:XXX/XXX/XXXX.

Supplementary Material
This Supplementary Material is organized as following: (1) the detailed calculation procedure for the multiple correlation behind the stand-alone modeling of HI, AK, ND, VT, MT, ME, and SD is presented; (2) a table for the confidence interval coverage is presented; (3) all the Google query terms used in this study are listed; (4) Google Trends data quality at different geographic area is studied; (5) full comparison to another Google-search-based benchmark method is presented; (6) detailed estimation results for each of 51 studied states/district/city are reported in tables and plotted in figures.

Multiple correlation
For each state, the multiple correlation of its flu activity level to the other states', other regions' and the national flu activity levels is calculated as follows. First, the states of HI and AK are excluded because they are not part of the contiguous US; the state of FL is excluded because FL data is not available from CDC. Then for the in-sample time period of 2010-10-09 to 2014-09-27, we regress each state's %ILI to (i) all other 48 states' %ILI (including DC and NYC but excluding FL, HI, and AK), (ii) all the other 9 regions' %ILI (i.e., regions other than the one that the specific state belongs to), and (iii) the national %ILI. After the regression, we obtain the R-squared, which is the square of multiple correlation. The five states with the lowest multiple correlations are ND, VT, MT, ME, and SD. We, therefore, would not use spatial pooling on HI, AK, ND, VT, MT, ME, and SD. Instead, we only use the state-specific data together with national level data for cross-resolution boosting on those seven aforementioned states in the second step of ARGOX.

Confidence interval coverage
We study the goodness of our confidence intervals by examining its actual coverage (coverage of the actual %ILI released by CDC weeks later). The result is shown in Table S1. In general, the coverage of 95% confidence interval is quite close to the nominal value, suggesting that our model quantifies the uncertainty reasonably well.   Table S2 lists all query terms/phrases used in this study. Most of them are taken from previous studies [8,21] with a few additional terms identified through "Related topics" and "Related queries" from Google Trends when search for flu-related information. Table S2: All search query terms used in this study. The last 21 terms separated by a horizontal line from the first 140 terms are new "Related topics" and "Related queries" identified from Google Trends.  Google Trends data quality As stated at trends.google.com, the numbers in Google Trends "represent search interest relative to the highest point on the chart for the given region and time; a value of 100 is the peak popularity for the term; a value of 50 means that the term is half as popular; a score of 0 means there was not enough data for this term." As such, the proportion of zeros in the Google Trends data reflects the data quality: higher proportion of zeros indicates lower quality of Google Trends data. Table S3 summarizes the average proportion of zeros for the query terms listed in Table S2 in each of the geographic areas. As we can see, Google Trends data at the US national level have far fewer zeros than any of the states, implying a significant drop in quality from national-level data to state-level data. Table S3: Average proportion of zeros in Google Trends data for the query terms in Table S2. Higher proportion of zeros indicates lower quality of Google Trends data since "a score of 0 means there was not enough data for this term" (trends.google.com). The proportion of zeros in Google Trends at the US national level is in the upper sub-  Overall '14-'17 '14-'15 ' Table S34: Comparison of different methods for state-level %ILI estimation in New Jersey (NJ). The MSE, MAE, and correlation are reported. The method with the best performance is highlighted in boldface for each metric in each period.