Abstract
For epidemics control and prevention, timely insights of potential hot spots are invaluable. Alternative to traditional epidemic surveillance, which often lags behind real time by weeks, big data from the Internet provide important information of the current epidemic trends. Here we present a methodology, ARGOX (Augmented Regression with GOogle data CROSS space), for accurate realtime tracking of statelevel influenza epidemics in the United States. ARGOX combines Internet search data at the national, regional and state levels with traditional influenza surveillance data from the Centers for Disease Control and Prevention, and accounts for both the spatial correlation structure of statelevel influenza activities and the evolution of people’s Internet search pattern. ARGOX achieves on average 28% error reduction over the best alternative for realtime statelevel influenza estimation for 2014 to 2020. ARGOX is robust and reliable and can be potentially applied to track county and citylevel influenza activity and other infectious diseases.
Similar content being viewed by others
Introduction
Each year in the United States (US) alone, the seasonal influenza (flu) epidemics may claim up to 61,000 deaths^{1}. Quick responses and preventive actions to changes in flu epidemics rely on timely and accurate information on the current flu severity. In particular, due to the geographically varying timing and intensity of disease epidemics, most public health decisions and executive orders for disease control and prevention are made at the state or local level. Accurate realtime flu tracking at the state/local level is thus indispensable. Traditional flu surveillance, such as those conducted by the US Centers for Disease Control and Prevention (CDC), however, often lags behind real time by up to two weeks. Here we propose a statistically principled, selfcoherent framework ARGOX (Augmented Regression with GOogle data CROSS space) for realtime, accurate flu estimation at the state level. ARGOX efficiently combines publicly available Internet search data with traditional flu surveillance data and coherently utilizes the data from multiple geographical resolutions (national, regional, and state levels).
For the last two decades, tracking of flu activities in the US mainly relies on traditional surveillance systems, such as the US Outpatient Influenzalike Illness Surveillance Network (ILINet) by the CDC. Through the ILINet, thousands of healthcare providers across the US report their numbers of outpatients with Influenzalike Illness (ILI) to CDC on a weekly basis. CDC then aggregates the data and publishes the ILI percentages (%ILI, i.e., the percentages of outpatients with ILI) in its weekly reports at the national and regional levels (there are ten Health and Human Services (HHS) regions in the US, each consisting of multiple states). Starting from 2017, the statelevel %ILI reports became available for selected states, and in late 2018 the statelevel %ILI reports became available for all states except Florida. Owing to the time for administrative processing and aggregation, CDC’s flu reports typically lag behind real time for up to 2 weeks and are also subject to subsequent revisions. Such delay and inaccuracy are far from optimal for public health decision making, especially in the face of epidemic outbreaks or pandemics.
Big data from the Internet offer the potential of realtime tracking of public health or social events. In fact, valuable insights have been gained from the Internet data about current social and economical status of a nation, including epidemic outbreaks^{2,3} and macro economic indices^{4,5}. Furthermore, realtime data from the Internet could also offer insights at the regional, state, or local level. Examples include foreshadowing statewise housing price index in the US^{6}, estimating New York City flu activity^{7}, estimating realtime countylevel unreported COVID19 severity in the US^{8} among others. For epidemic surveillance, such realtime digital data at local level can be potentially used to provide insights for early epidemic hotspot detection and timely public health resource allocation (e.g. vaccine campaigns) as well as to gather information on the overall disease prevalence.
Various models have been proposed to utilize Internet data, especially Internet search volume data, to provide realtime estimation of the current flu activity at the national level. Google Flu Trends (GFT), as one of the early examples, uses the search frequency of selected query terms from Google to estimate the realtime %ILI^{2}. Recent models on combining CDC’s surveillance data with Internetderived data appear to work well at the national level^{9,10}. Other methods, primarily targeting national flu epidemics, were also developed based on traditional epidemiology data and mechanistic models, such as susceptibleinfectiousrecoveredsusceptible model with ensemble adjustment Kalman filter (SIRSEAKF)^{7,11,12,13,14}.
Compared to estimation at the national level, %ILI estimation at the regional or state level is much more challenging, as documented by FluSight, the CDCsponsored Flu Prediction Initiative^{15}. Due to factors like geographical proximity, transportation connectivity, and public health communication, the statewise epidemic spread exhibits strong spatial structure. However, many digital flu estimation methods^{12,16,17}, including GFT, ignore such spatial structure and apply the same nationallevel method to regional, and/or statelevel flu estimation. A few attempts have been made to incorporate the geographical dependence structure. For example, Ref.^{18} studied the estimation of ILI activity in the boroughs and neighborhoods of New York City using a traditional epidemiological mechanistic SIRSnetwork model without Internet search data, where the dynamic system is multivariate with explicit parameters to characterize traffic between locales, and concluded that the spatial network is helpful at the borough scale but not at the neighborhood scale; Ref.^{19} utilized an ordinaryleastsquaresbased network model to improve upon the output of GFT, where a weighted average of GFT from all regions is produced as an networkenhanced final estimate for each individual region; Ref.^{20} employs a multitask nonlinear regression method for regional %ILI estimation, where a MultiTask Gaussian Process is proposed to regress each region’s %ILI on the corresponding Google search data; Ref.^{21} uses a network approach for %ILI estimation in a few selected states, where they first built a standalone state %ILI prediction based on the ARGO method^{9}, and then obtained a multiple linear regression prediction for a given state’s %ILI from other states’ %ILI, and finally a winnertakesall approach was adopted for each state separately to select one of the two approaches; Ref.^{22} shows that careful spatial structure modeling can lead to much improved accuracy in %ILI estimation at the regional level. An ensemble approach has also been proposed to utilize the output of a variety of available models to achieve better accuracy^{23}.
Nevertheless, at the state level, no existing methods provide realtime flu tracking with satisfactory accuracy and reliability. (i) There are no unified approaches to combine multiresolution and crossstate information effectively to provide national, regional and statelevel estimates within the same framework. (ii) Few existing models can outperform a naive estimation method, which, for each state, without any modeling effort, simply uses CDC’s reported %ILI from the previous week as the %ILI estimate for the current week (see Fig. 1 for an illustration). This would be particularly worrisome for public health officials who rely on accurate flu estimation at the local level to make informed decisions.
In this article we introduce ARGOX, a unified spatialtemporal statistical framework that combines multiresolution, multisource information to provide realtime statelevel %ILI estimates while maintaining coherency with %ILI estimation at the regional and national levels (in a cascading fashion). To illustrate the underlying idea of ARGOX, let us take estimating the %ILI in California as an example. The realtime Google search volumes for flurelated terms like "flu symptoms" or "flu duration" from California reflect its current statelevel flu intensity to some extent. In addition, California’s flu epidemics could be highly correlated with flu epidemics of nearby states such as Oregon and Nevada, as well as with geographically distant but transportationwise wellconnected states such as Illinois. California’s current flu situation may also depend heavily on the recent trends of flu epidemics, in particular, the overall national and Pacificwest regional flu trends. Taken these considerations together, ARGOX operates in two steps: at the first step, it extracts Google search information of most relevant query terms at three geographical resolutions—national, regional, and state levels; at the second step, the crosstime, crossresolution, crossstate information mentioned above, together with Internetextracted information, is integrated through careful modeling of their temporalspatial dependence structure, which yields significant enhancement in the estimation accuracy.
ARGOX was inspired in part by Refs.^{9} and^{22}, which studied the %ILI estimation at the national and regional levels respectively. Although the methods introduced in Refs.^{9} and^{22} worked well for flutracking at the national or regional level, these methods cannot be directly applied to accurately track statelevel %ILI for a number of reasons, which are specifically solved by ARGOX. In particular, ARGOX addresses the following issues: (i) how to simultaneously provide accurate, realtime flu tracking at the higherresolution level for all 51 US states (district/city), as opposed to only at the national or regional level, (ii) how to effectively combine multiresolution information from the national, regional and state levels for state %ILI estimation, i.e., how to leverage the information from the national and regional levels, in addition to the information at a particular state, for the %ILI estimation at a given state; (iii) how to solve the challenge of declining quality of Internet search data at higher geographical resolution, since compared to the Internet search data at the national level, the statelevel Internet search data are of much inferior quality; (iv) how to determine when to borrow information from other states for the %ILI estimation at a given state and when not to borrow, since the states have varying degree of connections—for a state well connected with others, borrowing information probably would help its %ILI estimation, but for a state not (geographically or epidemically) well connected with others, using information from other states might hurt (as opposed to help) its %ILI estimation; and (v) how to model the correlation structure of %ILI across the “wellconnected” states to effectively borrow such crossstate information to improve prediction accuracy. ARGOX, therefore, significantly advances accurate flu tracking from the national and regional levels to the state level, which could help public health officials make much more informed decisions.
Through the ARGOX framework, the statelevel flu activity estimates are produced in a unified and coherent way with the national and regional estimates. ARGOX achieves on average 28% mean squared error (MSE) reduction compared to the best alternative and shows strong advantages over all benchmark methods, including GFT, timeseriesbased vector autoregression (VAR), and another recent Internetsearchbased method developed in Lu et al.^{21}. ARGOX achieves its high estimation accuracy through a few features: (i) it automatically selects the most relevant search queries to address the problem of lowerquality Google search information at state or regional level; (ii) it incorporates timeseries momentum of flu activity; (iii) it pools the multiresolution information by combining the national, regional, and statelevel data; (iv) it explicitly models the spatial correlation structure of statelevel flu activities; (v) it adapts to the evolution in people’s search pattern, Google’s search engine algorithms, epidemic trends, and other timevarying factors^{24} with a dynamic twoyear rolling window for training; and (vi) it achieves selective pooling of most immediately relevant information for a handful of standalone states (details in Methods).
Results
We conducted retrospective estimation of the weekly %ILI at the US state level—50 states excluding Florida whose ILI data is not available from CDC, plus Washington DC and New York City—for the period of Oct 11, 2014 to March 21, 2020. For each week during this period, we only used the data that would have been available—the historical CDC’s ILI reports up to the previous week and Google search data up to the current week—to estimate statelevel %ILI of the current week. To evaluate the accuracy of our estimation, we compared the estimates with the actual %ILI released by CDC weeks later in multiple metrics, including the mean squared error (MSE), the mean absolute error (MAE), and the correlation with the actual %ILI (detailed in Methods). We also compared the performance of ARGOX with several benchmark methods, including (a) GFT (last estimate available: the week ending on August 15, 2015), (b) estimates by the lag1 vector autoregressive model (VAR model), (c) the naive estimates, which for each state without any modeling effort simply use CDC’s reported %ILI of the previous week as the estimate for the current week, and (d) a recent Internetsearchbased statelevel estimation model developed in Lu et al.^{21}. As ARGOX uses a twoyear training window, for fair comparison we keep the same twoyear training window for VAR as well. Also for fair comparison, the numerical results of the method of Lu et al.^{21} were directly quoted from the article (which reported results through May 14, 2017).
Table 1 summarizes the overall results of ARGOX, VAR, GFT, and the naive method, averaging over the 51 states/district/city for the whole period of 2014 to 2020 (up to March 21, 2020). Table 2 summarizes the comparison between ARGOX and the method of Lu et al.^{21}, averaging over 37 states for the period of 2014 to 2017. We need to compare ARGOX with Lu et al.^{21} in a separate Table 2 because the results of Lu et al.^{21} are only available for 37 states and only for the period of 2014 to 2017.
Table 1 shows that ARGOX gives the leading performance uniformly through all flu seasons in all metrics. Particularly, ARGOX achieves up to 28% error reduction in MSE and about 15 % error reduction in MAE compared to the best alternative in the whole period. ARGOX also keeps consistent seasonbyseason performance, with at least 15% error reduction in MSE compared to the best alternative method in every season from 2014 to 2019. For the 2019–2020 flu season with the (onset of) COVID19 pandemic, ARGOX’s accuracy still maintains. Compared with other benchmarks, ARGOX’s advantages in statelevel flu tracking are substantial. VAR and GFT fail to outperform the naive method in any of the evaluated flu seasons; both methods have MSE two or three times larger than the naive method. Table 2 shows that ARGOX also uniformly outperforms Lu et al.^{21} in all three seasons when the benchmark is available. More detailed results comparing ARGOX with the benchmarks can be found in the Supplementary Information (Table S4). The advantage of ARGOX over the method of Lu et al.^{21} could be attributed to (i) incorporating multiresolution information in the modeling that pools national, regional and statelevel information together, (ii) capturing the spatiotemporal information using one joint statistically structured variancecovariance matrix as opposed to ad hoc regression of each individual state’s %ILI on other states’, and (iii) using a statistically principled and interpretable method to dichotomously select between either joint modeling for statistically “connected” states or standalone modeling for statistically/geographically “disconnected” states.
Among all the methods that we numerically compared, ARGOX is the only one that uniformly outperforms the naive method in all 51 states/district/city in terms of MSE for the whole period of evaluation. Figure 1 plots the statebystate estimation results, showing the ratio of the MSE of a given method to the MSE of the naive method. The results of four methods are plotted: ARGOX, VAR, GFT, and Lu et al.^{21} For each state, a blue color means that the MSE of a method is smaller (better) than the MSE of the naive method for that state, and a red color means the MSE of the method is larger (worse) than the MSE of the naive method. Darker blue means more advantage over the naive method, while darker red means more disadvantage than the naive method. It is noteworthy that ARGOX with all blue colors is the only method that gives uniformly better performance than the native method across all states. All other methods in comparison fail to do so for a large portion of the states investigated. Note that the naive method provides a modelfree baseline benchmark that solely relies on information from CDC’s flu reports. Therefore, ARGOX is the only method that effectively utilizes the Internet data to uniformly improve flu tracking from the traditional surveillance system, indicating ARGOX’s reliability and adaptability. With its universally enhanced accuracy over the alternative methods for realtime statelevel flu situation estimate, it appears that ARGOX could aid timely, proper public health decision making for the local monitoring and control of the disease.
Detailed numerical results for each state and for each flu season are reported in Tables S5–S55 and the figures in Supporting Information (SI), where ARGOX holds lead over other methods in the vast majority of the cases, further revealing its robustness over geographical and seasonal variability in flu epidemics.
In addition to the point estimate, ARGOX also provides 95% confidence intervals for each week’s estimates. For the entire period from 2014 to 2020, over all 51 states/district/city, the intervals provided by ARGOX successfully cover the actual %ILI in 92.5% of the cases (Table S1), which is close to the nominal 95%, demonstrating ARGOX’s accurate uncertainty quantification.
Discussion
ARGOX effectively combines state, regional, and nationallevel publicly available data from Google searches and CDC’s traditional flu surveillance system. It incorporates geographical and temporal correlation of flu activities to provide accurate, reliable realtime flu tracking at the state level. Across all the available states, ARGOX outperforms timeseriesbased benchmark models, GFT, and the method of Lu et al.^{21} ARGOX’s weekly %ILI estimations are accompanied by reliable interval estimates as a measure for uncertainty. The statelevel realtime tracking of flu epidemics by ARGOX could help public health officials and the general public to make more informed decisions to control and prevent the flu epidemics at the state or local levels. In particular, with the realtime estimates of flu activities by ARGOX in their home states and neighboring states, local public health officials could make more proper and timely decisions on the allocation of relevant resources, such as vaccines, hospitalization, medical equipment, personnel, etc. Also, informed with the current local flu situation provided by ARGOX, the general public could take necessary measures accordingly, such as taking the flu shot, social distancing, and mask wearing to reduce the risk of contracting flu; knowing the realtime flu severity at other states could help the general public make travel decisions and plan/arrange care for relatives and friends. More discussion on the usefulness of influenza forecasts to public health decision making can be found in Ref.^{25} and Ref.^{23}.
ARGOX’s adaptive pooling of the mostrelevant information among the 51 US states/district/city plays an important role in its performance. To avoid the possibility of overfitting, a structured covariance matrix on the %ILI increments is utilized. Such structured dynamic modeling of the crossstate covariance serves to capture the everchanging geographic spread pattern of the flu. It aggregates statetostate, timevarying connectivity factors such as commuting traffic, airline frequency, geographic proximity, and climatic patterns. The utilization of crossstate correlation also helps pool information from different states, regions and the entire nation in addition to the information at a given state. The pooling from national and regional level estimates incorporates the shared seasonality component in flu trends across all the states, which further helps reduce the risk of overfitting.
ARGOX operates in two steps: the first step extracts Internet search information at the state level, and the second step enhances the estimates using crossstate and crossresolution information (detailed in Methods). Such twostep design of ARGOX has broad applicability. With the general availability of ubiquitous Internet search data, ARGOX’s twostep framework could be flexibly adapted to track flu activities at even higher resolutions, such as county or city levels, when such weekly %ILI data become available. In addition, the first step could be substituted by other models or include other data sources, while the second step remains adaptable for multiresolution spatialtemporal boosting. A wide spectrum of flu estimation models, including susceptibleinfectiousrecoveredsusceptible model^{7}, empirical Bayes method^{16}, Wisdomofcrowds forecast^{17}, or ensemble of them^{26} can be fitted into the crossstate boosting step (the second step) of ARGOX.
Like all bigdatabased models, our result has certain limitations. ARGOX’s accuracy depends on the reliability of its inputs—Google Trends data and historical %ILI data from CDC. Google Trends data have increasing amount of missing data and zero counts as the resolution goes from national to regional and state levels (Table S3). Such degeneracy in data quality is a challenge for highresolution inference. Google search information could also be sensitive to media coverage^{27,28,29}. Furthermore, Google search data may only be representative of the search interests among Google users rather than the entire population. In states with less Internet penetration, such Google search data may be less predictive of the overall %ILI. The \(L_1\) penalty and the dynamic training of ARGOX aims to correct for the sparsity, overshooting, and representative issues of Google data, where only the most relevant search terms to %ILI estimation are selected at each state’s level. Models to further alleviate or eliminate the bias in Internet search data (e.g. by incorporating data on media coverage intensity) could be an interesting future direction. In addition, we should be aware that our estimation target, the CDC’s %ILI, is only a proxy for the true flu incidence in the population, as it’s calculated from a sample of outpatient visits with influenzalike symptoms. The reported %ILI at the state level could have (i) high noise due to its limited sample size, (ii) subsequent revision when healthcare providers update their information, and (iii) bias towards those with easy healthcare access. Nevertheless, accurate estimation of CDC’s %ILI at the state level is valuable for optimizing resource allocations. More detailed discussion about the importance of alternative indicators for flu incidence in the population can be found in Ref.^{30,31,32}.
ARGOX is accurate, reliable, flexible and generalizable, making it adaptable to other spatial and temporal resolutions for tracking or forecasting other diseases and social/economic events that leave traces on people’s Internet activity records. The ARGOX framework can be potentially adapted for COVID19 tracking by incorporating additional coronavirusrelated query terms at city, state, regional, and national level^{33}. With the current development of COVID19 pandemic, it is likely that the coronavirus would come back in the future winters. In light of this, accurate localized tracking of epidemic activity has become more important than ever before.
Methods
CDC’s ILINet data
Every Friday, CDC releases a report of %ILI for the previous week, which gives the percent of outpatient visits with influenzalike illness for the whole nation, each HHS region, each state (except Florida), Washington DC, and New York City (separated from New York State) (http://www.cdc.gov/flu/weekly/overview.htm). CDC also revises the initial report numbers in the subsequent weeks when more information becomes available (gis.cdc.gov/grasp/fluview/fluportaldashboard.html). Consequently, CDC’s %ILI data lag behind realtime for up to 2 weeks and are less accurate for more recent weeks. CDC’s %ILI data for this study were downloaded on Mar 27, 2020.
Google data
The Internet search volume data from Google are publicly available through Google Trends (trends.google.com). A user can specify the desired query term, geographical location, and time frame on Google Trends; the website then will return a (weekly) time series in integer values from 0 to 100, which corresponds to the normalized search volume of the query term within the specified time frame, where 100 represents the historical maximum, and 0 represents missing data due to inadequate search intensity. This integervalued time series from Google Trends is based on sampling Google’s raw search logs.
The search query terms that we use are based on previous work for national and regional flu estimation^{9,22}. We also included several additional queries and topics in this study, which were obtained from “Related queries” and “Related topics” on the Google Trends website when searching for flu related information. Table S2 in the Supplementary Information lists these search terms.
As one benchmark, we downloaded the discontinued Google Flu Trends (GFT) data (https://www.google.org/flutrends/about/data/flu/us/data.txt). GFT has national, regional, and statelevel prediction for the weekly %ILI from Jan 1, 2004 to August 9, 2015.
Google search data may only be representative of the search interests among Google users rather than the entire population. ARGOX attempts to correct for such potential bias in the modeling.
RegionalEnrichment of statelevel Google search data
Google Trends provides (normalized) search volume data at both national and state levels. However, for the statelevel data, there is a high level of sparsity (i.e., zero observations) among the returned integervalued time series (see Table S3). These zeros, which correspond to missing data due to inadequate search intensity, significantly lower the data quality at the state level (compared to the national level), which in turn severely reduces the prediction accuracy at the state level. To enhance the predictive power of statelevel Google data, we use a simple approach to borrow information from the regional level. First, we reconstruct regionallevel search frequency for each region in the US by weighting the statelevel search frequencies within a given region, where the weights are proportional to the state’s population. Second, instead of using the statelevel Google Trends timeseries, for each search term, we use a weighted average of the statelevel search frequency (2/3 weight) and the regionallevel search frequency (1/3 weight) as the input for statelevel %ILI estimation. We carry out this regionalenrichment process for all states/district/city, except seven states—Hawaii (HI), Alaska (AK), Vermont (VT), Montana (MT), North Dakota (ND), Maine (ME), and South Dakota (SD)—because these seven states are modeled with a separate standalone model (as detailed in the following sections). For these seven states, the raw Google Trends statelevel times series, not the regionalenriched time series, are used as input.
Evaluation metrics
We use three metrics to evaluate the accuracy of an estimate against the actual %ILI released by CDC: the mean squared error (MSE), the mean absolute error (MAE), and the Pearson correlation (Correlation). MSE between an estimate \({\hat{p}}_t\) and the true value \(p_t\) over period \(t=1,\ldots , T\) is \(\frac{1}{T}\sum _{t=1}^T \left( {\hat{p}}_t  p_t\right) ^2\). MAE between an estimate \({\hat{p}}_t\) and the true value \(p_t\) over period \(t=1,\ldots , T\) is \(\frac{1}{T}\sum _{t=1}^T \left {\hat{p}}_t  p_t\right\). Correlation is the Pearson correlation coefficient between \(\hat{{\varvec{p}}}=({\hat{p}}_1, \dots , {\hat{p}}_T)\) and \({\varvec{p}}=(p_1,\dots , p_T)\).
Prediction model of ARGOX
ARGOX operates in two steps: the first step extracts Internet search information at the state level, and the second step enhances the estimates using crossstate and crossresolution information.
At the second step, we take a dichotomous approach for the 51 US states/district/city (50 states except Florida, which does not have %ILI data, plus Washington DC and New York City). We set apart seven states: HI, AK, VT, MT, ND, ME, and SD. The first two (HI and AK) are geographically separated from the contiguous US. The last five (VT, MT, ND, ME, and SD) are the states that have the lowest multiple correlations (a.k.a. the R) in %ILI to the %ILI of the entire nation, the %ILI of the other states, and the %ILI of the other regions (detailed calculation method is given in Supplementary Information). A low multiple correlation of a state implies that the state’s flu activity is not well correlated with other states’ or other regions’. For these seven states, due to either the geological discontinuity or the low multiple correlation, it is not clear if using information cross the other states or other regions can help the statelevel %ILI estimation. Therefore, we adopt the dichotomous approach: For the 44 states/district/city (the vast majority), we apply a joint estimation approach at the second step to enhance the statelevel %ILI estimation by using all information, including information from other states and other regions; for the abovementioned seven states, we use a standalone estimation approach at the second step to enhance the %ILI estimation (not using information from other states and regions). The two steps of ARGOX are detailed below.
First step: extracting Internet search information at the state level
This step concerns extracting Google search information at each state. In particular, for a given state/district/city m, \(m= 1, \dots , 51\), let \(X_{i,t,m}\) be the logarithm of 1 plus the statelevel Google Trends data of search term i at week t (note: 1 is added to each statelevel Google Trends data point to avoid taking logarithm of zero); let \(y_{t,m}\) be the logittransformation of CDC’s %ILI at time t for state m. To estimate \(y_{T,m}\), an \(L_1\) regularized linear estimator is used in the first step based on the vector \({\varvec{X}}_{T,m} = (x_{i,T,m})\):
where the coefficients \(({\hat{\beta }}_{0,m}, \hat{\varvec{\beta }}_m)\) are obtained via
We set \(N=104\), i.e., a twoyear window, as recommended in previous studies^{9,22,24}. We set \(\lambda\) through crossvalidation.
In addition, we obtain an accurate estimate \({\hat{p}}^{nat}_T\) for the national %ILI by using the ARGO method^{9}, which uses national level Google search data. We also obtain an estimate \(({\hat{p}}_{T, 1}^{reg}, \dots ,{\hat{p}}_{T,10}^{reg})\) for the ten HHS regional %ILI by the first step of ARGO2 method^{22}, which uses aggregated regional level Google search data.
Second step: joint model for the 44 states/district/city other than HI, AK, ND, VT, MT, ME, and SD
For the 44 states, let \({\varvec{p}}_t=(p_{t,1},\dots , p_{t,44})^\intercal\) denote CDC’s %ILI at the state level; they are related to \(y_{t,m}\) through \(p_{t,m}=\exp (y_{t,m})/(1+\exp (y_{t,m}))\). Our raw estimate for \({\varvec{p}}_t\) from the first step is \(\hat{\varvec{p}}^{GT}_{t} = ({\hat{p}}_{t,1},\dots , {\hat{p}}_{t,44})^\intercal\), where \({\hat{p}}_{t,m} = \exp ({\hat{y}}_{t,m})/(1+\exp ({\hat{y}}_{t,m}))\). Our estimate of the national %ILI from the first step is \({\hat{p}}_t^{nat}\). Let the boldface \(\hat{\varvec{p}}_t^{nat}\) denote the length44 vector \(\hat{\varvec{p}}_t^{nat}=({\hat{p}}_t^{nat}, \dots ,{\hat{p}}_t^{nat})^\intercal\). We also have the regional %ILI estimate \(({\hat{p}}_{t, 1}^{reg}, \dots ,{\hat{p}}_{t,10}^{reg})\) from the first step. Let \(\hat{\varvec{p}}_t^{reg}\) denote the length44 vector \(\hat{\varvec{p}}_t^{reg}=({\hat{p}}_{t, r_1}^{reg}, \dots ,{\hat{p}}_{t, r_{44}}^{reg})^\intercal\), where \(r_m\) is the region number for state m.
Estimating \({\varvec{p}}_t\) is equivalent to estimating the time series increment \(\Delta {\varvec{p}}_{t} = {\varvec{p}}_{t}  {\varvec{p}}_{t1}\). We denote \({\varvec{Z}}_{t} = \Delta {\varvec{p}}_{t}\) for notational simplicity. For the estimation of \({\varvec{Z}}_{t}\), we want to incorporate the crossstate, crosssource correlations. We have four predictors for \({\varvec{Z}}_{t}\) after the first step: (i) \({\varvec{Z}}_{t1}=\Delta {\varvec{p}}_{t1}\), (ii) \(\hat{\varvec{p}}_{t}^{GT}  {\varvec{p}}_{t1}\), (iii) \(\hat{\varvec{p}}_{t}^{reg}  {\varvec{p}}_{t1}\), and (iv) \(\hat{\varvec{p}}_{t}^{nat}  {\varvec{p}}_{t1}\) ; they represent time series information, information from the state level Google search, information from the regional level estimation, and information from the national level estimation, respectively. Let \({\varvec{W}}_{t}\) denote the collection of these four vectors \({\varvec{W}}_{t}=({\varvec{Z}}_{t1}^\intercal , (\hat{\varvec{p}}_{t}^{GT}  {\varvec{p}}_{t1})^\intercal , (\hat{\varvec{p}}_{t}^{reg}  {\varvec{p}}_{t1})^\intercal , (\hat{\varvec{p}}_{t}^{nat}  {\varvec{p}}_{t1})^\intercal )^\intercal\).
To combine the four predictors, we use the best linear predictor formed by them:
where \(\mu _Z\) and \(\mu _W\) are the mean vectors of \({\varvec{Z}}\) and \({\varvec{W}}\) respectively, and \(\Sigma _{ZZ}\), \(\Sigma _{ZW}\), and \(\Sigma _{WW}\) are the covariance matrices of and between \({\varvec{Z}}\) and \({\varvec{W}}\). The best linear predictor gives the optimal way to linearly combine the four predictors to form a new one. The variance of \(\hat{\varvec{Z}}_{t}\) is
Consistent with the first step, we adopt a sliding twoyear training window to estimate \(\mu _Z\), \(\mu _W\), \(\Sigma _{ZZ}\), \(\Sigma _{ZW}\), and \(\Sigma _{WW}\) in Eq. (2) and (3). For \(\mu _Z\) and \(\mu _W\), we use the empirical mean of the corresponding variables as the estimates. However, for the covariance matrices, due to their large sizes and the small number of observations, we need to structure the covariance matrices for reliable estimation.
We assume the following structure:

1.
The covariances between the time series increments satisfy \({\mathrm{Var}}({\varvec{Z}}_{t})={\mathrm{Var}}({\varvec{Z}}_{t1})=\Sigma _{ZZ}\) and \({\mathrm{Cov}}({\varvec{Z}}_{t}, {\varvec{Z}}_{t1})=\rho \Sigma _{ZZ}\), where \(0<\rho <1\). This essentially assumes that the time series increments are stationary and have a stable autocorrelation across time and states.

2.
Independence among the different sources of information: time series increment, the estimation error of the firststep statelevel estimate, the estimation error of the regional estimate, and the estimation error of the national estimate, i.e., \({\varvec{Z}}_{t}, \hat{\varvec{p}}_{t}^{GT}  {\varvec{p}}_{t}, \hat{\varvec{p}}_{t}^{reg}  {\varvec{p}}_{t}, \hat{\varvec{p}}_{t}^{nat}  {\varvec{p}}_{t}\) are all mutually independent.
The covariance matrices are thereby simplified as:
where \(\Sigma ^{reg} = {\mathrm{Var}}(\hat{\varvec{p}}_{t}^{reg}  {\varvec{p}}_{t})\), \(\Sigma ^{nat} = {\mathrm{Var}}(\hat{\varvec{p}}_{t}^{nat}  {\varvec{p}}_{t})\), and \(\Sigma ^{GT} = {\mathrm{Var}}(\hat{\varvec{p}}_{t}^{GT}  {\varvec{p}}_{t})\). To further control the estimation stability, we incorporate a ridgeregressioninspired shrinkage^{34} to the linear predictor (2), replacing the joint covariance matrix of \(({\varvec{Z}}_t^\intercal , {\varvec{W}}_t^\intercal )^\intercal\) by the average of the structured covariance matrix and its empirical diagonal. Effectively, in Eq. (2), \(\Sigma _{ZW}\) is replaced by \(\frac{1}{2} \Sigma _{ZW}\), and \(\Sigma _{WW}\) is replaced by \((\frac{1}{2}\Sigma _{WW}+\frac{1}{2}D_{WW})\), where \(D_{WW}\) is the diagonal of the empirical covariance of \({\varvec{W}}_t\):
\(\Sigma _{ZZ}\), \(\Sigma ^{nat}\), \(\Sigma ^{reg}\), \(\Sigma ^{GT}\) and \(D_{WW}\) are estimated by the corresponding sample covariance from the data in the most recent 2year training window; \(\rho\) is estimated by minimizing the Frobenius norm (\(L_2\) distance) between the empirical correlation and structured correlation. Based on Eq. (3), the variance estimate is similarly updated by
Our final statelevel %ILI estimate for week T after the second step is:
with corresponding 95% interval estimate
Second step: standalone model for HI, AK, ND, VT, MT, ME and SD
For \(m \in \{\text {HI, AK, ND, VT, MT, ME, SD}\}\), we take a standalone modeling approach. For each of these states, which is either noncontiguous or has the lowest multiple correlation with outofstate %ILI (detailed in Supplementary Information), we focus on estimating the individual state’s %ILI by integrating the withinstate and national information in the second step. Thereby, our target is a scalar \(Z_{t}^{(m)} = p_{t, m}  p_{t1, m}\), the state’s %ILI increment at the current week. The predictor vector in the second step for state m is \({\varvec{W}}_{t}^{(m)} =({Z}_{t1}^{(m)}, (\hat{{p}}_{t, m}^{GT}  {p}_{t1, m}), (\hat{{p}}_{t}^{nat}  {p}_{t1, m}))\), where the regional terms are dropped. The best linear predictor with ridgeregression inspired shrinkage is then used to get the final estimate
The corresponding covariance matrices between the components \(\Sigma _{ZW}^{(m)} = {\mathrm{Cov}}(Z^{(m)}, {\varvec{W}}^{(m)})\), \(\Sigma _{WW}^{(m)} = {\mathrm{Var}}({\varvec{W}}^{(m)})\), and \(D_{WW}^{(m)} = {\mathrm{diagonal}}(\Sigma _{WW}^{(m)})\) are estimated by the corresponding sample covariance from the data in the most recent 2year training window.
The final statelevel %ILI estimate for week T after the second step for \(m \in \{\text {HI, AK, ND, VT, MT, ME, SD}\}\) is:
with corresponding 95% interval estimate
where \(\Sigma _{ZZ}^{(m)} = {\mathrm{Var}}(Z^{(m)})\) is the scalar variance of the univariate time series \(Z_{t}^{(m)}\).
References
US Centers for Disease Control and Prevention (CDC). Past seasons estimated influenza disease burden. https://www.cdc.gov/flu/about/burden/pastseasons.html (2020). Accessed: 20200507.
Ginsberg, J. et al. Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014 (2009).
Yang, S. et al. Advances in using internet searches to track dengue. PLoS Comput. Biol. 13, e1005607 (2017).
Scott, S. L. & Varian, H. R. Predicting the present with Bayesian structural time series. Int. J. Math. Modell. Numer. Optim. 5, 4–23 (2014).
Scott, S. L. & Varian, H. R. Bayesian variable selection for nowcasting economic time series. In Economic Analysis of the Digital Economy (eds Goldfarb, A. et al.) 119–135 (University of Chicago Press, Chicago, 2015).
Wu, L. & Brynjolfsson, E. The future of prediction: how Google searches foreshadow housing prices and sales. In Economic Analysis of the Digital Economy (eds Avi Goldfarb, S. G. & Tucker, C.) 89–118 (University of Chicago Press, Chicago, 2015).
Shaman, J. & Karspeck, A. Forecasting seasonal outbreaks of influenza. Proceedings of the National Academy of Sciences 109, 20425–20430 (2012). http://www.pnas.org/content/109/50/20425.full.pdf+html.
McNeil, D. G. Can smart thermometers track the spread of the coronavirus? https://www.nytimes.com/2020/03/18/health/coronavirusfeverthermometers.html (2020). Accessed: 20200412.
Yang, S., Santillana, M. & Kou, S. C. Accurate estimation of influenza epidemics using google search data via argo. Proc. Natl. Acad. Sci. 112, 14473–14478 (2015).
Yang, S. et al. Using electronic health records and internet search information for accurate influenza forecasting. BMC Infect. Dis. 17, 332. https://doi.org/10.1186/s1287901724247 (2017).
Yang, W., Lipsitch, M. & Shaman, J. Inference of seasonal and pandemic influenza transmission dynamics. Proc. Natl. Acad. Sci. 112, 2723–2728 (2015).
Shaman, J., Karspeck, A., Yang, W., Tamerius, J. & Lipsitch, M. Realtime influenza forecasts during the 2012–2013 season. Nat. Commun. 4, 2837. https://doi.org/10.1038/ncomms3837 (2013).
Yang, W., Karspeck, A. & Shaman, J. Comparison of filtering methods for the modeling and retrospective forecasting of influenza epidemics. PLoS Comput. Biol. 10, e1003583 (2014).
Shaman, J. & Kandula, S. Improved discrimination of influenza forecast accuracy using consecutive predictions. PLoS Curr. Outbreaks https://doi.org/10.1371/currents.outbreaks.8a6a3df285af7ca973fab4b22e10911e (2015).
Flusight: Flu forecasting  CDC. https://www.cdc.gov/flu/weekly/flusight/index.html (2020). Accessed: 20200412.
Brooks, L. C., Farrow, D. C., Hyun, S., Tibshirani, R. J. & Rosenfeld, R. Flexible modeling of epidemics with an empirical Bayes framework. PLoS Comput. Biol. 11, e1004382 (2015).
Farrow, D. C. et al. A human judgment approach to epidemiological forecasting. PLoS Comput. Biol. 13, e1005248 (2017).
Yang, W., Olson, D. R. & Shaman, J. Forecasting influenza outbreaks in boroughs and neighborhoods of New York City. PLoS Comput. Biol. 12, e1005201 (2016).
Davidson, M. W., Haim, D. A. & Radin, J. M. Using networks to combine “big data’’ and traditional surveillance to improve influenza predictions. Sci. Rep. 5, 8154 (2015).
Zou, B., Lampos, V. & Cox, I. Multitask learning improves disease models from web search. In Proceedings of the 2018 World Wide Web Conference, 87–96 (2018).
Lu, F. S., Hattab, M. W., Clemente, C. L., Biggerstaff, M. & Santillana, M. Improved statelevel influenza nowcasting in the united states leveraging internetbased data and network approaches. Nat. Commun. 10, 1–10 (2019).
Ning, S., Yang, S. & Kou, S. Accurate regional influenza epidemics tracking using internet search data. Sci. Rep. 9, 5238 (2019).
Reich, N. G. et al. Accuracy of realtime multimodel ensemble forecasts for seasonal influenza in the us. PLoS Comput. Biol. 15, e1007486 (2019).
Burkom, H. S., Murphy, S. P. & Shmueli, G. Automated time series forecasting for biosurveillance. Stat. Med. 26, 4202–4218 (2007).
Biggerstaff, M. et al. Results from the Centers for Disease Control and Prevention’s predict the 2013–2014 influenza season challenge. BMC Infect. Dis. 16, 1–10 (2016).
Santillana, M. et al. Combining search, social media, and traditional data sources to improve influenza surveillance. PLoS Comput. Biol. 11, e1004513 (2015).
Lazer, D., Kennedy, R., King, G. & Vespignani, A. The parable of Google flu: traps in big data analysis. Science 343, 1203–1205 (2014).
Butler, D. When Google got flu wrong. Nature 494, 155–156 (2013).
Lampos, V. et al. Tracking covid19 using online search. arXiv preprint arXiv:2003.08086 (2020).
Lipsitch, M. et al. Improving the evidence base for decision making during a pandemic: the example of 2009 influenza A/H1N1. Biosecur. Bioterrorism Biodefense Strategy Pract. Sci. 9, 89–115 (2011).
Nsoesie, E. O., Brownstein, J. S., Ramakrishnan, N. & Marathe, M. V. A systematic review of studies on forecasting the dynamics of influenza outbreaks. Influenza Other Resp. Viruses 8, 309–316 (2014).
Chretien, J.P., George, D., Shaman, J., Chitale, R. A. & McKenzie, F. E. Influenza forecasting in human populations: a scoping review. PLoS ONE 9, e94130 (2014).
Stephensdavidowitz, S. Google searches can help us find emerging covid19 outbreaks. https://www.nytimes.com/2020/04/05/opinion/coronavirusgooglesearches.html (2020). Accessed: 20200507.
Hoerl, A. E. & Kennard, R. W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970).
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2016).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Acknowledgements
SCK’s research was supported in part by National Science Foundation grant DMS1810914. The authors thank Professor Herman Chernoff for helpful comments. All analyses were performed with the R statistical software^{35}. The R package that implements the ARGOX method is available on CRAN at https://cran.rproject.org/web/packages/argo/, which uses the glmnet package^{36}. All datasets analyzed in the current study are available in the Harvard Dataverse repository, https://doi.org/10.7910/DVN/2IVDGK.
Author information
Authors and Affiliations
Contributions
S.Y. and S.N. contributed equally to this work. S.Y., S.N., and S.C.K. designed the research; S.Y., S.N., and S.C.K. performed the research; S.Y. and S.N. analyzed data; and S.Y., S.N., and S.C.K. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, S., Ning, S. & Kou, S.C. Use Internet search data to accurately track state level influenza epidemics. Sci Rep 11, 4023 (2021). https://doi.org/10.1038/s41598021830845
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598021830845
This article is cited by

Internetbased Surveillance Systems and Infectious Diseases Prediction: An Updated Review of the Last 10 Years and Lessons from the COVID19 Pandemic
Journal of Epidemiology and Global Health (2024)

Joint COVID19 and influenzalike illness forecasts in the United States using internet search information
Communications Medicine (2023)

COVID19 hospitalizations forecasts using internet search data
Scientific Reports (2022)

COVID19 forecasts using Internet search information in the United States
Scientific Reports (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.