Introduction

Over the past several years, numerous scientific studies have shown that user interactions with web applications generate latent health-related signals, reflective of individual as well as community level trends1,2,3,4,5,6,7,8. Seasonal influenza and the H1N1 pandemic were used as case studies for the development and evaluation of machine learning models that produce disease rate estimates in a non-traditional way, using online search or social media as their input information1,2,5,9,10. After an extensive scientific debate about their usefulness and accuracy11,12, and with the manifestation of new findings that improved upon past methodological shortcomings13,14,15, these techniques are now becoming part of public health systems, serving as complementary endpoints for monitoring the prevalence of infectious diseases, such as seasonal influenza8,16. Compared to conventional health surveillance systems, online user trails exhibit certain advantages, including low latency, continuous operational capacity, denser spatial coverage, and broader demographic inclusion8,17. Furthermore, during emerging epidemics they may offer community level insights that current monitoring systems are not equipped to obtain given a limited testing capacity18, and the physical distancing measures that discourage or prohibit people from interacting with health services19. Notably, the ongoing coronavirus disease (COVID-19) pandemic, caused by a novel coronavirus (SARS-CoV-2), has generated an unprecedented relative volume of online searches (Supplementary Fig. 1). This search behaviour could signal the presence of actual infections, but may also be due to general concern that is intensified by news media coverage, reported figures of disease incidence and mortality across the world, and imposed physical distancing measures20,21.

Previous research, which has used online search to predominantly model influenza-like illness (ILI) rates, has focused on supervised learning solutions, where “ground truth” information, in the form of historical syndromic surveillance reports, can be used to train machine learning models2,5,13,14,22. These models learn a function that maps time series of online user-generated data, e.g. the frequency of web searches over time, to a noisy representation of a disease rate time series, indirectly minimising the impact of online activity that is not caused by infection. Typically, this training data spans multiple years and flu seasons13,14,15. However, for most locations, if not all, no sufficient data—in terms of validity, representativeness, and time span—currently exists to apply supervised learning approaches to COVID-19. Therefore, unsupervised solutions that attempt to minimise the effect of concern should be sought, and fully supervised solutions should be used and interpreted with caution.

In this paper, we present models for COVID-19 using online search data based in both unsupervised and supervised settings. We first develop an unsupervised model for COVID-19 by carefully choosing search queries that refer to related symptoms as identified by a survey from the National Health Service (NHS) and Public Health England (PHE) in the United Kingdom (UK)23. Symptom categories are weighted based on their reported ratio of occurrence in cases of COVID-19. The output from these new models provides useful insights, including early warnings for potential disease spread, and showcases the effect of physical distancing measures. Since online searches can also be driven by concern rather than by infection, we attempt to minimise this effect by incorporating a news media coverage time series. The influence of news media on the online search signal is quantified through an autoregressive task that resembles a Granger causality test between these two time series24.

Recent research has demonstrated that a model estimating ILI rates from web searches, trained originally for a location where syndromic surveillance data is available, can be adapted and deployed to another location that cannot access historical ground truth data25. The accuracy of the target location model depends on identifying the correct search queries and their corresponding weights via a transfer learning methodology. By adapting this method, we map supervised COVID-19 models from a source to a target country, in an effort to transfer noisy knowledge from areas that are ahead in their epidemic curves to areas that are at earlier stages. This supervised approach transfers the clinical reporting biases of the source location, but given the statistical supervision, it is not affected by concern to the same degree as the unsupervised models. Our analysis reaffirms the insights of the unsupervised approach, showcasing how early warnings could have been obtained from locations that had already experienced the impact of COVID-19.

Taking noisy supervision a step further, we conduct a correlation and regression analysis to uncover potentially useful online search queries that refer to underlying behavioural or symptomatic patterns in relation to confirmed COVID-19 cases. We show that generic COVID-19-related terms and less common symptoms are the most capable predictors. Finally, we use the task of forecasting confirmed COVID-19 deaths to illustrate that web search data, when added to an autoregressive forecasting model, significantly reduces prediction error.

We present results for a multilingual and multicultural selection of countries—namely, the United States of America (US), UK (including a comparative analysis for England), Australia, Canada, France, Italy, Greece, and South Africa.

Results

Unsupervised models

The first part of the analysis presents unsupervised models of COVID-19 based on weighted search query frequencies. Queries and their weightings are determined using the first few hundred (FF100) survey on COVID-19 conducted by the NHS/PHE in the UK23. FF100 identified 19 symptoms associated with confirmed COVID-19 cases and their probability of occurrence (Supplementary Table 1). There exist other studies, such as the findings by Carfi et al.26, that corroborate the outcomes of the FF100 survey. Our choice to use it was based on the substantial size of the cohort (381 patients) and the comprehensive presentation of outcomes. In addition to the symptoms identified by FF100, we also include queries that mention COVID-19-related keywords (e.g. “covid-19” or “coronavirus”) as a separate category. Unsupervised models for COVID-19 in 8 countries are depicted in Fig. 1. We observe exponentially increasing rates that exceed the estimated seasonal average (previous 8 years) during their peak period in all investigated countries, as well as a steep drop of the score after the application of physical distancing or lockdown measures in most countries which is in concordance with clinical surveillance reports27. In the time series where we have attempted to minimise the effect of news by maintaining a proportion of the signal (Eqs. (3), (4) and (5)), we observe more conservative estimates in all locations, including altered trends during the peak period. Compared to the original scores (no minimisation of news media effects), there is an average reduction of the signals by 16.4% (14.2%–18.7%) in a period of 14 days prior and after their peak moments; their corresponding average linear correlation during this period is equal to 0.822 (0.739–0.905). Outside of the peak periods, the reduction is a moderate 3.3% (2.7%–4.0%), indicating that there is an association between the extent of media impact and the estimated disease prevalence. For the US, Australia, Canada, and Italy, the signal is visibly flattened during the time period around their respective peaks. Countries that took measures earlier in the epidemic curve (e.g., Greece) demonstrate a more pronounced pattern of decrease, with scores that go below the expected seasonal average. We also note that for Australia and the UK, search scores were already in decline after the application of physical distancing measures but before lockdowns. Outcomes based solely on the FF100 symptoms (Supplementary Fig. 2) or without using any weighting scheme (Supplementary Fig. 3) are available in the Supplementary Information (SI).

Fig. 1: Online search scores for COVID-19-related symptoms as identified by the FF100 survey, in addition to queries with coronavirus-related terms, for 8 countries from September 30, 2019 to May 24, 2020 (all inclusive).
figure 1

Query frequencies are weighted by symptom occurrence probability (blue line) and have news media effects minimised (black line). These scores are compared to an average 8-year trend of the weighted model (dashed line) and its corresponding 95% confidence intervals (shaded area). Application dates for physical distancing or lockdown measures are indicated with dash-dotted vertical lines; for countries that deployed different regional approaches, the first application of such measures is depicted. All time series are smoothed using a 7-point moving average, centred around each day.

A comparison of the search scores with minimised media effects to the time series of confirmed cases is depicted in Fig. 2. If we exclude South Africa, as it displays an outlying behaviour, perhaps due to a limited testing capacity28,29 or demographically-skewed Internet access patterns30, the correlation between these times series is maximised, reaching an average value of 0.826 (0.735–0.917) when clinical data is brought forward by 16.7 (10.2–23.2) days. This provides an indication of how much sooner the proposed unsupervised models could have signalled an early warning about these epidemics at a national level. It also shows numerically as well as visually that both the increase and the decline of confirmed cases are captured by the search signal. Replacing confirmed cases with deaths caused by COVID-19 (Supplementary Fig. 4) increases this period to 22.1 (17.4–26.9) days with a slightly greater maximised correlation (r = 0.846; 0.702–0.990).

Fig. 2: Comparison between online search scores with minimised news media effects (black line) and confirmed cases (dashed red line), as well as confirmed cases shifted back (red line) such that their correlation with the online search scores is maximised.
figure 2

The confirmed cases time series are shifted back by a different number of days for each country: 20 days (US), 24 days (UK), 6 days (Australia), 31 days (Canada), 10 days (France), 14 days (Italy), 12 days (Greece), and 53 days (South Africa). All time series are smoothed using a 7-point moving average, centred around each day.

Transfer learning

Models of confirmed COVID-19 cases are transferred from Italy (source) to all other countries (targets) in our analysis using a transfer learning methodology. In contrast to the unsupervised models, here we attempt to leverage information from a country that is ahead in terms of epidemic progression31. As a result, the obtained estimates are reflective of the clinical reporting systems in the source country, but not as influenced by user concern at the target countries given that they are derived from a supervised learning function. Our approach maps search queries about specific symptom categories, as identified by the FF100 survey, from the language of the source country (Italian) to the languages of the other countries using a temporal correlation metric (see Methods). We train 100,000 elastic net models for the source location, exploring the entire 1-norm regularisation path, and transfer the subset of models (84,557 on average) that assign a nonzero weight to a minimum of 3 and up to a maximum of 49 queries (out of the 54 queries identified for Italy) on a daily basis from February 17 to May 24, 2020 (both inclusive). The transferred estimates for the last day in the considered time span and their 95% confidence intervals are depicted in Fig. 3. Intermediate outcomes showing estimates from models that are trained and transferred on a daily basis are available in Supplementary Fig. 5. With the exceptions of Australia and South Africa, the peak of the transferred signal appears approximately 2–3 weeks earlier than the confirmed cases. The decrease after the peak is also more steep in the transferred models and always occurs after the commencement of physical distancing or lockdown measures. The average correlation between the transferred and the unsupervised (with minimised news effects) estimates is equal to 0.654 (0.572–0.735), which implies that there exist some differences between the two approaches (Supplementary Fig. 6). In fact, their correlation is maximised to 0.789 (0.736–0.842) by bringing the transferred time series 5.14 (3.21–7.08) days forward, indicating that the unsupervised signal provides an earlier alert, and that both signals are much more similar when their temporal discrepancy is corrected. Interestingly, during the temporal alignment of source and target search query frequencies, which is an intermediate step of the transfer learning technique, correlation increases when the target data is shifted forward. This partially confirms that Italy was indeed ahead by a few days in terms of either epidemic progression, user search behaviour, or both, and justifies our choice to use it as the source country. In particular, when we focus on the period from March 16 to May 24, 2020 (both inclusive)—dates that signify the beginning and end of high levels of transmission for Italy—and analyse all transferred models, the average shift in days that maximises these correlations per target country is: 13.76 (12.97–14.55) for the US, 12.67 (11.99–13.35) for the UK, 5.24 (4.30–6.18) for Australia, 8.06 (7.14–8.98) for Canada, 9.84 (8.62–11.06) for France, 1.44 (0.45–2.43) for Greece, and 10.99 (10.10–11.88) for South Africa.

Fig. 3: Transfer learning models based on online search data for 7 countries using Italy as the source country.
figure 3

The figures show an estimated trend for confirmed COVID-19 cases compared to the reported one. The trend is derived by standardising the transferred estimates (raw values are reflective of the demographics and clinical reporting approach of the source country). The solid line represents the mean estimate from an ensemble of models. The shaded area shows 95% confidence intervals based on all model estimates. Application dates for physical distancing or lockdown measures are indicated with dash-dotted vertical lines; for countries that deployed different regional approaches, the first application of such measures is depicted. Time series are smoothed using a 3-point moving average, centred around each day. We use this minimum amount of smoothing to remove some of the noise for visualisation purposes and maintain our ability to compare the transferred models to the corresponding clinical data.

Correlation and regression analysis

Aiming to uncover symptom-related patterns or associated behaviours, we examine the statistical relationship between web search frequencies and confirmed COVID-19 cases or deaths by performing a correlation and regression analysis. Outcomes could also help inform the choice of search terms in follow-up models for COVID-19. To reduce the representation bias of clinical endpoints, we combine data from multiple countries to the extent possible. For a more comprehensive experiment, we aggregate data from 4 countries where English is the main spoken language, namely the US, UK, Australia, and Canada. We first estimate the linear correlation between query frequencies and clinical data at all locations (in an aggregate fashion), from December 31, 2019 up to and including May 24, 2020. Results illustrating the top-correlated and anti-correlated search queries are depicted in Fig. 4a; correlations with deaths, which are lower given the additional temporal difference, are depicted in Supplementary Fig. 7a. Queries about the disease (“covid”; r = 0.72) or the virus (“sars cov 2”; r = 0.67) are the top-correlated. Various related symptoms demonstrate strong correlations as well, although the amount of correlation is not necessarily reflective of the symptom’s occurrence probability; examples include rash (r = 0.63), pink eye (r = 0.58), blue face (r = 0.56), sneezing (r = 0.55), and loss of the sense of smell (r = 0.49). Associated behaviours, e.g. staying at home (r = 0.61), or measures, e.g. quarantine (r = 0.60), are also strongly correlated. On the other hand, symptoms such as vomiting (r = −0.60) and migraine (r = −0.44) show a significant anti-correlation. When we shift confirmed cases back by 19 days and deaths by 25 days, the average correlation across all considered web searches is maximised; results are depicted in Supplementary Fig. 7b, c. The membership and ranking is similar to the previously shown results, although there are marginal increases in the numbers of search queries that seek information about the disease (characteristics, testing, potential treatments). For the regression analysis, we focus on the last 84 days of the considered time span (March 2 to May 24, 2020), training models and assessing estimates for each day of that period. We consider elastic net models with up to a 50% feature density (nonzero weights for half of the considered search queries at most) to minimise the chance of overfitting. The outcomes of this analysis are depicted in Fig. 4b, c; for completeness, results where the entire regularisation path is explored (1%–100% feature density) are shown in Supplementary Fig. 7d, e. We observe that the most impactful feature, in terms of its normalised contribution to estimates of confirmed cases or deaths, are search queries that include the term “covid” (29.32% and 36.29%, respectively), which is intuitive given the magnitude of this pandemic. Blue face (27.05%), loss of the sense of smell (12.57%), appetite loss (7.39%), pink eye (5.64%), and shortness of breath (5.42%) are the top-5 most impactful symptoms with regards to estimating confirmed cases. For mortality estimates, two new symptoms are introduced in the top-5: rash (8.80%) and loss of the sense of taste (6.5%). In addition, recommended behaviours, such as isolation and staying at home, and queries about masks, related diagnostics, or holidays are also impactful. Similarly to the correlation analysis, the search queries related to vomiting have a strong negative impact (−12.78%) in estimating confirmed cases; this is replaced by diarrhoea when estimating deaths (−15.26%). Queries about the common symptoms of cough, fatigue, and fever are not among the most correlated or impactful, suggesting that web searches about rarer symptoms may be more informative.

Fig. 4: Correlation and regression analysis of search query frequencies against confirmed COVID-19 cases or deaths in four English speaking countries (US, UK, Australia, and Canada).
figure 4

a Top-30 positively and top-10 negatively correlated search queries with COVID-19 confirmed cases; b Top-30 positively and top-10 negatively impactful queries in estimating COVID-19 confirmed cases; c Top-30 positively and top-10 negatively impactful queries in estimating deaths caused by COVID-19.

Forecasting deaths caused by COVID-19

As a final step in our analysis, we use fully supervised forecasting models for COVID-19 deaths to assess whether the inclusion of web searches can improve the accuracy of autoregressive (AR) models. We provide the same analysis for confirmed cases in the SI (Supplementary Table 2, and Supplementary Fig. 8), but given the irregularities in the way laboratory diagnostics were conducted, the time series of deaths is more consistent, and hence more appropriate to use for this challenging supervised learning task. We first assess an AR model that uses only deaths data from the past L = 6 days (AR-F) to conduct forecasts 7 and 14 days ahead. We then expand on this by incorporating online search data as well (SAR-F). Both models are based on Gaussian Processes (GPs) as detailed in Methods. We also use a basic persistence model (PER-F) as a modest baseline. Our testing period starts from April 20, 2020, but we commence testing only when a cumulative number of 10 deaths is recorded in a country (this reduces the amount of test points for South Africa only). Models are retrained at every time step, and their accuracy is assessed using the mean absolute error (MAE) between forecasts and actual figures. Table 1 enumerates these results, including a normalised average MAE across locations, models, and forecasting tasks to allow for a fairer joint interpretation. We note that SAR-F performs considerably better than AR-F, decreasing MAE by 32.65% and 33.77% in the 7 and 14 days ahead forecasting tasks, respectively. It also improves upon the PER-F baseline in the more challenging task, which is a positive outcome given the small amount of available training data. The corresponding 14 days ahead forecasts are depicted in Supplementary Fig. 9. As expected from the empirical evaluation, SAR-F estimates are visibly a better fit to the ground truth than the ones produced by AR-F, capturing the quantity and the overall trend more convincingly. These outcomes provide further evidence for the utility of incorporating online search information in disease models for COVID-19.

Table 1 Average mean absolute error and standard deviation (in parentheses) of forecasting models (7 and 14 days ahead) for daily deaths caused by COVID-19 in 8 countries.

Discussion

We have presented unsupervised (with minimised news media effects) and transfer learning models for COVID-19 based on online search data. The latter reaffirm the estimates of the former, albeit with an additional temporal delay of approximately 5 days. However, transfer learning provides a more statistically principled approach as it is based on a supervised model of confirmed cases from a source country (Italy). This is a component that makes it less prone to media influence, similarly to supervised14,15 or transferred25 models for ILI. We have conducted a series of experiments, across different countries, to demonstrate the practical utility of online search in modelling the incidence of COVID-19. By comparing our outcomes to clinical endpoints, we argue that signals from web search data could have served as preliminary early indicators for COVID-19 prevalence at the national level. Our results also highlight the immediate impact that physical distancing or lockdown measures had in reducing disease rates. Furthermore, a qualitative analysis shows that rarer symptoms or generic queries about COVID-19 correlate better with and are more predictive of clinically reported metrics.

Assessing the correctness of our analysis is difficult because there exists no definitive ground truth representative of community level disease rates to compare against. However, intrinsic properties in our findings as well as quantitative comparisons with different clinical endpoints provide at least partial evidence of validity. It is important to note that the COVID-19 outbreak in Italy, which was the first major outbreak in Europe32, did not cause an increase in the frequency of the vast majority (>70%) of search terms at the other countries we considered based on a Granger causality analysis24 (see SI). Nonetheless, this motivates our effort to minimise news media effects on the derived COVID-19 scores, but at the same time, it can also justify the moderate decrease that this process registers during peak periods (16.4% on average). For Italy itself, the COVID-19 score as estimated through online searches (with minimised media effects) is preceding confirmed cases and deaths by 14 and 18 days, respectively. In addition, search terms that in most instances do not represent infection (e.g., “COVID unemployment”) clearly follow the corresponding trends of search terms about COVID-19 symptoms (Supplementary Fig. 10). These observations combined corroborate the hypothesis that the greatest portion of the online search signal based on the terms we have identified and the derivatives we develop is more representative of infection rather than concern.

As described in the results, the correlation between the unsupervised search-based signal with minimised news effects and confirmed COVID-19 cases is maximised when the latter is brought forward by 16.7 days. This temporal difference between online searches and clinical information could be partly explained by the amount of time between the onset of common symptoms and of more severe ones that warrant a hospital admission33, the overall delay of some health systems to respond to the pandemic34, and national policies that recommended testing only after the persistence of symptoms in milder cases35. Interestingly, if no minimisation of news effects is conducted, the temporal distance for a maximal correlation increases by 1.7 days. The fact that most countries in our analysis exhibited concern about COVID-19 prior to the rapid increase of infections could be a justification for this. By repeating this analysis for mortality figures, we obtain that correlation is maximised by bringing death time series 22.1 (17.4–26.9) days forward. The added amount of days corroborates with findings about the time interval between the commencement of a hospitalisation and a death outcome (5–11 days)33.

Drawing our focus to England, we were able to compare outcomes from the unsupervised (with and without minimised effects) and transfer learning approaches to more conventional metrics. The first is a swabbing scheme led by the Royal College of General Practitioners (RCGP), which included many patients who did not have COVID-19-related symptoms and thus allowing us to obtain a more representative metric for community-level spread36; results are depicted in Fig. 5a, b. We observe that the weekly COVID-19 swabbing positivity rates correlate strongly with the unsupervised (r = 0.816, p < 0.001), unsupervised with minimised news media effects (r = 0.855, p < 0.001), and transferred (r = 0.954, p < 0.001) scores. We find that the unsupervised models could have provided an early warning at least a week before the RCGP swabbing scheme. The second metric is a normalised version of the confirmed cases that takes into consideration relative population figures (coronavirus.data.gov.uk); comparisons are depicted in Fig. 5c, d. Here, the time discrepancy is more visible, and correlations without any form of shifting are within a similar range as in our previous analysis (Figs. 2 and 3), showing that search-based estimates precede the clinical estimates by 1 and 2 weeks for the transfer learning and unsupervised approaches respectively.

Fig. 5: Comparison of weekly online search-based signals for COVID-19 to different clinical endpoints for England.
figure 5

a Estimates from the unsupervised models with or without minimising news media effects are compared to COVID-19 positivity rates obtained through a swabbing test scheme operated by the RCGP. b Estimates for COVID-19 obtained via transfer learning (source model of confirmed cases is based on data from Italy) are compared to COVID-19 positivity rates obtained through a swabbing test scheme operated by the RCGP. c Estimates from the unsupervised models with or without minimising news media effects are compared to confirmed cases rates as reported by PHE. d Estimates for COVID-19 obtained via transfer learning (source model of confirmed cases is based on data from Italy) are compared to confirmed cases rates as reported by PHE.

PHE in the UK has incorporated the unsupervised models described in this paper into their weekly surveillance reports about COVID-19 (gov.uk/government/publications/national-covid-19-surveillance-reports). According to PHE, estimates from these models provided important insights into community transmission at a time where conventional surveillance had limited reach due to the fact that testing was restricted to hospitalised cases. They were analysed alongside other surveillance systems, with the broad expectation that any changes in trends would be first detected through the analysis of web searches and would then be seen at a later stage in more conventional surveillance sources. The decrease in the web search scores was interpreted as an early signal of the impact of physical distancing37. The ability to quantify the lag between search engine trends and incidence trends in confirmed cases and mortality further improves the utility of this approach.

Related work has also indicated the potential value in using online search as an early indicator of COVID-1938,39. However, the provided analysis is not entirely comprehensive, in that it focuses in a very narrow subset of search terms (e.g., just the term “coronavirus” for Effenberger et al.39), only considers a few COVID-19 symptoms and does not weight them according to their frequency in patient cohorts, and does not account for potential biases in the obtained signal from concerned users. In fact, the reported lags compared to confirmed cases are not in par with our findings that add 5-6 more days of early warning. In addition, a study reporting that searches have been mostly influenced by the news media again focuses on a very narrow set of search terms, and does not conduct a causal analysis to support that claim40. We show that although some of the search terms could have been driven by concern, this becomes a minority when they are curated carefully (see SI), and we provide a method to reduce some of the remaining biases in the signal.

Nonetheless, our study is not without limitations. Primarily, it is currently impossible to provide a solid evaluation for our findings. This will require a better representation of disease incidence from a clinical or epidemiological perspective, which is an ongoing challenge. In addition, wherever a form of statistical supervision has been introduced (transfer learning, correlation and regression analysis, forecasting), reliable outcomes depend on the existence of a consistent sampling technique of the target information, which in this case is confirmed cases or deaths. We made a concerted effort to reduce the impact of such inconsistencies by training models jointly across multiple countries, focusing on mortality figures rather than the testing-dependent confirmed cases for more demanding tasks, and assessing qualitatively our outcomes across different locations. More detailed comparisons based on public health data from England reaffirm our broader findings. From a practical perspective, this work might not directly be applicable to healthcare resource allocation. This requires more geographically granular estimates that are not supported by the currently available online search data. However, our method can provide an alert at the national level that subsequently could motivate a more thorough examination using different disease surveillance systems and lead to the identification of outbreak hotspots. Online search behaviour could be influenced by many factors not related to infection. Although we have attempted to mitigate those effects by carefully selecting search terms and developing a method to minimise news effects, some biases may still be present in the obtained signals. Our Granger-causal analysis indicates that only a minority of the searches might have been affected (see SI). Finally, similarly to other studies that are based on online search data13,14, our approach is likely to have limited applicability to locations with lower rates of Internet access.

Methods

We first present the methodological approaches deployed in our analysis, and then describe the data sets used.

Community-level surveillance with an unsupervised symptom-based online search model

We generate k symptom-based search query sets using the k = 19 identified symptoms from the FF100 survey for COVID-19 (Supplementary Table 1)23. We also consider one additional set that includes specific COVID-19 terminology, i.e. the “COVID-19” keyword itself among others. Query sets may include different wordings for the same symptom or queries with minor grammatical differences (especially for queries in Greek and French). If a symptom is represented by more than one search query, then we obtain the total frequency (sum) across these queries. Each query set time series is smoothed using a harmonic mean over the past 14 days (see Supplementary Eq. 4 for a definition of the harmonic mean), and any trends across the entire period of the analysis are removed using linear detrending. We then apply a min-max normalisation (Supplementary Eq. 1) to the frequency time series of each query set to obtain a balanced representation between more and less frequent searches. We divide our data into two periods of interest, the current one (from September 30, 2019 until May 24, 2020) and a historical one (from September 30, 2011 to September 29, 2019). The corresponding data sets are denoted by \({\bf{X}}\in {{\mathbb{R}}}_{\ge 0}^{{N}_{1}\times k}\) and \({\bf{H}}\in {{\mathbb{R}}}_{\ge 0}^{{N}_{2}\times k}\), where N1, N2 represent the different numbers of days in the current and historical data, respectively. We use the symptom conditional probability distribution from the FF100 to assign weights (\({\bf{w}}\in {{\mathbb{R}}}_{\ge 0}^{k}\)) to each query category, and compute weighted time series (x = Xw, h = Hw), which are subsequently divided by the sum of w (weighted average). The additional set that includes specific COVID-19 terms is assigned a weight of 1. Our rationale is that people who experience COVID-19-related symptoms in addition to searching about them, will also issue queries about the disease as it is a broadly discussed topic. For the historical data, we divide their time span into yearly periods, and compute an average time series trend, hμ, using two standard deviations as upper and lower confidence intervals. We deploy a considerable historical period (8 years) and not a shorter and more recent one based on previous work on ILI rate modelling from online search data14,15 which indicated that the inclusion of an increased amount of past instances generally improves model accuracy. By doing this, we also capture search trends during past years that were affected by the H1N1 (swine flu), Ebola, and Zika virus epidemics.

Minimising the effect of news media using autoregression

On any given day the proportion of news articles about the COVID-19 pandemic is m [0, 1], and the weighted score of symptom-related online searches (see previous paragraph) is equal to g; we can apply a min-max normalisation so that g [0, 1] as well. We hypothesise that g incorporates two signals based on infected (gp) and concerned (gc) users, respectively, i.e.

$$g={g}_{p}+{g}_{c}.$$
(1)

Then, there exists a constant γ [0, 1] such that

$${g}_{p}=\gamma {g}\quad {\text{and}} \quad {g_c}=(1-\gamma ){g}.$$
(2)

Our goal is to approximate γ using the observed variables m and g. The rationale of the deployed approach is similar to the logic behind a Granger causality test24. First, we train a linear AR model for forecasting the online search score at a time point (day) t, gt, using its previous values; this is denoted by AR\(\left(g\right)\). We also train a linear AR model with the same forecasting target, but an expanded space of observations that includes current (mt) and previous (e.g. mt−1) values of the news articles ratio; this is denoted by AR\(\left(g,m\right)\). We then use the error ratio between the two models in forecasting gt as an estimate of γ for time point t. In particular, we first solve

$$\arg \mathop{\min }\limits_{{\bf{w}},{b}_{1}}\frac{1}{N}\mathop{\sum }\limits_{t=1}^{N}{({g}_{t}-{w}_{1}{g}_{t-1}-{w}_{2}{g}_{t-2}-{b}_{1})}^{2}\ ,$$
(3)

to learn a pair of weights (w) and an intercept term (b1) for AR\(\left(g\right)\). We use 2 lags (past values) to keep the complexity of the task tractable given the small amount of samples at our disposal, N. In our experiments we use the previous N = 56 days, an intermediate value to account for the trade-off between the amount of samples and maintaining recency. We then solve

$$\arg \mathop{\min }\limits_{{\bf{w}},{\bf{v}},{b}_{2}}\frac{1}{N}\mathop{\sum }\limits_{t=1}^{N}{({g}_{t}-{w}_{1}{g}_{t-1}-{w}_{2}{g}_{t-2}-{v}_{1}{m}_{t}-{v}_{2}{m}_{t-1}-{v}_{3}{m}_{t-2}-{b}_{2})}^{2}\ ,$$
(4)

to learn the weights ([w; v]) and an intercept term (b2) for AR\(\left(g,m\right)\). Using both models, we forecast the next (unseen) value of g, which is denoted by \({\hat{g}}_{t+1}\), and compute the absolute error from its known true value, gt+1. This yields errors, ϵ1 and ϵ2 for AR\(\left(g\right)\) and AR\(\left(g,m\right)\), respectively. If ϵ1 < ϵ2, then the news media signal does not help to improve the accuracy of AR\(\left(g\right)\), and hence we assume that it does not affect the online searches. Otherwise, we estimate its effect to be represented by

$$\gamma =\frac{{\epsilon }_{2}}{{\epsilon }_{1}}\ .$$
(5)

After obtaining a time series of γ’s for the all days in our analysis, we smooth each one of them using a harmonic mean (Supplementary Eq. 4) over the values of the previous 6 days (7 days in total including the day of focus).

Transferring supervised models of confirmed COVID-19 cases to different countries

Previous work has shown that it is possible to transfer a model for seasonal flu, based on online search query frequency time series, from one country that has access to historical syndromic surveillance data to another that has not25. Here, we adapt this method to transfer a model for COVID-19 incidence from a source country where the disease spread has progressed significantly to a target country that is still in earlier stages of the epidemic. The main motivation for this is our assumption that a supervised model based on data from the source country might be able to capture disease dynamics sooner and therefore better than unsupervised or supervised models based solely on data from the target countries. The steps and data transformations that are required to apply this technique are detailed below.

Search query frequency time series are denoted by \({\bf{S}}\in {{\mathbb{R}}}_{\ge 0}^{M\times {n}_{\text{S}}}\) and \({\bf{T}}\in {{\mathbb{R}}}_{\ge 0}^{M\times {n}_{\text{T}}}\), for the source and target countries respectively; M denotes the number of days considered, and nS, nT the number of queries for the two locations. As these time series are quite volatile for some locations in our study, something that does not help in cross-location mapping of the data, we have smoothed them using a harmonic query frequency mean based on a window of D = 14 past days (Supplementary Eq. 4). We subsequently train an elastic net model on data from the source location41, similarly to previous work on ILI14,15,17 or other text regression tasks42,43. In particular, we solve the following optimisation task

$$\arg \mathop{\min }\limits_{{\bf{w}},b}\left({\left\Vert {\bf{y}}-{\bf{S}}{\bf{w}}-b\right\Vert }_{2}^{2}+{\lambda }_{1}{\left\Vert {\bf{w}}\right\Vert }_{1}+{\lambda }_{2}{\left\Vert {\bf{w}}\right\Vert }_{2}^{2}\right)\ ,$$
(6)

where \({\bf{y}}\in {{\mathbb{R}}}^{M}\) denotes the daily number of confirmed COVID-19 cases in the source location, λ1, \({\lambda }_{2}\in {{\mathbb{R}}}_{ \,{>}\,0}\) are the 1- and 2-norm regularisation parameters, and \({\bf{w}}\in {{\mathbb{R}}}^{{n}_{\text{S}}}\), \(b\in {\mathbb{R}}\) denote the query weights and regression intercept, respectively. Prior to deploying elastic net, we apply a min-max normalisation on both S and y. We fix the ratio of λ’s, and then train q models for different values of λ1 starting from the largest value that does not yield a null model and exploring the entire regularisation path. From all the different regression models represented by the columns of \({\bf{W}}\in {{\mathbb{R}}}^{{n}_{\text{S}}\times q}\), and the elements of \({\bf{b}}\in {{\mathbb{R}}}^{q}\), we use the ones that satisfy a sparsity level from 5.56% to 90.74% (3 to 49 search queries out of a total of 54) as an ensemble to accomplish a more inclusive transfer.

To generate an equivalent feature space for the target location (same dimensionality, similar feature attributes), we first establish query set pairs between the source and the target location using the symptom categories in the FF100 survey. We map a source query to the target query from the same symptom category that maximises their linear correlation based on their frequency time series. As the time series of source and target queries may not be temporally aligned, prior to computing correlations, we look at a window of z = 45 days (backwards and forwards) and identify at which temporal shift the average correlation between search query frequencies in \({\bf{S}}^{\prime}\) and T is maximised; \({\bf{S}}^{\prime}\) here denotes a subset of S that includes only the search queries that have been assigned a non zero weight by the elastic net (Eq. (6)). In the rare event that no active target search query exists for a certain symptom category, to maintain model consistency we use the best correlated one from all target queries available (irrespectively of the symptom category) as its mapping. After this process, we end up with a subset \({\bf{Z}}\in {{\bf{R}}}^{M\times {n}_{\text{S}}}\) of the target feature space T. Notably, Z does not necessarily hold data for nS distinct queries as different source queries may have been mapped to the same target query. Z is subsequently normalised using min-max. To reduce the distance in the query frequency range between the two feature spaces (S, Z), we scale the latter based on their mean, column-wise (per search query) ratio \({\bf{r}}\in {{\mathbb{R}}}_{\ge 0}^{{n}_{\text{S}}}\), i.e. ZS = Zr. We then deploy the ensemble source models to the target space, making multiple inferences (for different λ1 values) held in \({\bf{Y}}\in {{\mathbb{R}}}^{{n}_{\text{S}}\times q}\):

$${\bf{Y}}={{\bf{Z}}}_{S}{\bf{W}}+{\bf{b}}\ .$$
(7)

We reverse the min-max normalisation for each one of the inferred time series (columns of Y) using values from the source model’s ground truth y (prior to its normalisation). Finally, we compute the mean of the ensemble (across the rows of Y) as our target estimate, and also use the 0.025 and 0.975 quantiles to form 95% confidence intervals.

Correlation and regression analysis

The relationship of search frequency time series and confirmed cases or deaths can uncover symptoms or behaviours related to COVID-19. However, as it is hard to find resources that are representative of community level disease rates, looking at this relationship separately for each country might produce misleading outcomes. To mitigate this to the extent possible, we combine data from C countries and produce an aggregate set of query frequencies, \({{\bf{Z}}}_{\alpha }\in {{\mathbb{R}}}^{CM\times n}\), where M, n denote the considered days and search queries, respectively. We denote the aggregated daily confirmed COVID-19 cases or deaths for these countries with \({{\bf{y}}}_{\alpha }\in {{\mathbb{R}}}^{CM}\). Prior to the aggregation, we apply min-max normalisation on the query frequency, confirmed cases, and deaths time series separately for each country to balance out local properties. Initially, we compute the linear correlation between the columns of Zα and yα. Correlation is an informative metric, but considers each search query in isolation. Therefore, we also perform a multivariate regression analysis to more rigorously estimate the impact of each search query in estimating yα. To do this, we apply elastic net regularised regression (Eq. (6)), training and testing K models, one for each day of an identified test period of interest. During each of the K training phases, assuming it contains η days, we use data up to and including the past η − 1 days to train, and test only on the last day (ηth) which is unseen; this results in daily test sets of size C (one value for each country). We explore elastic net’s regularisation path to consider L models that maintain (by assigning a nonzero weight) up to a reasonable percentage of the features (e.g. 50%), so that a solution is not overfitting. We do this gradually, selecting first 1% of the features and moving towards the maximum considered percentage. In this experiment, we use the test set to identify the most accurate (in terms of mean squared error) model at each density level. For this model, we determine the impact of each one of the features (search queries) by considering both its frequency and allocated weight. The impact Θ(  ) of a query q is equal to

$${{\Theta }}(q)=\left(\mathop{\sum }\limits_{\ell =1}^{L}\mathop{\sum }\limits_{t=1}^{K}\mathop{\sum }\limits_{j=1}^{C}{f}_{t,j}\ \ {w}_{\ell ,t}\right)/\left(\mathop{\sum }\limits_{\ell =1}^{L}\mathop{\sum }\limits_{t=1}^{K}\mathop{\sum }\limits_{j=1}^{C}{\hat{y}}_{\ell ,t,j}\right),$$
(8)

where ft,j denotes the query frequency at time point (day) t and for country j, w,t the corresponding weight at sparsity level , and \({\hat{y}}_{\ell ,t,j}\) the respective estimated confirmed cases or deaths. Impacts are summed across all the considered days, and model densities, and normalised at the end by the sum of all the corresponding estimates.

Short-term forecasting of COVID-19 deaths

Moving further into supervised models, we develop forecasting solutions for COVID-19 to assess the potential value of incorporating web search data in AR models. In this case, we focus on modelling related deaths as opposed to confirmed cases because the reporting of the latter has been dependent on testing approaches, and thus has a less reliable time series structure from a statistical perspective. Short-term forecasts of COVID-19 deaths can be conducted using the time series of past records. Augmenting this AR signal with online user trails could help to improve accuracy, if we draw a parallel with ILI rate modelling44. Let \({\bf{z}}\in {{\mathbb{R}}}_{\ge 0}^{M}\) denote the average frequency of N min-max normalised search queries for M days, and let \({\bf{y}}\in {{\mathbb{N}}}^{M}\) be the corresponding COVID-19 deaths. We first solve a strictly AR task using L past values, meaning that at a time point t we use yAR(t, L) = [yt, yt−1, …, ytL] y to forecast yt+D, performing D days ahead forecasting. We denote this forecasting model as AR-F. We also augment our observations by incorporating search query frequency data for the same time window resulting to an input xAR(t, L) = [zAR(t, L); yAR(t, L)] that is held in the concatenated matrix \({\bf{X}}=[{{\bf{Z}}}_{\text{AR}}(L);{{\bf{Y}}}_{\text{AR}}(L)]\in {{\mathbb{R}}}_{\ge 0}^{M\times 2(L+1)}\). We denote this search AR forecasting model as SAR-F. For simplicity, we drop the subscript notation of the input variables (x, z, y) in subsequent references to them.

We develop forecasting models using Gaussian Processes (GP), training a different model for each D days ahead forecasting task. The choice of GPs is justified by previous work on modelling infectious diseases14,15,45. GPs are defined as random variables any finite number of which have a multivariate Gaussian distribution46. GP methods aim to learn a function \(f:{{\mathbb{R}}}^{m}\to {\mathbb{R}}\) drawn from a GP prior. They are specified through a mean and a covariance (or kernel) function, i.e. \(f({\bf{x}}) \sim \,{\mathrm{GP}}\,(\mu ({\bf{x}}),k({\bf{x}},{\bf{x}}^{\prime} ))\), where x and \({\bf{x}}^{\prime}\) (both \(\in {{\mathbb{R}}}_{\ge 0}^{2(L+1)}\)) denote rows of the input matrix X for the SAR-F model. By setting μ(x) = 0, a common practice in GP modelling, we focus only on the kernel function. The specific kernel function used in SAR-F is given by

$$k({\bf{x}},{\bf{x}}^{\prime} )={k}_{\text{SE}}({\bf{z}},{\bf{z}}^{\prime} ;{\sigma }_{1},{\ell }_{1})+{k}_{\text{SE}}({\bf{y}},{\bf{y}}^{\prime} ;{\sigma }_{2},{\ell }_{2})+{k}_{\text{SE}}({\bf{x}},{\bf{x}}^{\prime} ;{\sigma }_{3},{\ell }_{3})+{\sigma }_{4}^{2}\delta ({\bf{x}},{\bf{x}}^{\prime} )\ ,$$
(9)

where kSE(  ,  ) denotes a squared exponential (SE) covariance function (Supplementary Eq. 5), δ(  ,  ) denotes a Kronecker delta function used for an independent noise component, ’s and σ’s are lengthscale and scaling (variance) parameters, respectively. This composite covariance function uses separate kernels for online search and deaths data, as well as a kernel where they are modelled jointly. This provides the GP with more flexibility in adapting its decisions to one class of observations over the other. The covariance function of the AR-F model is a simplified version of Eq. (9), where search data is not used, i.e.

$$k({\bf{y}},{\bf{y}}^{\prime} )={k}_{\text{SE}}({\bf{y}},{\bf{y}}^{\prime} ;{\sigma }_{1},{\ell }_{1})+{k}_{\text{SE}}({\bf{y}},{\bf{y}}^{\prime} ;{\sigma }_{2},{\ell }_{2})+{\sigma }_{3}^{2}\delta ({\bf{y}},{\bf{y}}^{\prime}).$$
(10)

The hyperparameters of both covariance functions (σ’s, ’s) are optimised using Gaussian likelihood and variational inference46.

In our analysis, we compare the aforementioned forecasting models with a basic persistence model (PER-F). For a D days ahead forecasting task, PER-F uses the most recently observed value of the ground truth, yt, as the forecasting estimate for the time instance t + D (\({\hat{y}}_{t+D}={y}_{t}\)). Given the limited time span and the irregularities in reporting, this is a competitive baseline to improve upon.

Data sets

Our analysis focuses on a set of 8 countries, namely the US, UK, England, Australia, Canada, France, Italy, Greece, and South Africa. This choice of countries covers various geographical locations across the world (Europe, North America, Australia, Africa), and allowed the curation of COVID-19-related search and news media terms from native speakers with a good understanding of this research project. Given the various logistic constraints related to data collection and translation from experts, we did not expand the analysis to other countries. However, it is expected that outcomes should replicate to other countries with similar socioeconomic and cultural characteristics. China was not included in our study because data from the Google search engine is not considered as a representative sample due to its modest market share (wikipedia.org/wiki/Google_China).

Online search data is obtained from Google Health Trends, a non public application interface offered by Google for research on health-related topics. Data represent daily online search query frequencies for specific areas of interest. Query frequencies are defined as the sum of search sessions that include a target search term divided by the total number of search sessions (for a day and area of interest). Hence, in the paper when we refer to a certain search query, we also refer to all other search queries that include all of its words. Google defines a search session as a grouping of consecutive searches by the same user within a short time interval. We have obtained data from September 30, 2011 to May 24, 2020 for the aforementioned 8 countries and separately for the nation of England. The list of search terms is determined by COVID-related symptoms and keywords. For each country, we mainly used queries in its native language(s). Each daily search query frequency is smoothed using a harmonic mean over the past D = 14 days including itself (Supplementary Eq. 4).

We are also using a global news corpus to extract media coverage trends for COVID-19 in all the countries in our study. This is estimated by counting the proportion of articles mentioning a COVID-19-related term (see SI for the list of terms). In particular, daily counts of the total of news media articles published, and the subset that included at least one relevant keyword anywhere in the body of the text were collected from the Media Cloud database (mediacloud.org) for the UK (93), US (225), Australia (61), Canada (79), France (360), Italy (178), Greece (75), and South Africa (135), where in the parentheses we state the number of media sources considered per country. These counts were collected from September 30, 2019 through May 24, 2020. Within this time-span we identified 2,535,735 articles as COVID-19-related from a total of 10,093,349. Supplementary Fig. 11 depicts the average daily ratio across all countries, as soon as it started being above zero (Jan. 2020), with two standard deviations as confidence intervals. Initially, there exists a distinctive pattern of (exponential) increase, but from April onwards the average media coverage has a moderate decreasing trend. In addition, we observe a certain variance across locations or time periods, that adds to the potential value of this signal.

For all countries in our analysis, we obtained daily confirmed COVID-19 cases and deaths time series from the European Centre for Disease Prevention and Control (ECDC). Links are provided in the SI.

Ethics

Ethical approval was not required for the analysis presented in this paper. Data was obtained in an anonymised and aggregated (national level) format.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.