Tracking COVID-19 using online search

Previous research has demonstrated that various properties of infectious diseases can be inferred from online search behaviour. In this work we use time series of online search query frequencies to gain insights about the prevalence of COVID-19 in multiple countries. We first develop unsupervised modelling techniques based on associated symptom categories identified by the United Kingdom’s National Health Service and Public Health England. We then attempt to minimise an expected bias in these signals caused by public interest—as opposed to infections—using the proportion of news media coverage devoted to COVID-19 as a proxy indicator. Our analysis indicates that models based on online searches precede the reported confirmed cases and deaths by 16.7 (10.2–23.2) and 22.1 (17.4–26.9) days, respectively. We also investigate transfer learning techniques for mapping supervised models from countries where the spread of the disease has progressed extensively to countries that are in earlier phases of their respective epidemic curves. Furthermore, we compare time series of online search activity against confirmed COVID-19 cases or deaths jointly across multiple countries, uncovering interesting querying patterns, including the finding that rarer symptoms are better predictors than common ones. Finally, we show that web searches improve the short-term forecasting accuracy of autoregressive models for COVID-19 deaths. Our work provides evidence that online search data can be used to develop complementary public health surveillance methods to help inform the COVID-19 response in conjunction with more established approaches.


Introduction
Online search data is routinely used to monitor the prevalence of infectious diseases, such as influenza [1][2][3][4] .Previous work has focused on supervised learning solutions, where ground truth data, in the form of historical syndromic surveillance reports, can be used to train machine learning models.However, no sufficient data -in terms of accuracy and time span-exist to apply such approaches for monitoring the emerging COVID-19 infectious disease pandemic caused by a novel coronavirus (SARS-CoV-2).Therefore, unsupervised, or semi-supervised solutions should be sought.Recent outcomes have shown that it is possible to transfer an online search based model for influenza-like illness (ILI) from a source to a target country without using ground truth data for the target location 5 .The transferred model's accuracy depends on choosing search queries and their corresponding weights wisely, via a transfer learning methodology, for the target location.In this work, we draw a parallel to previous findings and attempt to develop an unsupervised model for COVID-19 by: (i) carefully choosing search queries that refer to related symptoms as identified by a survey from the National Health Service (NHS) in the United Kingdom (UK), and (ii) weighting them based on their reported ratio of occurrence in people infected by COVID-19.Furthermore, understanding that online searches may be also driven by concern rather than infections, we devise a preliminary approach that attempts to minimise this part of the signal by incorporating a basic news media coverage metric in association with confirmed COVID-19 cases.Finally, we propose a transfer learning method for mapping supervised COVID-19 models from a country to another, in an effort to transfer knowledge from areas where the disease has a more extended progression.Results are presented for the UK, England, United States of America (US), Canada, Australia, France, Italy, and Greece.

Data
Google search.Google search data is obtained from the Google Health Trends API, a non public API created by Google for research on health-related topics.Data represent search query frequencies for a day and a location.Query frequencies are defined as the sum of search sessions that include a target search term divided by the total search sessions for this day and location.We have obtained data from September 30, 2011 to March 24, 2020 for the UK, England, US, Canada, Australia, France, Italy, and Greece.The list of search terms is determined by COVID-related symptoms and keywords.For each country, we mainly used queries in its native language. 1News media volume.We are using an extensive global news corpus to extract news media coverage trends for COVID-19 in all the countries of our study.This is estimated by counting the proportion of articles mentioning a COVID-19 related term.In particular, daily counts of total news media articles, and the subset that included at least one relevant keyword anywhere in the body of the text were collected from the MediaCloud database 2 via national corpora for the UK (93), US (225), Canada 1 Please note that we are avoiding to mention the exact search queries that we are using to discourage any kind of user search behaviour bias that may invalidate our current approach.These will be releases at a later time.At this stage, please contact us if you want to reproduce our technique.We generate k symptom-based search query groups using the k identified symptoms from the FF100 NHS questionnaire for COVID-19 (k = 18).In a separate version, we also consider two additional groups one referring to an investigated symptom (anosmia or loss of smell), and another that includes COVID-19 terminology, i.e. the "covid-19" keyword itself among others.Query groups may include different wordings for the same symptom or queries with minor grammatical differences (especially for queries in Greek and French).If a symptom is represented by more than one search query, then we obtain the total frequency (sum) across these queries.We apply a min-max normalisation to the frequency time series of each query group to obtain a balanced representation between more and less frequent searches.We divide our data into two periods of interest, the current one (from September 30, 2019 until March 24, 2020) and a historical one (from September 30, 2011 to September 29, 2019).The corresponding data sets are denoted by , where d 1 , d 2 represent the different numbers of days in the historical and current data, respectively.We use the symptom conditional probability distribution from the FF100 to assign weights (w ∈ R k ≥0 ) to each query category, and compute weighted time series (h = Hw, x = Xw).For the historical data, we divide their time span into yearly periods, and compute an average time series trend, h µ , using two standard deviations as upper and lower confidence intervals.Finally, we standardise x using the mean and standard deviation of the weighted time series of the current season augmented with points from the previous season (2018-19) to cover up for the missing (and potentially important) seasonal components.This is compared to a standardised version of the historical time series and their confidence intervals.
Minimising the effect of news media.On any given day the proportion of news articles about the COVID-19 pandemic is m ∈ [0, 1], and the weighted score of symptom-related online searches (see previous paragraph) is equal to g; we can apply a min-max normalisation so that g ∈ [0, 1] as well.We hypothesise that g incorporates two signals based on infected (g p ) and concerned (g c ) users, respectively, i.e.
Then, there exists a constant γ ∈ [0, 1] such that and We apply ordinary least squares (OLS) regression to learn a mapping from g and m to the actual number of confirmed infections, d, per day.For a meaningful interpretation of the regression's weights, d is also min-max normalised, i.e. such that d ∈ [0, 1].
In particular, at each day, we use the previous N days (including the current one) to optimise arg min where a 1 and a 2 ∈ R denote the weights of the online search and news signals, respectively.If a 1 > 0 and a 2 < 0, we can then hypothesise that the negative component coming from the media (a 2 m) is approximately equal to the unwanted component of the online search signal that is related to concern, i.e. a 1 g c ≈ −a 2 m . (5) Solving this for γ, we get Now, if a 1 > 0 and a 2 > 0, we can adjust for the relative contribution from the media by directly solving the equation d = a 1 g + a 2 m = γg, which results to In the rare case that a 2 is set to a positive number that is close to zero (i.e. a 2 ≤ .01),we set γ = 1, as the impact coming from the news media signal is negligible.For any other combination of a 1 and a 2 weights, we also set γ = 1, meaning that we consider the signal from the online search data in its entirety.Valid values for γ are thresholded so that γ is always in [0, 1].
Using the above approach, we can learn a different γ per day, and use g p = γg as our unsupervised (or semi-supervised in this case) online search signal, attempting to minimise the impact of news in a dynamic fashion.
Transferring supervised COVID-19 models to different countries.Previous work has shown that it is possible to transfer a model for seasonal flu, based on search query frequency time series, from one country that has access to historical syndromic surveillance data to another that has not 5 .Here, we adapt this method to transfer a model for COVID-19 from a source country where the spread has progressed significantly to a target country that is still in earlier stages of the epidemic curve.The rationale for this is that a supervised model based on data from the source country might be able to capture the disease dynamics better.The steps and data transformations that are required to apply this technique are detailed below.
Search query frequency time series are denoted by S ∈ R m×n S ≥0 and T ∈ R m×n T ≥0 , for the source and target countries respectively; m denotes the number of days considered, and n S , n T the number of queries for the two locations.As these time series are quite volatile for some locations in our study, something that does not help in cross-location mapping of the data, we have smoothed them using a harmonic query frequency mean based on a window of the D past days.More specifically, a smoothed search query frequency s i for a day i is equal to: where x (•) denotes the raw (non smoothed) search query frequency.
We train an elastic net model on data from the source location 6 , similarly to previous work on ILI 1,3,7 .In particular, we solve the following optimisation task arg min where y ∈ R m denotes the daily number of confirmed COVID-19 cases in the source location, λ 1 , λ 2 ∈ R >0 are the 1 -and 2 -norm regularisation parameters, and w ∈ R n S , b ∈ R denote the query weights and regression intercept, respectively.Prior to deploying elastic net, we apply a min-max normalisation on both S and y.We fix the ratio of λ 's, and then train q models for different values of λ 1 , under the constraint that only sparse solutions compared to the number of training instances are considered (for us to consider models with ≤ ξ nonzero weights approx.2ξ log(ξ ) samples are required).All different regression models represented by the columns of W ∈ R n S ×q , and different elements of b ∈ R q , are used as an ensemble for a more inclusive transfer that combines various source models with different sparsity levels.
To generate an equivalent feature space for the target location (same dimensionality, similar feature attributes), we first identify query group pairs between the source and the target location using the symptom categories in the NHS FF100 questionnaire.We map a source query to the target query from the same symptom category that maximises their Pearson correlation based on their frequency time series.To do this more effectively, prior to computing correlations, we shift the data by z days (looking at a maximum window of 30 days backwards or forwards) so that the average correlation between search query frequencies in S and T are maximised.If no target search query exists for a certain symptom category, we simply use the best correlated query from all target queries available (irrespectively of the symptom category) as its mapping.After this Figure 1.Standardised online search based scores for COVID-19 related symptoms as identified by the NHS FF100 survey for 8 nations up to and including March 24, 2020.Query frequencies are weighted by symptom frequency as described in Methods (blue line).We have also included estimates after minimising the news media effect using data from PHE, ECDC, and a global news media corpus (black line).These scores are compared with an average 8-year trend of the weighted model (dashed line) and its corresponding confidence intervals (shaded area).For a better visualisation all time series are smoothed using a 7-point moving average.process, we end up with a subset Z ∈ R m×n S of the target feature space T. Notably, Z does not necessarily hold data for n S distinct queries as different source queries may have been mapped to the same target query.Z is subsequently normalised using min-max.To make both feature spaces (S, Z) numerically compatible we scale the latter based on their mean, column-wise (per search query) ratio r ∈ R n S ≥0 , i.e.Z S = Z r.Now, we can deploy the ensemble source models to the target space, making multiple inferences (for different λ 1 values) held in Y ∈ R n S ×q : We then reverse the min-max normalisation for each one of the inferred time series (columns of Y) using values from the source model's ground truth y (prior to its normalisation).Finally, we compute the mean of the ensemble (across the rows of Y) as our target estimate, and also use two standard deviations to form a 95% confidence interval.

Results
The current online search based scores for COVID-19 in 8 nations are depicted in Figures 1 and 2 (data up to March 24, 2020).For a better visualisation, all time series are smoothed using a 7-point moving average, 3 days prior and after each point.
Figure 1 shows scores based on symptom-related query frequencies that are weighted by the actual symptom probability as reported in the NHS FF100 survey for COVID-19.Expanding on this, Figure 2 shows scores when search queries that are about the symptom of "anosmia" as well as strictly about COVID-19 are added as additional query groups.We set the weight of the anosmia symptom category to 0.4 (2 in 5 cases), as we wait for confirmation from an expert analysis.The weight of the strictly COVID-19 related queries is set equal to 1.The rationale behind including the latter category is that by now (and perhaps at an earlier time point) people who experience COVID-19 related symptoms might search about this disease directly as its name(s) and associated symptoms are broadly known.Focusing on the weighted signal (blue lines), we observe exponentially increasing rates that exceed the estimated confidence intervals in most investigated countries, with Greece having perhaps the lesser impact.At the same time, we are also observing a recent drop of the score in all countries.The added query categories (Fig. 2) slightly increase the maximum scores per country.
Looking at the scores where we have attempted to minimise the effect of news (black lines), and at the same time introduce some form of supervision in conjunction with the daily number of confirmed cases per country (note that we look back at the previous N = 10 days to determine this), we observe more conservative estimates in most locations, including a recent drop in some of them.Notably, the caveat of this approach is that it relies on the existence of a representative population sample based on the number and distribution of COVID-19 tests conducted in each country.If the reported daily number of COVID-19 cases is not a representative proportion, then variable d (confirmed number of infections) in Eq. 4 is not reliable, and the regression task becomes ill-posed, together with any interpretation of the derived weights.
Finally, Figure 3 showcases the outcome of an experiment where we trained a model for Italy and then transferred it to the rest of countries in our analysis.Italy was chosen as the source country because it is considered to be in front of the rest in terms of epidemic progression.During this experiment search query frequency time series were smoothed (as explained in Methods) using a harmonic mean of the past 14 days.An interesting observation while implementing the transfer learning model was that the search query frequency data for Italy were best correlated with other countries after shifting them by a number of days; 1 for Canada, 2 for the UK and the US, 7 for Australia, 8 for England and France, and 18 days for Greece.This indicates that in most occasions Italy is, indeed, in front by a few days at least in terms of user search behaviour.In total, we learn and transfer 37,331 elastic net models that select, by assigning a nonzero weight, from 2 to 18 search queries.Interestingly, the mapped trends correspond sufficiently well to confirmed cases data in most countries.Minor discrepancies are observed for Greece, Australia, and France, with a latter being the only case where this approach seems to under-predict.The same caveat, explained in the previous paragraph, applies for this analysis as well.

Figure 2.
Standardised online search based scores for COVID-19 related symptoms as identified by the NHS FF100 survey, in addition to queries about the symptom of "anosmia", and a group of coronavirus-related terms, for 8 nations up to and including March 24, 2020.Query frequencies are weighted by symptom frequency as described in Methods (blue line).We have also included estimates after minimising the news media effect using data from PHE, ECDC, and a global news media corpus (black line).These scores are compared with an average 8-year trend of the weighted model (dashed line) and its corresponding confidence intervals (shaded area).For a better visualisation all time series are smoothed using a 7-point moving average.7. Lampos, V., Yom-Tov, E., Pebody, R. & Cox, I. J. Assessing the impact of a health intervention via user-generated internet content.Data Min.Knowl.Discov.29, 1434-1457 (2015).

6/7
Figure 3. Transferring a supervised model for Italy to other countries in our analysis.The figures show an estimated confirmed cases trend (with confidence intervals) in all locations in our analysis (minus Italy) compared to the recorded confirmed cases as reported by the ECDC.Plot lines have been standardised, and then smoothed using a 3-point moving average.

Aggregated confirmed COVID-19 cases. The
4umber of confirmed COVID-19 cases on a daily basis for England is obtained by the corresponding PHE web page.3Forall the remaining locations we obtain daily confirmed cases data from the European Centre for Disease Prevention and Control (ECDC).4