Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# COVID-19 predictability in the United States using Google Trends time series

## Abstract

During the unprecedented situation that all countries around the globe are facing due to the Coronavirus disease 2019 (COVID-19) pandemic, which has also had severe socioeconomic consequences, it is imperative to explore novel approaches to monitoring and forecasting regional outbreaks as they happen or even before they do so. To that end, in this paper, the role of Google query data in the predictability of COVID-19 in the United States at both national and state level is presented. As a preliminary investigation, Pearson and Kendall rank correlations are examined to explore the relationship between Google Trends data and COVID-19 data on cases and deaths. Next, a COVID-19 predictability analysis is performed, with the employed model being a quantile regression that is bias corrected via bootstrap simulation, i.e., a robust regression analysis that is the appropriate statistical approach to taking against the presence of outliers in the sample while also mitigating small sample estimation bias. The results indicate that there are statistically significant correlations between Google Trends and COVID-19 data, while the estimated models exhibit strong COVID-19 predictability. In line with previous work that has suggested that online real-time data are valuable in the monitoring and forecasting of epidemics and outbreaks, it is evident that such infodemiology approaches can assist public health policy makers in addressing the most crucial issues: flattening the curve, allocating health resources, and increasing the effectiveness and preparedness of their respective health care systems.

## Introduction

In December 2019, a novel coronavirus of unknown source was identified in a cluster of patients in the city of Wuhan, Hubei, China1. The outbreak first came to international attention after the World Health Organization (WHO) reports said that there was a cluster of pneumonia cases on Twitter on January 4th2, followed by the release of an official report on January 5th3. China reported its first COVID-19-related death on January 11th, while on January 13th, the first case outside China was identified4. On January 14th, the World Health Organization (WHO) tweeted that Chinese preliminary investigations reported that no human-to-human transmission had been identified5. However, the virus quickly spread to other Chinese regions and neighboring countries, while Wuhan, identified as the epicenter of the outbreak, was cut off by authorities on January 23rd, 20206. On January 30th, the WHO declared the epidemic to be a public health emergency1, and the disease caused by the virus received its official name, that is, COVID-19, on February 11th7.

The first serious COVID-19 outbreak in Europe was identified in northern Italy during February, with the country recording its first death on February 21st8. The novel coronavirus was transmitted to all parts of Europe within the next few weeks, and as a result, the WHO declared COVID-19 to be a pandemic on March 11th, 2020. As of 16:48 GMT on April 18th, 20209, there were 2,287,369 confirmed cases worldwide, with 157,468 confirmed deaths and 585,838 recovered patients. The most affected countries with more than 100 k cases (in absolute numbers, not divided by population) were the US, with 715,105 confirmed cases and 37,889 deaths; Spain, with 191,726 confirmed cases and 20,043 deaths; Italy, with 175,925 confirmed cases and 23,227 deaths; France, with 147,969 confirmed cases and 18,681 deaths; Germany, with 142,614 confirmed cases and 4405 deaths; and the UK, with 114,217 confirmed cases and 15,464 deaths. The worldwide geographical distribution of COVID-19 cases and deaths by country is depicted in Fig. 1.

As shown, Europe has been severely affected by COVID-19. However, the spread of the disease now indicates that the center of the epidemic has moved to the US, with the state of New York counting more than 240 k cases and 17 k deaths. Figure 2 shows the distribution of COVID-19 cases and deaths in the United States by state as of April 18th, 202010.

To find new methods and approaches for disease surveillance, it is crucial to take advantage of real-time internet data. Infodemiology, i.e., information epidemiology, is a concept that was introduced by Gunther Eysenbach11,12. In the field of infodemiology, internet sources and data are employed to inform public health and policy13,14. These approaches have been suggested to be valuable for the monitoring and forecasting of outbreaks and epidemics15, such as Ebola16, Zika17, MERS18, influenza19, and measles20,21.

In this paper, Google Trends data on the topic of “Coronavirus (virus)” in the United States are employed at both the national and state levels to explore the relationship between COVID-19 cases and deaths and online interest in the virus. First, a correlation analysis between Google Trends and COVID-19 data is performed; then, the role of Google Trends data in the predictability of COVID-19 is explored. To the best of our knowledge, this paper is the first attempt of this kind performed for the United States.

The rest of the paper is structured as follows. The Methods section details the data collection procedure and the statistical analysis tools and methods. The Results section consists of the correlation analysis and of the forecasting models at both national and state levels. The Discussion section presents the main findings of this work, along with the limitations of this paper and future research suggestions.

## Methods

Data from the Google Trends platform are retrieved in .csv39 and are normalized over the selected period. Google Trends reports the adjustment procedure as follows: “Search results are normalized to the time and location of a query by the following process: Each data point is divided by the total searches of the geography and time range it represents to compare relative popularity. Otherwise, places with the most search volume would always be ranked highest. The resulting numbers are then scaled on a range of 0 to 100 based on a topic’s proportion to all searches on all topics. Different regions that show the same search interest for a term don't always have the same total search volumes40. The data collection methodology is designed based on the Google Trends Methodology Framework in Infodemiology and Infoveillance41. Note that the data may slightly vary based on the time of retrieval.

For keyword selection, the online interest in all commonly used variations is examined, and the variations are compared, i.e., “coronavirus (virus)”; “COVID-19 (search term)”; “SARS-COV-2 (search term)”; “2019-nCoV (search term)”; and “coronavirus (search term)”. Only “coronavirus (virus)” and “coronavirus (search term)” yield, as expected, considerably high online interest. Between the two, i.e., the topic (virus) and the search term, “coronavirus (virus)” is selected for further analysis.

Data on the worldwide distribution of COVID-19 cases and deaths are retrieved from Worldometer9. Data for the United States analysis of COVID-19 are retrieved from “The COVID Tracking Project”, which provides detailed structured data on COVID-19 cases and deaths nationally and at state level10. Maps of COVID-19 cases and deaths and online interest are created by the authors using the free online tools Pixelmap42 and Chartsbin43, with data from the respective sources9,10, while graphs, spider web charts, and maps of the correlation coefficients are created by the authors using Microsoft Excel (version 16.39).

As Google Trends data are normalized, the timeframe for which search traffic data are retrieved should exactly match the period for which COVID-19 data are available. Therefore, the timeframes for which analysis is performed are different among states, starting either on March 4th (for most cases) or on the date on which the first confirmed case was identified in each state, as shown in Table 2.

Each variable used in this study is divided by its full-sample standard deviation, estimated or calculated based on the basic formula of the standard deviation of a variable. By doing so, the inherent variability of each variable was moved, and thus, all variables have a standard deviation equal to 1. This equivalence makes it possible to compare the strength of the impact of the explanatory variables used on the dependent variable. The nonparametric44 unit root test is also applied to reveal whether or not the variables are stationary. The results suggest that both variables can be used directly in the present analysis without further transformation.

The first step in exploring the role of Google Trends in the predictability of COVID-19 is to examine the relationship between Google Trends and the incidence of COVID-19. As Pearson correlation analysis is the benchmark analysis in this kind of approach, the Pearson correlation coefficients (r) between the ratio (COVID-19 deaths)/(COVID-19 cases) and Google Trends data are calculated. In particular, a minimum variance bias-corrected Pearson correlation coefficient45,46 via a bootstrap simulation is applied to deal with the limited number of observations and, therefore, small sample estimation bias (also see45,47). The bias-corrected bootstrap coefficient $${\stackrel{\sim }{\rho }}^{b}$$ for the Pearson correlation is given as follows:

$${\stackrel{\sim }{\rho }}^{b}={B}^{-1}\sum_{j=1}^{B}{\stackrel{\sim }{\rho }}_{j}^{b}\left(\rho \right)$$

where $$B$$ corresponds to the length of the bootstrap samples; in this case, it is set equal to 99948. Note that the terms “COVID-19 deaths” and “COVID-19 cases” refer to the cumulative (total) COVID-19 deaths and cases in the United States and that this terminology is used hereafter unless otherwise stated.

Next, secondary correlation analysis is performed using the Kendall rank correlation, which is a nonparametric test that measures the strength of dependence between two variables. The Kendall rank correlation is distribution free and is considered robust in ratio data. Considering two samples with sample sizes $$n$$, the total number of pairings is $$\frac{1}{2}n(n-1)$$. The following formula is used to calculate the value of the bias-corrected Kendall rank correlation:

$${\stackrel{\sim }{\tau }}^{b}={B}^{-1}\sum_{j=1}^{B}{\stackrel{\sim }{\tau }}_{j}^{b}\left(\tau \right)$$

where $$\tau$$ is given by $$\tau =\frac{{n}_{c}- {n}_{d}}{\frac{1}{2}n(n-1)}$$, $${n}_{c}$$ is the concordant value, and $${n}_{d}$$ is the discordant value.

Following, a COVID-19 predictability analysis approach based on Google Trends time series for the United States and all US states (plus DC) is performed. The predictability model is a quantile regression, which is considered to be a robust regression analysis against the presence of outliers in the sample; it was introduced by49. Building on the study conducted by46, a quantile regression that is bias corrected via balanced bootstrapping is employed. Such a model is the appropriate statistical approach for mitigating small sample estimation bias and the presence of outliers in the dataset, as it combines the advantages of bootstrap standard errors and the merits of quantile regression. Additional knowledge on quantile regression can be found in the studies conducted by50 and51, while recent applications of quantile regression can be found in52,53. More recently54 introduced unconditional quantile regression, while the study by55 provides further insights into robust estimates of regressions.

Let $${Y}_{t},$$ with $$t\in T$$, be a time series that represents the dependent variable, supposing a bivariate specification. Quantile regression estimates the impact of the explanatory variable $${X}_{t}$$, with $$t\in T$$, on the variable $${Y}_{t}$$ at different points of the conditional $$q$$-quantile, with $$q\in \left(\mathrm{0,1}\right)$$, of the conditional distribution. A value of the $$q$$-quantile close to zero and a value of the $$q$$-quantile close to one represent the left (lower) and right (upper) tails of the conditional distribution, respectively. The conditional quantile function is defined as follows:

$$Q_{Y|X} \left( q \right) = {\text{X}}^{\prime } \beta_{q}$$

Given the distribution of $${Y}_{t}$$, the estimation of the conditional quantile functions $${\beta }_{q}$$ can be obtained by solving the following minimization problem:

$${\beta }_{q}=\mathrm{arg}\underset{\beta \in {\mathbb{R}}^{k}}{\mathrm{min}}E\left({\rho }_{q}\left(Y-X\beta \right)\right)$$

where $${\rho }_{q}\left(y\right)=y\left(q-{1}_{\left\{y<0\right\}}\right)$$ represents the loss function.

By minimizing the sample analog $$\left\{{y}_{1},\dots ,{y}_{n}\right\}$$ that corresponds to a $${q}^{th}$$ quantile sample, the estimator $${\beta }_{q}$$ takes the following form:

$$\beta_{q} = {\text{arg}}\mathop {\min }\limits_{{\beta \in {\mathbb{R}}^{k} }} \mathop \sum \limits_{t = 1}^{n} \rho_{q} \left( {Y_{t} - X_{t}^{^{\prime}} \beta } \right) = {\text{arg}}\mathop {\min }\limits_{{\beta \in {\mathbb{R}}^{k} }} \left[ {q\mathop \sum \limits_{{Y_{t} \ge \beta X_{t} }} \left| {Y_{t} - \beta X_{t} } \right| + \left( {1 - q} \right)\mathop \sum \limits_{{Y_{t} < \beta X_{t} }} \left| {Y_{t} - \beta X_{t} } \right|} \right]$$

where $$\beta {X}_{t}$$ is an approximation of the conditional $$q$$-quantile of the variable $${Y}_{t}$$.

In our analysis, $${Y}_{t}$$ stands for the ratio (COVID-19 deaths)/(COVID-19 cases), $${\rm X}_{t-1}$$ is the respective Google Trends value in lag order, and $$t=1,\dots ,T$$, with $$T$$ being the respective number of observations. A linear trend is used as well.

Finally, the bias-corrected parameter is estimated as follows:

$${\stackrel{\sim }{\beta }}^{b}\left(q\right)=\widehat{\beta }\left(q\right)-\widehat{bias}\left(\widehat{\beta }\left(q\right)\right)$$

where $$\widehat{bias}\left(\widehat{\beta }\left(q\right)\right)$$ is given by $${B}^{-1}{\sum }_{j=1}^{B}{\widehat{\beta }}_{j}^{*}\left(q\right)-\widehat{\beta }\left(q\right)$$ and $$q\in (0, 1)$$ denotes the quantile considered and, in this case, is set equal to 0.5 (median). Median regression is considered more robust to outliers than, for example, least squares regression. Finally, it also avoids assumptions about the error parametric distribution56.

Αll estimation results reported in this paper were computed in the R programming environment57. In particular, we employed the R packages "quantreg" and "boot" to compute the quantile regression estimates and to perform the bootstrapping, respectively. The code is available in a “Supplementary Online Material file”.

## Results

Figure 3 depicts the worldwide and US online interest in terms of Google queries in the “coronavirus (virus)” topic from January 22nd to April 15th, 2020. It shows that this topic is very popular, especially in Europe and North America. Specifically, interest in the United States is considerably high (above 70) for all US states.

To perform a first assessment of the relationship between Google Trends and COVID-19 data, the Pearson and Kendall rank correlations between the two variables are calculated, and the results are further compared. Tables 3 and 4 present the results of the Pearson and Kendall correlation analysis by state, respectively.

As reported in Table 3, statistically significant correlations are observed for the United States and for the states of Alabama, Arkansas, California, Colorado, Florida, Georgia, Illinois, Kentucky, Massachusetts, Minnesota, Nebraska, Nevada, New Hampshire, New York, North Carolina, Oregon, Pennsylvania, South Dakota, Tennessee, Vermont, Virginia, Washington, Wisconsin, and Wyoming as well as DC. The states of Iowa, Louisiana, Maine, Mississippi, Missouri, North Dakota, South Carolina, and Utah do not marginally reach the p < 0.1 threshold of statistical significance, i.e., $$p\in (0.1, 0.2)$$.

Based on the Kendall correlation analysis, statistically significant correlations are observed for the United States and for the states of Alaska, Arizona, Arkansas, California, Connecticut, Florida, Georgia, Hawaii, Iowa, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Tennessee, Utah, Vermont, Virginia, Washington, and Wisconsin as well as DC. Figure 4 depicts the heat map of the (a) Pearson and (b) Kendall correlation coefficients in the United States by state over the period examined.

As depicted in the heat maps and in the spider web charts for the respective correlation analyses in Fig. 5, visual comparison of the two approaches indicates that the results are consistent in both analyses.

However, the main purpose of this study is to explore the predictability of COVID-19 using Google Trends data in the United States. Proceeding with the results of the predictability analysis, Fig. 6 depicts the heat map for $${{\varvec{\beta}}}_{1}$$ by state, while Table 5 presents the quantile regression estimated predictability models for the US and for each US state (plus DC). As shown, the estimated Google Trends models exhibit strong COVID-19 predictability.

Note that due to the low number of observations, the states of Maine, Montana, North Dakota, West Virginia, and Wyoming are not included in the predictability analysis results, but they are given the value “zero (0)” to be included in the heat map for purposes of uniformity.

## Discussion

As of July 29th, 2020, there were 16,920,857 COVID-19 recorded cases worldwide, with the reported death toll at 664,141 and the number of recovered patients at 10,485,3169. In light of the COVID-19 pandemic and to find new ways of forecasting the spread of the disease, infodemiology approaches have provided valuable input in monitoring and forecasting the development of the COVID-19 pandemic over time and in measuring and analyzing the public’s awareness and response. Google Trends and Twitter have been identified as the most popular infodemiology sources, while other social media, such as Facebook and Instagram, exhibit promising results in analyzing users’ online behavioral patterns13.

Social media platforms can provide us with more qualitative data that can shift the focus to other directions. Such approaches include sentiment analysis, educational purposes, and efforts to measure and raise public awareness. Recent approaches to analyzing aspects of the COVID-19 pandemic using social media data include monitoring the Twitter usage of G7 leaders58, monitoring self-reported symptoms on Twitter59, and analyzing the public perception of the disease through Facebook60. Moreover, infodemiology sources have provided valuable input in recruiting online survey participants through Facebook to measure individuals’ COVID-19 confidence levels61 and in assessing the behavioral variations in COVID-19-related online search traffic in more than one search engine62. Finally, commentaries that make recommendations on the integration of other social media platforms, such as Facebook, Reddit, and TikTok, for disseminating medical information to inform public health and policy have been published63.

Google Trends offers a solid foundation for quantitative analysis with respect to the monitoring and predictability of COVID-19, as in the analysis presented in this study, where Google Trends data on the “coronavirus (virus)” topic were used to explore the predictability of COVID-19 in the United States at both national and state level. First, for a preliminary assessment of the relationship between Google Trends and COVID-19 data, Pearson correlation and Kendall rank correlation analyses were performed. Statistically significant correlations were observed for the United States and for several US states, which is in line with previous studies that argue that there is a relationship between Google Trends and COVID-19 data.

The COVID-19 predictability analysis, which used a quantile regression approach, exhibits very promising results and indicates the most important contribution of this study to the international literature: detecting and predicting the early spread of COVID-19 at the regional level. This contribution can be a substantial supplement in further assisting local authorities in taking the appropriate measures to handle the spread of the disease.

Figure 7 illustrates a graph of the COVID-19 deaths/cases ratio, daily COVID-19 deaths, daily COVID-19 cases, and the respective Google Trends normalized data in the United States from March 4th to April 15th, 2020. For purposes of consistency in the graph, the COVID-19-related time series are normalized on a 0–100 scale. As depicted in the graph and confirmed by the predictability analysis, the two variables are not linearly dependent. Instead, they exhibit an inversely proportional relationship, meaning that as COVID-19 progresses, the online interest in the virus decreases.

From a behavioral point of view, this result can be explained as follows. First, online interest starts to increase and reaches a peak as the number of confirmed cases becomes high and as the deaths rates start to show that the pandemic does indeed have severe consequences. However, after a certain period, the interest has an inverse course, which could also indicate that the public is overwhelmed by information overload and decreases its information “intake”. The spike in Google queries and the decline in the ratio of COVID-19 deaths/cases could be attributed to the spread of the virus over these days and the “delay” in deaths. Regarding this latter point, this means that cases increase while the total number of deaths has not yet started to considerably increase.

The latter point is in line with previous work on the topic27 suggesting that although significant correlations between COVID-19 and Google data are observed, the relationship tends to decrease in both strength and significance in regions that have been affected by COVID-19 as we move forward in time because the interest in the virus decreases. This decrease is counterintuitive and occurs before the case and death curves start to exhibit a downward trend, i.e., when a region is being heavily affected, independent of whether or not it has reached its peak. However, it would be interesting for future investigators to explore the relationship from this point onwards since, as shown in Fig. 7, the lines converge, with this convergence being indicative of a future change in the relationship dynamics when deaths peak at a later point and when they start their downward course as well.

The above can partly explain the differences in signs among states in both the Pearson and Kendall rank correlation coefficients, but a more in-depth explanation from a statistical perspective is that the Pearson correlation coefficient is estimated as the average of the deviations of observations from the sample mean. The weights of observations in the tails of the distribution are equal to the weight of other observations, and therefore, the outliers could affect the estimation of the results, especially in the case of the small sample. In consideration of ties, this study employs a bootstrap bias-corrected approach, but the main conclusions are based on quantile regressions. Unlike linear measures of dependency, quantile regression is considered superior in a sampling situation and more resistant to outliers than linear regressions, the Pearson correlation, or the Kendall rank correlation64. Taking into account that the current pandemic is a dynamic process that constantly evolves and has a serious social impact, it is very probable that there now exist—or, at a later stage, could develop—several data anomalies (e.g., due to non-pharmaceutical interventions); therefore, formal statistical tools such as the Pearson and Kendall rank correlations should be carefully interpreted.

This study has limitations. First, data from only one search engine are considered. Although Google Trends is the most popular search engine, some data on the coronavirus topic from other search engines were not included in this analysis. Second, the data at this point are very limited, and the results are based on few observations. Third, the 50 (+ 1) states exhibit diversity in terms of confirmed cases and deaths. Therefore, any conclusions drawn from this analysis refer to each case individually. Despite the known limitations of online search traffic data, the use of infodemiology metrics for informing public health and policy in general and for monitoring outbreaks and epidemics in particular has received wide attention.

To dynamically find the determinants of COVID-19, the predictability analysis in this study provides insights into how online search traffic data can play a considerable role in forming public health policies, especially in times of epidemics and outbreaks, when real-time data are essential. With the COVID-19 pandemic, the world is in uncharted territory socially, economically, and socially. This situation calls for immediate action and open research and data, and the term “multidisciplinary” has never before been more important. To that end, the role of big data in providing “opportunities for performing modeling studies of viral activity and for guiding individual country healthcare policymakers to enhance preparation for the outbreak” has been acknowledged65, and current research on the subject should focus on both exploring the role of other infodemiology variables in the predictability of COVID-19 and combining infodemiology sources with traditional sources to explore the full potential of what online real-time data have to offer for disease surveillance.

## Data availability

The COVID-19 and query datasets analyzed during the current study are available on the COVID-19 Tracking Project website10 and on the “Google Trends” explore page39, respectively.

## References

1. 1.

WHO Timeline—COVID-19. World Health Organization. https://www.who.int/news-room/detail/08-04-2020-who-timeline---covid-19 (2020).

2. 2.

3. 3.

Pneumonia of unknown cause. World Health Organization. https://www.who.int/csr/don/05-january-2020-pneumonia-of-unkown-cause-china/en/ (2020).

4. 4.

Secon, H., Woodward, A & Mosher, D. A comprehensive timeline of the new coronavirus pandemic, from China's first COVID-19 case to the present. Business Insider. https://www.businessinsider.com/coronavirus-pandemic-timeline-history-major-events-2020-3 (2020).

5. 5.

6. 6.

Qin, A. & Wang, V. Wuhan, Center of Coronavirus Outbreak, Is Being Cut Off by Chinese Authorities. New York Times. https://www.nytimes.com/2020/01/22/world/asia/china-coronavirus-travel.html (2020).

7. 7.

Coronavirus disease named COVID-19. BBC News. https://www.bbc.com/news/world-asia-china-51466362 (2020).

8. 8.

COVID coronavirus Outbreak: Italy. Wolrdometer. https://www.worldometers.info/coronavirus/country/italy/ (2020).

9. 9.

COVID coronavirus Outbreak. Worldometer. https://www.worldometers.info/coronavirus/ (2020).

10. 10.

The COVID Tracking Project. The Atlantic. https://covidtracking.com (2020).

11. 11.

Eysenbach, G. Infodemiology and infoveillance: Framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. J. Med. Internet Res. 11(1), e11 (2009).

12. 12.

Eysenbach, G. Infodemiology and infoveillance tracking online health information and cyberbehavior for public health. Am. J. Prev. Med. 40(5 Suppl 2), S154–S158 (2011).

13. 13.

Mavragani, A. Infodemiology and infoveillance: A scoping review. J. Med. Internet Res. 22(4), e16206 (2020).

14. 14.

Bernardo, T. M. et al. Scoping review on search queries and social media for disease surveillance: A chronology of innovation. J. Med. Internet Res. 15(7), e147 (2013).

15. 15.

Eysenbach, G. SARS and population health technology. J. Med. Internet Res. 5(2), e14 (2003).

16. 16.

van Lent, L. G., Sungur, H., Kunneman, F. A., van de Velde, B. & Das, E. Too far to care? Measuring public attention and fear for Ebola using twitter. J. Med. Internet Res. 19(6), e193 (2017).

17. 17.

Farhadloo, M., Winneg, K., Chan, M. S., Hall, J. K. & Albarracin, D. Associations of topics of discussion on twitter with survey measures of attitudes, knowledge, and behaviors related to Zika: Probabilistic Study in the United States. JMIR Public Health Surveill. 4(1), e16 (2018).

18. 18.

Poletto, C., Boëlle, P. & Colizza, V. Risk of MERS importation and onward transmission: A systematic review and analysis of cases reported to WHO. BMC Infect. Dis. 16(1), 448 (2016).

19. 19.

Samaras, L., García-Barriocanal, E. & Sicilia, M. A. Comparing Social media and Google to detect and predict severe epidemics. Sci. Rep. 10, 4747 (2020).

20. 20.

Mavragani, A. & Ochoa, G. The internet and the anti-vaccine movement: Tracking the 2017 EU measles outbreak. Big Data Cog. Comp. 2(1), 1 (2018).

21. 21.

Du, J. et al. Public perception analysis of tweets during the 2015 measles outbreak: Comparative study using convolutional neural network models. J. Med. Internet Res. 20(7), e236 (2018).

22. 22.

Mavragani, A., Ochoa, G. & Tsagarakis, K. P. Assessing the methods, tools, and statistical approaches in google trends research: Systematic review. J. Med. Internet Res. 20(11), e270 (2018).

23. 23.

24. 24.

Husnayain, A., Fuad, A. & Su, E. C. Applications of google search trends for risk communication in infectious disease management: A case study of COVID-19 outbreak in Taiwan. Int. J. Infect Dis. 95, 221–223 (2020).

25. 25.

Li, C. et al. Retrospective analysis of the possibility of predicting the COVID-19 outbreak from Internet searches and social media data, China, 2020. Euro Surveill. 25(10), 2000199 (2020).

26. 26.

Effenberger, M. et al. Association of the COVID-19 pandemic with internet search volumes: A Google Trends(TM) analysis. Int. J. Infect Dis. 95, 192–197 (2020).

27. 27.

Mavragani, A. Tracking COVID-19 in Europe: Infodemiology approach. JMIR Public Health Surveill. 6(2), e18941 (2020).

28. 28.

Walker, A., Hopkins, C. & Surda, P. The use of google trends to investigate the loss of smell related searches during COVID-19 outbreak. Int. Forum Allergy Rhinol. 10(7), 839–847 (2020).

29. 29.

Hong, Y. R., Lawrence, J., Williams, D. Jr. & Mainous, A. Population-level interest and telehealth capacity of US hospitals in response to COVID-19: Cross-sectional analysis of google search and national hospital survey data. JMIR Public Health Surveill. 6(2), e18961 (2020).

30. 30.

Ayyoubzadeh, S. M., Zahedi, H., Ahmadi, M. R. & Kalhori, S. N. Predicting COVID-19 incidence through analysis of google trends data in Iran: Data mining and deep learning pilot study. JMIR Public Health Surveill. 6(2), e18828 (2020).

31. 31.

Rufai, S.R. & Bunce, C. World leaders' usage of Twitter in response to the COVID-19 pandemic: a content analysis. J Public Health (Oxf). fdaa049 (2020).

32. 32.

Kouzy, R. et al. Coronavirus goes viral: Quantifying the COVID-19 misinformation epidemic on twitter. Cureus. 12(3), e7255 (2020).

33. 33.

Abd-Alrazaq, A., Alhuwail, D., Househ, M., Hamdi, M. & Shah, Z. Top concerns of tweeters during the COVID-19 pandemic: A surveillance study. J. Med. Internet Res. 22(40), e19016 (2020).

34. 34.

Dost, B. et al. Attitudes of anesthesiology specialists and residents toward patients infected with the novel coronavirus (COVID-19): A national survey study. Surg. Infect. (Larchmt). 21(4), 350–356 (2020).

35. 35.

Simcock, R. et al. COVID-19: Global radiation oncology’s targeted response for pandemic preparedness. Clin. Transl. Radiat. Oncol. 22, 55–68 (2020).

36. 36.

Kim, B. Effects of social grooming on incivility in COVID-19. Cyberpsychol. Behav. Soc. Netw. 23(8), 519–525 (2020).

37. 37.

Rosenberg, H., Syed, S. & Rezaie, S. The Twitter pandemic: The critical role of Twitter in the dissemination of medical information and misinformation during the COVID-19 pandemic. CJEM. 6, 1–4 (2020).

38. 38.

Chan, A.K.M., Nickson, C.P., Rudolph, J.W., Lee, A. & Joynt, G.M. Social media for rapid knowledge dissemination: Early experience from the COVID-19 pandemic. Anaesthesia. (2020)

39. 39.

40. 40.

41. 41.

Mavragani, A. & Ochoa, G. Google trends in infodemiology and infoveillance: Methodology framework. JMIR Public Health Surveill. 5(2), e13439 (2019).

42. 42.

PixelMap. AMCHARTS. https://pixelmap.amcharts.com (2020).

43. 43.

ChartsBin. https://chartsbin.com (2020).

44. 44.

Phillips, P. C. B. & Perron, P. Testing for a unit root in time series regression. Biometrica. 75(2), 335–346 (1988).

45. 45.

Efron, B. & Tibshirani, R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1(1), 54–75 (1986).

46. 46.

Karlsson, A. Bootstrap methods for bias correction and confidence interval estimation for nonlinear quantile regression of longitudinal data. J. Stat. Comput. Sim. 79(10), 1205–1218 (2009).

47. 47.

Guan, W. From the help desk: Bootstrapped standard errors. Stata J. 3(1), 71–80 (2003).

48. 48.

Davidson, R. & MacKinnon, J. G. Bootstrap tests: How many bootstraps?. Econ. Rev. 19(1), 55–68 (2000).

49. 49.

Koenker, R. & Bassett, G. Regression quantiles. Econometrica. 46(1), 33–50 (1978).

50. 50.

Koenker, R. & Hallock, K. F. Quantile regression. J. Econ. Percepct. 15(4), 143–156 (2001).

51. 51.

Yu, K., Lu, Z. & Stander, J. Quantile regression: Applications and current research areas. J. R Stat. Soc. Series D Stat. 52(3), 331–350 (2003).

52. 52.

Nikitina, L., Paidi, R. & Furuoka, F. Using bootstrapped quantile regression analysis for small sample research in applied linguistics: Some methodological considerations. PLoS ONE 14(1), e0210668 (2019).

53. 53.

Chen, F. & Chalhoub-Deville, M. Principles of quantile regression and an application. Lang. Test. 31(1), 63–87 (2014).

54. 54.

Firpo, S., Fortin, N. M. & Lemieux, T. Unconditional quantile regressions. Econometrica. 77(3), 953–973 (2009).

55. 55.

Salibian-Barrera, M. & Zamar, R. H. Bootrapping robust estimates of regression. Ann. Stat. 30(2), 556–582 (2002).

56. 56.

Chernozhukov, V., Hansen, C. & Jansson, M. Finite sample inference for quantile regression models. J. Econom. 152, 93–103 (2009).

57. 57.

R Core Team, 2017. R: A language and environment for statistical computing, Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/. R version 3.3.3.

58. 58.

Rufai, R. S. & Bunce, C. World leaders’ usage of Twitter in response to the COVID-19 pandemic: A content analysis. J. Public Health. 42(3), 510–516 (2020).

59. 59.

Sarker, A. et al. Self-reported COVID-19 symptoms on Twitter: An analysis and a research resource. J. Am. Med. Inform. Assoc. 27(8), 1310–1315 (2020).

60. 60.

Shorey, S., Ang, E., Yamina, A. & Tam, C. Perceptions of public on the COVID-19 outbreak in Singapore: a qualitative content analysis. J Public Health (Oxf). fdaa105, (2020).

61. 61.

Wang, P. W. et al. COVID-19-related information sources and the relationship with confidence in people coping with COVID-19: Facebook survey study in Taiwan. J. Med. Internet Res. 22(6), e20021 (2020).

62. 62.

Hou, Z. et al. Cross-country comparison of public awareness, rumours, and behavioural responses to the COVID-19 epidemic: An internet surveillance study. J. Med. Internet Res. 22(8), e21143 (2020).

63. 63.

Eghtesadi, M. & Florea, A. Facebook, Instagram, Reddit and TikTok: A proposal for health authorities to integrate popular social media platforms in contingency planning amid a global pandemic outbreak. Can. J. Public Health. 111, 389–391 (2020).

64. 64.

Gideon, R. A. & Hollister, R. A. A rank correlation coefficient resistant to outliers. J. Am. Stat. Assoc. 82(398), 656–666 (1987).

65. 65.

Ting, D. S. W., Carin, L., Dzau, V. & Wong, T. Y. Digital technology and COVID-19. Nat. Med. 26, 459–461 (2020).

## Author information

Authors

### Contributions

A.M. conceived the idea, designed the methodology, performed the data collection, performed the data analysis and interpretation, wrote the paper; K.G. designed the statistical methodology, performed the statistical analysis and interpretation and performed the computational analysis. Both authors reviewed and approved the manuscript.

### Corresponding author

Correspondence to Amaryllis Mavragani.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

### Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Mavragani, A., Gkillas, K. COVID-19 predictability in the United States using Google Trends time series. Sci Rep 10, 20693 (2020). https://doi.org/10.1038/s41598-020-77275-9

• Accepted:

• Published:

• ### A longitudinal and geospatial analysis of COVID-19 tweets during the early outbreak period in the United States

• Raphael E. Cuomo
• , Vidya Purushothaman
• , Jiawei Li
• , Mingxiang Cai
•  & Tim K. Mackey

BMC Public Health (2021)

• ### Characterizing all-cause excess mortality patterns during COVID-19 pandemic in Mexico

• Sushma Dahal
• , Juan M. Banda
• , Ana I. Bento
• , Kenji Mizumoto
•  & Gerardo Chowell

BMC Infectious Diseases (2021)

• ### Revealing the spatial shifting pattern of COVID-19 pandemic in the United States

• Di Zhu
• , Xinyue Ye
•  & Steven Manson

Scientific Reports (2021)

• ### Exploring the use of web searches for risk communication during COVID-19 in Germany

• Kaja Kristensen
• , Eva Lorenz
• , Jürgen May
•  & Ricardo Strauss

Scientific Reports (2021)