The new Coronavirus (SARS-CoV-2) causing Coronavirus disease 2019 (COVID-19) was first detected in Wuhan, China, in December 2019 and rapidly spread worldwide1, reaching a pandemic status by March 11, 20202. Europe was the second epicenter of the pandemic after Wuhan and continues to struggle with controlling the spread and fatalities due to COVID-19. As of December 2020, the World Health Organization (WHO) reported more than 70 million confirmed cases and close to 1.6 million deaths globally; Europe reported close to 22 million confirmed cases and an excess of 450 thousand fatalities3.

National responses and the ability to monitor and control the pandemic varied significantly, especially during the first months4. With the implementation of social distancing regulations, millions turned to the internet to find answers to their questions and worries about the pandemic; between March and May 2020, Coronavirus related searches became the most popular search terms on Google5.

Google Trends (GT) is a web-based, innovative tool made available by Google to analyze the content, frequency, and popularity of search queries in Google search across various regions and languages. Analysis can be carried out within a given timeframe, focused on a particular public event such as the onset of an epidemic6.

There is a growing body of research demonstrating the use of GT that monitoring online queries via GT correlates with specific behavioral outcomes. Previous studies using various methods analyzed the relationship between GT search interest and suicide rates, infectious diseases transmissibility, and the spread trajectory of emerging new pathogens such as SARS, Ebola, and 2009 influenza7,8,9,10. Various studies focusing on the public search interest on the web revealed GT search interest monitoring for symptom searches to be valuable in identifying new cases in COVID-19 pandemics11,12,13.

There is also some disagreement in the field about the validity of using Google Trends as a tool for digital epidemiology14,15,16. GT data can be influenced by many factors: historical events, public interest, or media coverage. However, when we study the relationship of this fragile data with actual data such as daily confirmed COVID-19 cases, the resulting correlation can be more reliable. Thus, monitoring this relationship can be a viable tool for understanding the movement of the pandemic.

Lippi et al. investigated the capacity of Google search volume of symptoms such as fever, cough, and dyspnea to predict the trajectory of the early 2020 COVID-19 outbreak in Italy using Spearman's correlation method. They concluded that GT's continuous monitoring is a valuable instrument in the early detection of COVID-19 outbreaks12. Most studies used conventional correlation methods to determine the relationship between symptom search and cases12,17,18,19. Other studies employed moving average (MA) methods to smooth daily fluctuations of symptoms and later new case emergence, and they selected three to seven days as their moving average20,21.

Some authors also preferred shifting the symptom search results to match the GT search and new cases21,22,23. One common denominator in all these studies was the use of non-dynamic statistical procedures. Another approach is to use wave analysis to detect the co-movement between symptoms and cases24. However, this approach has the limitation of not seeing correlation over time.

Asseo et al. relied on sliding windows correlations, a straightforward time-varying approach to assess the relationship between taste and smell loss on GT, and emerging case numbers. The sliding windows correlation method allows for monitoring correlations for each time period separately but still uses Pearson correlations25. Asseo et al.'s approach carries the limitation of conventional correlation, which lacks the ability to work with time-varying co-movement. On the other hand, the DCC model considers both time-varying correlation and time-varying variances, and this method is more powerful than the conventional correlation methods, including sliding windows with Pearson or Spearman correlation analysis26.

The DCC model, developed initially for financial time series, has been used by several researchers in finance and neuroscience. In finance, several studies used the method to investigate Google search interest and financial market behaviors27,28,29. In neuroscience, Lindquist et al. used the DCC model to study the time-varying correlation among several brain signals in functional magnetic resonance imaging (fMRI). The authors concluded that the DCC model better captured time-varying correlations as it minimizes random noise in the estimations26. We believe the DCC model can also be used in health sciences to capture the time-varying relationship between symptom search and new case emergence.

We aim to present DCC as a model that better fits the time-lagged nature of our data set and compare its viability against the sliding window correlation method to study the relationship between searches of fever, cough, and dyspnea on GT and new cases in Turkey, Italy, Spain, France, and the UK.

Methods

Data

Google search interest trends are calculated by dividing the number of queries of interest by the total number of queries for all search terms over the same time and region. Each query share is normalized on a scale of 0 to 100, with 100 representing the share's maximum value for the period and region selected. The scaled query share values are plotted daily, generating a time series. Search terms included pulmonary symptoms, e.g., fever, cough, dyspnea, as previously reported to be associated with COVID-19 infection [15 Lippi]. Searches for these terms covered Turkey, Italy, Spain, France, and the United Kingdom (UK), the European countries most affected by the COVID-19 pandemic. We have focused on these countries due to differences in geographic locations, cultures, and health systems.

Furthermore, the first wave of the pandemic presented at different times across these countries. At the same time, similar precautionary measures such as the shutdown of all schools and universities, closure of museums, cultural centers, cinemas, theatres, pubs, and the suspension of international flights were undertaken in all countries at approximately the same time30,31.

Google searches for pulmonary symptoms were obtained by R X64 40.2(R: A Language and Environment for Statistical Computing) using "gtrendsR" package for the dates between January 01 and August 31, 2020. Search terms were determined in Turkish and were later translated to the relevant languages (Italian, French, Spanish, and English) via Google Translate, and then checked for accuracy by native speakers. We used "fever," "cough" and "dyspnea" or "shortness of breath" as search terms for pulmonary symptoms ("ateş", öksürük", "nefes darlığı" for Turkish, "febbre ", tosse" and "dyspnée" for Italian, "fièvre", "toux” and "essoufflemen" for French, "fiebre", "tosse" and "dyspnea", for Spanish). Each term was searched, selecting “all categories” for each particular country. The search was conducted on September 1, 2020. We obtained the data of new cases for each country from the WHO COVID-19 database3.

Statistical analysis

An initial check of the raw data revealed very high fluctuations and time lags between symptoms and new cases (see Fig. 1). Previous studies used a 3- to 7-day moving averages20,21,32 to transform the data. We analyzed our data using various moving averages ranging from 3 to 7 days to deal with the high fluctuations and observed that five days was most appropriate to smooth the data. Next, we shifted symptom search results forward to capture the time lag between symptom searches and new case reports. We realized that each symptom in each country needed a unique time period. We used the RMSE approach to determine the best fit period for each symptom in each country. Symptoms were shifted forward until the minimum RMSE was observed. We used the sliding windows correlation offered by Asseo et al. The authors selected a time frame of 31 days and rolled the correlation with one day25. We deployed the same method and calculated sliding window correlations for the raw data, moving average, and shifted data.

Figure 1
figure 1

Symptoms and cases in raw data.

We later carried out the DCC model to understand the dynamic correlation between Google search interest for the three identified pulmonary symptoms and new cases. The DCC method, originally proposed by Engle et al.33 has been adapted to different multivariate cases by Tse et al.34. This method was developed for conditional volatility in financial portfolios, and it has been used in interdisciplinary fields26. The DCC model proved more powerful than the Pearson, Spearman's rank, or constant conditional correlation model as it captures dynamic correlations between two-time series (For a detailed description of the method, refer to Tse et al.). We analyzed the data using Oxmetrics 8 software and Microsoft Excel Microsoft, Redmond, WA, United States), and a p-value < 0.05 was considered to indicate as statistically significant difference.

Results

We initially checked the normality of the data using the Shapiro–Wilk test and observed that not all the series were normally distributed. Therefore, we used Spearman correlations instead of Pearson correlations for the rest of our analyses. The Spearman correlations for raw data and five days moving averaged and shifted data are presented in Table 1. When we examined the raw data, we found correlation coefficients to be weak (less than 50%) and/or non-significant for most symptom searches and cases. We first transformed the observations to a five-day moving average, then shifted symptoms separately with the RMSE values. We found that Spearman correlation coefficients between symptoms and new cases increased to moderate levels and became significant at p < 0.01 (see Table 1 for symptom-specific p-values). Figure 2 shows the RMSE values for fever, cough, dyspnea in five countries. The arrows show the optimum shift days where the RMSE values are minimum. Table 2 lists the symptom search shifts for each symptom in each country. We observed that the optimum time lag for each symptom ranged from 8 to 24 days. These findings show that search terms on GT may need to be shifted separately to better fit the nature of the phenomenon at hand.

Table 1 Spearman correlations between symptoms and new cases.
Figure 2
figure 2

Root mean squared errors of symptoms' shifts.

Table 2 Symptom search shifts with minimum root mean squared errors (RMSE) (days).

Before estimating the time-varying correlation, we checked the dynamic conditional correlation versus constant correlation using two diagnostics: the E-S and the LM tests. These tests use Chi-square fit values and check if dynamic conditional correlation should be used rather than constant correlation. Table 3 shows that the constant correlation hypothesis should be rejected for all of the series at p < 0.01. We suggest using a time-varying correlation to monitor the co-movement of symptom search and new cases emergence.

Table 3 Diagnostic checks of the DCC model.

Table 4 reports the DCC coefficients for the relationship between pulmonary symptom searches and new case emergence in Turkey, Italy, Spain, France, and the UK. We found that significant, moderate to high DCC correlations. The correlation degree of pulmonary symptom search was different for each of the symptoms and countries. The findings demonstrate that the null hypothesis of constant correlation should be rejected. (p < 0.01).

Table 4 DCC between symptoms and cases.

Looking at the DCC and sliding window correlation results with raw and MA-shifted data for fever, cough, and dyspnea symptoms, we found that: First, the DCC model proved a better fit than sliding windows correlation models during the first wave of the pandemic. Second, high fit periods for DCC coefficients (r 0.90) were different in each country.

For fever, the high fit period is April 10–May 14 for Turkey, March 31–June 5 for Italy, April 2–June 4 for Spain, April 14–May 7 for France, April 18–June 21 for the UK (see Fig. 3 for details).

Figure 3
figure 3

Correlations between fever symptom search and new cases.

For cough, the high fit period (r 0.90) is April 10–May 14 for Turkey, March 31–June 5 for Italy, April 2–June 5 for Spain, April 14–May 7 for France, April 18–June 18 for the UK, cough symptom search fit is the highest in the UK (see Fig. 4 for details).

Figure 4
figure 4

Correlations between cough symptom search and new cases.

For dyspnea, the DCC coefficient fluctuates after the pandemic's first wave. The high fit period (r 0.90) is April 10–June 4 for Turkey, March 31–June 5 for Italy, April 2–June 5 for Spain, April 14–May 7 for France, and April 18–June 16 for the UK (see Fig. 5 for details).

Figure 5
figure 5

Correlations between dyspnea symptom search and new cases.

Discussion

This study shows that for three pulmonary symptoms (fever, cough, and dyspnea), google search interest is correlated with COVID-19 new cases utilizing a DCC model. We also demonstrated that the DCC model's performance is better than the sliding windows correlation for data from the first wave of COVID-19 pandemics. Our findings suggest that monitoring Google interest using GT would provide valuable information to produce preventive and intervention related programming.

Previously, Asseo et al. examined the relationship between smell and taste loss symptoms of COVID-19 Google searches and new COVID-19 cases. They employed a sliding window correlation for time frames of one month between March 4-August 25, 2020. The authors could not find stable correlations between taste and smell loss searches and new cases for Italy and the US during the pandemics' first wave. However, they observed that this link fluctuates over time and concluded that the correlation between searches of novel symptoms of infectious disease and the number of new cases fluctuates and decreases over time25. We found similar fluctuations. However, we observed much less fluctuation in our data set during the pandemic's first wave. Fluctuations did increase, and correlations decreased as the western hemisphere moved into the summer period.

Lippi et al. investigated the relationship between the volume of Google searches for the most frequent symptoms (fever, cough, and dyspnea) of SARS-CoV-2 infection and new cases using the Spearman correlation test. They did not find a significant correlation between cough/fever and new cases, respectively, but did detect a significant correlation for dyspnea. However, the correlation between newly diagnosed COVID-19 cases and “cough” and “fever” search terms became statistically significant with a 3-week delay. This study used a standard three-week time period and failed to see correlations in several symptoms15. However, we used symptom and country-specific time periods ranging from 8–24 days. This dynamic approach helped reveal the correlation in a more fine-tuned manner.

Most studies used conventional correlation methods, and they could not observe the time-varying co-movement between GT symptom search and new case emergence17,20. It is important to detect correlations in different periods, such as wave periods in pandemics. The time-varying correlations approach allows to monitor this co-movement in different periods and provides us with multiple correlation indexes. The DCC model is such a model that performs better than other time-varying correlation approaches such as the sliding windows correlation35. Thus, we used DCC to find time-varying correlations between pulmonary symptoms and new COVID-19 cases.

Previous studies selected a fixed day for each of the countries14,17,19, but the research presented here shows that a heuristic approach (RMSE) results in different shift days for each country's GT symptom. Our results illustrate that Italy, France, and Spain have a shorter time delay ranging from 8 to 21 days; UK and Turkey have a range of 18 to 24 days from symptom search to new case report. These differences may be due to variations in peoples' web search reaction to pulmonary symptoms or the procurement of PCR test results and processing of the test results in each country. Attention paid to higher risk symptoms such as dyspnea may indicate the need for hospitalization, which is of distinct importance.

DCC analysis methods show that fever, cough, and dyspnea symptoms correlate well with new cases during the first wave of the pandemic. However, by May 2020, many fluctuations in correlations begin to appear. We suggest that these fluctuations can result for various reasons independent of the modelling used: symptoms may become too well known, testing of non-symptomatic cases may have become more common, the definition of new case reporting may have changed. New symptoms of interest related to COVID-19 may also have emerged, such as loss of taste, backache, etc. Constant monitoring of the public interest in GT is very important to formulate relevant search models. Yet, we suggest that the modeling approach will remain relevant.

One limitation of this study is the terms selected to carry out the searches. Local/colloquial use of terms was not included, which may have affected the results by limiting the scope. The second limitation is in the selection of the data sources. Though "Google" is the most used web-based search engine, the use of different search tools like "Yahoo," "Msn," or "Yandex" may have led to more accurate assessments of public interest. New case reports were based on the WHO database, and the case reporting protocols may have changed during the pandemic. The third limitation is about lack of causal inference; it would be unwise to interpret our findings to indicate a direct causal relationship between search interest and COVID-19 cases. Mass media coverage of the pandemic may have shifted the GT results towards an increase in COVID-related searches. This increased general interest in the media about COVID-19 during the studied time period may have created some level of spurious correlation10.

The 2020 COVID-19 pandemic is the largest global public health challenge of this century. Our findings reveal that pulmonary symptom queries are crucial early signs for emerging epidemics. For example, dyspnea search interest may signal potential hospitalization and the need for intensive care. Policymakers are advised to pay attention to and utilize these search interests to plan preventive and/ or intervention strategies. Monitoring search terms may also help understand the populace's lay beliefs and worries, revealing the need for further guidance. Our results may be of particular importance as we approach the vaccination period with an already existing anti-vaccination movement in place.