Early warnings of COVID-19 outbreaks across Europe from social media

We analyze data from Twitter to uncover early-warning signals of COVID-19 outbreaks in Europe in the winter season 2019–2020, before the first public announcements of local sources of infection were made. We show evidence that unexpected levels of concerns about cases of pneumonia were raised across a number of European countries. Whistleblowing came primarily from the geographical regions that eventually turned out to be the key breeding grounds for infections. These findings point to the urgency of setting up an integrated digital surveillance system in which social media can help geo-localize chains of contagion that would otherwise proliferate almost completely undetected.

In our study we have explored the potential of mining social network data from Twitter for uncovering early-warning signals of the COVID-19 outbreak. We show evidence that unexpected levels of concerns about pneumonia had been raised for several weeks before the first cases of infection were officially announced across a number of European countries. We also show that whistleblowing came primarily from the geographical regions that turned out to be precisely the key breeding grounds for infections.

Extracting data from Twitter
On 31 December 2019 the WHO was informed about the first "cases of pneumonia of unknown etiology". 9 We created a unique database including all messages that contained the keyword "pneumonia" in seven languages (Italian, French, Spanish, Polish, English, German and Dutch) and were posted on Twitter over the period from 1 January 2016 to 1 March 2020. We decided to focus on pneumonia since it is the most severe condition induced by COVID-19. 10 Moreover the flu season in 2020 has been milder than in previous years (2016-2019). 11 Collected tweets were associated with the users' details by leveraging the Twitter API. We also collected information on the number of followers, friends, statuses and location of each user. The initial data set included 573,298 unique users, and a total of 891,195 unique tweets. From this data set we extracted a sample including tweets concerned with pneumonia and posted in the period between 15 December 2018 and 21 January 2019 and the period between 15 December 2019 and 21 January 2020. Moreover, we selected users whose location was in a European region of Spain, the Netherlands, France, Italy, Poland, the United Kingdom, and Germany. We used several geo-coders to cross-check the geographic coordinates of the users. In this way we could filter out the tweets from non-European English-speaking, Spanish-speaking and French-speaking countries. We used GIS methods to assign Twitter users to NUTS1 European regions (e.g., Lombardy, Ile de France, Comunidad Valenciana).
To avoid overestimation of the number of tweets mentioning cases of pneumonia between December 2019 and January 2020, we made the following adjustments: a) we removed all tweets (and the corresponding users) that cited news with a direct url; b) we considered only users with fewer than 2,000 followers to filter out the effects of press agencies and celebrities that usually have a large number of friends; c) we removed all remaining tweets that still contained the word "Coronavirus", "China", or "COVID". These steps reduced the number of tweets to a final sample of 4,765 and downsized the number of users to 2,716 removing or mitigating the effects of COVID-19-related news that appeared up to 21 January 2020, when COVID-19 became a Class B notifiable disease. 12 After this date, tweets mentioning pneumonia are indeed most likely to be related to the COVID-19 outbreak, even when they do not directly use the word "COVID", and there is no obvious way to disambiguate comments concerned with genuine cases of pneumonia from comments elicited by mass media coverage of the COVID-19 outbreak. Fig. 1 shows the series of cumulative rescaled mentions of the word "pneumonia" in selected European countries. To properly test the change in trend of pneumonia-related tweets, we used a methodology similar to the one applied to measure excess mortality in Europe. 13 Table S1 reports estimates from Kolmogorov-Smirnov tests of differences between cumulative functions in the 2018/2019 and 2019/2020 winter seasons. It is noticeable that the rate of increase in the cumulative mentions of the word "pneumonia" in Italy during winter 2020 (shaded bar B in inset) substantially differs from the rate observed in the previous year (shaded bar A in inset): a sudden burst in concerns raised about anomalous cases of pneumonia took place at the beginning of January 2020, several weeks before the announcement of the first local source of a COVID-19 infection (20 February, Codogno, Italy).

Warnings of unexpected cases of pneumonia
France and Italy had a similar number of Twitter-reported cases of pneumonia in January 2020. Spain, Poland and the UK witnessed a similar communication pattern, but delayed by two weeks (circle C). The trend is different in Germany and the Netherlands: after a slow increase subsequent to the COVID-19 outbreak in January 2020, it was only at the end of February 2020 that the curves became steeper. This is likely to reflect local differences in the perception of the disease, as well as differences in the COVID-19 diffusion patterns and infection rates across European countries. From 20 February 2020 the slopes of the curves for all countries are likely attributable to a widespread increase in public interest in the pandemic threat. The series of cumulative mentions of the word "pneumonia" per country reveals unexpected variations in public interest in pneumonia-related issues already in January 2020. In particular we find a significant increase in the tweets mentioning the word "pneumonia" in most of the European countries and regions well before the outbreak of COVID-19 was officially reported in the news.
We also uncovered variations in the number of users citing pneumonia across various European regions between the winter season of 2020 and the corresponding seasons in previous years.
We obtained the locations of 13,088 users, and identified the European regions that were characterized by anomalous and unexpected surges in pneumonia-related Twitter mentions during the early undetected phases of the COVID-19 outbreak. Fig. 2 (left) shows the geographic distribution across European regions of unique users discussing pneumonia between 15 December 2019 and 21 January 2020, after filtering out press releases and news accounts. Fig. 2 (right) shows the relative increase in the number of such users between 2020 and the corresponding winter period in 2019. Both maps show interesting patterns with respect to the known outbreak evolution: the majority of users discussing cases of pneumonia came from the same regions, such as Lombardy, Madrid, Ile de France and England, that eventually reported early cases of the COVID-19 contagion (see also Table S2).

Towards an integrated digital surveillance system
These findings offer a first accounting of how far behind many European countries were in detecting the virus. While the approach here outlined is not without limitations (e.g., potential biases from confounders), it provides governments and local authorities with contextual geo-localized information for devising effective intervention policies throughout the whole epidemiological cycle, from the investigation and recognition phases of a pandemic up to the deceleration and preparation phases. 14 Monitoring social media can help public authorities to geo-localize chains of contagion that would otherwise proliferate almost completely undetected for several weeks before the first death caused by a virus is announced. Equally, social media can be used to mitigate the risk of a contagion resurgence in the phase 2 of a pandemic, when the restriction measures to counter the spread (e.g., social distancing) are progressively lifted. For example, in the current phase when many countries are evaluating digital surveillance and contact-tracing solutions for large-scale adoption, 15 using social media could help public health authorities to produce spatio-temporal density maps of infectious threats and ascertain which constraints can be relaxed and in which areas.
Social media hold promise for enhancing the effectiveness of public health surveillance, especially when combined with other novel data streams, such as Web search queries, 16 participatory surveillance data, 17 aggregated mobile phone data, 18 and geospatial data from social contact tracing solutions. 19 , 20 Over the longer term, any integrated digital surveillance system set to monitor COVID-19 and beyond should be controlled by independent data protection and regulation authorities, and adhere to a clear set of privacy-preserving and data-sharing principles that do not jeopardize civil rights and other fundamental liberties.

Supplementary Material
To check whether the cumulative distribution functions of the number of tweets citing pneumonia across several European countries differ between winter seasons, we performed a two-sample Kolmogorov-Smirnov (K-S) test. We focused on the following seasons: 15 December 2018 -21 January 2019 (sample 1) and 15 December 2019 -21 January 2020 (sample 2). When statistically significant, the difference between the cumulative distribution functions of the two samples is likely to indicate an increase in pneumonia-related tweets signaling a pneumonia diffusion pattern that differs from what would be expected from seasonal trends. In particular, we identified the users that cited pneumonia in selected European countries between 15 December 2019 and 21 January 2020, compared with the total number of users that cited pneumonia in the same weeks of the previous year.
The following table shows the values of the K-S statistics and confirms the results in Fig.1. The difference between cumulative distribution functions of number of tweets mentioning "pneumonia" in the 2020 and 2019 seasons is statistically significant at the 1% level for Italy, France, Spain and the UK, and at the 5% level for Poland. Instead, Germany and the Netherlands are not associated with statistically significant differences between the two seasons.