Introduction

Since its outbreak in early 2020, the novel coronavirus pneumonia COVID-191, has infected an estimated total of 172 million individuals across virtually all the worldā€™s countries with over 3.5 million related deaths recorded globally2.

Predicting the spread and forecasting the severity of the COVID- pandemic has become the focus of many research teams across the globe3,4,5,6. There is a shared agreement that forecasting the spread, growth and severity of the COVID-19 pandemic is a challenging task especially in our globally interconnected world. In this context, reliable prediction of contagion, growth and fatalities within countries and the regions in each country, before data is available and widely openly distributed, is essentially impossible. Despite this challenge, it is recognized that this knowledge is extremely useful in order to establish targeted confinement areas to contain virus spread more effectively while reducing the economic and social disruptions due to lockdown and social distancing strategies which in turn would also allow to allocate resources efficiently across regions.

Although forecasting a countryā€™s regional spread of COVID-19 severity and its associated mortality is critical to implement operational healthcare changes and epidemiological control measures, this was nearly impossible to achieve given the lack of readily available data at the beginning of the emerging SARS-CoV-2 pandemic. To address the challenge of gaining these insights at the onset of a pandemic, when there is a lack of widely accessible regional-level official data, in the present work we analysed openly available data from Twitter activity across different Italian, Spanish and United States regions to estimate crowd perception of the severity of the event. The collective knowledge of a crowd has been used successfully in similar challenging forecasting scenarios across social and data sciences where it has been established that collective opinions formed by a group of individuals can sometimes be more accurate than individual expert opinions. This phenomenon has been named ā€œthe wisdom of crowdsā€. As opposed to the common practice of using web-search engine queries that only indicate seeking knowledge patterns, we focused on assessing the wisdom of crowds as represented in the social media Twitter platform. We therefore aggregated social media reaction, expressed by geolocated tweet intensity of COVID-19 related activity, from Italian, Spanish and United States geopolitical regions at the beginning of the pandemic and we investigated relations with their regional mortality data after one month.

In social sciences it has been established7,8 that the collective aggregate opinion of a large number of non-experts can, in some contexts, outperform individual experts when the variable to be determined is not random and their individuals have some partial information and the ability to process it7. Social psychology studies have described crowd wisdom, opinion dynamics and collective knowledge by studying in which scenarios and in what ways crowds are wise8. Several studies have shown that social media wisdom of crowds, where user interactions are more frequent and opinion dynamics are heightened, are able to solve challenging forecasting tasks. Successful examples include improving Wikipedia articles9; predicting publicly traded securities stock prices10; the study of collective innovation in modern technological social networks11 and; election results forecasting12 among many others showing that crowd wisdom, opinion pooling and social media opinion dynamics constitute a powerful tool in forecasting tasks, even beyond expertsā€™ abilities. These methods work well especially when groups are large and connected opinion dynamics and communication allow crowds to process information13. Social media debate is the result of a complex process of information filtering where individuals gauge official information with local knowledge and confront their opinions openly. This process can degenerate in conspiracy theories and foster fake news, but it has been observed that on average the crowds can process information and weight reality in a rather accurate way14. While there are some examples of the use of the wisdom of crowds to estimate relevant variables that are otherwise hard to measure, there is no literature reporting the use of this collective knowledge gathered from social media tweets for assessing the spreading and resulting mortality during an emerging pandemic. No previous study has shown that the collective wisdom of crowds during the initial attention COVID-19 social media peak, when there were not readily available mortality data sources, can predict the regional cumulative mortality a month ahead.

Here we show that such ā€œwisdom of crowdsā€ has been able to predict, ahead of officially available data, COVID-19 infection severity across countries most affected by the pandemic. Our findings could underpin the creation of real-time novelty detection systems aimed at early reporting of SARS-like mortality and thus early activation of control measures in future pandemics. The wisdom of crowds could also be used to feed infection diseases explicit models with reliable data sourced locally from the exposed population when, at the early stages of a pandemic, there are no other sources of data available. The strength of the predictive association could be used to inform fast-response policy making. As such, it provides a scalable tool to increase preparedness and resilience in similar pandemic scenarios. Furthermore, the wisdom of crowdsā€™ capability to infer the value of some variables otherwise unmeasurable can be used to refine explicit models.

Materials and methods

COVID-19 and population data sources

We obtained COVID-19 spreading and casualties time series data aggregated by region from the official department of health website or repository for Italy15 and Spain16. For the United States we used the readily available New York Times dataset17 which contains regional level data as well as an interactive package for live monitoring.

For Italy we obtained both population data per region18 and social media usage statistics per region19, both updated as of 2019, from Istituto Nazionale di Statistica (ISTAT). We obtained regional population data for Spain from Istituto Nacional de EstadĆ­stica (INE) as of 201920 and for the United States from the United States Census Bureau21.

Social media data crawling and processing

We obtained all Twitter data from an early COVID-19 Twitter data repository22 available before Twitter started providing access to its own COVID-19 dataset. Tweets were collected from the 21st of January 2020 using Twitterā€™s Application Programming Interface (API)23. The Twitter repository is updated as new meaningful COVID-19 related words emerge23. Using tweet unique identifiers (IDs), we retrieved all their corresponding information (text, date, user and user data) from the repository via the ā€œtwarcā€ package (https://github.com/DocNow/twarc). This process is commonly referred to as tweet hydration. To measure the number of unique active users per day in each of the regions of Italy, Spain and the United States, we geolocated tweets with the HERE Geocoder service24 and obtained country and region categorization for over 50% of the tweets. In order to facilitate dataset reading and tabulation we used Googleā€™s OpenRefine API25 that efficiently groups hourly tweet data by day and converts nested dictionary type files in JavaScript Object Notation (JSON) to simple tabular comma separated values.

We defined tweet volume as the number of unique users active in a region per day, discussing COVID-19 related topics. This criterion is applied, rather than merely counting tweets, in order to measure the populationā€™s attention and news spreading in the crowd, while correcting for overrepresented Twitter activity from certain users. To account for different sizes of regionsā€™ populations we calculate tweet intensity by normalizing the tweet volume by region population in the case of Spain and the United States or by social media active population in the case of Italy. Indeed, Italy was the only country for which we were able to obtain specific social media usage data per region.

Identification of tweet intensity peaks

In order to identify the beginning of the social media reaction to the epidemic in each country we computed the z-score on the tweet activity time series on a two weeks rolling window. We used Zā€‰>ā€‰3 as the identifier of the social media reaction peak.

Regression analyses

Weighted and non-weighted linear regressions models were carried out using the python package ā€œstatsmodels.apiā€26 (python version 3.7). The Weighted Least Squares (WLS) function was used for weighted linear regression (with intercept) and the Ordinary Least Squares (OLS) function was used for unweighted linear regression (with intercept). The weighted regressions account for variability in the data by weighting the error proportionally to the square root of region population. This adjustment accounts for error measurements both in number of deaths and Twitter volume. Regression p values are calculated by the ā€œ.summary()ā€ function of the ā€œ.fit()ā€ method for the OLS and WLS functions. This derives from Chi Squared survival function of ā€œscipy.statsā€27 (scipy.stats.chi2.sf()) applied to the test statistics, which is assumed to be Chi Square distributed.

Correlation analyses

Spearmanā€™s rank correlation coefficient was calculated using the ā€œspearmanrā€ function of ā€œscipy.statsā€27. In order to assess the strength of this correlation we compare it with null-hypothesis correlation values for random shuffling of the data and obtain quantile confidence intervals. We hence obtain confidence levels for comparison with the random null model.

Results

Twitter activity as a proxy for crowd perception of the severity of COVID19

We have used readily accessible data from Twitter activity across different Italian, Spanish and United States regions to estimate crowd perception of the severity of the event while it is unfolding in its early stages. We then related the intensity of social media reaction with the severity of the infection in the same region in terms of the cumulative number of deaths reported the following month. Our study focuses on Italy and Spain as these countries have been the most affected at the start of the pandemic followed more recently by the United States.

The number of active tweet users posting on COVID-19 per day22,25, geolocated24 and aggregated by the regions of Italy, Spain and the United States is shown in Fig.Ā 1aā€“c respectively. Country regions are coloured according to each countryā€™s geo-political areas. FigureĀ 1 also shows the growth of positive SARS-CoV-2 cases nationwide as well as the cumulative number of deaths nationwide caused by COVID-1915,16,17. The cumulative mortality per geopolitical region of Italy, Spain and the United States is shown in Fig.Ā 2aā€“c respectively.

Figure 1
figure 1

Plots of COVID-19 related Twitter activity superimposed to number of confirmed COVID-19 cases and cumulative number of deaths nationwide. Twitters are geolocated and aggregated by geopolitical regions for (a) Italy; (b) Spain and (c) United States. For each country we group and color code regions by geolocation in the following way: (a) Italy: Northern regions (blue), Central regions (red) and Southern regions (orange), (b) Spain: Northeastern regions (blue), Northwestern regions (green), Central regions (red), Southern regions (orange), Autonomous cities (yellow), (c) United States: Northeastern states (blue), Southeastern states (green), Midwestern states (red), Southwestern states (orange), Western states (yellow). The growth of confirmed COVID-19 cases nationwide (cumulative) is represented by grey bars while the cumulative number of COVID-19 deaths nationwide is represented by yellow bars.

Figure 2
figure 2

Plots of the cumulative number of deaths per geopolitical region for: (a) Italy; (b) Spain; (c) United States.

Using the Z-score method we identified for tweet intensity peaks the period 21st to 24th February 2020 for Italy, 24th to 26th February 2020 for Spain and 3rd to 4th March for the United States (Fig.Ā 1aā€“c). For the United States we observed a first peak in tweet intensity with Zā€‰>ā€‰3 around the 25th of February with no apparent endogenous cause such as change in confirmed cases or deaths (Fig.Ā 1c), it is then followed by a second peak that corresponds to an endogenous spike when the United States death toll starts to rise. This second United States tweet intensity peak (Fig.Ā 1c) was used for our analysis.

Note that Italian regional official data for the pandemic15 was first available on the 24th February 2020 which is after the social media reaction (Figs. 1a,2a). Moreover, at that time most Italian regions still reported no cases hindering the possibility of forecasting from official data (Fig.Ā 2a). The crowds therefore reacted on national (or even global) news at a time when no official regional public data was available for the number of infections by gathering local information and elaborating it through opinion dynamics in social media (Fig.Ā 1a). The accurate regional forecasts were hence a result of the ā€œwisdom of crowdsā€ phenomenon. We observe an initial peak in late January (Fig.Ā 1a), perhaps due to the start of the epidemic in China, but with little differentiation between Italian regions. We then observe a second peak of interest from social media in late February (Fig.Ā 1a), this peak was heterogeneous across regions and it appears to be sparked by the endogenous growth of the infection in Italy being measured and reported. At the time (21st to 24th February 2020) only Italian nationwide epidemic data were available and regional or province breakdowns were only scattered across the news, but there were no deaths and many regions still had no tested infection cases and there was no official regional data release (Figs. 1a,2a).

In order to show whether tweet intensity is related to the severity of COVID19 we plotted in log scale the cumulative number of deaths for each Italian (Fig.Ā 3a) and Spanish (Fig.Ā 3b) regions on the 7th of April 2020 and for the United States regions (Fig.Ā 3c) on the 14th of April 2020 against the mean tweet intensities at the beginning of the epidemic perception. We used the number of deaths, instead of population confirmed cases, as these are less dependent on the number of samples taken for SARS-CoV-2 testing in the wider population. Using population confirmed cases would have been highly dependent on the country testing strategies which would require a non-trivial rescaling. Nonetheless, we would like to emphasize that our results also hold when regressing over the number of cases one month forward. Therefore, the tweet intensity is consistently forecasting the severity of spreading, despite no regional infection data being available at the time of Twitter reaction measurement. FigureĀ 3 shows the proportionality between the mean tweet intensity at the beginning of the epidemicā€™s perception, per region, and the number of deaths approximately one month forward. This figure demonstrates how the reaction on social media can correctly detect and rank the epidemicā€™s impact on different regions one month ahead, when no official data was available in Italy and the data was insufficient for forecasting in other countries. This association is least noisy for Italy (Fig.Ā 3a) and Spain (Fig.Ā 3b) as these countries were severely affected very early on before the WHO declared the global pandemic. We note that the regions of Lombardy in Italy, Madrid in Spain and New York in the United States have the largest initial tweets intensity reactions and correspond to the most severely affected regions one month later. Note that the Lazio Italian region (Fig.Ā 3a) is an outlier due to politicians and central bodies tweeting from it as well as national geolocation defaulting to the capital. For Spain (Fig.Ā 3b), the region of ā€œCastilla-La Manchaā€ was merged with Madrid as a large section of their population commute between the two and they are geographically nested. For the US the ā€œDistrict of Columbiaā€, is over-represented with tweets from Washington D.C. (Fig.Ā 3c).

Figure 3
figure 3

Demonstration that the reaction on social media can correctly detect and rank the epidemicā€™s impact on different regions one month ahead. The y-axis reports the cumulative number of deaths one month forward and the x-axis reports mean tweet intensity at the initial attention peak per geopolitical region of: (a) Italy; (b) Spain and (c) United States. The diameters of the circles are proportional to the population of the region.

To assess the strength of the observed association shown in Fig.Ā 3 and to verify that the values of the epidemic are not trivially related to the size of the population in each region we compared three regression models: model-1, adjusted tweet intensity versus log death cases; model-2, log population versus log death cases and; model-3, adjusted tweet intensity and log population versus log death cases.

We log-scaled the population to allow for a fair comparison as we notice a sub-linear relation to the number of deaths. Table 1 shows coefficientsā€™ significance as well as R2 for the three weighted and non-weighted regression models for the three countries analyzed in this study. From the data reported in Table 1, we observe that model-1 weighted regression is the most significant with the lowest p values and the largest overall R2 values for the Italy and Spain data. It has also a significant p value and sizable R2 for the United States. This indicates that tweet intensity is the most significant variable for the prediction of the number of deaths for Italy and Spain. For the United States we still observe that tweet intensity is a significant predictor for the number of deaths but results from unweighted regression with model-2 and model-3 reveal that the population of the regions is a better predictor.

Table 1 Per country coefficient significance and R2 values for weighted and unweighted regression models.

We further assessed the strength and significance of the relation between the tweet intensity and the number of deaths by quantifying non-linear monotonic dependency with Spearmanā€™s Rho correlation (Table 2). We observe that the significance level for Spearmanā€™s Rho correlation for Italy is at 99% and both the United States and Spain are significant to 95% significance level (Table 2). This confirms that there is a significant relation between regional early tweet intensities and the number of deaths in the respective regions for all three countries.

Table 2 Spearmanā€™s rank correlation coefficient values for empirical values of each country and corresponding null-models significance levels.

Discussion

The wisdom of crowds has been used successfully to estimate relevant variables that are otherwise hard to measure in several domains including the medical one28. Despite this, there is no literature reporting the use of the wisdom of crowds gathered from social media tweets for assessing the spreading of severity within an emerging pandemic where there is no readily accessible mortality data.

Our study shows statistically significant evidence that COVID-19 related mean tweet intensity per region, at the first endogenous attention spike, is able to significantly forecast the spreading of COVID-19 severity, as measured by number of deaths, one month forward. In the case of Italy, the crowdā€™s reaction with predictive power was recorded before any official regional contagion data was available. For Spain and the United States, the crowd still reacted when little data was available to make any forecast.

As the pandemic progressed, Italy, then rapidly followed by Spain, were the first countries affected with extremely high COVID-19 associated mortality. At such an early stage, Italy and Spain were therefore less influenced by discussions about the general global status of the pandemic in their measured social media activity, as less attention was present at the time. This allowed us to analyze the reaction from social media crowds with less biases. Moreover, Italy is made up of a good number of regions of comparable size with good social media usage as well as good official data for social media usage.

We show that the intensity of COVID-19 related Twitter activity is able to correctly identify the localities most affected by the pandemic in each of the considered countries. These localities are Lombardy, Madrid and New York, for Italy, Spain and the United States respectively (Fig.Ā 3). These geo-political and administrative localities have a striking social media reaction (Figs. 1,3). This suggests that the initial reaction of users on social media had efficiently processed data scattered throughout news channels, merged it with local information and performed an accurate risk assessment which is observable in the social media intensity reaction. We highlight in particular for Italy that Emilia Romagna and Veneto did not seem to be less affected than Lombardy at the beginning of the epidemic spread, however, nonetheless, crowd wisdom seems to have combined different information sources to highlight the perception of a greater danger in Lombardy (Figs.Ā 2a,3a). In Italian and Spanish regions, the crowds demonstrated a remarkable ability to predict the severity of COVID-19 impact at regional levels before the availability of official data. The regression results demonstrate that the intensity of COVID-19 related tweeds is a better predictor than the population size obtaining goodness of fit R2 values that are almost twice. For the United States we also demonstrated a significant predictability power of the tweet intensity, however in this case the regional population size is a better regressor. We might note that spreading of the pandemic in the United States started later when an amount of exogenous information was already circulating in the social media, furthermore the United States have a much wider variety of climate, culture, political guidance, population density, and total population throughout states that leads to a more difficult detection of the phenomenon (Fig.Ā 3c).

The crowdā€™s reaction to COVID-19 spreading measured through tweet intensity on a regional basis is a complex quantity rich of information. People react to both official information and to local knowledge gathered at personal level. Tweets are a process involving both sharing and comparing such information which includes a level of collective processing and assessing of the reliability of the source. It has been already recognized in the literature that such a process can be on average very accurate. It should not be therefore surprising that the crowd interest measured through COVID-19 related tweet intensity can result in an appropriate estimate of the local severity of the epidemics.

The confounding factor of different social media usage throughout regions is a potential limitation. We have adjusted for this in the case of Italy, as we were able to obtain the data, and our analysis was actually made more significant when correcting for this. Further work should seek this additional regional data for other countries as well, in order to more accurately adjust the Twitter intensity. Another potential limitation could be found in the use of linear regressions. These are a simplification of more complex dependency relations. Here we used simple methods, as little data was available, to show the validity of the phenomenon as per Fig.Ā 1. More data and sophisticated models can be investigated for practical monitoring and forecasting tools in the future. Despite the limitations of our methodology, we expect that wisdom of crowd information will be most effective in early stages of an emerging pandemic, when official testing data is not yet available and especially in regions where official information is hard to obtain for a variety of operational reasons.

The practical relevance of our results consists in the demonstration that tweet intensity can be used for forecasting. At the beginning of an epidemic when it is extremely difficult to have precise information and therefore modelers and public officials are obliged to use general statistical quantities to produce forecasts and consequently implement decisions. Our work shows that the information locally available to the population permeates through Twitter and social media and can be made available to modelersā€™ and policymakers at the early stages of a crisis when it is most needed.