Wisdom of crowds detects COVID-19 severity ahead of officially available data

During the unfolding of a crisis, it is crucial to forecast its severity at an early stage , yet access to reliable data is often challenging early on. The wisdom of crowds has been effective at forecasting in similar scenarios. We investigated whether the initial regional social media reaction to the emerging COVID-19 pandemic in three critically affected countries has significant relations with their observed mortality a month later. We obtained COVID-19 related regionally geolocated tweets from Italian, Spanish, and United States regions. We quantified the predictive power of the wisdom of the crowds using correlations and regressions of geolocated Tweet Intensity (TI) during the initial social media attention peak versus the cumulative number of deaths a month ahead. We found that the intensity of initial COVID-19 related tweet attention at the beginning of the pandemic across Italian, Spanish, and United States regions is significantly related (p < 0.001) to the extent to which these regions had been affected by the pandemic a month later. This association is most striking in Italy as when at its peak of TI in late February 2020 only two of its regions had reported mortality. The collective wisdom of the crowds at early stages of the pandemic, when information on the number of infections was not broadly available, strikingly predicted the extent of mortality reflecting the regional severity of the pandemic almost a month later. Our findings could underpin the creation of real-time novelty detection systems aimed at early reporting of the severity of crises impacting a territory leading to early activation of control measures at a stage when available data is extremely limited.

In social sciences it has been established 7,8 that the collective aggregate opinion of a large number of nonexperts can, in some contexts, outperform individual experts when the variable to be determined is not random and their individuals have some partial information and the ability to process it 7 . Social psychology studies have described crowd wisdom, opinion dynamics and collective knowledge by studying in which scenarios and in what ways crowds are wise 8 . Several studies have shown that social media wisdom of crowds, where user interactions are more frequent and opinion dynamics are heightened, are able to solve challenging forecasting tasks. Successful examples include improving Wikipedia articles 9 ; predicting publicly traded securities stock prices 10 ; the study of collective innovation in modern technological social networks 11 and; election results forecasting 12 among many others showing that crowd wisdom, opinion pooling and social media opinion dynamics constitute a powerful tool in forecasting tasks, even beyond experts' abilities. These methods work well especially when groups are large and connected opinion dynamics and communication allow crowds to process information 13 . Social media debate is the result of a complex process of information filtering where individuals gauge official information with local knowledge and confront their opinions openly. This process can degenerate in conspiracy theories and foster fake news, but it has been observed that on average the crowds can process information and weight reality in a rather accurate way 14 . While there are some examples of the use of the wisdom of crowds to estimate relevant variables that are otherwise hard to measure, there is no literature reporting the use of this collective knowledge gathered from social media tweets for assessing the spreading and resulting mortality during an emerging pandemic. No previous study has shown that the collective wisdom of crowds during the initial attention COVID-19 social media peak, when there were not readily available mortality data sources, can predict the regional cumulative mortality a month ahead.
Here we show that such "wisdom of crowds" has been able to predict, ahead of officially available data, COVID-19 infection severity across countries most affected by the pandemic. Our findings could underpin the creation of real-time novelty detection systems aimed at early reporting of SARS-like mortality and thus early activation of control measures in future pandemics. The wisdom of crowds could also be used to feed infection diseases explicit models with reliable data sourced locally from the exposed population when, at the early stages of a pandemic, there are no other sources of data available. The strength of the predictive association could be used to inform fast-response policy making. As such, it provides a scalable tool to increase preparedness and resilience in similar pandemic scenarios. Furthermore, the wisdom of crowds' capability to infer the value of some variables otherwise unmeasurable can be used to refine explicit models.

Materials and methods
COVID-19 and population data sources. We obtained COVID-19 spreading and casualties time series data aggregated by region from the official department of health website or repository for Italy 15 and Spain 16 . For the United States we used the readily available New York Times dataset 17 which contains regional level data as well as an interactive package for live monitoring.
For Italy we obtained both population data per region 18 and social media usage statistics per region 19 , both updated as of 2019, from Istituto Nazionale di Statistica (ISTAT). We obtained regional population data for Spain from Istituto Nacional de Estadística (INE) as of 2019 20 and for the United States from the United States Census Bureau 21 .
Social media data crawling and processing. We obtained all Twitter data from an early COVID-19 Twitter data repository 22 available before Twitter started providing access to its own COVID-19 dataset. Tweets were collected from the 21 st of January 2020 using Twitter's Application Programming Interface (API) 23 . The Twitter repository is updated as new meaningful COVID-19 related words emerge 23 . Using tweet unique identifiers (IDs), we retrieved all their corresponding information (text, date, user and user data) from the repository via the "twarc" package (https:// github. com/ DocNow/ twarc). This process is commonly referred to as tweet hydration. To measure the number of unique active users per day in each of the regions of Italy, Spain and the United States, we geolocated tweets with the HERE Geocoder service 24 and obtained country and region categorization for over 50% of the tweets. In order to facilitate dataset reading and tabulation we used Google's OpenRefine API 25 that efficiently groups hourly tweet data by day and converts nested dictionary type files in JavaScript Object Notation (JSON) to simple tabular comma separated values.
We defined tweet volume as the number of unique users active in a region per day, discussing COVID-19 related topics. This criterion is applied, rather than merely counting tweets, in order to measure the population's attention and news spreading in the crowd, while correcting for overrepresented Twitter activity from certain users. To account for different sizes of regions' populations we calculate tweet intensity by normalizing the tweet volume by region population in the case of Spain and the United States or by social media active population in the case of Italy. Indeed, Italy was the only country for which we were able to obtain specific social media usage data per region.

Identification of tweet intensity peaks.
In order to identify the beginning of the social media reaction to the epidemic in each country we computed the z-score on the tweet activity time series on a two weeks rolling window. We used Z > 3 as the identifier of the social media reaction peak.

Regression analyses.
Weighted and non-weighted linear regressions models were carried out using the python package "statsmodels.api" 26 (python version 3.7). The Weighted Least Squares (WLS) function was used for weighted linear regression (with intercept) and the Ordinary Least Squares (OLS) function was used for unweighted linear regression (with intercept). The weighted regressions account for variability in the data by weighting the error proportionally to the square root of region population. This adjustment accounts for error www.nature.com/scientificreports/ measurements both in number of deaths and Twitter volume. Regression p values are calculated by the ".summary()" function of the ".fit()" method for the OLS and WLS functions. This derives from Chi Squared survival function of "scipy.stats" 27 (scipy.stats.chi2.sf()) applied to the test statistics, which is assumed to be Chi Square distributed.
Correlation analyses. Spearman's rank correlation coefficient was calculated using the "spearmanr" function of "scipy.stats" 27 . In order to assess the strength of this correlation we compare it with null-hypothesis correlation values for random shuffling of the data and obtain quantile confidence intervals. We hence obtain confidence levels for comparison with the random null model.

Results
Twitter activity as a proxy for crowd perception of the severity of COVID19. We have used readily accessible data from Twitter activity across different Italian, Spanish and United States regions to estimate crowd perception of the severity of the event while it is unfolding in its early stages. We then related the intensity of social media reaction with the severity of the infection in the same region in terms of the cumulative number of deaths reported the following month. Our study focuses on Italy and Spain as these countries have been the most affected at the start of the pandemic followed more recently by the United States. The number of active tweet users posting on COVID-19 per day 22,25 , geolocated 24 and aggregated by the regions of Italy, Spain and the United States is shown in Fig. 1a-c respectively. Country regions are coloured according to each country's geo-political areas. Figure 1 also shows the growth of positive SARS-CoV-2 cases nationwide as well as the cumulative number of deaths nationwide caused by COVID-19 15-17 . The cumulative mortality per geopolitical region of Italy, Spain and the United States is shown in Fig. 2a-c respectively.
Using the Z-score method we identified for tweet intensity peaks the period 21 st to 24 th February 2020 for Italy, 24 th to 26 th February 2020 for Spain and 3 rd to 4 th March for the United States ( Fig. 1a-c). For the United States we observed a first peak in tweet intensity with Z > 3 around the 25 th of February with no apparent endogenous cause such as change in confirmed cases or deaths (Fig. 1c), it is then followed by a second peak that corresponds to an endogenous spike when the United States death toll starts to rise. This second United States tweet intensity peak (Fig. 1c) was used for our analysis.
Note that Italian regional official data for the pandemic 15 was first available on the 24 th February 2020 which is after the social media reaction (Figs. 1a,2a). Moreover, at that time most Italian regions still reported no cases hindering the possibility of forecasting from official data (Fig. 2a). The crowds therefore reacted on national (or even global) news at a time when no official regional public data was available for the number of infections by gathering local information and elaborating it through opinion dynamics in social media (Fig. 1a). The accurate regional forecasts were hence a result of the "wisdom of crowds" phenomenon. We observe an initial peak in late January (Fig. 1a), perhaps due to the start of the epidemic in China, but with little differentiation between Italian regions. We then observe a second peak of interest from social media in late February (Fig. 1a), this peak was heterogeneous across regions and it appears to be sparked by the endogenous growth of the infection in Italy being measured and reported. At the time (21 st to 24 th February 2020) only Italian nationwide epidemic data were available and regional or province breakdowns were only scattered across the news, but there were no deaths and many regions still had no tested infection cases and there was no official regional data release (Figs. 1a,2a).
In order to show whether tweet intensity is related to the severity of COVID19 we plotted in log scale the cumulative number of deaths for each Italian (Fig. 3a) and Spanish (Fig. 3b) regions on the 7 th of April 2020 and for the United States regions (Fig. 3c) on the 14 th of April 2020 against the mean tweet intensities at the beginning of the epidemic perception. We used the number of deaths, instead of population confirmed cases, as these are less dependent on the number of samples taken for SARS-CoV-2 testing in the wider population. Using population confirmed cases would have been highly dependent on the country testing strategies which would require a non-trivial rescaling. Nonetheless, we would like to emphasize that our results also hold when regressing over the number of cases one month forward. Therefore, the tweet intensity is consistently forecasting the severity of spreading, despite no regional infection data being available at the time of Twitter reaction measurement. Figure 3 shows the proportionality between the mean tweet intensity at the beginning of the epidemic's perception, per region, and the number of deaths approximately one month forward. This figure demonstrates how the reaction on social media can correctly detect and rank the epidemic's impact on different regions one month ahead, when no official data was available in Italy and the data was insufficient for forecasting in other countries. This association is least noisy for Italy (Fig. 3a) and Spain (Fig. 3b) as these countries were severely affected very early on before the WHO declared the global pandemic. We note that the regions of Lombardy in Italy, Madrid in Spain and New York in the United States have the largest initial tweets intensity reactions and correspond to the most severely affected regions one month later. Note that the Lazio Italian region (Fig. 3a) is an outlier due to politicians and central bodies tweeting from it as well as national geolocation defaulting to the capital. For Spain (Fig. 3b), the region of "Castilla-La Mancha" was merged with Madrid as a large section of their population commute between the two and they are geographically nested. For the US the "District of Columbia", is over-represented with tweets from Washington D.C. (Fig. 3c).
To assess the strength of the observed association shown in Fig. 3 and to verify that the values of the epidemic are not trivially related to the size of the population in each region we compared three regression models: model-1, adjusted tweet intensity versus log death cases; model-2, log population versus log death cases and; model-3, adjusted tweet intensity and log population versus log death cases.
We log-scaled the population to allow for a fair comparison as we notice a sub-linear relation to the number of deaths. Table 1 shows coefficients' significance as well as R 2 for the three weighted and non-weighted regression models for the three countries analyzed in this study. From the data reported in Table 1  www.nature.com/scientificreports/ is the most significant with the lowest p values and the largest overall R 2 values for the Italy and Spain data. It has also a significant p value and sizable R 2 for the United States. This indicates that tweet intensity is the most significant variable for the prediction of the number of deaths for Italy and Spain. For the United States we still observe that tweet intensity is a significant predictor for the number of deaths but results from unweighted regression with model-2 and model-3 reveal that the population of the regions is a better predictor.   (Table 2). We observe that the significance level for Spearman's Rho correlation for Italy is at 99% and both the United States and Spain are significant to 95% significance level ( Table 2). This confirms that there is a significant relation between regional early tweet intensities and the number of deaths in the respective regions for all three countries.

Discussion
The wisdom of crowds has been used successfully to estimate relevant variables that are otherwise hard to measure in several domains including the medical one 28 . Despite this, there is no literature reporting the use of the wisdom of crowds gathered from social media tweets for assessing the spreading of severity within an emerging pandemic where there is no readily accessible mortality data.
Our study shows statistically significant evidence that COVID-19 related mean tweet intensity per region, at the first endogenous attention spike, is able to significantly forecast the spreading of COVID-19 severity, as measured by number of deaths, one month forward. In the case of Italy, the crowd's reaction with predictive power was recorded before any official regional contagion data was available. For Spain and the United States, the crowd still reacted when little data was available to make any forecast.
As the pandemic progressed, Italy, then rapidly followed by Spain, were the first countries affected with extremely high COVID-19 associated mortality. At such an early stage, Italy and Spain were therefore less influenced by discussions about the general global status of the pandemic in their measured social media activity, as less attention was present at the time. This allowed us to analyze the reaction from social media crowds with less biases. Moreover, Italy is made up of a good number of regions of comparable size with good social media usage as well as good official data for social media usage.
We show that the intensity of COVID-19 related Twitter activity is able to correctly identify the localities most affected by the pandemic in each of the considered countries. These localities are Lombardy, Madrid and New York, for Italy, Spain and the United States respectively (Fig. 3). These geo-political and administrative localities have a striking social media reaction (Figs. 1,3). This suggests that the initial reaction of users on social media had efficiently processed data scattered throughout news channels, merged it with local information and performed an accurate risk assessment which is observable in the social media intensity reaction. We highlight in particular for Italy that Emilia Romagna and Veneto did not seem to be less affected than Lombardy at the beginning of the epidemic spread, however, nonetheless, crowd wisdom seems to have combined different information sources to highlight the perception of a greater danger in Lombardy (Figs. 2a,3a). In Italian and Spanish regions, the crowds demonstrated a remarkable ability to predict the severity of COVID-19 impact at regional levels before www.nature.com/scientificreports/ the availability of official data. The regression results demonstrate that the intensity of COVID-19 related tweeds is a better predictor than the population size obtaining goodness of fit R 2 values that are almost twice. For the United States we also demonstrated a significant predictability power of the tweet intensity, however in this case the regional population size is a better regressor. We might note that spreading of the pandemic in the United States started later when an amount of exogenous information was already circulating in the social media, furthermore the United States have a much wider variety of climate, culture, political guidance, population density, and total population throughout states that leads to a more difficult detection of the phenomenon (Fig. 3c). The crowd's reaction to COVID-19 spreading measured through tweet intensity on a regional basis is a complex quantity rich of information. People react to both official information and to local knowledge gathered at personal level. Tweets are a process involving both sharing and comparing such information which includes a level of collective processing and assessing of the reliability of the source. It has been already recognized in the literature that such a process can be on average very accurate. It should not be therefore surprising that the crowd interest measured through COVID-19 related tweet intensity can result in an appropriate estimate of the local severity of the epidemics.
The confounding factor of different social media usage throughout regions is a potential limitation. We have adjusted for this in the case of Italy, as we were able to obtain the data, and our analysis was actually made more significant when correcting for this. Further work should seek this additional regional data for other countries as well, in order to more accurately adjust the Twitter intensity. Another potential limitation could be found in the use of linear regressions. These are a simplification of more complex dependency relations. Here we used simple methods, as little data was available, to show the validity of the phenomenon as per Fig. 1. More data and sophisticated models can be investigated for practical monitoring and forecasting tools in the future. Despite the limitations of our methodology, we expect that wisdom of crowd information will be most effective in early stages of an emerging pandemic, when official testing data is not yet available and especially in regions where official information is hard to obtain for a variety of operational reasons.
The practical relevance of our results consists in the demonstration that tweet intensity can be used for forecasting. At the beginning of an epidemic when it is extremely difficult to have precise information and therefore modelers and public officials are obliged to use general statistical quantities to produce forecasts and consequently implement decisions. Our work shows that the information locally available to the population permeates through Twitter and social media and can be made available to modelers' and policymakers at the early stages of a crisis when it is most needed.

Data availability
During manuscript assessment and review the data can be accessed at: https:// figsh are. com/s/ b344e 72688 e047d 2d65d. Data openly available upon publication in Turiel, Jeremy; Fernandez-Reyes, Delmiro; Aste, Tomaso