The COVID-19 social media infodemic

We address the diffusion of information about the COVID-19 with a massive data analysis on Twitter, Instagram, YouTube, Reddit and Gab. We analyze engagement and interest in the COVID-19 topic and provide a differential assessment on the evolution of the discourse on a global scale for each platform and their users. We fit information spreading with epidemic models characterizing the basic reproduction number \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_0$$\end{document}R0 for each social media platform. Moreover, we identify information spreading from questionable sources, finding different volumes of misinformation in each platform. However, information from both reliable and questionable sources do not present different spreading patterns. Finally, we provide platform-dependent numerical estimates of rumors’ amplification.


Scientific RepoRtS
| (2020) 10:16598 | https://doi.org/10.1038/s41598-020-73510-5 www.nature.com/scientificreports/ dynamics of hatespeech and conspiracy theories 28,29 , the effect of bots and automated accounts 30 , and the threats of misinformation in terms of diffusion and opinions formation 31,32 . In this work we provide an in-depth analysis of the social dynamics in a time window where narratives and moods in social media related to the COVID- 19 have emerged and spread. While most of the studies on misinformation diffusion focus on a single platform 17,26,33 , the dynamics behind information consumption might be particular to the environment in which they spread on. Consequently, in this paper we perform a comparative analysis on five social media platforms (Twitter, Instagram, YouTube, Reddit and Gab) during the COVID-19 outbreak. The dataset includes more than 8 million comments and posts over a time span of 45 days. We analyze user engagement and interest about the COVID-19 topic, providing an assessment of the discourse evolution over time on a global scale for each platform. Furthermore, we model the spread of information with epidemic models, characterizing for each platform its basic reproduction number ( R 0 ), i.e. the average number of secondary cases (users that start posting about COVID-19) an "infectious" individual (an individual already posting on COVID-19) will create. In epidemiology, R 0 = 1 is a threshold parameter. When R 0 < 1 the disease will die out in a finite period of time, while the disease will spread for R 0 > 1 . In social media, R 0 > 1 will indicate the possibility of an infodemic.
Finally, coherently with the classification provided by the fact-checking organization Media Bias/Fact Check 34 that classifies news sources based on the truthfulness and bias of the information published, we split news outlets into two groups. These groups are either associated to the diffusion of (mostly) reliable or (mostly) questionable contents and we characterize the spreading of information regarding COVID-19 relying on this classification. We find that users in mainstream platforms are less susceptible to the diffusion of information from questionable sources and that information deriving from news outlets marked either as reliable or questionable do not present significant difference in the way it spreads.
Our findings suggest that the interaction patterns of each social media combined with the peculiarity of the audience of each platform play a pivotal role in information and misinformation spreading. We conclude the paper by measuring rumor's amplification parameters for COVID-19 on each social media platform.

Results
We analyze mainstream platforms such as Twitter, Instagram and YouTube as well as less regulated social media platforms such as Gab and Reddit. Gab is a crowdfunded social media whose structure and features are Twitterinspired. It performs very little control on content posted; in the political spectrum, its user base is considered to be far-right. Reddit is an American social news aggregation, web content rating, and discussion website based on collective filtering of information.
We perform a comparative analysis of information spreading dynamics around the same argument in different environments having different interaction settings and audiences. We collect all pieces of content related to COVID-19 from the 1st of January to the 14th of February. Data have been collected filtering contents accordingly to a selected sample of Google Trends' COVID-19 related queries such as: coronavirus, coronavirusoutbreak, imnotavirus, ncov, ncov-19, pandemic, wuhan. The deriving dataset is then composed of 1,342,103 posts and 7,465,721 comments produced by 3,734,815 users. For more details regarding the data collection refer to Methods.
Interaction patterns. First, we analyze the interactions (i.e., the engagement) that users have with COVID-19 topics on each platform. The upper panel of Fig. 1 shows users' engagement around the COVID-19 topic. Despite the differences among platforms, we observe that they all display a rather similar distribution of the users' activity characterized by a long tail. This entails that users behave similarly for what concern the dynamics of reactions and content consumption. Indeed, users' interactions with the COVID-19 content present attention patterns similar to any other topic 35 . The highest volume of interactions in terms of posting and commenting can be observed on mainstream platforms such as YouTube and Twitter.
Then, to provide an overview of the debate concerning the disease outbreak, we extract and analyze the topics related to the COVID-19 content by means of Natural Language Processing techniques. We build word embedding for the text corpus of each platform, i.e. a word vector representation in which words sharing common contexts are in close proximity. Moreover, by running clustering procedures on these vector representations, we separate groups of words and topics that are perceived as more relevant for the COVID-19 debate. For further details refer to Methods. The results (Fig. 1, middle panel) show that topics are quite similar across each social media platform. Debates range from comparisons to other viruses, requests for God blessing, up to racism, while the largest volume of interaction is related to the lock-down of flights.
Finally, to characterize user engagement with the COVID-19 on the five platforms, we compute the cumulative number of new posts each day (Fig. 1, lower panel). For all platforms, we find a change of behavior around the 20th of January, that is the day that the World Health Organization (WHO) issued its first situation report on the COVID-19 36 . The largest increase in the number of posts is on the 21st of January for Gab, the 24th January for Reddit, the 30th January for Twitter, the 31th January for YouTube and the 5th of February for Instagram. Thus, social media platforms seem to have specific timings for content consumption; such patterns may depend upon the difference in terms of audience and interaction mechanisms (both social and algorithmic) among platforms.
Information spreading. Efforts to simulate the spreading of information on social media by reproducing real data have mostly applied variants of standard epidemic models [37][38][39][40] . Coherently, we analyze the observed monotonic increasing trend in the way new users interact with information related to the COVID-19 by using epidemic models. Unlike previous works, we do not only focus on models that imply specific growth mechanisms, but also on phenomenological models that emphasize the reproducibility of empirical data 41  www.nature.com/scientificreports/ Most of the epidemiological models focus on the basic reproduction number R 0 , representing the expected number of new infectors directly generated by an infected individual for a given time period 42 . An epidemic occurs if R 0 > 1,-i.e., if an exponential growth in the number of infections is expected at least in the initial phase. In our case, we try to model the growth in number of people publishing a post on a subject as an infective process, where people can start publishing after being exposed to the topic. While in real epidemics R 0 > 1 highlights the possibility of a pandemic, in our approach R 0 > 1 indicates the emergence of an infodemic. We model the dynamics both with the phenomenological model of 43 (from now on referred to as the EXP model) and with the standard SIR (Susceptible, Infected, Recovered) compartmental model 44 . Further details on the modeling approach can be found in Methods.
As shown in Fig. 2, each platform has its own basic reproduction number R 0 . As expected, all the values of R 0 are supercritical-even considering confidence intervals (Table 1)-signaling the possibility of an infodemic. This observation may facilitate the prediction task of information spreading during critical events. Indeed, according to this result we can consider information spreading patterns on each social media to predict social response when implementing crisis management plans.
While R 0 is a good proxy for the engagement rate and a good predictor for epidemic-like information spreading, social contagion phenomena might be in general more complex [45][46][47] . For instance, in the case of Instagram, we observe an abrupt jump in the number of new users that cannot be explained with continuous models like the standard epidemic ones; accordingly, the SIR model estimates a value of R 0 ∼ 10 2 that is way beyond what has been observed in any real-world epidemic. www.nature.com/scientificreports/ Questionable VS reliable information sources. We conclude our analysis by comparing the diffusion of information from questionable and reliable sources on each platform. We tag links as reliable or questionable according to the data reported by the independent fact-checking organization Media Bias/Fact Check 34 . In order to clarify the limits of an approach that is based on labelling news outlets rather than single articles, as for instance performed in 33,48 , we report the definitions used in this paper for questionable and reliable information sources. In accordance with the criteria established by MBFC, by questionable information source we mean a news outlet systematically showing one or more of the following characteristics: extreme bias, consistent promotion of propaganda/conspiracies, poor or no sourcing to credible information, information not supported by evidence or unverifiable, a complete lack of transparency and/or fake news. By reliable information sources we mean news outlets that do not show any of the aforementioned characteristics. Such outlets can anyway produce contents potentially displaying a bias towards liberal/conservative opinion, but this does not compromise the overall reliability of the source. Figure 3 shows, for each platform, the plots of the cumulative number of posts and reactions related to reliable sources versus the cumulative number of posts and interactions referring to questionable sources. By interactions we mean the overall reactions, e.g. likes or other form or endorsement and comments, that can be performed with respect to a post on a social platform. Surprisingly, all the posts show a strong linear correlation, i.e., the number of posts/reactions relying on questionable and reliable sources grows with the same pace inside the same social media platform. We observe the same phenomenon also for the engagement with reliable and questionable sources. Hence, the growth dynamics of posts/interactions related to questionable news outlets is just a re-scaled version of the growth dynamics of posts/reactions related to reliable news outlets; however, the re-scaling factor ρ (i.e., the fraction of questionable over reliable) is strongly dependent on the platform.
In particular, we observe that in mainstream social media the number of posts produced by questionable sources represents a small fraction of posts produced by reliable ones; the same thing happens in Reddit. Among less regulated social media, a peculiar effect is observed in Gab: while the volume of posts from questionable sources is just the ∼ 70% of the volume of posts from reliable ones, the volume of reactions for the former ones is ∼ 3 times bigger than the volume for the latter ones. Such results hint the possibility that different platform react differently to information produced by reliable and questionable news outlets.
To further investigate this issue, we define the amplification factor E as the average number of reactions to a post; hence, E is a measure that quantifies the extent to which a post is amplified in a social media. We observe that the amplification E U (for unreliable posts posts produced by questionable outlets) and E R (for reliable posts  www.nature.com/scientificreports/ posts produced by reliable outlets) vary from social media platform to social media platform and that assumes the largest values in YouTube and the lowest in Gab. To measure the permeability of a platform to posts from questionable/reliable news outlets, we then define the coefficient of relative amplification α = E U /E R . It is a measure of whether a social media amplifies questionable ( α > 1 ) or reliable ( α < 1 ) posts. Results are shown in Table 2. Among mainstream social media, we notice that Twitter is the most neutral ( α ∼ 1 i.e. E U ∼ E R ), while YouTube amplifies questionable sources less ( α ∼ 4/10 ). Among less popular social media, Reddit reduces the impact of questionable sources ( α ∼ 1/2 ), while Gab strongly amplifies them ( α ∼ 4). Therefore, we conclude that the main drivers of information spreading are related to specific peculiarities of each platform and depends upon the group dynamics of individuals engaged with the topic. and engagements ( ρ eng ). In more popular social media, the number of questionable posts represents a small fraction of the reliable ones; same thing happens in Reddit. Among less popular social media, a peculiar effect is observed in Gab: while the volume of questionable posts is just the ∼ 70% of the volume of reliable ones, the volume of engagements for questionable posts is ∼ 3 times bigger than the volume for reliable ones. Further details concerning the regression coefficients are reported in Methods. Table 2. The average engagement of a post is the number of reactions expected for a post and is a measure of how much a post is amplified in each social media platform. The average engagement E U (for unreliable post) and E R (for reliable post) vary from platform to platform, and are the largest in Twitter and the lowest in Gab. The coefficient of relative amplification α = E U /E R measures whether a social media amplifies more unreliable ( α > 1 ) or reliable ( α < 1 ) posts. Among more popular social media platforms, we notice that Twitter is the most neutral ( α ∼ 1% i.e. E U ∼ E R ), while YouTube amplifies unreliable sources less ( α ∼ 4/10 ). Among less popular social media platforms, Reddit reduces the impact of unreliable sources ( α ∼ 1/2 ) while Gab strongly amplifies them ( α ∼ 4).

Conclusions
In this work we perform a comparative analysis of users' activity on five different social media platforms during the COVID-19 health emergency. Such a timeframe is a good benchmark for studying content consumption dynamics around critical events in a times when the accuracy of information is threatened. We assess user engagement and interest about the COVID-19 topic and characterize the evolution of the discourse over time. Furthermore, we model the spread of information using epidemic models and provide basic growth parameters for each social media platform. We then analyze the diffusion of questionable information for all channels, finding that Gab is the environment more susceptible to misinformation dissemination. However, information deriving from sources marked either as reliable or questionable do not present significant differences in their its spreading patterns. Our analysis suggests that information spreading is driven by the interaction paradigm imposed by the specific social media or/and by the specific interaction patterns of groups of users engaged with the topic. We conclude the paper by computing rumor's amplification parameters for social media platforms.
We believe that the understanding of social dynamics between content consumption and social media platforms is an important research subject, since it may help to design more efficient epidemic models accounting for social behavior and to design more effective and tailored communication strategies in time of crisis. Table 3 reports the data breakdown of the five social media platforms. Different data collection processes have been performed depending on the platform. In all cases we guided the data collection by a set of selected keywords based on Google Trends' COVID-19 related queries such as: coronavirus, pandemic, coronaoutbreak, china, wuhan, nCoV, IamNotAVirus, coronavirus_update, coronavirus_transmission, coronavirusnews, coronavirusoutbreak.

Data collection.
The Reddit dataset was downloaded from the Pushi ft.io archive, exploiting the related API. In order to filter contents linked to COVID-19, we used our set of keywords.
In Gab, although no official guides are available, there is an API service that given a certain keyword, returns a list of users, hashtags and groups related to it. We queried all the keywords we selected based on Google Trends and we downloaded all hashtags linked to them. We then manually browsed the results and selected a set of hashtags based on their meaning. For each hashtag in our list, we downloaded all the posts and comments linked to it.
For YouTube, we collected videos by using the YouTube Data API by searching for videos that matched our keywords. Then an in depth search was done by crawling the network of videos by searching for more related videos as established by the YouTube algorithm. From the gathered set, we filtered the videos that matched coronavirus, nCov, corona virus, corona-virus, corvid, covid or SARS-CoV in the title or description. We then collected all the comments received by those videos.
For Twitter, we collect tweets related to the topic coronavirus by using both the search and stream endpoint of the Twitter API. The data derived from the stream API represent only 1% of the total volume of tweets, further filtered by the selected keywords. The data derived from the search API represent a random sample of the tweets containing the selected keywords up to a maximum rate limit of 18000 tweets every 10 minutes.
Since no official API are available for Instagram data, we built our own process to collect public contents related to our keywords. We manually took notes of posts, comments and populated the Instagram Dataset.

Matching ability.
We consider all the posts in our dataset that contain at least one URL linking to a website outside the related social media platfrom (e.g., tweets pointing outside Twitter). We separate URLs in two main categories obtained using the classification provided by MediaBias/FactCheck (MBFC). MBFC provides a classification determined by ranking bias in four different categories, one of them being Factual/Sourcing. In that category, each news outlet is associated to a label that refers to its reliability as expressed in three labels, namely Conspiracy-Pseudoscience, Pro-Science or Questionable. Noticeably, also the Questionable set include a wide range of political bias, from Extreme Left to Extreme Right.
Using such a classification, we assign to each of these outlets a binary label that partially stems from the labelling provided by MBFC. We divide the news outlets into Questionable and Reliable. All the outlets already classified as Questionable or belonging to the category Conspiracy-Pseudoscience are labelled as Questionable, the rest is labelled as Reliable. Thus, by questionable information source we mean a news outlet systematically showing one or more of the following characteristics: extreme bias, consistent promotion of propaganda/conspiracies, poor or no sourcing to credible information, information not supported by evidence or unverifiable, a Table 3. Data breakdown of the number of posts, comments and users for all platforms. www.nature.com/scientificreports/ complete lack of transparency and/or fake news. By reliable information sources we mean news outlets that do not show any of the aforementioned characteristics. Such outlets can anyway produce contents potentially displaying a bias towards liberal/conservative opinion, but this does not compromise the overall reliability of the source. Considering all the 2637 news outlets that we retrieve from the list provided by MBFC we end up with 800 outlets classified as Questionable 1837 outlets classified as Reliable. Using such a classification we quantify our overall ability to match and label domains of posts containing URLs, as reported in Table 4.The matching ability that is low doesn't refer to the ability of identifying known domain but to the ability of finding the news outlets that belong to the list provided by MBFC. Indeed in all the social networks we find a tendency towards linking to other social media platforms, as shown in Table 5.

Text analysis.
To provide an overview of the debate concerning the virus outbreak on the various platforms, we extract and analyze all topics related to COVID-19 by applying Natural Language Processing techniques to the written content of each social media platform. We first build word embedding for the text corpus of each platform, then, to assess the topics around which the perception of the COVID-19 debate is concentrated, we cluster words by running the Partitioning Around Medoids (PAM) algorithm on their vector representations.
Word embeddings, i.e., distributed representations of words learned by neural networks, represent words as vectors in R n bringing similar words closer to each other. They perform significantly better than the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) for preserving linear regularities among words and computational efficiency on large data sets 49 . In this paper we use the Skip-gram model 50 to construct word embedding of each social media corpus. More formally, given a content represented by the sequence of words w 1 , w 2 , . . . , w T , we use stochastic gradient descent with gradient computed through backpropagation rule 51 for maximizing the average log probability where k is the size of the training window. Therefore, during training the vector representations of closely related words are pushed to be close to each other.
In the Skip-gram model, every word w is associated with its input and output vectors, u w and v w , respectively. The probability of correctly predicting the word w i given the word w j is defined as where V is the number of words in the corpus vocabulary. Two major parameters affect the training quality: the dimensionality of word vectors, and the size of the surrounding words window. We choose 200 as vector dimension-that is typical value for training large dataset-and 6 words for the window. www.nature.com/scientificreports/ Before applying the tool, we reduced the contents to those written in English as detected with cld3. Then we cleaned the corpora by removing HTML code, URLs and email addresses, user mentions, hashtags, stop-words, and all the special characters including digits. Finally, we dropped words composed by less than three characters, words occurring less than five times in all the corpus, and contents with less than three words.
To analyze the topics related to COVID-19, we cluster words by PAM and using as proximity metric the cosine distance matrix of words in their vector representations. In order to select the number of clusters, k, we calculate the average silhouette width for each value of k. Moreover, for evaluating the cluster stability, we calculate the average pairwise Jaccard similarity between clusters based on 90% sub-samples of the data. Lastly, we produce word clouds to identify the topic of each cluster. To provide a view about the debate around the virus outbreak, we define the distribution over topics c for a given content c as the distribution of its words among the word clusters. Thus, to quantify the relevance of each topic within a corpus, we restrict to contents c with max � c > 0.5 and consider them uniquely identified as a single topic each. Table 6 shows the results of the text cleaning and topic analysis for all the data.
Epidemiological models. Several mathematical models can be used to analyse potential mechanisms that underline epidemiological data. Generally, we can distinguish among phenomenological models that emphasize the reproducibility of empirical data without insights in the mechanisms of growth, and more insightful mechanistic models that try to incorporate such mechanisms 41 .
To fit our cumulative curves, we first use the adjusted exponential model of 43 since it naturally provides an estimate of the basic reproduction number R 0 . This phenomenological model (from now on indicated as EXP) has been successfully employed in data-scarce settings and shown to be on-par with more traditional compartmental models for multiple emerging diseases like Zika, Ebola, and Middle East Respiratory Syndrome 43 .
The model is defined by the following single equation: Here, I is incidence, t is the number of days, R 0 is the basic reproduction number and d is a damping factor accounting for the reduction in transmissibility over time. In our case, we interpret I as the number C auth of authors that have published a post on the subject. As a mechanistic model, we employ the classical SIR model 44 . In such a model, a susceptible population can be infected with a rate β by coming into contact with infected individuals; however, infected individuals can recover with a rate γ . The model is described by a set of differential equations: where S is the number of susceptible, I is the number of infected and R is the number of recovered. In our case, we interpret the number I + R as the number C auth of authors that have published a post on the subject. In the SIR model, the basic reproduction number R 0 = β/γ corresponds to the ration among the rate of infection by contact β and the rate of recovery γ . Notice that for the SIR model, vaccination strategies correspond to bringing the system in a situation where S < N/R 0 ; in such a way, both the number of infected will decrease.
To estimate the basic reproduction numbers R EXP 0 and R SIR 0 for the EXP and the SIR model, we use least square estimates of the models' parameters 42 . The range of parameters is estimated via bootstrapping 41,52 . Linear regression coefficients. Table 7 reports the regression coefficient ρ , the intercept and the R 2 values for the linear fit of Fig. 3. High values of R 2 confirm the linear relationship between reliable and questionable sources in information diffusion.
∂ t S = −βS · I/N ∂ t I = βS · I/N − γ I ∂ t R = γ I Table 6. Results of text cleaning and analysis for all the corpora.

Cleaned contents
Vocabulary size Topics Contents with max � > 0.5

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.