Social status and novelty drove the spread of online information during the early stages of COVID-19

Access to online information has been crucial throughout the COVID-19 pandemic. We analyzed more than eight million randomly selected Twitter posts from the first wave of the pandemic to study the role of the author’s social status (Health Expert or Influencer) and the informational novelty of the tweet in the diffusion of several key types of information. Our results show that health-related information and political discourse propagated faster than personal narratives, economy-related or travel-related news. Content novelty further accelerated the spread of these discussion themes. People trusted health experts on health-related knowledge, especially when it was novel, while influencers were more effective at propagating political discourse. Finally, we observed a U-shaped relationship between the informational novelty and the number of retweets. Tweets with average novelty spread the least. Tweets with high novelty propagated the most, primarily when they discussed political, health, or personal information, perhaps owing to the immediacy to mobilize this information. On the other hand, economic and travel-related information spread most when it was less novel, and people resisted sharing such information before it was duly verified.

S1 Data S1.1 Tweets As our task was to explore the online information spread during the COVID-19 pandemic, we collected Twitter data containing virus related information. Twitter was selected as the main data source for this study, as it provides it's data timely through APIs in contrast to the other popular social media platforms. The brand monitoring site Brand24 (https://brand24.com) was used, which collects data from social media, news, blogs, videos, forums, reviews and more. Brand24 continuously seeks for new and existing mentions across the web about the used keywords. The first collection started on March 11 2020, although the collected data date back to January 23 2020. The keywords used for the collection of data are: 'n cov2019', ' Brand24 provides the tweet ids where the mentions match the specified keywords. The tweet ids are separated from the rest of the mentions, and the Twitter API was used for the collection of the tweet information. Approximately 55% of the tweet ids can be retrieved using the Twitter API (12.17 million out of 19.22 million). The collection is done every 2 days over 8-day windows. Therefore, the collections were overlapping in order to (1) update the impressions (likes, retweets, etc.) and (2) to collect as many tweets as possible due to the fact that they might be removed. Out of the total of 9.5 million virus-related tweets, retweets and quotes collected, only 8.25 million English language tweets were used for the analysis.

S1.2 Twitter users
Twitter users were then separated from the tweets in order to (1) avoid having redundant information in the tweets dataset and (2) to have the most recent information available in relation to user characteristics such as the number of followers and the total number of retweets. The location of each user is inferred using the method proposed by Compton et al. (2014) [4], which infers a user's location by examining their connections' locations. The algorithm was capable of inferring location for more than 100 million users at a median error of 6.38 km and geotagging around 80% of public tweets. Using this method we were able to geotag 3.24 million tweets or 51.42% of our corpus. Furthermore, the gender, age group and, whether the account belongs to an organization account or not, are inferred using the m3inference model. M3 is a deep learning system for demographic inference that was trained on a massive Twitter dataset ( [10]). M3 provides a full model which uses vision (profile picture) and text (name and bio) inputs and, a text-only model to make the predictions. The full model was used for the majority of the twitter users. The text model was used on just 3% of the users where the profile picture was not compatible for the model. The   Table S1: User level summary statistics Figure S1: World map with size of conversation by city. The size of the bubble is proportional to the number of tweets in the area. Maps were generated in the software Tableau.
tweets were posted by 1.08 million users of which 66.9% were males, 72.3% were personal accounts, and 5.5% of the accounts were verified. The m3inference models helped us infer the age group of the users; 36.2% were under 19, 13.3% were between 19 and 29, 33% were between 29 and 39, and 17.4% over 39 years old.

S2 Social Dimensions
One of our main goals is to analyze the social dimensions in Twitter conversations, therefore we need to identify the main topics of the discussions in the tweets. To identify these dimensions, initially, 3 researchers manually browsed through a subset of the tweet corpus to form the broader themes. Five high level social dimensions were identified: Politics, Health information, Travel in- Figure S2: Screenshot of the survey website that was used to annotate the tweets. Annotators were given clear instructions on selecting one, or more categories or if the tweet does not belong to any of the five categories, not to select any category and continue to the next task. We study health and travel information since access and dissemination of both these types of information were crucial, especially during the pandemic's early stages. There were several competing theories, and hence uncertainty regarding the virality of COVID-19 [5,8]. The travel regulations imposed by several countries further caused anxiety related to travel over and above the underlying concerns regarding the spread of the virus [3]. The variety of information available on Twitter, coupled with the presence of many health and  Table S2: Proportion of tweets in each social dimension travel experts (including government officials) on the platform, made it a preferred destination to seek and disseminate health and travel-related information. Third, we study online political discourse during COVID-19 since there were several contentious and polarizing aspects of the pandemic early on, such as lock-downs, wearing masks, and social distancing. For example, most Republicans were against such regulations in the USA, whereas the Democrats favored them. On a similar theme, we next studied another macro-level impact of the pandemic-its economic impact. The economic upshot of COVID-19 can not be understated as many businesses closed while others downsized. There was a global spike in unemployment, and financial hardship was pervasive. Finally, we also measure the micro-level impact of the pandemic by studying the personal narratives shared by people online.The quarantine and stay-at-home orders confined millions of people to their homes. Without a doubt, that took a psychological toll on people worldwide. And social media was their ultimate refuge.
Due to the large volume of tweets, manual classification to the five categories was not feasible. Consequently, a subset of human-annotated tweets was used to build machine learning models that would accommodate classifying the remaining data. A subset of tweets needed to be manually labeled with the five social dimensions, on which the machine learning models would be trained. As part of this step, we created a survey website for annotators to classify the tweets (see Figure S2). The website fetches a random tweet at a time from the corpus and the participant has to classify the tweet into none, one or, more than one of the 5 pre-specified categories. To ensure the validity of the results and to filter out potential bots answering the survey, a small fraction of the tweets were used to test the participant. The label of the testing tweets was prespecified by the researchers and in case the participant failed to classify the test tweets correctly, the responses were not recorded. The survey was given to University of Cyprus and University of Edinburgh undergraduate and postgraduate students at first stage. The annotators were given brief descriptions of the five categories and were allowed to have five warm up tweets. In total, nine annotators were employed and the inter-annotator agreement was calculated using the statistical measure Fleiss' Kappa. The nine annotators were split into 3 teams where each team classified a subset of 100 tweets. The Fleiss' kappa (κ) was calculated 0.85, 0.81 and 0.83 for team one, two and three respectively, whereas the agreement between two annotators ranges from 84% to 91%.

S2.1 Text Classification
The next step involved building machine learning classifiers based on the annotated data that would help us assign social dimensions to the entire tweet corpus. Although, annotated data are not sufficient for building a machine learning classifier since raw text cannot be directly used as an input. Textual data is a form of unstructured data that does not come in a predefined form or structure. Therefore, crucial steps should be taken to make the data suitable for the analysis. First, text data should be cleaned to remove the noise. Due to a large number of users coming from different backgrounds and habits, Twitter data contain a lot of noise in the form of emojis, emoticons, slang language, URLs, and punctuation. Worthless information such as stop-words should be removed and other information such as emojis and symbols should be annotated for later use. The python libraries ekphrasis and nltk were used to perform the data cleaning. Examples of the textual data before and after the cleaning are presented in table S4.
The final step of data pre-processing is generating the root form of inflected words. Stemming and lemmatization are two methods of dealing with this task. Porter's stemming algorithm consists of five rule-based phases that reduce the word. On the other hand, lemmatization does a full morphological analysis to accurately identify the lemma for each word. Lemmatization was selected for this analysis as a full morphological breakdown of a word was providing more accurate results.

S2.1.1 Feature Engineering
Next, the textual data was transformed into numerical feature vectors that would serve as model inputs. There are different approaches to transforming raw text into feature vectors. Five different text transformation methods were used, resulting to 6 different datasets. Due to the complexity of the task, the different feature engineering techniques were used together to obtain better results. The methods are summarized below: • Count vectors: The simplest approach of representing text with numerical vectors. Count Vector is a matrix notation of the dataset in which every row represents a tweet from the set, every column represents a term from the corpus, and every cell represents the frequency count of a par-S7 Annotator 7 Annotator 8 Annotator 8 91% Annotator 9 87% 86%  Table S4: Text pre-processing ticular term in a particular tweet.
• TF-IDF Vectors: The TF-IDF score exemplifies the relative significance of a term in the tweet and the entire corpus. TF-IDF score is comprised by two terms: the first computes the normalized Term Frequency (TF), the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears. Bigram-level TF-IDF vectors were used, as they were found to have better predictive ability than the word-and character-level.
where term t occurs in document d which belongs to corpus D.
• Word Embeddings: Word embeddings are learned representations of words in a predefined vector space. Similar words tend to have alike word embeddings and that makes the approach suitable for vectorizing text for machine learning use. Two-word embedding algorithms were used; word2vec and doc2vec. Both models were trained on the entire corpus of tweets collected.  -Doc2vec: Doc2vec is an extension of word2vec which vectorizes documents instead of words. It was developed by Le and Mikolov (2014) [6].
• Topic modeling: An unsupervised technique that is used to group words into topics from a large corpus of text. Latent Dirichlet Allocation (LDA) algorithm for topic modeling was used. LDA is a matrix factorization technique. which represents any corpus as a document-term matrix. [2] • Natural Language Processing (NLP) features: Text-based features such as word count, character count, word density, punctuation count, upper case count, and title count. Although NLP features do not perform well in text classification tasks, they help improve other classification models.

S2.1.2 Model Building
One vs. rest machine learning models were preferred to multi-label classification due to more accurate results. Therefore, predicting each category was a separate classification task. Six data-sets were prepared, one for each of the feature engineering methods described above. For the prediction of each category, six classifiers were trained, one for each of the feature sets. Training different algorithms on each feature set, and aggregating the results, serves to accomplish more accurate predictions using a meta classifier. Four machine learning algorithms were used, a Support Vector Classifier and three variations of Logistic Regression; with L1 regularization, with L2 regularization and, with Stochastic Gradient Descent (SGD) training.
A binary support vector classifier aims to find a hyperplane in an Ndimensional space, where N is the number of features, that separates the two classes with the maximum margin between the two. Two parameters of the support vector classifier were tuned, C and class weight. C parameter adds a penalty for each misclassified observation. When C is small, the penalty is low so a decision boundary in the training has a large margin. Class weight penalizes the two classes differently in order to put more emphasis on one class or, address unbalanced dataset issues which was the case in this task.
Also known as Logit, logistic regression estimates the probabilities of the possible responses which are modeled using a logistic function. Due to the high dimensions of the engineered datasets, a regularization method was used for penalizing the coefficients. Lasso regression employs L1 regularization uses the parameter λ to penalize coefficients, which forces the less important coefficients to zero. Ridge regression uses a similar regularization, L2, which penalizes the coefficients based on λ but they never reach zero. Stochastic Gradient Descent training is an optimization technique that is used to train the models. The λ parameter and class weight were tuned for the Logistic regression models For each feature set, the four algorithms were trained and the best one was selected. The model selection was done using the 10-fold cross-validation f1 test score. The best model was then hyper-parameter tuned and stored. By the end of this procedure, six classifiers are trained, one for each of the engineered datasets. S10 Tweet Social Dimension COVID-19: Lloyd's to stress test the market and corporation resilience. @LloydsBank has announced measures that aim to provide reassurance to the market. https://t.co/ro8dXrrcbi#Lloyd's #Coronavirus #Insurance #Reinsurance

Economy
The Trump Organization is seeking U.K. and Irish bailout money to help cover wages for bartenders, bagpipers and other employees furloughed from its European golf properties because of the coronavirus lockdown.https://t.co/5eAABuQKMl Due to the reason that the amount of data annotated was very low in comparison to the entire tweet corpus, the method of active learning was used to accelerate the optimization of the classifiers. Active learning is a process to prioritize the data to be labeled to have a higher impact on the training of the model. First, the machine learning model is trained on a small subset of the data. Then, the unseen observations are assigned a priority score with the task of assigning higher priority to observations that the classifier was more unsure of the category. Applications of active learning use the least confidence, margin sampling, or entropy among others to measure priority. Although, for this study, the presence of six classifiers for each task was perceived as suitable for prioritizing observations.
The built classifiers are applied to the unseen data and the observations where three of the classifiers predicted that a tweet belongs to a specific category and the remaining three predicted that it doesn't, are separated for human classification. A subset of 100 tweets from each social dimension was then given to human annotators to manually classify them. Then, the models were trained again and the process was repeated until no significant improvement was found in the model performances. The process was repeated 4 times and the final human-annotated dataset consisted of 4036 tweets. The classifiers assigned at least one social dimension to 75% of the tweets in our corpus.
The performance metrics for each of the five categories are presented in the following

S3 Quantifying Novelty
The next step in our analysis was to measure the novelty of information contained in a tweet.
In order to measure the tweet-level novelty of information, we compared each tweet with the information included in preceding tweets. The topic distribution of a tweet i is compared to the topic distribution of all the tweets posted in different time windows preceding the tweet. Cosine similarity was used to measure the distance between the two vectors. To obtain the topic vectors we used the Doc2Vec model which was trained on 8.25 million COVID-S12 19 related tweets. Doc2Vec was preferred to other numeric representations of text documents, as documents with similar content but not necessarily the same words tend to have alike word embeddings and that helps to capture information uniqueness. The cosine similarity between each tweet's numerical representation and of all the tweets for the previous 1, 3, and 7 days was computed. The average cosine similarity for each tweet was then normalized between 0 and 1, and novelty was calculated as one minus the normalized average, where higher values of novelty correspond to higher uniqueness of information. The aforementioned approach is also described in the literature as information uniqueness [1,9]. Additionally, we discretize novelty into fixed sized intervals in order to study the effect of each level. To construct the categorical variable, we break novelty into 5 equal sized bins based on the value of novelty.
where A, B are the Doc2Vec topic vectors of two tweets.

S4 Categorizing Twitter users
We categorize Twitter users to Influencers, Health experts or other users in an attempt to capture the effect of such users in the outcomes of the analysis. Influencers are defined as users with at least 5000 followers and whose account is verified. By selecting only verified accounts we aim to avoid bots and malicious accounts. Zafar et al. (2015) described an approach of identifying topic experts using Twitter Lists [11]. Although, for the scope of this study only health experts were required. Health experts include doctors, nurses, health-specific journalists, medical researchers and health ministers. In order to identify health experts human annotation was done before employing machine learning. Three annotators annotated a random subset of 500 Twitter users based on their Twitter account bio. Then, users' data were cleaned in the same way as S2.1 and their count vectors were obtained to represent the text. Different machine learning classifiers were build to predict whether a user is a health expert based on their bio. Table S8 shows the performances of the used models. SGD Classifier was the best performing model with 85% accuracy and 96.45% recall implying that 96 out of 100 classified users as health experts were actually health experts. On the final dataset used, we found 43206 ( 4%) Influencers who produced 14% of the tweets and, 83475 (7.7%) health experts who produced 8% of the tweets.  (b) Histogram of 3 day Novelty.

S5 Regression Analysis
As our task was to study the the spread of information in the social media platform, we selected the number of retweets and whether the tweet was retweeted or not as the dependent variables. We modelled these variables in a Poisson regression framework in order to estimate the effects of individual-level and societylevel dimensions in disseminating information related to the pandemic. Poisson regression is a generalized linear model which assumes the Poisson distribution as the probability distribution of the response variable, therefore establishing it suitable for modelling count responses. Non-robust covariance type and log link function were used for all the regressions.

S5.1 Model specifications
We modelled 4 different model specifications for each response variable, utilizing various collected and calculated features. Influencers and Health Professionals resemble the author status features. In our final dataset, 14% of tweets were posted by Influencers, while 8% of tweets by health experts. We refer to users that do not fall in ether the Influencers or in Health Experts category as "other users". The social dimension features consist of Politics, Health, Personal, Economic, and Travel and were labelled through the text classification framework presented in S2.1. As described above, a tweet may belong to more than one of these social dimensions, having for example 12% of the tweets fall into 2 of the five categories. We control for tweet characteristics, Character Count, Punctuation Count, and Upper Case Word Count, which reduces the likelyhood that the spread of information is driven by those characteristics. Novelty was used both as a continuous and a categorical variable. The continuous variable was used in order to examine the linear and quadratic relationship with the response variable, while the categorical variable was used to investigate the effect of the tails. The three control variables were normalized between 0 and 1 using min-max scaling, so that they are at the same scale. All the used features are described in Tweet-level measure of novelty of information, calculated using tweets posted 24 hours before a tweet Novelty-3day Tweet-level measure of novelty of information, calculated using tweets posted 3 days before a tweet Novelty-7day Tweet-level measure of novelty of information, calculated using tweets posted 7 days before a tweet Influencer Whether the user is verified and has more than 5000 followers (1) Table S10: Summary statistics of the variables used in the regression specifications.
Next, we present the model specifications for the 4 regressions: Regression 1 -Estimating the response using the social dimensions as predictors.
Regression 2 -Estimating the response using the author's status and its interactions with social dimensions.
Regression 4 -Estimating the response using the author's status and its interactions with novelty.  Table S12: Results of Regression 2 with the number of retweets as response. Exponentiated coefficients are plotted in Figure 2B. Note: *** p<0.01, ** p<0.05, * p<0.1 S18 Dep

S7 Sample of highly reshared tweets
In the tables bellow we present a sample of tweets with very high number of retweets. In Figure 2B, we find that personal, and health-related tweets posted by health experts and, political tweets posted by influencers are retweeted more than other tweets. Additionally, in Figure 3A we observe that novel information is being spread more than non-novel information for all the social dimensions.
Moving beyond the numbers, we provide the following sample of tweets that fall into the aforementioned groups.  The French government has issued a warning against taking anti-inflammatory drugs (ibuprofen, Advil, cortisone) with covid-19 symptoms, as it may cause severe cases of the disease, even in young and middle aged adults with no underlying conditions Korea finished developing the 10 minute Covid-19 diagnostic kit and is now ramping up production. They plan to export 300.000 test-kits per weekhttps://t.co/DpJCph9RT7  Only 10% of Chinese cases and 25% of international cases currently being detected. #2019nCoV #coronavirus https://t.co/zlPUSpoTt8) The world has ongoing outbreaks of #Measles #Ebola #Cholera #influenza not only #coronavirus #nCoV Table S33: Sample of highly retweeted tweets with high novelty of information