Online hate network spreads malicious COVID-19 content outside the control of individual social media platforms

We show that malicious COVID-19 content, including racism, disinformation, and misinformation, exploits the multiverse of online hate to spread quickly beyond the control of any individual social media platform. We provide a first mapping of the online hate network across six major social media platforms. We demonstrate how malicious content can travel across this network in ways that subvert platform moderation efforts. Machine learning topic analysis shows quantitatively how online hate communities are sharpening COVID-19 as a weapon, with topics evolving rapidly and content becoming increasingly coherent. Based on mathematical modeling, we provide predictions of how changes to content moderation policies can slow the spread of malicious content.

A. Methodology. Our online cluster search methodology follows our earlier works referenced in the main paper (N.F. Johnson, R. Leahy, N. Johnson Restrepo, N. Velasquez, M. Zheng, P. Manrique, P. and S. Wuchty. Hidden resilience and adaptive dynamics of the global online hate ecology. Nature 573, 261 (2019); N.F. Johnson, M. Zheng, Y. Vorobyeva, A. Gabriel, H. Qi, N. Velasquez, P. Manrique, D. Johnson, E. Restrepo, C. Song, S. Wuchty, S. New online ecology of adversarial aggregates: ISIS and beyond. Science 352, 1459Science 352, (2016), but now looking within and across multiple social media platforms. While the method can in principle be repeated for any topic, we focus in this paper on forms of hate and hate-speech defined as either (a) content that would fall under the provisions of the United States' Code regarding hate crimes or hate speech according to Department of Justice's guidelines, or (b) content that supports or promotes Fascist ideologies or regime types (i.e., extreme nationalism and/or racial identitarianism). On-line communities promoting hate have become prevalent globally and are being linked to many recent violent real-world attacks, including the 2019 Christchurch shootings. We observe many different forms of hate adopting similar cross-platform tricks. Our research avoids needing any information about individuals, just as information about a specific molecule of water is not needed to describe the bubbles (i.e. clusters of correlated molecules) that form in boiling water. We define a hate cluster for practical purposes in this paper, as a cluster (e.g., Facebook fan page, VKontakte club) in which 2 out of 20 of its most recent posts at the time of classification align with the above definition of hate. Whether a particular cluster is strictly a hate philosophy, or simply shows material with tendencies toward hate, does not alter our main findings. Links between clusters are hyper-links. Our hate cluster network analysis starts from a given hate cluster A and captures any hate cluster B to which hate cluster A has shared an explicit cluster-level link. We developed software to perform this process automatically and, upon cross-checking the findings with our manual list, were able to obtain approximately 90 percent consistency between manual and automated versions. Figure S1 shows an example of clusters and a wormhole between them, from our analysis. Figure S1 shows the behavior within a universe (i.e., within a given platform, VKontakte) including the individual users. Gephi's ForceAtlas2 layout in Fig. S1 simulates a physical system in which nodes (clusters) repel each other while links act as springs. It is color agnostic, i.e. the color segregation emerges spontaneously and is not in-built. Nodes that appear closer to each other have local environments which are more highly interconnected while nodes that are far apart do not. Each cluster (VKontakte Group or Page) receives directly the feed of narratives and other material from that Group or Page and all members (fans) can engage in the discussions and posting activity. The visually explosive nature is similar across scales (i.e. akin to Fig. 2B of the main paper) which suggests that parts of the multiverse that are as-yet unknown or yet to be created, will exhibit similar behaviors.
The LDA (Latent Dirichlet Allocation) method is 'a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox.' We recommend this Wikipedia entry https : //en.wikipedia.org/wiki/Latent Dirichlet Allocation from which the quote comes, for useful links. We also refer to a recent paper from re-searchers at Ghent University and Twitter, available at arXiv:1909.01436v2, titled 'Discriminative Topic Modeling with Logistic LDA', by I. Korshunova, M. Fedoryszak, H. Xiong, L. Theis, for a nice summary of new advances and analysis, as well as references therein. The coherence score is a way of measuring the alignment of the words within an identified topic. The overall coherence score is just a simple arithmetic mean of all the per-topic coherences. C v , the coherence score, is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized point-wise mutual information (NPMI) and the cosine similarity. Essentially, it comprises collections of probability measures on how often top words in topics co-occur with each other in examples of the topics. The following is an example which specifically uses this C v measure: https://ieeexplore.ieee.org/document/8259775. We refer to is for a longer-form explanation and discussion of C v .
Figure S1: Detailed snapshot shows individuals' membership in hate clusters on a single day within a single universe of anti-U.S. hate during the rise of ISIS (January 24, 2015). Red clusters are those that subsequently were shut down, while green clusters were not. White dots are individual users. The intra-universe map has similarly explosive appearance to large-scale multiverse map in Fig. 2B of the main paper. User information is not publicly available for all platforms, hence it is not included in figures in the main paper.
B. Theoretical description of bilateral wormhole engineering (Figs. 4A,B,C main paper). This material draws heavily from the reference in the main paper: D.J. Ashton, T.C. Jarrett and N.F. Johnson, Effect of Congestion Costs on Shortest Paths Through Complex Networks, Phys. Rev. Lett. 94, 058701 (2005). We refer to that paper for demonstrations of good agreement between the core mathematics and approximations that we use, and numerical simulations -and hence numerical justifications for the approximations that we use. Each of the n nodes in a long chain (technically, a ring though a long chain also suffices) is connected to its nearest neighbors by a link of unit length. These links are directed in the 'directed' model, and undirected in the 'undirected' model. With a probability p any node can be attached to the central hub by a link of length 1 2 . The links to the hub are always undirected. For both the directed and undirected models, explicit expressions can be derived for the probability P ( , m) that the shortest path between any two nodes on the ring is , given that they are separated around the ring by length m. Summing over all m for a given and dividing by (n − 1) yields the probability P ( ) that the shortest path between two randomly selected nodes is of length . The average value for the shortest path across the network is then¯ = n−1 =1 P ( ). For the undirected model, the expressions are more cumbersome because there are more paths with the same length. However, defining nP ( ) ≡ Q(z, ρ) where ρ ≡ pn and z ≡ /n, there is a simple relationship between the undirected and directed models in the limit n → ∞ with p → 0, i.e. Q undir (z, ρ) = 2Q dir (2z, ρ). The models only differ in this limit by a factor of two: z → 2z, with z now running from 0 to 1/2. The results which follow were obtained by generalizing this procedure. We add a cost c every time a path passes through the central hub. This cost c is expressed as an additional path-length, however it could also be expressed as a time delay or reduction in flow-rate for transport and supply-chain problems. We have in general considered three cases: (1) constant cost c where c is independent of how many connections the hub already has, i.e. c is independent of how well-used the hub is; (2) linear cost c where c grows linearly with the number of connections to the hub, and hence varies as ρ ≡ np; (3) nonlinear cost c where c grows with the number of pairs connected directly across the network, and hence varies as ρ 2 . For a general, non-zero cost c that is independent of and m, we can write the following for a network with directed links: Performing the summation gives: The shortest path distribution is hence: Using the same analysis for undirected links yields a simple relationship between the directed and undirected models. Introducing the variable γ ≡ c n with z and ρ as before, we may define nP ( ) ≡ Q(z, γ, ρ) and hence find in the limit p → 0, n → ∞ that Q undir (z, γ, ρ) = 2Q dir (2z, 2γ, ρ). For a fixed cost, not dependent on network size or the connectivity, this analysis is straightforward. For linear costs, dependent on network size and connectivity and for N = 1 central hub, we can show that there exists a minimum value of the average shortest path¯ as a function of the connectivity to the central hub. Hence there is an optimal number of connections to the central hub, in order to create the minimum possible average shortest path. We denote this minimal path length as¯ ≡¯ | min . Such a minimum is in stark contrast to the case of zero cost per connection, where the value of¯ would just decrease monotonically towards one with an increasing number of connections to the hub. We now calculate the average shortest path,¯ = n−1 =1 P ( ), which yields: An analytic expression for¯ | min can be obtained by setting the differential of Eq. (5) equal to zero. If n is very large, one can introduce a higher cost without compromising the minimal shortest path¯ | min since in general the nodes are already much further from one another. We can also investigate how many connections we should make for a given cost and network size, in order to achieve the minimum possible shortest path¯ | min . This is obtained by setting the differential of Eq. (5) equal to zero and solving for p. To gain insight into the underlying physics, we now make some approximations to the exact analytic expressions. For large n, or more importantly large n − c, the term (1 − p) n−c → e −ρ in Eq. (5). Provided that the cost per connection to the hub is not too high, the region containing the minimal shortest path¯ | min will be at a reasonably high ρ. Hence we can neglect the exponential term and differentiate to find the minimum value of¯ with c = knp = kρ. It is reasonable to assume that at fixed k, optimal ρ will increase with n like n x where 0 < x ≤ 1. In particular, one obtains diffusive behavior whereby x ∼ 1/2. Specifically, ρ ≈ 2n k . For a large network (i.e. large n), we have therefore obtained a simple relationship between the number of connections one should introduce in order to create the minimal average shortest path between any two nodes in the network, and the cost per connection to the hub. Now we briefly turn to consider a specific yet physically reasonable example of non-linear costs, in which the costs are taken to depend on the number of pairs which are connected via the hub. In particular, we use c = k(np) 2 . We obtain the analytic relationship ρ ≈ 3 n k which is the non-linear equivalent of the above result. Obviously, more accurate expressions can be obtained since we know the complete form of the analytic solution -however these are too cumbersome algebraically to be presented here. For linear costs, the lowest value of¯ one can achieve is¯ | min ≈ √ 8kn. For non-linear costs, the minimal shortest path¯ | min ≈ 3 √ 27kn 2 . These last results show that the minimal shortest path¯ | min across the network grows like n 1 2 when we impose linear costs while it grows like n 2 3 when we put a cost on the number of direct connections between nodes made via the hub (i.e. non-linear costs). Corresponding results for the undirected model can be easily obtained from the equations for the directed model. For example for linear costs c = knp and undirected links, we obtain¯ | min ≈ √ 4kn and ρ ≈ n k for the minimal shortest path and the optimal connectivity. The present analysis can be extended to multiple hubs, N ≥ 2. For simplicity, we focus here on the specific example of constant costs and N = 2 (i.e. hub P, with nodes connected to it with probability p and hub Q, with nodes connected to it with probability q) where the cost associated with each hub has value c p and c q , with c p ≥ c q . The cost for using both hubs is assumed to be infinite. We first consider what happens when > c p ≥ c q . In this case, both hubs may be used and we may therefore write: where P P ( < m, > c p ) and P Q ( < m, > c q ) are understood to be P ( < m, > c) from the single-hub-with-costs case for probabilities p and q respectively. Substituting Eq.
(2) into the first term of Eq. (6) and performing the summation yields: An equivalent substitution and summation performed on the second term in Eq. (6) yields the same answer but with labels p and q interchanged. The third term, after substitution and summation, yields: Substitution of these individual terms into Eq. (6) yields: where g i = g ipq + g iqp − h i . To calculate the full probability distribution for the case > c p ≥ c q we now only require P ( = m): where P Q (i < m) is the single-hub-plus-costs distribution for a hub with probability q and P (i < m) is given by Eq. (9). We define the following functions: We then substitute P Q (i < m) and P (i < m) into Eq. (10) yielding: −c qf0 (a q , c p + 1, c q + 1) − g 0f 0 (a p a q , , c p + 1) +g 1f 1 (a q a p , , c p + 1) + g 2f 2 (a p a q , , c p + 1) .
We now obtain the final distribution by performing the sum over m: −c qf0 (a q , c p + 1, c q + 1) − g 0f 0 (a p a q , , c p + 1) +g 1f 1 (a q a p , , c p + 1) + g 2f 2 (a p a q , , c p + 1) These results were used directly to produce the figure and curve shown in Fig. 4C (top) for one platform (universe 1) with one long chain (i.e. one ring) having separate sets of wormholes to a second platform (universe 2) and a third platform (universe 3).
In what follows, we consider the other example in Fig. 4B,C (bottom) of the main paper, of two distinct platforms (universe 1 and universe 2) having separate sets of wormholes to the same third platform (universe 3). Interconnections between the nodes from different platforms are not considered. The only coupling between the two platforms is due to the cost of using the third platform (referred to here as the central hub). Since both the platforms' rings will tend to optimize their average path lengths, it is interesting to see how the optimal number of connections and the minimum path length changes due to the central hub being shared. Let two platform rings A and B have sizes and connecting probability of nodes n, n B , p and p B respectively. Considering platform ring A, we start from the expression for a simplified form of the average path length: In this case, the cost depends on the connectivity of both platforms as the congestion is caused by the connections from both the platforms, c ≡ c (p, p B ). For fixed network sizes n and n B ,l depends on two variables p and p B . To find the local maxima or minima, we need to locate the critical points by solving the following simultaneous equations as we require the tangent plane to be horizontal: Rewriting in terms of the scaled connectivity, ρ = np: In the limit n c, ρ 1, Similarly we have, The above analysis only strictly holds true for ρ 1, n c, but should be a reasonable first approximation otherwise. The two networks should be of similar sizes for this optimal condition. Since A and B are just dummy labels for the networks, by symmetry we get a similar equation for network B.
When the hub is used by a single network, the location of the minimum of average path length is obtained from dc dρ = 2n ρ 2 .
For a general cost, c = kρ α + qρ θ B , the optimum number of connections for the individual networks remains unchanged. However, the optimal average path lengthl does change. We expect it to increase as the additional cost due to the other network causes path lengths l ≤ c to be much higher and these do not use the hub, thus causing the overalll to increase.
Consider the linear case: for optimized networks, we have Consider the case when k = q and the network are of same size n. Substituting the optimal number of connections and cost when optimized, and considering the dominant terms, we get: Thus the optimal path length has increased tol ≈ √ 18nk as compared tol ≈ √ 8nk when the hub is used by a single ring and star network. The above result generalizes to multiple hubs, for linear cost functions and N networks with size n: Strictly speaking, we require n c for our approximations to be accurate, hence the addition of many hubs might result in the cost being too high for the result to be accurate. One important aspect of the result perhaps is that while the optimal number of connections does not change, the optimal path length does due to the hub being used by other platforms. This feature can be used to detect any platforms that may be using a particular hub without wanting to be detected. Here, since the inter-network connections do not exist and the nodes of this 'dark' platform only connect between themselves, it can only be detected through its influence on the cost associated with the hub. Conversely, if it is known that a platform is using a given hub, given its impact on the path length of other platforms using that hub, the nature of its cost function can be estimated. While the optimal probability remains the same, the path length does increase. This is expected as the paths with a higher cost than the separation of nodes, are avoided and this leads to a higher average path length when the costs are higher.
We can similarly consider a series of platforms (rings) with the outer rings using the inner rings and hub as the renormalized hub. Since our result holds only when n c, the assumption breaks down for higher rings when the number of nodes are considered equal for all rings. Therefore we consider the case n i = n i and a linear cost function. To check if the assumption holds, consider a general x th ring, The dominant term is the one in n for any higher x, c x ≈ n x n 2 and thus n x c x hold true for arbitrary higher x. The summation used is evaluated below.
To evaluate the sum, let Substituting yields, Thus the optimized path length of the m th ring which uses inner rings and the central hub as a renormalized hub for cost function which depends linearly on the number of connections and the path length of the inner ring is: The main conclusions are therefore as follows: (1) Even though the hub is being used by other networks, the number of connections required by a platform network to reach its optimal average path length does not change. However, the value of the optimal path length does change. Thus any deviance in optimal path length from the expected value can help in inferring the presence of another network that might otherwise go undetected.
(2)The optimized path length of the m th ring in a series of rings with n m = n m which uses inner rings and the central hub as a renormalized hub for the cost function and which depends linearly on the number of connections and the path length of the inner ring, is given by:l C. Theoretical description for manipulating outbreak of support surrounding malicious matter (Figs. 4D,E,F main paper). This material draws heavily from the reference given in the main paper: P.D. Manrique, M. Zheng, Z. Cao, E.M. Restrepo, N.F. Johnson, Generalized gelation theory describes onset of online extremist support, Phys. Rev. Lett. 121, 048301 (2018). We refer to that paper for demonstrations of good agreement between the core mathematics and approximations that we use, and numerical simulationsand hence numerical justifications for the approximations that we use. We start with a set of rate equations for the concentration c k (t) of small clumps (i.e. microscopically small clusters) of individuals size k (k = 1, 2, ...). These small clumps of individuals may be anywhere online, in some informal community setting (e.g. some people discussing another topic and/or on another platform). As yet, these clumps of individuals have not yet aggregated (i.e. 'gelled') into a large observable malicious matter cluster (e.g. a Facebook Page focused on hate). When it does, that large cluster can be called a gel that emerges out of this soup of microscopic clumps -and represents one of the observable Facebook Pages (clusters). We include heterogeneity among the interacting elements (small clumps of size k = 1) and consider that this heterogeneity ultimately dictates the evolution of the aggregation process.
A hidden variable x that we for simplicity call 'character', is randomly assigned to each element taken from a given distribution q(x). The interaction is described in terms of the similarity or dissimilarity of the interacting elements. We define the similarity S ij between element i and element j as S ij = 1 − |x i − x j |, so that elements with alike character have a high similarity and otherwise for a pair of elements with unlike character. We consider that the probability of aggregation for any two individuals i and j is given by C = S ij . Our definition also recognizes the opposite mechanism which tends to form clumps of dissimilar individuals, where the aggregation probability between i and j is C = 1 − S ij . The random case is recovered in the limit where the aggregation probability is independent of x, which is C = 1. The heterogeneous aspect of the aggregation process is transferred by means of a mean-field probability for aggregation. For example, for a uniform character distribution q(x), the probability density function (PDF) of the similarity y = S ij , for homophily is f (y) = 2y and hence the mean-field aggregation probability F , becomes: By contrast, for dissimilarity defining z = 1 − S ij , the PDF f (z) = 2(1 − z) resulting into a mean-field aggregation probability F , of: The homogeneous limit (i.e. random limit) occurs when y = 1 (or z = 0) and the diversity distribution is a Dirac delta which yields F = 1. In general, F determines the likelihood for any pair of elements i and j to merge into a new clump at a given timestep t. With this in mind we can rewrite a set of equations for the number of clumps of size k (n k ), for the heterogeneous system as: where N is the subpopulation from which a particular flavor of future Facebook Page (gel) might emerge if gelation occurs. The first term represents the population of clumps of size k that merge with other clumps, while the second term is the population of smaller clumps that merge to form clumps of size k, consisting of the well-known product kernel. By considering N = ∞ r=1 kn k , the equation can be immediately solved and the expression for the evolution number of individual elements is: where we have assumed that initially the system is comprised by individual elements only, (n 1 (0) = N ). This yields: and the general expression for any k ≥ 2 is found to be: For finite systems at some point into the dynamics a finite non-negligible fraction of the total population condenses into a single large cluster. This phenomena is known as gelation and divides the dynamics of the system. Each hate Facebook Page is a large cluster which we can call a gel. After the gel is formed, the moments of the size distribution is decomposed into the small clumps (or solution) and the largest gel cluster (Facebook page) in the following way: The importance of this decomposition becomes evident when analyzing the zeroth moment, M 0 = k≥1 n k , which provides the number of clumps of any size. By looking at its first derivative we find: The solution for the zeroth moment becomes negative when t > N/F which is problematic since M 0 gives the total number of clumps present. This problem is solved by using, above the gel point, k≥1 kn k = N − C, where C is the size of the gel cluster (hate Facebook page). With this correction the derivative of the zeroth moment becomes: This indicates that the number of clumps stops decreasing when the gel reaches the system size N . The appearance of the gel cluster is mathematically manifested as a singularity in the second moment of the size distribution. The evolution of the second moment is given by: and hence: which gives a closed differential equation for the second moment of the size distribution. The solution for the initial condition where all clumps are of size one (M 2 (0) = N ), is: which has a singularity at the time t c = N/2F , showing the point where the gel transition takes place. This critical time depends on the mean-field aggregation probability F and hence in the formation process. For a uniform character distribution, unlike clusters (dissimilarity) are slower to be formed and hence the transition occurs at a later time than alike cluster formation (homophily). Random clusters are the quickest to form since they have the maximum mean-field aggregation probability per timestep (F = 1). The expression for the evolution of the gel size (e.g. hate Facebook Page) is obtained by means of the exponential generating function E(y, t) ≡ k≥1 kn k e yk . Hence: This is known as the inviscid Burgers equation which is the simplest nonlinear hyperbolic equation and can be solved by the method of characteristics. For this type of partial differential equation, the characteristics are straight lines in the y-t plane where E is constant and have slope α(1 − E ), where for simplicity we have defined α = 2F/N and E = E/N . The equation of motion for y along the characteristic is therefore: Since E (and hence E ) is constant, the solution for y(t) along the characteristic is: where f (E) depends on the initial conditions which for the generating function we find it to be E(y, t = 0) = N e y which yield y(t = 0) = ln E . The derivation moves forward as follows: Now note that the generating function for y = 0 yields E(0, t) = N − C above the gel point and the following expression for the largest cluster is found: The solution can by written by means of the W -Lambert function as: which is the equation given in the main paper.
D. Example of LDA topic analysis of hate multiverse content before COVID-19 outbreak WARNING: THIS SECTION CONTAINS OFFENSIVE MATERIAL Figure S2 shows the typical results that we get from the analysis of the text, as discussed in the main paper. Figure S2: Example output from our machine learning (LDA) analysis discussed in the main paper text.  Table 1 Direct links between hate clusters in different social networks. It lists one-step (i.e. direct) hyperlinks between social networks connecting hate clusters from the Source (which can be regarded in this sense as a broadcasting) platform and the Target (or in this sense, receiving) platform, as shown visually in Fig. 1 in the main article. The key connective role of Telegram, and to lesser extent VKontakte, is evident in how they are the two platforms that receive and broadcast to all other 5 platforms. Links between hate clusters of the same platform are not shown.