Impact of individual actions on the collective response of social systems

Martin-Gutierrez, Samuel; Losada, Juan C.; Benito, Rosa M.

doi:10.1038/s41598-020-69005-y

Download PDF

Article
Open access
Published: 22 July 2020

Impact of individual actions on the collective response of social systems

Samuel Martin-Gutierrez¹,
Juan C. Losada¹ &
Rosa M. Benito¹

Scientific Reports volume 10, Article number: 12126 (2020) Cite this article

3578 Accesses
2 Citations
23 Altmetric
Metrics details

Subjects

Abstract

In a social system individual actions have the potential to trigger spontaneous collective reactions. The way and extent to which the activity (number of actions—A) of an individual causes or is connected to the response (number of reactions—R) of the system is still an open question. We measure the relationship between activity and response with the distribution of efficiency, a metric defined as $\eta =R/A$. Generalizing previous results, we show that the efficiency distribution presents a universal structure in three systems of different nature: Twitter, Wikipedia and the scientific citations network. To understand this phenomenon, we develop a theoretical framework composed of three minimal statistical models that contemplate different levels of dependence between A and R. The models not only are able to reproduce the empirical activity-response data but also can serve as baselines or null models for more elaborated and domain-specific approaches.

The connectivity network underlying the German’s Twittersphere: a testbed for investigating information spreading phenomena

Article Open access 08 March 2022

Understanding who talks about what: comparison between the information treatment in traditional media and online discussions

Article Open access 07 March 2023

Phase transitions in information spreading on structured populations

Article 02 March 2020

Introduction

Due to humans’ social nature, the actions of individuals hold the potential to trigger spontaneous collective reactions, leading to complex dynamics. In order to understand human collective behavior, it is necessary to find the laws that relate the individual actions to the collective response of social systems.

This topic has received considerable attention and has been approached from several perspectives^1,2,3,4. From diffusion on networked systems, a field which studies the spread of diseases or information and the emergence of cascading phenomena^5,6,7 to virality, a property of certain pieces of information that generate a wide response in social systems^8,9,10. Other works focus on the Influence Maximization problem, taking advantage of the diffusion mechanisms to find a set of individuals that maximize the response^11,12,13. Alternatively, the field of control theory aims to steer the collective behavior of a system by controlling the activity of a few individuals^{14, 15}.

Our goal in this work is to develop a theoretical framework that relates the number of actions performed by an actor (an agent or individual) embedded in a social system; that is, her activity (A), and the number of reactions that these actions trigger in her peers, or response (R). To relate these two magnitudes we generalize the efficiency metric ($\eta = \frac{R}{A}$), introduced by Morales et al. in the context of Twitter¹⁶, to other social systems.

We follow a well established modeling approach in social physics: explain the macroscopic properties of the system assuming the simplest microscopic interactions between the actors to extract the most fundamental laws^{17,18,19,20,21,22,23}. The macroscopic property in which we focus is the distribution of efficiency. We have used this metric to analyze three kinds of social systems of different nature: social networks, collaborative networks and citations networks. In particular, we have worked with 14 Twitter conversations around different issues in Spain, Turkey, Palestine, Argentina and Colombia, the editions of the English Wikipedia and the scientific citations data of authors from 14 different countries extracted from the Web of Science.

In Twitter, the activity is the number of original messages posted by a user and the response of the system is the number of retweets received by that user. Another magnitude used in our analysis is the response to single actions (r). In Twitter r would be the number of retweets obtained by a single tweet. In the scientific citations network, A is the number of publications of an author and R the number of citations obtained. The variable r in this case is the number of citations obtained by one paper. In the context of the Wikipedia collaboration network, we consider A as the aggregated number of editions performed by a particular user in any Wikipedia page. The corresponding R is the number of editions made by other users in her personal user page. These editions can be considered as messages directed to that particular user. In this case there is no data for the response to a single edition. Therefore, we have defined r as the number of editions made on the pages of users whose activity is $A = 1$.

We have found that the efficiency distribution in these three systems has a universal structure with small differences between the datasets, which may indicate the existence of a general mechanism governing the $A-R$ relationship. To reveal that mechanism we have developed three domain-independent minimal statistical models. Taking a parsimonious approach, we start from the most naive model and progressively consider more sophisticated theories with increasingly complex levels of dependence between R and A. The models are the Independent Variables model (InV), the Identical Actors model (IdA) and the Distinguishable Actors model (DiA). In the InV model the response of the system is independent with respect to the activity of the individual. In the IdA model, the response of the system depends on the activity of the individual, but the system is agnostic with respect to the individual that stimulates it. Finally, in the DiA model the response is determined not only by the activity of the individual, but also by her features. The models are general because no assumption is made about the particular characteristics of the system or its components.

Results

Distribution of efficiency

The efficiency metric is defined as the quotient between collective response R and individual activity A:

$$\begin{aligned} \eta = \frac{R}{A} \end{aligned}$$

(1)

It can be considered as a proxy for how efficient an individual is at triggering reactions in her peers or as a measure of the system’s inertia to react to the stimuli of the individual. The higher the individual’s efficiency, the lower the system’s inertia.

Our work is focused on the efficiency distribution, an example of which is presented in Fig. 1. It is characterized by a concave shape with two distinct tendencies for $\eta <1$ (an individual gets less than one reaction per action) and $\eta >1$ (an individual triggers more than one reaction per action). In the work by Morales et al.¹⁶, they used the Independent Cascade (IC) model on the Twitter follower network to reproduce the empirical distribution of user efficiency and showed that the shape of the distribution was universal for Twitter conversations. However, several questions were left open and some of the empirical results lacked a comprehensive explanation. In particular, they reported evidence for the independence of the efficiency distribution with respect to the functional form of the activity distribution and, from that, conjectured that communication patterns are not dependent on the way users post original messages; that is, that collective response is independent of individual activity.

In this work we go one step further and present evidence for the universality of the structure of the efficiency distribution in two other social systems. We also present the three aforementioned statistical models to provide a comprehensive description of the nature of the efficiency distribution and show the extent to which the activity of the individuals and their particular features influence the response of the system.

Description of the models

We have calculated the theoretical distributions of efficiency with three different methodologies: Monte-Carlo (MC) simulation, direct computation with discrete probability distributions and derivation of an analytical expression.

Once the basic mechanism of the model is laid out, MC simulation allows a direct implementation of the model’s assumptions. Thus, we use it to compare model and empirical data as well as to verify the results of the other methodologies.

To directly compute the efficiency distribution with the discrete joint probability distribution p(R, A) we follow the method described in the “Methods” section [Eqs. (18) and (19)]. The resulting efficiency distribution is asymptotically exact in the sense that, since the support for the distributions of A and R is $\mathbb {N}$, an infinite number of terms would be required to actually obtain exact results, but larger values of A and R have increasingly smaller probabilities, carrying progressively lower weight on the computation and enabling the results to converge for a finite number of terms.

The analytical calculation of the efficiency distribution has been carried out for the InV and IdA models by considering A and R as continuous random variables. Taking into account the definition of efficiency given by (1) we derive an expression for the probability density function (PDF) of efficiency using the joint probability distribution $\varphi (R,A) = \varphi (\eta A,A)$ (see Section 2 of the Supplementary Information):

$$\begin{aligned} f(\eta )={\left\{ \begin{array}{ll} \int _{R_m/\eta }^{\infty } \varphi (\eta A,A) A dA \quad \text {if} \quad \eta \le \frac{R_m}{A_m}\\ \int _{A_m}^{\infty } \varphi (\eta A, A) A dA \quad \text {if} \quad \eta > \frac{R_m}{A_m}\\ \end{array}\right. } \end{aligned}$$

(2)

where $A_m, R_m>0$ are the minimum values of A and R. In our case, $A_m = R_m = 1$ for every dataset. It is worth noting that the two branches of $f(\eta )$ in Eq. (2) correspond to the two characteristic tails of the efficiency distribution.

Independent variables model

In the InV model A and R are considered independent variables with probability distributions p(A) and p(R).

A Monte-Carlo simulation can be computed as follows: In a system with N individuals indexed by $i=1,2,\ldots , N$, store the empirical data of activity and response in two vectors $\vec {A}$ and $\vec {R}$ such that component i of vector $\vec {A}$ corresponds to the same individual as component i of vector $\vec {R}$. Next, shuffle each of them independently, such that the correlations that may have been present when each couple $(A_i,R_i)$ corresponded to the same individual vanish. The randomized versions of the vectors, $\vec {A}_{rnd}$ and $\vec {R}_{rnd}$, hold the same values as the originals but with the order of the elements randomly altered. Finally, the efficiency vector $\vec {\eta }_{rnd} = \vec {R}_{rnd} / \vec {A}_{rnd}$ is used to compute the efficiency distribution according to the InV model.

Since A and R are considered independent, their discrete joint probability distribution is $p(R,A)=p(R)p(A)$. The PDF of efficiency can be obtained by plugging this expression in (18) and (19) of “Methods”. However, for this model we have left out the results of the discrete methodology because we have derived an exact analytical expression.

For the analytical computation of the InV model we consider A and R as continuous variables with PDFs $f_A(A)$ and $f_R(R)$. Their joint probability distribution can be written as $\varphi (R,A)=f_A(A)f_R(R)$. Plugging this in (2) we obtain:

$$\begin{aligned} f^{InV}(\eta )={\left\{ \begin{array}{ll} \int _{R_m/\eta }^{\infty } f_R(\eta A)f_A(A) A dA \quad \text {if} \quad \eta \le \frac{R_m}{A_m}\\ \int _{A_m}^{\infty } f_R(\eta A)f_A(A) A dA \quad \text {if} \quad \eta > \frac{R_m}{A_m}\\ \end{array}\right. } \end{aligned}$$

(3)

This expression provides an explanation for a key result presented in¹⁶, where Morales et al. show that the right tail of the efficiency distribution remains unaltered when the activity distribution is modified. To reach that result, let us assume that $f_R(R)\propto R^{-\gamma _R}$. This power law distribution was used in¹⁶ as well as in other works to model the distribution of retweets²⁴, scientific citations²⁵ and incoming editions in Wikipedia³. Then, the right tail ($\eta > \frac{R_m}{A_m}$) of the PDF shown in (3) can be written as:

$$\begin{aligned} f^{InV}(\eta ) \propto \eta ^{-\gamma _R} \int _{A_m}^{\infty } A^{1-\gamma _R}f_A(A) dA = E_A[A^{1-\gamma _R}] \eta ^{-\gamma _R} \Rightarrow f^{InV}(\eta ) \propto f_R(\eta ) \end{aligned}$$

(4)

where $E_A[\cdot ]$ is the expected value with respect to the activity distribution. Therefore, when $f_R(R)\propto R^{-\gamma _R}$, the right tail of the efficiency distribution is proportional to $\eta ^{-\gamma _R}$. That is, in addition to being independent of the activity distribution, its shape is completely determined by the exponent of the response distribution.

To apply the analytical computation of the efficiency distribution for the InV model to empirical data we have fit the empirical distributions of A and R to a power law with exponential cutoff (or truncated power law) using the powerlaw python module²⁶. The functional form of this distribution is the following:

$$\begin{aligned} f(x) = \frac{\lambda ^{1-\alpha }}{\Gamma (1-\alpha ,\lambda x_{min})} x^{-\alpha }e^{-\lambda x} \end{aligned}$$

(5)

where $\Gamma (s, x)$ is the upper incomplete gamma function. The resulting fits for $f_A(A)$ and $f_R(R)$ for every dataset are presented in the Supplementary Information (SI). When the PDFs of activity and response are power laws with exponential cutoff, the PDF of efficiency adopts the following form:

$$\begin{aligned} f^{InV}(\eta )={\left\{ \begin{array}{ll} g(\eta ) \Gamma (2-\alpha _{R}-\alpha _{A},(\lambda _{R} \eta +\lambda _{A})\frac{R_m}{\eta }) \quad \text {if} \quad \eta \le \frac{R_m}{A_m}\\ g(\eta ) \Gamma (2-\alpha _{R}-\alpha _{A},(\lambda _{R} \eta +\lambda _{A})A_m) \quad \text {if} \quad \eta > \frac{R_m}{A_m}\\ \end{array}\right. } \end{aligned}$$

(6)

With

$$\begin{aligned} g(\eta ) = C (\lambda _{R}\eta +\lambda _{A})^{(\alpha _R+\alpha _A-2)}\eta ^{-\alpha _R} \end{aligned}$$

(7)

and

$$\begin{aligned} C = \frac{\lambda _R^{1-\alpha _R}}{\Gamma (1-\alpha _R,\lambda _RR_m)} \frac{\lambda _A^{1-\alpha _A}}{\Gamma (1-\alpha _A,\lambda _AA_m)} \end{aligned}$$

(8)

Identical actors model

A natural extension to the InV model is to consider that the response of the system depends on the activity of the individual. To carry out this extension in a parsimonious way, we realize that the stimuli to which the system reacts occur in a discrete fashion, so we can assume that it reacts to each action (a tweet, a scientific publication, an edition on Wikipedia, etc.) individually, as if they were isolated events. Then, while in the InV model the aggregate response of the system was independent of the aggregate activity of the actor, in the IdA model the partial response of the system to each single action is independent of the actor. But, as the aggregate response of the system to the activity of an individual is the sum of the partial responses to each of her A actions, a dependence between R and A is induced.

To formalize this idea we introduce the new variable r as the response of the system to a single action by any individual. This random variable follows the same distribution p(r) for all the actors. The aggregate response R associated to an actor that performed A actions and triggered partial responses $\{r_1,r_2,\dots ,r_A\}$ is $R=\sum _{j=1}^A r_j$. The dependence of R on A resides on the number of terms of this sum.

To perform a Monte-Carlo simulation of the IdA model, we first fit the p(r) with the hybrid methodology detailed in the SI and p(A) to a discrete truncated power law (see the SI for the results). Then, we generate a set of individuals whose activity is assigned according to p(A). The responses for each of the A actions of an individual is randomly generated with p(r) and then aggregated to obtain her R. The efficiency according to this model is directly computed from the (R, A) tuple associated to each actor.

To get the efficiency distribution of the IdA model from the discrete p(R, A) distribution, we start with the conditional discrete probability distribution of R given an activity A, which is computed as the $A-fold$ discrete convolution of p(r) with itself:

$$\begin{aligned} p(R|A) = p(r_1)*p(r_2)*\cdots *p(r_A) = p(r)*p(r)*\cdots *p(r) = p^{*A}(r) \end{aligned}$$

(9)

Then, the joint probability distribution can be obtained as:

$$\begin{aligned} p(R,A) = p(R|A)p(A) = p^{*A}(r) p(A) \end{aligned}$$

(10)

The efficiency PDF is obtained by plugging (10) in (18) and (19). The p(r) and p(A) distributions used in this methodology are the same as those used in the Monte-Carlo simulations.

To carry out the previous computations with infinite precision we would need an infinite number of values for the p(r), p(A) and p(R, A) distributions. To be able to perform the numerical computations, we have used distributions that are bounded at a certain value and we have verified that further increasing the number of values employed do not affect the results. The cut-off values used for the three systems considered are shown in Table 1.

Table 1 Cut-off values used to perform the numerical computations for the IdA model.

Full size table

An analytical expression for the efficiency distribution of the IdA model can be derived when p(r) is modeled as a power law ($p(r)\propto r^{-\gamma _r}$). For this approximation, the activity distribution p(A) has been modeled as a power law with exponent $\gamma _{A}$, a usual approach in the literature^{3, 24}. The corresponding fits are shown in the SI and the resulting expression for the PDF of efficiency is:

(11)

where $E_n(\cdot )$ is the generalized exponential integral, the lower incomplete gamma function and C the following normalization constant:

$$\begin{aligned} C = \frac{(\gamma _r-1)(\gamma _A-1)}{1+(1-\gamma _A)\Gamma (1-\gamma _A,1)} \end{aligned}$$

(12)

Distinguishable actors model

In the DiA model the actors are distinguishable, meaning that the system is sensitive to the individual who makes the action and reacts in a different manner depending on her particular features.

This idea can be formalized by considering that the probability distribution of response to single actions depends on the features of the individual that performs the action, summarized in a vector $\vec {s}$. The distribution of aggregate response R of the system is computed as the A-fold convolution of the $p(r|\vec {s})$ distribution with itself:

$$\begin{aligned} p(R|A,\vec {s}) = p^{*A}(r|\vec {s}) \end{aligned}$$

(13)

If $\{s_1,s_2,\dots ,s_N\}$ are the components of the feature vector (assume the features are independent discrete variables), the discrete joint probability distribution p(R, A) is obtained as follows:

$$\begin{aligned} p(R,A) = \sum _{s_1} \cdots \sum _{s_N} p^{*A}(r|\vec {s}) p(A) p(\vec {s}) \end{aligned}$$

(14)

Finally, p(R, A) can be used to compute the efficiency distribution with (18) and (19).

A key point is to find the conditional probability distribution $p(r|\vec {s})$ that characterizes the relationship between the features $\vec {s}$ of the individual and the response r of the system to her actions. Unfortunately, this task is not trivial in most cases. In the case of the citation network the literature shows that there are many and varied factors that determine the citation counts of publications²⁷, from the quality of the manuscript, to the field of research, the cited references or the reputation of the authors and their institutions. With respect to Wikipedia, some factors that could determine the response to a user could be the topics she is more active on, the age of her user account or her main role (some users may be focused on editing articles, others on moderating discussion pages, etc.).

Among the systems under study we have focused on Twitter, where we have chosen the number of followers F of a user as a proxy of her ability to trigger a response, since the follower layer is the substrate through which the retweets are spread^{28, 29}.

In order to establish the relationship between an individual’s features and the response of the system, we have relied on the Independent Cascade (IC) diffusion model. We have formalized the IC model by means of the binomial distribution and a set of assumptions based on empirical evidence (see “Methods”), obtaining the following expression for the response distribution to single actions conditioned on the number of followers (F) of the individual:

$$\begin{aligned} p(r|\vec {s}) = p(r|F) = B(r;F,p_{inf}) \end{aligned}$$

(15)

Where B(x; n, p) is a binomial distribution. The discrete joint probability distribution for A and R is given by:

$$\begin{aligned} p(R,A) = p(A) \sum _{F=0}^\infty B(R;AF,p_{inf}) p(F) \end{aligned}$$

(16)

The PDF of efficiency is obtained by plugging (16) in (18) and (19).

Notice that F is the only component of the feature vector $\vec {s}$ of the individual. The infection probability parameter $p_{inf}$ has been considered constant and equal for every individual and has been determined by Maximum Likelihood Estimation (MLE) of the p(r) distribution. The discrete computation of the DiA model also requires a fit for the p(F) distribution, which was performed with the hybrid methodology detailed in the SI. The p(A) was fit to a discrete truncated power law.

A Monte-Carlo simulation of the DiA model can be performed as follows: Generate a set of individuals with a random number of followers $F \sim p(F)$ and a random activity $A \sim p(A)$. Then, for each action j ($j=1,2,\dots ,A$) performed by an individual, the partial response of the system $r_j$ is computed with (15) and the aggregate response with $R=\sum _{j=1}^A r_j$.

For this model, we have found that an analytical derivation of the PDF of efficiency is too cumbersome to be tackled.

To conclude this section, we summarize the main features of the three developed models in Fig. 2. The models can be classified taking into account two properties: the distinguishability of the actors and the dependence of R with respect to A. Concerning the distinguishability of the actors, we have on one side the InV and IdA models, where the actors are considered identical, and on the other side the DiA model, where the particular features of the actors are taken into account. Regarding the $A-R$ dependence, we have on one side the InV model, in which R and A are independent variables, and on the other side, the IdA and DiA models, where R depends on A because the aggregate response R is the sum of the partial responses r to each individual action.

Application of the models to empirical data

The models presented in the previous section have been tested in three different systems: the scientific citations network, Twitter and Wikipedia. See the Supplementary Information (SI) for a detailed description of the datasets. In this section we analyze the models’ performance in each of them.