Introduction

Since their first introduction, Online Social Networks (OSN) have been deeply investigated for possible implications of the online public debate on political processes1. In the last decade, the centrality of OSN for political communications and debates has steady increased: OSN represent one of the most used tool for citizens to get an opinion2. It is not surprising, then, that political parties use them extensively to carry out a sort of never-ending propaganda.

Although in the literature there are different opinions on the impact that a particular grouping of users in OSN can have on their offline behavior3,4,5, it is undeniable that the online social environment is strongly polarized. The origin of such a polarization has been deeply discussed in the sociological literature6,7 and seems to be extremely dependent on country’s party systems8. In particular, the concepts of selective exposure, confirmation bias, echo chambers and filter bubbles have had a great relevance in the literature.

Selective exposure leads people to prefer information that confirms their preexisting beliefs9,10, while confirmation bias makes information consistent with one’s preexisting beliefs more persuasive11.

Such phenomena imply the formation of groups of users, characterised by following the same information in terms, e.g., of news outlets and personal opinions. These groups are thus closed in so called echo chambers: “a bounded, enclosed media space that has the potential to both magnify the messages delivered within it and insulate them from rebuttal”11,12,13. Echo chambers, by being impervious to information coming from outside that may contradict the pre-existing views of the chamber members, are believed to strongly contribute to the polarization of the online debate14.

Polarization may be also fomented by filter bubbles. This paradigm was first introduced by the activist Eli Pariser15: personalised results provided by search engines and shown in social media feeds can make users be trapped in a bubble of information they like and away from data and viewpoints considered less valuable, but that could challenge their beliefs. Although the user may not be affected in real life by the virtual bubble, due to the various communication channels he or she can take advantage of (see Ref.16), the customization of algorithms may contribute to the formation of a virtual world apart.

Discourse and discursive communities

Whether circulating within an echo chamber or suggested by recommendation algorithms, the type of information users come across online is fundamental to reinforcing or not the division into ‘closed’ groups. Nevertheless, also the study of the interactions between users is of absolute interest to detect polarization phenomena. The term discourse community was coined in 1982 and it indicates ‘groups that have goals or purposes, and use communication to achieve these goals’17. A discourse community is itself immaterial, and this tends to project it onto the forum on which it operates18. Thus, with the advent of OSN, discourse communities were projected onto the platforms themselves19: ‘A discourse community can be viewed as a social network, built from participants who share some set of communicative purposes’. According to Berkenkotter20, ‘just as the digital world is constantly evolving, discourse communities continually define and redefine themselves through communications among members’.

In the discourse community definition, we implicitly know the identities of the individuals forming the community. Actually, in the case of Twitter, it is just partially true, since we have trustworthy information only about a small minority of accounts. For this reason, we prefer to use the term discursive communities, as it was introduced in Ref.21 to identify group of users that are connected by non-trivial pattern of discourse, but for which we have limited information about the identity of the group itself. Nevertheless, since we can infer the discourse community of the discursive community by looking at a set of non-trivial data characterising the group, as the most frequent keywords used therein, the difference is more formal than substantial. Therefore, in the following we will use the two terms interchangeably.

To detect discursive communities in OSN, the first contributions applied mixed approaches to the political debate on Twitter22,23,24. The work considered political debate on Twitter about the US presidential election campaign, i.e. a ‘perfectly polarized’ one in which two opposite fronts face each other. The authors manually annotated the most frequent keywords characterizing Republicans and Democratics’ narratives and use them to infer the political orientation of accounts using them. The orientation of accounts not using hashtags was later inferred using a label propagation algorithm25. Remarkable, a clear partition in two distinct groups of users, supporters of the two political parties, was observed in the retweet network only (the network of users sharing content created by others). Finally, using a label propagation algorithm on the retweet network, the authors were able to successfully assign to all accounts the proper political orientation, that can be translated in the present context to the correct discursive community.

Every country has a different way in which opinions are polarised. This is due to the various party systems and electoral laws and, in principle, there could be more than just two fronts8. A methodology for detecting discourse communities less susceptible to human error should therefore rise from the data directly, rather than being based on a priori manual annotation. The approach firstly proposed in Ref.26 meets the desired property: the idea is to infer the various discursive communities starting from accounts whose identity is certified by the social network itself. In Twitter, these are the so called verified accounts. This class of accounts tend to produce new content rather than retweet the one created by others26. Since we trust information regarding their identities, the issue is the identification of the discursive communities anchored to them, something that can be done using their interactions with ‘standard’ users (based on the results by Conover et al.22,23,24, in terms of retweets).

Let the reader consider a pair of verified Twitter users. If they share a large number of retweeters, it is reasonable to think that they attract ‘similar’ users. In this sense, that pair of accounts are perceived to belong to the same discursive community, sharing similar views and ideas. Nevertheless, it is hard to state a priori how many common retweeters two verified users should have in order to be considered as ‘similar’; in this sense, a maximum entropy null-model is used as benchmark27. (More technical details on this construction can be found in the “Discursive communities” section.) The labels for verified users are then propagated, following the same approach as in Refs.22,23,24. In the present paper, we are going to follow this approach, that has great performances on manually annotated datasets28.

The recent literature has extensively analysed online debate and discourse communities, focusing, from time to time, on coordinated activities in discursive communities26,29,30, on the semantic network associated to the various discursive communities21,31,32, on their exposure to disinformation campaigns33,34, and on their dynamical evolution35,36,37,38,39.

In the present paper, we tackle the analysis of the network structure of discursive communities: we collect and study 8 thematic Twitter datasets, on topics ranging from sports, to COVID-19, to political elections and immigration policies. Our main result is that almost all the discourse communities therein features a bow-tie structure.

Bow-ties

Bow-tie structures were initially introduced by Broder et al. in order to study the structure of World Wide Web (WWW)40. The authors represented WWW as a directed network in which webpages are the nodes and the hyperlinks connecting them are the edges. Broder et al. noticed that the network displays a huge Weakly Connected Component (WCC), i.e., the maximal subgraph in which all nodes can be reached by any other one in the same subgraph, disregarding the direction of the link. This WCC includes more than 75% of all nodes.

WCC breaks into three main pieces: a Strongly Connected Component (SCC), in which each node can be reached by any other one in the same block, following the direction of the links; a group of nodes that can reach SCC, without being reached by it (called IN); a group of nodes that can be reached by SCC, but that cannot reach it (the OUT block). The observation is that SCC is the most populated sector, followed by the IN and the OUT sectors. Most of the websites can be found in the SCC, linking between each other; the IN sector was instead mostly composed by search engines, while the OUT one includes authorities, as Wikipedia.

Yang et al.41 refined the partition of the structure in40, introducing INTENDRILS, OUTTENDRILS, TUBES and OTHERS. The entire situation is pictorially represented in Fig. 1.

Figure 1
figure 1

The seven sectors of Yang’s bow-tie structure.

Remarkably, the bow-tie structure was detected also in control network of transnational corporations, having deep implications on financial stability42.

Results in a nutshell

In the case of our 8 thematic datasets, we find that a bow-tie structure is present in those discourse communities debating (1) about politics, like in the case, e.g., of election campaigns, and (2) about society, e.g., on the proper response to the pandemic or the appropriate management of migration fluxes. Instead, bow-ties are hardly present when the debate is about less socially relevant topics as sports (this confirms what observed in Ref.8).

There are two relevant points in observing the presence of bow-tie structures in discursive communities: how big the bow-tie is respect to the entire discursive community (the greatest the accounts in the non-OTHERS blocks, the more informative the bow-tie structure is) and how random the presence of this structure is (i.e., its statistical significance). Regarding the first point, when the bow-tie is informative, even in the worst case, it represents more than 80% of all nodes in the discursive community. Regarding the second point, in order to be sure that the observed bow-ties are not due to a random organization of links only, we compare the observed quantities with a maximum entropy null-model for directed network, conserving the in- and out-degree sequences43. The results show that the dimension of most of the bow-tie sectors are statistically significant, i.e., they carry a signal that cannot be due to the degree sequence only. In this sense, the presence of a bow-tie structure is an extremely non-trivial feature of the system.

We can add more detail to the analysis of this structure. When the bow-tie is informative, we observe two cases: the OUT-dominant and the INTEND-dominant ones, depending on which sector is the largest (respectively, OUT and INTENDRILS). The OUT sector has access to all information produced in the discursive community and, in particular, to the one produced by the most active block, SCC. Instead, in the INTEND-dominant bow-ties, the most crowded sector is the one of INTENDRILS, i.e., the retweeters of IN that are not retweeted by anyone else and that cannot access to all content created by SCC.

In principle, it should be desirable to have an OUT-dominant bow-tie: when the OUT sector is the most populated, there are many accounts that are exposed to information from all other sectors. This should give the accounts a multi-faceted, pluralistic knowledge. However, for the investigated datasets, we carry out an analysis on the production and quality of content in the various sectors of the bow-ties, and the outcome returns a different picture. In fact, regarding content production, SCC is the source of the greatest flux of content. When the discursive community is affected by m/disinformation, the incidence of links from non-reliable sources shared by SCC is much greater than by any other sector and is particularly considerable in the flux between SCC and OUT. In those cases when OUT-sectors dominate the bow-tie, we observe an infodemic (According to WHO, “infodemics are an excessive amount of information about a problem, which makes it difficult to identify a solution”. Coronavirus disease 2019 (COVID-19) Situation Report—45: https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200305-sitrep-45-covid-19.pdf?sfvrsn=ed2ba78b_4), since OUT, i.e. the widest block, is directly hit by the huge amount of messages of questionable quality produced by SCC.

Summarising, our contribution is twofold:

  • almost all the discursive communities in the 8 investigated datasets of Twitter debates, on different topics in different countries, display a bow-tie structure which is statistically significant;

  • when the bow-tie is affected by m/disinformation and it is OUT-dominant, the majority of users (i.e. those in the OUT block) is exposed to the flux of m/disinformation. In this sense, the bow-tie structure fuels the phenomenon of infodemic.

We would like to remark that the results in this manuscript do not represent the only contribution that connects the diffusion of m/disinformation to the network structure (see, for instance, work in44,45,46, just to consider some of the most recent contributions). However, to the best of our knowledge, for the first time the bow-tie structure emerges in online discursive communities. Moreover, its presence and its peculiarity permit do have a proper description of the phenomenon of infodemic.

Results

Datasets

In order to make our analysis as general as possible, we consider several Twitter datasets across different countries and about different topics. The data collected using the Twitter Streaming API are publicly available for further research and reproducibility and can be found at the following URL: https://toffee.imtlucca.it/datasets. In detail:

  • COVID-19 datasets: we explore Twitter posts containing keywords related to the COVID-19 pandemic in different languages and therefore diffused in different countries (in particular, the keywords for tweets collection were “coronavirus”, “ncov”, “covid”, “SARS-CoV2”, “#coronavirus”, “#coronaviruses”, “#WuhanCoronavirus”, “#CoronavirusOutbreak”, “#coronaviruschina”, “#coronaviruswuhan”, “#ChinaCoronaVirus”, “#nCoV”, “#ChinaWuHan”, “#nCoV2020”, “#nCov2019”, “#covid2019”, “#covid-19”, “#SARS_CoV_2”, “#SARSCoV2”, “#COVID19”. The subset of Italian messages has been matter of investigation in Ref.33 too). In particular, we consider the Italian, German and French debates about the pandemic, in the period between February and April 2020. The Italian dataset consists of 4,470,648 tweets published between February 17 and April 23. The German dataset contains 1,552,106 tweets posted between February 10 and April 23, the French one has 3,052,708 posts published between March 23 and April 7. The different time frames for data collection have been chosen according to the intensity of the Twitter traffic.

  • Dutch elections dataset: we collect Twitter posts about the national elections in the Netherlands in 2021. The keywords used for downloading data were “tweedekamer”, “verkiezingen”, “kabinet”, “coalitie”, “stem”, “stembus”, “verkiezingen2021” (respectively, “House of representatives”, “reconnaissance”, “cabinet”, “coalition”, “vote”, “ballot box”, “explorations”) and only messages in Dutch were selected. The dataset contains 1,002,499 tweets posted between February 2 and March 31, 2021.

  • Italian debate on migrants: we select Twitter posts shared in Italy with keywords regarding the discussion about the migration flows from Northern Africa to the Italian coasts. The dataset consists in 1,081,780 posts, published between January 23, 2019 and February 22, 2019. The dataset is described in more details in Ref.29.

  • Italian debate on the Astrazeneca vaccine: we examine 583,236 Twitter posts published in Italian, regarding the discussion about the safety of the Astrazeneca vaccine against COVID-19: the keywords used for the download were “astrazeneca”, “aifa”, “ema”, “trombosi” (respectively, “astrazeneca”, ”Italian Medicines Agency”, “European Medicines Agency” and “thrombosis”). The dataset contains posts shared between March 15, 2021 and May 15, 2021.

  • Italian and Turkish EURO2020: we analyze 144,725 Italian tweets and 430,374 Turkish ones about the European Football Championship EURO2020; the keyword used for the download was simply “#euro2020”. The tweets were published between, respectively, June 11–13 and June 11–23, 2021.

So as not to burden the presentation, in the following we will present the results about the Italian COVID-19, Italian EURO2020 and Turkish EURO2020 datasets. We will show the results related to the other datasets wherever there will be something substantially different, compared with the Italian dataset. However, all graphics and results about the other datasets can be found in the “Supplementary Material”.

Discursive communities

Our analysis focuses on the structure of networks of retweets, for each dataset. Retweeting a post is one of the possible ways in which people can interact on Twitter and it consists in sharing the content of a tweet written by another user. It usually means endorsing the post content as it has also the effect of raising the visibility of the original post. It was also shown that, among all possible interactions, retweets are the best performing to infer the political orientation of the various accounts22,23,24.

We start by distinguishing between verified and non-verified accounts. The former ones denote Twitter users whose identity has been verified by the social platform. This procedure is usually adopted to certify the accounts of renowned people and organizations and figures of public interest in general, as politicians, journalists, political parties, newspapers and TV-channels. We represent the interactions between verified and non-verified users as a bipartite network. In a bipartite network, nodes belong to two different sets, called layers, and an edge can exist only between vertices placed on different layers; we place the verified accounts on one layer of a bipartite network and the non-verified ones on the other one, again considering links as retweets between them. (In the present construction, we disregard the information about the direction of the retweet, since we are interested in the interaction between the two class of users. Nevertheless, as mentioned above, verified users tend more to create new contents (i.e tweet) than to share it with her followers (retweet)). The main idea is to anchor the definition of discursive communities on verified users since they usually introduce new content and posts: as observed in many other studies21,26,29,32,33,34,47, verified users are, on average, much more retweeted than common users. Such a procedure obtains great performances, since it can be observed that the various discursive communities are coherent in terms of verified users belonging to the same political front; in a further analysis we are comparing this procedure with annotated datasets, better quantifying our performances28.

Following the methodology introduced in Becatti et al.26, we count the common neighbors of each pair of verified users or, in simpler words, the number of non-verified users that have interacted (by retweeting or being retweeted) with the same pair of verified ones. The aim is projecting the bipartite network into the layer of the verified accounts, establishing an edge between two of them if the number of their common neighbors is significantly higher than what expected by a proper null-model. When this happens, we can assert that the two verified users refer to the same audience and, therefore, they probably share similar content and opinions. The statistical significance of the number of common neighbors can be established only comparing it with the predictions of an accurate benchmark, which, in this case, is represented by the Bipartite Configuration Model (BiCM48), an entropy-based model suited for bipartite networks. A complete description of the model and the projecting procedure can be found in “Entropy-based null-models for network analysis and their applications”.

The result of the above procedure is a monopartite network of verified users. We further obtain a partition in communities implementing the Louvain algorithm49 for the optimization of the modularity, with a slight modification. In fact, the standard definition of the modularity50 implements the Chung-Lu null-model51, which can be considered as a sparse matrix approximation of the entropy-based null-model defined in52 and it is known to return wrong results in the presence of strong hubs27. We thus replaced the Chung-Lu null-model in the modularity with the unipartite configuration model (UCM) defined in Ref.52. Furthermore, we correct for the node ordering bias that affects Louvain algorithm, independently on the objective function chosen. In fact, we perform multiple runs, each time reshuffling the order of the nodes: we finally select the partition displaying the greatest value of the (UCM-modified) modularity. More details can be found in “Entropy-based null-models for network analysis and their applications”.

For all the datasets, looking at the members of each discursive community, we can a posteriori associate the latter to a political wing, using the available information for verified users. We thus obtain clusters of users which represent the main wings of the political scenario of each of the examined countries. In addition, in almost all the datasets, we identify also a Media cluster, with official accounts of newspapers, TV-channels, radio and other media.

In the “Discursive communities for the Italian COVID-19 dataset”, the interested reader can find a complete description of all the discursive communities for the Italian COVID-19 dataset. For the other datasets, a brief description of their discursive communities is in the “Supplementary Material”.

Political orientation of non-verified users

The next step in our procedure consists in extending the discursive communities to non-verified accounts. More in details, following the approach in Ref.29, we use the membership of verified users as (fixed) seeds for the label propagation algorithm proposed by Raghavan et al.25 on the retweet network. This network is a monopartite and directed one in which nodes represent users and links start from the retweeted users and are directed towards the one who retweets. Let us remind that, in case the algorithm cannot find a dominant label for a specific vertex (i.e., in case of a tie), it randomly removes some of the edges attached to that vertex and repeats the procedure: for this reason, we run the label propagation 500 times and assign to each node the most frequent label (actually, the noise in the assignment of the labels is extremely limited).

Figure 2 shows the percentages of nodes placed in the various discursive communities for the Italian COVID-19 dataset (a detailed description of the various communities can be found in the caption of the figure). Considering also the other datasets, in almost all the cases, the label propagation procedure could assign a label to approximately 90% of the nodes. As we could expect, in the COVID-19 datasets, the Media community is always the most numerous one: updates on the spread of the pandemic, written by the official accounts of various media, received a great amount of retweets.

Figure 2
figure 2

Percentages of nodes in each discursive community, Italian COVID-19 dataset. Due to the presence of politicians and political parties from a specific political area, the various discursive communities are called following their political alignment. “PD” stays for the Italian Democratic Party (Partito Democratico); Italia Viva (“IV”) is the political party of the former prime Minister and former PD secretary Matteo Renzi, while M5S is the “Movimento 5 Stelle”, a political movement born on the web and being the most represented party in the Italian parliament at the time of the data collection. “FI” stays for Forza Italia, the political party of the former Prime Minister Silvio Berlusconi, while the “DX” (Destra) community includes right wing parties as Lega and Fratelli d’Italia. The most crowded discursive community is the one of Media in which there are most of the online news outcasts and newspapers. The accounts for which it was not possible to assign a discursive community are in grey.

As highlighted in other works21,26,29,30,32,33,34, the presence of well-defined discursive communities is the signal that users on Online Social Networks (OSNs) are strongly polarized, i.e., they tend to tend to split into groups, which one with same opinions and political orientation.

The bow-tie structure

The original concept of bow-tie by Broder et al.40 sees WWW divided into 3 main sectors: a Strongly Connected Component (SCC), in which each node can be reached by any other one in the same block, following the direction of the links; a group of nodes that can reach SCC, without being reached by it (called IN); a group of nodes that can be reached by SCC, but that cannot reach it (the OUT block).

The description by Broder et al. was subsequently refined by Yang et al.41, who split the network in seven distinct parts:

  • the greatest Strongly Connected Component (SCC);

  • the IN block;

  • the OUT block;

  • the TUBES sector, including nodes reachable from IN and accessing OUT, but not being part of SCC;

  • the INTENDRILS group, collecting all those nodes pointed by IN that cannot reach the OUT block;

  • the OUTTENDRILS sector, containing all those nodes pointing to OUT that cannot reach nodes in IN;

  • the OTHERS group, including all those nodes that cannot be placed in one of the previous six sectors.

In Fig. 1 there is a schematic representation of the bow-tie structure defined in Ref.41. The seven groups of nodes are mutually disjointed.

We remark that every directed network can be divided in blocks using the bow-tie decomposition. Nevertheless, as a rule of thumb, the bow-tie representation is informative about the network structure if the number of nodes in blocks other than OTHERS is greater or of the same order of those in OTHERS: the greatest the impact of the non-OTHERS blocks, the more informative the bow-tie structure is.

The bow-tie structure of the discursive communities

In the present manuscript, we investigate the presence of a bow-tie structure in the discursive communities of the retweet network, i.e., in the network composed by Twitter accounts (the nodes) and retweets (the links connecting the original author to the retweeter).

Results show that, when considering political online debates, a bow-tie structure is informative in almost every discursive community of our datasets, while for non-political debates (as the case of Euro2020), the bow-tie structure is less informative. Euro2020 itself records the extreme case in which more than one half of the nodes are in the OTHERS sector. We state that this bow-tie structure is uninformative—see, for example, the case of the Turkish debate during Euro2020 in Fig. 8. We remark that the presence of informative bow-ties in many of the discorsive communities here investigated is not a trivial result. Indeed, there are no evident reasons for expecting such distribution of the nodes a priori.

When a bow-tie structure is informative, we observe two recurrent situations in the investigated datasets and, according to them, we classify the bow-tie into two different categories:

  • When the OTHERS block is smaller than SCC, we will refer to strong bow-tie structures;

  • When the OTHERS block is greater than SCC, we will refer to weak bow-tie structures.

Furthermore, when the bow-tie is informative, may it be weak or strong, we can categorize it in two different ways, that we called respectively OUT-dominant and INTEND-dominant. In OUT-dominant bow-ties, most of the nodes of the bow-tie are placed in the OUT sector. As a rule of thumb, OUT-dominant bow-ties are more frequent when the bow-tie is strong, but we can find some counter-examples. The INTEND-dominant bow-tie is a bow-tie structure in which instead the most part of the nodes is located in the INTENDRILS sector, i.e., when most part of the users retweets accounts from the IN zone and has little to no interaction with the users in the other sectors. INTEND-dominant bow-ties are in general more frequent in weak bow-ties.

We highlight that it is not so strange that the most crowded blocks in the bow-ties are OUT and INTENDRILS: it was already observed in Ref.53 that the greatest number of users tend to mostly retweet content created by others and limit their production of new messages. The difference between OUT-dominant and INTEND-dominant bow-ties is the access to information: OUT-dominant bow-ties are those in which the majority of users can access almost all messages exchanged over the discursive community, while in the INTEND-dominant ones the majority of users limits their retweets to the content produced by accounts in the IN block. Otherwise stated, the main difference between INTEND- and OUT-dominant bow-tie structures is that the former displays a more ‘hierarchical’ structure, i.e., few accounts (those in the IN sector) introduce new content and many others just share it (the INTENDRILS sector). Instead, in OUT-dominant bow-ties, the greatest part of the users (i.e., the OUT block) not only shares posts by accounts in the IN block, but also it retweets content by users in SCC, OUTTENDRILS and TUBES. We argue that this behaviour, while more ‘democratic’, is, at the same time, more risky.

In fact, we will see in “Verified users’ distribution” that users with high visibility and which introduce new content on Twitter can be found mostly in the IN sector: typically, they are verified accounts. As observed in other studies, see, e.g. Ref.33, verified users tend to limit the spreading of low-quality content. We may argue, then, that users interacting mostly with verified users are safer from m/disinformation campaigns. In the following, we will see that the reputability of information shared confirms our hypothesis and we will come back on the matter.

Figure 3 displays the bow-tie structure of each discursive community for the Italian COVID-19 dataset (analogous plots for the other datasets can be found in the “Supplementary Material”). A single node represents one bow-tie sector and its dimension is proportional to the number of accounts in it. First, according to the definitions given above, the bow-tie structure is informative in all the discursive communities. In the cases of DX and IV, the bow-tie is particularly informative: its blocks include respectively 96.5% and 98.3% of the entire discursive community. Second, different discursive communities display bow-ties with different strengths. For instance, DX and IV discursive communities display strong bow-ties, while, M5S, Media, PD and FI have weak ones, since their SCCs are relatively small (and smaller than OTHERS).

Figure 3
figure 3

The bow-tie structure of the discursive communities in the Italian COVID-19 dataset. The dimension of the sectors is proportional to the number of nodes: DX and IV discursive communities have strong bow-ties (the OTHERS block is smaller than SCC), while the others are weak (the OTHERS block is greater than SCC, still being smaller then bow-tie WCC). The DX, IV, FI and MEDIA groups display a OUT-dominant bow-tie structure, with the most part of the nodes located in the OUT sector. M5S and PD communities have a INTEND-dominant bow-tie structure, the INTENDRILS sector being the dominant one. The colour of the blocks quantifies the distance between the observed dimensions and those predicted by the Direct Configuration Model (DCM). The observed dimension for the OTHERS sector is significantly less numerous (considering a significance level at \(\alpha =0.01\)) for all the communities, but PD. Remarkably, for INTEND-dominant bow-ties, also other sectors, as SCC and INTENDRILS, are usually bigger than what we expect from the model.

Third, the graph shows that the DX, IV, MEDIA and FI communities display OUT-dominant bow-ties, in which the OUT sector is the biggest one; considering all the investigated datasets, OUT-dominant bow-ties represent the most frequent configuration, being 21 out of 31 communities. Instead, 6 out of 31 discursive communities are INTEND-dominant bow-ties (as PD and M5S in Fig. 3).

We remark that, in all our datasets, all the right wing discursive communities display bow-ties with an OUT-dominant structure; in most of the cases, these bow-ties are also strong. The colours of the nodes in Fig. 3 are going to be explained in the following section.

Statistical significance of the bow-tie structure

It may be argued that the bow-tie structures featured by the discursive communities in our datasets are just an accident, due to the different role of the various users in the debate. In fact, those accounts that have high out-degrees and low in-degrees are naturally in the IN sector; those that, viceversa, have high in-degrees and low out-degrees are in the OUT sector, and so on. To test whether the presence of bow-ties is merely attributable to the behavioral characteristics of the accounts, we compare the dimensions of the different sectors, as observed in the real network, with those in a randomised system in which the in- and out-degree sequences are fixed. If the partition in the various bow-tie sectors were just a matter of the degree sequence, none of the dimensions of the various blocks should be statistically significant. Otherwise, we should observe a significant mismatch with respect to the expectation of the null-model.

In order to have an unbiased benchmark, we build an entropy-based null-model that preserves the in- and out-degree sequences, being maximally random for all the rest (see Ref.27 for a review on the subject). Summarising, starting from a real network, we consider the set of all possible graph realizations (the graph ensemble) having the same number of nodes as in the real system. Then, we assign to each representative of the ensemble a different probability of realization by maximising the entropy of the ensemble, but constraining the average value of some topological property of the real network (in our case, the in- and out-degree sequences). In this way, even if the single realization of the ensemble does not display the network properties that we would like to preserve, the entire ensemble, on average, does.

In the last years, such procedure has been adopted to analyse financial and economic systems43,48,52,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70, biological networks71,72,73,74 and online social networks21,26,29,30,31,32,33,34,75 and it was shown to be effective to extract the relevant structure from a real network76,77.

Here, we implement the Direct Configuration Model (DCM), firstly introduced in Ref.43 and implemented in the Python module NEMtropy70. More details on the exact derivation of DCM can be found in “Entropy-based null-models for network analysis and their applications”.

Going back to Fig. 3, the colour of the circles indicates the agreement between the actual size of the bow-tie sectors and the size predicted by the DCM: we are interested in detecting both too “big” and too “small” blocks. In particular, the darker the colour of the sectors in Fig. 3, the larger the \(-\log _{10}(\text {p-value})\) (so the lower the p-value) and the greater the disagreement of the real system from the randomization. For each sector, the two-tailed p-value has been calculated looking to a sample of 1000 graphs generated by the DCM.

The p-value tells us about the existence of a disagreement, but not about the direction of the disagreement. For instance, looking at the DX bow-tie in Fig. 3, both the dimensions of OTHERS and SCC have a really small p-value, thus they do not agree with the randomization, but the OTHERS block is smaller than predicted by the DCM, while SCC is larger.

Table 1 reports the exact p-values of the different blocks for the various bow-ties of Fig. 3. The significance of the blocks for each bow-tie can be assessed by using the False Discovery Rate (FDR) correction78, setting the statistical significance level to \(\alpha =0.01\). In the present case the correction is limited, due to the small number of blocks in the bow-tie.

Table 1 p-values related to the various bow-tie sectors in the COVID-19 Italian dataset. In orange strong bow-ties, while in teal weak ones; red dots indicate OUT-dominant bow-ties, blue dots indicate INTEND-dominant ones. If we set the statistical significance to \(\alpha =0.01\), then we then have to correct for multiple hypothesis testing per block. In the present case, we used the False Discovery Rate method (FDR78). In the table, validated p-values are marked by an asterisk ‘\(*\)’. The OTHERS block is statistically significant (in particular it is smaller than in the randomization) for all discursive communities but the PD one. It is remarkable that the dimension of SCC is significant in all strong bow-ties, while the one of OUT is significant only for IV bow-tie.

It is interesting to observe that, in both strong and weak bow-ties, the OTHERS block is statistically significant in all the discursive communities but PD. In particular, the dimension of the OTHERS block is much smaller then predicted by the null-model and the presence of the bow-tie is not due to the degree sequence only.

SCC is statistically significant (and bigger than expected) for all bow-ties but FI and PD. The IN block is often statistically significant and smaller than expected. We may notice that in the strong bow-tie of IV discursive community the dimensions of all sectors are statistically significant, while none are in the PD bow-tie, which is the smallest discursive community. It is worth noting that also the dimension of the discursive community has a role: due to the limited possible variability, smaller bow-ties feature more agreement with the model.

Verified users’ distribution

Usually, verified accounts on Twitter belong to public characters and organizations, such as journalists, politicians, actors, political parties, media, and VIPS in general. Previous studies testify that verified users tend to introduce new content and have high visibility on the platform21,26,29,33,53. Thus, we expect to find them in the IN block. The results in Fig. 4 confirm this intuition: in the case of OUT-dominant bow-ties (leftmost panel), the 33.2% of verified users, on average, are in the IN sector. High percentages of verified users are also in the SCC block (23.5%). In the case of INTEND-dominant bow-ties (central panel), the percentage of verified users in the IN group increases to 42.8%; the second block per percentage of verified users is INTENDRILS (20.1%). In those communities where the bow-tie structure is not informative (right panel, Fig. 4), a high percentage (42.9%) of verified users, on average, is in the OTHERS sector. In a few cases of not informative bow-ties, it happens that verified users are mostly in the OUTTENDRILS sector. In this last case, their messages hardly reach a big audience and are simply retweeted by a group of strong retweeters (OUT sector), not catching the interest of the accounts in the SCC. Let us remark that in the case of non-informative bow-ties the dimension of OUT and SCC blocks is nevertheless limited.

Figure 4
figure 4

Distribution of the percentage of verified users in each sector of the discursive communities with, respectively, OUT-dominant, INTEND-dominant and not informative bow-ties. Each bar-chart displays the average percentage of verified users in a specific sector, calculated respectively for all the OUT-dominant, INTEND-dominant and not informative bow-ties. In the cases of OUT-dominant and INTEND-dominant bow-ties, the highest percentages of verified accounts can be found in the IN group. Moreover, in OUT-dominant bow-ties, we can found a relevant percentage of verified accounts also in the SCC. Naturally, for those communities with no bow-tie structure the verified accounts are mostly placed in the OTHERS sector and, to less extent, in the OUTTENDRILS one.

Figure 5 reports the same bar-chart, about the presence of verified users, for the bow-ties of the COVID-19 Italian dataset. It is possible to observe that in OUT-dominant bow-ties - i.e., DX, IV, FI and MEDIA - verified users are mainly in IN and SCC sectors. Also, in INTEND-dominant bow-ties, the INTENDRILS sector contains quite a number of verified users. Other user characterizations of the bow-tie blocks can be found in the “Social bots”.

Figure 5
figure 5

Percentage of verified accounts in the bow-tie sectors for each discursive community of the COVID-19 dataset. The bar-charts confirm that verified accounts are mainly located in the IN sector and, to a less extent, in the SCC one. Only for the PD group, which has a INTEND-dominant bow-tie structure, verified accounts are mostly placed in the INTENDRILS block.

Conservatives groups

The bar-charts in Fig. 6 show the percentage of nodes, the percentage of edges and the number of edges per node in the Strong Connected Component, for each discursive community of the Italian COVID-19 dataset. Not only DX is the one with the greatest number of nodes and the greatest number of links in SCC, but also the link density of SCC in DX is much greater than that of any other discursive community. Thus, the number of links in SCC of DX is not proportionate to the number of nodes, and it results in a greater average degree per node. We found very similar behaviours also for the right-oriented communities of the other datasets.

Figure 6
figure 6

Percentage of nodes and edges in SCC for the communities in the Italian COVID-19 dataset. In the Italian COVID-19 dataset, the conservative and right-oriented discursive community (DX) has more numerous and denser SCCs, as it is displayed in the highest two graphics. In the lowest graphic, it can be seen that, also considering the number of links per node in SCC, DX results again the first discursive community. These results hold for all the conservative groups in all the datasets under investigation.

In fact, in all our datasets, the discursive communities of conservative groups (i.e., DX in the Italian dataset, AfD in the German one, Conservatories in the Dutch one) are those with the highest percentage of nodes and, especially, of edges within SCC. This peculiar feature signals the presence of a common (self-)organization of accounts in line with conservative ideas on Twitter.

NewsGuard (https://www.newsguardtech.com/it/) is an independent software toolkit that monitors the quality and transparency of several news websites worldwide. Through the tags that NewsGuard has assigned to news sites whose links appear in the retweets of our communities, we are able to quantify the amount of retweets containing untrustworthy URLs.

The recurrent situation is that almost only the conservative discursive communities display retweets with such URLs. For the Italian COVID-19 dataset, the DX group has 26,318 retweets with links to untrustworthy webpages of news sites, many more than in other communities: 1356 retweets for M5S, 78 retweets for IV, 20 retweets for MEDIA, 9 retweets for FI and 0 for the PD group. A very similar situation has been found for the other datasets, see “Supplementary Material”.

Another interesting aspect is that the most part of retweets containing not reliable URLs has origin in the strongly connected component. Figure 7 shows in red the percentage of retweets containing URLs of untrustworthy news pages within and between the sectors of the bow-tie structure for the DX group. The highest percentage can be found in SCC and between SCC and OUT. Again, this is a recurrent situation also for the conservative communities of the other datasets under investigation.

Figure 7
figure 7

Bow-tie structure of the DX group and percentages of retweets containing URLs of untrustworthy webpages. The DX community in the Italian COVID-19 dataset presents the highest number of retweets containing a link to untrustworthy webpages. Most of them origin from SCC: 8.4% of the retweets in SCC and 7.3% of the retweets between SCC and OUT contain not reliable URLs. In the diagram, we also insert the link between IN and OUT (the dashed line), which, considering the definition of each sector, is not forbidden a priori.

The case of EURO2020

Here, we devote a specific section to comment about the case of the European football championship (EURO2020) dataset (we do this for academic reasons, and not because Italy won the Euro2020 championship). This dataset features a less divisive, less debated, and less discussed tweets topics. The topics of all the other datasets either have a strong political nature or are debating with sharp different positions. We then analyze whether the fact that topics are less discussed/devated has anything to do with the presence -or absence- of a bow-tie structure in the EURO2020 dataset.

We identified 5 discursive communities for the Italian dataset and 2 discursive communities for the Turkish one. Of these 7, 4 do not have an informative bow-tie structure (in fact, most part of the nodes are in OTHERS), and the other three have a weak one (OTHERS is smaller than the weakly connected component of the bow-tie, but still greater than the strongly connected one).

Figure 8 reports the bow-tie structures of the two discursive communities in the Turkish dataset. The SPORTS group contains the official accounts of football players and clubs, and of sports newspapers. AK refers to the Justice and Development Party (Turkish: Adalet ve Kalkınma Partisi, AKP), which is a conservative political party in Turkey including President Erdogan and his ministries. While SPORTS does not display any informative bow-tie, AK has a weak one. Following our interpretation, the latter displays a more hierarchical conversation on Twitter, in which the SCC is not numerous. Moreover, the dimensions of the sectors are mostly not statistically significant.

Figure 8
figure 8

The bow-tie structure of the discursive communities for the Turkish EURO2020 dataset. The SPORTS group contains the official accounts of football players and clubs, and sports newspapers, while AK refers to the Justice and Development Party (Turkish: Adalet ve Kalkınma Partisi, AKP), which is a conservative political party in Turkey, including President Erdogan and his ministries. The SPORTS discursive community does not display an informative bow-tie structure, while the AK one has an extremely weak (INTEND-dominant) bow-tie. The dimension of the sectors is proportional to the number of nodes therein and the color quantifies the distance between the observed and the predicted dimension. Looking to the color of the vertices, it is possible to see that the observed dimensions are not statistically significant.

For the Italian case (Fig. 9) the main discursive community is formed by football players, sports newspapers and journalists. There is also a MEDIA community, containing accounts of Italian media, and other three small political communities (DX, IV, M5S). MEDIA, DX and IV does not display an informative bow-tie structure (respectively, 74%, 81.2% and 63.6% of the nodes are in OTHERS), while FOOTBALLERS and M5S show a weak bow-tie (respectively 15.9% and 23.9% of nodes in OTHERS).

Figure 9
figure 9

The bow-tie structure of the discursive communities for the Italian EURO2020 dataset. The dimension of the sectors is proportional to the number of nodes therein and the color quantifies the distance between the observed and the predicted dimension. The main discursive community is formed by football players, sports newspapers and journalists. Then, we identified a MEDIA community, containing accounts of Italian media, and three small political communities (DX, IV, M5S). MEDIA, DX and IV do not display an informative bow-tie structure (respectively 74%, 81.2% and 63.6% of the nodes in OTHERS), while FOOTBALLERS and M5S show a weak bow-tie (respectively 81.1% and 75.7% of nodes in INTENDRILS).

Euro2020 dataset is the only, among ours, in which no discursive communities have a strong bow-tie structure.

Discussion

In the present manuscript, we analysed eight thematic Twitter datasets in different languages, related to various debates in Europe. We identified the discursive communities in the retweet networks and we investigated the presence of bow-tie structures in such communities. In previous works, discursive communities were shown to mirror the political orientation of users21,22,23,24,26,29,30,32,33,34, thus the analysis of their structure is of utmost importance to infer the way opinions create and circulate.

Discursive communities and bow-ties

We found that a bow-tie structure is present in those discursive communities debating about politics, like in the case, e.g., of election campaigns (it is the case of the Dutch elections dataset) or debating about Society, e.g., ‘how to handle a pandemic?’ (it is the case of the Italian, German and French datasets about COVID-19) or ’how to manage migration fluxes?’ (it is the case of the Italian online debate on migrants). Instead, a bow-tie structure is absent when the topics of the discussion are sportive ones, as in the case of Euro2020 Turkish and Italian datasets.

More in details, we state that the bow-tie is informative if the corresponding WCC includes more than one half of the nodes of the entire discursive community. In the present datasets, we found that bow-ties are informative in all the discursive communities debating about politics. In the case of the Euro2020 dataset, bow-ties are not informative, or, if present, they are extremely weak. When the bow-tie is informative, we found essentially 2 cases: (1) the most crowded block is the OUT one; (2) the most crowded block is the INTENDRILS one. The former is typical of the discursive communities of right wing parties in all European political/societal debates of our datasets, while the latter is more common in less active political discursive communities in many political/societal datasets.

Which users in which bow-tie sectors and the exposure to m/disinformation

A closer inspection of the nodes in the various blocks and the quality of the shared content permit to better characterise the users in the bow-tie. The first observation is that the greatest part of the verified users, i.e., those accounts for which the identity of the owner has been certified by Twitter, in the IN sector, in each bow-tie. This finding is not surprising: as already observed in previous studies, verified users create content and are less active in sharing messages written by others21,26,29,33,47. Verified users are mostly politicians and official accounts of political parties, as well as journalists and official accounts of their newscasts and newspapers. In this sense, a discursive community displaying a INTEND-dominant bow-tie structure (where INTRENDILS is the most crowded block) may appear, at first sight, as a less democratic group: the content is created by a few accounts and shared by a group of followers that limit their interactions to sharing the messages coming from the IN block. Instead, in a OUT-dominant bow-tie, the greatest block is OUT and it can access the content created by all the other blocks in the bow-tie (with the only exception of INTENDRILS), so having the possibility to intercept every voice in the discursive community.

Actually, the issue is on the quality of the content created in the various blocks, see Fig. 7. Leveraging our ongoing collaboration with the NewsGuard organization (https://www.newsguardtech.com/it/), we annotated the URLs that appear in tweets in our datasets, based on the reliability and transparency ratings of the news sites to which those URLs belong (such ratings have been assigned by NewsGuard). It turns out that the lowest reliable URLs, in a strong bow-tie, are the ones shared in SCC. The fact that verified accounts are not responsible for the vast majority of m/disinformation sharing was already observed in Ref.33 and, in the present context, it reflects the fact that accounts in IN are minimally responsible for the spreading of low quality/untrustworthy content. Otherwise stated, when the source of information is not identifiable, the average quality of the content is lowered down.

An OUT-dominant bow-tie is, in this sense, more exposed to m/disinformation campaigns, as the majority of the accounts, i.e. those in the OUT block, is exposed to a great flow of content, in which the percentage of m/disinformation is quite high. On the other hand, the INTEND-dominant bow-tie is “safer”, since the greatest part of the accounts therein (i.e. the INTENDRILS nodes) accesses the messages from the IN sector that is less prone to m/disinformation campaigns.

It worth to be remarked that, due to the considerations above, the OUT-dominant bow-ties are at risk of infodemic. Infodemic is a recently introduced neologism, that became particularly popular during the COVID-19 pandemic. According to the WHO, “infodemics are an excessive amount of information about a problem, which makes it difficult to identify a solution. Infodemics can spread misinformation, disinformation and rumors during a health emergency. Infodemics can hamper an effective public health response and create confusion and distrust among people" (Coronavirus disease 2019 (COVID-19) Situation Report—45. https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200305-sitrep-45-covid-19.pdf?sfvrsn=ed2ba78b_4). The effects of the present COVID-19 infodemic, even if debated4,5, may put at risk the countermeasures to the spread of an epidemic and it is worrisome for policy makers (see, for instance, the Joint Communication titled “Tackling COVID-19 disinformation—Getting the facts right” (June 10th, 2020), available at the following link. https://ec.europa.eu/info/sites/default/files/communication-tackling-covid-19-disinformation-getting-facts-right_en.pdf).

Finally, let us consider also the peculiarity of right-wing discursive communities: for all those, the bow-tie is strong (i.e., the dimension of the OTHERS block is smaller than the SCC one) and it is neatly OUT-dominant. The structural exposure of the OUT-dominant bow-tie to infodemic is even more emphasized by the extreme activity of the SCC: for instance, in the COVID-19 Italian dataset the link density in the right-wing bow-tie is at least 3 times greater than any other OUT-dominant strong bow-ties.

Statistical significance of the analysis

Here, we remark an important aspect of our analysis, of uttermost importance. In the analysis of a complex network, it is necessary to consider what is being measured, and what is its baseline. A typical example is the modularity, i.e. one of the most used target function for community detection. The problem resides in stating what is the number of links inside a group of nodes that is enough to form a community. In this case, we build a null-model, i.e., a model that shows part of the properties of the original system, being random for all the rest, to have a proper benchmark for our observations. We then compare the number of edges inside a group of nodes with the one expected by the null-model. Without the null-model, we could not know whether the number of links that bind a group of nodes are due to the degree sequence, or whether they are instead the genuine signal of the presence of a community.

In the present study, we used an entropy-based null-model as a benchmark for our analysis27,79. An entropy-based null-model allows to have a benchmark that is tailored to the system under analysis. It fixes (on average) some topological quantities to the values observed in the real network and leaves all the rest completely random. Being based on the (Shannon) entropy maximisation, it guarantees that it uniformly considers all the possible configurations (it is ‘ergodic’, using Statistical Physics jargon), thus it does not introduce any bias in the analysis.

To strengthen the analysis, we study if the bow-tie structures are due to the degree sequence of the nodes in the various discursive communities. In fact, the size of IN and OUTTENDRILS could simply be due to the presence of many nodes with zero in-degree (an analogous consideration could be done for the OUT and the INTENDRILS blocks, considering, instead the out-degree). Thus, strong, weak and not informative bow-ties could be due to degree sequence only, and do not carry any kind of information on their own.

We thus used the Directed Configuration Model defined in Ref.54 and implemented by the Python module NEMtropy70. Our results show that the dimensions of the blocks in the bow-tie are very often statistically significant: the p-value of the observed dimensions of the various blocks against the null-model expected distribution are extremely small, such that they are not compatible with the degree sequence, or, otherwise stated, the dimension of the various blocks cannot be explained using the degree sequences only.

Limitations

Even if we have obtained strong results (see the null-model validation check on the dimension of the bow-tie sectors), we have nevertheless to remark few aspects of our analysis that can limit its generalization. First, the analysis is related to eight different thematic datasets in different languages, all referring to European debates, some of them of political nature. Indeed, while the total amount of messages analysed is quite impressive, we are aware that, even if the spectrum of the arguments covered is various, our findings may be valid on our datasets only. In the near future, we are going to expand the countries covered by our analyses and expand the list of debated arguments.

Following our jargon, OUT-dominant bow-ties expose the majority of their accounts to the risk of infodemic. Nevertheless, it is not a causal relation: the presence of OUT-dominant bow-ties does not imply the presence of an infodemic or of a disinformation campaign. In fact, if the sources shared by SCC are reputable, we will not observe any infodemic or m/disinformation signal. At the same time, it is true that OUT-dominant bow-ties help the diffusion of m/disinformation, when present, since accounts in OUT are exposed to all contents—reliable and not reliable—created by nearly every block in the discursive community.

Final remarks

Let us conclude with some final remarks. First, the bow-tie structure is present in the discursive communities of retweet networks. Let us recall that we build the retweet network by creating a direct link for every retweet, from the author of the original post to the retweeter. Then, from the retweet network we extract the subgraphs relative to the discursive communities obtained through the procedure described in “Discursive communities”. With this procedure, we are recovering the flow of information inside each discursive community; in this sense, we are disregarding the possible interactions among different discursive communities.

There is another limitation which is unavoidable, due to the nature of Twitter’s data: we have information regarding who retweeted who, but not on the “chain” of retweets, i.e. we cannot distinguish if the retweeter retweeted directly the message of the original author of the post or through one of the retweets given by one of followers of the original author.

Given the structure of the retweet network, it is therefore natural to ask what is the meaning of the bow-tie structure. In particular, what is the sense of the reachability of nodes in the retweet network? (We are thankful to reviewer 2 for suggesting this reflection that permits to give a clearer frame to our results.)

In fact, the retweet network describes an influence flow: users, by retweeting messages, testify that they are influenced by their opinions. In this sense, the opinions expressed in messages written by a user A influence the ones of her/his retweeter B, that, on turn, influence the one of her/his retweeter C and so on, even if C has never directly retweeted any of the messages of A. Otherwise stated, what the bow-tie is capturing is a division of users in different sets: in IN we find the creators of contents, only partially influenced by the content created by others, in OUT the big audience of standard users, influenced by the contents created in IN, SCC, TUBES and OUTTENDRILS, in SCC the active users influencing each other, and so on.

Finally, even if we have not access to the chain of retweets, we still expect the retweet network to be a good proxy of the followership network. Indeed, users either retweet messages from the accounts they follow or they retweet some message when searching for a specific topic. However, we expect the first method to be way more frequent, since in the user’s home the activity (i.e. new messages, likes and retweets) of the followed accounts are present. In this sense, we expect that the bow-tie will be present (and statistically significant) even in the followers network. Such an investigation is going to be part of future research.

Methods

Bow-tie detection

In the following we briefly described the main steps of the detection of the bow-tie structure, following the procedure outlined in Ref.41. The first step is the identification of the greatest Strongly Connected Component (SCC) and then the identification of the nodes in the various sectors, using the bow-tie definition.

Let be \(G_D(V, E)\) a directed graph where V is the set of nodes and E the set of links, and \(G_D^T(V, E)\) its counterpart obtained reversing the direction of the edges. The functions for the identification of the greatest SCC and the depth-first searches (DFS in the following), used to identify nodes reachable from a given node, are implemented in many python modules, such as igraph or networkx; for the present analysis, we used the former python module. The algorithm is pretty straightforward and follows the definitions in “The bow-tie structure”: the pseudo code is presented in Algorithm 1.

Philosophically, the algorithm works as follows. First, consider the greatest strongly connected component and call it SCC; then choose randomly a node \(v\in SCC\). All nodes that can reach v (identified via DFS), but that are not part of SCC represent the IN sector; an analogous line of reasoning takes to the identification of the OUT sector. Regarding the remaining node, the crucial information to be calculated is if they can be reached by nodes in IN and if they can reach nodes in OUT (again, using DFS): based on this final pieces of information, we can identified all remaining sectors.

figure a

Entropy-based null-models for network analysis and their applications

The bipartite configuration model

In order to create the various discursive communities we needed an appropriate null-model as benchmark for identifying those verified users that share the same audience. In this sense, it is necessary to compare the observed quantities with accurate predictions in order to state their significance: actually, the common audience may appear similar just due to the extreme activity of the considered verified users.

We represent the interaction between verified accounts—the ones whose identity is certified by Twitter platform- and unverified ones (i.e. all the others) via a bipartite undirected binary network in which a link connects a verified users to an unverified ones if there is at least a retweet between one and the other, or viceversa. Since the information about the number of different accounts interacting -via tweet or retweet- with a user is encoded, in this representation, in the degree sequence for nodes of both layers, we need a benchmark discounting it. The natural choice is to choose an entropy-based null-model, since it provides, by definition an unbiased framework27: the null-model is maximally random, but for the constraints imposed on the system. The bipartite null-model discounting the degree sequence is the Bipartite Configuration Model (BiCM48). In the present section we will briefly revise the steps of its definition.

Let us consider a bipartite network in which the two layers \(\top\) and \(\bot\) have dimension, respectively, \(N_\top\) and \(N_\bot\); in the following, Latin indices will be used to identify nodes on the \(\top\) layer while Greek ones will be used for the \(\bot\) layer. Then, the bipartite network can be represented by its biadjacency matrix, i.e. a \(N_\top \times N_\bot\) matrix \(\mathbf {M}\) whose generic entry \(m_{i\alpha }\) is 1 if the node \(i\in \top\) is connected to the node \(\alpha \in \bot\) and 0 otherwise.

Let us start from a real bipartite network \(G_{\text{Bi}}^*\) (in the following, all quantities denoted by a \(*\) will indicate those measured on the real network). First, let us define an ensemble of graphs, i.e. the set of all the possible bipartite graphs having the same number of nodes of \(G_{\text{Bi}}^*\), but with all different topologies, from the fully connected to the empty ones. Then, we can define the Shannon entropy over the ensemble, by assigning a different probability to each of its elements:

$$\begin{aligned} S=-\sum _{G_{\text{Bi}}\in \mathcal {G}_{\text{Bi}}}P(G_{\text{Bi}})\ln {P(G_{\text{Bi}})}; \end{aligned}$$

where, \(P(G_{\text{Bi}})\) is the probability of the generic element of the graph ensemble \(G_{\text{Bi}}\). Let us now maximise the entropy, while constraining the network degrees: in particular, we want that the ensemble average of degrees to match the value observed on the real network, in order to have a null-model tailored to the real system. In term of the biadjacency matrix, the degree sequences of the \(\top\) and \(\bot\) layers respectively read \(k_i=\sum _\alpha m_{i\alpha }\) and \(h_\alpha =\sum _i m_{i\alpha }\). Using the method of the Lagrangian multipliers, the constrained maximisation can be expressed as the maximisation of \(S'\), defined as

$$\begin{aligned} S'= & {} S\nonumber \\&+\sum _i\eta _i\left[ k_i^*-\sum _{G_{\text{Bi}}\in \mathcal {G}_{\text{Bi}}} P(G_{\text{Bi}})k_i(G_{\text{Bi}})\right] +\sum _\alpha \theta _\alpha \left[ h_\alpha ^*-\sum _{G_{\text{Bi}}\in \mathcal {G}_{\text{Bi}}} P(G_{\text{Bi}})h_\alpha (G_{\text{Bi}})\right] \nonumber \\&+\zeta \left[ \sum _{G_{\text{Bi}}\in \mathcal {G}_{\text{Bi}}}P(G_{\text{Bi}})-1\right] \end{aligned}$$

where S is the Shannon entropy defined above, \(\eta _i\), \(\theta _\alpha\) are the Lagrangian multipliers relative to the degree sequences, respectively, on \(\top\) and \(\bot\), and \(\zeta\) is the one relative to the probability normalization.

Maximising \(S'\) leads to a probability per graph \(G_{\text{Bi}}\in \mathcal {G}_{\text{Bi}}\) that can be factorised in terms of the probabilities per link \(p_{i\alpha }\)80, i.e.

$$\begin{aligned} P(G_{\text{Bi}})=\prod _{i,\alpha }p_{i\alpha }^{m_{i\alpha } (G_{\text{Bi}})}\,(1-p_{i\alpha })^{1-m_{i\alpha }(G_{\text{Bi}})}, \end{aligned}$$
(1)

where \(p_{i\alpha }=\dfrac{e^{-\eta _i-\theta _\alpha }}{1+e^{-\eta _i -\theta _\alpha }}\). Nevertheless, at this level the above equation is just formal, since we do not know the numerical value of \(\eta _i\) and \(\theta _\alpha\). To this aim, we can then maximise the likelihood of the real network52,81; it can be shown that the likelihood maximisation is equivalent to imposing

$$\begin{aligned} \langle k_i\rangle _{\text{BiCM}}=k_i^*,\,\forall \,i\in \top ; \qquad \langle h_\alpha \rangle _{\text{BiCM}}=h_\alpha ^*, \,\forall \alpha \,\in \bot . \end{aligned}$$

Validated projection of bipartite networks

We want to infer similarities among nodes on the same layer. We can use as a measure of similarity the number of common neighbours—for each couple of verified users, the number of unverified users that have interacted, via tweet or retweet, with both. Let us assume, without loss of generality, that we want to project the information contained in the bipartite network onto the \(\top\) layer and call \(V_{ij}\) the number of common neighbors between nodes \(i,j\in \top\) (following Ref.58, we use the letter V to indicate common neighbours, since this pattern appear in the bipartite network as a “V” between the layer).

In terms of the biadjacency matrix, \(V_{ij}\) can be expressed as

$$\begin{aligned} V_{ij}=\sum _\alpha V_{ij}^\alpha =\sum _\alpha m_{i\alpha }m_{j\alpha }, \end{aligned}$$

where we have defined \(V_{ij}^\alpha = m_{i\alpha }m_{j\alpha }\); \(V_{ij}^\alpha =1\) if both i and j are connected to node \(\alpha \in \bot\) and 0 otherwise. Let us now compare the observed \(V_{ij}\) for each possible pair of nodes in \(\top\) with the prediction of the BiCM. Since link probabilities are independent, the presence of each V-motif \(V_{ij}^\alpha\) can be regarded as the outcome of a Bernoulli trial:

$$\begin{aligned} f_{\text{Ber}}(V_{ij}^\alpha =1)&=p_{i\alpha }p_{j\alpha },\nonumber \\ f_{\text{Ber}}(V_{ij}^\alpha =0)&=1-p_{i\alpha }p_{j\alpha }. \end{aligned}$$

In general, the probability of observing \(V_{ij}=n\) can be expressed as a sum of contributions, running on the n-tuples of considered nodes (in this case, the ones belonging to the layer of users). Indicating with \(A_n\) all possible nodes n-tuples among the layer of \(\bot\), this probability amounts at

$$\begin{aligned} f_{PB}(V_{ij}=n)=\sum _{A_n}\left[ \prod _{\alpha \in A_n}p_{i\alpha }p_{j\alpha } \prod _{\alpha '\notin A_n}(1-p_{i\alpha '}p_{j\alpha '})\right] , \end{aligned}$$
(2)

where the second product runs over the complement set of \(A_n\). Eq. (2) represent the generalization of the usual Binomial distribution when the single Bernoulli trials have different probabilities, also known as Poisson Binomial distribution82.

We can, then, verify the statistical significance of the observed co-occurrences by calculating their p-value according to the distribution in Eq. (2), i.e. the probability of observing a number of co-occurrences greater than, or equal to, the observed one:

$$\begin{aligned} \text {p-value}(V^*_{ij})=\sum _{V_{ij}\ge V^*_{ij}}f_{PB}(V^*_{ij}). \end{aligned}$$
(3)

Repeating this calculation for every pair of nodes, we obtain \(\left( {\begin{array}{c}N_\top \\ 2\end{array}}\right)\) p-values. In order to state the statistical significance of the hypotheses belonging to this group, it is necessary to adopt a multiple hypothesis testing correction; in the present paper, we use the False Discovery Rate (FDR83), since it controls the false positives rate.

Direct configuration model

From the entire retweet network, in which the various accounts are represented as nodes in a direct network in which an arrow points the retweeter of a post, starting from its author, we extracted the various subgraphs of discursive community. Then, in order to compare the observed dimensions of the bow-tie sectors of these subgraphs and state their statistical significance, we adopted the Direct Configuration Model (DCM), which is the entropy-based model suited for direct monopartite networks43. For directed networks, the adjacency matrix is (in general) not symmetric, and each node i is characterized by two degrees: the out-degree \(k^{\text{out}}_i=\sum _ja_{ij}\) and the in-degree \(k^{\text{in}}=\sum _ja_{ji}\), where \(a_{ij}\) is the generic entry of the (directed) adjacency matrix \(\mathbf {A}\). The Directed Configuration Model (DCM) is therefore defined as the ensemble of direct networks with given out-degree and in-degree sequences. Using the same machinery as in the previous section “Entropy-based null-models for network analysis and their applications”, it is possible to derive a probability per graph: if \(G_D\) is the generic representative of the ensemble of directed graphs \(\mathcal {G}_D\), then the probability per graph \(P(G_D)\) reads:

$$\begin{aligned} P(G_D)=\prod _{i,j\ne i}q_{ij}^{a_{ij}(G_D)}(1-q_{ij})^{1-a_{ij}(G_D)}. \end{aligned}$$

Thus, again the probability per graph factorises in terms of probabilities per link \(q_{ij}\), which can be expressed in terms of Lagrangian multipliers

$$\begin{aligned} q_{ij}=\dfrac{e^{-\gamma _i-\delta _j}}{1+e^{-\gamma _i-\delta _j}}, \end{aligned}$$

where \(\gamma _i\) and \(\delta _j\) are the Lagrangian multipliers associated, respectively to the out-degree of node i and to the in-degree of node j. In order to get the numerical value of \(\gamma _i\) and \(\delta _j\), we can use the maximum likelihood as in the above section “Entropy-based null-models for network analysis and their applications”, which is equivalent to impose

$$\begin{aligned} \langle k_i^{\text{out}}\rangle _{\text{DCM}}=k_i^{*^{\text{out}}}, \qquad \langle k_i^{\text{in}}\rangle _{\text{DCM}}=k_i^{*^{\text{in}}},\,\forall i. \end{aligned}$$

Since the bow-tie decomposition is highly non linear, in order to calculate the statistical significance of the dimension of the various blocks, we generated a sample of 1000 different graphs for each discursive community, using the probabilities provided by the DCM. Then, we obtained a distribution for the dimensions of the bow-tie sectors just looking to the decomposition of each graph in our ensemble. At this point, we could calculate a two-tailed p-value with a significance at \(\alpha =0.01\) for estimating the distance between the dimensions observed with those reproduced by the ensemble.

Modularity and community detection

In the present analysis, we inferred the discursive communities from the communities in the validated network of verified users. In particular, we used the modularity based Louvain algorithm49.

The modularity84 compares the number of edges within the actual communities with its expectation under a certain null-model. Modularity can be written as

$$\begin{aligned} Q=\frac{1}{2m}\sum _{ij}(a_{ij}-p_{ij})\,\delta (C_i,C_j) \end{aligned}$$
(4)

where m is the total number of links of the network, \(a_{ij}\) are the entries of the adjacency matrix, \(p_{ij}\) is the probability to have a link between nodes i and j according to the chosen null-model, \(C_i\) and \(C_j\) are, respectively, the communities of nodes i and j and the Kronecker delta \(\delta (C_i,C_j)\) selects all the pairs of nodes contained in the same community (equal to 1 if \(C_i=C_j\) or 0 otherwise). In the original definition in Ref.85, the null-model chosen is the Chung-Lu one51, which conserve the degree sequence, but it is known to be inconsistent for dense networks that present strong hubs27. In the present paper we use instead the entropy-based Undirected Configuration Model (UCM) defined in52,81: it can be shown that in the case of sparse network, the UCM can be approximated by the Chung-Lu null-model.

Furthermore, Louvain algorithm is known to be order dependent, i.e. the resulting configuration depends on the order of the nodes given to the algorithm. In order to avoid this bias, we run 100 times the algorithm reshuffling each time the node order. At the end of 100 runs, we select the final partition displaying the maximum of the objective function (in our case, the modified modularity with the UCM).

The multiple runs approach is quite common26,38,39,58,75, but different approaches are present in the literature for the final choice of the resulting partition in communities: for instance, in Refs.38,39 the authors, instead of choosing the partition with the greatest value of the modularity, prefer to choose the most common clusters, the choice being motivated by the specific profile of modularity86. While the procedure proposed in Refs.38,39 is perfectly acceptable, we prefer ours, since it targets directly the objective function we are considering.

NEMtropy

In the present paper, we implemented the BiCM, the DCM and the Louvain algorithm using UCM null-models via the Python module NEMtropy, described in Ref.70.

Discursive communities for the Italian COVID-19 dataset

Here, we give a brief description of the discursive communities identified in the Italian COVID-19 dataset (Their dimensions are in Fig. 2):

  • DX: this community collects the official accounts and the main leaders of two Italian right-oriented political parties, ‘Lega’ and ‘Fratelli d’Italia’;

  • M5S: this community contains the main politicians and the official accounts of the Italian party ‘Movimento 5 Stelle’ (English: 5 Stars Movement), an anti-establishment political movement;

  • IV: this community is associated to the liberal party of ‘Italia Viva’ (English: Italy Alive) with centre/centre-left political positions;

  • PD: this cluster contains the politicians of the Italian ‘Partito Democratico’ (English: Democratic Party), the traditional centre-left party;

  • FI: this group collects the politicians and the official accounts of the Italian centre-right party of ‘Forza Italia’ (English: Go, Italy!);

  • MEDIA: this type of community is present in almost all the datasets we analyzed. It contains the official accounts of newspapers, journalists, TV-channels, radio channels and in general other Italian media.

Social bots

Social bots are computer algorithms whose behaviour on social platforms is often far from being benign: malicious bots are purposely created to distribute spam, sponsor public characters and, ultimately, induce a bias within the public opinion87,88,89. Often these agents have the task of increasing the visibility of certain users29,33.

Here, we report the outcome of a study about detection of social bots in the datasets under investigation. For bot detection, we exploit the general-purpose bot detection system based on supervised-learning presented in90. Such a system has been shown to be highly accurate, both for unveiling automated accounts that work alone and those that participate in coordinated activities. The bot detector is ‘traditional’, i.e., only one user per time is analyzed during the classification process91.

The classifier exploits so-called Class A features, i.e., features that can be directly extracted from the user profile. These features were originally introduced in87 and, despite their simplicity, proved to be still effective for the detection of novel bots too. Features that are known to be the most expensive to compute (mainly in terms of time needed for data gathering), namely those concerning the account’s relationships (friends and followers) have been disregarded.

Hence, in order to decide about the type of the account (either a bot or not), we (1) train and evaluate different machine-learning algorithms on a dataset where bots and genuine accounts are a priori known, (2) we select the model with the best classification performances and (3) we apply the resulting model to the accounts of the datasets investigated in the main text of the manuscript.

To train and validate the classifier we leverage the publicly available cresci-stock-2018 (https://botometer.osome.iu.edu/bot-repository/datasets.html) dataset. In particular, we use the accounts metadata of the 6842 bots and 5880 human that were still active at the time of data collection; data were crawled on July 2020 through the Tweepy library (https://www.tweepy.org/).

To select the best model, we consider five algorithms, each of them belonging to a different category: MlP (Multilayer Perception)92, JRip, i.e., a Java-based implementation of the RIPPER algorithm93, Naive Bayes94, Random Forest95, and the Weka96 implementation of the Instance-based Learning Algorithms, i.e., IBk97.

The performances of the five different algorithms are evaluated in terms of standard metrics, such as balanced accuracy, precision, and f-measure. The metrics are computed using a 10-fold cross-validation.

For all the experiments, we rely on the open source (Java-based) Weka framework that provide us the implementations of (1) the five machine-learning algorithms (for which we use the default parameter settings, https://waikato.github.io/weka-wiki/documentation/), (2) the evaluation metrics and (3) the process of 10-fold cross validation.

In light of our experiments (see Table 2), we select the Random Forest-based model as the classification process since it outperforms the other models.

Table 2 Performance results after 10-folds cross validation on cresci-stock-2018 data-set.

The resulting model for bot classification is then applied to tag all the accounts involved in our study, giving an average concentration of bots that is around 23.9% in total. In particular, if we focus on specific datasets, we observe percentage of bots around 23.9% for Italian COVID-19, 29% for German COVID-19, 23.4% for French COVID-19, 22.8% for Dutch elections, 25.7% for Italian debate on migrants, 24% for Italian debate on the Astrazeneca vaccine and 18.2% for Italian and Turkish EURO2020 dataset.

These are quite high values, especially if we take as a baseline measure the one provided by Varol et al.98 in a 2017 study which estimated the percentage of active bots on the Twittersphere at between 9 and 15%.

However, in our research, several aspects could motivate both the high values and their variability amongst the datasets. Specifically, (1) we are looking at specific (hot) topics that might involve more significant numbers of bots than the average, (2) we are considering datasets on significantly different topics (thus, the percentage of automated accounts might vary), and (3) we are analyzing data collected in different time intervals, but evaluated with a single classifier (this might further affect the classification performance, due to the possible evolution of bots).

With these premises to keep in mind, we now describe how the potential bots are distributed in the discursive communities. They are equally distributed among the discursive communities, with a slightly higher percentage of bots in the conservative groups: for instance, in the Italian and French COVID-19 datasets, the communities with the highest percentage of bots are DX and RIGHT-WING with, respectively, the 25.5% and the 29.7% of suspicious accounts. In our bow-tie structures, they are basically placed in the OUT sector or in the INTENDRILS one.

In Fig. 10 are shown the percentages of bots in a specific bow-tie sector averaged on all the discursive communities in the usual three categories. Globally, the highest percentages can be found in the OUT sector and in the INTENDRILS one: literally, social bots tend to retweet more than to be retweeted. In particular, in the case of OUT-dominant bow-tie, on average the 60% of bots are placed in the OUT sector and to a lesser extent in INTENDRILS (around 25%). In the case of INTEND-dominant bow-ties, we found even above the 60% of bots in INTENDRILS sector. Instead, when the bow-tie structure is absent, OTHERS is the block the contains the greatest number of bots.

Figure 10
figure 10

Percentages of social-bots in each sector of the bow-tie structure for OUT-dominant, INTEND-dominant and not informative bow-ties. This figure collects the percentage of bots in every bow-tie’s sector for discursive communities with OUT-dominant, INTEND-dominant and not informative bow-tie structure. It is easy to note how the highest percentages can be found in the OUT sector, in the INTENDRILS and in the OTHERS one, respectively for the case of OUT-dominant, INTEND-dominant and not informative bow-tie.

It is worth to be mentioned is that the higher percentages of bots in the strongly connected component can be found in the right-oriented discursive communities. For instance, in the Italian COVID-19 dataset the percentage of bots in the SCC for the DX is the 7%, while for all the others it does not overcome the 2%. Such a situation is particularly dangerous, since the fact that social bots are able of being retweeted by human users (as it is the case for accounts in SCC) means that are able to pass off themselves as genuine accounts.