Abstract
Projections of bipartite or twomode networks capture cooccurrences, and are used in diverse fields (e.g., ecology, economics, bibliometrics, politics) to represent unipartite networks. A key challenge in analyzing such networks is determining whether an observed number of cooccurrences between two nodes is significant, and therefore whether an edge exists between them. One approach, the fixed degree sequence model (FDSM), evaluates the significance of an edge’s weight by comparison to a null model in which the degree sequences of the original bipartite network are fixed. Although the FDSM is an intuitive null model, it is computationally expensive because it requires Monte Carlo simulation to estimate each edge’s p value, and therefore is impractical for large projections. In this paper, we explore four potential alternatives to FDSM: fixed fill model, fixed row model, fixed column model, and stochastic degree sequence model (SDSM). We compare these models to FDSM in terms of accuracy, speed, statistical power, similarity, and ability to recover known communities. We find that the computationallyfast SDSM offers a statistically conservative but close approximation of the computationallyimpractical FDSM under a wide range of conditions, and that it correctly recovers a known community structure even when the signal is weak. Therefore, although each backbone model may have particular applications, we recommend SDSM for extracting the backbone of bipartite projections when FDSM is impractical.
Introduction
Bipartite or twomode networks are composed of two types of nodes, which we call agents and artifacts, and edges between nodes of one type and nodes of the other type. These networks can be used to represent a wide range of phenomena and therefore are studied in a diverse range of disciplines. For example, natural selection unfolds as species (the agents) compete over sites (the artifacts), commerce is possible as traders exchange resources, scientific advances are reported as scholars write papers, and laws are adopted as legislators sponsor bills. Although bipartite networks are useful in their own right, they can also be useful for inferring unipartite (i.e., onemode) networks that are difficult to measure directly. For example, while it may be difficult to directly survey politicians about their political alliances because they are busy and may have reasons to misrepresent their true alliances, it may be possible to infer political alliances from politicians’ cosponsorship of legislation, which is readily observable^{1,2}. A bipartite projection transforms a bipartite network into a unipartite cooccurrence network in which pairs of agents are connected by edges whose weights capture their number of shared artifacts^{3,4,5}. For example, competitive interaction networks can be inferred from species’ cooccurrence in sites^{6}, trade networks can be inferred from firm colocation^{7,8,9} or product coexchange^{3}, scholarly collaboration networks can be inferred from paper coauthorship^{10}, and political alliance networks can be inferred from bill cosponsorship^{1}. Throughout the paper we use these applications to offer concrete examples, however the models we discuss are general and can be applied to extract unipartite backbones in such diverse contexts as flavor^{11}, misinformation^{12}, text^{13}, and genetic^{14} networks. Indeed, in principle any unipartite network can be represented as the projection of some bipartite network^{15,16,17}.
Despite their promise, bipartite projections (i.e., cooccurrence networks) are challenging to analyse because they are typically dense and weighted, and because the edge weights do not necessarily capture the strength of the relationship between nodes^{18}. As a result, it is often useful to analyze the backbone of a bipartite projection, which is an unweighted and typically sparser network that retains only the most ‘important’ edges. Although wellknown methods exist for extracting the backbone of weighted networks that are not bipartite projections^{19,20}, methods designed specifically for bipartite projections have recently been developed^{9,18,21,22}. Among these methods, the fixed degree sequence model (FDSM) relies on an intuitive null model, but requires computationally expensive Monte Carlo simulations, making it impractical for extracting the backbone of large bipartite projections. Faster methods are available, however relatively little is known about whether they yield backbones that are similar to those that would be obtained from using FDSM^{23}, and therefore whether they offer computationally efficient alternatives. To offer guidance to researchers wishing to extract an FDSMlike backbone from a large bipartite projection, in this paper we consider four potential alternatives to FDSM: fixed fill model (FFM) fixed row model (FRM), fixed column model (FCM), and stochastic degree sequence model (SDSM).
The paper is organized in six sections. We begin by formally defining bipartite projections, backbones, and the five backbone models, presenting proofs of the probability mass functions for their respective edge weight distributions in the Supplementary Text S1. In study 1, we evaluate the accuracy and speed of different approaches for estimating cellfilling probabilities used by the SDSM. In study 2, we evaluate the statistical power of the SDSM relative to the FDSM. In study 3, we examine how degree distributions impact the similarity of backbones extracted using FDSM and each of the alternative models. In study 4, we examine the extent to which backbones extracted using different models accurately recover a known community structure. Finally, we conclude with recommendations for backbone model selection and opportunities for future model development.
Backbone extraction for bipartite projections
Preliminaries
A bipartite network captures connections between nodes of one type (agents) and nodes of a second type (artifacts). Throughout this section, we use the ecological case of Darwin’s Finches to provide a concrete example^{24,25}. On his voyage to the Galapagos Islands on the H.M.S. Beagle, Darwin observed that only some species of finches lived on each island. These patterns can be represented as a bipartite network in which finch species (the agent nodes) are connected to the islands (the artifact nodes) where they are found^{26}. A bipartite network can be represented as a binary matrix in which the agents are arrayed as rows, and the artifacts are arrayed as columns. We use \({\mathbf {B}}\) to denote a bipartite network’s representation as a matrix, where \(B_{ik}=1\) if agent i is connected to artifact k, and otherwise is 0. The sequence of row sums and the sequence of column sums of \({\mathbf {B}}\) are called the agent and artifact degrees sequences, respectively. These sequences are among the bipartite network’s most significant features and are known to have implications for bipartite projections and backbones^{15,27,28}. In the ecological case, the agent degree sequence captures the number of islands where each species is found, while the artifact degree sequence captures the number of species found on each island.
The projection of a bipartite network is a weighted unipartite cooccurrence network in which a pair of agents is connected by an edge with a weight equal to their number of shared artifacts. For example, the bipartite projection of Darwin’s finch network is a species cooccurrence network in which a pair of finch species is connected by an edge with a weight equal to the number of islands where they are both found. We use \({\mathbf {P}}\) to denote the matrix representation of a bipartite projection, which is computed as \({{\mathbf {B}}}{{\mathbf {B}}}^T\), where \({\mathbf {B}}^T\) indicates the transpose of \({\mathbf {B}}\). In a projection \({\mathbf {P}}\), \(P_{ij}\) indicates the number of times agents i and j were connected to the same artifact k in \({\mathbf {B}}\). The diagonal entries of \({\mathbf {P}}\), \(P_{ii}\), are equal to the agent degrees, but in practice are ignored.
The backbone of a bipartite projection is a binary representation of \({\mathbf {P}}\) that contains only the most ‘important’ or ‘significant’ edges. For example, the backbone of a species cooccurrence network connects pairs of species if they are found on a significant number of the same islands, which might be interpreted as evidence that the two species do not compete for resources and perhaps are symbiotic. We use \({\mathbf {P}}'\) to denote the matrix representation of the backbone of \({\mathbf {P}}\). Because multiple methods exist for deciding when an edge is significant and thus should be preserved in the backbone, we use \(\mathbf{P }^{'{\text {M}}}\) denote a backbone extracted using method M. It is important to note that for a given bipartite projection, there is no ‘true’ backbone, but only backbones corresponding to specific backbone methods M. The backbone extracted using FDSM (i.e. \(\mathbf{P }^{'{\text{FDSM}}}\)) may be similar or different from a backbone extracted using another method such as SDSM (i.e. \(\mathbf{P }^{'{\text {SDSM}}}\)), and these similarities and differences depend on the information that is considered by the respective methods when determining whether edges’ weights are significant. It is these similarities and differences that we explore in the four studies below.
Backbone extraction methods that were originally developed for nonprojection weighted networks are often applied to weighted bipartite projections. One simple method preserves an edge in the backbone if its weight in the projection exceeds some global threshold T. However, when \(T = 0\), which is common, the backbone will be dense and have a high clustering coefficient because each artifact of degree d induces \(d(d1)/2\) edges in the backbone^{29}. Using \(T > 0\) can yield a sparser and less clustered backbone^{30,31,32}, but still yields highly clustered networks in which lowdegree nodes are excluded while highdegree nodes are preserved^{19}. More sophisticated methods, including the disparity filter^{19} and likelihood filter^{20}, aim to overcome these limitations of the global threshold method by using a different threshold for each edge based on a null model. However, all methods that were developed for nonprojection weighted networks have the same shortcoming when applied to weighted bipartite projections: they ignore information about the artifacts, which is lost when generating the projection^{18}. In the ecological case, the global threshold, disparity filter, and likelihood filter methods all decide whether two species should be connected in the backbone only by examining how many islands these two species are both found on, but do not consider the characteristics of those islands, including how many other species are found there, or even how many islands there are. Therefore, although these methods are promising for extracting the backbone from nonprojection weighted networks, different methods are required for extracting the backbone from a bipartite projection.
Bipartite ensemble backbone models
Bipartite ensemble backbone models decide whether an edge’s observed weight \(P_{ij}\) is significantly large, and thus whether a corresponding edge should be included in the backbone by comparing it to an ensemble of random bipartite networks. Let \({\mathscr {B}}\) be the set of all bipartite networks \(\mathbf {B^*}\) having the same number of agents and artifacts as \({\mathbf {B}}\). In the ecological case, \(\mathbf {B^*}\) might be viewed as representing a possible world containing the same species and islands, but in which locations of species on islands is different, and likewise \({\mathscr {B}}\) is the set of all such possible worlds. The bipartite ensembles used in backbone models take a subset \({\mathscr{B}}^{\text{M}}\) of \({\mathscr {B}}\), subject to certain constraints M, and impose a probability distribution on it. In all models except the SDSM, the uniform probability distribution is imposed on \({\mathscr{B}}^{\text{M}}\), that is, each element of the ensemble is equally likely. The backbone is then extracted from the projection of \({\mathbf {B}}\) by using the distribution of edge weights arising from projections of members of the ensemble to evaluate their statistical significance.
We use \(P^*_{ij}\) to denote a random variable equal to \((\mathbf {B^*}\mathbf {B^*}^T)_{ij}\) for \(\mathbf {B^*}~\in ~{\mathscr {B}}^{\text {M}}\). That is, \(P^*_{ij}\) is the number of artifacts shared by i and j in a bipartite network randomly drawn from \({\mathscr {B}}^{\text {M}}\). In the ecological case, \(P^*_{ij}\) represents the number of islands that are home to both species i and j in a possible world, while the distribution of \(P^*_{ij}\) is the distribution of the number of islands shared by species i and j in all possible worlds.
Decisions about which edges should appear in a backbone extracted at the statistical significance level \(\alpha\) are made by comparing \(P_{ij}\) to \(P^*_{ij}\)
This test includes edge \(P'_{ij}\) in the backbone if its weight in the observed projection \(P_{ij}\) is uncommonly large compared to its weight in projections of members of the ensemble \(P^*_{ij}\). We use a twotailed significance test in the studies below because, in principle, an edge’s weight in the observed projection could be uncommonly larger or uncommonly smaller than its weight in projections of members of the ensemble, however a onetailed test may also be used. In the ecological case, two species are connected in the backbone if their number of shared islands in the observed world is uncommonly large compared to their number of shared islands in all possible worlds.
There are many ways that \({\mathscr {B}}\) can be constrained^{33}, with each set of constraints describing a particular ensemble \({\mathscr {B}}^{\text {M}}\), which is used in a particular ensemble backbone model M to yield a particular backbone \({\mathbf {P}}^{'M}\). In the case of ensembles used to extract the backbone of bipartite projections, our focus in this paper, two broad types of constraints are common^{23}. First, ensembles can be distinguished by what they constrain: only the number of edges, the degrees of the agent nodes, the degrees of the artifact nodes, or the degrees of both the agent and artifact nodes. Second, ensembles can be distinguished by how they impose these constraints: the constraints can be satisfied exactly, or only on average. In statistical physics, ensembles that impose exact or ‘hard’ constraints are known as microcanonical, while ensembles that satisfy constraints on average or impose ‘soft’ constraints are known as canonical^{9}.
Prior work on these ensembles generally adopts either a theoretical focus on the ensembles themselves, or an applied focus on the consequences of ensemble choice. In the theoretical literature, some (primarily mathematicians) have aimed to characterize the properties of ensembles, such as estimating the cardinality of the ensemble of matrices with fixed rows and columns (below, we call this ensemble \({\mathscr{B}}^{{\text{FDSM}}}\))^{34}. Others (primarily physicists) have aimed to identify conditions under which ensembles are equivalent or nonequivalent, typically interpreting ensembles as representing thermodynamic systems^{35,36,37}. In the applied literature, the focus is not on identifying fundamental properties of ensembles, but instead on understanding the implications of choosing a particular ensemble when detecting a particular pattern, such as nestedness^{38} or community structure^{23,27}. The present work falls into this latter group: we are not directly concerned with identifying fundamental properties of ensembles, but instead on identifying the consequences of ensemble choice, with the ultimate goal of offering practical guidance to applied researchers wishing to extract the backbone of a bipartite projection.
In the remaining subsections below, we first describe the FDSM in terms of its ensemble. We then present four potential alternative backbone models whose ensembles differ only slightly from FDSM, in terms of either what they constrain or how they impose constraints. We then turn to exploring the consequences of choosing one of these alternatives over FDSM when extracting a backbone.
Fixed degree sequence model (FDSM)
In the fixed degree sequence model (FDSM), \(\mathbf {B^*}~\in ~{\mathscr {B}}^{{\text{FDSM}}}\) are constrained to have the same agent and artifact degree sequences as \({\mathbf {B}}\). That is, FDSM constrains the degrees of both the agent and artifact nodes, and requires that these constraints are satisfied exactly, making it a tightlyconstrained microcanonical ensemble. Adopting the FDSM implies, for example, that in all possible worlds a given species is found on exactly the same number of islands, and a given island is home to exactly the same number of species. The distribution of \(P^*_{ij}\) arising from \({\mathscr {B}}^{{\text{FDSM}}}\) is unknown, but can be approximated by uniformly sampling \(\mathbf {B^*}\) from \({\mathscr {B}}^{\text{FDSM}}\), constructing \(\mathbf {P^*}\), and saving the values \(P^*_{ij}\). In the studies below, we use 1000 samples of \(\mathbf {B^*}\) generated using the ‘curveball’ algorithm, which is among the fastest methods to sample \({\mathscr {B}}^{\text{FDSM}}\) uniformly at random^{39,40}. The FDSM has been used to extract the backbone of bipartite projections of, for example, movies coliked by viewers^{21} and conference panel coparticipation by scholars^{41,42}.
The FDSM offers an intuitively appealing approach to extracting the backbone of bipartite projections because it fully controls for both bipartite degree sequences, which are known to be responsible for many of the projection’s structural characteristics^{15,16}. However, because the distribution of \(P^*_{ij}\) must be computed via Monte Carlo sampling, it is computationally costly, making it impractical for all but relatively small bipartite projections. There are at least three distinct computational challenges. First, although the curveball algorithm is the fastest among existing methods for randomly sampling a bipartite graph with fixed degree sequences (i.e. for sampling \(\mathbf {B^*}\) from \({\mathscr {B}}^{\text{FDSM}}\)), it still can require several seconds per sample for large graphs. Second, once a \(\mathbf {B^*}\) has been sampled, constructing each \(\mathbf {P^*}\) requires matrix multiplication, which must be performed repeatedly and has complexity of at least \({\mathscr {O}}(n^{2.37})\)^{43}. Finally, computing an edge’s p value (i.e. \(\Pr (P^*_{ij} \ge P_{ij})\)) with sufficient precision to achieve a specified familywise error rate that controls for TypeI error inflation due to multiple testing^{22} can require these sampling and multiplication steps to be performed a very large number of times (see Supplementary Text S2).
These computational challenges have led researchers to develop other backbone models^{3,9,18}. Many such models exist, however here we are focused on identifying methods that yield backbones similar to what would be obtained using FDSM, and thus which may serve as computationallyfeasible alternatives to FDSM. Therefore, we consider only those models whose ensembles involve at least one of the two types of constraints imposed by FDSM. That is, we consider models that either (1) impose exact constraints, or (2) impose constraints on both the agent and artifact degrees.
Fixed fill model (FFM)
In the fixed fill model (FFM), \(\mathbf {B^*}~\in ~{\mathscr {B}}^{{\text {FFM}}}\) are simply constrained to contain the same number of 1s as \({\mathbf {B}}\). That is, the FFM constrains only the number of edges, but requires that this constraint is satisfied exactly. Adopting the FFM implies, for example, that in all possible worlds only the total number of speciesisland pairs is fixed, but any given species may be found on a different number of islands and any given island may be home to a different number of species. The distribution of \(P^*_{ij}\) arising from \({\mathscr {B}}^{{\text {FFM}}}\) has not been described before, but is derived in Supplementary Text S1.1. We call it a Jacobi distribution because it is related to Jacobi polynomials.
Fixed row model (FRM)
In the fixed row model (FRM), \(\mathbf {B^*}~\in ~{\mathscr {B}}^{{\text {FRM}}}\) are constrained to have the same agent degree sequence as \({\mathbf {B}}\), but have unconstrained artifact degree sequences. That is, the FRM constrains the degrees of the agent nodes, and requires that this constraint is satisfied exactly. A canonical variant of the FRM, the \(\hbox {BiPCM}_r\), also constrains the degrees of the agent nodes, but only requires this constraint to be satisfied on average; we do not consider it here because it involves neither of FDSM’s constraints^{9}. Adopting the FRM for backbone extraction implies, for example, that in all possible worlds a given species is found on the same number of islands, but a given island may be home to a different number of species. The distribution of \(P^*_{ij}\) arising from \({\mathscr {B}}^{{\text {FRM}}}\) is hypergeometric (see Supplementary Text S1.2), and for this reason it is sometimes referred to as the hypergeometric model^{22,23,44}. The FRM has been used to extract the backbone of bipartite projections of, for example, movies costarring actors^{22}, papers cowritten by authors^{22}, parties coattended by women^{44}, majority opinions joined by Supreme Court justices^{44}, and microRNAs coassociated with diseases^{45}.
Fixed column model (FCM)
In the fixed column model (FCM), \(\mathbf {B^*}~\in ~{\mathscr {B}}^{{\text {FCM}}}\) are constrained to have the same artifact degree sequence as \({\mathbf {B}}\), but have unconstrained agent degree sequences. That is, the FCM constrains the degrees of the artifact nodes, and requires that this constraint is satisfied exactly. A canonical variant of the FCM, the \(\hbox {BiPCM}_c\), also constrains the degrees of the artifact nodes, but only requires this constraint to be satisfied on average; we do not consider it here because it involves neither of FDSM’s constraints^{9}. Adopting the FCM for backbone extraction implies, for example, that in all possible worlds a given species may be found on a different number of islands, but a given island is home to the same number of species. The distribution of \(P^*_{ij}\) arising from \({\mathscr {B}}^{{\text {FCM}}}\) has not been described before, but is derived in Supplementary Text S1.3, where we show it is Poissonbinomial.
Stochastic degree sequence model (SDSM)
Finally, the stochastic degree sequence model (SDSM) takes \({\mathscr {B}}^{{\text {SDSM}}}\) to be all binary \(m \times n\) matrices, but also gives a process for generating these matrices with different probabilities. Each \(\mathbf {B^*}\) is generated by filling the cells \(B^*_{ik}\) with a 0 or 1 depending on the outcome of an independent Bernoulli trial with probability \(p^*_{ik}\). The distribution of the random variable \(P^*_{ij}\) arising from \({\mathscr {B}}^{{\text {SDSM}}}\) is Poissonbinomial with parameters which can be computed using the \(p^*_{ik}\) (see Supplementary Text S1.4)^{27,46}. There are many ways to choose \(p^*_{ik}\), but in the studies below we choose \(p^*_{ik}\) so that it approximates \(\Pr (B^*_{ik} = 1)\) for \(\mathbf {B^*}~\in ~{\mathscr {B}}^{{\text{FDSM}}}\). This choice of \(p^*_{ik}\) ensures that the SDSM constrains the degrees of both the agent and artifact nodes, but only requires these constraints to be satisfied on average. Adopting such a version of SDSM implies, for example, that in each possible world a given species may be found on many or few islands and a given island may be home to many or few species, but the average number of islands on which a given species lives in all possible worlds and the average number of species that live on an given island in all possible worlds matches these values the observed world. The SDSM has been used to extract the backbone of bipartite projections of, for example, legislators cosponsoring bills^{1,18,47,48,49}, zebrafish (Danio rerio) sharing operational taxonomic units^{50}, countries sharing exports^{3}, and genes expressed in genesets^{51}.
Study 1: Choosing cellfilling probabilities for the SDSM
The SDSM requires choosing \(p^*_{ik}\), which we want to approximate \(\Pr (B^*_{ik} = 1)\) for \(\mathbf {B^*}~\in ~{\mathscr {B}}^{{\text{FDSM}}}\). There are three types of methods that might be used for doing so: arithmetic, general linear models, and entropy maximization. First, we can choose \(p^*_{ik} = (r_i~\times ~c_k)/f\), where \(r_i\) is the sum of entries in row i of \({\mathbf {B}}\), \(c_k\) is the sum of entries in column k of \({\mathbf {B}}\), and f is the sum of all entries in \({\mathbf {B}}\). When \(p^*_{ik}\) falls outside the [0, 1] range, it is simply truncated toward 0 or 1, respectively. This method has a long history in ecology^{25}; we call it RCF because the value is chosen based on a row sum, a column sum, and the number of entries of \({\mathbf {B}}\) that are filled with a one, but elsewhere it has been called the ‘ChungLu method’^{52,53}. Second, an estimate can be obtained by fitting a general linear model of the form:
where the \(\beta\)’s are estimated coefficients and \(\epsilon\) is an error term. If the model is treated as a linear regression and the coefficients are estimated using ordinary least squares, then the predicted value of \(B_{ik}\) is chosen for \(p^*_{ik}\), either truncating values outside the required [0, 1] range (linear probability model; LPM) or transforming them into the required range using a linear discriminant model (LDM)^{54}. If the model is treated as a logistic regression and the coefficients are estimated using maximum likelihood, then the predicted probability that \(B_{ik} = 1\) is chosen for \(p^*_{ik}\). In prior work, the logistic regression approach has used a scobit or logit link function, with or without an interaction term (\(\beta _3\))^{1,18,47}. Finally, an estimate can be obtained by entropy maximization methods, including the polytope method (Poly)^{27,55} or bipartite configuration model (BiCM)^{3,9,56}. In this study, we evaluate the accuracy and speed of these methods for choosing \(p^*_{ik}\) that approximate \(\Pr (B^*_{ik} = 1)\) for \(\mathbf {B^*}~\in ~{\mathscr {B}}^{{\text{FDSM}}}\).
Methods
To evaluate accuracy, we begin by enumerating all the members of a small \({\mathscr {B}}^{\text{FDSM}}\). For example, given an agent degree sequence of [1, 1, 2] and an artifact degree sequence of [1, 1, 2], \({\mathscr {B}}^{\text{FDSM}}\) contains 5 members (see Table 1A). Second, from this complete enumeration, we compute the probabilities we wish \(p^*_{ik}\) to approximate (i.e., \(\Pr (B^*_{ik} = 1)\) for \(\mathbf {B^*}~\in ~{\mathscr {B}}^{{\text{FDSM}}}\), see Table 1B). Third, we compute \(p^*_{ik}\) using each of nine methods (see Table 1C for values obtained using the BiCM method). Finally, we quantify the accuracy with which \(p^*_{ik}\) approximates the desired probabilities using the mean absolute difference for all i, k. In the example shown in Table 1, BiCM’s accuracy for these degree sequences is 0.028. That is, on average \(p^*_{ik}\) chosen using BiCM deviates from the desired probabilities by ± 0.028 on average. Because evaluating accuracy in this way requires enumerating all members of \({\mathscr {B}}^{\text{FDSM}}\), it is possible only for short degree sequences that define \({\mathscr {B}}^{\text{FDSM}}\) with small cardinality. We focus on degree sequences ranging in length from 2 to 5, which define 384 unique \({\mathscr {B}}^{\text{FDSM}}\) ranging in cardinality from 4 to 2040.
After identifying each method’s accuracy, we evaluate the computational running time of the four most accurate methods by using them to choose \(p^*_{ik}\) for bipartite graphs defined by up to 1000 agents and up to 1000 artifacts, and thus requiring choosing up to 1,000,000 probabilities.
Results
Figure 1A shows the accuracy of each method’s computation of \(p^*_{ik}\). Each gray line plots the accuracy of each method for a single \({\mathscr {B}}^{\text{FDSM}}\), while the red line and shaded region plots the mean and 95% confidence interval of the accuracy of each method over all 384 \({\mathscr {B}}^{\text{FDSM}}\). We find that choosing \(p^*_{ik}\) using a logistic regression with an interaction term (i.e., ScobitI and LogitI) is on average least accurate^{1,18}, while choosing \(p^*_{ik}\) using the two entropy maximization method (BiCM and Poly) yield numerically equivalent results, which were on average most accurate^{3,27}.
Figure 1B shows the number of seconds required to compute \(p^*_{ik}\) using a 2.3 GhZ Intel i7 processor; lines illustrate the mean running time, while the shaded regions show the 95% confidence interval. Among the two most accurate methods, BiCM is several orders of magnitude faster than Polytope. When computing more than \(10^4\) probabilities, BiCM is also faster than the two slightly less accurate Logit and LDM methods. In the largest case we evaluated, computing \(10^6\) probabilities, BiCM took only about 0.026 seconds. Therefore, we use BiCM for choosing \(p^*_{ik}\) when extracting SDSM backbones in the remaining studies because it is both the most accurate and fastest.
Study 2: Statistical power of SDSM
Ensemble backbone models require the specification of a statistical significance level \(\alpha\), which determines how uncommonly large an observed edge weight \(P_{ij}\) must be when compared to edge weights \(P^*_{ij}\) arising from an ensemble in order for a corresponding edge to be included in the backbone. For a given model, smaller values of \(\alpha\) represent more stringent criteria for retaining edges, and therefore yield sparser backbones. Although FDSM and SDSM define their respective ensembles by constraining both agent and artifact degree sequences, and thus aim to yield similar backbones, a given \(\alpha\) does not necessarily represent the same level of stringency in these two models. Because the SDSM allows variation in the degree sequences of \(\mathbf {B^*}~\in ~{\mathscr {B}}^{\text {SDSM}}\), the distribution of \(P^*_{ij}\) is wider^{23,28}. These wider distributions mean that the SDSM provides a more conservative test of edge weight significance than FDSM, or alternatively the SDSM has less statistical power to detect significant edges than FDSM.
A concrete example serves to illustrate this difference. In economic geography, it is common to study the world city network using a bipartite projection where two cities are linked to the extent that firms maintain locations in both cities. The Globalization and World Cities (GaWC) dataset has been widelyused in this context, and takes the form of a bipartite network recording the presence or absence of 100 firms (artifacts) in 196 cities (agents) in the year 2000^{7,28}. In this bipartite network, the agent degrees are righttailed because most cities contain only a few firms, while a few cities such as New York contain many. Likewise, the artifact degrees are also right tailed because most firms maintain locations in only a few cities, while a few firms such as the accounting firm KPMG maintain locations in many.
Figure 2A illustrates the distribution of the MilanParis edge weight in projections arising from \({\mathscr {B}}^{\text{FDSM}}\) and \({\mathscr {B}}^{\text {SDSM}}\) of which the observed bipartite network is a member (i.e., the random variable \(P^*_{ij})\). These distributions allow a researcher to decide whether Milan and Paris’s observed number of colocated firms is significantly large, and therefore whether Milan and Paris should be connected in a world city network backbone. The SDSM distribution is wider than the FDSM distribution^{23,28}, which has implications for whether the MilanParis edge will be included in a backbone extracted at a given significance level using each model. In the observed data, there are 26 firms colocated in Milan and Paris (i.e., \(P_{ij} = 26\)). The probability of observing the same or larger edge weight in projections from the FDSM ensemble is 0.0033, which is less than \(\frac{0.05}{2}\), and therefore a MilanParis edge is deemed significant by the FDSM and is included in the FDSM backbone extracted at \(\alpha = 0.05\). In contrast, the probability of observing the same or larger edge weight in projections from the SDSM ensemble is 0.0275, which is not less than \(\frac{0.05}{2}\), and therefore a MilanParis edge is not deemed significant by the SDSM and is not included in the SDSM backbone extracted at \(\alpha = 0.05\). For a given level of significance \(\alpha\), this difference in statistical power leads the SDSM backbone to be sparser than the FDSM backbone (density \(= 0.004\) vs. 0.012), and means that these two backbones are dissimilar (Jaccard \(= 0.36\)).
In this study, we investigate SDSM’s statistical power relative to FDSM, and specifically whether extracting an SDSM backbone using a more liberal (i.e., larger) \(\alpha\) makes it more similar to an FDSM backbone extracted at \(\alpha = 0.05\).
Methods
To evaluate SDSM’s statistical power and the effect of significance levels on the similarity of SDSM and FDSM backbones, we first extracted the FDSM backbone from the GaWC bipartite network at \(\alpha = 0.05\). We then extracted SDSM backbones from the GaWC bipartite network at \(0.01 \le \alpha \le 0.3\) in 0.001 increments, each time computing the Jaccard index (J) to measure the similarity between the SDSM and FDSM backbones. After comparing SDSM and FDSM backbones extracted from the empirical GaWC bipartite network, we repeat this process using 100 synthetic bipartite networks with the same dimensions (\(196 \times 100\)), density (0.08) and righttailed agent and artifact degree distributions.
Results
The green line in Fig. 2B shows the Jaccard similarity between an FDSM backbone extracted from the empirical GaWC network at \(\alpha = 0.05\) and SDSM backbones extracted at the significance levels shown on the xaxis. We find that an SDSM backbone achieves its maximum similarity to the FDSM backbone (\(J = 0.81\)) when it is extracted using the more liberal significance level of \(\alpha = 0.12\). Returning to the example in Fig. 2A, using this more liberal significance level would result in the MilanParis edge being deemed significant and included in the SDSM backbone because its SDSM p value \(0.0275 < \frac{0.12}{2}\). Because this more liberal significance level results in the inclusion of additional edges, the new SDSM backbone extracted at \(\alpha = 0.12\) has a density (0.01), which is closer to that of the FDSM backbone extracted at \(\alpha = 0.05\) (0.012).
The purple line in Fig. 2B shows the mean Jaccard similarity between an FDSM backbone extracted using \(\alpha = 0.05\) and SDSM backbones extracted using \(0.01 \le \alpha \le 0.3\) from 100 bipartite networks generated to resemble the empirical GaWC network. The shaded purple region shows the 10th and 90th percentile of Jaccard similarities of these backbones. We find that these synthetic networks behave similarly to the empirical network. Specifically, SDSM and FDSM backbones extracted from a lowdensity \(196 \times 100\) bipartite network with righttailed degree distributions achieve a maximum similarity of \(0.49< J < 0.76\) when the FDSM backbone is extracted using \(\alpha = 0.05\) and the SDSM backbone is extracted using \(\alpha = 0.14\). This is promising because it suggests that, given the characteristics of an empirical bipartite network, it may be possible to select a significance level for extracting a computationallyefficient SDSM backbone that closely resembles a computationallyinfeasible FDSM backbone.
Study 3: Backbone similarity under varying degree distributions
Agent and artifact degree distributions are a key feature of a bipartite network, and are known to have implications for bipartite projections^{15,27,28}. The FDSM is particularly appealing because it allows decisions about the significance of edges in a projection to be conditioned on both bipartite degree sequences, thereby taking into account these important features. However, because the computational requirements of the FDSM make it impractical for extracting the backbone from most bipartite projections, it is often necessary to use a different backbone model. In this study, we evaluate the similarity of an FDSM backbone and backbones extracted using more computationally efficient models. We perform this comparison for backbones extracted from bipartite networks characterized by five types of degree distributions: righttailed, lefttailed, normal, constant, and uniform.
For the sake of concreteness, in this section we use the example of a bipartite network in which authors (agents) are linked to the papers they have written (artifacts). The projection of such a network yields a coauthorship network in which the edge weight between a pair of authors indicates their number of coauthored papers^{10}. These edge weight values will depend heavily on the distribution of papers written by authors (i.e., the agent degree sequence), and on the distribution of authors on each paper (i.e., the artifact degree sequence). Different degree distributions describe different kinds of scholarly environments as shown in Table 2. The choice of a backbone model affects whether these distributions are considered, and in this example affects whether decisions about the significance of two authors’ number of coauthored papers consider the scholarly environment. The FDSM compares their observed number of coauthored papers to the number that might be observed in alternative realizations of the same environment, while other backbone models relax the extent to which the environment is held constant.
Methods
We evaluate similarities among the backbones extracted using different models by comparing backbones extracted from synthetic \(100 \times 100\) bipartite networks with a density of 0.1, and with a combination of agent and artifact degree distributions shown in Table 2. Following our example, these synthetic bipartite networks might represent a college of 100 faculty who collectively wrote 100 papers, in a particular type of scholarly environment where each individual had a 10% chance of being an author on each paper. After generating a bipartite network with a given size, density, and degree distributions, we extract five different backbones from the generated bipartite network, using the fixed fill model, fixed row model, fixed column model, stochastic degree sequence model, and fixed degree sequence model; in all cases we use \(\alpha = 0.05\). We compute the similarity of the first four backbones to the FDSM backbone using a Jaccard index, repeating this process 100 times for each of the 25 possible combinations of agent and artifact degree distributions.
Results
The heatmaps in Fig. 3 illustrate the similarity between an FDSM backbone and a backbone extracted using an alternative model. The rows of each heat map correspond to different agent degree distributions, and the columns correspond to different artifact degree distributions, in the synthetic bipartite networks from which the backbones were extracted. The lightest patches identify conditions under which a given backbone model yields a backbone that is similar to what would be obtained using the computationally costly FDSM, while darker patches identify conditions under which these two backbones differ. We find that when agent degrees are constant (i.e., every agent has the same degree) and artifact degrees are constant or lefttailed, all backbone models yield the same backbone as FDSM (Mean \(J = 1\)). However, beyond this special case, which is likely to be rare in empirical data, similarity to FDSMextracted backbones varies.
As expected, the similarity of backbones extracted using FRM and FDSM depends primarily on the distribution of artifact degrees, not agent degrees (see Fig. 3B). For example, for any agent degree distribution, these two models yield very different backbones when artifact degrees follow a righttailed distribution (Mean \(J = 0.186\)), but very similar backbones when artifact degrees follow a normal distribution (Mean \(J = 0.863\)). This occurs because both models exactly control for agent degrees, however FDSM also controls for artifact degrees, while FRM does not.
A similar but rotated pattern emerges when considering the FCM: the similarity of backbones extracted using FCM and FDSM depends primarily on the distribution of agent degrees, not artifact degrees (see Fig. 3C). For any artifact degree distribution, these two models yield very different backbones when agent degrees follow a righttailed or uniform (Mean \(J = 0.084\)) distribution , but more similar backbones when agent degrees follow a lefttailed distribution or are constant (Mean \(J = 0.617\)). This occurs because both models exactly control for artifact degrees, however FDSM also controls for agent degrees, while FRM does not. However, there is a notable exception to this general pattern: when artifact degrees follow a uniform distribution, FCM and FDSM always yield different backbones (Mean \(J = 0.151\)).
The conditions under which the FFM yields FDSMsimilar backbones occur at the intersection of the conditions under which the FRM and FCM both yield FDSMlike backbones (see Fig. 3A). When artifact degrees follow a righttailed distribution or the agent degrees follow a righttailed or uniform distribution, then FFM and FDSM backbones differ (Mean \(J = 0.1\)). In contrast, for other combinations of degree distributions, FFM and FDSM backbones are more similar (Mean \(J = 0.724\)).
Finally, as expected based on the findings from study 2, we observe that the SDSM generally yields different backbones than FDSM when both are extracted at \(\alpha = 0.05\) (see Fig. 3D). Specifically, except in the narrow case where agent degrees are constant and artifact degrees are constant or lefttailed (Mean \(J = 1\)), SDSM and FDSM backbones exhibit only modest similarity (Mean \(J = 0.314\)). This lack of similarity occurs because SDSM offers a less statistically powerful (or more conservative) test of edges statistical significance than FDSM, and therefore retains fewer edges in the backbone. However, findings from study 2 also suggested that careful selection of the significance level used for extracting an SDSM backbone can yield results more similar to FDSM.
To explore this possibility, we expanded the analysis reported in Fig. 3D by extracting SDSM backbones at different significance levels \(\alpha\). We find that when a suitably more liberal (i.e., larger) significance level \(\alpha\) is used to extract an SDSM backbone, the resulting SDSM backbone is very similar to an FDSM backbone extracted at \(\alpha = 0.05\) (see Fig. 4A). Specifically, for backbones extracted from bipartite networks with any agent or artifact degree distributions, these two backbones tend to be very similar (Mean \(J = 0.865\)). This suggests that in principle the fast SDSM can be used to obtain a close approximation of a computationallyinfeasible FDSM backbone from any bipartite network.
In practice, using SDSM to obtain an FDSMlike backbone requires selecting an \(\alpha\) value for the SDSM that corresponds to \(\alpha = 0.05\) in the FDSM. We observe that there are three distinct values of such an ‘optimal’ \(\alpha\) that depend on agent and artifact degree distributions (see Fig. 4B). First, when agent degrees are constant, a value only slightly higher than 0.05 (Mean \(= 0.062\), SD \(= 0.021\)) achieves the best approximation of an FDSM backbone. Second, when artifact degrees are constant, a value roughly double (Mean \(= 0.09\), SD \(= 0.022\)) achieves the best approximation of an FDSM backbone. Finally, when neither agent nor artifact degrees are constant, which is likely in most empirical bipartite networks, a value roughly 2.5 times larger (Mean \(= 0.13\), SD \(= 0.014\)) achieves the best approximation of an FDSM backbone. Although further work is needed to facilitate the a priori selection of an \(\alpha\) that allows an SDSM backbone to closely approximate an \(\hbox {FDSM}_{\alpha = 0.05}\) backbone, these results suggest that under the most common circumstances (i.e., when there is variation in degrees) \(\alpha \approx 0.13\) may be appropriate.
Study 4: Recovery of community structure
Studies 1–3 examine the backbones extracted from random bipartite networks; however, empirical bipartite networks are not random. Frequently they contain a block structure that implies a particular community structure in the bipartite projection. In this study, we evaluate the extent to which backbones extracted using different models reflect a known community structure that is encoded in the bipartite data from which they are extracted^{57}. Recent work has shown that FDSM, FRM, SDSM, and BiPCM (a canonical variant of FRM) yield backbones with similar communities structures^{23}. Other work has shown that SDSM and FDSM backbones extracted from a bipartite network representing bill cosponsorship in the 114th session of the US Senate more clearly captured the hypothesized partisan community structure than an FRM backbone^{27}. We build on this prior work using synthetic data that is constructed to contain a ground truth communities, which allows us to evaluate backbone models’ ability to recover true communities, and not simply similar or hypothesized ones.
Methods
We investigate the ability for backbones to recover a known community structure in three steps. First, we simulate a \(200 \times 1000\) bipartite network with a density of 0.1 and righttailed agent and artifact degree distributions. We focus on a bipartite network with more artifacts than agents to ensure that these data contain sufficient information to encode potential community memberships. We focus on a bipartite network with righttailed degree distributions because they are common in many empirical unipartite^{58} and bipartite networks^{1,11,28}. This synthetic bipartite network could represent a legislative body composed of 200 legislators casting votes on 1000 bills, where any given legislator had a 10% chance of voting in favor of any given bill. The righttailed degree distributions capture the fact that most legislators vote in favor of only a few bills, and that most bills receive the support of only a few legislators, which is typical of legislative bodies. The backbone of a projection of such a bipartite network would represent a network of collaboration or ideological alignment among legislators^{1}.
Second, we incorporate evidence of communities in this bipartite network by randomly assigning each agent and each artifact to one of two groups. We then perform checkerboard swaps, which preserve the degree distributions, until a given fraction of edges W are withingroup, connecting an agent and artifact from the same group^{59}. Figure 5A provides graphical depictions of the matrices describing synthetic bipartite networks at two values of W. In each plot, the rows represent agents assigned to group A or B, the columns represent artifacts assigned to group A or B, and a cell is shaded black if the row agent is connected to the column artifact. When \(W = 0.5\), agents in a given group are equally likely to associate with artifacts in either group, placing \(\approx 0.5\) of the edges (i.e., shaded cells) in the diagonal blocks and \(\approx 0.5\) of the edges in the offdiagonal blocks. In contrast, when \(W = 0.8\), agents in a given group are much more likely to associate with artifacts from their own group than artifacts in the other group, placing \(\approx 0.8\) of the edges in the diagonal blocks and \(\approx 0.2\) of the edges in the offdiagonal blocks. Returning to our example, the groups could represent political parties: each legislator belongs to one of two parties (i.e., there are conservative and liberal legislators), and each bill advances the agenda of one of these parties (i.e., there are conservative and liberal bills). When \(W = 0.5\), a conservative legislator is equally likely to vote for conservative and liberal bills, while when \(W = 0.8\), a conservative legislator is fourtimes more likely to vote for a conservative bill than a liberal bill.
Finally, we extract a backbone from the bipartite network using a given model and compute the backbone’s modularity Q with respect to the agents’ group assignments^{60}. If a backbone model is able to recover the community structure from evidence in the bipartite network, then we expect a positive association between W and Q. In the legislative example, if legislators are bipartisan in their voting patterns (i.e., \(W = 0.5\)), then legislators should not be clustered by party in the backbone (i.e., \(Q \approx 0\)). In contrast, if legislators are strongly partisan in their voting patterns (i.e., \(W = 0.8\)), then legislators should be clustered by party in the backbone (i.e., \(Q \gg 0\)).
We repeat these three steps 10 times for \(0.5 \le W \le 0.8\) in 0.05 increments. When evaluating the SDSM backbone, we consider both a backbone extracted using the conventional significance level of \(\alpha = 0.05\) and one extracted at the more liberal \(\alpha = 0.13\), which study 3 suggests yields a backbone similar to FDSM.
Results
Figure 5B shows the modularity (yaxis; with respect to known community memberships) of backbones extracted using different models from bipartite networks containing different fractions of withincommunity edges (xaxis). Solid lines illustrate the mean modularity across 10 replications, while the shaded regions illustrate 95% confidence intervals. All six lines increase monotonically, confirming that all backbone models yield backbones that can recover a known community structure; however, there is notable variation among the models. As evidence of community structure grows stronger in the bipartite network, the modularity of backbones extracted using the FFM and FCM slowly increase, but even when the evidence of such a structure is quite strong (i.e., when \(W = 0.8\)) they only achieve average values of \(Q = 0.15\) and 0.18, respectively. Backbones extracted using the FRM display a similar pattern, but achieve a statistically significantly higher average modularity (\(Q = 0.39\)) value when W is large.
Backbones extracted using FDSM and SDSM yield modularity values that are statistically significantly larger than those obtained from FFM, FRM, or FCM backbones, but that are not statistically significantly different from each other. That is, these backbone models are indistinguishable in their ability to recover the known community structure, and do so very well. As evidence of a community structure grows stronger in the bipartite network, the modularity of backbones extracted using these models rapidly increases. When the evidence of community structure is strong (i.e., when \(W = 0.8)\), these backbones have very high modularity (mean \(Q = 0.49\)). However, even when there is only modest evidence of community structure in the bipartite network (e.g., when \(W = 0.65\)), these backbones are still able to identify the community structure and have a distinctively high modularity (mean \(Q = 0.37\)).
These findings suggest that although all backbone models can yield backbones that recover a known community structure, SDSM and FDSM backbones are able to detect this structure more clearly and from a weaker signal.
Discussion
Bipartite networks can be used to represent a wide range of phenomena in the social and natural worlds including interspecies competition, global trade, scientific advances, and legislative deliberation. Likewise, projections of bipartite networks, which take the form of cooccurrence networks, can be useful for inferring unipartite networks whose edges would otherwise be difficult to measure directly. The fixed degree sequence model (FDSM) offers an appealing null model for making such inferences, but its computational complexity often makes it impractical. Several computationally simpler alternatives to FDSM have been proposed, including the fixed fill model (FFM) fixed row model (FRM), fixed column model (FCM), and stochastic degree sequence model (SDSM). In this paper we have systematically compared FDSM to each of these alternatives to evaluate their aspects of their accuracy, speed, statistical power, backbone similarity, and ability to recover a known community structure.
In study 1, we examined several methods for choosing the probabilities used by the stochastic degree sequence model (SDSM), finding that the bipartite configuration model (BiCM) is both the fastest and most accurate. In study 2, we examined the statistical power of the SDSM relative to the fixed degree sequence model (FDSM), finding that the SDSM can be viewed as a statistically less powerful (or more conservative) variant of the FDSM. In study 3, we examined the similarity of an FDSMextracted backbone to backbones extracted using other models, finding that the SDSM and FDSM extract very similar backbones from bipartite networks with a wide range of possible degree distributions when an appropriate significance level \(\alpha\) is chosen. Finally, in study 4, we examined the ability for backbones extracted using different models to recover a known community structure, finding that although all models yield a backbone that recovers the structure, SDSM and FDSM can detect a community structure more clearly and from a weaker signal.
Based on these findings, and with the goal of offering researchers some guidance in extracting the backbones of bipartite projections, we offer three recommendations. First, we recommend the stochastic degree sequence model (SDSM) for extracting the backbones of bipartite projections because it is fast, controls for both agent and artifact degree sequences, and yields modular backbones when the bipartite data contains even modest evidence of withincommunity clustering. Second, when the SDSM is used, we recommend that the cellfilling probabilities \(p^*_{ik}\) be chosen using the Bipartite Configuration Model (BiCM) because it is faster and more accurate than any other currently available method. Third, when an FDSM backbone extracted at the \(\alpha = 0.05\) significance level is desired but computationally infeasible, we recommend extracting an SDSM backbone at the \(\alpha = 0.13\) significance level, which we observe is very similar when there is variation in the agent and artifact degree sequences. The models and options necessary to adopt these recommendations are implemented in the backbone package for R^{27}.
These findings and recommendations must be viewed in light of the fact that, due to the computational requirements of the FDSM and of extracting a large number of backbones across the four studies, these studies have relied on small synthetic bipartite networks ranging in size from \(3 \times 3\) (study 1) to \(200 \times 1000\) (study 4). However, in practice bipartite networks may be several orders of magnitude larger. For example, a bipartite network used to infer collaborations in the US House of Representatives includes 435 agents (representatives) and over 6000 artifacts (bills)^{1,55}, while a bipartite network used to infer movie recommendations includes 17,770 agents (films) and nearly 500,000 artifacts (viewers)^{21}. Future research should explore whether these findings extend to backbones extracted from such large bipartite networks. Limitations of existing backbone models also point to directions for future research. First, using the FDSM will generally be computationally infeasible in practice because the distribution of \(P^*_{ij}\) arising from \({\mathscr {B}}^{{\text{FDSM}}}\) must be estimated via numerical simulation. Identifying this distribution’s probability mass function, which is known for the other ensembles (see Supplementary Text S1), would facilitate the use of this otherwise attractive model. Second, all the ensemble models we have considered impose constraints on the degree sequences, but other types of constraints may also be useful. For example, in some contexts it may be necessary to constrain all members of an ensemble to contain a 0 in a particular cell (e.g., to represent that an author was not alive to coauthor a paper, or a legislator was not present to cosponsor a bill)^{61} These limitations and future directions notwithstanding, the results presented above provide a starting point for further development of backbone models, and provide applied researchers with some practical guidance on model selection.
Code availability
All code necessary to replicate these analyses is available at https://osf.io/m4yfd/. The backbone package used to perform the analyses is available for R from CRAN, and can be installed by typing install.packages(“backbone”) in the R console.
References
Neal, Z. P. A sign of the times? Weak and strong polarization in the US Congress, 1973–2016. Soc. Netw. 60, 103–112 (2020).
Fowler, J. H. Legislative cosponsorship networks in the US House and Senate. Soc. Netw. 28, 454–465 (2006).
Saracco, F., Di Clemente, R., Gabrielli, A. & Squartini, T. Randomizing bipartite networks: The case of the world trade web. Sci. Rep. 5, 1–18 (2015).
Di Clemente, R., Strano, E. & Batty, M. Urbanization and economic complexity. Sci. Rep. 11, 1–10 (2021).
Simmons, B. I. et al. bmotif: A package for motif analyses of bipartite networks. Methods Ecol. Evol. 10, 695–701 (2019).
Diamond, J. M. Assembly of species communities. In Ecology and Evolution of Communities (eds Cody, M. L. & Diamond, J. M.) 342–444 (Harvard University Press, Harvard, 1975).
Taylor, P. J., Catalano, G. & Walker, D. R. Measurement of the world city network. Urban Stud. 39, 2367–2376 (2002).
Straka, M. J., Caldarelli, G. & Saracco, F. Grand canonical validation of the bipartite international trade network. Phys. Rev. E 96, 022306 (2017).
Saracco, F. et al. Inferring monopartite projections of bipartite networks: An entropybased approach. New J. Phys. 19, 053022 (2017).
Newman, M. E. Scientific collaboration networks. I. Network construction and fundamental results. Phys. Rev. E 64, 016131 (2001).
Ahn, Y.Y., Ahnert, S. E., Bagrow, J. P. & Barabási, A.L. Flavor network and the principles of food pairing. Sci. Rep. 1, 1–7 (2011).
Tollefson, J. Tracking QAnon: How Trump turned conspiracytheory research upside down. Nature 590, 192–193 (2021).
Radhakrishnan, S., Erbis, S., Isaacs, J. A. & Kamarthi, S. Novel keyword cooccurrence networkbased methods to foster systematic reviews of scientific literature. PLoS ONE 12, e0172778 (2017).
Zhang, B. & Horvath, S. A general framework for weighted gene coexpression network analysis. Stat. Appl. Genet. Mol. Biol. 4, 1–43 (2005).
Vasques Filho, D. & O’Neale, D. R. J. Transitivity and degree assortativity explained: The bipartite structure of social networks. Phys. Rev. E 101, 052305. https://doi.org/10.1103/PhysRevE.101.052305 (2020).
Guillaume, J.L. & Latapy, M. Bipartite structure of all complex networks. Inf. Process. Lett. 90, 215–221 (2004).
Newman, M. E. & Park, J. Why social networks are different from other types of networks. Phys. Rev. E 68, 036122 (2003).
Neal, Z. P. The backbone of bipartite projections: Inferring relationships from coauthorship, cosponsorship, coattendance and other cobehaviors. Soc. Netw. 39, 84–97 (2014).
Serrano, M. Á., Boguná, M. & Vespignani, A. Extracting the multiscale backbone of complex weighted networks. Proc. Natl. Acad. Sci. 106, 6483–6488 (2009).
Dianati, N. Unwinding the hairball graph: Pruning algorithms for weighted complex networks. Phys. Rev. E 93, 012304 (2016).
Zweig, K. A. & Kaufmann, M. A systematic approach to the onemode projection of bipartite graphs. Soc. Netw. Anal. Min. 1, 187–218 (2011).
Tumminello, M., Miccichè, S., Lillo, F., Piilo, J. & Mantegna, R. N. Statistically validated networks in bipartite complex systems. PLoS ONE 6, e17994 (2011).
Cimini, G., Carra, A., Didomenicantonio, L. & Zaccaria, A. Metavalidation of bipartite network projections. arXiv preprint arXiv:2105.03391 (2021).
Sanderson, J. G. Testing ecological patterns. Am. Sci. 88, 332 (2000).
Gotelli, N. J. Null model analysis of species cooccurrence patterns. Ecology 81, 2606–2621 (2000).
Neal, Z. P. & Neal, J. W. Out of bounds? The boundary specification problem for centrality in psychological networks. Psychol. Methods. https://doi.org/10.1037/met0000426 (2021).
Domagalski, R., Neal, Z. P. & Sagan, B. backbone: An R package for extracting the backbone of bipartite projections. PLoS ONE 16, e0244363 (2021).
Neal, Z. P., Domagalski, R. & Sagan, B. Analysis of spatial networks from bipartite projections using the R backbone package. Geogr. Anal. https://doi.org/10.1111/gean.12275 (2021).
Latapy, M., Magnien, C. & Del Vecchio, N. Basic notions for the analysis of large twomode networks. Soc. Netw. 30, 31–48 (2008).
Derudder, B. & Taylor, P. The cliquishness of world cities. Glob. Netw. 5, 71–91 (2005).
Fong, C. Expertise, networks, and interpersonal influence in congress. J Polit. 82, 269–284 (2020).
Bratton, K. A. & Rouse, S. M. Networks in the legislative arena: How group dynamics affect cosponsorship. Legis. Stud. Q. 36, 423–460 (2011).
Strona, G., Ulrich, W. & Gotelli, N. J. Bidimensional null model analysis of presenceabsence binary matrices. Ecology 99, 103–115 (2018).
Barvinok, A. On the number of matrices and a random matrix with prescribed row and column sums and 0–1 entries. Adv. Math. 224, 316–339 (2010).
Barré, J. & Gonçalves, B. Ensemble inequivalence in random graphs. Physica A 386, 212–218 (2007).
Touchette, H. Equivalence and nonequivalence of ensembles: Thermodynamic, macrostate, and measure levels. J. Stat. Phys. 159, 987–1016 (2015).
Squartini, T., de Mol, J., den Hollander, F. & Garlaschelli, D. Breaking of ensemble equivalence in networks. Phys. Rev. Lett. 115, 268701 (2015).
Bruno, M., Saracco, F., Garlaschelli, D., Tessone, C. J. & Caldarelli, G. The ambiguity of nestedness under soft and hard constraints. Sci. Rep. 10, 1–13 (2020).
Strona, G., Nappo, D., Boccacci, F., Fattorini, S. & SanMiguelAyanz, J. A fast and unbiased procedure to randomize ecological binary matrices with fixed row and column totals. Nat. Commun. 5, 4114 (2014).
Carstens, C. J. Proof of uniform sampling of binary matrices with fixed row sums and column sums for the fast curveball algorithm. Phys. Rev. E, 91, 042812 (2015).
Stegbauer, C. & Rausch, A. How international are international congresses?. Connections 32, 1–11 (2012).
Derudder, B. & Liu, X. How international is the annual meeting of the Association of American Geographers? A social network analysis perspective. Environ. Plan A 48, 309–329 (2016).
Coppersmith, D. & Winograd, S. Matrix multiplication via arithmetic progressions. J. Symb. Comput. 9, 251–280 (1990).
Neal, Z. P. Identifying statistically significant edges in onemode projections. Soc. Netw. Anal. Min. 3, 915–924 (2013).
Chen, X. et al. BNPMDA: Bipartite network projection for mirnadisease association prediction. Bioinformatics 34, 3178–3186 (2018).
Liebig, J. & Rao, A. Fast extraction of the backbone of projected bipartite networks to aid community detection. Europhys. Lett. 113, 28003 (2016).
Schoch, D. & Brandes, U. Legislators’ rollcall voting behavior increasingly corresponds to intervals in the political spectrum. Sci. Rep. 10, 1–9 (2020).
Aref, S. & Neal, Z. P. Detecting coalitions by optimally partitioning signed networks of political collaboration. Sci. Rep. 10, 1–10 (2020).
Aref, S. & Neal, Z. P. Identifying hidden coalitions in the U. S. House of Representatives by optimally partitioning signed networks based on generalized balance. Sci. Rep. 11, 19939 (2021).
Buerger, A. N. et al. Gastrointestinal dysbiosis following diethylhexyl phthalate exposure in zebrafish (danio rerio): Altered microbial diversity, functionality, and network connectivity. Environ. Pollut. 265, 114496 (2020).
Marini, F., Ludt, A., Linke, J. & Strauch, K. Genetonic: an r/bioconductor package for streamlining the interpretation of rnaseq data. bioRxiv (2021).
Becatti, C., Caldarelli, G. & Saracco, F. Entropybased randomization of rating networks. Phys. Rev. E 99, 022306 (2019).
Chung, F. & Lu, L. Connected components in random graphs with given expected degree sequences. Ann. Comb. 6, 125–145 (2002).
Allison, P., Williams, R. A. & von Hippel, P. Better predicted probabilities from linear probability models with applications to multiple imputation. In 2020 Stata Conference, 1 (Stata Users Group, 2020).
Neal, Z. P., Domagalski, R. & Yan, X. Homophily in collaborations among US House of Representatives, 1981–2018. Soc. Netw. 68, 97–106 (2022).
Bruno, M. Bicm package. https://github.com/mat701/BiCM (2021). https://github.com/mat701/BiCM.
Cann, T. J., Weaver, I. S. & Williams, H. T. Is it correct to project and detect? Assessing performance of community detection on unipartite projections of bipartite networks. In International Conference on Complex Networks and their Applications, 267–279 (Springer, 2018).
Broido, A. D. & Clauset, A. Scalefree networks are rare. Nat. Commun. 10, 1–10 (2019).
Guimera, R., SalesPardo, M. & Amaral, L. A. N. Module identification in bipartite and directed networks. Phys. Rev. E 76, 036102 (2007).
Newman, M. E. & Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 (2004).
Snijders, T. A. Enumeration and simulation methods for 0–1 matrices with given marginals. Psychometrika 56, 397–417 (1991).
Acknowledgements
This work was supported by the National Science Foundation (#1851625 & #2016320) and the Michigan State University Center for Business and Social Analytics.
Author information
Authors and Affiliations
Contributions
Z.P.N. conceived the research questions, designed and conducted the analysis, wrote the first draft, and prepared the revisions. R.D. and Z.N. wrote the backbone package. B.S. wrote the proofs. All authors analysed the results and reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Neal, Z.P., Domagalski, R. & Sagan, B. Comparing alternatives to the fixed degree sequence model for extracting the backbone of bipartite projections. Sci Rep 11, 23929 (2021). https://doi.org/10.1038/s41598021032383
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598021032383
This article is cited by

Urban economic fitness and complexity from patent data
Scientific Reports (2023)

Where do knowledgeintensive firms locate in Germany?—An explanatory framework using exponential random graph modeling
Review of Regional Research (2023)

Bowtie structures of twitter discursive communities
Scientific Reports (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.