Comparing alternatives to the fixed degree sequence model for extracting the backbone of bipartite projections

Projections of bipartite or two-mode networks capture co-occurrences, and are used in diverse fields (e.g., ecology, economics, bibliometrics, politics) to represent unipartite networks. A key challenge in analyzing such networks is determining whether an observed number of co-occurrences between two nodes is significant, and therefore whether an edge exists between them. One approach, the fixed degree sequence model (FDSM), evaluates the significance of an edge’s weight by comparison to a null model in which the degree sequences of the original bipartite network are fixed. Although the FDSM is an intuitive null model, it is computationally expensive because it requires Monte Carlo simulation to estimate each edge’s p value, and therefore is impractical for large projections. In this paper, we explore four potential alternatives to FDSM: fixed fill model, fixed row model, fixed column model, and stochastic degree sequence model (SDSM). We compare these models to FDSM in terms of accuracy, speed, statistical power, similarity, and ability to recover known communities. We find that the computationally-fast SDSM offers a statistically conservative but close approximation of the computationally-impractical FDSM under a wide range of conditions, and that it correctly recovers a known community structure even when the signal is weak. Therefore, although each backbone model may have particular applications, we recommend SDSM for extracting the backbone of bipartite projections when FDSM is impractical.


S1 Probability Mass Functions of projection edge weights under ensemble backbone models
In the subsections below, we derive the probability mass functions of P * ij used by ensemble backbone models to evaluate the statistical significance of the weight of edge P ij in a bipartite projection. We use the following notation: • Let B be an m × n bipartite matrix, with a vector of row sums R = (r 1 , . . . , r m ), a vector of column sums C = (c 1 , . . . , c n ), and f cells containing a 1. So • Let B M be the ensemble of all m × n matrices B * = (B * ij ) that obey the constraints of the respective model. In all models, the probability distribution on B M is uniform except in the stochastic case.
• Let P * ij be a random variable equal to (B * B * T ) ij for all B * ∈ B M . Note that we have (1)

S1.1 Fixed Fill Model (FFM)
Let the fixed fill model constrain all B * ∈ B FFM to contain the same number of 1s (i.e. fill) as B.
Theorem S1.1. Under the fixed fill model, the distribution of P * ij for i = j satisfies Pr(P * ij = k) = n k r 2 n−k−r n − k r (m − 2)n f − n − k + r mn f . (2) Proof. For the denominator we need to compute the cardinality #B FFM . If B * ∈ B FFM then B * has mn entries of which f must be chosen to be ones. So For the numerator, suppose P * ij = k. We see from equation (1) that there are exactly k columns c where B * ic = B * jc = 1. There are n k ways to choose these columns. Now define the following parameters: p = number of columns c where B * ic = 1 and B * jc = 0, q = number of columns c where B * ic = 0 and B * jc = 1, r = number of columns c where B * ic = 0 and B * jc = 0.
The number of ways to pick the columns counted by these parameters from the n − k columns which do not contains ones in both rows is the trinomial coefficients n−k p,q,r . Now we have used 2k + p + q ones in rows i and j. So there are f − 2k − p − q left to distribute to the remaining m − 2 rows. And these rows have (m − 2)n entries. So the number of possibilities for these remaining ones is (m−2)n f −2k−p−q . Thus the total number of choices from this and the previous paragraph is For even modestly large B, computing equation (2) involves values larger than can be handled by some programs. In practice, we use logs to make these computations practical.
We now show that the sum in the numerator of this probability is related to the famous Jacobi orthogonal polynomials. This sum is a terminating hypergeometric series. Given a real number a and a nonnegative integer r the corresponding Pochhammer symbol or rising factorial is (a) r = a(a + 1)(a + 2) · · · (a + r − 1).
Note that if a is an integer with −r < a ≤ 0 then (a) r = 0 because the product contains 0 as a factor. Given real numbers a 1 , a 2 , . . . , a p and b 1 , b 2 , . . . , b q as well as a variable z, the corresponding hypergeometric series is Note that if any of the a i are negative integers then, because of the remark above, this series will terminate and become a polynomial in z.
To convert a binomial coefficient into Pochhammer symbols, we write The following identity will also be useful (a) b+r = (a)(a + 1) · · · (a + b − 1) × (a + b)(a + b + 1) · · · (a + b + r − 1) We now return to the sum in the numerator of equation (2). We will ignore the factor of 2 n−k since it is constant with respect to the sum and so can be pulled outside. For simplicity of calculation we will also use the substitutions s = (m − 2)n, t = f − n − k.
Thus we have We are indebted to Marko Petkovšek [personal communication] for pointing out that this 2 F 1 is, up to a factor, a specialization of a Jacobi polynomial. Given a nonnegative integer and real numbers α, β the associated Jacobi polynomial is To make these 2 F 1 polynomials agree we can let = n − k, and z = 0. With these substitutions we get

S1.2 Fixed Row Model (FRM)
Let the fixed row model constrain all B * ∈ B FRM to have the same row sums as B.
Theorem S1.2. Under the fixed row model, the distribution of P * ij for i = j is hypergeometric and satisfies Proof. The total number of ways to pick r i of the n columns for ones in the ith row and r j of the n columns for ones in the jth row is So that will go in the denominator of the desired probability.
For the numerator we follow the same line of reasoning as in the previous proof, where the parameters therein can be expressed as choices.

S1.3 Distribution of projection edge weights under the Fixed Column Model (FCM)
Let the fixed column model constrain all B * ∈ B FCM to have the same column sums as B.
Let X 1 , . . . , X n be independent Bernoulli random variables. Let the probability of success for X i be The random variable is said to have the Poisson binomial distribution with parameters p 1 , . . . , p n .
Theorem S1.3. Under the fixed column model, the distribution of P * ij for i = j is Poisson binomial with parameters Proof. The B * ik are all either zero or one and are independent in different columns when only the column sums are fixed. So as k varies, the products B * ik B * jk are independent Bernoulli random variables. Comparing equations (1) and (5), we see that the distribution of P * ij is Poisson binomial.
If column k has column sum c = c k then all zero-one vectors with sum c are equally likely for that column of B * . So there are m c possible kth columns. The number of ways to have a success is the number of possible columns which have ones in both positions i and j where i = j. So the number of choices is the number of ways to choose the remaining c − 2 ones in that column from the other m − 2 positions, that is, m−2 c−2 . Thus which finishes the demonstration.

S1.4 Stochastic Degree Sequence Model (SDSM)
In the stochastic degree sequence model, B SDSM consists of all binary m × n matrices. A method is then chosen to generate probabilities p * ik . Finally, matrices B * ∈ B SDSM are generated using these probabilities for independent Bernoulli trials, where B * ik is filled with a one with probability p * ik and zero otherwise.
Theorem S1.4. Under the stochastic degree sequence model, the distribution of P * ij for i = j is Poisson binomial with parameters p 1 = p * i1 p * j1 , . . . , p n = p * in p * jn .
Proof. The fact that the distribution is Poisson binomial follows immediately from the independence assumption on the Pr(B * ik ) and equation (1). Furthermore, the probability that the kth variable is one is p k = Pr(B * ik B * jk = 1) = Pr(B * ik = 1) Pr(B * jk = 1) = p * ik p * jk . So we are done.

S2 Familywise error rates in backbone extraction
When testing the hypothesis that an observed statistic s is different from what would be expected at random (i.e. under a given null model), the researcher must specify a significance level α. The researcher then computes the probability p of observing a value greater than or equal to s under the null model. The null hypothesis is rejected and the alternative hypothesis is supported if p < α. When only one hypothesis is being tested, this procedure ensures a Type-I error rate -a false positive, or the risk of rejecting the null hypothesis when it is true -of α.
In the context of backbone extraction, the 'statistic' s is the number of co-occurrences between two agent nodes or the edge weight in the bipartite projection, and the 'null model' is defined by the chosen bipartite ensemble backbone models. When deciding whether a given edge should be included in the backbone, the researcher is testing a single hypothesis where the null hypothesis is that the edge's weight is no stronger than would be expected under the null model. If the null hypothesis is rejected, then the edge is included in the backbone. Committing a Type-I error in this context results in including the edge in the backbone when it should be excluded (i.e. a false positive).
When multiple independent hypotheses are tested simultaneously, the Type-I error rate is inflated. Specifically, the familywise error rateᾱ -the risk of making one or more Type-I errors -is where t is the number of independent tests. For example, if the Type-I error rate for each hypothesis test is α = 0.05, and t = 100 independent tests are performed, thenᾱ = 1 − (1 − 0.05) 100 = 0.995. That is, it is virtually guaranteed that at least one Type-I error will be committed in these 100 hypothesis tests. Because extracting a backbone requires the researcher to conduct a hypothesis test for every edge (with non-zero weight) in the network, backbone extraction nearly always involves testing multiple independent hypotheses.
Many different procedures exist for controllingᾱ when multiple independent hypothesis tests are conducted. All of these procedures involve using a corrected significance level α * for each hypothesis test so thatᾱ is maintained at the desired tolerance for Type-I error. The simplest but also most conservative approach is the Bonferroni correction, which defines α * = α t . Other less conservative and more powerful corrections include the Holm-Bonferroni correction [Holm, 1979] which has been used to extract the backbone of a political network [Aref and Neal, 2021], and the False Discovery Rate [FDR; Benjamini and Hochberg, 1995] which has been used to extract the backbones of movie rating and international trade networks [Saracco et al., 2017]. These correction procedures, as well as several others, are available in the backbone package we use to extract backbones in our studies [Domagalski et al., 2021].
Using one of these procedures to controlᾱ is usually appropriate when extracting the backbone of a bipartite projection. Doing so is often straightforward because (1) many backbone models we consider (FRM, FCM, FFM, SDSM) yield exact p-values, and (2) the backbone package we use to extract backbones in our studies implements several different methods for correcting α * and thus controllingᾱ. However, for reasons we describe below, it is computationally infeasible to controlᾱ when extracting backbones using FDSM. While this represents a significant limitation to using FDSM backbones in practice, and is a key reason we are seeking alternatives, this is not a problem for our studies. Within each of our studies, the rate of Type-I error inflation is identical for all backbones, which means that uncorrected FDSM backbones can be compared to uncorrected non-FDSM backbones.

S2.1 Controlling FWER in FDSM backbones
It is computationally infeasible to controlᾱ when the backbone is extracted from a bipartite projection using FDSM. The challenge arises because each edge's p-value is estimated via a Monte Carlo procedure, and estimating these p-values with sufficient precision and confidence requires an impractically large number of Monte Carlo trials. In this section, we describe one way to estimate the required number of trials, and illustrate why controllingᾱ in FDSM backbones is impractical.
Because the associated probability mass function is unknown, when using FDSM to extract a backbone, the p-value of a given edge's weight (i.e. the probability that the same or larger edge weight would be observed under the null model) is estimated via Monte Carlo methods. Following N Monte Carlo trials in which a projection P * is constructed from a random B * ∈ B FDSM , the p-value of the edge between i and j is Therefore, the estimation of p ij is equivalent to estimating a proportion from a sample.
Determining the sample size required to estimate a proportion from a sample with a given error tolerance is a well-studied problem in statistical inference, under the heading of power analysis. Fleiss et al. [2013] show that the required minimum sample N to determine whether an estimated proportion P 1 differs from a hypothesized proportion P 0 , with a Type-I error rate of 1 and Type-II error rate of 2 , is where z represents the critical value corresponding to 1 or 2 in the standard normal distribution. Note that the Type-II error rate is the opposite of the Type-I error rate, the risk of failing to rejecting the null hypothesis when it is, in fact, false (i.e. a false negative). In the backbone context, committing a Type-II error results in excluding an edge from the backbone when it should be included. They further recommend performing a minor correction to arrive at a final estimate N With a small adaptation to their first expression, we can use these to estimate the required number of FDSM Monte Carlo trials. We wish to use it to determine the required minimum number of Monte Carlo samples N to determine whether an edge's estimated p-value p ij differs from our corrected significance level α * . Accordingly, we can re-write the expression given by Fleiss et al. [2013] as: Two examples serves to illustrate how this expression implies that an impractically large number of Monte Carlo trials will be required under even modest assumptions. Suppose we are using FDSM to extract the backbone from a projection of a 100 agent × 1000 artifact bipartite network, and we wish to maintain a familywise error rate ofᾱ = 0.05. If we assume that our bipartite projection will be dense (hence the need for extracting its backbone) and will not contain any zero-weight edges, then we must conduct 100(100−1) 2 = 4950 independent hypothesis tests. Using the Bonferroni correction for simplicity of illustration, this implies a corrected two-tailed significance level of α * = 0.05/4950 2 ≈ 0.000005 for each test. Further, assume that we are willing to tolerate a 5% risk of incorrectly including an edge (i.e. Type-I error, 1 = 0.05), and a 5% risk of incorrectly excluding an edge (i.e. Type-II error, 2 = 0.05), because both types of errors are equally problematic for graphs.
Under these assumptions, we can consider two scenarios. First, we can determine how many (additional) trials are necessary to make a decision about an edge whose statistical significance appears unambiguous after some number of initial trials. When it appears that p ij = 0, this represents a 'best case scenario' in which it should be relatively easy to reach a decision about whether the edge should be included in the backbone. We can compute the required number of trials as: Under this best case scenario, at least 733,695 Monte Carlo trials are required to reach a decision given our familywise error rate and tolerances for Type-I and Type-II errors. Recall that each Monte Carlo trial requires sampling one B * ∈ B FDSM using the curveball algorithm, then multiplying B * by its transpose. Although the running time of these two operations is relatively fast (approximately 0.07 seconds on the system we use to evaluate running times in Study 1), performing the required number of trials under this best case scenario would take around 14 hours.
Second, consider a more realistic scenario in which, after some initial number of trials, an edge's statistical significance is more ambiguous because p ij is near α * . For the sake of illustration, consider an edge whose p-value we initially estimate as p ij = 0.0000038, which appears smaller than α * = 0.000005, but is close and therefore riskier. In this case, N ≥ 29 845 088 (initial estimate) N ≥ 30 637 088 (adjusted estimate) Under this more realistic scenario, where the edge's statistical significance is not unambiguous, over 30 million Monte Carlo trials are required to reach a decision given our error tolerance. This would require a running time of approximately 25 days.