Generative Models for Global Collaboration Relationships

When individuals interact with each other and meaningfully contribute toward a common goal, it results in a collaboration. The artifacts resulting from collaborations are best captured using a hypergraph model, whereas the relation of who has collaborated with whom is best captured via an abstract simplicial complex (SC). We propose a generative algorithm GENESCs for SCs modeling fundamental collaboration relations. The proposed network growth process favors attachment that is preferential not to an individual’s degree, i.e., how many people has he/she collaborated with, but to his/her facet degree, i.e., how many maximal groups or facets has he/she collaborated within. Based on our observation that several real-world facet size distributions have significant deviation from power law–mainly since larger facets tend to subsume smaller ones–we adopt a data-driven approach. We prove that the facet degree distribution yielded by GENESCs is power law distributed for large SCs and show that it is in agreement with real world co-authorship data. Finally, based on our intuition of collaboration formation in domains such as collaborative scientific experiments and movie production, we propose two variants of GENESCs based on clamped and hybrid preferential attachment schemes, and show that they perform well in these domains.

Many large endeavors in society such as scientific discoveries and production of motion pictures are a result of collaboration. Typically, individuals collaborate to form teams, for example, a scientific paper is written jointly by a team of researchers. Also, smaller teams can collaborate to form larger groups. Examples of the latter include a movie production house containing teams of artists, directors, and crew; a disaster relief mission requiring interactions between teams of medical rescue workers, fire-fighters, and law enforcement officials with some common agents serving as gateways; and a major scientific discovery happening with the coming together of research over a series of papers, which typically have some common authors. The main goal of this paper is to understand the fundamental characteristics of the underlying global collaboration structures that exist in collaborative fields such as scientific research and movie production.
In modeling collaboration structures, the basic collaborative unit could either be the relation underlying the collaboration or the output or artifact from the collaboration (e.g. paper or movie). The difference in the resulting structure is best illustrated with a simple example. Suppose authors a, b, c, and d write three papers with authorships (a,b,c), (a,b), (c,d). Then a structure based on the collaboration artifact is identical to the set of papers, whereas one based on the collaboration relation is (a,b,c), (c,d). In other words, the collaboration relation structure ignores (a,b) since (a,b,c) already captures the fact that any subset of it, in particular (a,b), has collaborated. Previous studies of collaboration networks have overwhelmingly focused on the artifact-based structure [1][2][3][4][5][6][7][8][9] . The relation-based structure, is instead able to capture the social aspects of collaboration, which is interesting in its own right.
Hypergraphs (HG) are suitable for expressing "richer than pairwise" relationships between collaborators 2, 6 ; however, we believe they best model the artifacts of the collaboration-each by a hyperedge. The collaboration relation on the other hand is closed under the subset operation. A perfect match for succinctly capturing such a property is the abstract simplicial complexes (SC), which in simplest terms is a collection of sets closed under the subset operation.
The primary distinction between HGs and SCs is that in the case of the latter, a "simplex" of dimension k (modeling a (k + 1)-ary collaboration relation) subsumes all subset simplices of dimension k − 1, and so on recursively. Consequently, if an HG is used to model a collaboration relation, the distributions of sizes and degrees turn out to be non-trivially skewed compared to the relation structure itself, or equivalently, the SC representation thereof.
In this paper, we propose a new generative algorithm GENESCs for SCs that models the fundamental relations underlying large-scale collaboration. GENESCs is primarily based on preferential attachment-not with an individual's degree (how many people has he/she collaborated with) but rather with his/her facet degree (how many maximal groups or facets has he/she collaborated within). Unlike graphs, where a node's degree can capture its first order local connectivity properties, in SCs, there are two key metrics to consider-a node's facet degree and a facet's size. Based upon our observation that several real-world facet size distributions have significant deviation from power law-predominantly due to the fact that larger facets tend to subsume smaller ones-we adopt a data-driven approach. We "seed" GENESCs with a facet size distribution (from input data) and grow the SC randomly facet-by-facet to generate a final SC with a facet degree distribution that matches real data.
Note that we sample the facet size distribution as input instead of the artifact (hyperedge) size distribution since we want to generate the underlying SC, and not the Hypergraph, and there may be several discordant facet size distributions resulting from a single hyperedge size distribution based on how the collaboration relation is structured.
Our key contributions are summarized below: 1. Systematic characterization of the nature of subsumption in real world global collaboration networks.
(Introduction Section and Supplementary Information) 2. An efficient generative algorithm GENESCs to generate realistic SCs with matching facet degree distributions, given only their facet size distributions. (Methods Section.) 3. An analytic proof that the facet degree distribution of generated SCs is power law distributed, matching empirical studies, with an exponent α = + − 2 c s 1 1 , where c is the average facet density (ratio of the number of facets to the number of nodes) and s is the average facet size. cs is the average facet degree, and thus the average number of collaborations in which a node participates. (Results Section, Theorem 0.1) 4. Validation using empirical statistical analysis that the generated SCs have low Kolmogorov-Smirnov and Total-Variation distances from the real data. (Results Section) 5. Adaptations of the core facet-based preferential attachment kernel in GENESCs to model some observed variations in real world collaboration structures. (Methods Section) Interestingly, we demonstrate (and give analytical justification for the fact) that when GENESCs generates facets one after another with their sizes randomly drawn from the facet size distribution of the target real data set given as input, the probability of occurrence of subsumptions during this random growth process is negligible. This does not contradict our observation that subsumption phenomena is common in real collaboration artifact data. In reality, subsumptions occur over sequentially added hyperedges, whereas in GENESCs, we already start with the pre-subsumed facet-based representation, hence further distortion is not required. This feature of GENESCs is a valuable benefit by virtue of using facets as opposed to hyperedges in the sampling process.

Modeling Global Collaboration Relationships
Standard graphs are insufficient to capture group phenomena since they only model binary relations between individuals. A generalization of graphs, namely, the hypergraph has been proposed to address this shortcoming 2, 6 . A hypergraph H = (G, E) comprises a set of nodes V and hyper-edges ⊆ E 2 V to model higher order (or super-binary) relations.
Insights about the structure of a large collaboration network can be drawn by examining its "artifacts", i.e., papers, movies, etc., and the underlying distributions of hyperedge size (the number of nodes belonging to a hyperedge, e.g., number of co-authors in a paper) and hyper-degree (the hyper-degree of a node is the number of hyperedges it belongs to, e.g., number of movies an actor has acted in).
We believe that equally interesting insights can emerge from an understanding of the collaboration relation which focuses on the social aspects of the collaboration, namely the set of collaborating individuals. Such a question can be answered by examining the underlying higher-order global collaboration structure, which is not concerned about the specific products of the collaboration. For example, if a, b, and c have collaborated as a group (a, b, c), then sparser collaboration relationships (a, b) or (b, c), even if they occurred, do not add much value if our goal is to understand the number of maximal groups that a person has collaborated in.
Such information is indeed buried in the "collaboration artifact network", i.e., the hypergraph, but typically, statistical properties of the higher-order global collaboration structure cannot be trivially determined from those of the hypergraph. In the worst case, the representation complexity of hypergraphs grows exponentially, since k collaborating individuals can build as many as 2 k − 1 different artifacts. Since we are only interested in the fact that these k individuals collaborated on at least one project, the artifact network may be too unwieldy for analysis.
Abstract Simplicial Complexes. The basic structure of the underlying collaboration can be modeled by an abstract simplicial complex (SC). A set-system SC of non-empty finite subsets of a universal set S is an abstract simplicial complex if for every set ∈ X SC, and every non-empty subset ⊂ Y X, ∈ Y SC-thus the set-system is "closed" under subset operation 10 . Therefore, a simplicial complex captures the basic nature of the collaboration, i.e., who all have worked together on common tasks, instead of the artifacts that have been produced by such a collaboration, i.e., papers or movies.
Consider the example in Supplemental Information Fig. 1-suppose a, b, c, and d are four authors who have co-authored five papers among them including single-author papers-this co-authorship is denoted by a hypergraph consisting of five hyperedges: In previous work, we have shown how simplicial complexes can be effectively used to model collaboration networks [11][12][13] . For a collaborative group denoted by a facet, two basic metrics are facet size (how many people belong to that collaboration) and a node's facet degree (how many maximal collaborations or facets does that node belong to).

Modeling Subsumptions
The basic difference between hypergraphs and simplicial complexes can be explained by the phenomenon of subsumptions. In the previous example, hyperedges (a, b) and (b, c) get subsumed by the largest hyperedge (a, b, c), which is a facet in SC. Similarly, facet (a, d) subsumes (a).
In theory, subsumptions can be very pronounced. Consider a large research project with N participating faculty members. Consider the situation where each faculty member has one single-author paper, one paper written with every other faculty member, a paper with each set of two other distinct faculty members, and so on. Finally, assume that all of these N authors collaborate to write a joint paper together. Clearly, there are 2 N − 1 distinct hyperedges in total-N single author papers, ( ) k-author papers, in general. Since each node has exactly 2 N−1 hyperedges, the hyperedge degree distribution is given by the impulse function δ where δ i,j denotes the Kronecker delta function, and d i is the degree of researcher i. On the other hand, hyperedge size distribution is a non-monotonic function which is proportional to In contrast, in the simplicial complex representation, the largest collaboration is the only facet, so all nodes have facet degree 1, hence the facet degree distribution is δ d ,1 i . Moreover, the only facet has size N, hence the facet size distribution is δ s N , j , where s j is number of authors in paper j (See Fig. 1). There is significant discrepancy between the two distributions due to the intense degree of subsumption-since all faculty members collaborate with each other on one paper, the other smaller collaborations are directly implied by the former. Note that the deviation would be larger if there were multiple papers with exactly the same authors, since the hyperedge count would increase without affecting facet statistics.
The above example demonstrated a case where many different hypergraph instances associated with N nodes may map to only one simplicial complex, i.e. a many-to-one mapping between the set of different hypergraphs to one simplicial complex representation. Yet, one may think that once a specific hypergraph is given, the corresponding simplicial complex can be simply obtained from it by following the subset closure operation. However, in reality, while working with datasets, one typically expects only the distributional statistics to be given. We next demonstrate that starting from a given hyperedge size distribution one might end up with multiple simplicial complex representations with drastically different facet size distributions even for a fixed number of nodes. In fact, the total variation distance D TV (a commonly used distance metric to compare two different distributions, i.e., normalized statistical distance between two distributions which measures sum of differences over the support set, formally defined in the Results Section) among the various feasible non-isomorphic simplicial complexes might approach 1, which is the maximum value that can be defined between two distributions, as N → ∞.
Consider a given hyperedge size distribution f h (.). Let us denote the maximum hyperedge size in f h (.) by H, and assume that N = 2H − 1, where again N = |V| denotes the number of nodes. Next, consider a hypergraph consisting of a total of 2H + 1 hyperedges, with 2H − 1 hyperedges of size one, and one hyperedge each of size H − 1 and H. That is, . A small scale illustration of this scenario with N = 5 and H = 3 is given in Fig. 2. Given this distribution, it may be possible that the two large hyperedges are disjoint and do not possess any nodes in common, hence in the SC representation there is one   We note that we consider an evolving model of collaborations where a "facet" emerges (i.e., at time t it is indeed a facet in the Simplicial Complex) and remains a facet until it is subsumed at a later time t′ > t by a bigger facet modeling the occurrence of a collaboration involving a superset of the participants. Thus at time t′, in the resulting ultimate simplicial complex structure, the (older) subsumed facet does not exist anymore, as it merely becomes a "face" of the larger facet. Similarly, if at time t′ a collaboration occurs that involves a subset of participants of an older collaboration that had occurred at time t < t′, then the new collaboration is just a face of the older facet and is instantly subsumed by the latter. To that end, until the end of the growth process, we term every facet in the complex a transient facet, in the sense that it has a probability of getting subsumed. On the other hand, at the end of the growth process, any surviving facet will be a (permanent) facet.
While the above examples can be regarded as rather unlikely events, subsumption does occur quite frequently in real world collaboration. Figures 3a,b illustrate statistics for both hypergraph and SC models of DBLP co-authorship data. While the tail of the facet size distribution obeys a power law, the head, where significant probability mass is concentrated, does not. In particular, a significant number of singleton authors get subsumed by the larger collaborations they participate in. Additionally, the slopes of the facet degree and hyperedge degree distributions are different. This can also be attributed to subsumptions of smaller hyperedges by larger facets. Figures 3c,d illustrate the difference between hyperedge and facet statistics for another co-authorship data set (Physical Review D journal). Like in Fig. 3a, there is significant difference at the head of the size distributions, implying a significant frequency of subsumptions. Also, in both Fig. 3b and d, the tails of the facet degree distribution are shorter than those of the respective hyper-degree distributions. This can be attributed to the subsumption of many small hyperedges at the high-degree nodes, thus reducing the facet degree compared to their hyperedge degrees.
We now examine the nature of subsumptions in real collaboration datasets more closely. In Fig. 4, we plot nine real collaboration data sets, the number/percentage of subsumptions of facets of size i by facets of size j (where j ≥ i). Obviously, this is a lower triangular matrix where the rows indicate sizes of subsuming facets and columns indicate sizes of subsumed facets.
We observe that the nature of subsumptions varies across the nine data sets. In DBLP, IMDB (with regular movies and cast only), Phys Rev A, Phys Rev B, and Phys Rev E, small facets are subsumed (first few columns). This is intuitively expected, since it is likely that smaller subsets of a collaboration are also valid collaborations.
In Phys Rev D, Phys Rev L and Phys Rev C, there is a strong subsumption presence both on and off the diagonals. These subsumption events model scenarios where a significant number of individuals are collaborating on a task and the exact set of individuals collaborate again, with perhaps a few additional collaborators such as new graduate students joining a lab. These situations likely arise from the large endeavors typical of experimental physics where very large collaborations of laboratories result in a paper, as evidenced by Phys Rev D and L. In fact, for Phys Rev D, the diagonal has non-trivial mass even for collaboration sizes of 500, which have not been shown here.
Finally, the IMDB data set that includes both cast and crew of movies (c) is a class apart in the above trend taken to a much larger magnitude. This is because a core crew tends to get utilized by a director in multiple movies. As seen from these figures, such events occur often in reality, hence subsumption is an important issue to address when modeling the structure of the underlying global collaboration network.

Generative Growth Models for Networks Induced by Global Collaboration Relationships
Generative network growth models have received great interest over the past 15 years for classical (binary) graphs. The most prominent growth model has been preferential attachment, which has been demonstrated to result in graphs with node degrees following power law distributions 14 , i.e. k −γ , where γ ∈ [2,3]. Examples of other network growth models that have received attention are small-world models 15 , densification models 16 , and duplication models 17 , to name some.
In contrast, there has been a limited amount of work on generative models for group collaboration structures. In addition to the node degree distribution, which has been a focal metric for classical network growth models, for collaboration structures more complex than graphs, the hyperedge size distribution is key. Hebert-Dufresne et al. 6 proposed a generative algorithm called Structural Preferential Attachment (SPA) by progressively growing a hypergraph based on parameters that depend on the power law exponents of hyperedge size and degree distributions (their assumption was that both obey power law distribution). In SPA, two free probability parameters are simultaneously controlled to generate new structures and attachment points in the current network in order to simultaneously match the tails of both the hyperedge degree and size distributions. However, as Fig. 3 suggests, this is not accurate for many collaboration relation structures, particularly for the size distributions. More importantly, structure-based growth models 2, 6 do not focus on modeling the structure of the underlying global collaboration network-instead they model the "artifacts" of collaboration, i.e., hyperedges.
To generate real world collaboration relation structures, we propose a generative facet-by-facet growth model based on preferential attachment, not based on an individual X's degree, that is, how many people has X collaborated with over a period of time, but on X's facet degree which measures how many maximal collaborations or facets has X been involved in. The basic intuition behind this is the following: individuals who are comfortable being part of several distinct collaboration endeavors are likely to attract more collaborators than the individuals who are happy participating in a fewer number of collaboration endeavors (albeit with several collaborators).
We assume that the facet size distribution is given to us and our primary aim is to generate a random simplicial complex that closely matches the ground truth distribution for facet degrees. We do this because the facet size distributions are often non-power law and even non-monotonic, particularly near the head of the distribution as in Fig. 3a and c, thus not obeying trends of typical distributions generated by random growth models. On the other hand, facet degree distribution in Fig. 3b has monotonic behavior and is power law with exponential cutoff, implying that it might be possible to obtain it more accurately via random growth models. Accordingly, our facet-based generative model takes as input the facet size distribution and grows the collaboration simplicial complex one (transient) facet at a time. During this process, a large facet may subsume a smaller (transient) existing facet since we are interested in capturing the fundamental structure of collaboration.
In related work, we first acknowledge the early work 1 which empirically points out the basic properties of scientific collaboration network structures. Over the past decade, a significant number of researchers have studied structures representing collaboration artifacts. Liu et al. 8 have proposed a preferential-attachment based growth model for "affiliation networks". However, this is only a qualitative study without any theoretical analysis. Some authors have considered the artifacts of collaboration 3 , and have provided analytic results for only very special cases where the collaborative outputs are fixed size. This was also the assumption elsewhere 18 with the authors studying the evolution processes for the network of scientific collaboration artifacts. On the other hand, an empirical study focused on a very specific scientific collaboration network has been provided 5 ; and a modified preferential attachment algorithm has been considered 4 , again for collaboration artifacts along with numerical results. Our work goes beyond these models to provide both a theoretically sound treatment, and accounts for the key phenomenon of subsumption that occurs in various collaborative endeavors. In other work focused on going beyond pairwise relationships, properties of random tripartite hypergraphs have been studied 19 . Also, Wu et al. 20 have recently proposed a simplicial complex generation model but with a significantly different goal of characterizing the growing geometry of networks. Similarly, Bianconi et al. 21 consider the Network Geometry with Flavor (NGF) and provide a quantum mechanical description. Courtney et al. 22 consider a configuration model treatment for simplicial complexes of a given dimension. On the other hand, Lambiotte et al. 23 propose a growth model based on the copying mechanism and study the resulting phase transition behavior. Very recently, growth models for collaboration artifacts have been considered 9 , but they address neither the success of matching the size distribution nor the subsumption phenomena, which is a focal point of this paper.

Results
GENESCs has several distinctions from other works that have proposed PA based generative growth models for hypergraphs 6,8 . Since we are interested in modeling the global collaboration relation and not its artifacts, in our model the hyperedges are subsumed to yield facets, thus preserving the core structure underlying the participation of various individuals in a collaboration. We do not assume that the size of such collaborative structures is power law distributed. In fact, this is not the case for several collaboration networks, especially near the head, i.e., small collaborations. Accordingly, GENESCs takes as input the distribution of facet sizes and average facet density (i.e., average number of collaborations per node) to generate a collaboration relation using a variant of PA.
The dynamics of GENESCs is distinct from classical PA in two ways. First, the structural unit of growth in each time step in our setting is a (transient) facet which could contribute one or more nodes to the simplicial complex. In contrast, in classic PA, one node is added in each time step. Secondly, in classical PA new nodes are always added to the network, whereas in our case, some nodes in the newly generated facet may be merged with existing nodes; moreover, new facets may subsume older facets or be subsumed by older facets. This is consistent with how large collaboration networks grow-new endeavors consist of both existing individuals and new individuals.
It is well known that preferential-attachment (PA) based network growth methods result in power law node degree distributions for classical graphs 14 . Other more general variants of preferential attachment have been analyzed extensively as well 24 . We show below that the growth model behind GENESCs results in SCs with power law distributed facet degree. We also compute the power law exponent as a function of input parameters such as average facet size and average facet density, and show that it matches real world collaboration network data sets well. Performance evaluation of GENESCs. We measured the quality of the distribution generated by GENESCs using the Kolmogorov-Smirnov distance D KS and the Total Variation distance D TV between real data distributions p(⋅) and the generated distributions q(⋅) for both facet sizes and facet degrees: Note that the facet sizes in GENESCs are generated using the facet size distribution of real collaboration data sets, where the effect of subsumptions has already been incorporated in the first place. We observed that GENESCs yields D KS ≈ 0 and D TV ≈ 0 for facet sizes, confirming the fact that subsumption events are rare if one is drawing random collaboration structures from the facet size distribution instead of the hyperedge size distribution. Analytic arguments for this phenomenon are presented in Supplemental Information Section 3. Figure 5 illustrates the performance comparison of GENESCs with respect to real DBLP publication data and the theoretical prediction of Theorem 0.1 as far as the facet degree distribution is concerned. It can be observed that GENESCs matches well the characteristics of both the real facet degree distribution and the theoretical prediction. Figure 6 compares the relative performance of the Structured Preferential Attachment (SPA) algorithm 6, 7 with GENESCs. When applying SPA, first the parameters that best fit DBLP's hyperedge degree and size distributions are obtained and used for hyperedge generation. Then, we have inflicted subsumptions on that data and plotted it. It can be observed from Fig. 6(a) that SPA, which is designed to generate pure power law distribution for both hyperedge sizes and hyper-degrees, is unable to match the real facet size distribution of the DBLP data set (D TV = 0.72). It is also unable to closely match the facet degree distribution of DBLP (D TV = 0.094), especially near the heavy tail. Moreover, it has a markedly different slope. In contrast, since GENESCs samples the facet size distribution, it is able to yield a close match to (obviously) that distribution (D TV = 0.023) (the minor difference is due to the occurrence of some subsumptions at low sizes), and the facet degree distribution (D TV = 0.054).
While we have primarily focused thus far on demonstrating the success in matching facet degree and facet size distributions, here we also provide more advanced network metrics that confirm the success of GENESCs. First among these is neighborhood facet degree correlation. We adapt the node degree correlation metric k nn , defined in 25 to simplicial complexes as , where P(k ′ |k) is the probability that a node with facet degree k will have a neighbor (in the simplicial complex) with facet degree k′. As demonstrated in Fig. 7 for the DBLP dataset, we find that both the dataset and the random simplicial complexes produced by GENESCs have extremely good match at the tail, and both real and generated networks are assortative, i.e., high degree nodes tend to be surrounded by high degree nodes as is typically observed in scientific collaboration networks 26 . This can be concluded from the fact that the k nn (k) function has a positive slope with increasing facet degree k.  In addition, Fig. 8 demonstrates that facet size distributions conditioned on a fixed facet degree resulting from GENESCs closely follow the real dataset. The figure also points out that a facet size of one can only occur for facet degree one in a simplicial complex-for an isolated facet (which corresponds to a single node).

Discussion
In this paper, we proposed GENESCs, a generative model for collaboration structures modeled as simplicial complexes (SC). SCs are different from graphs since they have two different dimensions for growth (facet size and facet degree), whereas graphs only have one (node degree), since all edges have identical sizes. SCs are also different from hypergraphs, which do not have the subsumption property. While hypergraphs are good for modeling artifacts of collaborations, e.g., papers and movies, SCs are more appropriate for succinctly modeling the inherent structure of the collaboration relationship, i.e., who all have collaborated with each other. This distinction is key because subsumptions are very common in real-world collaboration networks.
The facet size distribution constrains the facet degree distribution to an extent. Leveraging this observation, GENESCs takes facet size distributions of real world collaboration networks as input and efficiently generates random SCs with facet degree distributions matching those corresponding to the real data. Our theoretical analysis is shown to accurately predict the statistical properties of SCs generated by GENESCs in its pure form. For collaborations that have characteristics such as very heavy tails of facet sizes or non-power law popularity distributions of participants as collaborators, appropriate modifications to the preferential attachment step of GENESCs (namely, clamping and uniform-hybridization) yields good intuitively sound results.

Methods
GENESCs: A Facet-based Preferential Attachment Model. GENESCs takes as input the facet size distribution z(s) of a real data set (where s is the facet size with 1 ≤ s ≤ |V|) and generates a random simplicial complex (V, F) modeling that data set with facet degree distribution p k (where k denotes facet degree).