The emergence of heterogeneous scaling in research institutions

Research institutions provide the infrastructure for scientific discovery, yet their role in the production of knowledge is not well characterized. To address this gap, we analyze interactions of researchers within and between institutions from millions of scientific papers. Our analysis reveals that collaborations densify as each institution grows, but at different rates (heterogeneous densification). We also find that the number of institutions scales with the number of researchers as a power law (Heaps’ law) and institution sizes approximate Zipf’s law. These patterns can be reproduced by a simple model in which researchers are preferentially hired by large institutions, while new institutions complimentarily generate more new institutions. Finally, new researchers form triadic closures with collaborators. This model reveals an economy of scale in research: larger institutions grow faster and amplify collaborations. Our work deepens the understanding of emergent behavior in research institutions and their role in facilitating collaborations. The scientific ecosystem is characterized by complex multifaceted relationships between institutions, researchers, and their collaborators. In this work, the authors find common patterns in these relationships expressed through superlinear scaling, Heaps’ law, and Zipf’s law within the global collaboration network, and propose a minimal network model to explain these patterns based on preferential hiring by larger institutions, the “adjacent possible” capturing the birth of new institutions, and triadic closure of collaborations.

S cientific innovation and training require efficient and robust infrastructure. This infrastructure is provided by research institutions, a category that includes universities, government labs, industrial labs, and national academies [1][2][3][4][5] . Despite the long tradition of bibliometric and science of science research 6 , the focus has only recently shifted from individual scientists 7,8 and teams [9][10][11] to how institutions affect researcher productivity and impact 12,13 . Many gaps remain in our understanding of the role of institutions in the production of scientific knowledge, and specifically, how they form, grow, and facilitate scientific collaborations. These questions are important, because collaborations are increasingly prevalent in scientific research 1,9,10 and produce more impactful and transformative work 10,14 . Collaboration allows scientists to cope with the increasing complexity of knowledge 15 by leveraging the diversity of expertise 16 and perspectives offered by collaborators from different institutions 17 and disciplines 18 .
To understand the evolution of research institutions and collaborations, we analyze a large bibliographic database spanning many decades and multiple scientific disciplines. The database contains millions of publications from which the names of authors (collaborators) and their affiliations (research institutions) have been extracted for each paper. Analysis of these data reveals strong statistical regularities. We find that collaborations scale superlinearly with institution size, i.e., faster than institutions grow, consistent with densification of growing networks [19][20][21] . However, the scaling law is different for each institution, and as a result, different parts of the collaboration network densify at different rates. We also find that institutions vary in size by many orders of magnitude with an approximately power-law distribution, also known as Zipf's law 22 . The number of institutions, in contrast, scales sublinearly with the number of researchers, thus following Heaps' law 23,24 . The sublinear scaling implies that, even as more institutions appear, each institution gets larger on average, but this average belies an enormous variance.
Finally, we create a stochastic model that helps explain how institutions and research collaborations form and grow. In this model, a researcher appears at each time step and is preferentially hired by larger institutions (e.g., due to their prestige or funding), which leads to the rich-get-richer effect creating Zipf's law. With a small probability, however, a researcher joins a newly appearing institution. The arrival of this new institution then triggers yet more new institutions to form in the future, which explains Heaps' law 25 . Finally, once hired, researchers make connections to other researchers and their collaborators with an independent probability to explain collaborations scaling superlinearly with institution size. Despite its simplicity, the model reproduces a range of empirical observations, including the number and size of research institutions, and how pockets of increasingly dense structures form in collaboration networks.
These empirical results demonstrate universal emergent patterns in the formation and growth of research institutions and collaborations. Our model demonstrates that new institutions are critical to absorbing extra capacity by collecting researchers who do not join large institutions. At the same time, large institutions offer an economy of scale: they grow faster and provide more collaboration opportunities compared to smaller institutions.

Results and discussion
As the first step towards characterizing the complexity of institution scaling, we collect data from Microsoft Academic Graph 26 to capture how millions of collaborations evolve over time. Figure 1 shows the collaboration network at the institution level in the field of sociology. Figure 1a demonstrates a remarkable diversity of institution size and growth, both in terms of the number of researchers (node growth) and collaborations between institutions (edge growth). Collaborations are clustered, with clear groups of interacting institutions. Research collaborations within an institution are equally complex. Figure 1b highlights the largest connected component of the collaboration network within Harvard. Individual researchers vary widely in the number of collaborators, with new collaborations appearing in clusters.
This dataset helps us capture how the number of collaborations scale with an institution's size, n. Figure 2a, b shows the number of internal and external collaborations versus n across four different disciplines: computer science, physics, math, and sociology. While each institution follows a scaling law c~n α (R 2 is close to 1.0, see Supplementary Note 6), the exponents α differ substantially between institutions. This is shown in the insets of Fig. 2a, b where we collect scaling exponents across thousands of institutions and notice that their distribution stretches between zero (in which institutions do not gain any collaborations) to two (collaborations are extremely dense). In the thermodynamic limit, exponents cannot be larger than two, therefore values above two are due to finite-size effects.
To show that the scaling exponents of all institutions are different, we create a null model (see Supplementary Note 3) in which all institutions follow the same scaling law. In this null model, residuals of each institution's fitted scaling relation are reshuffled and added as noise onto a single scaling relation. Differences between fitted exponents in this model are due to statistical noise rather than different scaling laws. We find that the variance of the scaling laws across all institutions is much higher than this null model. We therefore reject the hypothesis that all the exponents within a field are the same within statistical error. We explore the dependence of scaling on final institution size in Supplementary Note 6, and find the scaling exponents are superlinear (approximately 1.2 on average) and do not depend strongly on the final size of the institution. Different parts of the collaboration network therefore densify at different rates, which extends on previous work that uncovered densification for many networks at the aggregate level 19 .
We find weak evidence that higher scaling exponents correspond to institutions with greater impact. In physics, the Spearman rank correlation, s, between mean paper impact after five years and internal collaboration scaling exponents is 0.09 (borderline significant, p-value = 0.06) and for external collaboration is 0.27 (p-value < 10 −5 ). Similarly, in sociology, the correlation is 0.19 (p-value = 0.03) between impact and internal collaboration exponents, and the same correlation value is found for external collaboration exponents. For all other fields, however, the correlations are not statistically significant (p-value ≥ 0.20). Impact, a proxy of institution research quality, cannot fully explain why collaborations grow faster in some institutions and not others, but can give some insight into reasons for this diversity. These results suggest that highly impactful institutions seem to form collaborations more easily as they grow. Nonetheless, almost all institutions benefit from being larger, as the number of collaborations per person typically grows with size ( Fig. 2a, b inset).
The superlinear scaling of collaborations cannot be explained by researcher productivity. The scaling exponents of output, i.e., the cumulative number of papers published by researchers affiliated with that institution at a given year, are centered around 1.0 (see Supplementary Note 4). Paper output per researcher is therefore approximately independent of institution size. The average team size per institution, however, increases with institution size (see Supplementary Note 5), which may help explain the scaling of collaborations. Namely, as institutions grow, they form larger teams for each paper. This, in turn, creates more collaborations (which are proportional to the team size squared).
We also find that the distribution of institution sizes (as of 2017) follows Zipf's law (Fig. 2c), similar to the observed heavytailed distribution of city sizes 22,27 . In Supplementary Note 1 and Supplementary Data 1, we show that while the largest institutions are intuitive, such as Harvard, the smaller institutions tend to be for-profit colleges, community colleges, and institutions without a formal department in the field of interest (e.g., an engineering school with papers in sociology). In addition, the number of institutions grows sublinearly with the number of researchers in each field (Fig. 2d). This feature, known as Heaps' law, implies that quadrupling the number of researchers in a field roughly doubles the total number of institutions associated with that field. Exact scaling law values for each field can be found in Table 1, where Heaps' laws are calculated for the total number of researchers in each field, N, greater than twenty and Zipf's law is calculated for institution size, n, greater than ten.
A Model of Institution Growth. We now describe a stochastic growth model of institution formation that elucidates how institutions and collaborations jointly grow. We model institution formation and growth with a Pólya's urn-like set of mechanisms described in ref. 25 , and we model the growth of collaborations with a network densification mechanism 20,21 . Unlike existing models of network densification [19][20][21] , however, our model reproduces the heterogeneous densification of internal and external collaborations, and the non-trivial growth structure on institutions. This is complimentary to a very recent model on heterogenous exploration 28 , in which Polya's urn models vary as a function of a node's position on a (static) network.
We imagine an urn containing balls of different colors. The balls can be thought of as the resources given to each institution, where each color represents a different assigned institution, as shown in Fig. 3a. Balls are picked uniformly at random with replacement, with each pick representing a newly-hired researcher, and the ball color is recorded in a sequence to represent what institution hires the researcher. Afterwards, ρ balls of the same color are added to the urn to represent the additional resources and prestige given to a larger institution, known as reinforcement (left panel of Fig. 3a) 25 . If a previously unseen color is chosen, then ν + 1 uniquely colored balls are placed into the urn, a step known as triggering (right panel of Fig. 3a) 25 . The new colors represent institutions that are able to form because of the existence of a new institution. This triggering, also known as adjacent possible 25 , does not imply causality per se, e.g., the cause of the University of California Merced's creation was not strictly because of previously established institutions. Instead, these institution-specific causes are represented as stochastic noise, a remarkable simplification that does not remove the observed statistical regularities. Triggering, however, agrees with anecdotal evidence, making it an intuitive factor behind the creation of institutions. For example, UC Davis was spun out of UC Berkeley, and USC Institute for Creative Technology was spun out of USC Information Sciences Institute, which itself was founded by researchers from the Rand Corporation. The model we describe is known as Polya's urn with triggering 25 , and predicts Heaps' law with a scaling relation~N ν/ρ and Zipf's law with scaling relation~n −(1+ν/ρ) 25 . In our simulations, we arbitrarily chose ρ to be 4 and ν to be 2, which agrees well with the data shown in Fig. 2.
Next, we explain the heterogeneous and superlinear scaling of collaborations through a model of network densification. Building on the work of 20,21 , we have each new researcher, represented as a node, connect to a random researcher within the same institution, as well as an external researcher picked uniformly at random (left panel of Fig. 3b). New collaborators are then chosen independently from neighbors of neighbors with probability pi , where pi is unique to each researcher's institution (right panel of Fig. 3b). We let pi be a Gaussian distributed random variable with mean, μ = 0.6, and standard deviation, σ μ = 0.25 and truncated between 0 and 1. Lambiotte et al., 21 show that their equivalent to μ, when greater than 0.5, produces densification. We therefore choose μ = 0.6 to ensure the network densifies. We show separately that pi directly controls the heterogeneity we observe in internal collaboration scaling, but the heterogeneity in external collaboration scaling is an emergent outcome of this model 29 .
To summarize, our model has four parameters: ρ (reinforcement), ν (triggering), and two parameters to explain collaboration densification heterogeneity, μ p and σ p . In the main text, we let ρ equal 4, ν equal 2, μ p equal 0.6, and σ p equal 0.25. These are arbitrarily chosen parameters meant to create statistical patterns that are qualitatively similar to empirical data. Namely, μ p > 0.5 ensures collaboration densification 21 , and σ p > 0 ensures that densification scaling exponents vary between institutions. Interestingly, this model's Zipf's and Heaps' laws can be exactly calculated, as discussed by Tria et al. 25 , with Zipf's law exponent equal to −1 − ν/ρ and Heaps' law equal to ν/ρ. This model qualitatively reproduces Zipf's and Heaps' laws (Fig. 2c, d and Table 1) and the heterogeneous scaling of internal and external collaborations shown in Fig. 2a, b. While other plausible mechanisms for Zipf's law [30][31][32] , Heaps' law 24 , or densification 19 exist, the current model describes these patterns in a cohesive framework and explains the heterogeneous scaling we discover in the data. While this heterogeneity is built into our internal scaling laws, the external scaling heterogeneity is an emergent property within the model 29 .
The model also reproduces qualitative trends of cross-sectional analysis. Specifically, the scaling exponents of internal collaborations produced by the model when measured at a specific point in  Each fit is a linear regression on log-scaled x and y axes for the number of researchers in each field above 100. Errors are standard errors of linear regression coefficients. Simulation scaling laws are theoretical exponents calculated for Polya's urn model with triggering with coefficients ρ = 4 and ν = 2 25 . See Results and Discussion for details of the mechanism coefficients. time, i.e., in a cross-sectional setting, vary in time and are larger than scaling exponents of external collaborations and decrease over time (Supplementary Note 6), unlike what we see in data ( Supplementary Fig. 3). These results are robust to stochastic variations of the densification mechanism (Supplementary Note 7). As a final comparison with data, we compared the growth of institutions and the ways links form to the model mechanisms and found broad agreement 29 .

Conclusion
We identify strong statistical regularities in the growth of research institutions. The number of collaborations increases superlinearly with institution size, i.e., faster than institutions grow in size, though the scaling is heterogeneous, with a different exponent for each institution. Therefore, each institution has its own universal scaling, i.e., regardless of its size, it will always have the same percentage of new collaborations for each percentage increase in size. The super scaling is not explained by the increased productivity of researchers at larger institutions the number of papers per researcher is roughly independent of institution size. Instead, the growing collaborations are associated with bigger teams at larger institutions. The diversity in collaboration scaling exponents is partly explained by variations in institution impact. Institutions with higher impact papers also tend to have a larger scaling exponent. This provides evidence that a higher collaboration scaling exponent allows for collaborations to form more easily, and that in turn creates higher-impact papers. Further analysis is needed to test this hypothesis in the future. When these observations are incorporated into a minimal stochastic model of institution growth, we are able to reproduce the surprising regularity of research institution formation, growth and the heterogenous densification of collaboration networks. That said, there is still room for improvements to this model, given quantitative differences between the model and data, such as the constant shift difference between the Heaps' laws (Fig. 2c), or the difference in the collaboration scaling law exponents (insets of Fig. 2a, b).
These findings support the idea that academic environments differ in their ability to bolster researcher productivity and prominence 12 , and also demonstrate that institution size and ability to facilitate collaborations as a potential factor explaining differences in academic environments. Additional research is needed to identify other factors that contribute to an institution's success.

Methods
Data. We use bibliographic data from Microsoft Academic Graph (MAG), from which researcher names (authors), their institutional affiliation, and references made to other papers have been extracted 26,33 . MAG data has disambiguated institutions and authors for each paper, allowing us to consider all authors with the same unique identifier to be the same researcher, and similarly for each institution. In these data, authors typically have only one affiliation at any time (see Supplementary Note 1). We focus on four fields of study: computer science, physics, math and sociology. After data cleaning, we have almost ten million papers published between 1800 and 2018 (see Supplementary Note 1). Our computer science data includes early research in topics relating to computers, including electrical engineering, and therefore stretches back to before 1900.
We define institution size in a given year as the number of authors who have been ever been affiliated with that institution up until that year. Collaborations are defined as two researchers who have co-authored a paper up until that year. We distinguish between internal collaborations (co-authors at the same institution) and external collaborations (co-authors affiliated with different institutions). Finally, to understand the relation between collaborations and institution size, we define output as the cumulative number of papers from researchers affiliated with an institution in a particular year.
Analysis. We use cumulative statistics to reduce statistical variations and to better compare to a stochastic growth model of institution formation. To check the robustness of results, we compare to an alternate yearly definition of institution size and collaborations (see Supplementary Note 2). We find all qualitative results are the same, in part because both definitions are highly correlated.
We present scaling results for longitudinal analysis, which tracks how collaborations evolve as individual institutions grow [34][35][36] . This contrasts to crosssectional analysis applied in previous work on city scaling 37,38 and institution scaling [2][3][4]39 , which measures collaborations as a function of the size of all institutions at a given point in time. We find that cross-sectional analysis identifies scaling laws that are not representative of the growth of most institutions (see Supplementary Note 7), and while simulations and empirical data give scaling exponents that are fairly constant in time for each institution, cross-sectional scaling exponents vary in time for both data and simulation. For these reasons, we focus on longitudinal scaling analysis in this paper, although scaling laws derived by either analysis method strongly relate to each other 36,40 . Fig. 3 Schematic representation of the institution growth model. a At time t a new researcher is hired, modeled as extracting a ball with uniform probability with replacement from an urn, U (black arrow). The ball color represents an institution. Hiring a researcher will always add ρ new balls of the same color to the urn in the next timestep (reinforcement). Hiring the first researcher at an institution (picking a ball color that has never been picked before), triggers ν + 1 new colors to enter the urn, increasing the likelihood of more institutions to hire their first researcher (triggering). b Researchers within each institution (dash-dotted boxes) have both internal collaborators (darker solid lines) and external collaborators (gray lines). Once a researcher is hired, they choose one random internal and one random external collaborator (solid arrows). New collaborations (dashed arrows) are formed independently with probability p A , if hired by institution A, and p B if hired by institution B. These new connections form triangles.