Introduction

Scientific innovation and training require efficient and robust infrastructure. This infrastructure is provided by research institutions, a category that includes universities, government labs, industrial labs, and national academies1,2,3,4,5. Despite the long tradition of bibliometric and science of science research6, the focus has only recently shifted from individual scientists7,8 and teams9,10,11 to how institutions affect researcher productivity and impact12,13. Many gaps remain in our understanding of the role of institutions in the production of scientific knowledge, and specifically, how they form, grow, and facilitate scientific collaborations. These questions are important, because collaborations are increasingly prevalent in scientific research1,9,10 and produce more impactful and transformative work10,14. Collaboration allows scientists to cope with the increasing complexity of knowledge15 by leveraging the diversity of expertise16 and perspectives offered by collaborators from different institutions17 and disciplines18.

To understand the evolution of research institutions and collaborations, we analyze a large bibliographic database spanning many decades and multiple scientific disciplines. The database contains millions of publications from which the names of authors (collaborators) and their affiliations (research institutions) have been extracted for each paper. Analysis of these data reveals strong statistical regularities. We find that collaborations scale superlinearly with institution size, i.e., faster than institutions grow, consistent with densification of growing networks19,20,21. However, the scaling law is different for each institution, and as a result, different parts of the collaboration network densify at different rates. We also find that institutions vary in size by many orders of magnitude with an approximately power-law distribution, also known as Zipf’s law22. The number of institutions, in contrast, scales sublinearly with the number of researchers, thus following Heaps’ law23,24. The sublinear scaling implies that, even as more institutions appear, each institution gets larger on average, but this average belies an enormous variance.

Finally, we create a stochastic model that helps explain how institutions and research collaborations form and grow. In this model, a researcher appears at each time step and is preferentially hired by larger institutions (e.g., due to their prestige or funding), which leads to the rich-get-richer effect creating Zipf’s law. With a small probability, however, a researcher joins a newly appearing institution. The arrival of this new institution then triggers yet more new institutions to form in the future, which explains Heaps’ law25. Finally, once hired, researchers make connections to other researchers and their collaborators with an independent probability to explain collaborations scaling superlinearly with institution size. Despite its simplicity, the model reproduces a range of empirical observations, including the number and size of research institutions, and how pockets of increasingly dense structures form in collaboration networks.

These empirical results demonstrate universal emergent patterns in the formation and growth of research institutions and collaborations. Our model demonstrates that new institutions are critical to absorbing extra capacity by collecting researchers who do not join large institutions. At the same time, large institutions offer an economy of scale: they grow faster and provide more collaboration opportunities compared to smaller institutions.

Results and discussion

As the first step towards characterizing the complexity of institution scaling, we collect data from Microsoft Academic Graph26 to capture how millions of collaborations evolve over time. Figure 1 shows the collaboration network at the institution level in the field of sociology. Figure 1a demonstrates a remarkable diversity of institution size and growth, both in terms of the number of researchers (node growth) and collaborations between institutions (edge growth). Collaborations are clustered, with clear groups of interacting institutions. Research collaborations within an institution are equally complex. Figure 1b highlights the largest connected component of the collaboration network within Harvard. Individual researchers vary widely in the number of collaborators, with new collaborations appearing in clusters.

Fig. 1: Network visualization of the collaborations in the field of sociology in 2017.
figure 1

a Collaborations between institutions. Each node represents a research institution, and institutions with more researchers are represented by larger nodes. Each link represents collaborations between researchers at different institutions and more collaborations are represented by thicker lines. Darker nodes represent faster-growing institutions (defined as the number of new researchers added between 2012 and 2017), and darker links represent faster-growing collaborations (defined as the number of new inter-institution collaborations between 2012 and 2017). Links with fewer than 10 collaborations are removed, as are isolated nodes. A few major universities are labeled: Peking University, Oxford University, Harvard University, Massachusetts Institute of Technology (MIT), and the University of Southern California (USC). b The largest connected component of collaborations within Harvard University. Each node represents a researcher. Dashed lines represent new collaborations added between 2012 and 2017, while open circles represent new researchers added between 2012 and 2017. The highest degree node is Ichiro Kawachi, a highly cited sociologist.

This dataset helps us capture how the number of collaborations scale with an institution’s size, n. Figure 2a, b shows the number of internal and external collaborations versus n across four different disciplines: computer science, physics, math, and sociology. While each institution follows a scaling law c ~ nα (R2 is close to 1.0, see Supplementary Note 6), the exponents α differ substantially between institutions. This is shown in the insets of Fig. 2a, b where we collect scaling exponents across thousands of institutions and notice that their distribution stretches between zero (in which institutions do not gain any collaborations) to two (collaborations are extremely dense). In the thermodynamic limit, exponents cannot be larger than two, therefore values above two are due to finite-size effects.

Fig. 2: Institutions densify at different rates but their size and frequency follow universal patterns.
figure 2

a Internal and (b) external collaborations versus institution size for three arbitrarily chosen institutions with more than 103 cumulative researchers in each field or simulation. Circle markers correspond to simulation data; solid lines, medium dashed lines, long dashed lines, and short dashed lines correspond to data from the fields of computer science (CS), physics, math, sociology, respectively. Dash-dotted lines report linear scaling, showing that institutions' scaling laws are super-linear. Insets: distribution of exponents across thousands of institutions (cf. Supplementary Note 1). c The distribution of researchers in each institution as of 2017 (Zipf’s law), and (d) the number of unique institutions versus the total number of researchers in each field (Heaps' law). Closed circle markers correspond to simulation data; open circles, squares, diamonds, and triangles correspond to computer science (CS), physics, math, sociology, respectively. In addition, light dash-dotted lines indicate empirical trends while darker dashed lines indicate theoretical scaling law exponents −1 − ν/ρ and ν/ρ for (c) and (d) respectively25. Simulation data in (a) and (b) are collected from four realizations and in (c) and (d), from fifteen realizations (individual realizations show similar trends). Simulation parameters are ρ equals 4, ν equals 2, μp equals 0.6, and σp equals 0.25.

To show that the scaling exponents of all institutions are different, we create a null model (see Supplementary Note 3) in which all institutions follow the same scaling law. In this null model, residuals of each institution’s fitted scaling relation are reshuffled and added as noise onto a single scaling relation. Differences between fitted exponents in this model are due to statistical noise rather than different scaling laws. We find that the variance of the scaling laws across all institutions is much higher than this null model. We therefore reject the hypothesis that all the exponents within a field are the same within statistical error. We explore the dependence of scaling on final institution size in Supplementary Note 6, and find the scaling exponents are superlinear (approximately 1.2 on average) and do not depend strongly on the final size of the institution. Different parts of the collaboration network therefore densify at different rates, which extends on previous work that uncovered densification for many networks at the aggregate level19.

We find weak evidence that higher scaling exponents correspond to institutions with greater impact. In physics, the Spearman rank correlation, s, between mean paper impact after five years and internal collaboration scaling exponents is 0.09 (borderline significant, p-value = 0.06) and for external collaboration is 0.27 (p-value < 10−5). Similarly, in sociology, the correlation is 0.19 (p-value = 0.03) between impact and internal collaboration exponents, and the same correlation value is found for external collaboration exponents. For all other fields, however, the correlations are not statistically significant (p-value ≥ 0.20). Impact, a proxy of institution research quality, cannot fully explain why collaborations grow faster in some institutions and not others, but can give some insight into reasons for this diversity. These results suggest that highly impactful institutions seem to form collaborations more easily as they grow. Nonetheless, almost all institutions benefit from being larger, as the number of collaborations per person typically grows with size (Fig. 2a, b inset).

The superlinear scaling of collaborations cannot be explained by researcher productivity. The scaling exponents of output, i.e., the cumulative number of papers published by researchers affiliated with that institution at a given year, are centered around 1.0 (see Supplementary Note 4). Paper output per researcher is therefore approximately independent of institution size. The average team size per institution, however, increases with institution size (see Supplementary Note 5), which may help explain the scaling of collaborations. Namely, as institutions grow, they form larger teams for each paper. This, in turn, creates more collaborations (which are proportional to the team size squared).

We also find that the distribution of institution sizes (as of 2017) follows Zipf’s law (Fig. 2c), similar to the observed heavy-tailed distribution of city sizes22,27. In Supplementary Note 1 and Supplementary Data 1, we show that while the largest institutions are intuitive, such as Harvard, the smaller institutions tend to be for-profit colleges, community colleges, and institutions without a formal department in the field of interest (e.g., an engineering school with papers in sociology). In addition, the number of institutions grows sublinearly with the number of researchers in each field (Fig. 2d). This feature, known as Heaps’ law, implies that quadrupling the number of researchers in a field roughly doubles the total number of institutions associated with that field. Exact scaling law values for each field can be found in Table 1, where Heaps’ laws are calculated for the total number of researchers in each field, N, greater than twenty and Zipf’s law is calculated for institution size, n, greater than ten.

Table 1 Zipf’s law and Heaps’ Law exponents for research fields and simulation.

A Model of Institution Growth

We now describe a stochastic growth model of institution formation that elucidates how institutions and collaborations jointly grow. We model institution formation and growth with a Pólya’s urn-like set of mechanisms described in ref. 25, and we model the growth of collaborations with a network densification mechanism20,21. Unlike existing models of network densification19,20,21, however, our model reproduces the heterogeneous densification of internal and external collaborations, and the non-trivial growth structure on institutions. This is complimentary to a very recent model on heterogenous exploration28, in which Polya’s urn models vary as a function of a node’s position on a (static) network.

We imagine an urn containing balls of different colors. The balls can be thought of as the resources given to each institution, where each color represents a different assigned institution, as shown in Fig. 3a. Balls are picked uniformly at random with replacement, with each pick representing a newly-hired researcher, and the ball color is recorded in a sequence to represent what institution hires the researcher. Afterwards, ρ balls of the same color are added to the urn to represent the additional resources and prestige given to a larger institution, known as reinforcement (left panel of Fig. 3a)25. If a previously unseen color is chosen, then ν + 1 uniquely colored balls are placed into the urn, a step known as triggering (right panel of Fig. 3a)25. The new colors represent institutions that are able to form because of the existence of a new institution. This triggering, also known as adjacent possible25, does not imply causality per se, e.g., the cause of the University of California Merced’s creation was not strictly because of previously established institutions. Instead, these institution-specific causes are represented as stochastic noise, a remarkable simplification that does not remove the observed statistical regularities. Triggering, however, agrees with anecdotal evidence, making it an intuitive factor behind the creation of institutions. For example, UC Davis was spun out of UC Berkeley, and USC Institute for Creative Technology was spun out of USC Information Sciences Institute, which itself was founded by researchers from the Rand Corporation. The model we describe is known as Polya’s urn with triggering25, and predicts Heaps’ law with a scaling relation ~ Nν/ρ and Zipf’s law with scaling relation ~ n−(1+ν/ρ)25. In our simulations, we arbitrarily chose ρ to be 4 and ν to be 2, which agrees well with the data shown in Fig. 2.

Fig. 3: Schematic representation of the institution growth model.
figure 3

a At time t a new researcher is hired, modeled as extracting a ball with uniform probability with replacement from an urn, U (black arrow). The ball color represents an institution. Hiring a researcher will always add ρ new balls of the same color to the urn in the next timestep (reinforcement). Hiring the first researcher at an institution (picking a ball color that has never been picked before), triggers ν + 1 new colors to enter the urn, increasing the likelihood of more institutions to hire their first researcher (triggering). b Researchers within each institution (dash-dotted boxes) have both internal collaborators (darker solid lines) and external collaborators (gray lines). Once a researcher is hired, they choose one random internal and one random external collaborator (solid arrows). New collaborations (dashed arrows) are formed independently with probability pA, if hired by institution A, and pB if hired by institution B. These new connections form triangles.

Next, we explain the heterogeneous and superlinear scaling of collaborations through a model of network densification. Building on the work of20,21, we have each new researcher, represented as a node, connect to a random researcher within the same institution, as well as an external researcher picked uniformly at random (left panel of Fig. 3b). New collaborators are then chosen independently from neighbors of neighbors with probability pi, where pi is unique to each researcher’s institution (right panel of Fig. 3b). We let pi be a Gaussian distributed random variable with mean, μ = 0.6, and standard deviation, σμ = 0.25 and truncated between 0 and 1. Lambiotte et al.,21 show that their equivalent to μ, when greater than 0.5, produces densification. We therefore choose μ = 0.6 to ensure the network densifies. We show separately that pi directly controls the heterogeneity we observe in internal collaboration scaling, but the heterogeneity in external collaboration scaling is an emergent outcome of this model29.

To summarize, our model has four parameters: ρ (reinforcement), ν (triggering), and two parameters to explain collaboration densification heterogeneity, μp and σp. In the main text, we let ρ equal 4, ν equal 2, μp equal 0.6, and σp equal 0.25. These are arbitrarily chosen parameters meant to create statistical patterns that are qualitatively similar to empirical data. Namely, μp > 0.5 ensures collaboration densification21, and σp > 0 ensures that densification scaling exponents vary between institutions. Interestingly, this model’s Zipf’s and Heaps’ laws can be exactly calculated, as discussed by Tria et al.25, with Zipf’s law exponent equal to −1 − ν/ρ and Heaps’ law equal to ν/ρ. This model qualitatively reproduces Zipf’s and Heaps’ laws (Fig. 2c, d and Table 1) and the heterogeneous scaling of internal and external collaborations shown in Fig. 2a, b. While other plausible mechanisms for Zipf’s law30,31,32, Heaps’ law24, or densification19 exist, the current model describes these patterns in a cohesive framework and explains the heterogeneous scaling we discover in the data. While this heterogeneity is built into our internal scaling laws, the external scaling heterogeneity is an emergent property within the model29.

The model also reproduces qualitative trends of cross-sectional analysis. Specifically, the scaling exponents of internal collaborations produced by the model when measured at a specific point in time, i.e., in a cross-sectional setting, vary in time and are larger than scaling exponents of external collaborations and decrease over time (Supplementary Note 6), unlike what we see in data (Supplementary Fig. 3). These results are robust to stochastic variations of the densification mechanism (Supplementary Note 7). As a final comparison with data, we compared the growth of institutions and the ways links form to the model mechanisms and found broad agreement29.

Conclusion

We identify strong statistical regularities in the growth of research institutions. The number of collaborations increases superlinearly with institution size, i.e., faster than institutions grow in size, though the scaling is heterogeneous, with a different exponent for each institution. Therefore, each institution has its own universal scaling, i.e., regardless of its size, it will always have the same percentage of new collaborations for each percentage increase in size. The super scaling is not explained by the increased productivity of researchers at larger institutions the number of papers per researcher is roughly independent of institution size. Instead, the growing collaborations are associated with bigger teams at larger institutions. The diversity in collaboration scaling exponents is partly explained by variations in institution impact. Institutions with higher impact papers also tend to have a larger scaling exponent. This provides evidence that a higher collaboration scaling exponent allows for collaborations to form more easily, and that in turn creates higher-impact papers. Further analysis is needed to test this hypothesis in the future.

When these observations are incorporated into a minimal stochastic model of institution growth, we are able to reproduce the surprising regularity of research institution formation, growth and the heterogenous densification of collaboration networks. That said, there is still room for improvements to this model, given quantitative differences between the model and data, such as the constant shift difference between the Heaps’ laws (Fig. 2c), or the difference in the collaboration scaling law exponents (insets of Fig. 2a, b).

These findings support the idea that academic environments differ in their ability to bolster researcher productivity and prominence12, and also demonstrate that institution size and ability to facilitate collaborations as a potential factor explaining differences in academic environments. Additional research is needed to identify other factors that contribute to an institution’s success.

Methods

Data

We use bibliographic data from Microsoft Academic Graph (MAG), from which researcher names (authors), their institutional affiliation, and references made to other papers have been extracted26,33. MAG data has disambiguated institutions and authors for each paper, allowing us to consider all authors with the same unique identifier to be the same researcher, and similarly for each institution. In these data, authors typically have only one affiliation at any time (see Supplementary Note 1). We focus on four fields of study: computer science, physics, math and sociology. After data cleaning, we have almost ten million papers published between 1800 and 2018 (see Supplementary Note 1). Our computer science data includes early research in topics relating to computers, including electrical engineering, and therefore stretches back to before 1900.

We define institution size in a given year as the number of authors who have been ever been affiliated with that institution up until that year. Collaborations are defined as two researchers who have co-authored a paper up until that year. We distinguish between internal collaborations (co-authors at the same institution) and external collaborations (co-authors affiliated with different institutions). Finally, to understand the relation between collaborations and institution size, we define output as the cumulative number of papers from researchers affiliated with an institution in a particular year.

Analysis

We use cumulative statistics to reduce statistical variations and to better compare to a stochastic growth model of institution formation. To check the robustness of results, we compare to an alternate yearly definition of institution size and collaborations (see Supplementary Note 2). We find all qualitative results are the same, in part because both definitions are highly correlated.

We present scaling results for longitudinal analysis, which tracks how collaborations evolve as individual institutions grow34,35,36. This contrasts to cross-sectional analysis applied in previous work on city scaling37,38 and institution scaling2,3,4,39, which measures collaborations as a function of the size of all institutions at a given point in time. We find that cross-sectional analysis identifies scaling laws that are not representative of the growth of most institutions (see Supplementary Note 7), and while simulations and empirical data give scaling exponents that are fairly constant in time for each institution, cross-sectional scaling exponents vary in time for both data and simulation. For these reasons, we focus on longitudinal scaling analysis in this paper, although scaling laws derived by either analysis method strongly relate to each other36,40.