Introduction

Diversity is highly valued in modern societies1,2,3,4,5,6. Social cohesion, tolerance, and integration are linked to tangible benefits including economic vibrancy7,8 and innovativeness5,9,10,11. Far from being an abstract ideal, this conviction has guided many governmental and hiring policies and can have broad and long-lasting effects on society12,13. However, diversity is a complex issue, as groups can be diverse in terms of various attributes, such as ethnicity, gender, age, and socioeconomic background. It is also unclear if all forms of diversity are beneficial. For instance, ethnic density has been associated with positive outcomes in terms of health14,15, while ethnic polarization has a negative effect on economic development16. Furthermore, diversity can be a divisive topic that is clouded by emotion, partisan loyalties, and political correctness, all of which can hinder impartial discussions17. The factors above strongly motivate an objective study on the value of diversity, and on whether more diverse groups achieve greater success.

One domain in which this question can be effectively addressed is academia18,19. The structure of academic collaboration is observable via co-authorships, which frequently involve scientists from different locations, disciplines and backgrounds20,21. Furthermore, academic output has an objective, widely accepted measure—citation count22,23. This amenability to analysis has already attracted attempts at identifying the factors which underlie success in academia, an enterprise known as the Science of Science24. Although many such factors have been studied, including gender25, academic age26, team size27, interdisciplinarity28, ethnicity29, and affiliation30,31, the study of these factors is extremely complex and many questions remain unanswered.

Our study seeks to address this shortcoming from a number of hitherto unexplored perspectives. Firstly, we compare homophily in scientific collaborations from the perspectives of age, gender, affiliation, and ethnicity. We find clear signs of homophily in the cases of ethnicity, gender, and affiliation. However, in only one case, ethnicity, was homophily was found to be increasing steadily over time. Secondly, we examine the relationship between various classes of diversity and research impact at the level of scientific fields. Remarkably, we found that ethnic diversity is most strongly associated with scientific impact. Thirdly, we compare the benefits of diversity on groups vs. individuals, and find that the former outweighs the latter. Finally, we study the evolution and effect of diversity over time, team size, and number of collaborators, and verify that the above findings persist across all of these dimensions. The results of these multiple angles of analysis are combined to form a far richer picture of diversity than has been possible in the past.

Results

Exploring homophily

A natural starting point for our study of diversity is to establish the extent to which homophily32 exists in academia—i.e., whether scientists tend to collaborate more frequently with similar others—which would lead to an overall lack of diversity in scientific collaborations. We use the Microsoft Academic Graph dataset (available at: https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/), and analyze 1,045,401 multi-authored papers (see Supplementary Figure 1 for the distribution of papers by year), written by 1,529,279 scientists, spanning eight main fields and 24 subfields of science. We analyzed diversity in terms of these five attributes: ethnicity (eth), discipline (dsp), gender (gen), affiliation (aff), and academic age (age); see Supplementary Note 1. Here, the abbreviations in parentheses are used in subsequent mathematical expressions to indicate the associated attribute. These attributes reflect many technical and social factors that influence teamwork and collaboration. Affiliation indicates the geographic location, and may even reflect the way collaborative work is carried out—from the style and culture of collaboration to its mundane details, such as the medium used to collaborate, e.g., face-to-face interactions vs. telecommunication or email. Academic age is not only indicative of the amount of experience that a scientist has, but is also typically associated with actual age. Discipline may reflect a scientist’s substantive knowledge and his/her acquired skills through training, as well as the culture in which collaborative work is carried out. Finally, ethnicity and gender may play a role in shaping scientists’ social identities, knowledge, and biases. To quantify diversity in terms of any of the aforementioned attributes, we use the Gini Impurity33, resulting in the following group diversity indices, \(d_{{\mathrm {eth}}}^{\mathrm {G}}\), \(d_{\mathrm {{age}}}^{\mathrm {G}}\), \(d_{{\mathrm {gen}}}^{\mathrm {G}}\), \(d_{{\mathrm {dsp}}}^{\mathrm {G}}\) and \(d_{{\mathrm {aff}}}^{\mathrm {G}}\) (an alternative diversity measure was also considered; see Supplementary Note 2 and Supplementary Figure 2).

To explore homophily, we generate different randomized baseline models whereby a particular attribute—be it ethnicity, gender, affiliation, or academic age—is shuffled. For example, in the case of ethnicity, this process is akin to creating a universe in which ethnicity is disregarded in the selection of co-authors, while retaining other criteria. To preserve the conditional distributions of the ethnicities, the shuffling process is constrained to only occur between authors of papers that have the same subfield, publication year, and number of authors; for full details, see Supplementary Note 3. This way, for every paper p in the real dataset, there exists a matching paper p′ in the randomized dataset that may differ from p in terms of ethnic diversity, but is identical to p in terms of gender, affiliation, academic age, citations, publication year, and number of authors per paper. Importantly, while such a baseline model may produce homogeneous groups, the emergence of such groups is purely the result of random chance rather than homophily. As such, by comparing the real dataset with this baseline model, we can determine whether homophily exists, and if so, quantify the degree to which it is spread across academia. Figure 1a compares our real dataset with the randomized baseline model in terms of the cumulative distributions of \(d_x^{\mathrm {G}}:x \in \{ {\mathrm {eth,age,gen,aff}}\}\). As can be seen, for x {eth, gen, aff}, groups with low \(d_x^{\mathrm {G}}\) are more common in reality than would be expected by random chance, highlighting the fact that homophily does indeed exist in academia in terms of ethnicity, gender, and affiliation. However, for x = age, the opposite was observed (see Supplementary Figures 36 for subfield-specific distributions). These observations persist, regardless of the publication year (Fig. 1b), and the number of authors per paper (Fig. 1c). The temporal trends observed in Fig. 1b are particularly intriguing. For \(d_{{\mathrm {eth}}}^{\mathrm {G}}\), while the population of scientists is becoming more ethnically diverse (see the steady increase in the red line), this trend is not reflected in the actual coauthor groupings, implying that ethnic homophily is steadily increasing. For \(d_{{\mathrm {age}}}^{\mathrm {G}}\), the actual level of diversity is greater than would be expected by random chance; this pattern is regularly observed in academia, e.g., consider the many publications resulting from advisor–advisee collaborations. For \(d_{{\mathrm {gen}}}^{\mathrm {G}}\), although gender homophily continues to exist, it steadily decreases over time, suggesting that women are playing an ever greater role in scientific endeavors. Finally, for \(d_{{\mathrm {aff}}}^{\mathrm {G}}\), there is a marked decrease in affiliation homophily around the 1990s; this is consistent with the jump in multi-university collaborations in the 1990s due to the widespread of the Internet and other technologies that facilitate collaboration across geographically distant scientists30.

Fig. 1
figure 1

Exploring homophily in real vs. randomized data. Each column corresponds to a different class of diversity, and each row presents the results of a specific set of experiments whereby \(d_x^{\mathrm {G}}:x \in \{{\mathrm { {eth,age,gen,aff}}}\}\) in real data is compared against randomized data. a Cumulative distributions of \(d_x^{\mathrm {G}}\). b Change in mean diversity \(\langle d_x^{\mathrm {G}}\rangle\) over time. c Mean diversity \(\langle d_x^{\mathrm {G}}\rangle\) for papers with different number of authors

The link between diversity and scientific impact

Having explored homophily in academia, we now study the effects of homophily (and diversity) on research impact, measured by the number of citations received within 5 years of publication, denoted by \(c_5^{\mathrm {G}}\) (see Supplementary Note 4 and Supplementary Figure 7). Using the same dataset and notation described earlier, we study the relationship between a subfield’s diversity and its academic impact. Here, we distinguish between two notions of diversity. The first is where the unit of analysis is a paper’s set of authors, while the second is where the unit of analysis is an individual scientist’s entire set of collaborators. We refer to the former as group diversity, and to the latter as individual diversity; see Fig. 2 for an illustration comparing the two notions.

Fig. 2
figure 2

Group vs. individual diversity. For any given class of diversity, x {eth, age, gen, dsp, aff}, differences in color represent differences in terms of x. The group diversity index \(d_x^{\mathrm {G}}\) of Paper A is higher than that of Paper B. The individual diversity index of Scientist C is higher than that of Scientist D

For each subfield, Fig. 3a depicts the mean group diversity indices, \(\langle d_x^{\mathrm {G}}\rangle :x \in \{ {\mathrm {eth,age,gen,dsp,aff}}\}\), against the mean 5-year citation count, \(\langle c_5^{\mathrm {G}}\rangle\), taken over papers in that subfield (notation summary and formal definitions are in Supplementary Table 1 and Supplementary Note 2, respectively). Remarkably, we find that a subfield’s ethnic diversity is the most strongly correlated with impact (r = 0.77); the positive correlation persists even when the subfields are studied in isolation (Supplementary Figures 8 and Supplementary Table 2), regardless of the number of authors per paper (Supplementary Figure 9). These findings are further supported by the regression analysis in Table 1. While these findings do not imply causation, it is still suggestive that one can largely predict scientific impact based solely on average ethnic diversity, especially given that ethnicity is arguably unrelated to technical competence.

Fig. 3
figure 3

Group and individual diversity vs. impact in each subfield. In each subplot, the points correspond to subfields, the color indicates the main field, while the solid line and the shaded area represent the regression line and the 95% confidence interval, respectively. Each regression has also been annotated with the corresponding Pearson’s r and p values. a For each subfield, the subplots depict the mean group diversity indices, \(\langle d_{{\mathrm {eth}}}^{\mathrm {G}}\rangle\), \(\langle d_{{\mathrm {age}}}^{\mathrm {G}}\rangle\), \(\langle d_{{\mathrm {gen}}}^{\mathrm {G}}\rangle\), \(\langle d_{\mathrm {{dsp}}}^{\mathrm {G}}\rangle\) and \(\langle d_{\mathrm {{aff}}}^{\mathrm {G}}\rangle\), against the mean 5-year citation count, \(\langle c_5^{\mathrm {G}}\rangle\), taken over papers in that subfield. b For each subfield, the subplots depict the mean individual diversity indices, \(\langle d_{\mathrm {{eth}}}^{\mathrm {I}}\rangle\), \(\langle d_{\mathrm {{age}}}^{\mathrm {I}}\rangle\), \(\langle d_{\mathrm {{gen}}}^{\mathrm {I}}\rangle\), \(\langle d_{\mathrm {{dsp}}}^{\mathrm {I}}\rangle\) and \(\langle d_{\mathrm {{aff}}}^{\mathrm {I}}\rangle\), against the mean 5-year citation count, \(\langle c_5^{\mathrm {I}}\rangle\), taken over scientists in that subfield

Table 1 Regression analyses of diversity classes on academic impact

Having studied group diversity, we now move our attention to individual diversity. Here, we analyze scientists with at least 10 collaborators each, amounting to a total of 5,103,877 collaborators over 9,472,439 papers (see Supplementary Table 3 for a summary of all filters applied on the dataset). For each subfield, Fig. 3b depicts the mean individual diversity indices, \(\langle d_x^{\mathrm {I}}\rangle :x \in \{ {\mathrm {eth,age,gen,dsp,aff}}\}\), against the mean 5-year citation count, \(\langle c_5^{\mathrm {I}}\rangle\), taken over scientists in that subfield. As can be seen, a subfield’s ethnic diversity is again the most strongly correlated with impact (r = 0.55), even when the subfields are studied in isolation (Supplementary Figure 10 and Supplementary Table 4).

The above results highlight a potential dysfunction. While homophily was observed for ethnicity, affiliation and gender, the only attribute for which it was found to be increasing over time was ethnicity, which seems strange given the apparent preeminence of ethnic diversity. Motivated by this observation, we further explore the relationship between ethnic diversity and scientific impact in the randomized universe used earlier in Fig. 1. Recall that, in such a universe, ethnicity is excluded as a criterion for selecting co-authors while the other factors are preserved. Hence, it stands to reason that any differences in impact between the randomized and real datasets can be attributed to ethnic diversity. To examine these differences, we partitioned the papers into two categories, labeled as diverse \(\left( {d_{\mathrm {{eth}}}^{\mathrm {G}} > \tilde d_{\mathrm {{eth}}}^{\mathrm {G}}} \right)\) and non-diverse \(\left( {d_{\mathrm {{eth}}}^{\mathrm {G}} \le \tilde d_{\mathrm {{eth}}}^{\mathrm {G}}} \right)\), where the tilde denotes the median. The scientists were similarly partitioned into diverse \(\left( {d_{\mathrm {{eth}}}^{\mathrm {I}} > \tilde d_{\mathrm {{eth}}}^{\mathrm {I}}} \right)\) and non-diverse \(\left( {d_{\mathrm {{eth}}}^{\mathrm {I}} \le \tilde d_{\mathrm {{eth}}}^{\mathrm {I}}} \right)\). We find that the diverse consistently outperforms the non-diverse, regardless of the year of publication (Fig. 4e), the number of authors per paper (Fig. 4g), and the number of collaborators per scientist (Fig. 4i). We replicated these plots using the randomized, instead of the real, dataset (Fig. 4f, h and j). As can be seen, the performance gap between the diverse and non-diverse almost entirely disappears in the randomized dataset, suggesting that the observed impact gains in the real dataset could indeed be attributed to ethnic diversity. Note that, in the real dataset, a large proportion of papers have \(d_{\mathrm {{eth}}}^{\mathrm {G}} = 0\) (see Fig. 4a), and a large proportion of scientists have \(d_{\mathrm {{eth}}}^{\mathrm {I}} = 0\) (see Fig. 4c). As such, the observed performance gap between the diverse and the non-diverse could be predominantly due to these papers and scientists being less impactful than their counterparts whose \(d_{\mathrm {{eth}}}^{\mathrm {G}} > 0\) and \(d_{\mathrm {{eth}}}^{\mathrm {I}} > 0\), respectively. To determine whether this is the case, we replicated the analysis of papers but after excluding those with \(d_{\mathrm {{eth}}}^{\mathrm {G}} = 0\), and likewise replicated the analysis of scientists but after excluding those with \(d_{\mathrm {{eth}}}^{\mathrm {I}} = 0\); see Supplementary Figure 11. As can be seen, even after this exclusion, the diverse mostly outperform the non-diverse, regardless of publication year, number of authors per paper, and number of collaborators per scientist.

Fig. 4
figure 4

The relationship between ethnic diversity and impact. a Distribution of \(d_{\mathrm {{eth}}}^{\mathrm {G}}\) in real data. Papers were partitioned into two categories: diverse (highlighted in the darker tones, with \(d_{\mathrm {{eth}}}^{\mathrm {G}} > \tilde d_{\mathrm {{eth}}}^{\mathrm {G}}\)) and non-diverse (highlighted in the lighter tones, with \(d_{\mathrm {{eth}}}^{\mathrm {G}} \le \tilde d_{\mathrm {{eth}}}^{\mathrm {G}}\)), where the tilde denotes the median. b The same as (a), but for randomized data. c and d The same as (a, b), respectively, but with \(d_{\mathrm {{eth}}}^{\mathrm {I}}\) instead of \(d_{\mathrm {{eth}}}^{\mathrm {G}}\). e \(\langle c_5^{\mathrm {G}}\rangle\) against publication year in real data. f The same as (e), but for randomized data. g \(\langle c_5^{\mathrm {G}}\rangle\) against number of authors per paper in real data. h The same as (g), but for randomized data. i \(\langle c_5^{\mathrm {I}}\rangle\) against number of collaborators per scientist in real data. j The same as (i), but for randomized data

Inferring causality

To provide further evidence of the link between ethnic diversity and scientific impact, we use coarsened exact matching34, a technique typically used to infer causality in observational studies35. Specifically, it matches the control and treatment populations with respect to the confounding factors identified, thereby eliminating the effect of these factors on the phenomena under investigation. In our case, when studying group ethnic diversity, the treatment set consists of papers for which \(d_{\mathrm {{eth}}}^{\mathrm {G}} > P_{100 - i}\left( {d_{\mathrm {{eth}}}^{\mathrm {G}}} \right)\), and the control set of papers for which \(d_{\mathrm {{eth}}}^{\mathrm {G}} \le P_i\left( {d_{\mathrm {{eth}}}^{\mathrm {G}}} \right)\), where \(P_i\left( {d_{\mathrm {{eth}}}^{\mathrm {G}}} \right)\) denotes the ith percentile of \(d_{\mathrm {{eth}}}^{\mathrm {G}}\). This process is repeated using i = 10, 20, 30, 40, 50, corresponding to progressively larger gaps in ethnic diversity between the two populations. Thus, if ethnic diversity is indeed associated with increased scientific impact, we would expect to find a significant difference in impact between the two populations, and expect this difference to increase in tandem with the aforementioned gap in diversity. The confounding factors identified were the year of publication, number of authors, field of study, authors’ impact prior to publication, and university ranking. The same process was carried out for individual ethnic diversity, for which the confounding factors were academic age, number of collaborators, discipline, and university ranking; see Supplementary Note 5 and Supplementary Figures 12 and 13 for more details, and Supplementary Figure 14 for an illustration of how this process works on a given collection of papers. The results for group and individual ethnic diversities are summarized in Tables 2 and 3, respectively. As can be seen, increasing the diversity gap between the control and treatment populations is often accompanied by a greater difference in scientific impacts between the two populations. Remarkably, in the case of papers and scientists above the 90th percentile, the difference in scientific impact reaches 10.63% and 47.67%, respectively, compared to their counterparts below the 10th percentile. Clearly, these results do not suggest that diversity is the only causal factor. For example, one may argue that highly ranked universities tend to attract students from around the world and are more ethnically diverse as a result; indeed we verified that this was the case (see Supplementary Note 6 and Supplementary Figures 15 and 16). In such situations, coarsened exact matching is particularly useful precisely because it allows us to establish causality despite such effects.

Table 2 Coarsened exact matching of group ethnic diversity
Table 3 Coarsened exact matching of individual ethnic diversity

Interplay between group and individual ethnic diversity

Finally, we investigate the interplay between group ethnic diversity, \(d_{\mathrm {{eth}}}^{\mathrm {G}}\), and individual ethnic diversity, \(d_{\mathrm {{eth}}}^{\mathrm {I}}\). To this end, for each of the 1,045,401 papers in our dataset, we calculate \(d_{\mathrm {{eth}}}^{\mathrm {I}}\) averaged over the authors in that paper; we denote this as \(\left\langle {d_{\mathrm {{eth}}}^{\mathrm {I}}} \right\rangle _{\mathrm {{paper}}}\). This allows us to study the ways in which the two notions of diversity vary in the same paper. Indeed, as illustrated in Fig. 5, a paper can have high \(d_{\mathrm {{eth}}}^{\mathrm {G}}\) and at the same time have low \(\left\langle {d_{\mathrm {{eth}}}^{\mathrm {I}}} \right\rangle _{\mathrm {{paper}}}\), and vice versa. With this in mind, we studied the impact, \(\left\langle {c_5^{\mathrm {G}}} \right\rangle\), of papers falling in different ranges of \(d_{\mathrm {{eth}}}^{\mathrm {G}}\) and \(\left\langle {d_{\mathrm {{eth}}}^{\mathrm {I}}} \right\rangle _{\mathrm {{paper}}}\); see the matrix at the bottom-right corner of Fig. 5. Here, if we denote this matrix by A, and label the bottom row and leftmost column as 1, we find that \(\mathop {\sum}\nolimits_{i = 1}^4 A_{i,1} < \mathop {\sum}\nolimits_{i = 1}^4 A_{1,i}\) and \(\mathop {\sum}\nolimits_{i = 1}^4 A_{i,4} > \mathop {\sum}\nolimits_{i = 1}^4 A_{4,i}\). Hence, while it appears that both group and individual diversities can be valuable, the former seems to have a greater effect on scientific impact. In other words, having co-authors who are inclined to collaborate across ethnic lines (i.e., co-authors whose individual ethnic diversity is high) appears to be not as important as the mere presence of co-authors of different ethnicities (i.e., co-authors whose group ethnic diversity is high).

Fig. 5
figure 5

The interplay between group and individual ethnic diversity. The top part of the figure illustrates an example of 4 papers. The authors of paper A have different ethnicities, but each has ethnically homogeneous collaborators. Then, one could argue that paper A has high \(d_{\mathrm {{eth}}}^{\mathrm {G}}\) but low \(\left\langle {d_{\mathrm {{eth}}}^{\mathrm {I}}} \right\rangle _{\mathrm {{paper}}}\). Similarly, paper B has low \(d_{\mathrm {{eth}}}^{\mathrm {G}}\) and low \(\left\langle {d_{\mathrm {{eth}}}^{\mathrm {I}}} \right\rangle _{\mathrm {{paper}}}\), paper C has low \(d_{\mathrm {{eth}}}^{\mathrm {G}}\) and high \(\left\langle {d_{\mathrm {{eth}}}^{\mathrm {I}}} \right\rangle _{\mathrm {{paper}}}\), and paper D has high \(d_{\mathrm {{eth}}}^{\mathrm {G}}\) and high \(\left\langle {d_{\mathrm {{eth}}}^{\mathrm {I}}} \right\rangle _{\mathrm {{paper}}}\). The matrix at the bottom-right corner represents the mean citation counts, \(\left\langle {c_5^{\mathrm {G}}} \right\rangle\), of papers falling in different ranges of \(d_{\mathrm {{eth}}}^{\mathrm {G}}\) and \(\left\langle {d_{\mathrm {{eth}}}^{\mathrm {I}}} \right\rangle _{\mathrm {{paper}}}\)

Discussion

To summarize, this study is the first to cover five different classes of diversity, which allowed us to illuminate many interesting connections between diversity and scientific collaboration. It was also important to establish the occurrence of homophily, and this was achieved via a set of randomized baseline models. These were used to compare observed collaborations with simulated data where the attribute of interest was randomized while controlling for the relevant confounding variables. These comparisons revealed clear and consistent patterns of homophily in the cases of ethnicity, gender, and affiliation, and also revealed that ethnicity was the only attribute for which homophily is increasing over time. In the case of academic age, inverse homophily was found, i.e., scientists seem to prefer collaborating with individuals from different age groups, a possible reflection of the widely held practice of research students being mentored by, and collaborating with, more senior academics.

Armed with these results, we shifted our focus to the effect of homophily (and diversity) on scientific impact. This analysis was conducted using a number of different analytical tools, including regression analysis, randomized baseline models, and coarsened exact matching. Broadly, we found that diversity was positively correlated with impact, though the statistical significance of the observed effect varied significantly depending on the class of diversity and field of study. Overall, discipline and affiliation diversity were the least correlated with impact, a surprising finding given the apparent importance of these attributes. Conversely, ethnic diversity had the strongest correlation, which is especially surprising since ethnicity is not as related to technical competence as the other classes mentioned.

These findings have significant implications. For one, recruiters should always strive to encourage and promote ethnic diversity, be it by recruiting candidates who complement the ethnic composition of existing members, or by recruiting candidates with proven track records in collaborating with people of diverse ethnic backgrounds. Another implication is that, while collaborators with different skill sets are often required to perform complex tasks, multidisciplinarity should not be an end in of itself; bringing together individuals of different ethnicities—with the attendant differences in culture and social perspectives—could ultimately produce a large payoff in terms of performance and impact. To put it differently, intangible factors, such as team cohesion and a sense of esprit de corps should be considered in addition to technical alignment.

The underlying message is an inclusive and uplifting one. In an era of increasing polarization and identity politics, our findings may positively contribute to the societal conversation and reinforce the conviction that good things happen when people of different backgrounds, cultures, and ethnicities come together to work towards shared goals and the common good.