Collaboration and topic switches in science

Collaboration is a key driver of science and innovation. Mainly motivated by the need to leverage different capacities and expertise to solve a scientific problem, collaboration is also an excellent source of information about the future behavior of scholars. In particular, it allows us to infer the likelihood that scientists choose future research directions via the intertwined mechanisms of selection and social influence. Here we thoroughly investigate the interplay between collaboration and topic switches. We find that the probability for a scholar to start working on a new topic increases with the number of previous collaborators, with a pattern showing that the effects of individual collaborators are not independent. The higher the productivity and the impact of authors, the more likely their coworkers will start working on new topics. The average number of coauthors per paper is also inversely related to the topic switch probability, suggesting a dilution of this effect as the number of collaborators increases.

Modern science has become increasingly collaborative over the past decades [1].Large teams have become almost necessary to tackle complex problems in various disciplines, requiring a large pool of knowledge and skills.On the other hand, small teams may introduce novel paradigms [2].
A powerful representation of the collaborative nature of science is given by a collaboration network, in which nodes are authors, and two nodes are connected if they have coauthored at least one paper.With the growing availability of bibliometric data, collaboration networks have been extensively studied, and their structural properties are now well known [3][4][5][6].
Collaboration networks are concrete manifestations of homophily between scholars.People working on the same topic or problem may decide to team up and leverage their respective skills to increase their chances of discovering new results.This is an example of selection, in that similar individuals end up interacting with each other.
On the other hand, collaboration could also induce social influence, in that scholars might affect the future behavior of their coauthors.Coauthors often expose us to new tools, methods, and theories, even when the latter is not being used for the specific project carried out by the team.The link between diffusion of knowledge and collaboration has been highlighted and explored for some time.For instance, it is known that knowledge flow occurs with a greater probability between scholars who have collaborated in the past [7] and those who are in close proximity in the network [8].
In particular, once scholars discover new research topics, they may decide to work on them in the future.
Switches between research interests have become increasingly frequent over time [9] and have been quantitatively investigated [10,11].The decision to switch may actually be induced by the coauthors in a social contagion process [12][13][14][15][16] where scholar a, who spreads the new topic, influences scholar b to adopt it.For this reason, epidemic models have been applied to describe the diffusion of ideas [17][18][19].In these models, an infected individual a exposes a susceptible individual b to a disease with a certain probability of getting infected and continuing the spread.In the case of a topic, the infection spreads if b works on the new topic.
Here we present an extensive empirical analysis of the relationship between topic switches of scientists and their collaboration patterns.We distinguish active authors, i.e., those who have papers on the new topic, from inactive authors who have never published in that area.For simplicity, we focus only on the first-order neighborhoods in the collaboration network.We find that the probability for an inactive scholar to switch topic grows with the productivity and impact of their active coauthors.The larger the average number of inactive coauthors of active scientists, the smaller the effect.Also, the topic-switch probability for an inactive scholar grows with the number of their active coauthors, with a profile suggesting that the contributions of each coauthor are not independent.

RESULTS
We use the scientific publication dataset Ope-nAlex [20].We present the results for twenty topics belonging to three disciplines: Physics, Computer Science, and Biology & Medicine.See Methods A for details.
Our approach is inspired by the pioneering work by Kossinets and Watts on social network evolution [21].In it, the authors estimated triadic closure of two individu- Active author a0 has four coauthors {a1, a2, a3, a6}, of whom a1 is already active, and a6 also collaborated with a1 in the IW.This leaves the subset of exclusive inactive coauthors {a2, a3}.Within this subset, only a2 becomes active in the AW, resulting in a0's source activation probability of 1 2 = 0.50.Additionally, a2 writes their first paper with a0 in the AW.als a and b, i.e., the probability that a and b become acquainted as a function of the number of common friends.They took two snapshots of the network at consecutive time ranges: in the earlier snapshot, one keeps track of all pairs of disconnected people, and in the latter, one counts how many of those pairs become connected.A similar approach has been adopted to compute membership closure, i.e., the probability that an individual starts participating in an activity having been connected to k others who participate in it [22].We now describe how we adapt this framework to measure how collaborations induce topic switches.
Given a scientific topic t, reference year T 0 , and window size T , we construct two consecutive non-overlapping time ranges spanning years [T 0 − T, T 0 ) and [T 0 , T 0 + T ) respectively.We call the first range the interaction window (IW), where we track author interactions in the collaboration network, and the latter range, the activation window (AW), where we count topic switches.We then identify the set of active authors A who published papers P on topic t during the IW.For example, in Fig. 1A, A = {a 0 , a 1 , a 4 , a 5 }.We construct the collaboration network G by considering all papers P written by authors a ∈ A after a becomes active.Note that P includes papers outside of P , like the ones drawn in gray in Fig. 1A.We classify the non-active authors in G as inactive authors who are the candidates for topic switches in the AW.They turn active when they publish their first paper on topic t.In Fig. 1B, authors a 2 , a 3 , and a 6 are inactive, with a 2 and a 6 becoming active in the AW.Furthermore, we rank each active author a ∈ A based on two metrics of scientific prominence: productivity and impact, described in the Methods C, and calculated at the end of the IW to capture the current perception of a's scholarly output.Finally, for each metric, we identify and mark the authors who rank in the top and the bottom 10%.
Given this general setup, we conduct two complementary experiments that we describe in detail in Sections A and B. In Experiment I, we measure membership closure among inactive authors to quantitatively assess how past collaborations with active authors manifest in topic switches.In Experiment II, we instead focus on the active authors, quantifying the propensity of their inactive coauthors to start working on their topic of expertise.

A. Experiment I
Here we investigate membership closure among inactive authors.Specifically, we will answer the following questions: • How is the probability of topic switches related to k, the number of contacts with active authors?
• Does this probability depend on the relative prominence of the active authors?
To compute the measure, we first must define what construes as contact with an active author in the IW.We consider two definitions as described below. -

Active contacts (k)
Radiation Therapy FIG.
2. Experiment I. Cumulative target activation probability (in purple) for inactive authors in the AW with shaded 95% confidence intervals.For each k, the y-value indicates the fraction of inactive authors with at least k active contacts in the IW who became active in the AW.The green solid line with shaded errors represents the baseline described in the text, corresponding to independent effects from the coauthors.The heatmap below the x-axis shows the mean difference between the observed and baseline curves for each k value.It is gray if the 95% confidence interval contains 0, denoting the k-values where the points are statistically indistinguishable at p-value 0.05.Positive and negative deviations from the baseline are in red and blue, respectively.
1.The number of active coauthors, with the same coauthor counted as many times as the number of collaborations.In the collaboration network, this corresponds to the weighted degree when considering only active coauthors.
2. The number of papers written with active coauthors.
For example, in Fig. 1C author a 6 has four contacts based on the first definition (two from a 5 and one each from a 0 and a 1 ), and two if we use the second (from the second and the fourth papers in the IW).We report the findings based on the first definition in the main text.Results from the second definition do not alter the main conclusions and can be found in SI Figs.S1 and S2.
To address the first question, we compute the cumulative target activation probability C(k), i.e., the fraction of inactive authors who become active in the AW as a function of the number of contacts k (see Methods E).In Fig. 2, we plot C(k) (in purple) for each of the twenty topics under investigation.Error bars derive from averaging over different time windows for each field (see Methods D).As expected, we see an increasing trend.In particular, the jump from k = 0 to k = 1 is remarkable, showing that the probability of spontaneous activation in the absence of previous contacts (k = 0) is much lower than that of activation through collaboration (k ≥ 1).We observe that the higher the number of contacts, the larger the probability.Most of the growth occurs for low values of k.
To put these numbers in context, we consider a simple baseline C base (k) (see Methods F) where we assume each contact has a constant, independent probability of producing a topic switch.Within each topic, we compute the difference (see Methods D) between the curves for each value of k over all reference years and plot them below the x-axis.Except for the topics of Cluster Analysis, Parallel Computing, and Peptide Sequence, the observed curves deviate from the baseline.This provides some empirical evidence to ascertain that the baseline cannot capture the nuances in the observed data.A positive deviation for the majority of the topics indicates a compounding effect.Fluid Dynamics and Statistical Physics are exceptions, as they undershoot the baseline.This may be because they are broad interdisciplinary fields unlike the others, and having collaborators in different fields may lessen their effect.
Next, we explore the second research question, checking if the contact source's prominence affects activation chances.Recall that in every IW for a topic, we select active authors in the top 10% and the bottom 10% based on productivity and impact.This separates the most prominent active authors from the least prominent.To mitigate confounding effects, we only consider the subset of inactive authors who are neighbors with strictly one of the two sets of active authors.
In Fig. 3, we assess the significance of the difference between the cumulative target activation probabilities for inactive authors in contact with active authors in the two bins.Each row corresponds to a topic, and the color of each square indicates whether the difference is positive (red), negative (blue), or non-significant (grey).The two columns correspond to prominent authors selected based on productivity (left) and impact (right).For productivity, all differences are significant and positive, meaning that contacts with highly productive active authors lead to higher target activation probabilities.For impact, there are a handful of exceptions.Overall, having prominent contacts increases the target activation probability.

B. Experiment II
Here we focus on the active authors and their collaborators.For every active author a, we consider the subset of their inactive coauthors who have exclusively collaborated with a in the IW.We call this set the exclusive inactive coauthors of a.For example, in Fig. 1D, active author a 0 has four coauthors {a 1 , a 2 , a 3 , a 6 }, of whom only a 2 and a 3 exclusively collaborate with a 0 in the IW.We do this because effects due to active authors different from a would be difficult to disentangle and could confound the analysis and the conclusions.The relevant measure here is the source activation probability P a s , i.e.,

Active contacts (k)
Active contacts (k) 0.12 0.06 0 -0.06 -0.12 FIG. 3. Heatmaps showing the mean difference between the cumulative target activation probabilities of the inactive authors in the AW who had exclusive contacts with the top 10% and bottom 10% of active authors, respectively, selected according to productivity (left) and impact (right) in the IW.The cells are gray if the 95% confidence interval contains 0. The majority of red cells indicate that the cumulative target activation probabilities for contacts with the top 10% are higher than those with the bottom 10%.
the fraction of exclusive inactive coauthors who become active in the AW (see Methods G).The fraction controls for the collaboration neighborhood sizes which could vary widely for different scholars.In Fig. 1D, P a s for a 0 is 1 2 = 0.5, as only a 2 becomes active in the AW.For a given set of active authors, we obtain C s , the complementary cumulative probability distribution of their source activation probabilities (see Methods G).We select the pools of the most and least prominent authors as described in Experiment I.The relative effects of the two groups are estimated by comparing the cumulative source activations, i.e., points on the respective cumulative distributions at a specific threshold f * .Results are reported in Fig. 4A for a threshold f * = 0.10.Our conclusions also hold when considering a threshold f * = 0.20, which can be found in SI Fig. S3.
In Fig. 4, each row corresponds to a topic.The green and purple ranges represent the 95% confidence intervals of the mean difference between the cumulative source activations for the two pools of authors for productivity and impact, respectively.For productivity, the difference is significant for all topics but one (Superconductivity).The mean and 95% confidence interval of the means of the difference between the cumulative source activations of active authors in the top 10% and bottom 10% based on productivity (green) and impact (pink).(B) The mean and 95% confidence interval of the means of the difference between the chaperoning propensities of active authors in the top 10% and bottom 10% based on productivity (green) and impact (pink).A positive difference indicates that the effect is stronger for the top 10% active authors.
The differences are somewhat less pronounced for impact, but are still significant in most cases.
To further corroborate this finding, we specialize the analysis by checking how many exclusive coauthors of a also published their first paper on topic t in the AW with a.This is a way to assess the chaperoning propensity of active authors [23], and we define the measure in Methods H.In Fig. 4B, we report the 95% confidence intervals of the average difference between the chaperoning propensities for the most prominent and the least prominent active authors for threshold f * = 0.10.Similar to Fig. 4A, we find that the more productive/impactful an active author is, the more likely their coauthors will start working with them on a new topic.Results for f * = 0.20, which confirm this trend, can be found in SI Fig. S3.
While our analysis clearly shows that prominence is a factor, one may wonder if the number of coauthors also plays a role.We posit that, on average, the more collaborators one has, the more tenuous the contact with any of them will be, resulting in lower source activation probabilities.From each group of most prominent authors, we, therefore, pick the top and the bottom 20% based on the average number of coauthors on papers published with exclusive inactive coauthors.By construction, this excludes any paper written on the focal topic.
In Fig. 5, we perform the same analysis as in Fig. 4A for the two pools of authors described above.We observe that the confidence intervals of the differences lie to the left of zero, i.e., are negative.For productivity, all values are significant.For impact, there are only two topics (Chemotherapy and Radiation Therapy) that are not significant.Overall, inactive coauthors of prominent authors with more collaborators have a lower probability of switching topics.This is consistent with the intuition that the interactions with each coauthor are less frequent/strong in that case and, consequently, less effective at inducing topic switches.

DISCUSSION
Collaboration allows scholars to deepen existing knowledge and be exposed to new ideas.In this paper, we assessed if and how collaboration patterns affect the probability of switching research topics.We determined that the probability for a scholar to start working on a new topic depends on earlier contacts with people already active in that topic.This effect is proportional to the number of contacts, with more contacts resulting in higher probabilities.In most topics, this behavior is distinct from a simple baseline assuming independent effects from the contacts, which likely indicates effects of non-dyadic interactions that prompt further investigation.
Similarly, we measured the probability that inactive coauthors of an active author end up publishing on the new topic, which singles out the effect of the association with that author in the activation process.We stress that, by design, previous interactions between inactive and active authors are limited to works dealing with topics different from the focal topic.Therefore, our analysis suggests that an active author may expose an inactive one to a new topic, even when their interactions do not directly concern that topic.This underlines the social character of scientific interactions, where discussions may deviate from the context that mainly motivates them.
We also checked whether the activation probability depends on some specific features of the active authors.We found that the more prolific and impactful authors have higher chances of inducing coauthors to switch topics and become coauthors in their first paper on the new topic.Furthermore, we showed that the larger the number of coauthors of an active author, the lower the chance of a topic switch.This is consistent with a dilution of the influence, resulting from the inability to interact strongly with collaborators when their number is large.To the best of our knowledge, we are disclosing this effect for the first time.
A natural explanation of our findings is that topic switches result from a social contagion process, much like the adoption of new products [14,24], or the spreading of political propaganda [16].However, we cannot discount selection effects in observational studies like ours [25].Having large numbers of active coauthors on a topic may be associated with strong latent homophily between the authors, which may facilitate the future adoption of the topic even without interventions from the active authors.
Our work uses OpenAlex, a valuable open-access bibliometric database.We rely on their author disambiguation and topic classification algorithms to conduct the analyses.These processes are inherently noisy and can introduce implicit biases.In addition, there appears to be incomplete citation coverage which might partly explain why the results for impact are less robust than those for productivity.Future releases of OpenAlex might mitigate these problems.To counter these issues, we repeated our analysis on multiple topics from three distinct scientific disciplines.While the size of the effects varies with the topic, our main conclusions hold across all topics, with very few exceptions.
In conclusion, our work offers a platform for further investigations on the mechanisms driving homophily in science.A thorough understanding of these mechanisms requires effective integration of all factors that may play a role.Besides productivity and impact, topic switches may be affected by the institutional affiliations of those involved.On the one hand, it is plausible that people in the same institution have more chances to interact and affect each other's behavior.On the other hand, collaborations with people from renowned institutions are expected to weigh more in the process.Another discriminating factor could be the number of citations to the collaborator's papers.The higher the number of citations, the closer the association between collaborators.We could also include the scientific affinity between coauthors through the similarity of their papers.Modern neural language models [26,27] allow to embed papers and, consequently, authors in high-dimensional vector spaces, where the distance between two authors is a good proxy of the similarity of their outputs.

A. Data
We analyze papers from the February 2023 snapshot of the bibliometric dataset OpenAlex: the successor to Microsoft Academic Graph (MAG).We restrict our analysis to papers published between 1990 and 2022 and having at most thirty authors.Papers are tagged with concepts (topics) by a classifier trained on the MAG.We use concept tags to construct snapshots for three fields: Physics, Computer Science (CS), and Biology and Medicine (BioMed).Physics contains 19.7M papers, while CS and BioMed each have 27.6M and 43.52M papers, respectively.Within each domain, we select seven, six, and seven topics, respectively.We publish the code and associated data on GitHub.
Within each topic, we consider reference years between 1995 and 2018, where the respective interaction and activation windows contain at least 3000 papers.This threshold ensures a critical mass of papers and authors to conduct the analyses.Each topic we selected has at least 10 reference years satisfying the constraint.The statistical tests in the manuscript are aggregated over the different reference years.More information is available in SI Tables S1 to S3.

B. Overlap coefficient
We use the overlap coefficient to measure the degree of overlap between the different sets of authors picked based on productivity and impact.
In our case, the two sets are the same size, so a score of 10% implies that both sets share 10% of the elements.

C. Author ranking metrics
Let P be the set of papers published on topic t authored by the set of active authors A during the interaction window IW.Let a be an active author who wrote P a papers during the IW.We define the following metrics to rank active authors and select the top and bottom 10%.
Productivity: the count of papers a has authored on topic t during the IW.More formally, it is the cardinality of the set P ∩ P a .
Impact: the average citation count of P a from the papers in P .We argue that restricting incoming citations from P is a good proxy for the impact that a has made on that topic.The average number of citations is a better indicator of excellence than the total citation count [28].Also, considering the average instead of the sum lowers its correlation with productivity, here measured by the overlap coefficient of Methods B, as often the most productive authors are also the most cited ones [11].A low correlation lets us safely disregard the confounding effects of the two metrics and allows us to treat them as fairly independent variables.Correlation statistics are reported in SI Tables S4 to S6.

D. Statistical test for difference of samples
To test whether two independent samples X 1 and X 2 are different concerning their means µ 1 and µ 2 , we assume the null hypothesis H 0 : µ 1 = µ 2 .We compute the mean and 95% confidence interval of µ 1 − µ 2 using bootstrapping and reject the null hypothesis H 0 at p < 0.05 if the confidence interval does not contain 0 [29].In other words, X 1 and X 2 are considered statistically different at p < 0.05 if the 95% confidence interval of the difference of their respective means does not contain 0. Furthermore, a positive mean of the difference indicates that X 1 > X 2 , while a negative mean indicates X 1 < X 2 .

E. Target activation probability
Let n(k) be the number of inactive authors with exactly k contacts during the exposure window, of whom m(k) become active in the observation window.The target activation probability P (k) is the probability of becoming active after having exactly k contacts, defined as The cumulative target activation probability C(k) with k or more contacts is given by F. Simple baseline for membership closure Let p represent the probability of activation from a single contact.The probability of activation having k contacts, acting independently of each other, is P base (k) = 1 − (1 − p) k .We compute p from the observed data using Eq. ( 1) as p = P (1) = m (1)  n(1) .This is the fraction of inactive authors with exactly one contact who became active as P base (1) = 1 − (1 − p) 1 = p.Like before, we calculate the cumulative target activation probability for the baseline C base (k) with k or more contacts as The denominator is the same as in Eq. ( 1) and comes from the observed data.The numerator represents the expected number of active authors if the contacts affect the activation independently.

G. Source activation probability
Let n a be the number of exclusive inactive coauthors of an active author a in the IW.Let m a be the number of those exclusive inactive coauthors who become active in the AW.The source activation probability of scholar a is thus We stress that, for the probability to be well-defined, n a must be greater than zero.Therefore, in our calculations, we focused on active authors with at least one exclusive inactive coauthor.
For any 0 ≤ f ≤ 1, we compute the fraction C s (f ) of all active authors whose source activation probability is greater than or equal to f .C s (f ) is the complementary cumulative probability distribution of the source activation probability P a s .As expected, C s (f ) quickly decreases to 0 with increasing f .Because the curves corresponding to two sets of active authors are effectively indistinguishable at the tail, we compare a pair of points at some threshold f * .We call C s (f * ) the cumulative source activation.
The choice of the threshold f * is important.Setting it to 0 or 1 would return the same probability for both sets of authors.It should not also be too small for numerical reasons.For example, if there are only five inactive coauthors, the smallest non-zero fraction cannot be smaller than 1/5 = 0.20.Choosing too high a value instead would lead to weaker statistics.So, we fix the value at 0.10 for the results in the main text (Figs. 4 and 5) and report the results for 0.20 in SI Figs.S3 and S4.

H. Chaperoning propensity
Let m a be the number of exclusive inactive coauthors of an active author a who become active in the AW, which is the same as the numerator of Eq. ( 4).Let i a be the number of those authors who write their first paper on topic t with a in the AW.The chaperoning probability of a is defined as We define the chaperoning propensity P c (f ) corresponding to a specific threshold f ∈ [0, 1] as the fraction of all active authors with P a c ≥ f .We use the aforementioned values of 0.10 (Figs. 4 and 5) and 0.20 (SI Figs.S3  and S4) for the threshold f .3, but here contacts is the number of papers written with active coauthors in the IW.Heatmaps showing the mean difference between the cumulative target activation probabilities of the inactive authors in the AW who had exclusive contacts with the top 10% and bottom 10% of active authors, respectively, selected according to productivity (left) and impact (right) in the IW.The cells are gray if the 95% confidence interval contains 0. The majority of red cells indicate that the cumulative target activation probabilities for contacts with the top 10% are higher than those with the bottom 10%.

A 2 FIG. 1 .
FIG. 1. Schematic setup for our analysis.(A) Stream of papers across interaction (IW) and activation (AW) windows.Papers tagged with the focal topic t are marked in red.(B) Author collaboration graph at the end of IW.Authors ai and aj are linked by an edge of weight k if ai coauthored k papers with aj within the IW.The authors active in the focal topic by the end of IW are marked in red.(C) Focus: inactive authors.Inactive author a6 has four active contacts from three sources {a0, a1, a5} derived from the collaboration graph in (B).(D) Focus: active authors.Active author a0 has four coauthors {a1, a2, a3, a6}, of whom a1 is already active, and a6 also collaborated with a1 in the IW.This leaves the subset of exclusive inactive coauthors {a2, a3}.Within this subset, only a2 becomes active in the AW, resulting in a0's source activation probability of 1 2 = 0.50.Additionally, a2 writes their first paper with a0 in the AW.

FIG. 4 .
FIG.4.Experiment II results for f * = 0.10.(A) The mean and 95% confidence interval of the means of the difference between the cumulative source activations of active authors in the top 10% and bottom 10% based on productivity (green) and impact (pink).(B) The mean and 95% confidence interval of the means of the difference between the chaperoning propensities of active authors in the top 10% and bottom 10% based on productivity (green) and impact (pink).A positive difference indicates that the effect is stronger for the top 10% active authors.

FIG. 5 .
FIG.5.Dilution effect results for f * = 0.10.The mean and 95% confidence interval of the mean of the difference between the cumulative source activations of active authors in the top 20% and bottom 20% bins, based on the average number of coauthors, among the top 10% active authors in productivity (green) and impact (pink).A negative difference across the topics indicates a dilution effect, wherein coauthors of prominent active scholars with fewer collaborators are more likely to switch topics.

TABLE S1 .
Summary information for Physics topics.#Papers: average number of Papers.#Authors: average number of active authors.Averages are computed over all time windows selected for a topic.

TABLE S2 .
Summary information for Computer Science topics.#Papers: average number of Papers.#Authors: average number of active authors.Averages are computed over all time windows selected for a topic.

TABLE S3 .
Summary information for Biology & Medicine topics.#Papers: average number of Papers.#Authors: average number of active authors.Averages are computed over all time windows selected for a topic.

TABLE S4 .
Physics average Overlap Coefficient between the top and the bottom 10% of active authors selected based on Productivity and two different definitions of Impact.The first definition uses Cavg and is used in the main text.The second definition uses Ctot.The degree of overlap is significantly greater for Ctot.FIG.S2.Experiment I. Same as Fig.