To understand quantitatively how scientists choose and shift their research focus over time is of high importance, because it affects the ways in which scientists are trained, science is funded, knowledge is organized and discovered, and excellence is recognized and rewarded1,
‘The essential tension’ hypothesis as described in ref. 5 has highlighted the conflicting demands of scientific careers that require both exploration and exploitation4,8,22. Indeed, career advancement, from promotion to obtaining grants, demands a steady stream of publications, which is often achieved through uninterrupted, yet incremental contributions to the existing, established research agenda. By contrast, frequent changes in research topics invite risk of failure and decreased productivity. The disciplinary boundaries, arising from factors such as implicit culture, tacit and accumulated knowledge23,24 and peer recognition3,25, together with intensifying specialization in science and engineering disciplines26, make radical shifts, such as moving from chemical biology to high energy physics, unlikely, if at all possible. On the other hand, although a steady and focused research portfolio helps scientists stay productive, it potentially undermines chances for originality8. Indeed, innovative and novel insights often emerge from encountering new challenges and opportunities associated with venturing into new topics and/or incorporating them into the existing research agenda4,15,20,27,28.
Given the broad effect on individual careers and strong implications for science and innovation policy, there is an urgent need for quantitative approaches to understand the nature of changes in research interests of individual scientists throughout their careers. This need becomes more urgent with the accelerating scale and complexity of scientific enterprise2,26,29,30. A variety of microscopic factors have been identified that drive a scientist’s choice of research problems, which range from age10,11 to gender12,13, to training and mentorship9,14, from funding or collaboration opportunities15,
Here we aim to systematically address this question by first identifying patterns in the working scientists’ research agendas as their careers progress. Using articles published by American Physical Society (APS) journals covering over 30 years (1976–2009)41,42 and through a careful and extensive author-name disambiguation process6,43, we collect publication records of individual authors over time (Supplementary Note 1). We further take advantage of the Physics and Astronomy Classification Scheme (PACS) codes used by the APS to classify topics in physics. Indeed, among all identifiers for research topics, PACS codes stand out in the frequency of their use30,44,
We compose two topic vectors based on the first and last m papers of the scientist (gi and gf, respectively), thereby capturing the research interest at the earliest and the latest stages of the career (Fig. 1). Using the complementary cosine similarity between gi and gf, we quantify the interest change J of a scientist throughout the career as:
Equation (1) captures the research-interest change that resulted from a change in topics or from a change in engagement with the topics, thereby providing an effective quantification on the extent of change. J = 0 indicates that the two topic vectors gi and gf are identical, capturing the fact that the author not only studied the same set of topics at the two stages of the career, but also was involved in each of these topics with the same weight. J = 1 corresponds to a complete change in interests, in which a researcher does not engage in any initial topic of interest. We choose m = 8. As a result, our analyses are based on 14,715 scientists, each of which was an author on at least 2m = 16 papers included in our dataset. We also report analyses based on other m values (m = 6 and m = 10, Supplementary Note 2) and find that our results are insensitive to the choice of m. To take other factors into account that can play a role in quantifying research-interest change, we further perform three additional measurements. First, to avoid the gaps between the two sets of papers, we take 2m consecutive papers starting at a randomly chosen paper and measure the interest change based on the two adjacent m paper sequences (Supplementary Note 3). Second, to eliminate the effects of different publication rates, we measure the interest change J within scientists who publish at similar rates (Supplementary Note 4). Finally, as research interest is associated with time, we measure interest change based on two sets of papers published over the same time period at the early and late stage of a career (Supplementary Note 5). We obtain similar observations in all three measurements, demonstrating that the discovered patterns in research-interest evolution do not rely on a specific metric.
The quantification of research-interest change allows us to measure its distribution within the population. We find that the fraction of scientists P decays with the extent of interest change J (Fig. 2a), which can be well fitted with an exponential function (see Supplementary Note 6 for fitting statistics). This exponential decay indicates that most scientists are characterized by little change in their research interests and the probability of making a leap decreases exponentially with its range. At the same time, the fact that the histogram P(J) is positive in the full range of domain [0, 1] indicates that large changes in a career, such as switching to completely different areas, do occur, albeit very rarely. These observations raise an interesting question: What forms of interest-change distribution should we expect? To answer this question, we identify three features, that is, heterogeneity, recency and subject proximity, that characterize research-interest evolution. To illustrate how these features shape the distribution of interest change, we perform three ‘experiments’ in which the interest-change distribution is re-measured using modified publication sequences of individual scientists. These results further demonstrate the non-trivial nature of the exponential distribution observed.
The frequencies of topic tuple occurrences in the publication sequence of a scientist follow a power-law distribution (Fig. 2b), demonstrating the heterogeneity in individual’s engagement in different subjects. Indeed, a scientist’s research agenda contains core research subjects that are repeatedly investigated coupled with other more peripheral ones that may only be touched upon occasionally. To examine the effect of heterogeneity on the change in research interests, we modify each publication sequence by retaining only the first occurrence of each topic tuple and removing the subsequent ones (see Supplementary Fig. 1a for the illustration of this procedure and Supplementary Note 7 for more details). The interest-change distribution measured over the modified sequences reaches a peak at an intermediate value followed by a gradual decay (Fig. 2c), which is in contrast to the monotonic, exponential decrease that was observed in the real data.
By denoting the number of papers between two consecutive uses of a topic tuple as Δn (see Methods), we compare the distribution P(Δn) to the reshuffled publication sequence Po(Δn). We find that the ratio P(Δn)/Po(Δn) decreases with Δn, going from above to below 1 (Fig. 2d). This implies that scientists are more likely to publish on subjects recently studied in real sequences than in reshuffled sequences. To further explore this feature, we measure the relationship between the probability that a distinct topic tuple is re-studied (Π) and the order of the first occurrence of this topic tuple in a scientist’s publication sequence (see Methods). As shown in Figure 2e, the relationship obtained demonstrates that scientists are more likely to publish on subjects studied recently than on those investigated a long time ago. This recency feature suggests that scientists tend to avoid going back to the original research subjects once they have moved on to other ones, which consequently drives them to explore new research subjects. This is affirmed in our analysis of the interest-change distribution measured over the reshuffled publication sequences (see Supplementary Fig. 1b for the illustration of this procedure). The obtained distribution remains exponential, but with a much steeper decrease than the original distribution, which makes the large extent of interest change significantly less frequent than what is actually observed (Fig. 2f).
Knowledge is characterized by underlying topical geometry that imposes varying inherent distances between pairs of research subjects26,49,50. When a scientist changes the research subject, he or she is more likely to choose a subject that is related to the current one than to move to a totally new field, implying the subject proximity of research-interest change. To verify this insight, we modify each publication sequence by replacing distinct topic tuples in publications of a scientist with topic tuples that were randomly drawn (see Supplementary Fig. 1c for the illustration of this procedure). The resulting interest-change distribution shows that most scientists have a large extent of change (Fig. 2g), demonstrating the effect of subject proximity on research-interest evolution.
These three features reveal insights that are offered by the observed distribution of research-interest change. Both heterogeneity and subject proximity arise from exploitation of the current field, which help stabilize the research interest. Lacking either of the two, P would have been characterized by a distribution with a larger mean. Yet, the recency feature resulting from the exploration of new areas destabilizes research interest. If it were not for the exploitation, the interest change would have been much more limited given the heterogeneity and subject-proximity features. Together they provide us with an empirical basis to build a statistical model for individual careers with varying research subjects over time.
Here we consider scientific research as a random walk, following Isaac Newton’s retrospection that during his scientific career he was like “... a boy playing on the seashore... finding... a prettier shell than ordinary”51. Such a ‘seashore’ is represented by a one-dimensional lattice in our model and a ‘shell’ corresponds to a scientific finding that yields a paper (Fig. 3a). The locations at which shells can be found are located with a certain probability p at the sites of lattice, where each location contains q shells of one type. q follows a distribution P(q) ~ q−α, which is motivated by the heterogeneity in the potential that a research yields papers: although some are fruitful, many of research topics are not. Therefore, the seashore lattice contains sites with piles of shells, separated by a random number of empty sites in between. A scientist starts at a random initial position and performs an unbiased random walk on its own lattice (50% chance to move one step left or right). Upon reaching a site that contains shells, the walker picks one shell, corresponding to publishing one paper. The walker stops when it reaches the end of the career, defined by the total number of steps S, following a truncated log-normal distribution P(S), the choice of which is motivated by the distribution of the observed career lifetime span of the real scientist52,53 (Supplementary Fig. 2).
Despite its simplicity, the seashore walk predicts the presence of both heterogeneity and recency features. If we assume that one type of shells corresponds to one research subject, the power-law distribution P(q) provides varying limits to the number of papers that can be published on different research subjects. The unbiased random walker is likely to return to a site repeatedly, which consequently enables it to collect all the shells from the site. Such repeated visits and P(q) give rise to the heterogeneity and recency features observed empirically (Supplementary Fig. 3a, b).
To further capture the evolution of research interest, we need to assign topics to each type of shell. Recent advances in knowledge expansion4,8,29 have provided two specific features governing the evolution of a scientist’s research agenda: first, existing topics are connected to form a research subject; and second, new topics are occasionally added29. By absorbing these two features into the seashore walk, we define how the research subject of a type of shell is associated with its location on the lattice and how topics evolve on the underlying lattice, leading to the introduction of Model I (Fig. 3a).
For Model I we assume that there is a topic pool containing 3 topics for each site on the lattice. The shells on the site, if there are any, are characterized by a random combination of the 3 topics (with repetitions), representing an artificial topic tuple for the research subject of the site. The value 3 is based on the observation that each paper is characterized on average by 2.89 PACS codes42 (Supplementary Fig. 4a, b). Furthermore, we assume that one topic pool covers L sites on the lattice and two neighboring topic pools differ by one topic. Therefore, new topics are encountered as the walker moves away from the starting point, which is in line with the empirical observation that new topics emerge when the number of distinct topic tuples used by a scientist increases (Supplementary Fig. 4c).
Model I generates a sequence of shells for each walker traversing its own seashore, and each shell is characterized by an artificial topic tuple. By measuring interest change for an ensemble of walkers (see Supplementary Note 8 for the implementation of the simulation), we obtain a P(J) similar to that of real data (Fig. 4a and see Supplementary Note 9 for statistical analyses). Despite its simplicity, Model I captures the process of research-interest evolution. It also leads to an interesting question: to what extent could the framework of seashore walk be improved? Note that the generation of topic tuples in Model I is based on simple processes, which can limit its capability to accurately reproduce the interest-change distribution. If this is the case, we would expect a better result if more complexities are added to ensure that a walker’s topic tuples as well as the correlations among these topic tuples are statistically similar to the data. To test this hypothesis, we constructed Model II (Fig. 3b).
For Model II we generated a sequence of shells picked by a walker and identify the number of distinct shells in the sequence. For a walker who has picked x distinct shells, we randomly find a scientist from the data who used x different topic tuples in the publication sequence and randomly map them to the x distinct shells (see Supplementary Note 10 for the implementation of simulation).
We apply Model II to the same shell sequences generated in Model I and find that the resulting research-interest change distribution matches closer to the empirical observations (Fig. 4b and Supplementary Note 9 for statistical analyses). The improved result confirms the validity and potential of the seashore walk to capture the research-interest evolution. It is noteworthy that the random-mapping procedure in Model II is used only to avoid introducing any sophisticated means to generate topic tuples. It is inherently different from duplicating the actual sequence (Supplementary Discussion 1). Moreover, despite some assumptions defining the seashore walk, such as the assumption that the number of shells at a site follows a power-law distribution, the systematic reproduction of empirical observations is obtained from the interplay of multiple mechanisms. Removing either of these assumptions would invalidate the model (Supplementary Discussion 2).
Finally, the seashore walk makes two additional predictions regarding an individual’s career. First, the individual’s publication process is bursty, as the inter-publication time follows a power-law distribution35,54 (Supplementary Fig. 5a). In the model, a random walker’s first passage steps follow a power-law distribution asymptotically with exponential cutoff ~S−3/2 (ref. 55), giving rise to the ‘burstiness’ of publication time (Fig. 4c). Second, the number of papers authored by a scientist follows a power-law distribution with an exponential cutoff56,57 (Supplementary Fig. 5b). We obtain the same form of distribution from the model (Fig. 4c). This is owing to a combination of factors, including the uniform probability of encountering sites with shells, the property that the mean number of sites visited by a random walker scales as ~S−1/2, and the existence of a fat-tail in the log-normal distribution P(S) (Supplementary Discussion 3).
The success of our simple model in capturing patterns observed in individual careers raises another question: can other related approaches be adopted and applied to the modelling process investigated here? To this end, we identified two classes of models that might be suitable on the basis of existing works in science of science and in network science. The first class pertains to models for the mobility patterns of an individual32,37 by treating topic tuples as locations. In these models, a scientist’s sequence of research subjects becomes the sequence of locations visited in an individual mobility trajectory (see Methods). Although this approach could reproduce the heterogeneity feature based on preferential attachment38, it could not capture the recency feature. Indeed, under the preferential attachment mechanism, a positive feedback would arise by which the more frequently a research subject was studied, the more likely it would be studied again. As a result, an old subject would receive more attention than the recent subject. Therefore, the probability of reusing a topic tuple (Π) would decrease with the rank of first usage of this topic tuple32. This, however, directly contradicts the recency feature that was observed in the data (Fig. 2e), demonstrating the inherent inability of models that are based on preferential attachment to capture research-interest change. The second class of models treats the individual interest change as a Markov process on the knowledge network4,8,49. This approach provides a comprehensive picture of the geometry of the knowledge network that gives rise to subject proximity. However, the heterogeneity feature leading to the power-law distributed topic tuple usage can not be generated by a Markov process with fixed transition probabilities. Moreover, the knowledge network characterized by the movements of an individual between research subjects is not static but dynamic39, which can not be accounted for without introducing a much more complicated model. Taken together, both approaches exhibit clear limitations in reproducing important characteristics of interest evolution that were studied in this paper. Our model, on the other hand, overcomes these limitations and preserves the patterns observed in the research-interest evolution.
In summary, by taking advantage of the PACS codes that classify general areas of physics into multiple clearly defined sub-areas, we quantitatively measure the extent of interest change for over 14,000 scientists, and show an exponential distribution of interest change within the population. We identify three key features in interest evolution that are essential for the presence of the observed distribution. We further develop a simple statistical model that describes scientific research as a random walk and that successfully captures empirical observations. Together, our results fill a critical gap in our quantitative understanding of science at a large scale by identifying a set of macroscopic patterns, which govern research-interest change throughout individual careers. Despite the well-known fact that scientists’ choices of research subjects are driven by a myriad of factors, our results indicate that research-interest evolution can be captured well by a simple statistical model, uncovering a new degree of regularity underlying individual careers.
The methodology introduced here implies that there are some limitations and potential for future work. When composing a topic vector, we assume papers on which an author’s name appears are equally representative of his or her research interests. This assumption is justified by the difference between the interest in the problem addressed by this paper and the contribution to the paper or recognition of each author3,58: every author has to be interested in the problem to engage in co-authoring the paper. In the future work, it would be of interest to systematically quantify the difference of each co-author’s interest in a single paper on which they collaborate. The macroscopic patterns emphasized in this study are not significantly affected by potential errors in name disambiguation given the large number of scientists analysed (Supplementary Note 1). Yet, the accuracy of author name disambiguation needs to be constantly challenged and scrutinized whenever publication data is applied. The systematic nature of classification codes and their rich, hierarchical structures make them good approximations of topics in research ranging from scientific discoveries30,44,
Promising future directions include extending the simple model proposed here to a multidimensional random walk in the knowledge space, which may lead to a model capturing a richer set of phenomena that characterize individual careers. Other directions include extending this work to other scientific domains to address the universality and robustness of our results, investigating how the observed patterns depend on contextual information such as nations41,60, institutions9,43, scientific disciplines17, the size of the research community16,23,61, the status of a scientist3,52,53 and publication habits. It would also be important to understand the short-term benefit and long-term scientific impact6,7,62,
Calculating topic vectors
The value of each element in the topic vector represents a topic’s normalized frequency of occurrence in the set of papers analysed. Given a topic tuple, we can use a vector X = (a1, a2, …, a67) to express the occurrence of each of the 67 topics in the topic tuple, whereby ai = 0 indicates that the ith topic is not included in the topic tuple, ai = 1 that the ith topic appears once, and so on. is the size of the topic tuple. By normalizing X, we obtain a vector Y = (b1, b2, …, b67) in which bi = ai/NX is the normalized frequency of occurrence of the ith topic. The topic vector g is calculated by averaging m different Y vectors drawn from m papers, with . Take the calculation of gi in Fig. 1 as an example. The two topic tuples for gi are (68, 89, 89) and (02, 05, 68). The element value of topic 68 is calculated as as it appears once in each of the topic tuples. The element value of topic 89 is calculated as as it appears twice in one topic tuple and is not included in the other. The elemental values of topic 02 and 05 are calculated as .
Δn is measured as the separation between the two subsequent appearances of the same topic tuple in a scientist’s publication sequence. For example, representing distinct topic tuples as different capital letters and assuming a publication sequence is ‘A A A B C C B D E’, we obtain the following series of measurements: ΔnA = 1 (distance between first and second appearance of A), ΔnA = 1 (distance between the second and third appearance of A), ΔnB = 3 and ΔnC = 1. These values are then used to calculate the distribution P(Δn).
Each publication sequence is characterized by two parameters. One is the number of papers n (that is, the length of the sequence) and the other is the number of distinct topic tuples in the sequence (x). As both parameters vary among individuals, we first fix a set of distinct topic tuples to measure their re-usage frequency Π. Here we focus on the first 5 distinct topic tuples in the sequence (that is, the maximum rank is 5). Therefore, only those sequences with x ≥ 5 are considered. We also analyse other cases by filtering x ≥ 4 and x ≥ 6 and similar patterns are observed. For each qualified sequence (x ≥ 5), we go through it from the beginning until the fifth distinct topic tuple is firstly used. We then start to count the instance where one of the five topic tuples is reused. For each individual, we obtain a fraction of time each of the five topic tuples is reused. This fraction is then averaged over all qualified sequences to generate Π.
Generating the sequence using a preferential-attachment-based model
We apply a preferential-attachment-based model to generated topic tuple sequences with power-law distributed usage of each topic tuple32,37,38. In the model, an individual’s activity is randomly chosen from the two actions. One is to explore a new subject and publish a paper with a new topic tuple. The other is to return to a previously studied subject and publish a paper with a topic tuple that has already been used. The probability to explore is defined as ρn−γ in which the term n−γ captures the decreasing trend to explore a new subject as the number of papers increases. Consequently the probability of return, that is, to reuse an old topic tuple, is 1−ρn−γ. If one returns, the choice of existing topic tuples is governed by preferential attachment: the probability pi to use a specific topic tuple i is proportional to the tuple i’s current usage, thereforem, where ni is the number of times that the tuple i is used. The parameters applied are ρ = 0.4 and γ = 0.1. Each individual’s time step is controlled by the number of papers published, following the distribution P(n)~n−1.5 with a cutoff nmax = 150. These variables make the sequence generated similar to those in real data. We generate a total of 20,000 independent sequences, a comparable number to the size of the real data. See Supplementary Discussion 4 for more information about the preferential-attachment-based model.
Computational codes for data processing, analysis, and model simulation are available upon request.
How to cite this article: Jia, T., Wang, D. & Szymanski, B. K. Quantifying patterns of research-interest evolution. Nat. Hum. Behav. 1, 0078 (2017)
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank A.-L. Barabasi for providing the initial dataset, A.-L. Barabási and G. Korniss for discussions. This work was supported by the Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053. T.J. is supported by the Natural Science Foundation of China (61603309) and CCF-Tencent RAGR (20160107). D.W. is supported by the Air Force Office of Scientific Research under award number FA9550-15-1-0162 and FA9550-17-1-0089. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Supplementary Figures, Supplementary Notes, Supplementary Discussion, Supplementary References.