Categorical and Geographical Separation in Science

We study scientific collaboration at the level of universities. The scope of this study is to answer two fundamental questions: (i) can one indicate a category (i.e., a scientific discipline) that has the greatest impact on the rank of the university and (ii) do the best universities collaborate with the best ones only? Restricting ourselves to the 100 best universities from year 2009 we show how the number of publications in certain categories correlates with the university rank. Strikingly, the expected negative trend is not observed in all cases – for some categories even positive values are obtained. After applying Principal Component Analysis we observe clear categorical separation of scientific disciplines, dividing the papers into almost separate clusters connected to natural sciences, medicine and arts and humanities. Moreover, using complex networks analysis, we give hints that the scientific collaboration is still embedded in the physical space and the number of common papers decays with the geographical distance between them.


Introduction
The 20th century is well known for its critical works of Kuhn, Popper, Lakatos and Feyerabend that tried to build models of how the science should work or to show how it does in fact work.In the same, owing to the entrance into era of overwhelming information it was possible to tackle this problem quantitatively [5,6], pointing out specific phenomena observed in science.Several studies are bound to answer such questions like "How to measure who the best scientist is?" [7][8][9][10][11][12]15] or try to simulate the process of paradigms shifts [13,14].In this study, we make use of complex networks tools [16,17] to show how this issue is resolved at the level of scientific institutions (i.e., universities), to be more specific (i) what is the correlation between university rank and the number of papers in a specific discipline and (ii) what are the components of the scientific collaborations.

Results
In order to estimate the correlations between university rankings and scientific productivity we had to identify two different sources of data: (i) first devoted to the university ranking with at least 10 years activity , (ii) second connected to actual bibliographic information, in particular complying with the following rules: (1) allowing to view categories of publications,(2) allowing to view address of the publication, (3) allowing to view year of publication.The lists of top hundred universities were downloaded from two services: Academic Ranking of World Universities 1 -later referred to as ARWU and QS World University Ranking 2 -later referred to as QS.The rationale behind choosing two rankings that follow different rules was to check the robustness of the performed analysis.After preliminary analysis, we have chosen the service Web of Science as a data source for obtaining the information on citations.For one institution the average number of publications ranges between few to dozens of thousands of publications.As a result each university has two tables containing the following fields: (i) published (date of publication),(ii) ID (reference to the second table),(iii) subject category (category of publications),(iv) language.The key information used in this report is the subject category of the published paper (we will refer to it later as to simply category) We start the analysis with the estimation of correlation reflecting the dependence of the number of papers a university has published on its rank in the list.To be more precise, for each of 180 categories we build a 100 by 2 matrix, where the first row gives ranks r i of all universities in this category and the second one gathers the number of papers n i published in this category by the given university.As one of the variables is already given in the form of rank we decided to use Spearman's rank correlation coefficient ρ as the measure of dependence between r and n.The results are gathered in Table 1 together with the total number of papers in the given category and the statistical significance of the test.
It is of use to examine the relation between the size of the category, measured by the number of papers N belonging to it and the above mentioned correlation coefficient ρ.Those results are shown in Fig. 1 giving the evidence that lower correlation (i.e., larger number of papers following higher rank) is in general characteristic for categories with large total number of papers.Moreover, the correlation for such categories are statistically significant.

Categorical separation
Here, we would like to check the hypothesis of categorical separation of Science.It is our belief, that certain categories ten to "glue together" the scientists working in them.In other words, the possibility of interdisciplinary research is not that high as one would expect it to be.In order to test this assumption we we performed the Principal Component Analysis (PCA) for 10 most prominent categories (in sense of the total number of papers).As can be seen in Fig. 2a, the first three principal components explain 90% of variability in the data, so the analysis can be restricted just to them.Further, by plotting 2nd component vs 1st (Fig. 2b) and 3rd component vs. 2nd (Fig. 2c) we can identify the main directions of the dataset.In Fig. 2b we have biochemistry, biology, neuroscience, medicine and psychology in the positive part of x-axis while chemistry, physics, materials science, engineering and computer science are in negative part of this axis.In can thus mean that the 1st component divides the categories into technical sciences (negative values) and medicine-related ones (positive values).The 2nd component is much harder to be identified -a rough estimate could link positive axis with fundamental sciences as we have physics, chemistry and biology.Finally there is a clear interpretation as to the 3rd component -the only significant positive value is connected to physics.

Network analysis
Apart from the categorical point of view we can also consider university quality by analyzing the direct connections between universities i and j on the basis of the collaboration matrix C ij where the element C ij gives the number of common publications of institutions i and j.The principal concept of the network analysis is depicted in Fig. 3.
Using 100 highest ranked universities, for each of them (u 1 , u 2 , ..., u 100 ) we search for its publications p 1 , p 2 , ..., p M(u1) .Then, if among the co-authors of p 1 there is any that comes from either of the universities u 2 , ..., u 100 a link of weight w = 1 between those universities (e.g, u 1 and u 2 ) is established.The weight is increased by one each time u 2 is found among the following publications of u 1 .Finally the weight of the link between nodes u 1 and u 2 is just the number of their common publications (as seen in the database).

Weights probability distribution
The first, fundamental quantity to be computed is the probability distribution of weights p(w) giving the idea about the diversity of number of publications between universities.Figure 4 presents p(w) for raw data (black circles) as well as for the logarithmically binned ones (with the base b = 2, red-filled circles).The plot suggests that the majority of weights can be found for w between 1 and 10 -there a plateau can be clearly seen.However, there is still a clear pattern for the remaining part even for weights as large as w = 10000 that could be presumably fitted by a power-law function.However it is possible to fit a full-range log-normal function (red curve) with the parameters µ = 3.44 ± 0.02 and σ = 1.63 ± 0.01 (values obtained by a maximum-likelihood fitting).However, the Kolmogorov-Smirnov goodness-of-fit test accepts the hypothesis that the data points come from the distribution described by ( 1) for relatively low level of significance (α = 0.01).The result is similar to this obtained in Performing this search for consecutive universities from the ranking we obtain a fully connected network of all 100 universities with links denoting the number of common publications.
Dependence of edge width on node strength An interesting point of the further analysis is to test if the strength of the university, measured as the total number of its publications with other universities from the ranking influence the affinity of one university to link to another one.More precisely, we shall test what is dependence of the weight w AB between universities u A and u B on the product of their strengths s A s B .A log-log scatter-plot of this relation for all pairs of universities is shown in Fig. 5a with black circles.It brings clear evidence that the larger is the product of universities' strengths the higher is the number of common publications between them.By performing a logarithmic binning (red-filled circles) it is possible to analyze the specific form of the relation.The outcome is presented in Fig. 5b, where two fits are shown: a linear one (blue line, slope a = 6.54 × 10 −7 ± 0.15 × 10 −7 and negligible intercept) and power-law one (red dotted line, exponent 0.97 ± 0.01).The linear fitting has the R 2 value of 0.99 while the power-law one 0.94.Taking into account those value as well as close to 1 exponent of the power-law fitting it is reasonable to assume that the average weight between universities characterized by strengths (number of publications) s A and s B is given by the relation Equation ( 2) can serve as a kind of predictor for estimating a possible level of cooperation between two universities.Also, observed deviations from this law could indicate either a presence of outliers in a given dataset or invalid data, thus Eq. ( 2) might be useful as a first-step verification procedure of the examined data.
Weight threshold Following analysis will use the concept of weight threshold [17] depicted in Fig. 6.
Let us take the original network of 5 fully connected universities from Fig. 6a.Let us assume now that we are interested in constructing an unweighted network that would take into account only the connections with weight higher than a certain threshold weight w T (w > w T ).A possible outcome of this procedure is presented in Fig. 6b -all the links with w < w T are omitted and as a result we obtain a network where links indicate only connections between nodes (i.e., they do not bear any value).
Using weight threshold as a parameter it is possible to obtain several unweighted networks -for each value of w T in the range w min ; w max we get a different network N T (w T ) whose structure is determined only by w T .Then, for each of these networks it is possible to compute standard network quantities: (i) number of nodes N that have a at least one link (i.e., nodes with degree k i = 0 are not taken into account), (ii) Number of edges (links) E between the nodes,(iii) clustering coefficient C, (iv) assortativity coefficient r (v) entropy S of node degree probability distribution and (vi) the average shortest path l (see Materials and Methods for details).
Network observables as function of weight threshold Figure 7 gathers the plots of the above described network parameters as a function of w T .First, as can be seen in Fig. 7a, the number of nodes N is a linearly decreasing function of the weight threshold w T .The number edges E decreases even faster -for w > 200 it follows an exponential function (Fig. 7b).Similarly to N (w T ) the clustering coefficient also drops down linearly with the weight threshold (Fig. 7c), although several small jumps over the trend can be seen.The most interesting is the behaviour of r(w T ) shown in Figure 5d: the coefficient starts with r < 0, while for larger thresholds it crosses r = 0 and for w T in range [200; 400] it takes its maximal value.Then it once again drops down below zero reaching r = −0.4 for w T around 1000.Finally it increases toward zero for large w T .A non-monotonic behaviour is also observed in the case of the normalized entropy S/S max (Fig. 7e) -here a rapid growth occurs at the very beginning (for small w T ), then a linear decrease happens.For large w T the normalized entropy S/S max again increases.The average shortest path (Fig. 7f) resembles the shape of r except for the lack of growth at the end.

Network visualisation
The above described non-trivial behaviour of quantities r, S/S max and l cannot be the sole cause of the relations presented by Eqs (1) and (2).It seems that there has to be another phenomenon leading to such an effect.
Using Pajek3 program as well as a tool for community detection4 it is possible to visualize connections between universities and community structure (denoted by color) for different values of w T .The results for w T = 250, 600, 800 and 1200 are shown in Figs 8-11, providing an input for further analysis.Up to w T = 250 the network is percolated (it is possible to reach any node from another one); over that value a separation occurs -Chinese, Australian and Singapore, Danish and Swedish as well as Swiss universities all form separate clusters.The giant cluster is built out of American, Canadian, English, Dutch, Swiss, German and Japan universities (Fig. 8).For w T = 600 Dutch, Swiss, German and Japan universities are separated from the giant cluster (Fig. 9) and for w T = 800 also the English ones (Fig. 10).The final separation touches also American universities (w T = 1200, Fig. 11).
It seems that the key aspect governing this kind of phenomenon is the geographical distance between the universities.In fact, Figure 12 confirms this supposition: the number of publications between universities A and B is a decreasing power-law function of the geographical distance between them.The gap around d AB =5000 is most probably caused by the presence of continents.Similar results regarding the role of geographical distance in science were obtained in [18,19].

Discussion
Our preliminary results show that even such fundamental and straightforward analysis as calculation of correlation coefficient between position of the university in the ranking and the number of papers published by its employees may reveal some non-trivial relationships.In particular, one may use it as indicator of the interest a certain scientific area gains over the years.Thus it can be possible to spot an emergence of certain trends in science and, in effect, react for example establishing a new direction of research in the university.
Our final results show that the scientific collaboration is highly embedded in the physical space -it seems that the key aspect that governs the number of common publications is the geographical vicinity of the universities.It is relevant even in the case of links between continents (link between Australia and Singapore).On the other hand the strength of the ties between universities is proportional to product of their total number of publications.These two relations could be used as a starting point for modeling of university collaboration.

Data verification
Abbreviations.The seemingly straightforward procedure of querying for a specific university name encounters some problems that could have a strong impact on the further results.Web of Science has a set of abbreviations commonly used for searching such as Univ for "University" or Coll for "College".Moreover it is essential to notice that one has to form a very specific query in order to get rid of severe mistakes.Table 3 shows an exemplary list of the search universities together with the exact search phrase that had to be used.
Ambiguity of queries.The 'Search' field is a search key that we use to associate with the authors of the publications and it can consist of one of the operators: + which stands for AND operator in Boolean logic and | which stands for NOT operator in Boolean logic.These operators are used to clearly assess the origin of the publication.Table 2 shows that using just the names of universities from the list (first column) would lead in the case of number 98 to obtaining publications of both Technical University in Munich and University of Munich, instead of just the latter.To omit this problem one has to insert a query Univ Munich | Tech Univ Munich that ensures achieving proper results.On the other hand for the case shown as number 78, it was not sufficient to enter it Washington Univ, as there are many universities with such an abbreviation; it was necessary to add St. Louis in the query text.
Network analysis Clustering coefficient C i for node i is defined as the number of existing links among its nearest neighbors e i (i.e, nodes to which it has links) divided by the total number of possible links among them k i (k i − 1)/2 The total clustering coefficient for the whole network is calculated as the average over all C i .Assortativity coefficient r defined by where i goes over all edges in the network.The coefficient is in the range [−1; 1], r = 1 means that the highly connected nodes have the affinity to connect to other nodes with high k i while r = −1 happens when highly connected nodes tend to link to nodes with very low k i .Entropy of node degree probability distribution.It is calculated by first obtaining the degree probability distribution p(k) (i.e., the probability that a randomly chosen node has exactly k edges) and then evaluating the expression: where k min and k max are, respectively, the smallest and the largest degree in the network.For the sake of comparison we divide the obtained value of entropy by its maximal value, i.e., S max = ln (k max − k min ).
Average shortest path l .It is calculated as the average value of shortest distance (measured in the number of steps) between all pairs of nodes i, j in the network.

Figure 3 .
Figure 3. Representation of the university collaboration network.Each node is a university and links show the connections between them.The width of each link corresponds to the number of common publications between the nodes in question.

FitFigure 4 .
Figure 4. Probability distribution of weights p(w) in the data.Black circles are original data, red-filled circles are binned data (logarithmic binning) and the blue line is a fit to the binned data given by Eq. (1).

Figure 5 .
Figure 5. Log-log plot of weights w AB versus the product of strengths s A s B .(a) Original data (black circles) and binned data (red-filled circles).(b) Log-log plot of fittings to the binned data: linear (blue solid line) and power-law (red dotted line).

Figure 6 .Figure 7 .Figure 9 .
Figure 6.Illustration of the weight threshold concept: (a) a weighted university network with weights proportional to the number of common publications, (b) an unweighted network constructed from the weighted one seen on panel (a) by imposing a weight threshold -only links with weights w > w T are kept.

Figure 12 .
Figure 12.Weight w AB vs geographical distance d AB between universities.Gray circles are row data while red-filled circles are binned data.Blue solid line is a fit (outliers are omitted during the fitting procedure).