Quantifying national information interests using the activity of Wikipedia editors

We live in a"global village"where electronic communication has eliminated the geographical barriers of information exchange. With global information exchange, the road is open to worldwide convergence of opinions and interests. However, it remains unknown to what extent interests actually have become global. To address how interests differ between countries, we analyze the information exchange in Wikipedia, the largest online collaborative encyclopedia. From the editing activity in Wikipedia, we extract the interest profiles of editors from different countries. Based on a statistical null model for interest profiles, we create a network of significant links between countries with similar interests. We show that countries are divided into 18 clusters with similar interest profiles in which language, geography, and historical background polarize the interests. Despite the opportunities of global communication, the results suggest that people nevertheless care about local information.

We live in a "global village" where electronic communication has eliminated the geographical barriers of information exchange. With global information exchange, the road is open to worldwide convergence of opinions and interests. However, it remains unknown to what extent interests actually have become global. To address how interests differ between countries, we analyze the information exchange in Wikipedia, the largest online collaborative encyclopedia. From the editing activity in Wikipedia, we extract the interest profiles of editors from different countries. Based on a statistical null model for interest profiles, we create a network of significant links between countries with similar interests. We show that countries are divided into 18 clusters with similar interest profiles in which language, geography, and historical background polarize the interests. Despite the opportunities of global communication, the results suggest that people nevertheless care about local information. "We live in a global world" has become a cliché 1 . Historically, the exchange of goods, money, and information was naturally limited to nearby locations, since globalization was effectively blocked by spatial, territorial, and cultural barriers 2 . Today, new technology is overcoming these barriers and exchange can take place in an increasingly international arena 3 . Nevertheless, geographical proximity still seems to be important for the trade of goods 4-7 as well as for mobile phone communication 8 and scientific collaboration 9 . However, since the Internet makes information travel more easily and rapidly than goods, it remains unclear what are the effective barriers of global information exchange. As information exchange requires shared interests, we therefore need to better understand to what extent interests are global.
To study interests on a global scale, we use the free online encyclopedia Wikipedia, which has evolved into one of the largest collaborative repositories of information in the history of mankind 10 . Wikipedia is a multi-lingual encyclopedia that captures a wide spectrum of information in millions of articles. These articles undergo a peer-reviewed editing process without a central editing authority. Instead, articles are written, reviewed, and edited by the public. Each article edit is recorded, along with a time-stamp, and, if the editor is unregistered, the computer's IP address. The IP address makes it possible to connect each edit to a specific location. Therefore we can use Wikipedia editors as sensors of information for mapping interest profiles to specific geolocations. In this way, we can examine what role geography plays in how information is shared, and explore to what degree local interests form global connections.
In this paper, we use co-editing of the same Wikipedia article as a proxy for shared information interests. To quantify the locality of such interests, we look at how often editors from different countries co-edit the same articles. To infer connections of shared interest between countries, we develop a null model and represent significant correlations between countries as links in a global network. Structural analysis of the network indicates that interests are polarized by factors related to geography, language and historical background. Despite the possibility of unrestricted global exchange, information interests seem to remain local.

Relating information interests to geographical location
As one of the largest and most linguistically diverse repositories of human knowledge, Wikipedia has become the world's main platform for archiving factual information 10 . The free online encyclopedia consists of almost 300 language editions, with English being the largest one 11,12 . Moreover, an important feature of Wikipedia is that every edit made to an article is recorded. When-ever an unregistered editor makes a change to an article, the editor's location and the time of the change are registered. Thanks to this detailed data, Wikipedia provides a unique platform for studying different aspects of information processes, for example, semantic relatedness of topics 13,14 , collaboration [15][16][17] , social roles of editors 18 , and the geographical locations of Wikipedia editors 19 .
In this work, we used data from Wikipedia dumps 20 to select a random sample from the English Wikipedia edition. In total, the English edition has around 10 million articles, including redirects and duplicates. Since retrieving the editing histories of all articles is computationally demanding, we randomly sampled more than one million articles from this set. Since our purpose is to examine global information interests, we chose to sample articles from the largest and most widespread language edition. Moreover, for each English article that we initially extracted, we retrieved the complete editing history in all language editions. Therefore, our analysis is not restricted to only one language edition of the Wikipedia articles. Finally we merged all language editions together to create a global editing history for each article. For each edit, the editing history includes the text of the edit, the timestamp of the edit, and, for unregistered editors, the IP address of the editor's computer. From the IP address associated with the edit, we retrieved the geolocation of the corresponding editor using the IP database 21 . For the purpose of spatial analysis, we limited the analysis to edits from unregistered editors, because the data on the location for most of the registered Wikipedia editors is unavailable. The resulting dataset contains more than one million (1,069,746) Wikipedia articles and more than 23 million (23,555,117) edits in total. We use these edits as proxies for interest profiles at a specific geolocation.

Inferring shared interests from edit cooccurrence
We identify the interest profile of a country by aggregating the edits of all unregistered Wikipedia editors whose IPs are recorded in the country. If an article is co-edited by editors located in different countries, we argue that the countries share a common interest in the information of the article. In other words, we connect countries if their editors co-edit the same articles. Indirectly, we let individuals who edit Wikipedia represent the population in their country. While Wikipedia editors in a country certainly do not represent a statistically unbiased sample, there is a natural tendency that they edit contents that are related to the country in which they live. Therefore, we consider the edit behavior as an important indicator of the general interests of a country.
Our analysis is limited to unregistered Wikipedia editors, since arXiv:1503.05522v1 [cs.CY] 18 Mar 2015 the edits from these users include an IP address that makes it possible to find the geographical location of the edit. We therefore ignore a significant amount of edits that come from registered editors. However, registration in Wikipedia demands an effort from the editor, which means that editors who make the effort of registering potentially have a greater interest in the process of maintaining and editing information than in the specific information itself. Registered editors also introduce other kinds of biases, since they can develop career paths and specialization or biases towards selected topics, or focus on administrative functions 22 . This would suggest that unregistered editors form the most adequate representation of general information interests. From the co-editing data, we create a network that represents countries as nodes and shared interest as links. The naive approach is to use the raw counts of co-edits between countries as links. The problem with this approach is that it is biased toward the number of editors in each country. Some countries may be strongly connected, not because of evident shared interests but merely as a result of a large community of active Wikipedia editors. To address this problem, we propose a statistical filtering method that filters out connections that could exist only due to size effects or noise. The filtering method assumes a multinomial distribution and determines the expected number of co-occurring edits from the empirical data. Similar methods to filter Wikipedia data have been used in other studies 23 . While they used a similar statistical approach to determine significant correlations, we correct for the multiple comparisons to avoid false positives that affect the link values between countries.

Interest model
We link countries based on their co-occurring edits over all Wikipedia articles. For a specific article a, we calculate the link weight between all pairs of countries that edited that article as follows: if editors in country i have edited the article k a i times, and editors in country j have edited the same article k a j times, then the countries' empirical link weight, w a i j , is calculated as: Since the total number of articles is over one million, most country pairs have co-edited at least one article. Therefore, the aggregation of all articles results in numerous links between countries, and the countries with relatively large editing activities become highly central. Accordingly, we cannot know if a link exists by chance, or because countries actually tend to edit the same articles more frequently than expected. To determine which links are statistically significant, similarly to other studies 24,25 , we compare the empirically observed link weights with the weight given by a null model. In the null model, we assume that each edit comes from a country randomly picked proportionally to its total number of edits. More specifically, the random assignments are performed by drawing the countries from a multinomial distribution. That is, for each edit, country i is selected proportional to its cumulative editing activity, p i = ∑ a k a i M , where M is the total number of edits for all articles. Note that each edit is sampled independently from all other edits, and that the cumulative edit activity of a country in the null model on average will be the same as the observed one. This null model preserves the average level of activity of the countries, but randomizes the temporal order and the articles that countries edit. Figure 1 shows an example of this Note that pins represent country edits, and therefore they can be repeated. The resulting empirical network, calculated by multiplying raw co-edits counts, can be seen at the bottom. In the right panel, we illustrate the null model with the same four articles, and the resulting network at the bottom. In the null model, the editing activity of the countries is on average preserved, but the order of the edits is reshuffled within and across articles. Because of the filtering, some links are removed in the interest model.
reshuffling scheme with four articles. From the null model, we can analytically compute the expected probability, µ a i j , that two countries i and j edit the same article a (see the Methods section for the derivation): where n a is the total number of edits in article a.
To compare the empirical and expected link values, we compute standardized values, so called z-scores. For countries i and j and article a, the z-score z a i j is defined as where the standard deviation σ a i j , is computed in the Methods section.
The z-scores are useful for comparisons of weights, since they account for the large variations that exist in the articles' edit histories. We then sum over all articles to find the cumulative z-score for countries i and j Using the Bonferroni correction, we consider a link to be significant if the probability of observing the total z-score is less than 0.05/N, where N is the number of countries. Since the total zscore is a sum over many independent variables, we can approximate the expected total z-score distribution with a normal distribution. The normal distribution has average value 0 and standard deviation √ L, where L is the number of Wikipedia articles. Thus, the threshold for the significant link weight is t = 3.52 √ L, where 3.52 is derived from the condition that P(z > 3.52) = 0.05/N, where N = 234 and P is the standard Gaussian distribution (with zero average and unit variance). If the total z-score is larger than the threshold, we create a link between countries i and j with w i j according to In summary, the interest model maintains the average level of activity of the countries and randomizes the articles that they edit. By comparing results from the interests model and empirical values, we can identify significant links between countries.

Clustering countries with similar interests
The interest model identifies significant links showing the global relationships that appear from similar editing activity. These links provide information about the pairwise relationships between countries, but to identify large-scale structures among the thousands of links, we must identify and highlight the groups of countries that share interest in the same information. To reveal such groups among the pairwise connections between countries, we use a network community detection method. In this approach, we first build a network of countries connected with the significant links. To identify groups of countries, we envision an editor game in which editors from different countries are active in sequence. In this relay race, a country passes the edit over to a country that is linked proportionally to the weight of the link. Accordingly, the sequence of countries form a random walk and certain sets of countries with strong internal connections will be visited for a relatively long time. They are the groups of countries we are looking for. Moreover, identifying these groups with the described dynamics is equivalent to using the community-detection method known as the map equation 26,27 . Therefore, to identify groups of countries with strong internal connections, we use the map equation's associated search algorithm Infomap, which is available online 28 .

Results and discussion
We discuss the results in three levels. First, we discuss the global picture of clusters with shared information interests. Then, we show the interconnections between the clusters. Finally, we consider each cluster separately and examine the interconnections between countries within the clusters.

A. World map of information interests
Between the 234 countries, we identify 2847 significant links that together form a network of article co-edits. By clustering the network, we identify 18 clusters of strongly connected countries (see Supplementary Table 1 for detailed list of countries in each clusters). The clustering is illustrated in Fig. 2, where countries of the same cluster have the same color. The results show that the division of countries, to some extent, follows geographical proximity. For example, most of the Eastern European countries are clustered together, as well as countries in Scandinavia and the Middle East. These results suggest that the influence of geographical proximity affects how interests are formed 19,29 .
Another important factor in the formation of information interests is language. For example, countries in Central and South America are divided into two clusters, with Portuguese and Spanish as common languages in each cluster, respectively. The influence of language in sharing interests is not surprising. It is well known that interests are formed by cultural expression and public opinion, and an important platform for these expressions is language 30 . The importance of the relation between language and interests has also been demonstrated by the surprisingly small overlap between languages in Wikipedia and the variation in the editing of controversial topics [31][32][33] . In general, the world of Wikipedia co-editing reveals clusters of countries that resemble existing (un)written geopolitical borders, and it reflects shared interests between the countries based on such intertwined aspects as, for example, geography, migration, language, transportation and historic backgrounds [34][35][36][37][38][39] .

B. World network of information interests
To examine the connections between clusters, we look at the network structure at the cluster level. Figure3 shows the connections between the clusters of countries illustrated in Fig. 2. In general, connections tend to be stronger between clusters of geographically proximate countries. Interestingly, the Middle East cluster has the largest strengths to other clusters, forming a hub that connects East and West, North and South. Interpreting the strong connections as potential highways for information spreading, the Middle East is not only a melting pot of ideas, but also plays an important role for the spread of information.

C. Country connections within clusters
To find connections between countries, we look at the network structure of countries within a cluster. In the upper left corner of Fig. 3, we show the strongest connections within the Central European cluster. The network shows that countries are linked based on the overlap of the officially spoken languages 40 . For example, Belgium has three official languages, Dutch, French and German. Indeed, Belgium is connected closely to the Netherlands, France and Luxembourg. The same pattern can be observed for the triad of Switzerland, Germany and Austria.
To further examine the interests that form country connections, we look at which articles are most important for creating the significant links. As an example, we looked at the top articles for two European country pairs, Germany-Austria in the European cluster and Sweden-Norway in the Scandinavian cluster. The most significant co-edits relate to local and regional interests, for instance, sports, media, music, and places (see Supplementary  Fig. 2). For example, the top Germany-Austria list includes an Austrian singer who is also popular in Germany and an Austrian football player who is playing in the German league. The top Sweden-Norway list shows a similar pattern of locally related  Figure 3 World network of information interests. The size of the nodes represent the total z-score of the clusters. The links represent connections between clusters obtained from the cluster analysis with Infomap, and the thicker the line, the stronger the connection. Clusters are colored in the same way as in Fig. 2. The upper left corner shows the most significant connections between countries in the Central European cluster. topics, for example, a host of a popular tv show simultaneously aired in Sweden and Norway, a Swedish football manager who has been successful both in Sweden and Norway, and a music genre that is nearly exclusive to Scandinavian countries. Overall, the top articles suggest that an important factor for co-editing is locally related interests.

Conclusion
We introduced a general statistical filtering method to extract significant connections in weighted correlation networks. We applied the method to editor activity in Wikipedia to connect countries that share similar information interests. Despite the advances in information and communication technologies, the interest network reveals that interests remain local. In line with earlier studies, we find that that language, geography, and history have a great impact on the diversity of interests. The network obtained from the similarity in information interests can be used to model the propagation of information on a global scale. Moreover, since local interest limits the global information exchange, these results can help us to better understand the concept of globalization.

Method
To find connections in interest, we measure the co-occurring edits of two countries in the same articles. We quantify the connection with an empirical weight that is computed as the product of the countries' edit activities in the article. For a Wikipedia article a, if the total edit activity of country i is denoted by k a i , and for country j is k a j , then we calculate the empirical weight, w a i j , according to As the total edit activity of countries differs, the probability that countries appear together in a certain article varies. If the total number of edits for all articles is M, then the expected proportion of edits for country i is This is the probability of country i making a random edit overall, and this is the null model we use to filter noisy connections in the interest model. Assume that there is a total of c countries, and a total of n edits for article p. Let x k denote the number of edits from country k. Then the probability of any particular combination of edits for the various countries follows a multinomial distribution With the above distribution, we can compute the expected probability of the co-occurrence of two countries i and j in an article µ a i j = ∑ 1...k x i x j n! x 1 !...x n ! p x 1 1 ...p x n n (9) Following the multinomial theorem, we can now calculate the mean, variance and covarience matrix for the occurrence of a country pair. The mean value of the co-occurrence of two countries becomes µ a i j = n a (n a − 1)p i p j .
Using the multinomial theorem multiple times, one can also compute the variance: (σ a i j ) 2 =n a (n a − 1)p i p j ((6 − 4n a )p i p j + (n a − 2)(p i + p j ) + 1).
Thus the standard deviation of the pair, σ a i j , is the square root of equation (11). This equation enters into the definition of the z-score in equation (4). The sum of the z-score is then approximated with a Gaussian distribution. This approximation is well justified by the very large number of articles we have. In practice already, 1, 000 articles give a good approximation, as shown by the numerical simulations in Supplementary Fig. 1.