Ecology of the digital world of Wikipedia

Wikipedia, a paradigmatic example of online knowledge space is organized in a collaborative, bottom-up way with voluntary contributions, yet it maintains a level of reliability comparable to that of traditional encyclopedias. The lack of selected professional writers and editors makes the judgement about quality and trustworthiness of the articles a real challenge. Here we show that a self-consistent metrics for the network defined by the edit records captures well the character of editors’ activity and the articles’ level of complexity. Using our metrics, one can better identify the human-labeled high-quality articles, e.g., “featured” ones, and differentiate them from the popular and controversial articles. Furthermore, the dynamics of the editor-article system is also well captured by the metrics, revealing the evolutionary pathways of articles and diverse roles of editors. We demonstrate that the collective effort of the editors indeed drives to the direction of article improvement.


The effect of trimming the low-degree articles
The number of editors working on the article (i.e. article degree) influences the goodness of the article. Although the contribution of the editors on the articles can be largely different, the low-degree articles tend to contain poor information. We thus set the trimming threshold for the article degree. The network with threshold x is the network after trimming the articles with degree k a ≤ x. The network with threshold 0 is the original network.
To assess the effect of the trimming threshold on our self-consistent analysis, we calculate the correlation of the complexity measure between the networks with different threshold values. The Pearson's correlation coefficient p xy of the complexity measure between two networks with threshold value x and y is defined as, where C x α and C y α denote the complexity of the article α of the network with threshold x and y, respectively. The average complexity of the network is, by definition, C x = C y = 1. The Spearman's rank correlation coefficient ρ xy is given by the Pearson's correlation coefficient calculated on the ranks of the data instead of the original values, where R C x α and R C y α denote the complexity rank of the article α of the network with threshold x and y, respectively. R C x and R C y denote the average rank. As shown in Fig. 6, there are three different fixed points, the networks with threshold 0 ∼ 2, the networks with threshold 3 ∼ 6, and the networks with threshold ≥ 7.
Top-N hit rate for finding the "fetured" articles depends on the threshold value. The finding accuracy does not monotonically increase with increasing the threshold value as shown in the right panel in Fig. 6. The original network skims "featured" articles faster than the network with threshold 1. As increasing the threshold value from 1 to 19, the accuracy increases for the networks with small threshold but decreases for the networks with large threshold. The network with threshold 9 and 10 work best. We use the optimal threshold 9 for our self-consistent analysis. Spearman's rank correlation of the complexity rank. (Right) Cumulative number of "featured" articles contained in the top-ranked articles in complexity measure for different threshold values. To find the "featured" articles, the networks after trimming the articles with degree k a ≤ 9 (orange) and k a ≤ 10 (magenta) work best.

3/10
Relation to the metric of country fitness and product complexity Our formulation is inspired by the self-consistent measure proposed for countries' economic fitness F and product's complexity Q 2, 3 for the world trade network, calculated as The reason why we can not simply adopt the original definition of quality for the article complexity is the following. At a first glance, one could be inclined to regard the editors as the countries and the articles as the products, naturally from its causality relation: editors write the articles and not vice versa. However, on the Wikipedia network, there is no capability which limits the touch of editors to an articles (e.g. everyone could make an edit on an article of quantum physics and mid-century history of a certain local village of Japan). Therefore the selectiveness of the editors editing an article is not a good measure of its complexity or goodness, meaning that this way of straight forward application is not appropriate for Wikipedia network. On the contrary, the selectiveness of the opponent articles in the Wikipedia bipartite network gives good information about the editors. Because we here take the top-editors, all of them are editing on thousands of articles. Some of the edits are contents edit and some can be more maintenance type edits such as small or systematic correction. Although it is hard to distinguish these maintenance like activity from the edit size or other information, we can expect that an editor editing so many articles (some indeed edit millions), especially including articles with low "goodness" (complexity), has a lower probability to make a contents edit contributing the "goodness" of articles. The selective edit records mainly on "good" articles, on the other hand, can be regarded as a good indication of the higher contents writing contribution to the article by his each edit. This type of character of editors, not the "fitness" of it, is measured by the scatterdness in our framework.
Note that for binary network b εα , our definition is equivalent to taking the inverse of the products' complexity index as new complexity index for articles in the original definition for economy relation (Eqs. (12)), i.e. C j = Q −1 j , and therefore essentially the same as taking the inverse rank for the complexity in the original definition, except for the normalization condition. In the present work, we take weighted network w εα mainly because of its better performance as shown in Fig. 7 and hence this direct relation is lost. However, the mapping relation for the binary network still gives a good guide for considering the convergence of our non-linear recursion process to a non-trivial fixed point with smooth distribution of the resulting values, thanks to the intensive work on the convergence condition for the fitness-complexity measure 4 . The relation between the cumulative numbers of "featured" and "controversial" articles, which shows the performance of complexity measure to find "featured" articles without picking "controversial" articles.   . Top-10 upward moving "controversial" articles in the top 1% strength area in r S − r J plane. Five out of these top-10 articles (red) are from "Science, Biology, and Health" category, while the share of this category in the whole "controversial" articles is about 10%. Top-10 articles are overlaid on the other "controversial" articles (gray). . "Newly featured" articles with eye-catching downward motion (red) overlaid on the "Newly featured" articles (gray).