Article | Published:

Early onset of structural inequality in the formation of collaborative knowledge in all Wikimedia projects

Nature Human Behaviourvolume 3pages155163 (2019) | Download Citation


The Wikimedia project, including Wikipedia, is one of the largest communal data sets and has served as a representative medium to convey collective knowledge in the twenty-first century. Researchers have believed that the analysis of these collaborative digital data sets provides a unique window into the processes of collaborative knowledge formation; yet, in reality, most previous studies have usually focused on its narrow subsets. Here, by analysing all 863 Wikimedia projects (various types and in different languages), we find evidence for a universal growth pattern in communal data formation. We observe that inequality arises early in the development of Wikimedia projects and stabilizes at high levels. To understand the mechanism behind the observed structural inequality, we develop an agent-based model that considers the characteristics of the editors and successfully reproduces the empirical results. Our findings from the Wikimedia projects data, along with other types of collaboration data, such as patents and academic papers, show that a small number of editors have a disproportionately large influence on the formation of collective knowledge. This analysis offers insights into how various collaboration environments can be sustained in the future.

Data availability

Wikimedia dumps for the main analysis are available from the Wikimedia Downloads ( Additional public data sets are also available from the following sources: OECD (; UNESCO (; and the CIA ( data set of the total number of speakers for each language is owned by SIL International and can be accessed from their website by means of a subscription ( Bibliographic metadata of academic papers and patents were retrieved from the in-house system of the Korea Institute of Science and Technology Information and were licensed from Scopus ( and the European Patent Office (; distribution is prohibited. The pre-processed data used to create the figures are available from GitHub (, along with codes.

Additional information

This work received institutional supports from the Korea Institute of Science and Technology Information. The National Research Foundation (NRF) of Korea grant funded by the Korean Government also supported this work through grant nos. NRF-2017R1E1A1A03070975 (J.Y.), NRF-2018R1C1B5083863 (S.H.L.) and NRF-2017R1A2B3006930 (H.J.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information


  1. Future Technology Analysis Center, Korea Institute of Science and Technology Information, Seoul, Korea

    • Jinhyuk Yun
  2. Department of Liberal Arts, Gyeongnam National University of Science and Technology, Jinju, Korea

    • Sang Hoon Lee
  3. Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon, Korea

    • Hawoong Jeong
  4. Institute for the BioCentury, Korea Advanced Institute of Science and Technology, Daejeon, Korea

    • Hawoong Jeong


All three authors designed the experiment and wrote the manuscript. J.Y. collected and analysed the data.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Sang Hoon Lee or Hawoong Jeong.

Supplementary information

  1. Supplementary Information

    Supplementary Figs. 1–27, Supplementary Tables 1–7, Supplementary Methods, Supplementary References

  2. Reporting Summary

