Internet browser viewing Wikipedia website

Wikipedia is the world’s fifth-most-viewed website.Credit: Chris Batson/Alamy

The most-cited journal articles on Wikipedia include papers on the names of lunar craters and the DNA sequences of human and mouse genes — and many of the most popular works are referenced more times in the online encyclopaedia than they are in the scientific literature.

“It is pretty incredible that almost all the highly cited articles are science articles,” says Matt Miller, a data scientist and librarian based in New York City. Miller analysed citation data released in March by the Wikimedia Foundation, the non-profit organization in San Francisco, California, that runs Wikipedia. The data set — which contains some 15.7 million records — shows how many times sources with formal identifiers such as ISBNs (international standard book numbers) and DOIs (digital object identifiers) are referenced across all of Wikipedia’s nearly 300 language editions. Wikimedia notes that most publications cited by identifiers on Wikipedia are books, but Miller looked specifically at the numbers for publications with DOIs — the most widely used identifier for journal articles — on the English-language version of Wikipedia. His data set contains 1.2 million citations that used DOIs, referencing more than 835,000 unique articles.

The most-referenced paper, with 4,702 citations across English Wikipedia, is a 2002 collection of more than 15,000 sequences of human and mouse genes (see ‘English Wikipedia’). The Wikipedia pages that reference the study are almost exclusively entries about single genes or proteins. “It’s a pleasant surprise,” says Robert Strausberg, a cancer researcher who led the project and is now deputy scientific director at the Ludwig Institute for Cancer Research in New York City.

An expanded version of the gene collection, published in 2004, is the second-most-cited article, with about 3,400 references (by contrast, it has 487 citations in the scientific literature, according to Google Scholar). Daniela Gerhard, a cancer geneticist at the US National Cancer Institute in Bethesda, Maryland, and a co-author of the paper, says that the publications are probably cited so frequently because they provide accessible information about the sequences of expressed genes.

In all, five articles in the top ten are about DNA catalogues, including one study that details a method of generating such collections. A 2005 map of nearly 3,000 human protein interactions also features on the list, at number seven. (Wikimedia’s original post notes: “Unsurprisingly, Wikipedians love reference works.”)

Astronomy articles make up the rest of the list, with four entries. The third-most-referenced paper, cited by nearly 3,000 English Wikipedia pages, is a 2007 study that helped researchers to interpret the results of Hipparcos, the first space mission to measure the positions, distances and brightness of stars.

Other space-science papers on the list cover the size and brightness of asteroids, and the names of lunar craters (in a 1971 publication that has just 16 citations in the scientific literature, according to Google Scholar). These papers are probably highly cited because they are reliable references for the many celestial bodies that have their own Wikipedia pages, says astronomer Floor van Leeuwen at the University of Cambridge, UK, who wrote the Hipparcos study.

Bots’ work

Wikipedia, which launched in 2001, receives about 16 billion page views per month and is currently the world’s fifth-most-visited website. Anybody can create articles or edit an existing one, but the site’s guidelines require that writers and editors must attribute quotes and information to published sources such as books or scholarly papers.

A separate analysis of the Wikimedia data dump by Ross Mounce, who directs open-access programmes at the London-based philanthropic foundation Arcadia Fund, reveals the ten most-cited DOI articles across all of the encylopaedia’s language editions (see ‘All Wikipedia language editions’). Six of the articles are the same, but the first entry is notably different. The top-referenced DOI article is a 2007 paper updating a century-old classification of the global climate, which has a whopping 2.8 million citations — but only 169 on English Wikipedia (the second-most-cited source across all editions has just over 21,000 references).

The climate study is so heavily cited because millions of its citations come from pages created by an automated computer program. The bot, developed by physicist Sverker Johansson at Dalarna University in Falun, Sweden, had produced nearly 3 million articles as of July 2014, according to Wikipedia. One-third of the articles are in Swedish and the rest are in Cebuano and Waray, two languages spoken in the Philippines. The bot has produced millions of articles about geographic locations such as towns and islands, and most of those articles include information about the local climate type, which reference the climate study, says Johansson. He adds that he has no precise figures for the bot-generated citations of the climate paper, “but 2.8 million is in the right ballpark”.

Mounce notes that other articles might be heavily cited on Wikipedia but not formally referenced by their DOIs — and instead referenced by other means, such as their PubMed ID numbers.

Citations are important if people are to trust information, says John Chodacki, director of the University of California Curation Center, who is based in Berkeley. “That’s true for journal articles and also for Wikipedia pages,” he says. But analysing and comparing citation data across scholarly papers has historically been possible using only paywalled services. “One of the most interesting things is that this information is available at all.”