Programming tools: Adventures with R

Tippmann, Sylvia

doi:10.1038/517109a

Download PDF

Toolbox
Published: 29 December 2014

Programming tools: Adventures with R

Sylvia Tippmann

Nature volume 517, pages 109–110 (2015)Cite this article

9145 Accesses
153 Citations
951 Altmetric
Metrics details

Subjects

13 February 2015 Owing to an editing error, the original version of this article did not make it clear what prevented Rabih Murr from practising R — he was preparing a paper for publication. The text has been updated to reflect this.

A Correction to this article was published on 04 March 2015

This article has been updated

A guide to the popular, free statistics and visualization software that gives scientists control of their own data analysis.

Credit: Illustration by The Project Twins

For years, geneticist Helene Royo used commercial software to analyse her work. She would extract DNA from the developing sperm cells of mice, send it for analysis and then fire up a package called GeneSpring to study the results. “As a scientist, I wanted to understand everything I was doing,” she says. “But this kind of analysis didn’t allow that: I just pressed buttons and got answers.” And as Royo’s studies comparing genetic activity on different chromosomes became more involved, she realized that the commercial tool could not keep up with her data-processing demands.

Moot punishments for Japanese STAP scientists Ice ages made Earth's ocean crust thicker US government abandons carbon-capture demonstration

With the results of her first genomic sequencing experiments in hand at the start of a new postdoc, Royo had a choice: pass the sequences over to the experts or learn to analyse the data herself. She took the plunge, and began learning how to parse data in the free, open-source software package R. It helped that the centre she had joined — the Friedrich Miescher Institute for Biomedical Research in Basel, Switzerland — ran regular courses on the software. But she was also following a wider trend: for many academics seeking to wean themselves off commercial software, R is the data-analysis tool of choice.

Besides being free, R is popular partly because it presents different faces to different users. It is, first and foremost, a programming language — requiring input through a command line, which may seem forbidding to non-coders. But beginners can surf over the complexities and call up preset software packages, which come ready-made with commands for statistical analysis and data visualization. These packages create a welcoming middle ground between the comfort of commercial ‘black-box’ solutions and the expert world of code. “R made it very easy,” says Rojo. “It did everything for me.”

That, indeed, is what R’s developers intended when they designed it in the 1990s. Ross Ihaka and Robert Gentleman, statisticians at the University of Auckland in New Zealand, had an interest in computing but lacked practical software for their needs. So they developed a programming language with which they could perform data analysis themselves. R got its name in part from its developers’ initials, although it was also a reference to the most widely used coding language at the time, S.

Credit: Sylvia Tippmann/Source: Elsevier Scopus database

In the early days of the World Wide Web, R quickly attracted interest from scientists around the globe who needed statistical software and were willing to contribute ideas. Gentleman and Ihaka decided to make their source code accessible to everybody, and coding-literate scientists quickly developed packages of pre-programmed routines and commands for particular fields. “I can write software that would be good for somebody doing astronomy,” says Gentleman, “but it’s a lot better if someone doing astronomy writes software for other people doing astronomy.”

Mathematical solutions

Karline Soetaert, an oceanographer at the Royal Netherlands Institute for Sea Research in Yerseke, took up that idea when, in 2008, she wanted to check the health of zooplankton in the estuary of the river Scheldt. Soetaert wanted to calculate how fast zooplankton were dying, using measurements along the river, but R was not equipped for that. To tackle the problem, she worked with two ecologists to develop deSolve — the first package written in R to solve differential equations. “Other software can do that, but it is expensive and closed source,” she notes. Now deSolve is used by epidemiologists modelling infectious diseases, geneticists working on gene-regulatory networks and drug developers working on pharmacokinetics (how compounds behave in living organisms).

By 2003, 10 years after R’s first release, scientists had developed more than 200 packages, and the first citations of the ‘R Project’ appeared. Today, nearly 6,000 packages exist for all kinds of specialized purposes (see ‘R in science’). They allow scientists to compare a human and a Neanderthal genome (using Bioconductor); to model population growth (IPMpack); predict equity prices (quantmod); and visualize the results in polished graphics (ggplot2) in a few lines of code. Experts can use R to write up manuscripts, embedding raw code in them to be run by the reader (knitr). Nearly 1 in 100 scholarly articles indexed in Elsevier’s Scopus database last year cites R or one of its packages — and in agricultural and environmental sciences, the share is even higher (see ‘A rising tide of R’).

boxed-text

Statistical success

For many users, R’s quality as statistics software stands out. The tool is on a par with commercial packages such as SPSS and SAS, says Robert Muenchen, a statistician at the University of Tennessee in Knoxville who analyses the popularity of software used in statistical computing. In the past decade, R has caught up with and overtaken the market leaders. “Most likely, R became the top statistics package used during the summer of this year,” he says.

In genomics and molecular biology, a software project called Bioconductor was developed on the back of R. It helps scientists to process and compare huge numbers of genetic sequences, to query results against databases such as Gene Expression Omnibus and to upload data to the databases . It includes almost 1,000 packages, some of which help to link the millions of DNA snippets from next-generation sequencing experiments to annotated genes.

For her dive into R, Royo had intensive training: under the supervision of Michael Stadler, head of the Friedrich Miescher Institute’s bioinformatics group, she took about half a year to work on R and Bioconductor. But there are plentiful chances to learn, says Karthik Ram, an ecologist at the Berkeley Institute for Data Science in California who founded rOpenSci, an initiative that helps scientists to adopt and develop R (see ‘An R starter kit’). He and his colleagues teach free courses that do not require existing programming skills and are targeted towards scientists’ specific problems.

boxed-text

One researcher who took that training is Megan Jennings, an ecologist at San Diego State University in California. She tracks bobcats, mountain lions and other wild animals, to understand their movements. Armed with more than 400,000 time-stamped photos to which she had appended species names — taken from 36 cameras running for almost a year — Jennings wanted to follow particular species at particular times of year. At first, she manually selected the photos she wanted and fed them into a black-box program called PRESENCE. But with Ram’s help, she is creating an R package that reads in the tagged photos, cleans them up and then sends customized subsets of the data to a pre-existing modelling package in R. “What took me one hour to do manually, I will now be able to do in five minutes,” Jennings says.

One of the greatest perks of R is its online support. Discussion forums about R-related topics outstrip online questions about any commercial statistics software says Muenchen.

“It’s common to see someone post a question and the person who developed the package answer within half an hour,” he says. This rapid response is key for scientists in basic research. “I can find an answer to almost any question online,” says Royo. She can confidently do most of her day-to-day data analysis herself, and she helps out less proficient colleagues. Still, “I google things every day”, she adds. Learning R, says Royo, has not only taught her coding skills, but has also made her more critical about other scientists’ analyses.

Not every scientist is enthusiastic about learning the necessary programming — even though, says Ram, R is less intimidating than languages such as Python (let alone Perl or C). “There are going to be far more scientists that will be comfortable with click-and-drop interfaces than will ever learn to program at any time,” Muenchen says. Geneticist Rabih Murr, for example, took the same R course as Royo when he was a postdoc, but preparing a paper for publication gave him little time to practise. To get started and develop research-specific skills in R definitely requires a commitment: “It’s a matter of priorities,” he says. But after becoming a lab head at the University of Geneva in Switzerland this year, he is planning to hire someone with R experience.

Like any other skill, learning R cannot be done overnight. But Jennings says that it is worth it. “Make that time. Set it aside as an investment: for saving time later, and for building skills that can be used across multiple problems we face as scientists.”

R in science

Researchers have used R to devise software packages in all kinds of disciplines. A few are listed below; there are thousands more at the Comprehensive R Archive Network (CRAN).

Astrophysics The solaR package provides functions to determine the solar radiation that falls on Earth.

Carbon datingBchron creates chronologies based on radiocarbon- and non-radiocarbon-dated depths of sediments.

Climate scienceraincpc allows researchers to obtain and analyse daily global rainfall data from the US National Oceanic and Atmospheric Administration’s Climate Prediction Center.

EpidemiologyDCluster is a package for the detection of spatial clusters of diseases.

ChemistryChemmineR is a cheminformatics toolkit for analysing small molecules in R.

GeneticsBioconductor provides tools for the analysis of high-throughput genomic data.

Pharmacokinetics The PKfit package can model the half-life and dose absorption of drugs.

PalaeoecologyNeotoma provides access to data on pollen, fossil mammals and everything else on the Neotoma palaeoecology database.

OceanographydeSolve is a package for solving differential equations.

Graphicsggplot2 one of the most popular visualization packages in R.

Phylogenydendextend compares trees of evolutionary relationships.

Genomics The QuasR package lets researchers quantify and annotate short reads from sequencing experiments.

Change history

13 February 2015
Owing to an editing error, the original version of this article did not make it clear what prevented Rabih Murr from practising R — he was preparing a paper for publication. The text has been updated to reflect this.
04 March 2015
A Correction to this paper has been published: https://doi.org/10.1038/519120a

Authors

Sylvia Tippmann
View author publications
You can also search for this author in PubMed Google Scholar

Phylogenetic diversity and functional potential of the microbial communities along the Bay of Bengal coast
- Salma Akter
- M. Shaminur Rahman
- Md Firoz Ahmed
Scientific Reports (2023)
ULBP2 is a biomarker related to prognosis and immunity in colon cancer
- Xiaoping Yang
- Xiaolu Su
- Dekui Zhang
Molecular and Cellular Biochemistry (2023)
Constructed the ceRNA network and predicted a FEZF1-AS1/miR-92b-3p/ZIC5 axis in colon cancer
- Xiaoping Yang
- Pingfan Wu
- Dekui Zhang
Molecular and Cellular Biochemistry (2023)
Synthesis of non-active electrode (TiO2/GO/Ag) for the photo-electro-Fenton oxidation of micropollutants in wastewater
- P. Kaur
- Y. Park
- M. Sillanpaa
International Journal of Environmental Science and Technology (2023)
Omicron-included mutation-induced changes in epitopes of SARS-CoV-2 spike protein and effectiveness assessments of current antibodies
- Du Guo
- Huaichuan Duan
- Hubing Shi
Molecular Biomedicine (2022)

Programming tools: Adventures with R

Subjects

R in science

An R starter kit

Change history

13 February 2015

04 March 2015

Related links

Related links in Nature Research

Related external links

Rights and permissions

About this article

Cite this article

This article is cited by

Phylogenetic diversity and functional potential of the microbial communities along the Bay of Bengal coast

ULBP2 is a biomarker related to prognosis and immunity in colon cancer

Constructed the ceRNA network and predicted a FEZF1-AS1/miR-92b-3p/ZIC5 axis in colon cancer

Synthesis of non-active electrode (TiO2/GO/Ag) for the photo-electro-Fenton oxidation of micropollutants in wastewater

Omicron-included mutation-induced changes in epitopes of SARS-CoV-2 spike protein and effectiveness assessments of current antibodies

Search

Quick links

Subjects

Change history

13 February 2015

04 March 2015

Related links

Related links

Related links in Nature Research

Related external links

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Phylogenetic diversity and functional potential of the microbial communities along the Bay of Bengal coast

ULBP2 is a biomarker related to prognosis and immunity in colon cancer

Constructed the ceRNA network and predicted a FEZF1-AS1/miR-92b-3p/ZIC5 axis in colon cancer

Synthesis of non-active electrode (TiO2/GO/Ag) for the photo-electro-Fenton oxidation of micropollutants in wastewater

Omicron-included mutation-induced changes in epitopes of SARS-CoV-2 spike protein and effectiveness assessments of current antibodies

Search

Quick links