Three ways researchers can avoid common programming bugs

Even the best programmers make mistakes – here’s how to detect them before it’s too late.

Andy Tay

Three ways researchers can avoid common programming bugs

Even the best programmers make mistakes – here’s how to detect them before it’s too late.

10 November 2020

Andy Tay

WhataWin/Getty Images

Research that draws on large volumes of data demands greater programming literacy from scientists to ensure accurate findings. Researchers need to be aware of the ‘bugs’ – errors, flaws or faults – that can creep in as the applications required to process big data sets become more complex.

Different programming approaches applied to the same data can lead to inconsistency in results. A study published in Nature in May, for example, found great variability in how 70 different research teams’ software analyzed the same set of functional magnetic resonance imaging (fMRI) data.

The findings, say the authors, led by Rotem Botvinik-NezerView from Tel Aviv University, Israel, “show that analytical flexibility can have substantial effects on scientific conclusions”.

This issue extends to other research areas such as ecology where many scientists who create their own software are self-taught and often unaware of the best programming practices and tools.

Like typos in a manuscript, errors in code are almost impossible to avoid completely, especially for inexperienced researchers.

Nature Index spoke to three researchers about the kinds of systems that can be put in place to catch these bugs before they lead to mistakes in research findings.

1. Include preventive checks that ensure code runs correctly

“One of the most common problems relates to indexing of a data array or data set, in which a portion of the data – such as a particular subset of observations – is selected for further analysis,” says Russell Poldrack, a professor of psychology and director of Centre of Reproducible Neuroscience at Stanford University in California.

For instance, in a data array with, say, 500 observations (represented in rows) and 100 variables (represented in columns), each value has a unique row and column position (“address”). If the array is incorrectly indexed, a target value will appear at the wrong address, causing errors in findings because the analysis has missed the target data.

According to Poldrack, researchers can avoid making this type of error by adding a function (a block of reusable code) to their program to check that arrays are the right size.

For example, if it’s expected that there should be 500 measures for every 100 individuals, then the function should check that the array has the correct number of elements along each dimension i.e. 500 rows by 100 columns. And if the function finds that the array size is incorrect, it can alert users to the error and terminate further analysis. Poldrack adds that another very common issue is invalid or incorrect data values.

A well-known example of this, which led to a correction being published in PLOS One in 2015, after the age of a study participant was recorded as 32757, affecting the outcome of the analyses.

This error can also occur when missing values are coded for the sake of convenience as a numeric value, such as minus one or infinite, says Poldrack.

Again, having a function assess for appropriate data values can help. For instance, a function can check that elapsed time values are strictly positive, and counts are non-negative integers.

Bug-check: Include functions, especially in complex software, as preventive checks that can terminate computation and notify the user when errors are detected.

2. Use machine-learning algorithms to automate bug detection

One common class of programming error, says Michael Pradel, a professor of computer science at the University of Stuttgart in Germany, occurs when a programmer accidentally uses the wrong name to refer to a variable (i.e. a data point) or function.

“For instance, instead of calling a function computePoint(), a developer may accidentally type the name of a different but existing function such as computePoints().”

The “s” makes a difference: these functions would expect different quantities of input data but might not be coded to reject insufficient or surplus data passed to them. As a result, the function could return incorrect calculations.

According to Pradel, machine learning algorithms that can learn and respond to new data on the go can do a better job of finding incorrectly typed or named variables and functions than supplied algorithms that are more limited in scope because they are manually designed and tuned.

For example, in 2018, he published a paper in conference proceedings about such a learning-based bug-detector.

Pradels advises that it is better when programmers use longer, more descriptive names in their source code. For instance, the name ‘time_elapsed_in_seconds’ may be preferred to the name ‘time’ because the former contains more information (that the variable of interest is the ‘elapsed time’ and the unit of measure is ‘seconds’.)

Learning from names is an effective way for a machine-learning program to "understand" what the programmer intended to write, and hence, identify code that is using variables to hold inappropriate values.

Bug-check: Collaborate with experts or make use of publicly available machine-learning algorithms

3. Embrace open science to access community review

“It is common for programmers to presume that their code is error-free, but this is often not the case,” says Cassandra Gould van Praag, an open science community engagement coordinator at the University of Oxford in the United Kingdom.

This leads to code that is not being reviewed for bugs, says Gould van Praag, who recommends that researchers embrace ‘open science’ – which encourages researchers to make their code, methods, documentation, and data (where it is safe to do so) available to the wider community, often before it’s published in a journal.

Gould van Praag acknowledged that researchers may be worried about being ‘scooped’ before publication if they share their data too early, and legal stipulations related to sensitive personal data can make sharing data difficult.

Where it is not possible to fully anonymize data to share alongside your code, she advised that in some cases one may be able to simulate data which follows the same properties of your own distributions.

Alternatively, a code review can be conducted remotely via a screen share, where the person conducting the review does not have direct access to the raw data at any point. Researchers will need to understand what is necessary in their discipline and country to protect the anonymity of human participants before data can be shared alongside code.

This can be a complex issue, but the “the best way to get started is to join a community that is already practising open science to know how to do it right,” she says.

“Ultimately, open science is about getting additional sets of eyes on our work and developing the awareness that no code is perfect,” says Gould van Praag.

Bug-check: Join an open science community, such as the Open Research Calendar, a Twitter community that shares international open science events, or ReproducibiliTEA, a journal club that discusses ways to improve scientific reproducibility, to connect with researchers who can help review data and code at an early stage.

Sign up to the Nature Index newsletter

Get regular news, analysis and data insights from the editorial team delivered to your inbox.