Devil in the details

Journal name:
Nature
Volume:
470,
Pages:
305–306
Date published:
DOI:
doi:10.1038/470305b
Published online

To ensure their results are reproducible, analysts should show their workings.

As analysis of huge data sets with computers becomes an integral tool of research, how should researchers document and report their use of software? This question was brought to the fore when the release of e-mails stolen from climate scientists at the University of East Anglia in Norwich, UK, generated a media fuss in 2009, and has been widely discussed, including in this journal. The issue lies at the heart of scientific endeavour: how detailed an information trail should researchers leave so that others can reproduce their findings?

The question is perhaps most pressing in the field of genomics and sequence analysis. As biologists process larger and more complex data sets and publish only the results, some argue that the reporting of how those data were analysed is often insufficient.

Take a recent survey by comparative genomist Anton Nekrutenko at Pennsylvania State University in University Park and computer scientist James Taylor of Emory University in Atlanta, Georgia. The pair examined 14 sequencing papers published last year in Science, Nature and Nature Genetics, and found that the publications often lacked essential details needed to reproduce the analysis — the papers referenced merely bioinformatics software, for example, without noting the version used or the value of key parameters.

“Transparency is a laudable goal, but given the complexity of the analyses, is it realistic?”

The two researchers presented their findings at the Advances in Genome Biology and Technology meeting in Marco Island, Florida, on 2 February. Although their account has not been published, it does not seem to have surprised anyone in the field. Indeed, it builds on a 2009 paper in Nature Genetics that found similar omissions in published accounts of microarray experiments. (J. P A. Ioannidis et al. Nature Genet. 41, 149155; 2009). In this case, findings from 10 of the 18 studies analysed could not be reproduced, probably because of missing information.

If genomics were as politicized as climate science, the authors of studies in which the information trail is missing would probably face catcalls, conspiracy charges and demands for greater transparency and openness. Instead, many in the field merely shrug their shoulders and insist that is how things are done. Bioinformatics is a fast-paced science in which software and standards for data analysis change rapidly and with them, the protocols and workflows of users.

Nature does not require authors to make code available, but we do expect a description detailed enough to allow others to write their own code to do a similar analysis.

Some in the field say that it should be enough to publish only the original data and final results, without providing detailed accounts of the steps in between. Others argue that it is pointless to document the version of the software used, as new incarnations of programs differ little. But that is not always the case. Edward McCabe, then at the California NanoSystems Institute at the University of California, Los Angeles, was so perturbed when different versions of the same bioinformatics software gave wildly different results that he published a paper on it (N. K. Henderson-Maclennan et al. Mol. Genet. Metab. 101, 134–140; 2010). Reviewers resisted its publication, asking what was new about the findings, as it was already common knowledge that different software versions could dramatically affect analyses. There is a troubling undercurrent here: that the problem lies not with the lack of information, but rather with those who find the incomplete information a problem, such as researchers who are new to the field.

Transparency is a laudable goal, but given the complexity of the analyses, is it realistic? There are certainly examples of stellar documentation. The 1000 Genomes Project, for example, a project to sequence and analyse more than a thousand genomes, has carefully detailed its workflows, and makes both its data and its procedures available for the world to see. It is perhaps easier for members of that project — which is essentially repeating the same procedure more than a thousand times — to practise good experimental hygiene than it is for individual scientists, who have more flexible and varied research goals. Nevertheless, tools are coming online to simplify documentation of the complex analyses required for genome analysis. These include freely available programs such as Taverna (http://www.taverna.org.uk) and Nekrutenko's more user-friendly Galaxy (http://main.g2.bx.psu.edu). Neither of these is perfect, but they illustrate the level of detail that could enrich published reports.

As genome sequencing spreads from the large, centralized sequencing centres that largely pioneered the technique into smaller labs and clinics, it is important that the community consider such solutions.

Comments

  1. Report this comment #18081

    Anurag Chaurasia said:

    We scientists should keep it in our mind that simplest easiely understandable version of our publicatons will be more usefull to the society in general & for new up coming researchers in particular.
    Anurag chaurasia,ICAR,India,anurag_vns1@yahoo.co.in,anurag@nbaim.org.in,+919452196686

  2. Report this comment #18106

    Kostas Karasavvas said:

    There has been work to make Taverna and Galaxy more interoperable. That is, one can use Galaxy's intuitive interface to call the more expressive Taverna workflows thus leveraging on the advantages of each system. Analysis data from Galaxy can be made available in Taverna and vice versa.

    More details at: https://trac.nbic.nl/elabfactory/wiki/eGalaxy

  3. Report this comment #18132

    Dan Clutter said:

    I always thought papers (sequencing or otherwise) should provide enough information to allow other researchers to reproduce the work.

  4. Report this comment #18146

    Andrey Shevel said:

    "... to write their own code to do a similar analysis..." - the codes might be huge in volume and understanding. It seems impractical in real life.
    More feasible method is to use for the analysis only free and open source software (FOSS) packages (which are standards de facto in the class of study) with relatively small corrections/additions. In such the way anyone else could be able to test the analysis algorithms against another set of the experimantal data.

  5. Report this comment #18388

    Richard Dale said:

    "Nature does not require authors to make code available, but we do expect a description detailed enough to allow others to write their own code to do a similar analysis."

    Does no-one else see anything wrong with this? As I think about it I see more and more problems with this policy.

    First, and most obvious, is that it does not allow for interested readers to look through the code for errors. How does a reader know that the code does what researchers claim without writing an entirely new program? That is far harder to do than to search for problems with extant code (especially well-documented code), and takes a far more dedicated critic with spare time (how many competent researchers are at such a loose end they can afford to literally waste time rewriting code?).

    Second this prevents anyone from addressing the code who is not a competent programmer. I would be far more confident in finding an error in someone else's code than in writing my own that would expose such an error.

    Third, if I did write code that gave different results then I would not know if my or the author's code was incorrect. Before writing a criticism I would have to be very certain of my own code; it is difficult to proof-read one's own work, impossible to guarantee no errors. Of course finding a positive (an error) is far easier than being confident of a negative (no errors in my own code).

    Fourth, if any errors are found then the original author can simply say "well my code supports my conclusions". The argument then falls down to relative authority of the author and the critic, which is of course a fallacy and entirely unhelpful. This simply supports the "consensus", which we all know is anti-scientific. The whole point of the great advances in science have been that they are against the consensus.

    Finally there is a very human tendency to leave out from a description some of the detail, especially that considered either irrelevant or considered so obvious that it need not be stated, even more so in highly-specialised science where "everyone knows" . This is a problem, usually trivial to slight, in any paper, but here it introduces another layer of problems. Once one introduces the blindly literal computer it can become far more important. It means that anyone considering the paper must know the whole field of the authors. Anyone who has overlapping knowledge cannot study any part of the analysis performed by the code, as without full knowledge the code cannot be repeated. For example would be that a statistician could not find an error in the statistical methods employed within code. Considering how often ignorance of statistical handling has lead researchers in many scientific fields to erroneous conclusions this is a very important example.

    Can anyone else think of more problems with this policy?

  6. Report this comment #18392

    Steve N said:

    "Nature does not require authors to make code available, but we do expect a description detailed enough to allow others to write their own code to do a similar analysis."

    This part of the article leapt out at me and I was going to state my concerns with it when I noticed that Richard Dale has already listed all the problems with this policy far more eloquently than I could. I would just add that this policy is clearly "mistaking the map for the territory". With the increase in reliance on computer processing, and memory capacity and bandwidth becoming ever cheaper, I can't imagine why a requirement to have available the original code "€œwarts and all"€ is not mandatory.

  7. Report this comment #18399

    Cheng Xiaofeng said:

    The same problem exists in the field of Ecology.
    There are many complex data sets in ecological research. The analysis of these data is a long time-consuming, and strenuous work for everyone. The original data might be selected to be used to do the statistical analysis. But the standard to choose the data was never be mentioned in most papers. After that, there are many important assumptions if using the certain analysis methods in software. In most papers, authors just told us the data was analyzed by some methods using the software, but without saying whether the data satisfies the assumption or not. If the data do not satisfy it, even the significant analysis results are not robust. No one cares it. So many suspicious data have appeared on different ranks of journals to mislead other researchers for a long time.

    From Kunming Institute of Botany, Chinese Academy of Sciences, cxf20041382@163.com

  8. Report this comment #18406

    Ari Loytynoja said:

    One should aim at robust results, not just at reproduceable ones. In sequence analysis many analysis methods are based on heuristic algorithms and do not guarantee finding globally optimal solutions. Furthermore, there typically are numerous solutions that, given our imperfect models and scoring systems, seem equally or nearly equally good and should be considered as possibly correct solutions. If the aim is to provide reproducible results, the method should hide these alternative solutions and always return one of them, the same for every person analysing the data. This may sound desirable but it has its downside: the solution picked by the method may not be the correct one and everybody ends up reproducing the same wrong answer. In my opinion, individual analyses (such as sequence alignment using heuristic algorithms) do not need to be reproduceable for the analyses to be sound: conclusion that remain valid despite slight variations in the data are more likely to correct.

  9. Report this comment #45049

    Kalo Franky said:

    Well i really believe that scientist is evolving to different level. Technology wise, they truly are using 10% of their brain to take technology to a whole new level. inpatient rehab

Subscribe to comments

Additional data