Toward effective software solutions for big biology

Prins, Pjotr; de Ligt, Joep; Tarasov, Artem; Jansen, Ritsert C; Cuppen, Edwin; Bourne, Philip E

doi:10.1038/nbt.3240

Correspondence
Published: 08 July 2015

Toward effective software solutions for big biology

Pjotr Prins^1,2,3^na1,
Joep de Ligt²^na1,
Artem Tarasov⁴,
Ritsert C Jansen ORCID: orcid.org/0000-0003-2977-9110⁵,
Edwin Cuppen ORCID: orcid.org/0000-0002-0400-9542^1,2 &
…
Philip E Bourne⁶

Nature Biotechnology volume 33, pages 686–687 (2015)Cite this article

16k Accesses
28 Citations
122 Altmetric
Metrics details

Subjects

You have full access to this article via your institution.

Download PDF

To the Editor:

Leading scientists tell us that the problem of large data and data integration, referred to as 'big data', is acute and hurting research. Recently, Snijder et al.¹ suggested a culture change in which scientists would aim to share high-dimensional data among laboratories. It is important to realize that sharing data is only part of the solution. The elephant in the room is bioinformatics and bioinformatics software development in particular—which, despite being crucially important, mostly fails to address the requirements of 'big data'.

Whereas Internet companies such as Google, Facebook and Skype have built infrastructure and developed innovative software solutions to cope with vast amounts of data, the bioscience community seems to be struggling to realize big data software projects. This has led to problems in sharing, annotation, computation and reproducibility of data^2,3,4.

Before we can devise software solutions for big data, there are more basic pressing concerns with bioinformatics software development that need to be resolved. Biologists are not formally trained for software engineering, so much of the bioinformatics software available today has been developed by PhD biologists in relative isolation on the back of funded experimental research programs. This model of software development tied to wet-lab research can work well but has resulted in a culture of 'one-offs'. The aim of most research projects is to obtain results in the shortest possible time, and this is often achieved by writing prototype software rather than developing well-engineered and scalable solutions. Even when funding is obtained to develop software, there are usually no long-term resources allocated to software maintenance, which results in problems with bug fixing, continuity and reproducibility.

Instead of working alone to develop software, researchers can join or start collaborative free and open-source software (FOSS) projects, thereby improving their coding skills through the scrutiny of their peers. True FOSS projects have licenses that allow continuation of projects that were abandoned by the original developers, thereby enabling modular development. We published a bioinformatics manifesto as a practical guide for FOSS-style development (https://github.com/pjotrp/bioinformatics/blob/master/README.md) that aims to provide process and architecture guidelines for early-career bioinformaticians and their supervisors. Bioinformatics already has vibrant collaborative FOSS projects, such as Galaxy, Cytoscape, BioPerl and Biopython, but these projects are often worked on part-time owing to lack of or inadequate funding and will not service the requirements of big biology without major additional investment. For example, after initial funding from the US National Institutes of Health (NIH) and the National Science Foundation (NSF), the Galaxy project is now seeking new funding to continue its work, and no funds at all have been granted by scientific agencies to work on Biopython. The amount of dedicated funding for bioinformatics software development remains small. For example, the NIH has a budget of $30 billion, of which an estimated 2–4% is allocated to computation and bioinformatics grants. We estimate that only a small fraction of this funding is used for big data software development. By comparison, the nonprofit Mozilla Foundation turns over $300 million annually for software development and FOSS promotion, and Google invests an estimated $6.7 billion annually in R&D. Private donors could, in principle, establish a foundation to support software development for integrative web-based services on large computer clusters. If investments in sharing data resources for biomedical research, such as the NIH Big Data to Knowledge (BD2K) initiative, with an annual budget of $24 million, and the European Bioinformatics Institute's smaller BioSamples project, were matched by serious investments in software development, maintenance and reproducibility, these projects would render better returns.

One way to solve the challenge is to wait for companies, such as 23andMe, that have made multimillion-dollar deals with pharma to realize large-scale investments and create big data solutions. However, such solutions would need to be purchased and, owing to their proprietary nature, would be difficult to adapt or benchmark. Another solution would be for biology funding agencies to establish initiatives for centralized software development. A different solution, and the one that we favor, is to use FOSS as a distributed development effort and develop collaborative software projects, such as those developed by the Linux, Mozilla and Apache foundations, which include private sector participation. For example, the goal of the Linux Foundation (which includes members such as IBM and Intel) is to fund Linux development.

Most of the bioinformatics software in use today does not scale for terabytes of data. R software programs typically load all data in RAM and suffer from its memory and runtime inefficiencies, and they are not designed for simultaneous use of multiple CPUs to speed up computations³. Where programming languages such as R, Python, Perl and Ruby are great for prototyping and quick analysis, they fail to deliver when it comes to big data processing. Solving the scalability problem will require embracing programming languages that are more efficient and have abstractions for multi-CPU computations³, even if switching languages proves hard for most bioinformatician programmers.

Attribution for bioinformatics software development is also problematic. In a post titled 'You're not allowed bioinformatics anymore' on his blog Opiniomics (https://biomickwatson.wordpress.com/2014/07/21/youre-not-allowed-bioinformatics-anymore/), Mick Watson eloquently explains that bioinformatics is a scientific discipline in its own right and that bioinformaticians need career development. Ironically, in many of the most-cited biology research publications, there is a substantial bioinformatics contribution (usually the analytic method), often delivered as novel software solutions and data. However, it is rare for bioinformaticians to feature either as first or last authors on publications in high-impact journals. Authorship of community software projects can be troublesome as well, because the original authors tend to receive credit for the lifetime of the project, even when later code amendments and added functionality are equally or more important than the initial software. Lack of scientific attribution for software development hurts career development and can force bioinformaticians to opt for careers in traditional biology.

To solve the issue of attribution and related career development, we propose that the software contribution itself counts toward scientific track record. Every versioned software release and accompanying source code can be assigned a digital object identifier (DOI) with clear attribution for all contributors. The relative contribution of authors could be checked by visiting the software version control, such as that delivered by web services such as GitHub. This would make published software accountable, reproducible and citable. DOI citations could count as conventional citations, because they express the impact of a piece of software by its use.

In conclusion, our view is that to tackle the challenge of big biology software development, leading scientists need to acknowledge that software development is an integral part of research and not just an underpinning method. Projects need to promote bioinformatics collaborations and create scientific rewards. Universities need to increase their efforts to promote interdisciplinary research, to ensure that informatics is embedded in the life-sciences curriculum and encourage talented software developers and biologists to get involved in big data by tailoring individual career-development plans. Funding agencies can add institutional focus; emphasize collaborative FOSS approaches; build on existing grassroots initiatives⁵; create split funding streams for software and hardware; support maintenance of projects; encourage collaboration with experts in high-performance computing and software engineering; and fund larger projects dedicated to big biology software solutions.

References

Snijder, B., Kandasamy, R.K. & Superti-Furga, G. Nat. Biotechnol. 32, 755–759 (2014).
Article CAS Google Scholar
Collins, F.S. & Tabak, L.A. Nature 505, 612–613 (2014).
Article Google Scholar
Trelles, O., Prins, P., Snir, M. & Jansen, R.C. Nat. Rev. Genet. 12, 224 (2011).
Article CAS Google Scholar
Marx, V. Nature 498, 255–260 (2013).
Article CAS Google Scholar
Möller, S. et al. BMC Bioinformatics 15, S7 (2014).
Article Google Scholar

Download references

Acknowledgements

This document benefited from many reviewers. We especially thank K.W. Broman, T. Casci, V. Guryev, T. Seemann and J. Vilo for constructive comments.

Author information

Pjotr Prins and Joep de Ligt: These authors contributed equally to this work.

Authors and Affiliations

Department of Medical Genetics, University Medical Centre, Utrecht, the Netherlands
Pjotr Prins & Edwin Cuppen
Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), CancerGenomics. nl, Utrecht, the Netherlands
Pjotr Prins, Joep de Ligt & Edwin Cuppen
Department of Nematology, Wageningen University, the Netherlands
Pjotr Prins
St. Petersburg State University, St. Petersburg, Russia
Artem Tarasov
University of Groningen, Groningen Bioinformatics Centre, Groningen, the Netherlands
Ritsert C Jansen
Office of the Director, The National Institutes of Health, Bethesda, Maryland, USA
Philip E Bourne

Authors

Pjotr Prins
View author publications
You can also search for this author in PubMed Google Scholar
Joep de Ligt
View author publications
You can also search for this author in PubMed Google Scholar
Artem Tarasov
View author publications
You can also search for this author in PubMed Google Scholar
Ritsert C Jansen
View author publications
You can also search for this author in PubMed Google Scholar
Edwin Cuppen
View author publications
You can also search for this author in PubMed Google Scholar
Philip E Bourne
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pjotr Prins.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prins, P., de Ligt, J., Tarasov, A. et al. Toward effective software solutions for big biology. Nat Biotechnol 33, 686–687 (2015). https://doi.org/10.1038/nbt.3240

Download citation

Published: 08 July 2015
Issue Date: July 2015
DOI: https://doi.org/10.1038/nbt.3240

This article is cited by

Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis
- Steffen Möller
- Stuart W. Prescott
- Michael R. Crusoe
Data Science and Engineering (2017)
Discovering and linking public omics data sets using the Omics Discovery Index
- Yasset Perez-Riverol
- Mingze Bai
- Henning Hermjakob
Nature Biotechnology (2017)
Imagining the future of bioimage analysis
- Erik Meijering
- Anne E Carpenter
- Jean-Christophe Olivo-Marin
Nature Biotechnology (2016)

Toward effective software solutions for big biology

Subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis

Discovering and linking public omics data sets using the Omics Discovery Index

Imagining the future of bioimage analysis

Search

Quick links

Subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis

Discovering and linking public omics data sets using the Omics Discovery Index

Imagining the future of bioimage analysis

Search

Quick links