Abstract
Methods for analyzing the full complement of a biomolecule type, e.g., proteomics or metabolomics, generate large amounts of complex data. The software tools used to analyze omics data have reshaped the landscape of modern biology and become an essential component of biomedical research. These tools are themselves quite complex and often require the installation of other supporting software, libraries and/or databases. A researcher may also be using multiple different tools that require different versions of the same supporting materials. The increasing dependence of biomedical scientists on these powerful tools creates a need for easier installation and greater usability. Packaging and containerization are different approaches to satisfy this need by delivering omics tools already wrapped in additional software that makes the tools easier to install and use. In this systematic review, we describe and compare the features of prominent packaging and containerization platforms. We outline the challenges, advantages and limitations of each approach and some of the most widely used platforms from the perspectives of users, software developers and system administrators. We also propose principles to make the distribution of omics software more sustainable and robust to increase the reproducibility of biomedical and life science research.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Alser, M. et al. Technology dictates algorithms: recent developments in read alignment. Genome Biol. 22, 249 (2021).
Mangul, S. et al. Systematic benchmarking of omics computational tools. Nat. Commun. 10, 1393 (2019).
Alser, M., Eudine, J. & Mutlu, O. Genome-on-diet: taming large-scale genomic analyses via sparsified genomics. Preprint at arXiv https://doi.org/10.48550/arXiv.2211.08157 (2022).
Meyer, F. et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).
Cox, R. Surviving software dependencies. Commun. ACM 62, 36–43 (2019).
Mangul, S., Martin, L. S., Eskin, E. & Blekhman, R. Improving the usability and archival stability of bioinformatics software. Genome Biol. 20, 47 (2019).
Mangul, S. et al. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol. 17, e3000333 (2019).
Begley, C. G., Buchan, A. M. & Dirnagl, U. Robust research: institutions must do their part for reproducibility. Nature 525, 25–27 (2015).
Wratten, L., Wilm, A. & Göke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods https://doi.org/10.1038/s41592-021-01254-9 (2021).
Brito, J. J. et al. Recommendations to enhance rigor and reproducibility in biomedical research. Gigascience 9, giaa056 (2020).
Heil, B. J. et al. Reproducibility standards for machine learning in the life sciences. Nat. Methods 18, 1132–1135 (2021).
Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).
Malloy, B. A. & Power, J. F. An empirical analysis of the transition from Python 2 to Python 3. Empir. Softw. Eng. 24, 751–778 (2019).
Gosden, J. A. Software compatibility. In Proc. December 9–11, 1968, Fall Joint Computer Conference, Part I—AFIPS ’68 (Fall, Part I) https://doi.org/10.1145/1476589.1476605 (ACM Press, 1968).
Abate, P., Di Cosmo, R., Treinen, R. & Zacchiroli, S. A modular package manager architecture. Inf. Softw. Technol. 55, 459–474 (2013).
Decan, A., Mens, T. & Grosjean, P. An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empir. Softw. Eng. 24, 381–416 (2018).
Boettiger, C. An introduction to Docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 49, 71–79 (2015). 49.
Silver, A. Software simplified. Nature 546, 173–174 (2017).
Dunn, M. C. & Bourne, P. E. Building the biomedical data science workforce. PLoS Biol. 15, e2003082 (2017).
Florance, V. in Informatics Education in Healthcare: Lessons Learned (ed. Berner, E. S.) 125–133 (Springer, 2020).
Bush, W. S., Wheeler, N., Darabos, C. & Beaulieu-Jones, B. in Biocomputing 2022 412–416 (World Scientific, 2021).
Wu, J. et al. Virtual meetings promise to eliminate geographical and administrative barriers and increase accessibility, diversity and inclusivity. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01176-z (2021).
Siepel, A. Challenges in funding and developing genomic software: roots and remedies. Genome Biol. 20, 147 (2019).
Gardner, P. P. et al. Sustained software development, not number of citations or journal choice, is indicative of accurate bioinformatic software. Genome Biol. 23, 56 (2022).
Hoffman, D. et al. The BOGUS Linux Release https://bogus.org/ (2003)
Fernández-Sanguino, J. et al. A Brief History of Debian Ch. 4 https://www.debian.org/doc/manuals/project-history/detailed.en.html (2023).
Gunthorpe, J. APT User’s Guide https://www.debian.org/doc/manuals/apt-guide/index.en.html (1998).
Leonard, T. Introduction. Zero Install Docs https://docs.0install.net/basics/ (CERN Web Services, 2003).
Conda documentation. Conda https://docs.conda.io/en/latest/ (2017).
Bicking, I. pip 24.0. PyPI https://pypi.org/project/pip/ (2024).
Parnas, D. L. Designing software for ease of extension and contraction. IEEE Trans. Softw. Eng. SE-5, 128–138 (1979).
Claes, M., Mens, T., Di Cosmo, R. & Vouillon, J. A historical analysis of Debian package incompatibilities. 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories https://doi.org/10.1109/msr.2015.27 (2015).
Dolstra, E., De Jonge, M., Visser, E. & Others. Nix: a safe and policy-free system for software deployment. In LISA 4, 79–92 (2004).
Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).
Mancinelli, F. et al. Managing the complexity of large free and open source package-based software distributions. In 21st IEEE/ACM International Conference on Automated Software Engineering (ASE’06) 199–208 (2006).
Gamblin, T. et al. The Spack package manager. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis on SC ’15. https://doi.org/10.1145/2807591.2807623 (2015).
Hoste, K., Timmerman, J., Georges, A. & De Weirdt, S. EasyBuild: building software with ease. In 2012 SC Companion.: High. Perform. Comput., Netw. Storage Anal. https://doi.org/10.1109/sc.companion.2012.81 (2012).
Dongarra, J. Report on the Fujitsu Fugaku System. Tech. Report No. ICLUT-20-06 (Univ. Tennessee Knoxville Innovative Computing Laboratory, 2020).
Dagnat, F., Simon, G. & Zhang, X. Toward a distributed package management system. In Lococo 2011: Workshop on Logics for Component Configuration (2011).
Kamp, P.-H. & Watson, R. N. M. Jails: confining the omnipotent root. Proc. 2nd Int. SANE Conf. 43, 116 (2000).
Syed, M. H. & Fernandez, E. B. The software container pattern. In Proc. 22nd Conference on Pattern Languages of Programs 24–26 (The Hillside Group, 2015).
da Veiga Leprevost, F. et al. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics 33, 2580–2582 (2017).
Adair, R. J., Bayles, R. U., Comeau, L. W. & Creasy, R. J. A Virtual Machine System for the 360/40. Tech. Report (International Business Machines Corporation, 1966).
Smith, J. & Nair, R. Virtual Machines: Versatile Platforms for Systems and Processes (Elsevier, 2005).
Angiuoli, S. V. et al. CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinform. https://doi.org/10.1186/1471-2105-12-356 (2011).
Merkel, D. et al. Docker: lightweight linux containers for consistent development and deployment. Linux J. 2014, 2 (2014).
Cook, J. in Docker for Data Science 103–118 (Apress, 2017).
Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: scientific containers for mobility of compute. PLoS ONE 12, e0177459 (2017).
Huang, D., Cui, H., Wen, S. & Huang, C. Security analysis and threats detection techniques on Docker container. In 2019 IEEE 5th International Conference on Computer and Communications (ICCC) 1214–1220 (2019).
Tomar, A., Jeena, D., Mishra, P. & Bisht, R. Docker security: a threat model, attack taxonomy and real-time attack scenario of DoS. In 2020 10th International Conference on Cloud Computing, Data Science and Engineering (Confluence) 150–155 (2020).
Zahid, F., Kuo, M. M. Y. & Sinha, R. Light-weight active security for detecting DDoS attacks in containerised ICPS. In 2021 18th International Conference on Privacy, Security and Trust (PST) 1–5 (2021).
Martin, A., Raponi, S., Combe, T. & Di Pietro, R. Docker ecosystem—vulnerability analysis. Comput. Commun. 122, 30–43 (2018).
Galaxy Community. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. 50, W345–W351 (2022).
Zhang, Z. et al. Uniform genomic data analysis in the NCI Genomic Data Commons. Nat. Commun. 12, 1226 (2021).
Seven Bridges Genomics—the biomedical data analysis company. Seven Bridges https://www.sevenbridges.com (2016).
Hornik, K. The comprehensive R archive network. Wiley Interdiscip. Rev. Comput. Stat. 4, 394–398 (2012).
Lawlor, B. & Sleator, R. D. The democratization of bioinformatics: a software engineering perspective. Gigascience 9, giaa063 (2020).
Shirinbab, S., Lundberg, L. & Casalicchio, E. Performance evaluation of containers and virtual machines when running Cassandra workload concurrently. Concurr. Comput. Pract. Exp. 32, e5693 (2020).
Felter, W., Ferreira, A., Rajamony, R. & Rubio, J. An updated performance comparison of virtual machines and Linux containers. In 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) 171–172 (2015).
BioBuilds home. L7 informatics https://l7informatics.com/resource-center/biobuilds-home/ (2018).
Yuen, D. et al. The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols. Nucleic Acids Res. 49, W624–W632 (2021).
Belmann, P. et al. Bioboxes: standardised containers for interchangeable bioinformatics software. Gigascience 4, 47 (2015).
Field, D. et al. Open software for biologists: from famine to feast. Nat. Biotechnol. 24, 801–803 (2006).
Yuen, D. et al. ga4gh/tool-registry-service-schemas: 2.0.1. Zenodo https://zenodo.org/doi/10.5281/zenodo.1193735 (2022).
Dagnat, F. & Simon, G. Toward a distributed package management system. In Lococo 2011: Workshop on Logics for Component Configuration (2011).
Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).
Collins, F. S. & Tabak, L. A. Policy: NIH plans to enhance reproducibility. Nature 505, 612–613 (2014).
Turkyilmaz-van der Velden, Y., Dintzner, N. & Teperek, M. Reproducibility starts from you today. Patterns 1, 100099 (2020).
FAIR principles GO FAIR https://www.go-fair.org/fair-principles/ (2017).
Bedő, J., Di Stefano, L. & Papenfuss, A. T. Unifying package managers, workflow engines, and containers: computational reproducibility with BioNix. Gigascience 9, giaa121 (2020).
Courtès, L. Functional package management with Guix. Preprint at arXiv https://doi.org/10.48550/arXiv.1305.4584 (2013).
Acknowledgements
O.M. and SAFARI Research Group members (M.A., C.F. and N.A.) are supported by funding from Intel, VMware, Semiconductor Research Corporation, the National Institutes of Health and the Eidgenössische Technische Hochschule (ETH) Future Computing Laboratory. S.M. and R.A. are supported by the National Science Foundation grants 2041984 and 2316223 and National Institutes of Health grant R01AI173172. We thank M. Sarahan (Principal Software Engineer, Manager at Anaconda, Inc.) for our useful discussion at the AnacondaCON 2019 conference. We thank Dr. Mosqueiro for the fruitful discussion and feedback.
Author information
Authors and Affiliations
Contributions
M.A. and S.M. led the project. S.M. conceived of the presented idea. M.A., R.A., N.R., S.W., N.A. and V.S. collected data. M.A., S.W. and N.A. produced the figures. M.A., B.L., R.J.A., S.W., R.A., D.S., T.O., BD.K., M.S.A, O.M. and S.M. wrote, reviewed and edited the manuscript. All authors discussed the text and commented on the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Protocols thanks Bernard Pope and Devon Ryan for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 General overview of the installation process for omics software tools.
When biomedical researchers need to use omics software tools and reproduce reported results, they first need to locate the version number and web address of each omics tool using the published research paper and supplementary information. They can then download the omics tools, determine each tool’s dependencies using information provided by the tool’s developers, and try to install the tools on their personal computer or HPC cluster. If the tool is successfully installed, the researchers download the relevant omics data and apply the tools as needed. However, even when the hardware resource requirements (CPU type, memory capacity, and storage) of each tool are met, some omics tools are likely to fail when exact reproduction is attempted because of installation challenges.
Extended Data Fig. 2 Development timeline and brief description of popular (a) package managers and (b) containers.
Each tool is described with key information regarding its functionality, purpose, and supported operating system. In addition to the surveyed package managers and containers, the first package manager, PMS, and the first container, FreeBSD Jail, are shown.
Extended Data Fig. 3 Standard workflow for installing software with a package manager.
The user, usually an administrator, asks the package manager to install a specific piece of software. If the software is not already installed, the package manager fetches the appropriate package from a repository. If any of the dependencies are not already installed, the package manager retrieves the dependency’s package from the repository and starts the installation procedure for that package. Once all the dependencies are installed, the initially requested software is installed. The package manager often goes through several iterations of this process, because every dependency can have its own list of dependencies, in which case each of the dependency’s dependencies must be verified through the same process.
Extended Data Fig. 4 Standard workflow for running software with containerization.
The user asks to install a specific container image. If the container image is available locally, then the user can run it directly through the container engine. Potential dependencies are already handled without any intervention from the user. If the software image is not available locally, the appropriate image must be fetched from a repository.
Supplementary information
Supplementary Information
Supplementary Methods for Tables 1–3.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Alser, M., Lawlor, B., Abdill, R.J. et al. Packaging and containerization of computational methods. Nat Protoc 19, 2529–2539 (2024). https://doi.org/10.1038/s41596-024-00986-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41596-024-00986-0
This article is cited by
-
Genomic reproducibility in the bioinformatics era
Genome Biology (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.