Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Packaging and containerization of computational methods

Abstract

Methods for analyzing the full complement of a biomolecule type, e.g., proteomics or metabolomics, generate large amounts of complex data. The software tools used to analyze omics data have reshaped the landscape of modern biology and become an essential component of biomedical research. These tools are themselves quite complex and often require the installation of other supporting software, libraries and/or databases. A researcher may also be using multiple different tools that require different versions of the same supporting materials. The increasing dependence of biomedical scientists on these powerful tools creates a need for easier installation and greater usability. Packaging and containerization are different approaches to satisfy this need by delivering omics tools already wrapped in additional software that makes the tools easier to install and use. In this systematic review, we describe and compare the features of prominent packaging and containerization platforms. We outline the challenges, advantages and limitations of each approach and some of the most widely used platforms from the perspectives of users, software developers and system administrators. We also propose principles to make the distribution of omics software more sustainable and robust to increase the reproducibility of biomedical and life science research.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: An overview of packaging, virtualization and containerization platforms for addressing challenges of omics software installation.

Similar content being viewed by others

References

  1. Alser, M. et al. Technology dictates algorithms: recent developments in read alignment. Genome Biol. 22, 249 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  2. Mangul, S. et al. Systematic benchmarking of omics computational tools. Nat. Commun. 10, 1393 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Alser, M., Eudine, J. & Mutlu, O. Genome-on-diet: taming large-scale genomic analyses via sparsified genomics. Preprint at arXiv https://doi.org/10.48550/arXiv.2211.08157 (2022).

  4. Meyer, F. et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Cox, R. Surviving software dependencies. Commun. ACM 62, 36–43 (2019).

    Article  Google Scholar 

  6. Mangul, S., Martin, L. S., Eskin, E. & Blekhman, R. Improving the usability and archival stability of bioinformatics software. Genome Biol. 20, 47 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Mangul, S. et al. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol. 17, e3000333 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Begley, C. G., Buchan, A. M. & Dirnagl, U. Robust research: institutions must do their part for reproducibility. Nature 525, 25–27 (2015).

    Article  CAS  PubMed  Google Scholar 

  9. Wratten, L., Wilm, A. & Göke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods https://doi.org/10.1038/s41592-021-01254-9 (2021).

    Article  PubMed  Google Scholar 

  10. Brito, J. J. et al. Recommendations to enhance rigor and reproducibility in biomedical research. Gigascience 9, giaa056 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Heil, B. J. et al. Reproducibility standards for machine learning in the life sciences. Nat. Methods 18, 1132–1135 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).

    Article  CAS  PubMed  Google Scholar 

  13. Malloy, B. A. & Power, J. F. An empirical analysis of the transition from Python 2 to Python 3. Empir. Softw. Eng. 24, 751–778 (2019).

    Article  Google Scholar 

  14. Gosden, J. A. Software compatibility. In Proc. December 9–11, 1968, Fall Joint Computer Conference, Part I—AFIPS ’68 (Fall, Part I) https://doi.org/10.1145/1476589.1476605 (ACM Press, 1968).

  15. Abate, P., Di Cosmo, R., Treinen, R. & Zacchiroli, S. A modular package manager architecture. Inf. Softw. Technol. 55, 459–474 (2013).

    Article  Google Scholar 

  16. Decan, A., Mens, T. & Grosjean, P. An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empir. Softw. Eng. 24, 381–416 (2018).

    Article  Google Scholar 

  17. Boettiger, C. An introduction to Docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 49, 71–79 (2015). 49.

    Article  Google Scholar 

  18. Silver, A. Software simplified. Nature 546, 173–174 (2017).

    Article  CAS  PubMed  Google Scholar 

  19. Dunn, M. C. & Bourne, P. E. Building the biomedical data science workforce. PLoS Biol. 15, e2003082 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Florance, V. in Informatics Education in Healthcare: Lessons Learned (ed. Berner, E. S.) 125–133 (Springer, 2020).

  21. Bush, W. S., Wheeler, N., Darabos, C. & Beaulieu-Jones, B. in Biocomputing 2022 412–416 (World Scientific, 2021).

  22. Wu, J. et al. Virtual meetings promise to eliminate geographical and administrative barriers and increase accessibility, diversity and inclusivity. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01176-z (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Siepel, A. Challenges in funding and developing genomic software: roots and remedies. Genome Biol. 20, 147 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Gardner, P. P. et al. Sustained software development, not number of citations or journal choice, is indicative of accurate bioinformatic software. Genome Biol. 23, 56 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Hoffman, D. et al. The BOGUS Linux Release https://bogus.org/ (2003)

  26. Fernández-Sanguino, J. et al. A Brief History of Debian Ch. 4 https://www.debian.org/doc/manuals/project-history/detailed.en.html (2023).

  27. Gunthorpe, J. APT User’s Guide https://www.debian.org/doc/manuals/apt-guide/index.en.html (1998).

  28. Leonard, T. Introduction. Zero Install Docs https://docs.0install.net/basics/ (CERN Web Services, 2003).

  29. Conda documentation. Conda https://docs.conda.io/en/latest/ (2017).

  30. Bicking, I. pip 24.0. PyPI https://pypi.org/project/pip/ (2024).

  31. Parnas, D. L. Designing software for ease of extension and contraction. IEEE Trans. Softw. Eng. SE-5, 128–138 (1979).

    Article  Google Scholar 

  32. Claes, M., Mens, T., Di Cosmo, R. & Vouillon, J. A historical analysis of Debian package incompatibilities. 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories https://doi.org/10.1109/msr.2015.27 (2015).

  33. Dolstra, E., De Jonge, M., Visser, E. & Others. Nix: a safe and policy-free system for software deployment. In LISA 4, 79–92 (2004).

  34. Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).

    Article  PubMed  Google Scholar 

  35. Mancinelli, F. et al. Managing the complexity of large free and open source package-based software distributions. In 21st IEEE/ACM International Conference on Automated Software Engineering (ASE’06) 199–208 (2006).

  36. Gamblin, T. et al. The Spack package manager. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis on SC ’15. https://doi.org/10.1145/2807591.2807623 (2015).

  37. Hoste, K., Timmerman, J., Georges, A. & De Weirdt, S. EasyBuild: building software with ease. In 2012 SC Companion.: High. Perform. Comput., Netw. Storage Anal. https://doi.org/10.1109/sc.companion.2012.81 (2012).

  38. Dongarra, J. Report on the Fujitsu Fugaku System. Tech. Report No. ICLUT-20-06 (Univ. Tennessee Knoxville Innovative Computing Laboratory, 2020).

  39. Dagnat, F., Simon, G. & Zhang, X. Toward a distributed package management system. In Lococo 2011: Workshop on Logics for Component Configuration (2011).

  40. Kamp, P.-H. & Watson, R. N. M. Jails: confining the omnipotent root. Proc. 2nd Int. SANE Conf. 43, 116 (2000).

    Google Scholar 

  41. Syed, M. H. & Fernandez, E. B. The software container pattern. In Proc. 22nd Conference on Pattern Languages of Programs 24–26 (The Hillside Group, 2015).

  42. da Veiga Leprevost, F. et al. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics 33, 2580–2582 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Adair, R. J., Bayles, R. U., Comeau, L. W. & Creasy, R. J. A Virtual Machine System for the 360/40. Tech. Report (International Business Machines Corporation, 1966).

  44. Smith, J. & Nair, R. Virtual Machines: Versatile Platforms for Systems and Processes (Elsevier, 2005).

  45. Angiuoli, S. V. et al. CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinform. https://doi.org/10.1186/1471-2105-12-356 (2011).

  46. Merkel, D. et al. Docker: lightweight linux containers for consistent development and deployment. Linux J. 2014, 2 (2014).

    Google Scholar 

  47. Cook, J. in Docker for Data Science 103–118 (Apress, 2017).

  48. Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: scientific containers for mobility of compute. PLoS ONE 12, e0177459 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  49. Huang, D., Cui, H., Wen, S. & Huang, C. Security analysis and threats detection techniques on Docker container. In 2019 IEEE 5th International Conference on Computer and Communications (ICCC) 1214–1220 (2019).

  50. Tomar, A., Jeena, D., Mishra, P. & Bisht, R. Docker security: a threat model, attack taxonomy and real-time attack scenario of DoS. In 2020 10th International Conference on Cloud Computing, Data Science and Engineering (Confluence) 150–155 (2020).

  51. Zahid, F., Kuo, M. M. Y. & Sinha, R. Light-weight active security for detecting DDoS attacks in containerised ICPS. In 2021 18th International Conference on Privacy, Security and Trust (PST) 1–5 (2021).

  52. Martin, A., Raponi, S., Combe, T. & Di Pietro, R. Docker ecosystem—vulnerability analysis. Comput. Commun. 122, 30–43 (2018).

    Article  Google Scholar 

  53. Galaxy Community. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. 50, W345–W351 (2022).

  54. Zhang, Z. et al. Uniform genomic data analysis in the NCI Genomic Data Commons. Nat. Commun. 12, 1226 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Seven Bridges Genomics—the biomedical data analysis company. Seven Bridges https://www.sevenbridges.com (2016).

  56. Hornik, K. The comprehensive R archive network. Wiley Interdiscip. Rev. Comput. Stat. 4, 394–398 (2012).

    Article  Google Scholar 

  57. Lawlor, B. & Sleator, R. D. The democratization of bioinformatics: a software engineering perspective. Gigascience 9, giaa063 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  58. Shirinbab, S., Lundberg, L. & Casalicchio, E. Performance evaluation of containers and virtual machines when running Cassandra workload concurrently. Concurr. Comput. Pract. Exp. 32, e5693 (2020).

    Article  Google Scholar 

  59. Felter, W., Ferreira, A., Rajamony, R. & Rubio, J. An updated performance comparison of virtual machines and Linux containers. In 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) 171–172 (2015).

  60. BioBuilds home. L7 informatics https://l7informatics.com/resource-center/biobuilds-home/ (2018).

  61. Yuen, D. et al. The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols. Nucleic Acids Res. 49, W624–W632 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Belmann, P. et al. Bioboxes: standardised containers for interchangeable bioinformatics software. Gigascience 4, 47 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  63. Field, D. et al. Open software for biologists: from famine to feast. Nat. Biotechnol. 24, 801–803 (2006).

    Article  CAS  PubMed  Google Scholar 

  64. Yuen, D. et al. ga4gh/tool-registry-service-schemas: 2.0.1. Zenodo https://zenodo.org/doi/10.5281/zenodo.1193735 (2022).

  65. Dagnat, F. & Simon, G. Toward a distributed package management system. In Lococo 2011: Workshop on Logics for Component Configuration (2011).

  66. Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  67. Collins, F. S. & Tabak, L. A. Policy: NIH plans to enhance reproducibility. Nature 505, 612–613 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  68. Turkyilmaz-van der Velden, Y., Dintzner, N. & Teperek, M. Reproducibility starts from you today. Patterns 1, 100099 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  69. FAIR principles GO FAIR https://www.go-fair.org/fair-principles/ (2017).

  70. Bedő, J., Di Stefano, L. & Papenfuss, A. T. Unifying package managers, workflow engines, and containers: computational reproducibility with BioNix. Gigascience 9, giaa121 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  71. Courtès, L. Functional package management with Guix. Preprint at arXiv https://doi.org/10.48550/arXiv.1305.4584 (2013).

Download references

Acknowledgements

O.M. and SAFARI Research Group members (M.A., C.F. and N.A.) are supported by funding from Intel, VMware, Semiconductor Research Corporation, the National Institutes of Health and the Eidgenössische Technische Hochschule (ETH) Future Computing Laboratory. S.M. and R.A. are supported by the National Science Foundation grants 2041984 and 2316223 and National Institutes of Health grant R01AI173172. We thank M. Sarahan (Principal Software Engineer, Manager at Anaconda, Inc.) for our useful discussion at the AnacondaCON 2019 conference. We thank Dr. Mosqueiro for the fruitful discussion and feedback.

Author information

Authors and Affiliations

Authors

Contributions

M.A. and S.M. led the project. S.M. conceived of the presented idea. M.A., R.A., N.R., S.W., N.A. and V.S. collected data. M.A., S.W. and N.A. produced the figures. M.A., B.L., R.J.A., S.W., R.A., D.S., T.O., BD.K., M.S.A, O.M. and S.M. wrote, reviewed and edited the manuscript. All authors discussed the text and commented on the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Serghei Mangul.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Protocols thanks Bernard Pope and Devon Ryan for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 General overview of the installation process for omics software tools.

When biomedical researchers need to use omics software tools and reproduce reported results, they first need to locate the version number and web address of each omics tool using the published research paper and supplementary information. They can then download the omics tools, determine each tool’s dependencies using information provided by the tool’s developers, and try to install the tools on their personal computer or HPC cluster. If the tool is successfully installed, the researchers download the relevant omics data and apply the tools as needed. However, even when the hardware resource requirements (CPU type, memory capacity, and storage) of each tool are met, some omics tools are likely to fail when exact reproduction is attempted because of installation challenges.

Extended Data Fig. 2 Development timeline and brief description of popular (a) package managers and (b) containers.

Each tool is described with key information regarding its functionality, purpose, and supported operating system. In addition to the surveyed package managers and containers, the first package manager, PMS, and the first container, FreeBSD Jail, are shown.

Extended Data Fig. 3 Standard workflow for installing software with a package manager.

The user, usually an administrator, asks the package manager to install a specific piece of software. If the software is not already installed, the package manager fetches the appropriate package from a repository. If any of the dependencies are not already installed, the package manager retrieves the dependency’s package from the repository and starts the installation procedure for that package. Once all the dependencies are installed, the initially requested software is installed. The package manager often goes through several iterations of this process, because every dependency can have its own list of dependencies, in which case each of the dependency’s dependencies must be verified through the same process.

Extended Data Fig. 4 Standard workflow for running software with containerization.

The user asks to install a specific container image. If the container image is available locally, then the user can run it directly through the container engine. Potential dependencies are already handled without any intervention from the user. If the software image is not available locally, the appropriate image must be fetched from a repository.

Supplementary information

Supplementary Information

Supplementary Methods for Tables 1–3.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alser, M., Lawlor, B., Abdill, R.J. et al. Packaging and containerization of computational methods. Nat Protoc (2024). https://doi.org/10.1038/s41596-024-00986-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41596-024-00986-0

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing