Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy

Nature Biotechnology (2024)Cite this article

Subjects

The Earth BioGenome Project aims to produce reference genomes for all ~1.8 million known eukaryotic species over the next decade1,2,3,4. Achieving this goal will require the current pace of reference genome production to increase by at least two orders of magnitude1. Automation of the assembly process with a pipeline that is widely accessible to any research group will be required to achieve this speed-up. Enabling this goal requires sustained effort in three major areas: genome assembly optimization and best-practice development, computational infrastructure provisioning, and dissemination and training.

Fig. 1: VGP–Galaxy assembly pipeline (version 2.1) consists of 10 workflows that can be combined into 8 analysis trajectories depending on the combination of input data.
Fig. 2: Phylogenetic tree and assembly statistics of genomes assembled using the VGP–Galaxy assembly pipeline.

Data availability

The workflows, their description and instructions on how to use them can be found at https://galaxyproject.org/projects/vgp/workflows/. The requisite tools are installed on usegalaxy.org and usegalaxy.eu, and are in the process of being installed on usegalaxy.org.au. These genomes were supported by collaborators of the VGP and ERGA, and the QC analyses reported here to test the VGP Galaxy pipeline do not release those that are under specific embargo policies for genome-wide analyses (e.g., https://genome10k.ucsc.edu/data-use-policies/). New genome assemblies are available in the GenomeArk repository: https://www.genomeark.org/. After manual curation, the assemblies are submitted to the US National Center for Biotechnology Information (NCBI) under the BioProject Vertebrate Genome Project: https://www.ncbi.nlm.nih.gov/bioproject/48924317.

Acknowledgements

We thank Yagoub Adam, Tyler Alioto, Jun Aruga, Diego De Panis, Sagane Dind, Diego Fuentes, Shilpa Garg and Jèssica Gómez for contributing to the initial implementation during ELIXIR Biohackathon 2021. We also thank Nate Jue for help testing and developing the pipeline tutorials and Andrea Guarracino for their useful comments to the manuscript. This work was supported in part by the Intramural Research Program of the US National Human Genome Research Institute (NHGRI), the US National Institutes of Health (NIH) and the Howard Hughes Medical Institute (HHMI). The authors are grateful to the broader Galaxy community for their support and software development efforts. This work is funded by NIH grants U41 HG006620, U24 HG010263, U24 CA231877 and U01CA253481, along with US National Science Foundation grants 1661497, 1758800 and 2216612. The work was also supported in part by The Human Frontier Science Program (HFSP) RGP0025/2021, the Swiss National Science Foundation (SNSF) grants 202669 and 198691, the Swiss State Secretariat for Education, Research and Innovation (SERI) grant 22.00173 and Horizon Europe under the Biodiversity, Circular Economy and Environment program (REA.B.3, BGE 101059492). Usegalaxy.eu is supported by German Federal Ministry of Education and Research grants 031L0101C and de.NBI-epi to B.G. Computational resources are provided by the Advanced Cyberinfrastructure Coordination Ecosystem (ACCESS-CI), Texas Advanced Computing Center, and the JetStream2 scientific cloud.

Author information

Author notes

  1. These authors contributed equally: Delphine Larivière, Linelle Abueg, Nadolina Brajuka

Authors and Affiliations

  1. Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA

    Delphine Larivière, Marius van den Beek & Anton Nekrutenko

  2. Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA

    Linelle Abueg, Nadolina Brajuka, Jennifer R. Balacco, Melanie Couture, Olivier Fedrigo, Grenville MacDonald Gooder, Kathleen Horan, Nivesh Jain, Cassidy Johnson, Brian O’Toole, Tatiana Tilley, Erich D. Jarvis & Giulio Formenti

  3. Bioinformatics Group, Department of Computer Science, Albert-Ludwigs University Freiburg, Freiburg, Germany

    Cristóbal Gallardo-Alba & Bjorn Grüning

  4. Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea

    Byung June Ko & Heebal Kim

  5. Departments of Biology and Computer Science, Johns Hopkins University, Baltimore, MD, USA

    Alex Ostrovsky & Michael C. Schatz

  6. Department of Medicine and Life Sciences (MELIS), Institut de Biologia Evolutiva, Universitat Pompeu Fabra-CSIC, Barcelona, Spain

    Marc Palmada-Flores & Tomas Marques-Bonet

  7. Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA

    Brandon D. Pickett, Arang Rhie & Adam M. Phillippy

  8. Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA

    Keon Rabbani & Mark J. P. Chaisson

  9. CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Porto, Portugal

    Agostinho Antunes

  10. Department of Biology, Faculty of Sciences, University of Porto, Porto, Portugal

    Agostinho Antunes

  11. Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA

    Haoyu Cheng

  12. Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA

    Haoyu Cheng

  13. Wellcome Sanger Institute, Cambridge, UK

    Joanna Collins

  14. Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia

    Alexandra Denisova

  15. Department of Biosciences, University of Milan, Milan, Italy

    Guido Roberto Gallo

  16. BMRI, Weill Cornell Medical College, New York, NY, USA

    Alice Maria Giani

  17. eGnome, Inc., Seoul, Republic of Korea

    Heebal Kim

  18. Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea

    Heebal Kim & Chul Lee

  19. Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA

    Chul Lee

  20. Catalan Institution of Research and Advanced Studies (ICREA), Barcelona, Spain

    Tomas Marques-Bonet

  21. CNAG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona, Spain

    Tomas Marques-Bonet

  22. Institut Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Barcelona, Cerdanyola del Vallès, Spain

    Tomas Marques-Bonet

  23. Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus

    Simona Secomandi

  24. University of Florence, Department of Biology, Florence, Italy

    Marcella Sozzoni

  25. Tree of Life, Wellcome Sanger Institute, Cambridge, UK

    Marcela Uliano-Silva

  26. Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA

    Robert W. Williams

  27. Department of Ecology & Evolution and Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland

    Robert M. Waterhouse

Authors
  1. Delphine Larivière
  2. Linelle Abueg
  3. Nadolina Brajuka
  4. Cristóbal Gallardo-Alba
  5. Bjorn Grüning
  6. Byung June Ko
  7. Alex Ostrovsky
  8. Marc Palmada-Flores
  9. Brandon D. Pickett
  10. Keon Rabbani
  11. Agostinho Antunes
  12. Jennifer R. Balacco
  13. Mark J. P. Chaisson
  14. Haoyu Cheng
  15. Joanna Collins
  16. Melanie Couture
  17. Alexandra Denisova
  18. Olivier Fedrigo
  19. Guido Roberto Gallo
  20. Alice Maria Giani
  21. Grenville MacDonald Gooder
  22. Kathleen Horan
  23. Nivesh Jain
  24. Cassidy Johnson
  25. Heebal Kim
  26. Chul Lee
  27. Tomas Marques-Bonet
  28. Brian O’Toole
  29. Arang Rhie
  30. Simona Secomandi
  31. Marcella Sozzoni
  32. Tatiana Tilley
  33. Marcela Uliano-Silva
  34. Marius van den Beek
  35. Robert W. Williams
  36. Robert M. Waterhouse
  37. Adam M. Phillippy
  38. Erich D. Jarvis
  39. Michael C. Schatz
  40. Anton Nekrutenko
  41. Giulio Formenti
Contributions

D.L. built the assembly pipeline with support from G.F., L.A., C.G.-A., B.G., A.O., H.C., M.U.-S., B.D.P., A.R., M.v.d.B. and the VGP assembly working group. L.A., A.D., G.R.G., A.M.G., G.M.G., N.J., C.J., B.O., S.S., M.S. and T.T. generated one or several assemblies used in the analyses. B.J.K., K.R. and M.J.P.C. validated the zebra finch assemblies. J.C. performed the manual curation on the zebra finch assembly. L.A. assembled and evaluated the mitochondrial genomes. N.B. established the decontamination pipeline and performed the contamination analyses. N.B. and M.P.-F. compared the scaffolding strategies. A.N. performed the analyses on XBP1. C.G.-A. and B.D.P. developed the training material with support from the user community. K.H. and M.C. sourced and arranged for sample procurement for species in this study. J.R.B., N.J., T.T., B.O’T., O.F., C.L., H.K., T.M.-B. and R.M.W. generated the PacBio and Hi-C data. G.F., M.C.S., A.N., A.M.P. and E.D.J. conceived the study and drafted the manuscript. All authors, including A.A. and R.W.W., contributed to writing and editing the manuscript and approved it.

Corresponding authors

Correspondence to Erich D. Jarvis, Michael C. Schatz, Anton Nekrutenko or Giulio Formenti.

Ethics declarations

Competing interests

The authors declare no competing interests.

Supplementary information

Supplementary Information

Supplementary Notes and Supplementary Figs. 1–14

Supplementary Table

Supplementary Tables 1–10

Rights and permissions

About this article

Cite this article

Larivière, D., Abueg, L., Brajuka, N. et al. Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-023-02100-3

  • Published:

  • DOI: https://doi.org/10.1038/s41587-023-02100-3

