Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Best practices for genetic and genomic data archiving

Abstract

Genetic and genomic data are collected for a vast array of scientific and applied purposes. Despite mandates for public archiving, data are typically used only by the generating authors. The reuse of genetic and genomic datasets remains uncommon because it is difficult, if not impossible, due to non-standard archiving practices and lack of contextual metadata. But as the new field of macrogenetics is demonstrating, if genetic data and their metadata were more accessible and FAIR (findable, accessible, interoperable and reusable) compliant, they could be reused for many additional purposes. We discuss the main challenges with existing genetic and genomic data archives, and suggest best practices for archiving genetic and genomic data. Recognizing that this is a longstanding issue due to little formal data management training within the fields of ecology and evolution, we highlight steps that research institutions and publishers could take to improve data archiving.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Estimating the unknown number of ‘missing’ datasets in open repositories.
Fig. 2: Recommendations for genetic and genomic data archiving.
Fig. 3: Wider actions needed to improve data archiving.

Similar content being viewed by others

Data availability

All data are accessible in the Supplementary information.

References

  1. Vines, T. H. et al. The availability of research data declines rapidly with article age. Curr. Biol. 24, 94–97 (2014).

    CAS  PubMed  Google Scholar 

  2. Roche, D. G., Kruuk, L. E. B., Lanfear, R. & Binning, S. A. Public data archiving in ecology and evolution: how well are we doing? PLoS Biol. 13, e1002295 (2015).

    PubMed  PubMed Central  Google Scholar 

  3. Tedersoo, L. et al. Data sharing practices and data availability upon request differ across scientific disciplines. Sci. Data 8, 192 (2021).

    PubMed  PubMed Central  Google Scholar 

  4. Piwowar, H. A., Vision, T. & Whitlock, M. C. Data archiving is a good investment. Nature 473, 285 (2011).

    CAS  PubMed  Google Scholar 

  5. Cochrane, G., Cook, C. E. & Birney, E. The future of DNA sequence archiving. GigaScience 1, 2 (2012).

    PubMed  PubMed Central  Google Scholar 

  6. Strasser, B. J. The experimenter’s museum: GenBank, natural history, and the moral economies of biomedicine. Isis 102, 60–96 (2011).

    PubMed  Google Scholar 

  7. International Human Genome Mapping Consortium. A physical map of the human genome. Nature 409, 934–941 (2001).

    Google Scholar 

  8. Ratnasingham, S. & Hebert, P. D. bold: The Barcode of Life Data System (http://www.barcodinglife.org/). Mol. Ecol. Notes 7, 355–364 (2007).

  9. Blanchet, S., Prunier, J. G. & De Kort, H. Time to go bigger: emerging patterns in macrogenetics. Trends Genet. 33, 579–580 (2017).

    CAS  PubMed  Google Scholar 

  10. Leigh, D. M. et al. Opportunities and challenges of macrogenetic studies. Nat. Rev. Genet. 22, 791–807 (2021).

    CAS  PubMed  Google Scholar 

  11. Schmidt, C., Hoban, S. & Jetz, W. Conservation macrogenetics: harnessing genetic data to meet conservation commitments. Trends Genet. 39, 816–829 (2023).

    CAS  PubMed  Google Scholar 

  12. Ruppert, K. M., Kline, R. J. & Rahman, M. S. Past, present, and future perspectives of environmental DNA (eDNA) metabarcoding: a systematic review in methods, monitoring, and applications of global eDNA. Glob. Ecol. Conserv. 17, e00547 (2019).

  13. Günther, T. & Coop, G. Robust identification of local adaptation from allele frequencies. Genetics 195, 205–220 (2013).

    PubMed  PubMed Central  Google Scholar 

  14. Decision Adopted by the Conference of the Parties to the Convention on Biological Diversity, https://www.cbd.int/doc/decisions/cop-15/cop-15-dec-05-en.pdf (CBD, 2022).

  15. Hoban, S. et al. Genetic diversity targets and indicators in the CBD post-2020 Global Biodiversity Framework must be improved. Biol. Conserv. 248, 108654 (2020).

    Google Scholar 

  16. Hoban, S. et al. Monitoring status and trends in genetic diversity for the Convention on Biological Diversity: an ongoing assessment of genetic indicators in nine countries. Conserv. Lett. 16, e12953 (2023).

    Google Scholar 

  17. Rieseberg, L., Vines, T. & Kane, N. Editorial and retrospective 2010. Mol. Ecol. 19, 1–22 (2010).

    PubMed  Google Scholar 

  18. Moore, A. J., Mcpeek, M. A., Rausher, M. D., Rieseberg, L. & Whitlock, M. C. The need for archiving data in evolutionary biology. J. Evol. Biol. 23, 659–660 (2010).

    PubMed  Google Scholar 

  19. Whitlock, M. C. Data archiving in ecology and evolution: best practices. Trends Ecol. Evol. 26, 61–65 (2011).

    PubMed  Google Scholar 

  20. Fairbairn, D. J. The advent of mandatory data archiving. Evolution 65, 1–2 (2011).

    PubMed  Google Scholar 

  21. Berberi, I. & Roche, D. G. No evidence that mandatory open data policies increase error correction. Nat. Ecol. Evol. 6, 1630–1633 (2022).

    PubMed  Google Scholar 

  22. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).

    PubMed  PubMed Central  Google Scholar 

  23. Gomes, D. G. E. et al. Why don’t we share data and code? Perceived barriers and benefits to public archiving practices. Proc. R. Soc. B 289, 2022111 (2022).

    Google Scholar 

  24. Huang, X. et al. Willing or unwilling to share primary biodiversity data: results and implications of an international survey. Conserv. Lett. 5, 399–406 (2012).

    Google Scholar 

  25. Hostler, T. J. The invisible workload of open research. J. Trial Error https://doi.org/10.36850/mr5 (2023).

    Article  Google Scholar 

  26. Kozlov, M. How a scandal in spider biology upended researchers’ lives. Nature 608, 658–659 (2022).

  27. H2020 Programme: AGA – Annotated Model Grant Agreement (European Commission, 2019).

  28. Crandall, E. D. et al. Importance of timely metadata curation to the global surveillance of genetic diversity. Conserv. Biol. 37, e14061 (2023).

    PubMed  PubMed Central  Google Scholar 

  29. Ceballos, G. et al. Accelerated modern human–induced species losses: entering the sixth mass extinction. Sci. Adv. 1, e1400253 (2015).

    PubMed  PubMed Central  Google Scholar 

  30. Leigh, D. M., Hendry, A. P., Vázquez‐Domínguez, E. & Friesen, V. L. Estimated six per cent loss of genetic variation in wild populations since the industrial revolution. Evol. Appl. 12, 1505–1512 (2019).

    PubMed  PubMed Central  Google Scholar 

  31. Jensen, E. L. & Leigh, D. M. Using temporal genomics to understand contemporary climate change responses in wildlife. Ecol. Evol. 12, e9340 (2022).

    PubMed  PubMed Central  Google Scholar 

  32. Lawrence, E. R. et al. Geo-referenced population-specific microsatellite data across American continents, the MacroPopGen Database. Sci. Data 6, 14 (2019).

    PubMed  PubMed Central  Google Scholar 

  33. Lischer, H. E. L. & Excoffier, L. PGDSpider: an automated data conversion tool for connecting population genetics and genomics programs. Bioinformatics 28, 298–299 (2012).

    CAS  PubMed  Google Scholar 

  34. Adamack, A. T. & Gruber, B. PopGenReport: simplifying basic population genetic analyses in R. Methods Ecol. Evol. 5, 384–387 (2014).

    Google Scholar 

  35. Manoukis, N. C. FORMATOMATIC: a program for converting diploid allelic data between common formats for population genetic analysis. Mol. Ecol. Notes 7, 592–593 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. Jombart, T. adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics 24, 1403–1405 (2008).

    CAS  PubMed  Google Scholar 

  39. Gratton, P. et al. A world of sequences: can we use georeferenced nucleotide databases for a robust automated phylogeography? J. Biogeogr. 44, 475–486 (2017).

    Google Scholar 

  40. Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Google Scholar 

  42. Mallick, S. et al. The Allen Ancient DNA Resource (AADR) a curated compendium of ancient human genomes. Sci Data 11, 182 (2024).

    PubMed  PubMed Central  Google Scholar 

  43. Jenkins, G. B. et al. Reproducibility in ecology and evolution: minimum standards for data and code. Ecol. Evol. 13, e9961 (2023).

    PubMed  PubMed Central  Google Scholar 

  44. Grealey, J. et al. The carbon footprint of bioinformatics. Mol. Biol. Evol. 39, msac034 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Böhne, A. et al. Contextualising samples: supporting reference genomes of European biodiversity through sample and associated metadata collection. Preprint at bioRxiv https://doi.org/10.1101/2023.06.28.546652 (2024).

  46. Stroe, O. ENA to introduce mandatory spatiotemporal annotations. EMBL-EBI https://www.ebi.ac.uk/about/news/updates-from-data-resources/ena-spatiotemporal-metadata/ (4 April 2023).

  47. Frank, R. D., Kriesberg, A., Yakel, E. & Faniel, I. M. Looting hoards of gold and poaching spotted owls: data confidentiality among archaeologists & zoologists. Proc. Assoc. Inf. Sci. Technol. 52, 1–10 (2015).

    Google Scholar 

  48. Chapman, A. D. Current Best Practices for Generalizing Sensitive Species Occurrence Data (GBIF Secretariat, 2020).

  49. Clarke, K. C. A multiscale masking method for point geographic data. Int. J. Geogr. Inf. Sci. 30, 300–315 (2016).

    Google Scholar 

  50. Scholz, A. H. et al. Multilateral benefit-sharing from digital sequence information will support both science and biodiversity conservation. Nat. Commun. 13, 1086 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. Marden, E. et al. Sharing and reporting benefits from biodiversity research. Mol. Ecol. 30, 1103–1107 (2021).

    PubMed  Google Scholar 

  52. Bhaumik, V. Global inequities in local science. Nat. Ecol. Evol. 7, 793 (2023).

    PubMed  Google Scholar 

  53. Miller, J., White, T. B. & Christie, A. P. Parachute conservation: investigating trends in international research. Conserv. Lett. 16, e12947 (2023).

    Google Scholar 

  54. de Vos, A. & Schwartz, M. W. Confronting parachute science in conservation. Conserv. Sci. Pract. 4, e12681 (2022).

    Google Scholar 

  55. Carroll, S. R. The CARE Principles for Indigenous Data Governance. Data Sci. J. 19, 43 (2020).

  56. Carroll, S. R., Herczog, E., Hudson, M., Russell, K. & Stall, S. Operationalizing the CARE and FAIR principles for Indigenous data futures. Sci. Data 8, 108 (2021).

    PubMed  PubMed Central  Google Scholar 

  57. Kukutai, T. Indigenous data sovereignty—a new take on an old theme. Science 382, eadl4664 (2023).

    PubMed  Google Scholar 

  58. Te Aika, B. et al. Aotearoa genomic data repository: an āhuru mōwai for taonga species sequencing data. Mol. Ecol. Resour. https://doi.org/10.1111/1755-0998.13866 (2023).

  59. Hudson, M. et al. Indigenous Peoples’ rights in data: a contribution toward indigenous research sovereignty. Front. Res. Metr. Anal. 8, 1173805 (2023).

    PubMed  PubMed Central  Google Scholar 

  60. Mc Cartney, A. M. et al. Indigenous peoples and local communities as partners in the sequencing of global eukaryotic biodiversity. npj Biodivers. 2, 8 (2023).

    Google Scholar 

  61. Shaikh A. Ecology week 4: field sample with animals. figshare https://doi.org/10.6084/m9.figshare.1194651.v1 (2014).

  62. Gonzalez L. Sexual crime in Colombia 2010-2022. figshare https://doi.org/10.6084/m9.figshare.21937154.v1 (2010).

  63. Roche, D. G., Jennions, M. D. & Binning, S. A. Fees could damage public data archives. Nature 502, 171 (2013).

    CAS  PubMed  Google Scholar 

  64. Barrett, T. et al. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 40, D57–D63 (2012).

    CAS  PubMed  Google Scholar 

  65. Deck, J. et al. The Genomic Observatories Metadatabase (GeOMe): a new repository for field and sampling event metadata associated with genetic samples. PLoS Biol. 15, e2002925 (2017).

    PubMed  PubMed Central  Google Scholar 

  66. Shaw, F. et al. COPO: a metadata platform for brokering FAIR data in the life sciences. F1000Research 9, 495 (2020).

    Google Scholar 

  67. Associated data. Web of Science https://images.webofknowledge.com/images/help/WOK/hp_associated_data.html (2018).

  68. Including sample location and collection date and time for biosample submissions including sample location. DDBJ https://www.ddbj.nig.ac.jp/news/en/2023-05-02-e.html (2023).

  69. Costa-Pereira, R. & Pruitt, J. Retraction: behaviour, morphology and microhabitat use: what drives individual niche variation? Biol. Lett. 16, 20200588 (2020).

    PubMed  PubMed Central  Google Scholar 

  70. van den Burg, M. P. & Vieites, D. R. Bird genetic databases need improved curation and error reporting to NCBI. Ibis 165, 472–481 (2023).

    Google Scholar 

  71. Final NIH Policy for Data Management and Sharing NOT-OD-21-013, https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html (National Institute for Health, 2020).

  72. Guidelines on FAIR data management in Horizon 2020 (European Commission, 2016).

  73. Data Management Plan: Guidance for Peer Reviewers, https://www.ukri.org/wp-content/uploads/2021/07/ESRC-200721-DataManagementPlan-GuidanceforPeerReviewers.pdf (UKRI, 2013).

  74. Peng, G. et al. Scientific stewardship in the open data and big data era roles and responsibilities of stewards and other major product stakeholders. D-Lib Mag. https://doi.org/10.1045/may2016-peng (2016).

  75. Toelch, U. & Ostwald, D. Digital open science—teaching digital tools for reproducible and transparent research. PLoS Biol. 16, e2006022 (2018).

    PubMed  PubMed Central  Google Scholar 

  76. Thrall, P. H. et al. From raw data to publication: introducing data editing at Ecology Letters. Ecol. Lett. 26, 829–830 (2023).

    PubMed  Google Scholar 

  77. Cousijn, H. et al. A data citation roadmap for scientific publishers. Sci. Data 5, 180259 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  78. Time to recognize authorship of open data. Nature 604, 8 (2022).

  79. Miraldo, A. et al. An Anthropocene map of genetic diversity. Science 353, 1532–1535 (2016).

    CAS  PubMed  Google Scholar 

  80. Figuerola-Ferrando, L. et al. Global patterns and drivers of genetic diversity among marine habitat-forming species. Glob. Ecol. Biogeogr. 32, 1218–1229 (2023).

    Google Scholar 

  81. Kays, R. et al. The Movebank system for studying global animal movement and demography. Methods Ecol. Evol. 13, 419–431 (2022).

    Google Scholar 

  82. Beninde, J. et al. CaliPopGen: a genetic and life history database for the fauna and flora of California. Sci. Data 9, 380 (2022).

    PubMed  PubMed Central  Google Scholar 

  83. Hoban, S. et al. Genetic diversity goals and targets have improved, but remain insufficient for clear implementation of the post-2020 global biodiversity framework. Conserv. Genet. 24, 181–191 (2023).

    PubMed  PubMed Central  Google Scholar 

  84. Schmidt, C., Domaratzki, M., Kinnunen, R. P., Bowman, J. & Garroway, C. J. Continent-wide effects of urbanization on bird and mammal genetic diversity. Proc. R. Soc. B 287, 20192497 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  85. Schmidt, C. & Garroway, C. J. Systemic racism alters wildlife genetic diversity. Proc. Natl Acad. Sci. USA 119, e2102860119 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  86. Schmidt, C. & Garroway, C. J. The population genetics of urban and rural amphibians in North America. Mol. Ecol. 30, 3918–3929 (2021).

    PubMed  Google Scholar 

  87. Wieczorek, J. et al. Darwin core: an evolving community-developed biodiversity data standard. PLoS ONE 7, e29715 (2012).

  88. Field, D. et al. The genomic standards consortium. PLoS Biol. 9, e1001088 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  89. Meyer, R. et al. Aligning standards communities for omics biodiversity data: sustainable darwin core-MIxS interoperability. Biodivers. Data J. 11, e112420 (2023).

    PubMed  PubMed Central  Google Scholar 

  90. Buttigieg, P. et al. The environment ontology: contextualising biological and biomedical entities. J. Biomed. Semant. 4, 43 (2013).

    Google Scholar 

Download references

Acknowledgements

D.M.L. was funded by the BiodivERsA project ‘ACORN’ granted by the Swiss National Science Foundation (SNSF Project 31BD30_193900). I.P.-V. was supported by the US Geological Survey John Wesley Powell Center for Analysis and Synthesis. Thanks to T. Günther and B. Star for their comments on ancient DNA archiving practices and considerations. Thanks also to J. Gibson for her helpful discussions about FAIR databases. Thanks to F. Gugerli and C. Buser-Schoebel for their helpful feedback on the manuscript. This work was conducted as a part of the Standardizing, Aggregating, Analyzing and Disseminating Global Wildlife Genetic and Genomic Data for Improved Management and Advancement of Community Best Practices Working Group supported by the John Wesley Powell Center for Analysis and Synthesis, funded by the US Geological Survey. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the US Government.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the inception and writing of this work. D.M.L. supervised this work and conducted the editing, with support from I.P.-V., A.G.V. and M.E.H.; I.P.-V. conducted the database analysis.

Corresponding author

Correspondence to Deborah M. Leigh.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Ecology & Evolution thanks Natalie Forsdick, Dominique Roche and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Data 1

Metadata used in Fig. 1.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Leigh, D.M., Vandergast, A.G., Hunter, M.E. et al. Best practices for genetic and genomic data archiving. Nat Ecol Evol 8, 1224–1232 (2024). https://doi.org/10.1038/s41559-024-02423-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41559-024-02423-7

Search

Quick links

Nature Briefing Anthropocene

Sign up for the Nature Briefing: Anthropocene newsletter — what matters in anthropocene research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: Anthropocene