Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Navigating bottlenecks and trade-offs in genomic data analysis

Abstract

Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of genomic analysis pipelines.
Fig. 2: Genomic compression and sketching.

Similar content being viewed by others

References

  1. Wetterstrand, K. A. DNA sequencing costs: data. National Human Genome Research Institute www.genome.gov/sequencingcostsdata (2022).

  2. Preston, J., VanZeeland, A., & Peiffer, D. A. Innovation at illumina: the road to the $600 human genome. Nature Portfolio https://www.nature.com/articles/d42473-021-00030-9 (2021).

  3. Pennisi, E. A. $100 genome? New DNA sequencers could be a ‘game changer’ for biology, medicine. Science 376, 1257–1258 (2022).

    Article  CAS  PubMed  Google Scholar 

  4. Regalado, A. China’s BGI says it can sequence a genome for just $100. MIT Technology Review. https://www.technologyreview.com/2020/02/26/905658/china-bgi-100-dollar-genome/ (2020).

  5. Berger, B., Daniels, N. M. & Yu, Y. W. Computational biology in the 21st century: scaling with compressive algorithms. Commun. ACM 59, 72–80 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Rozenblatt-Rosen, O., Stubbington, M. J. T., Regev, A. & Teichmann, S. A. The Human Cell Atlas: from vision to reality. Nature 550, 451–453 (2017).

    Article  PubMed  Google Scholar 

  7. Zheng, G. Our 1.3 million single cell dataset is ready to download. 10x Genomics. https://www.10xgenomics.com/blog/our-13-million-single-cell-dataset-is-ready-to-download (2022).

  8. Edgar, R. C. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022).

    Article  CAS  PubMed  Google Scholar 

  9. Marçais, G., Solomon, B., Patro, R. & Kingsford, C. Sketching and sublinear data structures in genomics. Annu. Rev. Biomed. Data Sci. 2, 93–118 (2019). This work is an excellent in-depth review of sketching for algorithm designers.

    Article  Google Scholar 

  10. Kurzak, J., Bader, D.A., & Dongarra, J., (eds) Scientific Computing with Multicore and Accelerators (CRC, 2010 Dec 7).

  11. Mernik, M., Heering, J. & Sloane, A. M. When and how to develop domain-specific languages. ACM Comput. Surv. 37, 316–344 (2005).

    Article  Google Scholar 

  12. Van der Auwera, G. A. et al. From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinforma. 43, 11 (2013).

    Google Scholar 

  13. McKenna, A. et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297–1303 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Banks, E. Run the germline GATK best practices pipeline for $5 per genome. GitHub https://github.com/broadinstitute/gatk-docs/blob/master/blog-2012-to-2019/2018-02-12-Run_the_germline_GATK_Best_Practices_Pipeline_for_%245_per_genome.md (2020).

  15. Illumina. DRAGEN Complete Suite; latest version: 4.0.3. AWS Marketplace. https://aws.amazon.com/marketplace/pp/prodview-ypz2tpzy6f5xq (2022).

  16. Shajii, A., Yorukoglu, D., Yu, Y. W. & Berger, B. Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics 32, i538–i544 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 1–4 (2016).

    Article  Google Scholar 

  18. Stein, L. Genome annotation: from sequence to biology. Nat. Rev. Genet 2, 493–503 (2001).

    Article  CAS  PubMed  Google Scholar 

  19. Lewis, C. M. Genetic association studies: design, analysis and interpretation. Brief. Bioinforma. 3, 146–153 (2002).

    Article  CAS  Google Scholar 

  20. Baldi, P. & Brunak, S. Bioinformatics: The Machine Learning Approach (MIT Press, 2001).

  21. Suhre, K., McCarthy, M. I. & Schwenk, J. M. Genetics meets proteomics: perspectives for large population-based studies. Nat. Rev. Genet 22, 19–37 (2021).

    Article  CAS  PubMed  Google Scholar 

  22. Allis, D. C. & Jenuwein, T. The molecular hallmarks of epigenetic control. Nat. Rev. Genet 17, 487–500 (2016).

    Article  CAS  PubMed  Google Scholar 

  23. Moses, L. & Pachter, L. Museum of spatial transcriptomics. Nat. Methods 19, 534–546 (2022).

    Article  CAS  PubMed  Google Scholar 

  24. Burgess, D. J. Spatial transcriptomics coming of age. Nat. Rev. Genet 20, 317–317 (2019).

    Article  CAS  PubMed  Google Scholar 

  25. Berger, B. & Cho, H. Emerging technologies towards enhancing privacy in genomic data sharing. Genome Biol. 20, 1–3 (2019).

    Article  CAS  Google Scholar 

  26. Gürsoy, G. et al. Functional genomics data: privacy risk assessment and technological mitigation. Nat. Rev. Genet 2021, 1–14 (2021).

    Google Scholar 

  27. Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. Introduction to Algorithms (MIT Press, 2022).

  28. Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. 19, A68–A77 (2015).

    Google Scholar 

  29. Zhang, Z. et al. Uniform genomic data analysis in the NCI Genomic Data Commons. Nat. Commun. 12, 1226 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. BackupWorks.com. LTO Program announces price per gigabyte now less than one penny. BackupWorks.com https://www.backupworks.com/LTO-program-cost-per-gigabyte-milestone.aspx (2022).

  31. 100,000 Genomes Project Pilot Investigators. 100,000 genomes pilot on rare-disease diagnosis in health care — preliminary report. N. Engl. J. Med. 385, 1868–1880 (2021).

    Article  Google Scholar 

  32. Matange, K., Tuck, J. M. & Keung, A. J. DNA stability: a central design consideration for DNA data storage systems. Nat. Commun. 12, 1358 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Jacob, B, Wang, D, & Ng, S. Memory Systems: Cache, DRAM, disk (Morgan Kaufmann, 2010).

  34. Bonfield, J. K. CRAM 3.1: advances in the CRAM file format. Bioinformatics 38, 1497–1503 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38, 1767–1771 (2010).

    Article  CAS  PubMed  Google Scholar 

  37. Hernaez, M., Pavlichin, D., Weissman, T. & Ochoa, I. Genomic data compression. Annu. Rev. Biomed. Data Sci. 2, 19–37 (2019). This work is a canonical review of genomic data compression by many of the authors involved in standardization efforts.

    Article  Google Scholar 

  38. Loh, P. R., Baym, M. & Berger, B. Compressive genomics. Nat. Biotechnol. 30, 627–630 (2012).

    Article  CAS  PubMed  Google Scholar 

  39. Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet 19, 208–219 (2018). This article goes more in-depth on cloud computing and how that is changing genomic data analysis.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Hie, B., Cho, H., DeMeo, B., Bryson, B. & Berger, B. Geometric sketching compactly summarizes the single-cell transcriptomic landscape. Cell Syst. 8, 483–493 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Hie, B. et al. Computational methods for single-cell RNA sequencing. Annu. Rev. Biomed. Data Sci. 3, 339–364 (2020). This review discusses some of the newer computational challenges presented by scRNA-seq data.

    Article  Google Scholar 

  43. Lähnemann, D. et al. Eleven grand challenges in single-cell data science. Genome Biol. 21, 1–35 (2020).

    Article  Google Scholar 

  44. Evans, C., Hardin, J. & Stoebel, D. M. Selecting between-sample RNA-seq normalization methods from the perspective of their assumptions. Brief. Bioinforma. 19, 776–792 (2018).

    Article  CAS  Google Scholar 

  45. Google. All networking pricing. Google Cloud https://cloud.google.com/vpc/network-pricing (2022).

  46. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 40, 1652–1666 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Gaziano, J. M. et al. Million veteran program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).

    Article  PubMed  Google Scholar 

  49. Lin, J. C., Hsiao, W. W. W. & Fan, C. T. Transformation of the Taiwan Biobank 3.0: vertical and horizontal integration. J. Transl. Med. 18, 1–13 (2020).

    Article  Google Scholar 

  50. All of Us Research Program Investigators. The “All of Us” research program. N. Engl. J. Med. 381, 668–676 (2019).

    Article  Google Scholar 

  51. Baker, M. & Buyya, R. Cluster computing: the commodity supercomputer. Softw. Pract. Exp. 29, 551–576 (1999).

    Article  Google Scholar 

  52. Goenka, S. D. et al. Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing. Nat. Biotechnol. 40, 1035–1041 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Marshall, P., Keahey, K., & Freeman, T. in 2011 11th IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing 205–214 (IEEE, 2011).

  54. Grossman, R. L. The case for cloud computing. IT professional 11, 23–27 (2009).

    Article  Google Scholar 

  55. Cormode, G. & Garofalakis, M. in Proc. 2007 ACM SIGMOD Int. Conf. Management of Data 281–292 (2007).

  56. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  PubMed  Google Scholar 

  57. Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).

    Article  CAS  PubMed  Google Scholar 

  58. Berger, B., Waterman, M. S. & Yu, Y. W. Levenshtein distance, sequence comparison and biological database search. IEEE Trans. Inf. Theory 67, 3287–3294 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  59. He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat. Methods 19, 316–322 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Kaminow, B., Yunusov, D. & Dobin, A. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. Preprint at Biorxiv https://doi.org/10.1101/2021.05.05.442755 (2021).

    Article  Google Scholar 

  61. Sarkar, H., Srivastava, A. & Patro, R. Minnow: a principled framework for rapid simulation of dscRNA-seq data at the read level. Bioinformatics 35, i136–i144 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 1–8 (2018).

    Article  CAS  Google Scholar 

  63. Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  64. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv https://doi.org/10.48550/arXiv.1303.3997 (2013).

    Article  Google Scholar 

  65. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Grigoryev, D. N. in Big Data Analysis for Bioinformatics and Biomedical Discoveries (ed. Ye, S. Q.) 15–34 (CRC, 2016).

  68. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Endrullat, C., Glökler, J., Franke, P. & Frohme, M. Standardization and quality management in next-generation sequencing. Appl. Transl. Genomics 10, 2–9 (2016).

    Article  Google Scholar 

  71. Yorukoglu, D., Yu, Y. W., Peng, J. & Berger, B. Compressive mapping for next-generation sequencing. Nat. Biotechnol. 34, 374–376 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Shajii, A. et al. A Python-based programming language for high-performance computational genomics. Nat. Biotechnol. 39, 1062–1064 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Berger, B., Peng, J. & Singh, M. Computational solutions for omics data. Nat. Rev. Genet 14, 333–346 (2013). This work is an older review of computational challenges and solutions in bioinformatics, the topics of which this Review assumes background familiarity with.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Rehm, H. L. et al. GA4GH: international policies and standards for data sharing across genomic research and healthcare. Cell Genomics 1, 100029 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Alberti, C. et al. in Proc. IEEE Data Compression Conf. (DCC) 221–230 (2016).

  76. Fritz, M. H., Leinonen, R., Cochrane, G. & Birney, E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21, 734–740 (2011).

    Article  CAS  Google Scholar 

  77. Bonfield, J. K. & Mahoney, M. V. Compression of FASTQ and SAM format sequencing data. PloS ONE 8, e59190 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Rahman, A., Chikhi, R. & Medvedev, P. Disk compression of k-mer sets. Algorithms Mol. Biol. 16, 1–4 (2021).

    Article  Google Scholar 

  79. Hach, F., Numanagić, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Janin, L., Schulz-Trieglaff, O. & Cox, A. J. BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics 30, 2796–2801 (2014).

    Article  CAS  PubMed  Google Scholar 

  81. Yu, Y. W., Daniels, N. M., Danko, D. C. & Berger, B. Entropy-scaling search of massive biological data. Cell Syst. 1, 130–140 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Ferragina, P. & Manzini, G. in Proc. 41st Annual Symp. Foundations of Computer Science 390–398 (IEEE, 2000).

  83. Ferragina, P., Manzini, G., Mäkinen, V. & Navarro, G. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms https://doi.org/10.1145/1240233.1240243 (2007).

    Article  Google Scholar 

  84. Kuhnle, A. et al. Efficient construction of a complete index for pan-genomics read alignment. J. Comput. Biol. 27, 500–513 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Bhaskaran, V. & Konstantinides, K. Image and Video Compression Standards: Algorithms and Architectures (Springer, 1997).

  86. Yu, Y. W., Yorukoglu, D., Peng, J. & Berger, B. Quality score compression improves genotyping accuracy. Nat. Biotechnol. 33, 240–243 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  87. Malysa, G. et al. QVZ: lossy compression of quality values. Bioinformatics 31, 3122–3129 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Ochoa, I., Hernaez, M., Goldfeder, R., Weissman, T. & Ashley, E. Effect of lossy compression of quality scores on variant calling. Brief. Bioinforma. 18, 183–194 (2017).

    Google Scholar 

  89. Broder, A.Z. in IEEE Proc. Compression and Complexity of SEQUENCES (Cat. No. 97TB100171) 21–29 (IEEE, 1997).

  90. Broder, A. Z., Charikar, M., Frieze, A. M. & Mitzenmacher, M. in Proc. 30th ACM Symp. Theory of Computing (STOC ‘98) 327–336 (Association for Computing Machinery, 1998).

  91. Jaccard, P. The distribution of the flora in the alpine zone. N. Phytol. 11, 37–50 (1912).

    Article  Google Scholar 

  92. Zhao, X. BinDash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics 35, 671–673 (2019).

    Article  CAS  PubMed  Google Scholar 

  93. Baker, D. N. & Langmead, B. Dashing: fast and accurate genomic distances with HyperLogLog. Genome Biol. 20, 265 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  94. Flajolet, P., Fusy, É., Gandouet, O. & Meunier, F. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discret. Math. Theor. Comput. Sci. https://doi.org/10.46298/dmtcs.3545 (2007).

    Article  Google Scholar 

  95. Ondov, B. D. et al. Mash Screen: high-throughput sequence containment estimation for genome discovery. Genome Biol. 20, 1–3 (2019).

    Article  Google Scholar 

  96. Stranneheim, H. et al. Classification of DNA sequences using Bloom filters. Bioinformatics 26, 1595–1600 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Bradley, P. et al. Ultrafast search of all deposited bacterial and viral genomic data. Nat. Biotechnol. 37, 152–159 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  98. Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 1–3 (2019).

    Article  Google Scholar 

  99. Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  100. Numanagić, I. et al. Fast characterization of segmental duplications in genome assemblies. Bioinformatics 34, i706–i714 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  101. Ekim, B., Berger, B. & Chikhi, R. Minimizer-space de Bruijn graphs: whole-genome assembly of long reads in minutes on a personal computer. Cell Syst. 12, 958–968 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. Rautiainen, M. & Marschall, T. MBG: minimizer-based sparse de Bruijn Graph construction. Bioinformatics 37, 2476–2478 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  103. Marçais, G. et al. Improving the performance of minimizers and winnowing schemes. Bioinformatics 33, i110–i117 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  104. Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Flomin, D., Pellow, D. & Shamir, R. Data set-adaptive minimizer order reduces memory usage in k-mer counting. J. Comput. Biol. 29, 825–838 (2022).

    Article  CAS  PubMed  Google Scholar 

  106. Edgar, R. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ 9, e10805 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  107. Shaw, J. & Yu, Y. W. Theory of local k-mer selection with applications to long-read alignment. Bioinformatics 2021, btab790 (2021).

    Google Scholar 

  108. Orenstein, Y., Pellow, D., Marçais, G., Shamir, R. & Kingsford, C. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Comput. Biol. 13, e1005777 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  109. Ekim, B., Berger, B. & Orenstein, Y. in Proc. Int. Conf. Research in Computational Molecular Biology (RECOMB) (ed. Schwartz, R.) 37–53 (Springer LNBI, 2020).

  110. DeMeo, B. & Berger, B. Hopper: a mathematically optimal algorithm for sketching biological data. Bioinformatics 36, i236–i241 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  111. Manavski, S. A. & Valle, G. CUDA compatible GPU cards as efficient hardware accelerators for Smith–Waterman sequence alignment. BMC Bioinforma. 9, 1–9 (2008).

    Article  Google Scholar 

  112. Herbordt, M. C., Model, J., Gu, Y., Sukhwani, B. & VanCourt, T. in Proc. 14th Annual IEEE Symp. Field-Programmable Custom Computing Machines Vol. 2006 217–226 (IEEE, 2006).

  113. Alser, M., Shahroodi, T., Gómez-Luna, J., Alkan, C. & Mutlu, O. SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs. Bioinformatics 36, 5282–5290 (2020).

    Article  CAS  Google Scholar 

  114. Cali, D. S. et al. in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) 951–966 (IEEE, 2020).

  115. Jouppi, N. P. et al. in Proc. 44th Annual Int. Symp. Computer Architecture Vol. 24 1–12 (2017).

  116. Catreux, S. et al. DRAGEN Sets New Standard for Data Accuracy in Precision FDA Benchmark Data. Optimizing Variant Calling Performance with Illumina Machine Learning and DRAGEN Graph. Illumina https://www.illumina.com/science/genomics-research/articles/dragen-shines-again-precisionfda-truth-challenge-v2.html (2020).

  117. NVIDIA. Genome sequencing analysis. NVIDIA https://www.nvidia.com/en-us/clara/genomics/ (2022).

  118. Heath, A. P. et al. The NCI Genomic Data Commons. Nat. Genet 53, 257–262 (2021).

    Article  CAS  PubMed  Google Scholar 

  119. Schatz, M. C. et al. Inverting the model of genomics data sharing with the NHGRI genomic data science analysis, visualization, and informatics lab-space. Cell Genomics 2, 100085 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  120. Charbonneau, A. L. et al. Making Common Fund data more findable: catalyzing a data ecosystem. Preprint at bioRxiv https://doi.org/10.1101/2021.11.05.467504 (2021).

    Article  Google Scholar 

  121. Abadi, M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Preprint at arXiv https://arxiv.org/abs/1603.04467 (2016).

  122. Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).

    Google Scholar 

  123. Gjendemsjø, A. An introduction to MATLAB. OpenStax CNX http://cnx.org/contents/2100a51e-a5c9-4e41-9cb6-087b755125ac@3.4 (2007).

  124. Perkel, J. M. Julia: come for the syntax, stay for the speed. Nature 572, 141–143 (2019).

    Article  CAS  PubMed  Google Scholar 

  125. Döring, A. et al. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinforma. 9, 11 (2008).

    Article  Google Scholar 

  126. Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  127. Köster, J. Rust-Bio: a fast and safe bioinformatics library. Bioinformatics 32, 444–446 (2016).

    Article  PubMed  Google Scholar 

  128. Ward, B. J. Fast, open, easy, software for biology. BioJulia https://biojulia.net (2022).

  129. Angerer, P. et al. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).

    Article  Google Scholar 

  130. Wolf, F., Angerer, P. & Theis, F. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  131. Saledin, S. P., Pope, B. & Oshlack, A. BPipe: a tool for running and managing bioinformatics pipelines. Bioinformatics 28, 1525–1526 (2012).

    Article  Google Scholar 

  132. Köster, J. & Rahmann, S. Snakemake — a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).

    Article  PubMed  Google Scholar 

  133. Reiter, T. et al. Streamlining data-intensive biology with workflow systems. GigaScience 10, giaa140 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  134. Blankenberg, D. et al. Galaxy: a web‐based genome analysis tool for experimentalists. Curr. Protoc. Mol. Biol. 89, 19 (2010).

    Article  Google Scholar 

  135. Mahadik, K. et al. Sarvavid: a domain specific language for developing scalable computational genomics applications. Proc. 2016 Int. Conf. Supercomput. https://doi.org/10.1145/2925426.2926283 (2016).

    Article  Google Scholar 

  136. Ahmed, N. & Wahed, M. The de-democratization of AI: deep learning and the compute divide in artificial intelligence research. Preprint at arXiv https://arxiv.org/abs/2010.15581 (2020).

  137. Hellendoorn, V. J. & Sawant, A. A. The growing cost of deep learning for source code. Commun. ACM 65, 31–33 (2021).

    Article  Google Scholar 

  138. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145 (2008).

    Article  CAS  PubMed  Google Scholar 

  139. Pfeiffer, F. et al. Systematic evaluation of error rates and causes in short samples in next-generation sequencing. Sci. Rep. 8, 1–4 (2018).

    Article  Google Scholar 

  140. Lang, D. et al. Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacific Biosciences Sequel II system and ultralong reads of Oxford Nanopore. GigaScience 9, giaa123 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  141. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  142. Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  143. Oxford Nanopore. Oxford Nanopore Tech update: new Duplex method for Q30 nanopore single molecule reads, PromethION 2, and more. Oxford Nanopore Technologies https://nanoporetech.com/about-us/news/oxford-nanopore-tech-update-new-duplex-method-q30-nanopore-single-molecule-reads-0 (2021).

  144. Zheng, G. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303–311 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  145. Belton, J. M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).

    Article  CAS  PubMed  Google Scholar 

  146. Solomon, B. & Kingsford, C. Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300–302 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  147. Sahlin, K. & Medvedev, P. De novo clustering of long-read transcriptome data using a greedy, quality value-based algorithm. J. Comput. Biol. 27, 472–484 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  148. Mohamed, S. & Syed, B. A. Commercial prospects for genomic sequencing technologies. Nat. Rev. Drug Disco. 12, 341 (2013).

    Article  CAS  Google Scholar 

  149. Eisenstein, M. Illumina swallows PacBio in long shot for market domination. Nat. Biotechnol. 37, 3–5 (2019).

    Article  CAS  PubMed  Google Scholar 

  150. Sundquist, A., Ronaghi, M., Tang, H., Pevzner, P. & Batzoglou, S. Whole-genome sequencing and assembly with high-throughput, short-read technologies. PloS ONE 2, e484 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  151. Van Dijk, E. L., Jaszczyszyn, Y., Naquin, D. & Thermes, C. The third revolution in sequencing technology. Trends Genet 34, 666–681 (2018).

    Article  PubMed  Google Scholar 

  152. Tan, G. et al. Long fragments achieve lower base quality in Illumina paired-end sequencing. Sci. Rep. 9, 2856 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  153. Schirmer, M. et al. Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinforma. 17, 125 (2016).

    Article  Google Scholar 

  154. Dohm, J. C., Peters, P., Stralis-Pavese, N. & Himmelbauer, H. Benchmarking of long-read correction methods. NAR Genomics Bioinforma. 2, Iqaa037 (2020).

    Article  Google Scholar 

  155. Fullwood, M. J., Wei, C. L., Liu, E. T. & Ruan, Y. Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res 19, 521–532 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  156. Duan, Z. et al. A three-dimensional model of the yeast genome. Nature 465, 363–367 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  157. Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  158. Spies, N. et al. Genome-wide reconstruction of complex structural variants using read clouds. Nat. Methods 14, 915–920 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  159. Loose, M., Malla, S. & Stout, M. Real-time selective sequencing using nanopore technology. Nat. Methods 13, 751–754 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors thank E. Banks, L. Cowen and I. Numanagić for helpful discussions. Y.W.Y. is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) grant RGPIN-2022-03074 and B.B. is supported by US National Institutes of Health (NIH) grant 1R35GM141861.

Author information

Authors and Affiliations

Authors

Contributions

The authors contributed equally to all aspects of the article.

Corresponding author

Correspondence to Bonnie Berger.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Genetics thanks C. Titus Brown and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

BEETL: https://github.com/BEETL/BEETL

BEETL-fastq: https://github.com/BEETL/BEETL

BIGSI: https://github.com/phelimb/BIGSI

BinDash: https://github.com/zhaoxiaofei/BinDash

BPipe: https://github.com/ssadedin/bpipe

CORA: http://cb.csail.mit.edu/cb/cora/

Dashing: https://github.com/dnbaker/dashing

DNAnexus: https://www.dnanexus.com/

DSRC2: http://sun.aei.polsl.pl/dsrc

ESS-Compress: http://github.com/medvedevgroup/ESSCompress

FACS: https://github.com/SciLifeLab/facs

Geosketch: https://geosketch.csail.mit.edu/

Hopper: http://hopper.csail.mit.edu/

Illumina Dragen: https://www.illumina.com/products/by-type/informatics-products/dragen-bio-it-platform.html

Illumina Dragen AMI: https://aws.amazon.com/quickstart/architecture/illumina-dragen/

IsONclust: https://pypi.org/project/isONclust/

Kraken 2: https://github.com/DerrickWood/kraken2

Mash: https://github.com/marbl/mash

MashMap2: https://github.com/marbl/MashMap

MBG: https://github.com/maickrau/MBG

mdbg: https://github.com/ekimb/rust-mdbg/

Minimap2: https://github.com/lh3/minimap2

Nvidia Clara Parabricks: https://www.nvidia.com/en-us/clara/genomics/

Qualcomp: https://sourceforge.net/projects/qualcomp

Quartz: https://quartz.csail.mit.edu/

QVZ: https://github.com/mikelhernaez/qvz

SCALCE: http://scalce.sourceforge.net/

SEDEF: https://github.com/vpc-ccg/sedef

Seq: https://github.com/seq-lang/seq

Sequence Bloom Tree: http://www.cs.cmu.edu/~ckingsf/software/bloomtree/

SneakySnake: https://github.com/CMU-SAFARI/SneakySnake

Winnowmap: https://github.com/marbl/Winnowmap

Glossary

Accelerators

A hardware device or software program that enhances the overall performance of the computer. A software accelerator implements as many system functions as possible in software and moves performance-critical functions into special-purpose external hardware to reduce compute time.

Bloom filters

An indexing approach for storing the presence or absence of k-mers in a dataset; they have been leveraged to considerably reduce the amount of space and still run in constant time. However, they can have high false positive rates (that is, query hits when there are none).

CIGAR strings

(Concise idiosyncratic gapped alignment report strings). The sequence alignment map (SAM) file format’s compressed representation of a read alignment to a reference.

Cloud computing

The use of computing resources distributed in the ‘cloud-shaped’ Internet to store, manage and analyse data, rather than doing so on a local server or personal computer.

Complexity

Algorithm complexity is generally measured as an upper bound on its long-term growth rate: how its runtime or space requirements grows as the input size grows, rather than its absolute magnitude, and thus constants are omitted. In practice, a set of algorithms can share the asymptotic complexity despite some of them being a constant 2, 3 or even 1,000 times slower than their counterparts in the set.

Compute resources

The amount of compute power (for example, central processing units (CPUs) and memory) that can be requested, allocated and used for computing.

Domain-specific languages

Computer languages tailored to a specific domain such as genomics.

Field-programmable gate arrays

(FPGAs). Hardware accelerators that can be configured/reprogrammed by a customer after manufacturing. They enable custom hardware acceleration without needing entirely new chips to be manufactured.

Graphics processing units

(GPUs). Hardware accelerators that can process many pieces of data simultaneously. They were historically used primarily for rendering computer graphics, but the massive parallelism makes them useful for applications such as machine learning.

Jaccard index

A measure of the similarity between two sets, defined as the size of the intersection divided by the size of the union.

k-mer

Genomic data normally come in long strings of nucleotides (A, C, G and T). Many genomic algorithms process these strings by looking at exact matches of length-k substrings, which are known as k-mers.

Kryder’s law

Disk drive density doubles every 13 months, determined by the capability of hard drive storage media over time.

Lossless compression

A procedure that takes advantage of redundancy/repetition to reversibly transform a large file into a smaller one — for example, storing the string ‘ACGTACGTACGTACGTACGT’ as ‘5*(ACGT)’. Note that although shorter, the transformed string contains all the same information as the original.

Lossy compression

Sometimes, we are willing to discard some information when compressing a file. For example, if we start with data points ‘12.362, 15.212, 92.786’ we could round the points and discard some precision to get ‘12, 15, 93’, which can be stored in less space. However, after lossy compression, although we can still reproduce data that look similar to the same kind of format as the original, they are no longer an exact replica.

Metagenomics

Ordinary genomics studies the genome of a single organism. Metagenomics is the simultaneous study of a collection of many different species’ genomes in a single sample, typically that of microbial communities.

Moore’s law

Computing power (in TeraFLOPS) doubles every 18 months, determined by the number of transistors you can pack per unit area on a chip.

Multicore

A single computing processor with two or more independent computing units (called cores). Running multiple instructions on multiple cores at the same time can increase the overall speed of programs.

Parallelization

Parallel computing allows numerous calculations to be performed simultaneously, thereby accelerating computation. Based on this principle, many large-scale computational tasks can then be divided into smaller ones and solved on multiple machines concurrently.

Parsing

The input data to a computer program can come in various formats. Before performing any type of complicated analysis, programs must first translate those data into an internal representation, in a process known as parsing.

Random access memory

(RAM). Short-term storage for data the computer is actively using to speed access.

Random access

Access to any element of stored data as easily and efficiently as any other.

RNA sequencing

A genomic approach for the detection and quantitative analysis of mRNA molecules in a biological sample.

Scale

Scalability typically refers to how an algorithm handles larger amounts of data; for example, an algorithm scales with the amount of data if its runtime and space requirements grow slowly enough in required time and size to solve the problem.

Single-threaded

Computation that operates as a single sequential series of operations without any parallelization. It is often used as a benchmark for the speed of a method without using any types of hardware tricks or multi-threaded acceleration.

Sketching

These methods reduce the number of data points considered, while still capturing salient features of the underlying data, to minimize the computational resources required for large-scale analyses. Unlike lossy data compression, it is generally not possible to reproduce even an approximate copy of the original data, because the sketch only summarizes a few important features.

Space-complexity

Computer scientists traditionally measure the amount of computer memory (random access memory (RAM)) an algorithm needs to run by asking how the amount of memory needed scales with the size of the data. Often, the same types of terms are used as for time-complexity, and we speak of linear, log-linear or quadratic space algorithms.

Technology refresh lifecycle

The cycle of regularly updating compute infrastructure to maximize a system’s performance.

Tensor processing units

(TPUs). Systems developed by Google for application-specific integrated circuits to accelerate machine learning workflows.

Time-complexity

Computer scientists traditionally measure how fast an algorithm is by asking how the number of central processing unit (CPU) operations scales with the size of the data. An algorithm is linear time if doubling the amount of data to be processed doubles the number of CPU operations needed. An algorithm is quadratic time if doubling the amount of data quadruples (×4) the number of CPU operations. A log-linear time algorithm is only marginally slower than a linear time algorithm, although the exact scaling requires a bit more mathematical formalism to describe. Most practical algorithms are either linear or log-linear.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Berger, B., Yu, Y.W. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 24, 235–250 (2023). https://doi.org/10.1038/s41576-022-00551-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41576-022-00551-z

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research