Cloud computing for genomic data analysis and collaboration

A Corrigendum to this article was published on 12 February 2018

This article has been updated

Key Points

  • Cloud computing is a paradigm whereby computational resources such as computers, storage and bandwidth can be rented on a pay-for-what-you-use basis.

  • The cloud's chief advantages are elasticity and convenience. Elasticity refers to the ability to rent and pay for the exact resources needed, and convenience refers to the fact that the user need not deal with the disadvantages of owning or maintaining the resources.

  • Archives of sequencing data are vast and rapidly growing. Cloud computing is an important enabler for recent efforts to reanalyse large cross-sections of archived sequencing data.

  • The cloud is becoming a popular venue for hosting large international collaborations, which benefit from the ability to hold data securely in a single location and proximate to the computational infrastructure that will be used to analyse it.

  • Funders of genomics research are increasingly aware of the cloud and its advantages and are beginning to allocate funds and create cloud-based resources accordingly.

  • Cloud clusters can be configured with security measures needed to adhere to privacy standards, such as those from the Database of Genotypes and Phenotypes (dbGaP).

Abstract

Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Increase in storage of next-generation sequencing data.
Figure 2: Cloud elasticity.
Figure 3: Cloud reproducibility.
Figure 4: Models for distributed collaboration.

Change history

  • 12 February 2018

    The above article originally stated “FireCloud and CGC rely on AWS and the Google Cloud Platform for computing and data storage. In addition to charges for these commercial services, users pay convenience surcharges.” The second sentence was incorrect, as pointed out to and independently verified by the authors, and has been removed. Also, an incorrect citation was given for reference 66. The citation should have been: Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotech. 34, 525–527 (2016). Finally, reference 67 referred to an older version of the CWL specification and has been updated. The article has been corrected online. The authors apologize for these errors.

References

  1. 1

    Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

    Article  CAS  PubMed  Google Scholar 

  2. 2

    Stephens, Z. D. et al. Big data: astronomical or genomical? PLOS Biol. 13, e1002195 (2015). This perspective puts the genomic data deluge in context with other sciences and shows how growth of archived genomics data is tracking improvements in technology.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. 3

    Kodama, Y. et al. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 40, D54–D56 (2012).

    Article  CAS  PubMed  Google Scholar 

  4. 4

    Leinonen, R. et al. The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. 5

    Toribio, A. L. et al. European Nucleotide Archive in 2016. Nucleic Acids Res. 45, D32–D36 (2017).

    Article  CAS  PubMed  Google Scholar 

  6. 6

    Denk, F. Don't let useful data go to waste. Nature 543, 7 (2017).

    Article  CAS  PubMed  Google Scholar 

  7. 7

    Kuo, W. P., Jenssen, T.-K., Butte, A. J., Ohno-Machado, L. & Kohane, I. S. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 18, 405–412 (2002).

    Article  CAS  PubMed  Google Scholar 

  8. 8

    Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).

    Article  CAS  PubMed  Google Scholar 

  9. 9

    McCall, M. N., Bolstad, B. M. & Irizarry, R. A. Frozen robust multiarray analysis (fRMA). Biostatistics 11, 242–253 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  10. 10

    Rhodes, D. R. et al. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc. Natl Acad. Sci. USA 101, 9309–9314 (2004).

    Article  CAS  Google Scholar 

  11. 11

    Zeggini, E. et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat. Genet. 40, 638–645 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. 12

    Marchionni, L., Afsari, B., Geman, D. & Leek, J. T. A simple and reproducible breast cancer prognostic test. BMC Genomics 14, 336 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  13. 13

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. 14

    International Cancer Genome Consortium et al. International network of cancer genome projects. Nature 464, 993–998 (2010).

  15. 15

    GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

  16. 16

    Melé, M. et al. Human genomics. The human transcriptome across tissues and individuals. Science 348, 660–665 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. 17

    Trans-Omics for Precision Medicine (TOPMed) Program. National Heart, Lung, and Blood Institute https://www.nhlbi.nih.gov/science/trans-omics-precision-medicine-topmed-program (2017).

  18. 18

    Collins, F. S. & Varmus, H. A new initiative on precision medicine. N. Engl. J. Med. 372, 793–795 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. 19

    Gaziano, J. M. et al. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).

    Article  PubMed  Google Scholar 

  20. 20

    Foster, I. G. & Dennis, B. Cloud Computing for Science and Engineering (MIT Press, 2017). This book describes the public and private cloud offerings availabkle and how to use APIs for both commercial and OpenStack clouds to automate cloud tasks. It also describes Globus Auth and other important ideas related to identity federation, authentication and authorization.

    Google Scholar 

  21. 21

    International Cancer Genes Consortium. PCAWG Data Portal and Visualizations. ICGC http://docs.icgc.org/pcawg/ (2017).

  22. 22

    Birger, C. et al. FireCloud, a scalable cloud-based platform for collaborative genome analysis: strategies for reducing and controlling costs. bioRxiv, https://doi.org/10.1101/209494 (2017).

    Google Scholar 

  23. 23

    Lau, J. W. et al. The Cancer Genomics Cloud: collaborative, reproducible, and democratized – a new paradigm in large-scale computational research. Cancer Res. 77, e3–e6 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. 24

    Reynolds, S. M. et al. The ISB Cancer Genomics Cloud: a flexible cloud-based platform for cancer genomics research. Cancer Res. 77, e7–e10 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. 25

    Celniker, S. E. et al. Unlocking the secrets of the genome. Nature 459, 927–930 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. 26

    The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  27. 27

    Mell, P. M. & Grance, T. SP 800–145. The NIST definition of cloud computing. National Institute of Standards and Technology http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf (2011).

    Google Scholar 

  28. 28

    Wingfield, N., Streitfeld, D. & Lohr, S. Cloud produces sunny earnings at Amazon, Microsoft and Alphabet. New York Times https://www.nytimes.com/2017/04/27/technology/quarterly-earnings-cloud-computing-amazon-microsoft-alphabet.html (27 April 2017).

  29. 29

    Mathews, L. Just how big is Amazon's AWS business? (hint: it's absolutely massive). Geek.com https://www.geek.com/chips/just-how-big-is-amazons-aws-business-hint-its-absolutely-massive-1610221/ (2014).

    Google Scholar 

  30. 30

    Sefraoui, O., Aissaoui, M. & Eleuldj, M. OpenStack: toward an open-source solution for cloud computing. Int. J. Comput. Appl. Technol. 55, 38–42 (2012).

    Google Scholar 

  31. 31

    Moreno-Vozmediano, R., Montero, R. S. & Llorente, I. M. IaaS cloud architecture: from virtualized datacenters to federated cloud infrastructures. Computer 45, 65–72 (2012).

    Article  Google Scholar 

  32. 32

    Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. 33

    Stewart, C. A. et al. in Proc. 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure https://dl.acm.org/citation.cfm?id=2792745 (2015).

    Google Scholar 

  34. 34

    European Open Science Cloud [Editorial]. Nat. Genet. 48, 821 (2016).

  35. 35

    Madduri, R. K. et al. Experiences building Globus Genomics: a next-generation sequencing analysis service using Galaxy, Globus, and Amazon web services. Concurr. Comput. 26, 2266–2279 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  36. 36

    Yakneen, S., Waszak, S., Gertz, M. & Korbel, J. O. Enabling rapid cloud-based analysis of thousands of human genomes via Butler. bioRxiv https://doi.org/10.1101/185736 (2017).

    Google Scholar 

  37. 37

    Yung, C. K. et al. Large-scale uniform analysis of cancer whole genomes in multiple computing environments. bioRxiv https://doi.org/10.1101/161638 (2017).

    Google Scholar 

  38. 38

    Baggerly, K. A. & Coombes, K. R. Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. Ann. Appl. Statist. 3, 1309–1334 (2009).

    Article  Google Scholar 

  39. 39

    Dai, M. et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. 40

    Ioannidis, J. P. et al. Repeatability of published microarray gene expression analyses. Nat. Genet. 41, 149–155 (2009).

    Article  CAS  PubMed  Google Scholar 

  41. 41

    Nekrutenko, A. & Taylor, J. Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat. Rev. Genet. 13, 667–672 (2012).

    Article  CAS  PubMed  Google Scholar 

  42. 42

    Piccolo, S. R. & Frampton, M. B. Tools and techniques for computational reproducibility. Gigascience 5, 30 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  43. 43

    Angiuoli, S. V. et al. CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics 12, 356 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  44. 44

    Krampis, K. et al. Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics 13, 42 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  45. 45

    Merkel, D. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, 2 (2014).

    Google Scholar 

  46. 46

    Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: scientific containers for mobility of compute. PLOS One 12, e0177459 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. 47

    The Clinical Cancer Genome Task Team of the Global Alliance for Genomics and Health. Sharing clinical and genomic data on cancer – the need for global solutions. N. Engl. J. Med. 376, 2006–2009 (2017).

  48. 48

    Bonazzi, V. R. & Bourne, P. E. Should biomedical research be like Airbnb? PLOS Biol. 15, e2001818 (2017). The authors of this paper describe the NIH Data Commons and suggest cloud computing as a means for making large-scale genomics data sets available and associated analyses reproducible.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. 49

    Bourne, P. E., Lorsch, J. R. & Green, E. D. Perspective: sustaining the big-data ecosystem. Nature 527, S16–17 (2015).

    Article  CAS  PubMed  Google Scholar 

  50. 50

    Tryka, K. A. et al. NCBI's database of genotypes and phenotypes: dbGaP. Nucleic Acids Res. 42, D975–D979 (2014).

    Article  CAS  PubMed  Google Scholar 

  51. 51

    Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. 52

    Brown, J. B. et al. Diversity and dynamics of the Drosophila transcriptome. Nature 512, 393–399 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. 53

    Graveley, B. The developmental transcriptome of Drosophila melanogaster. Genome Biol. 11, I11 (2010).

    Article  PubMed Central  Google Scholar 

  54. 54

    Gutzwiller, F. et al. Dynamics of Wolbachia pipientis gene expression across the Drosophila melanogaster life cycle. G3 5, 2843–2856 (2015).

    Article  CAS  PubMed  Google Scholar 

  55. 55

    Bernstein, M. N., Doan, A. & Dewey, C. N. MetaSRA: normalized human sample-specific metadata for the sequence read archive. Bioinformatics 33, 2914–2923 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. 56

    Yung, C. K. et al. The Cancer Genome Collaboratory [abstract]. Cancer Res. 77, 378 (2017).

    Article  CAS  Google Scholar 

  57. 57

    Nellore, A. et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the sequence read archive. Genome Biol. 17, 266 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. 58

    Frazee, A. C., Langmead, B. & Leek, J. T. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics 12, 449 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  59. 59

    Langmead, B., Hansen, K. D. & Leek, J. T. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11, R83 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  60. 60

    Nellore, A., Wilks, C., Hansen, K. D., Leek, J. T. & Langmead, B. Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics 32, 2551–2553 (2016). This work reports the use of cloud computing and MapReduce software to study tens of thousands of human RNA sequencing data sets, showing that many splice junctions that are well represented in public data are not present in popular gene annotations.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. 61

    Collado-Torres, L. et al. Reproducible RNA-seq analysis using recount2. Nat. Biotechnol. 35, 319–321 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. 62

    Nellore, A. et al. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics 33, 4003–4040 (2017).

    Google Scholar 

  63. 63

    Vivian, J. et al. Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35, 314–316 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. 64

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    Article  CAS  PubMed  Google Scholar 

  65. 65

    Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. 66

    Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotech. 34, 525–527 (2016).

    Article  CAS  Google Scholar 

  67. 67

    Amstutz, P. et al. Common workflow language, v1.0. Figshare https://doi.org/10.6084/m9.figshare.3115156.v2 (2016).

    Google Scholar 

  68. 68

    Tatlow, P. J. & Piccolo, S. R. A cloud-based workflow to quantify transcript-expression levels in public cancer compendia. Sci. Rep. 6, 39259 (2016). This study shows how cloud computing can be used to reanalyse over 12,000 human cancer RNA sequencing data sets for as little as US$0.09 per sample.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. 69

    Foster, I. K., Carl. The Grid 2: Blueprint for a New Computing Infrastructure (Morgan Kaufmann, 2003).

    Google Scholar 

  70. 70

    Drew, K. et al. The Proteome Folding Project: proteome-scale prediction of structure and function. Genome Res. 21, 1981–1994 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. 71

    Rahman, M. et al. Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results. Bioinformatics 31, 3666–3672 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. 72

    Stein, L. D. The case for cloud computing in genome informatics. Genome Biol. 11, 207 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  73. 73

    Bais, P., Namburi, S., Gatti, D. M., Zhang, X. & Chuang, J. H. CloudNeo: a cloud pipeline for identifying patient-specific tumor neoantigens. Bioinformatics 33, 3110–3112 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. 74

    Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 44, W3–W10 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. 75

    Towns, J. et al. XSEDE: accelerating scientific discovery. Comput. Sci. Eng. 16, 62–74 (2014).

    Article  CAS  Google Scholar 

  76. 76

    Galaxy Community Hub. Publicly accessible Galaxy servers. Galaxy Project https://galaxyproject.org/public-galaxy-servers/ (2017).

  77. 77

    Afgan, E. et al. Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinformatics 11 (Suppl. 12), S4 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  78. 78

    Liu, B. et al. Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses. J. Biomed. Inform. 49, 119–133 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  79. 79

    Foster, I. Globus Online: accelerating and democratizing science through cloud-based services. IEEE Internet Comput. 15, 70–73 (2011).

    Article  Google Scholar 

  80. 80

    Dana-Farber Cancer Institute. Dana-Farber Cancer Institute and Ontario Institute for Cancer Research join Collaborative Cancer Cloud http://www.dana-farber.org/newsroom/news-releases/2016/dana-farber-cancer-institute-and-ontario-institute-for-cancer-research-join-collaborative-cancer-cloud/ (2016).

  81. 81

    Hawkins, T. The Collaborative Cancer Cloud: Intel and OHSU team up for cancer research. siliconANGLE http://siliconangle.com/blog/2016/12/16/collaborative-cancer-cloud-intel-ohsu-team-cancer-research-thecube/ (2016).

    Google Scholar 

  82. 82

    Global Alliance for Genomics and Health. A federated ecosystem for sharing genomic, clinical data. Science 352, 1278–1280 (2016).

  83. 83

    Amazon Web Services. AWS case study: DNAnexus. Amazon https://aws.amazon.com/solutions/case-studies/dnanexus/ (2017).

  84. 84

    ICGC Data Coordination Center. About cloud partners. ICGC http://docs.icgc.org/cloud/about/ (2017).

  85. 85

    modENCODE Project. modENCODE on the EC2 cloud. modENCODE http://data.modencode.org/modencode-cloud.html (2017).

  86. 86

    Dean, J. & Ghemawat, S. MapReduce. Commun. ACM 51, 107 (2008).

    Article  Google Scholar 

  87. 87

    Kelly, B. J. et al. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol. 16, 6 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. 88

    Langmead, B., Schatz, M. C., Lin, J., Pop, M. & Salzberg, S. L. Searching for SNPs with cloud computing. Genome Biol. 10, R134 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. 89

    Feng, X., Grossman, R. & Stein, L. PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinformatics 12, 139 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  90. 90

    McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. 91

    GA4GH-DREAM. GA4GH-DREAM Workflow Execution Challenge. Synapse https://www.synapse.org/WorkflowChallenge (2017).

  92. 92

    Franke, A. et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci. Nat. Genet. 42, 1118–1125 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. 93

    Petryszak, R. et al. The RNASeq-er API—a gateway to systematically updated analysis of public RNA-seq data. Bioinformatics 33, 2218–2220 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. 94

    Goldman, M., Craft, B., Zhu, J. & Haussler, D. The UCSC Xena system for cancer genomics data visualization and interpretation [Abstr. 2584]. Cancer Res. 77, 2584 (2017).

    Google Scholar 

  95. 95

    Kolesnikov, N. et al. ArrayExpress update—simplifying data submissions. Nucleic Acids Res. 43, D1113–D1116 (2015).

    Article  CAS  PubMed  Google Scholar 

  96. 96

    Google Compute Engine. Google Compute Engine pricing. Google Cloud Platform https://cloud.google.com/compute/pricing (2017).

  97. 97

    Chard, R. et al. in 2015 IEEE 11th International Conference on e-Science, 136–144 (IEEE, 2015).

    Google Scholar 

  98. 98

    Barr, J. Natural Language Processing at Clemson University – 1.1 Million vCPUs & EC2 Spot Instances. Amazon https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/ (2017).

    Google Scholar 

  99. 99

    NIH Commons. Commons Credits Pilot Portal. Commons Credits Pilot Portal https://www.commons-credit-portal.org/ (2017).

  100. 100

    National Science Foundation. Amazon Web Services, Google Cloud, and Microsoft Azure join NSF's Big Data Program. National Science Foundation https://www.nsf.gov/news/news_summ.jsp?cntn_id=190830&WT.mc_ev=click (2017).

  101. 101

    National Institute of Mental Health. Welcome to the NIMH Data Archive. NDA https://data-archive.nimh.nih.gov/ (2017).

  102. 102

    Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  103. 103

    Lappalainen, I. et al. The European Genome-Phenome Archive of human data consented for biomedical research. Nat. Genet. 47, 692–695 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  104. 104

    National Institutes of Health. NIH security best practices for controlled-access data subject to the NIH genomic data sharing (GDS) policy. NIH Office of Science Policy https://osp.od.nih.gov/wp-content/uploads/NIH_Best_Practices_for_Controlled-Access_Data_Subject_to_the_NIH_GDS_Policy.pdf (2015).

  105. 105

    Stein, L. D., Knoppers, B. M., Campbell, P., Getz, G. & Korbel, J. O. Data analysis: Create a cloud commons. Nature 523, 149–151 (2015). In this paper, the authors argue for the use of cloud computing in large consortia and describe plans for its use in the ICGC.

    Article  CAS  PubMed  Google Scholar 

  106. 106

    Deutsche Telekom. Deutsche Telekom launches highly secure public cloud based on Cisco platform. Deutsche Telekom https://www.telekom.com/en/media/media-information/archive/deutsche-telekom-launches-highly-secure-public-cloud-based-on-cisco-platform------362100 (2015).

  107. 107

    Datta, S., Bettinger, K. & Snyder, M. Secure cloud computing for genomic data. Nat. Biotechnol. 34, 588–591 (2016).

    Article  CAS  PubMed  Google Scholar 

  108. 108

    Dove, E. S. et al. Genomic cloud computing: legal and ethical points to consider. Eur. J. Hum. Genet. 23, 1271–1278 (2015).

    Article  PubMed  Google Scholar 

  109. 109

    Francis, L. P. Genomic knowledge sharing: a review of the ethical and legal issues. Appl. Transl Genom. 3, 111–115 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  110. 110

    Seven Bridges Genomics. API Overview. Seven Bridges Genomics https://docs.sevenbridges.com/v1.0/docs/the-api (2017).

  111. 111

    Ananthakrishnan, R., Chard, K., Foster, I. & Tuecke, S. Globus platform-as-a-service for collaborative science applications. Concurrency Comput. Pract. Exp. 27, 290–305 (2015).

    Article  Google Scholar 

  112. 112

    Chaterji, S. et al. Federation in genomics pipelines: techniques and challenges. Brief Bioinform. https://doi.org/10.1093/bib/bbx102 (2017).

    Article  PubMed Central  Google Scholar 

  113. 113

    Campbell, S. Teaching cloud computing. Computer 49, 91–93 (2016).

    Article  Google Scholar 

  114. 114

    Dudley, J. T. & Butte, A.J. In silico research in the era of cloud computing. Nat. Biotech. 28, 1181–1185 (2010).

    Article  CAS  Google Scholar 

  115. 115

    Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  116. 116

    Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

  117. 117

    Heath, A. P. et al. Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets. J. Am. Med. Inform. Assoc. 21, 969–975 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  118. 118

    Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).

    Article  CAS  PubMed  Google Scholar 

  119. 119

    Fisch, K. M. et al. Omics Pipe: a community-based framework for reproducible multi-omics data analysis. Bioinformatics 31, 1724–1728 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  120. 120

    Allcock, W. et al. in Proceedings of the 2005 ACM/IEEE conference on Supercomputing 54 (Seattle, 2005).

    Google Scholar 

  121. 121

    Petryszak, R. et al. Expression Atlas update — a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments. Nucleic Acids Res. 42, D926–D932 (2014).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors thank J. Taylor, E. Afgan, M. Schatz, J. Goecks and A. Margolin for reading through a draft of this work and providing helpful comments. B.L. was supported by the US National Institutes of Health/National Institute of General Medical Sciences grant 1R01GM118568.

Author information

Affiliations

Authors

Contributions

The authors contributed equally to all aspects of this manuscript.

Corresponding authors

Correspondence to Ben Langmead or Abhinav Nellore.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

PowerPoint slides

Supplementary information

Supplementary information S1 (methods)

Supplementary Information for: Cloud computing for genomic data analysis and collaboration (PDF 228 kb)

Glossary

Sequencing reads

Snippets of DNA sequence as reported by a DNA sequencer.

Storage

A component of a computer that stores data.

Processors

A central component of a computer in which the computation takes place.

Computer cluster

A collection of connected computers that are able to work in a coordinated fashion to analyse data.

Metadata

Information about a data set, often pertaining to how and from where it was collected. For example, for a human data set, metadata might include sex, age, cause of death and sequencing protocol used.

Containers

Similar to 'virtual machines', containers are 'virtual computers' that enable the use of multiple, isolated services on a single platform. They can run in the context of another computer, using a portion of the host computer's resources. Docker and Singularity are two container management systems.

Firewalls

Barriers that prevent unwanted, perhaps insecure network traffic from reaching a protected network.

Application programming interfaces

(APIs). Formal specifications of the ways in which a user or program can interface with a system, for example, a cloud.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Langmead, B., Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat Rev Genet 19, 208–219 (2018). https://doi.org/10.1038/nrg.2017.113

Download citation

Further reading

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing