Computational solutions to large-scale data management and analysis

Schadt, Eric E.; Linderman, Michael D.; Sorenson, Jon; Lee, Lawrence; Nolan, Garry P.

doi:10.1038/nrg2857

Review Article
Published: September 2010

Computational solutions to large-scale data management and analysis

Eric E. Schadt¹,
Michael D. Linderman²,
Jon Sorenson¹,
Lawrence Lee¹ &
…
Garry P. Nolan³

Nature Reviews Genetics volume 11, pages 647–657 (2010)Cite this article

9370 Accesses
389 Citations
9 Altmetric
Metrics details

Subjects

Key Points

Biological research is becoming ever more information-driven, with individual laboratories now capable of generating terabyte scales of data in days. Supercomputing resources will be increasingly needed to get the most from the big data sets that researchers generate or analyse.
The big data revolution in biology is matched by a revolution in high-performance computing that is making supercomputing resources available to anyone with an internet connection.
A number of challenges are posed by large-scale data analysis, including data transfer (bringing the data and computational resources together), controlling access to the data, managing the data, standardizing data formats and integrating data of multiple different types to accurately model biological systems.
New computational solutions that are readily available to all can aid in addressing these challenges. These solutions include cloud-based computing and high-speed, low-cost heterogeneous computational environments. Taking advantage of these resources requires a thorough understanding of the data and the computational problem.
Knowing the parallelization of the analysis algorithms enables a more efficient solution to a computational problem by distributing tasks over many computer processors. The types of parallelism can be classified into two broad categories: loosely coupled (or coarse-grained) parallelism and tightly coupled (or fine-grained) parallelism, each benefiting from different types of computational platforms, depending on the problem of interest.
Clusters of computers can be optimized for many different classes of computationally intense applications, such as sequence alignment, genome-wide association tests and reconstruction of Bayesian networks. Cloud computing makes cluster-based computing more accessible and affordable for all. The distributed computing paradigm MapReduce has been designed for cloud-based computing to solve problems such as mapping raw DNA sequence reads to a reference genome (that is, problems that have loosely coupled parallelism).
Cloud computing provides a highly flexible, low-cost computational environment. However, the costs of cloud computing include sacrificing control of the underlying hardware and requiring that big data sets be transferred into the cloud for processing.
Heterogeneous multi-core computational systems, such as graphics processing units (GPUs), are complementary to cloud-based computing and operate as low-cost, specialized accelerators that can increase peak arithmetic throughput by 10-fold to 100-fold. These systems are specifically tuned to efficiently solve problems involving massive tightly coupled parallelism.
Heterogeneous computing provides a low-cost, flexible computational environment that improves performance and efficiency by exposing architectural features to programmers. However, programming applications to run in these environments requires significant informatics expertise.
Cloud providers such as Microsoft make advanced cloud computing resources freely available to individual researchers through a competitive, peer-reviewed granting process. Others providers, such as Amazon, provide advanced cloud storage and computational resources via an intuitive and simple web interface. Users of Amazon Web Services can today not only upload big data sets and analysis tools to Amazon S3 but also solve problems using MapReduce via a point-and-click interface.

Abstract

Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist — such as cloud and heterogeneous computing — to successfully tackle our big data problems.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Generating and integrating large-scale, diverse types of data.**

**Figure 2: Cluster, cloud, grid and heterogeneous computing hardware and software stacks.**

Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools

Article 21 June 2021

Hao Hou, Brent Pedersen & Aaron Quinlan

Julia for biologists

Article 06 April 2023

Elisabeth Roesch, Joe G. Greener, … Michael P. H. Stumpf

Butler enables rapid cloud-based analysis of thousands of human genomes

Article Open access 05 February 2020

Sergei Yakneen, Sebastian M. Waszak, … PCAWG Consortium

References

Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).
Article CAS Google Scholar
Bandura, D. R. et al. Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Anal. Chem. 81, 6813–6822 (2009).
Article CAS Google Scholar
Chen, Y. et al. Variations in DNA elucidate molecular networks that cause disease. Nature 452, 429–435 (2008).
Article CAS Google Scholar
Emilsson, V. et al. Genetics of gene expression and its effect on disease. Nature 452, 423–428 (2008).
Article CAS Google Scholar
Altshuler, D., Daly, M. J. & Lander, E. S. Genetic mapping in human disease. Science 322, 881–888 (2008).
Article CAS Google Scholar
Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).
Article CAS Google Scholar
Munroe, D. J. & Harris, T. J. Third-generation sequencing fireworks at Marco Island. Nature Biotech. 28, 426–428 (2010).
Article CAS Google Scholar
Flusberg, B. A. et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nature Methods 7, 461–465 (2010). Shows how SMRT sequencing will add an important time dimension to DNA and RNA sequencing data. Maximizing the information that can be derived from the data will demand substantially increased data-storage requirements and computational resources.
Article CAS Google Scholar
Garey, M. R. & Johnson, D. S. Computers and Intractability: A Guide to the Theory of NP-Completeness (W. H. Freeman, New York, 1979).
Google Scholar
Schadt, E. E. et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biol. 6, e107 (2008).
Article Google Scholar
Snir, M. MPI-The Complete Reference 2nd edn (MIT Press, Cambridge, Massachusetts, 1998).
Google Scholar
Zhang, W., Zhu, J., Schadt, E. E. & Liu, J. S. A Bayesian partition method for detecting pleiotropic and epistatic eQTL modules. PLoS Comput. Biol. 6, e1000642 (2010).
Article Google Scholar
Costello, E. K. et al. Bacterial community variation in human body habitats across space and time. Science 326, 1694–1697 (2009).
Article CAS Google Scholar
Kuczynski, J. et al. Direct sequencing of the human microbiome readily reveals community differences. Genome Biol. 11, 210 (2010).
Article Google Scholar
Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).
Article CAS Google Scholar
McGinnis, S. & Madden, T. L. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32, W20–W25 (2004).
Article CAS Google Scholar
Armbrust, M. et al. Above the Clouds: A Berkeley View of Cloud Computing (University of California, Berkeley, 2009).
Google Scholar
Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J. & Brandic, I. Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Future Generation Comput. Syst. 25, 599–616 (2009).
Article Google Scholar
Dean, J. & Ghemawat, S. MapReduce: simplified data processing on large clusters. 6th Symp. on Operating Systems Design and Implementation [online], (2004). Introduces the MapReduce concept, which was developed at Google. MapReduce is one of the leading large-scale parallel computing technologies, both in terms of the size of data it can handle and the size of the computational infrastructure that is available to process such data.
Google Scholar
Matsunaga, A., Tsugawa, M. & Fortes, J. in 4th IEEE International Conference on eScience. 222–229 (IEEE, Indianapolis, Indiana, 2008).
Google Scholar
Langmead, B., Schatz, M. C., Lin, J., Pop, M. & Salzberg, S. L. Searching for SNPs with cloud computing. Genome Biol. 10, R134 (2009). An early example in genomics of using standard cloud-based services to detect SNPs — in this case, by aligning whole-genome sequence data to a reference genome.
Article Google Scholar
Schatz, M. C. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25, 1363–1369 (2009).
Article CAS Google Scholar
Sansom, C. Up in a cloud? Nature Biotech. 28, 13–15 (2010).
Article CAS Google Scholar
Vance, A. Training to climb an Everest of digital data. New York Times B1 (11 Oct 2009).
Google Scholar
Stein, L. D. Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nature Rev. Genet. 9, 678–688 (2008). A comprehensive review of the informatics infrastructure that will be required to achieve success in biological research, both now and in the future.
Article CAS Google Scholar
Constable, H., Guralnick, R., Wieczorek, J., Spencer, C. & Peterson, A. T. VertNet: a new model for biodiversity data sharing. PLoS Biol. 8, e1000309 (2010).
Article Google Scholar
Rosenthal, A. et al. Cloud computing: a new business paradigm for biomedical information sharing. J. Biomed. Inform. 43, 342–353 (2009).
Article Google Scholar
Owens, J. D. et al. A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26, 80–113 (2007).
Article Google Scholar
Friedrichs, M. S. et al. Accelerating molecular dynamic simulation on graphics processing units. J. Comput. Chem. 30, 864–872 (2009).
Article CAS Google Scholar
Luttmann, E. et al. Accelerating molecular dynamic simulation on the cell processor and Playstation 3. J. Comput. Chem. 30, 268–274 (2009).
Article CAS Google Scholar
Schatz, M. C., Trapnell, C., Delcher, A. L. & Varshney, A. High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics 8, 474 (2007). One of the first genomics applications to use GPUs to substantially speed up the process of high-throughput sequence alignments.
Article Google Scholar
Liu, Y., Maskell, D. L. & Schmidt, B. CUDASW++: optimizing Smith–Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res. Notes 2, 73 (2009).
Article Google Scholar
Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009). One of the first GPU-based RNA sequence aligners.
Article CAS Google Scholar
Linderman, M. D. et al. High-throughput Bayesian network learning using heterogeneous multicore computers. in Proc. of the 24th ACM Int. Conf. on Supercomputing (Tsukuba, Ibaraki, Japan; 2–4 Jun 2010). 95–104, http://doi.acm.org/10.1145/1810085.1810101 (ACM, New York, 2010). Describes a high-throughput GPU-based application for Bayesian network learning. The network learner was built with a novel software tool, the Merge compiler, that helps programmers to integrate multiple implementations of the same algorithm, targeting different processors, into a single application that optimally chooses at run-time which implementation to use based on the problem and hardware available.
Google Scholar
Nickolls, J., Buck, I., Garland, M. & Skadron, K. Scalable parallel programming with CUDA. Queue 6, 40–53 (2008).
Article Google Scholar
Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4, Article17 (2005).
Barroso, L. A. & Holzle, U. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. 1–108 (Morgan & Claypool Publishers, 2009). Highlights the important future role of large-scale data centres in hosting big data sets and facilitating computing on those sets.
Google Scholar
Bell, G. & Gray, J. Petascale computational systems: balanced cyberinfrastructure in a data-centric world Microsoft Research [online], (2005).
Google Scholar
Zhu, J. et al. Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nature Genet. 40, 854–861 (2008). An example of an integrative genomics network-reconstruction method that is among the most computationally demanding methods in biological research.
Article CAS Google Scholar
Schadt, E. E., Friend, S. H. & Shaywitz, D. A. A network view of disease and compound screening. Nature Rev. Drug Discov. 8, 286–295 (2009).
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Pacific Biosciences, 1505 Adams Drive, Menlo Park, 94025, California, USA
Eric E. Schadt, Jon Sorenson & Lawrence Lee
Computer Systems Laboratory, 420 Via Palou Mall, Stanford, 94305-4070, California, USA
Michael D. Linderman
Department of Microbiology and Immunology, Stanford University, 300 Pasteur Drive, Stanford, 94305-5124, California, USA
Garry P. Nolan

Authors

Eric E. Schadt
View author publications
You can also search for this author in PubMed Google Scholar
Michael D. Linderman
View author publications
You can also search for this author in PubMed Google Scholar
Jon Sorenson
View author publications
You can also search for this author in PubMed Google Scholar
Lawrence Lee
View author publications
You can also search for this author in PubMed Google Scholar
Garry P. Nolan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eric E. Schadt.

Ethics declarations

Competing interests

Eric E. Schadt, Jon Sorenson and Lawrence Lee are all employed by Pacific Biosciences and own stock in the company.

Glossary

Petabyte: Refers to 10¹² bytes. Many large computer systems now have many petabytes of storage.
Cloud-based computing: The abstraction of the underlying hardware architectures (for example, servers, storage and networking) that enable convenient, on-demand network access to a shared pool of computing resources that can be readily provisioned and released.
Heterogeneous computational environments: Computers that integrate specialized accelerators, for example, graphics processing units (GPUs) or field-programmable gate arrays (FPGAs), alongside general purpose processors (GPPs).
High-performance computing: A catch-all term for hardware and software systems that are used to solve 'advanced' computational problems.
Bayesian network: A network that captures causal relationships between variables or nodes of interest (for example, transcription levels of a gene, protein states, and so on). Bayesian networks enable the incorporation of prior information in establishing relationships between nodes.
NP hard: For the purposes of this paper, NP hard problems are some of the most difficult computational problems; as such, they are typically not solved exactly, but with heuristics and high-performance computing.
Algorithm: A well-defined method or list of instructions for solving a problem.
Parallelization: Parallelizing an algorithm enables different tasks that are carried out by its implementation to be distributed across multiple processors, so that multiple tasks can be carried out simultaneously.
Markov chain Monte Carlo: A general method for integrating over probability distributions so that inferences can be made around model parameters or predictions can be made from a model of interest. The sampling from the probability distributions required for this process draws samples from a specially constructed Markov chain: a discrete time random process in which the distribution of a random variable at a given point in time given the random variables at all previous time points is only dependent on the distribution of the random variable directly preceding it.
General purpose processor: A microprocessor designed for many purposes. It is typified by the ×86 processors made by Intel and AMD and used in most desktop, laptop and server computers.
OPs/byte: A technical metric that describes how many computational operations (OPs) are performed per byte of data accessed, and where those bytes originate.
Random access memory: Computer memory that can be accessed in any order. It typically refers to the computer system's main memory and is implemented with large-capacity, volatile DRAM modules.
Cluster: Multiple computers linked together, typically through a fast local area network, that effectively function as a single computer.
Cluster-based computing: An inexpensive and scalable approach to large-scale computing that lowers costs by networking hundreds to thousands of conventional desktop central processing units together to form a supercomputer.
Computational node: The unit of replication in a computer cluster. Typically it consists of a complete computer comprising one or more processors, dynamic random access memory (DRAM) and one or more hard disks.
Central processing unit: (CPU). A term often used interchangeably with the term 'processor', the CPU is the component in the computer system that executes the instructions in the program.
Virtualization: Refers to software that abstracts the details of the underlying physical computational architecture and allows a virtual machine to be instantiated.
Operating system: Software that manages the different applications that can access a computer's hardware, as well as the ways in which a user can manipulate the hardware.
Health Insurance and Portability and Accountability Act: (HIPAA.) United States legislation that regulates, among many things, the secure handling of health information.
Distributed file system, distributed query language and distributed database: A file system, query language or database that allows access to files, queries and databases, respectively, from many different hosts that are networked together and that enable sharing via the network. In this way, many different processes (or users) running on many different computers can share data, share database and storage resources and execute queries in a large grid of computers.
Core: Typically used in the context of multi-core processors, which integrate multiple cores into a single processor.
Graphics processing unit: (GPU.) A specialized processor that is designed to accelerate real-time graphics. Previously narrowly tailored for that application, these chips have evolved so that they can now be used for many forms of general purpose computing. GPUs can offer tenfold higher throughput than traditional general purpose processors (GPPs).
Field-programmable gated array: (FPGA). Digital logic that can be reconfigured for different tasks. It is typically used for prototyping custom digital integrated circuits during the design process. Modern FPGAs include many embedded memory blocks and digital signal-processing units, making them suitable for some general purpose computing tasks.
Floating point operations: (FLOPS). The count of floating point arithmetic operations (an approximation of operations on real numbers) in an application.
Single-molecule, real-time sequencing: (SMRT sequencing). Pacific Biosciences' proprietary sequencing platform in which DNA polymerization is monitored in real time using zero-mode waveguide technology. SMRT sequencing produces much longer reads than do current second-generation technologies (averaging 1,000 bp or more versus 150–400 bp). It also produces kinetic information that can be used to detect base modifications such as methyl-cytosine.
Bucket: Fundamental storage unit provided to Amazon S3 users to store files. Buckets are containers for your files that are similar conceptually to a root folder on your personal hard drive, but in this case the file storage is hosted on Amazon S3.
Exabyte: Refers to 10¹⁸ bytes. For context, CISCO estimates that the monthly global internet traffic in the spring of 2010 was 21 exabytes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schadt, E., Linderman, M., Sorenson, J. et al. Computational solutions to large-scale data management and analysis. Nat Rev Genet 11, 647–657 (2010). https://doi.org/10.1038/nrg2857

Download citation

Issue Date: September 2010
DOI: https://doi.org/10.1038/nrg2857

This article is cited by

Big health data for elderly employees job performance of SOEs: visionary and enticing challenges
- Qian Zhang
Multimedia Tools and Applications (2024)
Perovskite single-pixel detector for dual-color metasurface imaging recognition in complex environment
- Jiahao Xiong
- Zhi-Hong Zhang
- Hong-Chao Liu
Light: Science & Applications (2023)
Biocomputational Identification of sRNAs in Leptospira interrogans Serovar Lai
- Xinq Yuan Tan
- Marimuthu Citartan
- Thean-Hock Tang
Indian Journal of Microbiology (2023)
Application of dynamic database based on machine learning in enterprise intellectual property management
- Guanghuan Fu
- Yanmin Liu
Soft Computing (2023)
Environmental induced transgenerational inheritance impacts systems epigenetics in disease etiology
- Daniel Beck
- Eric E. Nilsson
- Michael K. Skinner
Scientific Reports (2022)