Key Points
-
The explosive growth of omics data generated in individual laboratories around the world presents new challenges that require innovative computational tools.
-
Modern indexing techniques for assembly, read mapping and search can solve problems in storing and accessing massive sequencing data at the terabyte scale.
-
Compressive genomics is one such technique that exploits the redundancy in genomic data to store sequence data in compressed form while allowing efficient and effective searches on these data without decompressing first.
-
The toolbox for transcriptomic data now includes sophisticated computational methods that are able to discover patterns from unstructured or semi-structured data, borrowed from the fields of data mining and machine learning.
-
Graph-theoretical techniques, such as network flow and random walks, can assist in the interpretation of diverse types of functional genomics data.
-
A range of software tools and websites implementing key algorithmic ideas is increasingly available to meet these omics data challenges and such tools are presented in this article.
Abstract
High-throughput experimental technologies are generating increasingly massive and complex genomic data sets. The sheer enormity and heterogeneity of these data threaten to make the arising problems computationally infeasible. Fortunately, powerful algorithmic techniques lead to software that can answer important biomedical questions in practice. In this Review, we sample the algorithmic landscape, focusing on state-of-the-art techniques, the understanding of which will aid the bench biologist in analysing omics data. We spotlight specific examples that have facilitated and enriched analyses of sequence, transcriptomic and network data sets.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).
Goecks, J., Nekrutenko, A., Taylor, J. & Team, G. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010).
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Kircher, M. & Kelso, J. High-throughput DNA sequencing — concepts and limitations. BioEssays 32, 524–536 (2010).
Kahn, S. D. On the future of genomic data. Science 331, 728–729 (2011).
Gross, M. Riding the wave of biological data. Curr. Biol. 21, R204–R206 (2011).
Huttenhower, C. & Hofmann, O. A quick guide to large-scale genomic data mining. PLoS Comput. Biol. 6, e1000779 (2010).
Schatz, M., Langmead, B. & Salzberg, S. Cloud computing and the DNA data race. Nature Biotech. 28, 691–693 (2010).
Stein, L. D. The case for cloud computing in genome informatics. Genome Biol. 11, 207 (2010).
Tringe, S. G. & Rubin, E. M. Metagenomics: DNA sequencing of environmental samples. Nature Rev. Genet. 6, 805–814 (2005).
Gstaiger, M. & Aebersold, R. Applying mass spectrometry-based proteomics to genetics, genomics and network biology. Nature Rev. Genet. 10, 617–627 (2009).
Mardis, E. R. Next-generation DNA sequencing methods. Annu. Rev. Genom. Hum. Genet. 9, 387–402 (2008).
Metzker, M. L. Sequencing technologies — the next generation. Nature Rev. Genet. 11, 31–46 (2010).
Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010).
Flicek, P. & Birney, E. Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009).
Pevzner, P. A., Tang, H. & Waterman, M. S. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753 (2001). The EULER assembler introduces the de Bruijn graph and Eulerian path formulation for assembly, a paradigm used in the most popular assemblers.
Batzoglou, S. et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177–189 (2002).
Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003).
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).
Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).
Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).
Compeau, P. E., Pevzner, P. A. & Tesler, G. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 29, 987–991 (2011).
Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012).
Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).
Vezzi, F., Narzisi, G. & Mishra, B. Reevaluating assembly rvaluations with feature response curves: GAGE and Assemblathons. PLoS ONE 7, e52210 (2012).
Salzberg, S. L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).
Kingsford, C., Schatz, M. C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010). This paper analyses complexity issues in genome assembly; the primary algorithmic challenge is that assembly can be complicated by short reads and genomic repeats.
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature Methods 9, 357–359 (2012).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009). Bowtie is probably the most widely used FM-index- or BWT-based short-read mapper. It demonstrates that the read-mapping problem can be done accurately even on a personal computer.
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).
Ferragina, P. & Manzini, G. Indexing compressed text. JACM 52, 552–581 (2005).
Burrows, M. & Wheeler, D. J. A block-sorting lossless data compression algorithm (Digital Equipment Corporation, 1994).
Hach, F. et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nature Methods 7, 576–577 (2010).
Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G. & Birney, E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011).
Christley, S., Lu, Y., Li, C. & Xie, X. Human genomes as e-mail attachments. Bioinformatics 25, 274–275 (2009).
Pinho, A. J., Pratas, D. & Garcia, S. P. GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res. 40, e27 (2012).
Tembe, W., Lowey, J. & Suh, E. G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26, 2192–2194 (2010).
Brandon, M. C., Wallace, D. C. & Baldi, P. Data structures and compression algorithms for genomic sequence data. Bioinformatics 25, 1731–1738 (2009).
Wang, C. & Zhang, D. A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res. 39, e45 (2011).
Loh, P. R., Baym, M. & Berger, B. Compressive genomics. Nature Biotech. 30, 627–630 (2012). This paper introduces 'compressive genomics', a general algorithmic paradigm that harnesses redundancy within data sets to speed up analyses by compressing data in such a way as to allow direct computation on the compressed data. Compressed versions of BLAST and BLAT demonstrate search times that scale linearly in the amount of non-redundant data without loss of accuracy.
Deorowicz, S. & Grabowski, S. Compression of DNA sequence reads in FASTQ format. Bioinformatics 27, 860–862 (2011).
Hach, F., Numanagic, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Kent, W. J. BLAT—the BLAST-Like Alignment Tool. Genome Res. 12, 656–664 (2002).
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods 5, 621–628 (2008).
Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008).
Ozsolak, F. et al. Direct RNA sequencing. Nature 461, 814–818 (2009).
Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotech. 31, 46–53 (2012).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotech. 29, 644–652 (2011).
Garber, M., Grabherr, M. G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods 8, 469–477 (2011).
Brown, P. O. & Botstein, D. Exploring the new world of the genome with DNA microarrays. Nature Genet. 21, 33–37 (1999).
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets-update. Nucleic Acids Res. 41, D991–D995 (2013).
Butte, A. The use and analysis of microarray data. Nature Rev. Drug Discov. 1, 951–960 (2002).
Allison, D. B., Cui, X., Page, G. P. & Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nature Rev. Genet. 7, 55–65 (2006).
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genet. 11, 733–739 (2010).
Shen-Orr, S. S. et al. Cell type-specific gene expression differences in complex tissues. Nature Methods 7, 287–289 (2010). This work describes a linear algebraic approach to model the mixture of gene expression signals of multiple cell types from microarray experiments and to deconvolute the signals separately for each cell type.
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protoc. 7, 562–578 (2012).
Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009).
Kim, D. & Salzberg, S. L. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 12, R72 (2011).
Whitney, A. R. et al. Individuality and variation in gene expression patterns in human blood. Proc. Natl Acad. Sci. USA 100, 1896–1901 (2003).
Lu, P., Nakorchevskiy, A. & Marcotte, E. M. Expression deconvolution: a reinterpretation of DNA microarray data reveals dynamic changes in cell populations. Proc. Natl Acad. Sci. USA 100, 10370–10375 (2003).
Wang, Y. et al. In silico estimates of tissue components in surgical samples based on expression profiling data. Cancer Res. 70, 6448–6455 (2010).
Gaujoux, R. & Seoighe, C. Semi-supervised nonnegative matrix factorization for gene expression deconvolution: a case study. Infect. Genet. Evol. 12, 913–921 (2012).
Clarke, J., Seo, P. & Clarke, B. Statistical expression deconvolution from mixed tissue samples. Bioinformatics 26, 1043–1049 (2010).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Reich, M. et al. GenePattern 2.0. Nature Genet. 38, 500–501 (2006).
Tanay, A., Sharan, R., Kupiec, M. & Shamir, R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc. Natl Acad. Sci. USA 101, 2981–2986 (2004).
Narayanan, M., Vetta, A., Schadt, E. E. & Zhu, J. Simultaneous clustering of multiple gene expression and physical interaction datasets. PLoS Comput. Biol. 6, e1000742 (2010).
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
Segal, E. et al. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genet. 34, 166–176 (2003). A probabilistic graphical model is constructed to identify regulatory modules, consisting of co-regulated or co-expressed genes, from gene expression data.
Kim, D., Kim, M. S. & Cho, K. H. The core regulation module of stress-responsive regulatory networks in yeast. Nucleic Acids Res. 40, 8793–8802 (2012).
Zinman, G. E., Zhong, S. & Bar-Joseph, Z. Biological interaction networks are conserved at the module level. BMC Syst. Biol. 5, 134 (2011).
Rhrissorrakrai, K. & Gunsalus, K. C. MINE: Module Identification in Networks. BMC Bioinformatics 12, 192 (2011).
Colak, R. et al. Module discovery by exhaustive search for densely connected, co-expressed regions in biomolecular interaction networks. PLoS ONE 5, e13348 (2010).
Ali, W. & Deane, C. M. Functionally guided alignment of protein interaction networks for module detection. Bioinformatics 25, 3166–3173 (2009).
Zhang, Y., Xuan, J., de los Reyes, B. G., Clarke, R. & Ressom, H. W. Reverse engineering module networks by PSO-RNN hybrid modeling. BMC Genomics 10 (Suppl. 1), S15 (2009).
Michoel, T., De Smet, R., Joshi, A., Van de Peer, Y. & Marchal, K. Comparative analysis of module-based versus direct methods for reverse-engineering transcriptional regulatory networks. BMC Syst. Biol. 3, 49 (2009).
Joshi, A., De Smet, R., Marchal, K., Van de Peer, Y. & Michoel, T. Module networks revisited: computational assessment and prioritization of model predictions. Bioinformatics 25, 490–496 (2009).
Wang, X., Dalkic, E., Wu, M. & Chan, C. Gene module level analysis: identification to networks and dynamics. Curr. Opin. Biotechnol. 19, 482–491 (2008).
Hirose, O. et al. Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models. Bioinformatics 24, 932–942 (2008).
Litvin, O., Causton, H. C., Chen, B. J. & Pe'er, D. Modularity and interactions in the genetics of gene expression. Proc. Natl Acad. Sci. USA 106, 6441–6446 (2009).
Akavia, U. D. et al. An integrated approach to uncover drivers of cancer. Cell 143, 1005–1017 (2010). The computational approach CONEXIC implements a module network to integrate different data sets, including CNVs and gene expression, from cancer studies and discover dysregulated genes.
Maathuis, M. H., Colombo, D., Kalisch, M. & Buhlmann, P. Predicting causal effects in large-scale systems from observational data. Nature Methods 7, 247–248 (2010). This paper describes an algorithm to estimate the effects of perturbations from observational data in gene expression experiments in which the causal relationship is not known between genes.
Markowetz, F., Kostka, D., Troyanskaya, O. G. & Spang, R. Nested effects models for high-dimensional phenotyping screens. Bioinformatics 23, I305–I312 (2007).
Prat, Y., Fromer, M., Linial, N. & Linial, M. Recovering key biological constituents through sparse representation of gene expression. Bioinformatics 27, 655–661 (2011).
Yeung, K. Y. & Ruzzo, W. L. Principal component analysis for clustering gene expression data. Bioinformatics 17, 763–774 (2001).
Schmid, M. et al. A gene expression map of Arabidopsis thaliana development. Nature Genet. 37, 501–506 (2005). Scalable methods are introduced here that associate expression patterns to phenotypes both to label new expression samples with and to identify marker genes for phenotypes.
Zhou, X., Kao, M. C. & Wong, W. H. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA 99, 12783–12788 (2002).
Parts, L., Stegle, O., Winn, J. & Durbin, R. Joint genetic analysis of gene expression data with inferred cellular phenotypes. PLoS Genet. 7, e1001276 (2011).
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nature Protoc. 7, 500–507 (2012).
Ng, S. et al. PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis. Bioinformatics 28, i640–i646 (2012).
Vaske, C. J. et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26, i237–i245 (2010).
The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Heiser, L. M. et al. Subtype and pathway specific responses to anticancer compounds in breast cancer. Proc. Natl Acad. Sci. USA 109, 2724–2729 (2012).
Liu, X., Yu, X., Zack, D. J., Zhu, H. & Qian, J. TiGER: a database for tissue-specific gene expression and regulation. BMC Bioinformatics 9, 271 (2008).
Ogasawara, O. et al. BodyMap-Xs: anatomical breakdown of 17 million animal ESTs for cross-species comparison of gene expression. Nucleic Acids Res. 34, D628–D631 (2006).
Sirota, M. et al. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci. Transl. Med. 3, 96ra77 (2011).
Lamb, J. The Connectivity Map: a new tool for biomedical research. Nature Rev. Cancer 7, 54–60 (2007).
Schmid, P. R., Palmer, N. P., Kohane, I. S. & Berger, B. Making sense out of massive data by going beyond differential expression. Proc. Natl Acad. Sci. USA 109, 5594–5599 (2012).
Palmer, N. P., Schmid, P. R., Berger, B. & Kohane, I. S. A gene expression profile of stem cell pluripotentiality and differentiation is conserved across diverse solid and hematopoietic cancers. Genome Biol. 13, R71 (2012).
Dudley, J. T., Tibshirani, R., Deshpande, T. & Butte, A. J. Disease signatures are robust across tissues and experiments. Mol. Syst. Biol. 5, 307 (2009).
Li, W. et al. Integrative analysis of many weighted co-expression networks using tensor computation. PLoS Comput. Biol. 7, e1001106 (2011).
Franceschini, A. et al. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2013).
Croft, D. et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 39, D691–D697 (2011).
Chatr-aryamontri, A. et al. The BioGRID interaction database: 2013 update. Nucleic Acids Res. 41, D816–D823 (2013).
Gerstein, M. B. et al. Architecture of the human regulatory network derived from ENCODE data. Nature 489, 91–100 (2012).
Wong, A. K. et al. IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res. 40, W484–W490 (2012).
Hartwell, L. H., Hopfield, J. J., Leibler, S. & Murray, A. W. From molecular to modular cell biology. Nature 402, C47–C52 (1999).
Ideker, T., Ozier, O., Schwikowski, B. & Siegel, A. F. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18 (Suppl. 1), S233–240 (2002).
Ulitsky, I. & Shamir, R. Identification of functional modules using network topology and high-throughput data. BMC Syst. Biol. 1, 8 (2007). This study uncovers modules in interaction networks such that the components within a module are also similar to each other with respect to expression or another attribute of interest.
Jiang, P. & Singh, M. SPICi: a fast clustering algorithm for large biological networks. Bioinformatics 26, 1105–1111 (2010).
Nabieva, E., Jim, K., Agarwal, A., Chazelle, B. & Singh, M. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21 (Suppl. 1), i302–i310 (2005). Network flow-based methods are introduced as a paradigm for propagating information within cellular networks.
Singh, R. & Berger, B. Influence flow: integrating pathway-specific RNAi data and protein interaction data. International Society for Computational Biology [online], (2007).
Yeger-Lotem, E. et al. Bridging high-throughput genetic and transcriptional data reveals cellular responses to alpha-synuclein toxicity. Nature Genet. 41, 316–323 (2009).
Lan, A. et al. ResponseNet: revealing signaling and regulatory networks linking genetic and transcriptomic screening data. Nucleic Acids Res. 39, W424–W429 (2011).
Huang, S. S. & Fraenkel, E. Integrating proteomic, transcriptional, and interactome data reveals hidden components of signaling and regulatory networks. Sci. Signal. 2, ra40 (2009). This paper introduces a Steiner tree formulation to uncover subnetworks connecting a set of seed proteins.
Tuncbag, N., McCallum, S., Huang, S. S. & Fraenkel, E. SteinerNet: a web server for integrating 'omic' data to discover hidden components of response pathways. Nucleic Acids Res. 40, W505–W509 (2012).
Yeang, C. H., Ideker, T. & Jaakkola, T. Physical network models. J. Comput. Biol. 11, 243–262 (2004).
Tu, Z., Wang, L., Arbeitman, M. N., Chen, T. & Sun, F. An integrative approach for causal gene identification and gene regulatory pathway inference. Bioinformatics 22, e489–e496 (2006).
Suthram, S. Beyer, A., Karp, R. M., Eldar, Y. & Ideker, T. eQED: an efficient method for interpreting eQTL associations using protein networks. Mol. Syst. Biol. 4, 162 (2008).
Kim, Y. A., Wuchty, S. & Przytycka, T. M. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Comput. Biol. 7, e1001095 (2011).
Doyle, P. G. & Snell, J. L. Random Walks and Electric Networks (Mathematical Association of America, 1984).
Steffen, M., Petti, A., Aach, J., D'Haeseleer, P. & Church, G. Automated modelling of signal transduction networks. BMC Bioinformatics 3, 34 (2002).
Pandey, J. et al. Functional annotation of regulatory pathways. Bioinformatics 23, i377–i386 (2007).
Banks, E., Nabieva, E., Chazelle, B. & Singh, M. Organization of physical interactomes as uncovered by network schemas. PLoS Comput. Biol. 4, e1000203 (2008).
Banks, E., Nabieva, E., Peterson, R. & Singh, M. NetGrep: fast network schema searches in interactomes. Genome Biol. 9, R138 (2008).
Singh, R., Xu, J. & Berger, B. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl Acad. Sci. USA 105, 12763–12768 (2008). This paper introduces global network alignment and pioneers the use of spectral methods to solve it. Led to IsoBase, a database of functionally related proteins across protein-protein, genetic interaction and metabolic networks, simultaneously incorporating both sequence and network data.
Liao, C. S., Lu, K., Baym, M., Singh, R. & Berger, B. IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics 25, i253–i258 (2009).
Flannick, J., Novak, A. Srinivasan, B. S., McAdams, H. H. & Batzoglou, S. Graemlin: general and robust alignment of multiple large interaction networks. Genome Res. 16, 1169–1181 (2006).
Koyuturk, M. et al. Pairwise alignment of protein interaction networks. J. Comput. Biol. 13, 182–199 (2006).
Kelley, B. P. et al. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl Acad. Sci. USA 100, 11394–11399 (2003).
Atias, N. & Sharan, R. Comparative analysis of protein networks: hard problems, practical solutions. Commun. Acm 55, 88–97 (2012).
Park, D., Singh, R., Baym, M., Liao, C. S. & Berger, B. IsoBase: a database of functionally related proteins across PPI networks. Nucleic Acids Res. 39, D295–D300 (2011).
Ma, C.-Y. et al. Reconstruction of phyletic trees by global alignment of multiple metabolic networks. BMC Bioinformatics (in the press).
Goh, K. I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).
Rossin, E. J. et al. Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology. PLoS Genet. 7, e1001273 (2011).
Navlakha, S. & Kingsford, C. The power of protein interaction networks for associating genes with diseases. Bioinformatics 26, 1057–1063 (2010).
Vanunu, O., Magger, O., Ruppin, E., Shlomi, T. & Sharan, R. Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol. 6, e1000641 (2010).
Kohler, S., Bauer, S., Horn, D. & Robinson, P. N. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet. 82, 949–958 (2008). This paper introduces random-walk based approaches for prioritizing disease genes using interaction networks.
Erten, S., Bebek, G., Ewing, R. M. & Koyuturk, M. DADA: degree-aware algorithms for network-based disease gene prioritization. BioData Min. 4, 19 (2011).
Vandin, F., Upfal, E. & Raphael, B. J. Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol. 18, 507–522 (2011). The authors develop a flow-based and statistical approach for analysing genes mutated in cancers within their network context in order to identify significantly mutated subnetworks.
Cerami, E., Demir, E., Schultz, N., Taylor, B. S. & Sander, C. Automated network analysis identifies core pathways in glioblastoma. PLoS ONE 5, e8918 (2010).
Kumar, P., Henikoff, S. & Ng, P. C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature Protoc. 4, 1073–1081 (2009).
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nature Methods 7, 248–249 (2010).
Yandell, M. et al. A probabilistic disease-gene finder for personal genomes. Genome Res. 21, 1529–1542 (2011).
Vandin, F., Upfal, E. & Raphael, B. J. De novo discovery of mutated driver pathways in cancer. Genome Res. 22, 375–385 (2012).
Chowdhury, S. A. & Koyuturk, M. Identification of coordinately dysregulated subnetworks in complex phenotypes. Pac. Symp. Biocomput. 2010, 133–144 (2010).
Ulitsky, I., Krishnamurthy, A., Karp, R. M. & Shamir, R. DEGAS: de novo discovery of dysregulated pathways in human diseases. PLoS ONE 5, e13367 (2010).
Cho, D.-Y., Kim, Y.-A. & Przytycka, T. M. Network biology approach to complex diseases. PLoS Comput. Biol. (in the press).
Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).
Furey, T. S. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nature Rev. Genet. 13, 840–852 (2012).
Hafner, M., Lianoglou, S., Tuschl, T. & Betel, D. Genome-wide identification of miRNA targets by PAR-CLIP. Methods 58, 94–105 (2012).
Wang, E. T. et al. Transcriptome-wide regulation of pre-mRNA splicing and mRNA localization by muscleblind proteins. Cell 150, 710–724 (2012).
Ascano, M., Hafner, M., Cekan, P., Gerstberger, S. & Tuschl, T. Identification of RNA-protein interaction networks using PAR-CLIP. Wiley Interdiscip. Rev. RNA 3, 159–177 (2012).
Jungkamp, A. C. et al. In vivo and transcriptome-wide identification of RNA binding protein target sites. Mol. Cell 44, 828–840 (2011).
Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).
Meyer, L. R. et al. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 41, D64–D69 (2013).
de Souza, N. The ENCODE project. Nature Methods 9, 1046–1046 (2012).
Gerstein, M. B. et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE Project. Science 330, 1775–1787 (2010).
Manber, U. & Myers, G. Suffix Arrays — a new method for online string searches. Siam J. Comput. 22, 935–948 (1993).
Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008).
Paten, B. et al. Cactus: algorithms for genome multiple sequence alignment. Genome Res. 21, 1512–1528 (2011).
Acknowledgements
The authors thank and L. Cowen for valuable feedback. B.B. thanks the US National Institutes of Health (NIH) for grant GM081871. M.S. thanks the NIH for grant GM076275 and US National Science Foundation (NSF) for grant ABI0850063.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Related links
Glossary
- Cloud computing
-
The use of computing resources distributed in the Internet to store, manage and analyse data, rather than doing so on a local server or personal computer.
- Parallel computing
-
A form of computation that allows numerous calculations to be carried out simultaneously, thereby accelerating computation. On the basis of this principle, many large-scale computational tasks can then be divided into smaller ones and solved on multiple machines concurrently.
- Machine learning techniques
-
Empirical data are taken as input, the relationship among the data is mathematically or statistically modelled, and patterns or predictions are generated. Supervised learning algorithms infer a function from labelled data features and predict labels on future input; unsupervised learning algorithms model the patterns or the distribution of a given unlabelled data set.
- Parallel dynamic programming
-
A technique that splits a large dynamic programming problem, usually by filling a table that can avoid redundant calculation, into a number of subproblems and computes all subproblems in parallel using multiple central processing units (CPUs). The computing speed-up scales almost linearly with the number of CPUs.
- Multicore computer processing units
-
(Multicore CPUs). Single computing processors with two or more independent computing units (called cores). Running multiple instructions on multiple cores at the same time can increase the overall speed of programs.
- Cache-oblivious algorithm
-
Takes advantage of the cache system of the central processing unit (that is, the local memory of frequently accessed data) to avoid expensive memory access operations and thus to improve efficiency; the intrinsic design of these algorithms does not require computer programs to be tuned for machines with different cache systems.
- Linear mixed model
-
A statistical model that models the observed effects from multiple different hidden factors; the effects are additively mixed according to the proportions of their corresponding factors.
- Matrix factorization
-
A method for decomposing a matrix into the product of two matrices. It can be applied to identify individual factors involved in a mixed observation.
- Differential geometry
-
A mathematical discipline for studying geometric objects, such as curves and surfaces, using the techniques of differential and integral calculus.
- Linear programming
-
A mathematical program for the optimization of a linear objective function, subject to linear constraints. Such functions capture the linear relationship between variables for the problem being optimized.
- Principle component analysis
-
A tool for transforming a set of observations with correlated variables into a set of linearly independent variables called principle components, making sure that the first principle component accounts for the largest variability of the data.
- Copy number variant
-
(CNV). Corresponds to abnormal number of copies of one or more segments in the genome. CNVs can be caused by structural rearrangements of the genome such as deletions, duplications, inversions and translocations.
- Bayesian network
-
A statistical model that describes the distribution of a set of random variables by a directed acyclic graph that represents the relationship among the random variables. For example, in a Bayesian network for a regulatory relationship for a set of genes, each variable represents a gene and each directed edge denotes either activating or repressing regulation between two genes.
- Steiner tree problem
-
Formulated on a network to find a minimum-length subnetwork that interconnects a set of seed nodes. Any two seed nodes may be connected by an edge or a path through other nodes.
- Random walk
-
A mathematical formulation of a number of successive random steps on a graph. It has been widely used to explain stochastic observations, such as diffusion in biological networks.
- Eigenvalue problem
-
The aim of this is to find a non-zero vector (that is, eigenvector), given a square matrix, such that the multiplication of the two is only different by a scalar factor.
- Set cover
-
Given a set of elements and subsets, the goal is to find the minimum number of subsets that cover all the elements.
Rights and permissions
About this article
Cite this article
Berger, B., Peng, J. & Singh, M. Computational solutions for omics data. Nat Rev Genet 14, 333–346 (2013). https://doi.org/10.1038/nrg3433
Published:
Issue Date:
DOI: https://doi.org/10.1038/nrg3433
This article is cited by
-
DNA-framework-based multidimensional molecular classifiers for cancer diagnosis
Nature Nanotechnology (2023)
-
Detecting protein complexes with multiple properties by an adaptive harmony search algorithm
BMC Bioinformatics (2022)
-
A novel liver cancer diagnosis method based on patient similarity network and DenseGCN
Scientific Reports (2022)
-
Precision medicine: the precision gap in rheumatic disease
Nature Reviews Rheumatology (2022)
-
Transcription Factor Activation Profiles (TFAP) identify compounds promoting differentiation of Acute Myeloid Leukemia cell lines
Cell Death Discovery (2022)