Recent technological advances have expanded the breadth of available omic data, from whole-genome sequencing data, to extensive transcriptomic, methylomic and metabolomic data. A key goal of analyses of these data is the identification of effective models that predict phenotypic traits and outcomes, elucidating important biomarkers and generating important insights into the genetic underpinnings of the heritability of complex traits. There is still a need for powerful and advanced analysis strategies to fully harness the utility of these comprehensive high-throughput data, identifying true associations and reducing the number of false associations. In this Review, we explore the emerging approaches for data integration — including meta-dimensional and multi-staged analyses — which aim to deepen our understanding of the role of genetics and genomics in complex outcomes. With the use and further development of these approaches, an improved understanding of the relationship between genomic variation and human phenotypes may be revealed.
At a glance
- Sequencing technologies — the next generation. Nature Rev. Genet. 11, 31–46 (2010).
- RNA sequencing: advances, challenges and opportunities. Nature Rev. Genet. 12, 87–98 (2011). &
- RNA-Seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009). , &
- Principles and challenges of genome-wide DNA methylation analysis. Nature Rev. Genet. 11, 191–203 (2010).
This is a comprehensive review of DNA methylation data analysis.
- ChIP–seq: advantages and challenges of a maturing technology. Nature Rev. Genet. 10, 669–680 (2009).
- Next-generation proteomics: towards an integrative view of proteome dynamics. Nature Rev. Genet. 14, 35–48 (2013). , &
- Metabolomics technology and bioinformatics. Brief. Bioinform. 7, 128–139 (2006).
- Single-cell sequencing-based technologies will revolutionize whole-organism science. Nature Rev. Genet. 14, 618–630 (2013). , &
- Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62, 1198–1211 (1998). &
- The family based association test method: strategies for studying general genotype—phenotype associations. Eur. J. Hum. Genet. 9, 301–306 (2001). , &
- Unbiased methods for population-based association studies. Genet. Epidemiol. 21, 273–284 (2001). , &
- Integrated analysis of genetic, genomic and proteomic data. Expert Rev. Proteomics 1, 67–75 (2004). , &
- Data integration in genetics and genomics: methods and challenges. Hum. Genomics Proteomics 2009, 869093 (2009). et al.
- Moving toward a system genetics view of disease. Mamm. Genome 18, 389–401 (2007). &
- Next-generation genomics: an integrative approach. Nature Rev. Genet. 11, 476–486 (2010). , &
- Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies. Pharmacogenomics 13, 213–222 (2012). &
- 7246, 134–143 (Springer Berlin Heidelberg, 2012). et al. in Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (eds Giacobini, M., Vanneschi, L. & Bush, W.)
- ATHENA: a tool for meta-dimensional analysis applied to genotypes and gene expression data to predict HDL cholesterol levels. Pac. Symp. Biocomput. 385–396 (2013). et al.
- The case for cloud computing in genome informatics. Genome Biol. 11, 207 (2010).
- GobyWeb: simplified management and analysis of gene expression and DNA methylation sequencing data. PLoS ONE 8, e69666 (2013). et al.
- Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline. BMC Bioinformatics 15, 30 (2014). et al.
- Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets. J. Am. Med. Inform. Assoc. 21, 969–975 (2014). et al.
- Quality control procedures for genome-wide association studies. Curr. Protoc. Hum. Genet. 68, 1.19.1–1.19.18 (2011). et al.
- Pitfalls of merging GWAS data: lessons learned in the eMERGE network and quality control procedures to maintain high data quality. Genet. Epidemiol. 35, 887–898 (2011).
This paper provides detailed lessons learned about quality control processes in high-throughput genotype data and guides readers toward best practices when cleaning and merging genotype data.
- Quality control and quality assurance in genotypic data for genome-wide association studies. Genet. Epidemiol. 34, 591–602 (2010). et al.
- The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010). et al.
- RNA-seq: from technology to biology. Cell. Mol. Life Sci. 67, 569–579 (2010). &
- Next generation sequencing based approaches to epigenomics. Briefings Funct. Genom. 9, 455–465 (2010). &
- Statistical challenges of high-dimensional data. Phil. Trans. R. Soc. A. 367, 4237–4253 (2009). &
- 2001). , & The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer-Verlag,
- Biofilter: a knowledge-integration system for the multi-locus analysis of genome-wide association studies. Pac. Symp. Biocomput. 368–379 (2009). , &
- Spatially uniform ReliefF (SURF) for computationally-efficient filtering of gene–gene interactions. BioData Min. 2, 5 (2009). , , &
- 166–175 (Springer Berlin Heidelberg, 2007). & in Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (eds Marchiori, E., Moore, J. H. & Rajapakse, J. C.)
- Sparse principal component analysis. J. Comput. Graph. Stat. 15, 265–286 (2006). , &
- Genetic algorithms. Sci. Am. 267, 66–72 (1992).
- The nature of confounding in genome-wide association studies. Nature Rev. Genet. 14, 1–2 (2013). &
- Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nature Methods 11, 407–409 (2014). &
- Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006). et al.
- Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007). &
- Population-specific genetic variants important in susceptibility to cytarabine arabinoside cytotoxicity. Blood 113, 2145–2153 (2009). et al.
- A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity. Proc. Natl Acad. Sci. USA 104, 9758–9763 (2007).
This is one of the first papers to present an integrative analysis to identify DNA variants and gene expressions associated with chemotherapeutic drug-induced cytotoxicity.
- Genetic variants associated with carboplatin-induced cytotoxicity in cell lines derived from Africans. Mol. Cancer Ther. 7, 3038–3046 (2008). , , , &
- An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genet. 37, 710–717 (2005).
This study used an integrative approach to use DNA variation and gene expression data to identify drivers of complex traits.
- Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nature Biotech. 31, 142–147 (2013). et al.
- Quantitative measurement of allele-specific protein expression in a diploid yeast hybrid by LC-MS. Mol. Syst. Biol. 8, 602 (2012). et al.
- A computational workflow to identify allele-specific expression and epigenetic modification in maize. Genomics Proteomics Bioinformatics 11, 247–252 (2013). &
- Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
This paper reports the sequencing and analysis of mRNA and microRNA of hundreds of multi-ethnic individuals from the 1000 Genome Project.
- Genome-wide mapping of allele-specific protein–DNA interactions in human cells. Nature Methods 5, 307–309 (2008). , , , &
- Extensive variation in chromatin states across humans. Science 342, 750–752 (2013). et al.
- Identification of genetic variants that affect histone modifications in human cells. Science 342, 747–749 (2013). et al.
- Encode Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306, 636–640 (2004).
- KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000). &
- Synergistic effect of different levels of genomic data for cancer clinical outcome prediction. J. Biomed. Inform. 45, 1191–1198 (2012).
This study shows a graph-based approach for predicting cancer clinical outcome by integrating multi-omics data as a transformation-based integration.
, , &
- Bayesian integrative genomic model for pathway analysis of complex traits. Genet. Epidemiol. 36, 352–359 (2012). , , &
- Time to recurrence and survival in serous ovarian tumors predicted from integrated genomic profiles. PLoS ONE 6, e24709 (2011). , , , &
- ATHENA: the analysis tool for heritable and environmental network associations. Bioinformatics 30, 698–705 (2014).
ATHENA is a tool for meta-dimensional integration of multi-omics data. This paper describes the software and its application for these types of analyses.
, , , &
- ATHENA: Identifying interactions between different levels of genomic data associated with cancer clinical outcomes using grammatical evolution neural network. BioData Min. 6, 23 (2013). , , &
- The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature Rev. Cancer 8, 37–49 (2008).
This review addresses the properties of high-dimensional data spaces and the challenges for data analysis and interpretation.
- A statistical framework for genomic data fusion. Bioinformatics 20, 2626–2635 (2004).
This is the first study to propose a kernel-based integration as a transformation-based integration.
, , , &
- Protein function prediction via graph kernels. Bioinformatics 21, i47–i56 (2005). et al.
- Fast protein classification with multiple networks. Bioinformatics 21, ii59–ii65 (2005). , &
- Graph sharpening plus graph integration: a synergy that improves protein functional classification. Bioinformatics 23, 3217–3224 (2007). , &
- ATHENA: a knowledge-based hybrid backpropagation-grammatical evolution neural network algorithm for discovering epistasis among quantitative trait loci. BioData Min. 3, 5 (2010). , &
- Predicting HIV drug resistance with neural networks. Bioinformatics 19, 98–107 (2003). &
- Ensemble classifier for protein fold pattern recognition. Bioinformatics 22, 1717–1722 (2006). &
- An integrated approach to uncover drivers of cancer. Cell 143, 1005–1017 (2010).
This paper demonstrated a computational framework that identified drivers of melanoma using chromosomal copy number and gene expression data.
- Stitching together multiple data dimensions reveals interacting metabolomic and transcriptomic networks that modulate cell regulation. PLoS Biol. 10, e1001301 (2012). et al.
- Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nature Genet. 40, 854–861 (2008). et al.
- Popular ensemble methods: an empirical study. J. Artif. Intell. Res. 11, 169–198 (1999). &
- Integrative subtype discovery in glioblastoma using iCluster. PLoS ONE 7, e35236 (2012). et al.
- Bayesian correlated clustering to integrate multiple datasets. Bioinformatics 28, 3290–3297 (2012). , , , &
- Bayesian consensus clustering. Bioinformatics 29, 2610–2616 (2013). &
- Power and sample size calculations. A review and computer program. Control Clin. Trials 11, 116–128 (1990). &
- NCI–NHGRI Working Group on Replication in Association Studies. Replicating genotype–phenotype associations. Nature 447, 655–660 (2007).
- Failure to replicate a genetic association may provide important clues about genetic architecture. PLoS ONE 4, e5639 (2009). , , &
- Diverse convergent evidence in the genetic analysis of complex disease: Coordinating omic, informatic, and experimental evidence to better identify and validate risk factors. BioData Min. 7, 10 (2014). et al.
- Experimental validation of in silico predicted KCNA1, KCNA2, KCNA6 and KCNQ2 genes for association studies of peripheral nerve hyperexcitability syndrome in Jack Russell Terriers. Neuromuscul. Disord. 22, 558–565 (2012). , , &
- Computational prediction and experimental validation associating FABP-1 and pancreatic adenocarcinoma with diabetes. BMC Gastroenterol. 11, 5 (2011). et al.
- Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions. PLoS Genet. 5, e1000534 (2009). et al.
- Estrogens, enzyme variants, and breast cancer: a risk model. Cancer Epidemiol. Biomarkers Prev. 15, 1620–1629 (2006). et al.
- Multicollinearity in regression analysis: the problem revisited. Rev. Econ. Stat. 49, 92 (1967). &
- An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006).
- 87–101 (Springer, 2013). , , & in Genetic Programming Theory and Practice X
- Pareto-based multiobjective machine learn: an overview case studies. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 38, 397–415 (2008). &
- Molecular epidemiology of breast cancer: genetic variation in steroid hormone metabolism. Mutat. Res. 462, 323–333 (2000). &
- Glutathione S-transferase M1, M3, P1, and T1 genetic polymorphisms and susceptibility to breast cancer. Cancer Epidemiol. Biomarkers Prev. 10, 229–236 (2001). et al.
- A genome-wide association study identifies locus at 10q22 associated with clinical outcomes of adjuvant tamoxifen therapy for breast cancer patients in Japanese. Hum. Mol. Genet. 21, 1665–1672 (2012). et al.
- Genome-wide association studies identify four ER negative-specific breast cancer risk loci. Nature Genet. 45, 392–398, 398e1–2 (2013). et al.
- Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nature Genet. 45, 353–361, 361e1–2 (2013). et al.
- Common genetic determinants of breast-cancer risk in East Asian women: a collaborative study of 23 637 breast cancer cases and 25 579 controls. Hum. Mol. Genet. 22, 2539–2550 (2013). et al.
- PathAct: a novel method for pathway analysis using gene expression profiles. Bioinformation 9, 394–400 (2013). &
- A two-stage random forest-based pathway analysis method. PLoS ONE 7, e36662 (2012). &
- Association of cytochrome P450 1B1 (CYP1B1) polymorphism with steroid receptor status in breast cancer. Cancer Res. 58, 5038–5041 (1998). , , &
- Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28, 1353–1358 (2012).
- A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 66, 279–292 (2000). , &
- AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 (2011). et al.
- ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010). , &
- HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res. 40, D930–D934 (2012). &
- Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 22, 1790–1797 (2012). et al.
- Genetics of gene expression and its effect on disease. Nature 452, 423–428 (2008).
This important paper presents the relationship between genetic variation, gene expression and clinical phenotypes using human blood and adipose tissue.