Methods of integrating data to uncover genotype–phenotype interactions

Journal name:
Nature Reviews Genetics
Volume:
16,
Pages:
85–97
Year published:
DOI:
doi:10.1038/nrg3868
Published online

Abstract

Recent technological advances have expanded the breadth of available omic data, from whole-genome sequencing data, to extensive transcriptomic, methylomic and metabolomic data. A key goal of analyses of these data is the identification of effective models that predict phenotypic traits and outcomes, elucidating important biomarkers and generating important insights into the genetic underpinnings of the heritability of complex traits. There is still a need for powerful and advanced analysis strategies to fully harness the utility of these comprehensive high-throughput data, identifying true associations and reducing the number of false associations. In this Review, we explore the emerging approaches for data integration — including meta-dimensional and multi-staged analyses — which aim to deepen our understanding of the role of genetics and genomics in complex outcomes. With the use and further development of these approaches, an improved understanding of the relationship between genomic variation and human phenotypes may be revealed.

At a glance

Figures

  1. Biological systems multi-omics from the genome, epigenome, transcriptome, proteome and metabolome to the phenome.
    Figure 1: Biological systems multi-omics from the genome, epigenome, transcriptome, proteome and metabolome to the phenome.

    Heterogeneous genomic data exist within and between levels, for example, single-nucleotide polymorphism (SNP), copy number variation (CNV), loss of heterozygosity (LOH) and genomic rearrangement, such as translocation, at the genome level; DNA methylation, histone modification, chromatin accessibility, transcription factor (TF) binding and micro RNA (miRNA) at the epigenome level; gene expression and alternative splicing at the transcriptome level; protein expression and post-translational modification at the proteome level; and metabolite profiling at the metabolome level. Arrows indicate the flow of genetic information from the genome level to the metabolome level and, ultimately, to the phenome level. The red crosses indicate inactivation of transcription or translation. CSF, cerebrospinal fluid; Me, methylation; TFBS, transcription factor-binding site.

  2. Alternative hypothesis of complex-trait aetiology.
    Figure 2: Alternative hypothesis of complex-trait aetiology.

    Hypothesis A (grey arrow) is the theory that variation is hierarchical, such that variation in DNA leads to variation in RNA and so on in a linear manner. Hypothesis B (black arrow) is the idea that it is the combination of variation across all possible omic levels in concert that leads to phenotype.

  3. Categorization of multi-staged analysis.
    Figure 3: Categorization of multi-staged analysis.

    Multi-staged analysis can be divided into three categories. a | Analysis of expression quantitative trait loci (eQTLs) analysis involves the identification of genetic variation associated with measures of quantitative gene expression. b | Allele-specific expression involves the analysis of whether the maternal or paternal allele is preferentially expressed, followed by the association of this allele with cis-element variations and epigenetic modifications. c | Domain knowledge overlap involves a two-step analysis in which an initial association analysis is performed at the single-nucleotide polymorphism (SNP) or gene expression variable followed by the annotation of the significant associations with knowledge generated by other biological experiments. This approach enables the selection of association results with functional data to corroborate the association. CTCF, CCCTC-binding factor; Pol II, RNA polymerase II.

  4. Categorization of meta-dimensional analysis.
    Figure 4: Categorization of meta-dimensional analysis.

    Meta-dimensional analysis can be divided into three categories. a | Concatenation-based integration involves combining data sets from different data types at the raw or processed data level before modelling and analysis. b | Transformation-based integration involves performing mapping or data transformation of the underlying data sets before analysis, and the modelling approach is applied at the level of transformed matrices. c | Model-based integration is the process of performing analysis on each data type independently, followed by integration of the resultant models to generate knowledge about the trait of interest. miRNA, microRNA; SNP, single-nucleotide polymorphism.

References

  1. Metzker, M. L. Sequencing technologies — the next generation. Nature Rev. Genet. 11, 3146 (2010).
  2. Ozsolak, F. & Milos, P. M. RNA sequencing: advances, challenges and opportunities. Nature Rev. Genet. 12, 8798 (2011).
  3. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 5763 (2009).
  4. Laird, P. W. Principles and challenges of genome-wide DNA methylation analysis. Nature Rev. Genet. 11, 191203 (2010).
    This is a comprehensive review of DNA methylation data analysis.
  5. Park, P. J. ChIP–seq: advantages and challenges of a maturing technology. Nature Rev. Genet. 10, 669680 (2009).
  6. Altelaar, A. F. M., Munoz, J. & Heck, A. J. R. Next-generation proteomics: towards an integrative view of proteome dynamics. Nature Rev. Genet. 14, 3548 (2013).
  7. Shulaev, V. Metabolomics technology and bioinformatics. Brief. Bioinform. 7, 128139 (2006).
  8. Shapiro, E., Biezuner, T. & Linnarsson, S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nature Rev. Genet. 14, 618630 (2013).
  9. Almasy, L. & Blangero, J. Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62, 11981211 (1998).
  10. Horvath, S., Xu, X. & Laird, N. M. The family based association test method: strategies for studying general genotype—phenotype associations. Eur. J. Hum. Genet. 9, 301306 (2001).
  11. Devlin, B., Roeder, K. & Bacanu, S. A. Unbiased methods for population-based association studies. Genet. Epidemiol. 21, 273284 (2001).
  12. Reif, D. M., White, B. C. & Moore, J. H. Integrated analysis of genetic, genomic and proteomic data. Expert Rev. Proteomics 1, 6775 (2004).
  13. Hamid, J. S. et al. Data integration in genetics and genomics: methods and challenges. Hum. Genomics Proteomics 2009, 869093 (2009).
  14. Sieberts, S. K. & Schadt, E. E. Moving toward a system genetics view of disease. Mamm. Genome 18, 389401 (2007).
  15. Hawkins, R. D., Hon, G. C. & Ren, B. Next-generation genomics: an integrative approach. Nature Rev. Genet. 11, 476486 (2010).
  16. Holzinger, E. R. & Ritchie, M. D. Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies. Pharmacogenomics 13, 213222 (2012).
  17. Holzinger, E. et al. in Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (eds Giacobini, M., Vanneschi, L. & Bush, W.) 7246, 134143 (Springer Berlin Heidelberg, 2012).
  18. Holzinger, E. R. et al. ATHENA: a tool for meta-dimensional analysis applied to genotypes and gene expression data to predict HDL cholesterol levels. Pac. Symp. Biocomput. 385396 (2013).
  19. Stein, L. D. The case for cloud computing in genome informatics. Genome Biol. 11, 207 (2010).
  20. Dorff, K. C. et al. GobyWeb: simplified management and analysis of gene expression and DNA methylation sequencing data. PLoS ONE 8, e69666 (2013).
  21. Reid, J. G. et al. Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline. BMC Bioinformatics 15, 30 (2014).
  22. Heath, A. P. et al. Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets. J. Am. Med. Inform. Assoc. 21, 969975 (2014).
  23. Turner, S. et al. Quality control procedures for genome-wide association studies. Curr. Protoc. Hum. Genet. 68, 1.19.11.19.18 (2011).
  24. Zuvich, R. L. et al. Pitfalls of merging GWAS data: lessons learned in the eMERGE network and quality control procedures to maintain high data quality. Genet. Epidemiol. 35, 887898 (2011).
    This paper provides detailed lessons learned about quality control processes in high-throughput genotype data and guides readers toward best practices when cleaning and merging genotype data.
  25. Laurie, C. C. et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet. Epidemiol. 34, 591602 (2010).
  26. McKenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 12971303 (2010).
  27. Marguerat, S. & Bähler, J. RNA-seq: from technology to biology. Cell. Mol. Life Sci. 67, 569579 (2010).
  28. Hirst, M. & Marra, M. A. Next generation sequencing based approaches to epigenomics. Briefings Funct. Genom. 9, 455465 (2010).
  29. Johnstone, I. M. & Titterington, D. M. Statistical challenges of high-dimensional data. Phil. Trans. R. Soc. A. 367, 42374253 (2009).
  30. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer-Verlag, 2001).
  31. Bush, W. S., Dudek, S. M. & Ritchie, M. D. Biofilter: a knowledge-integration system for the multi-locus analysis of genome-wide association studies. Pac. Symp. Biocomput. 368379 (2009).
  32. Greene, C. S., Penrod, N. M., Kiralis, J. & Moore, J. H. Spatially uniform ReliefF (SURF) for computationally-efficient filtering of gene–gene interactions. BioData Min. 2, 5 (2009).
  33. Moore, J. H. & White, B. C. in Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (eds Marchiori, E., Moore, J. H. & Rajapakse, J. C.) 166175 (Springer Berlin Heidelberg, 2007).
  34. Zou, H., Hastie, T. & Tibshirani, R. Sparse principal component analysis. J. Comput. Graph. Stat. 15, 265286 (2006).
  35. Holland, J. H. Genetic algorithms. Sci. Am. 267, 6672 (1992).
  36. Vilhjálmsson, B. J. & Nordborg, M. The nature of confounding in genome-wide association studies. Nature Rev. Genet. 14, 12 (2013).
  37. Zhou, X. & Stephens, M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nature Methods 11, 407409 (2014).
  38. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904909 (2006).
  39. Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007).
  40. Hartford, C. M. et al. Population-specific genetic variants important in susceptibility to cytarabine arabinoside cytotoxicity. Blood 113, 21452153 (2009).
  41. Huang, R. S. et al. A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity. Proc. Natl Acad. Sci. USA 104, 97589763 (2007).
    This is one of the first papers to present an integrative analysis to identify DNA variants and gene expressions associated with chemotherapeutic drug-induced cytotoxicity.
  42. Huang, R. S., Duan, S., Kistner, E. O., Hartford, C. M. & Dolan, M. E. Genetic variants associated with carboplatin-induced cytotoxicity in cell lines derived from Africans. Mol. Cancer Ther. 7, 30383046 (2008).
  43. Schadt, E. E. et al. An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genet. 37, 710717 (2005).
    This study used an integrative approach to use DNA variation and gene expression data to identify drivers of complex traits.
  44. Liu, Y. et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nature Biotech. 31, 142147 (2013).
  45. Khan, Z. et al. Quantitative measurement of allele-specific protein expression in a diploid yeast hybrid by LC-MS. Mol. Syst. Biol. 8, 602 (2012).
  46. Wei, X. & Wang, X. A computational workflow to identify allele-specific expression and epigenetic modification in maize. Genomics Proteomics Bioinformatics 11, 247252 (2013).
  47. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506511 (2013).
    This paper reports the sequencing and analysis of mRNA and microRNA of hundreds of multi-ethnic individuals from the 1000 Genome Project.
  48. Maynard, N. D., Chen, J., Stuart, R. K., Fan, J.-B. & Ren, B. Genome-wide mapping of allele-specific protein–DNA interactions in human cells. Nature Methods 5, 307309 (2008).
  49. Kasowski, M. et al. Extensive variation in chromatin states across humans. Science 342, 750752 (2013).
  50. McVicker, G. et al. Identification of genetic variants that affect histone modifications in human cells. Science 342, 747749 (2013).
  51. Encode Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306, 636640 (2004).
  52. Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 2730 (2000).
  53. Kim, D., Shin, H., Song, Y. S. & Kim, J. H. Synergistic effect of different levels of genomic data for cancer clinical outcome prediction. J. Biomed. Inform. 45, 11911198 (2012).
    This study shows a graph-based approach for predicting cancer clinical outcome by integrating multi-omics data as a transformation-based integration.
  54. Fridley, B. L., Lund, S., Jenkins, G. D. & Wang, L. A. Bayesian integrative genomic model for pathway analysis of complex traits. Genet. Epidemiol. 36, 352359 (2012).
  55. Mankoo, P. K., Shen, R., Schultz, N., Levine, D. A. & Sander, C. Time to recurrence and survival in serous ovarian tumors predicted from integrated genomic profiles. PLoS ONE 6, e24709 (2011).
  56. Holzinger, E. R., Dudek, S. M., Frase, A. T., Pendergrass, S. A. & Ritchie, M. D. ATHENA: the analysis tool for heritable and environmental network associations. Bioinformatics 30, 698705 (2014).
    ATHENA is a tool for meta-dimensional integration of multi-omics data. This paper describes the software and its application for these types of analyses.
  57. Kim, D., Li, R., Dudek, S. M. & Ritchie, M. D. ATHENA: Identifying interactions between different levels of genomic data associated with cancer clinical outcomes using grammatical evolution neural network. BioData Min. 6, 23 (2013).
  58. Clarke, R. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature Rev. Cancer 8, 3749 (2008).
    This review addresses the properties of high-dimensional data spaces and the challenges for data analysis and interpretation.
  59. Lanckriet, G. R. G., De Bie, T., Cristianini, N., Jordan, M. I. & Noble, W. S. A statistical framework for genomic data fusion. Bioinformatics 20, 26262635 (2004).
    This is the first study to propose a kernel-based integration as a transformation-based integration.
  60. Borgwardt, K. M. et al. Protein function prediction via graph kernels. Bioinformatics 21, i47i56 (2005).
  61. Tsuda, K., Shin, H. & Schölkopf, B. Fast protein classification with multiple networks. Bioinformatics 21, ii59ii65 (2005).
  62. Shin, H., Lisewski, A. M. & Lichtarge, O. Graph sharpening plus graph integration: a synergy that improves protein functional classification. Bioinformatics 23, 32173224 (2007).
  63. Turner, S. D., Dudek, S. M. & Ritchie, M. D. ATHENA: a knowledge-based hybrid backpropagation-grammatical evolution neural network algorithm for discovering epistasis among quantitative trait loci. BioData Min. 3, 5 (2010).
  64. Dra˘ghici, S. & Potter, R. B. Predicting HIV drug resistance with neural networks. Bioinformatics 19, 98107 (2003).
  65. Shen, H.-B. & Chou, K.-C. Ensemble classifier for protein fold pattern recognition. Bioinformatics 22, 17171722 (2006).
  66. Akavia, U. D. et al. An integrated approach to uncover drivers of cancer. Cell 143, 10051017 (2010).
    This paper demonstrated a computational framework that identified drivers of melanoma using chromosomal copy number and gene expression data.
  67. Zhu, J. et al. Stitching together multiple data dimensions reveals interacting metabolomic and transcriptomic networks that modulate cell regulation. PLoS Biol. 10, e1001301 (2012).
  68. Zhu, J. et al. Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nature Genet. 40, 854861 (2008).
  69. Opitz, D. & Maclin, R. Popular ensemble methods: an empirical study. J. Artif. Intell. Res. 11, 169198 (1999).
  70. Shen, R. et al. Integrative subtype discovery in glioblastoma using iCluster. PLoS ONE 7, e35236 (2012).
  71. Kirk, P., Griffin, J. E., Savage, R. S., Ghahramani, Z. & Wild, D. L. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics 28, 32903297 (2012).
  72. Lock, E. F. & Dunson, D. B. Bayesian consensus clustering. Bioinformatics 29, 26102616 (2013).
  73. Dupont, W. D. & Plummer, W. D. Power and sample size calculations. A review and computer program. Control Clin. Trials 11, 116128 (1990).
  74. NCI–NHGRI Working Group on Replication in Association Studies. Replicating genotype–phenotype associations. Nature 447, 655660 (2007).
  75. Greene, C. S., Penrod, N. M., Williams, S. M. & Moore, J. H. Failure to replicate a genetic association may provide important clues about genetic architecture. PLoS ONE 4, e5639 (2009).
  76. Ciesielski, T. et al. Diverse convergent evidence in the genetic analysis of complex disease: Coordinating omic, informatic, and experimental evidence to better identify and validate risk factors. BioData Min. 7, 10 (2014).
  77. Van Poucke, M., Vanhaesebrouck, A. E., Peelman, L. J. & Van Ham, L. Experimental validation of in silico predicted KCNA1, KCNA2, KCNA6 and KCNQ2 genes for association studies of peripheral nerve hyperexcitability syndrome in Jack Russell Terriers. Neuromuscul. Disord. 22, 558565 (2012).
  78. Sharaf, R. N. et al. Computational prediction and experimental validation associating FABP-1 and pancreatic adenocarcinoma with diabetes. BMC Gastroenterol. 11, 5 (2011).
  79. Raychaudhuri, S. et al. Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions. PLoS Genet. 5, e1000534 (2009).
  80. Crooke, P. S. et al. Estrogens, enzyme variants, and breast cancer: a risk model. Cancer Epidemiol. Biomarkers Prev. 15, 16201629 (2006).
  81. Farrar, D. E. & Glauber, R. R. Multicollinearity in regression analysis: the problem revisited. Rev. Econ. Stat. 49, 92 (1967).
  82. Fawcett, T. An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861874 (2006).
  83. Moore, J.H., Hill, D. P., Sulovari, A & Kidd, L.C. in Genetic Programming Theory and Practice X 87101 (Springer, 2013).
  84. Jin, Y. & Sendhoff, B. Pareto-based multiobjective machine learn: an overview case studies. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 38, 397415 (2008).
  85. Kristensen, V. N. & Borresen-Dale, A. L. Molecular epidemiology of breast cancer: genetic variation in steroid hormone metabolism. Mutat. Res. 462, 323333 (2000).
  86. Mitrunen, K. et al. Glutathione S-transferase M1, M3, P1, and T1 genetic polymorphisms and susceptibility to breast cancer. Cancer Epidemiol. Biomarkers Prev. 10, 229236 (2001).
  87. Kiyotani, K. et al. A genome-wide association study identifies locus at 10q22 associated with clinical outcomes of adjuvant tamoxifen therapy for breast cancer patients in Japanese. Hum. Mol. Genet. 21, 16651672 (2012).
  88. Garcia-Closas, M. et al. Genome-wide association studies identify four ER negative-specific breast cancer risk loci. Nature Genet. 45, 392398, 398e12 (2013).
  89. Michailidou, K. et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nature Genet. 45, 353361, 361e12 (2013).
  90. Zheng, W. et al. Common genetic determinants of breast-cancer risk in East Asian women: a collaborative study of 23 637 breast cancer cases and 25 579 controls. Hum. Mol. Genet. 22, 25392550 (2013).
  91. Mogushi, K. & Tanaka, H. PathAct: a novel method for pathway analysis using gene expression profiles. Bioinformation 9, 394400 (2013).
  92. Chung, R.-H. & Chen, Y.-E. A two-stage random forest-based pathway analysis method. PLoS ONE 7, e36662 (2012).
  93. Bailey, L. R., Roodi, N., Dupont, W. D. & Parl, F. F. Association of cytochrome P450 1B1 (CYP1B1) polymorphism with steroid receptor status in breast cancer. Cancer Res. 58, 50385041 (1998).
  94. Shabalin, A. A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28, 13531358 (2012).
  95. Abecasis, G. R., Cardon, L. R. & Cookson, W. O. A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 66, 279292 (2000).
  96. Rozowsky, J. et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 (2011).
  97. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
  98. Ward, L. D. & Kellis, M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res. 40, D930D934 (2012).
  99. Boyle, A. P. et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 22, 17901797 (2012).
  100. Emilsson, V. et al. Genetics of gene expression and its effect on disease. Nature 452, 423428 (2008).
    This important paper presents the relationship between genetic variation, gene expression and clinical phenotypes using human blood and adipose tissue.

Download references

Author information

Affiliations

  1. Department of Biochemistry and Molecular Biology, Center for Systems Genomics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.

    • Marylyn D. Ritchie,
    • Ruowang Li,
    • Sarah A. Pendergrass &
    • Dokyoon Kim
  2. National Human Genome Research Institute, Inherited Disease Research Branch, Baltimore, Maryland 21224, USA.

    • Emily R. Holzinger

Competing interests statement

The authors declare no competing interests.

Corresponding author

Correspondence to:

Author details

  • Marylyn D. Ritchie

    Marylyn D. Ritchie is a professor in the Department of Biochemistry and Molecular Biology, and Director of the Center for Systems Genomics at The Pennsylvania State University, State College, USA. She is a statistical and computational geneticist who focuses on understanding the genetic architecture of complex human diseases. She has expertise in developing new bioinformatic tools for complex analysis of large data sets in genetics, genomics and clinical databases, in particular in the area of pharmacogenomics. She has received several awards and honours, including selection as a Genome Technology, Rising Young Investigator in 2006, an Alfred P. Sloan Research Fellow in 2010 and a KAVLI Frontiers of Science fellow by the US National Academy of Science for each of the past four consecutive years. She has extensive experience in all aspects of genetic epidemiology and bioinformatics related to human genomics, including study design, genotyping platform selection, statistical analysis and interpretation of results. She also has wide knowledge of dealing with large data sets and complex analysis, including genome-wide association studies, next-generation sequencing, copy number variations and data integration of meta-dimensional omic data. Marylyn D. Ritchie's homepage.

  • Emily R. Holzinger

    Emily R. Holzinger is a postdoctoral fellow with Joan Bailey-Wilson at the Computational and Statistical Genomics Research Branch of the National Human Genome Research Institute, Baltimore, Maryland, USA. She completed her Ph.D. work in Human Genetics at Vanderbilt University, Nashville, Tennessee, USA. Her research interests focus on developing novel computational methods to improve the analysis of complex human traits using high-throughput data.

  • Ruowang Li

    Ruowang Li is pursuing a Ph.D. in bioinformatics and genomics at The Pennsylvania State University, State College, USA. He was fascinated by the complexity of molecular biology, so he studied biology and computer science at Worcester Polytechnic Institute, Massachusetts, USA, from 2007 to 2011. He has been developing and applying computational methods to identify the molecular factors affecting the varied responses of different individuals to chemotherapeutic drugs, as well as the survival status of patients with cancer. He is currently a National Science Foundation graduate fellow in the laboratory of Marylyn D. Ritchie.

  • Sarah A. Pendergrass

    Sarah A. Pendergrass is a research faculty member in the Department of Biochemistry and Molecular Biology at the Center for Systems Genomics and the laboratory of Marylyn D. Ritchie at The Pennsylvania State University, State College, USA. She is a genetic bioinformaticist and focuses on high-throughput data analysis and data-mining projects for uncovering the genetic architecture of complex traits. She has extensive experience in developing novel methodologies, such as those for phenome-wide association studies. She has developed unique software tools to enable researchers to access and analyse data in new ways, including several tools for data visualization. She obtained her Ph.D. in genetics from Dartmouth College, Hanover, New Hampshire, USA, focusing on gene expression analyses and bioinformatics for biomarker and biological discovery for systemic sclerosis, and obtained an M.S. in biomedical engineering from Thayer School of Engineering at Dartmouth College.

  • Dokyoon Kim

    Dokyoon Kim obtained his Ph.D. in biomedical informatics from Seoul National University College of Medicine, Korea. His research entails the development and application of data integration approaches, mostly using data from The Cancer Genome Atlas to improve the ability to diagnose, treat and prevent cancer. His primary focus lies in integrating multi-omic data and biological knowledge to better translate genomic and biomedical data into clinical products. He is currently a postdoctoral fellow in the laboratory of Marylyn D. Ritchie at The Pennsylvania State University, State College, USA.

Additional data