Machine learning applications in genetics and genomics

Journal name:
Nature Reviews Genetics
Volume:
16,
Pages:
321–332
Year published:
DOI:
doi:10.1038/nrg3920
Published online

Abstract

The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.

At a glance

Figures

  1. A canonical example of a machine learning application.
    Figure 1: A canonical example of a machine learning application.

    A training set of DNA sequences is provided as input to a learning procedure, along with binary labels indicating whether each sequence is centred on a transcription start site (TSS) or not. The learning algorithm produces a model that can then be subsequently used, in conjunction with a prediction algorithm, to assign predicted labels (such as 'TSS' or 'not TSS') to unlabelled test sequences. In the figure, the red–blue gradient might represent, for example, the scores of various motif models (one per column) against the DNA sequence.

  2. A gene-finding model.
    Figure 2: A gene-finding model.

    A simplified gene-finding model that captures the basic properties of a protein-coding gene is shown. The model takes the DNA sequence of a chromosome, or a portion thereof, as input and produces detailed gene annotations as output. Note that this simplified model is incapable of identifying overlapping genes or multiple isoforms of the same gene. UTR, untranslated region.

  3. Two models of transcription factor binding.
    Figure 3: Two models of transcription factor binding.

    a | Generative and discriminative models are different in their interpretability and prediction accuracy. If we were to separate two groups of points, the generative model characterizes both classes completely, whereas the discriminative model focuses on the boundary between the classes. b | In a position-specific frequency matrix (PSFM) model, the entry in row i and column j represents the frequency of the ith base occurring at position j in the training set. Assuming independence, the probability of the entire sequence is the product of the probabilities associated with each base. c | In a linear support vector machine (SVM) model of transcription factor binding, labelled positive and negative training examples (red and blue, respectively) are provided as input, and a learning procedure adjusts the weights on the edges to predict the given label. d | The graph plots the mean accuracy (±95% confidence intervals) of PSFM and SVM, on a set of 500 simulated test sets, of predicting transcription factor binding as a function of the number of training examples.

  4. Incorporating a probabilistic prior into a position-specific frequency matrix.
    Figure 4: Incorporating a probabilistic prior into a position-specific frequency matrix.

    A simple, principled method for putting a probabilistic prior on a position-specific frequency matrix involves augmenting the observed nucleotide counts with pseudocounts and then computing frequencies with respect to the sum. The magnitude of the pseudocount corresponds to the weight assigned to the prior.

  5. Three ways to accommodate heterogeneous data in machine learning.
    Figure 5: Three ways to accommodate heterogeneous data in machine learning.

    The task of predicting gene function labels requires methods that take as input data such as gene expression profiles, genetic interaction networks and amino acid sequences. These diverse data types can be encoded into fixed-length features, represented using pairwise similarities (that is, kernels) or directly accommodated by a probability model.

  6. Inferring network structure.
    Figure 6: Inferring network structure.

    Methods that infer each relationship in a network separately, such as by computing the correlation between each pair, can be confounded by indirect relationships. Methods that infer the network as a whole can identify only direct relationships. Inferring the direction of causality inherent in networks is generally more challenging than inferring the network structure68; as a result, many network inference methods, such as Gaussian graphical model learning, infer only the network.

References

  1. Mitchell, T. Machine Learning (McGraw-Hill, 1997).
    This book provides a general introduction to machine learning that is suitable for undergraduate or graduate students.
  2. Ohler, W., Liao, C., Niemann, H. & Rubin, G. M. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 3, RESEARCH0087 (2002).
  3. Degroeve, S., Baets, B. D., de Peer, Y. V. & Rouzé, P. Feature subset selection for splice site prediction. Bioinformatics 18, S75S83 (2002).
  4. Bucher, P. Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 4, 563578 (1990).
  5. Heintzman, N. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genet. 39, 311318 (2007).
  6. Segal, E. et al. A genomic code for nucleosome positioning. Nature 44, 772778 (2006).
  7. Picardi, E. & Pesole, G. Computational methods for ab initio and comparative gene finding. Methods Mol. Biol. 609, 269284 (2010).
  8. Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature Genet. 25, 2529 (2000).
  9. Fraser, A. G. & Marcotte, E. M. A probabilistic view of gene function. Nature Genet. 36, 559564 (2004).
  10. Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell 117, 185198 (2004).
  11. Karlic, R. R. Chung, H., Lasserre, J., Vlahovicek, K. & Vingron, M. Histone modification levels are predictive for gene expression. Proc. Natl Acad. Sci. USA 107, 29262931 (2010).
  12. Ouyang, Z., Zhou, Q. & Wong, H. W. ChIP–seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl Acad. Sci. USA 106, 2152121526 (2009).
  13. Friedman, N. Inferring cellular networks using probabilistic graphical models. Science 303, 799805 (2004).
  14. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, 2001).
    This book provides an overview of machine learning that is suitable for students with a strong background in statistics.
  15. Hamelryck, T. Probabilistic models and machine learning in structural bioinformatics. Stat. Methods Med. Res. 18, 505526 (2009).
  16. Swan, A. L., Mobasheri, A., Allaway, D., Liddell, S. & Bacardit, J. Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. OMICS 17, 595610 (2013).
  17. Upstill-Goddard, R., Eccles, D., Fliege, J. & Collins, A. Machine learning approaches for the discovery of gene–gene interactions in disease data. Brief. Bioinform. 14, 251260 (2013).
  18. Yip, K. Y., Cheng, C. & Gerstein, M. Machine learning and genome annotation: a match meant to be? Genome Biol. 14, 205 (2013).
  19. Day, N., Hemmaplardh, A., Thurman, R. E., Stamatoyannopoulos, J. A. & Noble, W. S. Unsupervised segmentation of continuous genomic data. Bioinformatics 23, 14241426 (2007).
  20. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nature Methods 9, 215216 (2012).
    This study applies an unsupervised hidden Markov model algorithm to analyse genomic assays such as ChIP–seq and DNase-seq in order to identify new classes of functional elements and new instances of existing functional element types.
  21. Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods 9, 473476 (2012).
  22. Chapelle, O., Schölkopf, B. & Zien, A. (eds) Semi-supervised Learning (MIT Press, 2006).
  23. Stamatoyannopoulos, J. A. Illuminating eukaryotic transcription start sites. Nature Methods 7, 501503 (2010).
  24. Boser, B. E., Guyon, I. M. & Vapnik, V. N. in A Training Algorithm for Optimal Margin Classifiers (ed. Haussler, D.) 144152 (ACM Press, 1992).
    This paper was the first to describe the SVM, a type of discriminative classification algorithm.
  25. Noble, W. S. What is a support vector machine? Nature Biotech. 24, 15651567 (2006).
    This paper describes a non-mathematical introduction to SVMs and their applications to life science research.
  26. Ng, A. Y. & Jordan, M. I. Advances in Neural Information Processing Systems (eds Dietterich, T. et al.) (MIT Press, 2002).
  27. Jordan, M. I. Why the logistic function? a tutorial discussion on probabilities and neural networks. Computational Cognitive Science Technical Report 9503 [online], (1995).
  28. Wolpert, D. H. & Macready, W. G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1, 6782 (1997).
    This paper provides a mathematical proof that no single machine learning method can perform best on all possible learning problems.
  29. Yip, K. Y. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 13, R48 (2012).
  30. Urbanowicz, R. J., Granizo-Mackenzie, D. & Moore, J. H. in Proceedings of the Parallel Problem Solving From Nature 266275 (Springer, 2012).
  31. Brown, M. et al. in Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (ed. Rawlings, C.) 4755 (AAAI Press, 1993).
  32. Bailey, T. L. & Elkan, C. P. in Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (eds Rawlings, C. et al.) 2129 (AAAI Press, 1995).
  33. Schölkopf, B. & Smola, A. Learning with Kernels (MIT Press, 2002).
  34. Leslie, C. et al. (eds) Proceedings of the Pacific Symposium on Biocomputing (World Scientific, 2002).
  35. Rätsch, G. & Sonnenburg, S. in Kernel Methods in Computational Biology (eds Schölkopf, B. et al.) 277298 (MIT Press, 2004).
  36. Zien, A. et al. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799807 (2000).
  37. Saigo, H., Vert, J.-P. & Akutsu, T. Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinformatics 7, 246 (2006).
  38. Jaakkola, T. & Haussler, D. Advances in Neural Information Processing Systems 11 (Morgan Kauffmann, 1998).
  39. Shawe-Taylor, J. & Cristianini, N. Kernel Methods for Pattern Analysis (Cambridge Univ. Press, 2004).
    This textbook describes kernel methods, including a detailed mathematical treatment that is suitable for quantitatively inclined graduate students.
  40. Peña-Castillo, L. et al. A critical assessment of M. musculus gene function prediction using integrated genomic evidence. Genome Biol. 9, S2 (2008).
  41. Sonnhammer, E., Eddy, S. & Durbin, R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405420 (1997).
  42. Apweiler, R. et al. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 3740 (2001).
  43. Pavlidis, P., Weston, J., Cai, J. & Noble, W. S. Learning gene functional classifications from multiple data types. J. Computat. Biol. 9, 401411 (2002).
  44. Lanckriet, G. R. G., Bie, T. D., Cristianini, N., Jordan, M. I. & Noble, W. S. A statistical framework for genomic data fusion. Bioinformatics 20, 26262635 (2004).
  45. Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B. & Botstein, D. A. Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci. USA 100, 83488353 (2003).
  46. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, 1998).
    This textbook on probability models for machine learning is suitable for undergraduates or graduate students.
  47. Song, L. & Crawford, G. E. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harbor Protoc. 2, pdb.prot5384 (2010).
  48. Wasson, T. & Hartemink, A. J. An ensemble model of competitive multi-factor binding of the genome. Genome Res. 19, 21022112 (2009).
  49. Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21, 447455 (2011).
  50. Cuellar-Partida, G. et al. Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics 28, 5662 (2011).
  51. Ramaswamy, S. et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA 98, 1514915154 (2001).
  52. Glaab, E., Bacardit, J., Garibaldi, J. M. & Krasnogor, N. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS ONE 7, e39932 (2012).
  53. Tibshirani, R. J. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267288 (1996).
    This paper was the first to describe the technique known as lasso (or L1 regularization), which performs feature selection in conjunction with learning.
  54. Urbanowicz, R. J., Granizo-Mackenzie, A. & Moore, J. H. An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systems. IEEE Comput. Intell. Mag. 7, 3545 (2012).
  55. Tikhonov, A. N. On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39, 195198 (1943).
    This paper was the first to describe the now-ubiquitous method known as L2 regularization or ridge regression.
  56. Keogh, E. & Mueen, A. Encyclopedia of Machine Learning (Springer, 2011).
  57. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012).
  58. Manning, C. D. & Schütze, H. Foundations of Statistical Natural Language Processing (MIT Press, 1999).
  59. Davis, J. & Goadrich, M. Proceedings of the International Conference on Machine Learning (ACM, 2006).
    This paper provides a succinct introduction to precision-recall and receiver operating characteristic curves, and details under which scenarios these approaches should be used.
  60. Cohen, J. Weighted κ: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70, 213 (1968).
  61. Luengo, J., García, S. & Herrera, F. On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl. Inf. Syst. 32, 77108 (2012).
  62. Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520525 (2001).
    This study uses an imputation-based approach to handle missing values in microarray data. The method was widely used in subsequent studies to address this common problem.
  63. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genet. 46, 310315 (2014).
    This study uses a machine learning approach to estimate the pathogenicity of genetic variants using a framework that takes advantage of the fact that natural selection removes deleterious variation.
  64. Qiu, J. & Noble, W. S. Predicting co-complexed protein pairs from heterogeneous data. PLoS Comput. Biol. 4, e1000054 (2008).
  65. Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 7, 601620 (2000).
  66. Bacardit, J. & Llorà, X. Large-scale data mining using genetics-based machine learning. Wiley Interdiscip. Rev. 3, 3761 (2013).
  67. Koski, T. J. & Noble, J. A review of Bayesian networks and structure learning. Math. Applicanda 40, 51103 (2012).
  68. Pearl, J. Causality: Models, Reasoning and Inference (Cambridge Univ. Press, 2000).

Download references

Author information

Affiliations

  1. Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195–2350, USA.

    • Maxwell W. Libbrecht &
    • William Stafford Noble
  2. Department of Genome Sciences, University of Washington, 3720 15th Ave NE Seattle, Washington 98195–5065, USA.

    • William Stafford Noble

Competing interests statement

The authors declare no competing interests.

Corresponding author

Correspondence to:

Author details

  • Maxwell W. Libbrecht

    Maxwell W. Libbrecht graduated with a degree in computer science from Stanford University, California, USA, where he was advised by Serafim Batzoglou, in 2011. He is currently a Ph.D. student at the Department of Computer Science and Engineering at the University of Washington, Seattle, USA, and is advised by William Noble. He works on developing methods that integrate diverse types of data in order to understand gene regulation.

  • William Stafford Noble

    William Stafford Noble received his Ph.D. in computer science and cognitive science from University of California, San Diego, USA. His research group develops and applies statistical and machine learning techniques for modelling and understanding biological processes at the molecular level. He is the recipient of a US National Science Foundation (NSF) CAREER award and is a Sloan Research Fellow. William Stafford Noble's homepage.

Additional data