Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Machine learning applications in genetics and genomics

Key Points

  • The field of machine learning includes the development and application of computer algorithms that improve with experience.

  • Machine learning methods can be divided into supervised, semi-supervised and unsupervised methods. Supervised methods are trained on examples with labels (for example, 'gene' or 'not gene') and are then used to predict these labels on other examples, whereas unsupervised methods find patterns in data sets without the use of labels. Semi-supervised methods combine these two approaches, leveraging patterns in unlabelled data to improve power in the prediction of labels.

  • Different machine learning methods may be required for an application, depending on whether one is interested in interpreting the output model or is simply concerned with predictive power. Generative models, which posit a probabilistic distribution over input data, are generally best for interpretability, whereas discriminative models, which seek only to model labels, are generally best for predictive power.

  • Prior information can be added to a model in order to train the model more effectively when it is provided with limited data, to limit the complexity of the model or to incorporate data that are not used by the model directly. Prior information can be incorporated explicitly in a probabilistic model or implicitly through the choice of features or similarity measures.

  • The choice of an appropriate performance measure depends strongly on the application task. Machine learning methods are most effective when they optimize an appropriate performance measure.

  • Network estimation methods are appropriate when the data contain complex dependencies among examples. These methods work best when they take into account the confounding effects of indirect relationships.

Abstract

The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: A canonical example of a machine learning application.
Figure 2: A gene-finding model.
Figure 3: Two models of transcription factor binding.
Figure 4: Incorporating a probabilistic prior into a position-specific frequency matrix.
Figure 5: Three ways to accommodate heterogeneous data in machine learning.
Figure 6: Inferring network structure.

Similar content being viewed by others

References

  1. Mitchell, T. Machine Learning (McGraw-Hill, 1997). This book provides a general introduction to machine learning that is suitable for undergraduate or graduate students.

    Google Scholar 

  2. Ohler, W., Liao, C., Niemann, H. & Rubin, G. M. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 3, RESEARCH0087 (2002).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Degroeve, S., Baets, B. D., de Peer, Y. V. & Rouzé, P. Feature subset selection for splice site prediction. Bioinformatics 18, S75–S83 (2002).

    Article  PubMed  Google Scholar 

  4. Bucher, P. Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 4, 563–578 (1990).

    Article  Google Scholar 

  5. Heintzman, N. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genet. 39, 311–318 (2007).

    Article  CAS  PubMed  Google Scholar 

  6. Segal, E. et al. A genomic code for nucleosome positioning. Nature 44, 772–778 (2006).

    Article  Google Scholar 

  7. Picardi, E. & Pesole, G. Computational methods for ab initio and comparative gene finding. Methods Mol. Biol. 609, 269–284 (2010).

    Article  CAS  PubMed  Google Scholar 

  8. Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

    Article  CAS  PubMed  Google Scholar 

  9. Fraser, A. G. & Marcotte, E. M. A probabilistic view of gene function. Nature Genet. 36, 559–564 (2004).

    Article  CAS  PubMed  Google Scholar 

  10. Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell 117, 185–198 (2004).

    Article  CAS  PubMed  Google Scholar 

  11. Karlic, R. R. Chung, H., Lasserre, J., Vlahovicek, K. & Vingron, M. Histone modification levels are predictive for gene expression. Proc. Natl Acad. Sci. USA 107, 2926–2931 (2010).

    Article  CAS  PubMed  Google Scholar 

  12. Ouyang, Z., Zhou, Q. & Wong, H. W. ChIP–seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl Acad. Sci. USA 106, 21521–21526 (2009).

    Article  CAS  PubMed  Google Scholar 

  13. Friedman, N. Inferring cellular networks using probabilistic graphical models. Science 303, 799–805 (2004).

    Article  CAS  PubMed  Google Scholar 

  14. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, 2001). This book provides an overview of machine learning that is suitable for students with a strong background in statistics.

    Book  Google Scholar 

  15. Hamelryck, T. Probabilistic models and machine learning in structural bioinformatics. Stat. Methods Med. Res. 18, 505–526 (2009).

    Article  PubMed  Google Scholar 

  16. Swan, A. L., Mobasheri, A., Allaway, D., Liddell, S. & Bacardit, J. Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. OMICS 17, 595–610 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Upstill-Goddard, R., Eccles, D., Fliege, J. & Collins, A. Machine learning approaches for the discovery of gene–gene interactions in disease data. Brief. Bioinform. 14, 251–260 (2013).

    Article  CAS  PubMed  Google Scholar 

  18. Yip, K. Y., Cheng, C. & Gerstein, M. Machine learning and genome annotation: a match meant to be? Genome Biol. 14, 205 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  19. Day, N., Hemmaplardh, A., Thurman, R. E., Stamatoyannopoulos, J. A. & Noble, W. S. Unsupervised segmentation of continuous genomic data. Bioinformatics 23, 1424–1426 (2007).

    Article  CAS  PubMed  Google Scholar 

  20. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nature Methods 9, 215–216 (2012). This study applies an unsupervised hidden Markov model algorithm to analyse genomic assays such as ChIP–seq and DNase-seq in order to identify new classes of functional elements and new instances of existing functional element types.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods 9, 473–476 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Chapelle, O., Schölkopf, B. & Zien, A. (eds) Semi-supervised Learning (MIT Press, 2006).

    Book  Google Scholar 

  23. Stamatoyannopoulos, J. A. Illuminating eukaryotic transcription start sites. Nature Methods 7, 501–503 (2010).

    Article  CAS  PubMed  Google Scholar 

  24. Boser, B. E., Guyon, I. M. & Vapnik, V. N. in A Training Algorithm for Optimal Margin Classifiers (ed. Haussler, D.) 144–152 (ACM Press, 1992). This paper was the first to describe the SVM, a type of discriminative classification algorithm.

    Google Scholar 

  25. Noble, W. S. What is a support vector machine? Nature Biotech. 24, 1565–1567 (2006). This paper describes a non-mathematical introduction to SVMs and their applications to life science research.

    Article  CAS  Google Scholar 

  26. Ng, A. Y. & Jordan, M. I. Advances in Neural Information Processing Systems (eds Dietterich, T. et al.) (MIT Press, 2002).

    Google Scholar 

  27. Jordan, M. I. Why the logistic function? a tutorial discussion on probabilities and neural networks. Computational Cognitive Science Technical Report 9503 [online], (1995).

    Google Scholar 

  28. Wolpert, D. H. & Macready, W. G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1, 67–82 (1997). This paper provides a mathematical proof that no single machine learning method can perform best on all possible learning problems.

    Article  Google Scholar 

  29. Yip, K. Y. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 13, R48 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Urbanowicz, R. J., Granizo-Mackenzie, D. & Moore, J. H. in Proceedings of the Parallel Problem Solving From Nature 266–275 (Springer, 2012).

    Book  Google Scholar 

  31. Brown, M. et al. in Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (ed. Rawlings, C.) 47–55 (AAAI Press, 1993).

    Google Scholar 

  32. Bailey, T. L. & Elkan, C. P. in Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (eds Rawlings, C. et al.) 21–29 (AAAI Press, 1995).

    Google Scholar 

  33. Schölkopf, B. & Smola, A. Learning with Kernels (MIT Press, 2002).

    Google Scholar 

  34. Leslie, C. et al. (eds) Proceedings of the Pacific Symposium on Biocomputing (World Scientific, 2002).

    Google Scholar 

  35. Rätsch, G. & Sonnenburg, S. in Kernel Methods in Computational Biology (eds Schölkopf, B. et al.) 277–298 (MIT Press, 2004).

    Google Scholar 

  36. Zien, A. et al. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000).

    Article  CAS  PubMed  Google Scholar 

  37. Saigo, H., Vert, J.-P. & Akutsu, T. Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinformatics 7, 246 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Jaakkola, T. & Haussler, D. Advances in Neural Information Processing Systems 11 (Morgan Kauffmann, 1998).

    Google Scholar 

  39. Shawe-Taylor, J. & Cristianini, N. Kernel Methods for Pattern Analysis (Cambridge Univ. Press, 2004). This textbook describes kernel methods, including a detailed mathematical treatment that is suitable for quantitatively inclined graduate students.

    Book  Google Scholar 

  40. Peña-Castillo, L. et al. A critical assessment of M. musculus gene function prediction using integrated genomic evidence. Genome Biol. 9, S2 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Sonnhammer, E., Eddy, S. & Durbin, R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420 (1997).

    Article  CAS  PubMed  Google Scholar 

  42. Apweiler, R. et al. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37–40 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Pavlidis, P., Weston, J., Cai, J. & Noble, W. S. Learning gene functional classifications from multiple data types. J. Computat. Biol. 9, 401–411 (2002).

    Article  CAS  Google Scholar 

  44. Lanckriet, G. R. G., Bie, T. D., Cristianini, N., Jordan, M. I. & Noble, W. S. A statistical framework for genomic data fusion. Bioinformatics 20, 2626–2635 (2004).

    Article  CAS  PubMed  Google Scholar 

  45. Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B. & Botstein, D. A. Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci. USA 100, 8348–8353 (2003).

    Article  CAS  PubMed  Google Scholar 

  46. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, 1998). This textbook on probability models for machine learning is suitable for undergraduates or graduate students.

    Google Scholar 

  47. Song, L. & Crawford, G. E. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harbor Protoc. 2, pdb.prot5384 (2010).

    Google Scholar 

  48. Wasson, T. & Hartemink, A. J. An ensemble model of competitive multi-factor binding of the genome. Genome Res. 19, 2102–2112 (2009).

    Article  Google Scholar 

  49. Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21, 447–455 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Cuellar-Partida, G. et al. Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics 28, 56–62 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Ramaswamy, S. et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA 98, 15149–15154 (2001).

    Article  CAS  PubMed  Google Scholar 

  52. Glaab, E., Bacardit, J., Garibaldi, J. M. & Krasnogor, N. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS ONE 7, e39932 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Tibshirani, R. J. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–288 (1996). This paper was the first to describe the technique known as lasso (or L 1 regularization), which performs feature selection in conjunction with learning.

    Google Scholar 

  54. Urbanowicz, R. J., Granizo-Mackenzie, A. & Moore, J. H. An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systems. IEEE Comput. Intell. Mag. 7, 35–45 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  55. Tikhonov, A. N. On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39, 195–198 (1943). This paper was the first to describe the now-ubiquitous method known as L 2 regularization or ridge regression.

    Google Scholar 

  56. Keogh, E. & Mueen, A. Encyclopedia of Machine Learning (Springer, 2011).

    Google Scholar 

  57. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  58. Manning, C. D. & Schütze, H. Foundations of Statistical Natural Language Processing (MIT Press, 1999).

    Google Scholar 

  59. Davis, J. & Goadrich, M. Proceedings of the International Conference on Machine Learning (ACM, 2006). This paper provides a succinct introduction to precision-recall and receiver operating characteristic curves, and details under which scenarios these approaches should be used.

    Google Scholar 

  60. Cohen, J. Weighted κ: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70, 213 (1968).

    Article  CAS  PubMed  Google Scholar 

  61. Luengo, J., García, S. & Herrera, F. On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl. Inf. Syst. 32, 77–108 (2012).

    Article  Google Scholar 

  62. Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001). This study uses an imputation-based approach to handle missing values in microarray data. The method was widely used in subsequent studies to address this common problem.

    Article  CAS  PubMed  Google Scholar 

  63. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genet. 46, 310–315 (2014). This study uses a machine learning approach to estimate the pathogenicity of genetic variants using a framework that takes advantage of the fact that natural selection removes deleterious variation.

    Article  CAS  PubMed  Google Scholar 

  64. Qiu, J. & Noble, W. S. Predicting co-complexed protein pairs from heterogeneous data. PLoS Comput. Biol. 4, e1000054 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  65. Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 7, 601–620 (2000).

    Article  CAS  PubMed  Google Scholar 

  66. Bacardit, J. & Llorà, X. Large-scale data mining using genetics-based machine learning. Wiley Interdiscip. Rev. 3, 37–61 (2013).

    Google Scholar 

  67. Koski, T. J. & Noble, J. A review of Bayesian networks and structure learning. Math. Applicanda 40, 51–103 (2012).

    Google Scholar 

  68. Pearl, J. Causality: Models, Reasoning and Inference (Cambridge Univ. Press, 2000).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to William Stafford Noble.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

PowerPoint slides

Glossary

Machine learning

A field concerned with the development and application of computer algorithms that improve with experience.

Artificial intelligence

A field concerned with the development of computer algorithms that replicate human skills, including learning, visual perception and natural language understanding.

Heterogeneous data sets

A collection of data sets from multiple sources or experimental methodologies. Artefactual differences between data sets can confound analysis.

Likelihood

The probability of a data set given a particular model.

Label

The target of a prediction task. In classification, the label is discrete (for example, 'expressed' or 'not expressed'); in regression, the label is of real value (for example, a gene expression value).

Examples

Data instances used in a machine learning task.

Supervised learning

Machine learning based on an algorithm that is trained on labelled examples and used to predict the label of unlabelled examples.

Unsupervised learning

Machine learning based on an algorithm that does not require labels, such as a clustering algorithm.

Semi-supervised learning

A machine-learning method that requires labels but that also makes use of unlabelled examples.

Prediction accuracy

The fraction of predictions that are correct. It is calculated by dividing the number of correct predictions by the total number of predictions.

Generative models

Machine learning models that build a full model of the distribution of features.

Discriminative models

Machine learning approaches that model only the distribution of a label when given the features.

Features

Single measurements or descriptors of examples used in a machine learning task.

Probabilistic framework

A machine learning approach based on a probability distribution over the labels and features.

Missing data

An experimental condition in which some features are available for some, but not all, examples.

Feature selection

The process of choosing a smaller set of features from a larger set, either before applying a machine learning method or as part of training.

Input space

A set of features chosen to be used as input for a machine learning method.

Uniform prior

A prior distribution for a Bayesian model that assigns equal probabilities to all models.

Dirichlet mixture priors

Prior distributions for a Bayesian model over the relative frequencies of, for example, amino acids.

Kernel methods

A class of machine learning methods (for example, support vector machine) that use a type of similarity measure (called a kernel) between feature vectors.

Bayesian network

A representation of a probability distribution that specifies the structure of dependencies between variables as a network.

Curse of dimensionality

The observation that analysis can sometimes become more difficult as the number of features increases, particularly because overfitting becomes more likely.

Overfitting

A common pitfall in machine learning analysis that occurs when a complex model is trained on too few data points and becomes specific to the training data, resulting in poor performance on other data.

Label skew

A phenomenon in which two labels in a supervised learning problem are present at different frequencies.

Sensitivity

(Also known as recall). The fraction of positive examples identified; it is given by the number of positive predictions that are correct divided by the total number of positive examples.

Precision

The fraction of positive predictions that are correct; it is given by the number of positive predictions that are correct divided by the total number of positive predictions.

Precision-recall curve

For a binary classifier applied to a given data set, a curve that plots precision (y axis) versus recall (x axis) for a variety of classification thresholds.

Marginalization

A method for handling missing data points by summing over all possibilities for that random variable in the model.

Transitive relationships

An observed correlation between two features that is caused by direct relationships between these two features and a third feature.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Libbrecht, M., Noble, W. Machine learning applications in genetics and genomics. Nat Rev Genet 16, 321–332 (2015). https://doi.org/10.1038/nrg3920

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3920

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research