Review Article | Published:

Machine learning applications in genetics and genomics

Nature Reviews Genetics volume 16, pages 321332 (2015) | Download Citation

Abstract

The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.

Key points

  • The field of machine learning includes the development and application of computer algorithms that improve with experience.

  • Machine learning methods can be divided into supervised, semi-supervised and unsupervised methods. Supervised methods are trained on examples with labels (for example, 'gene' or 'not gene') and are then used to predict these labels on other examples, whereas unsupervised methods find patterns in data sets without the use of labels. Semi-supervised methods combine these two approaches, leveraging patterns in unlabelled data to improve power in the prediction of labels.

  • Different machine learning methods may be required for an application, depending on whether one is interested in interpreting the output model or is simply concerned with predictive power. Generative models, which posit a probabilistic distribution over input data, are generally best for interpretability, whereas discriminative models, which seek only to model labels, are generally best for predictive power.

  • Prior information can be added to a model in order to train the model more effectively when it is provided with limited data, to limit the complexity of the model or to incorporate data that are not used by the model directly. Prior information can be incorporated explicitly in a probabilistic model or implicitly through the choice of features or similarity measures.

  • The choice of an appropriate performance measure depends strongly on the application task. Machine learning methods are most effective when they optimize an appropriate performance measure.

  • Network estimation methods are appropriate when the data contain complex dependencies among examples. These methods work best when they take into account the confounding effects of indirect relationships.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    Machine Learning (McGraw-Hill, 1997). This book provides a general introduction to machine learning that is suitable for undergraduate or graduate students.

  2. 2.

    , , & Computational analysis of core promoters in the Drosophila genome. Genome Biol. 3, RESEARCH0087 (2002).

  3. 3.

    , , & Feature subset selection for splice site prediction. Bioinformatics 18, S75–S83 (2002).

  4. 4.

    Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 4, 563–578 (1990).

  5. 5.

    et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genet. 39, 311–318 (2007).

  6. 6.

    et al. A genomic code for nucleosome positioning. Nature 44, 772–778 (2006).

  7. 7.

    & Computational methods for ab initio and comparative gene finding. Methods Mol. Biol. 609, 269–284 (2010).

  8. 8.

    et al. Gene ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

  9. 9.

    & A probabilistic view of gene function. Nature Genet. 36, 559–564 (2004).

  10. 10.

    & Predicting gene expression from sequence. Cell 117, 185–198 (2004).

  11. 11.

    , , & Histone modification levels are predictive for gene expression. Proc. Natl Acad. Sci. USA 107, 2926–2931 (2010).

  12. 12.

    , & ChIP–seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl Acad. Sci. USA 106, 21521–21526 (2009).

  13. 13.

    Inferring cellular networks using probabilistic graphical models. Science 303, 799–805 (2004).

  14. 14.

    , & The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, 2001). This book provides an overview of machine learning that is suitable for students with a strong background in statistics.

  15. 15.

    Probabilistic models and machine learning in structural bioinformatics. Stat. Methods Med. Res. 18, 505–526 (2009).

  16. 16.

    , , , & Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. OMICS 17, 595–610 (2013).

  17. 17.

    , , & Machine learning approaches for the discovery of gene–gene interactions in disease data. Brief. Bioinform. 14, 251–260 (2013).

  18. 18.

    , & Machine learning and genome annotation: a match meant to be? Genome Biol. 14, 205 (2013).

  19. 19.

    , , , & Unsupervised segmentation of continuous genomic data. Bioinformatics 23, 1424–1426 (2007).

  20. 20.

    & ChromHMM: automating chromatin-state discovery and characterization. Nature Methods 9, 215–216 (2012). This study applies an unsupervised hidden Markov model algorithm to analyse genomic assays such as ChIP–seq and DNase-seq in order to identify new classes of functional elements and new instances of existing functional element types.

  21. 21.

    et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods 9, 473–476 (2012).

  22. 22.

    , & (eds) Semi-supervised Learning (MIT Press, 2006).

  23. 23.

    Illuminating eukaryotic transcription start sites. Nature Methods 7, 501–503 (2010).

  24. 24.

    , & in A Training Algorithm for Optimal Margin Classifiers (ed. Haussler, D.) 144–152 (ACM Press, 1992). This paper was the first to describe the SVM, a type of discriminative classification algorithm.

  25. 25.

    What is a support vector machine? Nature Biotech. 24, 1565–1567 (2006). This paper describes a non-mathematical introduction to SVMs and their applications to life science research.

  26. 26.

    & Advances in Neural Information Processing Systems (eds Dietterich, T. et al.) (MIT Press, 2002).

  27. 27.

    Why the logistic function? a tutorial discussion on probabilities and neural networks. Computational Cognitive Science Technical Report 9503 , (1995).

  28. 28.

    & No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1, 67–82 (1997). This paper provides a mathematical proof that no single machine learning method can perform best on all possible learning problems.

  29. 29.

    et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 13, R48 (2012).

  30. 30.

    , & in Proceedings of the Parallel Problem Solving From Nature 266–275 (Springer, 2012).

  31. 31.

    et al. in Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (ed. Rawlings, C.) 47–55 (AAAI Press, 1993).

  32. 32.

    & in Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (eds Rawlings, C. et al.) 21–29 (AAAI Press, 1995).

  33. 33.

    & Learning with Kernels (MIT Press, 2002).

  34. 34.

    et al. (eds) Proceedings of the Pacific Symposium on Biocomputing (World Scientific, 2002).

  35. 35.

    & in Kernel Methods in Computational Biology (eds Schölkopf, B. et al.) 277–298 (MIT Press, 2004).

  36. 36.

    et al. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000).

  37. 37.

    , & Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinformatics 7, 246 (2006).

  38. 38.

    & Advances in Neural Information Processing Systems 11 (Morgan Kauffmann, 1998).

  39. 39.

    & Kernel Methods for Pattern Analysis (Cambridge Univ. Press, 2004). This textbook describes kernel methods, including a detailed mathematical treatment that is suitable for quantitatively inclined graduate students.

  40. 40.

    et al. A critical assessment of M. musculus gene function prediction using integrated genomic evidence. Genome Biol. 9, S2 (2008).

  41. 41.

    , & Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420 (1997).

  42. 42.

    et al. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37–40 (2001).

  43. 43.

    , , & Learning gene functional classifications from multiple data types. J. Computat. Biol. 9, 401–411 (2002).

  44. 44.

    , , , & A statistical framework for genomic data fusion. Bioinformatics 20, 2626–2635 (2004).

  45. 45.

    , , , & Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci. USA 100, 8348–8353 (2003).

  46. 46.

    Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, 1998). This textbook on probability models for machine learning is suitable for undergraduates or graduate students.

  47. 47.

    & DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harbor Protoc. 2, pdb.prot5384 (2010).

  48. 48.

    & An ensemble model of competitive multi-factor binding of the genome. Genome Res. 19, 2102–2112 (2009).

  49. 49.

    et al. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21, 447–455 (2011).

  50. 50.

    et al. Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics 28, 56–62 (2011).

  51. 51.

    et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA 98, 15149–15154 (2001).

  52. 52.

    , , & Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS ONE 7, e39932 (2012).

  53. 53.

    Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–288 (1996). This paper was the first to describe the technique known as lasso (or L1 regularization), which performs feature selection in conjunction with learning.

  54. 54.

    , & An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systems. IEEE Comput. Intell. Mag. 7, 35–45 (2012).

  55. 55.

    On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39, 195–198 (1943). This paper was the first to describe the now-ubiquitous method known as L2 regularization or ridge regression.

  56. 56.

    & Encyclopedia of Machine Learning (Springer, 2011).

  57. 57.

    ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  58. 58.

    & Foundations of Statistical Natural Language Processing (MIT Press, 1999).

  59. 59.

    & Proceedings of the International Conference on Machine Learning (ACM, 2006). This paper provides a succinct introduction to precision-recall and receiver operating characteristic curves, and details under which scenarios these approaches should be used.

  60. 60.

    Weighted κ: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70, 213 (1968).

  61. 61.

    , & On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl. Inf. Syst. 32, 77–108 (2012).

  62. 62.

    et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001). This study uses an imputation-based approach to handle missing values in microarray data. The method was widely used in subsequent studies to address this common problem.

  63. 63.

    et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genet. 46, 310–315 (2014). This study uses a machine learning approach to estimate the pathogenicity of genetic variants using a framework that takes advantage of the fact that natural selection removes deleterious variation.

  64. 64.

    & Predicting co-complexed protein pairs from heterogeneous data. PLoS Comput. Biol. 4, e1000054 (2008).

  65. 65.

    , , & Using Bayesian networks to analyze expression data. J. Comput. Biol. 7, 601–620 (2000).

  66. 66.

    & Large-scale data mining using genetics-based machine learning. Wiley Interdiscip. Rev. 3, 37–61 (2013).

  67. 67.

    & A review of Bayesian networks and structure learning. Math. Applicanda 40, 51–103 (2012).

  68. 68.

    Causality: Models, Reasoning and Inference (Cambridge Univ. Press, 2000).

Download references

Author information

Affiliations

  1. Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195–2350, USA.

    • Maxwell W. Libbrecht
    •  & William Stafford Noble
  2. Department of Genome Sciences, University of Washington, 3720 15th Ave NE Seattle, Washington 98195–5065, USA.

    • William Stafford Noble

Authors

  1. Search for Maxwell W. Libbrecht in:

  2. Search for William Stafford Noble in:

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to William Stafford Noble.

Glossary

Machine learning

A field concerned with the development and application of computer algorithms that improve with experience.

Artificial intelligence

A field concerned with the development of computer algorithms that replicate human skills, including learning, visual perception and natural language understanding.

Heterogeneous data sets

A collection of data sets from multiple sources or experimental methodologies. Artefactual differences between data sets can confound analysis.

Likelihood

The probability of a data set given a particular model.

Label

The target of a prediction task. In classification, the label is discrete (for example, 'expressed' or 'not expressed'); in regression, the label is of real value (for example, a gene expression value).

Examples

Data instances used in a machine learning task.

Supervised learning

Machine learning based on an algorithm that is trained on labelled examples and used to predict the label of unlabelled examples.

Unsupervised learning

Machine learning based on an algorithm that does not require labels, such as a clustering algorithm.

Semi-supervised learning

A machine-learning method that requires labels but that also makes use of unlabelled examples.

Prediction accuracy

The fraction of predictions that are correct. It is calculated by dividing the number of correct predictions by the total number of predictions.

Generative models

Machine learning models that build a full model of the distribution of features.

Discriminative models

Machine learning approaches that model only the distribution of a label when given the features.

Features

Single measurements or descriptors of examples used in a machine learning task.

Probabilistic framework

A machine learning approach based on a probability distribution over the labels and features.

Missing data

An experimental condition in which some features are available for some, but not all, examples.

Feature selection

The process of choosing a smaller set of features from a larger set, either before applying a machine learning method or as part of training.

Input space

A set of features chosen to be used as input for a machine learning method.

Uniform prior

A prior distribution for a Bayesian model that assigns equal probabilities to all models.

Dirichlet mixture priors

Prior distributions for a Bayesian model over the relative frequencies of, for example, amino acids.

Kernel methods

A class of machine learning methods (for example, support vector machine) that use a type of similarity measure (called a kernel) between feature vectors.

Bayesian network

A representation of a probability distribution that specifies the structure of dependencies between variables as a network.

Curse of dimensionality

The observation that analysis can sometimes become more difficult as the number of features increases, particularly because overfitting becomes more likely.

Overfitting

A common pitfall in machine learning analysis that occurs when a complex model is trained on too few data points and becomes specific to the training data, resulting in poor performance on other data.

Label skew

A phenomenon in which two labels in a supervised learning problem are present at different frequencies.

Sensitivity

(Also known as recall). The fraction of positive examples identified; it is given by the number of positive predictions that are correct divided by the total number of positive examples.

Precision

The fraction of positive predictions that are correct; it is given by the number of positive predictions that are correct divided by the total number of positive predictions.

Precision-recall curve

For a binary classifier applied to a given data set, a curve that plots precision (y axis) versus recall (x axis) for a variety of classification thresholds.

Marginalization

A method for handling missing data points by summing over all possibilities for that random variable in the model.

Transitive relationships

An observed correlation between two features that is caused by direct relationships between these two features and a third feature.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nrg3920