The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.
The field of machine learning includes the development and application of computer algorithms that improve with experience.
Machine learning methods can be divided into supervised, semi-supervised and unsupervised methods. Supervised methods are trained on examples with labels (for example, 'gene' or 'not gene') and are then used to predict these labels on other examples, whereas unsupervised methods find patterns in data sets without the use of labels. Semi-supervised methods combine these two approaches, leveraging patterns in unlabelled data to improve power in the prediction of labels.
Different machine learning methods may be required for an application, depending on whether one is interested in interpreting the output model or is simply concerned with predictive power. Generative models, which posit a probabilistic distribution over input data, are generally best for interpretability, whereas discriminative models, which seek only to model labels, are generally best for predictive power.
Prior information can be added to a model in order to train the model more effectively when it is provided with limited data, to limit the complexity of the model or to incorporate data that are not used by the model directly. Prior information can be incorporated explicitly in a probabilistic model or implicitly through the choice of features or similarity measures.
The choice of an appropriate performance measure depends strongly on the application task. Machine learning methods are most effective when they optimize an appropriate performance measure.
Network estimation methods are appropriate when the data contain complex dependencies among examples. These methods work best when they take into account the confounding effects of indirect relationships.
- Machine learning
A field concerned with the development and application of computer algorithms that improve with experience.
- Artificial intelligence
A field concerned with the development of computer algorithms that replicate human skills, including learning, visual perception and natural language understanding.
- Heterogeneous data sets
A collection of data sets from multiple sources or experimental methodologies. Artefactual differences between data sets can confound analysis.
The probability of a data set given a particular model.
The target of a prediction task. In classification, the label is discrete (for example, 'expressed' or 'not expressed'); in regression, the label is of real value (for example, a gene expression value).
Data instances used in a machine learning task.
- Supervised learning
Machine learning based on an algorithm that is trained on labelled examples and used to predict the label of unlabelled examples.
- Unsupervised learning
Machine learning based on an algorithm that does not require labels, such as a clustering algorithm.
- Semi-supervised learning
A machine-learning method that requires labels but that also makes use of unlabelled examples.
- Prediction accuracy
The fraction of predictions that are correct. It is calculated by dividing the number of correct predictions by the total number of predictions.
- Generative models
Machine learning models that build a full model of the distribution of features.
- Discriminative models
Machine learning approaches that model only the distribution of a label when given the features.
Single measurements or descriptors of examples used in a machine learning task.
- Probabilistic framework
A machine learning approach based on a probability distribution over the labels and features.
- Missing data
An experimental condition in which some features are available for some, but not all, examples.
- Feature selection
The process of choosing a smaller set of features from a larger set, either before applying a machine learning method or as part of training.
- Input space
A set of features chosen to be used as input for a machine learning method.
- Uniform prior
A prior distribution for a Bayesian model that assigns equal probabilities to all models.
- Dirichlet mixture priors
Prior distributions for a Bayesian model over the relative frequencies of, for example, amino acids.
- Kernel methods
A class of machine learning methods (for example, support vector machine) that use a type of similarity measure (called a kernel) between feature vectors.
- Bayesian network
A representation of a probability distribution that specifies the structure of dependencies between variables as a network.
- Curse of dimensionality
The observation that analysis can sometimes become more difficult as the number of features increases, particularly because overfitting becomes more likely.
A common pitfall in machine learning analysis that occurs when a complex model is trained on too few data points and becomes specific to the training data, resulting in poor performance on other data.
- Label skew
A phenomenon in which two labels in a supervised learning problem are present at different frequencies.
(Also known as recall). The fraction of positive examples identified; it is given by the number of positive predictions that are correct divided by the total number of positive examples.
The fraction of positive predictions that are correct; it is given by the number of positive predictions that are correct divided by the total number of positive predictions.
- Precision-recall curve
For a binary classifier applied to a given data set, a curve that plots precision (y axis) versus recall (x axis) for a variety of classification thresholds.
A method for handling missing data points by summing over all possibilities for that random variable in the model.
- Transitive relationships
An observed correlation between two features that is caused by direct relationships between these two features and a third feature.