This Month
Published: 03 April 2018

Points of Significance

Statistics versus machine learning

Danilo Bzdok^1,2,
Naomi Altman³ &
Martin Krzywinski⁴

Nature Methods volume 15, pages 233–234 (2018)Cite this article

128k Accesses
848 Citations
351 Altmetric
Metrics details

Statistics draws population inferences from a sample, and machine learning finds generalizable predictive patterns.

You have full access to this article via your institution.

Two major goals in the study of biological systems are inference and prediction. Inference creates a mathematical model of the data-generation process to formalize understanding or test a hypothesis about how the system behaves. Prediction aims at forecasting unobserved outcomes or future behavior, such as whether a mouse with a given gene expression pattern has a disease. Prediction makes it possible to identify best courses of action (e.g., treatment choice) without requiring understanding of the underlying mechanisms. In a typical research project, both inference and prediction can be of value—we want to know how biological processes work and what will happen next. For example, we might want to infer which biological processes are associated with the dysregulation of a gene in a disease, as well as detect whether a subject has the disease and predict the best therapy.

Many methods from statistics and machine learning (ML) may, in principle, be used for both prediction and inference. However, statistical methods have a long-standing focus on inference, which is achieved through the creation and fitting of a project-specific probability model. The model allows us to compute a quantitative measure of confidence that a discovered relationship describes a 'true' effect that is unlikely to result from noise. Furthermore, if enough data are available, we can explicitly verify assumptions (e.g., equal variance) and refine the specified model, if needed.

By contrast, ML concentrates on prediction by using general-purpose learning algorithms to find patterns in often rich and unwieldy data^1,2. ML methods are particularly helpful when one is dealing with 'wide data', where the number of input variables exceeds the number of subjects, in contrast to 'long data', where the number of subjects is greater than that of input variables. ML makes minimal assumptions about the data-generating systems; they can be effective even when the data are gathered without a carefully controlled experimental design and in the presence of complicated nonlinear interactions. However, despite convincing prediction results, the lack of an explicit model can make ML solutions difficult to directly relate to existing biological knowledge.

Classical statistics and ML vary in computational tractability as the number of variables per subject increases. Classical statistical modeling was designed for data with a few dozen input variables and sample sizes that would be considered small to moderate today. In this scenario, the model fills in the unobserved aspects of the system. However, as the numbers of input variables and possible associations among them increase, the model that captures these relationships becomes more complex. Consequently, statistical inferences become less precise and the boundary between statistical and ML approaches becomes hazier.

To compare traditional statistics to ML approaches, we'll use a simulation of the expression of 40 genes in two phenotypes (−/+). Mean gene expression will differ between phenotypes, but we'll set up the simulation so that the mean difference for the first 30 genes is not related to phenotype. The last ten genes will be dysregulated, with systematic differences in mean expression between phenotypes. To achieve this, we assign each gene an average log expression that is the same for both phenotypes. The dysregulated genes (31–40, labeled A–J) have their mean expression perturbed in the + phenotype (Fig. 1a). Using these average expression values, we simulate an RNA-seq experiment in which the observed counts for each gene are sampled from a Poisson distribution with mean exp(x + ε), where x is the mean log expression, unique to the gene and phenotype, and ε ∼ N(0, 0.15) acts as biological variability that varies from subject to subject (Fig. 1b). For genes 1–30, which do not have differential expression, the z-scores are approximately N(0, 1). For the dysregulated genes, which do have differential expression, the z-scores in one phenotype tend to be positive, and the z-scores in the other tend to be negative.

**Figure 1: Simulated expression and RNA-seq read counts for 40 genes in which the last 10 genes (A–J) are differentially expressed across two phenotypes (−/+).**

Our goal in the simulation is to identify which genes are associated with the abnormal phenotype. We'll formally test the null hypothesis that the mean expression differs by phenotype with a widely used generalized linear negative binomial model that allows for biological variability among subjects with the same phenotype. We'll perform a test for each gene and identify those that show statistically significant differences in mean expression, based on P values adjusted for multiple testing via the Benjamini–Hochberg method³. In an alternative Bayesian approach, we would compute the posterior probability of having differential expression specific to the phenotype.

Figure 2a shows the P values of the tests between phenotypes as a function of the log fold change in gene expression. The ten dysregulated genes are highlighted in red; our inference flagged nine out of the ten (except F, with the smallest log fold change) as significant with adjusted P < 0.05. We could use the fold change as a measure of effect size, with a confidence interval or highest posterior region used to indicate the uncertainty in the estimate. In a realistic setting, genes identified by the analysis would then be validated experimentally or compared with data from other sources such as proposed gene networks or annotations.

**Figure 2: Analysis of gene ranking by classical inference and ML.**

To ask a similar biological question using ML, we would typically try several algorithms evaluated by cross-validation on independent test subjects, or bootstrap methods with 'out-of-sample' evaluation⁴ to select one with good prediction accuracy. Let's use a random forest (RF) classifier⁵ that will simultaneously consider all genes and grow multiple decision trees to predict the phenotype without assuming a probabilistic model for the read counts. The result of this RF classification with 100 trees is shown in Figure 2b, where the P values from the classical inference are plotted as a function of feature (gene) importance. This score quantifies a given gene's contribution to the average classification improvement⁵ within a partition when the tree is split selecting that gene. Many ML algorithms have analogous measures that allow some quantification of the contribution of each input variable to the classification. In our simulation, eight of ten genes with the largest importance measures were from the dysregulated set. Not in the top ten were genes D and F, which had the smallest fold changes (Fig. 2a).

If we perform the simulation 1,000 times and count the number of dysregulated genes correctly identified by both approaches—on the basis of either classical null-hypothesis rejection with an adjusted P value cutoff or predictive pattern generalization with RF and top-ten feature importance ranking—then we find that the two methods yield similar results. The average number of dysregulated genes identified is 7.4/10 for inference and 7.7/10 for RF (Fig. 2c). Both methods have a median of 8/10, but we find more instances of simulations for which only 2–5 dysregulated genes were identified with inference. This is because the way we've designed the selection process is different for the two approaches: inference selects by an adjusted P value cutoff so that the number of selected genes varies, whereas in the RF we select the top ten genes. We could have applied a cutoff to the importance score, but the scores do not have an objective scale on which to base the threshold.

We've used pre-existing knowledge about RNA-seq data to design a statistical model of the process and draw inference to estimate unknown parameters in the model from the data. In our simulation, the model encapsulates the relationship between the mean number of reads (the parameter) for each gene for each phenotype and the observed read counts for each subject. The output of the statistical analysis is a test statistic for a specific hypothesis and confidence bounds of the parameter (mean fold change, in this example). In our example, the model's parameters relate explicitly to aspects of gene expression—the counts of molecules produced at a certain rate in a cell can be directly interpreted.

To apply ML, we don't need to know any of the details about RNA-seq measurements; all that matters is which genes are more useful for phenotype discrimination based on gene expression. Such generalization greatly helps when we have a large number of variables, such as in a typical RNA-seq experiment that may have hundreds to hundreds of thousands of features (e.g., transcripts) but a much smaller sample size.

Now consider a more complex experiment in which each subject contributes multiple observations from different tissues. Even if we only conduct a formal statistical test that compares the two phenotypes for each tissue, the multiple testing problem is greatly complicated. The increase in data complexity may make classical statistical inference less tractable. Instead we could use an ML approach such as clustering of genes or tissues or both to extract the main patterns in the data, classify subjects, and make inferences about the biological processes that give rise to the phenotype. To simplify the analysis, we could perform a dimension reduction such as averaging the measurements over the ten subjects with each phenotype for each gene and each tissue.

The boundary between statistical inference and ML is subject to debate¹—some methods fall squarely into one or the other domain, but many are used in both. For example, the bootstrap⁶ method can be used for statistical inference but also serves as the basis for ensemble methods, such as the RF algorithm. Statistics requires us to choose a model that incorporates our knowledge of the system, and ML requires us to choose a predictive algorithm by relying on its empirical capabilities. Justification for an inference model typically rests on whether we feel it adequately captures the essence of the system. The choice of pattern-learning algorithms often depends on measures of past performance in similar scenarios. Inference and ML are complementary in pointing us to biologically meaningful conclusions.

References

Bzdok, D. Front. Neurosci. 11, 543 (2017).
Article Google Scholar
Bzdok, D., Krzywinski, M. & Altman, N. Nat. Methods 14, 1119–1120 (2017).
Article CAS Google Scholar
Krzywinski, M. & Altman, N. Nat. Methods 11, 355–356 (2014).
Article CAS Google Scholar
Lever, J., Krzywinski, M. & Altman, N. Nat. Methods 13, 603–604 (2016).
Article CAS Google Scholar
Altman, N. & Krzywinski, M. Nat. Methods 14, 933–934 (2017).
Article CAS Google Scholar
Kulesa, A., Krzywinski, M., Blainey, P. & Altman, N. Nat. Methods 12, 477–478 (2015).
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Psychiatry, RWTH Aachen University, Germany
Danilo Bzdok
INRIA/Neurospin Saclay, France
Danilo Bzdok
Professor of Statistics at The Pennsylvania State University,
Naomi Altman
staff scientist at Canada's Michael Smith Genome Sciences Centre,
Martin Krzywinski

Authors

Danilo Bzdok
View author publications
You can also search for this author in PubMed Google Scholar
Naomi Altman
View author publications
You can also search for this author in PubMed Google Scholar
Martin Krzywinski
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bzdok, D., Altman, N. & Krzywinski, M. Statistics versus machine learning. Nat Methods 15, 233–234 (2018). https://doi.org/10.1038/nmeth.4642

Download citation

Published: 03 April 2018
Issue Date: 01 April 2018
DOI: https://doi.org/10.1038/nmeth.4642

This article is cited by

Twenty-four-hour physical activity patterns associated with depressive symptoms: a cross-sectional study using big data-machine learning approach
- Saida Salima Nawrin
- Hitoshi Inada
- Ryoichi Nagatomi
BMC Public Health (2024)
Machine learning to predict unintended pregnancy among reproductive-age women in Ethiopia: evidence from EDHS 2016
- Daniel Niguse Mamo
- Yosef Haile Gebremariam
- Agmasie Damtew Walle
BMC Women's Health (2024)
What factors preventing the older adults in China from living longer: a machine learning study
- Shiyin Xiao
- Yajie Bi
- Wei Chen
BMC Geriatrics (2024)
Causes of death in individuals with lifetime major depression: a comprehensive machine learning analysis from a community-based autopsy center
- Paula Villela Nunes
- Livia Mancine
- Claudia Kimie Suemoto
BMC Psychiatry (2024)
Continued versus discontinued oxytocin stimulation in the active phase of labour (CONDISOX): individual management based on artificial intelligence - a secondary analysis
- Sidsel Boie
- Julie Glavind
- Niels Uldbjerg
BMC Pregnancy and Childbirth (2024)

Statistics versus machine learning

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Twenty-four-hour physical activity patterns associated with depressive symptoms: a cross-sectional study using big data-machine learning approach

Machine learning to predict unintended pregnancy among reproductive-age women in Ethiopia: evidence from EDHS 2016

What factors preventing the older adults in China from living longer: a machine learning study

Causes of death in individuals with lifetime major depression: a comprehensive machine learning analysis from a community-based autopsy center

Continued versus discontinued oxytocin stimulation in the active phase of labour (CONDISOX): individual management based on artificial intelligence - a secondary analysis

Search

Quick links

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Twenty-four-hour physical activity patterns associated with depressive symptoms: a cross-sectional study using big data-machine learning approach

Machine learning to predict unintended pregnancy among reproductive-age women in Ethiopia: evidence from EDHS 2016

What factors preventing the older adults in China from living longer: a machine learning study

Causes of death in individuals with lifetime major depression: a comprehensive machine learning analysis from a community-based autopsy center

Continued versus discontinued oxytocin stimulation in the active phase of labour (CONDISOX): individual management based on artificial intelligence - a secondary analysis

Search

Quick links