Collection

# Statistics for Biologists

Since September 2013 Nature Methods has been publishing a monthly column on statistics called "Points of Significance." This column is intended to provide reseachers in biology with a basic introduction to core statistical concepts and methods, including experimental design. Although targeted at biologists, the articles are useful guides for researchers in other disciplines as well. A continuously updated list of these articles is provided below.

Importance of being uncertain - How samples are used to estimate population statistics and what this means in terms of uncertainty.

Error Bars - The use of error bars to represent uncertainty and advice on how to interpret them.

Significance, P values and t-tests - Introduction to the concept of statistical significance and the one-sample t-test.

Power and sample size - Use of statistical power to optimize study design and sample numbers.

Visualizing samples with box plots - Introduction to box plots and their use to illustrate the spread and differences of samples. See also: Kick the bar chart habit and BoxPlotR: a web tool for generation of box plots

Comparing samples—part I - How to use the two-sample t-test to compare either uncorrelated or correlated samples.

Comparing samples—part II - Adjustment and reinterpretation of P values when large numbers of tests are performed.

Nonparametric tests - Use of nonparametric tests to robustly compare skewed or ranked data.

Designing comparative experiments - The first of a series of columns that tackle experimental design shows how a paired design achieves sensitivity and specificity requirements despite biological and technical variability.

Analysis of variance and blocking - Introduction to ANOVA and the importance of blocking in good experimental design to mitigate experimental error and the impact of factors not under study.

Replication - Technical replication reveals technical variation while biological replication is required for biological inference.

Nested designs - Use the relative noise contribution of each layer in nested experimental designs to optimally allocate experimental resources using ANOVA.

Two-factor designs - It is common in biological systems for multiple experimental factors to produce interacting effects on a system. A study design that allows these interactions can increase sensitivity.

Sources of variation - To generalize experimental conclusions to a population, it is critical to sample its variation while using experimental control, randomization, blocking and replication for replicable and meaningful results.

Split plot design - When some experimental factors are harder to vary than others, a split plot design can be efficient for exploring the main (average) effects and interactions of the factors.

Bayes’ theorem - Use Bayes’ theorem to combine prior knowledge with observations of a system and make predictions about it.

Bayesian statistics - Unlike classical frequentist statistics, Bayesian statistics allows direct inference of the probability that a model is correct and it provides the ability to update this probability as new data is collected.

Sampling distributions and the bootstrap - Use the bootstrap method to simulate new samples and assess the precision and bias of sample estimates.

Bayesian networks - Model interactions between causes and effects in large networks of causal influences using Bayesian networks, which combine network analysis with Bayesian statistics.

Association, correlation and causation - Pairwise dependencies can be characterized using correlation but be aware that correlation only implies association, not causation. Conversely, causation implies association, not correlation.

Simple linear regression - Given data on the relationship between two variables, linear regression is a simple and surprisingly robust method to predict unknown values.

Multiple linear regression - When multiple variables are associated with a response, the interpretation of a prediction equation is seldom simple.

Analyzing outliers: influential or nuisance - Some outliers influence the regression fit more than others.

Regression diagnostics - Residual plots can be used to validate assumptions about the regression model.

Logistic regression - Regression can be used on categorical responses to estimate probabilities and to classify.

Classification evaluation - It is important to understand both what a classification metric expresses and what it hides.

Model selection and overfitting - "With four parameters I can fit an elephant and with five I can make him wiggle his trunk". John von Neumann

Regularization - Constraining the magnitude of parameters of a model can control its complexity.

P values and the search for significance - Little P value What are you tryign to say Of significance - Steve Ziliak

Interpreting P values - A P value measures a sample's compatability with a hypothesis, not the truth of the hypothesis.

Tabular data - Tabulating the number of objects in categories of interest dates back to the earliest records of commerce and population censuses.

Clustering - Clustering finds patterns in data, whether they are there or not.

Principal component analysis - PCA helps you interpret your data, but it will not always find the important patterns.

Classification and regression trees - Decision trees are a simple but powerful prediction method.

Ensemble methods: bagging and random forests - Many heads are better than one.

Machine learning: a primer - Machine learning extracts patterns from data without explicit instructions.

Machine learning: supervised methods - Supervised learning algorithms extract general principles from observed examples guided by a specific prediction objective.