
Kevin Glover/MRC
Experiments that use only a small number of animals are common, but might not give meaningful results.
Replace, refine, reduce: the 3 Rs of ethical animal research are widely accepted around the world. But now the message from UK funding agencies is that some experiments use too few animals, a problem that leads to wastage and low-quality results.
On 15 April, the research councils responsible for channelling government funding to scientists, and their umbrella group Research Councils UK, announced changes to their guidelines for animal experiments. Funding applicants must now show that their work will provide statistically robust results — not just explain how it is justified and set out the ethical implications — or risk having their grant application rejected.
The move aims to improve the quality of medical research, and will help to address widespread concerns that animals — mostly mice and rats — are being squandered in tiny studies that lack statistical power.
“If the study is underpowered your results are not going to be reliable,” says Nathalie Percie du Sert, who works on experimental design at the National Centre for the Replacement, Refinement and Reduction (NC3Rs) of Animals in Research in London. “These animals are going to be wasted.”
Researchers say that sample size is sometimes decided through historical precedent rather than solid statistics. There is also a lack of clarity: last year, an analysis of selected papers published in Nature or Public Library of Science journals describing animal experiments revealed that few reported the use of statistical tests to determine sample size, even though both publishing groups had endorsed guidelines to improve reporting standards (D. Baker et al. PLoS Biol. 12, e1001756; 2014).
Animals feature in a wide range of experiments (see ‘Animal use’), many of which are designed to test drugs before trials are done in people. The effects that researchers are looking for in these preclinical studies are often subtle, and ‘power calculations’ are needed to reveal the number of animals needed to show an effect. But an international academic partnership called the CAMARADES project (Collaborative Approach to Meta Analysis and Review of Animal Data from Experimental Studies), has shown that many animal studies are underpowered: studies in stroke, for example, are typically powered at between 30% and 50%, meaning that there is just a 30–50% chance of detecting a biological effect if it exists.

Source: UK Organisation Data Service
Malcolm Macleod, a neuroscientist at the University of Edinburgh, UK, blames, among other things, a lack of training and support in experimental design, as well as limited funds: animals are expensive to work with.
Some say that the pressure to ‘reduce’ may be one of the reasons for small experiments, but others counter that this is a misinterpretation of the 3 Rs because small experiments are ethically problematic if they have low statistical power.
The problem is not limited to Britain: last year, Francis Collins, director of the US National Institutes of Health (NIH), and Lawrence Tabak, NIH deputy director, warned about a lack of reproducibility in preclinical research and mentioned a dearth of sample-size calculations as one of the problems (see Nature 505, 612–613; 2014).
The situation infuriates animal-welfare proponents. “It’s completely unethical to use animals in studies that aren’t properly designed,” says Penny Hawkins, head of the research-animals department at the Royal Society for the Prevention of Cruelty to Animals in Southwater, UK.
Boosting the number of animals in specific experiments need not mean more animals are used overall because multiple small experiments can often be replaced by fewer, larger ones “One potential implication is we need to ask for money to do larger studies,” says Marcus Munafò, a psychologist at the University of Bristol, UK.
Another way to increase sample sizes would be to link up researchers working on similar topics. Munafò notes that this is what geneticists now do for studies that require scanning a large number of genomes. “That template already exists,” he says. “The question is, how do you initiate that cultural change?”
More immediately, du Sert is developing an online tool for the NC3Rs that will help researchers to design robust studies. “We’re not blaming anyone for the way they were doing things before,” she adds. “That was the practice at the time.”
- Journal name:
- Nature
- Volume:
- 520,
- Pages:
- 271–272
- Date published:
- ()
- DOI:
- doi:10.1038/520271a
Cressey's news article (1) describes a step further by Research Councils UK in trying to improve statistical robustness and reliability a la Cohen (2). It is an admirable step, one with also ethical implications, as commented by Hawkins (in 1). However, it confuses science with statistics, and seems to assume that power will provide robust and reliable results where lack of power will not. Thus, jumping onto the power bandwagon is the way to go.
As commented by Lazic (3), power is mostly relevant when you know your population, because power is part of the Neyman-Pearson approach to data testing. It is irrelevant when using Fisher's approach, however, which is, arguably, the one mostly used in psychology and related disciplines (4). Power can be made superfluous under the latter approach, and with this I mean that it can be used to approximate the needed sample size, but the unknown parameters of the population are second-guesses and, thus, power analysis may not be but a nice exercise.
Alternatives to sample sizes by power analyses require to think what is science, and how statistics can contribute to it. Exploratory data analysis (5) and a focus on effect sizes and confidence intervals (6) can further scientific knowledge even when sample sizes are small because of costs, or when they turn out smaller than expected relative to (post hoc) power. Meta-analysis (7, 8) can bring together smaller sample sizes and, thus, increase power in so doing. Even Bayesian analysis may result in smaller sample requirements, depending on the prior probabilities used (9).
In a nutshell, it is good to ask for power when power is relevant, but not all statistical techniques nor all of science requires it, especially when larger samples sizes are difficult or unwise to come by.
---Jose Perezgonzalez (Massey University)---
(1) Cressey, D. (2015). UK funders demand strong statistics for animal studies. Nature 520, doi:10.1038/520271a
(2) Cohen, J. (1988). Statistical power analysis for the behavioral sciences, 2nd. Ed. New York: Psychology Press
(3) Lazic, S. (2015). Comment, http://www.nature.com/news/uk-funders-demand-strong-statistics-for-animal-studies-1.17318#comment-1972479780
(4) Perezgonzalez, J. D. (2015). Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Front. Psychol. 6, doi:10.3389/fpsyg.2015.00223
(5) Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.
(6) Cumming, G. (2012). Understanding the new statistics. Effect sizes, confidence intervals, and meta-analysis. New York, NY: Routledge.
(7) Rosenthal, R. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage.
(8) Braver, S. L. et al. (2014). Continuously cumulating meta-analysis and replicability. Perspect. Psychol. Sci. 9, doi:10.1177/1745691614529796
(9) Kruschke, J. K. (2011). Doing Bayesian data analysis. Amsterdam: Elsevier.
A grant usually covers multiple experiments, and specifying the sample size up front does not allow for sequential learning from earlier experiments. For example, differences between litters may be larger than expected and need to be accounted for.
An unintended consequence of this policy is inundating statisticians with requests to "do the math" so that the desired sample size sounds justified.