Nature | News

UK funders demand strong statistics for animal studies

Move addresses concerns that some experiments are not using enough animals.

Article tools

Kevin Glover/MRC

Experiments that use only a small number of animals are common, but might not give meaningful results.

Replace, refine, reduce: the 3 Rs of ethical animal research are widely accepted around the world. But now the message from UK funding agencies is that some experiments use too few animals, a problem that leads to wastage and low-quality results.

On 15 April, the research councils responsible for channelling government funding to scientists, and their umbrella group Research Councils UK, announced changes to their guidelines for animal experiments. Funding applicants must now show that their work will provide statistically robust results — not just explain how it is justified and set out the ethical implications — or risk having their grant application rejected.

The move aims to improve the quality of medical research, and will help to address widespread concerns that animals — mostly mice and rats — are being squandered in tiny studies that lack statistical power.

“If the study is underpowered your results are not going to be reliable,” says Nathalie Percie du Sert, who works on experimental design at the National Centre for the Replacement, Refinement and Reduction (NC3Rs) of Animals in Research in London. “These animals are going to be wasted.”

Researchers say that sample size is sometimes decided through historical precedent rather than solid statistics. There is also a lack of clarity: last year, an analysis of selected papers published in Nature or Public Library of Science journals describing animal experiments revealed that few reported the use of statistical tests to determine sample size, even though both publishing groups had endorsed guidelines to improve reporting standards (D. Baker et al. PLoS Biol. 12, e1001756; 2014).

Animals feature in a wide range of experiments (see ‘Animal use’), many of which are designed to test drugs before trials are done in people. The effects that researchers are looking for in these preclinical studies are often subtle, and ‘power calculations’ are needed to reveal the number of animals needed to show an effect. But an international academic partnership called the CAMARADES project (Collaborative Approach to Meta Analysis and Review of Animal Data from Experimental Studies), has shown that many animal studies are underpowered: studies in stroke, for example, are typically powered at between 30% and 50%, meaning that there is just a 30–50% chance of detecting a biological effect if it exists.

Source: UK Organisation Data Service

Malcolm Macleod, a neuroscientist at the University of Edinburgh, UK, blames, among other things, a lack of training and support in experimental design, as well as limited funds: animals are expensive to work with.

Some say that the pressure to ‘reduce’ may be one of the reasons for small experiments, but others counter that this is a misinterpretation of the 3 Rs because small experiments are ethically problematic if they have low statistical power.

The problem is not limited to Britain: last year, Francis Collins, director of the US National Institutes of Health (NIH), and Lawrence Tabak, NIH deputy director, warned about a lack of reproducibility in preclinical research and mentioned a dearth of sample-size calculations as one of the problems (see Nature 505, 612–613; 2014).

The situation infuriates animal-welfare proponents. “It’s completely unethical to use animals in studies that aren’t properly designed,” says Penny Hawkins, head of the research-animals department at the Royal Society for the Prevention of Cruelty to Animals in Southwater, UK.

Boosting the number of animals in specific experiments need not mean more animals are used overall because multiple small experiments can often be replaced by fewer, larger ones “One potential implication is we need to ask for money to do larger studies,” says Marcus Munafò, a psychologist at the University of Bristol, UK.

Another way to increase sample sizes would be to link up researchers working on similar topics. Munafò notes that this is what geneticists now do for studies that require scanning a large number of genomes. “That template already exists,” he says. “The question is, how do you initiate that cultural change?”

More immediately, du Sert is developing an online tool for the NC3Rs that will help researchers to design robust studies. “We’re not blaming anyone for the way they were doing things before,” she adds. “That was the practice at the time.”

Journal name:
Date published:

For the best commenting experience, please login or register as a user and agree to our Community Guidelines. You will be re-directed back to this page where you will see comments updating in real-time and have the ability to recommend comments to other users.

Comments for this thread are now closed.


4 comments Subscribe to comments

  1. Avatar for Leonora Mahle
    Leonora Mahle
    To be fair, I'm all for necessary animal tests under carefully monitored and regulated conditions in which the animal is given a happy and kind life. However, I would disagree that the 'vast majority' of animal experiments are respectful towards animals, since I would not agree that an animal spending its whole life from birth to death in a cage in a lab is respectful, even if it is not harmed...also, with today's technology, studying the brain and body through simulations is actually sometimes a lot more effective than testing on animals, which have very different anatomies to humans, and respond differently to humans. For science to progress I would argue that more modern and humane methods need to be adopted.
  2. Avatar for Jose Perezgonzalez
    Jose Perezgonzalez
    Cressey's news article (1) describes a step further by Research Councils UK in trying to improve statistical robustness and reliability a la Cohen (2). It is an admirable step, one with also ethical implications, as commented by Hawkins (in 1). However, it confuses science with statistics, and seems to assume that power will provide robust and reliable results where lack of power will not. Thus, jumping onto the power bandwagon is the way to go.
    As commented by Lazic (3), power is mostly relevant when you know your population, because power is part of the Neyman-Pearson approach to data testing. It is irrelevant when using Fisher's approach, however, which is, arguably, the one mostly used in psychology and related disciplines (4). Power can be made superfluous under the latter approach, and with this I mean that it can be used to approximate the needed sample size, but the unknown parameters of the population are second-guesses and, thus, power analysis may not be but a nice exercise.
    Alternatives to sample sizes by power analyses require to think what is science, and how statistics can contribute to it. Exploratory data analysis (5) and a focus on effect sizes and confidence intervals (6) can further scientific knowledge even when sample sizes are small because of costs, or when they turn out smaller than expected relative to (post hoc) power. Meta-analysis (7, 8) can bring together smaller sample sizes and, thus, increase power in so doing. Even Bayesian analysis may result in smaller sample requirements, depending on the prior probabilities used (9).
    In a nutshell, it is good to ask for power when power is relevant, but not all statistical techniques nor all of science requires it, especially when larger samples sizes are difficult or unwise to come by.
    ---Jose Perezgonzalez (Massey University)---
    (1) Cressey, D. (2015). UK funders demand strong statistics for animal studies. Nature 520, doi:10.1038/520271a
    (2) Cohen, J. (1988). Statistical power analysis for the behavioral sciences, 2nd. Ed. New York: Psychology Press
    (3) Lazic, S. (2015). Comment,
    (4) Perezgonzalez, J. D. (2015). Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Front. Psychol. 6, doi:10.3389/fpsyg.2015.00223
    (5) Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.
    (6) Cumming, G. (2012). Understanding the new statistics. Effect sizes, confidence intervals, and meta-analysis. New York, NY: Routledge.
    (7) Rosenthal, R. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage.
    (8) Braver, S. L. et al. (2014). Continuously cumulating meta-analysis and replicability. Perspect. Psychol. Sci. 9, doi:10.1177/1745691614529796
    (9) Kruschke, J. K. (2011). Doing Bayesian data analysis. Amsterdam: Elsevier.
  3. Avatar for Stanley Lazic
    Stanley Lazic
    A sample size calculation is useful for confirmatory experiments when there is (1) a specific hypothesis to test about a primary endpoint, (2) an idea of what is a meaningful effect size, and (3) a good estimate of the expected variability in the data. This does not characterise the exploratory nature of most in vivo research. For example, a sample size calculation should be required for the following confirmatory experiment: Does 10 mg/kg of a compound increase the time until disease onset (defined as the first instance of paw clasping when suspended from the tail) by at least 20% in the R6/1 transgenic mouse model of Huntington's disease? A sample size calculation is less relevant for the following exploratory experiment: Does this compound---tested at three doses---have an effect on any of the behavioural, neuroanatomical, or histological end points? High power is useful, but power to detect what exactly, and at what dose? Any effect is interesting, and researchers should not be required to pick a primary end point prematurely and make up numbers to satisfy grant reviewers. A greater source of irreproducibility is presenting the results of the exploratory experiment (or a subset of them) as if they were hypothesised in advance, and not validating the findings with a subsequent confirmatory experiment.
    A grant usually covers multiple experiments, and specifying the sample size up front does not allow for sequential learning from earlier experiments. For example, differences between litters may be larger than expected and need to be accounted for.
    An unintended consequence of this policy is inundating statisticians with requests to "do the math" so that the desired sample size sounds justified.
  4. Avatar for Guest
    "Use fewer animals! ...No wait, use more animals! What does the 3rd 'R' stand for again? Well use as many animals as it takes but just make sure you deliver p-values less than the magic number!" This is why science policy should be informed by actual scientists who understand the complexities of how science works.
sign up to Nature briefing

What matters in science — and why — free in your inbox every weekday.

Sign up



Nature Podcast

Our award-winning show features highlights from the week's edition of Nature, interviews with the people behind the science, and in-depth commentary and analysis from journalists around the world.