# Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence

## Abstract

Most neuroscientists would agree that for brain research to progress, we have to know which experimental manipulations have no effect as much as we must identify those that do have an effect. The dominant statistical approaches used in neuroscience rely on P values and can establish the latter but not the former. This makes non-significant findings difficult to interpret: do they support the null hypothesis or are they simply not informative? Here we show how Bayesian hypothesis testing can be used in neuroscience studies to establish both whether there is evidence of absence and whether there is absence of evidence. Through simple tutorial-style examples of Bayesian t-tests and ANOVA using the open-source project JASP, this article aims to empower neuroscientists to use this approach to provide compelling and rigorous evidence for the absence of an effect.

## Access options

from\$8.99

All prices are NET prices.

## References

1. 1.

Benjamin, D. J. et al. Redefine statistical significance. Nat. Hum. Behav. 2, 6–10 (2018).

2. 2.

Dienes, Z. Using Bayes to get the most out of non-significant results. Front. Psychol. 5, 781 (2014).

3. 3.

Gallistel, C. R. The importance of proving the null. Psychol. Rev. 116, 439–453 (2009).

4. 4.

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D. & Iverson, G. Bayesian t tests for accepting and rejecting the null hypothesis. Psychon. Bull. Rev. 16, 225–237 (2009).

5. 5.

Love, J. et al. JASP: Graphical statistical software for common statistical designs. J. Stat. Softw. 88, 1–17 (2019).

6. 6.

Wagenmakers, E.-J. et al. The need for Bayesian hypothesis testing in psychological science. in Psychological Science Under Scrutiny: Recent Challenges and Proposed Solutions (eds. Lilienfeld, S. O. & Waldman, I.) 123–138 (Wiley, 2017).

7. 7.

Altman, D. G. & Bland, J. M. Absence of evidence is not evidence of absence. Br. Med. J. 311, 485 (1995).

8. 8.

Edwards, W., Lindman, H. & Savage, L. J. Bayesian statistical inference for psychological research. Psychol. Rev. 70, 193–242 (1963).

9. 9.

Jeffreys, H. Theory of Probability (Oxford University Press, 1961).

10. 10.

Szucs, D. & Ioannidis, J. P. A. Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biol. 15, e2000797 (2017).

11. 11.

Etz, A. & Wagenmakers, E.-J. J. B. S. Haldane’s contribution to the Bayes factor hypothesis test. Stat. Sci. 32, 313–329 (2017).

12. 12.

Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).

13. 13.

Lee, M. D. & Wagenmakers, E.-J. Bayesian Cognitive Modeling: A Practical Course (Cambridge University Press, 2013).

14. 14.

Morey, R. D. & Rouder, J. N. BayesFactor: computation of Bayes factors for common designs. v. 0.9.12–4.2 https://cran.r-project.org/package=BayesFactor (2018).

15. 15.

Carrillo, M. et al. Emotional mirror neurons in the rat’s anterior cingulate cortex. Curr. Biol. 29, 1301–1312.e6 (2019).

16. 16.

Jeffreys, H. Theory of Probability (Oxford University Press, 1939).

17. 17.

Nieuwenhuis, S., Forstmann, B. U. & Wagenmakers, E.-J. Erroneous analyses of interactions in neuroscience: a problem of significance. Nat. Neurosci. 14, 1105–1107 (2011).

18. 18.

Gelman, A. & Stern, H. The difference between “significant” and “not significant” is not itself statistically significant. Am. Stat. 60, 328–331 (2006).

19. 19.

Morey, R. D. & Rouder, J. N. Bayes factor approaches for testing interval null hypotheses. Psychol. Methods 16, 406–419 (2011).

20. 20.

Rouder, J. N., Morey, R. D., Speckman, P. L. & Province, J. M. Default Bayes factors for ANOVA designs. J. Math. Psychol. 56, 356–374 (2012).

21. 21.

Rouder, J. N., Engelhardt, C. R., McCabe, S. & Morey, R. D. Model comparison in ANOVA. Psychon. Bull. Rev. 23, 1779–1786 (2016).

22. 22.

Myung, I. J. & Pitt, M. A. Applying Occam’s razor in modeling cognition: a Bayesian approach. Psychon. Bull. Rev. 4, 79–95 (1997).

23. 23.

Efron, B. Why isn’t everyone a Bayesian? Am. Stat. 40, 1–5 (1986).

24. 24.

Lee, M. D. & Vanpaemel, W. Determining informative priors for cognitive models. Psychon. Bull. Rev. 25, 114–127 (2018).

25. 25.

Bayarri, M. J., Berger, J. O., Forte, A. & Garcia-Donato, G. Criteria for Bayesian model choice with application to variable selection. Ann. Stat. 40, 1550–1577 (2012).

26. 26.

Cremers, H. R., Wager, T. D. & Yarkoni, T. The relation between statistical power and inference in fMRI. PLoS ONE 12, e0184923 (2017).

27. 27.

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D. & Wagenmakers, E.-J. The fallacy of placing confidence in confidence intervals. Psychon. Bull. Rev. 23, 103–123 (2016).

28. 28.

Marsman, M., Waldorp, L., Dablander, F. & Wagenmakers, E. J. Bayesian estimation of explained variance in ANOVA designs. Stat. Neerl. 73, 351–372 (2019).

29. 29.

van Doorn, J., Marsman, M., Ly, A. & Wagenmakers, E.-J. Bayesian rank-based hypothesis testing for the rank sum test, the signed rank test, and Spearman’s ρ. J. Appl. Stat. https://doi.org/10.1080/02664763.2019.1709053 (2020).

30. 30.

Wagenmakers, E.-J., Morey, R. D. & Lee, M. D. Bayesian benefits for the pragmatic researcher. Curr. Dir. Psychol. Sci. 25, 169–176 (2016).

31. 31.

Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction. (The MIT Press, 1998).

32. 32.

Wrinch, D. & Jeffreys, H. On certain fundamental principles of scientific inquiry. Philos. Mag. 42, 368–374 (1923).

33. 33.

Rozeboom, W. W. The fallacy of the null-hypothesis significance test. Psychol. Bull. 57, 416–428 (1960).

34. 34.

Stefan, A. M., Gronau, Q. F., Schönbrodt, F. D. & Wagenmakers, E.-J. A tutorial on Bayes factor design analysis using an informed prior. Behav. Res. Methods 51, 1042–1058 (2019).

35. 35.

Wagenmakers, E.-J. et al. Bayesian inference for psychology. Part II: example applications with JASP. Psychon. Bull. Rev. 25, 58–76 (2018).

36. 36.

Rouder, J. N. Optional stopping: no problem for Bayesians. Psychon. Bull. Rev. 21, 301–308 (2014).

37. 37.

Schönbrodt, F. D. & Wagenmakers, E.-J. Bayes factor design analysis: planning for compelling evidence. Psychon. Bull. Rev. 25, 128–142 (2018).

38. 38.

Consonni, G., Fouskakis, D., Liseo, B. & Ntzoufras, I. Prior distributions for objective Bayesian analysis. Bayesian Anal. 13, 627–679 (2018).

39. 39.

Gronau, Q. F., Ly, A. & Wagenmakers, E.-J. Informed Bayesian t-tests. Am. Stat. 74, 137–143 (2019).

## Acknowledgements

C.K. is funded by NWO VICI grant 453-15-009; V.G. is funded by ERC grant 758703 and NWO VIDI grant 452-14-015; E.J.W. is funded by NWO VICI grant 453-16-003. We thank F. Bartos for help with Fig. 2.

## Author information

Authors

### Contributions

All authors conceived the project together and contributed to the writing of the manuscript. E.J.W. coordinates the development of JASP.

### Corresponding author

Correspondence to Christian Keysers.

## Ethics declarations

### Competing interests

E.J.W. declares that he coordinates the development of the open-source software package JASP (https://jasp-stats.org), a non-commercial, publicly-funded effort to make Bayesian statistics accessible to a broader group of researchers and students. C.K. and V.G. declare no competing interests.

Peer review information Nature Neuroscience thanks Denise Cai, Zhe Dong, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data

### Extended Data Fig. 1 The relationship between BF, p, and effect sizes values.

a, This log-log plot shows the BF+0 values corresponding to familiar critical p values for a one-tailed one-sample t-test at different sample sizes (n). The curves show the BF+0 values obtained in a Bayesian t-test based on the critical t-value that provides P=0.05 (yellow), P=0.01 (green), P=0.005 (black) and P=0.001 (black). The yellow dashed horizontal line indicates the BF+0=3 bound for moderate evidence considered by Jeffreys9 to be similar to P=0.05, the green one the BF+0=10 for strong evidence considered similar to P=0.01. The two black dashed lines mark BF+0=1, i.e. the line of no evidence, and BF+0=1/3, the bound for moderate evidence of absence. The background gradient reminds the reader that the BF reference values of 3 and 10 should not be considered hard bounds. Instead the BF should be interpreted as a continuous value, with values diverging more from 1 supporting stronger conclusions. This panel makes two points. First, there is no simple equivalence between p and BF that holds over all sample sizes. This is because in a frequentist t-test, the observed effect size (d) sufficient to generate a specific p value decreases with $$\sqrt {\mathrm{n}}$$ more rapidly than for the BF. As a result, at large n, very small effect sizes generate ‘significant’ t-test: at n=1000, the critical t-value for a one-tailed P=0.05 is 1.65, corresponding to d=1.65 /$$\sqrt {\mathrm{n}}$$ =0.05. For the BF, such a minuscule effect is 4 times more likely under H0 than H+ (BF+0=0.26). Hence, for small sample sizes p and BF support similar conclusions (e.g., P=0.05 at n=4 corresponds to BF+0>3, supporting the same conclusion of evidence for an effect), but for large sample sizes the frequentist and Bayesian conclusions can diverge in the presence of very small effect sizes (e.g., P=0.05 at n=1000 corresponds to BF+0<1/3, see Jeffreys, H. Some Tests of Significance, Treated by the Theory of Probability. Proc. Cambridge Philos. Soc. 31, 203–222 (1935)). Considering confidence or credible intervals of the effect size in addition to p or BF values helps interpret such cases. Second, the fact that the dashed lines are above the curve of the same color for all n>4 shows that BF+0=3 and BF+0=10 indeed protect against Type I errors in a frequentist sense at least at P=0.05 or P=0.01, respectively. In other words, if BF10>3, p<0.05, and if BF10>10, p<0.01, but how much lower than 0.05 or 0.01 the exact P value is, depends on n. b, BF+0 (left) and p (right) values as a function of measured effect- and sample-sizes. These panels illustrate the measured effect sizes necessary to provide evidence for an effect at different sample sizes in a one-sample one-tailed t-test using the BF vs. traditional p values. Each curve connects the results at different sample sizes for the specified value of d. The logarithmic BF and p scales are aligned so as to place BF=3 next to P=0.05, and BF=10 next to P=0.01.

### Extended Data Fig. 2 Evidence for or against a factor in a Bayesian ANOVA.

A Bayesian ANOVA is a form of model comparison. This figure illustrates how the Bayes factor can provide evidence for a simpler model by concentrating its predictions on a single parameter value. This example ANOVA determines whether or not the data D depend on the value of the factor Group by comparing the Null Model D=0*Group (left) against the Group Model D=β*Group, with a Cauchy prior on β (right). The top row illustrates the prior probability attributed to the different values of β under the two competing models. Note how both models include β = 0 as a possibility, but given that the probability values must integrate to 1 over the entire β space, for the Null Model p(β = 0) = 1 while for the Group Model, the probability is distributed across all plausible alternative values. The middle row shows the predicted t-values based on these priors, where t represents the difference between the data from the two groups as in Fig. 2. Note how these predictions are more peaked for the Null compared to the Group model. The bottom row compares the predicted probability of finding particular t-values under the two models, and shows how values close to zero (i.e., small or no difference between the groups) are predicted more often by the Null compared to the Group Model, while the opposite is true for large t-values. If conducting the experiment reveals a measured t-values close to zero, the Bayes Factor for including the factor Group would be substantially below 1, providing evidence for the absence of an effect of Group, while the inverse would be true for high t-values.

### Extended Data Fig. 3

Examples of how to report results.

## Supplementary information

### Supplementary Information

Supplementary Note on continuous testing and Supplementary Fig. 1.

## Rights and permissions

Reprints and Permissions