To the Editor:

Reproducibility of results is a fundamental tenet of science. In this journal, Richter et al.1 tested whether systematic variation in experimental conditions (heterogenization) affects the reproducibility of results. Comparing this approach with the current standard of ensuring reproducibility through minimizing variation in experimental conditions (standardization), they concluded that heterogenization improved reproducibility1. However, in our view, they did not account for significant sources of dependency in their data, which resulted in an inflated type I error rate through pseudoreplication (defined as “the use of inferential statistics to test for treatment effects with data from experiments where either treatments are not replicated (though samples may be) or replicates are not statistically independent”2). We show that this leads to strong overconfidence in their analyses and that their hypothesis is unsupported.

Richter et al.1 compared F ratios of strain-by-experiment interactions of 36 behavioral measures to test for differences in reproducibility between series of standardized and heterogenized experiments. Although these measures were treated as independent, most are strongly intercorrelated, thus causing interdependency of the F ratios. Two sources contribute to interdependence of behavioral measures and, hence, pseudoreplication. First, the measures within each of their three behavioral tests are strongly intercorrelated. For example, an animal exploring the edge of an arena cannot explore the center simultaneously. Second, the measures are correlated across tests because many animals show temporal and cross-contextual consistency in behavioral traits (known as 'animal personalities'3,4). However, to allow meaningful comparison of reproducibility for standardized versus heterogenized experiments, F ratios must be obtained independently5.

We reanalyzed their data (Supplementary Note and Supplementary Figs. 1–6) by identifying supposedly independent variables using hierarchical clustering. This analysis showed that there was no detectable difference in reproducibility between standardization and heterogenization. Hence, in our view, their data do not support their hypothesis.

We caution that overconfidence resulting from pseudoreplication may lead to premature conclusions in studies designed to prove this principle1,6,7. Unjustifiably assuming that heterogenization yields better reproducibility may prompt a reduction in the number of replicate experiments, possibly decreasing the chance of detecting desired and/or unwanted effects8. Hence, further studies validating the benefits of heterogenization for reproducibility are required before it can be adopted as the new standard.