To the Editor — Pyc and Rawson1 proposed that test practice (versus restudy) enhances long-term memory because testing enhances the effectiveness of mediators generated during encoding. In that study, we evaluated two components of mediator effectiveness: mediator retrieval and decoding. Camerer et al.2 sought to replicate the mediator retrieval component of this mediator effectiveness hypothesis. We provided our program and spoke numerous times with the authors regarding the method. Based on responses we received from the project team, the replication was a close methodological match to the original method1. They concluded2 that they replicated our findings in the second stage of data collection but not the first. Below, we discuss potential reasons why our results were not replicated in the first stage of data collection.

The mediator effectiveness hypothesis provides a directional prediction about the relationship between learning schedule (study versus test) and the effectiveness of mediators generated during encoding. The first stage yielded a trend in the predicted direction that did not meet standard significance levels with a two-tailed test. Stage 2 provided results that met standard significance levels. An important measure that was collected but not reported by Camerer et al.2 is target recall on the final test. The mediator effectiveness hypothesis states that differences in mediator recall contribute to differences in target recall. If differences in target recall were not obtained, there would be no testing effect to be explained and therefore no expectation for differences in mediator recall. Likewise, if a weaker testing effect were observed than in the original study, one would expect weaker effects in mediator recall than in the original study. Without knowing target recall, the size of the expected mediator recall effect is not clear.

Another factor that may have influenced the results is prior exposure to Swahili–English word pairs. These normed items from Nelson and Dunlosky3 have been widely used in the field. We ensured1 that participants had no prior exposure, but this information was not provided by Camerer et al.2. Prior exposure to the information could influence the effectiveness of test versus study practice.

One final difference is in relation to the population sampled. We used undergraduate students only, and Camerer et al. used both undergraduate and graduate students. Their participants also received greater compensation. The possibility that testing effects of interest may be moderated by differences in participant characteristics is consistent with outcomes recently reported4.

More generally, variability in effect size estimates across studies is expected. Instead of focusing solely on P values to determine replicability, Braver, Thoemmes and Rosenthal5 argue that a meta-analytic approach provides more appropriate criteria. A continuously cumulating meta-analysis (CCMA) of outcomes from the original study and from stage 1 data reported by Camerer et al. yielded a pooled d of 0.43 (95% CI = 0.12–0.74), P = 0.006, consistent with Braver and co-workers’ argument that “a CCMA analysis on a replication attempt that does not reach significance might nonetheless provide more, not less, evidence that the effect is real” (p. 333)5. Taking this approach into account, results of both stages 1 and 2 of Camerer et al. replicate the mediator retrieval assumption of the mediator effectiveness hypothesis.