Main

Ever since the publication in 2011–2012 of two front-of-the-magazine pieces by Prinz et al.1 and Begley and Ellis2, the research world has been rocked by accusations of the “irreproducibility” of preclinical research. Both of these articles described, anecdotally, the inability of pharmaceutical preclinical testing units to reproduce experiments by academic laboratories published in the scientific literature. These surveys were widely reported on in the media (e.g., ref. 3), and their pessimistic message was embraced by funding agencies4 and journal editors5. The desire for a more systematic approach to the issue yielded high-profile “reproducibility projects” in both psychology6 and cancer7; the former has been published and confirmed that large proportions of published research findings cannot be clearly reproduced by independent laboratories. A recent survey of preclinical scientists revealed that 52% think there is a “significant crisis”8.

This being said, it is hard to decide whether something in science has actually changed, or is “broken”, or whether the apparent epidemic of irreproducibility is simply due to more attention being paid to the topic against a backdrop of the (undeniably) poor translational success of recent decades in drug development9. Those commenting on irreproducibility have most often attributed it to poor experimental design, poor reporting of methods and results (for example, missing statements about blinding and randomization), various types of bias, investigator misconduct, and misuse of statistics10. I believe that all of these things occur. I would argue, however, that the current state of affairs is certainly no worse than when I started in science 25 years ago, and almost certainly better. At the beginning of my career, for example, methods sections would simply refer to “mice”; now at least we're usually told that they are male—alas, always male—C57BL/6 mice. Thus, it seems to me that the true reasons for poor translation lie elsewhere.

What is responsible for poor translation?

What then is responsible for our current translational challenges? Some believe that the drugs that work in animals really do also work in people, but that it's becoming harder and harder to prove this so. That is, the real problem may actually be the changing nature of clinical trials, in which fewer and fewer drug-naive participants are available, and in which the placebo response has steadily increased over time11. Others have pointed to shortcomings in the validity of animal models (and their implementation) used in preclinical studies12. I have previously argued that a major problem with the status quo in preclinical pain research is the animal models in common use; the inadequacies are related to all three facets of an animal model: subjects, assays, and measures13. The human sufferers of chronic pain are overwhelmingly female14, middle-aged or elderly15, and of heterogeneous genetic background, whereas the animal subjects in pain experiments are overwhelmingly young-adult, male Sprague Dawley rats or C57BL/6 mice13,16,17. Both quantitative and robust qualitative differences in pain processing have been documented between strains18 and the sexes19,20, confounding simple conclusions. The most common chronic pain syndromes in humans are low back pain, arthritis of the joints, and headache21, whereas the most common chronic pain assays in current use for animal subjects involve experimental ligations of afferent fibers serving, and injection of inflammatory substances into, the hind paw22. Finally, whereas the most prevalent (and bothersome) clinical symptoms of chronic pain are deep, spontaneous (ongoing) pain and numbness23,24, and comorbidities like sleep disruption21, preclinical pain researchers continue to focus almost exclusively on measuring mechanical and thermal pain hypersensitivity (allodynia and hyperalgesia)13,25. Whether currently popular animal models are close enough to clinical reality or not is hard to say (is the glass half-empty, or is it half-full?), and such discussions tend to quickly devolve into simply counting up recent efficacy successes (e.g., ziconotide, tanezumab) and failures (e.g., neurokinin-1 antagonists, fatty acid amide hydrolase inhibitors). Simply put, preclinical pain research with currently popular animal models performs more than adequately for backwards translation26,27; the forward translation jury is still out.

The role of latent environmental factors in pain research

I would suggest that another reason for apparent irreproducibility may be more important than is commonly appreciated: stealth or “latent” laboratory environmental factors28, the “unknown unknowns” of biology. Simply stated, any factor affecting stress levels in the testing environment may have huge effects on any biological phenomenon affected by stress, which includes essentially all of them, behavioral and non-behavioral alike. In turn, we are aware of only a few such factors, and even when we are aware of them they hardly ever receive our attention. Figure 1 lists a variety of factors that might vary in the laboratory environment, both in the vivarium and the testing room, that have been documented to affect pain sensitivity in rodents. Pain is, in fact, a perfect example of a biological domain in which subject stress can produce robust effects. The bigger problem is that these effects are unpredictable. Stress can produce either stress-induced analgesia (SIA) (see ref. 29 for review) or stress-induced hyperalgesia (SIH) (see ref. 30 for review), and although some evidence suggests that genetics, stress severity, context and chronicity may be relevant to the direction in which pain is modulated31,32,33,34,35, this topic remains greatly understudied. Also unappreciated is the fact that stress associated with pain testing can be surprisingly high (and hugely variable) in human participants as well. In a study of capsaicin pain, self-reported stress levels ranged from 0 to 8 on a 0-to-10-point scale, and this stress interacted with both sex and a genetic variant within the AVPR1A gene to significantly affect pain ratings36.

Figure 1: Factors significantly affecting pain sensitivity in rodent models and sample references.
figure 1

Kim Caesar/Springer Nature

Only factors that might credibly vary between laboratories are considered. For the factor “Diet”, we excluded diabetes, hypertension and obesity models. We also excluded experimental stressors or procedures such as shock, restraint, prolonged maternal separation, or sucrose feeding. Only papers reporting statistically significant effects in either direction are listed. In the case of multiple papers by the same laboratory (indicated by *), only the first to be published is listed.

Some laboratory stressors (or non-stressful modulatory factors) are related to husbandry, and have long been known. For example, housing isolation and crowding can both affect animal welfare (e.g., ref. 37), and directly affect pain sensitivity38,39,40,41,42,43, and the effects of cage density might be sex-specific44. Housing enrichment using physical objects has been shown to ameliorate pain45,46,47. Also well known are the effects of prenatal (i.e., gestational) stress (e.g., ref. 48) (which might be sex-specific49), which may occur due to noise or other disruptions in the breeding room, social isolation of breeding dams, or even high lead or arsenic levels in the drinking water50,51,52. Stress during pregnancy can affect maternal behavior53, which in turn can affect stress responses of rats in adulthood, and across generations54. Postnatal stressors during development can have long-lasting effects as well. Failure to provide bedding material for a rat to make a nest for her pups can lead to hyperalgesia in those pups when they become adults55,56, and the handling of young animals (for example, related to injections, and involving the stress of maternal separation) can cause long-term alterations in pain sensitivity41,57,58,59,60,61,62,63. A growing literature also documents the effect of diet on pain sensitivity64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80. In one case, differences in the soy content of rat chow turned out to be wholly responsible for a failure to replicate in Baltimore the findings of studies originally performed in Jerusalem65. Perhaps the largest husbandry-related stressor is within-cage aggression in males. Social defeat in the “resident-intruder” assay is known to produce profound stress-induced analgesia81, and one might imagine great variability in the apparent “baseline” sensitivity of male mice depending on their social status (dominant/submissive) and how long before testing aggressive encounters had occurred. Kevin Keay and colleagues have shown that subpopulations of rats that exhibit dominant behavior as residents in a resident-intruder assay have different behavioral changes—including apparent changes in pain sensitivity—after nerve damage than rats that exhibit submissive behavior82, with corresponding alterations in neuroimmune signaling83. Even very subtle differences in husbandry may matter. One study demonstrated increased pain sensitivity of male mice reared until weaning with only their brothers compared to both brothers and sisters84. Another recent study observed that mice housed on aspen bedding were more sensitive to thermal and mechanical stimuli than mice housed on TEK-Fresh bedding made of 100% virgin wood pulp85. Other variable husbandry factors that have been discussed but not yet shown directly to affect pain sensitivity include: litter size (i.e., culling), weaning timing, light/dark phase duration, housing mice and rats in the same room, water source, in-house breeding versus commercial breeding (and, in the latter, delay between arrival and testing), direct transfer from vivarium to testing room versus the use of a holding room, and endemic presence of viruses. The effect of husbandry on adult biomedical testing is hardly limited to stress effects on behavioral measures. A celebrated recent study demonstrated that the hygienic barrier facilities that most laboratory mice are currently housed in render their immune system abnormally immature, and concludes that translation would be improved if mice were “dirtier”86. It should be pointed out that all these husbandry-related factors apply both to commercial breeding facilities as well as vivaria within academia and industry.

Other laboratory stressors occur in the testing room rather than the vivarium. Many routine procedures—including capture/handling, blood collection, and gavage—produce stress in laboratory animals87. Simply being placed in a novel testing room under bright lights can produce non-negligible stress in laboratory mice that have never before left the vivarium88, and a large literature documents the SIA resulting from this novelty and the ameliorating effects of habituation to the testing room (e.g., ref. 89). Different laboratories have very different practices regarding the frequency and duration of handling of adult animals, potentially leading to very different (and strain-dependent90,91) stress levels on testing days if human experimenters are present during the data collection, and depending on whether the same or different individuals are involved in handling versus testing92. The precise method of mouse capture has been shown to have large effects on stress93. Stress related to injections, producing SIA, is well known28,94, and likely due to the associated capture and restraint95 required in addition to the needle itself. Whether animals are naive to other behavioral tests can affect the results of pain tests performed later96. Circadian rhythms affect pain in rodents (e.g., ref. 97), in a strain-98, sex-99 and estrous phase-dependent100 manner, and testing times in laboratories vary widely, as does whether animals are kept under regular or reverse light cycles. The ambient temperature101,102,103,104, humidity105, barometric pressure106, and noise level107,108,109 in the testing room have all been shown to affect pain sensitivity; these are not as tightly controlled in many laboratories as is often pretended. In most cases, animals are tested in groups, and social factors can alter results as well. We and others have found that pain sensitivity can be increased (by emotional contagion110,111,112,113 or social SIH88,111), or decreased (by social SIA88,110 or social buffering114,115,116,117,118) when animals are in close proximity (through either visual110 or olfactory113,119 mechanisms), depending on their social status (siblings, familiars, or strangers) and the threat level determined by their degree of physical access to each other. Our laboratory surprisingly observed that the mere presence of a male (but not female) experimenter in the testing room could produce stress in mice and rats, producing olfactory-mediated SIA that robustly affected apparent “baseline” pain sensitivity119.

The impact of laboratory environmental factors

That environmental factors differing from one laboratory to another can affect behavioral experiments has been appreciated for quite some time. In a famous study by Crabbe and colleagues120, the performance of multiple inbred mouse strains tested in standard behavioral assays were found to be highly lab-dependent, even though the investigators at three institutions took great pains to synchronize and standardize animal shipment, husbandry, and test-related parameters. It should be noted that the differences between labs were related to overall levels of behavior in these assays; relative strain sensitivities were generally preserved across sites121. These investigators hypothesized that the lab variation was mostly due to “experimenter effects”122. In an analysis of acute pain sensitivity in over 8,000 mice we concluded that differences between experimenters accounted for more variance than any other factor, including genotype123. Environmental factors can interact with genotype, of course, producing yet more variability. In one experiment, investigators observed that a brief period of food shortage could abolish and even reverse a well-known mouse strain difference in response to amphetamine124. In an intriguing study of rats selectively bred for high and low pain (autotomy) behavior after hind limb denervation, the selected phenotype could be largely abolished simply by co-housing “high” rats with “low” counterparts, and vice-versa125. This same phenomenon was thereafter demonstrated in high- and low-autotomy mouse strains126. In one of our studies, a particularly robust strain difference in pain sensitivity127 was found eventually to exist only because of SIA—resulting simply from being placed in the testing room itself—in one strain but not the other; if mice were habituated to the room on several days before testing no strain difference was observed36.

These sorts of findings have led to a discussion as to whether reproducibility in animal experiments would be enhanced by standardization of husbandry and experimental parameters across laboratories128. Some have argued for such standardization in the name of sample size reduction, and improving comparability of results within and between laboratories129. Others have pointed out that many environmental factors simply cannot be practically standardized (for example, staff, room architecture, noise, tap water composition, locally sourced rodent chow), and that any such attempts would prioritize comparability over the potentially even more important aim of generalizability (external validity)130,131. In fact, some have argued that attempts over the last few decades to standardize between laboratories in the name of reproducibility have actually led to a “standardization fallacy”, producing results that are increasingly distinct between laboratories (and thus, perversely, less reproducible)132. A proposed solution to the problem is systematic variation of experimental conditions128,133, although I see a real danger here in presupposing which environmental factors actually matter.

Conclusions

The plethora of “latent” laboratory environmental factors that affect pain studies—and biomedical studies more generally—imply that if two laboratories fail to replicate each other's findings, it is not necessarily the case that one is “right” and one is “wrong”, or that such research is “irreproducible.” They might both have obtained accurate data given their unique environmental context. The relevant contextual factors may or may not be appropriately reported, and may or may not even be understood. Increased prioritization, especially by funding agencies, of the identification of such factors (like experimenter sex, which was unknown until 2015) is warranted. Only by turning unknown unknowns into known knowns—and either controlling for them, covarying them out of statistical analyses, and/or explicitly varying them to investigate interactions—will we improve both the reproducibility and generalizability of preclinical research.