A systematic review of studies utilizing hair glucocorticoids as a measure of stress suggests the marker is more appropriate for quantifying short-term stressors

Quantitating glucocorticoids (GCs) in hairs is a popular method for assessing chronic stress in studies of humans and animals alike. The cause-and-effect relationship between stress and elevated GC levels in hairs, sampled weeks later, is however hard to prove. This systematic review evaluated the evidence supporting hair glucocorticoids (hGCs) as a biomarker of stress. Only a relatively small number of controlled studies employing hGC analyses have been published, and the quality of the evidence is compromised by unchecked sources of bias. Subjects exposed to stress mostly demonstrate elevated levels of hGCs, and these concentrations correlate significantly with GC concentrations in serum, saliva and feces. This supports hGCs as a biomarker of stress, but the dataset provided no evidence that hGCs are a marker of stress outside of the immediate past. Only in cases where the stressor persisted at the time of hair sampling could a clear link between stress and hGCs be established.


Material and Methods
The methods listed below were pre-specified in a study protocol accessible online since Jan 13, 2016 (Supplemental materials C).
A broad -inclusive -search strategy was employed in an attempt to find all relevant publications that could provide unambiguous evidence of hGCs being related to the HPA axis-activating stress response of an individual. Studies where hGC levels were used as a measure of stress or where hGC levels were correlated to GC levels in other biological matrices (blood, saliva, urine and feces) were retrieved through multiple databases (MEDLINE, Web of Science, EMBASE, Zoological Record, and PsycINFO). The publication searches were conducted in January 2016. Whereas we have contextualized our findings with more recent examples, no studies published past this date were included in the analyses. Quality assessments were made according to an adapted (nine-item) checklist and basic study information was extracted along with hGC results. Synthesizing data from multiple sources, summary estimates were created separately for correlations with different biological matrices. Similarly, experimental designs deemed fundamentally incompatible were separated out and individual summary estimates were created. Specifically, studies were classified as studying "induced (acute) stress", "chronic stress", "observed stress" (where the stressor was inferred by an observer), "self-assessed stress", "past stress" (where a subject was exposed to a stressful period, which had subsequently ended prior to hair collection) and post-traumatic stress disorder ("PTSD"; which we chose to include because its link with hGCs was receiving great attention at the time this review was scoped). Random effects models were used throughout and differences in experimental subjects and controls were expressed as standardized mean differences. For detailed descriptions of the methods, refer to Supplemental materials A.

Results
A total of 3,518 unique entries were found using the search strategy, of which 468 entries were retained for full text analysis (Fig. 1). A majority of these studies were subsequently excluded due to not meeting the pre-stated inclusion criteria: 28% were excluded due to their exploratory study design -often characterized by the lack of a control group and a clear a priori hypothesis; 16% presented no data from a controlled study -these were mostly method papers, reviews, opinion papers, and other narrative journal entries; 26% were not peer-reviewed publications -these were mostly meeting abstracts and theses. Other incompatible study designs, and entries where the full text could not be obtained, made up 17% of the entries retained for full text screenings.
For the entries retained for full text screening -where all texts were verified to concern the use of hGCs -an exponential growth in method adoption is obvious: 2015 saw more publications on hGCs than had been published between 2003 and 2011 in total. Presently, a new publication (counting also non-peer reviewed entries) on hGCs is available online every three days (or less). www.nature.com/scientificreports www.nature.com/scientificreports/ Study quality of experimental studies. Of the 59 peer-reviewed publications included in the present systematic review, 38 papers reported on 42 studies with a stress group/control group design that could be assessed for study quality.
A salient trend was found when assessing the risk of bias: A majority of the 38 papers did not account for the possibility that a stressor other than the one that was purportedly studied could have influenced the results. This is evident in Fig. 2 focusing on checklist items 2, 3 and 8: The influence of concurrent interventions or unintended exposures could only be ruled out in 9 (24%) of the studies (item 3), the influence of confounding factors could only be ruled out in 16 (42%) of the studies (item 2), and only 12 (32%) of the studies featured a study design that ensured that the subjects were equally exposed to any confounding factors (item 8). In only three studies (8%) could all three sources of bias be ruled out entirely. Similar ambient conditions for stress and control groups could also only be guaranteed in 15 (39%) of the studies (item 5). Remarkably, only 3 (8%) of the studies reported on blinding of the outcome assessors (item 6), even though this is an explicit recommendation of most present-day best-practice frameworks (e.g. the ARRIVE guidelines 39 ). In no one study were all of the sources of bias addressed, and in a few none were (for a by-entry summary of the risk-of-bias analyses, refer to Supplemental materials B, appendix 1).

Study characteristics and data extraction.
The studies retained for analysis presented a diverse set, with no two study designs quite alike (Tables 1 and 2). Of the studies retained for analysis, roughly half (48%) were human studies. Both sexes have been studied in roughly equal numbers (52% female subjects across all studies), but only rarely were equal sex ratios employed in any one study; study objectives and opportunistic sampling of e.g. wildlife populations tending to bias the sex ratio in favor of one or the other. We made initial attempts at exploring sex differences -similar to a previous meta-analysis 38 -however the data were insufficient to draw any conclusions. Similarly, when extracting data we had harbored hopes of being able to compare the effects of differing sampling and analysis protocols that have been discussed previously 40 . However, the laboratory methods employed were fairly similar and study designs fairly dissimilar, the combination lending itself poorly to stringent analyses. Human studies were consistent in sampling the posterior vertex of the head, whereas the non-human studies appeared to sample regions by convenience or just by random (e.g. studies in dogs have sampled backs, shoulders, chests, and legs, depending on research group and study). Although often discussed as a potential issue 41,42 no one study admitted to including hair follicles in their hair samples and all but six papers 5,34,43-46 explicitly described methods designed to ensure samples being free of follicles. It has been theorized that the color of a hair can influence the GC content 47 , however the results are inconsistent 10 and hair color is rarely reported. We consequently did not attempt to extract information on this. Only human and other primate studies employed the "stress calendar" idea, sub-sectioning hairs to infer circulating GC levels at multiple time points in the past from the same sample. Of all the studies retained for analysis, a clear majority (48 studies, 83%) employed a washing step, intended to remove contaminants from the outside of the hairs, and all but two studies minced/ pulverized the hairs prior to analysis. For quantification of GCs, antibody-based methods were most frequently employed (48 studies, 83%), however numerous different protocols/antibodies have been utilized.
When extracting data, two studies -Manenschijn et al. 48 and Luo et al. 4 -were singled out as having a reported precision more than tenfold higher than the other 38 studies (including studies utilizing the very same methodology in comparable subjects). We believe that this is simply due to incorrect reporting of the measure of dispersion. Unable to reach the authors for a comment -despite multiple attempts -we have tentatively included the data from these studies, assuming that the graphically presented measures of dispersion were in fact SEMs, rather than -as listed -95% CIs. Three studies -two experimental studies of chronic stressors 33,49 , and one study reporting a non-significant correlation between hGC and GC in saliva 50 -were excluded from further analysis as critical information could not be obtained from the corresponding authors (none of the studies listed the number of samples/subjects used in their analyses). We do not expect these exclusions to have significantly altered our summary estimates, however, as these studies were of moderate size and all fell into well-populated subgroups. correlations with Gc in other matrices. Meta-analyses of correlation coefficients revealed a great deal of heterogeneity between studies, as could be expected from the diverse set of studies analyzed (Fig. 3). Significant synthesized meta-correlations could be found between hGCs and GCs in blood, saliva and feces. A significant correlation could not be found between hGCs and GCs in urine. However this analysis featured only five studies (collecting 169 subjects), all with fairly high intra-study variance of data. Leave-one-out analysis furthermore revealed that the statistically significant correlation found between GCs in blood and hGCs could not be substantiated if data from the study by Yu et al. 51 were removed. Moreover, removing the data from the study by Accorsi et al. 52 would more than halve the synthesized correlation coefficient between GCs in feces and hGC (putting it in range with the other correlations at r = 0.22), suggesting that the strength of the correlation may be somewhat overestimated. Additionally, removing a single study for each correlation summary would reduce heterogeneity markedly, bringing down the largest I 2 value to 38%, suggesting a small number of studies were responsible for a majority of the heterogeneity (the leave-one-out analyses are presented in their entirety in Supplemental Materials B, Appendix 2). A factor adding to the heterogeneity may also be that some studies averaged multiple samples over time -e.g. the study by D' Anna-Hernandez et al. 53 where hair and saliva samples were averaged across four sampling points throughout human pregnancy, or Sauvé et al. 54 comparing single hair samples to urine samples averaged over a 24-hour period. With a relatively small dataset, we did not see fit to analyze these as separate subgroups, increasing our "researcher degrees of freedom" 55,56 , and potentially flagging spurious correlations (this would, further, only have been possible for the correlation with GC in saliva). Moreover, the studies that correlated both point estimates and averages 31,57-59 did not demonstrate a consistent difference between the two approaches.
hGcs as a measure of stress. Summarizing the evidence from the experimental studies using random effects models produced varied results (Fig. 4), reaffirming our decision to carry out subgroup analyses. Induced (acute) stress models produced a clear elevation in GC concentrations measured in hairs ( Fig. 4A) with low inter-study heterogeneity (I 2 could not be estimated). Chronic stressors also produced a significant elevation in deposited GC compared to control groups (Fig. 4B). The results from the chronic stress studies were however highly heterogeneous with a majority of the variance of the summary estimate stemming from between-study variation, as opposed to within-study variation (I 2 = 80%), suggesting that not all of the studies were comparable with respect to the stress response and its effect on hGC levels. Simply put, it is unlikely that these studies all describe a similar HPA axis activation in response to the studied stressor; it is, for instance, likely that some scenarios simply did not induce a stress response. Observed stress (Fig. 4C) and self-assessed stress (Fig. 4D) produced unclear results. Finally, stressors that had subsided at the time of sampling ("past stress") did not produce a measurable elevation in hGCs (Fig. 4E). Studies concerning hGCs measured in PTSD sufferers similarly generated unclear results (Fig. 4F), with a combination of studies showing both elevations and decreases in hGC output relative to a control group. week that concern or utilize hGC analysis, it is fair to say that it has become a widespread method for assessing stress. But hair-growth is a slow process, and popular speculation 60,61 suggests that GCs are sequestered by hairs over several weeks -if not months. Consequently, controlled studies are for logistic reasons hard to design and execute. Perhaps this is why our search strategy turned up more narrative reviews, opinion papers, and book chapters lauding the method than it did actual controlled studies providing empirical evidence that the method is a sound one. Moreover, the typical (i.e. the most numerous) study employing hGC analyses, published prior to February 2016, was an exploratory one. Characteristically, a single cohort of subjects had hair samples collected along with a number of other environmental, physiological, psychological, and/or demographic data. Correlations were then constructed to scrutinize which parameters were linked to elevated hGC concentrations. The topics of these studies are varied, from investigations of environmental effects on squirrel gliders 62 or social effects on German shepherds 63 to probing cultural 64 , environmental 65 , nutritional 32 or genetic 66 influences on psychological stress in people of differing ages. The implicit prior assumption for studies of this kind is that hGCs are linked to central HPA axis functioning and are thus a measure of (chronic) stress. This puts even more of an onus on the (relatively small number of) controlled studies to validate and affirm the use of hGC concentrations as a measure of stress.
is there support for hGcs as a biomarker of stress? The present investigation supports the use of hGCs as a measure of central HPA axis functioning and, consequently, as a stress-sensitive biomarker. The compounded data however calls into question the temporality of the marker, suggesting it is a better marker for ongoing than of past stress.
In studies where subjects were exposed to a controlled stressor, a predictable elevation was found in most cases. Whether repeated ACTH challenges 67 or a more elaborate protocol combining multiple stressors were employed 68 , a consistent increase was found across species when comparing challenged subjects to unstressed controls. The effect of acute stress on hGC levels seems, furthermore, to be rapid. Whereas most studies sampled hairs at least two weeks after having applied the stressor, the study by Cattet et al. 14 is remarkable in that they report elevated hGCs within hours of stressor onset. In a similar vein, most stress protocols were applied continuously for weeks before hGC concentrations were evaluated, but González-de-la-Vara et al. 67 found that they could detect an elevation in hGCs two weeks after a pair of sustained-release ACTH injections. Both studies point to hGC concentrations being reflective, primarily, of events in the recent past, as opposed to historical stressors. This is also consistent with hGCs correlating with GCs in other matrices.
Although both inter-and intra-study variances were high for the collated data, it is clear that hGC concentrations correlate significantly with GC concentrations in other matrices. The synthesized correlation coefficients are weak to moderate -ranging from 0.13 to 0.56 -but this is in range with the correlations between established matrices obtained in these very same studies 54,59,69,70 . Due to large fluctuations stemming from the pulsatile nature of GC release 71 , coupled with the different temporality of the matrices -serum and saliva concentrations of GCs    www.nature.com/scientificreports www.nature.com/scientificreports/ change in a matter of minutes in response to a stressor, urinary and fecal GCs change over a period of hours 72these correlations will inevitably be moderate at the most. The correlation between hGCs and GCs in feces is the strongest of the four, which is to be expected as fecal samples integrate circulating GC concentrations over a period of several hours. Hairs are similarly suggested to sequester GCs from circulation over a longer time window. In the face of popular claims, it is unlikely that this time window is several weeks long, however, as hGC concentrations also correlate significantly with serum and salivary concentrations of GCs.

The effect of confounders on measured hGC levels in chronic stress studies.
When compiling studies of chronic stress, a link between individuals experiencing stress and elevated levels of hGCs was found, albeit a weaker link than for acute stressors. The greater level of heterogeneity of this dataset is probably in part because some of the studies were carried out under highly uncontrolled circumstances. With long-term studies featuring subjects -whether human or non-human -in an uncontrolled environment, it is hard to ensure that the studied stressor is the sole and most influential source of stress. It may be that a lack of dietary salmon elicits a physiological stress reaction in grizzly bears, as suggested by Bryan et al. 43 , but it is quite impossible to tell what other factors might influence the life and allostasis of these bears. The confounding factors of this study may well have overshadowed the effect the authors were looking for. Similarly, military training is not all long marches and adrenaline-fueled combat training. With no outside verification, the soldiers undergoing basic training studied by Boesch and collaborators 73 may not have had a more active HPA axis than e.g. an office worker with an active lifestyle in the period of sampling. This is not to criticize these experiments; rather, this is to highlight the fact that a number of studies into chronic stressors have an exploratory element to them, as the magnitude of the chronic stressor is hard to judge in relation to a host of ambient stressors. Our risk-of-bias assessment singled out unrelated confounding factors as the most common unchecked source of bias. Only 24% of studies could account for external confounding factors in the studied period, and only in 32% of studies could they be assumed to have been distributed equally between the studied subject groups. The evidence supplied by the chronic stress studies should thus be interpreted carefully.
Differences between the human concept of stress and HPA axis activity. In our investigation, studies where periods of stress were inferred presented highly heterogeneous data. It has been shown before that when human subjects are asked to introspectively assess their own level of stress, assessments correlate poorly with their actual HPA axis functioning 38,74,75 . In the present investigation we see a similar trend for studies relying on self-reported stress. Whereas we will note that the present investigation contains only a handful of studies, no consistent trend or even weak effect can be inferred. This is not to say that the subjects were not experiencing psychological stress -the studies collect data from distressed subjects ranging from survivors of natural disasters 76 to patients sourced from mental health services 31,77 -but it serves as a reminder that the human concept of stress is not synonymous with the prototypical fight-or-flight response. Different states of stress will involve the HPA axis differently. This is further exemplified by the studies of PTSD, where in two studies 31,78 a notable reduction in hGCs is found for PTSD subjects when compared to healthy controls. We specifically analyzed PTSD studies separately as it has been suggested that PTSD is accompanied by a lowering in circulating GC levels, as opposed to an elevation. Notably, PTSD subjects are identified through clinical scores suggesting the subjects were assigned to groups according to arbitrary cutoffs in a continuum of chronic stress diagnoses. This muddying of the waters, where the line between chronic stress conditions and PTSD is blurred, may, in part, explain why no clear trend is found concerning hGC profiles for either. In the future, a larger dataset that would allow for a more stringent subgrouping of chronic stress studies based on e.g. clinical scores may assist in identifying more uniform profiles.
Studies where the stressed subjects were identified through observation similarly did not paint a consistent picture. Whether the result of studying animal behavior 5,79 or of putting human patients through structured interviews 4,80 , there seems to be a mismatch between subjectively assessed stress and hGC levels. For this category of studies we will note that it is particularly concerning that no blinding was employed, even though the findings hinge completely on the subjective assessment of an external observer. Would the marked difference between groups have been as profound in the study by Carlitz et al. 5 (the average effect size was greater than that found in any induced stress study) if periods of stress had been recorded by a blinded observer? The study was made even more subjective and bias-prone by not ever defining the studied stressors, leaving it in the hands of unblinded observers to determine what to consider a stressor. With few studies and heterogeneous results, it is currently hard to determine whether studies that rely on externally assessing stress can provide empirical evidence with respect to the utility of hGC analyses. temporal aspects of hGcs as a biomarker of stress. An important factor shared between the chronic stress studies that demonstrate a clear difference between stressed and control subjects is that the stressor persisted at the time of sampling. When singling out the studies where the stressor could be positively ensured to have subsided at the time of hair sampling, the pattern was found to be different. In the study by Kapoor et al. 81 , pregnant rhesus monkeys were exposed to a daily acoustic startle stress protocol for five weeks. Serum samples analyzed for circulating GC levels were used to verify that the protocol elicited a significant stress response throughout the period. When analyzing hGC concentrations 3-13 weeks later (depending on subject), no elevation could be found relative to a control group; not a trace to be found of a considerable elevation of circulating GC levels persisting for five weeks. Similarly, when Ashley et al. 34 analyzed hairs from both reindeer and caribou two weeks after a single (non-sustained-release) ACTH challenge, no elevation could be found. Fecal GC analyses confirmed that the stressor had subsided after 24-48 hours. With only two studies in this category, we should be careful not to over-interpret; however, this is all part of a recurring pattern. In a recent meta-analysis, Stalder et al. 38 reanalyzed historical data from human studies in aggregate -collecting data from 66 studies and more than 10,000 hGC samples -and found that in cases of past/absent stress, no relation with elevated hGC concentrations (2019) 9:11997 | https://doi.org/10.1038/s41598-019-48517-2 www.nature.com/scientificreports www.nature.com/scientificreports/ could be found. The idea of hairs containing a historical record of past stress is, and remains, completely unproven, empirical evidence instead pointing to hGCs being a measure of concurrent stress.
Evidence that hGC concentrations are a historical record of stress could also come in the form of studies sub-sectioning hairs, inferring circulating GC levels at multiple time-points in the past. However, the GC levels of hairs were found to be similar across all the sampled segments -individuals with elevated levels of hGCs had higher levels of hGCs in all segments when compared to controls 77,82,83 . In only two studies the authors attempted to construct a narrative based on point-to-point fluctuations in GC concentrations along the hair shafts. The findings by Luo and collaborators 4 are however marred by a strong wash-out effect, with hGC levels successively becoming lower the further away from the scalp a segment is sourced. The most distant segment is purported to contain the lowest levels of hGCs as this segment is hypothesized to correspond to a period before a major trauma. However, this also holds true for the non-traumatized controls, undermining the hypothesis. The study by Carlitz et al. 5 is similarly problematic in that the narrative seems to have been constructed post hoc, and only three individual profiles are shown in the paper. To our knowledge, there is little evidence to suggest that interactions between a steroid hormone and a strand of hair are strong enough to lock the molecules permanently in place. This effect is the basis by which a specific section of hair can be related to a time-point in the past. Convincing evidence that baleen from whales can trap hormones, leaving a historical record of hormonal fluctuations, has been presented 84,85 . A similar case for hair -a distantly related keratinous matrix -remains elusive however. A recent investigation using radiolabeled GCs in monkeys has instead presented fairly conclusive evidence that GCs do not form discrete bands in strands of hair, but that GCs move along the shaft of hair post-deposition 86 . The difference may lie in the gauge and density of the matrices, with baleen samples being extracted from a depth of several centimeters, using power tools, as opposed to processing the entirety of a micrometer-thick hair.
Regardless, the evidence provided by sub-sectioning of hairs, taken altogether, rather seems to suggest that hGCs are distributed along a strand of hair by longitudinal transport of the hydrophobic hormones, whether through diffusion or capillary action (possibly helped along by the waxy sebum). Reading too much into point-to-point fluctuations thus currently appears to be a case of seeing patterns where there are none to be found.

conclusions
Combining results of controlled studies with the correlational evidence, it seems fair to state that hGC levels seem to relate to central HPA axis functioning. GC levels in hairs appear to be an appropriate marker of ongoing physiological stress. If the stressor persists, hGC analyses will remain useful; however, it is currently unadvisable to interpret events in the past based on hGC levels. The idea of GCs being locked into place, providing a historical record of HPA axis functioning has been called into question every time it has been tested in a controlled experiment. Based on the collected evidence we would strongly advise against sub-segmenting hairs, speculating about specific periods in the past. We would be delighted to be proven wrong by a future study, but there is something to be said about, not only the studies our search strategy uncovered, but also the ones that could not be found. Whereas it is hard to design a study where subjects' stress levels are controlled for weeks on end, it is far from impossible to design a study to test the hypothesis that a stressor in the past can be uncovered in a specific segment of hair. Yet, these studies are nowhere to be found. Whereas the data material did not allow for a stringent exploration of publication bias, it seems highly probable that a number of studies providing negative results have remained unpublished. With this review, and others like it, it is our hope that these negative findings may find their way into publications, providing a better picture of when hGC analyses are appropriate, and when they are not.
We strongly recommend that current and future research into hGC analyses focus on some of the fundamental questions. How are GCs incorporated into hairs? How long do they remain post-deposition? There are a number of basic questions that can be answered by small means, utilizing clever study design. With two notable exceptions 9,86 , studies utilizing radioisotope-labelled GCs are virtually completely missing; yet, the information that could be gained from studies of this type is invaluable. Applied studies utilizing hGC analyses, in the meanwhile, would do well to approach theoretical concepts surrounding hGCs in an agnostic fashion. We do not currently know how far into the past we can measure stress through sampling hairs. Stating that a certain protocol measures e.g. three months of preceding stress is misleading and perpetuates misinformation. Moreover, it is important that the duration of stressors be recorded and reported as accurately as possible. If we are able to pin down the temporality of hGCs as a stress marker, findings of studies in the past may have to be re-interpreted. This will however only be possible if there is enough information on timing of stressors relative to hair samplings. By detailing, as best as they can, the timing of stressors, researchers will, in a sense, future-proof their study results. Moreover, we hope that researchers will be more wary of unaccounted-for sources of stress in their studies. In many cases, these cannot be avoided. Consequently, we would encourage the reporting of possible "contaminating factors" -sources of stress that could not be accounted for in the study design. We would also urge authors to publish their raw data, or, at the very least, keep a record of them. In our investigation we were disappointed to find peer-reviewed publications missing crucial basic information (e.g. the number of subjects analyzed in a study) and to learn, when contacting the authors, that the information could not be produced. With most journals being able to host supporting data files online, and a host of repositories available for when journals cannot, there is no reason for not making data available and risking losing important records. Finally, we will note that many of the analyzed papers utilizing hGC analyses have made useful contributions to science; whether giving a voice to overlooked wildlife, trying to improve animal welfare, or assessing the mental wellbeing of people. We may seem critical of some of these studies, however, this comes from an adamant belief that we can and should do even better. We must constantly hold ourselves to a higher standard, in order to improve our field of research.