The National Institutes of Health (NIH) issued a policy in 2015 to promote inclusion of both male and female subjects in preclinical biomedical research [1]. As guidance to address this policy, NIH has articulated “The Four Cs of Studying Sex to Strengthen Science” (https://orwh.od.nih.gov/sex-gender/nih-policy-sex-biological-variable): (1) Consider (design studies that take sex into account, or explain why it is not incorporated), (2) Collect (tabulate sex-based data), (3) Characterize (analyze sex-based data), and (4) Communicate (report and publish sex-based data). This article describes a general, multi-step approach to address The Four Cs when investigation of sex differences is not the primary aim, and sample sizes required to detect sex differences are not known. We acknowledge that there are many instances where sex may be a top priority; however, as written, the NIH policy and associated “Four C’s” guidance recognize that this is not invariably the case.  The factors that may contribute to an individual researcher’s prioritization of sex differences among all possible independent variables are beyond the scope of this brief article.

The data to illustrate this approach are shown in Fig. 1 and Table 1 for a study involving pain-related behavior in mice. The expression and treatment of pain constitute one domain in which sex differences are apparent in both humans and laboratory animals [2], and preclinical research is playing a major role in analgesic drug development in response to the NIH “Helping to End Addiction Long-term” (HEAL) initiative. Accordingly, this is one area where preclinical research is active, the new NIH policy is pertinent, and inclusion of both males and females is warranted. For this illustrative study, which used methods similar to those described previously [3], adult male and female ICR mice (Envigo, Frederick, MD) received an intraperitoneal injection of dilute lactic acid (IP acid) as an acute visceral pain stimulus immediately before being placed into 4-inch diameter plexiglass cylinders for 20-min periods of videotaped observation. Two observers blinded to experimental treatment scored each video by counting the number of pain-related stretches (defined as contraction of the abdomen followed by extension of the hind limbs). Observer scores for each session were averaged for subsequent analysis. The cyclooxygenase inhibitor ketoprofen was also evaluated for its effectiveness to block IP acid effects in a separate group [3]. The following steps in experimental design and analysis were incorporated to address The Four Cs.

Fig. 1
figure 1

Effects of IP lactic acid administered alone or after pretreatment with ketoprofen on pain-related stretching behavior in male and female mice. Abscissae: concentration of lactic acid (left panel, diluted in sterile water and administered IP in a volume of 10 ml/kg) or dose of ketoprofen (right panel, administered subcutaneously as a 30-min pretreatment to 0.32% IP acid). Veh = vehicle for IP acid (left panel) or ketoprofen before 0.32% IP acid (right panel). Ordinates: number of stretches observed during a 20-min observation period. Points show mean ± SEM for all mice (black circles, N = 12), just males (blue squares, N = 6), and just females (pink diamonds, N = 6). Filled symbols indicate significantly different from Veh as determined by one-way ANOVA and Dunnett’s post hoc test for pooled data, males, or females. $ Indicates a significant sex × dose interaction as determined by two-way ANOVA, but follow-up analysis with the Holm–Sidak post hoc test did not reveal a significant effect of sex at any acid dose. The results of ANOVA and power analysis for data in each panel are shown in Table 1

Table 1 Summary of power analysis results from one-way and two-way ANOVA of data shown in Fig. 1

(1) IP acid alone was tested in one group of 12 mice, ketoprofen + IP acid was tested in a separate group of 12 mice, and each group contained equal numbers of male and female mice (N = 6/sex). The total group size, which was modestly larger than the N = 6–10 mice used in our previous study with only males [3], was selected based on the observation that confidence-interval widths for any normally distributed population decline with increasing sample size and approach an asymptote at approximately N = 12 [4]. We further chose to constitute this N = 12 group size with equal numbers of N = 6 mice per sex in part to comply with published guidance by the British Journal of Pharmacology for use of equal sample sizes of N ≥ 5 across groups [5]. However, it is important to note that NIH guidance does not specify minimum group sizes or allocations by sex, and our approach could be used regardless of the numbers of males and females included in a study.

(2) Sex was not the primary independent variable of interest in this study, so the primary analysis pooled data from both sexes and used repeated-measures one-way ANOVA (Prism 8.0, GraphPad, La Jolla, CA) to evaluate effects of IP acid or ketoprofen dose. A significant ANOVA was followed by Dunnett’s post hoc test to compare acid or ketoprofen effects to vehicle effects. The criterion for significance for this and all other statistical tests was p < 0.05. IP acid produced a dose-dependent stimulation of stretching, and ketoprofen dose-dependently blocked IP acid effects.

(3) Secondary analyses to address sex as a biological variable proceeded in two steps. First, data in each panel of Fig. 1 were segregated by sex and analyzed by repeated-measures one-way ANOVA. IP acid stimulated stretching in both sexes; however, ketoprofen failed to significantly decrease IP acid effects in either sex. Second, data from males and females in each panel were also analyzed by two-way ANOVA (Prism 8.0), with sex as a between-subjects factor and IP acid or ketoprofen dose as a within-subjects factor. A significant interaction was followed by a Holm–Sidak post hoc test. For IP acid alone, there was a significant main effect of acid dose but not of sex. The acid dose × sex interaction was significant, but post hoc analysis did not reveal a sex difference at any acid dose. For ketoprofen + IP acid, there was a significant main effect of ketoprofen dose but not of sex, and the ketoprofen dose x sex interaction was also not significant.

(4) Lastly, all one-way and two-way ANOVA results were submitted to power analysis to calculate three values: (a) Cohen’s f effect size, (b) achieved power (1−β), and (c) the total number of animals predicted as necessary to detect a significant effect given the empirically determined effect size and criterion levels of α = 0.05 and power (1−β) = 0.8 (G × Power [6], free and publicly available: http://www.gpower.hhu.de).

Power analysis complements the ANOVA results in three ways. First, “effect size” provides a basis for comparing the magnitude of sex differences or other effects across studies [7]. In this data set, for example, all of the effect sizes for IP acid alone were greater than the effect sizes for ketoprofen antinociception.

Second, “power” provides a basis for confidence in drawing conclusions based on the ANOVA results. In particular, just as “α” values specify the probability of a Type 1 (false-positive) error, so the “β” values specify the probability of a Type 2 (false-negative) error. Moreover, just as convention accepts α ≤ 0.05 as an acceptable criterion for Type 1 errors in concluding that an effect is PRESENT, so convention also generally accepts β ≤ 0.2 as an acceptable criterion for Type 2 errors in concluding that an effect is ABSENT [8]. Insofar as the term “power” represents 1−β, then this criterion is equivalent to power ≥ 0.8. Given this criterion, it is appropriate to conclude that an effect of sex (or any other variable) is absent only if the experiment is sufficiently powered to reach that conclusion at power ≥ 0.8. In this data set, most sex-based analyses failed to reach the criterion for statistical significance, and even with the significant dose × sex interaction for effects of IP acid alone, the post hoc analysis failed to reveal a significant sex difference at any acid dose. However, power analysis indicated that it would be inappropriate to conclude from these ANOVA results that a sex difference was absent for either IP acid or ketoprofen, because none of the analyses involving sex achieved power ≥ 0.8. The same principle applies to interpretation of ketoprofen effects in males or females alone. Although ketoprofen failed to produce a significant decrease in IP acid-stimulated stretching in either males or females, power was well under 0.8 in both sexes. Consequently, it would be inappropriate to conclude from these ANOVA results that ketoprofen had no effect. Notably, power ≤ 0.8 is not problematic in the event that a significant effect is deemed to be present, because in these cases, the concern is with potential false-positive conclusions (addressed by α) and not with potential false-negative conclusions (addressed by β). For example, the pooled analysis of ketoprofen effects had power < 0.8 (0.752); however, the ketoprofen data met the criterion for a significant effect, so there was no need to address the potential of a false-negative conclusion.

Lastly, the “predicted N” can inform experimental design for future studies that might pursue evaluation of sex differences. Thus, the effect size observed for a treatment in an initial sample of subjects from some population can be used to predict the sample size required to achieve target statistical criteria (e.g., α ≤ 0.05, β ≤ 0.2) for confidence in reaching positive or negative conclusions regarding that treatment effect in other subjects from that population. Importantly, this prediction is founded on the assumption that the effect size observed in the initial sample size is representative of the whole population, but this of course is an empirical question (see ref. [8] for commentary). In this study for example, power analysis predicts that total sample sizes of 56 (28/sex) and 16 (8/sex) would be required to adequately characterize the presence or absence of a main effect of sex and a dose × sex interaction, respectively, for two-way ANOVA of effects produced by IP acid alone. Similarly, samples sizes would need to be increased to 15 (males) or 10 (females) to adequately characterize effects of ketoprofen by one-way ANOVA in males or females alone.

In summary, this approach provides a strategy to address The Four Cs in NIH-funded preclinical research. Inclusion of both males and females is responsive to the “Consider” mandate for the experimental design that takes sex into account. The pooling of data across sex permits focus on the primary variable(s) of interest, while segregation of data by sex addresses the “Collect” mandate for tabulating sex-based data. The secondary analyses address the “Characterize” mandate by using ANOVAs to analyze sex-based differences with existing sample sizes and power analyses to guide both interpretation of the ANOVA results and design of any future studies that might focus on sex differences. Finally, the results of ANOVAs (F statistics) and power analyses (effect size, power, and predicted Ns) provide a useful array of statistical outcome measures that fulfill the “Communicate” mandate for reporting and publishing sex-based data.

Ultimately, three categories of sex differences have been described: Type 1 (“sexual dimorphism”; qualitatively different phenotypes between sexes), Type 2 (“sex differences”; quantitatively different phenotypes), and Type 3 (“sex convergence and divergence”; similar phenotypes with different biological mechanisms) [9]. The analysis proposed here will be more sensitive to Type 1 than Type 2 differences, but in either case, it can provide preliminary information on the existence of phenotypic sex differences. By contrast, this approach will not detect Type 3 differences. As a result, even if adequately powered statistical analysis indicates that a phenotypic sex difference is absent, it remains possible that the convergent phenotypes could have sexually divergent underlying mechanisms. With these caveats in mind, the approach proposed here can serve as a strategy for preliminary analysis of sex effects in studies that include both sexes, but do not aim to focus on sex as a primary variable of interest. In those instances where sex is not the primary variable of interest, we suggest that priority of analyses should be assigned to those primary variables that have been adequately powered based on initial study design. Additional analyses of secondary variables such as sex are important and can be addressed in appropriately designed follow-up studies.

Funding and disclosure

This work was supported by grants T32DA007027 (CMD), R01DA037287 (MLB), R01NR014886 (GNN), and R01DA030404 (SSN). The authors declare no competing interests.