Introduction

Behavioral pharmacology has a long and rich history in addiction science, from examining drug effects under controlled laboratory conditions to testing risk mechanisms for alcohol and other substance use disorders [1, 2]. More recently, behavioral pharmacology approaches have been proposed as tools for medication development for addiction [3,4,5,6], with the most commonly used paradigm consisting of controlled alcohol/drug administration (i.e., alcohol or drug “challenge”). Controlled drug/alcohol challenges allow for tests of medication × drug/alcohol interactions, which is critical in medication development for establishing safety and tolerability. Beyond safety, behavioral pharmacology paradigms permit testing of theoretically meaningful endpoints, described as early efficacy markers. These endpoints often include medication-induced changes in the subjective response (SR) to alcohol or drugs, as well as measures of craving, via cue-reactivity or alcohol/drug administration [7,8,9]. Putatively, these early efficacy endpoints (i.e., Phase Ib) can inform clinical trials and whether or not a novel compound should be advanced to the next stage of clinical testing (i.e., Phase II, randomized clinical trial) [6, 10]. However, the human laboratory prediction made herein is limited to behavioral pharmacology studies with alcohol administration. Furthermore, certain medications (e.g., antagonists such as naltrexone) may be better suited for screening through these models than other medications (e.g., antagonist such as gabapentin).

While the utility of behavioral pharmacology for establishing the safety and tolerability of addiction pharmacotherapies in humans is well established, the degree to which the early efficacy of novel compounds in the human laboratory can predict clinical efficacy remains unclear. In a recent critique, we argued that the degree to which a behavioral pharmacology paradigm is useful as an early efficacy marker depends on the degree to which that paradigm is related to the desired clinical outcome (e.g., abstinence or reduced heavy drinking) [11]. At a theorical level, medications that reduce alcohol-induced stimulation and alcohol craving during alcohol administration are thought to reduce alcohol use [12]. Medications that potentiate the sedative effects of alcohol during alcohol administration are also thought to reduce alcohol intake [12]. Nevertheless, these hypothetical predictions have not been empirically tested and doing so is the focus of the present study. In our previous work, we conducted a series of simulations were conducted to determine the required sample size for a behavioral pharmacology study to detect early efficacy based on varying levels of association between human laboratory paradigms and clinical outcomes [11]. These simulations used hypothetical associations, given that estimates of the “real” association between medication effects in behavioral pharmacology studies and its efficacy in randomized clinical trial (RCTs) remain unknown.

Thus far, systematic reviews have focused on qualitative assessments of the consistency between behavioral pharmacology and RCT results [3, 13, 14]. For example, Naltrexone has known clinical efficacy for AUD [15, 16] and appears to reliably blunt the reinforcing effects of alcohol [17, 18]. This is seen as evidence that reducing the rewarding effects of alcohol is a mechanism of action of naltrexone. Furthermore, blunting the rewarding SR to alcohol is an early efficacy marker of naltrexone observed in the lab [19, 20] and confirmed in clinical trials [21, 22]. Though these reviews provide insights into the consilience between behavioral pharmacology and clinical outcomes, they cannot provide quantitative estimates that could inform medication development. To date, no quantitative test of the concordance of behavioral pharmacology and clinical efficacy has been published in the field of addiction or psychiatry.

To address this gap in the literature we employed a novel translational meta-analytic approach to test whether behavioral pharmacology effect sizes are correlated with RCT effect sizes. To accomplish this goal, we searched the literature for medications tested for AUD using both behavioral pharmacology paradigms and RCTs. We then computed effect sizes for each medication in both behavioral pharmacology (i.e., “lab”) and clinical trial (i.e., “clinic”) designs. To integrate these two independent effect sizes, we used medication as the unit of analysis and tested the degree to which behavioral pharmacology effects are correlated with treatment effects across the various AUD pharmacotherapies.

For this proof-of-concept study, we focus on the primary outcome from alcohol challenge studies, which is SR to alcohol, including alcohol craving. This study does not address other important early efficacy endpoints in behavioral pharmacology for AUD, including cue-induced craving [23] and stress-induced [9, 24] craving. The focus on SR is consistent with its centrality in multiple theories of AUD etiology [25,26,27,28], and its relevance and wide prevalence in the AUD behavioral pharmacology literature [4, 12, 26]. Furthermore, alcohol craving has been introduced as an AUD symptom in DSM-5 and is widely considered a translational phenotype [29]. Through independent meta-analysis of AUD pharmacotherapies in human laboratory studies and RCTs, and by systematically testing their association, this study will quantitatively estimate the relationship between behavioral pharmacology and clinical trial outcomes in medication development for AUD.

Methods

Literature review

Inclusion criteria for the behavioral pharmacology studies were (1) the administration of a pharmacological agent approved or being developed for the treatment of AUD, (2) alcohol administered in the laboratory to a target BrAC via alcohol challenge or priming for self-administrationFootnote 1, (3) SR outcomes measured via self-report questionnaires, (4) reported in the English language, or translated to English, and (5) publication in a PubMed indexed journal. Databases were searched through July, 2018 and collected data were analyzed through September, 2020.

Given the scope of literature covered in this meta-analysis an algorithmic approach was utilized to identify all the relevant research reports. First, published reviews of AUD psychopharmacology were reviewed to identify medications that have been tested in the human laboratory with alcohol administration paradigms [3, 4, 30,31,32]. Examination of published reviews identified 40 pharmacological compounds that may have been evaluated using behavioral pharmacology paradigms from 45 laboratory studies. Second, PubMed searches were conducted with each of the 40 medications in combination with any of the following phrases: “alcohol challenge,” “alcohol response,” “response* to alcohol,” “alcohol response,” “alcohol priming,” “alcohol intoxication,” “ethanol intoxication,” “response* to ethanol,” “ethanol response.” Medical subject headings (MeSH) were used in combination with terms listed above. These PubMed searches yielded a total of 1206 studies which were assessed for relevance in the present paper via abstract review.

From these 1206 initial studies, 67 were deemed relevant for full text review. 16 studies were excluded based on full text review (7 for lack of controlled alcohol administration, 4 for lack of SR outcomes, 2 for lack of new outcomes, and 3 for self-administration only). This resulted in a final sample of 51 studies that were included in this analysis comprising 55 independent samples with 1850 total subjects (all study statistics are made publicly available in https://github.com/sbujarski). All studies were coded by at least two raters (SB, RG, and/or DJOR). Where coding discrepancies existed, all raters met in person to reach a consensus. Furthermore, when sufficient data to generate effect size estimates were not reported in the published paper, corresponding authors were contacted via email to obtain the necessary information. The DigitizeIt software [33] was also utilized to extract data from published figures [34].

Inclusion criteria for the RCT studies was: (1) a randomized controlled trial, (2) double or single blinded, (3) Placebo or active control condition, (4) Alcohol use was the primary endpoint, (5) 4 or more weeks of treatment, and (6) 12 or more weeks of follow-up. These inclusion criteria were selected based on established guidelines by the Cochrane Collaboration. Similar to the behavioral pharmacology review, RCT literature searching was algorithmic. First, Cochrane reviews were searched on each of the 24 medications with behavioral pharmacology data. Six medications (Naltrexone, Nalmefene, Acamprosate, Topiramate, Gabapentin, and Zonisamide) had published Cochrane reviews for AUD which included a total of 67 studies. Secondly, PubMed searches were conducted on each of the 24 medications with the following search phrases: “randomized clinical trial,” “randomized controlled trial,” “randomised clinical trial,” “treatment,” and “Alcohol.” For medications that had Cochrane reviews, Pubmed searches were time frame restricted to two years prior to the publication of the Cochrane review to the present. These searches identified a total of 2028 records, 132 of which were new studies subjected to full-text review and 118 which were included in the analyses. For RCTs, there were 17 medications and the number of studies for each medication varied from 1 to 34. The systematic review process is shown in Fig. 1.

Fig. 1: Consists of a flow chart of the systematic review process.
figure 1

Unlike a traditional meta-analysis, the goal was to identify trials for AUD medications tested in behavioral pharmacology and to restrict the analyses of randomized clinical trials to only those medications studied under behavioral pharmacology involving alcohol administration.

Selection of outcomes

Behavioral pharmacology

Prior factor analytic work by our group suggested that SR to alcohol represents a multifaceted construct with four distinct domains: (a) Stimulation/Hedonia, (b) Craving/Motivation, (c) Sedation/Motor Intoxication, and (d) Negative Affect [35, 36]. Assigning of outcome variables to SR domains was determined through consensus discussion among all study coders referencing the prior factor analytic work [35, 36], other published articles, and/or through referencing the specific items. The specific domain assignments are presented in Supplementary Fig. 1. Separate meta-analyses of medication effects were conducted on each outcome domain.

Randomized clinical trials

Informed by FDA guidelines for AUD medication development [37], two types of RCT outcomes were analyzed: any drinking and heavy drinking. For heavy drinking, the continuous outcome of percent drinking days (or percent heavy drinking days) was analyzed. The two clinical outcomes for RCTs were combined into a single outcome. This approach is consistent with standard practice in meta-analysis for AUD/SUD [38, 39], and resulted in a more stable estimate of medication effects on alcohol use while reducing the number of independent tests/comparisons. Meta-analytic methods for the RCT studies were identical to those employed for the behavioral pharmacology.

Data analytic plan

Data analysis for this study consisted of several steps. First, we calculated the unbiased Cohen’s d as the target effect size for each study. Cohen’s d was defined as the mean from the treatment group minus the mean from the control group divided by a pooled standard deviation. Cohen’s d was corrected by multiplying a correction factor to obtain an unbiased Cohen’s d. Second, we grouped the effect size results from Abstinent and Heavy Drinking together. The effect sizes of Abstinent and Heavy Drinking were in the opposite directions, therefore we reverse-coded the effect sizes of Abstinent. After reverse-coding, in both Abstinent and Heavy Drinking, a negative effect size indicates that the treatment group has a lower group mean than the control group. Hence, there are 4 outcomes in behavioral pharmacology laboratory (Stimulation/Hedonia, Craving/Motivation, Sedation/Motor Intoxication, and Negative Affect) and 1 outcome (Abstinent and Heavy Drinking were combined) in clinical trials. Third, within each outcome, we conducted fixed-effects meta-analysis for each medication using the metaphor R package [40]. In other words, all medications studies identified in our literature search were coded for their effects on the four behavioral pharmacology outcome domains and the single clinical trial outcome domain. And the effects of each study were pooled into a single estimate for a given medication. Fixed-effects meta-analysis was used instead of random-effects meta-analysis because for some medications, there was only 1 or 2 studies. In this case, we do not have enough studies to accurately estimate both the overall effect size and between-study heterogeneity. Hence, we adopted the fixed-effects meta-analysis and estimated the overall effect size only. For Stimulation, there were 17 medications. Within each medication, the number of studies varied from 1 to 17. For Sedation, there were 20 medications. Within each medication, the number of studies varied from 1 to 18. For Craving, there were 17 medications. Within each medication, the number of studies varied from 1 to 16. For Negative Affect, there were only 8 medications. Within each medication, the number of studies varied from 1 to 7. Since there were only a few studies for Negative Affect and data information was sparse, we excluded Negative Affect in the next step. Fourth, we aimed to use the effect size of each medication in the behavioral pharmacology laboratory to predict the effect size of each medication in clinical trials. Considering that both the independent and dependent variables have errors, we used the Williamson-York bivariate weighted least squares estimation to preserve the errors in both the independent and dependent variables [41,42,43,44]. The widely used ordinary least squares estimation could not be applied here because it only considers the errors in the dependent variable, and thus important information of the independent variable would be omitted. There were three regressions based on different laboratory outcomes (excluding negative mood due to its low data availability): Stimulation effect sizes predict clinical effect sizes, Sedation effect sizes predict clinical effect sizes, and Craving effect sizes predict clinical effect sizes. Fifth, we conducted a sensitivity analysis by correcting for publication bias. We used the p-uniform method [45], obtained the corrected estimated overall effect sizes, and conducted regression analysis. Compared to other publication bias correction methods, the p-uniform method performs relatively well when the effect sizes are homogeneous and the sample size is small [46]. We used the puniform R package [47].

Subsequent to analyzing the bivariate associations between laboratory and clinical outcomes we conducted predictive analysis to determine the degree to which these methods can inform go/no-go decisions for clinical trials of novel medications. To assess the predictive utility of these laboratory outcomes, we employed novel a leave-one-out Monte Carlo simulation method. The Williamson-York regression models were trained on a dataset with a single medication removed (the target medication). The regression models were then used to predict the clinical effect size of the target medication based on its observed laboratory effect size. A Monte Carlo method was used to account for predictor value uncertainties. Specifically, 100,000 predicted values were generated for each laboratory outcome. These simulated predicted values were then summarized with respect to their mean and standard deviation. To arrive at a single predicted clinical effect size distribution for the target medication, we computed an aggregated mean and SD across different outcomes. To provide a metric for how accurate these predicted effect sizes were, we compute a z-score for the observed clinical effect size with respect to the predicted mean effect size and standard deviation. This metric therefore represents the degree to which the observed effect size is expected under the predicted range. This procedure was then repeated across all medications included in this study.

Together, this novel application of Williamson-York bivariate weighted least squares estimation, derived from physics and astronomy fields, allowed us to integrate decades of research into a meaningful and quantitatively sound test of relationship between independent effect sizes obtained in behavioral pharmacology and RCT contexts. In this effort, medication was the unit of analysis. The novel leave-one-out Monte Carlo analysis also provides new insights into the predictive utility of these laboratory methodologies that can inform go/no-go decisions for novel medication clinical trials.

Results

Effect size estimation

Effect size estimation across the 51 human laboratory studies included in the study and across the three outcomes of stimulation, sedation, and craving, are presented in Supplementary Fig. 2. All studies are listed by author/year, medication name, medication dosage, estimated effect size of Hedge’s G (converted to Cohen’s d for the analyses), average drinks per month in the sample (DpM), and Breath Alcohol Concentration (BrAC) during the alcohol challenge. Effect size estimation across the 118 RCTs included and across the two outcomes of abstinence and heavy drinking are presented in Supplementary Fig. 3. All studies are listed by author/year, medication name, medication dosage, estimated effect size of Hedge’s G (converted to Cohen’s d for the analyses), and treatment duration (in weeks).

Alcohol-induced stimulation and clinical outcomes

As described above, we tested a model in which the stimulation effect sizes predict clinical effect sizes, across all medications studied under both behavioral pharmacology and RCTs. Effect sizes for stimulation and clinical outcomes were available for 12 medications. The slope of the regression was positive and estimated at β = 1.64 (SE = 0.46, p < 0.01) when the laboratory outcome was Stimulation, which indicated a significant positive relationship between the effect sizes for medication effects on alcohol-induced stimulation in the behavioral pharmacology studies and the medication effect sizes in clinical trials for AUD; see Fig. 2. The positive relationship suggests that medications that decreased alcohol-induced stimulation in the human laboratory were found to decrease drinking in RCTs. The bivariate-weighted correlation between the two sets of effect sizes is r = 0.370. With publication bias correction and corrected effect sizes, the slope of the regression was estimated at β = 1.18 (p < 0.05), such that the conclusion remained the same.

Fig. 2: Displays the Williamson-York bivariate weighted regression in which stimulation effect sizes predict clinical effect sizes.
figure 2

Each medication is represented as a dot on the regression line and smaller dots indicate more error variance while larger dots indicate less error variance around each estimate. Bivariate effect size standard errors for each medication are represented with the ellipses surrounding each point. The regression standard error is represented by the ribbon around the regression line.

Alcohol-induced sedation and clinical outcomes

For the Sedation effect sizes, a positive effect size indicates that the treatment group has a larger effect than the control group, while for the clinical outcomes, a negative effect size indicates that the treatment group has a larger effect than the control group. Data for 13 medications was available.

Results for the model in which the sedation effect sizes predict clinical effect sizes, the slope of the regression was β = 4.04 (SE = 2.48, p = 0.130), which was nonsignificant. Correlation between the two sets of effect sizes is r = 0.227. With publication bias correction and corrected effect sizes, the slope of the regression was significant and positive, at β = 2.38 (p < 0.05). The significant positive slope indicated that medications which lead to larger increases in sedative subjective effects had poorer clinical benefit see Fig. 3.

Fig. 3: Displays the Williamson-York bivariate weighted regression in which sedation effect sizes predict clinical effect sizes.
figure 3

Each medication is represented as a dot on the regression line and smaller dots indicate more error variance while larger dots indicate less error variance around each estimate. Effect size standard errors are represented with the ellipses surrounding each point. The regression standard error is represented by the ribbon around the regression line.

Alcohol-induced craving and clinical outcomes

The final model tested whether craving effect sizes predict clinical effect sizes, across all medications studied under both behavioral pharmacology and RCTs. Data was available for 13 medications. The observed slope of the regression was positive and significant, at β = 1.14 (SE = 0.32, p < 0.01). This finding suggests that medications that decreased alcohol-induced craving during an alcohol challenge were found to decrease drinking in RCTs. The correlation between the two sets of effect sizes is r = 0.074. With publication bias correction and corrected effect sizes, the slope of the regression was β = 3.28 (p < 0.001), such that the significant conclusion remained the same Fig. 4.

Fig. 4: Displays the Williamson-York bivariate weighted regression in which craving effect sizes predict clinical effect sizes.
figure 4

Each medication is represented as a dot on the regression line and smaller dots indicate more error variance while larger dots indicate less error variance around each estimate. Effect size standard errors are represented with the ellipses surrounding each point. The regression standard error is represented by the ribbon around the regression line.

Predictive utility of laboratory effects

The leave-one-out Monte Carlo analysis suggested that these the combination of these laboratory and quantitative methods can provide useful information value for predicting clinical efficacy for a novel medication that has yet to be tested in a clinical trial. That said, the effect size uncertainties are generally wide, driven largely by the laboratory effect size precision and modest correlations between laboratory and clinical effects (see Table 1, Fig. 5). The predicted effect sizes were well calibrated and not systematically biased. The average z-score of the observed clinical effect size with respect to the predicted distributions was very small (−0.004). Despite generally high concordance between predicted and observed effects, there were a few medications where substantial discrepancies occurred. Namely, Gabapentin was shown to have a significantly larger clinical impact than predicted and Memantine was found to have a significantly more deleterious clinical effect than predicted. Olanzapine was also found to have a smaller clinical impact than predicted, though this effect was substantially less severe than Gabapentin and Memantine. For all other medications the observed clinical effect size was within one standard deviation of the mean predicted effect size.

Table 1 Represents the predicted clinical effect size based on each medications’ laboratory effect sizes using a leave-one-out Monte Carlo simulation method on the bivariate-weighted regression models.
Fig. 5: Displays the predicted and observed clinical effect size distributions.
figure 5

Predicted effect size distributions were generated using a leave-one-out Monte Carlo simulation method for each medication across laboratory effect sizes. Where multiple laboratory effects existed for a given medication, these predicted distributions were aggregated.

Discussion

This study tested the relationship between early efficacy assays of SR to alcohol collected in placebo-controlled behavioral pharmacology studies of medications for AUD and the clinical effects of these AUD medications in RCTs. Leveraging advanced meta-analytic tools and the Williamson-York bivariate weighted least squares estimation, the latter appropriate for integrating dependent and independent variables with errors, this proof-of-concept study provided quantitative estimates to a critical substantive question in medication development. Namely, does early efficacy in the human laboratory captured by medication effects on SR to alcohol administration (i.e., stimulation, sedation, and craving) predict clinical outcomes in RCTs for those medications?

Simply put, we predicted that the more a medication reduced alcohol-induced stimulation, relative to placebo, the more that medication reduced alcohol intake in RCTs. This hypothesis was supported by our analyses such that reduced stimulation in the laboratory was positively associated with less drinking in RCTs, across the available medications studied in both human laboratory and clinical settings. Furthermore, we found the same pattern to be true for alcohol-induced craving and sedation, such that reduced craving and sedation in the laboratory was positively associated with less drinking in RCTs. These extensive and innovative analyses across a wide range of medications and outcomes, effectively integrates two critical phases of medication development, namely phase Ib (early efficacy) and phase II (clinical efficacy). It provides critical insights into the degree to which these early efficacy markers (i.e., SR during alcohol challenge) measured in the human laboratory, predict real-world clinical outcomes for AUD in RCTs.

While the fact that there is some consilience across the effects obtained in behavioral pharmacology trials and in RCTs for AUD is encouraging, the magnitude of these associations (i.e., their correlation) was relatively small. As detailed in our simulation study [11], the magnitude of the association between laboratory and clinical outcomes should inform power analyses for human laboratory trials. In that Monte Carlo Simulation study, a correlation between laboratory and clinical outcomes of 0.3 was the smallest and indicated that laboratory studies should have twice the sample size of a clinical trial in order to detect a medium effect size treatment. To further inform go/no-go decisions for novel medications, we conducted a leave-one-out Monte Carlo analysis on the combined human laboratory data. Findings suggested that these the combination of these laboratory endpoints can provide useful information for predicting clinical efficacy for a novel medication that has yet to be subjected to a clinical trial. A caveat to this conclusion is that the effect size uncertainties are generally wide, driven largely by the laboratory effect size precision and modest correlations between laboratory and clinical effects. Furthermore, while there was generally high concordance between predicted and observed effects, there were a few notable exceptions. Specifically, Gabapentin was shown to have a significantly larger clinical impact than predicted and Memantine was found to have a significantly more deleterious clinical effect than predicted. In brief, the Monte Carlo analyses add medication-specific results and directly examine the predictive utility of human laboratory models focused on SR domains.

This study represents an important step toward optimizing the medication development pipeline by leveraging behavioral pharmacology designs to elucidate medication effects on early efficacy endpoints. Insofar as SR to alcohol during an alcohol challenge is the used, and early efficacy endpoints include stimulation, sedation, and craving, this study confirms that these early efficacy markers are indeed quantitatively related to clinical outcomes in RCTs across a range of medications studied under both experimental conditions. In other words, medications that can reduce stimulation, reduce craving, and potentiate sedation during alcohol administration, compared to placebo, fare better in clinical trials as demonstrated by reduced alcohol consumption. This finding is consistent with the role of behavioral pharmacology in early signal detection and screening of promising compounds, as articulated in the medication development literature for AUD [4, 6, 48]. Nonetheless, caution should be exercised in adequately powering studies to reliably detect the behavioral pharmacology endpoints reported herein. In addition, it is important to consider medication development for AUD and its success, in the broader context of factors, including the lack of substantial investment compared to other fields [49].

During the peer-review process of this study, a number of important caveats were raised and should be considered by the readers in interpreting these findings. These analyses do not distinguish between drugs with a mechanism of action aimed at antagonizing the rewarding effects of alcohol (e.g., naltrexone, nalmefene, topiramate) and medications that seek to maintain abstinence by restoring homeostasis in brain systems dysregulated by the onset of abstinence (e.g., acamprosate and gabapentin). We are clearly underpowered to do so. Nevertheless, it is plausible that the behavioral pharmacology paradigms associated with alcohol administration in the laboratory, and studied herein, may be best suited for testing antagonist medications and less suited for screening the therapeutic potential of medications in the agonist category. Another issue brought up in peer-review is the notion that reduction in heavy drinking may be the ideal primary outcome for an antagonist medication, such as naltrexone [50], whereas abstinence may be a better outcome for an agonist medication, such as acamprosate [51]. In this meta-analysis, abstinence and heavy drinking outcomes are combined in order to boost statistical power. It is plausible that in addition to refining the behavioral pharmacology testing by selecting laboratory outcomes that are best suited based on the mechanism of action of a given medication (i.e., agonist versus antagonist), such refinement should be considered at the level of the clinical outcomes selected.

Several caveats and limitations should be applied to the interpretation of these findings. First, this proof-of-concept study is restricted to three dimensions of SR measured during an alcohol administration paradigm (i.e., stimulation, sedation, and craving). This study does not speak to other important early efficacy endpoints in behavioral pharmacology for AUD, including cue-induced craving [23] and stress-induced [9, 24] craving. Second, this study only examined medications that were studied under both human laboratory and RCT condition when certainly a host of medications did not meet this criterion. Nevertheless, the novel implementation of the Williamson-York bivariate weighted least squares estimation allowed us to integrate independent samples (i.e., participants tested in the laboratory were not the same as those tested in clinical studies). By doing so, we integrated decades of research. The alternative approach would be to test the same participants in the lab before they proceed to a clinical trial [23], which is both costly and cumbersome. Third, utilizing these three early efficacy endpoints to screen novel medications assumes that all promising AUD medications will work through these mechanisms of attenuating craving, stimulation, and/or potentiating sedation during alcohol administration. Conversely, as we understand novel drugs and novel mechanisms of action, a wider range of early efficacy endpoints may be necessary, including assessments of mood, alcohol metabolism, cue-reactivity, and alcohol self-administration [30, 52]. It is plausible that Gabapentin, for example, operates through different mechanisms hence the prediction via SR measures was not consistent with clinical trial outcomes, which proved more favorable clinically than predicted by the model. This is consistent with the argument that agonist medications seeking to restore homeostasis in brain systems dysregulated during abstinence may be better screened through alternative behavioral pharmacology models, including alcohol cue-reactivity, for example. Furthermore, the biobehavioral assays studied herein can inform the development of treatment responsive biomarkers, which remains a critical gap in AUD treatment development [53]. Fourth, publication bias continues to be a problem, and in this study alone, we estimated that 35% of the outcomes mentioned in publications did not have accompanying results. Selective publication of outcomes is endemic in human laboratory studies and clinical research more broadly [54]. While this issue has been recognized for almost three decades [55], it continues to be a threat to the interpretation of scientific findings and to meta-analytic efforts such as ours. Fifth, there is a clear imbalance with regards to the number of studies available across the range of medications studied, clearly naltrexone and acamprosate are the most widely studied medications with multiple studies available allowing for a more precise estimation of both human laboratory and RCT outcomes. For the other study medications, only a few studies were available for analyses. This imbalance led to more variability in the estimates for studies with few trials and caused medications like naltrexone and acamprosate to exert an undue influence on the outcomes. Nevertheless, since the analyses were conducted with medication as the unit of analysis, then medications with multiple studies were summarized into a single data point such that they did not “count more heavily” in the final analyses than any other medication. Sixth, while these extensive efforts include coding of study covariates, we were not able to reliably implement meta-regression analyses controlling for study differences given that many medications only had a few studies. Additional analyses including covariates may be possible for medications with multiple trials [17]. Seventh, the categorization scheme using items for the dimensions of SR on the basis of their face-validity can be improved upon in future studies in which person-level data are available. Specifically, network analysis may be well-suited for testing the relationships among the predictor variables (i.e., specific items/scales capturing dimensions of SR to alcohol) and in turn, improve the overall model prediction. Eighth, visual inspection of Fig. 5, in which predicted and observed effects are displayed for each medication, suggest that specificity and negative predictive value are low. This means that the lab models studied herein did not correctly identify any medications that were clinically ineffective. However, it should be noted that this is sample of AUD pharmacotherapies that was intentionally selected to have both human behavioral pharmacology and RCT studies. As such, many medications tested in the human lab may have not moved to RCT testing on the bases of poor human-lab outcomes. Ninth, this proof-of-concept study is focused exclusively on medications for AUD and the translation from early efficacy testing (behavioral pharmacology, phase Ib trial) to clinical efficacy testing (RCT, phase II trial). Nevertheless, the novel methods employed in this study are flexible and can be applied to examining the consilience between preclinical efficacy and early efficacy or clinical efficacy, another longstanding gap in the literature [56, 57]. This approach could also be used to estimate the utility of a host of paradigms for screening medications for alcohol and drug use disorders (e.g., cue-induced craving and self-administration) [5, 58].

In sum, behavioral pharmacology endpoints of alcohol-induced stimulation, sedation, and craving track medications effects from the human laboratory to clinical trial outcomes. This proof-of-concept study uses a novel methodological approach to integrate decades of medication development research and to demonstrate the relationship, albeit of small-to-moderate magnitude, between behavioral pharmacology with alcohol administration and clinical trials endpoints for AUD. These methods and results can be applied to a host of clinical questions and can streamline the process of screening novel compounds for AUD. This methodological approach can be used to quantify the predictive utility of cue-reactivity screening models and even preclinical models of medication screening.

Funding and disclosure

Support for data analysis and manuscript preparation provided by K24AA025704. The funder had no role in the design, analysis, interpretation, or writing of the report. None of the authors have any competing financial interest in relation to this work. None of the authors have any conflict of interests.