Title : Rapid , precise , and reliable phenotyping of delay discounting using Bayesian adaptive design optimization

Machine learning has the potential to facilitate the development of computational methods that improve the measurement of cognitive and mental functioning. In three populations (college students, patients with a substance use disorder, and Amazon Mechanical Turk workers), we evaluated one such method, Bayesian adaptive design optimization (ADO), in the area of delay discounting by comparing its test-retest reliability, precision, and efficiency with that of a conventional staircase method. The results showed that ADO led to 0.95 or higher test-retest reliability of the discounting rate within 10-20 trials (under 1-2 minutes of testing) in all three populations tested, captured approximately 10% more variance in test-retest reliability, was 3-5 times more precise, and was 3-8 times more efficient than the staircase method. The ADO methodology provides efficient and precise protocols for phenotyping individual differences.


Introduction
Precision medicine (Insel, 2014) proposes we use individually tailored treatment and prevention programs for each patient (Collins & Varmus, 2015) to maximize their efficacy.Its goal is to identify (bio)markers of individual differences and treatment outcomes on the basis of neurobiological or cognitive tests.While precision medicine is being widely used in the treatment of cancer (Friedman, Letai, Fisher, & Flaherty, 2015), there is growing interest in its application in psychiatry and general mental functioning (Insel, 2014), as reflected in the Research Domain Criteria initiative advanced by the National Institute of Mental Health.
A formidable challenge in applying precision medicine to mental functioning is improving measurement.We focus on three: reliability, precision, efficiency.It is difficult to measure reliably latent neurocognitive constructs or biological processes, such as impulsivity, reward sensitivity, or learning rate.While recent advancements in neuroscience and computational psychiatry (Montague, Dolan, Friston, & Dayan, 2012;Stephan & Mathys, 2014) provide novel frameworks, cognitive tasks, and latent constructs that allow us to investigate the neurocognitive mechanisms underlying psychiatric conditions, their reliabilities have not been rigorously tested or are not yet acceptable (Hedge, Powell, & Sumner, 2017).A recent largescale study suggests that the test-retest reliability of cognitive tasks are only modest (Enkavi, Eisenberg, Bissett, Mazza, & Poldrack, 2018).Even if a test is reliable across time, confidence in the measurement will depend on its precision at each measurement.To our knowledge, few studies have rigorously tested the precision of measures from a neurocognitive test.Lastly, cognitive tasks developed in research laboratories are not always efficient, often taking 10-20 minutes to administer.With lengthy and relatively demanding tasks, participants (especially clinical populations) can easily fatigue or be distracted (Sandry, Genova, Dobryakova, DeLuca, & Wylie, 2014), which can increase measurement error due to inconsistent responding.A byproduct of low task efficiency is that the amount of data (e.g., number of participants) typically available for big data approaches to studying psychiatry is smaller than in other fields.
Bayesian adaptive testing is a promising machine-learning method that can address the aforementioned challenges and thereby improve behavioral precision medicine (Cavagnaro, Myung, Pitt, & Kujala, 2010;Myung, Cavagnaro, & Pitt, 2013).It originates from optimal experimental design in statistics (Atkinson & Donev, 1992) and from active learning in machine learning (Cohn, Atlas, & Ladner, 1994).It is an algorithm-based Bayesian methodology for designing optimal experiments that lead to rapid and accurate parameter inference about the phenomenon under study with the fewest possible number of measurement episodes.It is a form of adaptive testing in which the values of design variables (e.g., stimulus properties and task parameters) to use in the next trial are determined online in real time based on the data collected from the preceding trials, so as to be maximally informative about the question of interest (e.g., what is the attention span of a 7-year old? how impulsive is this individual?).It differs from traditional adaptive techniques such as the staircase method (Leek, 2001) in that a parametric model of the underlying psychological process guides stimulus choice on each trial.The methodology and its variants are being applied across disciplines to improve the efficiency and informativeness of data collection (cognitive psychology (Cavagnaro, Aranovich, McClure, Pitt, & Myung, 2016;Myung & Pitt, 2009), vision (Gu et al., 2016;Lesmes, Jeon, Lu, & Dosher, 2006), psychiatry (Aranovich, Cavagnaro, Pitt, Myung, & Mathews, 2017), neuroscience (DiMattina & Zhang, 2011;Lewi, Butera, & Paninski, 2009), clinical drug trials (Wathen & Thall, 2008), and systems biology (Kreutz & Timmer, 2009)).
Here, we demonstrate the successful application of adaptive design optimization (ADO), an implementation of Bayesian adaptive testing, to improving measurement in the delay discounting task.Delay discounting is a strong candidate endophenotype for addictive disorders (Anokhin, Grant, Mulligan, & Heath, 2014;Bickel, 2015) and risky behaviors (for a review see, Green & Myerson, 2004).The construct validity of delay discounting has been demonstrated in numerous studies.For example, the delay discounting task is widely used to assess (altered) temporal impulsivity of various psychiatric disorders, including patients with substance use disorders (e.g., Green & Myerson, 2004), schizophrenia (Ahn et al., 2011;Heerey, Robinson, McMahon, & Gold, 2007), and bipolar disorder (Ahn et al., 2011).We show that in three different populations (college students, patients with substance use disorders, and the online testing community), ADO leads to rapid, precise, and reliable estimates of the delay discounting rate (k) with the hyperbolic function.Test-retest reliability of k reached up to 0.95 or higher within 10-20 trials (under 1-2 minutes of testing) with at least three times greater precision and efficiency than the staircase method (Mazur, 1987).

Methods Experiment 1 (college students)
In Experiment 1, we recruited college students (N=58) to evaluate test-retest reliability (TRR) of the ADO and staircase (SC) methods over a period of approximately one month, a span of time over which one might want to measure changes in impulsivity.Previous studies have typically used 1 week (e.g., Matusiewicz, Carter, Landes, & Yi, 2013), 2 weeks (Harrison & McKay, 2012), or 3-6 months (Weatherly & Derenne, 2013) between test sessions.Students visited the lab twice.In each visit they completed two ADO and two SC sessions, allowing us to measure TRR within and between sessions.In each session, students made 42 choices about hypothetical scenarios involving a larger but later reward versus a smaller but sooner reward.We examined TRR (using Pearson correlation coefficients) within each visit and between the two visits, using the discounting rate k of the hyperbolic function, as the outcome measure.
Participants.Fifty-eight students at The Ohio State University (25 males and 33 females; age range 18-37 years; mean 19.0, SD 2.9 years) were recruited.They were required to be at least 18 years of age and received course credits for their participation.For all studies reported in this work, we used the following exclusion criterion: a participant is excluded from further analysis if the participants' standard deviation (SD) of a parameter value is two SD greater or smaller than the group mean.In other words, we excluded participants who seemingly made highly inconsistent choices during the task.
Delay discounting task.Each participant completed two sessions, which were separated by approximately one month (mean=28.3days, SD=5.3 days).In each session, a participant completed four delay discounting tasks: two ADO-based tasks and two staircase-based tasks.
Each ADO-based or staircase-based task included 42 trials.The order of task completion (ADO then staircase versus the reverse) was counterbalanced across participants.
In the traditional staircase method, a participant initially made a choice between $400 now and $800 at seven different delays: one week, two weeks, one month, six months, one year, 3 years, and 10 years.Order of the delays was randomized for each participant.By adjusting the immediate amount, the choices were designed to estimate the participant's indifference point for each delay (1).See (Ahn et al., 2011;Green & Myerson, 2004) for the details of the procedure.
In the ADO method, the sooner delay and a later-larger reward were fixed as 0 day and $800.A later delay and a sooner reward were experimental parameters that were optimized on each trial.Based on the ADO framework and the participant's choices so far, the most informative design (a later delay and a sooner reward) was selected on each trial.
Computational modeling.We applied ADO to the hyperbolic function, which has two parameters (k: discounting rate and :inverse temperature rate).The hyperbolic function has the form V = A / (1 + kD), where an objective reward amount A after delay D is discounted to a subjected reward value V for an individual whose discounting rate is k (>0).Typically in a delay discounting task, two options are presented on each trial: a sooner-smaller (SS) reward and a later-lager (LL) reward.The subjective values of the two options are modeled by the hyperbolic function.We used softmax (Luce's choice rule) to translate subjective values into the choice probability on trial t: Where  66 and  77 are subjective values of the SS and LL options.To estimate the two parameters of the hyperbolic model in the staircase method, we used the hBayesDM package (Ahn, Haines, & Zhang, 2017).The hBayesDM package (https://github.com/CCS-Lab/hBayesDM)offers hierarchical and non-hierarchical Bayesian analysis of various computational models and tasks using the Stan software (Carpenter et al., 2016).The hBayesDM function of the hyperbolic model for estimating a single subject's data is dd_hyperbolic_single.
Note that updating of our ADO framework is based on each participant's data only.Thus for fair comparisons between ADO and staircase methods, we used an individual (non-hierarchical) Bayesian approach for the analysis of data from the staircase method.In ADO sessions, the parameters, means and SDs of the parameter posterior distributions of the hyperbolic model, are automatically estimated on each trial.Note that estimation of discounting rate (k) was of primary interest in this project.Estimates of the inverse temperature rate (a measure of response consistency or a degree of exploration/exploitation), , are provided in the Supplemental Figures, but will not be discussed further.

Experiment 2 (patients meeting criteria for a substance use disorder)
In Experiment 2, we recruited 35 patients meeting DSM-V criteria for a substance use Delay discounting task and computational modeling.The task and methods for computational modeling in Experiment 2 were identical to those in Experiment 1.For a subset of participants in Experiment 2 (15 out of 35), the upper bound for discounting rate (k) during ADO was set as 0.1 for computing efficiency and we noted that some participants' k values reached ceiling (=0.1).
For the other participants (n=20), the upper bound was set to 1.We report results that are based on all 35 patients (Figure 2A & 2B) as well as results without participants whose k values reached the ceiling of 0.1 (Figure S9).

Experiment 3 (large online sample)
In Experiment 3, we evaluated the durability of the ADO method, assessing it in a less controlled environment than the preceding experiments and with a larger and broader sample of the population, (808 Amazon Mechanical Turk workers).Each participant completed two ADO sessions, each of which consisted of 20 trials, which was estimated from Experiments 1 and 2 to be sufficient.All participants received detailed information about the study protocol and gave written informed consent in accordance with the Institutional Review Board at The Ohio State University, OH, USA.

Results
Past work customized the staircase method to yield very good TRR (Green & Myerson, 2004).In visits 1 and 2 of Experiment 1 (college students), we obtained mean values of 0.910 and 0.932, respectively.Nevertheless, ADO bested this performance, yielding values of 0.964 and 0.977, an improvement of approximately 10% in terms of the amount of variance accounted for (Figures S1 and S2; Figures S3 and S4 show the results for all participants, including the outliers).
Where ADO excels more significantly over the staircase method is in efficiency and precision.We measured the efficiency of the method by calculating how many trials are required to achieve the maximum TRR, which was assessed cumulatively at each trial (Figure 1).With ADO, we achieved over 0.95 TRR within 10-20 trials at visit 1.At visit 2, TRR exceeded 0.95 within 10 trials.With the staircase method, TRR failed to reach 0.9 even at the end of experiment (42 trials) at visit 1, and reached 0.9 only after 39 trials at visit 2. ADO yielded approximately 3-5 times more precise estimates of discounting rate as measured by the smaller standard deviation of the posterior distribution of the parameter, k (ADO visit 1: 0.122, visit 2: 0.098; SC visit 1: 0.413, visit 2: 0.537; Figure S5).
ADO also showed superior performance when examined across visits separated by one month (Figure S6).TRR measures converged at around 0.8 within 10 trials and were highly consistent with each other.In contrast, with the staircase method, the trajectories of the four measures were much more variable and asymptote, if at all, below 0.8 towards the end of the experiment.The results of Experiment 1 show that ADO leads to rapid, reliable, and precise measures of discounting rate.

Figure 2A and 2B
show that even in the patient population (Experiment 2, patients with a SUD), ADO still led to rapid, reliable, and precise estimates of discounting rates, again outperforming the staircase method.With ADO, maximum TRR was 0.976 and it reached this value within approximately 15 trials.Consistent with the results of Experiment 1, the staircase method led to a smaller maximum TRR (0.899) and it took on average 25 trials to reach this maximum (Figure S7).Precision of the parameter estimate was five times higher when using ADO than the staircase method (0.073 vs. 0.371).Figures S8 shows the results for all participants including the outliers in Experiment 2. While the upper bound of k was set as 0.1 for 15 patients and some patients' k values reached ceiling, Figure S9 suggests that the results largely remain the same whether we exclude those patients or not.
In Experiment 3 (Amazon Mechanical Turk workers), ADO again led to an excellent maximum TRR (0.965), greater than 0.9 within 10 trials as shown in Figure 2C-D.Figure S10 shows the results for all participants, including outliers.approximately 3-8 times more efficient (fewer number of trials required to reach maximum or 0.9 TRR).As might be expected, when tested in a less controlled environment (Experiment 3), precision suffers (0.371), being more comparable to that found with the staircase method, while reliability and efficiency hardly change.

Discussion
In three different populations, we have demonstrated that ADO led to highly reliable, precise, and rapid measures of discounting rate.ADO outperformed the staircase method in college students (Experiment 1) and in patients meeting DSM-V criteria for SUDs (Experiment 2).It held up very well in a less restrictive testing environment with a broader sample of the population (Experiment 3).The results of this study are consistent with previous studies employing ADO (Cavagnaro et al., 2016;Hou et al., 2016), showing improved precision and efficiency.This is the first study demonstrating the advantages of ADO-driven delay discounting in healthy controls and psychiatric/online populations.
The staircase method is an impressive heuristic method that delivers such good TRR (close to 0.90 in our study) that there is little room for improvement.Nevertheless, ADO is able to squeeze out additional information to increase reliability further.Where ADO excels relative to the staircase method is in precision and efficiency.The model-guided Bayesian inference that underlies ADO is responsible for this improvement.Unlike the staircase method, which follows a simple rule of increasing or decreasing the value of a stimulus, ADO has no such constraint, choosing the stimulus that is expected to be most informative on the next trial.Trial after trial, this flexibility pays significant dividends in precision and efficiency, as the results of the three experiments show.
The benefits of ADO also come with costs.For example, trials that are most informative can be ones that are also difficult for the participant (Ahn & Busemeyer, 2016).Repeated presentation of difficult trials can frustrate and fatigue participants.Another issue is that for participants who respond consistently, the algorithm will quickly narrow to small region of the design space and present the same trials repeatedly with the goal of improving precision even further.It is therefore important to implement measures that mitigate such behavior.We did so in the present experiment by inserting easy trials among difficult ones once the design space narrowed, keeping the total number of trials fixed.Another approach is to implement stopping criteria, such as ending the experiment once parameter estimation stabilizes for a three consecutive trials.
Both ADO and staircase methods are different version of a task, and as such led to slightly different values of discounting rates: The correlation between k from ADO and k from the staircase method is around 0.7.That the association is not higher should not be surprising.As mentioned above, ADO is more flexible than the staircase method in the design choices selected from trial to trial.While the staircase method is constrained to choosing among a few neighboring designs, there are no such limitations on ADO.This difference in flexibility will impact the final parameter estimate, especially in a short experiment.While we cannot say whether estimates using ADO are closest to individuals' true internal states, its high consistency within and especially across visits (Figure S4) demonstrates a degree of trustworthiness.
While we believe that ADO is an exciting, promising method that offers the potential to advance the current state of the art in precision medicine and computational psychiatry, in all fairness, we should mention a few major challenges and limitations in its practical implementation.One is the requirement of ADO that a computational/mathematical model of the experimental task is available.Also, the model should provide a good account (fit) of choice behavior; otherwise ADO might lead to even poorer TRR or other psychometric measures.We believe the success of ADO in the delay discounting task is partly thanks to the availability of a reasonably good and simple hyperbolic model with just two free parameters.The mathematical details of ADO and programming code for ADO experiments can be another serious hurdle.To reduce such barriers and allow even users with limited knowledge in ADO algorithms to utilize ADO in their research, we are developing user-friendly tools (Python-based package, web-based and smartphone platforms) for the research and clinical community.
Lastly, while we demonstrate the promise of an ADO method only in the area of delay discounting in this work, our methodology can be easily extended to other cognitive tasks that are of interest to researchers in psychiatry, psychology, decision neuroscience, and related fields where experimentation is at the core of scientific advances.For example, we can apply ADO to tasks involving value-based or social decision making, including choice under risk and ambiguity (Levy, Snell, Nelson, Rustichini, & Glimcher, 2010) and social interactions (e.g., Xiang, Lohrenz, & Montague, 2013).In addition, ADO can be used to optimize the sequence of stimuli and improve functional magnetic resonance imaging (fMRI) measurement (Bahg et al., 2018), which will reduce the cost of data acquisition and improve the quality of neuroimaging data.
In conclusion, the results of the current study suggest that machine-learning based tools such as ADO can improve the measurement of latent neurocognitive processes and thereby assist in the development of assays for precision medicine in mental health and more generally advance measurement in the behavioral sciences.Staircase, across two visits ADO, across two visits (A) (B) disorder (SUD) to assess the performance of ADO in a clinical population.The experimental design was the same as in Experiment 1 except that there was only a single visit.Participants.Twenty-eight individuals meeting Diagnostic and Statistical Manual of MentalDisorders (5 th ed.DSM-V) criteria for a substance use disorder and receiving treatment for addiction problems participated in the experiment (25 males and 10 females; age range 22-57 years; mean 35.8, SD 10.3 years).All patients were recruited through in-patient units at The Ohio State University Wexner Medical Center, seeking a treatment for their addiction problems.All patients received the Structured Clinical Interview for DSM-V Axis I disorders (SCID-I), which was conducted by trained graduate students and a study coordinator (Y.S.).Final diagnostic determinations were made by Woo-Young Ahn on the basis of patients' medical records and the SCID-I interview.Exclusion criteria for all individuals included head trauma with loss of consciousness for over 5 minutes, a history of psychotic disorders, and history of seizures or electroconvulsive therapy, and neurological disorders.Participants received gift cards for their participation (worth of $10/hr).

.
Eight hundred and eight individuals through Amazon Mechanical Turk (MTurk; 353 males and 418 females (37 individuals declined to report their sex); age mean 35.0, SD 10.8 years) were recruited.They were required to reside in the United States and be at least 18 years of age, and received $10/hr for their participation.Out of 808 participants, 71 participants (8.78%) were excluded based on the exclusion criteria (see Experiment 1) Delay discounting task.Each participant completed two consecutive ADO-based tasks, each of which consisted of 20 trials (c.f., 42 trials per session in Experiments 1 and 2).There was no break between the two tasks, so participants experienced the experiment as a single session.The task was identical to the ADO version in Experiment 1.

Table 1
summarizes the results across the three experiments.Comparison of the two

Table 1 .
Comparison of ADO and Staircase in their reliability, precision, and efficiency of estimating temporal discounting rates.Comparison of ADO and Staircase (SC) test-retest reliability of temporal discounting rates when assessed cumulatively in each trial (ADO) or every third trial (SC) (Experiment 1, college students) over two visits separated by approximately one month.In each visit, a participant completed two ADO sessions and two SC sessions.Test-retest reliability was assessed cumulatively in each trial.