Using smartphones to optimise and scale-up the assessment of model-based planning

Model-based planning is thought to protect against over-reliance on habits. It is reduced in individuals high in compulsivity, but effect sizes are small and may depend on subtle features of the tasks used to assess it. We developed a diamond-shooting smartphone game that measures model-based planning in an at-home setting, and varied the game’s structure within and across participants to assess how it affects measurement reliability and validity with respect to previously established correlates of model-based planning, with a focus on compulsivity. Increasing the number of trials used to estimate model-based planning did remarkably little to affect the association with compulsivity, because the greatest signal was in earlier trials. Associations with compulsivity were higher when transition ratios were less deterministic and depending on the reward drift utilised. These findings suggest that model-based planning can be measured at home via an app, can be estimated in relatively few trials using certain design features, and can be optimised for sensitivity to compulsive symptoms in the general population.

To check if this changed the meaning of the responses unduly, we compared the depression and STAI scores in our sample of N=1451 and compared it to previously found associations using the correct version.Prior research found the associations between SDS depression scores and STAI scores was r=.84 in 495 individuals 3 .In line with this, we found the association between SDS depression scores and STAI scores was r=.88.Additionally, as reported in the main text, factors derived from our dataset including participants with the incorrect version were all correlated with factors drawn from a dataset using the correct version in excess of .9,providing comfort that this error had no substantial impact of findings.

Supplementary Method 3. Computational Modelling
The basic reinforcement-learning (RL) models used in this study were based on those developed and refined across a range of studies modelling behaviour in the classic version of this task.We considered the 3 models tested previously considered in the context of validity of model-based assessments 4 .These were adapted to the fact that our task does not include a second stage decision as per 5 .For all models, Q values refer to the expected value of a given state.The first states are the two containers that can be selected, ,ai,t, where i refers to the "mostly pink" or "mostly purple" containers and t refers to trial.The second states correspond to the colour of the ball fired by the cannon, pink or purple, denoted ,sj,t, where j refers to pink or purple.Reward on a given trial is defined as the 'ball quality', i.e. whether or not it explodes ('good' or 'dud') prior to hitting its target, denoted rt.
Model A: This model is based on the original paper describing this task 6 and consists of 5 free parameters, ω (the relative contribution of mode-based and model-free Q values on choice), (the stage 1 learning rate), (the stage 2 learning rate), ρ (the perseveration or 'stickiness' parameter) and β (the inverse temperature).
The model-free algorithm updates the Q values, , for two choices of containers ( ) at stage 1 as follows: , where is the reward prediction error (RPE) associated with the chosen container ('first stage' in classic task): and is the RPE associated with the ball (pink/purple) that was released ('second stage' in the classic task): These Q values for the two ball colours are updated according to .The model-based algorithm applies a transition function, using known probabilities for how likely it is that a given ball is produced following a given container choice to ensure that terminal rewards are correctly assigned to initial choices.We assumed full knowledge of transitions in this task (80/20), as the proportion of pink/purple balls for each side was explicitly stated to subjects during the instructions and was visible throughout the experiment as the balls jumped around the two s.The model-based Q values for each container, , were thus computed as = To connect the values to choice of container, we use a Softmax choice rule, which assigns a probability to each action according to the weighted combination of the MB and MF estimates The probability of choosing each of the two containers is calculated, accordingly, as Model B: This model is based on that described in Otto, et al. 7 and consists of 4 free parameters, (a single learning rate that is applied to both stages), ρ (the perseveration or 'stickiness' parameter) and βMF (the model-free inverse temperature) and βMB (the modelbased inverse temperature).The model has two additional features compared to Model A: (i) rescaling of rewards (based on , which is designed to reduce parameter collinearity and (ii) a decay or 'forgetting' of unselected parameters (based on 1-.The former manifests in the RPE terms where i= and = The latter is implemented by multiplying the unchosen model-free Q values by 1-, as follows: *1-and *1-.The choice rule is implemented as follows: Model C: This model is based on Decker, et al. 8 and is identical to Model B, except that it omits the decay of unchosen values and the rescaling of rewards.

Model D:
This is identical to model B, without the reward rescaling feature (analogously it is the same as Model C but it includes a decay of unchosen values).

Model E:
this is identical to model D with an additional free parameter wherein the decay or 'forgetting' of the value of unselected options is governed by 'αD', rather than 1-α.

Model F: this is identical to model D with rescaling only on reward values (not qt1)
Model G: this is identical to model D with separate learning rates for rewarding and nonrewarding trials.

Group-Level Modelling
For each of the models above, hierarchical Bayesian estimation was carried out using we Markov Chain Monte Carlo (MCMC) techniques (specifically the No-U-Turn variant of Hamiltonian Monte Carlo) as implemented in the Stan modelling language.All parameters were specified as normally distributed; parameters with (0,1) ranges (α1, α2, αD, and ω) were then inverse logit transformed.Means of inverse temperature parameters (β1, βMF, and βMB) were constrained to be greater than 0. Weakly informative priors were used: for means, normal(0,2.5) for α1, α2, αD, ω, and ρ or normal(0,5) for β1, βMF, and βMB; for scales, half-Cauchy(0,1), constrained to be greater than 0, for α1, α2, αD, ω, ρ, and βMF or half-Cauchy(0,2), constrained to be greater than 0, for β1, and β1 MB.For model-selection, we ran four chains of 4,000 samples each, discarding the first 2,000 samples of each chain for burn-in.When modelling the larger dataset, we reduced this to 2,000 samples total for speed.

Model Comparison
To determine the best model, we ran each model on a subset of N=100 of the total sample.We compared models using a range of indicators of validity, with a summary presented in the table below.

(i) MCMC diagnostics
We first examined the chains visually for convergence and also computed Gelman and Rubin's (1992) potential scale reduction factors.For this, large values indicate convergence problems, whereas values near 1 are consistent with convergence.We ensured that these diagnostics were less than 1.05 for all variables.Model A had 78 divergent transitions and Model B had 371.Models C and D had no divergent transitions.Model E had 1 divergent transition and model F had 2.

(ii) LOOIC
We compared the fit of each model using the LOOIC (loo package in R).The lowest LOOIC was found for Model D, followed closely by model E, F then B.

(iii) Posterior Predictive Checks (PPC)
We examined the trial-by-trial predictions for subjects' choices ('y_pred') derived from the model and tested how closely they mirrored actual choices per subject ('y_actual').For each subject we calculated the proportion of trials where the predicted choice was the same as the actual one and summarised that consistency metric across the entire sample (mean, min, max, and proportion of subjects with consistency >80%).Model G performed the best considering all PPC metrics combined, but there were relatively modest differences across models.

(i) Parameter Recovery
We tested parameter recovery by simulating new choices from agents operating under the true estimated median parameters of our N=100 subjects.We then fitted our models to these choices to test if we could recover the true data generating parameters.We computed correlation coefficients as a measure of parameter recovery.Model B and F showed the strongest recovery of model-based betas at 0.74, but rates were similar at 0.73 for model G, 0.72 for model E, 0.71 for Model D and 0.70 for Model C.
Based on the combination of validation checks, we progressed with Model D for comparison analysis with hierarchical logical regression (HLR) and the point-estimate (PE) approach.Most notably, model D was one of the few models with no divergent transitions, had the lowest LOOIC, and strong performance on all other metrics, if not the best.

Supplementary Method 4. Comparing different estimation approaches
There are various ways to analyse data on this task.Hierarchical logistic regression (HLR) models are a popular choice, but there are alternatives range from extremely computationally cheap Point estimates (PE) to expensive generative models using Hierarchical Bayesian modelling (HB).We tried these alternatives and selected the best approach in a data driven manner, guided by the reliability and external validity of the measures.
Point estimates were calculated by the sum of the probability of staying on [Common, Rewarded trials + Rare, Unrewarded] minus the sum of the probability of staying on [Common, Unrewarded + Rare, Rewarded trials].Using PE could cut down computational demand and time to calculate estimates, and so may be ideally suited to app-based implementation.However, research suggests they suffer from poor reliability as they fail to consider trial-by-trial individual variability 9 .Both HLR and HB methods account for uncertainty at the level of individual subjects and in the context of the twostep task, have been shown to perform similarly well 4 .For the HB model, we fitted candidate models and selected the best to bring forward for comparison to these other approaches (as described in Analysis S3).
We operationalised the (i) internal consistency of the different measures as Cronbach's alpha comparing model-based planning estimates derived from participants' 1 st 100 trials to their 2 nd 100.We tested this on N=1451 individuals that had complete demographic and mental health data.Note that we did not use odds/evens here, because HB relies on information accumulated slowly over many trials occurring in series.We also tested for (ii) test re-test reliability from a subset of N=423 who completed two sets of 200 trials of Cannon Blast.Finally, we assessed (iii) the external validity of these measures by testing the relationship between estimates of model-based planning and sociodemographic differences N=1451.
As outlined above, we compared three methods of calculating model-based estimates: a point estimate approach (PE), hierarchical logistic regression (HLR) and hierarchical Bayesian modelling (HB).The three analytic approaches produced highly correlated estimates (HLR vs PE: r=.94, HLR vs HB: r=.82, PE vs HB: r=.74).
Using the split-half method where model-based scores from first 100 trials were compared with model-based estimates from the second 100 trials, we found the HLR demonstrated the greatest reliability with fair reliability (r=.56,Table S18).Point estimates and computational modelling also demonstrated fair split-half reliability of model-based estimates (PE: r=.48, HB: r=.44).Again, we found the HLR demonstrated the highest test-retest reliability with fair reliability (ICC1=.54).Point estimates also demonstrated fair test retest reliability (ICC1=.63),while Hierarchical Bayesian model showed poor reliability (ICC1=.29).In terms of external validity, the HLR demonstrated the best signal between model-based planning and individual differences (Table S19).

Result: HLR outperforms PE and HB across all metrics
The Neureka app comes from a group of scientists at the Global Brain Health Institute who are trying to uncover new ways to detect and prevent disorders of brain health.The brain is a big mystery and we need the help of people like you to solve it.
At Neureka, we think that everyone should be able to participate in science, that's we are taking experiments out of the lab, and into your smartphone.Our team have been hard at work for the past year making games that are not only fun to play, but help us learn about the brain at the same time.
We aim to use your anonymous data, combined with data from other players around the world, to conduct one of the biggest neuroscience studies of all time!

What will happen if I agree to take part?
You will be asked to provide an email-address and a password in order to register an account on the app.We need your email address so that you can sign into the app across different devices and so that we can contact you in future about your contribution to science and any future studies you may want to participate in.You can turn email communication off at any time by changing your settings.
After providing your email you will be asked to complete some basic demographic information and to complete a 'science challenge', where you will play games that tell us things about how your brain works and provide us with some personal information about your physical and mental health, lifestyle and family history.Once complete, you will be free to play the games whenever you like and to complete additional science challenges, at your leisure.

How long will it take to participate?
To complete science challenge 1, it will take approximately 1 hour and it doesn't need to be completed in one go.After that, you can play each game as much or as little as you like.Likewise, you can provide us with as much information about yourself as you like in our 'about me' section.Other science challenges may take longer, with some of them asking you to play the games according to a schedule for days, weeks or even months.These are entirely optional.

What data will I be sharing?
Your email address, the type of device on which you used the app (e.g.iPhone 6), your performance on the games that you choose to play, the demographic information you will be asked to provide during sign-up and any questionnaires that you complete within the app.

What will happen to my data?
We will use the data you provide us from within the app for academic research only.We will share fully anonymized data with other university researchers to help with the collective research effort of improving early detection of dementia and mental health disorders.We will never share personally identifiable information, such as your email address, with anyone.We will publish the findings that your contribution has helped generate in scientific journals and make them available for you to read and share online at www.neureka.ie.

Can I access my data?
You are also free to download all data that you have submitted through the app by tapping "settings"->"GDPR"->"download my data".A password will be required to do this to ensure your privacy and security and data will be sent to your personal email address provided at the time of sign up.Under the Freedom of Information Act 2014, you will have access to any data you provide through the app for up to 10 years after you have provided it, up until the point at which it has been fully anonymised and we can no longer recover your unique contribution.

Is it confidential?
Yes.We will not disclose any personally identifiable information (e.g.your unencrypted email address) to any third party.However, there are legal limits that apply to confidentiality in the Republic of Ireland, specifically if you provide unsolicited information to the research team pertaining to risk of harm to yourself or another person, this additional information is not protected by confidentiality and may be disclosed to the relevant authorities.Secondly, any information that you provide (solicited or unsolicited) may be disclosed as part of a legal process or police investigation in the Republic of Ireland, without your permission being sought.

Is it secure?
All data you provide through the app is encrypted before it is submitted over the internet.The data you send us through this app is stored and processed in accordance with EU General Data Protection Regulation (GDPR) with our procedures for storing and processing data having undergone and passed review by the data protection officer at Trinity College Dublin.

Gaining credit through SciStarter
We recently partnered with SciStarter, a website that hosts really cool citizen science projects like neureka.If you hold a SciStarter account under the same email as you have registered with neureka, you will automatically receive credit for that on SciStarter.This "linking" of information for registered SciStarter users is done in a fully encrypted way.

Connecting to other studies
Neureka is a powerful tool for engaging people in research, allowing study participants to take part at home and from all over the world.We want to share it with other researchers to help facilitate other important studies, allowing them to compare the same tasks and tools in different populations.If you are part of a study that is using Neureka to gather data, you will be asked by the researchers to enter a Study ID in the settings section of the app.They will provide you with this ID.By providing us with a valid ID, you agree for us to share your data with that research team.The data that we will share with approved studies are specific.That means that only relevant data are shared as outlined in the specific consent you have completed for that study, e.g.scores on a questionnaire, or data from one of our games and never contain any identifiable information (e.g.email address are never shared).

Do I have to take part in the research to use the app?
If you decide that you don't want to take part in the research then you should go to the "settings"->"GDPR" screen (accessed through the settings button in the top right of the app) and deselect "participate in research".This will stop any new data from your device being submitted to our server.Your use of the app won't be affected by this.If you change your mind you can reselect "participate in research" at any time.

How do I withdraw (i.e., get out) of taking part in the research?
Taking part in this research is completely voluntary.You can stop taking part in the research at any time using the process described above to stop sending any new data from your device to our server.If you want to delete data that you have previously submitted to us from the app then you can email the research team by clicking on the "settings "->"GDPR" and selecting "erase data".This will send us an email to delete your data and your data will be deleted from the server and our off-line data store held on a password protected computer in Trinity College Dublin.You will be notified by email once we have completed the process of deleting your data.The only time we won't be able to delete your data is if we have already performed analyses on it and a scientific report/journal article using this data has been submitted and/or if data has already been fully anonymised.

How much Mobile Data (i.e. smartphone access to the internet without wifi) does the app use?
Not very much.The average user will use less than 3 megabytes in total during their time using the app.This doesn't include the data used to download or update the app.

Are there any risks involved in participating?
The risks associated with participating in this research are minimal.However, it is possible that you might find the sensitive nature of some of the questions in questionnaires that you might choose to complete upsetting.If you experience upset or distress we encourage you to contact your General Practitioner/Doctor.We also provide information on support services in a dedicated tab within the app.

Are there any benefits to participating?
This study will not benefit you in any direct way.The data that we collect from you will hopefully allow us to develop a tool to better detect early cognitive changes that are markers of future risk of developing dementia and mental health disorders and in this way your contribution might benefit others.

Will the research team take responsibility for diagnosing app users with dementia or mental health disorders?
No. The games within the app are not diagnostic in isolation and your basic performance level is not indicative of risk for disorders of brain health.Users will not be provided with any processed scores, or individualised predictions on risk of dementia or mental health disorders based on the data that they provide through the app.While we hope that one day, some combination of these measures might help us to detect illness earlythe science is simply not there yet.

Who is running this study?
The study is being conducted by a research team based at the Global Brain Health Institute at Trinity College Dublin in the Republic of Ireland.The study has received ethical approval from the School of Psychology Research Ethics Committee at Trinity College Dublin.
In taking part and providing consent, you are agreeing to participate in a study that handles your data in line with the Republic of Ireland Data Protection Act 2018 which is in line with the EU General Data Protection Regulation (GDPR) 2016/679.For information on Data Protection law in Ireland see https://dataprotection.ie/docs/A-guide-to-your-rights-Plain-English-Version/r/858.htm

Figure S2. Two-step reinforcement learning task for assessing model-based learning.
This paradigm (adapted by 8 ) consisted of two stages where in the first stage subjects chose between two rockets which probabilistically transitioned to one of two second stage planets.Each rocket travelled to a preferred planet 70% of the time ('common' transition) or to the alternative planet 30% of the time ('rare' transition).In the second stage, subjects had to choose between two aliens, both with a unique probability of being rewarding ('space treasure') or unrewarding ('space dust') that drifted slowly and independently over the course of the experiment in pre-determined trajectories.Participants had to simultaneously track the distinct reward probabilities of each alien and incorporate knowledge of the transition structure in order to maximise their chance of reward.

Table S3 . Demographic Characteristics in Experiment 2 in the Whole Sample and the two Age, Gender and Education Matched Samples.
a 'Non-cisgender' includes those who identify as Transgender Man, Transgender Woman, Non-Binary, or not-listed SD=Standard deviation

Table S4 . Mental Health Summary in Experiment 2 (N=1451): Mean symptoms and their association with age, gender, and education.
AES: Apathy Evaluation Scale, AUDIT: Alcohol Use Disorders Identification Test, BIS: Barratt's Impulsivity Scale, EAT: Eating Attitudes Test, LSAS: Liebowitz Social Anxiety Scale, OCD: Obsessive Compulsive Disorder, OCIR: Obsessive Compulsive Inventory Revised, SCZ: Short Scale Measures for Schizotypy, SDS: Self-rated Depression Scale, STAI: Spielberger's Trait Anxiety Inventory.a: Cronbach's alpha SD=Standard deviation a Mean group difference between cis-gender men and cis-gender women.Positive values indicate higher scores among men, while negative tvalues indicate higher scores in women.b Positive F values indicate higher scores among those who did not attain third level education compared with those who obtained third level education and those who attained greater than third level education.

Table S6 . Descriptive Information of Reward Drift Sets assigned at Each Block
Drift Set B used in participants first play ('Risk Factors' section of the app).All ten drifts (A-J) used in repeated plays of Cannon Blast ('Free Play section of the app) Mean reward/SD difference=Absolute value between the highest mean reward/SD probability minus the lowest mean reward/SD probability

Table S8 . Associations between Model-based Planning and Individual Differences (N=5005) and Clinical Associations (N=1451)
p<.05, ** p<.01, *** p<.001 SE=standard error, OCD=Obsessive Compulsive Disorder a Independent models controlling for age gender and education b Covariate model controlling for age, gender and education *

Table S14 . Assessing the Impact of Increasing Trial Number per Participant on Mean- level Model-based Indices and their Reliability (N=716)
M=Mean, SD=Standard Deviation, CI=Confidence Interval Split-half reliability co-efficient assessed using odd-evens approach