Uncovering the structure of self-regulation through data-driven ontology discovery

Psychological sciences have identified a wealth of cognitive processes and behavioral phenomena, yet struggle to produce cumulative knowledge. Progress is hamstrung by siloed scientific traditions and a focus on explanation over prediction, two issues that are particularly damaging for the study of multifaceted constructs like self-regulation. Here, we derive a psychological ontology from a study of individual differences across a broad range of behavioral tasks, self-report surveys, and self-reported real-world outcomes associated with self-regulation. Though both tasks and surveys putatively measure self-regulation, they show little empirical relationship. Within tasks and surveys, however, the ontology identifies reliable individual traits and reveals opportunities for theoretic synthesis. We then evaluate predictive power of the psychological measurements and find that while surveys modestly and heterogeneously predict real-world outcomes, tasks largely do not. We conclude that self-regulation lacks coherence as a construct, and that data-driven ontologies lay the groundwork for a cumulative psychological science.

A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated

Clearly defined error bars
State explicitly what error bars represent (e.g. SD, SE, CI) Our web collection on statistics for biologists may be useful.

Software and code
Policy information about availability of computer code Data collection Data collection was accomplished using jsPsych-5.0 and Experiment Factory Data analysis Data analysis was accomplished using custom python code available in the Github repository for this project. That repo also indicates the specific library versions used in the manuscript (e.g., scikit-learn, statsmodels), as well as less commonly used packages (expfactoryanalysis, fancyimpute). R libraries were also used, including missForest, psych, qgraph and dynamicTreeCut For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability The imputed data underlying the analyses in this work, as well as the task and survey loading matrices, can be found on OSF

Blinding
Describe whether the investigators were blinded to group allocation during data collection and/or analysis. If blinding was not possible, describe why OR explain why blinding was not relevant to your study.

Behavioural & social sciences study design
All studies must disclose on these points even when the disclosure is negative.

Study description
This study is a quantitative cross-sectional study evaluated performance on many behavioral measures, and responses on multiple selfreport surveys.

Research sample
Amazon's Mechanical Turk was used for this study. There were two primary rationales: (1) Amazon's Mechanical Turk provides easy access to a more representative sample than our home institution, and (2) performing this study on Mturk was feasible -it would not have been possible to perform this study in the lab.
83.3% of the sample is White, 6.5% Black, and ~10% distributed amongst other categories; 50% is female, and the mean age of the sample is 33.6 years. This is in line with other work finding that Mturk samples are somewhat younger than the US population as a whole, though our sample does have a larger percentage of White respondents compared to other studies on MTurk demography. This larger percentage of White respondents seems driven by a lower percentage of Black respondents compared to the US population as a whole, which has been observed before in other MTurk studies. Thus while our sample is not perfectly representative of the US population as a whole, it is better than other possible convenience-based samples."

Sampling strategy
Participants were drawn from a convenience sample on Amazon's Mechanical Turk. To be eligible to participant participants had to have previously completed 2000 HITs (Human-Intelligence Tasks on MTurk), have >95% approval rating, and be living in the US. We initially aimed for a finaly sample (after QC) of 500 -with 200 used as a "discovery" cohort where most analyses would be developed, under the assumption that most correlations observed would have a small or medium effect-size. Due to over-enrollment we ended up with a final sample of 522. The final analyses were done on the entire cohort of 522 to ensure that our estimates were as stable as possible.

Data collection
Data was collected using Amazon's Mechanical Turk. No researcher had individual contact with any participant.

Timing
Data was collected from July, 2016 to September, 2016.

Data exclusions
Data was excluded from individual measures if they failed generic QC steps or failed measure-specific manipulation checks. In addition, outliers were removed. Outliers were defined as any datapoint more than 2.5IQR outside of the 1st or 3rd quartiles.

Non-participation
Participants were excluded from analyses if they failed to complete the entire measurement battery or if they failed QC on 4 or more individual measures.

Randomization
Participants were not allocated to separate groups.

Ecological, evolutionary & environmental sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sampling strategy
Note the sampling procedure. Describe the statistical methods that were used to predetermine sample size OR if no sample-size calculation was performed, describe how sample sizes were chosen and provide a rationale for why these sample sizes are sufficient.

Data collection
Describe the data collection procedure, including who recorded the data and how.

Location
State the location of the sampling or experiment, providing relevant parameters (e.g. latitude and longitude, elevation, water depth).

Access and import/export
Describe the efforts you have made to access habitats and to collect and import/export your samples in a responsible manner and in compliance with local, national and international laws, noting any permits that were obtained (give the name of the issuing authority, the date of issue, and any identifying information).

Describe any disturbance caused by the study and how it was minimized.
Reporting for specific materials, systems and methods

Authentication
Describe the authentication procedures for each cell line used OR declare that none of the cell lines used were authenticated.

Mycoplasma contamination
Confirm that all cell lines tested negative for mycoplasma contamination OR describe the results of the testing for mycoplasma contamination OR declare that the cell lines were not tested for mycoplasma contamination.

Commonly misidentified lines (See ICLAC register)
Name any commonly misidentified cell lines used in the study and provide a rationale for their use.

Palaeontology Specimen provenance
Provide provenance information for specimens and describe permits that were obtained for the work (including the name of the issuing authority, the date of issue, and any identifying information).

Specimen deposition
Indicate where the specimens have been deposited to permit free access by other researchers.

Dating methods
If new dates are provided, describe how they were obtained (e.g. collection, storage, sample pretreatment and measurement), where they were obtained (i.e. lab name), the calibration program and the protocol for quality assurance OR state that no new dates are provided.
Tick this box to confirm that the raw and calibrated dates are available in the paper or in Supplementary Information.

Animals and other organisms
Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research

Laboratory animals
For laboratory animals, report species, strain, sex and age OR state that the study did not involve laboratory animals.

Wild animals
Provide details on animals observed in or captured in the field; report species, sex and age where possible. Describe how animals were caught and transported and what happened to captive animals after the study (if killed, explain why and describe method; if released, say where and when) OR state that the study did not involve wild animals.

Field-collected samples
For laboratory work with field-collected samples, describe all relevant parameters such as housing, maintenance, temperature, photoperiod and end-of-experiment protocol OR state that the study did not involve samples collected from the field.

Human research participants
Policy information about studies involving human research participants Population characteristics 50% of the population was female, and the mean age was 33.6 (25th, 50th, 75th percentiles: 27/32/39)

Recruitment
Participants were recruited through Amazon's Mechanical Turk. Because of our rejection strategy, which excluded any participant who didn't complete the entire battery, our final sample was non-randomly drawn from the Mechanical Turk population as a whole. A number of steps were taken to reduce attrition, outlined in the supplementary methods. Overall, attrition was modest, with over 84% of our participants completing the entire battery. Followup on the participants who failed to complete the entire battery suggested that they did not significantly differ from our participants in any obvious way. We do not anticipate this self-selected attrition to affect the results.

nature research | reporting summary
April 2018 ChIP-seq Data deposition Confirm that both raw and final processed data have been deposited in a public database such as GEO.
Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks.

Data access links
May remain private before publication.
For "Initial submission" or "Revised version" documents, provide reviewer access links. For your "Final submission" document, provide a link to the deposited data.

Files in database submission
Provide a list of all files available in the database submission.
Genome browser session (e.g. UCSC) Provide a link to an anonymized genome browser session for "Initial submission" and "Revised version" documents only, to enable peer review. Write "no longer applicable" for "Final submission" documents.

Methodology Replicates
Describe the experimental replicates, specifying number, type and replicate agreement.

Sequencing depth
Describe the sequencing depth for each experiment, providing the total number of reads, uniquely mapped reads, length of reads and whether they were paired-or single-end.

Antibodies
Describe the antibodies used for the ChIP-seq experiments; as applicable, provide supplier name, catalog number, clone name, and lot number.

Peak calling parameters
Specify the command line program and parameters used for read mapping and peak calling, including the ChIP, control and index files used.

Data quality
Describe the methods used to ensure data quality in full detail, including how many peaks are at FDR 5% and above 5-fold enrichment.

Software
Describe the software used to collect and analyze the ChIP-seq data. For custom code that has been deposited into a community repository, provide accession details.

Flow Cytometry
Plots Confirm that: The axis labels state the marker and fluorochrome used (e.g. CD4-FITC).
The axis scales are clearly visible. Include numbers along axes only for bottom left plot of group (a 'group' is an analysis of identical markers).
All plots are contour plots with outliers or pseudocolor plots.
A numerical value for number of cells or percentage (with statistics) is provided.

Methodology Sample preparation
Describe the sample preparation, detailing the biological source of the cells and any tissue processing steps used.

Instrument
Identify the instrument used for data collection, specifying make and model number.

Software
Describe the software used to collect and analyze the flow cytometry data. For custom code that has been deposited into a community repository, provide accession details.
Cell population abundance Describe the abundance of the relevant cell populations within post-sort fractions, providing details on the purity of the samples and how it was determined.

Gating strategy
Describe the gating strategy used for all relevant experiments, specifying the preliminary FSC/SSC gates of the starting cell population, indicating where boundaries between "positive" and "negative" staining cell populations are defined.
Tick this box to confirm that a figure exemplifying the gating strategy is provided in the Supplementary Information.

Magnetic resonance imaging
Experimental design Design type Indicate task or resting state; event-related or block design.