Introduction

In the era of computational and precision psychiatry, we have two fundamental goals: (1) first, to robustly characterize mechanisms of human behavior and brain function as they relate to human health and disease, and (2) to apply our understanding of mechanisms to individual-level prediction and individualized treatments [1]. Yet, as in any field of science that seeks to make both population and individual-level inferences, there is a tension between measuring a phenomenon comprehensively and precisely to characterize mechanisms and measuring that phenomenon across many individuals for generalizability and ultimately prediction. Indeed, the methods of behavioral science range from the comprehensive characterization of individual patients (e.g., the bilateral medial temporal lobectomy patient, H.M.) to the development of generalizable genetic prediction models of psychiatric disease based on coarse diagnostic classification (e.g., genome-wide association studies) [2]. Yet, a science of precision psychiatry will require both rich individual-level characterization and population-level scale (see Fig. 1).

Fig. 1: The need for scale in behavioral research.
figure 1

Studies that achieve scale tend to do so by either measuring some characteristic across many participants, but coarsely (e.g., genome-wide association studies), or through comprehensive measurement of a limited number of participants (e.g., case studies of rare neurological phenomena, such as bilateral medial temporal lobectomy patient H.M.). Both types of scale are needed for precision psychiatry.

The challenges of precision psychiatry will require a radical rethinking of the way we approach behavioral research, to enable the sort of data collection needed to build models for individual-level inferences. Not only must we address existing issues of power and generalizability that have been major barriers to our science [3,4,5,6,7], but also move toward a scale that is beyond the resources of most individual laboratories.

We use the term scalable behavioral research to refer to the application of both traditional and novel tools for measuring and quantifying behavior (e.g., surveys, sensors, cognitive assessments) at the scale necessary for population-based research, large cohort-based longitudinal studies, and high-frequency measurement designs. This includes a dramatic increase in the size and diversity of our samples, as well as the ability to characterize dynamic changes in behavior and cognition, over time.

This review focuses on challenges and a potential framework for translating methods from experimental science toward the measurement of mechanisms for cognition and behavior at scale. We outline some of the main challenges to implementing our current measures of mechanisms, adapted from experimental science, in large diverse samples. Finally, we suggest ways of reconceptualizing the development of measurement tools and the role of the participant, toward achieving a generalizable science of behavior that is rigorous, inclusive, and representative.

Challenges to scale

There are both financial and logistical challenges to a robust precision psychiatry, even when we constrain the problem to measurements of behavior. Below, we outline three major human, technical, and psychometric barriers to scale across the behavioral sciences. Our goal with this review is to suggest bottlenecks within behavioral research which, if addressed, would provide a substantial leap forward in our ability to develop a precision psychiatry.

Participant engagement

Lack of participant engagement is one of the most significant barriers to feasibility for large scale behavioral research studies. Humans research participants have resource and time limitations—they are focused on myriad concerns related to family, work, and personal needs that prevent widespread engagement in research studies [8,9,10,11,12]. Moreover, the attention of humans is limited. Attention is captured by information that is compelling and/or goal-related, and fatigued by information that is not [13]. Yet despite these known barriers (and most researchers being human persons themselves), we tend to design our studies primarily to meet the needs of researchers and their science. The formidable issue of participant engagement has been well-described when it comes to use of digital health apps [14, 15] as well as the limitation sections of many research studies. Anticipated burden reduces enrollment and enthusiasm for participating in research [8, 16, 17], whereas actual and perceived burden increases attrition [18, 19], reduces data quality [20], and is among the largest cost drivers for longitudinal research [21, 22]. Moreover, because burden is not evenly distributed across the population or over time [8, 17], differential recruitment and attrition by sociodemographic, diagnostic, or contextual factors threatens generalizability [8, 18, 19, 23]. Participants from already disadvantaged populations (for whom research is already potentially a prohibitive burden) and in poorer health are least likely to enroll and most likely to attrite from research studies [8,9,10,11,12, 19, 23, 24], making addressing participant burden a major concern for any population-level behavioral science. All else being equal, where participant burden goes up, feasibility and affordability of research goes down. To build a scalable science, attention needs to be paid to the goals and needs of participants themselves. As both the source of our science and the “end user” of our discoveries, participants matter and as a field we need to build their needs into research study design [8,9,10, 18, 19].

Accessibility: humans and devices

The feasibility of large-scale studies critically depends not only on whether participants are willing to engage in research, but also whether they are able to engage. Accessibility is therefore another important consideration for scalable assessment. We use the term accessibility to refer to both an individual’s ability to access a measurement tool—which might be impeded by physical, logistical, linguistic, or health-related barriers—as well as how easy it is for an individual to interact with that tool [17, 25, 26].

Measurement tools are often developed with a particular participant group or scientific question in mind. When those instruments are later adapted for large-scale studies, beyond their original purpose, accessibility tends to be considered on an ad hoc basis—e.g., in a study of aging, for example, one should ensure that fonts are large enough to be readable by older adults [27]. However, if a measurement tool is to be used at scale, across large and diverse populations, then that tool must be as universally accessible as possible [28, 29]. That means considering accessibility for individuals who vary in sociodemographic factors, health status, age, education, and motivation.

What factors contribute to accessibility? Is it possible to take a “design for all” approach [30]? In addition to accommodations for individuals with different sensory or motor capabilities, differences in language or language fluency, and variations in technical skills or experience, a truly accessible tool will adhere to universal design principles that have been well articulated in the literature on human factors and user interface design [30,31,32,33]. These are principles developed to improve the operability, understandability, and perceivability of a tool across individuals [29] and emphasize clarity, simplicity, and consistency [28, 29, 34, 35]. When a research tool fails along these dimensions, it imposes a barrier not just for populations with specific sensory or motor impairments, but people with general cognitive difficulties [28], including individuals with mental disorders [29]. As with the case of participant engagement, considerations of accessibility limit how well we can reach participants with diverse needs and experiences. Moreover, the same principles that make a particular instrument more accessible will also tend to make it more engaging [36], more trustworthy [37], and improve the quality of data collected [38]. Accessibility of research tools is thus a critical component of a generalizable behavioral science.

From mechanisms to individual differences

The third major barrier to a scalable behavioral science is the measurement gap between basic and applied sciences [39]. In addition to issues of power and reproducibility discussed earlier, behavioral science is currently in the midst of a crisis of measurement that we have only begun to recognize and understand [40,41,42,43,44]. This arises primarily out of the drive to take measures that were developed in basic science laboratories and apply them to the study of individual differences [45].

Many of the most robust experimental measures of human cognition are poor and unreliable measures of individual differences [41]. Take, for example, the well-known and well-characterized Stroop interference effect, whereby participants are slower to name the color of word when the word text and color are incongruent (e.g., the word blue written in red) than when they are congruent (e.g., the word blue written in blue). Although a robust and replicable effect, brief measures of Stroop interference often have poor reliability [41]. Reliability, in the psychometric sense we use here, refers to the consistency of results from a particular measure. An entirely unreliable or inconsistent measure will produce different results each time, whereas a perfectly reliable measure will produce the same result every time. Reliability can be further divided into test–retest reliability vs. internal reliability. Test–retest reliability refers to the consistency of a measure over longer time periods (beyond a single measurement or test session) whereas internal reliability refers to the consistency of a measure over the time period that measure is delivered or administered. While a measure may have poor test–retest reliability and still be valid (e.g., if the underlying process or behavior being measured is unstable), measures with poor internal reliability are either measuring distinct constructs across test items or, in some cases, not measuring anything at all. Reliability, in some form, is a prerequisite for validity. Returning to our Stroop example: it is not sufficient to confirm that participants have slower response times for incongruent than congruent trials. Rather, the magnitude of an individual’s response time slowing on incongruent trials should be consistent within a test session or between test sessions if a particular measure of Stroop interference is to be considered reliable.

A growing literature in affective, social, and cognitive sciences have identified foundational reliability issues with some of the most widely used and richly characterized measures [40,41,42,43,44,45,46,47,48]. At best, unreliable measurement reduces power and reproducibility of research studies. At worst, unreliable measurement means that ostensible variations in behavior may reflect random variations between people or over time, and are fundamentally uninterpretable [49].

Why does the translation gap for behavioral measurement exist? Part of the problem is that many areas of behavioral science do not have a tradition of reporting reliability statistics [43]. There is, however, a more fundamental issue that is related to the types of variance that are the focus of the basic sciences, including neuroscience and experimental psychology [45].

When we seek to characterize variations in mechanisms across individuals, we draw from a rich and diverse basic science literature whose goal is to characterize those mechanisms. The Stroop interference effect, for example, has helped us understand processes related to automaticity, selective attention, and response inhibition. In translational and clinical science, we want to be able to take mechanisms—such as response inhibition—and look at how variations in those mechanisms might contribute to differences in disease risk and selection of appropriate treatments [1]. Yet, measurement approaches that are the most sensitive to differences between conditions often have the least variability between persons. The Stroop effect is so well-characterized in experimental psychology precisely because nearly all individuals show the expected pattern of response times to incongruent vs. congruent trials. Rather than being sensitive to individual differences, the optimal scenario for experimental validation is if a mechanism (or its measurement) is as invariant as possible across individuals [45]. However, as sensitivity to between-person individual differences is a prerequisite for understanding how variations in mechanisms contribute to human disease, the assumption of reliable between-person variability must be tested [41, 43].

In summary, to understand variations in mechanisms, we need to take the challenge of translation of measurement tools from basic science far more seriously. Such considerations are a foundational part of a scalable behavioral science, and should be central to our study design, interpretation of results, and overall scientific priorities.

Frameworks for scaling behavioral measurement

The limitations articulated above are daunting and, when considered together, paint a negative picture of the feasibility of a broadly scalable behavioral science to drive progress in psychiatry. Yet, we note that these challenges are not restricted to large-scale behavioral research, but exist, in some form, across human individual differences and clinical research. The drive toward larger-scale studies, diagnostics, and interventions acts as a lens—magnifying and bringing into focus the many barriers and limitations that already existed within the silos of our laboratories, institutions, or subfields. The goal in addressing these issues is to build better models of human behavior and disease, advance the progress of science, and ultimately develop better treatments.

In this section, we provide two approaches to behavioral research we believe will help address the barriers described in the previous section. These are approaches used by the authors in their own work, but are not the only potential solutions. Rather, our goal is to spark a conversation about ways that we might reconceptualize the research process and research laboratory in behavioral science, to make scalable research more feasible.

Iterative task development

Our current approach to the development of research methods and studies is approximately linear. Once a basic mechanism has been identified, research measurement tools (or tasks) initially developed to characterize that mechanism are adapted for an applied or clinical context. These tasks may then be piloted to assess feasibility and basic aspects of validity in a small sample drawn from a target patient population or among healthy controls. In this initial piloting phase, perhaps it is discovered that a particular task or condition produces “better” data than another (based on a diverse and heterogeneous set of criteria). This then informs selection of tasks and measures for a larger study. As noted above, the reporting of reliability metrics at this stage is inconsistent and, in some subfields, regularly omitted. If all goes well, a larger study is eventually conducted, leading to a mixture of negative and positive results that then enter the research literature and contribute to progress (or not) in a particular subfield.

We would argue that this process often goes awry at the earliest stage of task development—the translation of tasks from basic science to clinical research (or the study of individual differences). In addition to unknown reliability, many tasks are never evaluated for their participant burden characteristics or accessibility across populations, devices, and contexts. Usually it is only after the task has been used in a large study or several studies that it is recognized that the task falls short along one of these dimensions. A reasonable approach for addressing participant engagement, accessibility, and psychometric generalizability is necessary for macroscale behavioral research.

We look to computer science (a field of engineering) for potential solutions. Consumer-oriented software development, in particular, has evolved best practices for the development of applications geared toward addressing many of the same human and logistical barriers articulated above for measurement tools. Such software applications will fail if they are not useable, engaging, accessible, and scalable. Importantly, and like in the case of our research tools, it is often not clear what precise parameters or characteristics will lead to maximum useability, engagement, accessibility, and scalability.

The model used throughout software development—and in other areas of engineering and design—relies on iterative refinement and randomization (also known as A/B testing) [50,51,52]. That is, it is not enough to build an application and assume (based on first principles) that it will work as intended. Rather, a part of the application development process is the successive validation and refinement of the application along multiple simultaneous criteria. And, as in behavioral science, randomization of users to different test conditions (A/B testing) permits the selection of parameters, features, and user interface characteristics in a data-driven and unbiased manner [52]. Here, we describe the application of such an iterative A/B testing framework to the development of measurement tools focused on cognition (see Fig. 2).

Fig. 2: Iterative task development.
figure 2

Shown is a schematic of a basic iterative task development procedure. The inset graph shows an example visualization of reliability for an accuracy-based cognitive task. Measurement reliability is an often-neglected characteristic of assessments adapted from experimental/basic science.

Iterative task development begins with the selection of parameters for a particular task (overall task procedure, items, length, instructions, formatting, etc). This defines an initial prototype for further development. The prototype is based on a best guess of what has worked previously, either based on existing research in similar populations or based on measures for which there is a strong foundation of experimental science with well-characterized mechanisms. Next, additional parameters are selected that might be expected to change behavior: for instance, differences in instructions, methods of delivery (auditory vs. visual), methods of eliciting response (e.g., likert vs. T/F for questionnaires), and incentives for participation (e.g., return of research results, payment schedules or triggers, lottery). Criteria should be set in advance for determining whether a task meets some minimally acceptable standard of reliability, sensitivity, validity, generalizability, accessibility, and engagement across participants. Participants are then randomized to different versions of the task. The impacts of different task parameters are then evaluated and decisions are made about which parameters led to improvements and which did not. At that point, these parameters can be refined for subsequent rounds of A/B testing and task development. Notably, in this model, some measures would never exit the development stage as no combination of parameters tested yield versions of the test that meet criteria for minimum acceptability. Based on a priori criteria, these tests would not be considered appropriate for wide scale deployment in research studies designed to produce generalizable knowledge. Task development is complete when no further improvements are identified.

Approaches that rely on randomization to different items or parameters for task development and validation are not novel. Similar general models are used in the development of measures that rely on item banks—a large number of potential test or survey items are generated and then tested (using random assignment) to estimate psychometric characteristics of each item, allowing items with better psychometric characteristics to be identified and used in the development of custom applications (e.g., the SAPA Project) [53, 54]. The metastudy approach [55] similarly relies on randomization of participants to many possible variations of a task or experiment. The purpose of the metastudy, however, is to determine whether a particular effect or outcome is robust to variability across a range of nuisance parameters [55], rather than task development. Our iterative task development framework extends the logic of these item and parameter randomization approaches to include multiple successive (iterative) phases of task optimization and metrics that capture human factors such as accessibility and engagment.

When moving beyond psychometric criteria for task development, it is expected that better optimization along one dimension can lead to poorer optimization along another dimension. For example, one way to limit variability in scores due to non-human sources (e.g., for a measure of simple reaction time) is to limit the types of devices and contexts where a measure can be completed. iOS devices, for example, tend to have shorter response time latencies than Android devices (due to the latter’s variation in hardware) making them better suited as a class for measuring response times. At the same time, however, the lower cost of many Android smartphones means that the average education and socioeconomic status of iOS years tends to be higher than Android users [56]—factors that have robust and replicable associations with cognition and mental health. Thus, limiting a study to iOS devices will improve precision of measurement (and reduce the influence of a potential confound), but also exclude the majority of smartphone users [56].

Another example is the tension between task length and task reliability. The most robust and generalizable method for increasing the reliability of a measure is to increase its length. Unfortunately, increased test length or administration time contributes to participant burden—which reduces enrollment, increases attrition, and can threaten generalizability [16, 21]. While this concern is potentially less applicable for passive data collection (e.g., gps, actigraphy, or other sensor based modalities), more dense or frequent measurement can interfere with device processing speed or battery life, which can interfere with the participant’s use of a device and increase burden [57]. Across modalities, more precise or more comprehensive measurement is usually more burdensome.

One way to address the trade-off between psychometric and useability considerations is to create joint optimization metrics. For instance, one can use a metric that captures both task reliability and participant burden by looking at the minimum duration of a task needed for acceptable reliability across tasks or versions of a task (or minDAR) [44]. Based on an analysis of 25 cognitive tests, Passell et al. [44] reported dramatic variation in the task duration needed to produce acceptably reliable scores (defined as an internal reliability of at least r = 0.7). Some tasks had minDARs of only 30 s. For other tasks, the minDAR exceeded 10 min [44]. One might posit similar such joint optimization metrics for looking at the minimum duration for acceptable validity (based on associations with some predefined criterion) that allows comparison across task parameters.

There are two potential critiques of the iterative task development approach described here. First, optimization of certain parameters might threaten the validity and generalizability of the original task. That is, there is a risk that the more a task is modified, the less likely it is that the existing literature and validation for that task (including literature on basic mechanisms) can be applied. This is a valid concern, and appropriate checks should be included in the task development process to track validity. One might evaluate, for example, whether modifications that improve accessibility and reduce burden reduce the magnitude of important between condition effects. A brief and reliable measure of Stroop interference (if one exists [41]) should still have longer reaction times for incongruent trials than congruent trials. Otherwise, it is not a measure of Stroop interference. As limitations related to participant engagement, accessibility, and measurement already pose a threat to validity and generalizability, we believe that systematic efforts to address these barriers will tend to improve capacity for scalable assessment that advances behavioral science.

The second critique is that the sample sizes needed for an iterative task development approach are prohibitive for most studies. This is also a valid concern. We focus in this manuscript specifically on the contexts where applications of behavioral or cognitive measures at macroscale are both desirable and potentially achievable. In these cases, the investment in a robust iterative task development phase for translating such measures toward scalable contexts can save both human and financial resources in the long run. We also note that there are now many low cost and high throughput methods for large sample participant recruitment that do not require the same level of resource as a large traditional research study. If it is not possible to iteratively develop a measurement meant for schizophrenia research using a large sample of schizophrenia patients, a reasonably good first approximation of task psychometrics, accessibility, and burden can be made using large, diverse samples of mostly healthy participants. There are now numerous platforms for recruiting large numbers of participants to complete research assessments, including Amazon’s Mechanical Turk [58], Prolific [59], and Crowdflower [60]. These platforms can be a rapid and inexpensive source of participants, but researchers should be aware of their challenges and limitations [58].

Yet another approach for engaging participants is a citizen science model of recruitment, which treats the participant as a partner in the scientific discovery process. This approach is described in the next section.

Citizen science: participant as collaborator

Patient-centered, participant-centered, and/or participatory research frameworks have received a lot of attention over the past decade, and with good reason: the integration of the patient or participant perspective into research at all stages makes both practical and ethical sense [61,62,63]. Such an approach can help identify new opportunities [64] or fundamental design flaws [10] early in the research process. It also makes the identification and selection of incentives for participation both more comprehensive and clear [9, 10, 18, 24, 65].

Here, we focus on a citizen science framework for participatory research in behavioral science [65, 66]. In this framework, participation in research is incentivized by the desire to contribute to science, insight into the research question being studied, as well as return of study data and individual research results [65,66,67,68]. Participants use structured research tools to answer their own research questions or contribute to an overall research program by collecting their own data [69]. In behavioral science, that data collection involves completing surveys, behavioral measures, and cognitive tasks that provide individual-level feedback about major outcome variables or performance. The benefit of data collection using this model is threefold. First, the incentives are aligned for participant and researcher: they both want to understand the participant’s capabilities [65]. Participants will tend to exert effort toward better performance in a way that fits the assumptions of our research studies and can produce higher quality data than financial incentives [70]. Second, participants can recruit other participants in a way that leads to large scale participation. TestMyBrain.org, for example, receives about 500–1000 participants per day, of whom about 2/3 are new to research participation [44, 65, 67]. Third, it invites and encourages participants to provide feedback on research methods that can help generate insights about potential technical problems, issues with instructions, user interface improvements, or accessibility barriers that would otherwise be difficult to identify.

Many of the major reservations that researchers have about this approach to data collection are around the ethics of return of research results—specifically, where each participant is provided with their data or some individual-level metric derived from their data. How will a participant interpret the data [71]? Will they know what to do with it [72]? What if results cause distress or lead to decisions about treatment-seeking or care that ultimately have a negative impact [73]? One might flip this question, however, by asking: who has the most fundamental right to a participant’s data? If data are generated using the body and behavior of an individual, should researchers have the right to limit that individual’s access to that data? Rather than asking whether data should be shared with the participant, we should perhaps be asking how to best share data with participants [74, 75]. These are important considerations that are currently being deeply considered elsewhere as part of national and international initiatives [71,72,73,74,75,76,77], including the US Precision Medicine Initiative (All of Us research program) [76].

In the case of low risk measures where nonclinical interpretations of scores can be made interesting and understandable, we and others have found the return of research results to be a positive incentive for community education [78], engagement in research [65, 67], and developing relationships with participant communities that enhance research and improve public understanding of science [64]. In addition to TestMyBrain.org [65], initiatives that have had similar (or greater!) success at recruiting citizen science participants for studies of human cognition and behavior through return of research results include LabintheWild [79], Games with Words [67], Project Implicit [80], My Social Brain [81], and the SAPA Project [53]. While not suitable for all test development modalities or applications, the combination of crowdsourcing and/or citizen science approaches allow rapid evaluation of task characteristics like participant burden, reliability, and accessibility at relatively low cost. It remains to be seen whether digital citizen approaches can address engagement barriers in longitudinal research, where personal relationships (e.g., with research personnel) can also serve as engagement incentives. As with any method that relies on digital technology, targeted outreach will also be needed to ensure participation among communities with reduced access to smartphone technologies, including rural and underserved communities.

Future research directions

We have argued that, as we move toward precision psychiatry, there is a pressing need for broadly scalable behavioral research approaches—the development and validation of reliable, accessible, engaging, and generalizable methods is the only way to achieve a precision psychiatry that can precisely and dynamically characterize the behavior of many individuals, over time (Fig. 1). Below, we outline an emerging new field of behavioral science that focuses on temporal dynamics of cognition and behavior, enabled by new technologies and approaches to assessment, that could transform psychiatry and our understanding of the human mind.

Behavior, over time

Cross-sectional psychiatric research often implicitly assumes that variations in mechanisms that are associated with psychopathology between persons can generalize to variations in symptoms or psychopathology within a person over time. Yet, many psychological processes and mechanisms violate this ergodicity assumption [82].

One of the most exciting innovations that is enabled by digital technology and the shift toward larger-scale behavioral data collection is the ability to measure and monitor change over time in behavior and cognition. Typical approaches to measuring change rely on longitudinal “single-shot” designs [83], in which single time-point assessments are repeated across widely spaced intervals (e.g., annual assessments). The single-shot approach assumes that differences in cognition and behavior are relatively stable over brief time scales, and that meaningful change occurs relatively slowly. Ecological momentary assessment and measurement burst designs have revealed significant variability in cognition and behavior over hours and days, however, where distinct patterns of variability are associated with different behaviors [84], psychiatric risk [85], and the effectiveness of interventions [86]. Cognition, in particular, demonstrates significant within-person variability that accounts for as much as 40–60% of the total variability in performance [87].

Reliance on single-shot assessments to measure what are likely dynamic processes has two important consequences. First, it introduces temporal sampling error—or differences in measurement that reflect time-of-testing effects that can differ substantially from a person’s average [83]. Second, single time-point assessments assume that variability in performance is not meaningful for characterizing phenotypes. This is a relatively unsupported assumption, and one that is at odds with recent conceptualizations of dynamic phenotypes, a term originally coined to refer to time-dependent observable characteristics of single cells [88]. Because human behavior and performance are time-dependent, and display meaningful variability at a relatively fast time-scale (e.g., moments, hours, days), precisely characterizing important phenotypes require tools that can capture behavior as it unfolds in as near to real-time as possible [89].

Behavior, in context

Finally, new technologies for measuring physiology, mood, and environmental variables can now provide richness and context for more traditional behavioral assessments—fully removing the laboratory from the confines of brick and mortar and into people’s everyday environments. Digital sensors embedded in everyday wearables and personal digital devices can measure movement, sleep, vocal patterns, and even physiological signals that are related to mood, arousal, and health [90,91,92]. While some of the most innovative applications involve extracting signals from dense multimodal datasets that combine sensors using machine learning for prediction and diagnosis [92], there are also more immediate applications that make traditional tools both more powerful and more interpretable [93]. Processing of speech from voice and text can be used to rapidly extract information from sources that are otherwise hard to process [93, 94], as well as provide indicators of emotion and psychological status [90, 95]. Sensors embedded in smartphones, together with active measures of behavior such as surveys or cognitive tests, can give information about the context in which a behavior, experience, or cognitive process occurs [96]. Computer vision algorithms can take slices of data from human video and images to understand things like emotional experiences, social behavior, and attention [97, 98]. Actigraphy or sleep data can provide information about circadian rhythms that might track fluctuations in behavior or cognitive performance that provide meaningful signals related to brain and cognitive health [99]. Such applications are being widely tested by researchers in the field and time will tell which provide the most promising signals related to human cognition and behavior.

As human beings moving through the world, our behavior is both dynamic and exquisitely responsive to social and environmental contexts. Methods that allow us to access that dynamic and context-rich view of human cognition and behavior will open up new areas of investigation and potentially provide better models for understanding psychopathology. The coming years may reveal new architectures of human cognition and behavior based on temporal variation or state-related change, which can yield insights into the pathophysiology of mental disorders that were previously inaccessible due to methods limitations.

Conclusion

The scaling of methods for the assessment of health-related characteristics is happening throughout science and medicine, owing to the explosion of new technologies, new analytic approaches, and the unprecedented connectedness of human societies. We are now able to conduct research at a scale that was previously unimaginable.

In this review, we have attempted to lay out our view of some of the major considerations for scaling the science of behavior across individuals and over time, as well as potential approaches that emphasize iterative design of reliable, engaging, and accessible measures, together with thoughtful integration of participants in the research process. In no way, however, do we imply that the solutions suggested are the only path forward. Indeed, one of the most exciting things about the shift from individual investigators to communities of scientists and participants working together on ambitious projects is the potential to rethink our assumptions about the research process, where it is centered, and how best to drive scientific progress.

Funding and disclosure

Funding was provided by a National Institutes of Health grant to LG (NIMH R01MH121617) and National Institutes of Health grant to MJS (NIA U02AG060408). LG has received compensation as a member of the scientific advisory board of Sage Bionetworks, a 501c3 nonprofit organization. LG is on the Board of Directors of the Many Brains Project, a 501c3 nonprofit organization. The authors declare no competing interests.