Sensor technologies have multiplied across many realms of human activity, from tracking devices in cars to online browsing. Satellites scan and digitize the planet at regular intervals. The development of techniques for processing unstructured data such as text, images, audio and video by computer scientists animates the conversion of—for example—books1, radio broadcasts2 and television shows3 into data. In the twenty-first century, human behaviour—from mobility to information consumption to various types of interpersonal communication—is increasingly recorded somewhere and potentially computationally tractable. Past communication technologies, from mail to print to fax, typically left far fewer durable and accessible artefacts; those that did have become computationally accessible only in the past decade or so, as the relevant physical artefacts were digitized. The digitization of books is an example, which enables the computational analysis of a massive corpus of human expression that stretches back centuries4.

The emergence of these new data streams has often been compared to the development of the telescope. As Robert Merton famously wrote, “Perhaps sociology is not yet ready for its Einstein because it has not yet found its Kepler….”5. Merton’s provocation was that sociology did not yet have the empirical foundations on which to build great theory. Duncan Watts, in response, writes 62 years later, “…by rendering the unmeasurable measurable, the technological revolution in mobile, Web, and Internet communications has the potential to revolutionize our understanding of ourselves and how we interact. Merton was right: social science still has not found its Kepler. But three hundred years after Alexander Pope argued that the proper study of mankind should lie not in the heavens but in ourselves, we have finally found our telescope.”6.

We believe in the potential of digital data sources to transform the social sciences. However, the metaphor of the data streams from the instrumented society as a ‘telescope’ is misleading in important ways. First, the study of societies is different from the study of the stars, because the patterns that characterize human behaviour will generally differ across time and place. Second, the measures built from these streams are potentially suspect in ways that must be actively interrogated, because these sources were not built with scientific goals in mind. We now turn to the first point; the remainder of the paper is devoted to the second.

Fig. 1: Measurement in social science.
figure 1

Measurement is the bridge connecting scientific motivations and data with insight and applications.

The unstable logics of society and measurement

Empirical social science is largely focused on finding generalizable but not universal patterns in human behaviour. The part of the social sciences that has the intent of finding such universal patterns in human behaviour (for example, evolutionary psychology) is tiny relative to the whole field. The issue of the instability of the rules that govern human society is exacerbated by the very sociotechnical systems that are gathering the data about people, which are actively (and in some cases intentionally) changing the social world that social science would study. Through what social scientists call reflexivity and self-fulfilling prophecies, humans actively change the world that they are observing by acting on the knowledge gained (in part by measurement instruments).78

Reflexivity refers to the loop that links social reality with the theories and the metrics that we devise to explain it. The ‘bandwagon’ and ‘underdog’ effects, for instance, have long been identified in the analysis of electoral politics to explain the impact that polls and forecasts have on voting behaviour. If candidates are projected as likely winners, more people may decide to vote for them (bandwagon effect) or, conversely, more people could mobilize to increase support for the candidate expected to lose (underdog effect)7. These effects reflect the impact that measurements have on attitudes and behaviour8,9, and how our measures can distort the phenomena that they were designed to monitor. These distortions, in turn, can be amplified by algorithmic decision-making in public health, law enforcement, sentencing, education and hiring10,11.

Reflexivity also takes the form of the observer effect, which happens when people modify their behaviour if they know they are being watched12,13. Digital technologies have created a new version of the reflexivity problem, amplifying the performative aspect that is intrinsic in social indicators. When Google launched the Flu Trends project in 2008, the goal was to use search queries to estimate the prevalence of flu symptoms in the population. In 2013, however, Flu Trends substantially overestimated peak flu levels. One of the reasons was the flawed assumption that search behaviour was driven by external events, such as having flu symptoms. In fact, Google’s algorithms were driving those patterns as well: by trying to anticipate the intent of the users through recommended search terms, Google was distorting the information users would have otherwise revealed14. The reaction to the observed phenomenon, in other words, changed the phenomenon itself.

Obfuscation tactics represent another version of the observer effect: we can now disrupt measurements by deliberately adding ambiguous or misleading information to interfere with data collection. Examples of obfuscation include editing profile photographs to prevent facial recognition; using virtual private networking (VPN) to hide one’s location when browsing the web; or using group identity (for example, many people under one user account) to obscure specifics about the actions of one user15. The reflexivity loop here is created by the awareness that behavioural traces feed into metrics and surveillance, so the meaning of that behaviour is intentionally altered. This is similar to when respondents lie to survey researchers, but on a much larger scale. And because the skills needed to know that surveillance is happening and how to implement obfuscation to address this are not randomly distributed across the population, the individuals whose data will be altered in such ways will not be random either.

The unobtrusive nature of many digital measures suggests that, overall, observer effects may be less of an issue with these new data sources compared to the past when—for example—the gender, age and race of the person conducting an interview could vastly change the answers that respondents provided16. However, the loop that connects social reality with the metrics that we devise to analyse it has been strengthened—reflexivity is now embedded in the instruments used to monitor and predict human behaviour. It is as if the Hubble telescope were organizing the placement and behaviour of the stars at the same time as it is observing them. Social media, for example, not only capture human behaviour, but also have the potential to alter important patterns of human society, such as the speed of information flows, the scope of media production and the actors responsible for defining public opinion.

As a result of the fluidity of the principles organizing human society, the meaning of a given measure will also evolve. Part of why the social sciences must accommodate these new types of data is that emerging sociotechnical systems are reducing the relevance of some old scientific instruments used to measure human behaviour. Existing measures of key concepts such as gross domestic product and geographical mobility are shaped by the strengths and weaknesses of twentieth century data. If we only evaluate new measures against the old, we simply replicate their shortcomings, mistaking the gold standard of the twentieth century for objective truth. For example, consider the standard question (originally from 1978) from the American National Election Studies17 about radio consumption regarding an election: “Would you say you listened to a good many, several, or just one or two speeches or discussions on the radio about ‘the campaign’?”

This construction of ‘media consumption’ as consisting of a countable number of discrete units is an artefact of the technology of the broadcast era. This question bears little relation to how people access digital media today. It would be futile to attempt to capture behaviour regarding social media by asking questions such as ‘How many tweets did you see today?’ or ‘What Twitter accounts showed up in your feed?’. Many of the ways to measure behaviour developed in the early days of quantitative social science were: (1) necessary given constraints on measurement at the time; and (2) grounded in a social reality that was markedly different.

Figure 1 summarizes how measurement fits into the general scientific process. We discuss below the central challenges of turning data from these sociotechnical systems into scientific measurements. We include in this discussion two motivating examples of data streams that have been the basis of much social science research: location data from mobile phones and social media posts on Twitter. The key questions we turn to now are what and whom we measure with massively instrumented human behaviour, focusing on the key principles of measurement summarized in Box 1.

What trace data measure

The goal of measurement using behavioural trace data is to extract meaning from the raw data generated from instrumentation. All scientific data instrumentation confronts this issue, but the leap from raw data to meaningful measures is often particularly large when we use data recycled from systems designed for other purposes18. For example, mobility data from mobile phones reporting specific latitudes and longitudes are largely uninteresting without further processing, which enables us to measure proximity, mobility and other socially relevant concepts.

The key challenge is whether our measurement accurately captures the construct that we want to examine. Does it closely match other measures of the same thing? What is the potential slippage between construct and concept (for example, if measuring physical activity from mobile phones, how consequential are the missed stationary activities, such as a treadmill?). When we examine supposedly unrelated constructs, do our measures reflect the expected lack of association? By and large, twenty-first century observational data are not designed for research and need to be linked to known concepts before we can use the data to answer scientific research questions.

The meaning of measures is derived, in part, from theory. Theoretically driven designs that apply existing knowledge to interpret digital signals can overcome many of the problems of using instrumented behavioural data. Conversely, undertheorized ad hoc operationalizations can make research findings difficult to interpret and inconsistent across studies. As noted previously19, formal theory is useful not only in generating hypotheses, but also in selecting an appropriate way of measuring constructs with big data.

Consider, for example, the use of mobility data to study the spread of COVID-19. Multiple studies used real-time travel data to track the movement of people from Wuhan to other provinces in China20,21. The researchers found that population movements from Wuhan were strongly predictive of the introduction of the coronavirus to a region. Local controls then predicted the subsequent spread of the virus. In these studies there is a well-theorized process based on the assumption that the spread of the virus is driven by the proximity of individuals. The chosen theoretical framework, in turn, informs how generalizable those findings could be to other cases. That is, we might expect similar patterns in the USA22, but not in Australia, given the rigorous testing and isolation procedures that were imposed on visitors in the latter country. The results of any given empirical study are necessarily local, in both time and space; theory is needed for the appropriate movement of any measurement to a new geographical or temporal context23,24.

As we conduct more research using high-volume, complex data sources and formats, methods that offer insights into the validity of new measures become especially valuable. One promising approach is to examine classic validated self-reported scales in conjunction with new ways of measuring related concepts. For instance, self-reported news attention and exposure can be used in conjunction with eye-tracking to capture visual attention to online content25. A similar triangulation of approaches to measurement can also be useful in confirming the validity and robustness of new behavioural constructs26. Researchers have used mobile phone data to design proximity-based measures capturing the amount of time that people spend close to each other27. These metrics can serve a variety of useful purposes. They can be used as a proxy for relationship strength, or give us a way to track possible pathways of virus contagion. There is, however, the potential for error—two people whose devices appear near each other as measured by their Bluetooth beacons may, for instance, be separated by a wall or may simply be charging their phones from the same outlet. In cases such as this, triangulation can come from the inclusion of self-reported data, such as sending a message to someone’s phone to ask them who else is nearby at the time.

For internet-based research, both basic population characteristics and underlying mechanisms that structure user behaviour on digital platforms remain relatively poorly understood. Many basic concepts remain difficult to measure even in online platforms that offer easy data access to researchers. Despite the thousands of papers based on Twitter data in recent years, social media scholars still find that identifying the demographic characteristics of individual users remains a big challenge. Additionally, researchers still cannot reliably distinguish humans from non-humans (for example, bots, collective accounts or organizations), although there have been important strides made in that direction28,29. As a result, the large majority of Twitter research is making inferences about accounts or tweets; very little of Twitter research can reasonably claim to be making statements about the behaviours of humans. For research questions that focus on human behaviour on Twitter, methods that link user accounts to administrative data or to survey responses offer promise in identifying humans and their demographic attributes on Twitter30.

Even when it is clear that humans are the source of a given behaviour, there may be a challenge in attribution of specific behaviours to specific humans. In its early days, audience research for broadcast television, for instance, encountered challenges with multi-member households31. The data in those cases would suggest the existence of someone with a taste for children’s cartoons and cable news, when, in fact, there were two different individuals involved. Technological sensors can thus be actively misleading when behaviour is divided across humans (two people using the same Netflix account) or across sensors (the same person viewing Twitter on a smartphone and a desktop). Further exacerbating the issue is that the sensor–human mismatch could rapidly evolve over time. Thus, for example, a finding based on desktop browsing data that news consumption has systematically changed could simply be an artefact of the progressive shift from desktop browsers to mobile apps32. The lack of stability of human use of these different systems (and sensors) may make such a comparison over time essentially impossible.

The use of models based on other data can facilitate the measurement of focal behaviour. For example, who uses which device can be modelled from other data, and the outputs of this model will be less sensitive than discrete assumptions about the identity of a device user. The cable news viewer may be the grandparent and the Xbox user the grandchild. However, the data that are included in these models must always come from the past, and the relationship between measures is itself unstable. This is the fundamental problem of induction, and while it cannot be surmounted without a metaphysical revolution, we propose that constantly updated measurements and models represent our best amelioration of the problem. That is, we should plan for the slippage of our measurements and conduct an ongoing assessment of how particular measures capture the current social reality. For example, measures of inflation need to assess how the set of goods that people consume changes over time. This is a useful recalibration, although it also illustrates the limits of this approach, because the emergence of completely new items (no one was buying smartphones in 2000) makes consumption across time inherently incomparable.

The proliferation of communication technologies, driven by the internet, also yields a fragmentation of behaviours into different data silos. Consider a research question that explores whether non-proximate synchronous voice-mediated communication is important to reducing feelings of social isolation. The past half century has seen a steady fracturing of this behaviour into different systems—from government-mandated monopolies (for example, Ma Bell in the USA) to oligopolies to a countless number of internet providers. Furthermore, there are plausibly systematic biases in the data captured in any one of these systems—whom you talk to on your mobile phone might be systematically different from whom you talk to via Zoom, Skype or WhatsApp33. Even the tortured linguistic construction used above reflects the sociotechnical complexity: not too long ago ‘non-proximate synchronous voice-mediated communication’ would have been described simply as a ‘phone call’. One important consequence of this technological fragmentation is that measurements relying on a single digital device or service should be interpreted with considerable caution. The answers that we find could plausibly differ from those we could get by measuring the behaviour in a similar but different technology. Ironically, because of that complexity, an accurate picture of whom someone generally talks to may be better captured through a simple survey question than through records from a single platform.

Conversely, behaviours observed in different silos that seem similar might actually be capturing very different phenomena. Just as various name generators that are used in surveys to generate lists of contacts result in the identification of different social ties34, a friend on Facebook does not denote the same relationship as a Twitter follower or a LinkedIn contact. Moreover, none of these relations denote a ‘friend’ as used either colloquially or scientifically, although there are very likely some strong statistical connections among these concepts. These systems, furthermore, change over time and their affordances—what they allow users to do—also evolve. This in turn means that the causal processes that underlie our online social actions, relationships and structures are constantly changing. As such, we must now be aware of system-varying properties of measures such as temporal and inter-system validity. The challenge then becomes developing measures that provide some degree of generalizability over time or across systems for a given research question.

Another deep problem is the algorithmic confounding of measurement35. Confounding here refers to our inability to distinguish signals that represent typical human behaviour from ones that result from the rules that govern a digital platform. Without knowing how a system is designed, we could easily attribute social motives to behaviour driven by algorithmic decisions. If Twitter’s feed suddenly starts to prioritize sports, a user may find out who won an Olympics competition without any changes in their underlying interest in sports. Such changes are often difficult to detect, both because they are sometimes introduced without notice and because they may roll out unevenly, affecting certain user populations before others. This mechanism also functions in more subtle ways, such as how natural human proclivities are enhanced by algorithmic prompts. For instance, if Twitter systematically suggests that you follow back people who already follow you, that can boost our natural tendency to reciprocate social ties36. More generally, internet companies aim to manipulate human behaviour so as to increase engagement on their platforms (for example, Facebook, Twitter and Instagram) or money spent on their products (such as Amazon and Ebay). Those machine-learning-based manipulations are pervasive, and any efforts to develop measures from platform data need to evaluate the extent to which algorithms will distort both the measures and any downstream analyses. Because of their importance, those algorithms are worthy of closer study11,37.

Although an in-depth discussion of causal inference is outside the scope of this paper, we should note that a number of measurement issues identified here present a particular problem for research that aims to establish cause and effect. Lack of stability in measurement over time, for instance, may induce researchers to attribute the changes in a focal outcome to an unrelated external event. The discussion above regarding Google Flu Trends is also relevant here. In that case, there was an implicit assumption that the flu was causally related to flu-related searches on Google. However, if Google around 2013 was proposing flu-related searches during flu season because it had, deep in its complex algorithmic machinery, inferred it was flu season, the measure of exactly the same behaviour in 2013 would mean something very different than it did in 2008.

The malleability of human expression and language also poses general challenges around inferences of attitudes and opinions from language and image data38. Expressions of sentiment on Twitter are notoriously difficult for computers to decode, as they typically stumble over sarcasm, irony and hyperbole39. How problematic that is depends on the structure of the noise and, again, on what matters—that is, the research question.

Whom trace data measure

Human behaviour is a multi-level concept that often requires measurements at the individual level to make inferences about the distribution of behaviours, attitudes and attributes at the collective level. The research question should make clear what population is of interest to a particular study. That population could include people everywhere of all types, or it could be specific to a certain geographical region (a city or country), a particular community (a hobby group or company) or a myriad of other subpopulations (youths, immigrants, or politicians). Especially when entire populations are concerned, it is not feasible, logistically or financially, to collect data about everybody. In such cases, researchers should ideally collect data about a random sample of the population, which means that each member of the population has an equal probability of being in the sample. This ideal was never quite achievable, and is even less relevant in a world in which response rates to survey requests are below 10%, with uneven rates of accessibility of people across modalities of recruitment40.

With system-level data, one may be tempted to think that everybody is represented since the actions of all of the users are in the dataset. However, the sampling in this case happens at the level of who is a user of the system from which the data are collected as well as who is most active on said system41. It is, at best, a ‘convenience census’ of the platform under investigation rather than the whole population42. If the scientific objective is to make a statement about the people on the platform, that census might be compelling. However, any leap to generalize beyond that platform must be viewed more critically. This is a particular problem for research on Twitter, the most commonly cited source of emerging data, as it is used by only about 20% of the US population and is even less popular in most other countries43,44,45. Importantly, users of social media platforms do not mirror the general population of internet users either demographically41 or regarding other attributes such as their interests41,44,46. In light of recent progress in promoting the representativeness of research populations in other domains42,47, it is imperative to think carefully about these issues in the realm of social media47. We also note that methods that recalibrate data to make reasonable population-level inferences can be particularly powerful when applied to large-scale data48.

The issue of generalizability is amplified when only a subset of the platform population is studied. The key question is whether and how the nature of the sample affects the inferences being drawn. Thus, for example, a study of Twitter users who include their names and locations in their profiles49 raises the question: do these findings generalize to Twitter users who do not divulge such details? Similarly, another study50 examines the consumption patterns of political information, based on the small minority of Facebook users who provided partisan labels in their profiles—but do the resulting findings generalize to individuals who do not divulge their political affiliations? The relatively large sample size in these studies—by social science standards—does not alleviate the concern that the sample is not representative of the population using the platform51. This issue is exacerbated by the sometimes large changes in who uses a platform over time (Facebook was once the exclusive domain of Harvard undergraduates), in which case these demographic shifts by themselves affect what happens on the platform.

Other critical problems in generalizability include the fact that different platforms elicit systematically different behaviours. For example, the same person will often behave differently on Facebook and on Twitter52. More generally, some human behaviour is highly dependent on the setting—if we could only observe the same people at work, at home or in a religious setting, we might make radically different conclusions about humanity. Generalizability is a function not only of the population, but also of the particular observational contexts. Depending on the research question, this may or may not be a problem. A clearly defined question and population will help to establish how well the measurement lines up with the research intent.

Finally, we note the key measurement question of what are the systematic biases with respect to sampling. Generally, our data collection systems are biased away from minority and, especially, marginal populations; furthermore, our theoretical questions regarding populations typically focus on the middle of the distribution. Representativeness is an issue of transcendent importance in understanding humanity, now and in the past. Consider studies that analyse the text of Google Books (the largest digitized collection of human knowledge), which want to draw conclusions on how linguistic shifts in the texts over the centuries correspond with shifts in, say, national sentiment4. This corpus suffers both as a representation of language use, because its composition systematically changed over time (for example, with a much higher representation of scientific texts in the twentieth century)1 and because even a well-curated set of books will reflect the reality of unrepresentative elites. Not even the largest library ever compiled can cast light on those who—although unrepresented in published texts—still had the ability to act and change the course of history.

These representativeness issues were a major concern in the social science methods of the twentieth century. Reaching respondents through postal mail systematically excludes homeless populations, telephone surveys exclude those without a phone, and surveys conducted in person are subject to people’s comfort with and trust in that type of interaction with a stranger.

Observational behavioural streams are potentially subject to similar biases. First, often the instrumentation that collects the data are a consumer good owned by an individual (for example, a mobile phone or a computer), for which costs present a barrier. Second, the instrumentation is often driven by corporate business models aimed at people with money to spend. Third, people more concerned or knowledgeable about privacy matters may be less represented in systems that track behaviours as they opt out of using such services.

However, these data streams have some critical compensating features. Sensor technologies may fill in important data gaps, giving visibility to those who would otherwise be erased from the map. Satellite imagery, for instance, has been used to build indicators of wealth and poverty in the Global South when surveys of household income and consumption do not exist53. The banal pervasiveness of modern technology means representation will in many cases be superior to traditional data-collection mechanisms—it is cheaper to own a mobile phone than a home. There are parallels here to the administrative data that W. E. B. Du Bois used to study African-American individuals in the late nineteenth and early twentieth century54. The data of an administrative state that enforced racial hierarchy were surely not neutral, yet still had critical value in providing visibility of those most precariously positioned in society.

Furthermore, large sample sizes allow us to look at the behaviour of subsets of the data, for example, minorities (generally construed) and events that are statistically uncommon but consequential (for example, hate speech or misinformation)49,55,56. In these cases, sample size and our ability to zoom into smaller populations and infrequent data points matters more than the representativeness of the sample57. As Pareto observed long ago, many human behaviours are concentrated in tiny slices of the population58; however, twentieth century methods were generally poorly suited to studying that social reality. Perhaps the social theories of the twenty-first century will be able to use micro-level behavioural data to understand how structures of interdependence yield certain macro-level patterns59.

Access and ethics in measurement

Emerging data streams from sociotechnical systems present two additional challenges, compared with—for example—the data from the Hubble telescope. First, the Hubble telescope is controlled by the scientific establishment, whose goal, presumably, is to answer scientific questions. The institutional goal of a platform, such as Twitter, is clearly not to answer scientific inquiries. The first question is therefore, what can be measured? Second, humans as research participants pose ethical issues that far-away galaxies clearly do not. The question that follows then is what should be measured? We deal with these two questions in turn.

What can be measured varies markedly depending on the system that is generating the data. It is possible to design a small-scale data collection system that relies on consenting participants;60 however, access to data from millions of people generally requires partnership with a platform. There is a wide spectrum of availability for internet-based communication data with access rules that differ greatly across data holders and time. At the least restrictive end, platforms such as Reddit and Wikipedia allow access to nearly everything the end user can view in machine-readable formats. By contrast, companies such as Facebook and Twitter offer far more restrictive access regimes that are limited by time, data volume and the fact that not all publicly visible data are programmatically accessible. Notably, none of the current major platforms offers individual-level data on what people pay attention to, a remarkably large gap in current internet-based measurements61. Furthermore, none of the platforms provide access to information on the extensive randomized control trials (in the form of AB testing) that they do, which could—in principle—enable inferences of the influences of their algorithms on individuals62. Generally, any private authority that controls data of interest to researchers can, in the absence of regulation to the contrary, dictate the terms of data access as it chooses. The fact that the actions of platforms such as Twitter and Facebook are a compelling focus for scientific questions of public interest (consider: does a platform amplify the spread of misinformation? What steps does a platform take in response to hate speech?) makes this control deeply problematic63. A duty of scholarship in these spaces is to inform public discourse on these important questions. A corollary to the question of what can be measured must be: is it possible to speak truth to power if the power in question controls access to the data used to construct that ‘truth’? And, if not, is it (ever) possible to trust any measures that are allowed to be extracted from a given system?

Emerging data sources also pose new ethical challenges. We focus on those that intersect with measurement, and, in particular, on what can and should be measured. More extensive discussions of trace data ethics, as well as alternative models for data access, are available elsewhere18,64,65,66; here we briefly present five particularly pressing concerns. First, although informed consent is a mainstay of research on human participants, anonymous data acquired by third parties are often not considered ‘human participant data’ and are therefore exempt from review by institutional review boards. What are the ethical obligations of the researcher to consider the circumstances under which the data were collected? In a recent example, over 70 gigabytes of data from the far-right social network Parler were publicly released in early January 2021, including GPS-derived location data67. Whether researchers can ethically analyse this dataset is a topic of ongoing debate, particularly in light of the use of the website as a planning space for the US Capitol insurrection of 6 January 2021. Far more generally, people are probably unaware of how different systems track them, whether it is through mobility data from phones68, or browsing data. What then are the ethics of using tracking data from third parties when the targets of that tracking are, at most, nominally aware of that fact?

Second, the level of detail in behavioural datasets means that anonymization that is robust to re-identification efforts is often practically difficult or impossible69. It is important to note that de-identified anonymous data can be both the type that cannot be re-identified and the type that can. There have emerged approaches around ‘differential privacy’ that allow the addition of noise to a dataset that guarantees a degree of anonymity in the data, making it robust to re-identification efforts70,71. There is a trade-off, however, because the privacy-enhancing addition of noise diminishes the utility of the data. This was the approach taken in the Social Science One project that provided analytical access to Facebook data72 (Box 2). One of the struggles confronting the teams granted access was whether the resulting data retained value for answering their questions. (Note: some of the authors are involved in Social Science One and the Facebook 2020 Election Research Project.)

Third, what expectations of privacy are reasonable for publicly visible behaviours, such as tweets? What obligations are incumbent on the researcher to cloak those behaviours? For example, when should researchers avoid mentioning (in publications or presentations) information such as user screen names and complete social media messages, because of the possibility of negative attention or harassment? Some have argued that automatic anonymization of public data may not be the right approach either, rather, content creators should be consulted about their preferences73.

Fourth, the reliance on the principle of individual autonomy is intrinsically limited, for two reasons. In a world of networked information and insight there will generally be informational spillover from what one person discloses to other individuals. The function of networked media, by definition, is to facilitate interpersonal visibility74. An individual who shares their email data, for example, is necessarily providing information from other individuals. The Cambridge Analytica scandal demonstrates the perils of this kind of networked disclosure of information, in which individuals used a Facebook app, which in turn provided access to the behavioural data of the friends of those individuals. However, the risk of informational spillover is a more general principle that is not new with digital trace data: there are almost always potential spillovers from individual disclosure. Genetic data, for instance, potentially shed light on close relatives of an individual;75 and almost all data about an individual provide information about others. A response from one individual regarding their political preferences provides insights into the preferences of other household members. Knowledge about the drug use of one individual provides insights into the potential drug use of the friends of that individual.

There is also intra-individual informational inference, where information provided (perhaps with consent) enables inferences that the individual may not have anticipated76. The practical ethical upshot cannot be that all research for which there is the possibility of informational spillover or inference from disclosure is forbidden; however, it does mean that often there will need to be important limits to data sharing and data visibility. It also highlights the importance of data security.

Building on our discussion regarding ‘whom do we measure’, care must be taken when attempting to generalize the results of trace-data-based research to populations beyond the platform(s) examined, as well as to the offline lives of the participants41. It is essential to find ways to include participants who are digitally underrepresented, especially when such research is used to inform decisions about wide-ranging social or corporate policies.

Conversely, when digital forms of measurement can offer a better representation of marginal groups compared to a traditional twentieth century approach, our ethical obligation should be to use them, as the example of satellite data above highlights. The choice confronting society is not whether digital technologies will be used to measure human behaviour, but when, how and whether anyone outside of corporate or state surveillance will have access to those data. Ideally, large-scale digital data sources would feed into measures that inform nuanced policies and targeted interventions, going beyond one-size-fits-all initiatives, which tend to work less well for minority groups.

Finally, it is a duty of the field to critique decision-making practices that result from problematic measurement procedures. A previously published study, which demonstrated the racial biases of an algorithm used by many hospitals that was driven by errors of measurement, is an excellent example of both the dangers that result from flawed measurement in automated decision-making and the potential for good science to help to rectify those issues11.

Outlook

Box 3 summarizes the essential arguments of this paper. The massive instrumentation of global society has enormous potential to transform our understanding of the social world. However, the revolution in instrumenting human behaviour requires a revolution in the measurement of human behaviour. Any new measurement regime needs to match the possibilities of both old and new theories of society, deal with the essential instability of human measurement within these heavily instrumented sociotechnical systems, and develop a new model of ethical research of human participants that balances individual rights and collective benefits.