A Banksy artwork in Cheltenham, UK. Scientists tried to find the artist’s true identity using public data. Credit: Matt Cardy/Getty

British graffiti artist Banksy is renowned for his anonymity. But that status was dented last year when researchers published a paper that cross-referenced the locations of Banksy’s street art with public information about people’s addresses and likely movements (M. V. Hauge et al. J. Spatial Sci. 61, 185–190; 2016). The team, led by academics at Queen Mary University of London, concluded that someone previously suspected to be Banksy probably was the secretive artist.

Because the study used public data, a university ethics committee said that the work was exempt from formal review — and informally advised academics that it would do no harm because a UK national newspaper had already identified the person in question as Banksy. But for some ethicists, the paper highlights growing concerns about the potential hazards of research that uses public data. “I think this study should never have been done,” says Jake Metcalf, a technology ethicist at the think tank Data & Society in New York City.

Metcalf is one of several academics calling for new guidelines to steer scientists through ethical quandaries in Internet research. The unprecedented availability of online data — together with tools to draw patterns from it, such as machine learning — is opening up research possibilities that outpace existing ethics frameworks around privacy, consent and harm, says Charles Ess, a research ethicist at the University of Oslo and a member of the Association of Internet Researchers. The association will discuss how to update its guidelines at its annual meeting on 19 October in Tartu, Estonia. 

A flurry of similar initiatives is under way. Earlier this year, the SATORI project, funded by the European Commission, published recommendations on Internet research as part of an effort to standardize and update research-ethics guidelines. In September, the US National Science Foundation funded a US$3-million, 4-year study called PERVADE — of which Metcalf is a part — that aims to chart attitudes to data-research ethics, produce best-practice guidelines and create tools to assess the potential harms of such work. And some British universities are preparing their first guidelines on the ethics of Internet research, after the UK Research Integrity Office, a national advisory body, published non-binding recommendations about it last December. 

Common themes among these efforts include rethinking what counts as ‘public’ data, the ethical use of social media and the need to consider a study’s potential harm to wider society, as well as to individuals. Many countries have long-standing ethical checks for research that intervenes in human lives. But those principles, set up for medical and psychological studies, apply to research on human subjects, the definition of which often excludes Internet research, says Metcalf. 

In the United States, for instance, studies using public data (which includes that purchased from a third party) generally do not count as human-subjects research because they don’t access private, identifiable information about people. They don’t need to be checked by an institutional review board (IRB) or require informed consent. Guidelines issued in 2013 add that researchers should sometimes consider seeking review — if a person incorrectly assumed that access to his or her public information was restricted, for example. But IRBs have no obligation to adopt these proposals, and different committees may come to different verdicts, says Metcalf. 

Peter Hedges, head of the research-operations office at the University of Cambridge, UK, argues that even researchers who use information that is undeniably public, such as Twitter data, should review the ethics of their work. The SATORI guidelines advise that regulators and researchers should carefully consider whether publicly available information is actually private, and not fall back on simple classifications.

A major concern is how companies use online data for research.

If someone’s data are considered private and identifiable, that would usually mean obtaining their informed consent. But, in practice, such consent is often impossible to acquire for large-scale data studies, says Ess. And anonymizing data is difficult, because search engines can easily identify individuals from even small snippets of anonymized text or by cross-referencing them in multiple data sources. The SATORI guidelines recommend that researchers take precautions to ensure the anonymity of study participants, and Ess suggests that scientists can still, without too much effort, seek consent from anyone they explicitly quote in research papers.

When ethics committees do assess data studies, their viewpoint might be too narrow, says Ansgar Koene, an engineer and ethicist at the University of Nottingham, UK. They tend to consider the direct damage to an individual involved in research, rather than a project’s potential to do widespread harm to society. That debate flared up in September when artificial-intelligence researchers at Stanford University in California posted a preprint of research that predicted whether someone is gay from their photo; it used pictures sourced from an online dating site (see https://osf.io/zn79k). The study was approved by Stanford’s IRB, but provoked condemnation from some advocacy groups for lesbian, gay, bisexual, transgender and queer (LGBTQ) people, which branded it dangerous. The study’s lead author, Michal Kosinski, said the work aimed to protect people by exposing an existing threat from widely used technology. Kosinski and his colleague, Yilun Wang, discussed their results afterwards with representatives of the LGBTQ community, but Koene says that the discussion should have happened beforehand and the paper should have addressed their comments.

Computer science is a flashpoint for Internet-research ethics. Researchers in this field are not used to working with human study participants and often don’t consider the ethical impact of their work, says Koene, who has surveyed approaches to ethics in different disciplines. A major concern, academics agree, is how companies use online data for research — much of which they have proprietary access to. In 2014, for example, Facebook altered users’ newsfeeds without telling them, to study how this affected their emotions. A public backlash prompted Facebook to publish some details of its internal review process (M. Jackman and L. Kanerva Wash. Lee Law Rev. Online 72, 442; 2016) — but there is little transparency overall about how this works, says Koene. 

Researchers may not want to see their science slowed by formal ethical review, which can be time-consuming and opaque. Better ethics training is one solution, says Koene. But a failure to align data science with public perceptions of what is acceptable could generate a severe reaction, he warns. “The public will see us as no different from corporate or other special-interest groups pursuing a hidden agenda,” he says.