The data behind analyses, such as this map of mobile-phone use in Rome, present privacy issues. Credit: MIT SENSEABLE CITY LAB.

When web provider AOL's research division published an analysis of search behaviour on the Internet last year1, it had what it thought was a bright idea: it would reach out to academics by making an anonymized version of the data freely available for download from its website. But within hours, it had to pull the site, after bloggers managed to infer many identities from the data and view the associated search histories.

AOL's mistake highlights a growing dilemma for the social sciences. The hottest growth area in the field is computational social science. This is often based on privileged access to electronic data sets such as e-mail records, mobile-phone call logs and web-search histories of millions of individuals.

Such studies are ushering in a revolution in the social sciences, specialists say. But there is a trade-off between the scientific interest in working with such data and concerns about privacy. ?It's a huge issue,? says David Lazer, a researcher at the John F. Kennedy School of Government at Harvard University.

Lazer is co-author of a social-network analysis based on the largest mobile-phone data set ever studied by academics ? 18 weeks of details of who called whom, when and for how long, among 7 million users, representing 20% of the population of an unidentified European country, supplied by an unnamed operator2. A month after it appeared, Microsoft researchers published an even larger study3, this time of instant messages, featuring 30 billion conversations among 240 million people worldwide.

Until now, social science has struggled to obtain tools that do more than scratch the surface of some of its questions. These range from identifying the driving forces behind violence, to the factors influencing how ideas, attitudes and prejudices spread through human populations. The available tools have largely remained in a time warp, consisting of analyses of national censuses, small-scale surveys, or lone researchers with a notebook observing interactions within small groups.

Being able to automatically and remotely obtain such massive amounts of continuous data opens up unprecedented opportunities for social scientists to study organizations and entire communities or populations, says Marshall Van Alstyne, a researcher at Boston University. He is conducting research on the dynamics and productivity of organizations by analysing network patterns of e-mails among volunteers.

?There is enormous potential here for lines of research that shed new light on basic social-science questions,? says Jon Kleinberg, a specialist in network analysis at Cornell University in Ithaca, New York.

But the privacy issue looms large. Repetition of the AOL gaffe by other researchers might create a damaging public backlash, warns Myron Gutmann, director of the Inter-university Consortium for Political and Social Research, based in Ann Arbor, Michigan. The consortium acts as a clearinghouse for secure access to conventional sensitive social-science data sets, such as the raw data of the US national census.

?So far, researchers have been careful, and successful at avoiding serious problems,? says Kleinberg. ?But as the number of these types of study increases, the community is clearly going to need to engage in deeper discussions about the right way to safeguard privacy in working with these kinds of data.?

Lazer says the mobile-phone study was possible only after taking many precautions, including anonymizing data before researchers gained access to them, confidentiality agreements between the phone company and Harvard University, and approval and stipulations on access controls by the university's internal review board.

But those ad hoc arrangements can go only so far, says Lazer, and a larger institutional framework is needed to set best practices, especially for safe sharing of data among academics. The social scientists pioneering such research often have computer-science backgrounds. As the software tools developed go mainstream, they get taken up by academics with less expertise in protecting data from abuse.

Although AOL had anonymized the data it released ? 20 million web queries from around 650,000 users over 3 months ? the company overlooked the fact that people's search queries often include vanity searches on their own name. This fact, combined with context from other search queries, often allowed complete search records to be narrowed down to individuals.

Using algorithms similar to those for cracking cryptography keys, Kleinberg also showed in an May paper4 that more sophisticated attacks can be used to identify people in completely anonymized data. In the model, individuals in the network, and simulated false users, were able first to identify themselves in the network, and subsequently obtain the supposedly confidential data of all those connected to them. ?It's hard to guarantee in practice that any data set, however well anonymized, is truly safe from privacy breaches,? says Kleinberg.

This work reinforces the need for a systematic, institutional approach to improving the privacy rights of those whose data are used, says Van Alstyne. That echoes the conclusions of a May study by the US National Academies5, which said that safeguarding privacy cannot safely be left to individual researchers. It stated that: ?Institutional solutions involve establishing tiers of risk and access, and developing data-sharing protocols that match the level of access to the risks and benefits of the planned research.?

But Gutmann and other social scientists also stress that the risks should be kept in perspective. Scientists must meet strict rules on any research on human subjects. In contrast, private firms are largely free from such constraints, and already have wide latitude to snoop on, and data mine, their employees' work habits.

Specialized firms, not to mention governments, are also building vast cross-referenced databases of information on every aspect of individuals' lives, including their web search behaviour, interests and personal preferences. An excessive focus on research would be ?fixing the leaky faucet when the bath-tub's overflowing?, says Van Alstyne. ?Businesses seem more prone to misuse private data than scientists of any stripe.?