Published online 19 November 2008 | Nature 456, 287-288 (2008) | doi:10.1038/456287a


Web data predict flu

Search engines provide information about epidemics.

Two new studies hint at the public-health and research potential of mining the data created as people search the web. Both teams have successfully detected the onset of US seasonal flu epidemics, by extracting patterns of flu-related search terms from the billions of queries stored by Google and Yahoo.

The work tested the hypothesis that people will more frequently search the Internet using flu-related terms when they get sick. One group used Google's search-query logs, the other Yahoo's. Together they generated strikingly concordant findings: patterns of searches matched almost perfectly with official flu surveillance data — and often weeks in advance of these.

The Yahoo research1, done in conjunction with researchers at the University of Iowa in Iowa City and Harvard University, manually selected keywords for testing such as 'flu' or 'influenza' while eliminating confounding terms such as 'avian' or 'bird'.

The researchers compared the relative frequency of the search terms, between 2004 and 2008, with weekly national data on the standard surveillance indicators of flu — positive viral isolates and flu mortality rates. "We found that we could explain weekly variation in seasonal influenza one to three weeks in advance of cultures, and five weeks in advance of mortality," says Philip Polgreen, an epidemiologist at the University of Iowa.

The Google research, in collaboration with the US Centers for Disease Control and Prevention (CDC), modelled combinations of terms to find best fits against weekly CDC data from between 2003 and 2007. It then created and tested predictive models against the number of influenza-like cases reported by the CDC physician surveillance network during the 2007–08 flu season2. The models based on best-fit terms, the researchers claim, were "consistently one to two weeks ahead" of the CDC surveillance reports.

Click to enlarge.

Although search-term analyses could not replace surveillance monitoring — they could be prone to false alarms, and do not provide key information found through recording case numbers and viral isolates — they could supplement existing methods.

In particular, using logs of billions of search queries is fast, cheap and seems to yield good time and spatial resolution, at least down to the state level., the company's philanthropic arm, has turned its findings into Google Flu Trends, a free daily flu forecast for the United States. It plans to roll out the service to other countries, languages and diseases.

"Google Flu Trends nicely illustrates both the broad coverage and fine-grained detail that one can get from search-engine query-log data," says Jon Kleinberg, a specialist in network analysis at Cornell University in Ithaca, New York.

Scientists have only begun to scratch the surface of the potential of large web data sets — often because companies are reluctant to share their data because of both real and perceived privacy concerns, says David Lazer of the John F. Kennedy School of Government at Harvard University. "It is difficult to get access to such data, and so it should be, because there are confidentiality issues," he says. In the flu-prediction case, this was much less of an issue, says Lazer, as the data used were large pooled aggregates of search-term frequencies, from which it would be impossible to obtain information on individuals' searches.


In 2006, lacking access to Google databases, Gunther Eysenbach of the University of Toronto took out a Google Ad that pointed users searching for flu-related terms to a flu educational website. The patterns of clicks on the ad correctly anticipated, around a week in advance, the official Canadian FluWatch reports3 — for a total monitoring cost per season of about Can$364 (US$295).

The Google and Yahoo work shows that real-time electronic data can improve on government data, says Marshall Van Alstyne, an expert on information economics at Boston University in Massachusetts. "No doubt, it presages a trend toward leading rather than lagging indicators for other socially important phenomena," he says. That might even include the credit crisis, he says — because banks' own activities give them better and earlier data on macroeconomic trends than those that government agencies can collect. 

  • References

    1. Polgreen, P. M., Chen, Y., Pennock, D. M. & Nelson, F. D. Clin. Infect. Dis. 47, 1443–1448 (2008).
    2. Ginsberg, J. et al. Nature doi:10.1038/nature07634 (2008).
    3. Eysenbach, G. AMIA Annu. Symp. Proc. 244–248 (2006).
Commenting is now closed.