Big data can help to address many pervasive problems in the field of public health. For instance, large-scale data analyses are helping researchers to understand global patterns of disease, the range of factors that contribute to global health and the policies that provide the greatest potential for improvement1,2. In a paper in Nature, Heft-Neal et al.3 propose, implement and (importantly) scrutinize such an approach, exploring the impact of air quality on infant mortality in sub-Saharan Africa.
The authors’ study joins a growing body of work that explores international patterns of health outcomes through creative analyses of big data — a set of approaches pioneered by many on local geographical scales, but brought to the global-health stage by a project called the Global Burden of Disease Study (GBDS). In these types of study, multiple sources of health, administrative and research data are pooled and subjected to mathematical modelling and complex statistical analysis. But this exciting branch of public-health research is still finding its place amid conventional epidemiological techniques that involve gathering data from direct observations in cases and controls, or in longitudinal studies.
GBDS data have previously been used to estimate links between local air quality and mortality on a global scale (for example, in the project’s 2016 report4). But these analyses were dominated by data obtained from air-pollution monitoring stations, which are predominantly found in developed countries. In these areas, air pollution is typically lower than in sub-Saharan Africa.
By contrast, Heft-Neal et al. used satellite-based measurements of air pollution. They combined these measurements with data from 65 household health surveys, which they used to determine mortality for almost 1 million births in 30 countries across sub-Saharan Africa between 2001 and 2015. The authors also focused on infant mortality from all causes, whereas the GBDS emphasized mortality due to respiratory illness.
The results are surprising. Heft-Neal et al. estimate that 22% of infant deaths in sub-Saharan Africa — a total of 449,000 — could be avoided by decreasing average levels of air pollution to the lowest levels observed in the region (a concentration of 2 micrograms per cubic metre). This level of comparative improvement is higher than the estimates reached by two previous analyses using the publicly available GBDS data5,6 (Fig. 1). The authors place their results in the context of the previous work, putting forward several reasons for the different values. These include differing assumptions about what level of improvement in air quality is attainable (improvement from a median of 25 to 2 µg m−3 in the present paper, compared with improvement to 5.8 µg m−3 in the earlier analyses) and different sets of mortality data.
Rather than being satisfied with the headline association alone, Heft-Neal and colleagues carefully review the uncertainty in their estimation. For instance, they detail how the results might be affected by analytical assumptions, such as a linear relationship between air pollution and mortality within the range of observed values, and potential biases associated with using satellite-based measurements as a proxy for air pollution at ground level. They also consider potential confounders such as socio-economic status — it has previously been predicted that wealthier households would be less affected by air pollution than poorer households, but the authors show that this is not the case in their analysis. Such self-reflection is refreshing and essential, and places the results in an appropriate context for consideration by researchers and policy experts.
Heft-Neal et al. outline their data sources in their supplementary information, but future work can go further by filling in the details necessary to replicate and reproduce results from big-data studies. For example, detailed, peer-reviewed descriptions of data curation should be published, and the final data set should itself be deposited in citable repositories such as datadryad.org. By sharing citable analysis details and data, the value of studies such as Heft-Neal and colleagues’ could be even greater.
Is this the final word on associations between air quality and infant mortality? Certainly not, because any observational study runs the risk of confusing correlation with causation. But I would suggest that proof of causation should not be the only motivation for such studies. Rather, the goal of any scientific exploration should be to know more afterwards than we did before. Proving causation might help researchers to pinpoint the direct effects of particular policies on particular aspects of health. But carefully vetted broad-scale associations can point to ways in which small policy changes can yield large improvements (even if indirectly) in addressing challenging public-health goals. This is especially useful for aspects of public health, such as air-pollution analyses, in which tightly controlled experimental studies would be difficult and ethically challenging — it would not be possible, for instance, to randomize levels of air pollution to individuals, nor to easily assign specific exposures to specific locations.
Large-scale data-science studies can offer insight into factors that predict trends in health outcomes, but may have limited use for defining causation, particularly at continental scales. For example, consider Google Flu, which aimed to estimate the numbers of influenza cases in the United States by analysing search-term trends relating to flu symptoms. For many weeks, the system’s data-science-based predictive approach provided more-accurate results than did conventional epidemiological tracking based on physician reports and laboratory confirmation. However, following an adjustment to the prediction algorithm in early 2013, the system vastly overestimated flu cases for two weeks7. By relying wholly on associations rather than also incorporating epidemiological risk factors, the algorithm had few checks and balances against over- or underestimation, and offered few insights into the factors driving short-term patterns in flu incidence.
In summary, although big-data analyses cannot replace careful epidemiological studies, they can give broad insight into the potential benefits of public-health policies. In this case, Heft-Neal and colleagues’ work highlights the benefits of aspiring to reduce air pollution to the lowest levels observed in their data set, and provides assessments of the effects of more-modest changes in pollution levels. This type of analysis certainly has a place in the modern public-health toolbox. As noted by Kofi Annan2: “Without good data, we’re flying blind. If you can’t see it, you can’t solve it.”
Nature 559, 188-189 (2018)