Arising from Chaves et al. Nature Communications (2020)

In a recent study, Chaves et al.1 find international consumption and trade to be major drivers of ‘malaria risk’ via deforestation. Their analysis is based on a counterfactual ‘malaria risk’ footprint, defined as the number of malaria cases in absence of two malaria interventions, which is constructed using linear regression. In this letter, I argue that their study hinges on an obscured weighting scheme and suffers from methodological flaws, such as disregard for sources of bias. When addressed properly, these issues nullify results, overturning the significance and reversing the direction of the claimed relationship. Nonetheless, I see great potential in the mixed methods approach and conclude with recommendations for future studies.

To construct ‘malaria risk’, Chaves et al.1 regress malaria cases on cumulative tree cover loss and two malaria intervention variables, expressed in shares of usage. Their globally aggregated data cover the period from 2000 until 2015 on a yearly basis. Data on malaria cases and tree cover loss are available for 26 countries in tropical biomes, while the two intervention variables are only available for 13 of these countries in Africa. Figure 1 shows the time series under scrutiny; additional information on the data is provided in Supplementary Note 1.

Fig. 1: Time series under consideration.
figure 1

Variables are malaria cases (in million), cumulative tree cover loss (in million hectare), percent sleeping under insecticide-treated nets (ITN) and percent of under-5 fevers receiving artemisinin-based combination therapies (ACT).

Chaves et al.1 specify their regression model as (see their paper for notation)

$$\mathop{\sum }\limits_{r}{I}_{r}(t)={\beta }_{0}+{\beta }_{L}\mathop{\sum }\limits_{r}{L}_{r}(t)+{\beta }_{n}n(t)+{\beta }_{a}a(t).$$

However, the actual model is a weighted regression of the type

$$w(t)\mathop{\sum }\limits_{r}{I}_{r}(t)={\beta }_{0}+{\beta }_{L}w(t)\mathop{\sum }\limits_{r}{L}_{r}(t)+{\beta }_{n}w(t)n(t)+{\beta }_{a}w(t)a(t)+\epsilon (t),$$

where w(t) is a weight scalar and ϵ(t) is an error term at time t. Weights were constructed via replication of observations, meaning that ∑tw(t) ≠ 1. The sample size is not adjusted accordingly, meaning that standard errors are too small by a factor of 2.08 on average (see Table 1, column two). The weighting was obscured by its omission from the Methods and by the replicated rows only being visible after unhiding them in the spreadsheet that is provided in their replication files. Chaves et al.1 weigh 2005 at 42.86%, 2001 at 17.86%, and 2014 at 16.07%. The unweighted model, as it is specified in the paper, undoes the significance and switches the sign of forest loss, as can be seen in columns one and three of Table 1.

Table 1 Comparison of original regression results to alternatives.

The study by Chaves et al.1 is looking to estimate a causal effect of deforestation on malaria incidence. Valid estimates of this relation can only be obtained using appropriate techniques and assumptions that require theoretical justification2. The authors do not consider these intricacies and offer no explanation of why their ‘malaria risk’ measure may be interpreted as it is. Instead, they disregard a number of statistical issues that I discuss below.

Chaves et al.1 base their model selection on achieving a ‘sufficient’ R2—a procedure that is well known to be inadequate3. To illustrate this, consider a regression of birth rates on stork population. Common seasonal patterns lead to high correlation and high values of R2. However, we learn very little about the actual relationship and estimates will be spurious. Chaves et al.1 claim that any model adaptation would only marginally increase R2 and hence necessarily mimic their results. This is factually incorrect, missing the relative nature of R2. See column (4) of Table 1 for a demonstration of how an additional variable can affect results.

Obtaining unbiased estimates from a linear regression relies on the exogeneity assumption, i.e. no correlation between explanatory variables and the error term. This assumption is commonly violated by simultaneity or omitted variables4. Simultaneity occurs when variables are determined contemporaneously, e.g. due to reciprocal causation. Regressing a disease’s incidence on its interventions is a textbook example for this phenomenon. Valid inference could only be drawn using elaborate methods, such as instrumental variables, or, if theoretically justifiable, by assuming no effects of malaria incidence on the use of nets and therapy. Omitted variable bias occurs when the dependent and explanatory variables are both affected by a third factor. Chaves et al.1 cite Garg5 and Berazneva and Byker6, who establish causal links between deforestation and malaria for specific regions. These studies rely on panel data, allowing for subnational heterogeneity, and an extensive set of control variables in order to distil a causal effect. Chaves et al.1 themselves observe a number of malaria determinants in their appendix, which are also drivers of deforestation6. Yet, the authors do not take any of these factors into account. The distortion caused by this oversight becomes noticeable when including a linear time trend, as one of many omitted variables (see Table 1, column (4)).

In their study, Chaves et al.1 perform a time series regression without considering any of the associated complexities. Crucially, their model relies on stationarity of variables, i.e. their distributions, hence moments such as the mean, must be constant over time4. Non-stationary variables generally lead to the spurious regression problem7. Results would then indicate strong correlation between variables, but do not imply causation. In the study’s model, we cannot reject non-stationarity for any of the variables considered and we find autocorrelated residuals—all at any reasonable level of significance (see Supplementary Table 1 for test results). The variable of interest, cumulative forest loss, is even non-stationary by design. When dealing with this issue in two simple ways, we find completely different results—namely sign-switching and insignificant coefficients. See columns (4) and (5) of Table 1 for a model accounting for a linear time trend and one where the relation of yearly changes of variables is modelled.

Putting aside inadequate methods, there is a number of simplifications that neglect important complexities of both malaria and deforestation dynamics. By aggregating data, Chaves et al.1 implicitly assume international homogeneity of malaria dynamics. This assumption is striking, given weak empirical support8 and the spatial mismatch of malaria and forest loss. Malaria predominantly occurs in Africa, with 93% of global cases in 20189, while forest loss mostly stems from other regions10. Furthermore, Chaves et al.1 silently equate the distinct concepts of forest loss, deforestation and commodity-driven deforestation. With the Hansen et al.10 data, they use information on forest loss, which is only partly due to deforestation10,11. Deforestation, in turn, is driven by multiple factors, including but not limited to commodity production12. Since commodity-driven deforestation is only a subset of forest loss, with arguably special dynamics, this distinction is relevant for conclusions that can be drawn.

To sum up, the study by Chaves et al.1 constitutes an important attempt at linking malaria, deforestation and trade, but falls short of this ambitious goal. Their use of an unorthodox weighting scheme lacks justification and pushes results towards showing a link between deforestation and malaria. Their model is plagued by a number of serious methodological issues, including simultaneity, omitted variables and non-stationarity. Each one of them individually is enough to invalidate results. Still, I hope this direction is pursued further and offer some recommendations: (a) be transparent with assumptions made, (b) approach interdisciplinary problems with an interdisciplinary team, (c) be precise and careful with the notion of causality.