Causal attribution fractions, and the attribution of smoking and BMI to the landscape of disease incidence in UK Biobank

Unlike conventional epidemiological studies that use observational data to estimate “associations” between risk factors and disease, the science of causal inference has identified situations where causal estimates can be made from observational data, using results such as the “backdoor criteria”. Here these results are combined with established epidemiological methods, to calculate simple population attribution fractions that estimate the causal influence of risk factors on disease incidence, and can be estimated using conventional proportional hazards methods. A counterfactual argument gives an attribution fraction for individuals. Causally meaningful attribution fractions cannot be constructed for all risk factors or confounders, but they can for the important established risk factors of smoking and body mass index (BMI). Using the new results, the causal attribution of smoking and BMI to the incidence of 226 diseases in the UK Biobank are estimated, and summarised in terms of disease chapters from the International Classification of Diseases (ICD-10). The diseases most strongly attributed to smoking and BMI are identified, finding 11 with attribution fractions greater than 0.5, and a small number with protective associations. The results provide new tools to quantify the causal influence of risk factors such as smoking and BMI on disease, and survey the causal influence of smoking and BMI on the landscape of disease incidence in the UK Biobank population.

Eq. 1 shows that if the relative risks for x, w, and z are positively correlated, then the attributable fraction for disease risk within the population, is greater than would be estimated using the average relative risk, with estimates using the mean relative risk providing a lower bound. If x and z are negatively correlated then the ≥ sign is be replaced by ≤.
Another quantity that might be considered is the expected value of the attributed fraction 1 − 1/e η x , that is E[1 − 1/e η x ] = 1 − E[1/e η x ]. Because 1/e η x is concave, Jensen's inequality gives, and as a result, If e η x and e η w +η z are positively correlated, then Eqs. 3 and 1 indicate that E[1 − 1/e η x ] will also bound Eq. 10 of the main text. However, this would not be true if e η x and e η w +η z were negatively correlated.

A.2 Relation to other attributable fractions
A recent study 1 with a proportion p exposed to a virus, and an estimated relative risk R, reported an attribution fraction of A = p(R − 1)/R. Here it is briefly outlined when this will approximate Eq. 10 of the main text. Assume that e η x , e η w , and e η z are uncorrelated, so that with the approximation F(t) H(t), Eq. 8 in the main text simplifies to Eq. 14 in the main text, that may be approximated as, If we consider a proportion p that are exposed with relative risk R, and a proportion (1 − p) that are unexposed, then using Eq. 4, where the approximation in the final line follows if R − 1 is small enough, as it often can be. A better approximation follows from the second line, where p(R − 1) 1 ensures that A f p(R − 1). If p 1, then A f (R − 1)/R as usual, as can be seen from the first or second line above.

B Unmeasured confounders and mediation -the "frontdoor criteria"
Another important result from causal inference, is the "frontdoor criteria" 2, 3 . A well-known example 3 is assessing the influence of smoking on disease risk in the presence of unmeasured confounders that influence both smoking use and disease risk, by using an additional measurement of tar in peoples' lungs (figure 1). Again we consider the adjustment formula for this situation in the limit of rare diseases, as above, and consider the simple specific example with continuous variables for e.g. average number of cigarettes per day and tar content of lungs. Although the estimated incidence rates will differ from those using proportional hazards models, the causal estimate for the influence of smoking on lung cancer, is the same as we might (with hindsight) have anticipated from mediation studies.

Figure 1.
The "frontdoor criteria" estimates the causal influence of an exposure do(X = x), that is mediated by Z, in the presence of unmeasured confounders U that influence both the disease risk and the exposure X.
For the situation described in figure 1, the "front door" adjustment formula states 2, 3 , Using this, and proceeding as before, Next consider the specific example where η x = β x x , η z = β z z, P(X = x) is a normal distribution N(µ x , σ 2 x ), and P(Z = z|X = x) is a normal distribution N(αx, σ 2 z ), where in the latter case α is a constant and the mean of z is αx. Understanding that the sums should be considered as integrals when variables are continuous, then we have, and, giving, The incidence rate at baseline X = x 0 is determined by the first three terms, and differs from a proportional hazard estimate that is adjusted by either or both, of x or z. The first two terms are equivalent to a proportional hazards estimate with x at the mean exposure µ x and z at the baseline value, and the third term quantitatively accounts for the spread in values of x and z about their mean values. The influence of do(X = x), is seen in the last term e β z αx , with the change in risk being mediated by z in a very simple and intuitive way.
For the situation considered here, where there is solely an indirect effect of the exposure through the mediator, this estimate is the same as for a mediation analysis with measured confounding 4 . Interestingly, in the equivalent mediation analysis using a proportional hazards model with measured confounding, the influence of measured confounding on the estimate 1 , does not appear in the resulting expressions for natural direct, and indirect, effects. This appears to explain the agreement between estimates with measured, and unmeasured confounding -for the model of figure 1 in limit of rare diseases and a proportional hazards model, the estimate is (apparently) unaffected by confounding.
Equation 7 applies to any situation described by figure 1, and the example given can be generalised, e.g. to multivariate normal distributions. A potentially important, and apparently overlooked application that would be worth exploring in greater detail, is to Mendelian Randomisation (MR) studies. In MR, X would be a genetic variant, and Z would be a biological mediator such as cholesterol. MR is intended to allow the causal influence of Z on disease risk to be estimated, by using the genetic associations of the same variant with X, and with disease risk. However, it could be argued that there are unmeasured confounders such as region of birth, that can influence both your genetics and risk-modifying exposures such as pollution. What the model represented by Eq. 7 appears to suggest, is that with an appropriate analysis, the unmeasured confounders U will not modify the estimated parameters linking X, Z, and the presence of disease Y.

C Relative risks
Using the same approximations used to derive Eq. 4 of the main text, it can be written in several ways, for example, When education and socio-economic factors are represented by Z, then the factor A Z accounts for changes in risk due to both socio-economic factors and education, and the influence of setting X = x is calculated through the factor e η x . If we could set X equal to the baseline values x 0 , the probability distribution would be proportional to the baseline hazard function H 0 (t), amplified or shrunk by the factor A Z . If the baseline values corresponded to the lowest disease risk, then A Z H 0 (t) would be the lowest possible disease incidence rate that could have been achieved through lifestyle changes. Eq. 11 can be written as, This shows that the relative risk of disease within time t for a population with X = x, compared with a population with baseline values of X = x 0 , is equal to the relative risk from observational studies, that have, Figure 2 compares attribution fractions estimated using Eqs. 10 and 14 of the main text (Eq. 14 is the equivalent estimate to that used by the World Health Organisation 8 ). Figures 1-4, show equivalent tables and plots to those in the main text, but for attribution fractions solely due to BMI or smoking alone.

D.3 The estimate F(t) H(t)
Probability densities F(t) for 400 diseases in men and women, without confounding by prior disease, were modelled with Weibull distributions 5 . For age groups of 60, 70, 80, and 90 years, figure 4 provides histograms for the number of diseases having occurred with a given probability interval, and the cumulative proportion of diseases included by that interval. Even at age 90, almost all diseases had an estimated probability of less than 0.2, for which values the estimate F(t) H(t) is very good.
With confounding by prior disease, or with greater than average risk factors, the probabilities would be higher. Unfortunately the approximation F(t) H(t) becomes less reliable when individuals have risk factors that lead to a much higher relative risk than the general population. To explore where the approximations start to fail, examples with relative risks of 1.1, 2.0, and 5.0 are considered, for a late-onset disease whose risk increases rapidly in later life and for a sporadic disease whose risk is moderate throughout life but increases comparatively slowly with age 5 . The diseases were modelled with a Weibull distribution with survival function S(t) = exp(−e η (t/L) k ), where η is a linear predictor for adjustment so e η ≡ RR is the relative risk, L is a parameter that sets a scale for age t, and k is a dimensionless parameter. For the late-onset disease example we took L = 115 and k = 6.8, and for the sporadic disease example we took L = 190 and k = 2.0. The population will contain a mixture of individuals with relative risks ranging in values, some of which may be less than one. Figure 5 compares F(t) with its approximation by F(t) H(t), and also explores how the approximation modifies the estimated attribution fractions (if defined by Eq. 8 instead of Eq. 10, both from the main text). Note that many individuals in a population will often have small relative risks, and the overall combination of relative risks from within the population will

Age 60
Probability of a specific disease  determine the accuracy of Eq. 8's approximation of Eq. 10. For late-onset diseases, whose risk is small until late in life, the approximations are all very good until age 80, but start to fail for the larger relative risks from age 90 onwards. For the sporadic diseases and relative risks of 1.1 and 2.0, the approximations are fairly good until 80 or 90, but when the relative risk is 5.0 the approximation starts to fail from approximately age 60 onwards. Overall, for most individuals and diseases, the approximation F(t) H(t) is adequate for average UK life expectancies of approximately 80 years, in other words, for most of a typical UK human lifespan. The strongest failure of the estimates are for individuals with large relative risks, and for diseases that are more likely to be observed earlier in life (sporadic diseases). For unadjusted fits and small relative risks, the estimate of F(t) H(t) is good even for ages approaching 100. An alternative option suggested in the main text, is to regard Eq. 8 as an (age-independent) definition for ages t → 0, which will be a reasonable approximation for most individuals, diseases, and ages up to the average UK life expectancy of approximately 80 years. With approximation Late onset disease (exact) Sporadic disease (exact) Figure 5. The left and central figures show how the approximation F(t) H(t) starts to fail at large enough ages, and how larger relative risks (R.R.) make the approximation worse. Comparing the left and central figure, the approximation is much better for late-onset diseases that only occur in old age. The right figure shows how the approximation causes the age-dependent attribution fraction (Eq. 10 of the main text), to deviate from its age-independent approximation (Eq. 8 of the main text), at large enough ages. Even the largest deviations are within about 20% of the age-independent approximation (right figure).