Modeling and mapping the burden of disease in Kenya

Precision public health approaches are crucial for targeting health policies to regions most affected by disease. We present the first sub-national and spatially explicit burden of disease study in Africa. We used a cross-sectional study design and assessed data from the Kenya population and housing census of 2009 for calculating YLLs (years of life lost) due to premature mortality at the division level (N = 612). We conducted spatial autocorrelation analysis to identify spatial clusters of YLLs and applied boosted regression trees to find statistical associations between locational risk factors and YLLs. We found statistically significant spatial clusters of high numbers of YLLs at the division level in western, northwestern, and northeastern areas of Kenya. Ethnicity and household crowding were the most important and significant risk factors for YLL. Further positive and significantly associated variables were malaria endemicity, northern geographic location, and higher YLL in neighboring divisions. In contrast, higher rates of married people and more precipitation in a division were significantly associated with less YLL. We provide an evidence base and a transferable approach that can guide health policy and intervention in sub-national regions afflicted by disease burden in Kenya and other areas of comparable settings.

The Global Burden of Disease (GBD) study provides an excellent framework to quantify the magnitude of health loss due to diseases, injuries, and risk factors 1 . GBD studies quantify health loss through both mortality and morbidity by using the so-called disability-adjusted life years (DALYs) 2,3 . In Kenya for example, 78.3% of total DALYs are constituted by years of live lost (YLL) due to premature mortality 4 , with the leading causes HIV/AIDS, lower respiratory infections, diarrheal diseases, tuberculosis, and malaria 5 . However, most of the GBD studies have focused on the national level 6 , missing potentially significant variations at the sub-national level 7 . Although some studies assessed disease burden at the sub-national level [8][9][10][11] , only a few considered spatial patterns in these measures 12,13 . Knowledge about sub-national regions that exhibit significant above or below average disease burden is of particular interest for deciding where to intervene to improve population health.
Yet, there remains much we do not know about the sub-national distributions of risk factors of disease burden so that we have limited knowledge about where health interventions will be most efficient. Many low and middle-income countries lack disaggregated health statistics that are needed for sub-national studies and spatial analyses. A possibility to overcome this issue may be census data that is more often available and could be useful to analyze disease burden due to e.g. premature mortality.
To the best of our knowledge, no study systematically assessed the spatial patterns of disease burden due to premature mortality with sub-national data across an entire country in Africa. One notable exception however, is provided by Manda and Abdelatif 14 , who analyzed the spatial-temporal variation of mortality risk across South African municipalities. However, they did not account for important risk (e.g., infectious disease), environmental or socio-demographic factors (e.g., climate, ethnicity). Furthermore, their study did not explicitly assess spatial clusters of life years lost.
We set out to investigate the spatial distribution of disease burden based on the most recent population and housing census 2009 in Kenya. Specifically, we aimed to 1) detect spatial clusters of YLLs at the division level (n = 612), and to 2) identify variables that are associated with the YLLs at this level.

Results
We noted that YLL exhibited a distinct geographic pattern, with higher YLLs in western, northwestern, and northeastern Kenya (Fig. 1). We found small but significant spatial clustering of YLL across Kenyan divisions (global Moran's I = 0.20, p-value < 0.001). Figure 2 shows significant (p-value < 0.001) local spatial clusters of high YLL rates (a) near Lake Victoria in western Kenya, (b) in Turkana County in the northwest, and (c) near the border with Ethiopia and Somalia in the northeast. Significant spatial clusters of low YLLs were found in central and southern Kenya. Figures 1 and 2 of the supplementary file contain additional information for YLL based on the Kenya specific life expectancy (not reported). Figure 3(a) shows the relative importance of the significant explanatory variables in our model. Table 1 in the supplementary file shows odds ratios and 95% confidence intervals for the ten most important variables from a replicated version (Poisson multivariable regression) of our boosted regression tree (not reported). Higher shares of Luo ethnicity or more crowded households were strongest factors significantly and positively associated with YLL at the division level. Figure 3(b) and (c) depict the partial dependence plots (PDP) for share of Luo ethnicity and household crowding, each illustrating the isolated influence of these risk factors on YLLs while controlling for all other factors. For example, YLL sharply increased with higher share of Luo people until it levelled out at around 65%, after which the strength of association remained constant. Household crowding also had a non-linear influence on YLL. The effect of crowding on YLL was low for less than 3.5 persons per room but crowding above this threshold was associated with rapidly increasing YLLs rates, up to 5.5 persons per room. Shares of Luo ethnicity and crowded households in a division were also significantly interacting with each other (Fig. 4). The association between share of household crowding and YLL rate was stronger in divisions with a share of Luo people above approximately 30%.
Furthermore, higher shares of Kisii ethnicity, higher malaria endemicity, and divisions at higher latitudes were significantly and positively associated with YLL (Fig. 3, Supplement). We also found that the spatial lag coefficient that represented YLL in neighboring divisions was positively associated with YLL. In contrast, higher shares of Kikuyu or Kamba ethnicity, higher shares of married people, or higher precipitation in divisions were significantly and negatively associated with YLL. We also tested our model with a spline function on the precipitation variable (precipitation of the wettest month) but could not find any significant difference to the model reported here (Supplementary File, Fig. 4).

Discussion
Years of life lost due to premature mortality (YLL) were spatially clustered in western, northwestern, and northeastern Kenya and higher shares of Luo people and crowded households exhibited strongest associations with YLL in Kenyan divisions.   While most divisions displayed YLL rates around the national average of 0.4, some divisions had YLL rates up to four times higher (1.7), exhibiting spatial concentration of premature mortality. For example, high YLL rates clustered near Lake Victoria in the southwest (Fig. 2a). This region is characterized by highest HIV prevalence and high malaria endemicity 15 . HIV/AIDS and malaria are the first-and third-most important causes of YLL, constituting 18.9% and 10.0% of Kenya's total YLL, respectively 4 . Hence, these conditions could be an explanation for the high burden in this area. Other significant clusters of high YLL rates were identified in Turkana County of northwestern Kenya (Fig. 2b) and in the border triangle with Ethiopia and Somalia (Fig. 2c). These predominantly remote, (semi-) arid regions are sparsely populated and dominated by (nomadic) pastoralism 16 . There could be several explanations for the high burden in these regions. First, remoteness could imply limited access to health care facilities and services. Second, low agricultural potential, combined with frequent droughts may periodically lead to health-threatening food insecurity 16 . Finally, inter-tribal violence related to resource scarcity and cross-border overflow from armed conflicts in neighboring countries (Somalia, South Sudan, Uganda) may be reasons for the clusters of high YLL in these regions 17,18 . The central and southern regions of Kenya, in which low YLL clustered, are rather characterized by higher agricultural potential, good income opportunities and better food security 16 . Combined with modest HIV prevalence and low malaria endemicity, this may explain the low burden in this area 15,19 .

Principal component Original variable and factor loading
Marital status The share of people being married monogamously (0.8), or never being married (−0.6) were correlated with this component.

Protestant Christian
The share of people being Protestant (0.9), or Catholic (−0.3) were correlated with this component.

Occupation
The share of people working on family-owned farms (0.8) was correlated with this component.

Bicycle possession
The share of households (HH) having a bicycle (0.8).

Modern assets
The share of HH with a mobile (0.6), a TV (0.4) and a radio (0.3) were correlated with this component.

Livestock possession
The mean number of goats (0.7) and chicken (0.5) per HH were correlated with this component.

Poor cooking fuel
The share of HH cooking with firewood (0.9), or with charcoal (−0.5) were correlated with this component.

Poor lighting fuel
The share of HH using tin lamps (0.9) was correlated with this component.

Good roof material
The share of HH having an iron sheet roof (0.8), or a grass roof (−0.6) were correlated with this component.

Poor floor material
The share of HH having earth floor (0.7), or cement floor (−0.7) were correlated with this component.

Poor wall material
The share of HH having mud/wood as wall material (0.9) was correlated with this component.

Good sanitation
The share of HH having a covered pit latrine (0.9), or using the bush for sanitation (−0.4) were correlated with this component.
Poor water source The share of HH using a river as water source (0.9) was correlated with this component.

Good water source
The share of HH using water drawn through pipes (0.7) was correlated with this component.  Higher shares of specific ethnicities (Luo or Kisii) within divisions were positively associated with YLL and this association was the strongest among all variables in the model. Kenya is home to over 70 distinct ethnic groups, with the Kalenjin, Kamba, Kikuyu, Luo, and Luhya being among the largest ones. This rich diversity however has often led to social tensions 20,21 and unequal health outcomes. For example, our finding is consistent with other studies that report highest HIV and tuberculosis prevalence and also child mortality among the Luo compared to other ethnicities in Kenya 22,23 . It is therefore quite understandable that those divisions inhabited primarily by the Luo or Kisii were positively associated with YLL. Our findings underline the importance of considering ethnicities when examining the burden of disease 24 . For example, certain health-related practices (e.g., circumcision, use of cultural medicine, sexual behavior) and people's access to health care can be strongly dependent on ethnicity [24][25][26] . However, we here explicitly point out that we neither can assume a direct relationship between ethnicity and higher risk of YLL since we examined relationships at the ecological and not at the individual level. Nor can we infer causal relationships between ethnic-specific health behavior and YLL from our cross-sectional study. Future studies should look into the ethnic composition and respective health behavior at the individual level to better understand the burden of disease across different population groups and regions across Kenya. Household crowding (over 3 persons per room) was positively associated with YLL, possibly due to a higher risk of communicable diseases such as acute respiratory infections, tuberculosis, or skin diseases with more persons sharing one room 27 . This finding is in line with studies from New Zealand 28 and Uganda 29 that also revealed associations between household crowding and morbidity. In contrast to our results, Ombok et al. 30 did not identify crowding as a risk factor for child mortality in Nyanza Province of Kenya, possibly because they used a dichotomous variable (<5 and ≥5 persons/room) while we employed a continuous measure. There was a statistically significant interaction between household crowding and Luo ethnicity in our study. This indicates a mutually enforcing effect of these two factors so that risk of premature mortality is particular high in a division if both factors are high. While there is little evidence on the health effects of household crowding with respect to ethnicities in the literature, this suggests a need for more in-depth analysis in future studies.
We found higher malaria endemicity in a division was positively associated with YLL. This is consistent with a large body of literature, especially in the sub-Sahara Africa context 5,15,[31][32][33][34][35][36][37] . In contrast, we found that being married can be protective against poor health and YLL; this has also been shown in several studies for different health outcomes [38][39][40] . Using the same data in another study at the individual level in Kenya, Gruebner et al. 40 found reduced risk of child death for mothers who lived in households with married household heads. The authors assumed that being married indicates a stable living arrangement providing a health-promoting environment. In the current study, this may also be true at the ecological level as we found higher rates of married persons in a division was negatively associated with YLL.
Our study found a negative association of higher precipitation in a division with YLL. It is not entirely clear why this is the case. While one study found that malaria mortality was associated with rainfall in western Kenya 41 , a study in Sweden found that higher precipitation decreased the number of deaths in the 18 th and 19 th century 42 . The authors argue that in Sweden a warm spring with good rainfall increased the chance of a rich harvest, on which the pre-industrial population was dependent. This may also be true in our study, as precipitation allows for crop cultivation (e.g. coffee, banana) that would provide income possibilities for the local population with positive effects for health 43,44 .
Divisions that were geographically located further in the north of Kenya were positively associated with YLL. This may mirror findings from our spatial cluster analysis suggesting that these regions may represent remoteness, low agricultural potential, frequent droughts, or inter-tribal violence. More spatial epidemiological studies are needed to further breakdown the geographic distribution of explanatory variables associated with the burden of disease in Kenya. Furthermore, YLL were positively associated with YLL in neighboring divisions. This may indicate spill over, that is, exposure factors in one division (e.g., higher share of specific ethnicities, crowded households, malaria endemicity) may also be associated with higher YLL in adjacent divisions, even when these factors are low there. Another explanation for the spatial lag effect could be that adjacent divisions share similar high values of exposure factors.
We recognize three noteworthy limitations of our study. First, we calculated rates of YLL based on death cases per household within the last twelve months prior to the census that can be related to possible biases. For example, early death of a child is a traumatic event that may influence such reporting. Recall bias may play a role due to exclusion of deaths that occurred within the recall period and may underestimate the level of mortality. In turn, over-reporting of deaths that occurred outside the recall period may have led to an overestimation of mortality 45 . Although recall bias has frequently been regarded as a major issue in case-control studies, it has also been reported to compromise retrospective study designs 46 . For the neighboring country of Tanzania however, Moshiro et al. 47 found that long recall periods of up to 12 months did not affect estimates.
Second, we had to exclude 9.6% of the death cases as they were reported with an unknown age at death. Comparisons between age-specific mortality rates calculated from the Kenyan census data with rates from the GBD 2010 study indicated noticeable lower mortality rates for older ages (>60) in our data. This suggests that the death cases that we excluded in our study were predominantly people of older age. Death cases at older age have a fairly small impact on the YLL due to lower residual life expectancies and hence we assume that it had only a negligible effect on our findings.
Third, we created a single model for YLL attributable to all causes of deaths based on census, that is, on a complete enumeration of the population. Such an approach prevents analysis of YLL specific to communicable diseases, non-communicable diseases, or injuries. Yet, risk factors vary substantially from one group of diseases to another, which needs to be kept in mind when interpreting our findings.
To the best of our knowledge this is the first study that addressed the spatial distribution of the burden of disease due to premature mortality at the division level in Africa. Based on census data, we identified spatial patterns of the years of life lost (YLL) that provide crucial information for better understanding about the locations where people are at higher risk for premature mortality. Moreover, we identified exposure factors that were significantly associated with YLL.
Kenya has made significant improvements in the reduction of the top three causes of premature death in 2016 as compared to 2005 48 . For example, HIV/AIDS as a cause for premature death was reduced by 60.4%, diarrheal diseases by 29.8%, and lower respiratory infections by 23.3% 48 . Furthermore, Malaria as the seventh important cause of premature death in the country was reduced by 59.9% as compared to 2005 48 .
Our spatial epidemiological approach with census data is transferable and should be reapplied with updated census data once these are available. Thereby it will contribute to a precision public health supporting the allocation of scarce resources to regions and specific populations most affected by premature mortality also in contexts beyond Kenya.

Methods
Data set and availability. Micro level data from the most recent Census conducted August 24 th 2009 49 was used. This data is also available to other researchers who meet the criteria for access to confidential information. Interested researchers may request this data at datarequest@knbs.or.ke. Study design and population. As in Gruebner et al. 40 , a cross-sectional study design was used, with data on the general population and for this study aggregated at the division level. We excluded those divisions with preliminary non-residential areas and thereby arrived at N = 612 divisions suitable for our analyses. The population for these divisions ranged from 165 to 870,202, with a median population of 44,661. Explanatory variables. We considered the following variables from the census aggregated at the division level: Population density (population/km 2 ), household crowding (mean number of persons/room), percentage of rural households and ethnic population groups, as well as mean educational attainment (range 0 = no education to 20 = completed university degree). Mean access to health care was calculated based on health facilities obtained from the Kenya Open Data Portal 50 to population ratio. Malaria endemicity (i.e., basic reproductive number for Malaria cases) was taken from Gething et al. 15 and the mean altitude in meter was taken from Jarvis et al. 51 . Six variables represented climate related factors and were taken from Hijmans et al. 52 : Mean annual temperature in degrees centigrade with maximum temperature of warmest month and minimum temperature of coldest month, as well as the mean annual precipitation in millimeter with mean precipitation of wettest month and mean precipitation of driest month. We also included geographic coordinates and a factor representing the spatial lag of YLL (i.e., average value of YLL in adjacent divisions).

SD Median Min Max
Furthermore, we applied a principal components analysis on additional census variables to combine explanatory variables representing socio-demographic characteristics of the population to enhance the interpretability of results 53,54 . All components with Eigenvalues greater than one were extracted and used as uncorrelated explanatory factors in our analyses. Table 1 summarizes all principal components with respective variables and factor loadings and Table 2 provides summary statistics for all variables used in the analysis.

Analysis.
We first performed spatial autocorrelation analysis (Moran's I) to explore spatial clustering of YLL, that is, the degree to which nearby divisions tend to show similar or dissimilar YLLs rates. The global Moran's I characterizes the overall pattern in the entire study area 55 . The local Moran's I identifies local spatial clusters of similar (hotspots) or dissimilar neighboring divisions (outliers) that are significantly different from an expected spatial pattern under normality assumption 56 . Divisions that indicated a significant (p < 0.001) local Moran's I were mapped and classified into High-High (or Low-Low) hotspots, that is, high (or low) YLL in one division next to high (or low) YLL in neighboring divisions, or Low-High or High-Low spatial outliers. We conducted global and local Moran's I with "spdep" in R 57,58 .
Second, we used boosted regression trees (BRTs) to quantify the association between explanatory variables and YLL in Kenya. BRTs draw on techniques from machine learning 59,60 and have been successfully applied to disease modeling [60][61][62][63] . We chose BRTs because they can handle non-linear relationships, are insensitive to outliers, and account for interactions between variables 60,64 . Generally, models based on regression trees partition the variable space into those parts with the most homogenous responses to the explanatory variables 60,64 and the relative importance of these variables determines the strength of their association with the YLL. This relative importance is quantified by the number of times a variable is used for splitting a regression tree, weighted by the model improvements as a result of each additional split, and averaged over all trees 60,64 . In order to examine the nature of the association between a variable and YLL, partial dependence plots (PDPs) were computed. PDPs are fitted functions for a certain explanatory variable along its data range and thus represent the isolated effect of the variable on YLL while holding all other explanatory variables at their mean 60 . Interactions among variables identified and modeled by BRTs can be visualized by three-dimensional PDPs. We applied BRTs using "dismo" and "gbm" packages in R 58,65,66 .
Finally, we tested BRT model residuals for spatial autocorrelation to verify the assumption of independent errors 67 . For all procedures, we followed the guidelines and recommendations of Good Epidemiological Practice (GEP) defined by the German Society for Epidemiology to secure ethical principals in data handling 68 .