The Global Burden of Disease (GBD) study provides an excellent framework to quantify the magnitude of health loss due to diseases, injuries, and risk factors1. GBD studies quantify health loss through both mortality and morbidity by using the so-called disability-adjusted life years (DALYs)2,3. In Kenya for example, 78.3% of total DALYs are constituted by years of live lost (YLL) due to premature mortality4, with the leading causes HIV/AIDS, lower respiratory infections, diarrheal diseases, tuberculosis, and malaria5.

However, most of the GBD studies have focused on the national level6, missing potentially significant variations at the sub-national level7. Although some studies assessed disease burden at the sub-national level8,9,10,11, only a few considered spatial patterns in these measures12,13. Knowledge about sub-national regions that exhibit significant above or below average disease burden is of particular interest for deciding where to intervene to improve population health.

Yet, there remains much we do not know about the sub-national distributions of risk factors of disease burden so that we have limited knowledge about where health interventions will be most efficient. Many low and middle-income countries lack disaggregated health statistics that are needed for sub-national studies and spatial analyses. A possibility to overcome this issue may be census data that is more often available and could be useful to analyze disease burden due to e.g. premature mortality.

To the best of our knowledge, no study systematically assessed the spatial patterns of disease burden due to premature mortality with sub-national data across an entire country in Africa. One notable exception however, is provided by Manda and Abdelatif14, who analyzed the spatial-temporal variation of mortality risk across South African municipalities. However, they did not account for important risk (e.g., infectious disease), environmental or socio-demographic factors (e.g., climate, ethnicity). Furthermore, their study did not explicitly assess spatial clusters of life years lost.

We set out to investigate the spatial distribution of disease burden based on the most recent population and housing census 2009 in Kenya. Specifically, we aimed to 1) detect spatial clusters of YLLs at the division level (n = 612), and to 2) identify variables that are associated with the YLLs at this level.


We noted that YLL exhibited a distinct geographic pattern, with higher YLLs in western, northwestern, and northeastern Kenya (Fig. 1). We found small but significant spatial clustering of YLL across Kenyan divisions (global Moran’s I = 0.20, p-value < 0.001). Figure 2 shows significant (p-value < 0.001) local spatial clusters of high YLL rates (a) near Lake Victoria in western Kenya, (b) in Turkana County in the northwest, and (c) near the border with Ethiopia and Somalia in the northeast. Significant spatial clusters of low YLLs were found in central and southern Kenya. Figures 1 and 2 of the supplementary file contain additional information for YLL based on the Kenya specific life expectancy (not reported).

Figure 1
figure 1

Years of life lost (YLL) due to premature death at the division level in 2009.

Figure 2
figure 2

Significant spatial clusters of years of life lost (YLLs) per person at the division level. The map shows three clusters of divisions in which high values of YLL (above average) were found next to each other, one near Lake Victoria (a) one in Turkana County (b) and one in the border triangle with Ethiopia and Somalia (c).

Figure 3(a) shows the relative importance of the significant explanatory variables in our model. Table 1 in the supplementary file shows odds ratios and 95% confidence intervals for the ten most important variables from a replicated version (Poisson multivariable regression) of our boosted regression tree (not reported). Higher shares of Luo ethnicity or more crowded households were strongest factors significantly and positively associated with YLL at the division level. Figure 3(b) and (c) depict the partial dependence plots (PDP) for share of Luo ethnicity and household crowding, each illustrating the isolated influence of these risk factors on YLLs while controlling for all other factors. For example, YLL sharply increased with higher share of Luo people until it levelled out at around 65%, after which the strength of association remained constant. Household crowding also had a non-linear influence on YLL. The effect of crowding on YLL was low for less than 3.5 persons per room but crowding above this threshold was associated with rapidly increasing YLLs rates, up to 5.5 persons per room. Shares of Luo ethnicity and crowded households in a division were also significantly interacting with each other (Fig. 4). The association between share of household crowding and YLL rate was stronger in divisions with a share of Luo people above approximately 30%.

Figure 3
figure 3

Explanatory variables associated with years of life lost (YLLs). Relative importance of the ten most influential variables (a) and partial dependence plots (PDPs) of the two most important variables: Ethnicity (Luo) (b) and household crowding (c). Rug plots on the x-axes illustrate the data distribution of the respective variable in percentiles. PDPs were smoothed using a spline interpolation.

Table 1 List of principle components used as explanatory variables in this study and the respective original variables with main factor loadings given as Pearson’s correlation coefficients in brackets
Figure 4
figure 4

Joint partial dependence plot (PDP) visualizing interaction between ethnicity (Luo) and household crowding.

Furthermore, higher shares of Kisii ethnicity, higher malaria endemicity, and divisions at higher latitudes were significantly and positively associated with YLL (Fig. 3, Supplement). We also found that the spatial lag coefficient that represented YLL in neighboring divisions was positively associated with YLL. In contrast, higher shares of Kikuyu or Kamba ethnicity, higher shares of married people, or higher precipitation in divisions were significantly and negatively associated with YLL. We also tested our model with a spline function on the precipitation variable (precipitation of the wettest month) but could not find any significant difference to the model reported here (Supplementary File, Fig. 4).


Years of life lost due to premature mortality (YLL) were spatially clustered in western, northwestern, and northeastern Kenya and higher shares of Luo people and crowded households exhibited strongest associations with YLL in Kenyan divisions.

While most divisions displayed YLL rates around the national average of 0.4, some divisions had YLL rates up to four times higher (1.7), exhibiting spatial concentration of premature mortality. For example, high YLL rates clustered near Lake Victoria in the southwest (Fig. 2a). This region is characterized by highest HIV prevalence and high malaria endemicity15. HIV/AIDS and malaria are the first- and third-most important causes of YLL, constituting 18.9% and 10.0% of Kenya’s total YLL, respectively4. Hence, these conditions could be an explanation for the high burden in this area. Other significant clusters of high YLL rates were identified in Turkana County of northwestern Kenya (Fig. 2b) and in the border triangle with Ethiopia and Somalia (Fig. 2c). These predominantly remote, (semi-) arid regions are sparsely populated and dominated by (nomadic) pastoralism16. There could be several explanations for the high burden in these regions. First, remoteness could imply limited access to health care facilities and services. Second, low agricultural potential, combined with frequent droughts may periodically lead to health-threatening food insecurity16. Finally, inter-tribal violence related to resource scarcity and cross-border overflow from armed conflicts in neighboring countries (Somalia, South Sudan, Uganda) may be reasons for the clusters of high YLL in these regions17,18. The central and southern regions of Kenya, in which low YLL clustered, are rather characterized by higher agricultural potential, good income opportunities and better food security16. Combined with modest HIV prevalence and low malaria endemicity, this may explain the low burden in this area15,19.

Higher shares of specific ethnicities (Luo or Kisii) within divisions were positively associated with YLL and this association was the strongest among all variables in the model. Kenya is home to over 70 distinct ethnic groups, with the Kalenjin, Kamba, Kikuyu, Luo, and Luhya being among the largest ones. This rich diversity however has often led to social tensions20,21 and unequal health outcomes. For example, our finding is consistent with other studies that report highest HIV and tuberculosis prevalence and also child mortality among the Luo compared to other ethnicities in Kenya22,23. It is therefore quite understandable that those divisions inhabited primarily by the Luo or Kisii were positively associated with YLL. Our findings underline the importance of considering ethnicities when examining the burden of disease24. For example, certain health-related practices (e.g., circumcision, use of cultural medicine, sexual behavior) and people’s access to health care can be strongly dependent on ethnicity24,25,26. However, we here explicitly point out that we neither can assume a direct relationship between ethnicity and higher risk of YLL since we examined relationships at the ecological and not at the individual level. Nor can we infer causal relationships between ethnic-specific health behavior and YLL from our cross-sectional study. Future studies should look into the ethnic composition and respective health behavior at the individual level to better understand the burden of disease across different population groups and regions across Kenya.

Household crowding (over 3 persons per room) was positively associated with YLL, possibly due to a higher risk of communicable diseases such as acute respiratory infections, tuberculosis, or skin diseases with more persons sharing one room27. This finding is in line with studies from New Zealand28 and Uganda29 that also revealed associations between household crowding and morbidity. In contrast to our results, Ombok et al.30 did not identify crowding as a risk factor for child mortality in Nyanza Province of Kenya, possibly because they used a dichotomous variable (<5 and ≥5 persons/room) while we employed a continuous measure. There was a statistically significant interaction between household crowding and Luo ethnicity in our study. This indicates a mutually enforcing effect of these two factors so that risk of premature mortality is particular high in a division if both factors are high. While there is little evidence on the health effects of household crowding with respect to ethnicities in the literature, this suggests a need for more in-depth analysis in future studies.

We found higher malaria endemicity in a division was positively associated with YLL. This is consistent with a large body of literature, especially in the sub-Sahara Africa context5,15,31,32,33,34,35,36,37. In contrast, we found that being married can be protective against poor health and YLL; this has also been shown in several studies for different health outcomes38,39,40. Using the same data in another study at the individual level in Kenya, Gruebner et al.40 found reduced risk of child death for mothers who lived in households with married household heads. The authors assumed that being married indicates a stable living arrangement providing a health-promoting environment. In the current study, this may also be true at the ecological level as we found higher rates of married persons in a division was negatively associated with YLL.

Our study found a negative association of higher precipitation in a division with YLL. It is not entirely clear why this is the case. While one study found that malaria mortality was associated with rainfall in western Kenya41, a study in Sweden found that higher precipitation decreased the number of deaths in the 18th and 19th century42. The authors argue that in Sweden a warm spring with good rainfall increased the chance of a rich harvest, on which the pre-industrial population was dependent. This may also be true in our study, as precipitation allows for crop cultivation (e.g. coffee, banana) that would provide income possibilities for the local population with positive effects for health43,44.

Divisions that were geographically located further in the north of Kenya were positively associated with YLL. This may mirror findings from our spatial cluster analysis suggesting that these regions may represent remoteness, low agricultural potential, frequent droughts, or inter-tribal violence. More spatial epidemiological studies are needed to further breakdown the geographic distribution of explanatory variables associated with the burden of disease in Kenya.

Furthermore, YLL were positively associated with YLL in neighboring divisions. This may indicate spill over, that is, exposure factors in one division (e.g., higher share of specific ethnicities, crowded households, malaria endemicity) may also be associated with higher YLL in adjacent divisions, even when these factors are low there. Another explanation for the spatial lag effect could be that adjacent divisions share similar high values of exposure factors.

We recognize three noteworthy limitations of our study. First, we calculated rates of YLL based on death cases per household within the last twelve months prior to the census that can be related to possible biases. For example, early death of a child is a traumatic event that may influence such reporting. Recall bias may play a role due to exclusion of deaths that occurred within the recall period and may underestimate the level of mortality. In turn, over-reporting of deaths that occurred outside the recall period may have led to an overestimation of mortality45. Although recall bias has frequently been regarded as a major issue in case-control studies, it has also been reported to compromise retrospective study designs46. For the neighboring country of Tanzania however, Moshiro et al.47 found that long recall periods of up to 12 months did not affect estimates.

Second, we had to exclude 9.6% of the death cases as they were reported with an unknown age at death. Comparisons between age-specific mortality rates calculated from the Kenyan census data with rates from the GBD 2010 study indicated noticeable lower mortality rates for older ages (>60) in our data. This suggests that the death cases that we excluded in our study were predominantly people of older age. Death cases at older age have a fairly small impact on the YLL due to lower residual life expectancies and hence we assume that it had only a negligible effect on our findings.

Third, we created a single model for YLL attributable to all causes of deaths based on census, that is, on a complete enumeration of the population. Such an approach prevents analysis of YLL specific to communicable diseases, non-communicable diseases, or injuries. Yet, risk factors vary substantially from one group of diseases to another, which needs to be kept in mind when interpreting our findings.

To the best of our knowledge this is the first study that addressed the spatial distribution of the burden of disease due to premature mortality at the division level in Africa. Based on census data, we identified spatial patterns of the years of life lost (YLL) that provide crucial information for better understanding about the locations where people are at higher risk for premature mortality. Moreover, we identified exposure factors that were significantly associated with YLL.

Kenya has made significant improvements in the reduction of the top three causes of premature death in 2016 as compared to 200548. For example, HIV/AIDS as a cause for premature death was reduced by 60.4%, diarrheal diseases by 29.8%, and lower respiratory infections by 23.3%48. Furthermore, Malaria as the seventh important cause of premature death in the country was reduced by 59.9% as compared to 200548.

Our spatial epidemiological approach with census data is transferable and should be reapplied with updated census data once these are available. Thereby it will contribute to a precision public health supporting the allocation of scarce resources to regions and specific populations most affected by premature mortality also in contexts beyond Kenya.


Data set and availability

Micro level data from the most recent Census conducted August 24th 200949 was used. This data is also available to other researchers who meet the criteria for access to confidential information. Interested researchers may request this data at

Study design and population

As in Gruebner et al.40, a cross-sectional study design was used, with data on the general population and for this study aggregated at the division level. We excluded those divisions with preliminary non-residential areas and thereby arrived at N = 612 divisions suitable for our analyses. The population for these divisions ranged from 165 to 870,202, with a median population of 44,661.

Outcome variable

The outcome variable was “Years of Life Lost (YLL)” per person at the division level, calculated based on reported death cases in each household 12 months prior to the census, and standardized by age and gender. YLLs are defined as the sum of years of residual life expectancy of each death case with regard to the GBD 2010 standard life table that assumes a life expectancy at birth of 86.02 years for all individuals globally1. The census reports 263,564 death cases in Kenya, however, with 9.7% of them recorded with an unknown age of the deceased person. These cases were excluded from our study since they could not be used for calculating YLLs. Our final dataset included 238,121 death cases that were used to calculate age and sex standardized YLL rates at the division level (N = 612).

Explanatory variables

We considered the following variables from the census aggregated at the division level: Population density (population/km2), household crowding (mean number of persons/room), percentage of rural households and ethnic population groups, as well as mean educational attainment (range 0 = no education to 20 = completed university degree).

Mean access to health care was calculated based on health facilities obtained from the Kenya Open Data Portal50 to population ratio. Malaria endemicity (i.e., basic reproductive number for Malaria cases) was taken from Gething et al.15 and the mean altitude in meter was taken from Jarvis et al.51. Six variables represented climate related factors and were taken from Hijmans et al.52: Mean annual temperature in degrees centigrade with maximum temperature of warmest month and minimum temperature of coldest month, as well as the mean annual precipitation in millimeter with mean precipitation of wettest month and mean precipitation of driest month. We also included geographic coordinates and a factor representing the spatial lag of YLL (i.e., average value of YLL in adjacent divisions).

Furthermore, we applied a principal components analysis on additional census variables to combine explanatory variables representing socio-demographic characteristics of the population to enhance the interpretability of results53,54. All components with Eigenvalues greater than one were extracted and used as uncorrelated explanatory factors in our analyses. Table 1 summarizes all principal components with respective variables and factor loadings and Table 2 provides summary statistics for all variables used in the analysis.

Table 2 Descriptive statistics for all explanatory variables used in the study.


We first performed spatial autocorrelation analysis (Moran’s I) to explore spatial clustering of YLL, that is, the degree to which nearby divisions tend to show similar or dissimilar YLLs rates. The global Moran’s I characterizes the overall pattern in the entire study area55. The local Moran’s I identifies local spatial clusters of similar (hotspots) or dissimilar neighboring divisions (outliers) that are significantly different from an expected spatial pattern under normality assumption56. Divisions that indicated a significant (p < 0.001) local Moran’s I were mapped and classified into High-High (or Low-Low) hotspots, that is, high (or low) YLL in one division next to high (or low) YLL in neighboring divisions, or Low-High or High-Low spatial outliers. We conducted global and local Moran’s I with “spdep” in R57,58.

Second, we used boosted regression trees (BRTs) to quantify the association between explanatory variables and YLL in Kenya. BRTs draw on techniques from machine learning59,60 and have been successfully applied to disease modeling60,61,62,63. We chose BRTs because they can handle non-linear relationships, are insensitive to outliers, and account for interactions between variables60,64. Generally, models based on regression trees partition the variable space into those parts with the most homogenous responses to the explanatory variables60,64 and the relative importance of these variables determines the strength of their association with the YLL. This relative importance is quantified by the number of times a variable is used for splitting a regression tree, weighted by the model improvements as a result of each additional split, and averaged over all trees60,64. In order to examine the nature of the association between a variable and YLL, partial dependence plots (PDPs) were computed. PDPs are fitted functions for a certain explanatory variable along its data range and thus represent the isolated effect of the variable on YLL while holding all other explanatory variables at their mean60. Interactions among variables identified and modeled by BRTs can be visualized by three-dimensional PDPs. We applied BRTs using “dismo” and “gbm” packages in R58,65,66.

Finally, we tested BRT model residuals for spatial autocorrelation to verify the assumption of independent errors67. For all procedures, we followed the guidelines and recommendations of Good Epidemiological Practice (GEP) defined by the German Society for Epidemiology to secure ethical principals in data handling68.