Assessing the population-wide exposure to lead pollution in Kabwe, Zambia: an econometric estimation based on survey data

This study quantitatively assessed the population-wide lead poisoning conditions in Kabwe, Zambia, a town with severe lead pollution. While existing data have reported concerning blood lead levels (BLLs) of residents in pollution hotspots, the data representing the entire population are lacking. Further, selection bias is a concern. Given the lack of compulsory testing schemes, BLLs have been observed from voluntary participants in blood sampling surveys, but such data can represent higher or lower BLLs than the population average because of factors simultaneously affecting participation and BLLs. To illustrate the lead poisoning conditions of the population, we expanded the focus of our surveys and then econometrically estimated the BLLs of individuals representing the population, including those not participating in blood sampling, using background geographic, demographic, and socioeconomic information. The estimated population mean BLL was 11.9 μg/dL (11.6–12.1, 95% CI), lower than existing data because of our wide focus and correction of selection bias. However, the scale of lead poisoning remained immense and 74.9% of residents had BLLs greater than 5 μg/dL, the standard reference level for lead poisoning. Our estimates provide a deeper understanding of the problem and a foundation for policy intervention designs.

adopt these levels for reference-although health damages have been reported for BLLs below 5 μg/dL 1,4 , and we do not imply that BLLs below 5 μg/dL are safe. Previous studies on Kabwe have reported BLLs exceeding 45 μg/ dL, including the levels normally considered fatal 15,[19][20][21][22] . The scale of the problem is also large. According to the Toxic Sites Identification Program (TSIP) 23 , Kabwe's lead contamination affects the largest number of people, with 120,000, among all the confirmed cases of lead contamination around the world.
Despite these alarming reports, representative data of the extent of lead poisoning among the entire population of Kabwe are lacking. This can be attributed to two issues related to data collection. First, most previous surveys have primarily focused on a small sample of data collected from children residing in areas around the mining sites. Although the pollution problem is the most acute in these areas, and children are generally at the highest risk of lead poisoning 4,18 , the data coverage has been far from representative. Second, while BLL data were obtained from those who voluntarily participated in studies, given the lack of public mechanisms for formal and compulsory blood lead testing, there could be selection bias in the data, a problem widely recognised in the medical, statistical, public health, and economics literature [24][25][26][27][28] . Data from voluntary (or self-selected) participants in studies generally can fail to reflect the conditions of the population if certain factors simultaneously affect participation decisions and the outcomes of interest. In the context of our study, the residents voluntarily participating in the blood sampling can be those particularly concerned about or suspected to have lead poisoning, and their BLLs can be lower or higher than the population average. Various background characteristics, including both observable (e.g. education, age, employment and living standards) and unobservable ones (e.g. health preferences), can also form their willingness and constraints to participate and directly affect BLLs. As the demographic composition, socioeconomic conditions and pollution levels are diverse in Kabwe, data reflecting these diversities and correcting for potential selection bias are essential.
The purpose of this study is to quantitatively assess the prevalence of lead poisoning among the entire population of Kabwe (administratively, the district of Kabwe). Our methodology to accomplish this purpose is twofold. First, we chose our sample individuals based on random sampling covering the entire Kabwe district. Although we could obtain BLL data from the subset of the chosen individuals who voluntarily participated in blood sampling, this expanded our focus beyond that of previous studies and helped define the target samples representing the population. Quick results of the blood sampling survey are available elsewhere 22 . Second, we employed econometric models to estimate BLLs for the representative sample individuals. Concurrently with the blood sampling survey, we conducted a household survey to obtain background socioeconomic, demographic, and geographic information for the sample individuals, including those who did not participate in the blood sampling survey. We then used these data to econometrically estimate the equation to determine BLLs and, finally, calculated the BLLs of the entire sample individuals. We paid attention to the potential differences in both observable and unobservable characteristics between the participants and non-participants under two econometric methods: ordinary least squares (OLS) and Heckman's sample selection model 24 .
The contributions of our study are twofold. First, this study illustrates the severity and diversity of the lead pollution problem in Kabwe and help policymakers design remedial measures. This study is the first attempt to systematically obtain representative estimates of the lead poisoning conditions of residents in the entire Kabwe district. The estimated mean BLL was 11.9 μg/dL and 74.9% of residents had BLLs above the standard reference level of 5 μg/dL. Data representing the population are important to fully understand the pollution problem in Kabwe. Representative data can also serve as a foundation for a policy intervention design and cost-benefit analysis. Further, our estimates shed light on the extent of risks facing low-and middle-income countries, contributing to studies quantifying the global burden of general and pollution-related diseases [29][30][31][32] . Despite the substantial impact of lead poisoning on the global disease burden, there remain gaps in the literature, and existing data do not precisely depict the burden of diseases at contaminated sites 31 .
The second contribution is methodological and refers to the management of selection bias. Health surveys are often subject to selection bias analogous to our data, which can lead to biased conclusions 27,28 . The methodology adopted not only is consequential for bias mitigation in the current study. Although the details of our specifications would need modifications, our approach is applicable to many other cases where formal and compulsory testing for a disease is lacking.

Methods
Data collection and potential selection bias. We 16). Further approvals were granted by the Ministry of Health through the Zambia National Health Research Ethics Board and the Kabwe District Medical Office. The data were collected in accordance to the Declaration of Helsinki, and the informed consent was obtained from all the study participants including the parents/legal guardian of the minor subjects for participating in the study.
The two surveys were designed consistently and targeted the same sample households selected in the following two-step approach. In the first step, utilising the Zambia's national census frame which divides the Kabwe district into 384 standard enumeration areas (SEAs), we randomly selected 40 SEAs across the entire district. In the second step, we randomly selected 25 households (and a few replacements) from each sampled SEA. The sampling weights were generated to account for population differences across the SEAs.
The KHSS 2017 conducted interviews with 895 households (4,900 individuals) at houses and collected data on socioeconomic, demographic and geographic information. The response rate was 88.2%, and we could regard the data adjusted by the sampling weights as representative of the entire Kabwe population (for more details of the survey, see the report 33  www.nature.com/scientificreports/ To obtain BLL data, we conducted a blood sampling survey concurrently with the KHSS 2017. For hygiene and ethical considerations, we selected 13 local clinics to perform the blood sampling, instead of collecting blood at houses. We invited up to four members (two children aged 10 years or younger and their parents or guardians) from each sample household for the blood sampling. We prioritised young children over children older than 10 years old. The invitations were made sequentially. We assigned identical venues and dates for households from the same SEAs. The typical assigned dates had a 3-day window from the day after the invitation. However, we allowed for some flexibility and sampled the blood of those who visited the clinic even after the assigned time window, as long as the clinic was operational for households from other SEAs. Therefore, the window for blood sampling was effectively the number of days from the day after the invitation until the pre-set blood sampling period in each clinic was over, which had a substantial variation across households from 3 days to a month. We revisit this feature of the survey window when setting up our econometric model later. A total of 372 households (41.6%) participated in the blood sampling and, on average, 2.2 members from the participating households provided blood samples.
We performed blood digestion and metal extraction as described by our previous study 34 with minor modifications and measured BLLs using an Inductively Coupled Plasma-Mass Spectrometer (ICP-MS). In addition, we also measured BLLs with a portable analyser, LeadCare II, to obtain quick results 22 . However, we in this study focus on the ICP-MS data, considering their general accuracy. See the Supplementary Material Section S1 for details on the methods used to measure BLLs and the difference in the data between the two analysers.
Regardless of the accuracy of the techniques, however, we further need to account for the risk of selection bias in the BLL data. In the absence of formal and compulsory testing mechanisms, we relied on individuals' voluntary (self-selected) visits to the clinics. However, the participants in blood sampling could have traits leading to higher or lower BLLs than the population. Such traits can include education, gender, age and living standards. The survey design did not prioritise children aged 11 years or older, and this could also contribute to the deviation of characteristics, although a small number of such children attended clinics. Moreover, certain unobservable characteristics affecting BLLs can further differ between the participants and non-participants. For example, those with greater preferences for health possibly had low BLLs but tended to participate in the blood sampling surveys, whereas those with a high innate physiological capacity for lead excretion possibly tended not to participate because they had low BLLs and did not perceive symptoms of lead poisoning. These issues can lead to selection bias, and the raw data observed from the voluntary participants can fail to illustrate the lead poisoning conditions of the population.

BLL estimation approach.
To correct for potential selection bias, we first estimated the equations to explain BLLs of children aged 0-10 years and adults aged 19 years or above. Then, using the estimated equations, we calculated BLLs for all individuals, including children aged 11-18 years and those in the other age groups who did not participate in the blood sampling.
BLLs generally depend on the ambient pollution level, the opportunities of exposure to pollution, the physiological capacity of lead absorption and excretion, and the knowledge and technologies used to prevent lead poisoning. We controlled for ambient pollution levels by including the distance, direction, and altitude of household location-the first two variables are with respect to the mine waste dumping site (Black Mountain). The remaining factors were measured by age and various other individual and household characteristics denoted by X i . Data for these variables are available regardless of participation in blood sampling. We assumed the following equation for BLL: The logarithmic form for BLL adjusts its distribution to approximately normal-BLL is bounded from below and has a skewed distribution-and allows the factors on the right-hand side to have proportional effects rather than level effects. ε i is the independent and identically distributed error term that captures noise, such as casual fluctuations and measurement errors in BLLs, and the effects of unobservable factors. While we presented a single equation above, we assumed different equations for children aged 0-10 years and adults aged 19 years or above.
Below, we discuss our specification in detail.

Geographic factors.
Existing studies have examined the relationship between the geographic location and ambient pollution level [12][13][14] . Since lead is transported from the mine waste dumping site through the flow of wind and water, the distance from the site is negatively correlated with ambient lead levels. The soil lead contamination spreads to the western side of the site, particularly towards the west-northwest (WNW), which corresponds to the direction of the prevailing local wind. The contamination also slightly extends to the lowelevation south-eastern side, reflecting pollution transported by water. The northern and southern sides are the least contaminated. We defined distance i as the distance between the mine waste dumping site and the location of i 's household, with β dis < 0 expected. Also, we assumed that the WNW is the most contaminated and, accordingly, we defined direction i as the radian of the acute angle passing through WNW, the mine waste dumping site, and the location of i 's household. That is, the household location is WNW at direction i = 0 , either north-northeast or south-southwest at π/2 , and east-southeast (ESE) at π . We employed a quadratic specification in Eq. (1), which allows BLLs to have two peaks at WNW and ESE if β dir1 < 0 , β dir2 > 0 and −β dir1 /(2β dir2 ) < π . We statistically assessed the appropriateness of the specification for direction in Supplementary Material Section S2. We also www.nature.com/scientificreports/ used altitude in metres, altitude i , considering that elevated areas can be less exposed to dust and water flows, although the general tendency of land elevation can be absorbed by the direction variables.

Age and other covariates. For children, we assumed a non-linear relationship between their ages and
BLLs and defined the following functional form: is an indicator function that takes the value of 1 if the argument condition is satisfied, and mage i denotes age in months. The functional form reflects the findings in the literature. Young children are generally at a high risk of lead poisoning. Playing outside and age-appropriate hand-to-mouth behaviours expose them to lead, and their gastrointestinal absorption of lead is high 4 . Foetuses and infants born to exposed mothers absorb lead in utero and through breastfeeding 35 . Consequently, BLLs often reach a peak at or before the age of 24 months and then decrease as children grow older, reflecting their physical and behavioural growth 1,36 . Thus, we employed a specification that allows an inverted U-shaped relationship between the logarithmic BLL and age up to 23 months, but assume a linearly decreasing relationship between the two factors for children aged 2 years or above. For adults, the physiological foundation of the BLL-age relationship is not clear, but age-related changes in metabolism and lifestyle can affect BLLs. We simply assumed a log linear relationship between BLL and age for adults.
In addition, we used the following individual and household characteristics, denoted as X i , for children: a dummy variable for female; the mothers' education level (grades), which reflects their general, health-related and lead-related knowledge; a dummy variable for children whose mothers were absent (the mothers' education level was set at zero for such children); a dummy variable for female-headed households; household size; dependency ratio (the proportion of household members aged 0-15 years and 65 years or above); and the log of per capita household expenditure, which measures living standards. We also used dummy variables for household location: urban areas, small-scale farming areas, large-scale farming areas, and the Makululu compound-an area of informal settlement where public services are poorly delivered. We set urban area as the base category.
For adults, we continued to use the dummy variables for female and household location, household size and dependency ratio but dropped the variables related to mothers and household heads. The per capita household expenditure was not used, either, because it is not exogenous for adults. Instead, we used their own education level, which reflects living conditions to certain extent as well as knowledge levels. We also used a dummy variable for marital status, which takes the value of one for either married or co-habiting individuals, and the duration of residence in Kabwe (in years) to account for the effects of long-term lead exposure. econometric methods to estimate BLL equation. We considered two methods to estimate Eq. (1).
The first one is OLS, which directly estimates Eq. (1) from the data of the participants in the blood sampling survey. If the bias in BLLs are attributable to the difference in observable factors between the participants and non-participants, then the OLS estimate of Eq. (1) is unbiased and can be used to obtain estimates representing the population. However, as previously mentioned, unobservable characteristics can also affect both BLLs and participation decisions. This can disrupt the error term distribution and bias the OLS estimate of Eq. (1).
To account for this risk, we also adopted Heckman's sample selection model 24 . This model corrects for the bias in unobservable factors by simultaneously estimating the probability of participation (selection equation) for the entire sample, including non-participants. Specifically, we considered the following selection equation: where is the normal distribution function with the probability density function of ψ , X i is the same as in Eq. (1), and g age i has the functional forms identical to f age i . The bias in Eq. (1) can be fixed by estimating Eq. (1) with the inverse Mills ratio, ψ/�.
In the sample selection model, the use of an exclusion restriction variable, which affects the probability of participation but not BLL, is preferable. We used the number of days of the blood sampling window denoted by window i as an exclusion restriction. As described above, the blood sampling window was effectively the number of days that the assigned clinic remained operational for blood sampling after the day following the invitation. Other factors being equal, households that received early invitations and had longer time windows would more easily manage to attend clinics and would have higher probabilities of participation. The exogenous nature of the blood sampling window renders it irrelevant for BLLs. estimation of the representative BLLs. After obtaining the BLL equations, we estimated the BLLs of the representative sample individuals by inputting their characteristics on the right-hand side of the equations. We applied the survey's sampling weights when aggregating the estimated BLLs.
To estimate the BLLs of adolescents aged 11-18 years, who were basically not covered in our BLL survey and thus not used in the BLL equation estimations, we used the equation for children aged 0-10 years, assuming that age-BLL trend, which we expected to be negative, would hold up to the age of 18 years.
Next, we calculated the number of the residents with BLLs above 5 μg/dL by interacting the estimated proportion of those with such BLLs and the total population. Considering the population growth, we used the population estimates of our own 33  (2) f age i = φ 0 + φ 1 mage i + φ 2 mage 2 i × I age i < 2 + φ 3 age i × I age i ≥ 2 .
(3) Pr i participates = �{δ dis log distance i + δ dir1 direction i + δ dir2 direction 2 i + δ alt altitude i + g age i + X i ξ ′ + ζ window i }, www.nature.com/scientificreports/ Further, we present two graphical results. The first one is an in-depth examination of the mean BLLs across age groups. In the second one, we simulated the geographic variation of the mean BLLs. We divided the entire Kabwe district into 1 km × 1 km grids, and estimated the mean BLL in each grid cell. Distance and direction were measured for each cell and other independent variables were measured by the means in the ward-official innerdistrict division-to which the cell corresponds (we provide additional technical notes before showing results).
All estimations were performed using Stata 15 software.

Results
observed BLL data. The observed mean BLL among the participants in the blood sampling survey was 15.9 μg/dL, in which we did not make econometric adjustments ( Table 1). The 50 percentile (median) BLL was 11.3 μg/dL, indicating a skewed distribution. Male BLLs tended to be higher than female ones. BLLs were generally negatively associated with age. Overall, approximately 5.3% of the participants reported BLLs exceeding 45 μg/dL. This proportion was 14.2% among children aged 0-5 years, but only nine adults (2.0%) reported such high BLLs. The observation size for those aged 11-18 years were small.
characteristics of blood sampling participants and non-participants. The characteristics of the participants and non-participants in blood sampling were not identical. Among children aged 0-10 years, the two groups significantly differed in terms of household location, size and living standards, with P values below 0.10 ( Table 2). Among adults, the characteristics of the two groups were more clearly distinct, with P values mostly below 0.01 (Table 3). Therefore, the participants in blood sampling were not a random subset of our study target. Their BLLs (Table 1) can fail to represent the lead poisoning conditions of the population. (Table 4, column I), the coefficients of the distance and direction variables had expected signs with P values below 0.01. BLL was decreasing in the distance, whereas the relationship between BLL and direction was U-shaped, with the highest peak at WNW, the lowest peaks at northeast and south ( direction i ≈ 5π/8 ), and a small peak at ESE. The explanatory powers of distance and direction were so large that R 2 remained at 0.67 even after dropping other independent variables. Considering the strong powers of these factors and given that the values of these variables were similar among neighbouring households, we clustered standard errors for SEAs (in all the subsequent estimations as well). Altitude did not have a significant effect. Age also had a significant effect. BLL peaked at 16.5 months, which is close to the average age of children to stop breastfeeding in Kabwe, 15.8 months 33 . This suggests a role of lead transfer through breastfeeding. BLL decreased by approximately 5% per year from the age of two years.

estimated BLL equation for children. In the BLL equation estimation based on OLS
Among other factors, the dependency ratio raised BLLs, albeit with a marginally significant P value of 0.07. This suggests the possibility that parents in households with high dependency ratios failed to take sufficient precautionary measures for lead exposure. Mothers' education reported a negative coefficient but its effect was insignificant with a P value of 0.10. Similarly, the per capita household expenditure did not have a significant coefficient.
Under Heckman's sample selection model, the probability of participation significantly depended on age and household size ( Table 4, column II). Although household income per capita reported significantly different means between the participants and non-participants (Table 2), its effect on participation was insignificant after other factors were controlled for. Conversely, while the mean age was almost identical between the two groups ( Table 2), age had a significant non-linear effect on the probability of participation. The exclusion restriction, the duration of blood sampling window, had a significant effect with a P value below 0.01. However, the resulting BLL equation was similar to the OLS estimate (Table 4, column III). The inverse Mills ratio did not have a significant effect on BLLs with the P value of selection bias greater than 0.10. Therefore, selection bias was limited in terms of unobservable factors and the OLS estimate of the BLL equation was not significantly biased.  www.nature.com/scientificreports/   www.nature.com/scientificreports/ ratios were more likely to participate. The duration of the blood sampling window significantly increased the probability of participation. However, similar to the results for children, the inverse Mills ratio did not have a significant effect with P value above 0.10.
Representative estimates of lead poisoning conditions. We estimated the BLLs of 4,898 individuals, all but two sample individuals who had missing information, that represent the lead poisoning conditions of the entire population (Table 6). Since the selection bias in terms of unobservable factors was not significantly observed (Tables 4, 5), we used the BLL equations obtained under OLS. All figures hereafter were weighted by the survey's population weights. The representative mean BLL was 11.9 μg/dL, with a 95% confidence interval of 11.6-12.1 μg/dL, which is 2.4 times higher than the standard reference level of 5 μg/dL. 74.9% of the residents had BLLs above 5 μg/dL. This proportion, as of 2017, corresponds to approximately 202,500 individuals based on our population estimate, 270,389 individuals 33 , and to approximately 170,400 individuals based on the relatively moderate population projection of 227,551 individuals by the Central Statistical Office of Zambia 37 . The 50 percentile (median) was 8.7 μg/dL. Men had significantly higher BLLs than women (the P value for zero difference is below 0.01). Notably, only 9.6% of children aged 0-5 years and 9.8% of children aged 6-10 years had BLLs below 5 μg/dL, although this study expanded the focus beyond the immediate neighbourhood of the mine waste dumping site. 4.6% of children aged 0-5 years had BLLs above 45 μg/dL, but our estimates did not predict such high BLLs for adolescents aged 11-18 years and adults. Figure 1 depicts the in-depth relationship between the estimated BLLs and age. After peaking within the ages of 12-23 months, BLLs for children demonstrated a declining trend with age, albeit with fluctuations. Note that the BLLs of those aged 18 years and 19-29 years were continuously connected. This suggests that we successfully estimated the BLLs of those aged 11-18 years from the equation for children aged 0-10 years. Figure 2 illustrates the simulated geographic distributions of BLLs, separately for children (a) and adults (b). To obtain the figure for children, we set age at 16 months, when BLL reaches the maximum. Thus, the figure for

Discussion
This study estimated BLLs representative of the lead poisoning conditions among the entire population of Kabwe, Zambia, using the combined dataset of the ICP-MS measures of BLLs and a socioeconomic household survey. As in the previous studies on Kabwe and other health surveys in general, we were faced with the risk of selection bias in the BLL data in terms of both observable and unobservable factors, owing to non-random participation in the blood sampling survey. To overcome this problem, we employed econometric methods that controlled for differences in observable and unobservable factors between participants and non-participants in the blood sampling survey.
Our estimates showed that the mean BLL for the population was 11.9 μg/dL (Table 6), which is 25.2% lower than the mean of the observed BLLs of the participants (15.9 μg/dL, Table 1). While unobservable factors reported a minor bias (Tables 4, 5), the observable factors were not identical between the participants and non-participants (Tables 2, 3). In particular, the participants (or their parents) tended to have lower education levels and resided in Kabwe for a prolonged period, which were factors positively associated with BLLs. The age composition and household location were also different. These differences led to higher BLLs among the participants. Our estimate of the mean BLL was also lower than the ones in existing studies 15,[19][20][21] , mainly because their focus was placed mostly on pollution hotspots, but their data could be faced by selection bias similar to our observed data. Further, both our estimated and observed mean BLLs were lower than our early results based on LeadCare II analyser 22 . Although LeadCare II analyser is considered fairly accurate, our samples included higher BLLs than ones to which LeadCare II is often applied, and this apparently led to overestimation of BLLs (see the Supplementary Material Section S2).
Nevertheless, our results illustrate the devastating lead poisoning problems in Kabwe. We confirmed critically high BLLs among children residing in the most contaminated areas. Further, the mean BLL of our estimates was considerably higher than the standard reference level of 5 μg/dL, and the proportion of those with BLLs above this level amounted to 74.9%. Based on our population estimate as of 2017 33 , this proportion corresponds to 202,500 individuals (or 170,400 based on another population estimate 37 ), which is greater than an existing estimate of 120,000 in the TSIP 23 .
These estimates provide a foundation for policy intervention designs. Since lead poisoning was widespread across the entire Kabwe district, interventions that span across the entire population are required. Thus, although immediate interventions, such as chelation therapy proposed under a World Bank project 38 , could focus on pollution hotspots, interventions to reduce lead transportation, such as capping the mine waste dumping site with concrete or clean soil, would be of fundamental importance. Our estimates also provide grounds for proper cost-benefit evaluations of interventions. For large-scale interventions, the benefits for the entire population, not only the residents in hotspots, need to be accounted for, and this requires population-level data. Proper cost-benefit evaluations are important for sustainability of interventions as they require large costs and longterm commitment (e.g. monitoring and maintenance). www.nature.com/scientificreports/ Our methodology has an implication for other cases of health studies. Medical and clinical data collected through voluntary participation in testing can be subject to analogous selection bias problems to ours, particularly in cases in which formal and compulsory testing schemes are lacking. The extent of disruptions caused by selection bias can vary by case, and our econometric specifications would require modifications if applied to other cases. Nevertheless, the principle of our approach-collection of background data from the representative sample individuals, including those who did not participate in the medical testing, and correction of deviation in the characteristics of the participants-is applicable to various cases in which selection bias is a concern.
Finally we address the limitations of our study. First, our methodology was not employed to perfectly predict the BLL of each individual. Our estimates reflected variations of BLLs by gender, age groups, areas within Kabwe and various other factors but did not fully reflect idiosyncratic variations. Certain individuals with particular idiosyncratic factors can have high or low BLLs even if their traits and residential locations are associated with low or high BLLs. The second limitation, related to the first one, is the general difficulty to econometrically predict extreme outcomes. Such outcomes are scarce and idiosyncratic factors prevail over systematic ones. In our case, a small proportion of adults did report such BLLs, but our estimates did not predict such BLLs for adults. Finally, while we employed BLL as the measure, a comparison with alternative measures would improve the understandings of the lead poisoning problem in Kabwe. For example, bone and tooth conditions would reflect the effects of long-term lead exposure better, and clinical conditions would reflect idiosyncratic variations in the sensitivity to lead intake. Analysing these alternative measures could be the topic for further research.