Mapping local variation in educational attainment across Africa

Educational attainment for women of reproductive age is linked to reduced child and maternal mortality, lower fertility and improved reproductive health. Comparable analyses of attainment exist only at the national level, potentially obscuring patterns in subnational inequality. Evidence suggests that wide disparities between urban and rural populations exist, raising questions about where the majority of progress towards the education targets of the Sustainable Development Goals is occurring in African countries. Here we explore within-country inequalities by predicting years of schooling across five by five kilometre grids, generating estimates of average educational attainment by age and sex at subnational levels. Despite marked progress in attainment from 2000 to 2015 across Africa, substantial differences persist between locations and sexes. These differences have widened in many countries, particularly across the Sahel. These high-resolution, comparable estimates improve the ability of decision-makers to plan the precisely targeted interventions that will be necessary to deliver progress during the era of the Sustainable Development Goals.

The United Nations Educational, Scientific and Cultural Organization (UNESCO) states that the ultimate mission of the education targets in Sustainable Development Goal (SDG) 4 is to "ensure inclusive and equitable quality education and promote lifelong learning opportunities for all" 1-3 . This is important, because it has been shown that increasing the number of years of schooling that are completed (educational attainment), can lead to higher capital, greater social mobility and increased equity among men and women, in these and other socio-economic outcomes 1,2,4-8 . Educational attainment for women of reproductive age is also among the leading social determinants of health, with higher attainment being strongly associated with improved reproductive health and decreased child mortality 9-14 . The causal pathway between education and health is difficult to study, because randomized control trial methods are logistically challenging and ethically problematic. Observational studies controlling for other predictors of health status, such as age and income, however, indicate that even small gains in educational attainment may improve health outcomes across a wide variety of low-income contexts. Studies across diverse settings have found that increased education for women of reproductive age is associated with improved child nutrition and decreased child mortality, and this effect is consistently stronger than increases in income 15,16 . Importantly, a comprehensive multi-level study found that increases in average attainment in communities are associated with improved survival for infants born to all women in that community, regardless of their own educational attainment or income 17 . This is consistent with research on health behaviours, showing that less-educated women model health behaviours on those of their broader community 18 . These improved health outcomes have also been shown through increased use of prenatal care, greater adherence to treatment regimens and increased contraception use 9,12,19,20 . Despite these clear benefits, international aid for basic education has been deprioritized as a proportion of total aid expenditure every year since 2010 21 .

Precision public health and education
SDG 4 focuses on the reduction of inequalities in education on the basis of factors such as wealth, sex and location 1,2,22 . In addition, UNESCO's agenda for reforming education access in developing countries is itself centred around equity 22,23 . Global health efforts have included substantial investments in the use of data to guide interventions that will benefit populations more efficiently and increase equity in outcomes, a strategy that has been termed precision public health 24 . The same paradigm should be extended to the social determinants of health that must be addressed for progress to be sustained. Therefore, although comparable indicators of educational attainment exist at the national level, it is increasingly important to measure subnational variation.
While past studies have assessed subnational variation in attainment for specific African countries 25,26 , to our knowledge no comprehensive and comparable set of estimates exist for the continent. Here we build a precisely geolocated database of 173 unique census and survey sources containing information on educational attainment (see Supplementary Figs 1 Table 2 for information on data type, coverage and source). We estimate the average number of years of attainment for women of reproductive age (15-49) across a grid of 5 × 5 km across 51 countries in Africa from 2000 to 2015. We also estimate attainment for 20-24-year-old women to more closely identify changes over time. Finally, we construct equivalent models for men to examine differences between the sexes at the same local level. We use recently developed Bayesian spatiotemporal methods 27-29 for the analysis of this dataset, leveraging the high-resolution spatial and temporal information from these data. The estimates produced by these models enable comparisons of subnational regions. We focus on geographical inequality at the 5 × 5-km or local level to explore the subnational distribution of educational attainment, for the following reasons. First, data are increasingly geolocated to specific Educational attainment for women of reproductive age is linked to reduced child and maternal mortality, lower fertility and improved reproductive health. Comparable analyses of attainment exist only at the national level, potentially obscuring patterns in subnational inequality. Evidence suggests that wide disparities between urban and rural populations exist, raising questions about where the majority of progress towards the education targets of the Sustainable Development Goals is occurring in African countries. Here we explore within-country inequalities by predicting years of schooling across five by five kilometre grids, generating estimates of average educational attainment by age and sex at subnational levels. Despite marked progress in attainment from 2000 to 2015 across Africa, substantial differences persist between locations and sexes. These differences have widened in many countries, particularly across the Sahel. These highresolution, comparable estimates improve the ability of decision-makers to plan the precisely targeted interventions that will be necessary to deliver progress during the era of the Sustainable Development Goals.

-4 and Supplementary
communities, and advances in Bayesian model-based geostatistics enable the modelling of these precise space-time covariance structures. Second, through the increasing availability of satellite imagery and other geospatial modelling endeavours, we have built a collection of covariates at the 5 × 5-km scale that are included in this predictive modelling framework. These are mostly available at only the community level, but allow us to predict outside of our data to estimate mean educational attainment and its uncertainty across all of Africa as a guide for policy formulation and intervention targeting. The utility of community-level and individual-level measurements is discussed in the Supplementary Discussion.

Persistent differences in educational attainment
We used various validation strategies to assess the fit of our models. Across Africa, we use out-of-sample cross-validation to demonstrate that our models have low root mean square errors, low absolute errors, well-calibrated coverages and high concordance with existing small-area estimates (see Supplementary Figs 12-28, Supplementary Tables 8-23).
Estimates of mean years of educational attainment for men and women aged 15-49 and 20-24 are shown in Fig. 1a-d and Fig. 2a-d, respectively. These summaries show geographical disparities across Africa, with persistently low levels of attainment across the Sahel region, particularly in northern Nigeria, South Sudan and northern Kenya. In 2015, Ekiti state had the highest mean attainment in Nigeria among women of reproductive age, 11.3 years (95% uncertainty interval, 10.7-11.9) years, whereas many states in the northern region had averages below two years: Kebbi, 1.6 years (1.0-2.1); Yobe, 1.7 years (1.2-2.3); Sokoto, 1.5 years (1.0-2.1); and Zamfara, 1.6 years (1.1-2.2). For the same age range in Kenya, Nairobi province had the highest average attainment, 11.4 years (10.5-12.4), whereas the more rural North Eastern province had an average of 2.1 years (1.3-3.0). The lowest four regions across all of Africa had averages of less than 0.5 years, and all were rural regions in Chad: Daraba (0.5; 0.1-1.2), Kanem (0.4; 0.1-0.9), Barl El Gazal (0.4; 0.1-0.8) and Lac (0.4; 0.1-0.9). All outputs of these analyses at the national, first administrative subdivision (for example, state), second administrative and 2015 (f). Maps reflect administrative boundaries, land cover, lakes and population; pixels with fewer than ten people per 1 × 1 km and classified as 'barren or sparsely vegetated' are coloured in grey 32,36-40 .
Marked changes were observed over time when focusing on the 20-24 age range ( Fig. 2a-d), with particular improvement observed in urban centres between 2000 and 2015 in Nigeria, Kenya, Ghana, Sudan and South Africa. Several populous urban states in Nigeria showed significant gains in average attainment for women since 2000, such as Abuja state, where attainment increased from 6.0 (4.7-7.2) to 9.7 years (9.0-10.5). In Ghana, the most highly educated urban regions in the southern part of the country demonstrated moderate increases in average attainment for women aged 20-24, such as Ashanti region, where attainment improved from 7.4 (6.9-7.9) to 9.9 years (9.5-10.4). Additionally, Ghana stands out in Western Africa for its improvements in more rural regions, for example, in the Northern region attainment improved from 1.8 (1.4-2.2) to 5.2 years (4.8-5.7) since 2000.

Implications for international goals
An explicit goal of SDG 4 is to eliminate sex-associated disparities across all levels of education by 2030 30 . We illustrate the gap in mean years of attainment between men and women for both age ranges (Figs 1e, f and 2e, f). Average attainment for men was significantly higher across the Sahel and Central Africa, particularly in the northern regions of Nigeria and Kenya that had very low levels of education in women of reproductive age (see Fig. 3). Here we use 'significantly' to refer to areas where 95% of the difference between Bayesian posterior predictive distributions was above zero (see Supplementary  Information). These regions showed even stronger differences in the 20-24 age range, for which in some regions attainment in males was more than four years higher than in females (see Extended Data Fig. 1). Across states in 2015, we observed the largest difference in attainment by sex in the Kabia state of Chad, where men had achieved 5.8 more years (4.0-7.8) than women. In terms of statistical significance, 64 out of 77 states in Benin (representing 86% of the national population) had higher levels of attainment in males than females. The same was true for and 2015 (f). Maps reflect administrative boundaries, land cover, lakes and population; pixels with fewer than ten people per 1 × 1 km and classified as 'barren or sparsely vegetated' are coloured in grey 32,36-40 .
all districts within Sierra Leone, Guinea, Guinea-Bissau and Togo. By contrast, average attainment trended towards higher levels for women across much of southern Africa in 2015; however, this difference was never significant. We observed no significant differences by sex for any district within South Africa, Botswana, Zimbabwe, Rwanda and others. We further examined these trends in educational opportunity by applying a threshold for attainment. UNESCO defines basic education as completing the first nine years of formal schooling, including primary education (1-6 years of schooling) and lower secondary (7-9 years of schooling) 31 . The mean of 1,000 realizations of our full model is shown in Figs 1, 2. The Bayesian modelling framework that we used enables probabilistic inferences to be made about the likelihood that such targets have been met, on the basis of the confidence of the predictions (see Supplementary Information). In Figure 4, we illustrate the probability of average attainment being above six years in 2015 for women of reproductive age, or the equivalent of completing primary education (see Extended Data Fig. 2 for women aged 20-24). Despite SDG 4 not containing specific targets on years of attainment, this threshold was selected to highlight how substantial work remains in order to achieve even basic levels of education in many subnational regions within Africa.
We use high-resolution population data to aggregate these probabilities to different administrative levels for increased use in policy development and targeted intervention strategies, as well as to demonstrate the value of geospatial estimation for showing disparities within countries 32 . For instance, at the national level, the average woman of reproductive age in Nigeria has completed primary school in 2015. At lower geographical levels, however, these probabilities ranged from almost 0 to 100% of the population depending on the district or grid cells within the district (Fig. 3). Across Africa, many areas had averages that we could reasonably conclude were less than primary school completion (less than 5% probability of being greater than six years), but others were less certain. These regions may be less certain because our estimates were very close to six years, or because our estimates had wide uncertainty intervals (see Supplementary Information). Using the precision public health paradigm, these results have important implications for investment in education. Areas that were very unlikely (less than 5%) to be achieving primary school completion in 2015 should have investment aimed at improved access to basic education (examples of such measures are discussed in the Supplementary Discussion). Many areas with higher uncertainties probably not only have very low averages, but also require increased data collection efforts. This echoes the call in precision public health to invest in quality data at the local level to target interventions most equitably and efficiently 24 .

Discussion, limitations and future work
This study represents a notable application of Bayesian geostatistical methods in a comprehensive, geolocated dataset to model educational attainment with refined spatial and temporal resolution. Our estimates show that although attainment has generally improved for women of reproductive age in Africa since 2000, these gains have now stagnated in many subnational regions. We also demonstrate that in 2015, gaps remain in attainment between the sexes in many areas across Africa; these gaps were relatively stable over time. These findings suggest that both men and women are experiencing progress in educational attainment, but the achievement of greater equity by sex remains out of reach for much of Africa.
Geographical inequality is only one form of inequality that can be used to investigate disparities below the national level. While our national level (a). Maps reflect administrative boundaries, land cover, lakes and population; pixels with fewer than ten people per 1 × 1 km and classified as 'barren or sparsely vegetated' are coloured in grey 32,36-40 .
Article reSeArcH framework allows us to explore geographical differences at a refined spatial level, there are many other dimensions that contribute to observed population inequities, such as social stratification by race, ethnicity or wealth (see Supplementary Discussion for limitations). Although further work is needed to explore additional forms of inequality, this predictive analysis has immediate relevance for policy development. First, our analysis maps a human capital indicator across Africa that is particularly relevant for the evolving global development agenda 33 . Second, and even more importantly, we are specifically considering educational attainment in women of reproductive age (and gender disparities in education) as a critical social determinant of maternal and thus child health 9-14 .
Given the intersection between educational attainment for women of reproductive age and maternal and child health targets 34,35 , these results have important implications for targeted investment to improve entrenched geographical and sex disparities. Communities with low education levels for women may be more likely to fail in public health interventions aimed at increasing prenatal care utilization, treatment adherence or contraception use 9,12,19,20 . Targeting precision health interventions without considering the landscape of human capital indexed by educational attainment poses sustainability risks, such as unrealistic assumptions about care-seeking behaviour and retention. In addition to the implications for health intervention, the global health agenda must also consider education and improved attainment as a goal itself in building sustainable, healthy populations.
Clearly the ultimate goal of SDG 4 extends beyond attainment to the quality of education. Nevertheless, as the global policy dialogue shifts to focusing on learning outcomes (see Supplementary Discussion), our results directly identify where gaps in basic education persist. These results can be used to improve accountability in need-based investment strategies from the national to local level. For communities which we have identified as having very low attainment, localized information can help to elucidate the drivers of low attendance and inform effective investment strategies.
Improving educational attainment among women of reproductive age has cross-cutting benefits for the SDG targets related to maternal and child health. This approach demonstrates the benefits of leveraging spatial information for modelling of human capital indicators in which data are correlated across space and time. This study emphasizes how documenting national-level trends in attainment masks pronounced variation across subnational areas. Despite progress, these findings suggest that large areas in sub-Saharan Africa still lag in meeting basic education targets, especially for women. In order to deliver on the promise of inclusive and equitable education for all 3 , it is critical for investments in education to be informed by locally relevant information so that no community is left behind.
Online Content Methods, along with any additional Extended Data display items and Source Data, are available in the online version of the paper; references unique to these sections appear only in the online paper.

UNESCO. UNESCO Operational Definition Of Basic Education Thematic
Framework (UNESCO, 2007

MEthOdS
Overview. Our study follows the Guidelines for Accurate and Transparent Health Estimates Reporting (GATHER). Using a Bayesian model-based geostatistical framework and synthesizing geolocated data from 173 household and census datasets, this analysis provides 5 × 5-km estimates of mean years of education for women of reproductive age (15-49), women aged 20-24, and equivalent male agebins between 2000-2015 in Africa. This includes 48 countries in mainland Africa, as well as islands for which we had survey data, including Madagascar, Comoros, and São Tomé and Príncipe. We did not estimate for Mauritius, Seychelles or Cape Verde, as no available survey data could be sourced. Analytical steps are described below and additional detail can be found in the Supplementary Information. Data. We compiled a database of 173 survey and census datasets in Africa that contained geocoding of subnational administrative boundaries or precise coordinates for sampled clusters. These included datasets from the Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS) and Integrated Public Use Microdata Series (IPUMS) [41][42][43] (see Supplementary Table 2). We extracted demographic, education and sample design variables. The coding of educational attainment varies across survey families. In many surveys, respondents can indicate their level of attainment on a continuous year scale. In others, respondents may only have several aggregate categories such as 'Secondary completion' , 'Primary completion' , or 'less than primary' . When all that is known is that an individual completed a particular level of education, but it is not known if they continued onto the next level, a theoretical level of completion must be assigned to the individual in order to estimate summary statistics for the population such as mean years of educational attainment. For example, if the option 'Primary completion' (6 years) is followed by 'Secondary completion' (12 years), it can be assumed that an individual who only selects the former has attained between 6 and 12 years of education. In previous literature examining trends in mean years of education, the assumption is made that all of these individuals have 6 years, or sometimes the midpoint of the feasible range (9) 44,45 . Trends in the single-year data demonstrate that this assumption introduces compositional bias in the estimation of attainment trends over time and space, as differences in true drop-out patterns or binning schema could lead to biased mean estimates.
For this analysis, we used a recently developed method that selects a training subset of similar surveys across time and space to estimate the true single-year distribution of binned datasets (J.F., N.G. & E.G., manuscript in preparation). This algorithmic approach markedly reduces bias in summary statistics estimated from datasets with binned coding schemes. The years in all coding schemes were mapped to the country-and year-specific references in the UNESCO International Standard Classification of Education (ISCED) for comparability 46 . We used a top coding of 18 years on all data; this is a common threshold in many surveys that have a cap and it is reasonable to assume that the importance of education for health outcomes (and other related SDGs) greatly decreases after what is the equivalent of 2 to 3 years of graduate education in most systems.
Data were aggregated to mean years for women of reproductive age (15-49) to measure progress towards the SDG 4 target 2 . A subset of the data for a smaller age range of women aged 20-24 was also examined to track temporal shifts as well as the effects of large educational initiatives in Africa since 2000. Equivalent age-bins were aggregated for males in order to examine differences in mean years of attainment by sex. Where precise coordinates were available, data were aggregated to mean years at a specific latitude and longitude assuming a simple random sample, as the cluster is the primary sampling unit for the stratified design of all DHS and MICS surveys. Where only geography information was available at the level of administrative units, data were aggregated according to their sample design. For aggregation to administrative units for which the survey was not sampled to be representative, design effects were re-estimated using a package for analysing complex survey data in R 47 . Spatial covariates. In order to leverage strength from locations with observations to the entire spatiotemporal domain, we compiled several 5 × 5-km raster layers of possible socio-economic and environmental correlates of education in Africa (see Supplementary Table 3 and Supplementary Fig. 5). Acquisition of temporally dynamic datasets, where possible, was prioritized in order to best match our observations and thus predict the changing dynamics of educational attainment. Of the 29 covariates included, 23 were temporally dynamic. The remaining six covariate layers were temporally static, and were applied uniformly across all modelling years. More information, including plots of all covariates, can be found in the Supplementary Information.
Our primary goal is to provide educational attainment predictions across the African continent at a high resolution and we have used methods to provide the best out-of-sample predictive performance at the expense of inferential understanding. In order to select covariates and capture possible nonlinear effects and complex interactions between them, an ensemble covariate modelling method was implemented 48 . For each region three sub-models were fit to our dataset using all of our covariate data as explanatory predictors: generalized additive models, boosted regression trees and lasso regression. Each sub-model was fit using fivefold cross-validation to avoid overfitting and the out-of-sample predictions from across the five holdouts are compiled into a single comprehensive set of predictions from that model. Additionally, the same sub-models were also run using 100% of the data and a full set of in-sample predictions were created. The five sets of out-ofsample sub-model predictions were fed into the full geostatistical model as the explanatory covariates when performing the model fit. The in-sample predictions from the sub-models were used as the covariates when generating predictions using the fitted full geostatistical model. This methodology maximizes out-of-sample predictive performance at the expense of no longer being able to provide statistical inferences on causality. A recent study has shown that this ensemble approach can improve predictive validity by up to 25% over an individual model 48 . More details on this approach can be found in the Supplementary Information. Analysis. Geostatistical model. Gaussian data are modelled within a Bayesian hierarchical modelling framework using a spatially and temporally explicit hierarchical generalized linear regression model to fit mean years of education attainment in five regions in Africa as defined in the Global Burden of Diseases, Injuries, and Risk Factors (GBD) study 49 ('Northern' , 'Western' , 'Southern' , 'Central' and 'Eastern'; see Extended Data Fig. 3). GBD study design sought to create regions on the basis of two primary criteria: epidemiological homogeneity and geographical contiguity 49 . For each GBD region, we approximated the posterior distribution of our Bayesian model: We model the mean years of attainment at cluster i as Gaussian data given precision τ and a fixed scaling parameter s i . We use the sample size in each cluster as our scaling parameter. We have suppressed the notation, but the means (edu i ), scaling parameters (s i ), predictions from the three submodels (X i ), and residual terms ( ⁎ ε ) are all indexed at a space-time coordinate. The means (edu i ) represent an individual's expected educational attainment given that they live at that particular location. Mean attainment was modelled as a linear combination of the three sub-models (GAM, BRT and lasso), X i , a correlated spatiotemporal error term, ε i GP , and an independent nugget effect, ε i . Coefficients, β, on the sub-models represent their respective predictive weighting on the mean, while the joint error term, ε GP , accounts for residual spatiotemporal autocorrelation between individual data points that remains after accounting for the predictive effect of the submodel covariates and the nugget, ε i , is an independent error term. The residuals, ε GP , are modelled as three-dimensional Gaussian processes in space-time centred at zero and with a covariance matrix constructed from a Kroenecker product of spatial and temporal covariance kernels. The spatial covariance, Σ space , is modelled using an isotropic and stationary Matérn function 50 , and temporal covariance, Σ time , as an annual autoregressive (AR1) function over the 16 years represented in the model. This approach leveraged the data's residual correlation structure to more accurately predict attainment estimates for locations with no data, while also propagating the dependence in the data through to uncertainty estimates 51 . The posterior distributions were fit using computationally efficient and accurate approximations in R INLA (integrated nested Laplace approximation) with the stochastic partial differential equations approximation to the Gaussian process residuals 52 . Pixel-level uncertainty intervals were generated from 1,000 draws (that is, statistically plausible candidate maps) 53 created from the posterior-estimated distributions of modelled parameters.
To transform pixel level estimates into a range of information useful to a wide constituency of potential users, these estimates were aggregated from the 1,000 candidate maps up to district, provincial and national levels using n/a Confirmed The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement (animals, litters, cultures, etc.) A description of how samples were collected, noting whether measurements were taken from distinct samples or whether the same sample was measured repeatedly A statement indicating how many times each experiment was replicated The statistical test(s) used and whether they are one-or two-sided (note: only common tests should be described solely by name; more complex techniques should be described in the Methods section) A description of any assumptions or corrections, such as an adjustment for multiple comparisons The test results (e.g. P values) given as exact values whenever possible and with confidence intervals noted A clear description of statistics including central tendency (e.g. median, mean) and variation (e.g. standard deviation, interquartile range)

Clearly defined error bars
See the web collection on statistics for biologists for further resources and guidance.

Software
Policy information about availability of computer code 7. Software Describe the software used to analyze the data in this study.
The models were all fit using R version 3.3.2. The main statistical space-time Gaussian process regression models were fit using R-INLA version 0.0-1440400394.
For manuscripts utilizing custom algorithms or software that are central to the paper but not yet described in the published literature, software must be made available to editors and reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). Nature Methods guidance for providing algorithms and software for publication provides further information on this topic.

Materials and reagents
Policy information about availability of materials 8. Materials availability Indicate whether there are restrictions on availability of unique materials or if these materials are only available for distribution by a for-profit company.
No unique materials were used.

Antibodies
Describe the antibodies used and how they were validated for use in the system under study (i.e. assay and species). c. Report whether the cell lines were tested for mycoplasma contamination.
No eukaryotic cell lines were used.
d. If any of the cell lines used are listed in the database of commonly misidentified cell lines maintained by ICLAC, provide a scientific rationale for their use.
No commonly misidentified cell lines were used.

Animals and human research participants
Policy information about studies involving animals; when reporting animal research, follow the ARRIVE guidelines 11. Description of research animals Provide details on animals and/or animal-derived materials used in the study.