Main

Education, as a social determinant of health, is closely linked to several facets of the Sustainable Development Goals (SDGs) of the United Nations2. In addition to the explicit focus of SDG 4 on educational attainment, improved gender equality (SDG 5) and maternal, newborn, and child health (SDG 3) have well-documented associations with increased schooling15,16,17. In 2016, after years of deprioritization, aid to education reached its highest level since 200218. Despite this shift, only 22% of aid to basic education—defined as primary and lower-secondary—went to low-income countries in 2016 compared to 36% in 200219. This reflects a persistent pattern in which the distribution of aid does not align with the greatest need, even at the national level. Beyond international aid, domestic policy is also a crucial tool for expanding access to education, especially at higher levels. However, policy-makers often do not have access to a rigorous evidence base at a subnational level. This analysis presents the subnational distribution of education to support the growing evidence base of precision public health data, which shows widespread disparity of health outcomes as well as their social determinants.

Mapping education across gender

Despite widespread improvement in educational attainment since 2000, gender disparity persists in 2017 in many regions. Figure 1 illustrates the mean number of years of education and the proportion of individuals with no primary school attainment for men and women of reproductive age (15–49 years) in 2017. The average educational attainment is very low across much of the Sahel region of sub-Saharan Africa, consistent with previously published data14. In 2017, there was a large gender disparity in many regions, with men attaining higher average education across central and western sub-Saharan Africa and South Asia. Considerable variation remains between the highest- and lowest-performing administrative units within countries in 2017. For Uganda in 2017, this indicator ranged from 1.9 years of education (95% uncertainty interval, 0.8–3.0 years) in rural Kotido to 11.1 years (10.1–12 years) in Kampala, the capital city. Figure 1b, d displays the proportion of men and women aged 15–49 years who have not completed primary school. By considering the variation within populations in different locations, these maps help to identify areas with large populations in the vulnerable lower end of the attainment distribution. We estimated large improvements in the proportions of individuals who have completed primary school in Mexico and China. However, across much of the world women in this age group failed to complete primary school at a much higher rate than their male counterparts.

Fig. 1: Average educational attainment and proportion of individuals with no completed primary education at the first administrative level and absolute difference between women and men aged 15–49 years.
figure 1

a–d, Mean educational attainment for women (a) and men (c) and the proportion of individuals with no primary school education for women (b) and men (d) aged 15–49 years in 2017. Maps were produced using ArcGIS Desktop 10.6.

Despite continued lack of gender parity in education among the reproductive age group, vast progress towards parity has been made among the 20–24 age group. Extended Data Fig. 2 further examines gender parity in 2000 and 2017. This figure highlights two additional advantages of our analytic framework. First, we examined a younger group aged 20–24 years. Although education in this group is less directly relevant to maternal, newborn, and child health than education in the full window of reproductive age, these estimates allowed us to capture how the landscape of education has shifted over time (that is, across successive cohorts) and is therefore more likely to pick up improvements to access and retention in education systems that have been made since 2000. Second, we illustrate the probability that this estimated ratio is credibly different from 1 (parity between sexes) given the full uncertainty in our data and model. In 2000, we estimated that men completed schooling at a higher rate than women across much of the world, particularly for primary school education (that is, the probability that the parity ratio is greater than 1 was over 95%). This was true in most countries for both primary and secondary completion rates, but especially so in Burundi, Angola, Uganda, and Afghanistan (Extended Data Fig. 2a, c). By 2017, many countries moved significantly towards parity in both secondary and primary completion rates with the exception of large regions within central and western sub-Saharan Africa (Extended Data Fig. 2b, d).

Inequalities within and between countries

The subnational estimates of attainment presented here enable a closer examination of within-country inequality and associated trends over time. Figure 2 plots the national change in secondary attainment rates for women aged 20–24 years with the index of dissimilarity across second administrative-level units in 2017. The index of dissimilarity is an intuitive measure of geographical inequality that can be interpreted as the percentage of women with secondary attainment that would have to move in order to equalize secondary rates across all subnational districts. We estimated that countries that experienced more national progress over the period tended to be more spatially equal in 2017. However, the top-right quadrant of the graph highlights several countries that experienced substantial national progress yet remain some of the most geographically unequal countries today.

Fig. 2: National progress in secondary attainment rates for women aged 20–24 years compared with the national index of dissimilarity in 2017.
figure 2

a, Change in secondary attainment rates for women age 20–24 years between 2000 and 2017 compared with the national index of dissimilarity in 2017 (simple linear regression lines are included). b, Map of the national index of dissimilarity in 2017. Maps were produced using ArcGIS Desktop 10.6.

We further examined national progress between 2000 and 2017 in two such countries, India and Nigeria, where rates of secondary attainment increased from 10.9% (8.5–12.5%) to 37.2% (33.6–41.1%) and from 11.5% (6.2–18.3%) to 45.0% (37.0–52.5%), respectively (Fig. 3). The geographical distribution between two cohorts—women aged 20–24 years in 2000 and 2017—was analysed by examining all proportions simultaneously (Fig. 3a, b). We estimate that there has been a massive shift towards primary and secondary completion coupled with greater geographical variability in completion rates (that is, spread of the dots that represent subnational units in the legend). The majority of the 2017 cohort living in the northwest and northeast of India never completed secondary school. Urban centres in the south, such as Bangalore and Mumbai, have seen considerable progress compared with more rural regions. In Nigeria, we estimate substantial national improvement; however, the country remained one of the most spatially unequal in 2017 (Fig. 3d, e). The more-urban south, particularly around Lagos, experienced much faster progress than the more-rural north. The implications of the population distribution were explored by decomposing the improvement in the national rate of secondary completion since 2000 for each country into the additive contributions of rate changes at the second administrative level (Fig. 3c, f). This demonstrates that national progress was largely driven by improvements in populous urban regions (particularly Maharashtra, India, and Lagos, Nigeria), underscoring the importance of how subnational progress (or lack thereof) contributes differentially to narratives surrounding national change.

Fig. 3: Attainment rates and contributions to national change in secondary rates for women aged 20–24 years in India and Nigeria, 2000–2017.
figure 3

a, b, Attainment rates for women aged 20–24 years in 2000 (a) and 2017 (b) at the second administrative level in India. c, Additive contributions of changes in the attainment rates at the second administrative level to change in the rate at the national level between 2000 and 2017 in India. d, e, Attainment rates for women aged 20–24 years in 2000 and 2017 at the second administrative level in Nigeria. f, Additive contributions of changes in the attainment rates at the second administrative level to change in the rate at the national level between 2000 and 2017 in Nigeria. On all ternary maps, the ‘Zero’ category includes all individuals with either no schooling or some primary schooling without completion. Maps were produced using ArcGIS Desktop 10.6.

Discussion and limitations

We have built on previous modelling efforts that focused on the geographical distribution of average education14 by extending our estimation to the distribution of attainment, highlighting not only average attainment but also the proportions of individuals who completed key levels of schooling that are central to policy efforts. As we demonstrate, throughout much of the world women lag behind their male counterparts, and there is significant heterogeneity across subnational regions. Countries such as South Africa, Peru, and Colombia have seen tremendous improvement since 2000 in the proportion of the young adult population who have completed secondary school. As this trend continues, it will be important to focus not only on attainment but also on quality of education. However, many young women across the world still faced obstacles to attaining even a basic level of education in 2017 (Extended Data Fig. 3). This represents a missed opportunity for the global health community to focus on a well-studied determinant of maternal, newborn, and child health. Even with only marginal returns to health in the short term, studies suggest that, on average, communities will also see increased human capital, social mobility, and less engagement in child marriage or early childbearing20,21.

Children and adolescents do not complete formal schooling for many reasons. Many factors differentially affect girls, such as cost, late or no school enrolment, forced withdrawal of married adolescents, and the social influence of family members concerning the traditional roles of girls and women4,20,22,23. A critical step is acknowledging that commercialization in the area of education typically leads to higher inequity24. Treating public education as a societal good by increasing access, particularly in underserved rural communities, reduces inequality. Identifying areas that are stagnating or worsening, particularly in the realm of basic education for young women across the world, is an important first step to targeted, long-term reform efforts that will ultimately have widespread benefits for equity in health and development.

Many recent international calls to improve the social determinants of health have stated that measurement of inequity within countries is critical to understanding and tracking the problem, noting that geography is an increasingly important dimension of inequity24,25,26. Where people are born greatly determines their life chances, and continuing to consider development and human capital formation on a national level is insufficient24. The goal of this analysis is to identify local areas that may have experienced negligible improvements, but further rigorous research is required to contextualize these patterns within the unique mix of structural obstacles that each community faces. There are many indirect costs for attending school and each disadvantaged area that we identify in our analysis may experience them in different ways. These include the demand for children to work, the opportunity or monetary costs of attending school, distance to school, lack of compulsory education requirements, high fees for attendance, political instability, and many other forces. Overcoming these obstacles to improve educational attainment alone will not necessarily result in a more-educated and healthy population for each country as highly educated individuals may be more likely to emigrate, resulting in ‘brain drain’. This is especially true for countries that have been economically crippled over the past two decades and may lack the economic capacity to absorb a more highly educated labour force. Opening access to education will need to be coupled with economic reforms, both internationally and domestically, if countries are to fully experience dividends in human capital and health.

Over the next decade of the SDG agenda, it will be important to maintain the progress that has been made to reprioritise investment in education systems. There remains an alarming lack of distributional accountability in aid, especially to basic education, for which most funding is not going to the countries that need it most19. Connections between educational attainment and health offer promising opportunities for co-financing initiatives. For example, USAID recently invested US$90 million in HIV funding to the construction of secondary schools in sub-Saharan Africa. Global health leaders have noted the need to invest in precise data systems and eliminate data gaps to effectively target resources, develop equitable policy, and track accountability7. Our analysis provides a robust evidence base for such decision-making and advocacy. Decades of research on the effect of basic education on maternal, newborn, and child health positions this issue squarely in the purview of the global health agenda. It is crucial for the global health community to invest in long-term, sustainable improvement in the underlying distribution of human capital, as this is the only way to truly influence health equity across generations.

Methods

Overview

Using a Bayesian model-based geostatistical framework and synthesizing geolocated data from 528 household and census datasets, this analysis provides subnational estimates of mean numbers years of education and the proportion of the population who attained key levels of education for women of reproductive age (15–49 years), women aged 20–24 years, and equivalent male age bins between 2000 and 2017 in 105 countries across all low- and middle-income countries (LMICs). Countries were selected for inclusion in this analysis using the socio-demographic index (SDI) published in the Global Burden of Disease (GBD) study27. The SDI is a measure of development that combines education, fertility, and poverty. Countries in the middle, lower-middle, or low SDI quintiles were included, with several exceptions. Albania, Bosnia, and Moldova were excluded despite middle SDI status due to geographical discontinuity with other included countries and lack of available survey data. Libya, Malaysia, Panama, and Turkmenistan were included despite higher-middle SDI status to create better geographical continuity. We did not analyse American Samoa, Federated States of Micronesia, Fiji, Kiribati, Marshall Islands, Samoa, Solomon Islands, or Tonga, where no available survey data could be sourced. Analytical steps are described below, and additional details can be found in the Supplementary Information.

Data

We compiled a database of survey and census datasets that contained geocoding of subnational administrative boundaries or GPS coordinates for sampled clusters. These included datasets from 528 sources (see Supplementary Table 2). These sources comprised at least one data source for all but two countries on our list of LMICs: Western Sahara and French Guiana. We chose to exclude these two countries from our analysis; 42 of 105 included countries have only subnational administrative level data. We extracted demographic, education, and sample design variables. The coding of educational attainment varies across survey families. In some surveys, the precise number of years of attainment is not provided, with attainment instead aggregated into categories such as ‘primary completion’ or ‘secondary completion’. In such cases, individuals who report ‘primary completion’ may have gone on to complete some portion of secondary education, but these additional years of education are not captured in the underlying dataset. Previous efforts to examine trends in mean years of education have either assumed that no additional years of education were completed (that is, primary education only) or have used the midpoint between primary and secondary education as a proxy28. Trends in the single-year data, however, demonstrate that such assumptions introduce bias in the estimation of attainment trends over time and space, as differences in actual drop-out patterns or binning schema can lead to biased mean estimates29.

For this analysis, we used a recently developed method that selects a training subset of similar surveys across time and space to estimate the unobserved single-year distribution of binned datasets29. In comprehensive tests of cross-validation that leveraged data for which the single-year distributions are observed, this algorithmic approach significantly reduces bias in summary statistics estimated from datasets with binned coding schemes compared to alternatives such as the standard-duration method28. The years in all coding schemes were mapped to the country- and year-specific references in the UNESCO International Standard Classification of Education (ISCED) for comparability30. We used a top coding of 18 years on all data; this is a common threshold in many surveys that have a cap and it is reasonable to assume that the importance of education for health outcomes (and other related SDGs) greatly diminishes after what is the equivalent of 2 to 3 years of graduate education in most systems.

Data were aggregated to mean years of education attained and the proportions achieving key levels of education. The levels chosen were proportion with zero years, proportion with less than primary school (1–5 years of education), proportion with at least primary school (6–11 years of education), and proportion achieving secondary school or higher (12 or more years of education). A subset of the data for a smaller age bin (20–24 years) was also examined to more closely track temporal shifts. Equivalent age bins were aggregated for both women and men to examine disparities in mean years of attainment by sex. Where GPS coordinates were available, data were aggregated to a specific latitude and longitude assuming a simple-random sample, as the cluster is the primary sampling unit for the stratified design survey families, such as the Demographic and Health Survey (DHS) and Multiple Indicator Cluster Survey (MICS). Where only geographical information was available at the level of administrative units, data were aggregated with appropriate weighting according to their sample design. Design effects were estimated using a package for analysing complex survey data in R31.

Spatial covariates

To leverage strength from locations with observations to the entire spatiotemporal domain, we compiled several 5 × 5-km2 raster layers of possible socioeconomic and environmental correlates of education (Supplementary Table 5 and Supplementary Fig. 6). Acquisition of temporally dynamic datasets, where possible, was prioritized to best match our observations and thus predict the changing dynamics of educational attainment. We included nine covariates indexed at the 5 × 5-km2 level: access to roads, nighttime lightstv, populationtv, growing season, ariditytv, elevation, urbanicitytv, irrigation, and yeartv (tv, time-varying covariates). More details, including plots of all covariates, can be found in the Supplementary Information.

Our primary goal is to provide educational attainment predictions across LMICs at a high (local) resolution, and our methods provide the best out-of-sample predictive performance at the expense of inferential understanding. To select covariates and capture possible nonlinear effects and complex interactions between them, an ensemble covariate modelling method was implemented32. For each region, three submodels were fitted to our outcomes using all of our covariate data: generalized additive models, boosted regression trees, and lasso regression. Each submodel was fit using fivefold cross-validation to avoid overfitting and the out-of-sample predictions from across the five folds were compiled into a single comprehensive set of predictions from that model. Additionally, the same submodels were also run using 100% of the data and a full set of in-sample predictions were created. The five sets of out-of-sample submodel predictions were fed into the full geostatistical model as predictors when performing the model fit. The in-sample predictions from the submodels were used as the covariates when generating predictions using the fitted full geostatistical model. This methodology maximizes out-of-sample predictive performance at the expense of the ability to provide statistical inference on the relationships between the predictors and the outcome. A recent study has shown that this ensemble approach can improve predictive validity by up to 25% over an individual model32. More details on this approach can be found in the Supplementary Information.

The primary goal of using the stacking procedure in our analyses was to maximize the predictive power of the raster covariates by capturing the nonlinear effects and complex interactions between covariates to optimize the model performance. It has previously been suggested32 that the primary purpose of the submodel predictions is to improve the mean function of the Gaussian process. Although we have determined a way to include the uncertainty from two of our submodels (lasso regression and generalized additive models (GAM)), we have not determined a way to include uncertainty from the boosted regression tree (BRT) submodel into our final estimates. Whereas GAM and lasso regression seek to fit a single model that best describes the relationship between response variable and some set of predictors, BRT method fits a large number of relatively simple models for which the predictions are then combined to give robust estimates of the response. Although this feature of the BRT model makes it a powerful tool for analysing complex data, quantifying the relative uncertainty contributed by each simple model as well as uncertainty from the complex interactions of the predictor variables is challenging33,34. It is worth noting, however, that our out-of-sample validation indicates that the 95% coverage is fairly accurate (for example, closely ranges around 95%) as shown in the figures and table of Supplementary Information section 4.3.2. This indicates that we are not misrepresenting the uncertainty in our final estimates.

Analysis

Geostatistical model

Gaussian and binomial data are modelled within a Bayesian hierarchical modelling framework using a spatially and temporally explicit hierarchical generalized linear regression model to fit the mean number years of education attainment and the proportion of the population who achieved key bins of school in 14 regions across all LMICs as defined in the GBD study (Extended Data Fig. 1). This means we fit 14 independent models for each indicator (for example, the proportion of women with zero years of schooling). GBD study design sought to create regions on the basis of three primary criteria: epidemiological homogeneity, sociodemographic similarity, and geographical contiguity27. Fitting our models by these regions has the advantage of allowing for some non-stationarity and non-isotropy in the spatial error term, compared to if we modelled one spatiotemporal random-effect structure over the entire modelling region of all LMICs.

For each Gaussian indicator, we modelled the mean number of years of attainment in each survey cluster, d. Survey clusters are precisely located by their GPS coordinates and year of observation, which we map to a spatial raster location i at time t. We model the mean number of years of attainment as Gaussian data given fixed precision τ and a scaling parameter sd (defined by the sample size in the observed cluster). As we may have observed multiple data clusters within a given location i at time t, we refer to the mean attainment, μ, within a given cluster d by its indexed location i, and time t as μi(d),t(d).

$${{\rm{edu}}}_{d}|{\mu }_{i(d),t(d)},{s}_{d},\tau \, \sim \,{\rm{Normal}}({\mu }_{i(d),t(d)},\tau {s}_{d})\,\forall \,{\rm{observed}}\,{\rm{clusters}}\,d$$
$${\mu }_{i,t}=\,{\beta }_{0}+{{\bf{X}}}_{i,t}{\boldsymbol{\beta }}+{Z}_{i,t}+{{\epsilon }}_{{\rm{ctr}}(i)}+{{\epsilon }}_{i,t}\,\forall \,i\in {\rm{spatial}}\,{\rm{domain}}\,\forall \,t\in {\rm{time}}\,{\rm{domain}}$$

For each binomial indicator, we modelled the number of individuals at a given attainment level in each survey cluster, d. We observed the number of individuals reporting a given attainment level as binomial count data Cd among an observed sample size Nd. As we may have observed multiple data clusters within a given location i at time t, we refer to the probability of attaining that level, p, within a given cluster d by its indexed location i and time t as pi(d),t(d).

$${C}_{d}|{p}_{i(d),t(d)},\,{N}_{d}\sim {\rm{Binomial}}({p}_{i(d),t(d)},\,{N}_{d})\,\forall \,{\rm{observed}}\,{\rm{clusters}}\,d$$
$${\rm{logit}}({p}_{i,t})=\,{\beta }_{0}+{{\bf{X}}}_{i,t}{\boldsymbol{\beta }}+{Z}_{i,t}+{{\epsilon }}_{{\rm{ctr}}(i)}+{{\epsilon }}_{i,t}\,\forall \,i\in {\rm{spatial}}\,{\rm{domain}}\,\forall \,t\in {\rm{time}}\,{\rm{domain}}$$

We used a continuation-ratio modelling approach to account for the ordinal data structure of the binomial indicators35. To do this, the proportion of the population with zero years of education was modelled using a binomial model. The proportion with less than primary education was modelled as those with less than primary education of those that have more than zero years of education. The same method followed for the proportion of population completing primary education. The proportion achieving secondary school or higher was estimated as the complement of the sum of the three binomial models.

The remaining parameter specification was consistent between all indicators in both binomial and Gaussian models:

$$\mathop{\sum }\limits_{h=1}^{3}{\beta }_{h}\,=1$$
$${{\epsilon }}_{{\rm{ctr}}}\sim {\rm{iid}}\,{\rm{Normal}}(0,\,{\gamma }^{2})$$
$${{\epsilon }}_{i,t}\sim {\rm{iid}}\,{\rm{Normal}}(0,\,{\sigma }^{2})$$
$${\bf{Z}}\sim {\rm{GP}}(0,{\Sigma }^{{\rm{space}}}\otimes {\Sigma }^{{\rm{time}}})$$
$${\Sigma }^{{\rm{space}}}=\,\frac{{\omega }^{2}}{\varGamma (\nu ){2}^{v-1}}\times {(\kappa D)}^{\nu }\times {{\rm K}}_{\nu }(\kappa D)$$
$${\Sigma }_{j,\,k}^{{\rm{time}}\,}={\rho }^{|k-j|}$$

For indices d, i, and t, *(index) is the value of * at that index. The probabilities pi,t represent both the annual proportions at the space–time location and the probability that an individual had that level of attainment given that they lived at that particular location. The annual probability pi,t of each indicator (or μi,t for the mean indicators) was modelled as a linear combination of the three submodels (GAM, BRT, and lasso regression), rasterized covariate values Xi,t, a correlated spatiotemporal error term Zi,t, country random effects \({{\epsilon }}_{{\rm{ctr}}(i)}\) with one unstructured country random effect fit for each country in the modelling region and all sharing a common variance parameter, γ2, and an independent nugget effect \(\,{{\epsilon }}_{i,t}\) with variance parameter σ2. Coefficients βh in the three submodels h = 1, 2, 3 represent their respective predictive weighting in the mean logit link, while the joint error term Zi,t accounts for residual spatiotemporal autocorrelation between individual data points that remains after accounting for the predictive effect of the submodel covariates, the country-level random effect \({{\epsilon }}_{{\rm{ctr}}(i)}\), and the nugget independent error term, \(\,{{\epsilon }}_{i,t}\). The purpose of the country-level random effect is to capture spatially unstructured, unobserved country-specific variables, as there are often sharp discontinuities in educational attainment between adjacent countries due to systematic differences in governance, infrastructure, and social policies.

The residuals Zi,t are modelled as a three-dimensional Gaussian process (GP) in space–time centred at zero and with a covariance matrix constructed from a Kronecker product of spatial and temporal covariance kernels. The spatial covariance Σspace is modelled using an isotropic and stationary Matérn function36, and temporal covariance Σtime as an annual autoregressive (AR1) function over the 18 years represented in the model. In the stationary Matérn function, Γ is the Gamma function, Kv is the modified Bessel function of order v > 0, κ > 0 is a scaling parameter, D denotes the Euclidean distance, and ω2 is the marginal variance. The scaling parameter, κ, is defined to be \(\kappa =\sqrt{8v}/\delta \) where δ is a range parameter (which is about the distance for which the covariance function approaches 0.1) and v is a scaling constant, which is set to 2 rather than fit from the data37,38. This parameter is difficult to reliably fit, as documented by many other analyses37,39,40 that set this parameter to 2. The number of rows and the number of columns of the spatial Matérn covariance matrix are equal to the number of spatial mesh points for a given modelling region. In the AR1 function, ρ is the autocorrelation function (ACF), and k and j are points in the time series where |k − j| defines the lag. The number of rows and the number of columns of the AR1 covariance matrix are equal to the number of temporal mesh points (18). The number of rows and the number of columns of the space–time covariance matrix, Σspace Σtime, for a given modelling region are equal to: the number of spatial mesh points × the number of temporal mesh points.

This approach leveraged the residual correlation structure of the data to more accurately predict estimates for locations with no data, while also propagating the dependence in the data through to uncertainty estimates41. The posterior distributions were fit using computationally efficient and accurate approximations in R-integrated nested Laplace approximation (INLA) with the stochastic partial differential equations (SPDE) approximation to the Gaussian process residuals using R project version 3.5.142,43,44,45. The SPDE approach using INLA has been demonstrated elsewhere, including the estimation of health indicators, particulate air matter, and population age structure10,11,46,47. Uncertainty intervals were generated from 1,000 draws (that is, statistically plausible candidate maps)48 created from the posterior-estimated distributions of modelled parameters. Additional details regarding model and estimation processes can be found in the Supplementary Information.

To transform grid cell-level estimates into a range of information that is useful to a wide constituency of potential users, these estimates were aggregated from the 1,000 candidate maps up to district, provincial, and national levels using 5 × 5-km2 population data49. This aggregation also enabled the calibration of estimates to national GBD estimates for 2000–2017. This was achieved by calculating the ratio of the posterior mean national-level estimate from each candidate map draw in the analysis to the posterior mean national estimates from GBD, and then multiplying each cell in the posterior sample by this ratio. National-level estimates from this analysis with GBD estimates can be found in Supplementary Table 44.

To illustrate how subnational progress has contributed differentially to national progress (Fig. 3), we decomposed the improvement in the national rate of secondary completion since 2000 for each country into the additive contributions of rate changes at the second administrative level, where C is the national secondary rate change, N is the total number of second-level administrative units, ci is the population proportion in administrative unit i, and ri is the rate of secondary attainment in administrative unit i.

$$C=\,\mathop{\sum }\limits_{i=1}^{N}({c}_{i,2017}{r}_{i,2017})-({c}_{i,2000}{r}_{i,2000})$$

Although the model can predict at all locations covered by available raster covariates, all final model outputs for which land cover was classified as ‘barren or sparsely vegetated’ were masked, on the basis of the most recently available Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (2013), as well as areas in which the total population density was less than 10 individuals per 1 × 1-km2 pixel in 201550. This step has led to improved understanding when communicating with data specialists and policy-makers.

Model validation

Models were validated using source-stratified fivefold cross-validation. To offer a more stringent analysis by respecting some of the source and spatial correlation in the data, holdout sets were created by combining sets of data sources (for example, entire survey- or census-years). Model performance was summarized by the bias (mean error), total variance (root-mean-square error) and 95% data coverage within prediction intervals, and the correlation between observed data and predictions. All validation metrics were calculated on the predictions from the fivefold cross-validation. Where possible, estimates from these models were compared against other existing estimates. Furthermore, measures of spatial and temporal autocorrelation pre- and post-modelling were examined to verify correct recognition, fitting, and accounting for the complex spatiotemporal correlation structure in the data. All validation procedures and corresponding results are provided in the Supplementary Information.

Limitations

Our analysis is not without several important limitations. First, almost all data collection tools conflate gender and sex and we therefore do not capture the full distribution of sex or gender separately in our data. We refer throughout to the measurement of ‘gender (in)equality’, following the usage in SDG 5. Second, it is extremely difficult to quantify quality of education on this scale in a comparable way. Quality is ultimately a large part of the SDG agenda and of utmost importance to achieving equity in opportunity for social mobility. However, many studies across diverse low- and middle-income settings have linked attainment, even very low levels, to measurable improvement in maternal and child health17. As our analysis highlights with the proportional indicators, there are still many subnational regions across the world where large proportions do not complete primary school. A third limitation is that we are unable to measure or account for migration. A concept note released from the forthcoming Global Education Monitoring Report 2019 focuses on how migration and displacement affects schooling51. Our estimates of the modelled outcome, educational attainment for a particular space–time–age–sex, are demonstrated to be statistically unbiased (Supplementary Information section 4.3); however, interpretation of any change in attainment as a change in the underlying education system could potentially be biased by the effects of migration. It is possible that geographical disparities reflect changes in population composition rather than changes in the underlying infrastructure or education system. Pathways for this change are complex and may be voluntary. Those who manage to receive an education in a low-attainment area may have an increased ability to migrate and choose to do so. This change may also be involuntary, particularly in politically unstable areas where displacement may make geographical changes over time difficult to estimate. A shifting population composition is a general limitation of many longitudinal ecological analyses, but the spatially granular nature of the analyses used here may be more sensitive to the effects of mobile populations.

Our analysis is purely predictive but draws heavily in its motivation from a rich history of literature on the role of education in reducing maternal mortality, improving child health, and increasing human capital. Studies have also demonstrated complex relationships between increased education and a myriad of positive health outcomes, such as HIV risk reductions and spillover effects to other household members52,53. The vast majority of these studies are associational and recent attempts at causal analyses have provided more-mixed evidence54,55,56. Although causal analyses of education are very difficult and often rely on situational quasi-experiments, associational analyses using the most comprehensive datasets demonstrate consistent support for the connection between education and health17,57. Looking towards future analyses, it will be important to study patterns of change in these data and how they overlap with distributions of health. Lastly, our estimates cannot be seen as a replacement for proper data collection systems, especially for tracking contemporaneous change. Our analysis of uncertainty at a high-resolution may be used to inform investment in more robust data systems and collection efforts, especially if the ultimate goal is to measure and track progress in the quality of schooling.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.