Despite efforts to improve access to education, many adolescents and youths of school-age remain marginalised, disproportionately by geography, social–economic status, cultural norms and gender (Graetz et al., 2020; UNESCO, 2020a). The 2030 Sustainable Development Goal (SDG) agenda will not be achieved without substantial investment in education and associated inequalities (Friedman et al., 2020) for pre-adult age groups (UNESCO, 2015a, 2015b). This age group constitutes ~18% of the global population and represents the future well-being of society and its socio-economic development potential. The lower secondary school age (12–14 years) and youths of upper secondary school age (15–17 years) is also a period when higher risk (or protective) behaviours start or become entrenched, having a major impact on their health and development as adults (UNFPA, 2007). While there has been increased investment and initiatives in many countries since 2010 to improve access to secondary school education (Morgan et al., 2014; Koski et al., 2018), the proportion of out of school adolescents of lower secondary age and youth of school age remains unacceptably high—progress has been at a near standstill for 8 years (UNESCO, 2018) with sub-national heterogeneities and their drivers, such as geographic distance, poorly understood in many low-income and middle-income countries where policies are targeted. Without education access and protection, the immediate economic and social well-being of any country is at risk.

Distance to school is a recognised barrier to education access alongside socio-economic and demographic household characteristics such as parents level of educations, wealth, and early marriage (Buchmann, 1999; Yu, 2007; ILO, 2010). For example, according to the World Inequality Database on Education (WIDE) for 2020 (UNESCO, 2020b) only 8% of youths completed secondary school in Tanzania, 57% in the Dominican Republic and 21% in Cambodia. To aid in understanding inequalities related to physical or geographic accessibility, information on the location of populations, schools, and socio-demographic characteristics are increasingly available to develop fine spatial resolution maps of geographic accessibility. Further, triangulating data on the spatial location of school with the household-level data from nationally representative household surveys undertaken every 3–5 years (Anderson and Cleland, 1984; Ayad et al., 1997; Burgert-Brucker et al., 2015) can be useful in estimating school attendance at a community level. Finally, the improved mapping of the age-structured global population (Stevens et al., 2015; Wardrop et al., 2018; Worldpop, 2018) provides opportunities for understanding the location of services within populations and improves the estimation of those marginalised from schools. There have been no previous attempts that triangulate available community survey data, with the spatial databases of schools and fine-scale age-structured population maps at a sub-national level to estimate access and attendance amongst adolescents and school-age-youths at a fine geographic scale.

Here we employ a geospatial approach using an example of three countries classified by the World Bank as low- and middle-income (United Republic of Tanzania—Tanzania mainland, Cambodia and the Dominican Republic). The aim was to examine geographic accessibility to secondary schools and associate this with the predicted out-of-secondary school rates at a fine spatial resolution (1 by 1 km). The approach integrates locations of secondary schools with fine-scale geospatial covariates to estimate geographic accessibility in a Geographic Information System (GIS).


Spatial database for schools

Geographic location data for schools were assembled from governmental sources for the three countries. These countries were selected based on the geographic differences and heterogeneities in the distribution of secondary schools in Africa (Tanzania), Southeast Asia (Cambodia), and Latin America (Dominican Republic). For Tanzania 3258 secondary schools location data were obtained from the United Republic of Tanzania data portal (The United Republic of Tanzania—Government Basic Statistics Portal, 2015). The designated age range for a secondary school in Tanzania is 14–19 years old. For Cambodia, these were obtained from the Ministry of Education, Adolescent and Sport (MoEYS) consisting of 1615 schools classified as College, LyceeG10-12, and LyceeG7-12 with an age range of 13–18 years. In the Dominican Republic, data were obtained from the Ministerio de Educación de la República Dominicana (n = 4618) (Ministerio De Educación, 2018) and the corresponding age-range was 14–17 years.

Cluster-level data on school attendance

The Demographic and Health Surveys (DHS) for Tanzania (2015–16 DHS, n = 595 clusters) (Ministry of Health et al., 2016), Cambodia (2014 DHS, n = 611 clusters) (National Institute of Statistics/Cambodia et al., 2015), and Dominican Republic (2013 DHS, n = 476 clusters) (Centro De Estudios Sociales Y Demográficos—Cesdem/República Dominicana and ICF International, 2015) were first used to derive the rates of attendance adjusted for DHS sampling and stratification. The DHS survey sampling in each country was based on a two-stage stratified sampling design using the national census sampling frame. During the first stage, enumeration areas (EAs), also known as clusters, were selected by using a probability proportional-to-population size. During the second stage, households were sampled from a complete household listing in the selected EAs. Specific details on the sampling procedures for the three countries of interest for this work can be found in the DHS final reports, and in the DHS Sampling Manual (ICF International, 2012). DHS clusters were defined as a group of households in the same area or a block (if in urban areas) selected for the interview within the complex survey design used by the DHS, and usually cluster level spatial coordinates (latitude and longitude) are also provided in the surveys.

Ancillary covariates and population data

Additional covariate data were assembled to aid in the estimation of geographic access and interpolation of cluster-level data. Land use and land cover maps for the three countries were obtained from MERIS GlobCover (Arino et al., 2007). Globcover classification uses 22 classes defined based on the United Nation’s Land Cover Classification System (UN-LCCS) (Fao, 2000). The current GlobCover V.2.3 was derived from a time-series of medium resolution imaging spectrometer (MERIS) satellite imagery acquired from December 2004 to June 2006 at a spatial resolution of 300 m. An improved gap-filled digital elevation model (DEM) data was obtained from the HydroSHEDS dataset based primarily on NASA’s Shuttle Radar Topography Mission (SRTM) (Lehner et al., 2008). Roads data were assembled from Open Street Maps (OSM) and online resources such as the National Geospatial-Intelligence Agency (NGA) (NGA, 2015) and independent data from MapCruzin (an independent open-source data repository A gridded night-time light dataset based on low-light imaging of earth at night was downloaded, obtained from the Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB) [] operated jointly by NASA and NOAA (Elvidge et al., 2017). This gridded image of lights at night has been shown to correlate highly with the urban population (Small et al., 2005; Shi et al., 2014). Lastly, a 1 × 1 km population maps were downloaded from WorldPop (Worldpop, 2018). These represented disaggregated census-based maps using a combination of standardised dasymetric mapping approaches informed by population density weights calculated using random forest (RF) methodology (Stevens et al., 2015).

Geospatial modelling of travel time to secondary schools

A gridded layer of travel times to secondary schools was estimated using land use, elevation (the DEM), and roads in AccessMod (version 5.0) (Ray and Ebener, 2008). The other assembled covariates were used in interpolating attendance rates in space. In deriving the travel time grid, each GIS layer was converted into a raster surface 1 × 1 km and each pixel was assigned an impedance value representing the speed of traversing a grid pixel based on land use type (Table 2). The resulting rasters were then combined into a gridded friction layer. Travel speeds were assigned to different land use classes, roads and slope by assuming multiple modes of transport within a single journey. For instance, for primary roads, motorised transport was assumed with a maximum speed of 80 km h−1. On other tertiary roads, a 5 km h−1 walking speed was adopted with a correction for non-motorised transport (Cycling) at 10 km h−1 applied on residential roads. Details of travel speed and mode of travel were selected based on recommendations from previous studies (Noor et al., 2006; Tanser et al., 2006). The DEM was used to derive slope and different speeds calculated for each degree rise based on Tobler’s equation:

$$V = 6 \ast {{{\mathrm{exp}}}}\pi \left( { - 3.5\,{{{\mathrm{abs}}}}\left[ {{{{\mathrm{Tan}}}}\left( {{{{\mathrm{slope}}}}\,{{{\mathrm{in}}}}\,{{{\mathrm{degrees}}}}/57.296} \right) + 0.05} \right]} \right.$$

where V is the calculated speed (Tobler, 1993). Travel times to each school were computed separately.

Estimating net school attendance rates at cluster level

The calculation of the adjusted net school attendance rate followed the guidelines and coding proposed by the MEASURE DHS programme (Croft et al., 2018). The methodology adopted Stata (Statacorp., 2017) (software for statistics and data science) code for estimating rates adjusting for survey weighting. The adjusted net attendance rate estimated the total number of students of the official secondary school age group who attended secondary education (or primary, or higher education) at any time during the reference academic year. The numerator was the de facto total population of secondary school-age attending primary or secondary or higher school, while the denominator was the total number of de facto secondary school-age adolescents or school-age-youths. It, therefore, included students of official school age who accessed school earlier or later than the normal enrolment age and was expressed as a percentage of the corresponding population (UNESCO, 2019), giving a more precise picture of school participation. Age ranges were established based on guidelines from the National Ministry of Education and the UNESCO Institute for Statistics database. The age at the start of the academic year was used to determine the eligible secondary school-age population used in the numerators and denominators for the net attendance rate. To establish these age ranges, full information on the date of birth of the child in question was triangulated with the start of the academic year, to account for the temporal gap between the interviews and the start of the academic year. For geospatial mapping purposes, these rates were aggregated at cluster level and the proportion of secondary school age attending (or out-of-school) in each cluster was computed. The computed proportions at each georeferenced cluster were then interpolated through modelling at the second stage.

Geostatistical modelling of secondary school attendance

A model-based geostatistical method (Diggle et al., 1998) was used to spatially interpolate cluster-level estimates of attendance with gridded covariates to define mean attendance at 1 × 1 km. The renaissance of model-based geostatistical (MBG) approaches has occurred in other fields (Banerjee et al. 2004; Lindgren, 2013), with the added advantage of estimating uncertainty associated with the estimation of school attendance. At the first stage, covariates were selected using a statistical procedure. Covariates considered included the modelled travel time to the nearest school, the enhanced vegetation index, night-time light, minimum and maximum temperature in all three countries. A bestglm (Mcleod and Xu, 2008) procedure was then implemented for each country separately resulting in a parsimonious set for modelling.

The main objective in modelling was to predict net attendance at fine-scale for all locations nationally using a parsimonious set of covariates that were statistically important in explaining variation in observed attendance rates. For this purpose, a Bayesian hierarchical spatial model was implemented in the Integrated Nested Laplace Approximation in R software (R-INLA) (Rue et al., 2009; Cameletti et al., 2012; Martins et al., 2013) to estimate a continuous map of the proportion attending secondary school-level education at 1 × 1 km spatial resolution. A stochastic partial differential equation (SPDE) approach was adopted using R-INLA, and computation performed via Gaussian Markov random function (GMRF). A stationary model was implemented using Matérn covariance with the smoothness of process v and variance σ2 given by

$$\alpha = \nu + d/2$$

where d is spatial dimension and marginal variance \(\sigma ^2 = 1/\left( {{\Gamma}\left( \nu \right){\Gamma}\left( \alpha \right)\left( {4\pi } \right)^{d/2}k^{2\nu }\tau ^2} \right)\). A linear model was implemented using a Gaussian likelihood for the proportion attending school adjusted for sampling and strata. Thus,

$$z\left( s \right) = x\left( s \right)\beta + w\left( s \right) + \varepsilon \left( s \right)$$

where z(s) are realisations of the underlying attendance process linked to a spatial structured predictor in an additive way, x(s) denotes set of covariates with β coefficients and ε(s) is the measurement error. w(s) represents the spatial process associated with the spatial association between clusters. The Bayesian specification was completed by assigning non-informative priors to hyper-parameters to the fixed effects (covariates) and the random parameters (spatial and the measurement error). For SPDE parameters, a penalised complexity (PC) priors framework was used for the model range and the marginal variance (Fuglstad et al., 2019, 2020).

Model calibration (statistical consistency) and sharpness (concentration) were assessed using the probability integral transform (PIT) and the conditional predictive ordinate (CPO), a leave-one-out cross-validation approach in which an estimate was validated based on the fitted model and the remaining data only (Spiegelhalter et al., 2002; Czado et al., 2009). A 20% subset of data selected randomly was used in the computation of the mean prediction error (MPE), the root mean square error (RMSE), and a Pearson’s product–moment correlation coefficient that quantified the association between observed and predicted values. Figure 1 shows the overall methodology for geostatistical prediction of out-of-school rates (Breiman and Spector, 1992).Methodology for school attendance

Fig. 1: Methodology for school attendance modelling.
figure 1

Overall schematic flow of the geospatial analysis of out-of-school rates.


Summary of data and distance to school

There were 3258 secondary schools in the Tanzanian mainland, 1615 in Cambodia and 4618 in the Dominican Republic. The average straight-line distance from any population centre to the nearest school was estimated as 6.6 km in Tanzania (mainland), 3.3 km in Cambodia and 1.3 km in the Dominican Republic. This suggested that schools were geographically located at a further straight-line distance in Tanzania compared to the other two countries. This aspect was also reflected in travel time with the mean estimated travel time to the nearest school of 0.8 h in Tanzania (~50 min), 0.4 h in Cambodia (~25 min) and only 0.1 h (~10 min) in the Dominican Republic (Fig. 2).

Covariate selection and model validation

From the covariate selection procedure across the three countries, only temperature variables and night-time light (a proxy for urbanisation) were important statistically in explaining variation in school attendance rates. Travel time to the nearest secondary school (an indicator of geographic accessibility) was not selected for predictive modelling. Therefore, this covariate was used in associating geographic accessibility with predicted estimates of secondary school attendance at sub-national levels (Administrative level 1).

Table 1 lists model prediction performance for each country. For the three models, the Pearson correlation between the predicted estimate and the out-of-sample validation set (20% of clusters) was >60% in all countries. This suggested a good association of the prediction when compared to the observed data. The mean absolute error was calculated based on residuals between observed and predicted estimates and was relatively small at 0.29 (Tanzania), 0.11 (Cambodia), and 0.20 (Dominican Republic).

Table 1 Model goodness of fit and prediction performance for Tanzania, Cambodia and the Dominican Republic.

The predicted rate of secondary school non-attendance

Fig. 3 shows predictions of the percentage of adolescents and school-age-youths out-of-school in Tanzania, Cambodia and the Dominican Republic at a 1 km spatial resolution. The green areas are those with a low percentage of adolescents and school-age youths not attending secondary school. The second panel shows the difference between the upper and lower 95% Bayesian credible interval as a measure of uncertainty in estimates. Uncertainty is contributed by several factors including survey sampling of the clusters, few data points and the goodness-of-fit of the model. Fig. 4 shows a quadrant level analysis of the percentage out of secondary school and the estimate of adolescents and school-age-youths based on population distribution. Fig. 5 shows scatter plots between travel time and out-of-secondary school rates in the three countries with a fitted non-linear model via generalised additive models (GAM) regression. The corresponding R2 from GAM regression was 73.3% in Tanzania, 68.8% in Cambodia, and 87.5% in the Dominican Republic.

Fig. 2: Secondary school accessibility maps.
figure 2

Estimated travel time (minutes) to the nearest secondary school in the three countries (A) Tanzania, (B) Cambodia, and (C) the Dominican Republic. The blue dots represent the spatial distribution of school secondary schools in the three countries, respectively.

Fig. 3: Out-of-secondary school rates.
figure 3

Maps at 1 × 1 km spatial resolution of the predicted (mean) percentage of secondary school age adolescents and school-age-youths who were out-of-school in (A) Tanzania, (B) Cambodia and (C) the Dominican Republic. The lower panel maps show the difference between the upper and lower 95% Bayesian credible interval.

Fig. 4: Quadrant analytical plots.
figure 4

Scatter plots showing the variation of attendance rates by region and by estimated number out of school in A Tanzania, B Cambodia and C the Dominican Republic. The red and grey line show national averages for percentage (y-axis) and number (x-axis) of adolescents and school-age youths out of secondary school.

Table 2 shows that, on average, approximately 57.3 (54.5–58.3) of secondary school age adolescents and school-age-youths were estimated to be out of school in the Tanzanian mainland. This translated to approximately 2.8 million adolescents and school-age youths out of school in 2016. The regions with the lowest attendance rates were associated with longer travel times e.g. Tabora, Mbeya and Njombe. There were 8 regions in the Tanzanian mainland with >60% out-of-school rates as classified in the first quadrant of Fig. 4. These were in Dodoma, Katavi, Mbeya, Mtwara, Njombe, Rukwa, Shinyanga, Simiyu and Tabora. The total number of out-of-school adolescents and school-age-youths in these 8 regions was ~1.01 million, representing more than a third of the 2.8 million out-of-school.

Table 2 Estimates of geographic accessibility to secondary school, and number of adolescents and school-age youths not attending secondary school in the Tanzania mainland by region.

In Cambodia, ~40.0% (37.4–42.3%) were estimated to be out-of-secondary school representing ~0.59 million (annexe Table A2). The Môndól Kiri region had the largest population, with an estimated 50.2% (44.4–58.1%) of adolescents and school-age youths out-of-secondary school. In total, 11 out of 25 regions in Cambodia exceeded the national average of adolescents and school-age youths out-of-secondary school (17%; n = 170,079). For the Dominican Republic, the percentage of adolescents and school-age youths out-of-secondary school was lower at 10.7% (9.7–11.7%) representing ~0.1 million adolescents and school-age-youths. However, half of the regions (n = 17) in the Dominican Republic exceeded the national average with a population of ~68.2% (n = 70,398) of adolescents and school-age youths out of school.


This study focused on secondary school attendance for adolescents and school-age-youths in Tanzania, Cambodia and Dominican Republic. In Tanzania, more than 50% of this age group (14–19 years) were estimated to be out of secondary school education (mean 53.8% IQR 51.4–60.2%). Based on estimated distance (Table 2 and Fig. 5), secondary schools were twice the distance (6.6 km, IQR 2.2–19.6 km) and at a greater travel time (0.8 h, IQR 0.2–3.0 h) from the population in Tanzania compared to Cambodia and the Dominican Republic. In Cambodia, the estimated percentage of 13–18 years adolescents and school-age youths out-of-secondary schools was 40.0% (IQR 37.4–42.3%). While in the Dominican Republic only 10.77% (IQR 9.7–11.7%) amongst adolescence and school-age-youths between 14 years to 17 years adolescents and school-age youths were estimated to be out of secondary school. This represented ~2.8 million out-of-secondary schools in Tanzania in 2016, 0.6 million in Cambodia in 2014 and 0.1 million in the Dominican Republic in 2013. Maps of school attendance and geographic access (Figs. 2 and 3) are important in characterising heterogeneities at a fine geographic scale and can be particularly important when targeting education interventions. For countries such as Tanzania and Cambodia, a possible geographic-related intervention could be to increase school availability and reduce travel time to secondary schools in regions with poor access (Table 2).

Fig. 5: Travel time and attendance.
figure 5

Scatter plots at the sub-national level (Administrative level 1) showing the association between mean travel time (x-axis) and the modelled posterior mean secondary school non-attendance (y-axis) for (A) Tanzania, (B) Cambodia and (C) the Dominican Republic. The fitted blue line is the non-linear fit via GAM regression with corresponding 95% CI (grey ribbon).

The secondary school non-attendance rate in mainland Tanzania estimated here corroborates previous education research and enrolment data for Tanzania (The United Republic of Tanzania—Government Basic Statistics Portal, 2016; Human Rights Watch, 2017). It is worth noting an average distance of 5 km is commuted twice daily for secondary schools without boarding facilities. The long journey to secondary school contributes to the overall out-of-secondary school numbers estimated to be 2.8 million here, alongside other factors not explored here, e.g. socioeconomic status, individual characteristics (e.g. attitude towards school), cultural factors, home environment, and lack of teachers (Sabates et al., 2010; Inoue et al., 2015; Gubbels et al., 2019). The modelled predictions of the out-of-secondary school population in mainland Tanzania are consistent with findings from the education policy brief report in 2014 of 3 million (Tanzania Education Network, 2018). However, it is not clear, at sub-national levels, that those who enrol at the lower secondary complete (forms 1–4) and progress to the advanced level (forms 5 and 6) (Mashala, 2019). More research is required on secondary school enrolment, drop-out or the likelihood of secondary school completion and retention in low- and middle-income countries.

The average travel time to school in the Dominican Republic was approximately 10 minutes (0.1 h, IQR 0.0–0.2) with ~0.1 million estimated to be out-of-secondary school. The Dominican Republic is smaller in terms of land surface area but had a larger number of secondary schools (n = 4618) compared to the mainland Tanzanian (n = 3258) or Cambodia (n = 1615). The Tanzanian mainland is 19 times larger by geographic size, while Cambodia is 4 times larger than the Dominican Republic. This suggests that physical access to school in the Dominican Republic is boosted by school availability (short distance) relative to population distribution.

There were some limitations in the analyses undertaken. Firstly, the assumptions underlying the analysis of travel times assigned to different land surfaces include a degree of subjectivity, although all the assumptions made during this analysis are based on values derived from previously published geographic accessibility studies. This includes the measurement error in covariates and depending on the country context, these assumptions may lead to over or under-estimation of the actual rates. Secondly, variations in the sizes of schools were not explicitly modelled. The size of the school could be driven by the school location (urban or rural) as well as the size of underlying populations influencing use. The use of night-time light as a proxy for urbanisation adjusted for differences between urban and rural rates of attendance. However, other barriers such as household socioeconomic status, parents level of education, decision-making at a household level, and gender differences (Huisman and Smits, 2009) were not explored further. The focus here was on predictive modelling of non-attendance rates and the aspect of explanatory modelling should be explored by future studies. Lastly, the geographical displacement of the DHS clusters could influence fine-scale estimates of school attendance. A random displacement is applied to urban clusters (by maximum 2 km) and rural clusters (by maximum 5 km), and; an additional 1% of clusters are displaced by a maximum of 10 km. Given the interaction between cluster values and covariates used in the statistical models, the state of error introduced by the displacement was not explored analytically, but has been found to impact such modelling in a minimal way (Burgert, 2014; Gething et al., 2015).


The 2030 Agenda for Sustainable Development may not be achieved without investment in the education of the adolescent and youths. Thus, unearthing the subnational variation in secondary school geographic accessibility and attendance is important in estimating the fine-scale variation of those physically marginalised from education and provides indicators for monitoring SDG 4.3.1 (UNDP, 2019). Using an example of three low-income and middle-income countries (Tanzania mainland, Cambodia, and the Dominican Republic), the number of adolescents and youth of school-age out-of-school vary within and between countries and many are physically marginalised. In general, inequalities in access to secondary education has a future impact on national economies and require national investment to remove disparities and ensure no adolescents and youths are left behind. Alongside improving physical access and inequality in these countries, it would be beneficial to investigate at a micro (household-level) and macro-level the role of other factors such as direct and indirect costs, and the quality of provision on out-of-school rates.