Mapping out-of-school adolescents and youths in low- and middle-income countries

Education is a human right and a driver of development, but, is still not accessible for a vast number of adolescents and school-age-youths. Out-of-school adolescents and youth rates (SDG 4.3.1) in lower and middle-income countries have been at a virtual halt for almost a decade. Thus, there is an increasing need to understand geographic variation on accessibility and school attendance to aid in reducing inequalities in education. Here, the aim was to estimate physical accessibility and secondary school non-attendance amongst adolescents and school-age youths in Tanzania, Cambodia, and the Dominican Republic. Community cluster survey data were triangulated with the spatial location of secondary schools, non-proprietary geospatial data and fine-scale population maps to estimate accessibility to all levels of secondary school education and the number of out-of-school. School attendance rates for the three countries were derived from nationally representative household survey data, and a Bayesian model-based geostatistical framework was used to estimate school attendance at high resolution. Results show a sub-national variation in accessibility and secondary school attendance rates for the three countries considered. Attendance was associated with distance to the nearest school (R2 > 70%). These findings suggest increasing the number of secondary schools could reduce the long-distance commuted to school in low-income and middle-income countries. Future work could extend these findings to fine-scale optimisation models for school location, intervention planning, and understanding barriers associated with secondary school non-attendance at the household level.


Introduction
D espite efforts to improve access to education, many adolescents and youths of school-age remain marginalised, disproportionately by geography, social-economic status, cultural norms and gender UNESCO, 2020a). The 2030 Sustainable Development Goal (SDG) agenda will not be achieved without substantial investment in education and associated inequalities (Friedman et al., 2020) for pre-adult age groups (UNESCO, 2015a(UNESCO, , 2015b. This age group constitutes 18% of the global population and represents the future wellbeing of society and its socio-economic development potential. The lower secondary school age (12-14 years) and youths of upper secondary school age (15-17 years) is also a period when higher risk (or protective) behaviours start or become entrenched, having a major impact on their health and development as adults (UNFPA, 2007). While there has been increased investment and initiatives in many countries since 2010 to improve access to secondary school education (Morgan et al., 2014;Koski et al., 2018), the proportion of out of school adolescents of lower secondary age and youth of school age remains unacceptably highprogress has been at a near standstill for 8 years (UNESCO, 2018) with sub-national heterogeneities and their drivers, such as geographic distance, poorly understood in many low-income and middle-income countries where policies are targeted. Without education access and protection, the immediate economic and social well-being of any country is at risk.
Distance to school is a recognised barrier to education access alongside socio-economic and demographic household characteristics such as parents level of educations, wealth, and early marriage (Buchmann, 1999;Yu, 2007;ILO, 2010). For example, according to the World Inequality Database on Education (WIDE) for 2020 (UNESCO, 2020b) only 8% of youths completed secondary school in Tanzania, 57% in the Dominican Republic and 21% in Cambodia. To aid in understanding inequalities related to physical or geographic accessibility, information on the location of populations, schools, and socio-demographic characteristics are increasingly available to develop fine spatial resolution maps of geographic accessibility. Further, triangulating data on the spatial location of school with the household-level data from nationally representative household surveys undertaken every 3-5 years (Anderson and Cleland, 1984;Ayad et al., 1997;Burgert-Brucker et al., 2015) can be useful in estimating school attendance at a community level. Finally, the improved mapping of the age-structured global population (Stevens et al., 2015;Wardrop et al., 2018;Worldpop, 2018) provides opportunities for understanding the location of services within populations and improves the estimation of those marginalised from schools. There have been no previous attempts that triangulate available community survey data, with the spatial databases of schools and fine-scale age-structured population maps at a subnational level to estimate access and attendance amongst adolescents and school-age-youths at a fine geographic scale.
Here we employ a geospatial approach using an example of three countries classified by the World Bank as low-and middle-income (United Republic of Tanzania-Tanzania mainland, Cambodia and the Dominican Republic). The aim was to examine geographic accessibility to secondary schools and associate this with the predicted out-of-secondary school rates at a fine spatial resolution (1 by 1 km). The approach integrates locations of secondary schools with fine-scale geospatial covariates to estimate geographic accessibility in a Geographic Information System (GIS).

Methods
Spatial database for schools. Geographic location data for schools were assembled from governmental sources for the three countries. These countries were selected based on the geographic differences and heterogeneities in the distribution of secondary schools in Africa (Tanzania), Southeast Asia (Cambodia), and Latin America (Dominican Republic). For Tanzania 3258 secondary schools location data were obtained from the United Republic of Tanzania data portal (The United Republic of Tanzania-Government Basic Statistics Portal, 2015). The designated age range for a secondary school in Tanzania is 14-19 years old. For Cambodia, these were obtained from the Ministry of Education, Adolescent and Sport (MoEYS) consisting of 1615 schools classified as College, LyceeG10-12, and LyceeG7-12 with an age range of 13-18 years. In the Dominican Republic, data were obtained from the Ministerio de Educación de la República Dominicana (n = 4618) (Ministerio De Educación, 2018) and the corresponding age-range was 14-17 years.
Cluster-level data on school attendance. The Demographic and Health Surveys (DHS) for Tanzania International, 2015) were first used to derive the rates of attendance adjusted for DHS sampling and stratification. The DHS survey sampling in each country was based on a two-stage stratified sampling design using the national census sampling frame. During the first stage, enumeration areas (EAs), also known as clusters, were selected by using a probability proportional-to-population size. During the second stage, households were sampled from a complete household listing in the selected EAs. Specific details on the sampling procedures for the three countries of interest for this work can be found in the DHS final reports, and in the DHS Sampling Manual (ICF International, 2012). DHS clusters were defined as a group of households in the same area or a block (if in urban areas) selected for the interview within the complex survey design used by the DHS, and usually cluster level spatial coordinates (latitude and longitude) are also provided in the surveys.
Ancillary covariates and population data. Additional covariate data were assembled to aid in the estimation of geographic access and interpolation of cluster-level data. Land use and land cover maps for the three countries were obtained from MERIS Glob-Cover (Arino et al., 2007). Globcover classification uses 22 classes defined based on the United Nation's Land Cover Classification System (UN-LCCS) (Fao, 2000). The current GlobCover V.2.3 was derived from a time-series of medium resolution imaging spectrometer (MERIS) satellite imagery acquired from December 2004 to June 2006 at a spatial resolution of 300 m. An improved gap-filled digital elevation model (DEM) data was obtained from the HydroSHEDS dataset based primarily on NASA's Shuttle Radar Topography Mission (SRTM) (Lehner et al., 2008). Roads data were assembled from Open Street Maps (OSM) and online resources such as the National Geospatial-Intelligence Agency (NGA) (NGA, 2015) and independent data from MapCruzin (an independent open-source data repository http:// www.mapcruzin.com/). A gridded night-time light dataset based on low-light imaging of earth at night was downloaded, obtained from the Visible Infrared Imaging Radiometer Suite (VIIRS) Day/ Night Band (DNB) [https://www.ngdc.noaa.gov/eog/viirs.html] operated jointly by NASA and NOAA (Elvidge et al., 2017). This gridded image of lights at night has been shown to correlate highly with the urban population (Small et al., 2005;Shi et al., 2014). Lastly, a 1 × 1 km population maps were downloaded from WorldPop (Worldpop, 2018). These represented disaggregated census-based maps using a combination of standardised dasymetric mapping approaches informed by population density weights calculated using random forest (RF) methodology (Stevens et al., 2015).
Geospatial modelling of travel time to secondary schools. A gridded layer of travel times to secondary schools was estimated using land use, elevation (the DEM), and roads in AccessMod (version 5.0) (Ray and Ebener, 2008). The other assembled covariates were used in interpolating attendance rates in space. In deriving the travel time grid, each GIS layer was converted into a raster surface 1 × 1 km and each pixel was assigned an impedance value representing the speed of traversing a grid pixel based on land use type ( Table 2). The resulting rasters were then combined into a gridded friction layer. Travel speeds were assigned to different land use classes, roads and slope by assuming multiple modes of transport within a single journey. For instance, for primary roads, motorised transport was assumed with a maximum speed of 80 km h −1 . On other tertiary roads, a 5 km h −1 walking speed was adopted with a correction for non-motorised transport (Cycling) at 10 km h −1 applied on residential roads. Details of travel speed and mode of travel were selected based on recommendations from previous studies (Noor et al., 2006;Tanser et al., 2006). The DEM was used to derive slope and different speeds calculated for each degree rise based on Tobler's equation: V ¼ 6*expπ À3:5 abs Tan slope in degrees=57:296 where V is the calculated speed (Tobler, 1993). Travel times to each school were computed separately.
Estimating net school attendance rates at cluster level. The calculation of the adjusted net school attendance rate followed the guidelines and coding proposed by the MEASURE DHS programme (Croft et al., 2018). The methodology adopted Stata (Statacorp., 2017) (software for statistics and data science) code for estimating rates adjusting for survey weighting. The adjusted net attendance rate estimated the total number of students of the official secondary school age group who attended secondary education (or primary, or higher education) at any time during the reference academic year. The numerator was the de facto total population of secondary school-age attending primary or secondary or higher school, while the denominator was the total number of de facto secondary school-age adolescents or schoolage-youths. It, therefore, included students of official school age who accessed school earlier or later than the normal enrolment age and was expressed as a percentage of the corresponding population (UNESCO, 2019), giving a more precise picture of school participation. Age ranges were established based on guidelines from the National Ministry of Education and the UNESCO Institute for Statistics database. The age at the start of the academic year was used to determine the eligible secondary school-age population used in the numerators and denominators for the net attendance rate. To establish these age ranges, full information on the date of birth of the child in question was triangulated with the start of the academic year, to account for the temporal gap between the interviews and the start of the academic year. For geospatial mapping purposes, these rates were aggregated at cluster level and the proportion of secondary school age attending (or out-of-school) in each cluster was computed. The computed proportions at each georeferenced cluster were then interpolated through modelling at the second stage.
Geostatistical modelling of secondary school attendance. A model-based geostatistical method (Diggle et al., 1998) was used to spatially interpolate cluster-level estimates of attendance with gridded covariates to define mean attendance at 1 × 1 km. The renaissance of model-based geostatistical (MBG) approaches has occurred in other fields (Banerjee et al. 2004;Lindgren, 2013), with the added advantage of estimating uncertainty associated with the estimation of school attendance. At the first stage, covariates were selected using a statistical procedure. Covariates considered included the modelled travel time to the nearest school, the enhanced vegetation index, night-time light, minimum and maximum temperature in all three countries. A bestglm (Mcleod and Xu, 2008) procedure was then implemented for each country separately resulting in a parsimonious set for modelling.
The main objective in modelling was to predict net attendance at fine-scale for all locations nationally using a parsimonious set of covariates that were statistically important in explaining variation in observed attendance rates. For this purpose, a Bayesian hierarchical spatial model was implemented in the Integrated Nested Laplace Approximation in R software (R-INLA) (Rue et al., 2009;Cameletti et al., 2012;Martins et al., 2013) to estimate a continuous map of the proportion attending secondary school-level education at 1 × 1 km spatial resolution. A stochastic partial differential equation (SPDE) approach was adopted using R-INLA, and computation performed via Gaussian Markov random function (GMRF). A stationary model was implemented using Matérn covariance with the smoothness of process v and variance σ 2 given by where d is spatial dimension and marginal variance A linear model was implemented using a Gaussian likelihood for the proportion attending school adjusted for sampling and strata. Thus, where z(s) are realisations of the underlying attendance process linked to a spatial structured predictor in an additive way, x(s) denotes set of covariates with β coefficients and ε(s) is the measurement error. w(s) represents the spatial process associated with the spatial association between clusters. The Bayesian specification was completed by assigning non-informative priors to hyper-parameters to the fixed effects (covariates) and the random parameters (spatial and the measurement error). For SPDE parameters, a penalised complexity (PC) priors framework was used for the model range and the marginal variance (Fuglstad et al., 2019(Fuglstad et al., , 2020. Model calibration (statistical consistency) and sharpness (concentration) were assessed using the probability integral transform (PIT) and the conditional predictive ordinate (CPO), a leave-one-out cross-validation approach in which an estimate was validated based on the fitted model and the remaining data only (Spiegelhalter et al., 2002;Czado et al., 2009). A 20% subset of data selected randomly was used in the computation of the mean prediction error (MPE), the root mean square error (RMSE), and a Pearson's product-moment correlation coefficient that quantified the association between observed and predicted values. Figure 1 shows the overall methodology for geostatistical prediction of out-of-school rates (Breiman and Spector, 1992).Methodology for school attendance

Results
Summary of data and distance to school. There were 3258 secondary schools in the Tanzanian mainland, 1615 in Cambodia and 4618 in the Dominican Republic. The average straight-line HUMANITIES AND SOCIAL SCIENCES COMMUNICATIONS | https://doi.org/10.1057/s41599-021-00892-w ARTICLE HUMANITIES AND SOCIAL SCIENCES COMMUNICATIONS | (2021) 8:213 | https://doi.org/10.1057/s41599-021-00892-w distance from any population centre to the nearest school was estimated as 6.6 km in Tanzania (mainland), 3.3 km in Cambodia and 1.3 km in the Dominican Republic. This suggested that schools were geographically located at a further straight-line distance in Tanzania compared to the other two countries. This aspect was also reflected in travel time with the mean estimated travel time to the nearest school of 0.8 h in Tanzania (~50 min), 0.4 h in Cambodia (~25 min) and only 0.1 h (~10 min) in the Dominican Republic (Fig. 2). Covariate selection and model validation. From the covariate selection procedure across the three countries, only temperature variables and night-time light (a proxy for urbanisation) were important statistically in explaining variation in school attendance rates. Travel time to the nearest secondary school (an indicator of geographic accessibility) was not selected for predictive modelling. Therefore, this covariate was used in associating geographic accessibility with predicted estimates of secondary school attendance at sub-national levels (Administrative level 1). Table 1 lists model prediction performance for each country. For the three models, the Pearson correlation between the predicted estimate and the out-of-sample validation set (20% of clusters) was >60% in all countries. This suggested a good association of the prediction when compared to the observed data. The mean absolute error was calculated based on residuals between observed and predicted estimates and was relatively small at 0.29 (Tanzania), 0.11 (Cambodia), and 0.20 (Dominican Republic).
The predicted rate of secondary school non-attendance. Fig. 3 shows predictions of the percentage of adolescents and school-age-youths out-of-school in Tanzania, Cambodia and the Dominican Republic at a 1 km spatial resolution. The green areas are those with a low percentage of adolescents and school-age youths not attending secondary school. The second panel shows the difference between the upper and lower 95% Bayesian credible interval as a measure of uncertainty in estimates. Uncertainty is contributed by several factors including survey sampling of the clusters, few data points and the goodness-of-fit of the model. Fig.  4 shows a quadrant level analysis of the percentage out of secondary school and the estimate of adolescents and school-ageyouths based on population distribution. Fig. 5 shows scatter plots between travel time and out-of-secondary school rates in the three countries with a fitted non-linear model via generalised additive models (GAM) regression. The corresponding R 2 from GAM regression was 73.3% in Tanzania, 68.8% in Cambodia, and 87.5% in the Dominican Republic. Table 2 shows that, on average, approximately 57.3 (54.5-58.3) of secondary school age adolescents and school-age-youths were  The goodness of fit was assessed using DIC. The model prediction performance was assessed using the mean absolute error (MAE), the root mean square error (RMSE) and Pearson correlation between the predicted and a 20% validation set.
estimated to be out of school in the Tanzanian mainland. This translated to approximately 2.8 million adolescents and schoolage youths out of school in 2016. The regions with the lowest attendance rates were associated with longer travel times e.g. Tabora, Mbeya and Njombe. There were 8 regions in the Tanzanian mainland with >60% out-of-school rates as classified in the first quadrant of Fig. 4. These were in Dodoma, Katavi, Mbeya, Mtwara, Njombe, Rukwa, Shinyanga, Simiyu and Tabora. The total number of out-of-school adolescents and school-ageyouths in these 8 regions was~1.01 million, representing more than a third of the 2.8 million out-of-school. In Cambodia,~40.0% (37.4-42.3%) were estimated to be outof-secondary school representing~0.59 million (annexe Table  A2). The Môndól Kiri region had the largest population, with an   50.2% (44.4-58.1%) of adolescents and school-age youths out-of-secondary school. In total, 11 out of 25 regions in Cambodia exceeded the national average of adolescents and school-age youths out-of-secondary school (17%; n = 170,079). For the Dominican Republic, the percentage of adolescents and school-age youths out-of-secondary school was lower at 10.7% (9.7-11.7%) representing~0.1 million adolescents and schoolage-youths. However, half of the regions (n = 17) in the Dominican Republic exceeded the national average with a population of~68.2% (n = 70,398) of adolescents and schoolage youths out of school.

Discussion
This study focused on secondary school attendance for adolescents and school-age-youths in Tanzania, Cambodia and Dominican Republic. In Tanzania, more than 50% of this age group (14-19 years) were estimated to be out of secondary school education (mean 53.8% IQR 51.4-60.2%). Based on estimated distance (Table  2 and Fig. 5), secondary schools were twice the distance (6.6 km, IQR 2.2-19.6 km) and at a greater travel time (0.8 h, IQR 0.2-3.0 h) from the population in Tanzania compared to Cambodia and the Dominican Republic. In Cambodia, the estimated percentage of 13-18 years adolescents and school-age youths out-of-secondary schools was 40.0% . While in the Dominican Republic only 10.77% (IQR 9.7-11.7%) amongst adolescence and school-age-youths between 14 years to 17 years adolescents and school-age youths were estimated to be out of secondary school. This represented~2.8 million out-of-secondary schools in Tanzania   important when targeting education interventions. For countries such as Tanzania and Cambodia, a possible geographic-related intervention could be to increase school availability and reduce travel time to secondary schools in regions with poor access ( Table  2).
The secondary school non-attendance rate in mainland Tanzania estimated here corroborates previous education research and enrolment data for Tanzania (The United Republic of Tanzania-Government Basic Statistics Portal, 2016;Human Rights Watch, 2017). It is worth noting an average distance of 5 km is commuted twice daily for secondary schools without boarding facilities. The long journey to secondary school contributes to the overall out-of-secondary school numbers estimated to be 2.8 million here, alongside other factors not explored here, e.g. socioeconomic status, individual characteristics (e.g. attitude towards school), cultural factors, home environment, and lack of teachers (Sabates et al., 2010;Inoue et al., 2015;Gubbels et al., 2019). The modelled predictions of the out-of-secondary school population in mainland Tanzania are consistent with findings from the education policy brief report in 2014 of 3 million (Tanzania Education Network, 2018). However, it is not clear, at sub-national levels, that those who enrol at the lower secondary complete (forms 1-4) and progress to the advanced level (forms 5 and 6) (Mashala, 2019). More research is required on secondary school enrolment, drop-out or the likelihood of secondary school completion and retention in low-and middle-income countries.
The average travel time to school in the Dominican Republic was approximately 10 minutes (0.1 h, IQR 0.0-0.2) with~0.1 million estimated to be out-of-secondary school. The Dominican Republic is smaller in terms of land surface area but had a larger number of secondary schools (n = 4618) compared to the mainland Tanzanian (n = 3258) or Cambodia (n = 1615). The Tanzanian mainland is 19 times larger by geographic size, while Cambodia is 4 times larger than the Dominican Republic. This suggests that physical access to school in the Dominican Republic is boosted by school availability (short distance) relative to population distribution.
There were some limitations in the analyses undertaken. Firstly, the assumptions underlying the analysis of travel times assigned to different land surfaces include a degree of subjectivity, although all the assumptions made during this analysis are based on values derived from previously published geographic accessibility studies. This includes the measurement error in covariates and depending on the country context, these assumptions may lead to over or under-estimation of the actual rates. Secondly, variations in the sizes of schools were not explicitly modelled. The size of the school could be driven by the school location (urban or rural) as well as the size of underlying populations influencing use. The use of night-time light as a proxy for urbanisation adjusted for differences between urban and rural rates of attendance. However, other barriers such as household socioeconomic status, parents level of education, decision-making at a household level, and gender differences (Huisman and Smits, 2009) were not explored further. The focus here was on predictive modelling of non-attendance rates and the aspect of explanatory modelling should be explored by future studies. Lastly, the geographical displacement of the DHS clusters could influence fine-scale estimates of school attendance. A random displacement is applied to urban clusters (by maximum 2 km) and rural clusters (by maximum 5 km), and; an additional 1% of clusters are displaced by a maximum of 10 km. Given the interaction between cluster values and covariates used in the statistical models, the state of error introduced by the displacement was not explored analytically, but has been found to Table 2 Estimates of geographic accessibility to secondary school, and number of adolescents and school-age youths not attending secondary school in the Tanzania mainland by region. The number out of school is estimated by gender. For purpose of space, the tables for the Dominica Republic and Cambodia are included as supplementary tables (Supplementary Tables S1 and S2).

Conclusion
The 2030 Agenda for Sustainable Development may not be achieved without investment in the education of the adolescent and youths. Thus, unearthing the subnational variation in secondary school geographic accessibility and attendance is important in estimating the fine-scale variation of those physically marginalised from education and provides indicators for monitoring SDG 4.3.1 (UNDP, 2019). Using an example of three low-income and middleincome countries (Tanzania mainland, Cambodia, and the Dominican Republic), the number of adolescents and youth of ARTICLE HUMANITIES AND SOCIAL SCIENCES COMMUNICATIONS | https://doi.org/10.1057/s41599-021-00892-w school-age out-of-school vary within and between countries and many are physically marginalised. In general, inequalities in access to secondary education has a future impact on national economies and require national investment to remove disparities and ensure no adolescents and youths are left behind. Alongside improving physical access and inequality in these countries, it would be beneficial to investigate at a micro (household-level) and macro-level the role of other factors such as direct and indirect costs, and the quality of provision on out-of-school rates.

Data availability
School data is publically available for all three countries as referenced in the main article and the URLs have been provided. The DHS data is publicly available online through data request https://www.dhsprogram.com/data/available-datasets.cfm.