Measuring and forecasting progress towards the education-related SDG targets

Education is a key dimension of well-being and a crucial indicator of development1–4. The Sustainable Development Goals (SDGs) prioritize progress in education, with a new focus on inequality5–7. Here we model the within-country distribution of years of schooling, and use this model to explore educational inequality since 1970 and to forecast progress towards the education-related 2030 SDG targets. We show that although the world is largely on track to achieve near-universal primary education by 2030, substantial challenges remain in the completion rates for secondary and tertiary education. Globally, the gender gap in schooling had nearly closed by 2018 but gender disparities remained acute in parts of sub-Saharan Africa, and North Africa and the Middle East. It is predicted that, by 2030, females will have achieved significantly higher educational attainment than males in 18 countries. Inequality in education reached a peak globally in 2017 and is projected to decrease steadily up to 2030. The distributions and inequality metrics presented here represent a framework that can be used to track the progress of each country towards the SDG targets and the level of inequality over time. Reducing educational inequality is one way to promote a fairer distribution of human capital and the development of more equitable human societies.

the 10 th percentile is another commonly used relative inequality metric, that does not use the mean of the distribution. For variables such as per-person income that vary highly between contexts, a relative measure such as the Gini coefficient is an intuitive choice. In phenomena that do not vary greatly between contexts, researchers could potentially consider absolute measures, like the standard deviation, as more appropriate.
In the first stage, we explore the relationship between each metric and average years of schooling. This is a useful heuristic to understand how each metric behaves over the observed range of average values. We find that relative measures of inequality, such as the Gini coefficient, coefficient of variation, or 90 th to 10 th percentile ratio are highly collinear with average years of schooling, with correlation coefficients of -0.96, -0.96, and -0.84 respectively. This is expected, as relative metrics like the Gini coefficient have the mean value of the distribution in their denominator. Therefore, in the context of a highly bounded indicator like education, this effect induces very high correlation. Highly uneducated populations have high Gini coefficients as only a few individuals have more than zero years of schooling. As populations become more educated, their Gini coefficient heads steadily towards zero, with minimal variation between populations. Absolute measures of inequality, such as the AID, have more variable relationships with average attainment. Overall the correlation coefficient between the AID and mean years of schooling was 0.08. Employing either the AID or the standard deviation provides estimates of inequality that vary significantly more for a given mean years of schooling compared to average measures of inequality.

Supplemental Figure 2. Global Trajectories in Educational Inequality and Mean Attainment by Metric
Educational inequality represented by a number of metrics is represented on the y axis against the mean years of schooling globally and for each super-region. The dots mark 2018, the beginning of forecasts, and 2030, the SDG target year. Regional trends were aggregated from n=195 national level modeled estimates.
Gini coefficient, we would conclude that the most unequal nations observed by our study are those that are just beginning to educate their populations, those that have the lowest mean attainment values and just a few individuals achieving any years of schooling. This kind of distribution of education was commonplace among low-income nations in 1970, however by 2018 all nations had at least begun to educate their population to a meaningful degree, and most nations appeared to be near the peak of their Kuznets curve, if not significantly further along it. If we consider the AID metric, however, we would conclude that the most unequal societies are those developing nations that have both a high proportion of the people that still receives little to no schooling, and another large segment that receives many years of schooling. Supplemental Figure 4 illustrates the kind of societies that are deemed "most unequal" under either metric. The figure is best interpreted alongside supplemental figure 1 for context. Under the Gini coefficient assumptions, Niger in 1970 would be considered a "maximally unequal" context, with a Gini of .98, because almost 0% of the population has any schooling. The few years of available schooling is attained by a small number of individuals, who virtually all have fewer than 7 years. An example "maximally unequal" society under the AID assumptions is exemplified by Yemen in 2018. Here we can see an example of a society where over 30% of the population of 25-29-year-olds still has zero schooling, while over 40% of the same population has 12 years or more of schooling. As stated earlier, the choice of metric is normative. The authors of this study consider that Yemen in 2018 has much greater inequality than Niger in 1970 and have chosen the AID metric as the measure by which to present results in this paper. Recognizing that it is unlikely that consensus can ever be reached on which is the best measure of inequality, regardless of the application of that measure, we present results for alternative measures in this Supplement and also make available the results of our entire database which makes it easy for other researchers to produce the measure of inequality that they believe is most relevant to their research endeavors.

Choice of Inequality Metric -Conclusions
Ultimately, we chose to use the AID as our measure of educational inequality in the main text of this study. We believe that for education, an absolute metric such as the standard deviation or AID is the most policy-relevant measure of inequality. In considering the range of distributions we observed in our study, we concur with the interpretation yielded by the AID metric, that the most unequal societies, in terms of education, are those at the peak of their Kuznets curve that have large proportions of individuals with zero schooling, and large proportions with post-secondary education. Given that we believe that a metric of inequality provides an additional measure for policy makers to care about and aim to improve upon, we find that the AID is the most policy relevant metrics as at a given level of educational attainment it captures adequately how unequally the distribution years of schooling within a given population is.
We suggest that the degree of relativity of an inequality metric, which can be changed by modifying the value of β in the equation above, should reflect the inherent boundedness of a measure. For income,

Supplemental Figure 4. "Most Unequal" Societies Under Gini and AID Assumptions
The distribution of years of schooling for two example populations illustrating "maximally unequal" populations under the assumptions of the Gini and AID metrics. Each panel reflects individuals age 25-29, for a single country-year. This figure is best interpreted alongside supplemental figure 1 for context. which has hyperbolic variation between countries, using a β of 1 is intuitive. Likewise, for years of schooling, which is limited in range by the human life course, a β value of 0 is a reasonable choice. For other intermediately-bounded measures, employing a set of inequality metrics with a range of β values would be advisable.
For education, we ultimately chose to use the average interpersonal difference as the measure of educational inequality to present results by, as we find that the quantity it represents-the average difference in educational attainment between any two individuals in a population-to be easy to understand and relevant both to our conceptualization of inequality and for policy makers. Given that the mean is "in the denominator" of the Gini coefficient, expanding the mean level of schooling will almost always reduce measured inequality, regardless of if it occurs among those who are best or least off. The AID on the other hand, changes depending on which part of the population is targeted for progress. This attribute makes it a more policy-relevant metric as it tracks, as countries are investing in education and increasing the mean years of schooling of their populations, how well they are doing in terms of distributing those years of schooling equally.

Comparison to Previous Education Modeling Efforts
Here we summarize major alternative education modeling efforts, and discuss differences in methodology, scope, and results and their interpretation. Two sets of education estimates that are used extensively and are closest in scope to the work presented in this study, include the work by Barro and Lee and the work by the Wittgenstein Centre.
The estimates produced by Barro and Lee are among the most highly cited, have pioneered many aspects of measuring educational attainment, and have been updated several times [7][8][9][10] . Their framework involves drawing mainly on census data and using a timeseries approach. They model broad bins (categories) of educational attainment, as defined by the ISCED classifications 11 , and subsequently calculate means years of schooling using standard duration assumptions. As we have shown previously, the standard duration method is useful in enforcing consistency between mean years of schooling and the modeled distribution of education, however it does include some assumptions that can introduce bias and error into the estimates 12 . Recent estimates draw on about 600 total sources and cover 146 countries from 1950 through 2010.
A body of work produced by the Wittgenstein Centre has a distinct approach to modelling educational attainment that draws on demographic (cohort component) models [13][14][15] . This entails drawing on a recent data source for each country from a year close to the analytical base year, most recently 2015 15 . One base dataset is chosen per country, for a total of about 200 sources. Populations are then forecasted to 2100 and backcasted to 1950 using a number of expert-driven and empirical assumptions about education, fertility, migration, life expectancy, and other key factors. This approach is highly useful for studying the interconnection of education with other aspects of development, especially population growth. This kind of model allows for explicit examination of policy-levers for changing population-level dynamics, and holds key insights for the scientific understanding of population development, with implications for policy-makers 16 . Among other assumptions, the modeling framework generally assumed "global convergence" in forecasts, where all countries slowly drift towards global averages of parameters. Education is represented by a number of bins, including complete and incomplete attainment at numerous levels, as defined by ISCED standards.
Our work differs from these two major previous efforts in a few important ways that reflect its central aim of serving as a benchmarking and forecasting exercise towards the education-related SDG targets. We include 3,180 census and survey data sources, which represents the largest database of education data to our knowledge. Given the known sampling and non-sampling error in measures of education 17,18 we opt to include all possible sources of data and leverage a modeling framework that synthesizes disparate data using sampling variance and between-source deviation to estimate final results. Our data adjustment approach offers an additional correction for bias between data providers, creating a set of more directly comparable estimates. Including all possible data entails both advantages and challenges compared to other estimation strategies. The Wittgenstein approach benefits from internal consistency, as well as flexibility in the calculation of forecasting scenarios under varying assumptions. However, it has a limited capacity to automatically include multiple data sources across the available timeseries. By only using one data source, these estimates make the assumption that the data source chosen is valid and reliable and does not suffer from any non-sampling error. Alternatively, our approach allows for the inclusion of temporally overlapping data from numerous data providers, as well as data from multiple time periods. As a prediction effort, our work is more similar to that of Barro and Lee, using a timeseries modelling approach, although the details of our model vary substantially. We use nearly five times the input data, and model the full single year distribution of education, which better captures drop-out patterns relative to descriptions using wider bins of attainment, and allows for a more precise measurement of educational inequality 12 . We also provide projections through 2030, and novel metrics of educational inequality, that are focused on benchmarking regional and national progress towards meeting the education-related SDG targets.
The forecasts produced by the Wittgenstein Centre are highly useful in understanding and quantifying how investments in education will ripple out through other domains of development. For instance, the "SDG Education Scenario" allows us to appreciate how impactful achieving the education-related SDG targets could be for global health and well-being 15 . On the other hand, our work is a descriptive effort seeking to provide timely evidence on which countries are currently on track to achieve these important goals, and which countries have disparities that require renewed investments and attention. We employ a framework that is designed to maximize country-level predictive power, including an unprecedented level of data coverage and predictive validation for the field. Though our work is not directly comparable to the Wittgenstein forecasted scenarios (in methodological approach, the actual indices measured, and the main aims of the estimates), it is highly complementary. We produce country-level evidence about trajectories towards meeting international education targets, and the Wittgenstein scenarios provide strong evidence about the importance of meeting these goals for myriad areas of global development.

Correlation to Previous Education Modeling Efforts
In addition to out-of-sample predictive validity, the reliability of our estimates was assessed by comparing the mean years of schooling predictions to previously published estimates. A comparison to previous estimates of the distribution of education, or educational inequality, was not possible, given the lack of a comparable set of numbers. Other efforts to model the distribution of education have measured attainment using bins such as "primary completed." These bins measure ISCED levels of attainment, rather than a specific number of years that is comparable between countries 13 . Our analysis is focused on number of years, and therefore a direct comparison is not possible. Nevertheless, we were able to compare our mean attainment estimates to previous work, as a general measure of the similarity of data and modeling outcomes. We used the most recent age and sex-specific estimates from Barro and Lee, as well as the Wittgenstein Centre, as comparators 19,20 .
Previous studies have documented systematic differences in the data preparation process between the Barro-Lee numbers and our process, that do result small but systematic difference in the numbers 12 . Nevertheless, a high degree of correlation was expected, and was observed. Overall, comparing the data provided at 5-year intervals from 1970 to 2010, a correlation coefficient was observed of 0.94. Supplemental figure 5 shows country-year-age-sex specific data from both series of estimates.
Supplemental figure 6 compares mean years of schooling values between the estimates from the Wittgenstein Centre and the current study. It is important to note that mean years of schooling is not the main outcome of our current work, which is focused on other metrics of the distribution and inequality in education that are more relevant to SDG targets. Nevertheless, comparing this feature of the distribution of education-the mean-may serve as a helpful heuristic in understanding some of the methodological differences between our work and other education modelling exercises, and the implications for the interpretation of results. We compare our estimates to the Wittgenstein Medium (SSP2) education/population growth scenario which assumes a continuation of recent trends and global convergence. For the in-sample period close to the base year of the Wittgenstein estimates our results are highly concordant, with correlation coefficients over .9 in every 5-year increment observed. However, we observe that our approach produces slightly more optimistic results in the forecasts than the baseline Wittgenstein scenarios. Importantly, we also show a wider range of country-level values, with a greater degree of outliers, which is logical in the absence of convergence assumptions. We feel that this is appropriate for a country-level benchmarking and forecasting exercise. Nevertheless, we note that our estimates do show a significant degree of global convergence in the forecasted results, which is to be expected as countries approach inherent limits in bounded phenomena.
Finally, we reiterate that this exercise is not meant to be a comprehensive comparison between our work and prior efforts to model the distribution of education. The main outcome measures and contributions of our work are not directly comparable. Nevertheless, these limited comparisons may shed some light on systematic differences in methodological approach and the interpretation of results.

Limitations
Creating comparable estimates of education is made more challenging by the great degree of heterogeneity in national education systems. For example, instead of modeling attainment of Supplemental Figure 6. Mean Years of Schooling from Wittgenstein Centre and Current Study Each boxplot represents the distribution of average years of schooling for a country-year-age-sex group with overlapping data between the most recent Wittgenstein Centre estimates and the current study. All values reflect individuals age 25-29. The box and central line represent the 25th, 50th (median) and 75th percentiles of each distribution. The top whisker (vertical line) represents the largest value equal to or lower than 1.5 times the inter-quartile range plus the 75th percentile. The bottom whisker represents the smallest value equal to or greater the 25th percentile plus 1.5 times the inter-quartile range minus. Values outside of this range are shown with a point. The distribution of mean values is shown across countries and is not weight by population size. Pearson correlation coefficients are shown for each year and were fit on n=348 country-sex-year specific estimate pairs for years before and including 2010, and n=374 country-sex-year specific estimate pairs for the year 2015 and after. "secondary education" using country-specific standards than can vary widely from 8 to 13 years, we employ globally standardized metrics of the number of individuals completing 12 years of schooling. These have the advantage of creating comparable metrics of attainment between countries but are not able to acknowledge particularities between schooling systems, and capture variations in the social capital that 12 years of education may confer in a given country. Our estimates also do not reflect the quality of education, simply the quantity of years attained. Schooling hours and quality of instruction vary between and within countries 8,9,21,22 , which is an important mediator of the importance of education for health and development not accounted for here [23][24][25] . We also do not include the importance of non-traditional sources of education, such as open online courses 26 , or other forms of learning outside of formal classroom environments.
Our analysis, like any exercise in prediction, is also limited by data availability. Data quality and the number of available data points differ by country. Therefore, we leverage regional information to make predictions about temporal trends in many countries. We may consequently be missing important country-specific trends in data-sparse areas. Nations with smaller populations and less developed economies tend to have more data gaps. These gaps are reflected in wider prediction intervals, which should be considered in the interpretation of results. We also explicitly examine our ability to reliably extrapolate using out-of-sample predictive validity (see supplement).
Our projections assume that trends observed in recent decades will continue into the future. We do not leverage expert knowledge or model the capacity of the education system directly. This approach may fall short in certain cases when inherent limitations of the educational system, e.g. the number of universities, may represent unobserved covariates that will lead to unexpected changes in trends. Nevertheless, a comparison to previous efforts modelling education for the set of overlapping countries and years shows a high degree of concordance between our estimates and previous measures. It is also important to note that international migration is an important factor that could affect the interpretation of our results. Differential migration by level of education could, for example, change the apparent equality of education in a country, regardless of how equally the educational system of that country actually operates. Finally, it is important to consider that most data sources conflate sex and gender and we are unable to differentiate between these different dimensions in a meaningful way in this work.

Model Selection and Predictive Validity
In order to determine the hyperparameters and model specifications for each model in our analysis, including source type adjustment, cohort extrapolation model, K-nearest neighbors algorithm, and forecasting models, we used several out-of-sample predictive-validity (OOS-PV) tasks [27][28][29][30][31] . To test how well each model or hyper-parameter set performed, we "knocked-out" a portion of the training database and used the remaining data to make predictions that could be compared directly to the known values. Given that each model varied in its purpose, holdout structures were constructed in a model-specific fashion to best reflection the prediction task at hand. Full details are provided in the sections below.
All models were evaluated using median absolute error, which represents a robust estimate of central tendency that is robust to outliers. We also include median error as a measure of bias, and mean root squared error as alternative error summary statistics. Models were evaluated with respect to total performance in mean years of schooling, as well as other distributional characteristics, such as the proportion with 0 years of schooling. We also assessed the degree to which OOS-PV varied by decade, region, and type of data held out. In general, the best performing models tended to perform the best across almost all geographies/time periods, so it was not necessary to use multiple models for a single step.

Data Adjustment Model
The data adjustment model serves to correct for bias between different data sources. There are known sources of non-sampling error in educational data series, and we leverage information about these systemic differences to make adjusts to input data. This first requires the definition of "gold standard" data sources, which are presumed to be the most correct data source on average. In almost all cases we used IPUMS or DHS data as the gold standard, given their large samples sizes, transparent methods, and standardized data presentation format. For each region, IPUMS or DHS data were chosen as the gold standard depending on which source had more years of available data. That regional gold standard was subsequently also chosen for each country in the region, to serve as the country-specific gold standard, if there was at least one data point in that country from the data provider. If, for example, IPUMS was considered the regional gold standard, but a country in that region only had DHS data, then DHS would be chosen for the country. A country with both IPUMS and DHS data would follow the regional choice. The data adjustment model makes corrections for two types of situations. A) when multiple data sources, including gold standard data, are available for the same country, and B) when no gold standard data is available for each country, but we still wish to correct for known underlying biases in the data providers that are available for that country. We therefore employ a model that acts on each of these levels, and consequently, design holdouts to test the predictive validity for each level. The models were compared to a baseline approach, of simply not adjusting the data.

Task 1: No Gold Standard Data Available
In this task, we simulate the total lack of gold standard data for a country, by removed all of the IPUMS or DHS data from one third of the locations for which it is present. We subsequently run the full model and compare-all other variables held constant-how predictive validity varies by model. In effect, the two-thirds of remaining gold standard data was used to adjust all non-gold-standard data, which was then compared to the "held-out" one third of the gold standard data. This had the effect of simulating the instances in which some gold standard data was present in the region, but not present in a country of interest, to see how well regional adjustments could apply across countries. Put another way, we assessed the degree to which biases by data provider were consistent between countries.
This was repeated three times, to minimize the impact of random noise on the OOS-PV statistics. The resulting error statistics were averaged over the three iterations to produce the visualizations below. This task sought to capture the ubiquitous bias by data provider that can be introduced via sampling and survey methodologies. For instance, phone-based surveys such as Eurobarometer, because they only access populations with a connection to a phone, can easily overestimate the educational attainment of a population, and so they should be adjusted downward accordingly.

Task 2: Gold Standard Data Present
In this task, we removed some, but not all, of the gold standard data points for locations in which there were more than two observations of gold standard data. In countries where there is more than one observation, we removed half of the data in order to test how well our data adjustment model performed. This assesses how well the model adjusts for bias within a country that does have gold standard data present. By removing half of the gold standard data, and using the remaining half to adjust non-gold-standard sources, we can assess the reliability of these adjustments, relative to the held-out gold standard data. This was repeated twice, and the resulting error statistics were averaged over the two iterations to produce the final validity statistics.
The models assessed for data adjustment all took the basic form: , , , = + * + * + * : Where , , , , is the quantity of interest (either proportion of the population with no education or for a given age, sex, year, location combination. is a region-specific random effect which captures the average bias between surveys and censuses across all countries within that region and : is a location-spcific random effect which captures

Supplemental Figure 7. OOS-PV Statistics for Source-Type Adjustment Candidate Models
The baseline model, shown in red, represents no data provider adjustment. It is compared to the 3 candidate data source models, as specified above. Predictive validity statistics were calculated for the 'Gold Standard -Missing All' and 'Gold Standard -Missing Some' prediction tasks using n=17,431, and n=16,489 country-year-age-sex-data provider specific error values respectively. the additional bias between a location-specific gold standard (where applicable) and the other sources present in that location. The above model was run either separately by demographic group where:  Model 1 was run separately by region and sex,  Model 2 was run separately by super-region, and  Model 3 was run separately by region.

Results
All model specifications showed improved performance across all metrics as compared to the baseline mode, representing no adjustment. This was true overall (supplemental figure 7) and when stratified by decade (supplemental figure 8). This validated the decision to perform some kind of data adjustment as opposed to leaving the input data unadjusted. The best performing model across the two tasks was model 3, which was run separately by region, though model 2 did perform slightly better in instances where some gold standard data where available. This result suggests that survey series vary consistently by region, and gender is a less important dimension. While it could be possible that some surveys which only target heads of households or women might have a gender-specific bias to them, that was not observed in a systematic manner in our results. It is important to note that "region" and "super region" are used here as defined in the global burden of disease study(GBD) 32 . In the main text we use GBD super-regions, and refer to them as "regional groupings" for simplicity.

Supplemental Figure 8. OOS-PV for Source-Type Adjustment Candidate Models by Decade and Task
The baseline model, shown in red, represents no data provider adjustment. It is compared to the 3 candidate data source models, as specified above. Predictive validity statistics were calculated for the 'Gold Standard -Missing All' and 'Gold Standard -Missing Some' prediction tasks using n=17,431, and n=16,489 country-year-age-sex-data provider specific error values respectively.

Cohort Extrapolation
The cohort extrapolation model seeks to leverage the stability of the educational attainment of cohorts over time. This feature of education has been widely used for modeling purposes in numerous analyses 7,13,33 . Within a cohort, education is fairly stable after age 25. For example, the education of 35year old's in 2000 is likely to be highly similar to the education of 45-year old's in 2010. However, we do wish to capture differential mortality by education over age 65. To test how well our extrapolation model performed, we used data from repeated surveys or censuses repeated at regular intervals. We only used data from within a single data family, e.g. DHS surveys from 2000 and 2005 from the same country. We used an OSS-PV approach, in which we held out repeat observations of the same cohort over time. In a 3-fold knockout scheme, one third of all sources containing repeat observations of a cohort over time were held out. We subsequently ran only the cohort extrapolation model, and the resulting adjusted data were compared against extrapolated data in which no change in education over time was assumed. The resulting error statistics were averaged over the three iterations to produce final predictive validity statistics. The performance of the three candidate models is shown for predicting within-cohort change in mean educational attainment over time, as compared to a baseline model which assumes no change. For each model, predictive validity statistics were calculated over n=11,126 country-year-cohort-sex-data provider specific error values.
Where I is a natural spline with a knot at age 70 intended to capture the potential nonlinearity in the rate of change of differential mortality by education over age and is a random intercept at varying levels of geography and sex

Results
All three models showed improvements in OOS-PV over the baseline model which assumed no differential mortality by education (supplemental figure 9). This was also observed almost uniformly when stratified by decade (shown in supplemental figure 10). Nevertheless, improvements in predictive validity were modest. This is likely due to the scarcity of data tracking cohorts between the ages of 65 and 85. Survey series like the DHS often have small sample sizes at older ages due to sampling techniques, causing diminished signal of change in the mean educational attainment of a cohort over time. Furthermore, global coefficients are applied to all cohorts, when the effects of education on preventing early adult mortality may differ by geography. Finally, these results do not account for differential migration by educational attainment. Despite these limitations, this model does show an improvement over the baseline model and has important implications for the final estimates.

OOS-PV for Cohort-Extrapolation Candidate Models by Decade
The performance of the three candidate models is shown for predicting within-cohort change in mean educational attainment over time, as compared to a baseline model which assumes no change. Results are stratified by decade. For each model, predictive validity statistics were calculated over n=11,126 country-year-cohort-sex-data provider specific error values

Ensemble K-Nearest Neighbors Distribution Model
The K Nearest Neighbors Distribution Model has a number of hyper-parameters (as defined in the methods section) which were optimized using OOS-PV. Essentially, we wanted to choose the set of hyper-parameters that would best allow us to predict distributions out of sample. We divided our dataset into two portions, 90% training data and 10% testing data. We fit the model using the training data and compared our results out of sample to the testing data that was held out. This was repeated 3 times to reduce random noise. Only gold standard data, from IPUMS and DHS, were used as testing data in this exercise due to their consistently large sample sizes and reporting data in single-year bins. A "grid search" of hyperparameter values was conducted, meaning that we tested all possible combinations of the below values: Error statistics were combined for each multi-year bin of education attainment, including 0 education, 1-6 years, 7-12 years, and 13-18 years. We assessed predictive validity by taking the average median absolute error for each combination of hyperparameters across each level of education (shown in supplemental figure 11).

Results
The algorithm was relatively robust to the choice of hyperparameter specification. The average median absolute error ranged from 0.045 to 0.072 at the extreme ends of the spectrum. Generally speaking, the algorithm performed best at low levels of smoothing (span), with a medium amount of candidate distributions (K), higher weight on distributions closer in space/time/cohort space (Psi), and with a greater emphasis on cohort and space distance than age distance (CW, SW, and AW). The best performing combination of hyperparameters was the following:

Forecasting Model
Finally, we assessed the predictive validity of several approaches to forecasting the distribution of education. In this task, the final 10 years of data were held out, and each forecasting models was used to predict the removed period of the data.
Two model versions, "Forecasting by Means" and "Forecasting by Distributions," were considered for this predictive task. In the former, the modeling framework specified in the main section of this paper was used for all years 1950-2030. That is to say that a linear prior was fit on the data and used with GPR to produce average educational attainment and proportion of the population with no education estimates throughout 2030. Subsequently, the KNN algorithm was run using these predicted inputs as inputs for choosing candidate distributions. This model assumes that a) it is appropriate to forecast mean years of schooling and the proportion with 0 schooling using linear models, and b) it is appropriate to apply past distributions to these projected parameters using the KNN distribution model. In sum, this model involves using the best-performing version of the main model used in the past and running it forward through 2019-2030 to produce forecasts.
The second model uses the distributions of education previously modeled from 1970 through 2018, as described in the methods section, and extrapolates the rate of change of each component of the distribution. We theorized that because mean years of education have not historically changed linearly, using distributions of single-year bins of educational attainment would be a better choice for extrapolation. If proportions increase more linearly in logit space, then we can more accurately predict these proportions and subsequently calculate a more accurate set of mean years of schooling. This was based on the notion that modeling individual component proportions of the distribution would more closely reflect the underlying mechanism of educational development in populations.
We first calculated the rate of change, separately by country and sex, for each multi-year bins of educational attainment (No education, 1-6 years of education, 7-12 years of education, and 13-18 years of education) for the last 10 years of the estimates. We subsequently apply this rate of change to forecast each timeseries through 2030, After this first stage provides estimates of the coarse bins, we employed a similar rate of change model for each single-year bin of educational attainment, to predict the granular distribution over time. We ensured all proportions were internally consistent with the aforementioned predicted multi-year bins, by raking the sum of the single year bins to the total proportion of each larger bin. Subsequently, we calculated the mean years of schooling from these distributions. Finally, we tested the predicted distributions against the 10 years of held-out testing data.

Results
The rate-of-change distribution model had better performance in predicting mean attainment compared to the model that directly forecasts mean attainment. This was evaluated in terms of overall OOS-PV for mean attainment ( figure 12). We also assessed performance with respect to other outcome variables, and knockout region and data source ( figure 13). This increase in performance is likely due to a more granular, distribution-based model that can better capture the nonlinear growth in mean attainment. While economic development does increase mean statistics linearly for a period, there is an inflection point at which increases in mean educational attainment tend to level off. Using distributions to predict mean educational attainment, is therefore a better choice than using the means themselves for forecasting, as it provides a method of better approximating the mechanisms of educational growth in the population. For example, in many cases the most rapid growth in mean attainment is derived from Supplemental Figure 12. OOS-PV for KNN Candidate Forecasting Models OOS-PV is compared for the two candidate forecasting models. All metrics are shown for mean attainment. Predictive validity statistics were calculated over n=13,627 country-year-age-sex-knockout round specific error values.
reductions in the proportion completing zero years of schooling and increases in primary attainment. Once these changes are complete, then the relatively slower growth in secondary schooling would be the dominant driver in changes, reflected in a gradual leveling off of educational attainment. A completely linear model of mean attainment misses this nuance, whereas a distributional model can better approximate this mechanism. This type of distributional model is likely more reflective of the actual governmental levers and policies that are used to increase educational attainment among populations. For example, the elimination of school fees for a particular level of education, changing the number of years of compulsory schooling, building new schools for a particular level, or increasing the number of years required to complete a certain level, are all mechanisms that might create linear expansion in a specific level of schooling.

Supplemental Figure 13. Other Dimensions of OOS-PV Considered for Forecasting Models
Median absolute error (MAE) is shown by outcome variable, data provider, and knockout region to illustrate other dimensions of variation that were considered in selecting a forecasting model. Predictive validity statistics were calculated over n=13,627 country-year-age-sex-knockout round specific error values.