Mapping routine measles vaccination in low- and middle-income countries

The safe, highly effective measles vaccine has been recommended globally since 1974, yet in 2017 there were more than 17 million cases of measles and 83,400 deaths in children under 5 years old, and more than 99% of both occurred in low- and middle-income countries (LMICs)1–4. Globally comparable, annual, local estimates of routine first-dose measles-containing vaccine (MCV1) coverage are critical for understanding geographically precise immunity patterns, progress towards the targets of the Global Vaccine Action Plan (GVAP), and high-risk areas amid disruptions to vaccination programmes caused by coronavirus disease 2019 (COVID-19)5–8. Here we generated annual estimates of routine childhood MCV1 coverage at 5 × 5-km2 pixel and second administrative levels from 2000 to 2019 in 101 LMICs, quantified geographical inequality and assessed vaccination status by geographical remoteness. After widespread MCV1 gains from 2000 to 2010, coverage regressed in more than half of the districts between 2010 and 2019, leaving many LMICs far from the GVAP goal of 80% coverage in all districts by 2019. MCV1 coverage was lower in rural than in urban locations, although a larger proportion of unvaccinated children overall lived in urban locations; strategies to provide essential vaccination services should address both geographical contexts. These results provide a tool for decision-makers to strengthen routine MCV1 immunization programmes and provide equitable disease protection for all children.

likely to be misclassified as routine despite this process -particularly those from recall and from surveys 125 where coding of SIA doses may have been inconsistent. While we aim to produce as representative an

156
National-level time series of survey estimates were examined to assess systematic survey bias 2 . When the 157 estimates of a particular source appeared implausible in the context of other data or external information 158 on the reliability of that source, the survey methodology, case definitions, and data collection for that 159 source were evaluated, along with the underlying spatial patterns of the original survey data. Sources The following data were extracted from each survey source: vaccine card or home-based record (HBR) 171 doses, parental recall vaccine doses, age (in months), survey weight and design variables, and GPS-cluster 172 or areal location. Individuals with evidence of vaccination either from HBR or recall were considered to 173 have been vaccinated. Individuals were excluded from the analysis if they were missing age, spatial, or 174 survey design information or were outside of the study age or year range. The study included all years 175 between 2000 and 2019. A comprehensive overview of data from all study geographies included can be 176 found in Supplementary Fig. 2.

177
Individual age, in months, at the time of survey collection was used to assign each child to a birth 178 cohort (12-23 months, 24-35 months, 36-47 months, and 48-59 months). Data corresponding to each 179 birth cohort were included in the modelling process in the year in which that birth cohort was aged 0-12 180 months old 12 . This method allows the inclusion of additional individuals, which increases overall 181 geographical representation, but requires assumptions such as negligible catch-up vaccination and no 182 differential mortality or migration. However, the influence of including older cohorts in our model on the 183 key findings appeared to be minor (Supplementary Information section 2.2).

184
Wherever possible, we removed doses recorded on vaccination cards or from parental recall that 185 were attributable to SIAs. First, we reviewed the underlying survey methodology to identify, where 186 applicable, the coding used to identify doses delivered through SIAs rather than routine immunisation 187 services. We confirmed the occurrence of each SIA by cross-referencing the summary documentation of

204
To be included in the model-based geostatistical framework, all data must be geo-located at the 205 precision of a specific latitude and longitude coordinate pair. For areally located data that has been 206 matched to an administrative unit but for which no GPS coordinates were available, we used a previously 207 described method to generate a set of candidate cluster locations where survey sampling could plausibly 208 have occurred, using population weights and k-means clustering 14 (Supplementary Fig. 3). In brief, 209 10,000 points were randomly sampled from the area within the administrative boundary that was matched 210 to the areal data. Points were sampled proportionally to the total population within that space and time as 211 estimated by the WorldPop population raster 15 . A set of integration points (1 per 1,000 5 × 5-km pixels) 212 was generated via k-means clustering, representing candidate locations of survey sampling under the 213 assumption that each survey sampled locations proportional to population. These integration points were 214 used as input locations for the areal data in the geospatial model. Each integration point was assigned the 215 mean MCV1 coverage value previously calculated for the entire areal location. A weight was also 216 assigned to each integration point (i.e., candidate observation location) that is proportional to the number 217 of randomly sampled points that were included in the k-means cluster, such that all weights sum to 1 for 218 each areal data observation. The "survey" package in R 16 was used to determine a conservative sample 219 size for the entire areal unit, and the sample size at each integration point was determined using this 220 calculated total sample size and the weights of each integration point. Overall, this method produces a set 221 of candidate observations of vaccine coverage spread over the area in question, with increased weight 222 given to locations that are more densely populated. 223 224 1.3.6 Geospatial covariate selection 225 We used a variance inflation factor (VIF) algorithm for geospatial covariate selection before using the 226 subsequent covariates in child models. We chose to use a covariate selection algorithm before fitting child 227 models to remove collinear covariates to facilitate model convergence. For each coordinate point where 228 data is located, geospatial covariate values were extracted. The VIF algorithm works by computing the 229 variance of each covariate, which is conflated when multicollinearities exist in the covariate values 17 . The

230
VIF was computed for each covariate, removing the largest incrementally, until the VIF for each 231 remaining covariate was less than 3. This algorithm was run for all modelling regions (as discussed in Coverage was modelled in 13 continuous geographic regions, and the outputs of these regional models 239 were combined to create comprehensive maps of antigen-dose-specific coverage in LMICs. Countries 240 were assigned to one of 13 regions. Regions were decided by adapting regions defined by GBD, which 241 are constructed to group countries together by epidemiological similarity and geographical proximity 242 (Extended Data Fig. 2). We modified these regions to ensure geographical contiguity. We included a one-243 degree buffer to each region when modelling to ensure minimal effects of regional edges in final results.

244
Computationally, it is much more efficient to fit multiple regions in parallel that are smaller in area.

245
Additionally, fitting models in separate regions allows for different regional relationships in coverage 246 with geostatistical covariates.

249
Stacked generalisation is an ensemble method for combining models and has shown to increase predictive 250 accuracy in spatiotemporal disease modelling 18 . We use stacked generalisation in our modelling 251 framework to combine predictions of our child models (lasso, GAM and boosted regression trees) for 252 three main reasons. First, its use allows us to increase our predictive accuracy, although they reduce the 253 causal interpretation of the subsequent results. Next, as not many geospatial covariates largely explain 254 vaccination coverage, stacked generalisation maximise the predictive power of the covariates we do have.

255
Third, staked generalisation allows us to account for the complex, non-linear relationships between 256 covariates and outcomes in our model.

257
The choice of using stacked generalisation breaks the ability for our model to be used to describe 258 any causal pathway. Ideally, models would be optimised for prediction and causal inference. However, as 259 our objective is predictive accuracy rather than causal inference, we believe that the benefits of inclusion

281
With this type of spatial mesh, the spatial distance matrix as part of the Matérn covariate, was calculated 282 using the greatest circle distance along the surface of the sphere (i.e. Earth).

283
We define the SPDE on B " , a spherical manifold, and use the great circle distance as described in 284 the SPDE framework. Practically, this is implemented in our modelling pipeline by first converting data 285 locations and geographic region boundaries to three-dimensional coordinates on the unit-sphere. Next, 286 using the INLA "inla.mesh.create" function, we construct an B " , mesh using the three-dimensional data 287 and boundaries.

Model fitting
290 Models were fit using R-INLA as described previously 14 . We have chosen to use R-INLA for model 291 fitting and interpretation to leverage its computational tractability when running models across large 292 spatial scales. This is possible through maximising computational efficiency via an inherent parallelising 293 structure to exploit multi-core processing and also generalising across latent Gaussian models which 294 allows for a significant amount of inference to be automated.

295
Cluster-level observations for which precise latitude and longitudinal coordinates are known were 296 assigned a weight of 1. Areal data that was resampled as described above were given weight as a function 297 of the K-means clustering process. These weights were used in the R-INLA fitting process to ensure both 298 observation types were given equal contribution to the log-likelihood function of the model.

299
During model fitting, we apply a linear sum to 1 constraint on the child model ! / 's using the 300 "extraconstr" argument of R-INLA 20 , such that: 301 302 where A is a matrix of dimensions (1,1, length of vector of the fixed effects (i.e. 3)) and e = 1.

305
Model fitting parameters can be found in Supplementary year, for instance when the children were one year older than the target age (t+1), two years older than the 414 target age (t+2), or three years older than the target age (t+3). We evaluated the effect of the time lag 415 between survey observations for a given birth cohort using the following regression analysis, which 416 yielded the results shown in Supplementary Information

Supplementary Figures
Supplementary Fig. 1: Household surveys included in modelling All data are shown mapped to their corresponding geographic area. This is prior to candidate points being generated for areal data. The sample size represents the number of individual children included in the underlying microdata record for each survey. From 2000 to 2019, MCV1 data availability is shown for a) Caribbean, b) Central America, c) Central Asia and Eastern Europe, d) Central sub-Saharan Africa, e) East Asia, f) Eastern sub-Saharan Africa, g) North Africa and Middle East, h) Oceania, i) South America (Andean), j) South America (Tropical), k) South Asia, l) Southern sub-Saharan Africa, and m) Western sub-Saharan Africa. Countries excluded from this analysis due to lack of reliable survey data are shown in grey. a. b.
Supplementary Fig. 2: Data inclusion flow chart Survey identification from the GHDx and data loss during cleaning and processing is described below. 11 To be included in final analyses, data had to be within the countries included and within temporal range of the study, meet survey design standards, and contain individual-level microdata on age, MCV1 coverage, and subnational geographic location. Panel a indicates survey-level identification criteria and panel b indicates individual-level details after survey identification.
Supplementary Fig. 3: Areal data spatial resampling Areal data was resampled for inclusion in the geospatial models using k-means clustering and underlying population weights, via weighted survey data estimates of MCV1 coverage for each areal unit in a given country-survey-year (panel a), population surface from WorldPop used during the resampling process (panel b), 10,000 randomly sampled points proportional to the underlying population weights of panel b (panel c), and weighted integration points following k-means clustering to be used in the geospatial model (panel d).
Supplementary Fig. 4: Geospatial covariates used in modelling Twenty-six candidate geospatial layers of covariates possibly related to MCV1 coverage were selected for use in the stacked generalisation modelling process via a variance inflation factor selection algorithm per region. Individual covariates are cited in Supplementary  Posterior means and 95% uncertainty intervals are shown at the 5 × 5-km pixel level. Pixels that are grey in colour are either not included in the analysis, or have been classified by being "barren or sparsely vegetated" or had fewer than 10 people per 1 × 1-km pixel 15,26 .
Supplementary Fig. 15: First administrative level mean, lower, and upper uncertainty limits of MCV1 coverage, 2019 Posterior means and 95% uncertainty intervals are shown at the first administrative level. Pixels that are grey in colour are either not included in the analysis, or have been classified by being "barren or sparsely vegetated" or had fewer than 10 people per 1 × 1-km pixel 15,26 .
Supplementary Fig. 16: Second administrative level mean, lower, and upper uncertainty limits of MCV1 coverage, 2019 Posterior means and 95% uncertainty intervals are shown at the second administrative level. Pixels that are grey in colour are either not included in the analysis, or have been classified by being "barren or sparsely vegetated" or had fewer than 10 people per 1 × 1-km pixel 15,26 .
Supplementary Fig. 17: District level MCV1 coverage as function of uncertainty via Coffey-Feingold-Bromberg metric, 2019 Intensity of pink indicates areas that have lower uncertainty and lower coverage; intensity of blue indicates areas with higher coverage but also higher uncertainty. Purple indicates areas that have both low coverage and also higher uncertainty. Pixels that are grey in colour are either not included in the analysis, or have been classified by being "barren or sparsely vegetated" or had fewer than 10 people per 1 × 1-km pixel 15,26 .    Table 4 6 Identify and describe any categories of input data that have potentially important biases (e.g., based on characteristics listed in item 5).
Supplementary Table 4 For data inputs that contribute to the analysis but were not synthesised as part of the study: 7 Describe and give sources for any other data inputs. Supplementary Tables 6-7 For all data inputs: 8 Provide all data inputs in a file format from which data can be efficiently extracted (e.g., a spreadsheet rather than a PDF), including all relevant meta-data listed in item 5. For any data inputs that cannot be shared because of ethical or legal reasons, such as third-party ownership, provide a contact name or the name of the institution that retains the right to the data.  Tables 9-12  13 Describe methods for calculating uncertainty of the estimates. State which sources of uncertainty were, and were not, accounted for in the uncertainty analysis.

Supplementary
Methods (Post-estimation) 14 State how analytic or statistical source used to generate estimates can be accessed.
Data availability; Code availability Results and discussion 15 Provide published estimates in a file format from which data can be efficiently extracted.
Data availability 16 Report a quantitative measure of uncertainty of the estimates (e.g., uncertainty intervals).
Main text (Tracking uneven progress in routine MCV1 coverage); Supplementary Figs. 14-16, 17 17 Interpret results in light of existing evidence. If updating a previous set of estimates, describe the reasons for changes in estimates.

18
Discuss limitations that affect interpretation of the estimates.

Methods (Limitations)
Supplementary   Table 4: Input sources included in final analysis List of sources with GHDx identification number included in the final analysis along with number of GPS and areally located clusters is provided. These sources can also be explored following this link to the GHDx input data sources tool: http://ghdx.healthdata.org/lbd-publication-data-inputsources.