Gridded birth and pregnancy datasets for Africa, Latin America and the Caribbean

Understanding the fine scale spatial distribution of births and pregnancies is crucial for informing planning decisions related to public health. This is especially important in lower income countries where infectious disease is a major concern for pregnant women and new-borns, as highlighted by the recent Zika virus epidemic. Despite this, the spatial detail of basic data on the numbers and distribution of births and pregnancies is often of a coarse resolution and difficult to obtain, with no co-ordination between countries and organisations to create one consistent set of subnational estimates. To begin to address this issue, under the framework of the WorldPop program, an open access archive of high resolution gridded birth and pregnancy distribution datasets for all African, Latin America and Caribbean countries has been created. Datasets were produced using the most recent and finest level census and official population estimate data available and are at a resolution of 30 arc seconds (approximately 1 km at the equator). All products are available through WorldPop.


Background & Summary
Accurate and detailed information on the spatial distribution and numbers of births and pregnancies is crucial for informing planning decisions related to public health 1 . The survival and health of women and their new-born babies in low income countries is a key priority, with the reduction of maternal and neonatal mortality central for meeting a number of the United Nations Sustainable Development Goals (specifically goals 3.1 and 3.2) 2 . Whilst progress has been made, there were still 303,000 maternal deaths in 2015 (ref. 3) and children in lower income countries are 14 times more likely to die during their first 28 days of life compared to their higher income counterparts. Despite this, the spatial detail of basic data on the numbers and distribution of births and pregnancies is often of a coarse resolution and difficult to obtain 4 , with no co-ordination between countries and organisations to create one consistent set of subnational estimates for planning. Whilst there are clear inequalities of maternal and neonatal healthcare between nations 5 , there are also large disparities within individual countries, with growing recognition that national levels and trends could be masking important sub-national variations 6 . For example, a study in Indonesia found that under-5 mortality was nearly four times higher in the poorest fifth of the population than in the richest fifth 7 , and gaps like these are more likely to occur at the sub-national level [7][8][9] . Although progress has been made in reducing such inequalities, there is still substantial work to be done. As such, understanding subnational variation and inequity in health status, wealth and access to resources is increasingly being recognised as central to meeting developmental goals 10 . To understand and tackle inequalities related to maternal and neonatal health, the first step is to have a detailed knowledge of the distribution of births and pregnancies, which is known to vary substantially due to population age and sex distribution and age specific fertility rates (ASFR) 4 . These are also valuable data for subnational planning and estimation, and calculation of subnational indicators that rely on births or pregnancies as a denominator. When considering maternal and neonatal health in lower income countries, infectious disease is a major concern as pregnant women and new-borns are particularly at risk from many diseases, such as malaria 11 and HIV 12 . This issue has recently been highlighted by the Zika virus outbreak in Latin America, further intensifying the need for detailed information on the number and distribution of births and pregnancies. Currently there is a clear lack of data for such analysis, with complete and continuous datasets of numbers of births only available at the national level (e.g., United Nations Population Division 13 ). Whilst sub-national datasets are readily available for some countries, their spatial detail is often coarse with differences in the recorded metrics, sampling framework and data formats meaning that it is extremely difficult to assess burden within and across multiple nations. This study aims to overcome the data gap identified above by producing continental scale, gridded datasets of numbers of births and pregnancies with a spatial resolution of 30 arc seconds (approximately 1 km at the equator). Advances in computational power and spatial econometric techniques, as well as the increasing availability of geo-located data, have increased the ability to produce these fine spatial resolution datasets. As such, in the framework of the WorldPop project (www.worldpop.org), and extending the approaches described by Tatem et al. 4 , an open access archive of gridded birth and pregnancy distribution datasets for all African, Latin America and Caribbean (LAC) countries has been created. This process used the most recent and finest level census, census microdata, household survey data and official population estimate data available to the authors at the time of writing, alongside a range of geospatial datasets.

Methods
Gridded estimates of live births were produced for 50 Latin American and Caribbean and 58 African countries at a spatial resolution of 30 arc seconds. This was achieved by combining the latest datasets on population distribution, population age and sex structure and fertility rates in a GIS environment. Estimates of pregnancies were additionally generated using national-level estimates for stillbirths, miscarriages and abortions from the Guttmacher Institute 14 . The workflow develops the methods presented by Tatem et al. 4 , using a variety of data sources to construct continent wide datasets. The process is fully automated by a Python Script, allowing the rapid processing of multiple countries and alignment to a standard grid for the production of seamless continental scale datasets. The workflow is shown in Fig. 1 and described in detail below. Maps of the data sources and date for each country and whether urban and rural ASFR estimates were available can be found in the Supplementary Figures 1 and  2 respectively.

The basis for estimation: population distributions
The population distribution forms a major component of the births and pregnancy estimation process. The WorldPop project has recently completed construction of gridded population distribution datasets for all low-and middle-income countries at a resolution of 30 arc seconds. Full details are provided on the WorldPop website (www.worldpop.org.uk) along with links describing the methods in detail [15][16][17] . This study uses the relevant regions of Africa (Data Citation 1) and Latin America and the Caribbean (Data Citation 2), whose total population is adjusted to match the most recent United Nations Population Division (UNPD) 18 2015 estimates available when the population distribution datasets were produced. Figure 2a shows the gridded population distribution dataset for Bolivia as an example. To ensure data consistency, a WorldPop standard grid was used in processing; this is a gridded dataset providing ISO country codes at a resolution of 30 arc second (Data citation 3).

Calculating the proportion of women of reproductive age
Sub-national information on age and sex structure was collected, specifically women of childbearing age grouped in seven 5-years age groups (i.e., 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49), as defined by the UNPD 18 . Datasets for the majority of Africa were provided by Pezzulo et al. 19 whilst datasets for the remaining African, Latin America and the Caribbean countries were assembled from a variety of sources, following the protocols defined by Pezzulo et al. 19 . Table 1 (available online only) shows the source, spatial detail (i.e., administrative unit level) and reference year used for the countries processed in this study.
With the raw data recorded and documented according to different protocols determined by national governments, the project was presented with a wide range of table data formats and schemas. Data restructuring was achieved using scripting (R 3.3.1, Python 2.7) and table processing software (Microsoft Excel 2013). The resultant standardised tables contained fields corresponding to the proportionate values of people (both sexes) in each 5-year age group, and the overall proportion of males and females in each region. Table 2 shows an example of the standardised tables, for regions of Peru.
Age and sex structure information was matched to vector geographical boundaries from the Global Administrative Areas (GADM) database 20 , with the exceptions of Chile and Colombia where boundaries from the National Statistics Office were used. The extent of these boundaries was standardised to those defined by the WorldPop gridded ISO country code dataset using the Clip and Nibble tools in ArcGIS 10.3, executed as part of the Python 21 script. Figure 2b shows the distribution of females between the ages of 20 and 24 for Bolivia. Similar distributions were created for all other 5-year age groupings in the 15-49-year range.

Estimating fertility rates
Data on fertility was collected on a country-by-country basis to provide the most up to date and spatially detailed information. Data sources were chosen using a hierarchical approach as shown in Fig. 1, prioritising sources which included information on age specific fertility and those of the highest spatial detail. The type of fertility data used for each country is shown in Table 3 (available online only) and described in detail below. As with the age and sex structure datasets, restructuring and table processing was carried out using a variety of scripting and software packages (Python 2.7, R 3.3.1, Microsoft Excel 2013) to produce a common format and schema for each data type, as described in detail below.  For DHS and MICS surveys, ASFRs were estimated using a Stata program developed by Pullum 50 , as discussed in Tatem et al. 4 . The program calculated the basic demographic indicator by deriving ASFRs for each of the seven 5-year age groups covering the reproductive life span from 15-49 years based on dividing the number of births to women in each age group, during a retrospective 3-years reference period, by the number of women-years during the same period. Data restructuring and table processing was carried out using R 3.3.1 and Microsoft Excel 2013 to produce a common format, as that shown in Table 4.
Datasets representing the boundaries of subnational regions (Table 3) (available online only) were assembled and the relevant ASFRs matched to them. If the ASFR data was available for urban and rural areas within the subnational regions, the MODIS 500 m Global Urban Extent dataset 51 was used to distinguish urban and rural areas and allocate the constant value within them. Figure 2d shows an example ASFR dataset for Bolivia, showing the ASFRs for one reproductive age range: the 20-24 age group by sub-region. Similar datasets were constructed for all other 5-year age groups within the 15 to 49 range.
For countries where ASFRs disaggregated sub-nationally and by urban/rural were not available, information on the spatial variation of age structured fertility was sought from vital registration systems, census records and other national sources. This was routinely in the form of births registered per administrative unit per 5-year age grouping. Table 3 (available online only) shows for which countries this type of data (registered births per age group) was used, with Table 5 showing an example of the  standardised table format for Venezuela. Sub-national ASFRs were calculated by dividing the number of births in each age group (e.g., Table 5) by the number of females in the corresponding group. The latter was derived from the WorldPop population distribution 15 and age-sex distributions produced following the methodology of Pezzulo et al. 19 and outlined in Table 1 (available online only).
The UNPD provides national estimates of AFSRs by 5-year age grouping for the majority of countries 18,24 . These datasets were used where subnational information was not available. As with all other datasets, the country boundaries defined by WorldPop 15 were used to define the geographical extent.  For 9 countries, there was no information available on age specific fertility, either sub-nationally or nationally. In these cases, crude birth rates were obtained from a variety of official sources (Table 3 (available online only)) which were subsequently matched to the appropriate GIS country boundaries supplied by WorldPop 15 . Estimating the number of births from fertility, population and age structure For countries where measures of age specific fertility were available, the distribution of live births was estimated by multiplying the number of females in each age group (e.g., Fig. 2c) by the corresponding ASFR gridded dataset (e.g., Fig. 2d) or value (in the case of national ASFR estimates). The resultant seven age specific gridded datasets were summed to generate an estimate of total births. For countries where fertility was expressed simply as a crude birth rate (Table 3 (available online only)), the births distribution was calculated by multiplying the crude birth rate by the initial 30 arc second UNPD adjusted WorldPop population grid for 2015 (ref. 15).
Finally, for each country, the distribution was scaled to match the UNPD estimate of the total number of births 18 . For countries where the UNPD does not provide an estimate of the total number of births, the initial total was used. An example of the final distributed births gridded dataset for Bolivia is shown in Fig. 2e, with results for Africa, Latin America and the Caribbean shown in Fig. 3 and Supplementary  Table 1.

Estimating the distribution of pregnancies
The Guttmacher institute has published country specific estimates of the number of stillbirths, miscarriages and abortions at the national level 14 . These estimates for 2014 were integrated with UNPD national estimates on numbers of live births 18 to construct a ratio between numbers of births and pregnancies. This ratio was applied to the live births distribution to generate an estimate of the distribution of pregnancies. For countries not covered by the Guttmacher dataset, the nearest suitable geographical country value was used. An example of the final distributed pregnancies gridded dataset for Bolivia is shown in Fig. 2f whilst results for the entire Africa, Latin America and the Caribbean are shown in Fig. 3.

Code availability
The Python code developed for production of the births and pregnancies datasets is publicly and freely available through Figshare 52 . The code consists of a Python programming language script (version 2.7; www.python.org) and relies on the ArcGIS 10.4.1 ArcPy site package for performing GIS specific spatial operations. The script is internally documented to both explain its purpose (including a description of the GIS-specific spatial operations it performs) and, when required, guiding the user through its customisation.

Data Records
The high-resolution births and pregnancies datasets described in this article referring to the 108 countries listed in Table 3 (available online only) are publicly and freely available through the WorldPop Repository (http://www.worldpop.org.uk/data/). A collection of these datasets has been compiled for the births for LAC (Data Citation 4) and Africa (Data Citation 5) and pregnancies for LAC (Data Citation 6) and Africa (Data Citation 7), as described in Table 6.

Technical Validation
All data collected, assembled and used were (i) already validated by the corresponding data collector, owner and/or distributor, and (ii) further checked, in the framework of this project. The gridded 5-year age and sex count datasets constructed for Latin America and the Caribbean (e.g., Fig. 2c) were verified following the protocol outlined in Pezzulo et al. 19 , who compiled and assessed similar datasets for Africa and Asia. Briefly, this comprised of summing all the layers into a single dataset (representing the total numbers of people for all age and sex groups at the grid cell level) and then subtracting it from the corresponding WorldPop continental gridded population count dataset to make sure that the country totals matched the UNPD estimates for the year in question. All fertility rates used in this study were checked, on a county-by-country basis, to make sure they were within reasonable ranges. Additionally, for countries where additional sources of fertility data were available, estimates were produced using all available sources to compare the adjusted total births. These results showed that, although differences may be observed at the grid cell level, the totals at the administrative unit level are very similar. Endeavours were made to assemble the most recent, reliable and spatially detailed data at the time of writing. However, additional input from readers who may have knowledge or access to more recent and/ or better datasets are welcome for improving future iterations of the outputs.
The accuracy and quality of fertility estimates from survey data such as those provided by the DHS, have been assessed in several reports, by testing the quality of the birth history data in a large number of countries. These checks were mainly aimed to identify potential omission and displacement of births, potential displacement of births, or misreporting of date of birth 53,54 . Overall, although a number of issues were identified for some countries, these studies found that most estimates were either good or of acceptable quality. Furthermore, outcomes from Pullum and Becker 53 show that in general the latest DHS surveys are less prone to issues like incomplete birthdates, omissions and displacement of births and deaths. Similarly, a more recent report from Pullum and Staveteig 55 , exploring the quality and consistency of age and date reports in DHS surveys, demonstrates that DHS data is constantly evaluated to improve its quality.
Modelled estimates of total number of births per country prior to adjustments (to match UNPD estimate) were also plotted against the UNPD estimates 18 to assess the size of differences obtained through using subnational data sources. Figure 4 shows the correlation for the 95 countries for which UNPD provides an estimate, with a corresponding R 2 value of 0.982. Analysis was not possible for the 9 countries for which the UNPD does not provide an estimate, although these make up a very small proportion (0.01%) of the total births across the whole study area.

Usage Notes
The datasets presented here can be used both to (i) support applications measuring sub-national metrics of maternal and new-born health and (ii) to inform planning decisions. However, considering that they represent modelling outputs generated using ancillary covariates for producing the underlying WorldPop population distribution datasets, to avoid circularity, they should not be used to make predictions or explore relationships about any of those ancillary datasets 56 . Thus, before using the births and pregnancies datasets in correlation analyses against factors which are included in the construction of the population distribution datasets (e.g., correlating birth distribution with land-cover), ideally the population modelling process should be re-run using the WorldPop-RF code 57 with the applicable covariates removed.