Dataset on child vaccination in Brazil from 1996 to 2021

We present a machine-readable and open-access dataset on vaccination results among children under five years old in Brazil from 1996 to 2021. This dataset is interoperable with epidemiological data from the VAX*SIM project and reusable by the research community worldwide for other purposes, such as monitoring vaccination coverage and studying its determinants and impacts on child morbidity and mortality. The dataset gathers official and public information from the Brazilian National Immunisation Program, the Institute of Geography and Statistics, the Institute for Applied Economic Research, and the Ministry of Health. It includes 2,442,863 observations and 35 attributes aggregated by years, policy-relevant geographic units (country, macroregions, states, municipalities, and capitals), and age groups on 1,344,480,329 doses of 28 vaccines aimed to prevent 15 diseases, estimates of their target-population coverage, indicators of the vaccination coverage’s homogeneity, dropout rates, and spatial, demographic, and socioeconomic data. We automated all data processing and curation in the free and open software R. The codes can be audited, replicated, and reused to produce alternative analyses.

Data description. The dataset covered 1,344,480,329 doses of 28 vaccines to protect against 15 diseases. It also includes counts of target populations' estimates and indicators of vaccination coverage, homogeneity, and dropout rate. We add previously consolidated spatial, demographic, and socioeconomic data from the Brazilian Institute of Geography and Statistics (IBGE), Institute for Applied Economic Research (IPEA), and Ministry of Health (MoH), which are described and available elsewhere 8 . Figure 1 presents a flow diagram for the dataset on child vaccination in Brazil from 1996 to 2021. Regarding the exclusion criteria, we did not consider the vaccines that do not compose the childhood's routine immunisation schedule. We also excluded the Polio, Measles, www.nature.com/scientificdata www.nature.com/scientificdata/ Mumps and Rubella doses during the campaigns because they are counted in the total doses applied during the year, composing the Polio vaccination coverage. Any applied doses in children between 2 and 3 years old are considered late in the Brazilian official vaccination scheme. Therefore, we have excluded this age group from our dataset.

Methods
Data construction: Workflow. The data workflow comprises five main steps (Fig. 2). The first step covered mapping the availability of data on administered doses of vaccines from the Brazilian childhood's routine immunisation schedule in the TABNET application 5 . TABNET is a generic public domain tabulator that allows you to organise information from Brazil's national health system (SUS) databases, including SIPNI. We checked data availability on 41 vaccines, 12 types of doses, and 16 age groups of children under five years for the 5,570 municipalities in the country from 1996 to 2021. Such resulted in 10,530 queries to the TABNET.
The second step concerned web scraping data from the requests identified as valid in the previous step, which resulted in 700 separate raw files. We programmatically extracted the vaccination data on March 23, 2022. The third stage covered the data combination and transformation processes, whose main characteristics were: 1. Variables renaming and observations filtering. 2. Correction of codes and names identifying geographic units. 3. Age groups' recategorization. 4. Cleansing numeric values, e.g., excluding special characters. 5. Excluding vaccine doses not included in the Brazilian childhood's routine immunisation schedule and vaccine doses included in the regular vaccination schedule administered in a particular situation or offrecommended age groups. 6. Enrichment of the municipal datasets with data aggregated by states, macroregions, and national levels.
The fourth step involved the application of business rules for the construction of vaccination indicators, which combine groups of vaccines with different formulations for the same disease (Tables 1, 2).
The fifth step in the workflow involved data integration, harmonisation, and enrichment. We linked the dataset produced in the previous step with a spatial, demographic, and socioeconomic dataset 8 according to the years and codes of the geographic units. It includes geocoding of municipality centroids, total population size, child population by age group, birth and mortality measures, Brazilian Municipal Human Development Index (MHDI), Gini coefficient, Gross Domestic Product (GDP), and sanitation. Furthermore, we created www.nature.com/scientificdata www.nature.com/scientificdata/ the following variables: (i) vaccination dropout number (number of children unvaccinated with final dose in multi-dose schedules; i.e., the difference between the number of last doses and the number of initial doses); (ii) vaccination dropout rate (vaccination dropout number per number of initial doses); (iii) vaccination coverage (number of final vaccine doses in single and multi-dose schedules per the estimated target population); and (iv) vaccination coverage homogeneity (proportion of municipalities that met the vaccination coverage goal in Brazil, macroregions, and states). R codes and data processing/curation were peer-reviewed, and their results were compared to the information on official sites.

Data Records
We provide a machine-readable and open-access dataset on Brazilian National Immunisation Program (PNI) 5 vaccination results in children under five from 1996 to 2021. The dataset has 2,442,863 observations and 35 attributes aggregated by years (1996-2021), policy-relevant geographic units (country, macroregions, states, municipalities, and capitals), and age groups (children under one-year-old, and four years old). Table 3  www.nature.com/scientificdata www.nature.com/scientificdata/ The data resources described in this paper are freely and openly available on the Synapse repository at https://doi.org/10.7303/syn26453964 9 . The Synapse is a collaborative workspace for reproducible data-intensive research projects, which supports the integrated presentation of datasets, codes, and documentation, fine-grained access control, and provenance tracking. Anyone can browse the content on the Synapse website, but you must register for an account using your email address to download the files and datasets. The download is possible using web or programmatic clients, such as R and Python 10 . Table 4 provides an overview of the files and datasets stored in Synapse. We automated all data processing and curation in the free and open software R. Data files 1-3 hold the codes for extraction, transformation, and loading routines. We extracted the data in its original format (dataset 1) and separately saved each workflow endpoint's processed data (datasets 2-4). Data file 4 builds the final dataset (datasets 5 and 6), which was integrated, harmonised, and enriched with spatial, demographic, and socioeconomic data 8 . All R codes can be audited, replicated, and reused to produce alternative analyses.
The HTML files show type-specific information for all intermediate and final datasets attributes, including statistical summaries and missing frequencies (data files 5-7). Data file 8 includes information about requests made to the TABNET website, and data file 9 contains the responses to the web scraping. Data file 10 documents the metadata and attribute descriptions of the final dataset (datasets 5 and 6). Data file 11 presents mapping and time trends of vaccination results.
Remarkably, the final dataset (datasets 5 and 6) gathered 35 attributes described in the dictionary of terms (data file 10), including spatial, demographic, socioeconomic data, birth and mortality measures, as well as administered dose counts, measurements of coverage, dropout, and homogeneity, and target population sizes according to age groups for 24 vaccination indicators -defined as vaccine groups with different formulations against the same diseases.

technical Validation
Initially, we mapped the availability of vaccination data in children under five years on the TABNET website, considering all possible combinations between the variables of interest (municipalities, years, vaccines related to the Brazilian routine vaccination schedule, types of administered doses, and age groups). Subsequently, we checked the requests which do not return valid results through new programmatic queries to the TABNET. Besides, we compared the results obtained from successful requests with those presented in TABNET, considering the absolute numbers of administered doses and the relative estimates of vaccination coverage. All R codes for data extraction, processing, and curation were peer-reviewed. We also performed a time-trend analysis of vaccination indicators to detect abnormalities and inconsistencies.    www.nature.com/scientificdata www.nature.com/scientificdata/

Usage Notes
We should note the usages and warnings of the final dataset. First, it has a mixed ecological structure in which observation units are geographically defined populations at different time points. This structure allows interoperability with epidemiological data from the VAX*SIM project -whose main objectives are monitoring vaccination coverage among children under five years in Brazil and studying its determinants and impacts on child morbidity and mortality. The research community worldwide can use the dataset for other purposes, such as health inequality studies, multilevel analysis, and cross-country comparisons of vaccination results.
Additionally, it can be a valuable dataset for Brazilian health managers and professionals to evaluate compliance with vaccination goals, build data visualisation dashboards, and formulate programs or policies aimed at regions with lower vaccination coverage rates. Second, our vaccination indicators combined the administered doses of all antigen-specific vaccines for the same diseases. For example, we calculated poliomyelitis vaccination coverage among children under one year by the ratio between the sum of 3rd doses of five vaccines (OPV, IPV, DTaP/Hib/HepB/IPV, IPV/OPV, and DTaP/Hib/IPV) and the size of the target population.
It is a usual strategy to monitor vaccination indicators by immunising type (e.g. Polio, BCG, and MR) to avoid possible variations in the routine vaccination that group or replace vaccines over time. Regarding vaccination coverage, the indicators are based on the age at which children should receive the vaccine. Thus, we choose not to account for vaccinated children outside the usual schedule, excluding, in this dataset, the vaccine applied outside the recommended age group. Besides, the vaccination data refers to the application sites, not necessarily the children's place of residence.
Third, while immunising 100% of the target population is theoretically possible, especially in small cities, true immunisation levels of 100% are unlikely. We observed coverage levels exceeding 100% in the dataset, likely due to systematic errors in the ascertainment of the numerator or denominator, mid-year changes in target age groups, or the inclusion of children from other cities in the numerator 11 . Therefore, to analyse the data, we suggest categorising the vaccination coverage in Very Low (<50%), Low (50 to Goal%), High (Goal to 120%) and Very High (>120%), according to Braz et al. (2016) classification 12 (Figs. 3, 4). We also fixed the maximum values at 150% to avoid implausible outliers.
Fourth, our vaccination relative coverage among children of one year (2006-onwards) considered the total live births minus the infant deaths in the previous year and may diverge somewhat from MoH estimates, which consider only the number of live births in the last year. Finally, we could only build the dataset at an ecological level, including the administered doses in public and private health services, but it was impossible to separate the dose counts of these two sectors. Our data extraction and processing routines are sustainable and automatic, and we intend to update this dataset annually. Figure 3 is an example of how to apply the data to describe the polio vaccination coverage among Brazilian municipalities in 2021, and Fig. 4 is another example of how to apply the data to describe the time-trend of vaccination coverage in Brazil for children at one year old and under one year.