Despite a long history of mosquito-borne virus epidemics in the Americas, the impact of the Zika virus (ZIKV) epidemic of 2015–2016 was unexpected. The need for scientifically informed decision-making is driving research to understand the emergence and spread of ZIKV. To support that research, we assembled a data set of key covariates for modeling ZIKV transmission dynamics in Colombia, where ZIKV transmission was widespread and the government made incidence data publically available. On a weekly basis between January 1, 2014 and October 1, 2016 at three administrative levels, we collated spatiotemporal Zika incidence data, nine environmental variables, and demographic data into a single downloadable database. These new datasets and those we identified, processed, and assembled at comparable spatial and temporal resolutions will save future researchers considerable time and effort in performing these data processing steps, enabling them to focus instead on extracting epidemiological insights from this important data set. Similar approaches could prove useful for filling data gaps to enable epidemiological analyses of future disease emergence events.
Machine-accessible metadata file describing the reported data (ISA-tab format)
Background & Summary
Zika virus (ZIKV) emerged as a pathogen of global concern in 2015 when it rapidly spread through the Americas and was associated with Guillain-Barré syndrome (GBS) in adults and congenital Zika syndrome (CZS) in fetuses and neonates1. Though ZIKV had been discovered several decades earlier, recognition of severe outcomes and the explosive nature of ZIKV epidemics was only established recently2,
ZIKV is transmitted primarily by Aedes aegypti mosquitoes, which also transmit chikungunya, yellow fever, and dengue viruses. Like these other viruses, ZIKV transmission is highly dependent on the environment. Climatic conditions, for example, regulate the population dynamics of vectors10,11, and the built environment plays an important role in human-vector interaction and in providing breeding grounds for mosquitoes12. Even though the importance of these factors is widely recognized, their specific roles are more difficult to understand but can be aided by model-based analysis combining epidemiological and environmental data13.
The availability of spatiotemporal incidence data is critical to both current and near-future responses and to planning for responses to emerging infectious disease outbreaks. For example, during the Ebola epidemic in 2014-2015, mathematical and statistical models using incidence data were critical to informing resource allocation and placement of new hospital beds14, plans for vaccine trials15, estimates of intervention effectiveness, and understanding how the outbreak started and where it spread in time and space16. Similarly, spatiotemporal ZIKV data has informed efforts to estimate the number of people at risk for infection and the number of pregnant women infected6. Such data are also potentially important for selecting sites for ZIKV vaccine trials17.
Despite the widely recognized importance of spatiotemporal incidence data, there is often limited availability of such data sets for emerging infectious diseases18. In the case of Zika, there has been some effort to broaden access to these data (e.g., the cdcepi Github repository19), but the data available through these settings are often not internally consistent and are not made available with important covariates, such as population and weather conditions. Colombia is one country for which data has been made available online by its Instituto Nacional de Salud20 and is of particular interest due to the high resolution of data there (available weekly for each of 1,122 municipalities). This data set is also of particular interest for modeling the spatio-temporal spread of ZIKV due to Colombia’s diverse landscape and because of substantial heterogeneity in the timing and intensity of ZIKV transmission there21. Together, these factors offer a unique opportunity to examine the role of environmental and social influences on the spread of ZIKV22.
In addition to spatiotemporal incidence data, several variables are commonly incorporated into analyses of the transmission dynamics of ZIKV and related pathogens23,24. First, temperature plays a dominant role in ZIKV transmission due to its influence on vector and virus life traits25,26. Because the effect of temperature on transmission depends not only on mean temperature but also on daily temperature range27, we include estimates of mean, minimum, and maximum daily temperature. Second, a number of metrics related to moisture—including precipitation, humidity, and normalized difference vegetation index (NDVI)—are commonly used for modeling mosquito population dynamics due to their relevance to the immature stages of the mosquito life cycle11. Third, we include spatiotemporal estimates of relative mosquito abundance28, a spatial estimate of purchasing power as a proxy for the effect of socioeconomic effects on mosquito-human contact6,29, and spatial estimates of travel time to allow for exploration of the effects of connectivity on spatiotemporal transmission dynamics24. Fourth, we include demographic projections30 of total population and annual births to allow for quantification of the population at risk of ZIKV infection and severe outcomes such as GBS and CZS.
Here, we collated data on the aforementioned variables at three administrative scales on a weekly basis between January 1, 2014 and October 1, 2016, which spans the majority of ZIKV transmission activity in Colombia. Our hope is that this effort will increase access to this data set and reduce duplication of the considerable effort required to process data for epidemiological analyses of ZIKV transmission dynamics.
To achieve our central objective of assembling and collating multiple data sets pertaining to ZIKV transmission in Colombia, we first identified key data and then translated those data to comparable spatiotemporal resolution using a variety of methods. In some cases, this was as simple as downloading raster datasets and clipping them to shape files. In other cases, this involved statistical modelling to transform existing data products from certain scales into a single data product at some other desired scale. In all cases, our methods involved taking input data (Table 1) and generating output data (Table 2 (available online only), Data Citation 1: Dryad Digital Repository http://dx.doi.org/10.5061/dryad.83nj1) at a weekly timescale between January 1, 2014 and October 1, 2016 for each of three administrative scales (Fig. 1). Throughout, we generated output data at the national scale, for each of 33 departments, and for each of 1,122 municipalities, as defined by GIS shapefiles from the National Geographical Information System of Colombia31.
Zika case reports
The weekly number of Zika cases, by municipality, was reconstructed using two data sources. The main data source was a website20 of the Colombian National Institute of Health (Instituto Nacional de Salud) where the official weekly reports on the cumulative number of Zika suspected and confirmed cases for each municipality have been published since the beginning of 2016.
While the peak of the Colombian epidemic occurred in 2016, a significant number of cases were reported during 2015. In order to capture this initial portion of the epidemic, we used an additional data source, also available in the INS website20. Unfortunately, the number of cases reported in the latter data source seemed to consistently underreport the total number of cases reported by the INS at the national scale. For example, while the official data source reports a cumulative number of 11,712 cases by the end of 2015, this secondary source only reports 3,875 cases for this same period. Therefore, in order to reconstruct the 2015 portion of the epidemic while accounting for the better known total number of cases, we multiplied the weekly 2015 data by a correction factor. This correction factor was calculated as the ratio between the cumulative number of cases reported by each municipality up to the first week of 2016 according to the official source and the alternative source. The raw and the corrected weekly counts for each municipality are included in the data set. To account for cases from unknown municipalities within a department, we also provide data at the departmental level.
We obtained gridded population data across Colombia for the year 2015 at a resolution of 3 arc seconds (~93 m) from the WorldPop website (http://worldpop.org.uk). Similarly, we obtained high-resolution (30 arc seconds) unpublished gridded data on the number of births for the year 2015 from the WorldPop project. These high-resolution products were developed to ensure consistencies with subnational data on sex and age structures, as well as subnational age-specific fertility rates, while adjustments on births were made at subnational scales using data from the government of Colombia32,33, followed by national-level adjustments to contemporary numbers based on 2012 and 2015 United Nations Population Division data30,34.
Spatial aggregation of covariates
Aggregation of raster data at the level of administrative units requires some assumption about how raster values should be weighted to obtain a single value for an administrative unit. Due to the fact that Zika virus transmission occurs predominantly in human-dominated areas, we used human population (WorldPop Project) as our weighting variable. We applied this weighting procedure to aggregate all covariates at municipal (e.g., as in Fig. 2), departmental (e.g., as in Fig. 3), and national levels.
Aedes aegypti abundance
We obtained one hundred posterior samples of Aedes aegypti occurrence probabilities in raster format, from the published work of Kraemer et al.28, which we used to derive weekly mosquito abundance measures for all 52 weeks of the year. We based our method on the assumption that m mosquitoes at time t, m(t), can be represented by a Poisson distribution with rate parameter λ=−ln(l-occurrence probability), consistent with existing ZIKV transmission models29,35. We obtained such an estimate of the relative density of mosquitoes across a 4.65 km x 4.65 km grid for each of 52 weeks. In addition, we generated aggregated values at the municipality, department and national scales after weighting the raster data values by population (see the section on Spatial aggregation of covariates).
We downloaded meteorological readings from 30 stations across continental Colombia from National Oceanic and Atmospheric Administration (NOAA)’s Climate Data Online, an online archive of daily meteorological readings36. The variables we extracted from this data set included minimum daily temperature, maximum daily temperature, mean daily temperature, and relative humidity, all on a daily basis between January 1, 2014 and October 1, 2016.
To facilitate interpolation of these climate variables across a more complete spatial coverage of the country, we downloaded a digital elevation dataset at a resolution of 30 arc seconds from the Global 30 Arc-Second Elevation (GTOPO30) product37. Similarly, we downloaded the WorldClim gridded long-term average of monthly minimum temperature, maximum temperature, and precipitation at a 4.65 km x 4.65 km spatial resolution38, as well as NOAA’s Climate Prediction Center (CPC) global monthly mean air temperature at 0.5 arc-degrees resolution39.
To generate smooth, high-resolution surfaces of climate variables based on calibration to point readings from the 30 meteorological stations, we tested two approaches of spatial interpolation: (a) using non-parametric surface fitting with thin plate splines (TPS) with or without fixed-factor covariates40; (b) using spatial models (kriging) with or without covariates41. We selected the best interpolation models for each environmental variable based on leave-one-out cross validation, as described in the Technical Validation section.
The thin plate spline (TPS) follows the general form, where Y is the dependent variable evaluated at location x, μ is the fixed effect component of the model with optional covariates at location xi, P is the implicit spline polynomial function over the spatial coordinates, and ε is measurement error, assumed to be uncorrelated across sites and normally distributed with mean zero and standard deviation σ.
The kriging approach follows the concept that spatial autocorrelation is dependent on distance between locations. We used the krige function in the geoR library of R with parameters chosen based on maximum-likelihood estimation42. The model of a spatial process indexed by spatial locations xi follows where Y is the dependent variable evaluated at location x, μ is the fixed effect component of the model at location xi, S is a stationary Gaussian process with variance σ2 (partial sill) and a correlation function parametrized by φ (range), and ε is the error term with its variance τ2 (nugget variance). When μ is included, the trend is implemented using lm, the regression model function in R, and S(x) is fitted to the residuals of the regression model41.
Due to Colombia’s proximity to the Equator, we ignored the small effect of distance distortion arising from non-projected spatial layers on both models43. Because our goal is generating daily surfaces of climate variables, rather than developing a predictive model that works for days outside those to which we fitted the model, we treated every day separately and fitted a model for each day between January 1, 2014 and October 1, 2016 for which data was available. In addition to generating daily raster outputs and aggregating them at weekly time steps, we generated aggregated values at the municipality (Figs 2a–c), department (Fig. 3a) and national scales after weighting the raster data values by population (see the section on Spatial aggregation of covariates).
Rather than interpolating relative humidity directly based on station readings (which showed poor estimates in preliminary results), we approached the task of estimating relative humidity indirectly. First, we spatially interpolated weather station measurements of mean dew point temperature from the 30 stations across Colombia. This was followed by calculating relative humidity across the 4.65 km x 4.65 km grid based on interpolated mean temperature and dew point temperature, using the August-Roche-Magnus approximation for the saturation vapour pressure of water in air44, which follows where T and Td are the mean temperature and dew point temperature in °C and a=17.271 and b=237.7 °C44. Finally, in addition to generating daily raster outputs and aggregating them at weekly time steps, we generated aggregated values at the municipality (Fig. 2d), department (Fig. 3b) and national scales after weighting the raster data values by population (see the section on Spatial aggregation of covariates).
Normalized Difference Vegetation Index (NDVI)
Satellite-based technologies have been used to capture spatial variation in environmental factors related to vector population dynamics45,
Among the climate datasets we explored, precipitation proved to be the most spatially variable, making it difficult to rely on spatial models to make accurate estimates. Our attempt of spatial interpolation of precipitation using ordinary kriging resulted in large deviations from the observed values of the 30 stations obtained from NOAA. As an alternative, we used satellite-based data from NOAA’s Center for Satellite Applications and Research (STAR). We downloaded daily layers of the STAR rainfall estimates at ~4 km x 4 km resolution48. Once we download the daily products, we subset and resampled them into our standard resolution (4.65 km x 4.65 km) and spatial extent compatible with the other variables considered, before averaging across each consecutive seven days to generate weekly gridded data. In addition, we generated aggregated values at the municipality (Fig. 2g), department (Fig. 3d) and national scales after weighting the raster data values by population (see the section on Spatial aggregation of covariates).
Geographically based Economic data (G-Econ)
To account for socioeconomic differences, which are potentially associated with contact between humans and the vector, we used one-degree resolution gridded estimates of 2005 purchasing power parity (PPP) adjusted gross domestic product (GDP)49. To express the values in per capita, we divided the gridded GDP by the corresponding population, the latter obtained from the Gridded Population of the World product (v3)50 after resampling the latter to one-degree resolution. We chose this version of gridded population data for this task given that it was the one originally used to generate the 2005 gridded GDP values. Cells with missing values were imputed with the mean of the surrounding eight grid cell values. Once we obtained a complete grid layer at a resolution of one-degree (~111 km at the equator), we resampled the layer, without smoothing, to a resolution of 4.65 km x 4.65 km to match the resolution of all other gridded layers. We additionally computed aggregated results at the municipality, department and national levels after weighting them by the distribution of population (in the year 2005) within each administrative unit (see the section on Spatial aggregation of covariates).
To account for the general accessibility of each municipality and department, we used travel time data downloaded from the European Commission’s Joint Research Center at a resolution of 30 arc seconds51. This definition of travel time is a measure of overall accessibility rather than of frequency of travel. It is defined as the average length of time (in minutes) it takes individuals in a region to travel to the nearest location with a population greater than 50,000. Large travel time is indicative of a region whose population lives relatively far from urban centers. This gridded dataset has minutes of land-based travel time to the nearest settlement with population greater than 50,000 (as of the year 2000). The data is developed using a cost-distance model, which accounts for travel time increments based on the available transport networks and other environmental and political factors51. We aggregated travel time weighted by population at the municipal level to generate estimates of travel time for each municipality and similarly for each department (see the section on Spatial aggregation of covariates).
To identify the level of urbanization in each grid cell, we downloaded the MODIS global 2002 urban extent raster dataset52,53, which has a binary (0 or 1) value for each 500 m x 500 m grid cell around the globe. By counting the number of high-resolution urban grid cells that fall within each standard grid cell of 4.65 km x 4.65 km, we were able to generate a gridded product of percentage of the physical grid cell that is urban. Furthermore, in combination with the population raster we obtained from WorldPop30, we were able to generate a gridded estimate of urban population at each 500 m x 500 m grid cell in Colombia.
The code used to generate all gridded datasets and aggregating at municipal, departmental, and national levels is freely available for download from GitHub at https://github.com/asiraj-nd/zika-colombia54. This code utilizes the R programming language42 and Python version 2.7.10. Further explanation of the code is provided in a readme file in the repository on GitHub54.
All output datasets described in this article (Data Citation 1: Dryad Digital Repository http://dx.doi.org/10.5061/dryad.83nj1) are publicly and freely available through Dryad Digital Repository. The datasets stored in the datadryad.org Repository represent the ones produced at the time of writing, and will be preserved in their published form. Datasets of interest can be obtained by downloading the corresponding zipped archive files (Table 2 (available online only)).
Most datasets obtained from other sources have already been validated by independent studies30,38,39,48,
We used three metrics to compare model performance: mean absolute error, coefficient of variation, and Pearson’s correlation coefficient (COR). Mean absolute error (MAE) is the mean absolute difference between predictions and observations over n data points: We also used relative MAE (of two models), which is the ratio of the two MAEs. A relative MAE m of models A and B respectively, would indicate that predictions from model A were (1-m)% closer to the observed values than those from model B for an m value less than 1. The coefficient of variation (CV) evaluates the extent to which large values are dispersed relative to their mean value. It is the ratio of the root mean square error (RMSE) to the mean of observed values, Results of our comparison are described in Table 3. Overall, the ordinary kriging approach had higher accuracy for temperature (mean, maximum, and minimum) and relative humidity based on all three metrics. Model results also revealed that using other covariates, such as altitude and secondary climate data, improved interpolation results for temperature and relative humidity.
This compilation of datasets can facilitate a variety of studies relevant to vector-borne disease epidemiology in Colombia. The archive provides ready to use data both in a raster format with resolution of 5km x 5km, and at administrative units of municipal, departmental, and national scales.
These datasets have several limitations. First, the 30 meteorological stations used in generating climate surfaces are sparsely and unevenly distributed over Colombia, leading to uncertainty in the outputs. Moreover, some of the original gridded data we obtained had differing resolutions, including 0.1 arc-degrees (GPM), 0.5 arc-degrees (CPC), and 1 arc-degree (G-Econ). This meant that we had to resample these gridded products (GPM, CPC, GEcon) with crude estimates based on average values over a large swath of grid cells. Further, unlike all other products we used that were non-projected geographic WGS1984 raster files, the Tera and Aqua MODIS NDVI products were in sinusoidal projections, causing some distortions when re-projected to match population layers used in weighting.
In addition to spatial discrepancies, we also had to overcome the relatively poor temporal resolutions of Tera and Aqua MODIS NDVI products (which come at 16-day intervals) by linearly interpolating between two data points to fill in the 15 days in between, before aggregating the results at weekly time steps. Furthermore, daily satellite based rainfall data from NOAA assume 12:00-12:00 hour-day, which could potentially cause slight inconsistencies, despite the data finally being aggregated at weekly time steps. Other limitations include the modifiable area unit problem, which arises from disparities in the arbitrary sizes and borders of the administrative units which may bias aggregations based on these borders.
How to cite this article: Siraj, A. S. et al. Spatiotemporal incidence of Zika and associated environmental drivers for the 2015-2016 epidemic in Colombia. Sci. Data 5:180073 doi: 10.1038/sdata.2018.73 (2018).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Didan, K. NASA EOSDIS Land Processes DAAC https://doi.org/10.5067/modis/myd13a2.006 (2015)
This research was supported by a RAPID award from the National Science Foundation (DEB 1641130).