Finding hotspots: development of an adaptive spatial sampling approach

The identification of disease hotspots is an increasingly important public health problem. While geospatial modeling offers an opportunity to predict the locations of hotspots using suitable environmental and climatological data, little attention has been paid to optimizing the design of surveys used to inform such models. Here we introduce an adaptive sampling scheme optimized to identify hotspot locations where prevalence exceeds a relevant threshold. Our approach incorporates ideas from Bayesian optimization theory to adaptively select sample batches. We present an experimental simulation study based on survey data of schistosomiasis and lymphatic filariasis across four countries. Results across all scenarios explored show that adaptive sampling produces superior results and suggest that similar performance to random sampling can be achieved with a fraction of the sample size.

where ν controls the smoothness of the process, is a lengthscale parameter and K ν is a modified Bessel function.

Village Finder
The Village Finder algorithm is accessible via a Shiny app that suggests GPS coordinates of populated sites based on 1km resolution Worldpop gridded population data. A populated site is an area that meets certain size and population criteria and can represent a village, a neighborhood of a crowded city or a large but sparsely populated rural area. The user specifies the following 3 parameters to define the type of population sites queried: -maximum area size, above which a region cannot be considered as a unique location; -upper population threshold, above which a location should be counted as a unique location; -lower population threshold, below which a region smaller than the maximum area size should not be counted as a populated location.
The algorithm works iteratively. First, any 1km grid cells of the Worldpop raster that adhere to the three parameters are identified and the centroids are kept. The gridded population data, minus those grid cells identified in the first round, are then aggregated by a factor of 2 and any aggregated areas that adhere to the parameters are identified. The centroid of the most populated cell in the aggregated area is then assigned as the village location for that aggregated area. The process continues until all aggregated areas have an assigned centroid or until all thresholds are met.

Simulation of gold standard prevalence scenarios
For each of the four settings (schistosomiasis in Cote d'Ivoire and Malawi and lymphatic filariasis in Haiti and Philippines) we simulated gold standard prevalence estimates for cluster by fitting a spatial model to observed survey data. For each dataset, we first used a Generalized Additive Model (GAM) to fit thin plate spline (non-linear) relationships with four Worldclim variables (mean precipitation, mean temperature, precipitation seasonality, temperature seasonality), elevation (NASA SRTM) and distance to nearest waterbody (Digital Chart of the World) using mgcv (v1.8-27). We then fitted a variogram to the residuals from each model and conditionally simulated a single realization at all clusters using geoR (1.7-5.2.1). This was added to the predictions from the GAM and an inverse logit was applied to get predictions back on the probability scale. This two step process allowed us to fit complex non-linear relationships with covariates plus a residual spatial effect. To ensure an adequate number of hotspot communities for simulation purposes, the simulated datasets were adjusted slightly to ensure the mean prevalence was roughly equal to the relevant disease specific hotspot prevalence threshold (i.e. 10% for schistosomiasis and 2% for lymphatic filariasis). Supplementary Fig. S1 shows histograms of the cluster level prevalence for each country.

Validation Statistics
To measure the performance of the classification model we used four different metrics. To define them, we first need to define the following terms: -True positives (t p): cases where the actual category and the predicted category are both positive (e.g. a site classified as a hotspot actually has a prevalence above the threshold of interest).
-True negatives (tn): cases where the actual category and the predicted category are both negative (e.g. a site classified as not being a hotspot actually has a prevalence below the threshold of interest).
-False positives ( f p): cases where the actual category is negative, but the predicted class is positive (e.g. the site is classified as a hotspot, but the actual prevalence is below the threshold of interest). -False negatives ( f n): cases where the actual category is positive, but the predicted class is negative (e.g. the site is classified as not being a hotspot, but the actual prevalence is above the threshold of interest).