Applying a zero-corrected, gravity model estimator reduces bias due to heterogeneity in healthcare utilization in community-scale, passive surveillance datasets of endemic diseases

Data on population health are vital to evidence-based decision making but are rarely adequately localized or updated in continuous time. They also suffer from low ascertainment rates, particularly in rural areas where barriers to healthcare can cause infrequent touch points with the health system. Here, we demonstrate a novel statistical method to estimate the incidence of endemic diseases at the community level from passive surveillance data collected at primary health centers. The zero-corrected, gravity-model (ZERO-G) estimator explicitly models sampling intensity as a function of health facility characteristics and statistically accounts for extremely low rates of ascertainment. The result is a standardized, real-time estimate of disease incidence at a spatial resolution nearly ten times finer than typically reported by facility-based passive surveillance systems. We assessed the robustness of this method by applying it to a case study of field-collected malaria incidence rates from a rural health district in southeastern Madagascar. The ZERO-G estimator decreased geographic and financial bias in the dataset by over 90% and doubled the agreement rate between spatial patterns in malaria incidence and incidence estimates derived from prevalence surveys. The ZERO-G estimator is a promising method for adjusting passive surveillance data of common, endemic diseases, increasing the availability of continuously updated, high quality surveillance datasets at the community scale.


Spatial distribution of health care infrastructure
Monthly disease incidence was simulated for 100 administrative zones (patches, p) over 5 years.The patches were distributed in a 10 x 10 square matrix representing a health district (Figure S1.1).Each patch's population was drawn from a uniform distribution between 800 and 1200 and age-stratification was not considered.The population remained constant over the simulated time period.Eight primary health clinics (j) were randomly distributed across the 10 x 10 matrix (Figure S1).
Clinics differed in the number of staff (random uniform from 5 to 15), whether they offered advanced services (randomly distributed so that 50% of clinics offered advanced services), and whether health care was provided free-of-charge at the clinic (randomly distributed so that 50% of both advanced and basic clinics offered this service).

Disease Dynamics
We simulated consultation rates for constant background disease rates and for two diseases that exhibited annual seasonality in their burdens.We assumed the background disease rate was one infection per person per year.We set the annual incidence of each seasonal disease to one infection per person per year, but varied the seasonality of each disease separately, resulting in a monthly risk of infection for each disease g ( ) (Figure S1.2).Each individual's probability of infection for each disease during each month was defined as the inverse logit of the logit-transformed monthly risk of infection plus a normally distributed random error with a mean of 0 and standard deviation of 0.5, resulting in a probability ranging from 0 -1.This extra error was added so that the simulated data approximated the noisiness seen in field-derived disease notification data.This resulted in a number of cases for each disease g in patch p during month t drawn from a binomial distribution of size equal to the patch's population and probability (Eq.S1).

Reporting Rate
We modeled an individual's probability of seeking health care at the patch level (PCp) as a function of the distance to health clinics and the characteristics of those clinics (Eq.S2).

(Equation S2)
Where is the services provided by each clinic j, is the distance between patch p and clinic j, and is the shape of distance decay relationship, which varied randomly from 0.2-0.8 for each simulation.The services provided by each clinic j were a function of the number of staff of that clinic ( , whether it offered advanced services ( ), and whether healthcare was provided free of charge ( ), scaled to range from 0 -1 (Eq.S3).
(Equation S3) In addition, we simulated instances of zero reported infections per patch for each disease due to 1) a combination of low reporting rates and low disease risk and 2) randomness.These two causes of zero reported infections were simulated independently from each other by a random binomial event given a corresponding probability of a zero.The probability of a zero due to low reporting rates ( ) and low disease risk ( ) for each disease g in patch p at month t was calculated following Equation S4: (Equation S4) The probability of a zero due to randomness (Pzr) was set at a constant value of 0.1.
The number of reported monthly cases for each disease per patch was therefore defined as: (Equation S5) An example of comparing the true vs. reported cases for all three diseases for two patches of differing health care access is shown in Figure S1.3.Notably, this dataset bears characteristics that resemble realistic passive notification datasets, including a high variance around the mean and unexplained missing data reported as zeros.

Figure S1. 1 .
Figure S1.1.Example distribution of primary health clinics (red points) distributed among a matrix of 10 x 10 administrative zones (squares) for one simulation.The size of the point for each clinic corresponds to the level of services it provides (Sj).

Figure S1. 2 .
Figure S1.2.The monthly risk of infection ( ) for each of three diseases across the simulated time period for one simulation.

Figure S1. 3 .
Figure S1.3.The true number of cases and reported number of cases for all three diseases in three example patches with low, intermediate, and high probability of seeking healthcare for one simulation.

Figure S1. 4 .
Figure S1.4.Time series of district-level incidence rates in the simulated dataset, for the true dataset, reported dataset, and ZERO-G adjusted dataset for one simulation.The bars represent the 95% confidence intervals of the ZERO-G adjustment.

Figure S1. 5
Figure S1.5 Scatter plots comparing simulated true incidence data with unadjusted incidence rates, incidence rates adjusted for erroneous zeros, and ZERO-G adjusted incidence estimates for one simulation.

Figure S1. 6 .
Figure S1.6.Spatial pattern of mean annual incidence per administrative zone in the true incidence dataset, reported dataset, and the ZERO-G adjusted dataset for one simulation.The annual incidence is represented by the shaded color of each zone and the location of health clinics are represented by the points.Health clinics offering advanced primary care are

Figure S1. 7 .
Figure S1.7.Biases due to geographic location and financial policies were reduced in the ZERO-G adjusted data relative to the unadjusted data.Left: The average annual incidence per patch relative to a zone's distance to the nearest clinic.Right: The mean monthly incidence in zones with fee reimbursement and zones without fee reimbursement policies.Example is from one simulation.

Figure S2. 1
Figure S2.1 Scatter plots illustrating the relationship between sampling intensity estimated via the floating catchment area (FCA) method and the self-reported healthcare seeking rates from the IHOPE cohort for three survey years (2016, 2018, 2021).

Table S2 .
1. Table of best fit parameters estimated via MLE in the estimation of healthcare access A via Equations 7-12.