Predicting distribution of malaria vector larval habitats in Ethiopia by integrating distributed hydrologic modeling with remotely sensed data

Larval source management has gained renewed interest as a malaria control strategy in Africa but the widespread and transient nature of larval breeding sites poses a challenge to its implementation. To address this problem, we propose combining an integrated high resolution (50 m) distributed hydrological model and remotely sensed data to simulate potential malaria vector aquatic habitats. The novelty of our approach lies in its consideration of irrigation practices and its ability to resolve complex ponding processes that contribute to potential larval habitats. The simulation was performed for the year of 2018 using ParFlow-Common Land Model (CLM) in a sugarcane plantation in the Oromia region, Ethiopia to examine the effects of rainfall and irrigation. The model was calibrated using field observations of larval habitats to successfully predict ponding at all surveyed locations from the validation dataset. Results show that without irrigation, at least half of the area inside the farms had a 40% probability of potential larval habitat occurrence. With irrigation, the probability increased to 56%. Irrigation dampened the seasonality of the potential larval habitats such that the peak larval habitat occurrence window during the rainy season was extended into the dry season. Furthermore, the stability of the habitats was prolonged, with a significant shift from semi-permanent to permanent habitats. Our study provides a hydrological perspective on the impact of environmental modification on malaria vector ecology, which can potentially inform malaria control strategies through better water management.


Input Data.
Using GRASS 7.6, the 1 arc-second digital elevation model (DEM) from SRTM was resampled to 50 m grid resolution. Subsequently, the resampled DEM data was hydro-conditioned to ensure that the drainage networks are connected using the global slope enforcement approach 1 and converted to ground surface slopes as an input to ParFlow. The land cover type data used in CLM was determined by 30 m resolution Landsat-8 data, which was first filtered by cloud cover and processed with an unsupervised classification approach using k-mean clustering into ten classes in ENVI. The ten classes were resampled to 50 m resolution and then grouped into the International Geosphere-Biosphere Programme (IGBP) land cover types and checked against Google Earth images for consistency. The dominant IGBP types for the study area are cropland and grassland.
To characterize the subsurface, the soil taxonomy distribution ( Supplementary Fig S2) for the top 2 m from the surface was referenced from the SoilGrids250m TAXOUSDA dataset 2 and ranked by their relative permeability. Each soil type was then assigned a saturated hydraulic conductivity within the range of 0.0015-0.015 m/h 3 characteristic of either clay or clay loam. The saturated hydraulic conductivity of the deeper zone beyond the top 2 m was assigned an averaged value of 0.11 m/h based on GLHYMPS 2.0. The depth to bedrock data from SoilGrids250m BDRICM dataset 2 was used to delineate the bedrock zone and the hydraulic conductivity was set to 0.00001 m/h to render the corresponding grid cells impermeable.
For the climate forcing, 0.04 degree by 0.04-degree precipitation data was resampled to the model grid using bilinear interpolation while the rest of the forcing data fields were relatively coarser at either 0.25 degree by 0.25 degree or 0.5 degrees by 0.625 degrees so an average value was used for the entire domain. All the forcing data were pre-processed in NCAR Command Language (NCL) and input to the model hourly.
The list of model input data can be found in Supplementary Table S1.

Model Simulation.
In this study, a baseline was generated over the year of 2018 followed by a scenario with the implementation of a synthetic irrigation scheme during the dry season. The watershed model spans an area of 208 km 2 and has a high resolution, both spatially and temporally. The model resolution is 50m and there are 332 by 248 cells. The model was divided into 10 subsurface layers and the thickness of each layer varies, decreasing gradually towards the surface. For the irrigation scenario, the sugarcane plantation in the study area was grouped into four farms. The soil textural classes of the study area are clay and heavy clay soils and the peak admissible infiltration rate of these soils was 4mm/hr. The efficiency of sprinkler irrigation was estimated at 75%. Hence, a 10 mm/day irrigation operational module using the hose-moving sprinkler system was applied on alternate farms. The daily irrigation time was 22 hours per day and the irrigation interval of 10 days proposed by sugar factory agronomists was adopted.
The model was initialized in two stages. In the first stage, the CLM component was not activated and the initial water table was set at the surface without meteorological forcing. The simulation was run until the average water table reached 2 m below the land surface. During this process, the water drains from the mountainous areas to valleys, forming rivers and streams naturally. Next, the CLM component was activated and the transient meteorological forcing from the year of 2017 was repeated three years for the model to reach a sufficient state of equilibrium.
All model simulations were carried out on the High-Performance Computing Cluster at the University of California, Irvine. A 1-year simulation required 88 CPU cores and about 12 days to complete.
Supplementary Figure S4 shows the time series of precipitation, temperature, and simulated surface layers soil saturation for the baseline simulation and streamflow at points A, B, and C (see August to September because it is located close to a small stream. The baseline results show that the surface soil saturation of the four locations was relatively low in the dry months (January to April), ranging from 5% to 15% but could go up to as high as 100% in the rainy season (May to October). In the scenario with irrigation, the soil saturation increased two-fold to about 50 in the dry months. This contributes to surface water detention, which can potentially generate more breeding sites. As the irrigation ended in April with the onset of the rainy season, the increase in the saturation of the irrigation scenario over the baseline gradually reduced until they were both the same after more than a month.

Aquatic Habitat Survey
The aquatic habitats were surveyed during the dry (December 2017-February 2018) and rainy

Calibration and Validation
The soil saturation threshold θ for the rainy season from May to October of the baseline scenario was calibrated using the survey data in the same period to minimize the influence of dry season irrigation on the parameterization. This is because irrigation was not accounted for in the baseline scenario and was only approximated by a simplified scheme in the irrigation scenario. Of the 134 samples for the year 2018, some of the surveyed aquatic habitats such as man-made ponds, tire track puddles, and animal footprints which could not be simulated by the hydrologic model were omitted, leaving 102 samples for calibration and validation as shown in Fig. 3.
The objective of the calibration was to maximize the probability of detection (POD), which determines if the model can predict an aquatic habitat successfully. Other measures which can capture overprediction were not chosen here as the field data only cover locations with ponding and it is challenging to definitively rule out small puddles within the grid cell using satellite data. As shown in Equation (S1), the POD was calculated based on the ratio of the number of successful predictions or hits, H, to the total number of samples, S. In a successful prediction, the WI is at least 1 day given that an aquatic habitat is sampled at the same time and location. The POD ranges from 0 to 1, with 1 corresponding to perfect detection.

=
(S1) As the same dataset was used for both calibration and validation, 75% of the samples were randomly selected for the former and 25% for the latter. To ensure the relevance of the calibrated saturation threshold, bootstrapping was performed by randomly resampling the same dataset based on the 75:25 ratio with replacement to derive 1000 combinations as a representation of all the possible combinations. Each combination was used to calculate the POD corresponding to the saturation threshold within a predefined range of 0.4 and 1, at intervals of 0.02. To determine and assess the reliability of the optimal threshold, the average POD, and the associated 95% confidence interval values were computed for each saturation threshold.
Supplementary Figure S6