A global bionomic database for the dominant vectors of human malaria

Anopheles mosquitoes were first recognised as the transmitters of human malaria in the late 19th Century and have been subject to a huge amount of research ever since. Yet there is still much that is unknown regarding the ecology, behaviour (collectively ‘bionomics’) and sometimes even the identity of many of the world’s most prominent disease vectors, much less the within-species variation in their bionomics. Whilst malaria elimination remains an ambitious goal, it is becoming increasingly clear that knowledge of vector behaviour is needed to effectively target control measures. A database of bionomics data for the dominant vector species of malaria worldwide has been compiled from published peer-reviewed literature. The data identification and collation processes are described, together with the geo-positioning and quality control methods. This is the only such dataset in existence and provides a valuable resource to researchers and policy makers in this field.


Background and Summary
The behaviour and life history characteristics of a mosquito vector contribute to the relative importance of the species in terms of human malaria transmission 1 . Biting location, biting time and host preference will influence how effectively a mosquito can transmit malaria. In addition, understanding the behaviour of the vector guides how best it can be controlled and the likelihood that a particular intervention measure will be successful 2 . For example, a night feeding, anthropophilic, endophagic and endophilic mosquito (i.e., a vector that preferentially bites humans indoors during the night when people are asleep and vulnerable and then rests indoors) is likely to be a highly effective transmitter of malaria (e.g., An. funestus 3 ). These same characteristics make this vector an ideal candidate for indoor insecticidebased control such as indoor residual spraying (IRS), which targets mosquitoes that preferentially rest indoors, or long-lasting insecticide-treated nets (LLINs), which target those species attracted to humans indoors at night. On the other hand, a species that is zoophilic, exophagic and exophilic (i.e., a vector that prefers to bite animals and that spends its adult life outdoors) would not be impacted by these control methods (e.g., An. arabiensis 3 ), but may be vulnerable to outdoor space spraying or insecticidal zooprophylaxis. Increasingly malaria researchers are turning to transmission models to predict the impact of control measures on malaria transmission, to focus limited resources toward the most efficient measures of control and to address residual transmission 4 . It is becoming more widely accepted that simply scaling up existing insecticide based intervention methods is insufficient to tackle increasingly resistant vector populations or to impact existing, control avoiding species [5][6][7] . Spatially explicit, species-specific behavioural data are needed to populate the emerging transmission models that aim to identify the pathways to achieve elimination 4 .
The dominant vector species (DVS) of Africa, the Americas and the Asia-Pacific region have previously been identified 1 , and a brief literature survey of vector bionomics was conducted to accompany a series of papers that mapped the ranges of these species 3,8,9 . The survey did not show the proportion of a species showing a particular trait, but instead the proportion of studies reporting the trait for each species. This highlighted two major points. Firstly, a lack of published spatial datasets describing the ecology or behaviour of even the most dominant malaria vectors, and secondly, how much variation in behaviour exists within individual species. A comprehensive search for spatial bionomics data, incorporating behaviour, parasite infection and transmission potential plus other pertinent parameters was therefore conducted (Figure 1).
The focus of this publication is to present the results of this work; a global, species-specific, temporally and spatially categorised database of the bionomics of the DVS of human malaria.

Methods
Bionomics data were abstracted from the published literature detailing research studies that included data on: (i) Vector biology; for example parity and longevity; (ii) Vector infection and transmission; for example sporozoite rate and entomological inoculation rate; (iii) Human biting rate; (iv) Vector host preference (quantifiable measures of anthropo-and zoophily); (v) Human biting preference (quantifiable measures of endo-or exophagy); (vi) Human biting activity (preferred time of biting); (vii) Resting preference (quantifiable measures of endo-or exophily).
Regional datasets were created for Africa, the Americas and the Asia-Pacific region within which all data were attributed to species and location. There is no single standard method for measuring each of the above parameters so full details including mosquito collection date, season and sampling method were recorded, where given.

Published data searches
Publications detailing occurrence data for the DVS were identified from the MAP DVS database 8 (date range of field data: 1985-2010). To ensure an up-to-date dataset, additional searches using the DVS specific names as search terms were conducted in PubMed 10 and Web of Science 11 covering literature published from 2010 to May 2013 for the African DVS and August 2014 for the American and Asia-Pacific DVS. Language restrictions were not placed on these searches. Full text digital copies of all publications were obtained. All articles written in English, French, Portuguese and Spanish were read, and those publications with no useful bionomics data were removed. The decision to only include data collected since 1985 was made to ensure that the dataset reflected the current distribution of the DVS and included specimens identified using more up-to-date identification methods and taxonomy 1 .

Data extraction
Each article was searched for relevant bionomics data related to both a given location and to one of the vector species in question. Data were extracted as reported in the source document, with no assumptions www.nature.com/sdata/ SCIENTIFIC DATA | 3:160014 | DOI: 10.1038/sdata.2016.14 made, and only tabulated data or values reported in the text were accepted (no attempt was made to read graphically reported data from Figures). When possible, bionomics data for individual sibling species were extracted. However, where there was some ambiguity in the species being reported, they were recorded as the species complex (e.g., authors referring to An. gambiae but only relying on morphological identification and with no clear indication whether the specimens were considered An. gambiae species or An. gambiae complex).
In 2013 the An. gambiae species was officially split into two formally named species corresponding to the previous classifications of molecular form 12 . Form M is An. coluzzii and Form S is An. gambiae. Our study began before this classification was published and much of the data collated and included in this current work refers to An. gambiae in its old form (i.e., inclusive of An. coluzzii and An. gambiae). Where given, we captured full species details, however these use the old molecular form classification. Therefore, consider all mentions of An. gambiae to include An. coluzzii and An. gambiae unless specifically stated otherwise. Our dataset also records the previous classification of chromosomal form, where given. On occasion, despite conducting additional identifications to determine sibling species, authors presented their bionomics data for the species complex, this was also recorded as given.
Where possible all data reported in the source for a specific location, time and species are combined on a single data line. For example, this means there may be information relating to a vector population's host preference, sporozoite infection rate and peak biting time all combined on a single row. However, as not all bionomics parameters were reported by every study this also means that there are blank cells on each row. Blanks cells always represent 'no data'.
Where given, season was recorded. Due to the high influence of season on mosquito behaviour and abundance, when it was not provided it was calculated from the dates given, either in the source or by searching for information detailing when the rainy and dry seasons normally occurred in the specific location. When season has been calculated this is recorded in a separate column, so that users of the dataset are aware that this was not included in the original data source.
For the African dataset a search of the accumulated bionomics library was conducted to identify those authors who were most prolific in publishing pertinent bionomics data. These authors were contacted to ask if they had any further, unpublished data they would be able to contribute. Any unpublished data was added into the dataset as above. Authors were also contacted to clarify details that were unclear or to disaggregate data where the source suggested more detail may have been collected in the study, but had www.nature.com/sdata/ SCIENTIFIC DATA | 3:160014 | DOI: 10.1038/sdata.2016.14 not been presented within the published source. Due to time constraints this step was not carried out for the American and Asia-Pacific datasets.

Site geo-positioning
The majority of sites sampled had previously been geolocated in an earlier study mapping the ranges of the DVS 3,8,9 . All additional sites were georeferenced following the same protocol, fully detailed in Hay et al 1 . In brief, site location was determined by searching for the site name in online gazetteers (e.g., Google Earth, Google Maps, GeoNames) or other geolocational resources (e.g., Microsoft Encarta).
Site related contextual information provided in the original reference (e.g., '10 km from the coast') was used to confirm that the correct site had been identified. Data locations were attributed to area types, including point locations (within 10 km 2 ), wide areas (10-25 km 2 ), small polygons (25-100 km 2 ) or large polygons (>100 km 2 ). Single sampling points were identified as point locations. However, data were often reported for several sampling sites combined. In this case, sampling locations were determined as a wide areas, small polygons or large polygons depending on the extent of the sampling area. A single set of coordinates for the most central sampling site of the study are used to define the location of the sampling area, with the area type used to give an indication of the geographic spread of the sampling locations.

Bionomics datasets
We define a data record as a data point for a unique site-date-species combination. Table 1 reports the total number of data sources reviewed and the number of data records available for the most often reported bionomic parameters. A total of 1,837 published data sources were reviewed and data abstracted from 871 of these. Additional data were incorporated from 6 unpublished sources as a result of author contacts. Figure 2 shows the locations of the bionomics data study sites and indicates the global spread of the data. A list of countries for which at least one data record is available is shown in Table 2. Figure 3 displays the temporal spread of the data, by showing the starting year of the sampling period for each data source. There is no clear pattern to the number of published bionomic studies from each year, with a roughly equal spread of studies across time for each region. The lower number of studies for each region from 2010 onwards is possibly due to the lag period between field work and publication of results. Table 3 presents the number of data sources containing bionomics data for a selection of the most important DVS for malaria transmission, together with the number of individual sampling sites. The number of data records available for these species for a selection of the key parameters is also shown in Table 3. Although the number of data records is summed for members of a species complex in this summary Table, they are recorded separately for each sibling species in the bionomics datasets, if this was reported in the primary data source.

Data Records
The three regional databases are publicly available online as comma delimited files (Data Citation 1). The data are also available via the Malaria Atlas Project (MAP) website 13      Biology_sampling_n. 't' indicates that there are more than three sampling methods. Parity_n. The number of parous females detected from the total number examined. Parity_total. The total number of females examined for parity. Parity_percent. The percentage of parous females in the sample: number of parous females/total number examined*100. Daily_survival_rate_percent. The estimated proportion of female mosquitoes alive on day d that are still alive on day d+1. Fecundity. The number of eggs laid per batch. Gonotrophic_cycle_days. The number of days for a female mosquito to go through the reproducefeeding cycle.

VECTOR INFECTION RATE
Infection_sampling_1. The sampling methods used to collect the specimens detailed in the VECTOR INFECTION RATE section. Three methods can be listed. If more than three methods have been used, this is indicated as 't' in the final column. As 'Biology_sampling_1'. Infection_sampling_2. As 'Infection_sampling_1'. Infection_sampling_3. As 'Infection_sampling_1'. Infection_sampling_n. 't' indicates that there are more than three sampling methods. SR_dissection_n. The number of sporozoite infected females detected by dissection from the total number examined. SR_dissection _total. The total number of females dissected for sporozoites. SR_dissection_percent. The percentage of sporozoite infected females detected by dissection in the sample: number of infected females/total number examined*100. SR_CSP_n. The number of sporozoite infected females detected by circumsporozoite protein (CSP) analysis from the total number examined.     Outdoor_host_total. The total number of mosquitoes sampled outdoors examined for measures of host preference.
Outdoor host. The measure of host preference from outdoor sampled mosquitoes. Combined_host_sampling_1. The sampling methods used to collect the mosquitoes from which host preference is evaluated where data are amalgamated from more than one method, or where the method used is unclear. Three methods can be listed. If more than three methods have been used, this is indicated as 't' in the final column. As 'Biology_sampling_1'. Combined_host_sampling_2. As 'Combined_host_sampling_1'. Combined_host_sampling_3. As 'Combined_host_sampling_1'. Combined_host_sampling_n. 't' indicates that there are more than three sampling methods.
Combined_host_n. The number of mosquitoes positively indicating a measure of host preference collected by a combination of sampling methods.
Combined_host_total. The total number of mosquitoes sampled by a combination of sampling methods, examined for measures of host preference.
Combined_host. The measure of host preference from mosquitoes sampled by a combination of methods.
Host_unit. Indicates the measure used to identify host preference. HBI (%): Human Blood Index as a percentage. ABI (%): Animal Blood Index as a percentage. HBI (%calc): Human Blood Index as a percentage calculated from data given in source. ABI (%calc): Animal Blood Index as a percentage calculated from data given in source. AI: 'Anthropophilic Index', a measure of attraction to humans not included above, for example % individuals attracted to human baited trap over total collected in both human and cattle baited trap, calculated from count data. NB. the unit 'HBI (%calc)' and 'ABI (%calc)' is where the source provides the raw data needed to calculated HBI or ABI but does not actually present these data. The unit indicates that the calculation has been done here. Other_host_sampling_1. The sampling methods used to collect the mosquitoes from which host preference is evaluated where additional data are presented examining host preference. Three methods can be listed. If more than three methods have been used, this is indicated as 't' in the final column. As 'Biology_sampling_1'. Other_host_sampling_2. As 'Other_host_sampling_1'. Other_host_sampling_3. As 'Other_host_sampling_1'. Other_host_sampling_n. 't' indicates that there are more than three sampling methods.
Other_host_n. The number of mosquitoes positively indicating a measure of host preference. Other_host_total. The total number of mosquitoes examined for measures of host preference.

HUMAN BITING LOCATION AND TIME
Indoor_number_sampling_nights_biting. The sampling effort, in number of 'man nights', to collect the indoor biting data. Indoor_biting_sampling. The sampling method used to collect the indoor mosquitoes from which biting location preference is determined. As 'Biology_sampling_1'. Indoor_biting_n. The number of mosquitoes found biting indoors. Indoor_biting_total. The total number of indoor and outdoor biting mosquitoes. Indoor_biting. The percentage or ratio of mosquitoes found biting indoors.
Outdoor_number_sampling_nights_biting. The sampling effort, in number of 'man nights', to collect the outdoor biting data.
Outdoor_biting_sampling. The sampling method used to collect the outdoor mosquitoes from which biting location preference is determined. As 'Biology_sampling_1'. Outdoor_biting_n. The number of mosquitoes found biting outdoors. Outdoor_biting_total. The total number of indoor and outdoor biting mosquitoes. Outdoor_biting. The percentage or ratio of mosquitoes found biting outdoors. Indoor_outdoor_biting_units. Indicates the data unit for the indoor and outdoor biting data. I:O: Indoor to outdoor ratio. %: % biting indoors (or outdoors) given in source. %calc: % biting indoors (or outdoors) calculated from data given in source. NB. the unit '%calc' is where the source provides the raw data for indoor and outdoor biting densities but does not calculate the percentage indoors/outdoors. The unit indicates that the calculation has been done here. Indoor_number_sampling_nights_biting_activity. The sampling effort, in number of 'man nights', relevant to indoor biting activity data. Indoor_1830_2130. 't' given here if indoor biting activity peaks in the first quarter of the night, includes dusk biting. Indoor_2130_0030. 't' given here if indoor biting activity peaks in the second quarter of the night.
www.nature.com/sdata/ SCIENTIFIC DATA | 3:160014 | DOI: 10.1038/sdata.2016.14 Indoor_0030_0330. 't' given here if indoor biting activity peaks in the third quarter of the night. Indoor_0330_0630. 't' given here if indoor biting activity peaks in the fourth quarter of the night, includes dawn biting. Outdoor_number_sampling_nights_biting_activity. The sampling effort, in number of 'man nights', relevant to outdoor biting activity data. Outdoor_1830_2130. 't' given here if outdoor biting activity peaks in the first quarter of the night, includes dusk biting. Outdoor_2130_0030. 't' given here if outdoor biting activity peaks in the second quarter of the night. Outdoor_0030_0330. 't' given here if outdoor biting activity peaks in the third quarter of the night. Outdoor_0330_0630. 't' given here if outdoor biting activity peaks in the fourth quarter of the night, includes dawn biting. Combined_number_sampling_nights_biting_activity. The sampling effort, in number of 'man nights', relevant to biting activity data where data are presented for both indoor and outdoor biting combined. Combined_1830_2130. 't' given here if combined biting activity peaks in the first quarter of the night, includes dusk biting. Combined_2130_0030. 't' given here if combined biting activity peaks in the second quarter of the night. Combined_0030_0330. 't' given here if combined biting activity peaks in the third quarter of the night. Combined_0330_0630. 't' given here if combined biting activity peaks in the fourth quarter of the night, includes dawn biting.

VECTOR RESTING LOCATION PREFERENCE
Indoor_resting_sampling. Indoor sampling method used to collect the mosquitoes to assess indoor resting behaviour. As 'Biology_sampling_1'. Indoor_unfed. Total number of unfed mosquitoes in the sample collected indoors. Indoor_fed. Total number of fed mosquitoes in the sample collected indoors. Indoor_gravid. Total number of gravid mosquitoes in the sample collected indoors. Indoor_total. Total number of mosquitoes in the sample collected indoors, including unfed, fed and gravid females. Outdoor_resting_sampling. Outdoor sampling method used to collect the mosquitoes to assess outdoor resting behaviour. As 'Biology_sampling_1'. Outdoor_unfed. Total number of unfed mosquitoes in the sample collected outdoors. Outdoor_fed. Total number of fed mosquitoes in the sample collected outdoors. Outdoor_gravid. Total number of gravid mosquitoes in the sample collected outdoors. Outdoor_total. Total number of mosquitoes in the sample collected outdoors, including unfed, fed and gravid females.
Other_resting_sampling. Sampling methods relevant to 'other' data. These columns are used when additional sampling is reported, for example if indoor and outdoor resting mosquitoes are listed in the previous sections, but the source also reports data from a third sampling method such as mosquitoes resting in animal sheds. As 'Biology_sampling_1'. Other_unfed. Total number of unfed mosquitoes in the sample collected by additional/'other' methods. Other_fed. Total number of fed mosquitoes in the sample collected by additional/'other' methods. Other_gravid. Total number of gravid mosquitoes in the sample collected by additional/'other' methods. Other_total. Total number of mosquitoes in the sample collected by additional/'other' methods, including unfed, fed and gravid females. Resting_unit. The unit relating to the indoor, outdoor or other resting data.
Count: raw count data. %: percentage. Per man hour: total number collected divided by time spent collecting in hours. Fed:gravid: fed to gravid ratio, total number of fed specimens divided by total number of gravid specimens.

CITATION
Citation. The data source. PubMed_ID. PubMed ID, when available.

Technical Validation
Bionomics data have been recorded by a large number of researchers, often using different sampling methods and reporting the data using different metrics. Due to the complicated and non-standard nature of the data, all data were reviewed and checked by a second data abstractor. The data were also checked to ensure that recorded values were within the possible ranges (for example between 0 and 100 for parameters recorded as percentages) and that all values had associated units.
To ensure all locations were accurately geo-located these were again confirmed by a second data abstractor. As many of the data sources identified in this project had previously been included in www.nature.com/sdata/ SCIENTIFIC DATA | 3:160014 | DOI: 10.1038/sdata.2016.14 mapping projects on parasite rate 15,16 and vector occurrence 3,8,9 the geolocation coordinates for these sites had already been confirmed. Coordinates were also plotted to ensure that they fall on land and in the correct country.

Usage Notes
This is the first time that a comprehensive global database has been compiled of published bionomics data for the DVS of human malaria. The dataset described here will be of value to researchers when assessing the likely impact of vector control measures on malaria transmission and to policy makers when deciding how malaria control resources are allocated. Searching the dataset for data related to a specific DVS, geographic location or bionomic parameter will allow the user to quickly identify the available data, and to link this back to the original data source. In addition, this dataset can be used to identify the current knowledge gaps in the behaviour and life history characteristics of the DVS across their geographic ranges.
The published studies did not use consistent units for each of the parameters of interest, and no attempt has been made to standardise the units as part of this work. It is vitally important that the values for each parameter are not treated as single dataset that used a common methodology and unit. Users are strongly advised to examine the sampling methods and units fields provided for each parameter when making use of the data.
We will be using these data to test specific hypotheses relating to the DVS of Africa, including the presence of an east-west behavioural cline; whether insecticide control has caused a continent-wide non species-specific shift to exophagy amongst previously endophagic species; whether insecticide control has caused a continent-wide non species-specific shift in biting times amongst night biting species; and whether the DVS are really behaviourally flexible, or if the observed plasticity actually relates to different sub-species or sibling species within a complex.