A long-term dataset on wild bee abundance in Mid-Atlantic United States

With documented global declines in insects, including wild bees, there has been increasing interest in developing and expanding insect monitoring programs. Our objective here was to organize, validate, and share an analysis-ready version of one of the few existing long-term monitoring datasets for wild bees in the United States. Since 1999, the Native Bee Inventory and Monitoring Lab (BIML) of the United States Geological Survey has sampled wild-bee communities in the Mid-Atlantic U.S., but samples were collected in multiple studies and the datasets are not fully integrated. Furthermore, critical information about sampling methodology was often lacking, though these factors can significantly influence collection outcomes and must be considered in analyses. We cleaned and verified BIML data from Maryland, Delaware, and Washington DC, USA, and generated sampling methodology for over 84% of the 99,053 pan-trapped occurrences in this region. We enthusiastically invite creative analyses of this rich dataset to advance understanding of the biology and ecology of wild bees, inform conservation efforts, and perhaps help design a nationwide bee monitoring program.


Background & Summary
Wild bees are crucial pollinators of many crop and wild-plant species. Globally, 87 out of 115 major food crops require insect pollination and 78% of temperate plant species and 94% of tropical plant species require insect pollination for reproductive success and persistence 1,2 . Wild-bee conservation is crucial for global biodiversity as pollinators represent several extremely diverse taxa and have a vital co-evolved role in supporting and selecting for diverse plant communities 3 .
In Europe and the United States, there has been a surge of interest in developing programs to monitor wild-bee populations [4][5][6][7] . There have been documented declines in several North American bumble bees populations and species 8 , but we largely lack baseline data to assess declines in many other taxa. In response to heavy colony losses since 2006, monitoring of honey bee (Apis mellifera L.) populations has greatly increased 9 . These data have greatly contributed to our understanding of honey bee population dynamics, ecological interactions, colony stressors, and effective management.
We expect coordinated, wild-bee monitoring will enable similar advances in understanding wild-bee diversity, natural history, and conservation strategies, but there are still several challenges in collecting and analysing monitoring data. Ideally, a wild-bee monitoring program, or any monitoring program for that matter, would utilize consistent methods over time to detect long-term population trends. In practice, monitoring protocols often shift over time, and many existing datasets were compiled from studies using multiple sampling methods 10 .
Our goal here was to organize, validate, and share an analysis-ready version of one of the few existing long-term, wild-bee monitoring datasets in the United States. Since 1999, the Native Bee Inventory and Monitoring Lab (BIML) of the United States Geological Survey has collected, curated, and identified over 99,000 specimens of more than 314 wild bee species from over 1400 sites in Maryland, Delaware, and Washington DC. The BIML collection is among the most geographically dense, recent sampling of wild bees in the country. Also, all specimens were lethally collected and expert-identified to species, which provides a crucial comparison for observation-based datasets with coarser taxonomic information, including rapidly growing citizen science data 11,12 .
BIML data contain considerable diversity in wild-bee-sampling methods and critical sampling information is often missing or not easily accessible to end-users. One goal of BIML research was to test a variety of passive sampling methods to establish a standard protocol for wild-bee monitoring. They studied sampling efficiency of passive pan traps (bee bowls) with varying trap color, volume, and liquid. Pan traps are a reliable method for monitoring wild bees 13,14 , particularly in combination with net collections or transect walks, but the specific pan-trapping method, especially trap color and visibility, influences the abundance and composition of wild bees collected [15][16][17] . To fully utilize the BIML dataset collected with multiple methods, trap color should be included as a variable in statistical analyses. Currently, this is not possible with the publicly available BIML data 18 as 90% of the pan-trapped occurrences in Maryland, Delaware, and DC have no trap color data. Similarly, 36% of occurrences are missing trap volume and 21% lack sampling effort (the number of traps used during each sampling period). Fortunately, much of the missing information was recorded in BIML field notes, but not in a consistent, accessible format.
To fill these gaps, we cleaned and verified BIML data and extracted trap color, volume, and sampling effort for more than 84% of the pan-trapped occurrences. We focused on occurrences collected in pan traps because they represent more than 82% of the larger BIML dataset. We also created, for all pan-trapped occurrences, identifier variables and verified species binomials, date, and locality. We enthusiastically invite analyses of this rich dataset to advance understanding of bee biology and ecology, inform current wild bee conservation efforts, and perhaps even help design a national wild bee monitoring program.

Methods
Identifier variables. To facilitate aggregation and analysis of the BIML data, we added 'site' , 'site-year' , 'sampling event' , and 'transect' identifier variables. We defined 'sites' as unique combinations of latitude and longitude, and 'site-years' as unique combinations of site and year of sampling. Within site-years, we defined 'sampling events' according to the date of sampling and 'transects' as unique combinations of sampling event and text field notes. For some specimens, field notes included a transect ID, indicating that the BIML used multiple sets of pan traps at the same site. In other cases, field notes recorded differing sampling methods, or different information on the number of missing traps (traps that were cracked, tipped over, or otherwise compromised). If field notes recorded different methods or number of lost traps, we assumed that the BIML deployed multiple sets of traps (transects). We reviewed the field notes for all sampling events with multiple transects and reassigned these occurrences to a single transect if there was no evidence of multiple transects in the field notes.
Locality and taxonomic identification. Next, we reviewed and excluded occurrences lacking critical date and locality information. We removed all occurrences lacking sampling date or latitude and longitude of sampling location and occurrences with duplicated specimen identifiers. We filtered occurrences to a limited geographic area (Maryland, Delaware, and Washington DC, Fig. 1) that represents the densest region of BIML sampling (39.6% of dataset). This filtering removed wild-bee communities collected in desert or tropical biomes, which are likely governed by very different floral resource and climate dynamics 19,20 , and within the Mid-Atlantic USA, limited sampling locations to a region with a consistent dominant forest type 21 . Bee occurrences in 1999 and 2001 represented fewer than three sites per year, so we removed these years, retaining sites sampled from 2002-2016.
We also filtered data to our taxa of interest. We removed non-bee occurrences (species outside superfamily Apoidea, clade Anthophila) and records lacking species-level identity, discarding occurrences identified to family or genus (Online-only Table 1). Almost all non-bee occurrences we removed were wasps in the Vespidae, Crabronidae, and Sphecidae families, which are primarily predators, rather than pollen-collectors like most wild bees. For the transect-level dataset 22 (see 'Data Records' below), we calculated the abundance of Apis mellifera L., then removed A. mellifera from the dataset before calculating total bee abundance per transect, since often A. mellifera specimens likely originated from managed colonies and are not considered to be wild bees. www.nature.com/scientificdata www.nature.com/scientificdata/ We verified species names by cross-referencing all species binomials with the Discover Life database 23 . We corrected genus and species names that were clear spelling errors (Online-only Table 1) and consulted the original data source (S. Droege) for remaining species binomials that did not exist on Discover Life. We also referenced Discover Life occurrence maps to confirm that all species in the BIML dataset occur in the Mid-Atlantic US. After these data cleaning steps, we removed six occurrences of the remaining five unknown or out-of-region species (Online-only Table 1). Some species in the BIML data were identified singularly and as part of a species set. To avoid double counting these species, we created a new variable with cleaned, mutually exclusive species names (termed 'grouped name'). In 'grouped name' , we combined singular species names with their associated species sets (Online-only Table 1). For example, we reclassified occurrences identified as Halictus ligatus/poeyi, Halictus ligatus, or Halictus poeyi to Halictus ligatus/poeyi to avoid inflating future species richness estimates when occurrences might be the same species. In the final occurrence datasets, we included the cleaned, singular names ('name') and cleaned, grouped names ('grouped_name'), so future analysts can select the appropriate taxonomic aggregation for their research objectives. Voucher specimens for most species in the BIML dataset are housed in the Smithsonian collection, but some are not yet permanently archived. We suggest interested parties contact Sam Droege (current email: sdroege@usgs.gov) to access voucher specimens. We also included, to the best of our knowledge, current affiliations for individuals who identified BIML specimens (Supplementary Information, Table S1), and standardized names of identifiers ('identifiedBy') in the final datasets.
Sampling method and effort. To describe sampling method and effort, we used regular expressions to extract these data from field notes. We sought to compare bee communities sampled with a standard methodology, so we discarded bee occurrences collected with vane traps or nets, only retaining occurrences sampled with pan traps (i.e., bee bowls). Using the stringr package in R 24,25 , we searched the text of field notes to document trap volume, trap color, total number of traps, and the number of traps missing or disturbed. The most common BIML pan-trapping method involved setting out traps of multiple colors and combining the bees in all traps into one sample. Consequently, BIML recorded trap color in field notes as the number of traps of each color used for a specific sampling event. We designed regular expressions to extract the number of traps for the eight most common colors (white, blue, yellow, pale blue, fluorescent yellow, fluorescent blue, and florescent pale blue). For some occurrences, our regular expressions yielded no sampling information, so we manually reviewed these field notes and recorded any data missed by the automated search.
Next, we simplified the trap color and volume classification to facilitate future statistical analyses. To reduce the number of trap volume categories, we rounded trap volume to the nearest 0.5 ounces, and removed trap volumes greater than 40 ounces, assuming these were errors in data entry or extraction. When the trap color or volume used at a specific site changed within a year, we manually reviewed the field notes and corrected color or volume classifications when necessary. After correcting these discrepancies, we found the BIML very rarely changed sampling methods within a year, so we filled in most missing trap color or volume information by assuming a constant sampling method for all transects within a site-year. Finally, we combined rarely used color/ volumes (fewer than 1% of transects) into an 'Other' category. In the archived datasets with sampling information, we included original and simplified variables for trap color and volume.
Lastly, we summarized sampling effort and calculated effort-adjusted abundance of wild bees. We calculated the total number of traps for each sampling event, and, when available, we also described the number of traps missing or disturbed. If there was no documentation of missing or disturbed traps, we assumed all traps were recovered successfully. When the total number of traps differed between the field notes and 'number of traps' column, we selected the lower value. We defined the final number of traps in each transect as the original number minus the number missing or disturbed. We calculated the duration of sampling as the difference between the date traps were collected and date traps were set. If traps were set and collected on the same day, we set the duration of sampling to one day. For each occurrence in the BIML dataset, we converted bee abundance to abundance day −1 trap −1 . We conducted all data manipulation and aggregation with the R statistical and computing language 3.6.0 25,26

Data Records
We produced two primary datasets and a third analytical dataset 22 (details below), which are all archived at figshare.com. The first dataset ('Mid-Atlantic USA wild bee occurrences, all records') contains all 99,053 wild bee occurrences collected in pan traps by BIML from 2002 to 2016 in Maryland, Delaware, and Washington DC, USA. We describe all variables included in dataset one in Table 1. The second dataset ('Mid-Atlantic USA wild bee occurrences, records with sampling info') is all pan-trapped occurrences with sampling effort information. We retained 83,583 wild bee occurrences in dataset two representing 300 species from 1225 sites sampled over 1322 site-years. Dataset two documents the trap color and volume when these data were available. See Online-only Table 2 for full meta-data for dataset two.
We also generated an analytical dataset 22 ('Mid-Atlantic USA wild bee abundance per transect, records with sampling info') derived from dataset two. Dataset three documents wild-bee abundance, trapping method, and sampling effort per transect (see 'Identifier Variables' methods above). At the transect level, we defined trap color and volume as the combination of all trap colors and volumes associated with specimens captured in a transect. Table 2 describes all meta-data for dataset three.

technical Validation
We verified our sampling method and effort data in two ways. First, as described above (see 'Sampling method and effort'), we manually reviewed field notes for all occurrences lacking sampling effort information. This allowed us to correct instances where sampling data were missed by our automated processing. Second, for occurrences that had sampling effort information, we quantified the accuracy of our processing procedure. We selected www.nature.com/scientificdata www.nature.com/scientificdata/ a random sample of 1000 occurrences and compared field-note text and sampling-method variables from the original dataset with our sampling effort and method data. We noted whether trap color, trap volume, and the total number of traps were assigned correctly and quantified how much information we created compared with the original BIML dataset.
Overall, we were able to generate or verify pan-trap sampling effort and methodology for 84% of BIML specimens in the Mid-Atlantic U.S. We improved data coverage for trap color, trap volume, and sampling effort by 71%, 4.7%, and 6.1%, respectively. The percentage of occurrences with present, accurate sampling data was good to excellent for all three variables (trap color: 82.2%, trap volume: 88.6%, sampling effort: 95.9%,). Specifically, 888 observations in our validation set were missing trap color data. Using our automated text search, we accurately described trap color for 714 occurrences. Of the remaining 174 occurrences, we generated partial or incorrect data for 27 and 2 occurrences, respectively, and 145 had no color data in the text field notes. For trap volume and sampling effort (the total number of traps used in each sampling event), the original BIML dataset was much www.nature.com/scientificdata www.nature.com/scientificdata/ more complete. Out of 1000 specimens in our validation dataset, sampling effort and trap volume were missing for 50 and 161 occurrences, respectively. We generated accurate sampling effort for 41 occurrences and corrected BIML data for an additional 20. Of the 161 occurrences lacking trap volume, we filled in 47.

Usage Notes
The BIML data is a rich source of information on wild-bee biodiversity but it has some notable limitations. First, the BIML sampled few locations more than three years. The main goal of BIML collections was to establish baseline estimates of wild-bee species richness in the Mid-Atlantic region. They selected sites representing a wide variety of habitats, focusing on sampling many sites rather than repeated sampling at the same sites over time. Consequently, the BIML dataset has excellent geographic coverage in Maryland, Delaware, and Washington DC (Fig. 1), but lacks long time-series of bee abundance or community composition in consistent locations. Similarly, the BIML did not sample an equal number of sites for all habitat types represented in the dataset. For example, there are relatively few sites in highly agricultural landscapes compared with forested areas, so an analysis of the whole dataset may fail to reveal processes specific to agricultural land. For analyses focused on specific habitat types, we recommend sub-sampling or otherwise adjusting for uneven representation among habitat types.
Second, the dominant sampling method changed over the study period. We found that the BIML utilized different colors and volumes of pan traps over time (Figs. 2 and 3). Before 2003, the BIML used a variety of pan-trap colors and volumes. From 2003-2013, the most common sampling method was 3.25 ounce white, blue, and yellow pans, and in 2014, the BIML transitioned almost all their sampling to 12-ounce white, fluorescent blue, and fluorescent yellow bowls.
These features limit, or at best complicate, interpretation of temporal patterns of bee abundance or community composition in the BIML data. For example, a pattern of declining bee abundance over time could be explained by a less effective sampling method or sampling poorer sites or less attractive habitats in later years. In our view, the BIML data are not well suited for assessing temporal trends, as researchers must account for sampling and location changes over time or use a small subset of the data sampled with consistent methodology.
The BIML data we archive here were sampled with pan traps, which impose some inherent biases and limitations. Pan traps significantly under sample large-bodied bees 15,27 , probably because larger-bodied individuals are able to break surface tension in a container of liquid and fly away. Consequently, Bombus sp., Xylocopa virginica, and other large-bodied bees are less common in pan-trapped data than net-collected or observational datasets. Additionally, some evidence suggests pan traps perform poorly in areas with highly abundant flowering plants [28][29][30] , such as field edges adjacent to mass-flowering crops. In proximity to flowering plants, pan traps are less attractive to wild bees than real flowers, leading to an artificially depauperate wild-bee community captured by the pan traps.
Consequently, the BIML datasets we produced are not well suited for studies on large-bodied taxa or in areas with highly abundant flower resources, however these limitations could likely be addressed by combining this dataset with net collections or direct observations. Net collecting has its own limitations of more active time spent per specimen and observer biases, but there are few limitations in common between the two methods. In addition to pan-trapped specimens, the BIML net-collected approximately 16,000 specimens from 2002-2016 in Maryland, Delaware, and Washington DC 18 , but these collections did not necessarily occur at pan-trapping sites. Also, we did not include net-collected occurrences in our data processing pipeline, so species names should be verified before analysis, and sampling effort information may not exist.  www.nature.com/scientificdata www.nature.com/scientificdata/ We also advise future analysts that some BIML sampling locations are very close to each other. We created 'site' designations as all unique combinations of latitude and longitude, no matter the distance from other sites. In the format we provide, the BIML data violates the assumption of independent observations imposed by a frequentist statistical framework. Not all analytical frameworks assume independent observations, so we did not combine or thin sites, but refer interested parties to statistical methods or software designed for this task (k-means clustering or similar, spThin 31 for spatial thinning).

Code availability
All R code used to generate the archived data is available at Zenodo 26 .