Citizen science helps in the study of fungal diversity in New Jersey

The history of fungal diversity of the Northeastern United States is currently fragmentary and restricted to particular functional groups or limited geospatial scales. Here, we describe a unique by its size, lifespan and data originators dataset, to improve our understanding of species occurrence and distribution across the state and time. Between the years 2007 to 2019, over 30 parks and nature preserves were sampled during forays conducted by members of the New Jersey Mycological Association (USA), a nonprofit organization of fungi enthusiasts. The dataset contains over 400 000 occurrences of over 1400 species across the state, made up mostly of the phylum Basidiomycota (89%) and Ascomycota (11%), with most observations resolved at the species level (>99%). The database is georeferenced and openly accessible through the Global Biodiversity Information Facility (GBIF) repository. This dataset marks a productive endeavor to contribute to our knowledge of the biodiversity of fungi in the Northeastern United States leveraging citizen science to better resolve biodiversity of this critical and understudied kingdom.

only local communities, but entire mycological congresses 22 use approaches such as BioBlitz to complement traditional surveys.
At present, nature conservation efforts for fungi as are still developing in the United States 23,24 , as prior survey initiatives have been largely scattered and maintained by voluntary efforts of resident regional mycologists (amateur or professional). Due to the fragmented nature of previous observations, there is a need to resolve fungal species diversity at regional scales in the United States, especially in highly populated areas comprised of multiple habitat and ecosystem types influenced by the legacy of citizen science initiatives 5,25,26 . For states comprised of a variety of different ecosystem types, such as New Jersey, this presents a major gap in knowledge given the degree of geographic and ecological diversity seen state-wide. While some efforts have been made to characterize fungal diversity for New Jersey, these attempts have largely been restricted to smaller geographic scales or specific taxonomic groups of interest such as lichens 27,28 or parasitic species 29 . Despite these historical limitations, there is a pressing need to better characterize fungal biodiversity at larger spatial scales with recent work at global scales emerging for specific fungal guilds [30][31][32][33][34] .
The challenge presented by the lack of sufficient fungal diversity data can be resolved, to some extent, using citizen science. With the recent development of digital technologies, citizen science has been successful in contributing to fundamental research [35][36][37] . Several online platforms collect photographic and written observations directly from citizens (e.g. mushroomobserver.org, iNaturalist.org or fundis.org). However, the observations made via these portals are mostly recent; the tremendous efforts of amateur groups who have been tracking fungal diversity before the digital era are often left closed to the global research community due to the lack of the proper data storage and sharing protocols and so it was not uncommon for the local organizations to keep records on a personal computer or in a hand-written format. This limitation in protocols for sharing data collected by citizen scientists presents one of the major opportunities to researchers of biodiversity at larger spatial scales and making these datasets openly available is critically important to better resolve global biodiversity.
Here, we describe a dataset consisting of fungal taxa for the state of New Jersey collected as part of citizen science forays in 32 parks and nature preserves throughout the state, separated into nine sub regions 38 (Fig. 1), Tabel 1.
Data included in this dataset were collected between the year 2007 and 2019, by volunteers of the New Jersey Mycological Association (NJMA, www.njmyco.org) as part of their organization's yearly sampling forays. Established in 1971 as the Lakeland Mycology Club, NJMA is now non-profit organization with over 800 members motivated by their interest in fungi, the only organization of its kind in the state of New Jersey. NJMA has amassed a wealth of citizen science data through decades of sampling events. Importantly, NJMA maintains an active herbarium of approximately 3000 vouchered specimens stored at Rutgers University in New Brunswick; however, this repository is currently not a part of the Chrysler Herbarium (CHRB) at Rutgers. Here, we showcase how large quantities of data collected across a variety of habitats and locations over a span of 12 years by volunteers has contributed to scientific knowledge in a cost-effective and data rich manner. Our interest was to prepare the collected data in a standardized format and make it open access, to increase the applicability to fungal biodiversity research. The resultant dataset is also meant to raise interest among citizen science and scientists to increase the amount of accessible data on the distribution of species 37,39 . Given that the North American Mycological Association (NAMA, https://namyco.org/clubs.php) has records of over 90 similar groups across 37 states this type of citizen science driven data collection has the potential to exponentially increase our knowledge of fungal taxa across the United States.
The dataset presented here 38 highlights the taxonomic diversity for the state of New Jersey from 210 surveys (corresponds to 210 records in the event table), with 400 260 records in the occurrence table. Overall, 1906 taxa with presence/absence information for each survey are published. In total, 96% of the records in the occurrence table are absence data. The taxonomic structure is presented by 2 kingdoms (Fungi and Protista for slime molds), 5 phyla, 20 classes, 58 orders, 162 families, 516 genera and 1483 species. Some species names that were assigned to the observations earlier in the data collection are now outdated, but are kept in the dataset as synonyms or under-identified taxa. Together with these records, the total number of taxa sums up to 1850. Taxa varied both by environment type based on primary forest composition (Fig. 2a) and by region (Fig. 2b), although the regional effect may also be influenced by a sampling month.
The dataset is supplemented with openly available environmental variables of interest (average region temperature; maximum region temperature; average region humidity due point; average region precipitation; average region wind; retrieved from the United States Geological Survey https://www.usgs.gov and National Weather Service https://www.weather.gov databases) for each sampling location in order to provide insight into factors contributing to changes in the distribution of taxa. This information is available for each sampling event (survey) in the dataset at GBIF.org.
Despite the variable nature of collection across this period (such as different frequency of visiting of the same sampling sites, Table 2), this dataset presents an opportunity for researchers and citizen scientists interested in fungal biodiversity of the Northeastern United States. The dataset describes relative abundances for common taxa over time. Their trophic types are presented on Fig. 3. To expand our understanding of how this dataset compares to other similar attempts to capture fungal diversity across geographic scales, we compared our guild data to other Agaricomycetes datasets published at GBIF.org for New Jersey, the United States, and globally ( Fig. 4). Despite some variability in specific species present across these spatial scales, we found similar proportions of the guild types (Fig. 4). Together, this suggests that our dataset captures information similar to other datasets of this type when considering functional roles of fungi, while adding to our knowledge of region-specific introduced or newly observed species as biodiversity changes globally (Tables 3, 4).
We suggest that this dataset, and the associated collection methods used by the New Jersey Mycological Association, could be used as a model for a systemic approach for evaluating fungal diversity across the United States. Through increasing open and digital access to fungal data, we expect that the presented dataset contributes to more complete documentation of life on Earth beyond charismatic taxa.  (Table 1) selected due to their probability of having representative fungal biodiversity for that part of the state. Samples for the foray locations were collected for identification from within the boundaries of the sampling sites. www.nature.com/scientificdata www.nature.com/scientificdata/ Sampling sites were distributed across the entire state of New Jersey (Fig. 1) and were visited with some temporal variability within each year. Some foray locations within the dataset were unique to only part of the 12-year sampling window, while other foray locations were sampled consistently across the entire sampling period (e.g. Pocono Environmental Education Center vs Wawayanda State Park, Table 2). Sites were sampled between May and November of each year with citizen scientists sampling one site per day. Sites were normally sampled during the same month across years, though some variation in sampling time did occur. To better describe similar    www.nature.com/scientificdata www.nature.com/scientificdata/  www.nature.com/scientificdata www.nature.com/scientificdata/ regions across the state, we assigned regional identifications to each foray sampling locality based on habitat similarity and available climatological data. The sites sampled were selected to provide a measure of the fungal biodiversity within different ecosystems types representative of the state. Forays stopped in 2020 due to the COVID-19 pandemic and data after 2019 were not included due to changes in sampling activity. Data acquisition. The established sampling foray method has been practiced by NJMA for the past 30 years.
Sampling forays were conducted for two hours at each foray location with any member able to participate in collection. Some forays were made open to the public and participant numbers ranged from 5 to 30 people, with all participants starting from specific starting point. Within the two-hour foray period, samplers surveyed approximately a one-mile radius around the starting point and collected any visible sporocarps and returned them to foray leaders for identification. Hypogeous taxa were not explicitly sampled as part of these forays, and the focus was on macroscopic fruit bodies (with select observations of micro fungi). Sampled taxa were identified on site by foray leaders, and records were stored in the NJMA documentation archive with some samples stored in the herbarium. This combined strategy, using experts for identification and many participants for sample collection, effectively leveraged citizen science to make use of many samplers without formal scientific training in collection when a limited number of visiting taxonomic experts are available. The lists of observed species were recorded and saved as hand-written, PDF or Microsoft Office documents and stored at personal computers of NJMA members. A summary result of the forays was published in a PDF e-letter from the Association to its members and also shared on their website www.njmyco.org.
Datasets for different regional scales used in Fig. 4 and Table 4 were retrieved from GBIF.org and checked for accuracy to ensure species names matched across regional lists. A list of Agaricomycetes, a class highly represented in the NJMA dataset 40 , was selected from the dataset and used to compare diversity at higher spatial scales. Checklists for preserved specimens within the Agaricomycetes were retrieved (filtered by basis of record -"preservedSpecimen", occurrence status = "present") for three regional scales: New Jersey 41 , the United States 42 , and global 43 .
Taxonomic identification. Initial identification of taxa collected during forays was completed by foray leaders in the field using existing literature (listed in the GBIF repository 43 ) by assigning species names of the closest morphospecies.  Table 3. Species counts for Agaricomycetes datasets published at GBIF.org. NJMA: from the dataset of this study 40 . GBIF NJ: the dataset of preserved specimens for New Jersey region 41 . GBIF USA: the dataset of preserved specimens for the USA 42 . GBIF Global: global records of preserved specimens 43 . Total species counts are presented on the diagonal. Unique or shared species numbers are presented above the diagonal. www.nature.com/scientificdata www.nature.com/scientificdata/ Data digitalization and unification. The species lists were obtained from various NJMA members and converted from the existing format (Word, PDF, Excel) to Excel-based templates compatible with the EarthCape database (https://earthcape.com/, 44 ), spell checking, formatting, association of data with information fields such as locality name or scientific taxon name was carefully performed. EarthCape allowed consolidation of locations into different user defined regions according to geographic location, habitat type, or climatic zone. The EarthCape database also confirmed consistent taxonomic synonymizing by comparison of user-assigned species identities against currently accepted taxonomic names of GBIF taxonomic backbone at GBIF.org, and allowed to convert the data to the GBIF format to prepare for the dataset publication.

Data records
The dataset contains a description of whether a species (or in rare cases, a genus) were observed during a particular foray event. For all taxa observed across all forays, the presence or absence of that taxa is recorded in a particular foray and supplemented by the foray time and location, geographic data (coordinates, region, etc.), habitat type based on dominant hardwood in that location, and climatic variables (averages for temperature, precipitation, wind speed, dew point). Data on soil chemistry and geological variables were retrieved from United States Geological Survey (https://www.usgs.gov). Data for climate variables were retrieved from National Weather Service (www.weather.com) with representative collection stations identified and used for each region.
Our database is stored locally and is freely accessible through the GBIF (Global Biodiversity Information Facility) repository (www.gbif.org) under the https://doi.org/10.15468/7scek4 38 . For each occurrence record there are 61 fields of information, recorded using terms of Darwin Core standard (DwC) 45 (http://rs.tdwg.org/ dwc/terms). The database includes a supplemental data table that provides the climatological and geological data for each foray. These "Measurement or fact" extension table can be downloaded together with the source data. The dataset will be updated as new yearly forays occur to keep data consistent across forays. It is our intent that this data collection, made possible by the common interests among citizen science and scientists, continues to expand our knowledge of fungal distribution and biodiversity.

technical Validation
The data was validated using standardized procedure for digitization, formatting, and content checking of the occurrence as in earlier studies 46 . Integration and digitizing of data from various resources (electronic files, hand written documents etc.) was performed using EarthCape database software 44 built-in validation tools such as formatting and spelling checks, linked tables, alignment of nomenclature with the GBIF backbone 47 , and synonymizing. To ensure the names and authors for all taxa observed, species names were confirmed using Index Fungorum (http://www.indexfungorum.org). Homotypic names were checked to refer to the accepted names. Next, the data was exported into three linked tables using Darwin Core standard 45 : occurrence, event and meas-urementOrFact. The final data cleaning and processing was made using Linux command line scripts using bash and awk by R. Mesibov 48 , and included structure, format and content of data. All data were checked again once prepared for publication via GBIF to validate the taxonomy, climate data, and the occurrence status. Our dataset was compared to similar records, as well as records at larger spatial scales, from GBIF.org (Table 3). We confirmed that all species from our dataset were captured at larger scales, with only several unique observations. When compared to the GBIF data available for New Jersey, our dataset shows 557 shared species. The difference with 691 species unique to GBIF records for New Jersey and 605 species unique to our dataset likely results from the study focus, regions sampled, and changes in the fungal composition of the state over the years. Our dataset also competes with some of the largest datasets for fungal biodiversity in GBIF, contributing significantly to the global data pool (6% of global fungal occurrences for the period of 2007-2019) ( Table 4). Consistency of the taxonomic names was managed using GBIF Species API (https://www.gbif.org/developer/species) and the rgbif R package 49,50 . Trophic type was assigned using R package funguild 51 .

Usage Notes
We suggest that the data within this dataset be used by researchers interested in evaluating large scale changes in biodiversity for fungi across space and time as well as researchers interested in studying the ranges of particular fungal taxa or guilds, for example, in remote sensing of mycorrhizal composition 52 . Beyond usage by formal researchers, we suggest our methods to be used by citizen science groups collaborating with universities and data repositories to make this data more accessible. The described method have proven to be efficient at leveraging citizen science records for fungal biodiversity and so we implore similar groups to consider reviewing our dataset with the included collection methods and planning their own forays using similar strategies. By leveraging shared collection methods across enthusiast societies with platforms for sharing data like GBIF and iNaturalist, we can greatly improve knowledge of fungal biodiversity across larger spatial scales.