Community composition of aquatic fungi across the thawing Arctic

Thermokarst activity at permafrost sites releases considerable amounts of ancient carbon to the atmosphere. A large part of this carbon is released via thermokarst ponds, and fungi could be an important organismal group enabling its recycling. However, our knowledge about aquatic fungi in thermokarstic systems is extremely limited. In this study, we collected samples from five permafrost sites distributed across circumpolar Arctic and representing different stages of permafrost integrity. Surface water samples were taken from the ponds and, additionally, for most of the ponds also the detritus and sediment samples were taken. All the samples were extracted for total DNA, which was then amplified for the fungal ITS2 region of the ribosomal genes. These amplicons were sequenced using PacBio technology. Water samples were also collected to analyze the chemical conditions in the ponds, including nutrient status and the quality and quantity of dissolved organic carbon. This dataset gives a unique overview of the impact of the thawing permafrost on fungal communities and their potential role on carbon recycling. Measurement(s) carbon dioxide • methane • dissolved organic carbon • iron(2+) • iron(3+) • total phosphorous • Nitrate Measurement • sulfate • nitrite • ammonium • dissolved nitrogen atom in water • specific ultraviolet absorbance SUVA254 • spectral slope S289 • freshness index BIX • humification index HIX • fluorescence index FI • hydrogen to carbon ratio • oxygen to carbon ratio • m/z Technology Type(s) gas chromatography • total organic carbon analyzer • spectrophotometer • ion chromatography • total nitrogen analyzer • spectrofluorimeter • high-resolution mass spectrometry Factor Type(s) sampling site Sample Characteristic - Organism fungal sp. Sample Characteristic - Environment permafrost • pond water • sediment surface Sample Characteristic - Location Siberia • Sweden • Region of Northern Quebec • State of Alaska • Greenland Measurement(s) carbon dioxide • methane • dissolved organic carbon • iron(2+) • iron(3+) • total phosphorous • Nitrate Measurement • sulfate • nitrite • ammonium • dissolved nitrogen atom in water • specific ultraviolet absorbance SUVA254 • spectral slope S289 • freshness index BIX • humification index HIX • fluorescence index FI • hydrogen to carbon ratio • oxygen to carbon ratio • m/z Technology Type(s) gas chromatography • total organic carbon analyzer • spectrophotometer • ion chromatography • total nitrogen analyzer • spectrofluorimeter • high-resolution mass spectrometry Factor Type(s) sampling site Sample Characteristic - Organism fungal sp. Sample Characteristic - Environment permafrost • pond water • sediment surface Sample Characteristic - Location Siberia • Sweden • Region of Northern Quebec • State of Alaska • Greenland Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.14896713

carbon (DOC) concentrations, as well as various proxies of DOM such as fluorescence index (FI), freshness index (BIX), humification index (HIX), specific ultraviolet absorbance (SUVA 254 ), spectral slope for the intervals 279-299 nm (S 289 ) and average H/C and O/C.
The aim of this data collection was to study how the fungal diversity is affected by permafrost thaw and the resulting inputs of organic matter to the thermokarst ponds. Further, the impact of general chemical conditions in the ponds and their relationship to fungal community composition was addressed. Generally, this dataset gives unique insights into the composition of fungal communities in aquatic habitats in the Arctic. Thus, the data can be used to study the general composition of arctic fungal communities and how the community changes together with their environment, such as the availability of the carbon substrates. Importantly, it also expands the database for fungal ITS sequences with a large number of previously unencountered sequences, widening the knowledge and database available for studying fungal diversity in undersampled biomes. Additionally, this dataset can be useful for studies that explore the Arctic fungal taxonomy and their geographic distribution.

Methods
Study sites. We sampled ponds in the following five sites representing different regional-scale permafrost integrity: Toolik, Alaska, USA; Qeqertarsuaq, Disko Island, Greenland, Denmark; Whapmagoostui-Kuujjuarapik, Nunavik, Quebec, Canada; Abisko, Sweden and Khanymey, Western Siberia, Russia (Online-only Table 1). The aim was to include representatives of different stages of permafrost thaw in order to understand whether responses can be generalized across different geographic and environmental conditions. The sampling site in Alaska is located in a continuous permafrost area, mostly dominated by moss-tundra characterized by tussock-sedge Eriophorum vaginatum and Carex bigelowii, and dwarf-shrub Betula nana and Sequecing of ITS2 region with PacBio Sequell II System, with generation of high accurate reads produced after a circularization step.
Reads processed through SCATA pipeline.  Salix pulchra 15 . The average depth of the active layer in 2017 was ~50 cm 16 . Records of surface air temperature from 1989 to 2014 showed no significant warming trend, and there was no significant increase in the mean maximum thickness of the active layer or maximum thaw depth 17 . The sampling site in Greenland is located in the Blaesedalen Valley, south of Disko Island, and is characterized as a discontinuous permafrost area. From 1991 to 2011, Hollensen et al. 18 observed an increase of the mean annual air temperatures of 0.2 °C per year in the area, while Hansen et al. 19 highlighted that sea ice cover reduced 50% from 1991 to 2004. Soil temperatures recorded by the Arctic Station from the active layer of the coarse marine stratified sediments also showed an increase over the years 18 . The sampling site is comprised of wet sedge tundra, and the dominating species are Carex rariflora, Carex aquatilis, Eriophorum angustifolium, Equisetum arvense, Salix arctophila, Tomentypnum nitens and Aulacomnium turgidum 20 .

Metabarcoding
The Canadian site is located within a sporadic permafrost zone, in a palsa bog, in the valley of Great Whale river, close to the river mouth to Hudson Bay. The vegetation consists of a coastal forest tundra, dominated by the species Carex sp. and Sphagnum sp. 21 Since the mid-1990s, there has been a significant increase in the surface air temperature of the region for spring and fall, which has been correlated to a decline of sea ice coverage in Hudson Bay 22 . This area has experienced an accelerated thawing of the permafrost over the past decades, resulting in the collapse of palsas and the emergence of thermokarst ponds as well as significant peat accumulation 21,23 . In this specific site, thermokarst ponds at different development stage can be found, from recently emerging to older, mature thermokarstic waterbodies. The stage of the ponds was estimated based on the distance between the pond and the edge of the closest palsa, as well as based on satellite images 14 . The edges of the emerging ponds reached a maximum of 1 m from the closest palsa and were less than 0.5 m deep, whereas the edges of the developing ponds had a maximum distance of 2-3 m to the closest palsa and were ~1 m deep. Mature ponds were identified based on satellite images and were up to 60 years old.
The Swedish site is located in a discontinuous permafrost zone at the Stordalen palsa mire, on an area of collapsed peatland affected by active thermokarst. The region has experienced an increase in mean annual air temperature and active layer thickness since the 1980s, which has been followed by a shift to wetter conditions 24 . The vegetation found on the surface of the palsa depressions of Stordalen mire is dominated by sedges (Eriophorum vaginatum, Carex sp.) and mosses (Sphagnum sp.) 24,25 .
The Russian site is located in a discontinuous permafrost area in Western Siberia Lowland, near Khanymey village. The sampling site is a flat frozen palsa bog with a peat depth no more than 2 m, and is affected by active thermokarst, resulting in the emergence of thermokarst ponds 26,27 . The vegetation is dominated by lichens (Cladonia sp.), schrubs (Ledum palustre, Betula nana, Vaccinium vitis-idaea, Andromeda polifolia, Rubus chamaemorus) and mosses (Sphagnum sp.) 28 .
Sample collection. At all sites, water from the depth of 10 cm was collected from 12 ponds, totaling 60 ponds for the full dataset. Unfiltered water samples were collected for total P analysis. For analyzing Fe, various dissolved anions and cations, DOC concentrations, and perform optical and mass spectrometry analyses on DOM, water was filtered through GF/F glass fiber filters (0.7 μm, 47 mm, Whatman plc, Maidstone, United Kingdom). Moreover, water samples were collected in order to measure GHG (CO 2 and CH 4 ) concentrations. Water, detritus and sediment samples were also collected from ponds for fungal community analyses. Water samples were collected and filtered sequentially first through 5 µm Durapore membrane filter (Millipore, Burlington, Massachusetts, USA) and then through a 0.22 µm Sterivex filter (Millipore) to capture fungal cells of different sizes. The samples were filtered until clogging or up to a maximum of 3.5 liters (filtered volume ranging from 0.1 l to 3.5 l). Surface sediments were sampled from each of the ponds, with the exception of the Canadian site, where only one emerging and three developing ponds were sampled for sediments. From the sites in Alaska, Greenland, and Sweden, also detritus samples (dead plant material) were collected. The detritus was washed in the lab using tap water, followed by overnight incubation in 50 ml tap water to induce sporulation. The use of tap water may have added fungal spores to the samples, which should be kept in mind when using the detritus data. After the incubation, the water was filtered through a 5 μm pore size filter and the filter was stored at −20 °C.
All the samples for DNA extraction were transported to the laboratory frozen, with the exception of the Alaskan samples, which were freeze dried prior to transportation. The samples transported frozen were freeze dried prior to DNA extraction to ensure similar treatment of all samples. The samples for nutrient and carbon measurements were transported frozen with the exception of samples for DOC and fluorescence analyses, which were transported cooled.
Chemical analyses. All chemical, optical and mass spectrometry results are provided in OSF 29 . DOC quantification was carried out using a carbon analyzer (TOC-L + TNM-L, Shimadzu, Kyoto, Japan). Accuracy was assessed using EDTA at 11.6 mg C/l as a quality control (results were within + − 5%) and the standard calibration range was of 2-50 mg C/l. Fe(II) and Fe(III) were determined by using the ferrozine method 30 , but instead of reducing Fe(III) with hydroxylamine hydrochloride, ascorbic acid was used 31 . Absorbance was measured at 562 nm on a spectrophotometer (UV/Vis Spectrometer Lambda 40, Perkin Elmer, Waltham, Massachusetts, USA). The samples were diluted with milli-Q water if needed. The concentration of total P was determined using persulfate digestion 32 . The anion NO 3 − was measured on a Metrohm IC system (883 Basic IC Plus and 919 Autosampler Plus; Riverview, Florida, USA). NO 3 − were separated with a Metrosep A Supp 5 analytical column (250 × 4.0 mm) which was fit with a Metrosep A Supp 4/5 guard column at a flow rate of 0.7 ml/min, using a carbonate eluent (3.2 mM Na 2 CO 3 + 1.0 mM NaHCO 3 ). SO 4 was analyzed using Metrohm IC system (883 Basic IC Plus and 919 Autosampler Plus, Riverview), NH 4 + spectrophotometrically as described by Solórzano 33 , and NO 2 − and DN as in Greenberg et al. 34 . For the gas analyses, samples from Alaska and Canada were taken as previously described in Kankaala et al. 35 , except that room air was used instead of N 2 for extracting the gas from the water. Shortly, 30 ml of water was taken www.nature.com/scientificdata www.nature.com/scientificdata/ into 50 ml syringes, which were warmed to room temperature prior to extraction of the gas. To each syringes 0.5 ml of HNO 3 and 10 ml of room air was added and the syringes were shaken for 1 min. Finally, the volumes of liquid and gas phases were recorded and the gas was transferred into glass vials that had been flushed with N 2 and vacuumed. For Greenland, Sweden and Russia 5 ml of water was taken for the gas samples with a syringe and immediately transferred to 20 ml glass vials filled with N and with 150 µL H 2 PO 4 to preserve the sample. All gas samples were measured using gas chromatography (Clarus 500, Perkin Elmer, Polyimide Uncoated capillary column 5 m x 0.32 mm, TCD and FID detector respectively). optical analyses. In order to characterize DOM, we recorded the absorbance of DOM using a UV-visible Cary 100 (Agilent Technologies, Santa Clara, California, USA) or a LAMBDA 40 UV/VIS (PerkinElmer) spectrophotometer, depending on sample origin. SUVA 254 is a proxy of aromaticity and the relative proportion of terrestrial versus algal carbon sources in DOM 36 and was determined from DOC normalized absorbance at 254 nm after applying a corrective factor based on iron concentration 37 . S 289 enlights the importance of fulvic and humic acids related to algal production 38 and were determined for the intervals 279-299 nm by performing regression calculations using SciLab v 5.5.2. 39 We also recorded fluorescence intensity on a Cary Eclipse spectrofluorometer (Agilent Technologies), across the excitation waveband from 250-450 nm (10 nm increments) and emission waveband of 300-560 nm (2 nm increments), or on a SPEX FluoroMax-2 spectrofluorometer (HORIBA, Kyoto, Japan), across the excitation waveband from 250-445 nm (5 nm increments) and emission waveband of 300-600 nm (4 nm increments), depending on sample origin. Based on the fluorometric scans, we constructed excitation-emission matrices (EEMs) after correction for Raman and Raleigh scattering and inner filter effect 40 . We calculated the FI as the ratio of fluorescence emission intensities at 450 nm and 500 nm at the excitation wavelength of 370 nm to investigate the origin of fulvic acids 41 . Higher values (~1.8) indicate microbial derived DOM (autochthonous), whereas lower values (~1.2) indicate terrestrial derived DOM (allochthonous), from plant or soil 42 . HIX is a proxy of the humic content of DOM and was calculated as the sum of intensity under the emission spectra 435-480 nm divided by the peak intensity under the emission spectra 300-445 nm, at an excitation of 250 nm. Higher values of HIX indicate more complex, higher molecular weight, condensed aromatic compounds 43,44 . BIX emphasizes the relative freshness of the bulk DOM and was calculated as the ratio of emission at 380 nm divided by the emission intensity maximum observed between 420 and 436 nm at an excitation wavelength of 310 nm 45 . High values (>1) are related to higher proportion of more recently derived DOM, predominantly originated from autochthonous production, while lower values (0.6-0.7) indicate lower production and older DOM 42,44 . High resolution mass spectrometry. 50 ml water samples were collected from each of the ponds and were filtered with a Whatman GF/F filter for mass spectrometry analyses. For each sample, 1.5 ml of water was dried completely with a vacuum drier, and was then re-dissolved in 100 µL 20% acetonitrile, 80% water with three added compounds as internal standards (Hippuric acid, glycyrrhizic acid and capsaicin, all at 400 ppb v/v). Samples were filtered to an autosampler vials and injected at 50 µL onto the column. In order not to overload the detectors, some of the higher concentration samples were injected at a lower volume, to give a maximum of 20 µg carbon loaded.
High-performance liquid chromatography -high resolution mass spectrometry (ESI-HRMS) was conducted as described in Patriarca et al. 46 using a C18-Evo column (100 × 2.1 mm, 2.6 µm; Phenomenex, Torrance, California, USA). The ESI-HRMS data was averaged from 2-17 min to allow formula assignment to a single mass list. Formulas considered had masses 150-800 m/z, 4-50 carbon (C) atoms, 4-100 hydrogen (H) atoms, 1-40 oxygen (O) atoms, 0-1 nitrogen (N) atoms and 0-1 13 C atoms. Formulas were only considered if they had an even number of electrons, H/C 0.3-2.2 and O/C ≤ 1. The data are presented as a number of assigned formulas and weighted average O/C ratio, H/C ratio and m/z. The analysis was run in two batches (36 and 24 samples per run, respectively) and to the latter run, three samples of Suwannee River fulvic acid (SRFA, reference material) were added. At the moment of the run, the DOC concentration of these samples was unknown, so 50 µL was injected. From high resolution mass spectrometry, average H/C and a number of assigned formulas were obtained. The H/C can be used as a proxy of DOM aliphatic content; higher H/C values ( > 1) indicate more saturated (aliphatic) compounds, whereas values lower than 1 indicate more unsaturated, aromatic molecules 47 .
DNA extraction, ITS2 amplification and sequencing. All samples for molecular analyses (water and detritus filters and sediments) were extracted using DNeasy PowerSoil ® kit (Qiagen, Hilden, Germany), following the manufacturer's recommendations for low input DNA. Extracts were eluted in 100 µl of Milli-Q water and DNA concentrations were measured with Qubit dsDNA HS kit. The fungal ribosomal internal transcribed spacer 2 (ITS2) sequences were amplified using a modified ITS3 Mix2 forward primer from Tedersoo 48 , named ITS3-mkmix2 CAWCGATGAAGAACGCAG, and a reverse primer ITS4 (equimolar mix of cwmix1 TCCTCCGCTTAyTgATAtGc and cwmix2 TCCTCCGCTTAtTrATAtGc) 14 . Each sample received a unique combination of primers containing identification tags generated by Barcrawl 49 . All tags had a minimum base difference of 3 and a length of 8 nucleotides. Both forward and reverse primer tags were extended by two terminal bases (CA) at the ligation site to avoid bias during ligation of sequencing adaptors, and the forward primer tag also had a linker base (T) added to it 50  www.nature.com/scientificdata www.nature.com/scientificdata/ at 72 °C for 10 min. In order to reduce PCR bias, all samples (in duplicates) were first submitted to 21 amplification cycles. In case of insufficient yield, the number of cycles was increased up to 35 cycles (see the records on the number of cycles for each of the samples in Supplementary Table S2).
The PCR products were purified with Sera-Mag TM beads (GE Healthcare Life Sciences, Marlborough, Massachusetts, USA), visualized on a 1.5% agarose gel and quantified using Qubit dsDNA HS kit. The purified PCR products were randomly allocated into three DNA pools (20 ng of each sample), which were purified with E.Z.N.A. ® Cycle-Pure kit (Omega Bio-Tek, Norcross, Georgia, USA). Nine of the samples (4 water, 1 sediment and 4 detritus) were left out of the pools because of too little PCR product, giving a total of 203 samples for sequencing (Online-only Table 1). Negative PCR controls were added to each pool, as well as a mock community sample containing 10 different fragment sizes from the ITS2 region of a chimera of Heterobasidium irregular and Lophium mytilinum, ranging from 142 to 591 bases, as described by Castaño et al. 51 . The size distribution and quality of all the pools were verified with BioAnalyzer DNA 7500 (Agilent Technologies), and purity was assessed by spectrophotometry (OD 260:280 and 260:230 ratios) using NanoDrop (Thermo Fisher Scientific). The libraries were sequenced at Science for Life Laboratory (Uppsala University, Sweden), on a Pacific Biosciences Sequel instrument II, using 1 SMRT cell per pool. This PacBio technology allows the generation of highly accurate reads (>99% accuracy) which are produced based on a consensus sequence after a circularization step.
Quality filtering of reads, clustering and taxonomy identification of clusters. The sequencing resulted in a total of 1071489 sequences, ranging from 397 to 9184 sequences per sample (average on 2551 sequences per sample). The raw sequences were filtered for quality and clustered using the SCATA pipeline (https://scata.mykopat.slu.se/, accessed on May 19 th , 2020). For quality filtering, sequences from each pool were screened for the primers and tags, requiring a minimum of 90% match for the primers and a 100% match for the tags. Reads shorter than 100 bp were removed, as well as reads with a mean quality lower than 20, or containing any bases with a quality lower than 7. After this filtering, 582234 sequences were retained in the data. The sequences were clustered at the species level by single-linkage clustering at a clustering distance of 1.5%, with penalties of 1 for mismatch, 0 for gap open, 1 for gap extension, and 0 for end gaps. Homopolymers were collapsed to 3 and unique genotypes across all pools were removed. For a preliminary taxonomy affiliation of the clusters, hereafter called OTUs (Operational Taxonomic Units), sequences from the UNITE + INSD dataset for Fungi 52 database were included in the clustering process. After the clustering, the data included 518128 sequences, divided among 8218 OTUs. For taxonomical annotation, all OTUs with a minimum of ten total reads in the full dataset were included, retaining 3108 OTUs and 498414 sequences in the taxonomical analysis.

Data records
The raw sequences are deposited in the NCBI SRA database under accession number PRJNA701021 (Biosample accession numbers SAMN17843604-SAMN17843806) 53 . All the raw mass spectrometry data is available at the Mass Spectrometry Interactive Virtual Environment (MassIVE) under the accession number MSV000086952 54 . The fungal OTU sequences and their taxonomic classification, as well as all the environmental data related to the samples are deposited in Open Science Framework (OSF) 29 .

technical Validation
The DNA extractions were done in a laminar flow hood with a UV-C lamp and handled separately for each of the locations to avoid any possible cross-contamination between the samples or sites. For the PCRs, the samples were first randomized into three groups including samples from all locations to minimize the risk of batch effect at the sequencing step. Negative controls were included in the PCR step and added to the pools. The negative controls created 4, 12 and 4 sequences for pools 1, 2 and 3, respectively. For each pool, 100 ng of a positive control containing mock communities (as described in the methods section) was added. The mock communities captured all different fragment sizes. The sequences were controlled rigorously by removing singletons and rare reads, as described in the methods section, to remove any sequences that might be chimeric reads or a result of erroneous amplification. Finally, the sequences were compared against multiple databases and their taxonomy was verified manually to ensure that only fungal sequences were included to the tree shown in Fig. 2. For the resulting data, correlations between the number of OTUs and filtered water volume were checked to verify that the differential filtering volumes did not introduce a bias to the richness of the communities (Supplementary Figure S1).

Usage Notes
To verify that the acquired OTUs were of fungal origin, we scrutinized all the OTUs with a minimum of ten total reads for their phylogenetic origin, retaining 3108 OTUs and 498414 sequences in the analysis. For taxonomic annotation, the Protax-Fungi 55 and massBLASTer analyses available through the Pluto-F platform of the UNITE database (https://plutof.ut.ee/, accessed on May 23 rd 2020) were used. An OTU was assigned to a taxonomic level if the Protax identification probability was at least 95% and matched the taxonomy based on the massBLAST against the UNITE species hypothesis (SH) database. To support the taxonomic identification and to discard all non-fungal OTUs, a phylogenetic tree was built, which included all the OTUs with at least 50 reads in total. For building the tree, the ITS2 region was extracted with ITSx 56 , aligned with MUSCLE 57 and a Neighbor-Joining phylogenic tree was constructed using MEGA7 58 , with a p-distance, bootstrap of 1000 replicates, gamma distributed rates and gaps treated as pairwise deletions. As reference sequences, 29 different eukaryotes were selected from the ITS2 database 59 (http://its2.bioapps.biozentrum.uni-wuerzburg.de/, accessed on August 18 th 2020 -accession numbers: AB084092, AY752993, HQ219352, GQ402831, AF228083, AF053158, AF163102, JN113133, AF353997, GU001158, JQ340345, AM396560, FJ946912, EU812490, KJ925151, KF772413, JX988759, AF315074, AJ400496, AJ566147, AY458037, AY479922, EF060369, FN397599, JF750409, KF524372, U65485, Z48468, AY499004). Eight references were further derived from GenBank (https://www.ncbi.nlm.nih.gov/genbank/, accessed on www.nature.com/scientificdata www.nature.com/scientificdata/ September 29 th 2020 -accession numbers: KC357673, AF508774, AY676020, MN158348, HM161704, AY264773, AB906385, AJ296818). These sequences cover different eukaryote lineages: Centroheliozoa, Choanoflagellida, Ichthyosporea, Oomycetes, Streptophyta, Chlorophyta, Rhodophyta, Cercozoa, Amoebozoa, Apusozoa, Cryptophyta, Haptophyceae, Heterolobosea, Katablepharidophyta, Arthropoda, Picozoa, Alveolata, Cnidaria, Stramenopiles, Protostomia and Porifera. Additionally, 530 fungal sequences (SHs from UNITE database) were selected to represent different fungal lineages and included the closest SH matches to each of the OTUs. Finally, to further establish higher level taxonomy, BLASTn searches (E-value cutoff of 1e-3, against the NCBI nucleotide database (https://www.uppmax.uu.se/resources/databases/blast-databases/) were run on September-October 2020, excluding uncultured organisms and environmental samples), for all the OTUs. The (at least) 10 best hits from the BLASTn search were evaluated as follows: an OTU was classified as a "likely fungi" if all resulting hits would match fungal sequences. The OTUs that had also matched other eukaryotic sequences were checked manually, and classified as a "likely fungi" if the best hits were fungal sequences, and also based on the quality of the alignments (higher query coverage and identity %). Any OTU that could not be classified as a "likely fungi" from the BLASTn searches and/or would cluster with any other eukaryote reference in the phylogenetic tree was discarded. The final taxonomic assignment of all fungal OTUs was based on matches to UNITE SHs and Protax probabilities. OTUs with undetermined phylum, class or order were assigned to higher taxonomic levels whenever supported by the phylogenetic tree (considering a minimal bootstrap value of 70%) or BLASTn results (min of 90% query coverage and 90% identity for class, and 100% query coverage and min 97% identity for order). The final data set with 1334 OTUs (178531 sequences) identified as fungi is presented in Supplementary Figure S2.

Code availability
The code used for the extraction of ITS2 regions using ITSx and for the Blastn searches are provided in Supplementary Note 1.