Worldwide diversity of endophytic fungi and insects associated with dormant tree twigs

International trade in plants and climate change are two of the main factors causing damaging tree pests (i.e. fungi and insects) to spread into new areas. To mitigate these risks, a large-scale assessment of tree-associated fungi and insects is needed. We present records of endophytic fungi and insects in twigs of 17 angiosperm and gymnosperm genera, from 51 locations in 32 countries worldwide. Endophytic fungi were characterized by high-throughput sequencing of 352 samples from 145 tree species in 28 countries. Insects were reared from 227 samples of 109 tree species in 18 countries and sorted into taxonomic orders and feeding guilds. Herbivorous insects were grouped into morphospecies and were identified using molecular and morphological approaches. This dataset reveals the diversity of tree-associated taxa, as it contains 12,721 fungal Amplicon Sequence Variants and 208 herbivorous insect morphospecies, sampled across broad geographic and climatic gradients and for many tree species. This dataset will facilitate applied and fundamental studies on the distribution of fungal endophytes and insects in trees.


Background & Summary
Fungi and insects can have large impacts on tree health, ranging from beneficial to very harmful [1][2][3] . In the last 200 years the number and impact of tree pests (i.e. harmful fungi and insects) has considerably increased 4-6 , mainly because non-native pests have been introduced to new areas through the global trade of plant material 4,7 . However, our current knowledge about the distribution of organisms associated with trees comes from studies that explored the diversity of fungi and insects at local and regional scales [8][9][10][11] . These studies rarely focused on more than one taxon 12 , or sampled multiple hosts in different geographic regions [12][13][14][15] . The general lack of large-scale biodiversity studies is mainly due to a lack of resources, as surveys can be laborious, costly, and often require strong collaborative networks 16 . Large-scale biodiversity assessments have only been done for a few organism groups, including soil fungi 17 , earthworms 18 , plants 19 and terrestrial vertebrates 20 . Taxa associated with a single host species or genus have never been investigated in a standardized study across several continents, although this would be valuable to test general hypotheses related to global biodiversity patterns and to assess the risk associated with the global trade of plants.
We studied overwintering stages of endophytic fungi and insects associated with the twigs of multiple tree species, collected around the globe, during the winter of 2017/2018 (Fig. 1). This was done through the COST Action FP1401 "Global warning", which aimed to create standardized protocols for the establishment and monitoring of sentinel plantings 21 on a global scale, as an early warning system to detect potential tree pests. Specifically, this data set was compiled to determine which pests might be moved through trade of asymptomatic plant material. We sampled dormant twigs on multiple trees: 20 twigs were sampled per tree species and location ("a sample"). In total 145 native and non-native tree species, both angiosperms and gymnosperms, were sampled in 28 countries (total of 352 samples). Fungi were assessed by high-throughput sequencing (HTS). Insects were reared from dormant twigs collected from a subset of 109 tree species in 18 countries (total of 227 samples) and were sorted to taxonomic orders and feeding guilds. Herbivorous insects were identified by morphological and molecular methods. This data set reveals the diversity of tree-associated fungal endophytes and insects across broad geographic and climatic gradients and for many host taxa (Fig. 2). The data set can be used to investigate the biodiversity of tree-associated fungal endophytes and insects, especially in studies comparing different # A full list of authors and their affiliations appears at the end of the paper. DATA DEScrIpTor opEN geographic and climatic regions and different hosts. Furthermore, the results of such analyses can be used in Pest/Pathway Risk Analyses, with the ultimate goal of reducing the likelihood of pest introductions to new areas.
The data set includes raw sequence data obtained from HTS of the internal transcribed spacer region 2 (ITS2) of the ribosomal operon from dormant twig tissues 22 . We also present a sample x Amplicon Sequence Variant (ASV 23 ) matrix, which consists of abundances of 12,721 fungal ASVs in 352 samples 24 and data on the taxonomic classification of detected fungal ASVs 24 . We present similar data for insects: specimen counts for insects belonging to different feeding guilds (i.e. herbivores, predators and parasitoids) and taxonomic orders in each sample 24 and a sample x morphospecies matrix 24 showing abundances of 208 herbivorous insect morphospecies across 227 collected samples, together with data on the tentative identification of herbivorous insect morphospecies 24 . Aligned cytochrome c oxidase subunit 1 (COI) sequences of herbivorous insects are deposited in GenBank 25 . Finally, we provide a table containing information about tree species, location and climate for each sample 24 .

Methods
Field collection. Endophytic fungi and insects were assessed from dormant twig samples from 155 tree species at 51 locations in 32 countries. Sampled tree species belonged to genera that are native to, and occur widely across, either the northern or southern hemisphere, since very few tree genera occur naturally in both hemispheres (e.g., in our study only Podocarpus appears in both hemispheres but has a limited distribution in the northern hemisphere). We sampled largely in botanical gardens and arboreta, which allowed us to sample native and non-native, congeneric and confamiliar, tree species at each location. At each location, one native and one to three non-native congeneric or confamiliar tree species were sampled.
At each location, twenty 50-cm long asymptomatic twigs were collected from 1-5 individual trees per species, from different branches and different parts of the crown (Fig. 1). The number of individual trees per species depended on the number of trees available in the specific botanical garden or arboretum, which was often low ( Table 1). All twigs per tree species and location were pooled and analysed as a single sample. On some occasions two samples of the same tree species at the same location are considered. Sampling was conducted in the month with the shortest day-length in the year (end of December 2017 in the Northern hemisphere, end of June 2018 in the Southern hemisphere). Samples originating from a tropical region (eleven samples from Tanzania) were collected in June 2018. Trees were sampled in winter to align with the timing of trade, i.e. most woody plants are traded in winter or early spring, as plants will be planted in the following spring, and to reduce the risk of introducing foliar pests in deciduous trees. Evergreen gymnosperm and angiosperm tree species, which were also considered, do not lose foliage during winter, and are thus sold with leaves/needles.

Fungal endophytes.
To assess fungal communities, a total of 352 samples from 145 native and non-native tree species, belonging to nine families of angiosperms and gymnosperms, were collected. Sampling was done at 44 locations in 28 countries on five continents (Fig. 1, Table 1).
From each twig in a sample, one bud, one needle/leaf and one 1 cm long twig segment were taken ( Fig. 1). Needles from gymnosperms, and leaves from evergreen angiosperms were sampled to accurately assess the risk of trading these species. Twig segments were cut from the twig bases. The selected plant parts were surface sterilized by immersion in 75% ethanol for 1 min, 4% NaOCl for 5 min, and 75% ethanol for 30 s 26 . After air drying on a sterile bench, the following material from each of 20 twigs per sample was pooled: half of one bud, a 0.5 cm long piece of a needle (from gymnosperms) or a 0.25 cm 2 leaf (for evergreen angiosperms) and a 0.5 cm long piece of twig.
DNA extraction, PCR amplification and Illumina sequencing. Total genomic DNA was extracted from 50 mg of pooled, surface sterilized, and ground tissue ( Fig. 1) using DNeasy PowerPlant Pro Kit (Qiagen, Hilden, Germany), following the manufacturer's instructions. For a total of 31 out of 352 samples, DNA was extracted from different tissues separately, and DNA extracts were then pooled. DNA concentrations were quantified using the Qubit dsDNA BR Assay Kit (Thermo Fisher Scientific, Waltham, USA) on a Qubit 3.0 Fluorometer (Thermo Fisher Scientific) and DNA was diluted to 5 ng/μl. Samples that yielded less than 5 ng/μl were not diluted. The ITS2 region was amplified with the 5.8S-Fung and ITS4-Fung primers 27  Bioinformatics and taxonomical classification of ASVs. Quality filtering and delineation into ASVs were done with a customized pipeline 28 largely based on VSEARCH 29 , as described by Herzog et al. 30 . The output data available on Figshare show the abundances of fungal ASVs in the samples 24 . Taxonomic classification of ASVs was conducted using Sintax 31 implemented in VSEARCH against the UNITE v.7.2 database 32 with a bootstrap support of 80%. The data on the taxonomic classification of fungal ASVs is deposited in Figshare 24 .
Quality filtering, delineation into ASVs, and taxonomical assignments were done on a larger data set (total of 474 samples), which increased the confidence in the selected centroid sequences. This data set consisted of (1) sequences obtained from 352 samples of pooled tree tissues that are presented here 22 , (2) sequences obtained from 33 samples of pooled tree tissues which were not included in this manuscript due to violation of the common protocol, (3) sequences from 21 contaminated samples (positive DNA extraction controls), including sequences from the two control samples (not presented here), and (4) sequences obtained from 66 samples of non-pooled tree tissues of Pinus sylvestris and Quercus robur that were collected from the subset of locations considered in this study, but for a different study, and are thus not presented here.
Herbivorous insects. Insects were assessed from 227 samples of 109 tree species, collected at 31 locations and in 18 countries (Fig. 1, Table 1).
The collected twigs (twenty 50 cm twigs per species per location) were brought to a laboratory close to each sampling location and inspected for the presence of insects that overwinter as adults. Twigs were kept at room temperature with the cut ends immersed in water to induce budding and to allow the development of insects that overwinter as larvae, pupae or eggs. Twigs from each sample were protected with gauze bags to prevent insects moving between samples (Fig. 1). Twigs were inspected for the presence of insects daily for 4 weeks and all collected insects were stored in 95% ethanol for further examination.
Morphological and molecular identification. Insects were inspected using a stereo microscope and sorted to taxonomic orders and feeding guilds (i.e. herbivores, predators, parasitoids and other). The abundance of the different feeding guilds and taxonomic orders in the samples is presented in a file deposited on Figshare 24 . Herbivorous insects were further sorted into morphospecies and at least one specimen per morphospecies was stored at −20 °C for molecular analysis. The abundance of the different morphospecies in each sample is presented in a file deposited on Figshare 24 . Specimens for molecular analysis were photographed with a Leica Fig. 1 Description of sample collection and sampling locations for the study. (a) Twenty twigs per tree were collected from 1-5 trees per species at each location. One sample constitutes all twigs from a given tree species at a particular location. Fungi were assessed by high-throughput sequencing from 50 mg of surface sterilized and ground tissues (i.e. pooled 0.5 cm long twig segments, halves of buds, and 0.5 cm long needle/leaf segments for evergreen species collected from a sample). Insects were reared from collected twigs and identified using morphological and molecular approaches. (b,c) Sampling locations outside Europe (b) and within Europe (c) are shown. Colors indicate the groups that were assessed from collected samples. The size of the circles is proportional to the number of samples collected at each location. The same legend applies for b and c. In several cases, samples collected at multiple locations within a country were merged for better visibility (i.e. AR, AU, EE, HU, ME, SE, TN, UK, ZA; Table 1).
www.nature.com/scientificdata www.nature.com/scientificdata/ DVM6 digital microscope and the Leica Application Suite X (LAS X). Depending on the size of the insects, the whole individual or parts (e.g. legs, head) were used for molecular analysis. Genomic DNA was extracted with a KingFisher (Thermo Fisher Scientific) extraction protocol suitable for insects (35 min incubation at RT, 30 min wash at RT with 3 different washing buffers, 13 min elution at 60 °C) in a 96-well plate. PCR for the COI was carried out in 25 µl reaction volume with 2 µl diluted DNA (1:10), 0.5 µM of each of the primers LCO1490 and HCO2198 33 and 1 x REDTaq ReadyMix Reaction Mix (Sigma-Aldrich) using a Veriti 96-Well Thermal Cycler (Applied Biosystems) with the following setting: 2 min at 94 °C, five cycles of 30 s at 94 °C, 40 s at 45 °C, and 1 min at 72 °C, 35 cycles of 30 s at 94 °C, 50 s at 51 °C, and 1 min at 72 °C, and a final extension step at 72 °C for 10 min. The success of amplification was verified by electrophoresis of the PCR products in 1.5% (w/v) agarose gel at 90 V for 30 min with ethidium bromide staining. A standard Sanger sequencing of the PCR products in both directions with the same primers was done at Macrogen Europe, Amsterdam, Netherlands. Sequences were assembled and edited with CLC Workbench (Version 7.6.2, Quiagen) and compared to reference sequences in BOLD 34 . If no conclusive results were found, sequences were compared to reference sequences in the National Centre for Biotechnology Information (NCBI) GenBank databases 35 . Specimens were assigned to species if the query sequence showed less than 1% divergence from the reference sequence. If two or more taxa matched within the same range, the assignment was ranked down to the next taxonomic level (i.e., genus). When no species match was obtained based on the above criteria, a genus was assigned with a divergence of less than 3%. For lower taxonomic groups the 100 nearest sequences were inspected on the Blast Tree (Fast Minimum Evolution Method) and the taxonomic relationship was evaluated based on that tree. If none of the approaches above revealed a conclusive taxonomic assignment, the morphological identification was taken as reference. The results of morphological and molecular identification of insect specimens are presented in a file deposited on Figshare 24 . Insect sequences are deposited in GenBank database under accession numbers MW441337-MW441767 25 .  www.nature.com/scientificdata www.nature.com/scientificdata/ Sample metadata. Pairwise geographic distances (Euclidean distances) between sampling locations were calculated based on the geographic coordinates of the locations, with function "dist" in the R statistical programme 36 .
Climate data, including mean annual temperature, mean annual precipitation, and temperature seasonality were obtained from the WorldClim database 37 , at a resolution of 2.5 min, and represent averages between 1970 and 2000.
A host-tree phylogeny was constructed with the phylomatic function from the package brranching 38 in R using the "zanne2014" reference tree 39 . One Eucalyptus sample collected in Argentina and two Eucalyptus samples collected in Tunisia were not identified to species. To place them in the phylogeny, we assigned them to different congeneric species that were not sampled in this study and that we considered as representative samples of phylogenetic diversity from across Eucalyptus genus (E. viminalis, E. robusta and E. radiata). Pairwise phylogenetic distances between study tree species were calculated using the "cophenetic" function in R 36 .
The described sample metadata are available in a file on Figshare 24 .

Data records
The raw paired-end Illumina sequencing reads of the ITS2 region obtained from fungal DNA extracted from pooled twig tissues are archived at the NCBI Sequence Read Archive under BioProject accession number PRJNA708148 22 . For each of 352 samples, two fastaq files are deposited, corresponding to reverse and forward sequence reads. The excel file "Sample x fungal Amplicon Sequence Variant matrix" shows the distribution of 11,613,187 sequence reads of 12,721 fungal ASVs among 352 samples that were used for the fungal assessment and is available on Figshare 24 .
A total of 4,751 insect specimens were reared from 227 samples. The number of insects in different feeding guilds and orders is provided for each sample from which insects were collected (i.e. 154 out of 227 samples) in the excel file "Insect specimens per feeding guild, taxonomic order and sample" and is stored on Figshare 24 . Almost 64% of collected specimens were classified as herbivores (3,032), around 10% were classified as parasitoids (499) and only 2% were predators (89). An additional 1,131 out of 4,751 (24%) specimens could not be assigned to a feeding guild.
Data records on host identity (i.e. host species, hemisphere of origin and native or non-native host distribution range), geographic location (i.e. country of collection, location, latitude and longitude) and climate (i.e. mean annual temperature, temperature seasonality, mean annual precipitation) are also provided for each sample in the excel file "Sample Metadata" on Figshare 24 .

Technical Validation
Sampling and sample processing were done in a standardized way as described above. All samples that were not collected and processed following the protocols were discarded (33 fungal samples). Negative controls were used for DNA extractions and PCRs to ensure the absence of external and cross contamination, and if controls were positive, samples were not used (21 fungal sample). Positive PCR controls were used to ensure that PCRs were performed correctly. code availability R functions and databases used for generating the sample metadata are specified in the method section. A customized pipeline used for quality filtering of the raw sequence data obtained from HTS, delineation into ASVs and taxonomic classification of ASVs as described in Herzog et al. 30 is available as a "ITS2.bash" file from the Zenodo repository 28 .