Characterization of genome-wide SNPs for the water flea Daphnia pulicaria generated by genotyping-by-sequencing (GBS)

The keystone aquatic herbivore Daphnia has been studied for more than 150 years in the context of evolution, ecology and ecotoxicology. Although it is rapidly becoming an emergent model for environmental and population genomics, there have been limited genome-wide level studies in natural populations. We report a unique resource of novel Single Nucleotide Polymorphic (SNP) markers for Daphnia pulicaria using the reduction in genomic complexity with the restriction enzymes approach, genotyping-by-sequencing. Using the genome of D. pulex as a reference, SNPs were scored for 53 clones from five natural populations that varied in lake trophic status. Our analyses resulted in 32,313 highly confident and bi-allelic SNP markers. 1,364 outlier SNPs were mapped on the annotated D. pulex genome, which identified 2,335 genes, including 565 within functional genes. Out of 885 EuKaryotic Orthologous Groups that we found from outlier SNPs, 294 were involved in three metabolic and four regulatory pathways. Bayesian-clustering analyses showed two distinct population clusters representing the possible combined effects of geography and lake trophic status. Our results provide an invaluable tool for future population genomics surveys in Daphnia targeting informative regions related to physiological processes that can be linked to the ecology of this emerging eco-responsive taxon.

organism at several levels such as life-history variation and phenotypic plasticity 14 , physiological adjustments 15 , ribosomal RNA and DNA structure 16 , and genomic responses (e.g., gene expression) 13,17 .
Here, we generated a set of SNPs covering the whole D. pulicaria genome from clonal lineages isolated from five natural (lake) populations that varied in trophic (i.e., nutrient) status. By using these populations with different phosphorus load/supply (i.e., different eutrophication status ranging from 21 to 220 μ g/L of P), we characterized outlier SNPs (i.e., particular variable positions with differentiation among populations higher than average that are likely candidates to be affected by natural selection) and mapped them to the D. pulex annotated genome. Our results revealed 294 outlier SNPs that play a role in important metabolic pathways potentially involved in phosphorus use and processing, ranging from cellular to ecosystem levels.
The development and characterization of hundreds of SNPs with a regular distribution across the genome will provide an invaluable genetic resource in this emergent model organism, and should give us the option to screen regions under selection. This new methodological approach will allow us to face challenges such as identifying the mechanisms and processes related to the genetic basis of adaptation in natural populations 18 .

Results
SNP calling and mapping. A total of 159 900 538 good barcoded reads (ranging from 571 683 up to 31 600 240) were obtained for the 53 clones/samples analysed, resulting in 57 440 139 tags (i.e., contigs containing the same sequence). Despite mapping short reads against a different reference genome that can limit identification of DNA variants, we detected 32,313 SNPs by mapping our reads onto the Daphnia pulex genome after filtering by biallelic and highly confident positions, and without linkage disequilibrium (LD) within a scaffold. Finally, we removed those SNPs with expected heterozygosity (He) greater than 0.5 since they might represent paralogous regions (i.e., gene duplications). Confident SNPs were included in a total of 476 scaffolds out of 9,080 described for the D. pulex reference genome (see Fig. 1A). The D. pulex reference genome is available in scaffolds with different sequence lengths (ranging from 1,000 bp to 4.1 Mb), and the number of SNPs per scaffold depends on scaffold length. We found 93.84% of SNPs (i.e., 30,323) in the first 220 scaffolds, whose individual length is greater than 100 Kb.
Population analyses of genotyping results. Analysis of highly confident SNPs for 53 samples performed with PCoA showed five clusters corresponding to the five lakes sampled (Fig. 2). On the other hand, the STRUCTURE clustering analysis estimated that two genetic clusters (i.e., K = 2) were most likely, as determined by the Δ K method of Evanno et al. 19 . Our plot (Fig. 3) showed that the two genetic clusters correspond to northern and southern populations (i.e., HILL and ELK -North; SC, MAD, and SHAOK -South).

Description of outlier SNPs (gene annotation and role in metabolic pathways). 1,499 outlier
SNPs were detected by the Lositan programme (only 1,364 SNPs could be mapped on to the annotated D. pulex reference genome) along 212 different scaffolds (Fig. 1B). The Bayescan analysis found a lower number of outlier SNPs, which were represented in the list obtained from Lositan. Potentially adaptive SNPs identified by Lositan Plots obtained from iPATH v2 resulted in 294 KOG genes associated mainly with three metabolic pathways (172 genes have a role in the metabolism of nucleotides, amino acids and lipids; Fig. 4A) and four regulatory pathways (122 genes have a role in transcription, translation, replication and repair, and folding, sorting and degradation; Fig. 4B).

Discussion
Genomic resources have increased our ability to study the evolutionary ecology and population genetic structure of the Daphnia pulex species complex in more detail in recent years 4 , but can be greatly improved upon by NGS technologies currently available or soon-to-be available in the near future 20 . Using GBS methodology on genomic DNA from 53 samples/clones of D. pulicaria, we developed the first whole-genome resources for this important widely-distributed Nearctic lake species. Our dataset presents complementary information that can be compared with recent studies reporting differential gene expression in the transcriptome of D. pulex in response to phosphorus supply 13,17 . A production of high-quality SNPs was achieved, analysed and illustrated by different approaches, obtaining information on annotations blasted onto the current D. pulex reference genome. These GBS-derived genomic resources provide valuable molecular resources for high-resolution linkage mapping, association analysis, and gene verification, which will be of great use to better understand demographic, adaptive and micro-evolutionary processes in the keystone and emergent eco-responsive model organism, Daphnia.
Population genetic differentiation among populations of passively dispersed aquatic invertebrates in most cases has been found to be strong, despite the potential dispersal capacity facilitated by water birds and other vectors transporting their diapausing eggs [21][22][23] . In Daphnia, where diapausing eggs are enclosed in the protective ephippium, population genetic differentiation has been observed even at small/medium (i.e., less than 1 Km) spatial scales 24,25 . Our results are in agreement with a highly structured population model (Fig. 2) across the whole genome (i.e., 8,190 SNPs). This would indicate a low likelihood of colonization by novel genotypes/clones in already established populations, which has been reported in previous studies [26][27][28] . However, our inference of population structure using a Bayesian approach (i.e., STRUCTURE program) indicates a co-ancestry relationship with clustering of clones into two main lake groups: (i) Hill Lake and Elk Lake, and (ii) Madison Lake, Shaokotan Lake, South Center Lake (Fig. 3). These results could be indicative of a geographic and/or an environmental association. One cluster includes the two northern lakes (Hill and Elk), whilst the other cluster includes the three southern ones. In the present study, we cannot reject an isolation-by-distance effect due to a historical colonization event, but a relationship between genomic response and ecological traits cannot be ruled out since both Hill and Elk lakes are located in a different ecoregion, when compared to the other three lakes (South Center, Madison, and Shaokotan). In addition, the phosphorus (P)-loadings of Hill and Elk lakes are on the lower end of the trophic state range for the five lakes sampled (see Table 1), which could also be affecting the genetic clustering   19 , regardless of the admixture model applied. Red dotted lines indicate the arbitrary 20% threshold for assignment to belong to a particular genetic cluster. For population/lake codes, please, refer to Table 1.
Scientific RepoRts | 6:28569 | DOI: 10.1038/srep28569 relationships noted. Preliminary analyses performed in D. pulicaria sampled from one of our five lakes (South Center), using hyper-variable nuclear markers (i.e., microsatellites), support clear evolutionary consequences of anthropogenic environmental change (i.e., change in P-supply) on population structure through time 12 . These results are consistent with previous reports indicating that Daphnia populations show genotype x environment interactions when subjected to different P-conditions. As a consequence, different clones/genotypes are sensitive to P-availability 15,29 , which can include the differential expression of hundreds of genes 13,17 when subjected to nutritional environments that vary in P-supply.
Our results indicate that the outlier (and potentially adaptive) loci found might, amongst others, have important roles in phosphorus processing including assimilation and maintenance (see Fig. 4A,B). In fact, evidence suggests that variation in the environmental supply of P affects expression of highly conserved genes 15 (for a review). It has also been reported that phosphorus supply drives rapid turnover of membrane phospholipids in phytoplankton 30 . Particularly, in Daphnia it has been shown that under low quality food (i.e., high carbon -C to low phosphorus -P ratio: C:P), this organism can maintain P-homeostasis by reducing C-uptake and/or increasing P-retention, which is in agreement with stoichiometric theory 31 . Additionally, it has been suggested (and assessed) that the role of organismal C:P ratios may be functionally related to the structural and copy number variation of ribosomal-RNA genes 16 . Recent studies have reported large differences in gene expression levels under different P-diets in Daphnia at the intra-specific level 17,31 . These previous studies have found hundreds of differentially expressed genes in a comparison between genotypes run under low and high P environmental conditions. Important KEGG (i.e., Kyoto Encyclopedia of Genes and Genomes) pathways involving differentially-regulated genes have been noted for glycogen synthesis and lipid metabolism, among others. Our results, at the genomics level, indicate that some functional genes annotated on the current Daphnia pulex reference genome, might have an important role in lipid metabolism, mainly when considering exclusive genes involved directly in metabolic pathways (i.e., removing those genes linked to both metabolic and regulatory pathway; see Suppl. Figure). Whereas none of the outlier SNPs involving functional genes are related to three main regulatory pathways (cell mobility, membrane transport or signal transduction; see Fig. 4B), they are highly represented on regulatory pathways such as transcription, translation, replication and repair, and folding, sorting and degradation.
The new genomic resources reported in our study include parts of the genome that appear to be less affected by selection, and thus, may prove useful for future demographic studies. In addition, we found signatures of selection for candidate genes (see Appendix 1) related to abiotic stressors that will help to unravel adaptive molecular processes related to variation in environmental conditions such as lake eutrophication. Future analyses on these and other gene annotations should shed light on the selective forces structuring, at the genome-wide level, different genotypes and natural populations influenced by environmental pressure. Additionally, our resources and results may improve our understanding of how the genome interacts with the environment to produce the diversity of phenotypes found at the intra-specific level.
Future work that integrates molecular markers, ecological traits and geographic parameters will be necessary to shed light on the effects of both abiotic and biotic environmental factors as selective forces impacting the population genetic structure in Daphnia species complexes. In our case, we have identified candidate genes that can be used in the near future to study/manipulate natural populations in assessing physiological responses to varying nutrient environments (i.e., C:P ratios).

Methods
Study system and sampling. The life cycle of most daphniids is characterized by an asexual (parthenogenetic) mode of reproduction, shifting to a sexual phase at the end of the growing season when resources are scarce or environmental conditions are not favourable. During the first stages of sexual reproduction, some direct-developing (i.e., subitaneous) eggs produced by female Daphnia can develop into males, which are genetic replicates of the mothers, since Daphnia have environmental sex determination. These males can then mate with sexuallyreceptive females, which leads to the production of sexual (i.e., recombinant) eggs that are produced inside a scleroterized structure termed an ephippium. When the mother molts, the ephippium is shed into the water column, where it either sinks to the bottom sediments or rafts to the shoreline. Ephippia serve as both a diapausing stage as well as a (passive) dispersal stage, which increases the likelihood for population survival during harsh environmental conditions. Ephippial (diapausing) eggs will hatch when favourable conditions appear in natural populations. However, isolated resting eggs (i.e., different clones from sexual reproduction) can be hatched under laboratory conditions (i.e., resurrection ecology) allowing one to address studies on population genetic structure, evolutionary biology and adaptive response to environmental changes 12,[32][33][34] .
In this study, we hatched diapausing eggs from the upper sediment layers (i.e., down to a depth of ~4 cm) of five lakes located in Minnesota (USA). In addition, we established clones from live-caught animals from the water column. The lakes covered a range of trophic states from mesotrophic to hyper-eutrophic conditions. Considering phosphorus concentration ([P]) as one of the major factors involved in lake eutrophication, we covered a range from ~21 to ~220 μ g/L [P] (Table 1). Additionally, the geographic distribution of the lakes covered up to three different ecoregions in Minnesota as described by the Minnesota Pollution Control Agency (Fig. 5).  SNP calling and population structure analyses. The GBS analysis pipeline (Tassel ver. 4.0), an extension of the Java program TASSEL 2,35 , was used to call SNPs from the sequenced GBS library (.fastq rawdata set; see Appendix 2). Tags were aligned to the Daphnia pulex reference genome wfleabase.org/genome/Daphnia_ pulex/current/fasta/dpulex-all-chromosome-jgi060905.fasta.gz, which contains 9,080 scaffolds. A variant calling format (VCF) file was filtered for variant sites to cover at least 90% of individuals and for biallelic positions. Linkage disequilibrium (LD) between pairs of variant sites was evaluated by using the software package TASSEL (LD type: Full Matrix 35 ), and estimated by using standardized disequilibrium coefficients (D' 36 ) and squared allele-frequency correlations (r 2 37 ). The probabilities of obtaining LD estimates at least as extreme as those observed under a hypothesis of linkage equilibrium (p-values) were calculated by using Fisher's exact test for site pairs with two alleles each. P-values ≤ 0.05 were considered significant to remove all but one linked SNPs for further analyses.
Population genomic analyses were performed as follows. First, we ran a principal coordinate analysis (PCoA) using the codominant genetic distance matrix generated from highly confident SNPs to elucidate the genetic relationships for the geographic area sampled using GenAlEx ver. 6.501 38,39 . Second, the extent of admixture and the origin of the different genetic proportions were investigated using a Bayesian clustering method implemented in the STRUCTURE ver. 2.3.4 program 40 . We calculated the posterior likelihood values for a range of clusters from K = 2 to 5. For each K, we ran four simulations with a burn-in of 30,000 Markov-Chain-Monte-Carlo (MCMC) iterations and 300,000 iterations after burn-in. We applied the correction of Evanno et al. 19 to estimate the most likely value for K by using the CLUMPAK server (http://clumpak.tau.ac.il/bestK.html).
Outlier identification and mapping annotation. We performed a genome scan analysis on all individuals to identify the potential adaptive loci, which can be directly under selection or closely-linked (i.e., "hitchhiking") to loci under selection. We used two methods, Lositan programme 41 and Bayescan 42,43 for this purpose. Lositan uses F DIST 44 based on an infinite island model of populations 45 . Bayescan uses multinomial-Dirichlet distribution and estimates posterior probabilities of a locus to be under selection or not. In Lositan, we used 100 000 simulations with neutral mean F ST (force mean F ST ) and 0.99 confidence interval. In Bayescan, we used 100 000 iterations with a sample size of 10 000 and thinning interval of 10.
Finally, we mapped the outlier SNPs to the Daphnia pulex genome based on the published gene models at http://genome.jgi-psf.org/Dappul/Dappul.download.ftp.html (FrozenGeneCatalog20110204.gff.gz 4 ) using BEDTOOLS window feature (-w 5000), which takes into account the given number of base pairs upstream and downstream of the outlier SNP/position due to the fact that an identified SNP may not be directly under selection, but linked ("hitchhiking") to a locus under selection, covering 10,000 base pairs in total.
Annotated genes included into the KOG_JGI nomenclature (i.e., EuKaryotic Orthologous Groups, a tool for identifying orthologous and paralogous proteins) were plotted by iPATH v2 tool 46,47 (interactive Pathway