Transcriptomic resources for evolutionary studies in flat periwinkles and related species

The flat periwinkles, Littorina fabalis and L. obtusata, comprise two sister gastropod species that have an enormous potential to elucidate the mechanisms involved in ecological speciation in the marine realm. However, the molecular resources currently available for these species are still scarce. In order to circumvent this limitation, we used RNA-seq data to characterize the transcriptome of four individuals from each species sampled in different locations across the Iberian Peninsula. Four de novo transcriptome assemblies were generated, as well as a pseudo-reference using the L. saxatilis reference transcriptome as backbone. After transcripts’ annotation, variant calling resulted in the identification of 19,072 to 45,340 putatively species-diagnostic SNPs. The discriminatory power of a subset of these SNPs was validated by implementing an independent genotyping assay to characterize reference populations, resulting in an accurate classification of individuals into each species and in the identification of hybrids between the two. These data comprise valuable genomic resources for a wide range of evolutionary and conservation studies in flat periwinkles and related taxa.


Background & Summary
Marine gastropods of the genus Littorina are among the most interesting models for studying adaptation and speciation [1][2][3] . For example, the recent genomic resources made available for the rough periwinkle (L. saxatilis) have provided important knowledge about the genomic architecture of adaptation and of parallel ecotype evolution [4][5][6] . Although less studied, the flat periwinkles, L. fabalis and L. obtusata, have recently gained recognition as a model system to inform us about the late stages of ecological speciation and how different reproductive barriers accumulate and interact across the speciation continuum [7][8][9] . These two sister-species usually dwell on brown algae ( Fig. 1), presenting a largely overlapping distribution across the European Atlantic shores 10 . In some countries (e.g. Spain and United Kingdom), their different ecological requirements result in vertical zonation, with L. obtusata mainly inhabiting the mid to upper intertidal, whereas L. fabalis is more commonly found in the lower part of the shore 11 . Despite clear morphological and ecological differences, recent studies suggest that they are not fully reproductively isolated 9,12 , with a remarkable incidence of hybrids in Cabo do Mundo (Portugal) 7,8 . However, these studies were based on a relatively small number of markers, precluding a more detailed characterization of the type of hybrids. In order to circumvent these limitations, here we use RNA-seq to increase the genomic resources available for this system. We present the first transcriptome for flat periwinkles, identify putative fixed SNP differences between the two species and demonstrate the potential of this information to detect hybrids between them. These resources will facilitate additional evolutionary studies, including the role of adaptive introgression and reinforcement in the diversification of flat periwinkles, producer-herbivore interactions, as well as different locations in Northwestern Iberian Peninsula, from Abelleira (Galicia, Spain) in the North to Mindelo (Portugal) in the South, during 2014. Individuals collected in each sampling site (N = 12 to 27) were transported alive to the laboratory (ECIMAT, University of Vigo, Spain) within 24 hours after collection. Individuals were sexed under a dissection microscope. Because the two species can be distinguished based on male genitalia morphology 10 , only males were retained. These were placed in separate aquaria (one per site and species) with running marine water and aeration during 11 to 12 days for acclimation. The maintenance of the samples in similar environmental conditions before RNA extraction was performed to ensure that differences in expression between individuals from the two species were not specifically related to their condition at collection time. The only source of nutrients during this period was the running water, to minimize sequence contamination with DNA from other organisms present in their digestive tract. At the end of the acclimation period, samples were flash frozen in liquid nitrogen and kept at -80 °C until further analyses. Four males per species, each from a different location, were then selected for sequencing (  Fig. 3). In order to certify that individuals were accurately assigned to one of the species, only males were analysed in the reference populations. However, in Cabo do Mundo, where admixture was previously inferred based on microsatellites, two nuclear introns and mtDNA [7][8][9] , both males and females were included. RNA extraction. For each snail, shell and hepatopancreas were first removed (Table 2), and total RNA was extracted from all remaining tissues using the TRIzol ® reagent (Invitrogen) according to manufacturer's instructions. Residual DNA was subsequently eliminated with the TURBO DNA-free kit (Ambion) according to the manufacturer as well. RNA was purified with the RNeasy MinElute Cleanup Kit (Qiagen).
Library preparation and sequencing. Starting with 1 μg of RNA from each sample, library building (with individual barcodes) was performed with the TruSeq RNA Sample Preparation Kit v2 (Illumina) following the manufacturer's protocol and aiming for insert sizes of ~200 bp. The final libraries were pooled altogether in equimolar ratios before sequencing. Sequencing was performed in two different runs (using the same pool), each consisting of 4/5 of a lane in paired-end mode (first run 2 × 100 bp and second 2 × 125 bp) on an Illumina HiSeq 1500 in CIBIO/InBIO laboratories.
Quality control of sequence data. The quality of the raw reads 16 was assessed with FastQC v0.11.5 17 .
Subsequently, reads were clipped to remove Illumina adapters and trimmed for quality with TRIMMOMATIC v0.36 18 with the following steps: 1. Remove the adapters; 2. Cut the initial 10 bases at the start of the reads; 3.
Trimming when average quality of the nucleotides within a 4 bp window was below 15; 4. Remove bases at the start and end of the reads if quality was below 20; and 5. Discard reads shorter than 25 bp. Read quality was re-evaluated with FastQC at the end of this process 19 . transcriptome assembly and annotation. In order to identify highly discriminant SNPs between L. fabalis and L. obtusata, reference transcriptomes were generated using two main approaches. The first one ( Fig. 2) was based on the recently available reference transcriptome for L. saxatilis (deposited at DDBJ/ENA/ GenBank under the accession GHPE00000000.1), a member of the sister clade of flat periwinkles closest relatives. It consisted of applying the Pseudo-it pipeline developed by Sarver et al. 20 to obtain a pseudo-reference minimizing mapping biases towards any of the species. More specifically, this pipeline allows the generation of www.nature.com/scientificdata www.nature.com/scientificdata/ a pseudo-reference by replacing the original nucleotides in the reference (L. saxatilis) transcript sequences (e.g. A) by those that represent a fixed difference among the eight flat periwinkle individuals (e.g. T) or by using the IUPAC ambiguity code (e.g. R) when an alternative allele is fixed among the four L. fabalis (T) and the four L. obtusata (C) sequenced in this study. This was done through an iterative process with multiple steps, including: mapping reads to the reference using bwa mem v0.7.15 21 , removing duplicates with Picard v2.8.2 (http://broadinstitute.github.io/picard), identifying insertions and deletions following realignment and calling haplotypes with Genome Analysis Toolkit -GATK v3.7 22 , and injecting the filtered variants in the original reference making use of the FastaAlternateReferenceMaker within GATK. In line with the authors' recommendations, five iterations were implemented to guarantee a full incorporation of species variability. Further processing (filtering, indexing, merging and sorting) was done with SAMtools v1.3.1 23 .
The second approach consisted of reconstructing species-specific de novo transcriptome assemblies. Four de novo assemblies were generated: two using the reads from the four individuals sequenced for each species (Fab4ind and Obt4ind), and the other two using only the reads from the individual with the highest coverage for each species (fSAM-G03R12 and oMIN-G09R07 for L. fabalis and L. obtusata, respectively). All these assemblies were performed using TRINITY v2.2.0 24 with default parameters, using the cleaned paired reads. The quality and completeness of the produced assemblies was evaluated with Transrate v1.0.3 25 . This tool defines an optimal cut-off for each assembly, enabling the removal of possible chimeras and poorly supported contigs by mapping back the cleaned reads against the generated reference. Only the best-supported contigs ("good transcripts") were de novo pipeline Pseudo-it pipeline Mapping Table 1. Sampling information. Shown is the location and respective coordinates, collection date, sample size (N) and species for transcriptome sequencing (locations 1 to 8), as well as for SNP genotyping (locations 9 to 11). *Previously analysed by Carvalho et al. 7 and Costa et al. 8 .
www.nature.com/scientificdata www.nature.com/scientificdata/ retained for further analyses. In order to reduce redundancy, an extra step was performed with CD-HIT-EST v4.6.4 26 to keep only one transcript among those with more than 95% similarity. CD-HIT-EST was used after Transrate for all de novo transcriptomes, except for L. fabalis based on multiple individuals probably due to its high complexity. Finally, TransDecoder v3.0.0 27 was employed to predict candidate coding regions, discarding possible non-coding RNA and DNA contamination.
The final transcriptome comprises those transcripts with predicted open reading frames and with homologs in the Littorina saxatilis transcriptome reference plus those showing homology with PFAM common protein domains 28 . Raw and curated versions of each transcriptome are available in Figshare 19 . Annotation of transcripts was carried out using Conditional Reciprocal Best BLAST (crb-blast) v0.6.9 29 against the L. saxatilis reference transcriptome and the Swiss-Prot database 30 and is available in Figshare 19 . SNP calling. SNP calling was independently performed for L. fabalis and L. obtusata 16 . First, reads from all individuals were aligned to the pseudo-reference and to the respective species-specific de novo references    23 was used for converting the resulting alignments (SAM format) into binary files (BAM format), sorting and fixing mate-pair information. Duplicate reads were removed with Picard v1.140 using MarkDuplicates. Read realignment was performed with RealignerTargetCreator and IndelRealigner tools within GATK v3.7 22 . The pseudo-reference was chosen for downstream analyses based on mapping statistics (Table 3). However, SNPcall was also performed using the references obtained with the other assemblies 19 . Raw variant calling was carried out using bcftools mpileup and bcftools calls 31

Genetic variation and population structure. Genetic variability across individuals and species was
first assessed by a Principal Component Analysis (PCA) using PLINK v1.90b3.45 32 , which was plotted with the ggplot2 R package 33 . This analysis was based on a subset of SNPs that were called in all eight individuals, with a minimum coverage of 20 and a minimum minor allele count of 2 (i.e. singletons excluded to avoid errors) obtained with VCFtools 34 . Furthermore, only one random biallelic SNP per transcript was retained, reducing the effect of linkage disequilibrium. Population structure analysis was performed with fastStructure v1.0 35 based on the same SNP subset, with 10 cross-validation tests for each of three replicate runs, and considering the number of genetic clusters (K) between 2 and 8. The best number of clusters was inferred with the chooseK python script of fastStructure.
Putative fixed single nucleotide differences between species were identified using the python tools created by Simon Martin (https://github.com/simonhmartin/genomics_general), after filtering the raw VCF file for minimum SNP quality of 30 (Phred-scaled probability that the polymorphism exists), minimum coverage of 5 per sample and minimum number of calls of 8 (1 per individual). A more conservative set of putatively fixed SNP differences was also created by applying a more stringent filter of a minimum coverage of 20 per sample 19 . Gene ontology enrichment. A Gene Ontology enrichment analysis was performed to assess if the annotated subset of transcripts/genes with putatively fixed differences between L. obtusata and L. fabalis were enriched for any functional category. This analysis was performed using the more conservative set of candidate fixed SNPs with the online tool gProfiler 36 . The background panel consisted of all annotated genes with at least one SNP. Over-represented GO categories were identified using a hypergeometric test with a significance threshold of p < 0.05 after Benjamini-Hochberg correction 37 .

Selection of informative SNPs for validation.
A genotyping assay was designed with the goal of validating a subset of the identified SNPs and assessing their informativeness for species discrimination. Sequences of 101 bp with a fixed nucleotide difference in the central position were randomly selected from the list of annotated contigs/genes containing putative diagnostic SNPs (less conservative set). Sequences containing at least one polymorphism across 50 bp to each side of the target SNP were discarded to reduce the probability of missing genotypes or null alleles. Finally, to ensure that the 101 nucleotides were located in the same exon avoiding fragments that span introns, all the selected fragments were blasted against the L. saxatilis genome 38 and only those with a single hit of 101 bp continuous alignment with a maximum of 1 bp difference were retained for developing the assay.

SNP genotyping and informativeness. A final set of 42 putatively diagnostic SNPs was tested with an
independent genotyping assay at the National Genomics Infrastructure (NGI), SNP&SEQ Technology Platform (www.genotyping.se) at Uppsala University, Sweden. This consisted of using multiplexed primer extension (iPLEX) and a MassARRAY analyzer 39 from Agena Bioscience for allele detection by mass spectrometry, and subsequent conversion to a genotype using Typer (Agena Bioscience). Forty-eight samples from the Iberian Peninsula were genotyped (Fig. 3, Table 1). Hardy-Weinberg equilibrium (HWE) was estimated for each locus and reference population combination. Since there was no variability for most SNPs within the reference populations, linkage disequilibrium (LD) between locus pairs was evaluated across all populations. Both HWE and LD (genotypic disequilibrium) were evaluated through exact probability tests in Genepop v4.2 40  www.nature.com/scientificdata www.nature.com/scientificdata/ markers in LD or not conforming to HWE, the ancestry of each individual (L. fabalis, L. obtusata, or admixed) was inferred using STRUCTURE v2.3.4 [42][43][44] , running 10 replicates for K = 2, each consisting of 1,000,000 iterations after a 100,000 burn-in. Since all replicates resulted in a similar outcome, the one with the highest likelihood was plotted with DISTRUCT v1.1 45 .

Data Records
The core dataset of this work comprises the transcriptomic data of flat periwinkles. The raw data (eight SRA experiments) were deposited in NCBI Sequence Read Archive, with the BioProject accession number PRJNA556984 16 (Table 1). FASTQ files were divided by species (L. obtusata -Lobt and L. fabalis -Lfab) and biosample-specimen (oMOI, oCAN, oBAI, oMIN, fSAM, fABE, fTIR and fMUR). For each biosample, four files were submitted, corresponding to two different Illumina HiSeq lanes and one for each of the paired reads. technical Validation RNA integrity. RNA quality was evaluated after electrophoresis on the Bioanalyser 2,100 (Agilent Technologies) with the RNA 6000 Nano Chip Kit (Agilent Technologies). RNA quantity was measured with Qubit using the RNA BR Assay Kit (Life Technologies).

RNA-Seq data quality.
A total of 736 million reads were generated (368,266,256 paired-end reads), with an average of ~92,000 reads per sample (Table 3). From these, ~705 million reads (96%) were kept after removing the adapters and trimming for quality. These post-cleaning reads passed the minimum quality standards of FastQC 17 .
Pseudo-reference information. The pseudo-reference transcriptome is composed of 37,873 contigs (containing a total of 82,360,655 bp) with a mean contig length and N50 of 2,139 bp and 2,963 bp, respectively (Table 4). Annotation using a conditional reciprocal best blast hit approach against the Swiss-Prot database coding sequences resulted in 10,968 annotated transcripts (29%), while 91% of the transcripts had 1to1 orthologs in the L. saxatilis reference transcriptome.
Species-specific transcriptome assembly and annotation. The L. fabalis assembly based on the "cleaned" reads from 4 individuals of this species initially resulted in 396,047 contigs with a mean length of 612 bp, a N50 of 832 bp and a total size of 242,362,528 bp; whereas the L. obtusata transcriptome is 251,985,325 bp long, distributed across 349,459 contigs with a mean length of 721 bp and N50 of 1,133 bp ( Table 5). The assemblies based on a single individual from each species resulted in lower values for all parameters when compared to those based in all four individuals: 177,022 contigs with N50 of 715 bp for L. fabalis; and 208,362 contigs with N50 of 1,159 bp for L. obtusata ( Table 5).
Filtering of these contigs based on CD-HIT-EST and TransRate optimal assembly score (0.08 for L. obtusata   Table 4. Summary statistics for the transcriptome pseudo-reference assembly. www.nature.com/scientificdata www.nature.com/scientificdata/ and a N50 length of 1,080 and 861, for L. fabalis and L. obtusata respectively. The single individual assemblies are composed of 31,279 (fSAM) and 24,047 (oMIN) ORF, with mean lengths of 685 bp and 740 bp, and a N50 length of 801 and 912, respectively (Table 5).
Regarding the annotation of the assemblies based on data from four individuals, a total of 38,439 transcripts in the L. fabalis and 23,237 in L. obtusata were annotated (72% in both species), of which 10,592 and 6,275 were uniquely annotated to the Swiss-Prot database, while 20,246 and 15,690 had homologs in the L. saxatilis reference transcriptome. The annotation of the assemblies based on a single individual resulted in 23,947 annotated transcripts for L. fabalis (fSAM) and 19,228 for L. obtusata (oMIN) 19 .
Mapping, SNP calling, genetic differentiation and gene ontology enrichment. In total, 253,969,198 reads (mean = 36%) were mapped to the pseudo-reference transcriptome ( Table 3). The relatively small number of mapped reads can be due to the large number of transcripts filtered out during the assembly, and also possibly due to exogenous DNA. After filtering, a total of 90,826 SNPs, present in all samples, were inferred (SNPcall_Biallelic_NoMissing_PseudoRef_min20.vcf). Principal Component Analysis was based on the filtered dataset, from which 7,061 SNPs were randomly sampled from independent transcripts. The first component of the PCA explains 64% of the total variance and separates the individuals from different species, while the second component, explains 8% of the variance and essentially separates the L. fabalis individuals according to geographical distance: two from Ria de Vigo (fSAM-G03R12 and fTIR-G10R14) and two from Ria de Muros (fABE-G04R13 and fMUR-G11R15) (Fig. 4). The Bayesian assignment analysis with fastStructure, with K = 2 as the number of clusters that maximized the marginal likelihood and better explained the structure in the data, confirms the classification of the four individuals from each species, without any evidence of admixture (Fig. 5).
Putative species-diagnostic SNPs were identified from the raw SNP call using the python tools created by Simon Martin and two different filtering criteria. In the more stringent filter (minimum coverage of 20), 19,072 SNPs were inferred as putatively species-diagnostic (i.e. alternative alleles fixed within the group of four individuals from each species) whereas in the less stringent filter (minimum coverage of 5), 45,340 were identified as putatively species-diagnostic SNPs (FixedSNPs_PseudoRef_min5.vcf and FixedSNPs_PseudoRef_min20.vcf, respectively 19 ).  Table 5. Summary statistics of the Littorina obtusata and L. fabalis transcriptome assemblies. *Due to the complexity of the L. fabalis assembly based on multiple individuals, the order of the curation steps was reversed. # Based on the data from the four individuals from each species described in Table 2.
www.nature.com/scientificdata www.nature.com/scientificdata/ Among the genes with fixed differences between species no enrichment for a particular GO term was found. However, since RNA was extracted basically from the whole individual (not from individual tissues) and many of the transcripts may not be complete, this result should be interpreted with caution.

SNP genotyping validation and informativeness.
From the 42 putatively diagnostic SNPs selected for validation, two failed the quality control tests for primer design. Among the remaining SNPs, one failed during genotyping, one revealed no variability and three others revealed more than two alleles. All these SNPS were excluded resulting in a final set of 35 SNPs. All loci were genotyped for at least 91.7% of the individuals, whereas all individuals were genotyped for at least 94.3% of the loci. No conflicts were found among genotyping replicates (215), resulting in 100% reproducibility. Among these 35 loci, 26 (74.3%) SNPs are potentially diagnostic between L. fabalis and L. obtusata, i.e. fixed differences between the putatively pure populations. The non-diagnostic set consisted of two (5.7%) nearly diagnostic SNPs (one allele with a frequency ≥0.95 in one species and 0 in the other), five (14.2%) informative SNPs (one allele with a frequency ≥0.70 in one species and 0 to 0.03 in the other), and two others (5.7%) with very little or no discriminatory power between the two species defined morphologically according to male genitalia.
No significant Hardy-Weinberg deviations for any of the 35 loci were detected in the reference samples. However, LD remained significant between eight pairs of loci across the three populations. This resulted in the exclusion of six additional SNPs to avoid using non-independent information in subsequent analysis of ancestry. The STRUCTURE analysis for K = 2 based on these 29 SNPs confirmed the pure genetic ancestry of the reference L. fabalis and L. obtusata individuals and the occurrence of admixture in Cabo do Mundo, demonstrating the usefulness of these markers for species discrimination and detection of hybridization between flat periwinkles (Fig. 6).

Usage Notes
The two species-specific transcriptomes and the SNPs made available in this study are the first genomic/transcriptomic resources made publicly available for flat periwinkles. This information will be useful for a wide range of studies, including evolutionarily-oriented projects concerning adaptation, parallel evolution, comparative immunology, producer-herbivore interactions, hybridization and speciation; as well as for characterization of population genetic diversity and differentiation across the species' distribution ranges to help implementing appropriate management and conservation measures. However, it is important to emphasise that the discriminatory power of the SNPs here developed was based on individuals from the Iberian Peninsula. Although we expect that most of these SNPs will also be diagnostic in other regions, their power to discriminate between L. fabalis and L. obtusata should first be tested in other parts of the species distribution range before they are applied outside Iberia.  Table 1. No signatures of admixture were found in these individuals. The analysis was based on 7,061 random SNPs from different transcripts.