Mountain hare transcriptome and diagnostic markers as resources to monitor hybridization with European hares

We report the first mountain hare (Lepus timidus) transcriptome, produced by de novo assembly of RNA-sequencing reads. Data were obtained from eight specimens sampled in two localities, Alps and Ireland. The mountain hare tends to be replaced by the invading European hare (Lepus europaeus) in their numerous contact zones where the species hybridize, which affects their gene pool to a yet unquantified degree. We characterize and annotate the mountain hare transcriptome, detect polymorphism in the two analysed populations and use previously published data on the European hare (three specimens, representing the European lineage of the species) to identify 4 672 putative diagnostic sites between the species. A subset of 85 random independent SNPs was successfully validated using PCR and Sanger sequencing. These valuable genomic resources can be used to design tools to assess population status and monitor hybridization between species.


Background & Summary
The mountain hare (Lepus timidus) is an Arcto-alpine species that was the most common and widely distributed hare species across Europe during the last glacial periods 1 . Nowadays, the mountain hare is distributed from Fennoscandia to Eastern Siberia, but also occurs in isolated/refuge populations (e.g., Ireland, Scotland, the Alps, Poland, the Baltics and Japan), and in places where it has been introduced (Iceland, England, Faroe Islands and New Zealand) (see Fig. 1). Even though they are a popular game species and abundant within its range, mountain hares have sharply declined in some regions, particularly in areas of contact with the European hare (Lepus europaeus), where the latter tends to invade and replace the range of the former [1][2][3][4] . Mountain and European hares share extensive natural and human-induced contact zones in Western Europe, from the British Isles to Scandinavia and Central Europe (Fig. 1). Climate change is predicted to affect lagomorphs extensively 5,6 and, in particular, to accelerate the replacement of mountain hares by European hares in the contact zones, such as the Alps, Sweden or Ireland 7,8 . The two species may hybridize when in contact, resulting in some genetic introgression [9][10][11][12][13] , with potential effects on local adaptation 14 .
Even though the mountain hare and other hare species have been the subject of several population genetics studies, these have been mostly based on a few markers 10,[15][16][17] . Therefore, permanent genomic resources provide fundamental information to develop monitoring tools to evaluate population status and implement protective policies. In this work, we use high-throughput RNA sequencing to: i) generate genomic resources for the mountain hare; and, ii) use published data on the European hare 18 to pinpoint candidate fixed differences between the species that can be used to build genotyping tools to monitor gene exchange in the contact zones. We here present the first mountain hare transcriptome, and the most complete among the currently available European Lepus transcriptomes.

Methods
A summary of the methodological workflow is shown in the flowchart of Fig. 2.

Sampling procedure and locations
Specimens from the Alps (see Fig. 1) were sampled during regular permit hunting in Grisons, Switzerland. Specimens from Ireland (see Fig. 1 Irish hares were dispatched humanely and in accordance with the licence conditions by means of lethal injection administered by Mr William Fitzgerald, Veterinary Laboratory Service Follow (MVB MVM CertCSM), from the Department of Agriculture, Food and the Marine, Regional Veterinary Laboratory, Hebron Road, Kilkenny, R95 TX39. Total RNA was isolated from 8 individuals.

RNA extraction
Liver tissue was freshly collected, immediately preserved in RNAlater and then stored at −80°C until RNA extraction. Prior to extraction, frozen samples were ground in liquid nitrogen with a ceramic mortar and pestle. Mortar and pestle were washed prior to extraction using a 6-step wash that includes the following washing reagents in order: 70% ethanol, tap water, 10% bleach, milli-Q water, RNase away (Thermo Fisher Scientific) and finishing with molecular grade H 2 O. RNA extraction was performed using RNeasy Mini Kit according to manufacturer instructions.

RNA sequencing library preparation
The SureSelect Strand-Specific RNA Library Prep for Illumina Multiplexed Sequencing (Agilent Technologies) kit was used to prepare cDNA libraries for all samples. Library sizes were estimated using a Bioanalyzer 2,100 and quantified using KAPA Library quantification kit (KAPA BIOSYSTEMS). Equal molar concentrations of each library were pooled together for sequencing.
Sequence data processing and de novo transcriptome assembly A detailed description of tools and commands used in the data analysis is shown in Table 1 (available online only). A first quality evaluation of obtained sequence reads (Data Citation 1) was performed with FastQC v0.11.5 19 . After read quality inspection, adapters were removed and quality trimming performed using TRIMMOMATIC v0.36 20 , with instructions to remove the first ten bases, Illumina adapters, reads below 25 bp long and bases in the ends of reads with quality below 10, and to perform a 4-base sliding window trimming and cutting fragments with an average quality below 10. Trimmed-read quality was rechecked with FastQC (Data Citation 2). A de novo transcriptome assembly was then performed using all properly paired reads from the eight individuals in the dataset using TRINITY v2.2.0 21 , establishing RF as read orientation for a strand-specific assembly. In addition, as a complementary resource, de novo transcriptome assemblies for each of the two sampling localities were also performed. Transrate v1.0.3 22 was used to evaluate assembly quality and completeness and to remove possible chimeras and poorly supported contigs. Cleaned reads were mapped back to the produced assembly and only the wellsupported contigs were retained (Transrate optimal cut-off >0.024). In order to remove redundancy produced by using multi-sample data to perform the assembly, all contigs were clustered using CD-HIT-EST v4.6.4 23

SNP inference
SNP calling was performed separately for mountain hares (Data Citation 1) and European hares (Data Citation 3, from Amoutzias et al. 18 ). The three European hare specimens represent the European lineage of the species 18 . First, reads from all the individuals were mapped to the filtered mountain hare de novo transcriptome with bwa-mem v0.7.15 34 with default parameters and read group information added to each sequencing lane-sample pair. The resulting alignments were converted to a binary file (bam format), sorted and submitted to fixmate step using SAMtools v1.3.1 35 . Duplicate reads were removed using Picard v1.140 (http://broadinstitute.github.io/picard) with the option MarkDuplicates. Realignment and recalibration was performed with Genome Analysis Toolkit v3.6-0 36 . Finally, SNP call was carried out using Reads2snp v2.0.64 37 using a threshold of 20 for site and mapping qualities, the paralog filter, a minimum coverage of 10X and a genotype probability >0.95. The resulting VCF file was deposited in Figshare (Data Citation 2). Only SNPs represented in all sampled specimens were retained.

Differentiation, admixture and Gene Ontology enrichment analysis
A set of random 5 502 SNPs, selected from independent contigs in order to reduce the linkage probability, was identified with VCFtools v0.1.14 38 . PGDSpyder v2.1.1.0 39 was used to convert this file to the required file formats. Partitions of genetic diversity in the dataset were investigated with a Principal Components Analysis, using PLINK v1.90b3.45 40 and ggplot2 R package 41 to plot the results. Additionally, the data were analysed using the admixture model implemented in STRUCTURE 2.3.4 42 , with three replicate runs with 1 million steps after a burn-in period of 200 000, and K = 2. Results were plotted using CLUMPACK 43 . Gene Ontology enrichment analyses were performed for the collection of contigs/genes with fixed differences between mountain and European hare samples, and between mountain hare sampling localities. The analysis was based on the rabbit proteome annotations and performed with g:Profiler 34 , applying the g:SCS multiple test correction and the 'best per parent group' hierarchical filter. The background set of genes was reduced to contigs with SNP information.

Independent SNP genotyping
A random set of 110 SNPs, inferred as potentially diagnostic between L. timidus and L. europaeus, was selected for independent validation using Sanger sequencing. DNA was extracted from two of the previously analysed mountain hare samples (one Alpine, Sample_3112, and one Irish, Sample_3103) and two other European hare specimens (sampled in Clermont-Ferrand-Sample-1569-Font-Romeu, Pyrenees-Sample-1550-in France during the regular hunting season). DNA extraction was performed using JETQUICK Tissue DNA Purification kit (Genomed). PCR primers were designed to be anchored in a single exon (taking into account intron-exon boundaries from the European rabbit reference genome) and to amplify a portion of 110 independent contigs containing at least one putative diagnostic SNP. The Primer sets were designed using the Scrimer pipeline 44 , which depends on Primer3 45 to design and set the primer conditions. A third internal sequencing primer was designed. PCRs were performed using QIAGEN Multiplex PCR Master Mix (Qiagen) and the following thermal cycling profile: initial denaturation at 95°C for 15', 35 cycles of denaturation at 95°C for 30'', annealing at 60-67°C for 20'' and elongation at 72°C for 30'', and a final extension step at 72°C for 5'. PCR products were visually inspected under UV-light after electrophoresis in agarose gels stained with GelRed (Biotium), purified with Exonuclease I (New England Biolabs) and FastAP Thermosensitive Alkaline Phosphatase (Thermo Scientific), and sequenced using internal or, in a few cases, PCR primers in a ABI 3130xl genetic analyzer.

Code availability
Analyses in this work were performed with freely available open access tools mainly using command line versions (Table 1 (available online only)). Parameters are described in the methods section and software versions and commands used are detailed in Table 1 (available online only).

Data Records
Forty-eight raw FASTQ files were submitted to NCBI Sequence Read Archive, with accession number SRP095715 (Data Citation 1 and Tables 2 and 3). FASTQ files were divided in two sets, corresponding to the sampling localities (Ltim_Ireland and Ltim_Alps), and by biosample-specimen (SAMN06186748-3101, SAMN06186761-3102, SAMN06186762-3103 and SAMN06186763-3105; SAMN06186727-3112, SAMN06186728-3113, SAMN06186729-3114 and SAMN06186738-3116). In each biosample, six files were submitted, corresponding to three different Illumina HiSeq sequencing lanes and two read directions. Pre/post-cleaning FASTQC base quality pdf report (FASTQC.pdf) can be accessed in Figshare  (Data Citation 2). This dataset is the core of this work and has not been released or analysed previously. Trinity raw assemblies (Ltimidus_Trinity.fasta, LtimidusIreland_Trinity.fasta and LtimidusAlps_Trinity.fasta) were deposited on Figshare (Data Citation 2 and Table 4). The curated transcriptome assembly fasta files (LtimidusTranscriptome.cds.fasta and LtimidusTranscriptome.pep.fasta) and the annotated database file (LtimidusTranscriptome.xlsx) can also be found in Figshare

RNA integrity
The quality and quantity of each RNA sample was assessed using the 260/280 and 260/230 absorbance ratios estimated by an IMPLEN P330 NanoPhotometer and RNA Integrity Number (RIN) and concentration (μg μl −1 ) with a Bioanalyzer 2,100 (Agilent Technologies). All samples had RIN values above 8.

RNA-Seq data quality
The Illumina HiSeq run produced a total raw output of 103 941 215 100 bp paired-end reads (207 882 430 total reads). Adapter removal and quality trimming decreased this number to 201 569 448 reads (97%) ( Table 4). Final analysed reads passed the minimum quality parameters as established by FastQC.

Transcriptome assembly curation, annotation and quality
Cleaned reads were assembled into 272 183 contigs with a mean length of 594 bp and a N50 length of 839 bp (Table 4). After assembly curation with Transrate optimal cut-off >0.024, clustering with a 95% similarity threshold and open reading frame prediction, were retained 25 868 transcripts with a mean length of 842 bp and a N50 length of 1 182 (Table 4).
Annotation using a conditional reciprocal best blast hit approach results in 16 772 (65%) annotated transcripts, of which 13 641 were annotated to the rabbit transcriptome and 15 955 to the Swiss-Prot database (Fig. 3). In order to reduce the number of non-annotated transcripts, the less stringent unidirectional blast hit was added to the database. Hits were recovered for 25 549 transcripts (99%) (Fig. 3).
The mountain hare transcriptome produced in this study represents an important improvement compared to the currently available transcriptomic resources for European Lepus-L. granatensis 46 and L. europaeus 18 transcriptomes-as it performs better on several assembly statistics, such as reference coverage (42 versus 32% in L. granatensis and 40% in L. europaeus; using the rabbit transcriptome as reference).

Genetic variation, differentiation and gene ontology enrichment
In total, 218 057 526 reads (63%) were mapped to the filtered transcriptome-136 511 846 mountain hare reads (68%) and 81 545 680 European hare reads (57%) (see statistics in Table 5). After filtering, 159 629 high-quality SNPs were inferred, of which 41 182 (26%) were sequenced in all eleven specimens. A summary of polymorphic, shared and fixed SNPs is shown in Fig. 3. 4 672 putative species-diagnostic SNPs (considered when species presented alternative fixed alleles) were inferred (Data Citation 2, Supplementary Tables 1, also deposited in Figshare). The diagnostic power of our SNP set could be strongly reduced if any of the sequenced specimens was admixed (namely from the Alps, where the species overlap). We therefore conducted a Principal Component Analysis and a Bayesian Assignment analysis to assess our ability to separate the species. The results suggest that the analysed mountain and European hares are well differentiated with our SNP set, and only possible limited levels of admixture were found for Sample-3116 (Fig. 4). An extra table of putative species-diagnostic SNPs excluding that individual was therefore produced (Data Citation 2, Supplementary  Irish and Alpine samples respectively, and 126 were fixed between sampling localities (Data Citation 2, Supplementary Tables 5, deposited in Figshare). The 'membrane part' gene ontology term was found enriched in the collection of genes with fixed differences between the Irish and Alpine mountain hare samples, while terms 'lipid metabolic process', 'small molecule catabolic process', 'extracellular space and acyl-CoA dehydrogenase activity' were found enriched in genes with fixed differences between samples of the two species. Note however that even though the background gene set was controlled for, RNAsequencing data does not provide an unbiased sample of information across different genes and these results may represent tissue-related functions.

Usage Notes
These genomic resources (which greatly extend previously available marker sets; e.g. 50 ) will be useful for a variety of studies, particularly in the characterization of genetic diversity in mountain hare populations and on the development of hybridization monitoring tools. Note that SNPs were here inferred from an uneven and small species sample, and therefore any diagnostic genotyping assay built from this data  should be first tested with adequate sample sizes from pure parental populations of the species, before being applied to hybrid zones. material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/ The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/ zero/1.0/ applies to the metadata files made available in this article.