Genome- and transcriptome-derived microsatellite loci in lumpfish Cyclopterus lumpus: molecular tools for aquaculture, conservation and fisheries management

The lumpfish Cyclopterus lumpus is commercially exploited in numerous areas of its range in the North Atlantic Ocean, and is important in salmonid aquaculture as a biological agent for controlling sea lice. Despite the economic importance, few genetic resources for downstream applications, such as linkage mapping, parentage analysis, marker-assisted selection (MAS), quantitative trait loci (QTL) analysis, and assessing adaptive genetic diversity are currently available for the species. Here, we identify both genome- and transcriptome-derived microsatellites loci from C. lumpus to facilitate such applications. Across 2,346 genomic contigs, we detected a total of 3,067 microsatellite loci, of which 723 were the most suitable ones for primer design. From 116,555 transcriptomic unigenes, we identified a total of 231,556 microsatellite loci, which may indicate a high coverage of the available STRs. Out of these, primer pairs could only be designed for 6,203 loci. Dinucleotide repeats accounted for 89 percent and 52 percent of the genome- and transcriptome-derived microsatellites, respectively. The genetic composition of the dominant repeat motif types showed differences from other investigated fish species. In the genome-derived microsatellites AC/GT (67.8 percent), followed by AG/CT (15.1 percent) and AT/AT (5.6 percent) were the major motifs. Transcriptome-derived microsatellites showed also most dominantly the AC/GT repeat motif (33 percent), followed by A/T (26.6 percent) and AG/CT (11 percent). Functional annotation of microsatellite-containing transcriptomic sequences showed that the majority of the expressed sequence tags encode proteins involved in cellular and metabolic processes, binding activity and catalytic reactions. Importantly, STRs linked to genes involved in immune system process, growth, locomotion and reproduction were discovered in the present study. The extensive genomic marker information reported here will facilitate molecular ecology studies, conservation initiatives and will benefit many aspects of the breeding programmes of C. lumpus.


Results
Reduced-representation ddRAD sequencing of two individuals of Cyclopterus lumpus in a two sequencing runs on the Ion PGM ™ NGS platform resulted in a total of 990,653 quality filtered single-end reads (25-532 bp sequence length; 46 percent GC content). The transcriptome assembly based on 13C. lumpus individuals 47 , comprised of 346,430 transcripts from 221,659 trinity genes, while the de novo transcriptome assembly of the Fish-T1K data consists of 49 million assembled bases in 98,767 transcripts from 89,342 trinity 'genes' . The median transcript length was 362 bases, average length 550 bases and N 50 669 bases. Assembly of unigenes by CAP3 from the transcriptome assembly of C. lumpus by Eggestøl et al. 47 generated a total of 255,957 unigenes (52,671 contigs and 203,286 singletons), while for the Fish-T1K transcriptome a total of 53,703 unigenes (23,831 contigs and 29,872 singletons) were produced.
We obtained 2,346 STR-containing consensus sequences by analysing the ddRADseq data generated from two unrelated individuals of Cyclopterus lumpus. A total of 1,791 sequences contained STRs of different motif types, and 555 sequences contained two STRs. STR detection by the QDD-VM pipeline revealed a total of 3,067 g-STRs, of which 2,387 (77.83 percent) were simple repeat motifs and 680 (22.17 percent) were in compound formation (Table 1). Dinucleotide repeat motifs were most frequent (2,736; 89.21 percent), followed by trinucleotide (196, 6.39 percent), mononucleotide (79; 2.57 percent), and tetranucleotide (51; 1.66 percent) repeats, while only three (0.1 percent) hexanucleotide and two (0.07 percent) pentanucleotide repeat units were found (Fig. 1a). The distribution of g-STRs to different repeat motif length classes estimated by MISA varied from 5 to 42, with five repeats (25.69 percent) being most common, followed by 15+ (14.31 percent), six (12.16 percent) and seven (7.76 percent) repeats ( Table 2). The frequency of the dinucleotide repeat motifs was highly represented across repeat length classes. A total of 33 types of consensus (non-redundant) repeat motif were found among the STR-containing sequences (Table S1, Supplementary Material). The dominant repeat motif type was dinucleotide (AC/GT) n , where n refers to the number of times the unit is repeated, with a frequency of 67.80 percent (2,027), followed by (AG/CT) n (463, 15.10 percent), and (AT/AT) n (173, 5.60 percent). Trinucleotide repeats were only represented by an overall frequency of 6.10 percent among the 13 most abundant repeat motif types (Fig. 2a). Of the 2,346 STR-containing sequences, 723 were suitable for microsatellite primer design, and a total of 8,313 primers targeted at producing different amplicon sizes (multiple primer pairs) per locus were successfully designed (( Table 1); Table S2, Supplementary Material).
Using the QDD-VM pipeline to also screen the transcriptome of C. lumpus for Type I STRs, we isolated a total of 116,555 sequences containing STRs of different motif types, where 57,717 sequences contained between

Search parameters
Genomic Transcriptomic  (Fig. 1b). The distribution of EST-STRs to different repeat motif length classes ranged from 5 to 83, with a maximum frequency for five repeats (23.86 percent), followed by 10 (14.35 percent), six (12.05 percent) and 15+ (11.7 percent) repeats (Table 3). Mononucleotide repeat motifs were highly represented across repeat length classes, followed by dinucleotides and trinucleotides. We found a total of 145 types of non-redundant repeat motif among the STR-containing sequences (Table S3, Supplementary Material). The dominant repeat motif type was dinucleotide (AC/GT) n (75,479; 33.00 percent), followed by (A/T) n (62,984; 26.60 percent), and (AG/CT) n (25,631; 11.00 percent). Trinucleotide repeats were only represented by an overall frequency of 13.30 percent among the 13 most abundant repeat motif types (Fig. 2b). From the 116,555 STR-containing sequences, primers could be designed for 6,203 sequences, resulting in a total of 117,374 primers (( Table 1); Table S4, Supplementary Material).
Local BLAST search of the 22 microsatellites reported by Skirnisdottir et al. 13 against our ddRADseq dataset returned significant hits only for loci Clu11 (JX485370.1) and Clu40 (JX485383.1), with a 100 percent and 99.2 percent sequence similarity, respectively (Table S5, Supplementary Material). However, the STR contigs containing Clu11 and Clu40 did not meet our criteria for primer design in our dataset. A similar search for previously-reported C. lumpus loci in our EST dataset yielded significant hits for six loci: Clu07, Clu11, Clu19, Clu36, Clu40, and Clu45 (Table S5). As for the two previously-described g-STRs, these loci were not part of the sequences used for primer design in our study.
Functional annotation of EST-STR-containing unigenes based on the BLASTx analysis through BLAST2GO resulted in a total of 4,931 annotations. Through mapping, we could extract GO terms for assigning gene products into three categories, biological process (BP), cellular component (CC), and molecular function (MF). According   Table 3. Distribution of Cyclopterus lumpus EST-STRs to different repeat motif length classes.
to the GO analysis, 2,009 unigenes were assigned to the BP category, 1,789 to the MF category and 1,119 to the CC category. The small discrepancy between total annotation and the number of unigenes is expected for transcriptome assemblies, since contigs may contain multiple gene regions that are assigned to different categories. Within the BP category, genes involved in cellular, metabolic and biological regulation comprised the largest portion ( Fig. 3), while in the CC category the greatest number of genes were found to encode cellular components and cell parts (Fig. 3). Likewise, many sequences in the MF category encode proteins with binding and enzymatic activity (Fig. 3). We recovered a total of 135 annotations for g-STR-containing sequences, and a local BLAST search of g-STRs in our EST-STR data base yielded 332 significant hits. After removing these sequences, we could retain a final set of 394 actual g-STRs.

Discussion
Developments in high-throughput sequencing (HTS) technologies have afforded us with the opportunity to obtain genomic and transcriptomic sequences suitable for isolating vastly larger sets of STRs distributed across genomes compared to previous enrichment methods involving oligonucleotide hybridisation and cloning 48 . We implemented a STR discovery pipeline suitable for detecting STRs from either assembled (contigs or scaffolds) or non-assembled sequences. From assembled sequences STRs are extracted along with their flanking regions for primer design. Non-assembled sequences are first filtered to remove adaptors and eliminate short reads (<80 bp in our case) prior to identifying reads with STRs. Importantly, STRs can be detected at any given sequence length (a user-defined parameter). Although, to produce functional markers i.e. those loci that will successfully amplify in vitro, the optimum read length range is 150-500 bp for non-assembled sequences 49 . In the present study, 30 percent of the STR-containing non-assembled ddRAD sequences were suitable for primer design while 5 percent of the STR-containing transcriptome assembly sequences were suitable for primer design, indicating that the length and sequence properties of the STR-flanking region is also an important parameter. In all, we report on the identification of a total of some 232,000 genome-wide STR loci in C. lumpus based on an in silico STR development approach. As expected, ca. 98 percent of the genome-wide STRs were isolated from the transcriptome assembly compared to the data generated from reduced-genome sequencing. Nevertheless, the combinatorial use of genomic and transcriptomic sequences allowed us to isolate and design primers for STRs located in both coding and non-coding regions of the C. lumpus genome.
In C. lumpus the most common STR lengths were five, six and 10. We note that dinucleotide repeats motifs were dominant in the genome-derived STRs accounting for 89 percent, which is similar to previous studies on other fish 50,51 . Dinucleotide repeats were overrepresented also in EST-derived STRs, where they accounted for 52 percent, indicating that dinucleotide repeats are the dominant motif in the genome of C. lumpus. This trend is in accordance with EST-STR distributions described earlier in several fish species, including the channel catfish Ictalurus punctatus Rafinesque, 1818 ( 54 Linnaeus, 1758 (44 percent) 54 . Dinucleotide repeat motifs in C. lumpus were www.nature.com/scientificreports www.nature.com/scientificreports/ predominantly composed of AC/GT, which has been found to be the case also in all the aforementioned fishes except the killifish, in which the most common motif was AT/TA. The proportion of trinucleotide motifs in C. lumpus likewise resembles findings from other fishes, however, the nucleotide composition appears to differ since the most abundant trinucleotide motif in C. lumpus is AGG/CCT, while in catfish the two most abundant types are ATA and TTA, and in zebrafish, killifish and crucian carp the AAT/TTA motif dominates 52,53 . These differences suggest that the predominant repeat motif in fish is by no means consistent across species and taxa.
Functional annotation of the STR-containing EST unigenes revealed that the majority of these encoded for proteins involved in protein-binding and catalytic reactions. This is consistent with earlier studies providing compelling evidence that STRs, especially the AC repeat motif, play an important role in protein-binding and transcriptional activity [55][56][57] . The recently assembled transcriptome of C. lumpus by Eggestøl and co-workers 47 was targeted at identifying and mapping the components of the immune system involved in early immune responses of leukocytes following in vitro exposure to the pathogenic bacterium Vibrio anguillarum O1. As such, we anticipated to also uncover STRs linked to genes involved in immunity. Interestingly, an investigation into the cleaning behaviour (sea lice grazing efficacy) and disease resistance in several families of C. lumpus showed significant difference among families 6 . Inquiries into the genetic basis of grazing efficacy and disease resistance traits have been hampered by the limited available genomic resources for C. lumpus. In the present study, we also discovered STRs linked to genes involved in immune system process, growth, locomotion and reproduction to aid in such endeavours. We also noticed that genome-derived STRs to some degree overlapped with transcriptome-derived STRs, both in our g-STR dataset and in that of Skirnisdottir et al. 13 , indicating that genome-derived STRs can also include Type 1 STR loci. To this end, our bioinformatic framework allows for distinguishing Type I and -II STR loci in genome-derived STR databases and could be useful for developing both STR types in other species.
Khimoun et al. 27 assessed whether the patterns and levels of genetic diversity within and between bird populations are similar for EST-and g-STRs, and investigated how the levels of differentiation influence the relative efficiency of the respective marker types. They found that when there is strong genetic differentiation, inferred population-genetic structures were similar for both marker types, but that g-STRs slightly outperformed EST-STRs when differentiation was moderate. On the contrary, the study then provides compelling evidence that EST-STRs have a higher resolution in detecting weak population genetic structure compared to g-STRs. This pattern is consistent with earlier studies in plants 26,58 and, more recently, in sharks 28 . The previous studies also show that when using EST-and g-STRs as a single panel, this can result in the underestimation of the degree of population structure, especially when genetic structuring is weak 27,28 . Functional annotation of STR-containing sequences during marker development is therefore crucial, so that the actual distribution of anonymous g-STRs and EST-STRs can be accurately quantified and then properly applied in downstream genetic analyses.
Bioinformatics workflows for extracting STR locus-specific sequences from HTS dataset are gradually becoming available to facilitate genotyping-by-sequencing (GBS) of STRs 36,39,41 . Barbian and co-workers 36 compared the performance of capillary electrophoresis and HTS to validate and improve the STR-GBS approach. In that study, it was shown that the GBS approach identified new alleles based on sequence differences that were previously masked by size homoplasy. The large STR primer base reported in the present study for C. lumpus used the revised primer design parameters of Meglécz et al. 49 which were empirically validated to improve genotypic success rates. Although the present study did not involve the in vitro validation of the primers to determine amplification efficiency and the level of polymorphism (i.e., number of alleles) at each of the loci, Meglécz et al. 49 reported that the target region complexity had no effect on the polymorphism of STRs, and that the levels of polymorphism increased from di-to tetra-nucleotide repeat motifs in their two focal species. Furthermore, as HTS data generated from a larger panel of individuals becomes available for C. lumpus, it will be possible to possible to perform electronic PCR (e-PCR 59 ) and extract locus-specific genotypes without the need to perform in vitro experiments. For instance, the specificity of e-PCR (in silico) amplification (compared to BLAST) previously enabled breeders to identify the map positions of STRs in rice Oryza sativa L 60 . and potato Solanum tuberosum L. 61 , and anchor the STR loci within linkage groups without the need to perform additional PCR reactions. Moreover, e-PCR was used successfully to align the STR loci on the linkage map of Brassica napus L. to the genome of B. rapa L. and B. oleracea L. to identify candidate genes of QTLs for seed weight through comparative mapping of these Brassica species to Arabidopsis thaliana L. 62 . Therefore, STRs provide significant utility in that they provide a source of genetic variation that has a higher mutation rate and transferability success rate across populations and species than for SNPs 31,63 .
The C. lumpus STR database reported in the present study provides valuable molecular markers to the scientific community for a myriad of downstream applications, such as linkage mapping, parentage analysis, marker-assisted selection (MAS), quantitative trait loci (QTL) analysis, and assessing adaptive genetic diversity in this commercially valuable fish. Moreover, the large STR primer sets reported here readily allows for exploring the cost-efficient HTS-based STR genotyping-by-sequencing approach in C. lumpus or even data generated from HTS-based SNP genotyping experiments.

Methods
Sampling and DnA extraction. We obtained finclip samples of two C. lumpus individuals, of which one originated from southern Norway (Mandal, N 57.99 E 7.48) and one from northern Norway (Hekkingen, N 69.37 E 17.48). The finclip samples used in our study were from dead specimens caught during normal fishing activity and, therefore, no approval from the local ethics committee was necessary. We stored the samples at 4 degrees Celsius in absolute ethanol. To extract total genomic DNA from the samples, we used the DNeasy Blood and Tissue Kit following the manufacturer's instructions (Qiagen). ddRAD library preparation and nGS data processing. DNA extracts were quantified with the Qubit Broad Range dsDNA Assay (Thermo Fisher Scientific), and then diluted to standardised working concentrations in nuclease-free water. Library preparation was then performed following the modified ddRAD digestion-ligation www.nature.com/scientificreports www.nature.com/scientificreports/ protocol of Vivian-Smith and Sønstebø 43 . We then performed a combined double-digest and ligation reaction for each individual using 200 ng of DNA. This reaction was composed of the high fidelity restriction endonucleases PstI-HF and NdeI-HF (specific for CTGCA|G and CA|TATG restriction motifs, respectively; New England Biolabs), together with ligation reaction using the modified adapters P1 and A in a total volume of 60 micro-liters, using the NEB 4 buffer as previously described 43 . Next, we pooled the barcoded libraries and purified them with Agencourt AMPure XP Beads (Cat. No. A63881; Beckman Coulter). The pooled library was then resolved on a pre-cast 1.5 percent agarose gel in a Pippin Prep automated electrophoretic system (Sage Science) set to recover 400-600 bp fragments. Subsequently, we size selected for two fragment-length ranges and constructed two libraries transcriptome sequence retrieval, pre-processing and assembly. We obtained a recently assembled transcriptome of C. lumpus (accession number E-MTAB-6388, Eggestøl et al. 47 ) from the ArrayExpress Archive of Functional Genomics Data (https://www.ebi.ac.uk/arrayexpress/). In that study, the kidney leukocytes were used to generate RNAseq data for the de novo transcriptome assembly. In addition, we downloaded raw RNAseq Illumina paired-end reads of C. lumpus (accession number SRX3153215) from the NCBI Sequence Read Archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra) under BioProject PRJNA398732, the Fish-T1K (Transcriptomes of 1000 fishes) Phylogeny Project (Beijing Genome Institute). We submitted the raw sequence reads to a quality control (QC) step in FASTQC as implemented in the BLAST2GO program [64][65][66] . Next, we conducted a de novo transcriptome assembly using the quality-filter sequence data generated from the Fish-T1K project, employing the TRINITY pipeline of Grabherr et al. 67 with the option for read trimming by quality during assembly using TRIMMOMATIC 68 also implemented in BLAST2GO. Then, we filtered out known contaminants (Vibrio and IPNV), mitochondrial DNA and ribosomal DNA from the assembly using BLAST v. 2.7.1+ 69 . Subsequently, we performed further clustering and alignment of each respective transcriptome assembly to form transcript assemblies (unigenes) using the CAP3 program 70 with parameters -p 95, -o 49, and -t 10 000. Finally, we merged the two transcript assemblies (contigs and singletons) for STR discovery (Fig. 4).

Microsatellite mining and primer design.
To detect and extract STR-containing sequences from the quality-filtered and trimmed ddRADseq dataset, we used the QDD-VM v. 3.2.1 pipeline for low-coverage NGS data 49,71 (Fig. 4). First, we used the perl script QDD pipe1.pl to convert the input fastq file to fasta and to extract the STR-containing reads with di-to hexanucleotide motifs in both pure (perfect) and compound (imperfect) form, and longer than 80 bp. Second, we used the QDD pipe2.pl script to compare STR-containing reads of each individual using BLAST+, and the reads with very high sequence identity (>95 percent) were grouped into contigs and then used to create a consensus file with CLUSTALW v. 2.0 72 . To construct the consensus sequence for a given locus, we required greater than 0.66 of the sequences to have the same base at a particular site. This step allowed for identifying polymorphic STR loci i.e., in silico characterisation. Third, we executed the QDD pipe3.pl script to automatically design primers from unique (singletons and consensus) STR-containing sequences using the Primer3 algorithm 73 implemented within QDD-VM. We used the default parameters for designing primers that Meglécz et al. 49 empirically determined to improve genotyping success rate of STRs and to force the design of primer pairs with variable amplicon size in QDD-VM. The optimised parameters were as follows: product size 90-320 bp; primer size = 18-20-27 bp (min-optimal-max); melting temperature (Tm) = 57-60-63 degrees Celsius (min-optimal-max); GC content = 20-50-80 percent; maximum Tm difference = 10 degrees Celsius. The design of multiple primer pairs with different amplicon size per locus facilitates in silico selection of primer pairs for the design of multiplex PCR during the wet laboratory validation experiments. Last, we executed the QDD pipe4.pl script to check for contamination by a BLASTn (query nucleotide against nucleotide database) search (E-value cut-off < 10-20) of all STR-containing sequences with successful primer design against the NCBI database, as well as to compare these sequences to known transposable elements of vertebrates using RepeatMaster v. 4.0.7 (available from http://www.repeatmasker.org/).
To search for EST-STRs we used the QDD-VM pipeline for assembled contigs, which is the same as the above-mentioned pipeline, except for the following parameters: pipe1.pl and pipe3.pl, -contig was set to 1 to extract STRs with 200 bp flanking regions on both side in the assembled unigenes, pipe2.pl, -makecons was set to 0 to avoid paralogs.
Additionally, using the unique STR-containing sequences generated by the QDD pipe2.pl script, we executed the perl script misa.pl (MIcroSAtellite identifcation tool; available from http://pgrc.ipk-gatersleben.de/misa/) of Thiel et al. 74 to obtain further summary statistics on the identified STRs, which include (i) the distribution of STRs to different repeat unit classes, (ii) the distribution of STRs to different repeat motif length classes, and (iii) the frequency of STR motifs. We defined each repeat motif class of STRs using the MISA specification file (misa. ini) with the following parameters: minimum repeat sequence of 10 nucleotides for mononucleotide repeats, and at least five consecutive repeat units for di-, tri-, tetra-, penta-and hexanucleotide motifs. We used the default parameter of less than equal to 100 bp minimum distance between two repetitive units to identify and classify compound repeats. We then estimated the relative abundance of STRs in the transcriptome of C. lumpus by dividing the number of STRs found with the total assembly length length (loci/Mb). functional annotation of contigs containing microsatellites. To classify the putative function of the sequences containing STRs for which primers could be designed, we subjected the EST-derived sequences to a BLASTx (translated query nucleotide against protein database) search with a threshold E-value 10-5 as implemented in BLAST2GO. The three gene ontology (GO) descriptors for functional characterisation of STR-containing sequences were biological process (BP), cellular component (CC), and molecular function (MF). Finally, we executed the mapping function to extract the GO descriptors associated with each of the obtained BLAST hits in BLAST2GO using the annotation cut-off value of 10 −6 . Since our goal was also to provide Type II STRs, we used the same protocol to annotate all g-STR-containing sequences for which primers could be designed, in order to filter out loci associated with coding regions.