Discovery of novel genic-SSR markers from transcriptome dataset of an important non-human primate, Macaca fascicularis

Macaca fascicularis, also known as the cynomolgus macaque, is an important non-human primate animal model used in biomedical research. It is an Old-World primate widely distributed in Southeast Asia and is one of the most abundant macaque species in Malaysia. However, the genetic structure of wild cynomolgus macaque populations in Malaysia has not been thoroughly elucidated. In this study, we developed genic-simple sequence repeat (genic-SSR) markers from an in-house transcriptome dataset generated from the Malaysian cynomolgus macaque via RNA sequencing, and applied these markers on 26 cynomolgus macaque individuals. A collection of 14,751 genic-SSRs were identified, where 13,709 were perfect SSRs. Dinucleotide repeats were the most common repeat motifs with a frequency of 65.05%, followed by trinucleotide repeats (20.55%). Subsequently, we designed 300 pairs of primers based on perfect di- and trinucleotide SSRs, in which 105 SSRs were associated with functional genes. A subset of 30 SSR markers were randomly selected and validated, yielding 19 polymorphic markers with an average polymorphism information content value of 0.431. The development of genic-SSR markers in this study is indeed timely to provide useful markers for functional and population genetic studies of the cynomolgus macaque and other related non-human primate species.

higher degree of transferability across related species, and lower occurrence of null alleles 20,21 . Despite its lower polymorphism level compared to genomic SSR, genic-SSR has been used successfully in population genetic and evolutionary studies in many species [20][21][22][23] In recent years, advancement in sequencing technologies has made whole-genome or transcriptome sequencing of both model and non-model organisms feasible. The massive amount of transcriptome data obtained via RNA sequencing can be used in various applications, from gene identification to comparative functional analysis and differential gene expression. It also serves as an excellent sequence resource for marker development. Transcriptome sequencing coupled with established bioinformatic pipeline have been used effectively for high throughput identification of genic-SSR markers from various organisms [24][25][26] . Some of the tools used for SSR mining include MicroSatellite identification tool (MISA) 27 , FullSSR 28 and Genome-wide Microsatellite analysing tool package (GMATA) 29 .
There are fewer reported studies on the development of SSR markers for the cynomolgus macaque compared to Macaca mulatta, another non-human primate model. Hitherto, development of genic-SSR markers from whole transcriptome sequencing data of cynomolgus macaque has yet to be attempted. Therefore, the present study aimed to develop genic-SSR markers from an in-house transcriptome dataset of the Malaysian cynomolgus macaque generated from our previous studies 30,31 . This study is the first comprehensive report on the development of genic-SSR markers from the transcriptome of cynomolgus macaque. Here, we mined sequences containing SSRs from the transcriptome dataset, designed primers flanking pure di-and trinucleotide SSRs, and identified their associations with functional genes. Some randomly selected markers were further validated. The genic-SSR markers reported in this study are useful for population, functional genomic and comparative mapping studies of cynomolgus macaque and other related species.

Results and Discussion
De novo assembly and functional annotation. De novo assembly of the transcriptome data generated a total of 597,457 contigs with an average contig length of 400 bp; minimum and maximum contig lengths of 178 bp and 21,411 bp, respectively. Of the total contigs generated, 356,560 (~60%) of the contigs had an average coverage of more than 10 reads, and annotation of these contigs revealed 73,880 (~21%) contigs associated with functional genes. Out of the 73,880 annotated contigs, 67,399 contigs matched to M. fascicularis (GCF 000364345.1) RNA sequences. Subsequent protein sequence similarity searched against M. mulatta (GCF 000772875.2), Homo sapiens (GRCH38) and SwissProt databases, further annotated 1,461, 742 and 4,278 contigs respectively.
Identification and classification of genic-SSRs. We identified a total of 14,751 genic-SSRs in this study, reflecting the effectiveness of SSR mining from the transcriptome dataset. Of the total identified genic-SSRs, 13,709 (92.94%) were perfect repeats; while complex and compound repeats constituted the remaining 7.07% (Fig. 1). Among the perfect SSRs, dinucleotide repeats were the most abundant (8,918; 65.05%), followed by tri-(2,817; 20.55%), tetra-(1,062; 7.75%), penta-(767; 5.59%) and hexa-(145; 1.06%) nucleotide repeats. Di-and trinucleotide repeats constituted the largest groups of repeat motifs in our dataset, concurring with the results reported in other animal species such as human 32 , chicken 33 and fish 34 . www.nature.com/scientificreports www.nature.com/scientificreports/ Among the dinucleotide repeats, AC/GT (64.03%) accounted for the highest proportion, while CG/CG repeats were the lowest in proportion (0.29%). Amongst the ten types of trinucleotide repeats identified, AAC/ GTT repeats were the most abundant (23.86%), and ACG/CGT repeats were the least common (~0.1%). The distributions of di-and trinucleotide SSRs according to motif are shown in Fig. 2. As for tetra-, penta-and hexanucleotide repeats, the most common motifs were AAAC/GTTT (8.3%), AAAAC/GTTTT (15.5%) and AAAAAC/ GTTTTT (8.3%) respectively. Analysis of SSR densities in the human genome revealed that dinucleotide (AC/GT and AT/AT) and trinucleotide (AAC/GTT, AAT/ATT, AAG/CTT and AGG/CCT) repeats were the most common in humans 32 . AC/GT repeats were also reported to be the most common dinucleotide repeat in other organisms, including fish 35 and sheep 36 . CG-rich SSR motifs are very rare in the transcriptome of the M. fascicularis, occurring less than 1%, which corroborated the results reported in the genomes of humans 32 and other primate species 37 . CG/CG dinucleotide repeats are significantly low in vertebrates due to the methylation of cytosine, which favours the deamination of cytosine to thymidine 38 . Functional annotation of ssR loci, primer development and screening. Out of the 300 SSR loci used for primer design in this study, 105 loci were associated with genes involved in specific biological processes, cellular component and/or molecular functions. The complete list of these 300 SSRs and their respective predicted functions is provided in Supplementary Table S1. From the 30 SSR markers tested, 20 markers (66.67%) produced clear amplicons of expected sizes across all samples reproducibly. Nineteen out of these 20 markers (Table 1) were  www.nature.com/scientificreports www.nature.com/scientificreports/ polymorphic, demonstrating that more than 60% (19 out of 30) of the markers screened in this study were polymorphic. Alignment of the sequences obtained from the PCR amplicons with the contig sequences used to design the primers also verified successful amplification of the targeted DNA regions. Data analysis. Genetic diversity assessment was performed based on 19 polymorphic markers (  12,13 , where genomic SSR markers were used, the genic-SSR markers used in this study generated lower N A , H O , H E and PIC values for the same species. As the SSR markers developed in our study were generated from transcriptome, it was anticipated that the genetic diversity of these markers would be lower than those of SSR markers derived from genomic DNA regions 40 . The lower values could also be contributed by the lower number of samples (n = 26) and sampling sites in the current work compared to those studies 12,13 . Nonetheless, the identification of 19 polymorphic SSRs out of 30 markers screened based on 26 individual samples is promising. We are confident that higher N A , H O , H E and PIC values would be obtained with more samples.
The average PIC value of 0.431 for the 19 polymorphic loci validated in this study was comparable to those genic-SSRs developed for the Korean quail (mean PIC value = 0.494) 26 and crab (mean PIC value = 0.49) 24 .
Although not all the 19 genic-SSR markers showed high polymorphism and PIC values, all showed the reproducibility and specificity highly desired in genotyping by PCR.
There were very few reported studies on the development of SSR markers for cynomolgus macaque. The first study conducted to develop SSR markers for the cynomolgus macaque was reported in 2007 by Kikuchi et al. 41 . In their work, they crossed-amplified 148 SSR markers selected from human genome database, and discovered 66 (44%) polymorphic SSR markers in the cynomolgus macaque. Later, Higashino et al. 42 identified an additional 499 polymorphic SSR markers from the BAC library of M. fascicularis. They analysed the genetic polymorphisms of cynomolgus macaques originated from Indonesia, the Philippines and Malaysia using these SSR markers. In both studies, the SSR markers employed were derived from genomic DNA regions. The polymorphic genic-SSR markers identified in this study is a good addition to complement existing SSR markers to provide more markers for the investigation of the genetic structure of wild macaque populations. De novo assembly and functional annotation. A transcriptome dataset was generated from a previous RNA sequencing project of the M. fascicularis on liver, kidney, lymph node, spleen and thymus 30,31 . We subjected the raw sequencing reads to quality assessment using FASTQC v0.11.2. Illumina co-sequencing positive control (PhiX) sequences were filtered and cleaned sequence reads were subjected to base quality checking (Q ≥ 30). De novo sequence assembly was performed using CLC Genomics Workbench version 8.5.1 (CLC Bio-Qiagen, Aarhus, Denmark). We subjected the assembled contigs (average coverage ≥ 10 reads) to annotation by sequence similarity searches with BLAST+ version 2.2.31+ 43    www.nature.com/scientificreports www.nature.com/scientificreports/ Identification and classification of genic-SSRs. Genic-SSR identification and classification were performed on the filtered contigs (average coverage ≥ 10 reads) using MIcroSAtellite identification tool (MISA) 44 . The minimum number of repeats for di-, tri-, tetra-, penta-, and hexanucleotides were set at six, five, five, four, and four, respectively. Categorization of perfect, compound and complex SSRs were as follows. Perfect: consisting of a single repeat of n units; compound perfect: consisting of two or more alternate tandem repeats of n units each; complex: consisted of repeats that varied in motifs by a single unit/consisted of alternate repeat motifs interspersed within a single region/consisted of two simple perfect motifs separated by nonrepeating sequences of variable length. primer design. Contig sequences containing SSRs identified from the transcriptome dataset were employed for primer development using Primer3 software 45 . We focused on candidate SSR sequences of perfect di-and trinucleotides with repeat numbers ≥10 and with only one SSR presents in each contig for primer design. All contig sequences used for primer design were checked against genomic sequences to predict the location of introns. Three-hundred SSR primer pairs were designed. All the contigs used for SSR primers design were checked for functional annotation where a cut-off value of E < 1e −15 was used.

Materials and
Sampling, DNA extraction, PCR amplification, and electrophoresis. Thirty genic-SSR primer pairs selected randomly from the 300 pairs designed were used for initial screening on the DNA samples of 26 M. fascicularis individuals. Primers were selected randomly among those that have self-and cross-primer complimentary values of less than 3, low tendency to form secondary structures and 3′-complimentary value of less than 3. To test the robustness of the markers, samples were obtained from nine states in Peninsular Malaysia (Table 2) with the permission and collaboration of DWNP. Three samples from each state were obtained except Terengganu (2 samples). Genomic DNA samples of 24 M. fascicularis individuals were provided in the form of extracted DNA. Two DNA samples were isolated from liver tissue samples provided by DWNP using QIAamp DNA mini kit (Qiagen, Germany) according to the manufacturer's protocol.
PCR was performed in 10 µl reaction volumes containing 10 ng of genomic DNA, 2.0 µM of each primer, 1× PCR buffer, 2.5 mM MgCl 2 , 0.2 mM dNTPs, and 1 U Taq polymerase (Promega, USA), in a thermal cycler T100 (Bio-Rad, USA). Gradient PCR protocol under the following conditions was employed: a single cycle of initial denaturation at 95 °C for 5 minutes, followed by 35 cycles of denaturation at 95 °C for 1 minute, annealing at X °C for 30 seconds, extension at 72 °C for 5 minutes, and ended with a single cycle of final extension at 72 °C for 5 minutes. X denotes the different annealing temperatures (T a ) used for different primers ( Table 3). Primers that were not successfully amplified or produced multiple bands were further tested using touchdown PCR with 1 °C decrements starting from 60 °C. PCR products were separated on a 2.0% agarose gel and 8.0% non-denaturing polyacrylamide gel stained with EtBr (0.5 µg/ml). To further confirm the presence of targeted SSRs in the amplified products, PCR products with the expected fragment sizes were sequenced on an ABI 3730 through services provided by First Base Sdn. Bhd. (Seri Kembangan, Malaysia). Sequences obtained were compared with the contig sequences that the primers were designed from and the targeted SSR repeats were also identified.
Genetic diversity analysis. The 26 individual macaque samples from the nine states in Peninsular Malaysia were arbitrarily divided into two populations, the East Coast and West Coast populations, taking into consideration the Titiwangsa Range as a potential geographical barrier for gene flow between populations on both coasts. The East Coast population comprised of eight individuals from three states (Kelantan, Terengganu and Pahang), while the West Coast population consisted of 18 individuals from six states (Perlis, Kedah, Pulau Pinang, Perak, Selangor and Negeri Sembilan). SSR banding patterns were analyzed with PopGene version 1.32 46

Data Availability
The transcriptome data have been deposited in the NCBI Short Read Archive database under accession SRP096937 and SRX2499144-SRX2499147.