SSR individual identification system construction and population genetics analysis for Chamaecyparis formosensis

Chamaecyparis formosensis is an endemic species of Taiwan, threatened from intensive use and illegal felling. An individual identification system for C. formosensis is required to provide scientific evidence for court use and deter illegal felling. In this study, 36 polymorphic simple sequence repeat markers were developed. By applying up to 28 non-linked of the developed markers, it is calculated that the cumulative random probability of identity (CPI) is as low as 1.652 × 10–12, and the identifiable population size is up to 60 million, which is greater than the known C. formosensis population size in Taiwan. Biogeographical analysis data show that C. formosensis from four geographic areas belong to the same genetic population, which can be further divided into three clusters: SY (Eastern Taiwan), HV and GW (Northwestern Taiwan), and MM (Southwestern Taiwan). The developed system was applied to assess the provenance of samples with 88.44% accuracy rate and therefore can serve as a prescreening tool to reduce the range required for comparison. The system developed in this study is a potential crime-fighting tool against illegal felling.

Illegal logging is a severe problem in many timber-producing countries. Unplanned felling results in forest degradation, affects forest ecosystems and promotes the spread of pests and pathogens. In addition, large-scale deforestation causes forests to lose their soil and water conservation functions, thus leading to water shortages in the dry season and floods in the rainy season. These disasters brought by illegal or unplanned felling cause significant damages to the country.
Incidents of theft of valuable timber continue to occur, yet the crime investigations are complicated due to the lack of effective measures to present court evidence for conviction. To obtain the molecular evidence linking seized timber to illegally felled stumps, it is necessary to develop an individual identification system for tree species of high economic value, which are often the target of illegal felling.
Chamaecyparis formosensis Masam., also known as False Cypress, is endemic tree species in Taiwan distributed majorly in the cloud forest, a zonal forest type in the mid elevation (1700-2600 m) with extremely high biodiversity 1 . Gigantic C. formosensis is known for its superb timber quality. Due to its high quality and market value (4050 USD/m 3 , woodprice.forest.gov.tw), C. formosensis is critically threatened by illegal felling. Unplanned felling and poor management compromised the ecosystem and endangered these endemic species. Similar scenarios also happened to Dalbergia spp. (Leguminosae: Papilionoideae) 2,3 , Fraxinus excelsior (Oleaceae) 4 , Swietenia macrophylla (Meliaceae) 5 , and Intsia palembanica (Fabaceae) 6 . Moreover, timber production countries suffer from illegal felling, particularly in South-east-Asian, African and South American countries 7

Result and discussion
Development of new SSR markers for C. formosensis. For a higher accuracy of court's judgment on illegal felling, it is necessary to establish a complete forensic system. Although some SSR markers of cypress have been published 17,[25][26][27] , the reported detection rates for dried timber were only 20-40% 8,9 . Therefore, it is necessary to develop more SSR markers as a contingency plan. When the sample is in a poor condition, more markers can be applied in order to achieve the threshold of combined power of discrimination (CPD) required for successful comparison between seized timbers and victim trees. In order to maximize potential loci, Next generation sequence (NGS) methods were used. In this study 3 DNA library were constructed. We used the Illumina MiSeq platform (2 × 301 bp; Illumina, San Diego, California, USA) to sequence the DNA libraries ( Fig. 1. and Supplementary 1.).
A total of 70,325,072 raw reads were produced. The raw reads were deposited in the NCBI BioProject (PRJNA454510). After quality-trimming to the raw reads with CLC Genomics Workbench version 10 (QIA-GENE, Aarhus, Denmark), 70,319,509 contigs were generated with the length between 133 and 146 bp on average. De novo assembly was conducted with the following parameters: contig number 208,467, minimum length of contigs 18 28 . We newly designed 100 candidate SSR primer pairs for testing in C. formosensis by BatchPrimer3 29 .
There are 9 validated SSR markers that are polymorphic (success rate 9.00%) were registered in GenBank in NCBI (Table 1) and passed for cross-species tests (Supplementary 2 and 3).
Unlike the traditional SSR cloning method, with next-generation sequencing technology, it is easy to obtain a significant amount of SSR containing sequences from sequenced genomes 30 . However, transforming candidate SSR primer pairs into validated SSR markers is still a time-consuming and expensive step. Qualified SSR markers need to succeed in PCR amplification, have good peak pattern quality with minor stuttering, and be free of non-amplifying (invalid) alleles. The turnover rate from candidate SSR primer pairs to validated SSR markers varies from species to species 30,31 . The success rate in Chamaecyparis plants is between 5.24% and 9.27% 8,17,25,26,32 . Developing C. formosensis individual identification system. In this study, newly developed 9 validated SSR markers and other 27 validated SSR markers 17 polymorphic SSR markers were analyzed against 92 individuals from 4 geographic areas (MM, HV, GW, SY, Fig. 2 and Supplementary 1). The results of developed 36 SSR markers are summarized in Table 2. Among the 92 individuals in this study, each number of alleles of SSR is between 2 and 27, with an average of 7.916. The levels of observed heterozygosity (Ho) 33 are from 0.000 to 0.891,  Ho is the actual proportion of heterozygous individuals in each locus within the population, whereas the He is the expected value estimated per HWE. Ho and He are among the most widely used parameters in estimating genetic diversity in a population. The population structural and even historical information can be obtained from Ho and He. When Ho = He, it means that the population is random mating. When Ho < He, it means that the population is inbreeding. When Ho > He, it means that the population is outcrossing 33 . Most of these loci (36 tested) are Ho < He (except Cred211, Cred220, Cred225, Cred248, Cred276, Cred281, Cred297, Cred298), suggesting the population of C. formosensis has a low genetic divergence and is an inbred strain. HWE describes that under ideal conditions, there exists no mutations, no natural selection, no individuals moving in or out, the population is infinitely large, and random mating within the population. Therefore, gene frequency does not change over time or generation. However, there will always be one or more interfering factors (e.g. genetic drift, natural selection, mutation, gene flow, population bottleneck, founder effect, and inbreeding.) affecting gene frequency in nature 35 . Therefore, HWE is difficult to achieve in nature. In this study, 23 loci out of 36 markers deviated from HWE (63.89% deviation rate). The reason for this deviation could be artificial selection, nonpanmixia or genetic drift. Polymorphism information content, or power of information content (PIC), is an index of the relative ability of the SSR marker's genetic variability. The higher the polymorphism of marker's genotype, the higher the PIC value 36 . Polymorphic markers were highly informative (PIC > 0.50), reasonably informative (0.50 > PIC > 0.25), and slightly informative (PIC < 0.25). Power of discrimination (PD) 37 refers to the ability of genetic markers to distinguish individuals within a population. Obviously, in a population with more allele types and evenly distributed genotypes, the low probability of two random individuals having the same genotype, and the system can identify the greater probability of two random individuals. Probability of identity (P I ) 38 is the probability of two individuals with the same genotype. PD = 1-P I . The value of PIC, PD and P I of individual markers reflects its identification ability in the individual identification system. The greater PIC and PD, the lower P I in value, suggesting the higher identification ability of the marker, and vice versa. The levels of PIC range from 0.097 to   (Table 2). When identifying several independent polymorphic genetic markers simultaneously (polymorphic markers located in different linkage groups), the combined probability of identity (CP I ) is the product of the P I of each genetic marker. At this time, CP I will be greatly reduced, and the combined power of discrimination (CPD) will become very high. As defined above, CP I + CPD = 1. The credibility of the individual identification system is calculated based on "Random match probability in population size and confidence levels'" published by Budowle et al. 40 . Confidence levels (CL) = (1 -CP I ) N , where N is number of individuals. The individual identification system was applied to illegal felling cases 8 . When the seized timber and the victim tree are identified as the same particular plant, under the considerations of fairness and objective, the court usually adopts 99.99%, 99% or 95% confidence level as the credibility standard 41 , ISO ISO/IEC 17,025). In this study, the locus with the lowest P I within a linkage group was used to calculate the CP I ( Table 3). The CP I decreased along with accumulation of loci and the P I of each locus were sorted in ascending order. The system can accumulate up to 28 loci without linkage. When reaching its maximum capability, even under most strict standard (confidence level 99.99%) dictated by court, this system can be used to identify 60 million C. formosensis, with CP I as low as 1.652 × 10 -12 , CPD as high as 0.999999999998348 (almost equal to 1), beyond the known www.nature.com/scientificreports/  42 . Under ideal conditions, a minimum of 6 loci can be applied to the system, with an identifiable C. formosensis population of 2,900 under 95% confidence level. The CP I is as low as 1.728 × 10 -5 , and CPD is as high as 0.999982712603209 (Table 3).
One of the problems with SSR marker is the appearance of null alleles. One possible cause of SSR null alleles is poor primer annealing caused by the nucleotide sequence divergence of the flanking primer on one or both sides (for example, point mutation or indel in the primer sequence) 43 . In addition, due to the competitive nature of PCR, smaller alleles usually have a higher amplification efficiency than larger alleles. Therefore, only the smaller of the two alleles can be detected from heterozygous individuals. The null alleles caused by differential amplification can usually be seen by loading more samples or adjusting the contrast 44 . The third cause of null alleles may be due to inconsistent quality or the low quantity of DNA templates. Some loci are relatively easy to amplify, yet others cannot be amplified within the same DNA preparation 45 . When a null allele is present, the www.nature.com/scientificreports/ observed genotype represents one of the several possible true genotypes 46 . SSR markers inevitably produce null alleles, and each SSR marker has a different background for null alleles. Dakin et al. (2004) 44 reviewed 233 publications by examining how authors detect and deal with null alleles and the methods used to estimate the frequency of null alleles across articles. The authors demonstrate that the frequency of simulated null alleles is usually overestimated, which will lead to underestimating the usability of this marker. It was misunderstood that the existence of null alleles will reduce the availability of paternity testing, individual identification, and population genetic research. However, it has been demonstrated that null alleles do not change the overall result on assignment testing 43,44 . Compared with the presence of null alleles, increasing the number of loci and the degree of genetic differentiation has a more significant impact on the accuracy of assignment testing. This argument is valuable for studying SSR markers and populations prone to invalid alleles, as it allows researchers to use loci affected by invalid alleles 43,44 .
In Huang et al. 8 where C. taiwanensis individual identification system was applied to an illegal felling conviction case, CPD calculations exclude any markers that show homozygous PCR results per ISO/IEC 17,025. The CPDs are calculated only from the possibilities of the markers found in timber and tree samples simultaneously. Null alleles and PCR fail will only reduce the identification rate but will not cause seized timber and victim tree from different individuals to be identified as the same source. However, this is not to say that efforts should not be made to use loci that display low-frequency null alleles. On the contrary, markers that are less prone to invalid alleles should always be preferred because they are less ambiguous and are more potent in assignment testing. However, before many individual identification markers are developed and optimised, the impact of null alleles should not be overemphasized, as it reduces the usability of markers 43,44 . Population genetics analysis. F is , by definition F is = 1-Ho/He, is the inbreeding coefficient of an individual concerning the local subpopulation. When Ho < He, then F is > 0, indicating that the population is an inbreeding. The 36 polymorphic SSRs were used to evaluate the genetic diversity parameters of the four groups (MM, HV, GW, SY) ( Table 4). The number of alleles (A) for each locus is 4.417 and 5.444. Ho and He are ranged from 0.376 to 0.506 and from 0.474 to 0.583, respectively. All the groups show positive inbreeding coefficients, suggesting these four groups are inbreeding lines.
The fixation index (F st ) estimates population differentiation due to genetic structure 47 . A higher F st value means a higher degree of difference between populations. When F st is less than 0.05, there is no differentiation among populations. When F st is between 0.05 and 0.15, there is low differentiation among populations. On the other hand, the estimation of the number of migrants (Nm) is gene flow value 47 . If Nm is more than one, genes frequently exchange, which counteracts the genetic drift and prevents the population differentiation 48 . If Nm is greater than four, the population is a random mating 49 . The analyses of F st and Nm of the four geographic areas were conducted by GeneAlex 6.503 50 ( Table 5). The F st value between HV and GW was 0.035, suggesting no population differentiation in these two populations. The highest F st value (0.074) was found between HV and MM. The F st values ranged from 0.056 to 0.065 were found between the rest geographic areas, indicating a low differentiation in these geographic areas. The highest Nm value (6.832) was found between HV and GW, whereas the lowest value (3.141) was between HV and MM. The Nm values of four geographic areas were greater than 1 (between 3.141 and 6.832), suggesting a frequent gene exchange between the four geographic regions, which offsets genetic drift and prevents population differentiation. For GW/MM (Nm = 4.022), GW/HV (Nm = 6.382), the Nm values of the population are more significant than four, suggesting that these populations are random mating.
STRU CTU RE analysis 51,52 was used to analyze the population genetic structure of C. formosensis (Fig. 3), and the Delta K value was calculated to obtain the optimal number of clusters. K and Delta K are shown in Fig. 3a.
The individuals of C. formosensis were most likely to be three clusters (Fig. 3b): the SY located in Eastern Taiwan is an independent cluster, the MM located in Southwestern Taiwan is another cluster, whereas the two HV and GW geographic areas are in the same genetic cluster. The results of F st (Table 5), Nm (Table 5) and STRU CTU RE analyses (Fig. 3) show that C. formosensis of the four geographical areas belongs to the same genetic population. The F st and STRU CTU RE analyses suggest that the samples fall into three clusters. The hypothesis that Taiwan Island is one of the plant refuges during the Quaternary glaciation 53,54 may help to explain the results. The study of historical biogeography and phylogeny of cypress 54 suggested that C. formosensis in Taiwan diverged from Chamaecyparis in Japan 2.9 million years ago. The arrival of the Quaternary glaciation led to species extinction and the continued retreat of species to lower latitudes 55 ; thus Taiwan Island became a refuge for many ancient species, such as Juniperus morrisonicola (Cupressaceae) 56 , Abies kawakamii (Pinaceae) 57 , Castanopsis carlesii Table 4. Detailed genetic diversity parameters were identified at 36 simple sequence repeat (SSR) in four populations (MM, HV, GW, SY) of Chamaecyparis formosensis. N: the number of individuals, A: number of alleles, Ho: observed heterozygosity, He: expected heterozygosity, F is : inbreeding coefficient. When Ho < He, F is > 0, indicating that the population is inbreeding 67 . www.nature.com/scientificreports/ (Fagaceae) 58 . After the glaciation, species spread from the refuge to the surrounding areas and created species diversity across the latitude gradient 59 . In our results, the low polymorphism of C. formosensis probably indicates that they originally derived from the same large population during the glaciation. After the glacial retreat, these four C. formosensis clusters spread out from the refuge and formed the four populations due to geographic isolation. Studies 55,60 also show that, based on molecular evidence, many plants (eg. Cunninghamia konishii, Cyclobalanopsis glauca, Trochodendron aralioides) in Taiwan island have high genetic diversity, higher than that of mainland China and Japanese archipelago. This remarkable high genetic diversity is associated with the Ice Age history in Taiwan 55,60 . The low genetic diversity of C. formosensis differs from most Taiwanese plants but is similar to another endangered plant of the genus Cypress, C. taiwanensis, in Taiwan Island (A = 6.507, Ho = 0.392, He = 0.501) 8 . Compared to C. obtusa, an endangered cypress plant in the Japanese archipelago, C. formosensis is also inbreeding (F is = 0.034), but the degree of genetic diversity (A = 23.9) is significantly lower. One possible explanation of the low genetic diversity is that a large population of C. formosensis was divided into several smaller populations after ancient glacial retreat in Taiwan, and then they were recently overexploited by humans (REF).
GENECLASS v. 2.0 61 was applied to analyse the provenance of 92 individuals independently. The probability of samples returning to the correct provenance is 95.00% (MM), 88.00% (HV), 69.57% (GW), and 100.00% (SY), with an overall mean correct rate of 88.04% (Table 6). Three HV individuals were misassigned to GW and four GW individuals were misassigned to HV, corresponding to the observation that HV and GW are the same clusters. However, three GW individuals were misassigned to MM, possibly because the geographic location of GW is between HV and MM. Therefore, GW has characteristics of north and south at the same time. Likelihood, one MM was mis-assigned to GW, further supporting the inference that there is partial gene exchange between MM and GW. Our data show that the populations in eastern (SY) and western Taiwan (the rest populations) have distinct genotype differences. Within the western populations, the northern (HV) and the southern ones (MM) have obvious differences. Therefore, when seizing timbers in the future, the genotype can be served as a prefilter to infer the geographic area of the victim tree if the provenance is found to be MM, HV or SY. A further inspection is required if the provenance is GW because of the existence of gene exchange between nearby geographic areas.

Conclusions
In this study, a C. formosensis individual identification system was built with 36 polymorphic SSR markers. When 28 non-linked SSR markers are applied, the system is capable of identifying 60 million C. formosensis individuals with a confidence level of 99.99%. The lowest CP I is 1.652 × 10 -12 , and the highest CPD is 0.999999999998348. This system can provide the scientific evidence to link seized timbers and victim trees required the illegal felling court cases and facilitate future legal sales by profiling timbers. Through population genetics analysis, the system can provide provenance information, which would significantly enhance the efficiency by reducing the range required for investigation. The polymorphic markers developed in this study can be further applied to the conservation and breeding of the endangered species C. formosensis. Bioinformatics analysis was conducted with CLC Genomics Workbench version 10 (QIAGENE, Aarhus, Denmark). The raw reads were prescreened to remove adapter sequences and reads with greater than 0.01 error or an average quality less than QV20. The trimmed sequences were further subjected to de novo assembly.

Materials and methods
MISA (MIcroSAtellite v 1.0) 28 was applied to screen the SSR containing sequences from contigs. To design SSR primers, sequences with at least five di-, tri-, tetra-, penta-, and hexa-nucleotide repeats were selected using BatchPrimer3 29 , with optimized conditions set length at 18-23 bp, melting temperature 45-62 °C, and product size of 80-300 bp. www.nature.com/scientificreports/ A total of 100 candidate SSR primer pairs were newly designed in this study. These markers were subjected to validation test on 92 samples from four C. formosenses geographic areas (MM, HV, GW, SY see Supplementary 1). The samples DNA used in marker validation were extracted using the VIOGENE plant DNA extraction kit (VIOGENE, New Taipei City, Taiwan). The PCR reaction was conducted with a final volume 20 μL containing 2 ng of genomic DNA, 0.25 μL of 10 μM each primer and 10 μL of Q-Amp 2 × Screening Fire Taq Master Mix (Bio-Genesis Technologies, Taipei, Taiwan). The following PCR process was conducted: an initial denaturation of 95 °C for 2 min; 30 cycles of 95 °C for 45 s, a primer-specific annealing temperature for 45 s, and 72 °C for 45 s; followed by a 15-min extension at 72 °C ( Table 1). The amplified products were evaluated on the ABI 3500 (Applied Biosystems, Waltham, Massachusetts, USA) with GeneScan 600 LIZ Size Standard (Applied Biosystems). Fragment size was determined by using GeneMapper ID-X v1.6 (Applied Biosystems). The capillary electrophoresis diagrams for genotyping are shown in Supplementary 4.  www.nature.com/scientificreports/ The cross-species transferability of the designed markers was tested in Chamaecyparis taiwanensis; for details, see Supplementary 2 and 3.
Population genetics analysis. Genetic diversity parameters, genetic differentiation and gene flow among 4 geographic areas (MM, HV, GW, SY) were analyzed using F is , F st and Nm by GenAlex 6.503 50 .
The population genetic structure was analyzed using STRU CTU RE 2.3.4 52 . The program was run for K = 1 to 5 clusters with 20 independent runs to assess simulation stability. Each simulation was run for an initial 1,000,000 burn-in period followed by 100,000 replications based on the Markov chain Monte Carlo (MCMC) 64 . The best grouping was evaluated by Delta K 64 in Structure Harvester Web v0.6.94 65 . Bar graphs were generated by CLUMPP 1.1.2 66 for K ideal.
Individual provenance simulation was conducted with GENECLASS v. 2.0 61 on every 92 individuals independently. A pairwise simulation was also conducted on the pooled database deducted the sample itself.
Plant collecting permit declaration. With legislation compliance of experimental materials, we hereby declare that all of our experimental research and field studies on plants, either cultivated or wild, including the collection of plant material, comply with relevant institutional, national, and international guidelines and legislation.
Software and data use declaration. In this research, the software and the data generated by the software (including commercial software, open-licensed software) are used legally in accordance with regulations, which allow to reproduction, distribution, transmit and modification works (including commercial use).

Data availability
Raw sequence information and developed SSR primer pairs have been deposited to NCBI (BioProject ID PRJNA454510); GenBank accession numbers are provided in Table 1.