Genetic Characterization of Chinese fir from Six Provinces in Southern China and Construction of a Core Collection

Large ex situ germplasm collections of plants generally contain significant diversity. A set of 700 well-conserved Chinese fir (Cunninghamia lanceolata (Lamb.) Hook) clones from six provinces in southern China in the ex situ gene bank of Longshan State Forest, was analyzed using 21 simple sequence repeat markers, with the aim of assessing the genetic diversity of these germplasm resources. Genetic analysis revealed extensive genetic variation among the accessions, with an average of 8.31 alleles per locus and a mean Shannon index of 1.331. Excluding loci with null alleles, we obtained a low level of genetic differentiation among provinces, consistent with the interpopulation genetic variation (1%). Three clusters were identified by STRUCTURE, which did not match the individuals’ geographical provenances. Ten traits related to growth and wood properties were quantified in these individuals, and there was substantial variation in all traits across individuals, these provide a potential source of variation for genetic improvement of the Chinese fir. Screening large collections for multiple-trait selective breeding programs is laborious and expensive; a core collection of 300 accessions, representative of the germplasm, was established, based on genotypic and phenotypic data. The identified small, but diverse, collections will be useful for further genome-wide association studies.


Results
Genetic Diversity among the Loci. Twenty-one SSR markers were used to evaluate the genetic diversity among 700 clones of the Chinese fir from six different provinces. All of these were neutral SSR markers not affected by natural selection. A total of 181 alleles were identified across the loci, ranging from 3.83 at SSR3 to 18.83 at SSR1 (Table S1). The number of effective alleles per marker ranged from 1.39 to 11.05, with an average of 3.80. The mean value for I was 1.331, ranging from 0.588 to 2.601. The mean Ho was 0.561, which differed from the He value (0.604). Of the 21 primers used to characterize the Chinese fir collection, Ho was lower than He at 14 loci. Five loci (SSR1, SSR2, SSR6, SSR 11, and SSR21) showed significant deviation from Hardy-Weinberg equilibrium (HWE) due to heterozygote deficiency or null alleles. Possible null alleles were identified in four loci (SSR1, SSR2, SSR11, and SSR21), with null allele frequencies ranging from 0.06 to 0.29. Of these, SSR1, SSR11, and SSR21, which had higher null allele frequencies, were excluded from the following analyses. The mean fixation index (F IS ) was 0.053 (P < 0.05). Low genetic differentiation among the loci was observed based on F-statistics. The PIC value ranged from 0.25 (SSR12) to 0.92 (SSR1), with an average of 0.57, indicating that the loci were reasonably informative (Table S1).
Population Structure of the Chinese fir Samples. The analysis of the optimal substructure of the genetic relationship among Chinese fir accessions, excluding loci with null alleles, showed a clear ΔK peak at K = 3 (ΔK = 66.0) (Fig. 1); further evidence regarding K values, which ranged from 1 to 20 using Marverick software, also showed that K = 3 was the most likely value in this study ( Figure S1), indicating that all individuals grouped into three major clusters. Among the 10 independent replicate runs with k = 3, the major mode showed exactly identical patterns of individual assignment in each run ( Figure S2). Cluster 3 contained fewer individuals (Cluster 3: n = 222) than did Cluster 1 (Cluster 1: n = 237) and Cluster 2 (Cluster 2: n = 241) (Table S2), with average Q values of 0.628, 0.633, and 0.606, respectively. Sixty-one individuals from JX (53.98%) and 18 individuals from GZ  (Table S3); the population structure was weak, and most individuals from the six provinces belonged to a genetic cluster (Fig. 2). The three clusters were dominated by different gene pools but showed inconspicuous genetic differentiation from each other.
The number of alleles and their frequency in the three clusters showed a certain degree of difference (Table S4). The highest allele frequency (0.903) had a size of 216 bp on SSR17 in Cluster 1. The minimum allele frequency that can be measured was 0.002. Among 176 alleles in 18 loci, there were 118 alleles that could be detected in all clusters, accounting for 67.05%. The remaining 58 rare alleles were detected in parts of the clusters.
Genetic Diversity among Chinese fir Populations. Population genetic parameters, including Na, Ne, I, Ho, and He, were calculated using microsatellite data excluding loci with null alleles to estimate the variation among the samples obtained from six different provinces, as well as the three clusters, at the structure level (Table 1). Among the six provinces, the value of Na and the number of private alleles in GD was higher than those in other provinces, which may have been due to the larger sample size 27 . HN contained the fewest individuals, had the lowest Na, and lacked any private alleles. The three genetic clusters contained 241, 243, and 216 individuals, respectively, with each cluster having a similar value of Na and private alleles. Despite large disparities in sample size among the six provinces, there was little variation in the level of genetic diversity, which corresponded to the level of each cluster. The highest value of Ne was 4.007, seen in JX, while the lowest value (3.409) was seen in HN. The highest Ho and He values in FJ were not significantly higher than the lowest values in GX, with differences of 0.035 and 0.023, respectively. The value of Ho was lower than that of He in all six provinces, but this difference was not significant, which is consistent with the positive fixation index value. Genetic differentiation between any two provinces was calculated for all six provinces using pairwise genetic differentiation values (F ST ) ( Table S5). Most of the pairwise tests of differentiation (F ST ) performed between locations were significant (P < 0.01), with an overall F ST value of 0.009, suggesting that there was weak differentiation among the six provinces. The values ranged from 0.003 (GD-GX; GD-JX) to 0.016 (HN-FJ). A similar pattern of differentiation among provinces was observed using the standard Nei's genetic distance estimate (Table S5). Rousset's genetic distance values [F ST /(1-F ST )] (1997) also indicated that HN is most distant from other provinces, whereas GD-GX and GD-JX were most closely related. Further analysis among the three clusters yielded a mean value for F ST of 0.010, while the mean Nei's genetic distance was 0.034, indicating weak differentiation among them.
Analysis of Molecular Variance (AMOVA). AMOVA was performed for all Chinese fir accessions of the six different provinces to analyze the distribution of genetic diversity among and within the provinces. The results revealed low variation among the provinces ( Table 2). Nearly all the variation (99%) was attributed to differences within provinces. A hierarchical AMOVA of the three genetic clusters using STRUCTURE revealed that 96% of the variance was distributed within the clusters.    (Table 3). Data from outlier trees were discarded. The coefficient of variation was > 10%, and that of stem volume was highest, at >70.5% and a range from 0.0015 to 0.3235 m 3 . Tracheid length had the lowest coefficient of variation (11.9%). The maximum tree height (15.25 m) was more than 10-fold that of the minimum tree height (1.50 m), and the average height was 7.71 m. Further analyses of the phenotypic variation in the three clusters are shown in Table 4. ANOVA for the three clusters revealed there were no significant differences in most traits, except for P. The components of variance in all measured traits ( Table 5) also showed that the environmental variance in each trait was much less than the genetic variance, suggesting that environmental effects were small for these traits, corresponding to the small value of Q ST .

Development of a Core Collection.
To determine the appropriate sampling strategy and optimal core size, different sampling percentages from the whole collection were designed, and this was combined with an M strategy and random sampling method. The average build results for five repeats (Fig. 3) revealed that the number of alleles reserved by the M strategy was always greater than the number reserved by the random sampling method when the core germplasm was constructed using the same sample collection. The latter exhibited inferior efficiency compared to the former, especially for core sets with smaller sample collections, demonstrating a larger allele retention gap. The number of alleles sampled increased rapidly with increased sampling size using the M strategy; however, when the sampling size reached 150 individuals, the curve gradually levelled out, and there was no obvious change in the number of alleles when the sampling quantity increased. Ultimately, the M strategy was considered the preferred strategy for constructing the core selection and, considering that the core germplasm collected will be used for association analysis in the future, a sampling size of 300 was set. In addition, 10 phenotypic traits, including H, DBH, T, V, P, WBD, Hy, L, D and L/D, were used to construct the core germplasm using POWERCORE, which resulted in the identification of 26 individuals exhibiting 100% of the diversity of the whole collection. Finally, a core germplasm of 300 individuals (52 from GX, 34 from JX, 20 from HN, 20 from GZ, 37 from FJ, and 137 from GD) with complete phenotypes were selected based on the results of both methods described above.
Comparisons of the Whole Collection with the Core Collection. The phenotypic data and genetic diversity parameters were computed for the core collection (Tables 6 and 7). The lowest coincidence rate of the range of the core collection was 78.76% of the whole collection, and the highest rate of variation of the CV among the measured traits was only 17.71%. Additionally, t-tests performed for all phenotypic data showed no significant differences between the core collection and the whole collection, except for H ( Table 6). All phenotypic data showed a normal distribution. In addition, the genetic parameters Na, Ne, I, Ho, He and PIC were calculated for  Table 2. Analysis of molecular variance from microsatellite data excluding loci with null alleles using GenAlEx 6.5. a The first analysis included all provinces. b The second analysis included three genetic clusters.  the core collections, which were all very close to those of the entire collection and also showed no significant differences (Table 7). Among them, the values of I and PIC were as high as 96.84% and 99.22%, respectively, of the whole collection, indicating that the core collection was highly representative of the whole collection of 700 accessions.

Discussion
Characterization of the germplasm is essential for germplasm conservation and collection activities. A large sample of Chinese fir individuals from six different provinces was collected, grafted, and grown at a single location. We surveyed this germplasm for growth and wood-property traits, which are all important for the economic value of woody stems 28 . The substantial variation identified provides a potential opportunity for genetic improvement of the Chinese fir. In addition, by using neutral SSR markers, extensive genetic variation was detected. Among the 21 SSR loci, most conformed to HWE, and no population had a particularly large number of loci that deviated from HWE. Four loci may have had null alleles, contributing to positive values for the inbreeding coefficient 29 .
F IS values differed significantly from 0, suggesting that self-pollination may exist, or the Wahlund effect may be present 30 . The mean heterozygosity values Ho and He were 0.561 and 0.604, respectively, similar to the values in red-colored heartwood genotypes of Chinese fir in GX province (Ho = 0.562 and He = 0.584) 26 , and determined using the same microsatellite loci despite the sample size of this study being almost five times that of the previous study. Wang et al. 31 also obtained similar heterozygosity values for different numbers of wild and semi-wild apricots, using the same microsatellites. Number of loci, rather than the number of populations, affects the estimate of genetic diversity, consistent with the report by Ferrer et al. 32     but lower than those in Abies chensiensis and Austrocedrus chilensis 36,37 ; this suggests that the level of genetic diversity in this germplasm is moderate. Outcrossing and wind-pollination may explain the considerable level of polymorphisms in this species [38][39][40][41] . Of the 21 SSR loci used in this study, 14 showed lower values of Ho than He, indicating a deficiency in heterozygotes at these loci. The same heterozygote deficiency was observed at the population level excluding null alleles. The results are consistent with previous findings in the Chinese fir 26 using the same 21 SSR markers. In general, natural selection occurring throughout the life cycle of a tree appears to favor heterozygosity 42 , and hybrids show excess heterozygotes 43 . In this tree species, however, heterozygote deficiency may be explained by self-pollination 41 or by subpopulation structure 30 . The marker data also showed that the genetic distance between some clones was very small, such as between 25 44 , and therefore maintenance of heterozygosity and retention of heterosis are vital, although further analyses such as Mendelian inheritance testing 1 are needed to verify heterozygote deficiency. Previous analyses of the phenotypic traits of this species identified 98 relatively fast-growing genotypes with relatively high wood basic density 45 , and all of them had distant genetic distances. These genotypes can be chosen to be the parent in cross breeding. In the future, it will be necessary to further expand the Chinese fir breeding base and increase gene communication between the trees. The presence of private alleles in this germplasm may indicate useful rare variants and also provides the opportunity to select useful recombinants in the future. In this study, a mean of 8.31 alleles per locus from 700 Chinese fir accessions was obtained, a value higher than those in previous reports [24][25][26] . The high number of alleles identified in this study may be due to the large sample size 1 . The relatively low F ST value (0.015) of loci observed indicated a low level of genetic differentiation. Correspondingly, only 1% genetic variation was seen among provinces, as confirmed by AMOVA. These results confirmed those from a previous study showing that outcrossing in woody plants resulted in increased genetic diversity and reduced genetic differentiation among populations 31 . Furthermore, the levels of genetic diversity in each province except HN were similar, although there was a large disparity in sample size. These results indicate that genetic diversity may not be influenced by the number of samples when a large sample size of Chinese fir is available. The F ST values between pairwise provinces were significant but low, which indicated that most have mixed ancestry and therefore clustered together. Rousset's genetic distance values showed that HN was genetically the most distantly related among the provinces, which may have been due to its small sample size, while the most closely related were GD-GX and GD-JX, which may have been due to their geographical proximity or to human activity, such as artificial cultivation.  STRUCTURE identified three non-distinct clusters among the accessions of Chinese fir. The membership coefficient of the whole accession revealed that only 15.43% of individuals had ancestry values >0.80, and nearly half of the individuals had ancestry values <0.60. ANOVA also revealed no significant differences in most traits among the three clusters. Cluster assignment did not match geographical provenances, indicating that the Chinese fir accessions had a mixed ancestry, consistent with the low F ST values obtained. That may explain the small phenotypic variation obtained among provinces in a previous study 45 . The low divergence in this resource could be due to several factors. The materials used in this study were all plus trees, including excellent genetic materials, good local-type materials, excellent provenance materials, and hybrid materials, and may have been widely used. This may have resulted in gene flow under past anthropogenic activities, lowering the differentiation among populations. Additionally, the relatively small geographic scale of our survey may have also been a factor. Previous studies have shown that the influence of climate on population structure is stronger than that of the geographical distance influence in P. tomentosa, and P. tomentosa clones were generally assigned to three different climate regions, whereas most of the geographical proximities were clustered into different groups 1 . Wang et al. 31 also reported that geographic distance was not the principal factor influencing genetic differentiation in the Siberian apricot. The trees investigated in this study were all from similar climate zones, and a sufficient number of individuals may not have been sampled from all local populations, obscuring the genetic structure of Chinese fir, which may have weakened the genetic structure to a certain extent. In addition, outcrossing rate, population size and life-history traits can also have a strong influence on the genetic structure of plant populations [38][39][40] . A history of genetic drift may also contribute to the phenomenon of non-conspicuous genetic differentiation 26 . Chinese fir is an economically valuable and widely used conifer that is widely distributed in southern China, covering more than 9. 11 × 10 6 hm 2 . To obtain greater economic benefits, directional selection is used in breeding, and random sampling from small groups may lead to genetic drift that may contribute to low genetic differentiation. Such past anthropogenic activities may have significantly influenced the present population structure and patterns of genetic diversity in Chinese fir. Additionally, Chinese fir is a wind-pollinated tree species with a low level of inbreeding, which can lead to genetic drift. The samples of this germplasm were derived from different provinces; therefore, a comprehensive study of genetic structure should improve our ability to assess the distances over which differentiation can occur in Chinese fir 38,40 . Understanding the genetic structure of Chinese fir could facilitate the selection of trees for breeding to maximize genetic diversity as well as to enhance the potential gain from selection, which may have an impact on the ecological adaptation and evolution of this species in the future 40 .
Association analysis in multiple-trait selective breeding programs is a breeding strategy that can accelerate the breeding process; however, association mapping of large germplasm collections is laborious. At the same time, to avoid using too small of a sample to assess the correlation between phenotypes and genotypes we selected 300 accessions from different genetic backgrounds, representing the maximum variability of 700 Chinese fir accessions, for future association mapping studies. The t-tests performed for most phenotypic data and all genetic parameters showed no significant differences between the core and the whole collections, indicating that the core collection is a good representation of the original germplasm. The core collection developed in this study will be useful for genome-wide association studies in the future to accelerate breeding programs for the Chinese fir.

Materials and Methods
Plant Material and DNA Extraction. Chinese fir is widely distributed in southern China, including Guangdong, Fujian, Zhejiang, and 14 other provinces, as well as in Taiwan ( Figure S3). In 2004, 700 Chinese fir plus trees, including excellent genetic materials, good local type materials, excellent provenance materials, and hybrid materials, were collected from six groups based on their geographical locations, including Guangxi, Jiangxi, Hunan, Guizhou, Fujian, and Guangdong provinces ( Figure S3 and Table S6) 46 . These materials come from excellent provenance and a family gene collection area established in 1983, a Chinese fir-type gene resources collection area constructed in 1989, and an excellent hybrid materials genetic resources collection area of a second-generation seed orchard constructed in 1992, including more than 50 provenances with a broad genetic basis. The number of Chinese fir individuals in each provenance ranged from 1 to 67. The selection criteria of original plus trees considered growth indices and morphological parameters, including volume of wood, heightdiameter ratio, crown diameter ratio, percent of bark, disease resistance, stem straightness, the natural level of training, crown vice, collateral thickness degrees, growth vigor, and so on. The dominant comparative method and comprehensive evaluation method were used to choose plus trees, and the distance between any two individuals was more than 50 m. From 2004, scions of these trees were grafted onto 2-year-old Chinese fir rootstocks in the ex situ gene bank of Longshan State Forest Farm, Guangdong Province, China (25°11'N, 113°28'E, 285-296 m above sea level), which is located in a subtropical region with a moderate climate throughout the year and ample rainfall 47 . Each clone had at least four ramets of similar size and vigor. Grafted ramets were planted randomly at the site at a spacing of 3 × 3 m. This study was performed in strict accordance with the recommendations in the Guide for Observation and Field Studies 48 . Total genomic DNA was extracted from the mature leaves of each clone using a QIAGEN Plant DNeasy Kit (QIAGEN, Hilden, Germany) in 2014. The quality and concentration of the extracted DNA were measured using a NanoDrop 2000 Spectrophotometer. SSR Genotyping. Twenty-one previously identified microsatellite markers 26,49 were used to genotype the 700 clones. Amplification was performed in a 25-μL reaction volume containing 1.0 μL genomic DNA (~ 100 ng), 1.0 μL forward primer (10 μM), 1.0 μL reverse primer (10 μM), 12.5 μL 2X QIAGEN Taq Plus PCR MasterMix, and 9.5 μL double distilled water. The reaction was performed, as described previously 26 , in a T100 ™ thermal cycler, with an initial denaturation step at 94 °C for 5 min, followed by 35 cycles at 94 °C for 30 s, 56/58/62 °C (depending on the annealing temperature of the primer used) for 30 s, and 72 °C for 30 s, with a final extension at 72 °C for 10 min. The forward primer of each pair was labeled with a fluorescent dye (ROX, FAM, or HEX) during synthesis. PCR products were separated by capillary electrophoresis using the ABI3730xl DNA Analyzer (Applied Biosystems, Carlsbad, CA, USA). Genotypes were determined using Gene-Marker 2.2.0 software (SoftGenetics LLC, State College, PA, USA).
Phenotypic Data. Ten growth-and wood-property traits were measured in all 700 clones in 2014 in Longshan State Forest Farm, with at least three randomly selected ramets per clone. The growth traits included tree height (H), diameter at breast height (DBH), bark thickness (T), and stem volume (V). The wood-property traits included proportion of heartwood (P), wood basic density (WBD), hygroscopicity (Hy), tracheid length (L), tracheid diameter (D), and the ratio of L to D (L/D). Growth traits, including H, DBH, and T, were measured during field surveys using the method described by Duan et al. 50 . V was calculated according to the formula V = 0.000 058 777 042 × D 1.9699831 × H 0.89646157 . Additionally, a wood core was drilled at breast height from each tree using a tree growth cone and then placed in a plastic tube, which was not completely sealed, to prevent wet rot. P and WBD were measured using the method described by Duan et al. 50 . Hy was evaluated using the formula Hy = (W1 −W2)/W2, where W1 and W2 represent the water-saturated weight and oven-dry weight, respectively 47 . L and D were measured using the methods described by Huang et al. 51 .
Data Analysis. Microsoft Excel 2010 and SAS ver. 8.1 (SAS Institute, Cary, NC, USA) were used to examine trait differences in the phenotypic traits; these analyses excluded abnormal data obtained from weak grafts, including the mean value, standard error, amplitude, and coefficient of variation (CV). Detailed sampling, measurement methods, phenotypic variations, and phenotypic correlations for these 10 traits were described in a previous study 45 .
Microsatellite data were converted into various formats using Convert 1.3.1 52 for further analysis. The level of genetic diversity for all loci, including the number of alleles (Na), effective number of alleles (Ne), Shannon index (I), observed heterozygosity (Ho), expected heterozygosity (He), gene flow (Nm), and F-statistics calculations (F IS , F IT , and F ST ), were calculated using GenAlEx 6 software 53 . Polymorphism information content (PIC) was calculated using PowerMarker v. 3.25 54 . Hardy-Weinberg equilibrium (HWE) for all loci was assessed using Arlequin version 3.5 55 with 100,000,000 steps in the Markov chain and 100,000 dememorization steps 56 . Null alleles were detected using Microchecker 2.2.3 57 . A neutrality test for all loci was performed in Popgene version 1.32 58 . An analysis of molecular variance (AMOVA) was performed to partition the genetic variance among and within the provinces using GenAlEx 6.5 53 together with Microsoft Excel 2010.
Genetic variation among the samples obtained from six different provinces was evaluated by calculating genetic parameters using GenAlEx 6.5 53 . The F ST values were calculated to evaluate the genetic differentiation between any two provinces using FSTAT version 2.9.3 59 . Nei's genetic distance was estimated by GenAlEx 6.5 53 . The pairwise genetic distance [F ST /(1 − F ST )] 60 (1997) among any two provinces was also estimated.
The population structure of the clones was analyzed using the Bayesian model-based clustering algorithm in STRUCTURE ver. 2.3.1 61 . The software implements the Markov chain Monte Carlo (MCMC) algorithm and a Bayesian framework under admixture model, correlated allele frequencies. Subgroups were identified based on distinctive allele frequencies, and individuals were placed into K clusters by estimated membership probability (Q). In this study, the optimum value of K was determined using the model developed by Evanno et al. 62 , with an ad hoc statistic (ΔK), based on the second-order rate of change in the log probability of data between successive K values. The algorithm was run 10 times with a burn-in of 100,000 iterations, followed by 1,000,000 iterations for each value of K and subpopulations (K) ranging from 1 to 20. The height of this modal value was used as an indicator of the strength of the signal that was detected using Structure Harvester 63 . To further compare the evidence for a range of K values, we set Kmin to 1 and Kmax to 20, and the main repeats were 10, using Marverick software with the default parameters 64 . The CLUMPAK web server 65 was used to visualize the bar plot of the probability of membership from the results of Q-matrix and to evaluate modality.
A core collection was constructed. The optimal sampling method and amount were determined using PowerMarker v. 3.25 54 based on M and random sampling strategies. The M strategy is also known as the maximum number of alleles strategy 66,67 . The random sampling strategy was the same for all materials, and a certain number of samples was randomly assigned from germplasm materials 68 . The two sampling strategies were compared among 20,25,30,35,40,45,50,60,70,80,90,100,110,120,130,140,150,200,250, 300, and 400 individuals to assess the efficiency of allele sampling using molecular data. For phenotypic data consisting of continuous variables, 100% of the diversity can be sampled based on the precision of classification using PowerCore software 69 . From the two comprehensive sets of results, a core set available for future association mapping was constructed. Simple t-tests were performed for all phenotypic data and for the genetic parameters of the entire collection and the core collection using SAS ver. 8.1 (SAS Institute, Cary, NC, USA).