Genetic diversity and structure of wild and cultivated Amorphophallus paeoniifolius populations in southwestern China as revealed by RAD-seq

Amorphophallus paeoniifolius, is a commercially important vegetable crop because of its high production potential. In this study, we generated a total of 166 Gb of genomic data from 16 wild and 20 cultivated A. paeoniifolius individuals in southwestern China using restriction site associated DNA sequencing (RAD-seq). We compared the genome-wide variations between the wild and cultivated populations. Wild populations exhibited higher genetic diversity than did cultivated populations based on private allele number, expected heterozygosity, observed heterozygosity and nucleotide diversity. STRUCTURE analysis, principal component analysis (PCA) and a maximum likelihood (ML) tree indicated that A. paeoniifolius populations could be divided into three groups (a cultivated group and two wild groups) with significant genetic differentiation. The low genetic diversity and shallow genetic differentiation found within cultivated populations are likely caused by continuous selection and the clonal propagation methods used during domestication. The significant differentiation between the wild populations may suggest strong genetic drift due to small populations and human disturbance. The genome-wide single nucleotide polymorphisms (SNPs) identified in our study will provide a valuable resource for further breeding improvement and effective use of the germplasm.

There is evidence that domestication has led to a reduction in genetic diversity for several cultivated crops 4,8,9 . Genetic information before domestication and artificial selection might have been reserved in wild populations, which are particular resources for studying the influence of human selection on genetic variation in the A. paeoniifolius genome. However, the natural populations of A. paeoniifolius in China are strongly influenced by human activities, such as harvesting and deforestation, and most remaining populations are restricted to home gardens and agroforestry systems 10 . Thus, conservation measures must be taken to prevent the further decline of A. paeoniifolius resources, and information about genetic diversity and population structure is essential for formulating management and conservation approaches. Amorphophallus paeoniifolius can outcross, but vegetative propagation is usually used during cultivation and bears significant genetic load. The reliance on clonal propagation and the limited diversity of A. paeoniifolius germplasm make it highly vulnerable to many bacterial diseases, such as bacterial soft rot disease 11 . Heredity improvement of A. paeoniifolius cultivars using wild germplasms urgently needs to be addressed. Developing genomic resources, increasing understanding of the A. paeoniifolius gene pool (including wild germplasms), and gaining information about genetic diversity and population structure should speed the progress of biological research and genetic improvement.
Currently limited genomic information hinders genetic studies of A. paeoniifolius. Only a few studies have been carried out to gain information on the genetic diversity of A. paeoniifolius and its relationships with relative species using molecular approaches such as simple sequence repeats (SSRs) 12,13 , inter-simple sequence repeats (ISSRs) 14 and chloroplast DNA loci 15,16 . However, the molecular markers cannot provide sufficient resolution for genetic diversity and genetic structure inferences. Rapid progress in high-throughput sequencing technologies has provided an opportunity to infer genome-wide information from organisms without reference genomes with affordable cost. Reduced-representation genome sequencing allows us to discover thousands of genetic markers from many samples for population genomics studies [17][18][19] . These methods which consisted of restriction site associated DNA sequencing (RAD-seq) and genotyping by sequencing (GBS) have been successfully applied to population genomics studies of mangy species [20][21][22] .
To contribute to the understanding of A. paeoniifolius domestication and accelerate its agricultural application, we generated and analysed genome-wild SNPs for the wild and cultivated populations of A. paeoniifolius (2n = 2x = 28) in southwestern China by RAD-seq to provide a better understanding of the genetic diversity, genetic structure and divergence of this species. Our study will enhance the future genetic improvement of this important crop.

Results
Sequence data quality and processing. For the 36 sequenced samples, 166.2 Gb of raw data with an average of 4.49 Gb per sample were generated, ranging from 3.68 to 5.50 Gb. With quality filtering of the sequence data, a total of 161.0 Gb of clean data (3.68 Gb to 5.47 Gb for each sample, with an average of 4.47 Gb) was kept, presenting an average effective rate of 99.56%. In brief, our sequencing data showed high phred quality (Q20 > 90%, Q30 > 85%), with a stable GC content ranging from 41.07% to 44.58% (Table S1).
To access the genetic diversity of A. paeoniifolius at the germplasm level (between wild and cultivated germplasms), the 36 samples were divided into two groups. When we required loci to be present in at least 80% of the samples of the two groups, 32,536 RAD loci were retained. When comparing genetic diversity among the seven populations, 19,575 loci were retained after requiring loci to be present in 80% individuals of no less than six populations. More than two thirds of the SNPs at the population level (with an average of 73.70%) were confirmed to be transitions, and the observed transition vs. transversion (ts/tv) ratio ranged from 2.63 to 3.10 for each population. The variation numbers (transitions and transversions) and ts/tv ratios of the SNPs were much higher for the two wild populations of A. paeoniifolius than for the populations of A. paeoniifolius 'Yellow' (Fig. 1, Table S2).
Genetic diversity at the germplasm and population levels. For all polymorphic loci at the germplasm level, the private allele numbers (A P ), observed heterozygosity (H O ), expected heterozygosity (H E ) and nucleotide diversity (π) of the wild germplasm were 25440, 0.2289, 0.3463 and 0.3592, respectively. For A. paeoniifolius 'Yellow' , the A P , H O , H E and π values were 880, 0.1963, 0.1022 and 0.1053, respectively. When analysing all nucleotide positions, including the non-polymorphic sites, the observed heterozygosity, expected heterozygosity and nucleotide diversity of wild A. paeoniifolius dropped to 0.0009, 0.0013 and 0.0013, respectively. The three statistics (H O , H E and π), including non-polymorphic sites of A. paeoniifolius 'Yellow' , were 0.0007, 0.0004 and 0.0004, respectively ( Table 1).
As shown in Table 1, the wild germplasm of A. paeoniifolius had much higher genetic diversity than A. paeoniifolius 'Yellow' , which was uncovered by the private allele numbers, the observed heterozygosity, the expected heterozygosity and the nucleotide diversity, regardless of the germplasm or population level.
Population structure. Genetic analysis of population structure with STRUCTURE software and principal component analysis (PCA) revealed similar patterns. Structure Harvester suggested that K = 3 was the most likely genetic cluster number (see Fig. S1). The seven populations were divided into three genetic clusters. Two wild populations of A. paeoniifolius were separated into two clusters, and the five populations of A. paeoniifolius 'Yellow' formed the third genetic cluster (Fig. 2, Fig. 3). Population genetic grouping of K = 2 received support as the second highest ΔK value. Populations of wild A. paeoniifolius and A. paeoniifolius 'Yellow' formed respective clusters (Fig. S1). The genetic structure was also supported by the results of hierarchical AMOVA: the variance among the three clusters (61.6%) was higher than the genetic variance between wild species and the cultivar (32.4%) ( Table 2). Population genetic clustering conformed to that observed in the maximum likelihood (ML) tree constructed with all 36 individuals (Fig. 3c).
Genetic differentiation and AMOVA. A moderate level of genetic differentiation (F ST = 0.194) was found between wild A. paeoniifolius and A. paeoniifolius 'Yellow' at the germplasm level (P < 0.05). The pairwise F ST values between populations varied from 0 to 0.2909, with 9 of the 21 population pairs detected with significant values (P < 0.05) ( Table 3). None of the population pairs in A. paeoniifolius 'Yellow' were significantly differentiated.
The hierarchical analysis of molecular variance (AMOVA) divided the overall genetic variance as 32.4% between the wild and cultivated germplasms, 20.6% among populations within the germplasm and 47.0% within populations (Table 2). AMOVA analysis conducted using the predefined genetic clusters of the population structure analyses suggested that the majority of variance (61.6%) was found among genetic clusters. As the wild germplasm and A. paeoniifolius 'Yellow' were analysed separately, only the statistics for the wild germplasm were significant, with the variation among populations measured as 44.2% (Table 2).

Sequence assemble, annotation and GO enrichment analysis. The resultant SNPs consisted of
gene-derived (genic) SNPs and non-genic SNPs. Genic SNPs, which represent potential function-related variants, are useful for realizing phenotype mutations, genetic drift and gene flow in cultivated and natural populations; they are especially important for describing genes associated with complex traits 23,24 . To examine genic SNPs, paired-end sequences for each catalogue locus with at least one SNP were assembled. Only contigs with a length of 200 or more nucleotides were recorded for further analyses. Overall, the final assembled sequence comprised of 724783 contigs with an average length of 340 bp and a contig N50 size (50% of the genome is in fragments of this length or longer) of 372 bp (Fig. S2, Table S3).
Based on the assembled transcriptome database of Amorphophallus konjac, 107 assembled contigs of A. paeoniifolius (0.01%) were aligned to the unigenes of A. konjac. According to the annotation results against NCBI's non-redundant database, of the 107 unigenes that showed sequence similarity with A. paeoniifolius sequences, 102 unigenes were significantly mapped to known genes with BLASTX. The annotated sequences were implemented using Blast2GO for GO classification. The top two richest subcategories of the biological processes are metabolic processes and cellular processes. The most highly represented genes under the cellular component category are cell and cell part. Binding and catalytic activity represent the majority of molecular function (Fig. 4, Table S4). The top-hit species distribution inferred by BLAST results is Anthurium amnicola, Dioscorea nipponica, Musa acuminate subsp. Malaccensis and Nelumbo nucifera (Fig. 4).

Discussion
We used RAD-seq to assess patterns of genome-wide diversity in the cultivar A. paeoniifolius 'Yellow' and wild accessions of A. paeoniifolius in southwestern China. The analysis revealed that all genetic diversity indices (H O , H E and π) for the cultivated group were much lower than the wild accessions across the entire genome when at both the germplasm and population level ( Table 1). The excess of rare variants was more significant in wild populations, with 5,440-8,151 private alleles per population, which represented a large gene pool of subsistent genetic variation that could be used in future crop improvement programs. In contrast, only 0-7 private alleles per population were found in the domesticated populations. These differences might be because the continuous selection had reduced effective population size and increased genetic drift and hitchhiking during domestication 2,25 . Inbreeding and intensive selection during domestication, which narrow the germplasm genetic base, reduce the genetic diversity and promote adaptive divergence between domesticated crops, have been reported in many plant species 2,26 .
The number of transitions is predicted be much larger than transversions due to the biased mutational processes within plant genomes (e.g., cytosine deamination). Consistent with this expectation, the ts/tv ratio of the seven A. paeoniifolius populations ranged from 2.63 to 3.10. The similar nucleotide mutation pattern was also observed in other plants, such as peanut, maize and Arabidopsis [27][28][29] . Further analysis also indicted that the ts/tv ratio was higher in wild populations of A. paeoniifolius than in cultivated populations. Some researchers claim that transitions are more common than transversions, as they can provide easy tolerance from selection pressure 30 . The relatively low ts/tv ratio in A. paeoniifolius 'Yellow' might be a loss of evolutionary potential as a consequence of long, severe artificial selection.
Inferring from STRUCTURE analysis, PCA analysis and the ML tree, the seven populations were divided into three genetic clusters. The five populations of A. paeoniifolius 'Yellow' formed one cluster, and the two wild populations were split into two clusters. The hierarchical AMOVA also revealed a high level of genetic variation among the three clusters (61.6% of the total variation) and a significant (P < 0.01) level of variation between the     Table 3. Pairwise comparison of genetic distances (F ST values) (above diagonal) and significance levels (below diagonal) values among seven populations with p = 6/r = 0.8. (*Significance at the 5% nominal level; NS, not significant).

Source of variation Sum of Squares Variance components Percentage of variation Fixation index
wild germplasms and the cultivar (32.4%) ( Table 2). Almost no genetic differentiation was found among the cultivated populations. In most populations of A. paeoniifolius 'Yellow' , farmers brought seeds or tubers from a few individuals from the wild population, and then transferred and grew them in their home gardens. The cultivated populations were then maintained in the village from generation to generation. For most vegetative propagated crops, the separation from wild ancestors during domestication could lower the probabilities of sexual crossing in the subsequent populations 31,32 . Clonal propagation methods may have increased the homogeneity of A. paeoniifolius 'Yellow' at the population level. Unlike the cultivated populations, two wild populations showed significant differentiation. The wild accessions of A. paeoniifolius in China are distributed over a narrow geographic range (mostly in southern Yunnan province) as comparatively small populations. Human activities in recent years have dramatically influenced the genetic patterns of A. paeoniifolius populations, and the disturbances and encroachments from humans are still increasing. Deforestation for construction, farming and grazing has caused continuous and serious damage to the natural populations of A. paeoniifolius 10 . The higher F ST values between the two wild populations may suggest stronger genetic drift caused by these factors (Table 3).
We obtained usable information for 102 contigs containing at least one SNP associated with gene function, which might be useful for future studies of the genetic mechanism of the distinct characteristics among wild and cultivated A. paeoniifolius plants and the A. paeoniifolius genetic breeding program. Although our RAD-seq data consisted of sequences of coding and non-coding regions, and gene annotations for all loci were not assumed, the reason why so few SNPs were annotated (0.01%) is mainly the unavailability of the reference genome data. Although we used the transcriptome dataset from the relative species A. konjac 33 as the reference database, the genome of A. paeoniifolius is the best choice as the reference database to find more comprehensive SNP loci which are related to important agronomic traits such as characteristic secondary metabolism and crop defense of diseases. However, the large genome of A. paeoniifolius (proximate 4.21 Gb) 34 has hindered us to obtain genomic information. In the near future, when the A. paeoniifolius plant genome sequencing is complete; we believe that the reference genome data will boost population genetics and functional genetic research in A. paeoniifolius.
Populations of A. paeoniifolius in China have severely declined because of human activities 10 . Although we tried to collect as many samples as possible during field investigation, only 16 wild individuals and 20 cultivated individuals were acquired. It has been shown that using thousands of SNP loci could be powerful for population genetic analysis 35,36 , and the genome-wide SNPs might partly offset limitations due to the small sample size. Nonetheless, the 36 individuals used in our study were not sufficient for assessing the genetic diversity of A. paeoniifolius at the whole species level. The cultivar A. paeoniifolius 'Yellow' is only found in small regions of southern Yunnan province in China 7 . The sample locations in our study represented the full geographical range of the cultivar A. paeoniifolius 'Yellow' . It is believed that this cultivar was domesticated from wild plant materials collected from adjacent areas by farmers 7 . Our sample strategies, which focusing on this small area, should be comprehensive in accessing changes in genetic variation during the domestication of A. paeoniifolius 'Yellow' . Different genetic patterns discovered between wild and cultivated populations in our study could serve as a hint for artificial modifications to the A. paeoniifolius genome during domestication in southwestern China. As A. paeoniifolius has a relatively wide distribution range, further studies with sufficient samples are needed to fully evaluate the genetic diversity of A. paeoniifolius.
In conclusion, we reported the exploration of tens of thousands of SNPs to inspect the genetic relationship and compare the genetic diversity of wild and cultivated A. paeoniifolius populations by using RAD-seq. We believe this is the first study to report the exploration of such a large number of novel SNPs from A. paeoniifolius and that further increases the amount of genomic resources available for this species. The results provide new insights into the genetic consequences of crop domestication.  Table S5). Leaves were randomly selected from each population at intervals of at least 5 metres, and were dried in silica gel in sealed plastic bags until DNA extraction. Total genomic DNA was extracted using a Plant Genomic DNA kit (Tiangen, Beijing, China) following the manufacturer's protocol. De novo assembly and SNP exploitation. The raw data from 36 individuals was first quality-filtered.

Methods
Adapter sequences and paired reads with alternative reads containing ≥50% low-quality bases (quality value ≤5) or ≥10% unidentified nucleotides were removed. The putative duplication reads and reads without intact EcoRI cutting sites were also discarded.
To identify SNP loci for population genetic studies, the Stacks tool set 38 was utilized. The filtered data for each individual were grouped into loci by ustacks with a minimum stack depth (-m) of 5 and a distance allowed between stacks (-M) of 2. The loci for all samples were then merged into catalogues by cstacks with distances between catalogue loci (-n) of 2, and the loci of each individual were matched against the catalogue to obtain genotypes of the loci in each sample with sstacks.
To analyse the genetic diversity at the germplasm level, the genetic diversity statistics in the two germplasm groups were estimated separately. At least 80% of the samples (r = 0.8) and in all two germplasm groups (p = 2) must contain a locus for it to be included in the analyses. When we analysed the genetic diversity and structure at the population level, a locus was required to be present in 80% of the individuals (r = 0.8) and in no less than six populations (p = 6). Only a single SNP was randomly selected per locus to remove tightly linked SNP loci 39 in all analyses. Because the sample size in our study was relatively small, SNP loci with a global minor allele frequency (MAF) < 0.05 were therefore discarded to limit false SNP identification. SNP loci were selected using the 'populations' program implemented in Stacks 38 . Molecular statistics for the SNPs for the seven populations were analysed using Arlequin 3.5.1 40 . Genetic diversity and differentiation. Genetic diversity indices, including the private allele number (A P ), nucleotide diversity (π), heterozygosity (H O and H E ) and inbreeding coefficient (F IS ), were calculated using the 'populations' program in Stacks 38 . To analyse pairwise population differentiation and differentiation between wild and cultivated germplasms, F ST values were also computed with the 'populations' program and tested based on 1000 permutations with Arlequin 3.5.1 40 . Population structure analyses. Genetic structure was investigated by STRUCTURE 2.3.4 41 and principal component analysis (PCA). The 'populations' program in Stacks 38 was employed to output SNPs into the structure-format file. In the STRUCTURE analysis, an admixture model with correlated allele frequencies between populations was used. Ten replicates for each K (K = 1-6) were computed, with 50,000 burn-ins followed by 150,000 Markov chain Monte Carlo steps. The optimal K was inferred based on the ΔK method implemented in STRUCTURE HARVESTER 0.6.94 42 . Admixture proportions from replicate simulations at the optimal K were averaged using CLUMPP 1.1.2 43 . Summary outputs were displayed with the program Distruct 2.1 44 . PCA analysis was conducted using the Adegenet package 45 in R to further investigate the genetic structure between populations. A variant call format (VCF) file was generated using the 'populations' program in Stacks 38 . A maximum likelihood (ML) tree of 36 individuals was constructed with 1000 bootstraps using the SNPhylo software 46 . AMOVA. AMOVA was conducted to quantify genetic variation at three different hierarchical levels: among germplasm groups, among populations within the germplasm group and within populations. Genetic variations were further tested by assigning populations to genetic clusters identified by population structure analyses (i.e., among genetic clusters, among populations within clusters, within populations). AMOVA was also conducted to assess intra-differentiations within wild or cultivated germplasms. The analyses were conducted in Arlequin 3.5.1 40 , and the significant level for the variance components was computed using 1000 permutation steps.
Functional analysis of genic SNP-associated genes. To examine genic SNPs, paired-end sequences for each catalogue locus with at least one SNP were collated using the command line 'sort_read_pairs.pl' implemented in Stacks 38 . The collated sets of reads for each catalogue locus were then assembled into contigs by the command 'exec_velvet.pl' , which executes the Velvet 47 program. To estimate the homology of these contigs to coding regions, a transcriptome dataset of a related species (A. konjac) was downloaded from NCBI (SRR555564) 33 and assembled into unigenes using the Trinity 2.4.0 software 48 . The SNP-associated RAD tag sequences were aligned against the assembled transcriptome of A. konjac using SOAP 2.21 49 with the default settings. To investigate the gene function of loci, a BLASTX search was applied against NCBI's non-redundant database by BLAST 2.2.28 50 with an E-value of 1e −6 . According to the annotation results, Blast2GO 2.5.0 51 was used to obtain functional classification of the sequences using GO terms with an E-value of 1e −6 , which maps sequences to gene functions in term of molecular function, cellular components and biological processes 52 . Data Availability. The raw RAD sequence data for the 36 wild and cultivated Amorphophallus paeoniifolius individuals was deposited in the National Center for Biotechnology Information (NCBI) Sequence-Read Archive (SRA) database with the accession numbers SAMN07175567 -SAMN07175602.