Introduction

Longan (Dimocarpus longan Lour; 2n=2x=30) is a tropical perennial crop in the Sapindaceae (soapberry) family. Longan is indigenous to southern China and Southeast Asia, but is now a commonly cultivated fruit in more than 20 countries.1,2 World production of longan reached 2.35 million tons in 20092 and the top five longan producers (China, Thailand, Vietnam, India and South Africa) jointly account for 90% of global production. Among them, China has been the largest longan producer in terms of both cultivation area (470 000 ha) and total production (610 000 tons).2 Thanks to increasing popularity in non-Asian countries, longan cultivation is now expanding in tropical and subtropical countries throughout the world, including Australia, Israel and the United States.

Longan was domesticated in China more than 2000 years ago.3–5 There are several hundred longan cultivars worldwide, most of which are landraces and farmer varieties. China alone has more than 300 varieties maintained in the national longan germplasm collection.3,5 Wild longan populations still exist in Hainan, Guangdong, Guangxi and Yunnan provinces of China, as well as in northern Vietnam and Myanmar.1,4,5 Despite the large number of longan varieties in various collections, only a small number of varieties are commercially grown worldwide.1,6 Although they are sensitive to low temperatures, many traditional longan varieties have a chilling requirement for flowering and thus are not suitable for tropical regions.4,6

Like many tropical perennial tree crops, longan germplasm is maintained as living trees in field genebanks and varieties are subject to vegetative propagation during the process of germplasm exchange. But records and labels of the varieties have not always been properly maintained and accessions often arrive bearing limited information about their correct identity. The rate of mislabeling is substantial in longan germplasm collections, which restricts the sharing of information and materials among longan researchers and hampers the use of longan germplasm in breeding programs.5,7

Genotypes can be difficult to distinguish morphologically and accurate identification of longan varieties using molecular markers has been advocated to improve the efficiency of longan germplasm management and utilization.5,7,8 However, published research on molecular characterization of longan germplasm has so far been limited, and reported studies used mostly dominant markers including RAPD,9–11 AFLP,7,12,13 SCAR,14 SCTP15 and SRAP.16,17 Several studies have been done using inter-simple sequence repeat fingerprinting, which does not require specific sequence knowledge.18–20 Cross-species amplification of lychee SSR markers have been reported in longan21 as well as in other Sapindaceae species.22 In addition, a set of 384 putative SSR markers were developed and these markers are being verified.23 While Single nucleotide polymorphisms (SNP) markers have been widely used in plant germplasm management and breeding of fruit tree crops,24,25 this most powerful tool has not been available for longan.23

SNPs are the most abundant class of polymorphisms in plant genomes.26,27 Compared to SSR markers, SNP analysis can be done without requiring DNA separation by size and therefore, can be automated in high throughput assay formats. The diallelic nature of SNPs offers much lower error rate in allele calling and raises the level of consistency between laboratories.26,27 These advantages have resulted in SNPs increasingly becoming the markers of choice for accurate genotype identification and diversity analysis in perennial crops, as recently demonstrated in cacao (Theobroma cacao),28 grapevine (Vitis vinifera),29 pummelo (Citrus maxima),30 strawberry (Fragaria spp.)31 and tea (Camellia sinensis).32 Like other perennial horticulture crops, DNA fingerprinting using a small set of SNP markers is in great demand by the longan community for a broad range of research and field applications. These applications include, but are not limited to, identification of mislabeled accessions, parentage and sibship analysis for quality control in breeding and seeds programs, and characterization of farmer selections to support the production of high-value varieties for premium market.

Recently Lai and Lin33 developed a substantial amount of transcriptome data for somatic embryogenesis from longan cultivar Honghezi using cultured embryos at different developmental stages, and identified numerous unigenes expressed in embryogenic tissues. In addition, significant amount of lychee transcriptome data has been developed.34,35 The objectives of the present study were to develop SNP markers through the data mining of transcriptome data from longan and lychee and assess their potential application for longan varietal identification. The results reported herein represent the first validation study of SNPs in longan and demonstrate the utility of a transcriptome as an approach for de novo SNP identification in species lacking available genomic resources. These SNP markers, as well as the genotyping method, will be particularly useful for varietal identification, germplasm management and longan breeding programs.

Materials and methods

Mining of putative SNPs from transcriptome sequences

Transcriptome sequences of Dimocarpus longan Lour. (SRR412534) were obtained from the NCBI SRA Database (http://www.ncbi.nlm.nih.gov/sra/). We used NGSQCToolkit (v2.3, Platel RK, 2012) with stringent criteria (high-quality paired reads with 90% bases above Q20 level were retained) to remove the low-quality paired-end reads or reads containing adaptors36. The resultant 2.63×109 clean and high-quality reads (90 bp in length) with a total of 4.73 Gbp nucleotides were retained for further analysis. The software Trinity was used to produce a transcript containing 50 612 sequences. To obtain more potential polymorphism, 47 594 mRNA nucleotide sequences of affinis species lychee (Litchi chinensis Sonn.) were downloaded from NCBI GenBank (3 April 2014). Redundant entries of lychee were examined and excluded using the CD-HIT program with a 95% sequence similarity threshold.37 The FASTA-formatted files of longan and lychee sequences were merged into a single dataset for further data mining. Putative EST-SNPs were detected using the QualitySNP program.38 Only clusters that included at least 4 nucleotide sequences, with a confidence score over two, were accepted. In order to meet the requirements and constraints for primer design, all candidates for SNP markers with less than 50 nucleotides between two neighboring SNPs were removed. A subset of 60 identified SNP sequences was then chosen for design and manufacture of primers to assay for SNPs in longan plant.

Validation of putative SNPs

To evaluate the putative SNP markers for suitability of varietal identification, we used a nanofluidic genotyping system and validated the SNPs for 68 samples, representing 50 cultivated and wild longan accessions (Table 1). The cultivated germplasm samples were from the USDA-ARS Tropical Crops Germplasm Repository in Hilo Hawaii, whereas the wild trees were collected from Mangshi City in Yunnan, China. Healthy young leaf samples of these accessions were harvested and dried in silica gel. DNA was extracted from dried longan leaves with the DNeasy® Plant Mini kit (Qiagen Inc., Valencia, CA, USA), which is based on the use of silica as an affinity matrix. The dry leaf tissue was placed in a 2-mL microcentrifuge tube with one ¼-inch ceramic sphere and 0.15 g garnet matrix (Lysing Matrix A; MP Biomedicals. Solon, OH, USA). The leaf samples were disrupted by high-speed shaking in a TissueLyser II (Qiagen Inc.) at 30 Hz for 1 min. Lysis solution (DNeasy® kit buffer AP1 containing 25 mg mL−1 polyvinylpolypyrrolidone), along with RNase A, was added to the powdered leaf samples and the mixture was incubated at 65 °C, as specified in the kit instructions. The remainder of the extraction method followed manufacturer’s suggestions. DNA was eluted from the silica column with two washes of 50 µL Buffer AE, which were pooled, resulting in 100 µL DNA solution. Using a NanoDrop spectrophotometer (Thermo Scientific, Wilmington, DE, USA), DNA concentration was determined by absorbance at 260 nm. DNA purity was estimated by the 260∶280 ratio and the 260∶230 ratio.

Table 1 List of longan germplasm accessions used in SNP genotyping.

Sixty putative SNP sequences were submitted to the Assay Design Group at Fluidigm Corporation (South San Francisco, CA, USA) for design and manufacture of primers for a SNPtypeTM genotyping panel. The assays were based on competitive allele-specific PCR and enable bi-allelic scoring of SNPs at specific loci (KBioscience Ltd, Hoddesdon, UK). The Fluidigm SNPtypeTM Genotyping Reagent Kit was used according to the manufacturer’s instructions.35,36 Using these primers, the isolated DNAs were subjected to Specific Target Amplification36 in order to enrich the SNP sequences of interest. Genotyping was performed on a nanofluidic 96.96 Dynamic ArrayTM IFC (Integrated Fluidic Circuit; Fluidigm Corp.). This chip automatically assembles PCR reactions, enabling simultaneous testing of up to 96 samples with 96 SNP markers. The use of a 96.96 Dynamic Array IFC for SNP genotyping of human samples was described by Wang et al.39 End-point fluorescent images of the 96.96 IFC were acquired on an EP1TM imager (Fluidigm Corp.). The data were analyzed with Fluidigm Genotyping Analysis Software.40

Data analysis

Key descriptive statistics for measuring the informativeness of the SNP markers were calculated, including minor allele frequency, observed heterozygosity, expected heterozygosity, Shannon’s information index and inbreeding coefficient. The program GenAlEx 6.541,42 was used for computation. For genotype identification, pairwise multilocus matching was applied among individual samples using the same program. DNA samples that were fully matched at the genotyped SNP loci were declared the same genotype (or clones). Statistical rigor was assessed for match declaration using the probability of identity (PID) that two individuals may share the same multilocus genotype by chance.39 In computing PID, it was assumed that all individual genotypes were siblings (PID-sib), which was defined as the probability that two sibling individuals drawn at random from a population have the same multilocus genotype.43,44 The overall PID-sib is the upper limit of the possible ranges of PID in a population, and thus, provides the most conservative number of loci required to resolve all individuals, including relatives.43 The computation was carried out using the program GenAlEx 6.5.41,42

Distance-based multivariate analysis was used to assess the relationship among the individual varieties, as well as their relationship with the wild germplasm. Pairwise genetic distances as defined by Peakall et al.45 were computed using the DISTANCE procedure implemented in GenAlEx 6.5. The same program was then used to perform Principal Coordinates Analysis (PCoA), based on the pairwise distance matrix. Both distance and covariance were standardized.

A model-based clustering algorithm implemented in the STRUCTURE software program46 was applied to the SNP data. This algorithm attempted to identify genetically distinct subpopulations based on allele frequencies. The admixture model was applied and the number of clusters (K-value), indicating the number of subpopulations the program attempted to find, was set from 1 to 10. The analyses were carried out without assuming any prior information about the genetic group or geographic origin of the samples. Ten independent runs were assessed for each fixed number of clusters (K), each consisting of 1×106 iterations after a burn-in of 2×106 iterations. The ΔK value was used to detect the most probable number of clusters and the computation was performed using the online program STRUCTURE HARVESTER.47,48 Of the 10 independent runs, the one with the highest Ln Pr (X|K) value (log probability or log likelihood) was chosen and represented as bar plots.

To analyze the genetic diversity in the wild and cultivated longan germplasm groups, the intrapopulation genetic diversity was measured by gene diversity (Hs),49 observed heterozygosity (Ho) and FIS50 using GenAlex 6.5.41,42 The difference between wild and cultivated longan germplasm was measured using Fst, as implemented in the same program. In addition, analysis of molecular variance (AMOVA) was used to compare the size of molecular variance in wild and cultivated longan germplasm.

Results

SNP discovery

A total of 80 186 mRNA nucleotide sequences from longan and lychee were gathered as previously described. CAP3 program was used to assembly sequences into 10 001 contigs and 55 961 singlets with an average size of 2.42 sequences per contig under default parameter, among which putative SNPs were detected in only 141 contigs using the QualitySNP program. All of these selected clusters included a minimum of six EST sequences. In total, we obtained 1560 putative SNPs, including 70 C/T, 84A/G, 24 A/T, 21 A/C, 21 T/G, 20 C/G, 1320 Indel and 2 high tri-allelic polymorphisms. To select high quality SNPs for validation, candidate SNP sites with at least 50 bp before and after the site were filtered. We calculated the number of all sequences in a cluster and the number containing the SNP type in this cluster. We then selected 60 SNPs for validation by genotyping a test panel of longan varieties, including both cultivated varieties and wild populations. Among the 60 SNPs, 33 were from longan, 17 were from lychee and the remaining 10 SNPs were found in both longan and lychee.

Frequency of SNP markers and descriptive statistics

Out of the chosen 60 SNP markers, 52 were successful in genotyping. The failure of the remaining eight SNPs was likely due to the sequence complexity or the presence of polymorphisms within the flanking sequences. However, among the successful SNPs, 27 were monomorphic across the 68 longan samples (i.e. only one SNP variant was identified in all individuals). These monomorphic markers may have resulted from errors in transcriptome sequencing, which then led to incorrect identification of SNP. It is also possible that some of these SNPs may correspond to rare alleles that were not present in the analyzed longan varieties.

A total of 25 polymorphic SNPs were retained for further analysis. These 25 SNPs were reliably scored across the validation panel and thus, were considered true SNPs. Out of the 25 polymorphic SNPs, 22 were longan SNPs and 3 were SNPs shared by both longan and lychee. In contrast, the lychee SNPs were either non-amplified or failed in generating polymorphism in the test panel. The flanking sequences and SNPs of the 25 selections are listed in Table 2. The minor allele frequencies of these SNPs ranged from 0.061 to 0.458 with an average of 0.307. The mean information index was 0.584, ranging from 0.230 to 0.690. The observed heterozygosity ranged from 0.100 to 0.875 with an average of 0.406, whereas the mean expected heterozygosity was 0.400 ranging from 0.115 to 0.497 (Table 3).

Table 2 The flanking sequences and SNPs of the 25 polymorphic markers.
Table 3 Minor allele frequency, information index, heterozygosity and inbreeding coefficient of the 25 SNP loci scored on 50 longan accessions.

Cultivar identification

SNP profiles of the multiple trees from the same longan cultivar showed that genotyping results were highly consistent (Table 4). ‘Clonality’ for multiple trees within each cultivar was confirmed in varieties ‘Tiger Eye’ (HDIM 2), ‘Fuk Yan’ (HDIM 4), ‘E Daew’ (HDIM 7), ‘Haew’ (HDIM 8), ‘Sri Chompoo’ (HDIM 9), ‘Selection 7803’ (HDIM 11), ‘Egami’ (HDIM 20), ‘Biew Kiew’ (HDIM 23) and ‘Biew Kiew’ (HDIM 26). The multilocus matching also detected an off-type in the cultivar ‘E Wai’ (HDIM 24), where two different genotypes were found in this cultivar. The probability that two longan varieties will have the same genotype at the 25 SNP loci is approximately 1 in 100 000 for the tested longan varieties, as computed by the mutlilocus matching procedure implemented in GenAlex 6.5.41,42

Table 4 Examples of DNA fingerprints based on the full array of 25 SNPs for longan tree genotype identification.

Genetic diversity in cultivated and wild longan germplasm

After excluding the duplicated samples, the genetic relationships among the 50 longan germplasm accessions (25 wild accessions and 25 cultivated genotypes) are presented in the principal coordinates analysis plot (Figure 1). Each of the accessions has a unique SNP profile. The 50 accessions fall into two clearly different clusters without overlapping. The first cluster includes all the cultivated germplasm and the second one includes all wild germplasm from Yunnan, China. Within the cultivated germplasm, there is significant difference in two subclusters according to the PCoA. The first subcluster was comprised mainly of the varieties from Southern China, including ‘Chu Leon’, ‘Tai Wu Yuen’, ‘Sak Ip’, ‘Fuk Yan’, as well as the Hawaii cultivar ‘Ikeda’ (HDIM 16). The second subcluster included all Thai varieties, as well as the other varieties from Hawaii. The only exception is the Chinese cultivar ‘Chaer Jum’ (HDIM 22), which falls into the Thailand/Hawaii subcluster.

Figure 1
figure 1

PCoA plot of 50 longan accessions including 25 cultivated varieties from USDA longan collection in Hilo, Hawaii and 24 wild trees collected from Mangshi, Yunnan Province, China. The plane of the first three main PCO axes accounted for 61.0% of total variation. First axis=41.1% of total information, the second=11.9% and the third=8.0%.

Population stratification of the 50 varieties, based on ΔK value computed by STRUCTURE HARVESTER,48 revealed two clusters as the most probable number of K (Figures 2 and 3) and this partitioning was fully compatible with the principle coordinate analysis (Figure 1). All the wild germplasm were assigned to one Bayesian cluster, whereas the cultivated germplasm were grouped in another single Bayesian cluster. The only exception is accession ‘No 2-13 (taller)’, which appeared as a hybrid genotype between the cultivated and wild longan groups. To further illuminate the diversity within the cultivated germplasm, the clustering result at K=3 is also presented in Figure 3. The wild germplasm remained as a single cluster at K=3, but the cultivated longan were split into two subclusters, revealing the difference between the Chinese and Thailand/Hawaii accessions. In addition, several hybrid-like accessions that combined both Chinese and Thailand parentage were observed at K=3. These include ‘Ponyai’, ‘Diamond River’ and the aforementioned ‘No 2–13 taller’, which showed significant contribution from Yunnan wild germplasm (Figures 3 and 4).

Figure 2
figure 2

Plot of ΔK (filled circles, solid line) calculated as the mean of the second-order rate of change in likelihood of K divided by the standard deviation of the likelihood of K, m|L ″(K)|/s[L(K)].

Figure 3
figure 3

Inferred clusters in the longan varieties using STRUCTURE, where K is the potential number of genetic clusters that may exist in the overall analyzed longan accessions. Each vertical line represents one individual multilocus genotype. Individuals with multiple colors have admixed genotypes from multiple clusters. Each color represents the most likely ancestry of the cluster from which the genotype or partial genotype was derived. Clusters of individuals are represented by colors.

Figure 4
figure 4

Partition of total molecular variance between the cultivated and the wild germplasm groups using AMOVA. Number of permutations=9999.

The key descriptive statistics for the SNP loci are presented in Table 3, and the level of genetic diversity in cultivated and wild longan germplasm is presented in Table 5 and in Figure 4. Between the cultivated and the wild germplasm groups, gene diversity (expected heterozygosity), observed heterozygosity and inbreeding coefficient were all comparable. However, significant population differentiation was found by the contingency table test of Weir and Cockerham51 (Fst=0.300, P<0.001). AMOVA showed that both the within-collection and the between-collection variations were highly significant (P<0.001). Twenty-seven percent of the total molecular variance was due to difference between the two germplasm groups, whereas 73% was partitioned within collections. The estimated molecular variance was 191.6 in the wild population and 235.7 in the cultivated germplasm groups (Table 5).

Table 5 Comparison of genetic diversity (gene diversity, observed heterozygosity and molecular variance) in cultivated and wild longan germplasm.

Discussion

Genomic research in longan has been scarce and advanced molecular tools to support germplasm management are not available. Developing SNP markers from transcriptome sequences has been considered an efficient strategy for nonmodel species.52,53 In the present study, we identified 60 SNP markers based on the transcriptome sequences of embryos at various development stages to validate using a diverse panel of cultivated and wild germplasm. In spite of the fact that the transcriptome sequences were derived from the embryos of a single cultivar (Honghezi),33 we were able to obtain a moderate rate of success for marker validation, which indicates that a high percentage of success would be achieved if the transcriptome sequences were based on multiple genotypes. This approach for SNP marker development, therefore, can serve as a fast alternative for species lacking abundant genomic resources. As shown in the present study, even a small set of SNP markers can significantly improve the accuracy and efficiency in germplasm management.

Longan genotype identification

Unambiguous identification of genotypes is a concern for longan germplasm management, breeding and propagation of planting materials.7,8 In the present study, it has been demonstrated that a set of only 25 SNP markers was effective for the assessment of genetic identity of longan germplasm. Results from multiple trees of the same cultivar showed 100% concordance, demonstrating that the nanofluidic system is a reliable platform for generating longan DNA fingerprints with high accuracy. However, because a major fraction of the germplasm maintained in the USDA longan collection was directly or indirectly introduced from China, Thailand and other Asia countries, the reference standards need to be established based on the ‘original living trees’ of these accessions in China and Thailand. For example, there were two genotypes labeled as ‘E Wai’ (FI-R14-T3 and FI-R14-T4), but determination of the authentic result could not be made without knowing the genotype of the original reference tree. Therefore, assessment of genetic identity in this study was limited to duplicate identification.

Genetic diversity in wild and cultivated longan

The level of genetic diversity in the wild population is lower than in the cultivated germplasm group, as reflected by gene diversity and molecular variance. This result could be explained by the fact that the wild germplasm came from a single population collected from a single location in Yunnan, China. In contrast, the cultivated germplasm comprised varieties originally from Thailand, China and possibly other Asian countries. Nonetheless, the PCoA and the Bayesian clustering analysis both clearly separated the analyzed longan accessions into wild and cultivated clusters. This difference was further quantified by AMOVA, where a significant genetic difference (Fst=0.300; P<0.001) was found. The large difference indicates that, in spite of the available wild germplasm in southwest China, little has been integrated in the longan cultigens so far. The present result thus supports the notion that there remains a large amount of untapped genetic diversity in the primary gene pool of longan, including southwest China.4,5,54,55 It also supports the observation of Lin et al.7 who reported relatively low levels of genetic variation in the Chinese varieties of longan and hypothesized that the Chinese longan varieties might have suffered a bottleneck during domestication. Wild longan populations have been reported in several regions in southern China, including Guangxi,56 Hainan57 and Yunnan.58 Wild longan fruits differed from cultivated ones morphologically, including small fruit size, warty fruit skin, thin pulp and large seed.5 These wild longan germplasm potentially harbor new genes/alleles for agronomic traits, such as resistance/tolerance to biotic and abiotic stresses. Introgression of the wild germplasm would effectively broaden the genetic background of the cultivated longan. Moreover, given the severe genetic erosion in southwest China due to the rapidly diminishing forests, it is urgent to develop ex situ and in situ conservation plans to ensure proper maintenance of the wild populations.

Within the 25 cultivated germplasm, PCoA and Bayesian approach (K=3) both separated the Chinese germplasm from Thai and Hawaiian varieties, which illustrated the geographic differentiation between Chinese and Thailand longan germplasm. The majority of the Hawaii varieties showed closer approximation with the Thai varieties, indicating parentage or ancestry of Thai germplasm. This result is compatible with Lin et al.7 which showed the Thai cultivar ‘Miaoqiao’ was different from the 40 Chinese varieties. The same result was reported by Zhong et al.,59 who analyzed 95 longan accessions from China and Thailand. Their result showed that the 95 germplasm accessions could be divided into two groups (i.e., longan from China and longan from Thailand). The difference is also compatible with the assessment of Crane et al.6 where they suggested that the higher chilling requirement of traditional Chinese varieties limited longan production in tropical regions, whereas the varieties from Thailand do not have this problem. Nonetheless, Bayesian clustering analysis (K=3), also revealed two hybrid type varieties (‘Pongyan’ and ‘Diamond River’), which appeared to be admixed progenies derived from both Thai and Chinese longan parental varieties. In addition, cultivar ‘Chaer Jum’ and ‘No 2–13 taller’ were found to have significant contribution from the wild germplasm. However, so far we have insufficient information about the cultivated longan germplasm from Yunnan to assess the parentage of these cultivated longan germplasm. This information gap will be filled with ongoing research on molecular characterization of longan germplasm in China and Southeast Asia.

In conclusion, we conducted a pilot study on the development of SNP markers for longan and employed them for varietal genotyping, using a nanofluidic array. This technology enabled us to generate high quality SNP profiles for the purpose of longan varietal identification and genebank management. Our result also revealed significant genetic difference in wild and cultivated longan germplasm. To our knowledge, this is the first study to apply SNP markers in longan. New efforts to develop more SNP markers are underway, in order to make a comprehensive assessment of genetic diversity in longan and map quantitative traits loci for important agronomic traits in this crop. This information will be useful for verification of longan varieties and thus, has a significant potential for practical application.