DNA barcoding of medicinal orchids in Asia

Growing popularity of herbal medicine has increased the demand of medicinal orchids in the global markets leading to their overharvesting from natural habitats for illegal trade. To stop such illegal trade, the correct identification of orchid species from their traded products is a foremost requirement. Different species of medicinal orchids are traded as their dried or fresh parts (tubers, pseudobulbs, stems), which look similar to each other making it almost impossible to identify them merely based on morphological observation. To overcome this problem, DNA barcoding could be an important method for accurate identification of medicinal orchids. Therefore, this research evaluated DNA barcoding of medicinal orchids in Asia where illegal trade of medicinal orchids has long existed. Based on genetic distance, similarity-based and tree-based methods with sampling nearly 7,000 sequences from five single barcodes (ITS, ITS2, matK, rbcL, trnH-psbA and their seven combinations), this study revealed that DNA barcoding is effective for identifying medicinal orchids. Among single locus, ITS performed the best barcode, whereas ITS + matK exhibited the most efficient barcode among multi-loci. A barcode library as a resource for identifying medicinal orchids has been established which contains about 7,000 sequences of 380 species (i.e. 90%) of medicinal orchids in Asia.

www.nature.com/scientificreports/ to understand their original locality (natural habitats) and to monitor that specific area for conservation and sustainable management.
Medicinal orchids are usually traded in the form of dried or fresh parts of plants such as pseudobulbs, tubers, leaves, stems and flowers. Morphological characters of these dried or fresh parts of different species look very similar and are difficult to differentiate from each other, for example, tuber of Gymnadenia and Dactylorhiza ( Fig. 2A,B), stem of different species of Dendrobium (Fig. 2C,D), leaves of Vanda and Aerides, pseudobulb of many species of Coelogyne and Bulbophyllum. Accurate identification of orchid species from such dried or fresh materials based on morphological observation is almost impossible. To overcome this problem, DNA barcoding method can play a vital role for proper species level identification of medicinal orchids. Although few studies have been conducted on DNA barcoding of orchids using nuclear and plastid markers [26][27][28][29][30][31] , no concrete effort has been made so far focusing on DNA barcoding of medicinal orchids. Furthermore, in the vast majority of prior molecular works on orchids (in DNA barcoding and phylogenetic analysis), genomic DNA was extracted from leaf samples (e.g. [28][29][30][31]. But for the medicinal orchids, DNA extraction from tuber, stem and pseudobulb are equally important because these parts are major commodities of trade. Therefore, this research aims to evaluate DNA barcoding of medicinal orchids in Asia (where collection and trade of huge quantity of medicinal orchids have long existed) by using five barcodes (ITS, ITS2, matK, rbcL and trnH-psbA), which have been used in previous studies for DNA barcoding of angiosperms including orchids [29][30][31][32][33][34] . The specific objectives of this study were to: (1) test the use of existing protocol for DNA extraction from tuber, stem and pseudobulb of medicinal orchids and evaluate the success of amplification and sequencing, (2) assess efficacy of barcodes for identification of medicinal orchids and (3) establish a barcode dataset as a resource library for identification of medicinal orchids.

Results
Sequence success and characteristics. In this study, a total of 6986 sequences (including 431 newly generated sequences) were assembled to evaluate the five candidate barcodes (ITS, ITS2, matK, rbcL and trnH-psbA and their possible combinations). Sampling comprised 380 species belonging to 94 genera from five subfamily of Orchidaceae (Table 1). Our sample represents 90% species of medicinal orchids in Asia, which included all highly traded species. Sequence in the data matrix comprised 1823 (ITS), 1833 (ITS2 mainly excised from the aforementioned ITS sequences), 1414 (matK), 1109 (rbcL) and 807 (trnH-psbA) ( Table 2). Summary of taxon  www.nature.com/scientificreports/ sampling including sample ID, tissue source, gene bank accessions of newly generated and accessions of downloaded sequences are provided in Table 1, Table 2, Supplementary Table S1 and S2. In the aligned datasets, the maximum number of representative sequences of a species was limited to 20 individuals, and in the final data matrix about 80% species represented at least two individual sequences except for trnH-psbA. DNA extraction from tuber, pseudobulb and stem was 99% successful whereas extraction for the few samples that failed were mainly from stem and pseudobulb. The PCR amplification and sequencing success rates were high for ITS, rbcL (99%) and trnH-psbA (97%) (Table1). For matK, 96% of the sequences were successfully amplified and sequenced. Sequencing failed for 5 individuals because of polymorphic sites (double peaks) or a poly-G structure in the trace file. Failed sequences were re-sequenced. Some sequences were successfully obtained in the second attempt, but few samples still failed to generate readable sequences which were mainly from Orchidoideae. In total, about 97% of sequences were successfully sequenced. Sequence alignment was most consistent for rbcL and matK, followed by ITS. Conversely, trnH-psbA contains inversions as well as insertions and high level of sequence length variation, which makes the alignment extremely time consuming in comparison to other markers.

Discussion
In the vast majority of the prior studies of DNA barcoding of Orchidaceae, genomic DNA was extracted from leaves (e.g. [28][29][30][31] ) but illegal trade of medicinal orchids usually occurs by exporting their different parts such as tuber, stem, pseudobulb etc. Therefore, it is important to test and develop a protocol for DNA extraction from illegally traded parts of orchids for their accurate identification. In this study, we extracted DNA from the tuber, pseudobulb and stem samples, which were collected from traders or directly from wild habitats. Our study observed that DNA extraction is possible from tuber, pseudobulb and stem of medicinal orchids following existing protocol. This study found a high rate of PCR amplification and sequencing success for ITS, rbcL, matK as well as trnH-psbA, which is consistent with previous studies [29][30][31]34 . Comparatively, trnH-psbA seems to be slightly less successful due to presence of poly (T). Besides, some trnH-psbA sequences obtained with indels (i.e. insertion and inversions) caused complexity in data alignment. Such issue was also observed in previous studies 29,31 . Similarly, sequencing success rate of matK was relatively low due to presence of Poly (G). But generally, DNA extraction, PCR amplification and sequencing for ITS, matK, rbcL and trnH-psbA have no serious issues for DNA barcoding of medicinal orchids. Based on different analytical methods (genetic distance, BM, BCM, NJ), ITS performed the highest identification rate among the single barcode region (Fig. 4, Tables 3, 4, Supplementary Fig. S1). This rapidly evolving nuclear gene has the highest variable sites that contribute to the efficacy of species discrimination 36,37 . Our result is consistent with previous studies 29,31,32,36 . On the other hand, matK, rbcL and trnH-psbA (plastid barcodes) exhibited lower resolution than the ITS (nuclear barcode). This could be due to lower substitution rates found in the plastid region. Therefore, these plastid barcodes alone are not recommended for DNA barcoding of medicinal orchids. The low resolution of the plastid region has been reported in different seed plants including orchids 38,39 . In some studies 27,40 , other plastid barcodes such as ndhF, ycf1, trnL-trnlF (not evaluated in this study) were also found effective in DNA barcoding of Orchidaceae. However, in general these plastid barcodes are not commonly used in Orchidaceae. Therefore, additional studies are required to evaluate efficacy of ndhF, ycf1, trnL-trnlF for the DNA barcoding of the medicinal orchids. www.nature.com/scientificreports/ Different combinations of two and three markers from ITS, ITS2, matK, rbcL were analysed in this study. Due to the comparatively low number of sampled species and sequences along with several indels, we excluded trnH-psbA in concatenation that may create robust missing data in the data matrix. Such missing data may have a negative impact on the results of phylogeny or tree-based DNA barcoding [41][42][43][44] . Although few studies have proposed trnH-psbA as a key barcode 36,45 , in this study trnH-psbA performed weak in species resolution except in BM and BCM analysis (Table 4). Moreover, several problems of trnH-psbA such as high frequency of length variation and the presence of inversions and insertions (which create complications to use it as a DNA barcode) have been reported 46-48 . In the combinations of two barcode region, ITS + matK exhibited highest degree of species discrimination capability (Table 3) which is consistent with previous studies 29,31 . As an alternative to ITS + matK, a combination of ITS + rbcL could be a supplementary choice for the DNA barcoding of medicinal orchids ( Table 4). The usefulness of this option (i.e. ITS + rbcL) is important particularly when matK amplification fails. The use of matK as a barcode has been criticized mainly because primers may be taxon specific or universal primers may not be available for all taxa 49 . Although matK amplification is not an immense problem in orchids, some studies from other groups of plants reported that matK sequencing is successful only after using up to 10-primer pairs 47 .
The combination of matK + rbcL exhibited relatively low efficacy for species identification, possibly because these two plastid markers are more conserved and lack sufficient variable sites. Therefore, matK + rbcL cannot be an ideal barcode for the medicinal orchids. Similar results were also reported in previous studies in different groups of plants 32,38 . By contrast, matK + rbcL was proposed as a core barcode for land plants 36,46,50 . Our samples were delimited to a single family (Orchidaceae) i.e. relatively more closely related species than in the sampling by 36,46,50 , indicating that matK + rbcL is not effective for barcoding of medicinal orchids.
Combinations of 3-locus candidates were unable to increase resolution rates, as they exhibited comparatively lower resolution than the combinations of 2-locus. Such kinds of results have been also reported in the previous studies 30,31 . Thus, combinations of 3-barcode regions are not recommended as efficient DNA barcode for the identification of medicinal orchids.
In this study, some species belonging to Coelogyne, Cymbidium, Dendrobium (Epidendroideae) and Goodyera (Orchidoideae) were not correctly identified by NJ method with majority of markers (Fig. 4, Supplementary  Figs. S2-S12). Within the Coelogyne, two species i.e. Coelogyne fimbriata and Coelogyne ovalis were not distinctly identified. These two species are morphologically very similar and also share their geographical distribution www.nature.com/scientificreports/ ranges, which may lead to misidentification during sample collection. Besides, taxonomic identity of these two species is based on morphological studies but lacks assessment at molecular level, therefore likely to be the same species as assigned in taxonomic revision of Coelogyne section Fuliginosae (Orchidaceae) 51 . In the Dendrobium, D. moniliforme was not monophyletic, although a species should be monophyletic for the effectiveness of DNA barcoding 52 . D. moniliforme was also not resolved in the phylogenetic analysis of Dendrobium 23 where several samples with the name 'Dendrobium moniliforme' were nested into different lineages. Possible reason could be improper taxonomic treatment or existence of cryptic species within D. moniliforme. In the Goodyera, G. kwangtungensis failed to distinguish itself from G. schlechtendaliana. It may be due to incorrect identification during sample collection possibly caused by their similar morphological characters. Another possibility is that these two species may be the same; the assignment of separate taxonomic identity of these species may be due to presence of ambiguous characters from the result of hybridization and polyploidization. G. kwangtungensis and G. schlechtendaliana are also not resolved in the molecular phylogenetic analysis 53 . Further studies are necessary to clarify the taxonomic status of these unresolved species.
A perfect DNA barcode usually should exhibit high interspecific but low intraspecific distances 36,45 , which can be clearly demonstrated either using histograms (for e.g. 31,40 ) or scatter plots (for e.g. 29,32,33 ). In this study, histograms approach failed to detect clear barcoding gaps (Supplementary Fig. S1), indicating that intraspecific genetic distance and interspecific genetic distance distributions overlapped with each other. This result is also in line with previous studies 29, [32][33][34] . Conversely, barcoding gaps were detected using scatter plots where ITS performed the best among five single barcodes and ITS + matK presented the best among the multi-locus barcodes (Fig. 3). In the previous studies, barcode gaps were detected using scatter plots in varying degree between the barcodes 29,32,33 . The results of this study revealed that a rapidly evolving gene ITS is a powerful barcode in DNA barcoding gap assessment as well as efficacy of species identification success rate of medicinal orchids. Besides, ITS region is necessary in each of the most powerful multi-locus barcodes (i.e. ITS + matK and ITS + rbcL) indicating that ITS (having maximum intra-and interspecific genetic divergence comparisons) plays an important role to enhance barcode performance. Therefore, we recommend ITS region to be incorporated into the core barcode of medicinal orchids. This condition was also suggested by authors such as 38,45 , and strong positive effects of ITS locus have been reported in prior studies in different groups of plants 29,31,39 . Although few concerns have been raised about the use of ITS locus mainly due to fungal contamination 54,55 , but in orchids, ITS amplification and sequencing are already established and most commonly used in vast majority of molecular phylogenetic as well as DNA barcoding studies (for e.g. 26,29,31 ).

Conclusions
Based on three different analytical methods (genetic distance, similarity-based and tree-based) with sampling nearly 7000 sequences from five single barcodes (ITS, ITS2, matK, rbcL, trnH-psbA and their seven combinations), this study revealed that DNA barcoding is effective for identifying medicinal orchids. Among single locus, ITS performed the best barcode (amplification, sequencing and species identification). Among combined barcode loci, ITS + matK exhibited the most efficient barcode for the DNA barcoding of medicinal orchids. Alternative to ITS + matK, a combination of ITS + rbcL could be another multi-locus barcode option. This study indicated that a rapidly evolving gene ITS is important for the DNA barcoding of the medicinal orchids. Based on genetic distance analysis, we also suggest using scatter plots instead of histograms to detect the presence of DNA barcoding gaps in the medicinal orchids. Furthermore, the success rate of amplification and sequencing is high and the existing protocol is applicable for DNA extraction from tuber, stem and pseudobulb of medicinal orchids. A barcode library (assembling sequences from five loci ITS, ITS2, matK, rbcL and trnH-psbA) as a resource for identifying medicinal orchids has been established which contains about 7,000 sequences of 380 species (i.e. 90%) of medicinal orchids of Asia. Future studies will enhance this barcode library mainly by adding sequences from the remaining 10% species.

Methods
Sampling strategy. We compiled a checklist of medicinal orchids from Asia based on published literatures (e. g. 10,11 , and then the accepted species name was assigned following The Plant List and various recently published papers. Medicinal orchids in the checklist (within Asia) comprised 422 species and represent all five subfamily (i.e. Apostasioideae, Vanilloideae, Cypripedioideae, Orchidoideae and Epidendroideae) of Orchidaceae.
In total, 6,555 sequences were retrieved from the National Center for Biotechnology Information (NCBI) (see Supplementary Table S1). For this, priorities were given to those sequences, which are already published in papers or have provided voucher specimens so that misidentified sequences can be avoided. In cases where large numbers of sequences were available for a species per marker, we selected the ones, which have good quality, the longest sequences and represent from different geographical regions. Furthermore, the downloaded sequences from NCBI were filtered according to the following criteria: (1) omitted sequences having length less than 300 bp but this criteria is not applied for some ITS2 region; (2) excluded sequences lacking voucher specimens; and (3) discarded sequences having taxa without specific names (such as Habenaria sp. and Bulbophyllum cff. etc.).
In this study, 431 sequences were newly generated from 134 individuals representing 48 species that were collected from different localities of Nepal (mainly from community based forest). The national guidelines were followed for the collection and use of plants. The plant samples collected for the present study are currently neither included in the IUCN red list nor listed as protected plants. Although these plants are included in CITES Appendix II, there was no any provision to take collection permit during the time of fieldwork. The localities of species collected in this work are not from protected area; hence no permits were required. However, we did inform the related community forest user groups (local institutions under the district forest based on Forest Act www.nature.com/scientificreports/ 1993 enacted by Ministry forest and Environment, Government of Nepal) and took verbal consent for specimen collection. We reviewed related floras, monographs and compared specimens with printed as well as online images including available images of type specimens for the species identification. Species were formally identified by Dr. Bhakta Bahadur Raskoti, Biodiversity Conservation Initiative, Nepal. All newly generated sequences have been submitted to NCBI (Supplementary Table S2). Voucher specimens were deposited in National Herbarium and Plant Laboratories (KATH), voucher number are also available publicly in NCBI gene bank accession records. DNA extraction, PCR amplification and sequencing. Total genomic DNA was extracted from plant leaves dried in silica-gel following modified CTAB protocol 56 . We also extracted genomic DNA from tuber, stem, pseudobulb (total 47 samples from 20 species) dried in silica-gel. For this DNA was extracted following STE-CTAB protocol 57 . Amplification of DNA regions was performed using a polymerase chain reaction (PCR) following the reference 58 . The sequencing reactions were performed using the Applied Bio-systems Prism Bigdye Terminator Cycle Sequencing (Applied Bio-systems, Foster City, CA) following the manufacture's instructions. Primer pairs for PCR and sequencing used in this study are provided in Supplementary Table S3. Data analysis. Forward and reverse sequencing output files were edited and assembled using ContigExpress Application 6.0 (InforMax, Inc.). Assembled sequences were initially aligned using Clustal X 59 and then manually adjusted in BIOEDIT version 7 60 . Altogether twelve barcodes were evaluated including five single loci (ITS, ITS2, matK, rbcL, trnH-psbA) and seven combinations (ITS + matK, ITS + rbcL, ITS2 + matK, ITS2 + rbcL, matK + rbcL, ITS + matK + rbcL, ITS2 + matK + rbcL) using following methods.
Genetic distance-based method. The genetic pairwise distance for each marker was calculated in MEGA X 61 using the Kimura 2-parameter model, and we investigated the minimum interspecific distance and maximum intraspecific distance for each species using custom R script. To detect the barcode gaps, scatter plots were generated using R version 3.6.3 62 . In scatter plots, each dot represents a species and the dot above the 1:1 slope indicates a barcoding gap 32,33,63 . We counted the number of species having barcoding gaps for each marker; finally these barcode gaps were calculated in percentage. We also used histograms to detect barcoding gaps for every single and multi-loci barcode. Histograms were generated from the distribution of intraspecific and interspecific genetic distances obtained from pairwise summary function using the program TaxonDNA 64 .
Similarity-based method. To assess the proportion of accurate species identification, best match (BM) and best close match (BCM) functions were implemented in the TaxonDNA 64 . For BM analysis, identification was considered correct when query and best match sequences were from the same species, ambiguous when they were from both the same and different species, or incorrect when they belonged to different species 33,64 . For BCM, species identification was considered correct if a query matched all conspecific sequences within the 95% pairwise genetic threshold 33,64 . In BM and BCM analysis we deleted all species represented by a single sequence.
Tree-based method. To evaluate discriminatory power of single and multi-locus barcodes, unrooted neighbour-joining (NJ) trees were constructed in MEGA X 61 . For this pairwise deletion based on the p-distance model following protocols for species level discrimination in the closely related species were applied 34,64,65 . A species was considered successfully identified only when all conspecific individuals formed a monophyletic clade.