DNA barcodes for delineating Clerodendrum species of North East India

The diversified genus of Clerodendrum with its complex evolutionary history leads to taxonomic mystification. Unlike traditional taxonomic methods, DNA barcoding could be a promising tool for the identification and conservation of Clerodendrum species. This study was attempted to develop an efficient barcode locus in Clerodendrum species of North East India. We evaluated four barcode candidates (ITS2, matK, rbcL, ycf1) and its combinations in different Clerodendrum species. The reliability of barcodes to distinguish the species were calculated using genetic pairwise distances, intra- and inter-specific diversity, barcode gap, and phylogenetic tree-based methods. The results exemplify that matK posse’s maximum number of variables and parsimony-informative sites (103/100), intra- (0.021 ± 0.001) and inter- (0.086 ± 0.005) specific divergences and species resolution rate (89.1%) followed by ITS2, ycf1, and rbcL. Among the combinatorial locus, ITS2 + matK showed the best species discrimination with distinctive barcode gaps. Therefore, we tentatively suggest that the combination of ITS2 + matK as core barcode for Clerodendrum and converted into quick response (QR) code. Hence, this finding indicates that DNA barcoding could provide consistent resources for species discrimination and resolve taxonomic controversies of the genus as well as set a preliminary assessment toward its biodiversity.

North East India is endowed with enormous biodiversity of flora and fauna. Clerodendrum is a large, complex, and diversified genus that encompasses well-established pharmacological properties and its importance of ethnomedical assets was reported in many indigenous systems of medicines 1 . Globally, 540 Clerodendrum species were distributed in tropical and sub-tropical regions that include small trees, shrubs, and herbs 2 . Approximately, 23 Clerodendrum species were found in India, while 18 species occur in North East India 3 . The family of Clerodendrum was moved from Verbenaceae to Lamiaceae based on circumscription of evolutionary boundaries through molecular evidences 4 . Based on morphological variations, authors classified the genus into distinctive subgenera like Clerodendrum and Cyclonema, also numerous species were described by more than one authors such as C. floribundum Hort and C. floribundum R.Br., C. foetidum Bunge and C. foetidum D.Don, etc. 5,6 . Therefore, DNA barcoding techniques could function as a molecular identifier for proper documentation and classification of Clerodendrum. DNA barcoding uses short standardized region of DNA sequence(s) (either nuclear or/and cytoplasmic genome) for rapid authentication of discrete species and cost-effective in nature 7 . Unlike animals, the mitochondrial genes were an unsuitable choice of barcode marker in plants due to its low nucleotide substitutions rates. Subsequently, numerous nuclear and plastid genes were leading the focus of researchers for identifying plant species 8 . So far, no consensus emerged as the universal barcode for land plants 9 . However, the multi-locus combination of barcode could enhance the potential discriminatory rates between closely related species 10 . To date, no authenticated report on the practice of DNA barcoding in Clerodendrum sp. was cited. In this study, we collected only 9 species of Clerodendrum from different locations of North East India, and the rest of the species were not encountered during the fieldwork as they were extremely rare and only known from a small number of locations.
The accessibility of DNA barcoding in its practical application was constrained due to its difficulty in retrieval of information via direct scanning of DNA sequences 11 . The DNA sequences contain long strings of characters Characteristic analysis of each barcode locus. The ambiguous terminal sequences were deleted from the aligned sequence. The length of aligned sequences for each locus and combination of locus were ranged from 307 bp of ITS2 to 2455 bp of ITS2 + matK + rbcL + ycf1. Among the single locus, matK had the maximum variable and parsimony-informative characters followed by ITS2. ITS2 + matK had maximum variability and parsimony informative sites (264/244) among the combinational locus (Table 1). In this study, the mean interspecific distances were much higher than intra-specific distances. The pairwise intra-specific distances among the fifteen barcodes ranged from 0.0 to 0.044 ± 0.004 and the mean intra-specific distances was maximum for matK (0.021 ± 0.001) and least for rbcL + ycf1 (0.001 ± 0.000). Subsequently, the pairwise inter-specific distances were ranged from 0.0 to 0.151 ± 0.005 and the mean inter-specific distances was highest for matK (0.086 ± 0.005) and least for matK + rbcL (0.011 ± 0.003) ( Table 2). In precise, matK reveal the highest mean intra-and interspecific distances.
DNA barcode gap analysis. Fundamentally, an ideal barcode should show significant "barcode gap" that defined the spacer region between the range of inter and intra-specific divergences 12 . The existence of barcode gap were evaluated at a class interval of 0.005 distance units between inter and intra-specific divergences. Among the fifteen barcodes, significant barcode gap was observed in the plastid gene matK, nucleotide locus ITS2 and ITS2 + matK with the least overlap values, whereas the other genes revealed the unclear gaps with overlapped of intra-and inter-specific distances (Fig. 1).
Phylogenetic analyses. The barcode loci were analysed with BI, ML and NJ phylogenetic trees and generated similar discriminatory results with reliable clade support. The PP (Posterior Probability) values based on BI tree were higher than the bootstraps values of ML and NJ trees. The rate of discriminatory success for single and multi-locus barcodes were estimated based on percentage of species resolution for each species and determined to be monophyletic. Both the single and multi-locus barcodes showed different levels of species discrimination varying from 33.3 to 93.2% (Table 4). Amongst the single locus, matK (BI-91.6, ML-91.6, NJ-91.6) followed by ITS2 (BI-84.6, ML-84.6, NJ-84.6) showed relatively high levels of discriminating success rates, whereas rbcL (BI-60.2, ML-55.2, NJ-59.6) had lowest level of discriminations. Combination of both ITS2 and matK resolved maximum success rate of discrimination (BI-93.2, ML-91.9, NJ-93.2) as compared with other combinatorial loci www.nature.com/scientificreports/ of barcodes. Hence, it could be concluded that species discrimination was high when matK was included among other combinations. The phylogenetic tree of ITS2 + matK was reconstructed with BI method and nodal support value of ML and NJ as depicted in Fig. 2. In the phylogenetic tree, Clerodendrum species were well separated from outgroup and considered to be monophyletic. The phylogenetic tree was divided into 9 clades with moderate to high bootstraps and PP supports values. The Clade 1 consists of C. colebrookianum with 0.71 of BI support and 100% of ML and NJ bootstrap. In Clades 3, 4, 6, 7, 8, and 9 form clear individual clusters for species of C. infortunatum, C. indicum, C. thomsoniae, C. philipinum, C. inerme, and C. serratum. The species of C. crytophyllum and C. canescent in Clade 2 (BI-1.00, ML-73%, NJ-89%) and C. japonicum and C. paniculatum in Clade 5 (BI-1.00, ML-100%, NJ-100%) were grouped together in each clade which signified that they were closely related.
Thus, we tentatively proposed ITS2 + matK gene with significant barcode gap and strong discriminatory power could be the preeminent barcode for Clerodendrum species.
Two-Dimensional DNA barcode generation. At present, "DNA barcode" refers to the DNA sequences which were inadequate for storing data, recognition, and information retrieval. This could be resolved with the two-dimensional QR codes that could represent DNA barcode sequences efficiently. The ITS2 + matK barcode marker of Clerodendrum species were transformed into QR codes with a motive to benefit the diverse researchers with no prior knowledge of DNA barcoding (Fig. 3).

Discussion
An efficient barcode should be easily amplified, sequenced and resolve with high species discrimination and identification 13,14 . The four barcode markers used in this study were the universal plant markers with suitable length and cost effective 15 . In the present study, we analysed both the sequences of Clerodendrum sp. and repository sequences of GenBank records. Among the four-barcode locus, matK and ycf1 produces high quality of sequences as compared to ITS2 and rbcL. Among the markers, matK performed the best with high species resolutions and clear barcoding gaps followed by ITS2. The efficacy of matK was also supported in previous researches as core barcode for many plants genera due to its high amount of variability and results in a high rate of molecular evolution as compared to the other barcode coding regions 16,17 . Moreover, ITS2 was considered as a complementary marker to the core barcodes 18,19 and many studies had reported its high rate of variability in discriminating the species 20,21 . Many researchers proposed ITS2 as a standard marker for identification of more than 6,600 plant specimens from 753 genera and universal barcode for medicinal plant species 22,23 .
The plastid gene ycf1 was recently proposed as an effective barcode marker in angiosperms due to its high amount of variability 24 . This gene was reported to be a probable phylogenetic marker for plants like pines, orchid, etc. 25 . Simultaneously, the chloroplast coding region rbcL was proposed as a universal primer for ferns, mosses, and angiosperms 7,26 . But in some recent studies, rbcL was reported to be incongruous barcode marker due to its low inter-specific variations even in closely related species 20,27 . Conversely, in this study both the markers restrained the lowest number of variation site, parsimony informative sites and species discriminating rates. The complexities of these chloroplast markers prevent discrimination of species, as it represents only the maternal inheritance variation 28 . Thus, it could be suggested that ycf1 and rbcL region were not suitable for DNA barcoding in Clerodendrum species.
Several combinations of two, three and four barcodes were analysed in this study. The combination of matK + rbcL was proposed to be the universal barcode for all the land plants by CBOL in 2009, but in this study, www.nature.com/scientificreports/ it possesses lowest species resolutions among all the combinatorial barcode markers due to its low substitution rates. In contrast, the combination of ITS2 + matK represents the highest percentage of species resolution with clear barcode gaps as compared to both single and combination of markers and also relate similarities with the previous findings 29, 30 . The barcoding gap that exists between the highest intra-specific value and the lowest inter-specific value could depict the limits of species variation within a genus and a threshold limit of species can be set 31,32 . Overlaps of the threshold value signify cryptic species and probably show insignificant variation with the barcode. Among the single and multi-locus barcode, matK and ITS2 + matK posses' clear barcode gap compared to the other barcode markers. The statistics of best match, best close match and all species barcodes using TaxonDNA was used to evaluate the rate of species identification 9,33 and observed that ITS2 + matK followed by matK posses high rate of species discriminations. Based on phylogenetic tree methods, ITS2 + matK specifies maximum rate of species resolution in Clerodendrum. Similar levels of resolvability by different tree-methods were reported in Lamiaceae 34 . www.nature.com/scientificreports/ In recent trends, DNA barcode encounters limitation in its practical applications due to the lack of information compression and retrieval of information through direct scanning of DNA sequences 35 . Therefore, an easy innovative format and rapid retrieving barcode information is in need. Barcode technology was well established in manufacturing and retailing industries for a couple of decades. The QR code contains meaningful information in both vertical and horizontal direction more than the data carried by vertical lines of barcodes (stores maximum of 20 digits). This technique could detect symbols that lead to a specific product. If this technology was applied to represent the sequences of DNA barcodes then it could lead to efficient retrieval of information with the largest coding capacity and high compression ratio, as reported by Liu et al. In this study, ITS2 + matK barcode sequences for 9 Clerodendrum sp. were converted to QR codes with vivid sequence information. The QR code could monitor the different species of Clerodendrum from its origin even in the field; ensure the mislabelling and safety of its commercial product.
Hence, the barcode marker ITS2 + matK could be used as superlative locus to determine the species boundary in Clerodendrum. Optimization of these results for all the species of Clerodendrum was not advisable as this study was constraint to North East region of India but this could lay the foundation for the universal use of DNA barcoding in plants. The success rate of species identification would be more confirmed if more species were included further 36 . Therefore, a potential solution for identifying species based on geographical location and sampling size should be further investigated. In the upcoming years, these findings would be potentially helpful in delineating the large genus of Clerodendrum.     (Table 5). For bidirectional sequencing, the amplified products were sent to Eurofins Genomics India Private Limited Company using the same primers to resolve ambiguities. Additionally, we retrieved all the sequences (ITS2, matK, rbcL and ycf1) of Clerodendrum from NCBI database. The downloaded sequences were filtered based on the criteria that: (i) sequence length less than 300 bp and (ii) sequences lacking specific voucher names. According to our survey, some species contain less than five sequences  www.nature.com/scientificreports/ in NCBI while some species had maximum number of sequences for a specific barcode region. Therefore, the representatives for each species were restricted between five to eighteen samples. The taxa, voucher names and accession number used in this study were provided in (Table S1).  41 . Differences between intra-and inter-specific distances with four single barcodes were evaluated using pairwise distance matrix in MEGA X software. An ideal barcode could be determined with the presence of barcoding gap, that compared the intra-and inter-specific distance distribution for each barcode candidate with an interval distance of 0.05 in TaxonDNA with 'pairwise summary function' 42 . In TaxonDNA, best match, best close match, and all species barcodes functions were intended to examine the accurate identification proportion of each barcode. The 'Best match' analyses determine the closet adjoining match for a known sequence. If the examined sequences were from the analogous species then the identification was considered correct whereas incorrect if the sequences belong to different species 29 .

Data analysis.
Phylogenetic analysis. The species discriminatory efficacy of each single and multi-locus barcode candidates was assessed with three tree-based method, which include the Neighbour-joining (NJ) tree, Maximum likelihood (ML) tree and Bayesian inference (BI) tree. The NJ methods of all markers were conducted using MEGA X 43,44 . The reliability of node was supported by bootstrap test of 1,000 pseudo-replicates with K2P distance parameter. For ML analysis, the phylogenetic trees were constructed in RAXMLGUI v1.3.1, a graphical front-end for RAXML 45 . The clade support was assessed using ML with thorough bootstrap analyses, run 10 times starting from random seeds under GTRGAMMA model and 1,000 non-parametric bootstrap values 46 . The species forming separate clusters in the tree with bootstrap support > 50% were considered to be distinct. The analysis of BI trees was conducted in MrBayes v3.2.7 47 . The best substitution models of each locus were selected according to Akaike information criterion (AIC) with jModeltest version 2.1.7 48 . The model suggested by jModeltest was GTR + I + G for all the tested barcode except GTR + G model for matK. The two replicate runs of Markov chain Monte Carlo (MCMC) were run for 5,000,000 generations with four simultaneous chains (one cold and three hot chains), and trees were sampled at every 1000th generations. The adequate posterior probability (PP) distribution of samples were determined, when the split frequency of average standard deviation was lower than 0.01. Subsequently, the stationary was determined in Tracer v1.7.1 49 and the first 25% trees were discarded as burn-in and a 50% majority-rule consensus tree was constructed and PP was considered as node support values. All the topologies of trees were visualized in FigTree v1.4.4. Percentages of species resolutions were calculated from the reconstructed tree in order to resolve the monophyletic nature of the clades. Callicarpa americanawas used as an outgroup.
Generation of QR code. The two-dimensional QR code consists of black modules with three squares on the corner of the code on white background and could involves 7,089 numeric, 4,296 alphanumeric characters, and 2,953 bytes of binary data 50 . In this study, the QR code image for the candidate barcode marker was generated by DNA QR Code Web Server 35 .

Data availability
We retrieved GenBank accessions of ITS2, matK, rbcL and ycf1 for Clerodendrum sp. and details are included in Table S1 (Supporting Information). Submitted sequences of Clerodendrum species were included in Table S2.