Biological composition analysis of a natural medicine, Faeces Vespertilionis, with complex sources using DNA metabarcoding

Faeces Vespertilionis is a commonly used fecal traditional Chinese medicine. Traditionally, it is identified relying only on morphological characters. This poses a serious challenge to the composition analysis accuracy of this complex biological mixture. Thus, for quality control purposes, an accurate and effective method should be provided for taxonomic identification of Faeces Vespertilionis. In this study, 26 samples of Faeces Vespertilionis from ten provinces in China were tested using DNA metabarcoding. Seven operational taxonomic units (OTUs) were detected as belonging to bats. Among them, Hipposideros armiger (Hodgson, 1835) and Rhinolophus ferrumequinum (Schober and Grimmberger, 1997) were the main host sources of Faeces Vespertilionis samples, with average relative abundances of 59.3% and 24.1%, respectively. Biodiversity analysis showed that Diptera and Lepidoptera were the most frequently consumed insects. At the species level, 19 taxa were clearly identified. Overall, our study used DNA metabarcoding to analyze the biological composition of Faeces Vespertilionis, which provides a new idea for the quality control of this special traditional Chinese medicine.

www.nature.com/scientificreports/ Unfortunately, traditional means cannot easily provide species-specific identification for this kind of complicated natural medicine 12 . Broken debris of insects, such as appendages, eyes, setae and thorns 11 , could be observed by microscope. Nevertheless, identifying hosts or other biological composition at the species level is impracticable. Moreover, no species-specific compounds of Faeces Vespertilionis have yet been reported. In the natural medicine identification field, DNA barcoding has emerged as a new molecular tool for species identification 14 . Furthermore, DNA barcoding technology is able to identify traditional medicines with complex species sources 15 . Therefore, there is an urgent need to develop novel diagnostic tools to solve this problem.
DNA metabarcoding has revolutionized species identification methods 16 . This method is based on the highthroughput sequencing (HTS) of DNA barcode regions, amplified using universal polymerase chain reaction (PCR) primers 17 . Currently, HTS technologies can generate millions of sequences concurrently. Therefore, DNA metabarcoding enables us to quickly characterize a very large number of species present in an environmental sample in a single experiment 16,18,19 . Dietary analyses have been facilitated by the advent of metabarcoding and its application to the analysis of faeces or stomach contents 20 , especially for the analysis of the diet of insectivorous bats 21,22 . It is feasible that the biological composition of Faeces Vespertilionis could be discovered by metabarcoding, regardless of whether diversity is from the diet or species introduced due to collection, storage or transportation.
To our knowledge, this report is the first application of metabarcoding to depict the biological composition of a natural medicine with complex sources. Twenty-six Faeces Vespertilionis samples from ten provinces in China were selected, and the taxa involved were explored. We aimed to identify the bat hosts and other biological compositions of Faeces Vespertilionis, which should be an innovation for the quality control of this kind of fecal medicines.

Results
Sequencing results and data filtering. The Illumina sequencing of 26 PCR products generated a total of 17,794,946 paired-end reads. A total of 12,175,994 reads were obtained after filtering out low-quality reads, trimming tags and primers, and merging the paired-end reads. After retaining 232 bp sequences, dereplicating reads and removing chimeras, unique sequences were clustered at a 98% similarity threshold. Then, operational taxonomic units (OTUs) with a relative abundance greater than 0.01% of total reads of the 26 samples were retained. Finally, 243 OTUs and 4,758,016 reads remained. Rarefaction curves analysis indicated that the number of sequences and sequencing depth were sufficient for this study (Fig. S1).
Taxonomic identification of bats. We identified five bat species comprising 1,037,606 reads (Table S1).
Among them, the average relative abundance (i.e., the total relative abundance of each taxon across all samples divided by 26) of H. armiger accounted for 59.3% of the 26 Faeces Vespertilionis samples, followed by R. ferrumequinum (24.1%) (Fig. 1). Two or more bat species could be identified in each sample, with H. armiger and R. ferrumequinum being the main host types within most collected samples in this study (Fig. 2). 23 and non-metric multidimensional scaling (NMDS) 24 were utilized to calculate beta diversity and to analyze the differences between samples ( Fig. 3 and Fig. S2). The PCA results showed that the samples were clustered into three groups. The first was composed of yms1 from HUBEI Province and yms14 from SICHUAN Province. The second group comprised yms7 from JIANGXI Province, yms16 from SICHUAN Province, yms17 from SHANGDONG Province, yms21 from ZHE-JIANG Province, and yms25 from ZHEJIANG Province. The remaining samples formed the third group. Both the PCA and NMDS results implied that the origin was not the main explanation for the biodiversity variation between samples.

Biodiversity analysis. Principal component analysis (PCA)
The alpha diversity 25 of each sample, calculated with observed OTUs, and the Shannon entropy index were used to evaluate the biodiversity within samples. The observed OTUs constituted the number of different OTUs observed within an individual sample. The Shannon entropy index indicated the richness and evenness of the species present. As shown in Fig. 4, approximately 150-230 OTUs belonging to invertebrates could be observed in every 10 g of sample. Although the number of OTUs observed varied from sample to sample, the distribution of Shannon entropy was mainly between 2 and 3 when considering OTU evenness. It is worth noting that although samples yms1 and yms14 had relatively low diversity, the former had fewer OTUs, and the latter presented a more unbalanced species distribution. Both indices were useful to assess the complexity of Faeces Vespertilionis. OTUs were identified at the order level. Among them, 68 were further identified at the family level and assigned to 27 taxa. Twenty-four were identified at the species level and assigned to 19 taxa (Table S2). The OTUs identified were divided into three classes (Fig. 5a) and 11 orders (Fig. 5b), of which 89 OTUs with 543,376 reads belonged to Insecta. The result was clearly dominated by Diptera OTUs. Lepidoptera was the second most abundant, followed by Hemiptera. Several taxa within Coleoptera, Orthoptera, Ephemeroptera, Psocoptera and Blattodea were also recorded at a much lower frequency (Fig. 5b). In contrast to most samples, Coleoptera accounted for the largest proportion in yms1, while Blattodea accounted for a larger proportion in yms14 (Fig. 5c). Tabanidae sp3. represented a significant proportion of Diptera OTUs, while Nymphalidae sp. was representative of Lepidoptera ( Fig. 2-upper panel). In addition, several Cicadidae OTUs were also detected, and Cicadae Species were categorized at the order level, and samples were grouped by sample origin. The asterisks indicate that the species might have been introduced due to hygiene problems. Moreover, the lower part reflects the composition of bat species in each sample. Species were categorized at the family level.   www.nature.com/scientificreports/

Discussion
As hypothesized, DNA metabarcoding was a powerful method for host identification. This is the first report that describes two bat species (H. armiger and R. ferrumequinum) as the main Faeces Vespertilionis hosts. H. armiger 26 and R. ferrumequinum 27 live in groups in caves, abandoned tunnels, roofs or abandoned houses. They are widely distributed in China and mainly prey on nocturnal flying insects, such as Diptera and Lepidoptera. Our approach provided a detailed biological composition of the samples. At the order level, our results were congruent with previous studies, since Diptera and Lepidoptera were frequently observed in fecal samples 28,29 . In fact, these two orders constituted the majority of samples in the whole study, regardless of read counts or OTU number. The non-Insecta OTUs, first reported by us, suggests that a broader range of taxa within Faeces Vespertilionis can be identified using DNA metabarcoding than by traditional identification methods. The three mites parasitizing insects or nematodes discovered also indicate the sensitivity of this method. It is noteworthy that the presence of taxa with bioactivity (e.g., Cicadae Periostracum) that we detected might be fundamental to Faeces Vespertilionis efficacy.
To ensure species-level taxonomy assignment accuracy, the following aspects were considered. First, our study utilized a portion of the mitochondrial cytochrome c oxidase subunit I (COI) "Folmer" region. This region is a widely applied animal barcode 30,31 , especially for the identification of bat species and their diet 21,32 . Second, since the balance between fidelity and amplification success rate must be considered for polymerase selection during biological composition analysis experiments, Tks Gflex DNA Polymerase, which has been successfully used in similar studies, was adopted in our research 33,34 . Finally, the bioinformatics processing parameters we used have been adopted by previous related studies 31,35,36 . In addition, we manually checked the taxonomic assignment results in MEGAN and investigated the geographical distribution and life history of taxa to ensure reliable species-level identification. Additionally, although we evaluated our selected primers through NCBI Primer-BLAST and found that the primers could amplify different Chiroptera families and different Insecta orders, metabarcoding markers (including the COI "Folmer" region) may have skewed our results due to the existence of primer bias 37,38 .
The stability of traditional medicines benefits from controlling the biological composition within them 39 . However, the origins of Faeces Vespertilionis samples are not related to bat species or other biological compositions in their faeces, which suggests that the quality of this medical material could not be controlled by only limiting its sources. However, from the perspective of diversity, the samples were indeed clustered into three groups, which might be related to the presence of other orders from Insecta, such as Coleoptera and Blattodea, and might also be related to non-Insecta taxa. Regarding the abundant biodiversity of the samples, the sample size should be increased in the future to find key factors affecting the stability of the biological level of the samples to control the biodiversity and ensure the stability and uniformity of medicine quality.
Our results shed light on the biological composition of this commercial medicinal material, in which both dietary species and introduced species due to hygiene problems were observed. In our study, Ctenoplusia albostriata (Bremer & Grey, 1853) 40 from Lepidoptera was found to be distributed in all samples from ten provinces in China. Coboldia fuscipes (Meigen, 1830) 41 , an oyster mushroom fly from Diptera, was also found to be distributed in all samples from ten provinces in China. Additionally, the housefly Musca domestica (Linnaeus, 1758) 42,43 , the carpet beetle Anthrenus verbasci (Linnaeus, 1767) 44 , and the carrion beetle Dermestes maculatus (De Geer, 1774) 45 were detected in our samples, which might have been introduced due to hygiene problems.
In summary, exemplified by Faeces Vespertilionis, we have shown that the DNA metabarcoding approach is a practical solution to explore the species composition of traditional medicines with highly complex components. The quality control of natural medicines might benefit from our explorative research.

Methods
Samples and DNA extraction. In this experiment, we collected 26 samples of Faeces Vespertilionis from ten provinces in China (Table 1). Approximately 10 g of each sample was lysed with 80 ml of 1.5% sodium dodecyl sulfate, and the supernatant was collected. A TIANamp Stool DNA Kit (Tiangen Biotech Co., Ltd., Beijing, China) was used to extract genomic DNA according to the manufacturer's protocol. Before each extraction, the PCR and high-throughput sequencing. The primers used in this study were an existing detection system 46 based on the mitochondrial COI sequence. Metabarcoding studies on bulk collections of animals usually target a region within the COI "Folmer" region 30,47,48 . Therefore, the forward primer we used was LCO1490 (5′-GGT CAA CAA ATC ATA AAG ATA TTG G-3′), and the reverse primer was HCO1777 (5′-ACT TAT ATT GTT TAT ACG AGG GAA -3′) 46 . Both primers were tagged with 8 bp tags at the 5' end to distinguish samples during data analysis (Table S3), and the final amplified product was 297 bp. PCR amplification was performed using Tks Gflex DNA Polymerase (Takara Bio Inc.). These PCRs were conducted in a 50 µl reaction volume containing 25 µl of 2× Gflex buffer, 2 µl of DNA, 1 µl each of the forward and reverse primers, 20 µl of ddH 2 O, and 1 µl of Tks Gflex DNA Polymerase. The PCR conditions consisted of an initial denaturation step at 98 ℃ for 1 min, followed by 40 cycles of denaturation at 98 ℃ for 10 s, annealing at 45 ℃ for 15 s, and extension at 68 ℃ for 30 s, with a final extension step at 68 ℃ for 5 min. To avoid contamination, all PCR steps were conducted in a sterile laminar flow hood that was physically separated from locations where DNA extraction or post-PCR sample processing occurred. The laminar flow hood was treated with UV light for 30 min prior to PCR preparation. We included negative controls at the DNA extraction and PCR steps. PCR products were visualized with 1% agarose gel electrophoresis. The PCR product (297 bp) was selected by excision on agarose gel, and non-specific PCR products and primer dimers were discarded. Subsequently, PCR products were mixed in equimolar amounts. Bioinformatics processing. After quality control by FastQC (www. bioin forma tics. babra ham. ac. uk/ proje cts/ fastqc/), the sequencing reads were trimmed using Trimmomatic 49 , a trimming tool for Illumina sequencing data, to filter out the contaminating adapter sequence and low-quality reads (phred quality < 20). Sequencing reads were demultiplexed using fastq-multx 50 and assigned to each sample according to the unique tags. Primer and tag sequences were trimmed using bbduk from BBMap tools 51 . Paired-end reads were merged using QIIME 52 . We dereplicated reads using the USEARCH 53-55 "fastx_uniques" algorithm with the option "minuniquesize 12". Then, we applied the USEARCH UNOISE3 36 algorithm to detect and remove chimeras, substitutions due to incorrect base calls and gaps due to omitted or spurious base calls. USEARCH was used to cluster OTUs at a 98% similarity 35 threshold. Finally, OTUs with a relative abundance less than 0.01% of total reads of the 26 samples were removed from the OTU table using QIIME (filter_otus_from_otu_table.py). A representative sequence from each OTU was then picked.
Taxonomic assignment. We imported the OTU representative sequences into Geneious Prime 2020.2 and translated them into their amino acid sequences. After translation, we removed the sequences containing stop codons, which were regarded as artificial sequences. Then, the OTU table and representative sequences were regenerated. BLASTN 56 was used to compare the OTU representative sequences against the NCBI GENBANK database, and the output was imported into MEGAN COMMUNITY EDITION version 6.10.8 57 . The representative sequences were filtered with identity values of 80% to remove non-metazoans. Order-level taxonomy was assigned at > 95% identity values, family-level taxonomy was assigned at > 96.5%, and species-level taxonomy was assigned when the identity values between the query and reference sequences were above 98% 31,35 . Finally, we manually checked the taxonomic assignment and investigated the distribution and life history of the identified species.
Statistical analysis. We respectively rarefied (QIIME script-single_rarefaction.py 52 ) the number of sequences per sample in the OTU table according to the sample with the smallest reads of bats and other biological compositions in their faeces to ensure comparability. Then, biodiversity analysis was performed after removing the host and non-metazoan sequences. The default parameters of R (v2.12.1) and the R package vegan (v2.0-1) 58 were used for PCA, NMDS analysis and data visualization. The default parameters of the R packages phyloseq (v1.36.0) 59 and ggplot2 (v3.3.3) 60 were used for alpha diversity analysis and data visualization. Heatmap was plotted using the R package pheatmap (v1.0.12) 61 . We calculated the relative abundance (i.e., the value of each taxon divided by the total reads per sample) and the average relative abundance (i.e., the total relative abundance of each taxon across all samples divided by 26) for sample composition analyses 62 .

Data availability
Illumina data sets have been deposited in the Sequence Read Archive (SRA) under BioProject PRJNA688101.