Universal mtDNA fragment for Cervidae barcoding species identification using phylogeny and preliminary analysis of machine learning approach

The aim of the study was to use total DNA obtained from bone material to identify species of free-living animals based on the analysis of mtDNA fragments by molecular methods using accurate bioinformatics tools Bayesian approach and the machine learning approach. In our research, we present a case study of successful species identification based on degraded samples of bone, with the use of short mtDNA fragments. For better barcoding, we used molecular and bioinformatics methods. We obtained a partial sequence of the mitochondrial cytochrome b (Cytb) gene for Capreolus capreolus, Dama dama, and Cervus elaphus, that can be used for species affiliation. The new sequences have been deposited in GenBank, enriching the existing Cervidae mtDNA base. We have also analysed the effect of barcodes on species identification from the perspective of the machine learning approach. Machine learning approaches of BLOG and WEKA were compared with distance-based (TaxonDNA) and tree-based (NJ tree) methods based on the discrimination accuracy of the single barcodes. The results indicated that BLOG and WEKAs SMO classifier and NJ tree performed better than TaxonDNA in discriminating Cervidae species, with BLOG and WEKAs SMO classifier performing the best.

The current state of knowledge of molecular biologists has led to the widespread use of mitochondrial DNA (mtDNA) as a marker for species-specific identification in animals [1][2][3][4][5] . For intraspecific detection of unrelated individuals, sequences with high variability are recommended, e.g. certain nuclear genes 6 . For species identification within Cervidae, we choose conservative sequences shared among the animals with species-specific variables, because doing so brings the best effect [7][8][9] .
Mitochondrial DNA is known to be an effective molecular marker in phylogenetic analyses 10,11 . This is due to the high polymorphism of the control region, as well as a lack of recombination, and very good isolation efficiency, even from small amounts of biological tissue, as well as the resistance of mtDNA to degradation processes. Different gene regions such as mtDNA, have been used for the DNA barcoding approach, but cytochrome c oxidase COI) is a universal DNA barcode for animals 12 , such Cervidae 13,14 . Ward et al. 15 analysed mtDNA COI sequences in animals and indicated that the success of barcoding depended upon recent speciation, incorrect morphological taxonomy and species hybridisation, where barcoding could not differentiate interspecies. There are many drawbacks to the use of a barcoding tool for species identification, so, the scientific community must be cautious in accepting the above factors and use additional genes for further clarification. Generally, biological phenomena, such as the hybridisation process of close species, natural introgression process, error in specimen identification using classical taxonomy and recent speciation process, are known to strongly interfere with the DNA barcoding process, and these phenomena are known to occur at different degrees depending on the animal groups and datasets [16][17][18][19] . So, it is authoritative to have not only more databases on individual species COI gene sequences from different geographical locations and correct identification of species through traditional taxonomy. It should also be noted that taxonomic decisions based on a single molecular marker that is maternally Scientific Reports | (2023) 13:9133 | https://doi.org/10.1038/s41598-023-35637-z www.nature.com/scientificreports/ inherited might not resolve all species identification and should be supported by a second molecular marker, such as cytochrome b (Cytb). The analysis of species-specific variation using the homologous cytochrome b (Cytb) is characterized by high reproducibility and sensitivity of results 11,[20][21][22][23][24] . To distinguish closely related species, selected mtDNA fragments with very high specificity are needed. Often conservative gene sequences encoding proteins are used in studies on interspecies diversity 25 , while the control region is used to provide a reliable source of knowledge about intraspecific variability 7,10,25 .
In some cases, cytochrome b provides excellent phylogenetic information on the taxonomic position of various vertebrates; and thus, it can be used in the analysis of live specimens or for forensic identification purposes [26][27][28][29] , with the same success rate as COI 30 . In addition, this gene is often considered when determining the origins of samples from difficult biological materials, i.e. hair, feathers, tooth fragments or other bones, which mainly utilize mitochondrial DNA polymorphisms 26 . Irwin et al. 31 have determined the rate of evolutionary changes for the genera of some species in different components of cytochrome b amino acid sequences based on fossil DNA analyses. Several recent studies show that when the DNA template is derived from bone material, a 300-500 bp Cytb fragment, is suitable for mammalian species identification 12,27,[32][33][34] .
Machine learning (ML) is a branch of artificial intelligence (AI) where machines are trained to solve selfdesigned problems by learning new rules through repeated trials and feedback. ML enables inferring of models or relationships by learning from data. With the development of AI, machine learning has been rapidly developed and applied in DNA barcoding. Over the years, several different analytical methods were devised for the assessment of the species discrimination ability, such as TaxonDNA, NJ tree and machine learning approaches (BLOG and WEKA). Most machine learning approaches (MLA) bear its origin from statistical methods of regression analysis. Machine learning approaches are computer tools, which can be successfully applied in species identification 35 . BLOG (Barcoding with LOGic) and WEKA (Waikato Environment for Knowledge Analysis) are methods of ML, which can recognize unknown species (query set) present in the reference dataset composed of DNA barcode sequence (training set) of known species 36,37 .
Based on the literature, we have found the cytochrome b mitochondrial gene to be useful in identifying species of wild animals using bone material (mandible, frontal bone) and we tested its usefulness on the real life example. The specific aim of our research was to develop a short universal fragment from mtDNA, which could be used in the species identification of various deer populations. In addition, we proposed the use of machine learning approach methods to classify species with DNA barcode sequences.  (Table S3). Figure 1 show the obtained phylogenetic tree, which was resolved into three distinct clades that consisted of representatives of the three analysed species. Samples were grouped together with each species representative showing a high probability (100%) of assignment, indicating clear species identification. Within the clades, we found substantial polytomy, which is a result of a lack of sequence informativity within the species level.

DNA isolation.
Another phylogenetic analysis inferred from the Cytb sequences was constructed to illustrate the phylogenetic relationship of Cervidae species based on Cytb sequences from GenBank and our studies (Fig. 2). A phylogenetic ML tree was constructed using a pre-trained neural network. The tree important Cervidae species were formed monophyletic clades and found well-supported with bootstrap values (> 80%).
Machine learning approaches of BLOG and WEKA were compared with distance-based (TaxonDNA) and tree-based (NJ tree) methods based on discrimination accuracy and cost-effectiveness of barcode gene ( Table 2, Fig. 2). The results indicated that BLOG and WEKAs SMO classifier and NJ tree performed better than Tax-onDNA in discriminating Cervidae species. Specifically, the single barcode Cytb exhibited the highest species resolution (100%) for identifying 3 Cervidae species when BLOG or WEKAs SMO classifiers were used. This study showed that machine learning approaches provided higher discrimination accuracy and cost-effectiveness over other analytical methods in DNA barcoding of Cervidae species.

Discussion
We present that DNA barcoding is an effective molecular tool for Cervidae species identification and phylogenetic inferences as a result of our research, we have obtained the new sequences that have been deposited in a section of the Genbank belonging to the Cervidae family. Identification of species must be effective and precise even from degraded environmental material. In addition, well-resolved molecular phylogenies derived from these DNA barcode sequences have the potential to improve investigations of the mechanisms underlying community assembly and functional trait evolution. Our proposed methodology can be used in the future as a routine marker in cases when degraded samples will be used. It also contributes to the development of the subject of species identification in different environments. With the development of DNA barcoding, several analytical methods were developed for the assessment of species discrimination ability. There are no criteria for evaluating the quality of the analytical methods for species discrimination. We have analyzed the effect of barcodes on species identification from the perspective of the machine learning approach (MLA). We tested a reinforcement-learning algorithm to solve the challenge of reconstructing phylogenetic trees, which are used to describe the relationships among a set of sequences. Current tools for phylogenetic-tree reconstruction integrate heuristic approaches to evaluate only a subset of all potential trees, thus they suffer from the known trade-off between accuracy and running time. In our study, we tested the methodology for predicting the maximum likelihood tree. Our preliminary results, based on a machine learning approach algorithm demonstrate that the trained algorithm can accurately and efficiently reconstruct maximum-likelihood trees. This development technique ML could provide rapid, simple, and reliable tools for species confirmation and can be applied to the modelling of species distribution.
In the present study, we proposed and tested the use of a relatively short mitochondrial DNA sequence for the species identification of members of the genus Cervidae. Our proposed methodology, based on machine learning, confirmed the identification results of the classical phylogeny-based approach, which will enable its wider use in future routine studies of this type.
Nowadays, DNA analysis of biological samples has become the standard practice in animal identification at various taxonomic levels. Different types of tissues, such as bones, blood, hair (fur), feathers, skin, meat (muscle sample), faecal, and others are often the subject of many studies in various DNA analysis laboratories 22,41 . A universal fragment of genetic information is constantly being sought to use in many areas, e.g. poaching 26,42,43 , illicit trafficking of endangered species 26,[42][43][44] , protection of endangered animal species 27 , or determination of meat origin (for identification purposes) 44,45 . Anna Ramón-Laca et al. 22 , showed that differentiation of species can be achieved by using a species-specific primer that amplifies dissimilar length fragments. There are differences of opinion among the researchers, regarding which of the markers, COI or Cytb, provides more reliable and reproducible results for DNA barcoding analysis. In 2010, a group of researchers led by Tobe et al. 1 carried out an assessment of genetic intraspecific variability based on COI and Cytb sequences from 217 mammalian species. The results showed that the discriminatory power was higher for the Cytb gene, i.e. there was a higher probability that two random individuals from a given population would have sequence differences at the marker locus than for the COI sequence. Research carried out by Wilson-Wilde et al. 46 demonstrated that identification based on the COI gene sequence is suitable for genetically distant species, while in the case of closely related species, it is no longer unambiguous and requires additional tests. However, COI, Cytb, and the mt-CR control region are still used for this purpose 1,20-24,47-50 . To find better molecular tools the compilation of known DNA markers led to the construction of the genetic map of Cervus elaphus 51 54 . The genetic map of deer was used as an annotation for further research, such as the origin and evolution of ruminant genomes 52 , QTL scanning 53 , SNP analyses of the whole genome 55,56 and whole genome sequencing as well as the annotation and assembly of pseudochromosomes 54 .
In the results presented in this study, the total length of DNA fragments of all analysed individuals was 207 bp due to the removal of the last nucleotides in the sequences. The reason for obtaining different lengths was probably due to the inhibition of sequencing reactions by individual matrices. Similar results were obtained by Gupta et al. 33 , who worked on stool samples and also obtained short Cytb sequence fragments of 366, 374 and 503 bp 22,41,57 . We show that when using bone tissue, the primers used in this work for the Cytb gene fragment amplifications work better because firstly, they differentiate closely related species well and have the additional advantage that they can be used for many other mammalian species as well. Our research is confirmed by many studies, not only for the family Cervidae but also other works on the identification of other wild mammalian species 1,5,32,41,48,58 . Our phylogenetic analysis grouped the analysed sequences within individual species with 100% probability (Fig. 1). The Cytb fragment, analysed in this study, allows correct species identification, however, the lack of intraspecific polymorphism results in the inability to use it in population studies. This is clearly shown in the phylogenetic tree obtained (Fig. 1) where most monospecific nodes are polytomous. The lack of the node's solution (polytomy) is in this case is the result of a lack of genetic information from the analysed DNA sequences (soft polytomy). Our results suggest that the intraspecific genetic polymorphism is low for all mammalian species. Similar results were obtained in earlier studies 1,59 .
MLAs extract the distinct features from the DNA sequences by training the reference dataset and then used for identifying the query sequences. Cytb region provided the highest species resolution when using the BLOG method, and WEKAs SMO classifier. In the comparison of TaxonDNA and NJ tree, the BLOG and SMO methods produced a relatively low level of misidentification. Moreover, though the above methods have to achieved > 90% species identification success rate, still there is a need for further improvement in success rate (Fig. 2, Table 2). The discrimination ability of combined different barcodes in the species of the Cervidae genus is still fully unknown.  We present that DNA barcoding is an effective molecular tool for Cervidae species identification and phylogenetic inferences a result of our research. Identification of species must be effective and precise even from degraded environmental material. In addition, well-resolved molecular phylogenies derived from these DNA barcode sequences have the potential to improve investigations of the mechanisms underlying community assembly and functional trait evolution. Our proposed methodology can be used in the future as a routine marker in cases when degraded samples will be used. It also contributes to the development of the subject of species identification in different environments.
With the development of DNA barcoding, several analytical methods were developed for the assessment of species discrimination ability. There are no criteria for evaluating the quality of the analytical methods for species discrimination. We have analyzed the effect of barcodes on species identification from the perspective of the machine learning approach (MLA). We tested a reinforcement-learning algorithm to solve the challenge of reconstructing phylogenetic trees, which are used to describe the relationships among a set of sequences. Current tools for phylogenetic-tree reconstruction integrate heuristic approaches to evaluate only a subset of all potential trees, thus they suffer from the known trade-off between accuracy and running time. In our study, we tested the methodology for predicting the maximum likelihood tree. Our preliminary results, based on a machine learning approach algorithm demonstrate that the trained algorithm can accurately and efficiently reconstruct maximum-likelihood trees. This development technique ML could provide rapid, simple, and reliable tools for species confirmation and can be applied to the modelling of species distribution.

Materials and methods
Sampling DNA. In total 18 skull samples were obtained from wild-living specimens of three ungulate species in 2016-2018. DNA isolations were performed using the column-based method and the GeneMatrix Bond DNA Purification Kit (Eurx). The purity and concentration of DNA from the bone material were determined using a NanoDrop 2000c spectrophotometer (Thermo Scientific) ( Table 1).
Mitochondrial DNA analysis. The following primer pair was used for PCR amplification 32  PCR products were checked by electrophoresis in a 1.5% agarose gel containing ethidium bromide and a TBE buffer (pH 8.0); the gels were visualized under UV and archived using the GeneSys V.1.3.5.0 software (Syngene). The sequences reported in this paper have been deposited in the GenBank nucleotide sequence database with the accession numbers marked '*' in Tables S1, S2. Sequence analysis. At first, the forward and reverse sequences were aligned, and consensus sequences were obtained using Basic Local Alignment Tool software. ClustalW and Mega7.1 software were used to perform multiple sequence alignments 60 . Substitution patterns and rates were estimated under the Kimura 2-parameter model 61 .
The genetic variability of haplotypes was characterized by the total alignment length (bp), the number of monomorphic sites, the number of polymorphic sites, the number of parsimony informative sites (PIC), the number of haplotypes, and the average G + C content in each region using DnaSP6.10.01 62 . www.nature.com/scientificreports/ Species identification. To reveal the species of each sample analysed, we performed phylogeny reconstruction using the Bayesian approach. Seven Cytb sequences for Cervus elaphus, two for Dama dama and 9 for Capreolus capreolus were grouped together along with 131 Cytb sequences (Tables S1, S2) of the three species from Genbank, as well as two outgroup sequences (Antidorcas marsupialis, Beatragus hunteri) for comparison. Next, all sequences were aligned with the Muscle algorithm 61 and cut to obtain the proper alignment set in Seaview 63 . The best-fit substitution model was chosen using jModelTest 2.10 64 . Finally, the tree was constructed with MrBayes 3.2.6 38 using two, randomly started and independent runs, carried out for 20,000,000 generations of Markov chain steps. A consensus tree was constructed based on the set of trees collected after both runs converged-i.e. when the standard deviation of both runs was much below 0.01. ML tree was also searched using DeepNNPhylogeny, pre-trained neural networks to predict the best models of sequence evolution and the best tree topologies 65 . All neural networks have been trained with a large number of alignments simulated with the software PolyMoSim, designed to test phylogenetic tree reconstruction and to train machine learning models for phylogenetic reconstruction. We used also ModelTeller, a machine-learning based algorithm, which is based on the Random Forest, for the prediction of the optimal phylogenetic model for branch-length estimation 66 .
The MLAs BLOG and WEKA (machine learning approach). BLOG 2.0 (Barcoding with LOGic; Institute of Systems Analysis and Computer Science, National Research Council, Rome, Italy) and WEKA (The University of Waikato, Hamilton, New Zealand) were applied. BLOG provides a supervised MLA, which selects suitable nucleotide positions and computes the logic formulae for species identification 36 . The WEKA workbench is used for classification, clustering and selection problems 37,67 . The four classifiers: Naïve Bayes 68 , support vector machines (SMO) 69 , the decision tree C4.5 (J48) 70 and the rule-based RIPPER (Jrip) 71 were implemented to analyse the DNA sequences.
Distance-based analysis (TaxonDNA). The Kimura 72 2-parameter model (K2P)distances between all sequence pairs were calculated with TaxonDNA 1.9 (National University of Singapore, Singapore) and applied to compute the mean and the range of the intra-and interspecific distances for the barcode. The relative distribution of the pairwise intra-and inter-specific distances were estimated with the "best match" and "best close match" functions in the TaxonDNA under the Kimura 72 2-parameter (K2P) distance model. Statement. The consent of the bioethical commission is not required for this type of research in Poland.

Tree-based analysis (neighbour-joining
This is due to the fact that the study material was not taken from live animals, but from bones. In addition, the animals were not caught, euthanized or killed. Samples were taken from fallen animals by North West Forest Districts and sent for this study.

Conclusions
Despite the challenging biological material of bone tissue, the Cytb gene was successfully used to identify individuals of closely related ungulate bone DNA species using PCR analysis, Sanger DNA sequencing and accurate bioinformatics tools such as the Bayesian approach and the machine learning approach. This research will be extended to analyse more sequences of many barcodes for Cervidae species identification using DNA barcode sequences through machine learning approaches. The data obtained will serve used for comparisons with gene bank records. The proposed methodology will be helpful as a routine identification procedure for a variety of tissue sources, even in cases where the samples are degraded.

Limitations
The efficiency of the applied DNA isolation method varied. The resulting DNA concentration values demonstrated over a 100-fold difference between the lowest and highest concentration. The study revealed that one of the most important moments during the DNA extraction process was the preliminary preparation of the bone material. Identification of bone samples depended on the quality and quantity of DNA present in the sample. The efficiency of the applied DNA isolation method varied. The resulting DNA concentration values demonstrated over a 100-fold difference between the lowest and highest concentration. The study revealed that one of the most important moments during the DNA extraction process was the preliminary preparation of the bone material.
Identification of bone samples depended on the quality and quantity of DNA present in the sample.

Data availability
All data generated or analyzed during this study are included in this published article. The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable. All the sequences have been deposited in NCBI GenBank and can be found under accession numbers No: MK575604-MK575603. The Cytb sequences are deposited on the corresponding author's data.