Comparative transcriptome provides strategy for phylogenetic analysis and SSR marker development in Chaenomeles

The genus Chaenomeles has long been considered as an important ornamental, herbal and cash plant and widely cultivated in East Asian. Traditional researches of Chaenomeles mainly focus on evolutionary relationships on phenotypic level. In this study, we conducted RNA-seq for 10 Chaenomeles germplasms supplemented with one related species Docynia delavayi (D. delavay) by Illumina HiSeq2500 platform. After de novo assemblies, we have generated unigenes for each germplasm with numbers from 40 084 to 48 487. By pairwise comparison of the orthologous sequences, 9 659 othologus within the 11 germplasms were obtained, with 6 154 othologous genes identi�ed as single-copy genes. The phylogenetic tree was visualized to reveal evolutionary relationship for these 11 germplasms. GO and KEGG analyses were performed for these common single-copy genes to compare the functional similarities and differences. Selective pressure analysis based on 6 154 common single-copy genes reveals that 45 genes were under positive pressure selection. Most of them involved in plant disease defense system building process. 292 genes containing simple sequence repeats (SSRs) were used to develop SSR markers and compare their function in secondary metabolism pathways. Finally, 10 primers were chosen as SSR markers candidates for Chaenomeles germplasms by comprehensive screening. Our research provides new methodology and reference for future related research in Chaemomeles and is also useful for improvement, breeding and selection project in other related species.


Introduction
Chaenomeles, a genus within the subfamily Maloideae (Rosaceae) comprising four diploid (2n = 34) species 1 , is widely planted in East Asian and has important ecologic, ornamental and economic value.In China, Chaenomeles has more than 3000 years cultivated history.About 700 years ago, people began to recognize Chaenomeles' medicinal values: their effect on curing rheumatoid arthritis and hypopepsia 2 .Now, pharmacologists nd their leaves and owers contain substantial prunasin; their seeds have a large content of amygdalin; their fruits contain abundance chemical ingredients such as oleanolic acid, malic acid, pectinic acid, Chinese wolfberry citric acid, tartaric acid, citric acid, tannin, avonoids saponin and proanthocyanidins which acts as antioxidant and anticancer agents 3 .Rest of these ingredients also have different pharmacological activities 4 .
E cient methods to clarify the taxonomic status of both the wild and the cultivated germplasms are much needed 1 .In present, both morphological traits and various molecular markers are used to solve the taxonomic confusion [12][13][14] .In the past several years, Restriction fragment length polymorphism (RFLP), Random Ampli ed Polymorphic DNA (RAPD), Ampli ed fragment length polymorphism (AFLP), Simple Sequence Repeats (SSR), EST-based microsatellite (EST-SSR), Inter-simple sequence repeat (ISSR), Sequence characterized ampli ed region (SCAR) and single nucleotide polymorphism (SNP) have developed into the mainstream method to detect evolutionary relationship in relative species on genomic level.Lots of related researches were reported.In Chaenomeles, Barthish et al 15 employed RADP to analysis offspring families of C. cathayensis and C. speciosa.Their result showed the RAPD-based proportion of between-family variability is greatly higher in the hybrid populations than in pure species.He et al 16 17 .With the development the next generation sequencing (NGS) technologies and algorithms for data processing, RNA-seq has become an e cient and economical approach for genomic and trascriptomic resources digging, and has been increasingly considered as an e cient tool to identify transcriptwide molecule markers and to solve the phylogeny relationships innon-model species [18][19][20][21] .Although few studies about transcriptome datasets for individual species, increasing attention has been given to comparative transcriptomes in genus Chaenomeles.
Here, we sequenced transcriptomes for 10 Chaenomeles germplasm which belong to ve Chaenomeles species and one D. delavayi germplasm which is near-source species of Chaenomeles by Illumina HiSeq 2500 platform.After denovo assemble and functional annotation, common orthologous genes among the 11 germplasms were obtained.We selected the single-copy genes from all common orthologous genes to conduct phylogenetic tree construction to con rm their evolutionary relationship.By pairwise comparison of the orthologous sequences, speciation differentiation genes under positive selection were obtained to illustrate some key genes during evolution.A plenty of SSR sites were identi ed to be potential molecular markers for Chaenomeles.Common genes containing SSR sites among the 10 Chaenomeles germplasms were selected to detect gene expression model in secondary metabolism pathway.Finally, we developed some SSR marker for these 10 Chaenomeles germplasms.This study displays the rst comparative exploration of Chaenomeles transcriptomes using high-throughput RNA-seq and provides an important methodology and database resources for facilitating further studies on the phylogeny of Chaenomeles and its related genus.

Results
De novo assembly and unigene quali cation result of the  2B).

Screening of othologous genes and single-copy genes
To better understand the similarities and differences among the 11 germplasms, comparative analysis of genes among different germplasms is the most e cient methods.Genes from different germplasms with the best blast hits are considered as orthologs.All these orthologs were commonly used to generate orthologous pairs.We identi ed orthologous groups for all predicted nucleic acid and proteins sequence from the 11 matierals by using OrthoMCL software.After merging same family gene, nally, we obtained 28 181 OrthoMCL genes in C. thibetica (XZ), 28   S1).Then we found the common othologs among 11 germplasms by venn diagram.A total of 9659 othologous genes were obtained (Fig. 3A).Single-copy genes are valuable resources for phylogenetic analysis and SSR marker selection, which only have one copy in genome of corresponding species.We selected single-copy gene within 9 659 othologs also by using venn graph.And 6 154 othologous single-copy genes were obtained in our result (Fig. 3B, Supplementary Table S2).

Phylogenetic analysis based on single-copy orthologous families
In our study, the single-copy gene families were used to conduct phylogenetic analysis.We nally identi ed 6 154 orthologous groups in the 11 germplasms.Using these orthologous families, we constructed an evolutionary tree via iqtree software based on the Maximum likelihood method.
According to the evolutionary tree, the 11 germplasms were divided into different clades (Fig. 4).
C. sinensis (GP) and D. delavayi (DY) are in the same branch and are relatively far away from other species, indicating that they have close phylogenetic relationship.Compared with other species in the genus, C. japonica (RB) was classi ed as a single branch, showing interspeci c differentiation, which is also supported by AFLP studies 22 .C. speciosa (ZPLY), commonly known as 'Yizhoumugua' in Linyi City of Shandong Province, is generally considered to be originated from interspeci c hybridization with other species in the genus 23 .Our study showed that it had a relatively distant genetic relationship with the other three Chaenomeles germplasms.
Three germplasms of C. thibetica, namely XZ, XZSS and MYXZ, clustered in the same branch with C. cathayensis (MY), showing a very close phylogenetic relationship.Previous study 23 suggested that C. thibetica might be a hybrid species of C. cathayensis and D. delavayi (DY).However, this view still needs more su cient evidence to judge.C. speciosa is the most widely distributed species of the genus in China.Due to its outstanding pharmacological effects, it has been domesticated and cultivated for a long time and formed three genuine producing areas, corresponding to the distribution of three local varieties, namely 'Cunmugua' (ZPCA), 'Xuanmugua' (ZPXC) and 'Ziqiumugua' (ZPLP).Our phylogenetic analysis proved their close relationship.

GO and KEGG analyses result for single-copy genes in 11 germplasms
In order to understand the function of 6 154 common single-copy genes in 11 germplasms, we conducted GO and KEGG analyses (Fig. 5).According to the GO and KEGG result, 6 154 common single-copy genes shared similar metabolism process.After merging the top 15 GO terms in each germplasm, we totally obtained 27 common GO terms (Fig. 5A).Common single-copy genes were signi cantly enriched in these 27 GO metabolic processes in all the 10 Chaenomeles germplasms except D. delavayi (DY).In D. delavayi (DY), common single-copy genes were insigni cantly enriched in DNA metabolic process (GO: 0006259) with p. adjust 0.294 and chromosome organization (GO: 0051276) with p. adjust 0.564 respectively indicating its different metabolic process from Chaenomeles (Fig. 5A Supplementary Table S3).All of the 11 germplasms especially enriched in nitrogen compound metabolic process (GO: 0006807), organic cyclic compound metabolic process (GO: 1901360), heterocycle metabolic process (GO: 0046483) cellular nitrogen and compound metabolic process (GO: 0034641), and nucleic acid metabolic process (GO: 0090304) (Fig. 5A, Supplementary Table S3).

Detecting genes under selective pressure
The non-synonymous (dN) substitutions and synonymous (dS) substitutions are used to estimate the change of coding protein sequences.The magnitude of the Ka/Ks ratio provides evidence of genes under strong functional constraints (Ka/Ks < 1) or undergoing adaptive evolution (Ka/Ks > 1) [24][25][26] .The ratio of non-synonymous substitution rate (Ka) and synonymous substitution rate (Ks) are used to assess whether genes under selection.We generated pairwise comparison of different transcriptome datasets.According to the blast results, a total of 6 057 single-copy genes underwent selective process (Fig. 6A, Supplementary Table S5A).Among them, 467 out of 6 057 genes underwent puri cation selection indicating those genes are disadvantage during evolution process.693 out of 6 057 genes experienced weak positive selection and 45 out of 6 057 genes experienced strong positive selection indicating those genes are domain genes that determine evolution direction (Fig. 6B, Supplementary Table S5B-D).Further, we blasted these 45 positive selected genes to Arabidopsis (Supplementary Table S5D).19 out of 45 genes can be found homologous genes in Arabidopsis.ORTHOMCL10413 (RGA2) mainly acts as a repressor of the gibberellin (GA) signaling pathway through transcription coactivator of the zinc nger transcription factors GAF1/IDD2 and ENY/IDD1 to regulate gibberellin homeostasis and signaling 27 .ORTHOMCL9348 (RPL5) participates in cell proliferation and plays a role for translation in leaf dorsoventral patterning to specify leaf adaxial identity 28 .ORTHOMCL8960 (HEL) has a modular structure consisting of an N-terminal hevein-like domain (CB-HEL) and a C-terminal domain (CD-HEL) that is posttranslationally processed.Both domains show a strong antifungal activity indicating this gene is charge for defending plant against pathogen attack 29 .ORTHOMCL9718 (FIP2) can delay owering by repress expression of FLOWERING LOCUS C 30 .ORTHOMCL4564 (CSA1) also play important role in disease defense 31 .ORTHOMCL9773 (MYB27) is member of MYB family, lots of studies indicates that MYB family members can regulate secondary metabolism, control cell shape and defense disease resistance and increase hormone responses 32 .ORTHOMCL8338 (BZIP60) involves in controlling endoplasmic reticulum pressure to enhance immune response in both animal and plants 33 .In addition, although in plants, ORTHOMCL5656 (CYS6), ORTHOMCL6947 (HPR3) and ORTHOMCL6257 (F11P17.9)still have no reports for their function, according GO analysis result, these three genes main participated in disease defense process (Supplementary Table S3).All of these indicate most of the positive selection genes involves in plant disease defense system building.That is why those plants can survive during long period evolution and selection process.
A total of 110 pairs of SSR primers were synthesized, and the 10 Chaenomeles germplasms were used to verify the PCR ampli cation ability.Based on this, a total of 45 pairs of primers were screened out (Supplementary Table S11).Then, capillary electrophoresis was used to detect the polymorphism and speci city of the 45 primers.The results showed that only 10 pairs of primers had good polymorphism.Primers from ORTHOMCL6247, ORTHOMCL8473, ORTHOMCL7263 ORTHOMCL4541, ORTHOMCL9476, ORTHOMCL9834 and ORTHOMCL8700 had the best polymorphism relatively (Supplementary Fig. S1).The primers from ORTHOMCL4735, ORTHOMCL8947 and ORTHOMCL6535 also had relatively good polymorphism, and poor ampli cation result for individual germplasm (Supplementary Fig. S1).Table 1 displayed all selected 10 primers information including tandem repeats unites, annealing temperature and detailed forward and reverse primers (Table 1).

Discussion
Several approaches including morphological markers, cytological markers, and biochemical markers have been explored to classify plant resources 22 .However, each technique is imperfect.For example, great phenotypic difference individuals from geographical isolation create obstacle for taxonomists.
Subjective factors of taxonomists lead to misclassi cation.Limited number of isoenzyme and low resolution of cytology prevent taxonomists from recognizing the plants.Due to the aws, molecular markers have become a prevalent way for evolutionary studies in Chaenomeles and the related species [34][35][36][37] .Various molecular markers providing reliable and convictive evidence for related researches promote phytowaxonomy research, whereas, some technologies are still unaffordable because of high cost.Nowadays, RNA-seq has become a economical and e cient tool for tapping molecular information.
The phylogeny and classi cation of Chaenomeles have been controversial for a long time, especially whether C. sinensis is a legally effective species.
Robertson et al 38 conducted phylogenetic relationships among 88 genera of Rosaceae, and the results show it is far from Chaenomeles group, indicating that C. sinensis is not the member of Chaenomeles.In our research, we found C. sinensis had closer evolutionary relationship with D. delavayi con rming Robertson's result.There was obvious differentiation between C. sinensis (GP) and all other species of Chaenomeles.In terms of phenotype, C. sinensis (GP) with the characters of branches unarmed, owers solitary, coetaneous, sepals re exed, stipules ovate-lanceolate and margin glandular serrate is also signi cant different from the other Chaenomeles species which have the characters of branches armed, owers fascicled, precocious or coetaneous, sepals erect, rarely re exed, stipules herbaceous, reinform or auriculate and margin serrate 23 .Pollen morphology also shows that it is distant from other species 39 .Therefore, based on the phylogenetic results and phenotypic characteristics, we support that C. sinensis should not belong to Chaenomeles, but should be an independent genus Pseudocydonia.This view was also supported by our another study on the phylogeny of Chaenomeles based on chloroplast genome sequencing 40 .
The genetic characteristics of populations should be dictated by the interplay of genetic drift, gene ow and natural selection.These processes may be strongly in uenced by the demography and spatial distribution of populations 41 .However, their extrinsic feature have some disparities, even main chemical components also appears differences, this lead genuine regional drug in doctor of traditional Chinese medicine.From molecular marker perspective, although above same Chaenomeles, they have formed gene-diversity offsprings leading to different bands (Fig. 8), this phenomenon has been reported in Bartish's research, which discriminate 42 Chaenomeles by RAPD method 42 .
For closely related taxa, the advantages of homologous single copy genes for phylogenetic and phylogeographic analysis are clear because of their rapid evolutionary rates and clear avoidance of paralogy [43][44][45][46] .According past researches, there are a lot of single copy genes identi ed in many plants, including Euasterids, Rosaceae, Poaceae, Cycads and other 29 angiosperms [47][48][49][50][51] .For example, Duarte et al. totally identi ed a series of 959 single-copy genes in A. thaliana, P. trichocarpa, V. vinifera and O. sativa, and used 18 single-copy genes to improve resolution of Brassicaceae phylogeny 43 .In this study, we identi ed 6154 single-copy genes according to the strict ltering criterion.All these detected single-copy genes proved high effectiveness in phylogeny construction of Chaenomeles species, which provided a rough phylogenetic framework for the whole genus (Fig. 4).Therefore, this result indicated that all these candidate single copy genes from the transcriptomes of Chaenomeles species have good application in molecular phylogeny studies of Chaenomeles genus.
In our result, 10 out of 45 SSR markers were nally con rmed to distinguish the 10 Chaenomeles germplasms.They have good appearance in both polymorphism and ampli cation.The e ciency of primer developing rate was 22.2% is a little lower than Wang et al's 28.6% who develop SSR marker in Pineaaple.For the developing rate, many factors can in uence the result, such as the selected base number of primers, genetic diversities of tested species, and developing methods.For instance, Wang 52 compared four methods including ISSR, SSR, SCoT and RAPD by using 47 germplasms of Ananas comosus (L.) Merr.Among them, the developing rate of RAPD can reach 47.06% and signi cantly higher than rest methods.However, SCo T and SSR method have good comprehensive performance comparing two other methods.Taken together, our results have proved SSR marker developing based on transcriptome is promising.
Although classi cation studies in Chaenomeles have been made in great progress, system of nomination and classi cation still existing some problem.
The results of this study support the idea of C. sinensis as an independent genus, Pseudochaenomeles.With increasing exploration of edible and medicinal value, it is important to distinguish Chaenomeles assortment and avoid misusing.So the authoritative criterion is urgent.Due to delay of Chaenomeles genome, it's still hard to explored SSR markers e ciently.Because of the low cost for transcriptome sequencing, it will be optimum selection, thus, our research provides new methodology and reference for future related research in Chaemomeles.The large genetic diversity in the genus Chaenomeles as inferred from molecular markers is also useful for improvement, breeding and selection project in related species.A total of 15 duplicates leaves were collected from each germplasm resource.According to the manufacturer's instruction, total RNA was isolated from mature leaf samples as described in Owczarek et al 53 .After testing the quali cation of extracted RAN including concentration, RIN value, ratio of 28S to 18S, OD 260 to OD 280 by using Agilent 2100 Bioanalyzer and NanoDrop, then the RNAs from same germplasm were pooled to prepare cDNA library by using cDNA Synthesis Kit (Illumina Inc., San Diego, CA) following the manufacturer's recommendations 54 .Finally, the constructed cDNA libraries were sequenced using Illumina HiSeq 2500 by paired-end 150 bp strategy.

Methods
Filtering and assembling of reads, quantitative calculation and functional annotation for unigenes Firstly, raw reads were ltered with fastp software by removing adaptor and low quality reads 55 .Then, the cleaned reads were de novo assembled using Trinity with the default parameters for each individual sample 56 .Expression level of unigenes was calculated by RSEM 57 software, and was standardized by RPKM.Unigenes were queried against the non-redundant protein (Nr) database, the Swiss-Prot protein database, the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database and the Cluster of Orthologous Groups (COG) database using BLASTx with an E-value cut-off of 1e-5 58 .. According to the blast results, the coding region sequences (CDS) of unigenes were extracted and translated into peptides.For the unigenes without matching with the above four databases, we predicted the coding region sequences by using TransDecoder.Gene Ontology (GO) annotation of the unigenes were obtained using Blast2GO 59 .Enrichment analyses in GO term and KEGG pathway were conducted as described in Li et al 60 .
Identi cation of orthologs, single-copy genes and multiple-copy genes Diamond 61 and OrthoMCL 62 softwares were used to identify the orthologs among the 11 germplasms.The rst step, multiple sequences comparisons among different germplasms were conducted with cutoff of E-value ≤ 1e-5, query cover ≥ 30% by using Diamond.The second step, the orthologs were merged into different families by using OrthoMCL.These obtained orthologous genes can be divided into single-copy genes and multiple-copy genes according to their copy numbers.
Phylogenetic analysis and selective pressure analysis Based on the above analyses, we rstly constructed sequence alignment by using MUSCLE (multiple sequence alignment with high accuracy and high throughput) for single-copy genes.Then, the phylogenetic tree was performed in iqtree software 63 by Maximum likelihood method 64 .To detect whether genes were affected by nature condition, paml-codeml was used to compute non synonymous substitution rate (Ka), synonymous substitution rate (Ks) and ratio Ka to Ks.We aligned orthologous pairs 65 .Orthologs genes with Ka/Ks > 1, which were considered as genes under strong positive selection; orthologous genes with 0.5 < Ka/Ks < 0.1, which were considered as affected by weak positive selection; orthologous genes with Ka/Ks < 0.1, which were considered as affected by negative selection.The signi cance tests were measured with p values, which were corrected via false discovery rate (FDR).Expression model of single-copy genes in secondary metabolites pathway After performing KEGG analysis, the signi cant enriched pathway-secondary metabolites pathway was selected for detecting gene expression model among the 10 Chaenomeles germplasms.The pathway was depicted by Adobe Illustrator CS5 software 66 , and heatmaps were visualized by TBtools 0.6669 67 .

Identi cation and screening of SSRs among 10 Chaenomeles germplasms
We identi ed SSRs motifs in unigenes using MISA software 68 .According to the tandem repeat including AAC, ACA, CAA, GTT, TGT and TTG, all these SSRs motifs were divided into mono-, di-, tri-, tetra-, penta-and hexanucleotide type.PCR primers were designed for the SSRs using the program Primer3, which have more than 150 bp anking sequences 69 .All of the primers used for SSR development were showed in Supplementary Table S11.
DNA was extracted from leaves of the 11 germplasms as described in Bartish et al 70 .Fluorescently labeled primers were synthesized for amplifying gene fragment as following rules: the primer should have 2, 3, 4, or 5 tandem repeat units; the length of PCR products range from 150bp to 300 bp; the position of genes should not focus on one site; the polymorphic sites should be chosen rstly; different combination of tandem repeat units should be chosen averagely; the repeat base in primer should be less than four; length of primers should be located at about 20 ~ 23bp; TM value of primers should be at 60 ℃; consecutive A or T at 5' or 3' region of primer should be less than two; repeat sequence within primer was forbidden.The PCR products were detected by capillary electrophoresis.Figure 5 GO and KEGG analyses for common single-copy genes among 10 germplasms of Chaenomeles and D. delavay.In this graph, pot size means enriched gene numbers; from red to green, the degree of signi cance decreases.In both of GO and KEGG analysis result, only top15 GO terms and pathways are showed.

Declarations Figures
evaluated genetic relationships among 52 C. specisa accessions grown in China by combining AFLP with leaf morphology characters.208 polymorphic markers were identi ed.Zhang et al, used EST-SSR markers of Malus to research genetic diversity of 33 Chaenomeles germplasms and obtain exciting results Plant material, total RNA extraction and Illumina sequencingAll ve species of Chaenomeles and one related species (D. delavayi) were collected (Fig.1).C. sinensis (GP) was collected in Anji County of Zhejiang Provice.According to the local common name, C. speciosa was divided into four types, including C. speciosa 'Chunmugua' (ZPCA), C. speciosa 'Ziqiumugua' (ZPLP), C. speciosa 'Xuanmugua' (ZPXC) and C. speciosa 'Yizhoumugua' (ZPLY).Three local types of C. thibetica were also collected, which distribute in Lasha (XZ), Bomi (XZSS) and Motuo (MYXZ) of Xizang Province respectively.C. cathayensis (MY) was collected from Shiping County, Yunnan Province.C. japonica (RB) was introduced from Japan and collected from the Germplasm Resource Bank of Chaenomeles in Anji County of Zhejiang Province.D. delavayi (DY) was collected from Lancang County of Yunnan Province.The voucher specimens were deposited in the Herbarium of Research of Institute of Subtropical Forestry, Chinese Academy of Forestry with specimens ID showed in Fig.1.