Introduction

Global warming is changing Earth’s climate with possible negative effects on the growth and reproductive success of plants. Reduced plant productivity due to environmental changes1,2, such as high temperatures, heat waves and drought stress, might implicate incapacity to ensure global food security3,4. The Mediterranean region will be particularly affected by climate change, with increased aridity expected to occur (Intergovernmental Panel on Climate Change; http://www.ipcc.ch)5. Mediterranean agriculture will need to adapt to the new environmental conditions by growing drought tolerant crops. Selection and introduction of stress-tolerant cultivars of existing crops is a slow and costly process, requiring intensive research and field trials6. Another option is promoting alternative drought resistant crop species. Caper (Capparis spinosa L.) is a xerophilous crop showing a remarkable adaptability to harsh environments, and with promising potentialities for agrosystems under the threat of global warming7.

In the Mediterranean area, populations are generally grouped within one species, C. spinosa8,9,10,11,12,13, although the taxonomic classification of this species is controversial due to a large pattern of morphological and ecological variations14,15,16,17 and to the lack of specific molecular markers. Despite the high polymorphism, two main subspecies, which differ ecologically and morphologically11 are recognized in Europe: C. spinosa subsp. spinosa, showing derived characters and widespread from the Mediterranean to Central Asia, and C. spinosa subsp. rupestris (Sm.) Nyman, characterized by phenotypic features close to the tropical stock of the group, distributed in the Mediterranean Region and the Sahara13,17.

The fruits and flower buds of caper are utilized as food ingredient, generally in brine, and appreciated for their flavor and texture. Only locally, cultivation of capers is extensive and acquires economic relevance for farmers. In particular, the main current areas of production are localised in Morocco, Turkey, Spain and Italy, namely in the minor islands of Salina and Pantelleria. Being rich in bio-active compounds, capers have many important medicinal properties18,19,20,21,22,23,24,25,26,27,28. Moreover, like other Mediterranean species29,30,31, caper is also a source of natural compounds with allelopathic potential32,33. Therefore, the extracts from C. spinosa could also be used to develop natural products employable in an eco-friendly agriculture.

For its interest as gourmet food, its medicinal and allelopathic properties and the ability to thrive in arid conditions7, capers have great agricultural potential in areas with increasing drought conditions, such as the Mediterranean basin. The process of domestication of caper plants has been limited and cultivated varieties are still very similar to wild accessions34, leaving ample margins for enhancement of many traits, such as increased productivity, firmer buds, disease resistance and thornless habit. Breeding programs and an efficient exploitation of this orphan crop are hampered by confused taxonomy of the genus Capparis and the lack of genomic information. To date, only few sequences of the chloroplast35,36,37 and mitochondrial genome38 have been reported, with limited value for phylogenetic analyses and breeding programs. Currently, there are no nuclear Simple Sequence Repeat (SSR) markers described for the genus Capparis. Microsatellites or SSR are codominant and highly informative markers already broadly used to genotype a wide range of plant species39,40,41,42,43. Compared to other molecular markers, SSRs are abundant and uniformly distributed throughout plant genomes and show several advantages such as simplicity, high polymorphism, reproducibility, co-dominant inheritance and cross-species transferability44. For species with no genome annotated, as is the case of orphan crops, an effective strategy to uncover SSRs is to rely on transcriptomic sequences. In contrast to genomic SSRs, Expressed Sequence Tag (EST)-SSRs are located in the coding and untranslated regions and are highly transferable to related taxa45. Thus, EST-SSR markers can directly influence phenotype and can be considered efficient functional markers46.

The advent of Next Generation Sequencing (NGS) technologies combined with bioinformatics tools can generate extensive data on non-model species in a very cost-effective way47,48,49. Among NGS strategies, RNA Sequencing (RNA-Seq) approach50 is a high throughput technology that has great advantages in examining the fine structure of a transcriptome51,52 and provides an effective way to obtain large amounts of sequence data without prior genome information53,54,55. RNA-Seq has been widely used in many organisms to obtain mass sequence data for transcriptional analysis, gene discovery and molecular marker development52,54,55,56, showing a great potential as a tool for molecular breeding57.

Here, for the first time, we report the sequencing, de novo assembly, and annotation of the leaf transcriptome of C. spinosa subsp. rupestris, a primitive type closer to the tropical stock of the group13. In order to identify putative genes controlling the bioactive and high-value components production the assembly was functionally annotated using public databases. In addition, polymorphic EST-SSRs were identified in the leaf transcriptome, thereby obtaining the first set of co-dominant markers for the species.

This transcript dataset provides the most widespread resource currently available for gene discovery and markers development in C. spinosa. This resource will be instrumental for future breeding programs and phylogenetic studies of capers. In addition, the information now available will contribute to the sustainable adaptation of agricultural production in small islands and marginal areas of the Mediterranean region58 and in other regions affected by aridity and/or climate change.

Results

Sequencing, de novo assembly and functional annotation of C. spinosa leaf transcriptome

We performed RNA-Seq to assemble transcripts, identify genes and develop co-dominant markers for the first time in C. spinosa. Leaf transcriptome Illumina shotgun sequencing yielded nearly 80 million cleaned reads, de novo assembled into 208,677 transcripts with N50 length of 2,431 bp (mean length 1,493 bp) by Trinity (Table 1). To remove the redundant transcripts the clean reads were clustered by CD-HIT-EST generating 124,723 unigenes with N50 length of 2,380 bp (mean 1,417 bp) (Table 1). The quality of assembled unitranscripts was evaluated by comparing them to the set of Eudicotyledons genes using BUSCO quality assessment tool. Out of the 2,121 BUSCO groups searched, 87.8% (1,861 BUSCOs) were “complete” (i.e., 916 single-copy and 945 duplicated), 6.7% (142 BUSCOs) were “fragmented” and the remaining 5.5% (118 BUSCOs) were “missing”. In addition, one typical peak of GC content for plants, (around 50%) was found using QUAST59, underlining the absence of bacteria. In total we identify 0.40% of possible contaminations (e.g. bacteria and virus), representing only 1.78% of the ‘Other’ category of Fig. 1. Clustered transcripts were searched against the NCBI-nr databases revealing 89,670 (72%) transcripts whose translation was significantly similar to known proteins. In species distribution analysis, 47,749 (53%) transcripts showed homology (top blast hits) with Tarenaya hassleriana, followed by Eutrema salsugineum, Arabidopsis thaliana, Arabidopsis lyrata, and Camelina sativa with 4,704 (5%), 3,322 (4%), 3,183 (4%) and 2,519 (3%), respectively (Fig. 1). The assembled sequences were also queried against the Swiss UniprotKB database using BLASTx and BLASTp searches, respectively. Nearly 54% (66,902) unitranscripts had a blastx hit and 41% (51,048) of clustered transcripts with ORF ≥ 100 bp in length displayed significant homology for annotated protein sequences. When nucleotide and protein sequences were aligned against UniRef90, their homology increased to 85,294 (68%) and 62,339 (50%), respectively. Moreover, 46,099 (37%) unique Pfam protein motifs could be assigned and 3,341 (3%) protein sequences were predicted to have signal peptides (Table 2). The complete list of transcript annotations is shown in Supplementary Dataset S1.

Table 1 Overview of sequencing outputs and assembly of Capparis spinosa leaf transcriptome.
Figure 1
figure 1

Species-based distribution of blastx matches for each clustered unitranscript of Capparis spinosa leaf transcriptome. The species with a match <1% were grouped in the ‘Other’ category.

Table 2 Overview of functional annotation by homology of Capparis spinosa leaf transcriptome.

We extracted 27,035 non redundant GO terms from 51% transcripts and summarized them into 97 GOslim plant categories using CateGOrizer (Supplementary Dataset S2). The annotated clustered transcripts were grouped into the three main categories: most of the assignments (61%) belonged to the biological process (BP) category, while the remaining was shared between cellular component (CC) (14%) and molecular function (MF) classes (25%). Within BP, “cellular process”, “metabolic process”, “cellular component organization and biogenesis” were the main represented groups in a total of 44 level-2 categories. Within CC, 26 level-2 categories were identified. The top three groups were “cell”, “intracellular” and “cytoplasm”. Similarly, in the MF 24 level-2 GO terms were isolated and “catalytic”, “transferase” and “hydrolase activities” were the top three (Supplementary Fig. S1). In the KOG classification, 40,765 unitranscripts were classified into 24 KOG groups (Fig. 2). Among these, the cluster for “general function prediction only” (15%) represented the largest group, followed by “transcription” (10%), “replication, recombination and repair” (9%), “signal transduction mechanism” (8%). The “cell motility” was the smallest group, while no unigenes were classified as “extracellular structures” (Fig. 2).

Figure 2
figure 2

EuKaryotic Orthologous Groups (KOG) in Capparis spinosa leaf transcriptome. The unigenes with significant homologies in the KOG database were grouped into 24 categories. The number of unigenes belonging to each category was reported in the y-axis, while the subgroups in the KOG classification were represented in the x-axis.

Biological pathway analyses in C. spinosa

To investigate functional biological pathways in C. spinosa, we exploited Transdecoder that assigns KO to unitranscripts (e-value ≤ 1*10−5). The unique KOs identified were mapped against the KEGG database to verify the correct sequencing of well represented pathways in C. spinosa. Among the 127 KEGG pathways identified (Supplementary Table S1), purine metabolism (669 sequences; 76 KOs, covering 37% of the pathway), was the most represented pathway as number of homologous leaf transcripts. Pyrimidine metabolism (504; 56, covering 57%), oxidative phosphorylation (414; 60, covering 28%), phenylpropanoid biosynthesis (316; 18, covering 49%), fatty acid metabolism (biosynthesis and degradation) (297; 22, covering 23%) and α-linolenic acid metabolism (163; 12) were also highly represented.

Because of their high representation and the known role of adenine, jasmonate, and flavonols in the abiotic stress tolerance60,61,62,63,64, we analyzed purine, thiamine and α-linolenic acid metabolism and phenylpropanoid biosynthesis in detail. In the C. spinosa leaf transcriptome, a high representation of purine metabolism was highlighted (Fig. 3A). Particularly, we found enzymes involved in the production of thiamine phosphates: thiamine-phosphate synthase (EC 2.5.1.3) catalyzing the reaction for thiamine phosphate synthesis, thiamine phosphatase (EC 3.6.1.15) converting thiamine di-phosphate in thiamine phosphate, thiamine di-phosphokinase (EC 2.7.6.2) and thiamine phosphate phosphatase that lead the conversion of thiamine to thiamine di-phosphate and thiamine phosphate in thiamine, respectively (Fig. 3B).

Figure 3
figure 3

Analysis of purine (A) and thiamine (B) metabolism pathways by KEGG, showing the identified enzymes in Capparis spinosa leaf transcriptome (Enzyme Code - EC - identified are in green).

In the same way, α-linolenic acid metabolism (12 genes) is highly represented. In particular we identified jasmonate O-methyltransferase (EC 2.1.1.141) and acetyl-CoA C-acyltransferase (EC 2.3.1.16), involved in jasmonate biosynthesis (Fig. 4).

Figure 4
figure 4

KEGG analysis showing genes involved in α-linolenic acid metabolism in Capparis spinosa leaf transcriptome (Enzyme Code - EC - identified are in green).

A large proportion of phenylpropanoid biosynthesis pathway was also reconstructed (18 genes), identifying some important enzymes, such as phenylalanine ammonia lyase (PAL) (EC 4.3.1.24), the first component in the phenylpropanoid pathway; 4-coumarate-CoA ligase (EC 6.2.1.12) andcinnamate-4-hydroxylase (C4H) (EC 1.14.13.1), that convert trans-cinnamic acid (CA) to p-coumaric acid (COA); 4-coumarate CoA ligase involved in p-coumaroyl-CoAsynthesis, an intermediate for hydroxycinnamic acids, flavonols and flavonol derivatives (Fig. 5A).

Figure 5
figure 5

KEGG analysis showing genes involved in phenylpropanoid biosynthesis (A) and glycerolipid metabolism (B) in Capparis spinosa leaf transcriptome (Enzyme Code - EC - identified are in green).

Considering the role of lipids as signaling in plant responses to abiotic stress, the unitranscripts were investigated for coding sequences of lipid metabolism. In this highly represented pathway, we found key enzymes of glycerolipid metabolism involved in the phosphatidic acid (PA) synthesis, such as 1-acyl-sn-glycerol-3-phosphate acyltransferase (EC 2.3.1.51) and diacylglycerol kinase (ATP) (EC 2.7.1.107), converting lysophosphatidic acid and L-1, 2-diacylglycerol, respectively, in PA; and in PA transformation (phosphatidate phosphatase; EC 3.1.3.4) (Fig. 5B). We also identified CDP-diacylglycerol-inositol 3-phosphatidyltransferase (EC 2.7.8.11) that catalyzes phosphatidylinositol (PI) synthesis (Fig. 5B).

To complete the analysis of molecules playing important role during environmental stress response, the occurrence of components of sphingolipid metabolism was also explored and 10 different enzymes could be retrieved. A large proportion of sphingolipid metabolism pathways could be reconstructed, including, among others, alkaline ceramidase (EC 3.5.1.23) involved in the synthesis of sphingosine, and serine palmitoyltransferase (EC 2.3.1.50) a key enzyme of sphingolipid metabolism required for the conversion of L-serine and palmitoyl-CoA into 3-Dehydrosphinganine (Supplementary Fig. S2A).

Additionally, we identified sequences mapped in different pathways involved in phytochemical biosynthesis, such as terpenoid metabolism (36), carotenoids (12 genes), glucosinolates (9), stilbenoids (4), and anthocyanins (2). Genes involved in oxidative phosphorylation (60) and photosynthesis (28), such as cytochrome c oxidase (EC 1.9.3.1) and photosystem I/II, respectively, were also detected and well represented (Supplementary Fig. S2B,C).

Seven genes, YODA (mitogen-activated protein kinase kinase kinase), ER (ERECTA), EPLF9/STOMAGEN (epidermal patterning factor-like protein 9), TMM (too many mouths) ERL1 (erecta-like1), GTL1 (GT2-like1) and FAMA (FMA/bHLH097), known to be involved in the modulation of stomatal development in response to drought, were also found (Supplementary Dataset S1). In addition, we identified transcripts with homology to Stress Associated Proteins (SAPs) genes that are potential candidates to improve abiotic stress tolerance in plants using biotechnological approaches65. We found 32 C. spinosa unitranscripts homologous to ten genes encoding for A. thaliana and O. sativa SAPs (Table 3; Supplementary Dataset S1). The average length of the transcripts is 1,607 bp, with values ranging between 249 bp (TRINITY_DN33098_c0_g1_i1; SAP10) and 4,447 bp (TRINITY_DN23049_c0_g1_i3; SAP1).

Table 3 List of Capparis spinosa leaf transcripts homologous to genes encoding for SAPs.

Simple sequence repeats isolation and validation

A total of 5,009 perfect simple sequence repeats (SSR) with repeat numbers ranging from 4 to 31 (from di- to hexa- nucleotide motifs) were identified using the MISA tool in the assembled uniscripts (Supplementary Table S2). Trinucleotide repeats were the most abundant (2,756, 55.0%), followed by hexanucleotide (1,115, 22.3%), tetranucleotide (566, 11.3%), dinucleotide (362, 7.2%) and pentanucleotide (210, 4.2%) (Table 4). The most common repeat number was 7, observed in 1,146 assemblies (22.9%), followed by 8 (854, 17.1%), 4 (828, 16.5%), and >10 (794, 15.9%) tandem repeats (Table 4). The most abundant motifs detected were TCT (320, 6.4%) and TTC (303, 6.1%), followed by GAA (272, 5.4%). More details about different repeat motif for the isolated EST-SSRs are listed in Supplementary Table S2.

Table 4 Summary of EST-SSRs and their repeat motif isolated from Capparis spinosa leaf transcriptome.

Hundred-fifty primer pairs were designed using Primer3 (http://primer3.sourceforge.net/) and a first panel of 50 EST-SSRs was tested (Supplementary Table S3). The predicted SSRs were validated and evaluated for the polymorphism rate by using a set of 75 C. spinosa genotypes, collected across the distribution area of the species (Supplementary Table S4). Forty-one out of 50 tested EST-SSRs showed amplified fragments. Among them, 14 fragments fell outside the expected size range, and were not considered further. The other 27 EST-SSRs produced PCR fragments with the expected size, 14 of which were polymorphic with a number of alleles per locus ranging from 2 to 11 (mean 6), and values of He from 0.420 to 0.843 (mean 0.630) (Table 5), PIC from 0.332 to 0.826 (mean 0.583), Fis and Fst values from −0.058 to 0.830 (mean 0.062) and from 0.010 to 0.695 (mean 0.495), respectively (Table 5). The selected EST-SSRs showed strong discrimination power among the different taxa here considered. UPGMA phylogenetic tree and DAPC analysis based on SSR clearly discriminated the thorny group of C. spinosa subsp. spinosa from Italy, Creta and west Asia from the thornless group of C. spinosa subsp. rupestris from different regions and islands of Italy (hereinafter subsp. spinosa and rupestris, respectively) (Fig. 6). Among the group of subsp. spinosa, Sicilian populations differentiated from eastern Mediterranean and western Asia populations. The subsp. rupestris was more homogeneous, though samples from the small islands of Salina and, to a minor extent, Pantelleria and Ustica also skewed from the rest of the populations (Fig. 6).

Table 5 Main genetic parameters from the 14 polymorphic EST-SSR loci of the population under investigation (sample size 75).
Figure 6
figure 6

Genetic relationships among genotypes belonging to Capparis spinosa collection sampled across the distribution area of the species. (A) Dendrogram generated by 14 polymorphic EST-SSR developed in the present study, using the UPGMA method and Bruvo’s distance. (B) DAPC analysis clustering of the eight populations studied using the first two principal components (Y-axis and X-axis, respectively). CC: C. spinosa subsp. spinosa; CR: C. spinosa subsp. rupestris. The samples used for the EST-SSR validation were gathered in 8 main groups: CC Sicily, CC world, CR Favignana, CR Italy, CR Pantelleria, CR Salina, CR Sicily and CR Ustica.

Discussion

Although C. spinosa is a rich source of bioactive compounds with important nutritional and medicinal values18,25,66,67,68, until now available genomics resources were limited. The lack of adequate molecular markers and genes identification are a limit for an efficient employment of this orphan crop, displaying agro-based potentialities and a high demand for exploitation7. In addition, the natural resistance to drought and harsh environmental conditions makes C. spinosa a potentially important resource in areas threatened by global warming and desertification. Therefore, the transcriptomic data generated in this study provide useful resources to support a full taxonomic revision of the genus Capparis (Capparaceae), and assist selection in the modern breeding programs in order to promote this crop, especially in the Mediterranean countries.

Illumina next-generation RNA-Seq was successfully used to develop a high-quality leaf transcriptome of C. spinosa subsp. rupestris, generating a number of transcripts, similar to other transcriptome studies on plants54,69,70,71. Although our work is focused only on one vegetative tissue (leaf), the first transcriptome profile of C. spinosa grown under natural conditions has been developed. About 72% of the unitranscripts were successfully assigned to genes in the NR database, and likewise a large number of unigenes (68%) and predicted proteins (50%) showed match by querying against UniRef90 database. In addition, species distribution analysis showed a high homology with Brassicaceae, a sister family to Cleomaceae, underlining the close evolutionary relationship of C. spinosa with this family72.

Biological pathways identification plays a crucial role to shed light into functional analysis and transcriptomic data. KEGG is an integrated database resource that integrates genomic, chemical and systemic functional information, a useful tool for the interpretation of transcriptomic data and widespread interrogation of an organism’s genome content73. Here, a number of pathways of C. spinosa were highly represented. Among these, we described in detail those involved in abiotic stress tolerance and bio-compounds production. We studied purine and thiamine metabolism, α-linolenic acid metabolism, phenylpropanoid biosynthesis, lipid metabolism, genes involved in stomatal development and distribution, and lastly the presence of SAPs.

Purine metabolism, particularly thiamine (vitamin B1) and related phosphate esters are involved as cofactors in response to abiotic and biotic stress74. Thiamine metabolism can be altered under environmental stress in Zea mays75, while in A. thaliana the abiotic defenses activation and stress tolerance were triggered by altered adenine metabolism61. Cellular adenine levels drive plant growth and biomass increase, playing a key role as signal in the response modulation to abiotic stress and acclimation61. Moreover, a recent study76 suggested a possible connection between purine catabolism and stress phyto-hormone homeostasis/signaling. Takagi et al.76 showed how allantoin, a metabolic intermediate of purine catabolism accumulates in plants under abiotic stress, activating the jasmonic acid responses via abscisic acid (ABA) and enhancing seedling tolerance to abiotic stress.

Several putative targets involved in plant abiotic stress response, belonging to α-linolenic acid metabolism, phenylpropanoid biosynthesis and lipid metabolism, were also found in this study. Comparing transcriptomic profiles of susceptible and tolerance rice varieties, α-linolenic acid metabolic pathway appears involved in the high drought tolerance77 and, recently, a link between α-linolenic acid and jasmonic acid biosynthesis with cold acclimation was uncovered in Camellia japonica through RNA-Seq analysis63. Phenylpropanoid pathway is responsible for the synthesis of a wide range of secondary metabolites in plant. As expected, the analysis revealed that the majority of the metabolic genes of this pathway are expressed in C. spinosa leaves. In particular we identified PAL and C4H, genes encoding enzymes that catalyze the first and second step of phenylpropanoid way, respectively, and responsible for biosynthesis of lignin. C4Hs have remained highly conserved across the plant kingdom and recent studies78,79 highlighted their key role in response to stresses (drought and cold) and as scavengers of Reactive Oxygen Species (ROS). In addition, genes linked to stress responses, including ethylene biosynthesis and signaling, showed altered expression levels in PAL knocked-down plants under non-challenging conditions80. PAL is also a biosynthetic source of salicylic acid (SA) in plants81, a master regulator in biotic and abiotic stress response in plants, including drought stress82,83.

Key enzymes of glycerolipid metabolism driving the PA synthesis were also detected. PA is a diacyl glycerophospholipid used as precursor for complex lipids biosynthesis and transiently generated in response to biotic and abiotic stress in plants. PA plays an essential role in ABA-induced production of ROS, osmotic changes and temperature stress response84,85,86,87. In the same way, since lipid-protein interactions are crucial for deciphering the signaling cascades, we studied and isolated phosphoinositides and sphingolipids, compounds belonging to the highly coordinated signaling network developed in plants, linked to acclimation or survival under abiotic stress88.

C. spinosa is drought tolerant and shows an efficient hydraulic conductivity due to the well-developed xylem vessels in stems7,89. Therefore, based on these evidences, we further assessed the presence of leaf transcripts homologous to genes involved in stomatal development and distribution that can be considered as key genes in the response to drought stress and water use efficiency (WUE). We found seven transcripts related to stomata. In particular, YODA is a MAPKK kinase gene and GTL1a transcriptional repressor of SDD1, a negative regulator of stomata development90 and density91. ERECTA has been the first identified major effector of WUE and a recent study92 demonstrated that the EDT1/HDG11-ERECTA-E2Fa genetic pathway reduced the stomatal density by increasing cell size, providing a new strategy to improve WUE in crops. The presence of YODA, ERECTA and GTL1 in the assembled unitranscripts might be associated to an adaptive response of C. spinosa to drought.

We also focused our attention to the presence of homologs encoding for SAPs. These A20/AN1 zinc-finger proteins have been shown to confer tolerance to multiple abiotic stresses in plants. In A. thaliana, AtSAP9 regulates abiotic/biotic stress responses probably via the ubiquitination/proteasome pathway93 and AtSAP13 is upregulated in response to Cd, ABA, and salt stresses94. In rice, SAP homologs are activated by multiple abiotic stresses (such as cold, salt, and dehydration)95. In Prunus, water retention and cell growth are regulated by a stress-associated protein (PpSAP1) through the TARGET OF RAPAMYCIN (TOR) pathway96. In poplar, the downregulation of PagSAP1 increases salt stress tolerance97. The finding of SAP homologs highlights the possible mechanisms involved in the adaptability of C. spinosa to a wide range of environmental conditions.

The production of secondary metabolites with medicinal properties could be reflected by the presence in our transcriptome of several genes involved in phytochemicals biosynthesis: prolycopene isomerase (EC 5.2.1.13) belonging to carotenoids and involved in lycopene biosynthesis; farnesyl diphosphate synthase (FPS) (EC 2.5.1.10), that catalyzes the synthesis of farnesyl diphosphate (FPP) in terpenoid metabolism; enzymes involved in chlorogenic acids (CGA) production, an important scavenging and antioxidant compound98; the anthocyanidin 3-O-glucoside 5-O-glucosyltransferase (EC 2.4.1.298) that converts pelargonidin 3-glucoside in pelargonin, compound with antioxidant activity; enzymes involved in glucobrassicin and glucoiberverin biosynthesis (glucosinolates) known as anti-cancer agents99; the MYB transcription factor Rosea1 (Ros1) that, together with Delila, enhances anthocyanin accumulation and abiotic stress tolerance in tobacco100.

Finally, we developed the first panel of co-dominant markers (EST-SSR) in caper. So far, genetic analysis of Capparis germplasm has largely relied on AFLP, RAPD, and ISSR markers34,101,102. The main reasons for using dominant markers were the lack of a genome sequence and/or transcriptome information for this species. Here, we identified 5,009 microsatellites from the assembled transcriptome in agreement with the frequencies reported in other studies52,103,104,105,106,107. When mono-nucleotide repeats were excluded, tri-nucleotide repeats were the most abundant class of SSRs, with TCT the most frequent motif in our dataset. This finding is consistent with results reported in other species, such as rice, wheat, barley107,108, cotton109 and asparagus110. To determine the level of polymorphism and discrimination power among this first set (50) of EST-SSRs, the markers were tested and validated using 75 selected samples. Nine primers pairs failed to produce amplicons, possibly due to primers spanning splicing sites, large introns, chimeric primer(s), or poor-quality sequences introns106,111. Fourteen primer pairs produced amplicons that deviated from expected size, which might have been produced by the presence of introns106,111, large insertions or repeat number variations, or a lack of specificity. Conversely, 27 EST-SSRs were validated, 14 of which (52%) were polymorphic. The polymorphism level was higher than the values reported in previous studies34,101,102, with genetic diversity values (He) > 0.5 for 11 out of 14 polymorphic EST-SSRs. These results suggest that the isolated sequences are suitable for the development of specific primers and confirmed the quality of the transcriptome assembled. Moreover, cluster and DAPC analysis highlighted the ability of selected EST-SSR to discriminate among taxa and origin of C. spinosa samples here analysed. In particular, for the subsp. spinosa the Sicilian plants grouped together and were separated by the rest of Mediterranean and Asian samples. In this regard, it is noteworthy that the plants from the Mediterranean island of Cyprus grouped together with Asian plants from Azerbaijan and China, rather than with Sicily. It is therefore tempting to speculate that a more extensive analysis of molecular polymorphism with samples collected worldwide with the SSR developed here might reveal unexpected scenarios of diffusion and evolution of caper. For the group of subsp. rupestris, the germplasm available was limited to Southern Italy, mainland Sicily and its minor islands (Supplementary Table S4). Consequently, due to the proximity of the sampling sites, the germplasm was more homogeneous respect to what observed for the subsp. spinosa (Fig. 6). Nevertheless, samples collected in the minor islands of Salina clearly formed a separate group. Caper plants from the islands of Pantelleria and Ustica also skewed from mainland Sicily and Italy, though less so. The minor island of Favignana, on the other hand, did not outgroup. Interestingly, caper cultivation is a major agricultural activity in the islands of Salina, Pantelleria and Ustica, but not in Favignana and in the rest of Sicily. It is therefore possible that attempts of selections by local farmers caused the observed deviation compared to wild type populations. Moreover, farmers in Salina traditionally propagate caper plants by clonal cuttings, as opposed to farmers in Pantelleria and Ustica who usually employ seeds112. This difference might justify the pronounced separation in the samples from Salina, since individual diversity in the selected plants is faithfully preserved and the cultivated germplasm does not mix any longer with the wild ancestors.

The novel molecular markers developed can be used in future studies assessing genetic diversity, phylogenetic analysis, marker assisted selection (MAS), mapping and association analysis in Capparis species worldwide. In addition, the newly designed EST-SSR primers for C. spinosa can also be tested in other species of the Capparaceae family currently lacking of their own genomic resources.

Methods

Plant material

We selected for RNA-Seq analysis three wild populations of C. spinosa subsp. rupestris to maximize the variety of genetic backgrounds and environmental conditions, in order to enrich the trancriptomic information (Supplementary Table S5). The three populations were used as biological replicates. For each population, mature leaves from three different specimens were collected at the same vegetative stage and packed in situ, then immediately frozen in liquid nitrogen and stored in the laboratory at −80 °C until use. Furthermore, a panel of 75 wild C. spinosa samples (Supplementary Table S4) were collected across the natural distribution area of the species for DNA extraction in order to validate the EST-SSR markers isolated.

RNA isolation and sequencing

RNA from each collected sample was extracted using a NucleoSpin RNA Plant (Macherey-Nagel GmbH & Co. KG, 52355 Düren, Germany) and treated with RNase-free DNase. RNA quality (RNA Integrity Number (RIN) > 8.0) was evaluated using an Agilent Bioanalyzer RNA nanochip (Agilent, Wilmington, DE). RNA-Seq libraries were independently prepared for three pools representing the three different populations. Each pool was composed by equal leaf amount of three specimens. Sequencing library was prepared using the Illumina TruSeq RNA Sample Preparation Kit v2 (Illumina, San Diego, CA, USA) according to manufacturer’s specifications; quality and insert size distribution were evaluated using Agilent Bioanalyzer DNA 1000 chip. Sequencing libraries were quantified using qPCR and sequenced in the same lane on an Illumina HiSeq. 1000 generating 2 × 100 nt paired-end reads.

De novo assembly and functional annotation of leaf transcriptome

Raw reads were adapter clipped and quality trimmed following recommendations from previous studies113,114. Adapter sequence contamination and low quality nucleotides (PHRED < 5) were removed using Trimmomatic version 0.33115. De novo transcriptome assembly of cleaned reads was carried out in Trinity (v.2.5.1)116 with default parameters. To generate transcriptome containing only unique transcripts for downstream analysis, CD-HIT-EST117 was used (identity cut-off ≥ 90%) by removing all repetitive, identical and near-identical transcripts. The quality and completeness of the de novo assembly were evaluated using BUSCO3 software v3118. This quality assessment tool provides high-resolution quantifications for genomes, gene sets, and transcriptomes and checks whether each of the BUSCO group is complete, duplicated, fragmented, or missing in the genome or transcriptome assembly. The unitranscripts were compared to the set of Eudicotyledons genes, which contains 2121 BUSCO groups from a total of 40 species in order to obtain a quantitative measure of the transcriptome completeness, based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs. In addition, to evaluate potential contamination about bacteria and endophytes, QUAST software119 was used.

Functional annotation of unitranscripts was performed using the Trinotate pipeline (http://trinotate.sourceforge.net/) to identify open reading frames and assign best hits to UniprotKB (1*10−5), UniRef90 (1*10−5), PFAM-A (1*10−5), GO and KOG categories120. Transdecoder (v.3.1.0) (https://transdecoder.github.io/) was also used to de novo predict putative coding regions and protein sequences. Blastp search was carried out by using predicted Open Reading Frames (ORFs) as the query and the Swiss-Prot non-redundant database as the target. The HMMER package121 and Pfam databases122 were utilized to predict protein domains, while SignalP 4.1123 was used to predict the presence of signal peptides within the predicted ORFs. CateGOrizer124 was used to map GO terms to a parent plant to get a wide overview of the transcripts functional classification. Finally, KEGG Automatic Annotation Server (KAAS, http://www.genome.jp/kaas-bin/kaas_main?mode=est) was employed to map KEGG pathways of assigned caper orthologs125,126,127,128. All figures showing identified and highlighted pathways were developed through KEGG Mapper, Search Pathway using unique Kos (https://www.genome.jp/kegg/tool/map_pathway1.html). KO assignments were performed based on the bi-directional best hit of BLAST129.

Identification and validation of polymorphic EST-SSR markers

All clustered transcripts generated from de novo assembly were examined to identify new co-dominant molecular markers. SSRs based on short tandem repeats were identified through analyses of ESTs (EST–SSR) and carried out into gene-anchored marker loci130. SSR loci were detected using MicroSAtellite tool (MISA; http://pgrc.ipk-gatersleben.de/misa/misa.html)131. Di-, tri-, tetra-, penta-, and hexa-nucleotides were searched with a minimum of 20, 7, 5, 5, and 4 repeat units, respectively. A set of primer pairs (150) was designed using Primer3 software (http://primer3.sourceforge.net/)132 by imposing an amplicon size range of 100–400 bp, minimum and maximum GC contents 40 and 60%, and minimum and maximum melting temperature (Tm) values ranging from 58 to 60 °C, respectively (Supplementary Table S3). A first panel of 50 EST-SSR was tested and validated, verifying the primers specificity and amplicons size (Supplementary Table S3) on the above-mentioned panel of samples (Supplementary Table S4). Genomic DNA was extracted from leaves (200 mg) using the CTAB protocol133. DNA concentration and quality were checked with a Nanodrop ND1000 (Thermo Scientific). PCR amplifications were carried out in 20 µl reaction mixtures starting from 50 ng of DNA as previously described110. The fragments were analyzed on an ABI PRISM 3500 Genetic Analyzer (Applied Biosystems) and the alleles were sized by GENEMAPPER 4.0 (Applied Biosystems). Genetic diversity (He), mean allele number, fixation index (Fst), inbreeding coefficient (Fis) and Polymorphism Information Content (PIC) for each EST-SSR used were calculated by using PowerMarker v. 3.25134 and R/poppr135. Genetic relationships among studied genotypes were also investigated by cluster analysis and Discriminant Analysis of Principal Components (DAPC). The UPGMA (Unweighted Pair Group Method with Arithmetic Mean) phylogenetic tree was designed by using R/poppr135 with Bruvo’s distance136. The bootstrap analysis was performed based on 1,000 resamplings. DAPC, implemented in the R/adegenet137, was performed to infer population subdivision of the analysed collection, regardless of the geographic origin. Since only one sample (LAM01) belonging to the Lampedusa island population was available, this population has been excluded from DAPC analysis. In the output, samples were gathered in 8 main groups (Fig. 6B; Table S6). The number of principal components (PCs) retained was evaluated using the cross-validation procedure. We also used the K-means algorithm, ‘find.clusters’, to independently verify the assignment of individuals to clusters.