The identification of the Rosa S-locus and implications on the evolution of the Rosaceae gametophytic self-incompatibility systems

In Rosaceae species, two gametophytic self-incompatibility (GSI) mechanisms are described, the Prunus self-recognition system and the Maleae (Malus/Pyrus/Sorbus) non-self- recognition system. In both systems the pistil component is a S-RNase gene, but from two distinct phylogenetic lineages. The pollen component, always a F-box gene(s), in the case of Prunus is a single gene, and in Maleae there are multiple genes. Previously, the Rosa S-locus was mapped on chromosome 3, and three putative S-RNase genes were identified in the R. chinensis ‘Old Blush’ genome. Here, we show that these genes do not belong to the S-locus region. Using R. chinensis and R. multiflora genomes and a phylogenetic approach, we identified the S-RNase gene, that belongs to the Prunus S-lineage. Expression patterns support this gene as being the S-pistil. This gene is here also identified in R. moschata, R. arvensis, and R. minutifolia low coverage genomes, allowing the identification of positively selected amino acid sites, and thus, further supporting this gene as the S-RNase. Furthermore, genotype–phenotype association experiments also support this gene as the S-RNase. For the S-pollen GSI component we find evidence for multiple F-box genes, that show the expected expression pattern, and evidence for diversifying selection at the F-box genes within an S-haplotype. Thus, Rosa has a non-self-recognition system, like in Maleae species, despite the S-pistil gene belonging to the Prunus S-RNase lineage. These findings are discussed in the context of the Rosaceae GSI evolution. Knowledge on the Rosa S-locus has practical implications since genes controlling floral and other ornamental traits are in linkage disequilibrium with the S-locus.


Rchinensis1_3-Rchinensis2_27 is expressed in pistil and ovary tissue, like the S-RNases. The
S-RNase gene is highly expressed in pistils, in stigmas and styles of flowers at anthesis, but also shows low expression in entire flower buds. It should be noted that S-RNase lineage genes, can also show a similar pattern of expression, since this expression is inferred to be the ancestral expression of the S-RNase lineage genes 24 . Therefore, it is not surprising that SRNase30, and SRNase36, show expression in pistil and ovary (Fig. 2), as previously reported 44 . SRNase26 shows no expression in the tissues here analyzed, as well as those used in 44 . Rchinensis1_3-Rchinensis2_27, shows a similar expression to SRNase30, but with high levels of expression in pistil and ovary, and low expression in stamen (Fig. 2). Rchinensis2_16-Rchinensis3_17-Rchinensis4_40 is not expressed in the tissues here analyzed. These results are compatible with Rchinensis1_3-Rchinensis2_27 gene, being the S-pistil gene determining GSI specificity.  51 analyses. These amino acid sites are, in principle, responsible for GSI specificity 24,35,52 .The location of these sites at the predicted 3D structure is mostly around the active site pocket region (Fig. 3), as observed in other Rosaceae and Solanaceae species 25,52 .

Rosa S-RNases
, and E893 (E459 x Url; S1 S5) individuals. The sequence of these amplification products, reveled two types of identical sequences. One obtained from Ose, E200, E404, and E435 individuals, those individuals having the S2-RNase (GenBank acc. Numbers MW452856-MW452859). The other sequence type was obtained from Url and E893, those individuals having the S5-RNase (GenBank acc. numbers MW452860 and MW452861). These results support co-segregation of the S-locus with the S2-RNase genotypes here surveyed. Therefore, the S-RNase gene here identified is on the S-locus region.
Identification of the Rosa S-pollen F-box genes. The R. rugosa S-locus F-box gene (RrSLF; KY446808), expressed in pollen tissue and phylogenetically related with Prunus SFB gene, was reported as a putative Rosa S-pollen gene 46 . It should be noted that other F-box genes not involved in S-pollen specify determination are also expressed in pollen, as well as in other tissues 24 . The expression of RrSLF gene in other tissues has not, however, been addressed 46 . This gene has 99% homology at nucleotide level with a R. chinensis sequence (CM009583.1) located on chromosome 2, but the Rosa S-locus is located on chromosome 3 ( 44,45 , and this work). Furthermore, all S-pollen genes described in Rosaceae, Solanaceae, and Plantaginaceae are intronless genes, and the R. chinensis RrSLF (PRQ51373) and R. multiflora (Rmu_sc0016061.1;BDJD01015883.1) orthologous genes have one intron. The orthologous gene has been identified in all low coverage Rosa genomes, except R. majalis, covering the entire coding region. Low levels of divergence (0.036 for synonymous and 0.009 for non-synonymous divergence respectively, after Jukes and Cantor correction; N = 10) are obtained for this gene. This is in contrast with the Prunus SFB gene, that presents levels of variability above 20% 6,7,11,36 . Therefore, RrSLF is not the S-pollen gene.
In R. chinensis chromosome 3 there are 30 SFB/SLFL/SFBB like genes (called Fbox − and + according to the 5′ or 3′ position relative to the S-RNase, respectively; Supplementary Table S3; Supplementary File S4). In the two R. multiflora scaffolds where the S-RNase is located, there are five such genes (Supplementary Table S3). The phylogenetic analyses of the R. chinensis and R. multiflora SFB/SLFL/SFBB like genes together with Prunus SFB and SLFL genes, M. domestica S1-SFBBs, Petunia and Nicotiana SLF sequences, and A. thaliana F-box/kelch-repeat, shown in Fig. 4 (Supplementary File S4), revealed that the Rchinensis_F-box + 13 gene clusters with Prunus SFB gene. This gene is expressed in all tissues here analyzed (Fig. 5), but the S-pollen gene(s) are mainly expressed in anthers /pollen only 7, 10, 12-16, 21, 23-25 , and thus is unlikely to be the S-pollen gene. Furthermore, using a 1188 bp gene region obtained from eight Rosa genomes (R. arvensis, R. laevigata, R. moschata, R. xantina, R. rugosa, R. odorata, R. multiflora, and R. chinensis), low average levels of divergence (0.0321 for synonymous and 0.0128 for non-synonymous divergence respectively, after Jukes and Cantor correction) are observed. Moreover, this gene is the neighbor F-box gene of the T2-RNase Rchinensis1_8-Rchinensis2_25, that does not cluster with Prunus or Maleae S-RNases (Fig. 1). Therefore, Rchinensis_F-box + 13 gene is not involved in S-pollen GSI specificity determination.
The F-box genes in the vicinity of the R. chinensis and R. multiflora S-RNase gene cluster, with high support, with different Prunus SLF genes (Fig. 4), that are a sister group of Malus SFBBs 24 . In R. chinensis 14 of these genes (Rchinensis_F-box-3 up to Rchinensis_F-box + 11) show expression in stamen only, compatible with being S-pollen genes (Fig. 5). Furthermore, the R. multiflora orthologs of R. chinensis are not in the same order relatively to the S-RNase gene (Fig. 4), showing that this region is highly rearranged, as observed in the Malus/ www.nature.com/scientificreports/ Pyrus/Sorbus S-locus region 13, 16, 18-20, 24, 25 . This is surprising since we identified the Rosa S-RNase as belonging to the Prunus S-lineage, and in Prunus there is a single S-pollen gene. In Prunus the S-pollen gene presents levels of diversity similar to the S-RNase gene 11,29 . Therefore, we also determined levels of synonymous and non-synonymous divergence for the F-box genes surrounding the R. chinensis S-RNase (Rchinensis_F-box-1 and Rchinensis_F-box + 1, used as query in a blastn to identify contigs containing the orthologous genes in the low coverage Rosa genomes). The low levels of synonymous and non-synonymous divergence for Rchinensis_Fbox-1 (0.070 and 0.005 respectively), and Rchinensis_F-box + 1 (0.112 and 0.02129 respectively) are incompatible with the hypothesis that one of them is determining Rosa S-pollen specificity. Therefore, multiple F-box genes must be involved in Rosa pollen GSI specificity determination, as in Malus/Pyrus/Sorbus and Solanaceae species. On the predicted 3D structure, these amino acid sites are located in the same regions (Fig. 6) as those observed for Petunia S-pollen genes 26 . Evidence for positive selection is also observed for the five F-box genes of the R. multiflora scaffold sca0006888 (results available at http://bposi tive.i3s.up.pt/ under project Rosa S-locus genes; BP2018000004). Therefore, the data suggests that Rosa S-pollen specificity is determined by multiple F-box genes, like in Malus/Pyrus/Sorbus and Solanaceae species.

Discussion
In Rosa, there are very important traits of horticultural interest such as flower development, architecture, senescence, scent biosynthesis and emission, ease of reproduction, and resistance to biotic and abiotic stresses, that have been selected only once during the history of rose selection, and incorporated into many rose varieties 44,45,48 . Indeed, only 10 species have contributed to the genetic make-up of most of the modern rose cultivars, and some old and popular cultivars, such as 'Old Blush' have dominated the history of rose selection 40 . This Chinese rose from the Song dynasty (960-1279) conveys several desirable characters such as recessive reblooming habit, recessive lack of prickles (stem) and dominant flower doubleness 44,45,48 , all co-segregating with the S-locus region. The characterization of the S-locus region here performed is thus, very important in order to help breeding selection, and the control of genetic diversity. The Rosa S-locus is composed of a S-RNase gene that belongs to the Prunus lineage (Fig. 1). This gene shows all expected features of a S-pistil gene since it shows expression in pistil and ovary (Fig. 2), evidence for diversifying selection (Fig. 3), and co-segregation with the S-locus. In Fragaria, that also belongs to the Rosoideae subfamily (the two genera have been diverging for about 50 million years 54 ), a Prunus lineage S-RNase gene has been also reported as the putative S-pistil gene 24 . This suggests that the Rosoideae GSI system could be of the self-recognition type, with one S-pollen gene, as in Prunus ( 24 and references therein). Nevertheless, the Rosa F-box gene that clusters with the Prunus SFB gene (Fig. 4) is not the neighbor of the S-RNase gene, as observed in Prunus, and presents expression (Fig. 5) and polymorphism levels incompatible with being the S-pollen gene. In Rosa there are multiple F-box genes in the vicinity of the S-RNase, that show expression compatible with being www.nature.com/scientificreports/ the S-pollen gene (Fig. 5), and evidence for diversifying selection for F-box genes within a S-haplotype (Fig. 6). These are the expected features of S-pollen genes in a non-self-recognition system, as reported in Malus/Sorbus/ Pyrus 13, 16, 18-20, 24, 25 and Solanaceae species 15,22,23,26,[30][31][32] . Therefore, the Rosa S-locus region encompasses a large region, as reported for other species presenting non-self recognition systems, and this could explain why in rose several traits of horticultural interest are linked to the S-locus. It should be noted that the Rosa S-locus region may be highly rearranged, since we could not align the contig containing the S-RNase gene from R. chinensis with those from R. multiflora as well as the two contigs containing the S-RNase gene from R. multiflora. This observation has been previously reported for the two available R. chinensis genomes, that could not be aligned in this region (conceivably carrying two different S-haplotypes; see Fig. 1 of 55 ). Pollen transcriptome analyses of multiple S-haplotypes are, however needed to determine how many F-box genes in the vicinity of the Rosa S-RNase are involved in pollen GSI specificity, as performed in Malus 25 and Petunia 21 . Since multiple S-pollen genes are determining Rosa S-pollen specificity, as in Maleae species, this suggests that the ancestral Rosaceae GSI system was of the non-self-recognition system type. This implies that during evolution the Malus S-RNase lineage, that does not cluster with the Rosa S-RNase, has been recruited de novo from a duplicate of the ancestral S-RNase gene. Moreover, it implies that the Prunus S-pollen gene, that does not cluster with Rosa S-pollen genes, evolved de novo from an unrelated F-box gene.
Low levels of variation have been reported in roses but the S-locus region, being large and under balancing selection, may help retain a substantial fraction of the variability. Indeed, the evolution within the genus Rosa, occurred via interspecific hybridization, allopolyploidization and genetic reticulation among sympatric species 56 . In Europe, the recent post-glaciation period expedited the process of speciation, as the northward extension of the biotopes increased. The diploid R. arvensis holds a special place in this process, since, although it belongs to the synstylae group, has a substantial genomic promiscuity with the polyploid caninae group 56,57 , that may also help maintain diversity levels. The inbred strains that are being established for R. arvensis 53 are a starting point to investigate the diversity of the S-locus in European wild roses. Mutants breaking down self-incompatibility, will also help refine the S-locus effect on roses molecular diversity.

Methods
Identification of putative S-RNase like sequences in Rosa genomes. The annotations (CDS) of two R. chinensis genomes (available at NCBI RefSeq database (www.ncbi.nlm.nih.gov) and GDR database (https ://www.rosac eae.org) Supplementary Table S1) were downloaded, and the sequences showing similarity with reference S-RNases assigned as Rchinensis1 and Rchinensis3, respectively. Moreover, the corresponding genome sequences were downloaded as a FASTA file, and the S-RNase sequences here annotated labeled as Rchinensis2 Figure 6. Positively selected amino acid sites on the predicted 3D structure of the R. chinensis F-box located in the 5` region of the S-RNase. In yellow are highlighted those sites that appear as positively selected in the two datasets analyzed (14 and 13 R. chinensis F-box genes in the vicinity of the S-RNase that are expressed in stamen), in green those that are located in a region not analyzed when 14 F-box genes are considered, and in blue those that change due to alignment gaps in the two datasets analyzed. www.nature.com/scientificreports/ and Rchinensis4, respectively. R. multiflora genome (NCBI assembly database, Supplementary Table S1) was also downloaded as a FASTA file, since annotations are not available for this species. To find and extract protein encoding segments larger than 100 bp, we have used getorf, using the emboss Docker image available at pegi3s Bioinformatics Docker images project (htttps:pegi3s.github.io/dockerfiles). Then, we selected the protein encoding segments that show similarity with reference S-RNase sequences, using tblastx (Expect value (e) < 0.05), as implemented in SEDA 58, 59 , using as query Prunus S3-RNase (AJ298312), Malus Sh-RNase (AB032247), and Fragaria putative S-RNase (gi561957436, gi561674690 and gi561985884) 24 . Based on this information, we manually annotated the corresponding genome region to identify the exons of each gene. For each putative gene we obtained the corresponding amino acid sequence to calculate IP, using ExPASy software 60 . We used the same protocol for the genomes here assembled using the short reads of nine Rosa genomes downloaded from NCBI SRA database (Supplementary Table S1). In this case, we used FastQC to evaluate read quality, and Cutadapt to trim reads 61 , and ABySS 2.0 62 for the de novo assembly, using the Docker images available at pegi3s Bioinformatics Docker images project (htttps://pegi3s.github.io/dockerfiles).
Identification of the F-box genes located in R. chinensis chromosome 3 and R. multiflora scaffolds where the S-RNase is located. The protocol presented for the S-RNase like sequences was also used to obtain F-box genes located in R. chinensis chromosome 3, and for the two R. multiflora scaffolds where the S-RNase gene has been identified, using as query P. avium SFB3 (AAT72121.1), P. avium SLFL1 (BAG12295.1), and M. domestica SFBB3-beta (BAF47180.1).
Phylogenetic analyses. Phylogenetic analyses of the R. chinensis and R.multiflora S-RNase like sequences and F-box like genes were performed using sequences aligned with MUSCLE alignment algorithm, as implemented in ADOPS 63 . Only codons with a support value above 2 were used for phylogenetic reconstruction. Bayesian trees were obtained using MrBayes 3.1.2 64 as implemented in the ADOPS pipeline 63 . The Generalised Time-Reversible (GTR) model of sequence evolution was implemented in the analyses, allowing for among-site rate variation and a proportion of invariable sites. Third codon positions were allowed to have a gamma distribution shape parameter different from that of first and second codon positions. Two independent runs of 1,000,000 generations with four chains each, were carried out. The average standard deviation of split frequencies was always ~ 0.01 and the potential scale reduction factor for every parameter was ~ 1.00, showing that convergence was achieved. Trees were sampled every 100th generation and the first 5000 samples were discarded (burnin). The tree was converted to Newick format using the Format Conversion website (http://phylo geny.lirmm .fr/ phylo _cgi/data_conve rter.cgi) and edited using Mega7 65 .
The phylogenetic analyses of the low coverage Rosa genomes (Supplementary Table S1) sequences were performed with Mega7 65 , using ClustalW alignment algorithm, Neighbor-Joining method, bootstrap test with 10,000 replicates, the p-distance method for computing the evolutionary distances, and pairwise deletion since sequences can have different sizes (Supplementary Table S2).
Expression of R. chinensis S-RNase like, and F-box genes located on chromosome 3 in pistil and ovary, stamen, leaf, stem and root transcriptomes. To estimate expression of the Rosa S-RNase like sequences located on chromosome 3, we use RNA-seq data from R. chinensis pistil and ovary, stamen, leaf, stem, and root transcriptomes (Supplementary Table S4). We used FastQC to evaluate read quality, and Cutadapt to trim reads 61 . FPKM values were estimated using the RSEM method, as implemented in Trinity 66 , using the R. chinensis Refseq CDS, and the R. chinensis S-RNase like and F-box sequences located on chromosome 3.

R. arvensis genotype-phenotype association experiments.
Genomic DNA was extracted from leaves of 12 R. arvensis individuals, for which the haplotypes were predicted according to their parents and progeny ( Supplementary Fig. S3A,B), using the method of 67 . PCRs were performed using the genomic DNA and primers RA-F 5′ GGA AGC CAR ACT GAA GAT 3′ and RA-R 5´AGC ATC ACA GTY TCG ATC A 3′, designed for conserved regions of the putative Rosa S-RNase sequences here identified. Standard amplification conditions were 35 cycles of denaturation at 94 °C for 30 s, 52 °C for 30 s for primer annealing, and primer extension at 72 °C for 2 min. The amplification products with the expected size (353 bp) for individuals Ose, E200, E404, Url, and E893 were cloned using the TA cloning kit (Invitrogen, Carlsbad, CA). For each individual, the insert of 16 colonies was cut separately with DdeI, and HinfI restriction enzymes, and only one restriction pattern was observed, and thus three colonies only were sequenced. The ABI PRISM BigDye cycle-sequencing kit (Perkin Elmer, Foster City, CA), and specific primers, or the primers for the M13 forward and reverse priming sites of the pCR2.1 vector, were used to prepare the sequencing reactions. Sequencing runs were performed by STABVIDA (Lisboa, Portugal).
Identification of positively selected amino acid sites, their location on the crystal structure, and polymorphism levels. For the six sequences identified as putative Rosa S-RNases, we inferred positively selected amino acid sites, using codeML 51 , as implemented in ADOPS 63 , using Muscle as the alignment method. Such analyses where also performed for the 14 R. chinensis F-box sequences that are in the vicinity of the S-RNase, that cluster with Prunus SLFL, and that are expressed in stamen. Since the inclusion of Rchinensis_Fbox-3 gene sequence excludes from the analyses a large fraction of the 3´region, we performed these analyses after removing this sequence. codeML analyses were also performed for five R. multiflora sc0006888 F-box sequences in the vicinity of the S-RNase. The details of the analyses can be seen at the B + database (bpositive.i3s. up.pt 68  www.nature.com/scientificreports/ positively selected those amino acid sites that show a probability higher than 90% for both naive empirical Bayes (NEB) or Bayes empirical Bayes (BEB) methods.
To visualize these positions in the 3D structure, for the S-RNase (translation of Rchinensis1_3-Rchinen-sis2_27; Supplementary File S1) we first identified the signal peptide using SignalIP (http://www.cbs.dtu.dk/servi ces/Signa lP/) website tool available at ExPASy 60 . After removing the signal peptide, the 3D structure was modeled by I-Tasser 69 , and the model with the highest C-score value used. The same methodology was used for the putative S-pollen gene (translation of R chinensis_F box-1; Supplementary File S4), but in this case after removing the F-box domain (the first 60 amino acid positions). All structural images were produced using PyMOL (The PyMOL Molecular Graphics System, Version 1.7.4 Schrödinger, LLC.).
Levels of polymorphism were obtained with DnaSp 70 .

Data availability
All data generated or analyzed during this study are included in this published article (and its Supplementary Data File S1-S4). R. arvensis S2-and S5-RNase are deposited in GenBank, acc. Numbers MW452856-MW452861.