Portuguese wild grapevine genome re-sequencing (Vitis vinifera sylvestris)

The first genome of Vitis vinifera vinifera (PN40024), published in 2007, boosted grapevine related studies. While this reference genome is a suitable tool for the overall studies in the field, it lacks the ability to unveil changes accumulated during V. v. vinifera domestication. The subspecies V. v. sylvestris preserves wild characteristics, making it a good material to provide insights into V. v. vinifera domestication. The difference in the reproductive strategy between both subspecies is one of the characteristics that set them apart. While V. v. vinifera flowers are hermaphrodite, V. v. sylvestris is mostly dioecious. In this paper, we compare the re-sequencing of the genomes from a male and a female individual of the wild sylvestris, against the reference vinifera genome (PN40024). Variant analysis reveals a low number but with high impact modifications in coding regions, essentially non-synonymous single nucleotide polymorphisms and frame shifts caused by insertions and deletions. The sex-locus was manually inspected, and the results obtained are in line with the most recent works related with wild grapevine sex. In this paper we also describe for the first time RNA editing in transcripts of 14 genes in the sex-determining region, including VviYABBY and VviPLATZ.

. Sampling material of Vitis vinifera sylvestris (wild type) and flowchart of data analysis steps. WM individual show carpel abortion (typical of a male flower type); WF individual has reflexed stamens (representing a female flower type). Reference genome is PN40024 (CRIBI, 12X).

Comparison between the wild-type and reference genomes. Single Nucleotide Polymorphism
(SNP) and Insertions/Deletions (InDels) analysis were performed between the sequences obtained for the two wild-type individuals and the reference genome. Wild-type individuals used on this re-sequencing project present distinct flower characteristics. One of the plants show male characteristics (WM), with no functional carpel and the other exhibits female flowers (WF) with a complete carpel, but with reflexed stamens producing infertile pollen. Both of these individuals had almost the same number of SNPs identified (6,197,145 for WF and 6,575,026 for WM individual), and shared half of the occurrences (3,643,713 SNPs; Fig. 2a). The InDels analysis showed the same tendency as the previous results: the number of occurrences in both individuals were similar (1,151,997 for WF and 1,177,918 events for WM) and about half of them were shared (637,320 events; Fig. 2b).
The ratio between the numbers of transitions over transversions (Ti/Tv) is usually calculated in order to assess the quality of an assembly or SNPs calling. As Ti/Tv ration has been reported dependent of the genomic context of the SNPs 27 , we have evaluated Ti/Tv on introns (ratio 2.1), exons with synonymous SNP (1.9), exons with non-synonymous SNPs (1.4) and intergenic regions (2.2) (Fig. 2c). Our analysis confirmed that Ti/Tv was dependent of the genomic location, but not on the individuals under study, since WM and WF presented the same Ti/Tv pattern (Fig. 2c).
The distribution of SNPs and InDels, on the genome, was evaluated for each chromosome, by calculating the frequency of occurrences (number of SNPs/InDels by chromosome size). In both SNPs and InDels, the chromosome 6 was the one with the lowest number of events. For this chromosome, 1.00% of the bases were found to be changes on a single nucleotide (frequency of 0.011; Fig. 3a and Supplementary Figure S1) and InDels were identified in 0.20% of the chromosome (frequency of 0.0020; Fig. 3b and Supplementary Figure S1). The low number of modifications in chromosome 6 is essentially caused by WF individual, which presents a fraction of SNPs below Tukey 's fence. The least conserved chromosome was the 9 with a percentage of SNPs of 1.5% (frequency 0.015; Fig. 3a and Supplementary Figure S1) and InDels covering 0.29% of this chromosome size (0.0029; Fig. 3b and Supplementary Figure S1). The modification enrichment in chromosome 9 is mainly due to the WM individual. The pairwise comparison between WM and WF, besides chromosome 6 distinct profiles between the two tested individuals, as referred, also highlights the higher number of modifications in WM chromosome 13 than in WF (Supplementary Figure S2).
The analysis of chromosome regions with higher and fewer number of modifications, carried out in both individuals, showed that the density varied along the chromosomes, however most of them do not have a specific location with high density of SNPs and InDels simultaneously (Supplementary Figure S3).
We analysed the nature of the substitutions in SNPs by comparing the nucleotides on the reference genome with the nucleotides observed for the wild type. In accordance with other studies 10,12 , the number of transitions (substitution between adenosine and guanine or cytosine and thymine) were much more frequent on both genomes than transversions (substitutions between adenosine and cytosine, adenosine and thymine, cytosine and guanine or guanine and thymine) ( Fig. 3c and Supplementary Figure S4). Table 1. Number of wild-type reads mapped against the reference genome (PN40024). Unique: reads that map only to one site on the reference genome. Non-unique: reads that map to multiple sites of the reference genome. Singletons: reads that map to the reference genome, without mate pair. Cross-contigs: reads whose pair map to a different reference chromosome. Percentages: percentage of each class to the total mapped reads. WF: Reads obtained from the female wild individual. WM: Reads obtained from the wild type individual with male flower types. www.nature.com/scientificreports/ The effects of SNPs on protein-coding regions were also investigated by analysing the SNP location within gene context and the impact caused. To perform this analysis, we have calculated the percentage of modified features. As expected, the most affected context by SNPs was the intergenic (about 4.1% of intergenic sites reveal differences compared to reference genome; Fig. 4a and Supplementary Figure S5). Upstream, UTR regions and downstream are also affected regions (ranging from 1.3 to 0.9%). Introns and CDS regions reveal almost the same levels of SNPs, in terms of percentages (0.6% and 0.7%, respectively). When SNPs affected CDS, we further studied the in silico impact expected ( Fig. 4b and Supplementary Figure S5). The percentage of CDS modifications with synonymous and non-synonymous impact was almost the same (0.3% and 0.4%, respectively). A high percentage of premature STOP codons was found, essentially considering that this impact is among the most disruptive ones (0.01%).
The overall distribution of InDels is almost identical to SNPs: the majority of indels occurs in intergenic (0.6%), followed by upstream, UTRs and downstream regions (ranging between 0.3% and 0.2%; Fig. 4c and Supplementary Figure S5). However, contrary to SNPs, only a small portion of CDS are affected by InDels (0.03%). In terms of impact (considering CDS context), the majority of InDels caused frame shift (0.02%; Fig. 4d and Supplementary Figure S5). InDels resulting in codon change + deletions, corresponding to events where at least one codon is changed and at least one is deleted (0.003%), codon insertion (0.002%), codon change with insertions and simple codon deletion (both lower than 0.002%) were also found ( Fig. 4d and Supplementary Figure S5).
Heterozygosity levels. The nature of the SNPs and InDels was evaluated regarding their homozygosity.
Most of the events were heterozygous (0/1 and 1/2) ( Fig. 5 and Supplementary Figure S6). The sylvestris individuals had about 4,104,719 heterozygous SNPs and 761,638 heterozygous InDels, being the WM plant more heterozygous than the WF (Fig. 5). The number of homozygous alterations between sylvestris and the reference genome (1/1) was much lower (2,281,366 homozygous SNPs and 403,321 homozygous InDels) ( Fig. 5 and Supplementary Figure S6). Combining SNPs and InDels, the total number of heterozygous loci (0/1 and 1/2) was 4,866,357 while the homozygous (1/1) changes were 631,487. The ratio between heterozygous/homozygous SNPs was 1.80 and1.89 for InDels. www.nature.com/scientificreports/ www.nature.com/scientificreports/ www.nature.com/scientificreports/ An analysis of the heterozygous (0/1 and 1/2) over non-reference homozygous (1/1) SNPs, considering the sylvestris individuals separated, at multiple genomic locations, revealed slightly differences between them (Fig. 5c). Ratios obtained were 2.2 for WM (introns and exons), 2.0 for WM (intergenics), 2.0 for WF (exome) and 1.8 for WF (introns and intergenics) (Fig. 5c). When considering exons, the ratios obtained for synonymous and non-synonymous were the same (data not shown).   Figure S7d). We also validated the occurrences of individual exclusive modifications, where one of the individuals (WF) was homozygous and the other (WM) was found to be heterozygous (Supplementary Figure S7d). All the modifications identified in Sanger analysis were in accordance with Illumina results. Although, a higher number of changes were detected than those presented in the Supplementary Figure S7, the ones presented are an example to validate the results.
De novo assembly metrics. A de novo assembly of the genomes was achieved to better assess the sexdetermined region. All the reads obtained in this project were cleaned to remove low confidence reads. The female and male assembly used 224,542,530 and 287,279,296 reads, respectively ( Table 2). The coverage estimates for this assembly was 58X to female and 74X to male individual ( Table 2). After the assembly, the female genome was comprised into 102,650 contigs with a total size of 364,897,899 bp (N50 of 5,795), whereas the male was represented by 108,550 contigs, totalling 370,798,370 bp (N50 of 5,317; Table 2). A %CG ratio of 34 was identified for both samples under study.

Manually inspection of chromosome 2.
Recently, knowledge about the sex-determining region was updated 18,19 , and the 24 genes located in the sex-determining region were manually inspected from a de novo assembly performed with the raw data obtained for both male (WM) and female (WF) sylvestris individuals. The number of contigs identified for each one of these genes was listed in Supplementary Table S1. Both studies 18,19 indicate that the gene coding for INAPERTU RAT E POLLEN1 (VviINP1) is responsible for male sterility, due to an 8-bp deletion in the CDS region, detected in female individuals, causing a premature stop codon. We found that this gene has homology with VIT_200s0233g00051 and VIT_202s0154g00120, associated with chromosomes "unknown" and 2, respectively. We consider that these two IDs are two alleles of the same gene. In the samples tested under this re-sequencing project, we identified the two alleles in the male individual, but the female individual only has the allele with the 8-bp deletion (VIT_202s0154g00120). Although there was a consensus on the identification of VviINP1 as responsible for male sterility, the results for a female sterility gene were not yet unanimous and several genes were proposed as responsible for this trait. The most promising gene reported for female sterility were VviYABBY (VIT_202s0154g00070) and VviPLATZ (VIT_202s0154g00150) 18 . As far as we were able to identify, both of these genes only presented one allele in the hermaphrodite reference genome. In our wild plants the genes were heterozygous in the male and homozygous in the female plants. We also manually inspected several genes on chromosome 2, some of them described to be involved in flower development like VviNGA1 (VIT_202s0025g03000) 30 and VviSLK2 (VIT_202s0025g04390) 31 . Other 20 genes located on chromosome 2, with distinct transcription patterns were previously 15 detected 16 and were subject to this analysis, in a total of 46 genes. From the 46 genes analysed, the male individual studied was heterozygous for 24 genes (Fig. 6). The alleles of three of WM genes were identical to the reference genome (VIT_202s0241g00140, VIT_202s0154g00230, VIT_202s0033g00020; Fig. 6). In the female individual, 9 out of 46 genes were heterozygous with both alleles presenting differences compared to the reference genome; only two of 46 genes had alleles identical to the reference genome (VIT_202s0025g02630 and VIT_202s0033g00020); and the remaining 36 genes were homozygous with differences when compared to the reference. Also, we found only one identical gene between male and female individuals (VIT_202s0154g00030) being homozygous, but distinct to the reference genome (Fig. 6).
RNA editing was found in transcripts of the sex-determining region. RNA editing differences, defined as differences found in the RNA after transcription, including nucleotides substitutions, were identified in Vitis 26 , using these individuals and previous transcriptomic data 32,33 . It is relevant to report here, that no RNA Table 2. Parameters related to de novo assembly of the wild female (WF) and male (WM) individuals. Total number of reads: the number of reads obtained from the sequencing. Reads used on the assembly: number of reads after cleaning with BBDuk (BBMap package). Coverage: Expected coverage, considering the reference genome of 486 Mbp. Number of contigs: The amount of contigs obtained from the assemblies with more than 1kbp. Total assembly size: The total size (in base pairs) of the obtained genome. N50: The size of the shortest contig at 50% of the total genome length. %CG: Percentage of Cytosine and Guanines over the total genome length.  Figure S8b). The in silico analysis revealed seven genes in which RNA editing (six in SDR and one outside) was found in WM but only VviPLATZ showed a nonsynonymous modification. In WF, nine out of twenty-three genes with RNA editing showed non-synonymous modification, which includes VviPLATZ and VviFSEX (VIT_202s0154g00200) (Fig. 6).
To confirm that RNA editing is not an exclusive event in these two individuals (WM and WF), we performed a second validation using a distinct female plant (WF_2). This validation targeted VIT_202s0033g01400, a gene in chromosome 2 far from the sex-determining region. We found several editing events, some with the two forms of mRNA, edited and unedited and one full edited event (Supplementary Figure S9b). The RNA editing tested by the Sanger sequencing performed on the WF_2 individual, revealed the same changes detect by NGS for WF, suggesting that RNA editing is not an exclusive event of WF and WM individuals.
Modified mRNAs were found inside and outside the sex-determining region, with distinct profiles between male and female individuals (Fig. 6).
The results reported in this paper, provided a comparison between the reference genome (vinifera) and sylvestris. We have identified SNPs and InDels between the two subspecies and investigated the sex locus region, providing a further evidence of which genes may be key players in sex specification of grapevine. We have also found RNA editing in genes belonging to the sex-determined region, namely VviYABBY and VviPLATZ. When a male and a female allele were identical, the same colour was used (green). Question marks refer to genes with low assembly resolution whose sequence was not identified correctly. Stars indicate genes where RNA editing was identified in transcripts, detected by comparing the genome sequences against the previous published transcriptomes, for the same individuals. www.nature.com/scientificreports/

Discussion
The analysis of the wild-type individuals (V. v. sylvestris) reveals a high similarity to PN40024, used as reference genome (V. v. vinifera). In fact, 89% of the sequences maps only once to PN40024 genome. 10% of the reads whose pair was lost or map to a different chromosome may be due to the remaining heterozygosity present in the reference genome (93% homozygous 1 11 . The higher number obtained in our study suggests a large distance between wild-type individuals (sylvestris) and the domesticated grapevine (vinifera). The ratios heterozygous/homozygous are in the same range as calculated by Mercenaro et al. (2017) 10 , and between the findings reported by Zhou et al. (2017) 11 (~ 77%) and D'Onofrio (2020) 34 (~ 40%). It is described that domestication pressure set on cultivated varieties, improving grape characteristics, affected all chromosomes in a similar way 11,35 . In the present study, we were unable to detect a particular chromosome with significantly different number of modifications, measured by a single nucleotide change or Insertion/ Deletion. Nevertheless, chromosomes 6 and 9 were highlighted as the ones with fewer and more alterations, respectively, but the number of modifications is still inside the Tukey's Fences used on the outlier's analysis.
The analysis considering the two separate individuals also supports chromosome 6 as the most relevant to distinguish the WF and WM individuals, and the result for the pooled sample may have a great contribution from the WF, which seems to be particularly conserved on chromosome 6.
The chromosome 13 also contributes to distinguish WM and WF plants in terms of SNPs and InDels. We need to emphasize caution in extrapolating this result to sex, as differences observed here are putatively individualspecific. Chromosome 2, consensually associated to sex, is not highlighted by any one of these analyses.
Analysis of the two plant chromosomes revealed that SNPs and Indels are spread across all of them. There are regions with a very low number of modifications, which coincide with regions poorly sequenced in the reference genome, therefore, the correct read mapping was not possible.
Most of the modifications identified are found in non-coding regions, where the impact is not directly assessed. However, Shabalina et al. 36 associated non-coding and synonymous SNPs to the gene expression, as these modifications may promote modifications at mRNA structure with consequences at stability and decay, or even by interfering in RNA editing. Similar trends of coding/non-coding and synonymous/non-synonymous were previously associated to wild populations, and according to Williamson et al. 37 could be considered an evolutionary trait of the population instead of a mark associated to domestication efforts, essentially in nondomesticated species. However, coding/non-coding and synonymous/non-synonymous ratios were also proved to be associated to domestication, for example in bean 38,39 and tomato 40 . The modifications identified in this paper, were found through a comparison between wild type specimens and a domesticated variety, and so, the domestication impact may not be completely discarded. Furthermore, a recent publication used 261 SNP markers to evaluate the genetic differences between Portuguese varieties and Portuguese wild populations (including the populations referred herein) 41 , to establish kinship relations and gene flow between wild individuals and cultivar varieties, as already reported by other studies 34,41,42 . This kind of genetic flow between wild and domesticated varieties may contribute to dilute the domestication impact in grapevine due to a systematic recap of wild genes/alleles. Considering these findings any direct conclusion or association between the SNPs and their cause, should be drawn with caution. It should be noted that Cunha et al. 41 found no evidence of backcrossing between the domesticated and the wild genomes specifically used in this genome resequencing. Neither the evolutionary pressure on these two independent populations nor the effects of the reference genome domestication, were diluted by back-crosses, which means that both hypotheses could be simultaneously contributing to the observed differences.
One of the most unexpected result is the low Ti/Tv ratio (transitions over transversions) in exons, regardless the SNP impact on protein sequences (synonymous or non-synonymous). Ti/Tv ratios have been associated with the genomic context of the modification and in fact, we clearly observe an association between the genomic context and the ratio. However, the ratio we have obtained on exons (1.4-1.9) highly contrasts with the values reported by Wang et al. 27 (circa 2.8) but is in line with the results identified by Vondras et al. 7 . A lower ratio as we observed, is getting closer to the ratio that should be observed if SNPs were random (0.5 as there are two transitions and four transversions). Low values of Ti/Tv were also observed in Oryza sativa 43 and Brassica rapa 44 . It is evident and intriguing that, in plants, the higher the impact of a SNP, the lower is the Ti/Tv obtained.
The highest percentage of regions affected by modifications, are intergenic followed by upstream regions (5 kbp before a gene start). It is interesting to observe that the percentage of occurrences on the upstream regions are almost twice as the ones observed for downstream (5 kbp after gene ends), essentially as upstream regions are usually associated with the gene promotor region. When analyzing the modifications in CDS regions, we find that the majority of the events have impact on protein (corresponding to non-synonymous modifications or frame shifts). The high impact observed, suggests an intentional evolutionary direction beyond simple mutation randomness. The percentage of modifications within the CDS region is generally low, but the few ones detected are modifications with an expected biological impact. These results support the suggestion by Ramos et al. 32,33 , indicating that both subspecies (sylvestris and vinifera) are mostly similar, however, the few differences found lead to significant protein modification.
The fact that the two individuals used in this study share only half of the identified SNPs and InDels, suggests that these individuals are not parentally related. This idea is also supported by the heterozygous over non-reference homozygous SNPs whose ratio has been associated with lineages 27  www.nature.com/scientificreports/ nonref-hom ratios for each individual for the same genomic regions. Unfortunately, other Vitis studies 10,19 lacks this kind of detail. When considering SNPs and InDels in common (between the two wild individuals), we need to be caution when extrapolating the results to the entire wild population, as we are observing the pattern of only two individuals (one male and one female individuals). Nevertheless, as they were phenotypically distinct individuals (one producing male flowers and the other female flowers), the common modifications may correspond to wild-type specific characteristics.
As both individuals produce distinct flower types, some of the genetic differences found between them may be associated to sex. Curiously, chromosome 2, where the locus associated to sex has been reported by different studies [7][8][9] , has a similar number of SNPs and InDels, when compared to other chromosomes. Nevertheless, we performed a more detailed analysis of the sex-associated locus in chromosome 2. From the 46 genes studied on this chromosome, only one (VIT_202s0154g00030) has identical alleles in both wild-type individuals, and other gene was recently associated with male sterility (INP1) 18,19 , which makes relevant the question of whether or not the other 44 genes could be associated with sex. At present, we are not able to answer this question and further studies should be carried to determine the impact of these 44 genes, using more individuals of each flower type, in order to better associate each one of these genes with its putative role in flower sex determination. Additionally, the two individuals used on this resequencing project had already been tested for VviAPRT3 and VviFSEX 9 and the pattern was the expected for male and female.
Regarding the discussion of SNPs relevance, some of them located in coding regions, we need to be careful, since RNA editing has already been described as present in the nuclear transcripts of plants like Arabidopsis 24,25 and Vitis 26 . In Vitis, the two genomes included in this work, were compared with previous transcriptomic data 17,18 . We need to highlight that the studied individuals were the same.
Eventual differences between DNA and RNA (RDDs) need to be considered and tested before concluding that SNPs have impact in plants. Combining the genomes published in this work and the transcriptomes already made public 32,33 , we identified 14 genes in the sex-determining region whose transcripts show RNA editing, the majority of them exclusive of the female plant (8 out of 14). VviYABBY (VIT_202s0154g00070) is one of those genes and presents a two-variants RNA editing in the 3′UTR region, where edited (T-to-A) and non-edited mRNAs are simultaneously detected, at one of the tested flower developmental stages (D stage). Since the two forms are present, female flowers simultaneously exhibit messenger molecules with uracil (as a thymine is encoded by the genomes of female and male individuals) and other molecules with adenosine. A second mRNA two-variants RNA editing event was detected for 3′UTR of VviYABBY on male individual (WM) on developmental stages D and H. Which is also interesting, is the RNA editing pattern of VviPLATZ (VIT_202s0154g00150), as male and female inflorescence show different RNA editing events, without overlap, including events within the CDS regions.
Considering the relevance of the CRIBI database in Vitis research, we need to stress the urgency of a complete curation of the genome deposited in that database. An effort should be carried out to resolve the sequences deposited in the "unknown" and "X_random" chromosomes, as these sequences usually correspond to alleles of the genes present in the remaining 19 chromosomes.
Summing up, the work described in this article contributes to the identification of putative specific modifications of sylvestris when compared to vinifera. This paper clarifies questions raised on previous studies using the transcriptome 32,33 , where different levels of gene expression were observed between sylvestris and vinifera transcriptomes. The identification of SNPs and InDels in the promotor region of the genes (5 kbp upstream of the gene) may explain the differential expression levels detected. In addition, the nucleotide differences revealed here, within coding regions, suggest that different alleles or gene duplications may have distinct importance in sylvestris and in vinifera.

Materials and methods
Sampling and DNA extraction. Representatives from V. v. sylvestris (wild type) were selected from INIAV, Dois Portos (Lisbon district, Portugal, 39.041395, − 9.181956), where a wild type collection was established. One of the selected individuals exhibited female flowers (WF) and the other displayed male flowers (WM). Fresh leaves were collected from each selected individual. Leaves were grinded in liquid nitrogen and the DNA was extracted and purified with the DNeasy Plant Mini Kit (Qiagen), according to the manufacturer´s instructions. DNA quantity and quality was evaluated using a microplate reader Synergy HT (Biotek, Germany), using the software Gen5 (Biotek, Germany) and confirmed on a 1.7% (w/v) agarose gel.
High-throughput sequencing and reference assembly. Genome sequences were obtained individually from each individual using 1 µg of DNA, on a Genome Sequencer Illumina HiSeq 4000 technology (125 bp PE; Fig. 1). A total of 539,473,738 reads were generated on both sequencings. Low quality reads (phred quality lower than 15) and reads without mate were removed. Additionally, low quality bases were removed from 3′ and 5′ ends. Minimum read size obtained after base trimming was 36. For the WF genome, 216,895,404 reads (91.6% of the total) were used on further analyses, while the WM genome was represented with 277,949,176 high quality reads (91.8%). Considering the V. v. vinifera cv. PN40024 reference genome available 1 , we estimated a genome size of about 486 Mbp, for which our predicted coverage ranges between 56X (WF) and 72X (WM). For general proposes, the results obtained from the different individuals were merged together, to better represent a wild-type genome.
The good-quality reads were mapped against the reference genome 1 using BWA (0.7.15) 45 with the default parameters. About 97% of the input reads mapped correctly to the reference genome (Table 1). www.nature.com/scientificreports/ Considering the artificial coverage created due to PCR amplification during library preparation, the reads were submitted to Picard (1.131; https ://broad insti tute.githu b.io/picar d/). The MarkDuplicates tool from Picard was used to detect duplicated reads whose origin is expected to be a single fragment of DNA. MarkDuplicates marks the reads as duplicates if multiple reads pairs maps in the same genomic location and with the same orientation. The presence of insertions and deletions, on the new obtained genomes, compared to the reference genome, promoted the occurrence of punctual misalignments. These misalignments were locally realigned with GATK (3.5) 46,47 . Taking into consideration the reported quality score, sequencing cycle and context, a final Base Quality Recalibration was performed with GATK, to improve the base quality score of the reads. About 92% of the mapped reads were kept after these three refinement steps. Whenever not explicitly indicated, default parameters were used. Transitions over transversion ratio (Ti/Tv) as well as heterozygous over non-reference homozygous SNPs (he/nonref-hom) were calculated for each genome with SNPs grouped by chromosome (discarding the "Un" and "X_random" virtual chromosomes). Calculations were performed with Python (3.7.4) running over Jupyter Notebook (6.0.1). Modules pandas (0.25.1) and matplotlib (3.1.1) were used to manipulate data and draw boxplots. Whenever not explicitly indicated, default parameters were used. Data from WF and WM was treated separately and pooled only for plotting and conclusions.

Variant analyses. SNP and
De novo assembly. A de novo assembly of the genomes was performed in order to obtain the genomic sequences of loci that were unclear on the reference genome. From the bulk of sequenced reads (Table 2), an assessment was performed using FastqC software (v0.11.7). Considering the quality indicators, the reads were cleaned with BBDuk (v38.01) from BBMap package. A pipeline was set to remove adapters from both ends (ktrim = r and ktrim = lm; with options k = 23, tpe and tbo, using the provided adapters.fa as reference) and low quality reads (options qtrim = rl and trimq = 20). Reads were prepared for assembly using fq2fa function from IDBA package (1.1.3) 52,53 , with options -merge and -filter. Assembly was conducted with idba_ud. The contigs obtained with less than 1,000 bp were removed, as this represent low informative segments.
Contigs obtained on de novo assemblies were queried against the genes annotated on the sex region 15,16 , retrieved from the reference genome (hermaphrodite), on a blastn 54,55 run with e-value set to 1e − 10. The selected contigs were manually inspected. Although the usage of only two individuals, one of each sex, lacks the resolution to differentiate between sex-specific patterns and individual evolutionary mutations, this approach may provide a little insight into the genetic differences leading to different flower types.
The results of this assembly were validated by confirming the presence of contigs associated with VviAPRT3 and VviFSEX as reported by Coito et al. 17 and VviINP1 18,19 , VviPLATZ 18 and VviYABBY 19 and by testing the heterozygosity overlap between both assemblies (Supplementary Table S1). We manually inspected the region of chromosome 2 that has been associated with sex 15,16 . Whenever not explicitly indicated, default parameters were used. RNA editing. The SNPs data obtained here was compared against the transcriptomic data obtained for the same individuals in 2014 32 . RNA was extracted from inflorescences WF and WM individuals at developmental stages B, D and H 56 . Total RNA was extracted with Spectrum Total RNA kit (Sigma-Aldrich, Spain), according to the manufacturer's instructions. cDNA libraries were obtained along standard Illumina pipeline, targeting poly-A tails. The obtained transcriptomic sequences were assembled against the reference genome (PN40024-12X), in transcriptome version 1 2 . SNPs were assessed between the transcriptomes and reference genome with CLC Genomics Workbench (Quiagen). Calls were discarded if: (1) the SNP call was within the first or last 60 bases of a linkage group sequence; (2) the reference genome base was ambiguous at the SNP position or within 60 bases of it; (3) the average copy number of the SNP flanking region was more than two; (4) the Illumina quality score of either allele was less than 20 and (5) the number of reads supporting either allele was less than two reads per accession. The initial list of variants was filtered using the Phred quality scores of the position and surrounding bases. To reduce false positives in variant calling, the minimum variant frequency was set to 15%. The minimum coverage required was 4 reads. Gene expression was measured using the number of Reads per Kilobase of exon model per Million mapped reads (RPKM).
Comparison between the RNA-seq data and the genomes sequenced was performed on R (3.4.3) 49 , running on a RStudio cluster (1.0.143) 48 . The comparison between the genome and the transcriptome was performed by assessing the exact genomic location of the SNPs, as both studies were mapped against the same reference genome. To guarantee low number of false positives, we have defined a set of conserved filters to increase confidence: (1) Each analysed site should have a minimum coverage of 5X (both in DNA-seq and RNA-seq, www.nature.com/scientificreports/ simultaneously); (2) the gene should be expressing in all samples (WF and WM individuals at inflorescence developmental stages B, D and H; RPKM ≥ 1); (3) events as deletions and inserts have been removed and (4) at least 10% of the reads covering the region should identify the nucleotide, otherwise that nucleotide was considered an error and discarded. For sites where a SNP was identified only in DNA-seq or RNA-seq, a deeper analysis was performed, by assessing the information in the BAM files. In these situations, we evaluated if the lack of information was based on similarity between the genome/transcriptome 32,33 sequenced, and the reference genome, or if the position was discarded by the filters defined. To the remaining sites, the homozygosity of the locus was computed considered the frequency of each allele obtained from the genome sequence described in this paper (freq ≥ 0.90, homozygous; freq < 0.90, heterozygous). The comparison between the genome observed nucleotides and the transcriptome was achieved and, when the nucleotides observed in the transcriptome were not supported by the genomic information, we have considered RNA editing sites. We note that false RNA editing positives may be identified where the nucleotide observed in the transcriptome is present in less than 9% of the genome reads.
Interesting genes (reported in Fig. 6) were manually inspected using the Integrated Genome Viewer (IGV) browser 57 , loaded with the reference genome (PN40024-12X), the most recent gene annotation (V2.1) and the BAM files corresponding to the genome and transcriptomes sequencings, to validate the overall R analysis, with success.
As RNA editing was assessed based on cDNA libraries, we decided to refer to the DNA-RNA differences using the DNA and cDNA nomenclature (for example, using Thymine instead of Uracyl).
The R scripts developed will be available upon request, due to script's authors stringencies.

RNA editing validation by Sanger sequencing. For individual WM and a secondary female individual
(WF_2), RNA and DNA were extracted from the same inflorescence sample, at different developmental stages using Spectrum Plant Total RNA Kit (Sigma-Aldrich) and DNeasy Plant Mini Kit (Qiagen, Valencia, CA, USA), respectively, following manufacturer's instructions. This approach allows to exclude putative somatic mutations that may occur along the development of the new branches. Total RNA was additionally treated with 30 units of DNase I (Qiagen). Nucleic acid concentration was measured using a microplate reader Synergy HT (Biotek, Germany), with the software Gen5 (Biotek, Germany). The cDNA was synthesized from 100 ng of total RNA, using oligo(dT) 20 and RevertAid Reverse Transcriptase (Thermo Fisher Scientific) in a 20 μL reaction volume, following the manufacturer's instructions. cDNA concentration was measured as described above. Two genes in chromosome 2, VviYABBY (VIT_202s0154g00070) and VIT_202s0033g01400 were manually inspected using IGV. Primers were designed for exons of VviYABBY and VIT_202s0033g01400 genes using Primer Premier5 (Supplementary Table S2). For VviYABBY, a 336 bp fragment (DNA and cDNA) was amplified and for VIT_202s0033g01400 a 829 bp fragment was obtain in DNA and 632 bp in cDNA amplification.
The DNA and the cDNA amplifications were performed through PCR in 25 μL total volume composed by 0.