Genome reconstruction in Cynara cardunculus taxa gains access to chromosome-scale DNA variation

The genome sequence of globe artichoke (Cynara cardunculus L. var. scolymus, 2n = 2x = 34) is now available for use. A survey of C. cardunculus genetic resources is essential for understanding the evolution of the species, carrying out genetic studies and for application of breeding strategies. We report on the resequencing analyses (~35×) of four globe artichoke genotypes, representative of the core varietal types, as well as a genotype of the related taxa cultivated cardoon. The genomes were reconstructed at a chromosomal scale and structurally/functionally annotated. Gene prediction indicated a similar number of genes, while distinctive variations in miRNAs and resistance gene analogues (RGAs) were detected. Overall, 23,5 M SNP/indel were discovered (range 6,34 M –14,50 M). The impact of some missense SNPs on the biological functions of genes involved in the biosynthesis of phenylpropanoid and sesquiterpene lactone secondary metabolites was predicted. The identified variants contribute to infer on globe artichoke domestication of the different varietal types, and represent key tools for dissecting the path from sequence variation to phenotype. The new genomic sequences are fully searchable through independent Jbrowse interfaces (www.artichokegenome.unito.it), which allow the analysis of collinearity and the discovery of genomic variants, thus representing a one-stop resource for C. cardunculus genomics.

The genus Cynara, a member of Asteraceae family (a.k.a. Compositae), contains eight species and four subspecies, all native to the Mediterranean basin 1 . The members of the species Cynara cardunculus L. (2n = 2x = 34) are globe artichoke (var. scolymus (L.) Fiori), cultivated cardoon (var. altilis DC.) and the wild cardoon (var. sylvestris (Lamk) Fiori). The three C. cardunculus botanical varietas are fully cross-compatible with one another, and their F 1 hybrids are fully fertile. This, together with phenotype data 2 and inferences based on genetic markers 3,4 suggest that both the globe artichoke and cultivated cardoon were domesticated from wild cardoons and it is likely that their domestication occurred in the island of Sicily 5 . Globe artichoke was anthropogenically selected for the production of immature inflorescences (heads or capitula), and cultivated cardoon for its fleshy stalks 4 .
After Italy, the next biggest globe artichoke producers are Egypt and Spain (FAO 19 data, 2013), but its cultivation has spread to the Near East (Turkey and Iran), North Africa (Algeria, and Tunisia), South America (Argentina, Chile and Peru), United States (mainly in California) and China. Italy also harbors the richest primary cultivated gene-pool, which as a rule is classified on the basis of capitulum traits [20][21][22] into: i) 'Spinosi': with long sharp spines on bracts and leaves; ii) 'Violetti': with medium-sized, green-violet-colored capitula and fleshy thorns; iii) 'Romaneschi': with spherical or sub-spherical green capitula and; iv) 'Catanesi': with relatively small, elongated capitula with more or less marked violet streaks.
Recently, we published the sequence of the globe artichoke genome 23 (www.artichokegenome.unito.it). Its assembly, comprising 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ∼133-fold Illumina sequencing data and encodes 26,889 predicted genes. A new highly saturated linkage map was also constructed, which represents a big step forward from the genetic maps we previously developed [24][25][26][27] .

Detection of Presence/Absence variants (PAVs).
Based on gene coordinates, the five datasets were inspected for PAVs and 346 putative PAV genes ( Figure S1) was uncovered. The frequency of PAVs varied among the genomes ranging from 241 in SP and VS to 225 and 261 in C3 and VT respectively. Among the 251 shared PAVs, 173 (50.0%) were absent in all the genotypes, while 30 (8.7%) were absent in 2 genotypes and 14 (4.1%) in three genotypes. A total of 87 (25.1%) of the detected PAV genes were absent only in one genotype (17 in SP,14 in VS, 4 in C3, 27 in VT and 25 in A41). The PAV genes exclusively present in just one accession were 17 in A41, 1 in SP, 4 in C3, 1 in VS and 2 in VT ( Figure S1, Table S4). Functional information for candidate specific absent and present PAV genes was evaluated (Supplementary data) and a GO enrichment analysis conducted for PAV genes (17) present in the genotype A41 (Table S5).
Prediction and annotation of miRNA. From a search against miRBase 34 , 21 high confidence database, species-specific miRNAs were predicted and used for further analyses (File S2). The total number of predicted non-redundant miRNAs varied from 51 (within 74 VS genome regions) and 143 (in 241 genome regions of the reference 2 C), belonging to 32 (for SP) or up to 45 (2 C) miRNA families ( Fig. 1 and Table S6). The Tapir hybrid 35 search for target genes of the identified miRNA in the six genotypes, revealed between 307 (VS) and 1167 (2 C) putative miRNA: mRNA duplexes. Almost 90% of genes encoding predicted target transcripts have functional InterPro annotations. The total number of miRNA families involved in miRNA: mRNA interactions varied according to the genotype, ranging from 20 in VS to 45 in 2 C and C3 ( Fig. 1 and Table S7). Although the main families involved in miRNA: mRNA duplex formation were generally genotype-specific (Table S8) 23 . *Percentage of reconstructed genome compared with the 2 C genome.
The number of RLK, RLP, NB and other-KTM genes detected in each chromosome of the six genotypes in study is shown in Figure S5. In all the genotypes most RGAs were found on chromosome 1, followed by chromosomes 2, 13, 10 and 5. Chromosomes 6, 7, 14 and 16 (with the exclusion of C3) were devoid of RLP genes, while chromosomes 3, 6, 8 (with the exclusion of C3), 11, 12, and 14 were without NB genes. Distinct clusters of RGAs were identified for three of the main classes: RLK, RLP, NB and other-KTM (Table S10). In particular, chromosome 10 was found to harbor two RGA clusters: one of genes with an NB domain in a region of two Mb (16)(17)(18) and one of other-KTM genes at 10-11 Mb. No clusters of RLK genes were identified. RGAs were found to be quite uniformly distributed along the 17 chromosomes, except for the clustering of RLP genes on ch.  Alignments of the amino acid sequences and subsequent RAxML 39 analyses allowed the generation of phylogenetic trees for each of the RGA classes in the study. As expected, each resistance gene and its orthologs clustered together, although in some taxa one or more orthologs were absent (File S5). In particular for the most numerous RGA families, RLK and RLP, taxa missing at least one RGA from at least one genotype were 46 (24%) and 45 (50%) respectively ( Fig. 2 light blue). In addition, 2 clusters for both RLKs and RLPs contained at least one duplicated gene from one genotype (Fig. 2 orange), while clusters with at least one missing gene from a genotype and a duplicated gene from another genotype amounted to 1 for RLK and 4 for RLP (Fig. 2, highlighted in purple).

SNP/indel discovery and heterozygosity estimation.
To identify large-scale polymorphisms of sequenced accessions, reads were aligned against the globe artichoke reference genome (2 C). The mapping rate across different accessions varied from 95.8% to 97.6%, for an average of 96.5%. The whole SNP/indel set contained 23,450,539 entries (File S6), with 815,853 in 2 C rising to 14,495,680 in VS (Table 4). In the sequenced globe artichoke genotypes, similar SNP/Indel numbers were detected in SP and VS (ca 14.4 M) with slightly fewer in C3 and VT (ca 12.8 M), while in the cultivated cardoon A41 SNP/Indels halved to about 6 M. Among the globe artichoke genotypes the highest SNP frequency was found in VS (1/54 bp) with the highest indel frequency in SP (1/122 bp). Table S11 reports heterozygosity levels estimated across the six genotypes. All the genotypes, except the reference (2 C) which has been specifically bred by repeated cycles of selfing to attain a very low level of residual heterozygosity (Fig. 3a, track B), showed SNP-dense regions dispersed genome-wide ( Figure S6). However, many regions with a low frequency of heterozygous SNPs (Fig. 3) and carrying SNPs fixed in the homozygous state ( Figure S6) were observed. Some of them were genotype-specific (Fig. 3a) and occurred in gene-dense regions: chromosomes 4, 8, 11, 15 in SP; ch.10, 13 in VS; ch.5, 12, 17 in VT; ch.11 in A41. Some others are in common between genotypes (ch.1 in VS, C3 and VT). Interestingly, in chromosome 7 a wide region carrying SNPs mainly in the homozygous state was observed in both the SP and VT genotypes. Genetic relatedness among the six genotypes was assessed based on the whole SNP set (23 M) and also the SNPs detected in coding sequences (Fig. 3b). In both cases the reference genotype (2 C) and the cultivated cardoon (A41) clustered at a high level of genetic differentiation from the four globe artichoke genomes. Small differences were detected in the relationships among the latter, as the whole genome SNP analysis revealed a higher similarity between VT and C3, and a significant genetic differentiation between spiny (SP) and non-spiny (VS, VT and C3) globe artichoke types, while the coding SNPs highlighted a higher-similarity VT/SP cluster.  Table 3. Classification of RGAs identified in the six genotypes. RGA proteins classified based on domain identification together with their frequency.
Variants annotation. The analysis of functional variants between the reference and the resequenced genotypes was performed using the SnpEff 40 suite on both heterozygous and homozygous SNP/indel sets. The observed fraction of coding SNPs (500 k, exon-based) was as little as 2% of the 23 M detected SNPs. Two thirds of the total homozygous variants were located outside the gene space (intergenic region: 57.6%; intronic region: 9.3%) with only a small fraction contained within coding sequences (1.41%, Fig. 4). In SP a total of 6.2 M mutations were detected, 5,290,997 were observed as intergenic (Table S12), while 113,099 were in the annotated gene space. VS and VT showed a similar number of intergenic mutations, while variants annotated in CDS were 96,437 and 108,712 respectively. In C3, 3,719,864 intergenic SNPs were found, with 89,177 SNPs located in genic space. Finally, A41 (intergenic mutations: 1,470,424) contained 47,438 SNPs in CDS. About 98% of the variants were classified as modifiers. The fraction of moderate variants ranged from 0.58% to 0.8% according to accessions, and those having low effect from 0.75% to 1.05%. The high effect variants were the smallest class, with 1,464 to 3,761 mutations depending on the specific genotype (Table S12). Among the homozygous variations detected in coding sequences, 53.20% led to synonymous and 46.19% to non-synonymous amino acid changes, while 0.62% gave rise to non-sense mutations (Table S13). In respect to the heterozygous SNP/indels, the highest number of variants were mutations located in the intergenic and intronic regions (65.0% and 8.1%), while only 1.35% occurred in genic space (CDS, Fig. 4, Table S12). The high effect variants represented only 0.048% of total mutations. The fraction of mutations with low effect varied between 0.65% and 0.86%, while moderate variants accounted for between 0.6% and 0.77%. About 98% of the total mutations in all the resequenced genotypes were classified as modifier variants. With regard to the variations detected in coding sequences, 49.60% were annotated as non-synonymous mutations. Synonymous mutations amounted to 49.35% of total variants and only 1.05% were classified as non-sense mutation (Table S13).

Figure 2.
RLP and RLK resistance genes dendrogram of the six C. cardunculus genotypes. Taxa missing one or more RGA gene from at least one genotype are highlighted in light blue; clusters with one or more duplicated genes in one genotype are highlighted in orange, while clusters with at least one gene missing from one genotype and also containing at least one duplicated gene from one genotype are highlighted in purple.  Analysis of variants impact on CQAs and SLs gene classes. The possible impact of variants localized in genes responsible for caffeoylquinic acids (CQAs) and sesquiterpene lactones (SLs) biosynthesis were analyzed. Almost all investigated genes showed variants, with SLs showing a higher number of variants (24.5 SNP/ gene) compared to CQAs (5.5 SNP/gene, Fig. 5). The genes with non-synonymous variants potentially giving rise to high-effect impacts are reported in Table S14. Of the 33 non-synonymous variants belonging to the CQAs related genes, 10 (30.3%) showed a predicted deleterious effect in coding sequences when translated as amino acid substitution. These were located in all the polymorphic genes (Fig. 5). In the SL group, which contained 417 non-synonymous variants, about 83 (19.9%) were deleterious. All the coding SNPs were in the heterozygous state and affected 14 of the 17 genes (Fig. 5). In the CQAs pathway, one significant deleterious mutation, which creates a premature stop-codon, was heterozygous in the C4H gene (in SP, Table S14). No relevant homozygous deleterious variations were identified in the SL gene class. However, many deleterious variants in the heterozygous state were found in the germacrene A synthase (GAS) gene, which mutated exclusively in the four globe artichoke genotypes, and not in the cultivated cardoon (A41, Table S14). The impact of coding SNPs on protein genetic diversity among the analyzed genotypes is shown using family based phylogenetic trees (Fig. 5).

Discussion
Among the species belonging to the Asteraceae family, Cynara cardunculus has remained for a long time relatively unexplored compared to other species such as sunflower and lettuce, for which genomes extensive investigations have been performed. Following our recent release of the first globe artichoke genome sequence 23 , we performed the first WGR (at ~35× coverage) of five C. cardunculus genotypes, four representatives of the core globe artichoke varietal types at present in cultivation and one of the related and inter-fertile taxa cultivated cardoon. By the use of a genome reconstruction strategy based on iterative mapping and reference-guided assembly 41 , the five genomes were assembled and reconstructed at the chromosome scale. Our approach proved to be less demanding in terms of sequencing depth and multiple libraries construction compared to a de novo assembly. The sizes of the reconstructed genomes were comparable to that of the reference genome, with an average percentage of reconstruction close to 98.9%, but smaller compared to the estimated genome size of C. cardunculus (1,084 Mbp). This is likely due to the absence in the reference genome assembly of some repetitive sections, though it included about 95% of the gene space 23 . Indeed, by performing a new gene prediction of the globe artichoke reference sequence following the application of a more stringent AED threshold, we found 28,310 predicted genes, which corresponds to a 5% increase over the 26,889 previously predicted. As reported for other species 42 , globe artichoke and cultivated cardoon suffer major losses from numerous diseases, as a result of many years of cultivation and selection mainly focused on desirable commercial traits at the expense of disease resistance. Plants have developed effective mechanisms to recognize and respond to infections caused by pathogens, among these the RGAs 43-45 play a key role. In the five reconstructed genomes in study, as well as in the reference genome, we computationally mined and characterized RGAs on the basis of their significant structural features and conserved domains.
The RGAs identified here represent on average about 2% of the total number of predicted genes for all the genotypes. The majority of them were located preferentially on six chromosomes: 1, 2, 3, 5, 10 and 13, suggesting their specialization in resistance pathways. Their location assists development of a high-density genome-wide RGA genetic map for the species, which is pivotal for designing diagnostic markers and identifying quantitative trait loci (QTL) or markers associated with plant disease resistance. Cynara cardunculus RGAs mainly fall into RLK, RLP and other-KTM (i.e. genes with a P-kinase and a TM domain) classes (Table 3), while NBS (such as TNL, CNL, and TN) were poorly represented. Previous studies report that TN (TIR-NBS) are the largest group of resistance genes in both Arabidopsis thaliana (64%) 46 and Brassica rapa (64%) 47 , but they are rare in Oryza sativa (1%) 48 and Sorghum bicolor (1%) 49 and absent in Brachypodium distachyon and Zea mays 50 , suggesting their specificity for dicotyledons 51 . Kim et al. 52 , in a survey of RGA genes based on UNIGENE analyses, highlighted that some Asterids, such as Solanaceae, contain functional TNLs, whereas others do not. The same authors identified only 19 and 13 full length CNLs in the two Asteraceae, sunflower and lettuce, but no full length TNLs, and concluded that the latter are relatively poorly distributed or have been species (or family) specifically lost in Asterids. In line with our results, Christopoulu et al. 53 recently identified several RGA classes in lettuce (one of the closest species to C. cardunculus), with the RLK class most represented. Other NBS-LRR like RGAs were found to vary widely in number (for a summary, see Sekhwal et al. 54 ). Finally, it has been reported that NBS-LRR genes underwent gene expansion after speciation in Arabidopsis 46, 55 , rice 56 , corn 57 , Populus and grape vine 58 . From our results, it appears that C. cardunculus RGAs belong almost exclusively to the RLK/RLP families, while a few NBS RGAs were identified. Our results confirm the hypothesis of Kim et al. 52 of a species-specific evolution of TNLs in Asterids. Interestingly, several putative RGAs were identified and showed missing domains compared to the 'canonical' NBS, RLP and RLK families, in line with reports by several authors [59][60][61][62][63][64][65][66][67][68] . It was also reported that maintaining many NBS/resistance genes has potential fitness costs 69,70 and it has been suggested that plants use microRNAs to regulate NBS gene expression [71][72][73][74][75] . Indeed, we found that 9 (in VS) to 41 (in 2 C) identified RGAs (by Blastp approach) were putatively targeted by a miRNA, suggesting that this mechanism could also be present in C. cardunculus.
The number of identified miRNAs varied across the different genotypes, with the highest in 2 C and the lowest in VS. The variable number of identified miRNAs might also arise from SNPs present in some miRNA loci hampering their identification in some genotypes. In addition, as previously reported 23 , conserved miRNAs, such as miR162 and miR482, were not identified possibly because of their loss in the lines' genomes and/or because of genomic loci missing in the assembly. Most of the conserved miRNAs detected in the six genomes were predicted to target well known biological processes; this confirms, on the basis of a more comprehensive data set, what has been previously reported 23,76 . The comparison of GO term enrichments for the five newly reconstructed genomes revealed that GO terms related to binding and transcription were shared among the C. cardunculus genotypes, suggesting an involvement of miRNAs in transcription factors regulation. However, the process GO terms lignin and phenylpropanoids were enriched in 2 C, A41 and C3. This suggests their involvement in the regulation of genes involved in lignin biosynthesis and the flavonol pathway in these three genotypes, although this hypothesis should be furtherly investigated (e.g.: small RNA and degradome sequencing). The comparison of the five reconstructed C. cardunculus sequences with the reference genome led to the discovery of more than 23 million SNP/indels, heterogeneously distributed across all the genotypes and their chromosomes revealing traces of breeding efforts. SNP frequency ranged from about 1/54 bp (VS) to 1/122 bp (A41), while indels frequency varied between about 1/630 bp (SP) to 1/1,634 bp (A41). In a previous study 77 , focused on RAD-tag marker development in globe artichoke, we estimated a SNP frequency of one SNP per 179 bp and one indel every 5000 bp, values which differ significantly from the those presented here (Table 4). This discrepancy may be attributed to the RAD-tag protocol we applied, which was based on the use of one methylation sensitive enzyme, thus the obtained metrics mainly refer to the un-transcribed portion of the genome.
As expected the globe artichoke reference genome (2 C) showed the lowest heterozygosity (0.11%, a frequency of 1/893 bp), which is comparable to that found in inbred species 78 . This reflects its breeding history as 2 C is the result of three cycles of self-fertilization. In contrast, according to Portis et al. 26,79 , the four globe artichoke varietal types showed a high level of heterozygosity, as their heterozygous SNPs ranged from 7.4 M in VT to 9.3 M in VS (Table S11). Of these, the globe artichoke VT, the only genotype to be seed propagated, contained the lowest number of heterozygous SNPs (7,4 M); this is likely due to breeding practices aimed at decreasing its heterozygosity in order to stabilize its commercially-important attributes. Analogously, in the seed-propagated cultivated cardoon A41, a relatively low number (4,2 M) of heterozygous SNP was found. Many regions were observed with a low frequency of heterozygous SNPs (Fig. 3a) and carrying SNPs fixed in the homozygous state ( Figure S6).
In a previous study, we identified Sicily Island, in the South of Italy, as a possible primary center of globe artichoke domestication. Following characterization of 24 landraces collected from small-holdings, through a combination of morphological traits and PCR-based markers, we recognized intermediary spiny forms in the domestication process 5 . The presence/absence of spines has been considered a key trait to understand the origin of the material at present in cultivation. Barbieri 80 hypothesized that the spiny types (Spinosi) were selected first, followed by the violet types (Violetti), which possess less spiny heads, and finally by the non-spiny Romaneschi and Catanesi. However, both UPGMA analyses, performed on the whole set of detected SNPs/Indels and on the ones detected in coding regions, did not cluster separately the spiny and non-spiny types. This seems more consistent with the pattern of evolution proposed by Lanteri et al. 4 according to which the Spinosi and the Violetti, the latter characterized by capitula with bracts harbouring fleshy thorns, evolved side by side with the non-spiny Catanesi and Romaneschi. Since domestication implies the intensification of pressure for traits relevant to farming conditions, and may have had a greater impact on genes than on intergenic regions, our hypothesis appears to be confirmed by the presence SNPs fixed in the homozygous state occurring in gene-dense regions in a genotype-specific fashion (Fig. 3a), which is attributable to a signature of the domestication process addressed to fix distinctive traits in the different varietal types.
The species Cynara cardunculus has interesting applications in pharmacology, since the leaves and heads represent natural sources of bio-active compounds such as mono-and di-caffeoylquinic acids and sesquiterpene lactones, with several medicinal properties. In previous studies we isolated, functionally characterized both 'in vitro' and in vivo' and mapped the genes involved in their bio-synthetic pathway [81][82][83][84] . Our functional annotation of SNPs revealed thousands of coding polymorphisms (Table S12, Fig. 4) and the analysis of variants of genes related to caffeoylquinic acid (CQA) and sesquiterpene lactone (SL) biosynthesis displayed different outcomes. In the CQA pathway, except for one case, non-relevant deleterious SNPs were found. Conversely, in the SL pathway, although in heterozygosis, 54 deleterious mutations were found in 12 out of the 17 SL-related genes. This was confirmed by the remarkably lower diversity detected in CQA-compared to SL-related proteins (Fig. 5b). Many deleterious mutations were located in GAS, a key gene in the biosynthesis of SLs, of which cynaropicrin is the major representative in C. cardunculus 83 . The latter was found to accumulate in leaf tissues while its concentration is lower in inflorescence bracts of globe artichokes 85,86 ; the same results were also confirmed in cultivated cardoon leaves 87 . Cynaropicrin acts in leaves as an antifeedant 88 but is also responsible for the bitter taste of globe artichokes, thus the deleterious mutation in GAS gene detected in globe artichoke genotypes, but not in cultivated cardoon, might be the result of domestic breeding aimed at reducing bitterness in the globe artichoke edible capitula.
Next-generation sequencing is rapidly expanding our knowledge of genetic variation in many crops. The availability of the globe artichoke reference genome and the resequencing of four globe artichoke genotypes, representative of the germplasm in cultivation, has provided clues about their domestication processes as well as a first comprehensive identification of the genetic diversity of the Cynara cardunculus cultivated forms. For years, we have studied the progeny of the cross between the globe artichoke Romanesco with both globe artichoke and cultivated cardoon, with the goal to develop molecular maps on the basis of the two-way pseudo test cross strategy 24, 25 , and we have identified several QTLs for a number of capitula traits 26,89,90 . Based on our resequencing effort, a set of SNPs regularly spaced along the chromosomes was identified and annotated. This resource will be fruitful for the identification of regions responsible for the QTL, and lays the groundwork for a new phase of globe artichoke genomics to further the understanding of the genetic basis of agronomical important traits and for selective breeding. Beyond, our marker catalogue provides a highly valuable resource in terms of polymorphism, and allows foreseeing the future through the detection of fine haplotypes and imputation of SNPs on large accessions panels. Our results also provide key information for further functional, as well as comparative, genomics studies with other important crops such as sunflower, lettuce and chicory within the Asteraceae family.

Materials and Methods
Plant materials and DNA extraction. Six C. cardunculus genotypes (Table 1) were considered in the analysis; they included one cultivated cardoon (' A41' -Altilis 41), and five globe artichokes: '2 C' (reference genome), VS ('Violetto di Sicilia'), VT ('Violetto di Toscana'), C3 ('Romanesco C3') and SP ('Spinoso di Palermo'). SP belongs to the 'Spinosi' type, characterized by long sharp spines on bracts and leaves. VT belongs to the 'Violetti' type, produces medium-sized, green-violet-colored capitula harbouring fleshy thorns and it is seed-propagated. C3 belongs to the 'Romaneschi' type, and its capitula are spherical, green and non-spiny. VS belongs to the 'Catanesi' type, yielding non spiny elongated capitula with more or less marked violet streaks. 2 C is a Brazilian breeding line result of three cycles of selfing and characterized by a reduced level of heterozygosity 23 . A41 is a selected genotype of cultivated cardoon 25 producing small non-spiny capitula and it is seed-propagated.
All of them derived from propagated clones grown in a field trial location at University of Catania (Italy), during the 2014-2015 growing season. For the genotypes VS, VT, SP, total genomic DNA was extracted from fresh leaves of each genotype, using DNA Mini Plant kit (Qiagen). RNAse A was used to remove RNA contamination. DNA quality was checked by 1% (w/v) agarose gel electrophoresis, and its quantity was assessed by Qubit 2.0 (Life Technologies, Carlsbad, CA, USA) based on Qubit dsDNA HS Assay (Life Science). Raw sequence data for the genotypes A41, 2 C, and C3 were obtained from NCBI (Table 2) 23 .
Genome Sequencing. A total amount of one μg DNA was sonicated using 30″/90″ on/off time for 7 cycles with Bioruptor UCD-300 TS instrument (Diagenode, Belgium) to obtain 350 bp long fragments. End-repair and A-tailing procedures followed standard Illumina protocols, except PCR-free barcoded adapters (Biooscientific, Austin, TX, USA), used during the ligation step, and five conclusive enrichment PCR cycles were carried out. Negative selection of 100-150 bp fragments was performed with 0.8 X AMpure XP beads (Beckman Coulter, Inc., (Brea, California)). Quality control (QC) of libraries was performed with Bioanalyzer 2100 instrument (Agilent, Inc., Santa Clara, CA, USA) using High Sensitivity DNA kit and an accurate quantification was made using qPCR with Library Quantification kit (Kapa Biosystem, USA). Library were then pooled and diluted to a final concentration of 10 nM. Sequencing were performed with Illumina NextSeq500 sequencer (Illumina Inc., San Diego, CA, USA) and 150 bp paired-end sequences were generated. Raw reads were analyzed with Scythe (https://github. com/vsbuffalo/scythe) for filtering out contaminant substrings and Sickle (https://github.com/najoshi/sickle), which allows to remove reads with poor quality ends (Q < 30).

Genome reconstruction, gene prediction and annotation.
For the genome reconstruction, a combination of iterative read mapping against the globe artichoke reference and de novo assembly was adopted. De novo assembly was performed with ABySS 1.9.0 assembler 91 using k-mers which allowed to achieve the best assembly. Genome reconstruction of each varietal type was performed submitting the Abyss assembled sequences to IMR/ DENOM 41 (ver. 0.4.1; http://mus.well.ox.ac.uk/) pipeline using default parameters, adopting the globe artichoke genome 2 C (LEKV00000000.1) as a guide. Quality metrics for the assembled genomes were calculated with Assemblathon_stats.pl (http://korflab.ucdavis.edu/). Gene prediction was performed using reiterative runs of the Maker-P suite 92 . HMM models from Augustus 93 and SNAP 94 ab initio gene prediction algorithms previously developed 23 were combined with proteins and ESTs alignments as supporting evidence. All predicted gene models were filtered to retain only those with an AED ≤ 0.5. Gene function was assigned to predicted genes using BlastN 95 and Swissprot 96 database, using default parameters, with the exception of sequence E value = 1 e −5 . Predicted protein sequences were functionally annotated using InterproScan 32  Identification and characterization of PAV genes. Samtools 108 was used to generate a text file containing the number of illumina reads that mapped at each gene location on the reference genome. The number of reads that mapped at each gene location for every globe artichoke varieties were normalized by the total number of reads mapping the whole reference genome for each four globe artichoke varieties. These figures were calculated as follows: samtools view -c -F 4 -q 1 mapping_file.sorted.bam. As approach to identify putative PAV genes, all genes with less than 6 mapped reads from at least 1 varietal type and more than 29 mapped reads from at least another varietal type were selected. A Venn diagram of the shared/exclusive PAV genes was depicted using InteractiVenn 107 . Obtained list of candidate PAV genes were described/GO-categorised using the here produced functional annotation and through a blastP 95 analysis (TAIR10 109 ). GO enrichments in artichoke selected genes were calculated with AmiGO2 110 web service and Panther 111 . miRNA annotation. The MIReNA 112 software was used for the identification of high confidence miRNA-coding sequences (miRBase release 21 34 : high confidence database) in each pseudomolecule and CH0 of all the six genotypes in study. An homology search was conducted with known miRNAs from an array of 13 species (plants and algae), including: Solanum lycopersicum, Solanum tuberosum, Nicotiana tabacum, Vitis vinifera, Arabidopsis thaliana, Oryza sativa, Populus trichocarpa, Medicago trunculata, Zea mays, Picea abies, Triticum aestivum, Physcomitrella patens, Chlamydomonas reinhardtii. MIReNA was run with default parameters and the maximum number of allowed mismatches between known miRNAs and putative miRNAs was set to 10. For each genotypes, miRNA sequences identified were named based on the miRNA family with the addition of the name of the genotype (2 C, A41, VS, VT, C3 and SP). The Tapir 35 standalone software was applied to identify the targets of the identified eggplant miRNA on the predicted CDSs. In particular, we applied the Tapir hybrid function, which is based on the RNA hybrid search engine 113 , an algorithm conceived to determine, with high accuracy, miRNA:mRNA duplexes. Results were parsed with hybrid_parser function using default parameters. GO term enrichment of target sequences for each line was carried out with AGRIGO 30 and REVIGO 36 to find out a representative subset of the GO terms previously identified with the Interproscan 32 pipeline (medium similarity) and to visualize results. The size of the circles have been adjusted to reflect the p-value. AGRIGO 30 cross comparison of SEA (SEACOMPARE) was used to identify common and different enrichment GO terms between the genotypes showing GO terms enrichment.
Resistance genes analogs (RGA) identification and classification. Candidates genes were identified by means of a Blastp 95 analysis against the Plant Resistance Genes database 114 (http://prgdb.crg.eu/). Protein sequences of 2860 Arabidopsis RGAs (from RGDB) were used to perform BlastP searches against all the six proteomes. For each genotype, positive hits (p < 1e-60) were validated via HMMER37,38 v3 software, searching against PFAM 102 Hidden Markov Models (HMMs: available at http://pfam.xfam.org/) using a cutoff E-value of 1e-10. HMMs domains were chosen based on their known involvement in plant resistance against pathogens and included: NB-ARC (PF00931), TIR (PF01582), LRR (PF00560, PF07723, PF007725, PF08263, PF12799, PF13306, PF13516, PF13855 and PF14580), Pkinase (PF00069), ABC transporter (PF00005), and WD40 (PF00400). The CC motifs were predicted with EMBOSS pepcoil 115 , while TM domains were predicted using both TMHMM 2.0 116 and SCAMPI 117 . Resistance genes classification was based as followed: acronym containing the same domains from Christoupolou et al. 53  SNP calling. Reads were mapped onto globe artichoke genome reference using Burrows-Wheeler Aligner program (BWA) 122 and 'mem' command with default parameters. The BAM files were processed and adapted for SNP calling program with Samtools 108 mpileup using default parameters with the exclusion of minimum mapping quality equal to 25 and filtering ambiguous read mapping. Results were filtered taking into account two parameters: the SNPs call quality and depth. SNPs having mapping quality lower than 20 were removed. In addition, we set as lower limit of mapping depth a value of eight and the upper limit was set to 450. Relationships among the genotypes were computed using: i) whole genome and ii) coding (within exons) SNP/indel datasets. Genetic distances were computed based on the two datasets (R package SNPRelate) and dendrogram was computed and drawn using R Graphic-package 123 . The chromosomal locations of SNP densities were visualized in CIRCOS ideograms using the software package from http://circos.ca. SNP annotation. Identified variants were analyzed using SNPeff 40 to predict their effect on the set of gene models of globe artichoke. We investigated the role of missense and non-sense mutations in both homozygous/ heterozygous states, evaluating in which region the variations are found. The effect of each SNP/indel was classified according to SNPeff software into four classes: (1)"modifier", for the variants located outside the genes, in non-transcribed regions or introns; (2) "low effect" for variants in coding regions which do not change the amino acid sequence; (3) "moderate" effect for variants which change the amino acid sequence and (4) "high effect" for variants which modify splice sites, stop or start codons (loss or gain). CDS non-synonymous variants belonging to well characterized pathway and classified as missense, stop codon gained and frameshift effect were also submitted to Provean (Protein Variation Effect Analyzer algorithm) 124 analysis, in order to understand if an amino acid substitution has an impact on the biological protein functions. Provean predicts the functional impact for all classes of protein sequence variation such a single amino acid substitution, insertion, deletion and multiple substitution. The score threshold used was set to −2.5. Non-synonymous variations causing deleterious effects are evaluated in homozygous and heterozygous state for the negative impact on protein functionality. Accession codes. Sequence reads have been deposited in NCBI sequence read archive (SRA) under the number SRP055806. A JBrowse 28 interface, to access genomic data and related annotation, is available at www. artichokegenome.unito.it.