The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny

Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds, and wild species within and beyond the Compositae, and will facilitate the identification of economically important genes from related species.

| K-mer spectra analysis using 19-mer counts. Clone "2C" (left) showed significantly lower heterozygosity compared to that of cultivated "Romanesco C3" artichoke and "Altilis A41" cardoon genotypes (right). Among cultivated genotypes the difference between globe artichoke and cardoon was also evident and in agreement with previous reports (Portis et al., 2005).     A, B, H are possible states of underlying sequence, while a, b and h represent low coverage calls obtained from Illumina reads. Ehet and Ehom are the probability of false heterozygous calling for a given site. C is the probability of obtaining a heterozygous call at a given site by the presence of reads from both alleles. T is the probability of transition from one genotype to another (i.e. crossing-over events).

Figure S6 | Heat maps of the recombinant fraction (upper left sector) and LOD scores (bottom right sector) of linkage maps generated for
Romanesco C3 and Altilis 41 parents. Red corresponds to lowest recombination fraction and highest LOD scores. Figure S8 | Distribution of satellite sequences. Density of repeats (A) and genes (B) are report as heat map. Color-coded blocks (C) report non-consecutive occurrences of a given monomer along the sequence. Monomer are referred as in Supplementary Table S10 Supplementary Tables S1-S10

Genome completeness
The GMAP aligner (Wu et al., 2005) was adopted to assess the completeness of the globe artichoke genome using the existing ESTs of C. cardunculus. Different analyses were conducted varying the -x parameter (x = off, 100, 200, 300 and 400), which enables alignment of chimeric reads, and may help with some non-chimeric reads. The total number of unigenes found was obtained by considering both unique aligments and multi-exonic aligments. It reached a maximum of 35,138 (-x = 400) and 33,729 (-x = 200) with a mapping percentage of 95.59% and 91.81%, respectively. The non-mapping reads (978) remained constant even when varying the -x parameters. All the GMAP outputs are reported in the Supplementary table S1a. The CEGMA (Core Eukaryotic Genes Mapping Approach; Parra et al., 2007), pipeline was adopted to assess the completeness of the globe artichoke genome. A set of 248 core proteins that are present in a wide range of taxa, thereby are highly conserved, have been used to identify their exon-intron structures in genomic sequences. The lack of complete alignments for less conserved ortholog groups are probably due to divergence (CEGMA), while partial alignments confirm an even representation of different ortholog groups, indicating a gene space coverage ranging between 95% to 98.5 % (Supplementary table S1b).

Plastid and mitochondrial sequences
We searched the globe artichoke WGS assembly for fragments with similarity to chloroplast DNA. The search was conducted on a filtered dataset which did not include the scaffolds which were previously attributed to any linkage groups in order to avoid false positive hits to chloroplast hits genes resident in the genome). Using BlastN (e-value ≤10−5; Altschul 1997), we compared the filtered assembly to a draft of the C. cardunculus chloroplast genome (KM035764, Curci et al., 2015). This search revealed few scaffolds with significant similarity, ranging from about 1 kb to 82 Kb in length that consisted primarily of chloroplast DNA. Chloroplast-like scaffolds were filtered out from the draft, using BLASTn data related to pairs presenting more than 95% identity and nucleotide coverage exceeding 10%. We did not detect a complete assembly of the plastid genome among the globe artichoke scaffolds, while many occurrences of plastid-related sequences were observed within the mapped scaffolds of nuclear DNA. The amount of mitochondrial sequence was surveyed as well and a few scaffolds showed similarity to Nicotiana tabacum mitochondrial DNA (NC_006581), none of them contained significant portions of the mitochondrial genome. The lack of organellar scaffolds is probably a consequence of ALLPATHs LG ignoring reads with unexpectedly high read depth.

miRNAs analysis in globe artichoke
Overall 4,832 genomic regions (Supplementary file S1) distributed on the 17 pseudomolecules (on average 284 regions for pseudomolecule) were predicted to be candidate miRNA-coding loci belonging to 73 different miRNAs (Supplementary table S7). Of these, 4,478 regions (93%) represented only 6 C. cardunculus-specific miRNAs (namely miR6105, miR6108, miR6109, miR6110, miR6112, miR6115). Evidence of expression (RNAseq) was found for 122 out of 4,832 predicted putative miRNA-coding regions (2.5%; >10 reads). Considering publicly available unigene (Scaglione et al., 2012) and EST (Lai et al., 2009) datasets, evidence of expression was found for 882 miRNA-related genomic regions for the former, while ESTs gave evidence of expression for 789 miRNA-related genomic regions. Note that because of the presence of several miRNA putative paralogs, different loci were declared as expressed