Origin and evolution of the octoploid strawberry genome

Cultivated strawberry emerged from the hybridization of two wild octoploid species, both descendants from the merger of four diploid progenitor species into a single nucleus more than 1 million years ago. Here we report a near-complete chromosome-scale assembly for cultivated octoploid strawberry (Fragaria × ananassa) and uncovered the origin and evolutionary processes that shaped this complex allopolyploid. We identified the extant relatives of each diploid progenitor species and provide support for the North American origin of octoploid strawberry. We examined the dynamics among the four subgenomes in octoploid strawberry and uncovered the presence of a single dominant subgenome with significantly greater gene content, gene expression abundance, and biased exchanges between homoeologous chromosomes, as compared with the other subgenomes. Pathway analysis showed that certain metabolomic and disease-resistance traits are largely controlled by the dominant subgenome. These findings and the reference genome should serve as a powerful platform for future evolutionary studies and enable molecular breeding in strawberry.


Genome Assembly
1) Reads pre-processing. PCR duplicates, illumina adaptor AGATCGGAAGAGC and Nextera linkers (for MP libraries) were removed. The PE 470bp 2×265bp libraries (Supplemental Table 1) overlapping reads were merged with minimal required overlap of 10bp to create the stitched reads. 2) Error correction. Following pre-processing, merged PE reads were scanned to detect and filter reads with putative sequencing error (contain a sub-sequence that does not reappear in other reads).
3) Contigs assembly. The first step of the assembly consists of building a de bruijn graph (kmer=127 bp) of contigs from the all PE & MP reads (Supplemental Table 1). Next, PE reads were used to find reliable paths in the graph between contigs for repeat resolving and contigs extension. 10X barcoded reads were mapped to contigs ensure that adjacent contigs were connected only in case there is an evidence that those contigs originate from a single stretch of genomic sequence (reads from the same two or more barcodes were mapped to both contigs). 4) Scaffolds assembly. Later, contigs were linked into scaffolds with PE and MP information, estimating gaps between the contigs according to the distance of PE and MP links. In addition, 10X data was used to validate and support correct phasing during scaffolding. 5) Fill Gaps. A final fill gap step used PE and MP links and de bruijn graph information to detect a unique path connecting the gap edges. 6) Scaffolds elongation and refinement. 10X barcoded reads were mapped to the assembled scaffolds and clusters of reads with the same barcode mapped to adjacent contigs in the scaffolds were identified to be part of a single long molecule. Next, each scaffold was scanned with a 20kb length window to ensure that the number of distinct clusters that cover the entire window (indicating a support for this 20kb connection by several long molecules) was statistically significant with respect to the number of clusters that span the left and the right edge of the window. In case where a potential scaffold assembly error was detected the scaffold was broken at the two edges of the suspicious 20kb window. Finally, the barcodes that were mapped to the scaffold edges were compared (first and last 20kb sequences) to generate a scaffolds graph with a link connecting two scaffolds with more than two common barcodes. Linear scaffolds paths in the scaffolds graph were composed into the final scaffolds output of the assembly (Supplementary  Table 2). 7) Genetic map. A RADseq based genetic map 1 was used to correct misassemblies, identified 408.16Mb of haplotype variants, and extracted four homoeologs representing each subgenome. Comparisons to the genetic map was fully supported by comparative genomics with the diploid F. vesca genome 2 (Supplemental Figure 1). Supplemental Figure 1 shows syntenic scaffolds from chromosomes 3 of 'Camarosa'. Colors depict the four diploid progenitor species. The box below is highlighting a region along chromosome 3 where all eight haplotypes were phased. The largest fragments from each subgenome was selected for subsequent scaffolding steps, while the smaller fragment was added to the 'haplotype variant' file and are available along with the remainder of the genome. 8) HiC Scaffolding. The four representative homoeologous scaffolds, raw reads, and Dovetail HiC library reads were used as input data (Supplementary Figure 2) for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies to chromosome-length pseudomolecules (Supplemental Figure 3). A total of 371 million 150bp paired-end reads were sequenced; equating to ~401x sequence depth across the genome. 9) Gap Filling. After HiRise scaffolding, sequences were gap filled with PacBio reads (Supplementary Table 3) using PBJelly 3 , which were first error corrected using Pilon 4 . This resulted in ~7.5Mb of new sequence and bringing the total assembly size to 805,488,706bp across 28 chromosomes ( Figure 1).

Phylogenomics
The goal of this phylogenetic analyses was to identify each of the diploid progenitor species for each of the subgenomes in octoploid strawberry (Supplemental Table 7). Thus, all twelve described diploid species, plus four F. vesca subspecies, collected from around the world were sampled in this study. We included multiple accessions for each species, if available, and all germplasm is available from the National Clonal Germplasm Repository (https://www.ars.usda.gov/pacific-west-area/corvallis-or/national-clonal-germplasm-repository/) located in Corvallis, OR. Polyploid intermediates were not included in this study. A total of 51,737 protein coding genes from 'Camarosa' were assessed for evolutionary history in 8,405 gene trees. Using Phylogenetic iDentification of Subgenomes (https://github.com/mrmckain/PhyDS), we identified how many of each taxon and accession were found in a clade with a Camarosa gene using various BSV cutoff values --see Supplemental Figure 5. We also considered situations where there were only Camarosa and one other taxon in the clade to see if these data sets differed. The same trend was present with Fragaria vesca being the most prevalent with F. vesca var. bracteata having most instances of relatedness of all F. vesca subspecies (1757 subtrees; BSV50: 1135 subtrees; BSV80: 515 subtrees). F. innumae is the second most prevalent taxon (971 subtrees; BSV50: 718 subtrees; BSV80: 336 subtrees). F. viridis is the third most prevalent taxon (700 subtrees; BSV50: 459 subtrees; BSV80: 208 subtrees). The fourth most prevalent taxon is F. nipponica (611 subtrees; BSV50: 390 subtrees, BSV80: 159 subtrees). The hybridization sequence of the four progenitor species towards the formation of the octoploid is depicted in Supplemental Figure 7. Related diploid species were also identified (Supplemental Figure 6) for reasons outlined in Supplemental Figure 8. However, introgression via interspecific hybridization, as well as incomplete lineage sorting (ILS), are also likely sources of incongruence among some trees given that these are common evolutionary processes across higher eukaryotes including plants. However, given the relative abundance of the four identified diploid progenitors and identification of immediate sister species (e.g. F. vesca ssp. vesca & ssp. californica), majority of trees do not support an evolutionary history of ILS and/or interspecific hybridization events.
Chromosome assignments for F. vesca and F. iinumae subgenomes are congruent with a previous study using a genetic mapping approach 5 . These results support our phylogenetic approach for assigning chromosomes to each of these subgenomes. However, future studies are needed to confirm the chromosome assignments for F. viridis and F. nipponica. Reference genomes for these two diploid species are ideally needed to confirm chromosome assignments and regions of homoeologous exchanges (Supplemental Figure 16). Importantly, downstream analyses of subgenome dominance presented in the manuscript are not impacted by the chromosome assignments for the F. viridis and F. nipponica subgenomes. We largely only compare the dominant F. vesca subgenome to all of the other three subgenomes combined, including comparisons of gene expression, gene content, TE content, and tandem gene duplicates (see below Section 5). The three submissive subgenomes are treated individually for only the analysis of homoeologous exchanges with nearly identical results for F. viridis (9.8x bias) and F. nipponica (10.4x bias). Any chromosome mis-assignments among these, the two most submissive subgenomes, would have no impact on the subgenome dominance patterns observed in octoploid strawberry.
Furthermore, each subgenome has undergone a history of gene loss and homoeologous exchanges since the hybridization and polyploid events (see below section 5). The two most submissive subgenomes, F. viridis and F. nipponica, have undergone the most changes since merging into the nucleus with the other two subgenomes. These chromosomes are now remnant fragments of the original chromosomes mixed with homoeologous regions derived from the other progenitor species. Thus, it's possible that some chromosomes may have been misassigned or possibility due to their fragmented nature may not be able to be perfectly assigned to one of these two diploid progenitor species. Chromosome assignments will be confirmed when reference genomes become available for the two extant relatives of these progenitor species.
All of the data, including the raw data and phylogenetic trees, and scripts associated with these phylogenetic analyses are available on the NCBI database under BioProject PRJNA508389, Dryad Digitial Respository (see URLs), and GitHub (see URLs).

Supplementary Figures
Supplemental Figure 1: Syntenic comparison of subgenomes and haplotypes of Chromosome 3. Each color (yellow, blue, red and green) is unique to each of the homoeologous chromosomes compared to F. vesca genome 2 . Figure 2: Distribution of insert sizes in the Dovetail library. The distance between the forward and reverse reads is given on the X-axis in basepairs, and the probability of observing a read pair with a given insert size is shown on the Y-axis. Figure 3: A comparison of the contiguity of the input assembly (shown in blue) and the final HiRise scaffolds (shown in orange). Each curve shows the fraction of the total length of the assembly present in scaffolds of a given length or smaller. The fraction of the assembly is indicated on the Y-axis and the scaffold length in basepairs is given on the X-axis. The two dashed lines mark the N50 and N90 lengths of each assembly. Scaffolds less than 1 kb are excluded.       Table 7: Transcriptome datasets generated in this study for phylogenetic analyses.

Supplemental
. Table 8: Chromosomes shown in Figure 1 with diploid progenitor identified from phylogenetic analysis using PhyDs (Supplemental Figure 6) with assembled chromosome size (length to the basepair is shown). The number of phylogenetically supported genes with bootstrap values >50% identified using PhyDs. The number of syntenic orthologs (syntelogs) with sequence identity as being similar to the "old world" diploid species, shown in Supplemental Figure 15, are provided for the chromosomes identified for F. viridis, F. iinumae, and F. nipponica. The number of syntelogs that are similar to the F. vesca genome, shown in Supplemental Figure 15, for F. vesca chromosomes are provided.

Supplemental
Supplemental Table 9: Gene and transposable element content per subgenome Supplemental