Cultivated strawberry emerged from the hybridization of two wild octoploid species, both descendants from the merger of four diploid progenitor species into a single nucleus more than 1 million years ago. Here we report a near-complete chromosome-scale assembly for cultivated octoploid strawberry (Fragaria × ananassa) and uncovered the origin and evolutionary processes that shaped this complex allopolyploid. We identified the extant relatives of each diploid progenitor species and provide support for the North American origin of octoploid strawberry. We examined the dynamics among the four subgenomes in octoploid strawberry and uncovered the presence of a single dominant subgenome with significantly greater gene content, gene expression abundance, and biased exchanges between homoeologous chromosomes, as compared with the other subgenomes. Pathway analysis showed that certain metabolomic and disease-resistance traits are largely controlled by the dominant subgenome. These findings and the reference genome should serve as a powerful platform for future evolutionary studies and enable molecular breeding in strawberry.
The cultivated garden strawberry (Fragaria × ananassa), an allo-octoploid (2n = 8x = 56), has a unique natural and domestication history, originating as an interspecific hybrid between wild octoploid progenitor species approximately 300 years before present1. The genomes of the progenitor species, Fragaria virginiana and Fragaria chiloensis, are the products of polyploid evolution: they were formed by the fusion of and interactions among genomes from four diploid progenitor species (that is, subgenomes) approximately 1 million years before present2. Whereas two of the diploid progenitor species have been identified3, the other two diploid progenitor species have remained unknown. Moreover, the history of events leading to the formation of the octoploid lineage and the evolutionary dynamics among the four subgenomes that restabilized cellular processes after ‘genomic shock’4 in allopolyploids remain poorly understood. Here, we present what is, to our knowledge, the first chromosome-scale assembly of an octoploid strawberry genome, the identities of the extant diploid progenitor species of each subgenome, and novel insights into the collective evolutionary processes involved in establishing a dominant subgenome in this highly polyploid species.
The Rosaceae are a large eudicot family including a rich diversity of crops with major economic importance worldwide, such as nuts (for example, almonds), ornamentals (for example, roses), pome fruits (for example, apples), stone fruits (for example, peaches), and berries (for example, strawberries)5. Strawberries are prized by consumers, largely because of their complex array of flavors and aromas. The genus Fragaria was named by the botanist Carl Linnaeus, on the basis of the Latin word ‘fragrans’, meaning ‘sweet scented’, describing its striking, highly aromatic fruit6. A total of 22 wild species of Fragaria have been described, ranging from diploid (2n = 2x = 14) to decaploid (2n = 10x = 70)7. The genus Fragaria is highly interfertile between and within ploidy levels, thus leading to the natural formation of higher-polyploid species8,9.
Polyploid events, also known as whole-genome duplications, have been an important recurrent process throughout the evolutionary history of eukaryotes and have probably contributed to novel and varied phenotypes10,11,12,13. Polyploids are grouped into two main categories: autopolyploids and allopolyploids, involving either a single or multiple diploid progenitor species, respectively14,15. Many crop species are allopolyploids16, thus contributing to the emergence of important agronomic traits such as spinnable fibers in cotton17, diversified morphotypes in Brassica18, and varied aroma and flavor profiles in strawberry19. Allopolyploids face the challenge of organizing distinct parental subgenomes—each with a unique genetic and epigenetic makeup shaped by independent evolutionary histories—residing within a single nucleus15. Previous studies have proposed, as part of the ‘subgenome dominance’ hypothesis20, that the establishment of a single dominant subgenome may resolve various (epi)genetic conflicts in allopolyploids21,22,23,24. However, understanding of the underlying mechanisms and ultimate consequences of subgenome dominance remains largely incomplete25.
Subgenome-level analyses in most allopolyploid systems are greatly hindered by the inability to confidently assign parental gene copies (that is, homoeologs) to each subgenome, owing to both large-scale chromosomal changes and homoeologous exchanges that shuffle and replace homoeologs among parental chromosomes26,27,28,29. Octoploid strawberry still has a complete set of homoeologous chromosomes from all four parental subgenomes, thus greatly simplifying homoeolog assignment. Furthermore, gene sequences from extant relatives of the diploid progenitor species, which probably still exist for octoploid strawberry3, can be used to accurately assign homoeologs to each parental subgenome29. However, a high-quality reference genome for the octoploid is needed to fully exploit strawberry as a model system for studying allopolyploidy as well as to provide a platform for identifying biologically and agriculturally important genes and applying genomic-enabled breeding approaches30. The assembly of the octoploid strawberry genome, with an estimated genome size of 813.4 Mb, has been particularly challenging because of its high heterozygosity and ploidy level31. For example, the most recently published version of the octoploid strawberry genome is highly fragmented, with more than 625,000 scaffolds, and largely incomplete, with less than 660 Mb assembled after removal of the numerous gaps31. Thus, that version of the genome, owing to its overall highly fragmented nature, has not been a useful resource for genome-wide analyses including the discovery of molecular markers for breeding.
Assembly and annotation of the octoploid strawberry genome
Our goal was to obtain a high-quality reference genome for the Fragaria × ananassa cultivar ‘Camarosa’, one of the most historically important and widely grown strawberry cultivars worldwide. We sequenced the genome through a combination of short- and long-read approaches, including Illumina, 10X Genomics, and PacBio, totaling 615-fold coverage of the genome (Supplementary Table 1). Illumina (455-fold coverage) and 10X Genomics (117-fold coverage) data were assembled and scaffolded with the software package DenovoMAGIC3 (NRGene) (Supplementary Table 2), which has recently been used to assemble the allotetraploid wheat (Triticum turgidum) genome32. We further scaffolded the genome to chromosome scale by using Hi-C data (401-fold coverage) in combination with the HiRise pipeline (Dovetail) (Supplementary Figs. 1–3), then performed gap-filling with 43-fold-coverage error-corrected PacBio reads with PBJelly33 (Supplementary Table 3). The total length of the final assembly is 805,488,706 bp, distributed across 28 chromosome-level pseudomolecules (Fig. 1) and representing ~99% of the estimated genome size, on the basis of flow cytometry measurements. A genetic map for Fragaria × ananassa34 was used to correct any misassemblies, and comparisons to Fragaria vesca were used to identify homoeologous chromosomes.
We annotated 108,087 protein-coding genes along with 30,703 genes encoding long noncoding RNAs (lncRNAs), which were subdivided into 15,621 long intergenic noncoding RNAs, 9,265 antisense overlapping transcripts (AOT-lncRNAs), and 5,817 sense overlapping transcripts (SOT-lncRNAs) (Supplementary Table 4). Gene annotation and genome-assembly quality were evaluated with the Benchmarking Universal Single-Copy Orthologs v 2 (BUSCO)35 method (Supplementary Table 5). Most (99.17%) of the 1,440 core genes in the embryophyta dataset were identified in the annotation, thus supporting a high-quality genome assembly. The repetitive components of the nuclear genome were annotated with a custom-repeat-library approach36, including DNA transposons, long-terminal-repeat retrotransposons (LTR-RTs; for example, Copia and Gypsy), and non-LTR retrotransposons (Supplementary Table 6 and Supplementary Fig. 4). Transposable element (TE)-related sequences make up ~36% of the total genome assembly, and LTR-RTs are the most abundant TEs (~28%). The plastid and mitochondrial genomes were also assembled, annotated, and verified for completeness (Supplementary Fig. 5).
Origin of octoploid strawberry
Using the Fragaria × ananassa reference-genome assembly, we sought to identify the extant diploid relatives of each subgenome donor37. Previous phylogenetic studies aimed at identifying these progenitor species, often analyzing a limited number or different sets of molecular markers, have obtained inconsistent results3,38,39. However, F. vesca has long been suspected to be a progenitor, on the basis of meiotic chromosome pairing40; subsequent molecular phylogenetic analyses supported it being one of the diploid progenitors along with Fragaria iinumae and two additional unknown species3. We sequenced and de novo assembled 31 transcriptomes of every described diploid Fragaria species, which we used to identify progenitor species on the basis of the phylogenetic analysis of 19,302 nuclear genes in the genome (Fig. 2, Supplementary Figs. 6–8 and Supplementary Table 7). To our knowledge, this is the most comprehensive molecular phylogenetic analysis of the genus Fragaria to date, including the greatest number of molecular markers and sampling of diploid species, aimed at identifying the extant relatives of the progenitor species of octoploid strawberry (Supplementary Fig. 9 and Supplementary Table 8).
Our phylogenetic analyses provided strong genome-wide support for the two diploid progenitor species that had been previously hypothesized and identified the two previously unknown diploid progenitors. This discovery, together with the geographic distributions, natural history, and genomic footprints of the diploid species, provided a model for the chronological formation of intermediate polyploids that culminated in the formation of the octoploid (Fig. 2). Our phylogenetic analyses revealed F. iinumae and Fragaria nipponica as two of the four extant diploid progenitor species, both of which are endemic to Japan and in geographic proximity to all five described tetraploid species in China. The third species identified in our analyses, Fragaria viridis, is geographically distributed in Europe and Asia, and partially overlaps with the sole hexaploid species, Fragaria moschata. Therefore, we hypothesized that these tetraploid and hexaploid species may be evolutionary intermediates between the diploids and the wild octoploid species. This possibility is supported by a previous phylogenetic analysis identifying F. viridis as a possible parental contributor to both F. moschata and the octoploid event41. Finally, we identified F. vesca subsp. bracheata, which is endemic to the western part of North America, spanning Mexico to British Columbia, as the fourth parental contributor. Our species sampling also included two other F. vesca subspecies: F. vesca subsp. vesca, which is distributed from Europe to the Russian Far East, and F. vesca subsp. californica, which is endemic to the coast of California.
Octoploid strawberry species are geographically restricted to the New World and are largely distributed across North America, with the exception of isolated F. chiloensis populations in Chile and the Hawaiian Islands42. Therefore, our phylogenetic analyses combined with the geographic distributions of extant species not only support a North American origin for the octoploid strawberry but also suggest that F. vesca subsp. bracheata was probably the last diploid progenitor species to contribute to the formation of the ancestral octoploid strawberry. This possibility is further supported by a previous study revealing F. vesca subsp. bracheata as the likely maternal donor of the octoploid event, on the basis of the phylogenetic history of the plastid genome2. This finding is consistent with our analysis of the plastid genome of ‘Camarosa’ (Supplementary Fig. 10). Thus, these data suggest that the hexaploid ancestor probably crossed into North America from Asia and hybridized with native populations of F. vesca subsp. bracheata, an event dated at ~1.1 million years before present2. Our phylogenetic analysis also identified related diploid species possibly arising from ancient hybridization and introgression events with putative progenitor species or issues related to incomplete lineage sorting and/or missing data (Supplementary Fig. 6). Future studies will be able to more thoroughly investigate these possibilities after reference quality genomes are assembled for these other diploid progenitor species.
Subgenome dominance in allopolyploids
After most ancient allopolyploid events, one of the subgenomes, commonly referred to as the ‘dominant’ subgenome, emerges with significantly greater gene content and more highly expressed homoeologs (that is, postpolyploidy duplicate genes) than those of the other ‘submissive’ subgenome(s)21. Biased fractionation, which results in greater gene content of the dominant subgenome43, was first described in the model plant Arabidopsis thaliana21 and later described in Zea mays (maize)20, Brassica rapa (Chinese cabbage)44, and Triticum aestivum (bread wheat)45. The dominant subgenome has also been shown to be under stronger selective constraints46,47,48 and to be heritable through successive allopolyploid events49, and, as predicted22, it is not observed in ancient autopolyploids50,51,52. Moreover, subgenome expression dominance has recently been shown to occur instantly after interspecific hybridization and to increase over successive generations in monkeyflower23. However, some allopolyploids, including Capsella bursa-pastoris53 and Cucurbita species54, do not exhibit subgenome dominance.
The emergence of a dominant subgenome may resolve various genetic and epigenetic conflicts that arise from the genomic merger of divergent diploid progenitor species4,55, including mismatches between transcriptional regulators and their target genes24. The mechanistic basis of subgenome dominance, at least in part, appears to be related to subgenome differences in the content and regulation of TEs22,56. Gene expression levels are negatively correlated with the density of nearby TEs56 (Supplementary Fig. 11). Thus, the merger of subgenomes with different TE densities results in higher gene expression for the dominant homoeolog with fewer TEs22. The abundance and distribution of TEs can be used to predict gene expression dominance and eventual gene loss at the individual homoeolog level23.
Having identified the extant diploid relatives of octoploid strawberry, we used this information to investigate the evolutionary dynamics among the four subgenomes. We identified a dominant subgenome that was contributed by the F. vesca progenitor (Fig. 1) and has retained 20.2% more protein-coding genes and 14.2% more lncRNA genes, and has overall 19.5% fewer TEs than the other homoeologous chromosomes (Supplementary Table 9). The overall TE densities near genes were also lowest for F. vesca compared with the other parental subgenomes (Supplementary Fig. 11). Furthermore, we identified ~40.6% more tandem gene duplications on homoeologous chromosomes of F. vesca compared with the other subgenomes (Supplementary Table 9). The F. vesca subgenome, compared with the other subgenomes, also contains a greater number of tandem gene arrays as well as larger average tandem-gene-array sizes on six of seven homoeologous chromosomes. These findings suggest that the dominant F. vesca subgenome, compared with the other three subgenomes, has been under stronger selective constraints to retain genes, including tandemly duplicated genes known to be biased toward gene families that encode important adaptive traits57,58. For example, major disease-resistance genes in plants, including nucleotide-binding-site leucine-rich-repeat genes (NBS-LRRs), which are usually clustered in tandem arrays59, are biased toward the dominant F. vesca subgenome (χ2 test, P < 0.0001; Supplementary Fig. 12).
Because strawberry production is threatened by several agriculturally important diseases, we analyzed, in greater depth, the major family of plant resistance (R) genes60,61. Collectively, 423 NBS-LRR genes were identified, including 195 encoding an N-terminal coiled-coil (CC), 79 encoding toll interleukin 1 receptor (TIR), and 24 encoding resistance to powdery mildew 8 (RPW8) domains (Supplementary Fig. 12). Recent work has demonstrated that many R proteins recognize pathogen effectors through integrated decoy domains62, and the F. vesca genome encodes 20 such protein models63. Fragaria × ananassa has a greatly expanded set of 105 diverse domains that are fused to the R-protein structures and have the potential to function as integrated decoys62 (Supplementary Fig. 13 and Supplementary Dataset 1). Only a few resistance genes have been phenotypically identified in Fragaria × ananassa, but none have been functionally characterized64,65,66. The annotated genome thus provides a framework for accelerating R-gene discovery, connecting phenotype to genotype, and pyramiding R genes by developing targeted, homoeolog-specific molecular markers.
Although chromosomes contributed by the F. vesca progenitor retained the most genes overall, certain regions on chromosomes from the other progenitor species retained higher numbers of ancestral genes (Fig. 1b and Supplementary Fig. 14). Further analysis revealed that these regions are the products of homoeologous exchanges (HEs) or gene-conversion events28,67,68 (Supplementary Figs. 15 and 16). Notably, most HEs in octoploid strawberry involved replacements of the submissive homoeologs by corresponding regions of the dominant F. vesca subgenome (Supplementary Table 10). For example, our phylogenetic and comparative genomic analyses showed that HEs are 7.3× biased toward the F. vesca subgenome compared with F. iinumae, but they are not unidirectional as previously reported3. HEs were even more biased toward the F. vesca subgenomes compared with the other two subgenomes (9.8× for F. viridis and 10.4× for F. nipponica). These analyses validate findings from a previous study in wild octoploid strawberry3 and show that portions of the F. iinumae subgenome have been replaced with the F. vesca subgenome (Fig. 1b). Here, we identified HEs ranging in size from single genes to megabase-sized regions on chromosomes (Supplementary Table 10), findings similar to the patterns observed in other allopolyploids including Brassica napus (rapeseed)27,28, Gossypium hirsutum (cotton)67,69, and bread wheat70. The observed bias of HEs genome wide may be due to selection favoring the maintenance of proper network stoichiometry71 and altered dosage of certain gene products72 during the establishment of the dominant subgenome. Interestingly, 32.6% of NBS-LRR genes encoded on the three submissive subgenomes are derived from HE with the F. vesca subgenome. This result suggests that although the F. vesca subgenome may also dominate disease resistance in strawberry, the maintained diversity of resistance mechanisms contributed by the other three diploid progenitors may also have been under selection.
Finally, we examined gene expression in diverse organs to test whether the dominant F. vesca subgenome is more highly expressed than the submissive genomes (Fig. 3), as predicted by the subgenome-dominance hypothesis22,25. The density of TEs near genes was found to be negatively correlated with gene expression across all subgenomes (Supplementary Fig. 11a). Because HEs reshuffled and replaced homoeologs across each of the four parental chromosomes, only homoeolog pairs that had support for subgenome assignment were evaluated for subgenome expression dominance (that is, homoeolog expression bias). Our analyses revealed that the dominant F. vesca subgenome, which had the lowest overall TE densities near genes of all subgenomes (Supplementary Fig. 11b; Kolmogorov–Smirnov test, P < 10−33), encodes more significantly dominantly expressed homoeologs than the other three submissive subgenomes combined (Fig. 3c). This finding supports the hypothesis that subgenome expression dominance is influenced by overall TE-density differences between subgenomes22. At the individual homoeolog level, many dominantly expressed homoeologs were also contributed by one of the three submissive subgenomes. This observation was expected, given the variation in TE densities near homoeologs in each of the diploid progenitor genomes23,73.
Most HEs in octoploid strawberry resulted in the dominant F. vesca subgenome replacing the corresponding homoeologous regions of one of the submissive subgenomes. Thus, the observed homoeolog expression bias toward the F. vesca subgenome in Fig. 3 is an underestimate of transcriptome-wide expression dominance (68.7% of all transcripts). This bias has resulted in certain biological pathways being largely controlled by a single dominant subgenome. Our analyses revealed that certain metabolic pathways, including those that give rise to strawberry flavor, color, and aroma, are largely controlled by the dominant subgenome. For example F. vesca homoeologs in octoploid strawberry are responsible for 88.8% of the biosynthesis of anthocyanins, the metabolites responsible for the red pigments in ripening strawberry fruit; 89.2% of the biosynthesis of geranyl acetate, a terpene associated with fruit aroma; and 95.3% of the biosynthesis of fructose associated with sweetness (Supplementary Dataset 2). Similar results have been found in allotetraploid Brassica juncea, in which many dominant homoeologs have been found to be related to glucosinolate biosynthesis and to show signs of positive selection74.
We present what is, to our knowledge, the first chromosome-scale genome assembly for an octoploid strawberry—the highest-level polyploid genome of this quality assembled to date. Analysis of this genome allowed us to identify each of the diploid progenitor species, reconstruct the evolutionary history of the octoploid event, and investigate the evolution of a dominant subgenome. Our data support the hypothesis that subgenome dominance in an allopolyploid is established by TE-density differences near homoeologous genes in each of the diploid progenitor genomes22. Furthermore, our results show that the F. vesca subgenome has increased in dominance over time by having retained significantly more ancestral genes and a greater number of tandemly duplicated genes than the other three subgenomes, and replaced large portions of the submissive subgenomes via homoeologous exchanges. These trends, combined with subgenome expression dominance, have resulted in many traits being largely controlled by a single dominant subgenome in octoploid strawberry. This finding is consistent with results from a recent report indicating that the dominant subgenome in maize contributes more to phenotypic variation than the submissive subgenome48. This reference genome should serve as a powerful platform for breeders to develop homoeolog-specific markers to track and leverage allelic diversity at target loci. Thus, we anticipate that this new reference genome, combined with insights into subgenome dominance, will greatly accelerate molecular breeding efforts in the cultivated garden strawberry.
Sequence Read Archive, https://www.ncbi.nlm.nih.gov/sra/; Dryad, https://doi.org/10.5061/dryad.b2c58pc; PhyDS, https://github.com/mrmckain/PhyDS/; GDR, https://www.rosaceae.org/; CoGe, https://genomevolution.org/r/tx72/; RefTrans, https://github.com/mrmckain/RefTrans/; annoBTD, https://github.com/mrmckain/annoBTD/; Mitofy, http://dogma.ccbb.utexas.edu/mitofy/; dotPlotly, https://github.com/tpoorten/dotPlotly/; NCBI Conserved Domain Database, www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi/; Pfam database, www.ebi.ac.uk/Tools/pfa/pfamscan/; FastQC, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/; R, https://www.r-project.org/; Repeat-Masker, http://www.repeatmasker.org/; RepeatModeler, http://www.repeatmasker.org/RepeatModeler/; Google Maps, https://www.google.com/maps/.
The cultivar ‘Camarosa’ was selected because of its importance to the industry; historically, it has been one of the most widely grown short-day varieties worldwide, and it remains an important genotype in breeding programs. The haploid genome size (~813.4 Mb) was estimated through flow cytometry with four technical replicates at the Flow Cytometry Core at Benaroya Research Institute at Virginia Mason (Supplementary Dataset 3).
High-molecular-weight genomic DNA was isolated from young leaf tissue, after a 72-h dark treatment, through a modified nuclei-preparation method75,76, and the quality was verified through pulsed-field gel electrophoresis. A total of five PacBio 20-kb libraries were generated with a SMRTbell Template Prep Kit (PacBio) and were sequenced with 67 SMRT cells on the PacBio RSII platform at the UC Davis DNA Sequencing Facility. A total of 67 Gb (~82.4×) of PacBio sequence data was generated with an N50 read length of 17,699 bp (Supplementary Table 3). DNA fragments longer than 50 kb were used to construct a 10X Gemcode library with a Chromium instrument (10X Genomics) and sequenced on a HiSeqX system (Ilumina) with paired-end, 150-bp reads at the HudsonAlpha Institute for Biotechnology. A total of ~95 Gb (~117× fold coverage) of 10X Chromium library data was sequenced (Supplementary Table 1). Finally, five size-selected Illumina genomic libraries ranging from 470 bp to 10 kb were constructed (Supplementary Table 1). The ~470-bp and ~800-bp libraries were made with a Illumina TruSeq DNA PCR-free Sample Preparation V2 Kit. The two ~470-bp libraries were designed to produce ‘overlapping libraries’ after sequencing with paired-end, 265-bp reads on an Illumina Hiseq2500 system, producing ‘stitched’ reads of approximately 265 bp to 520 bp in length. To increase sequence diversity and depth, we constructed three separate mate-pair (MP) libraries with jumps of 2–5 kb, 5–7 kb, and 7–10 kb, with an Illumina Nextera Mate-Pair Sample Preparation Kit. The 800-bp library was sequenced on an Illumina HiSeq2500 system with paired-end, 160-bp reads, and the MP libraries were sequenced on an Illumina HiSeq4000 system with paired-end, 150-bp reads. A total of ~370 Gb (~455× fold coverage) of additional Illumina sequencing data was generated (Supplementary Table 1). Illumina library construction and sequencing were conducted at the Roy J. Carver Biotechnology Center, University of Illinois at Urbana-Champaign.
The genome was assembled with the DeNovoMAGIC software platform (NRGene), a DeBruijn-graph-based assembler designed for higher polyploid, heterozygous and/or repetitive genomes32,77. The Chromium 10X data were used to phase haplotypes and support scaffold validation and further elongation of the phased scaffolds. Dovetail HiC libraries were prepared as described previously78 and sequenced on an Illumina HiSeqX system with paired-end, 150-bp reads to ~401× sequence depth of the genome (Supplementary Fig. 2). The initial de novo assembly, raw genomic reads, and Dovetail HiC library reads were used as input data for HiRise, a software pipeline designed specifically for using proximity-ligation data to scaffold genome assemblies to chromosome-length pseudomolecules79. After HiRise scaffolding, the sequences were gap filled with PacBio reads with PBJelly33. Gaps filled with PacBio sequences were polished with Pilon (v 1.22)80 with Illumina paired-end data. Illumina reads were quality-trimmed with Trimmomatic81 and aligned to the draft contigs with bowtie2 (v 2.3.0)82 with default parameters. Parameters for Pilon were modified as follows: --flank 7, --K 49, and --mindepth 20. Pilon was run recursively three times, and there were minimal corrections in the third round, thus supporting accurate indel correction. A published genetic map34 and syntenic analyses against the F. vesca37 genomes with SynMap within CoGe83 were used to identify any assembly errors and haplotype variants, and to assign homoeologous chromosomes sets. Additional assembly details and results are summarized in the supplementary information.
Tissue collection, RNA library preparation, and sequencing
Plant tissue samples (flower before anthesis, flower at anthesis, leaf collected during the day and at night, leaves treated with methyl jasmonate (30 min, 4 h, and 24 h after treatment), runner, and salt-treated and untreated roots) were collected from Fragaria × ananassa cultivar ‘Camarosa’ grown in a growth chamber and immediately flash frozen in liquid nitrogen. Leaf tissues were also collected from wild diploid species grown in a growth chamber for phylogenetic analyses (Supplementary Table 7). Total RNA was isolated with a KingFisher Pure RNA Plant Kit (Thermo Fisher) and quantified with a Qubit 3 fluorometer (Thermo Fisher). RNA libraries were prepared with the KAPA mRNA HyperPrep Kit protocol (KAPA Biosystems). All samples were submitted to the Michigan State University Research Technology Support Facility Genomics core and sequenced with paired-end, 150-bp reads on an Illumina HiSeq 4000 system.
Transcriptome assembly and translation
Reads were cleaned with Trimmomatic v 0.32 (ref. 81) with adaptor trimming for TruSeq3 paired-end reads with a 1-bp mismatch, a palindrome clip threshold of 30, and a simple clip threshold of 10. Reads were then filtered on the basis of an average phred score calculated from a sliding window of 10 bp with a minimum threshold of 20 (Supplementary Dataset 4). The quality of trimmed reads was assessed afterward with FastQC84. Genome-guided and de novo transcriptome assemblies were generated with Trinity v 2.2.0 (ref. 85) for the genome annotation/expression and phylogenetic analyses, respectively. For genome annotation and expression analyses, reads were aligned to the Fragaria × ananassa cultivar ‘Camarosa’ genome with STAR v 2.5.3a86 with default options, except for --alignIntronMax, which was set to 10000. For genome annotation, the coordinate-sorted BAM output files from STAR were used for the genome-guided transcriptome assembly, and name-sorted SAM files were used for gene expression analysis (HTSeq in section 3). For the diploid species libraries used in the phylogenetic analyses, because transcriptome libraries were generated with a stranded method, the ‘SS_lib_type’ parameter with ‘RF’ option was used in the assembly. In addition, reads were normalized to a maximum read coverage of 100 with ‘normalize_max_read_cov’ in Trinity. The normalization option, which decreases the quantity of input reads for highly expressed genes, was used to improve assembly efficiency87. For homoeolog expression bias (HEB) analyses (described in the section below), counts of uniquely mapping reads were generated with HTSeq v 0.6.1 (ref. 88) with default options of htseq-count, except for feature type, which was set to ‘gene’ for all RNA-seq datasets of ‘Camarosa’. The fragments per kilobase per million reads mapped (FPKM) values were derived with the standard formula for FPKM = (read count/’per million’ scaling factor)/gene length in kilobases. For phylogenetic analysis, according to McKain et al.89, reads were aligned to the assembled transcripts with bowtie v 1.1.0 (ref. 90), and transcript abundance was estimated with RSEM v 1.2.29 (ref. 91) through the align_and_estimate_abundance.pl script packaged with Trinity. Transcripts were filtered by FPKM, an output from the aforementioned Perl script, with a minimum threshold of 1.0% of fragments per isoform mapped, as implemented in the filter_fasta_by_rsem_values.pl script. Filtered transcripts were BLASTed against the Fragaria vesca v 2.01 coding sequences with TBLASTX with a minimum e value of 1 × 10–10. The RefTrans package (see URLs) was used to translate assembled transcripts by filtering BLAST hits to identify the best hit with at least 75% bidirectional overlap between the transcript and F. vesca coding sequences. Best hits were used to guide translations with GeneWise (Wise2 v 2.2.0)92. The longest translations were used in downstream analyses.
The genome was annotated with the MAKER-P annotation pipeline36. Protein sequences (Araport11 and UniprotKB plant database), expressed sequence tags (NCBI), and ten mRNA-seq datasets (described below) and additional RNA-seq data for Fragaria × ananassa downloaded from NCBI-SRA (BioProject PRJNA394190; red ripening fruit) were used as evidence during annotation. The RNA-seq datasets were assembled into transcripts through the StringTie genome-guided approach93. A custom repeat library (‘Repeat annotation’ section below) and MAKER repeat library94 were used for genome masking. Ab initio gene prediction was performed with the gene predictors SNAP95 and Augustus96, which were previously iteratively trained for F. vesca37. During annotation, gene models with annotation edit distance <1.0 were included in the MAKER gene set and scanned for the presence of protein domains. The predicted gene models were further filtered to remove those with TE-related domains. Briefly, the protein-coding genes were searched (BLASTp, e = 10–10) against a transposase database from a previous study36, and if more than 50% of gene length aligned to the transposases, the gene was removed from the gene set. However, if 60% or more of the amino acid matches were due to only three individual amino acids, the alignment was considered to be caused by low complexity and was excluded. In addition, to assess whether core plant genes were annotated, the gene set was searched against the BUSCO v 2 (ref. 35) plant dataset (embryophyta_odb9). lncRNAs, including long intergenic noncoding RNAs, antisense overlapping transcripts, and sense overlapping transcripts, were identified with the Evolinc lncRNA-discovery pipeline (v 1.5.1)97. Transcripts with fewer than three reads per base pair were discarded. Putative lncRNAs with similarity (BLASTn e value <1 × 1010) to known TEs or rFAM’s catalog (v 13.0)98 of housekeeping RNAs were removed.
The Fragaria × ananassa genome was searched for LTR-RTs with LTRharvest99 with parameters ‘-minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 1 -similar 85 -vic 10 -seed 20 -seqids yes’ and LTR_finder100 with parameters ‘-D 15000 -d 1000 -L 7000 -l 100 -p 20 -M 0.9’. The identified LTR-RT candidates were filtered with LTR_retriever101 with default parameters. Miniature inverted TEs (MITEs) were identified with MITE-Hunter102. Candidate MITEs were manually checked for TSD and TIR, which were used for superfamily classification. Those with ambiguous TSD and TIR were classified as unknowns. The Fragaria × ananassa genome was then masked with both MITE and LTR libraries through Repeatmasker103 (see URLs), and other repetitive elements were identified with Repeatmodeler104 (see URLs). The repeats were then grouped into two categories: sequences of known identity and sequences of unknown identity. The latter were then searched against the transposase database, and if they had a match, they were included in the TE library. The library was further filtered with ProtExcluder36 and an in-house Perl script to exclude gene fragments. The final TE library was used to annotate the Fragaria × ananassa genome with RepeatMasker103 with parameters ‘-q -no_is -norna -nolow -div 40’. Annotation results were summarized with the ‘famcoverage.pl’ script from the LTR-retriever package101.
Organellar genome annotation
The chloroplast genome was annotated with Verdant, a web-based software suite specifically designed for plant chloroplast genomes105. Automated annotation of protein-coding genes, tRNAs, and rRNAs was completed with annoBTD (see URLs). Five Rosaceae plastomes in the Verdant database were selected as a reference for annotation, including the Fragaria vesca ‘Hawaii 4’ chloroplast genome37. The previously identified ORFs were BLASTed against the reference genomes with TBLASTX106 with an e-value cutoff of 0.1 and a cutoff of 50% identity between references and high-scoring segment pairs. The best reference for each ORF was used for annotation. An optimized BLASTN106 was used to identify and annotate tRNAs and rRNAs on the basis of reference genomes. The best-scoring references were used to annotate the RNA. Finally, the boundaries of each feature was identified on the basis of the sequence and positional information for the orthologous features from the five reference chloroplast genomes (Supplementary Fig. 5). The mitochondrial genome was annotated with the webserver for Mitofy (see URLs), a program designed to annotate the genes and tRNAs in the mitochondrial genomes of seed plants107. Mitofy uses NCBI-BLASTX to annotated genes on the basis of databases of 41 protein-coding genes and uses NCBI-BLASTN and tRNAscan-SE108 to annotate tRNAs and rRNAs on the basis of databases of 27 tRNAs and 3 rRNAs found in seed-plant mitochondrial plant genomes. The annotated plastid and mitochondrial genomes have been deposited in Dryad (see URLs).
Synteny and comparative genomics
The ‘Camarosa’ and F. vesca37 genomes were aligned in CoGe’s SynMap program with LAST83. The maximum distance between two matches was set to 20 genes, and the minimum number of aligned pairs was set to ten genes. Neighboring syntenic blocks were merged with ‘Quota Align Merge’109, with the maximum distance between two blocks set to 40 genes. Syntenic depth was calculated with ‘Quota Align’, and the ratio of coverage depth for F. vesca to F. ananassa gene was set to 1:4. Tandemly duplicated genes were identified and filtered from CoGe outputs with a max distance of ten genes. Fractionation bias was then calculated, with the maximum query chromosomes set to 28 and the maximum target chromosomes set to seven. The analyses can be regenerated with CoGe (see URLs). The two genomes were also aligned with MUMmer v 3.2 (ref. 110) to identify homoeologous exchanges (Supplementary Table 10) with parameters (nucmer --maxmatch -l 80 -c 200) and visualized with dotPlotly (see URLs).
Translated transcriptomes and whole-genome protein-coding genes for Fragaria × ananassa, F. vesca v 2.01, A. thaliana TAIR10 (ref. 111), and Malus domestica v 1.0 (ref. 112) (Phytozome v 12)113 were orthogrouped with Orthofinder v 0.3 (ref. 114) with Diamond v 0.8.36 (ref. 115) for similarity searches. Orthogroups were filtered so that a minimum of five unique accessions were present. Coding sequences and amino acid translations were separated into orthogroup-specific FASTA files. Amino acid sequences were aligned with MAFFT v 7.215 (ref. 116) with the ‘auto’ parameter, and PAL2NAL v 14 (ref. 117) was used under default parameters to create a codon alignment from MAFFT-aligned amino acids. Codon alignments were filtered by removal of alignment columns with 90% or more gaps and transcripts with unaligned lengths less than 30% of the alignment length, with scripts provided with McKain et al.89. Orthogroup trees were reconstructed with RAxML v 8.0.6 with 500 bootstrap replicates under the GTR + gamma evolutionary model. All 108,087 protein-coding genes from the F. x ananassa ‘Camarosa’ genome were used in the initial orthogrouping. After the filtering of orthogroups with fewer than five taxa, 51,737 ‘Camarosa’ genes remained in 8,405 gene trees. A total of 19,302 unique loci identified in large syntenic blocks forming 18,839 paralogous pairs were used to assess the evolutionary history of the subgenomes. Outgroups were chosen from either A. thaliana or M. domestica, with preference given to A. thaliana as an outgroup. To assess the evolutionary history of octoploid strawberry’s subgenomes, a novel tree-searching algorithm was developed called ‘phylogenetic identification of subgenomes’ (PhyDS; see URLs). The only parameters needed for PhyDS are a list of taxa, if any, to ignore in the gene trees and a minimum bootstrap value to set the threshold for acceptable subtrees. In this analysis, only genes from the ‘Camarosa’ genome were ignored (that is, PhyDS did not stop when it encountered an Fxa gene other than a sister paralog) to identify each of the diploid progenitors of octoploid strawberry. Results from varying bootstrap support cutoffs are provided. These homoeologs were than mapped back to each of the assembled chromosomes and, on the basis of their relative frequencies, used to assign each chromosome to a diploid progenitor species (Supplementary Table 8).
Gene expression analyses
HEB was assessed with the likelihood-ratio tests described in ref. 23, by analysis of the anther, root, and leaf transcriptome data. This test consists of a set of three nested hypotheses. The null hypothesis, H0, is that the homoeologs are expressed at equal levels after normalization for gene length and sequencing depth. The first alternative hypothesis, H1, is that one of the homoeologs is more highly expressed in all tissues, such that the difference can be explained by a single scaling factor. The second alternative hypothesis, H2, is that the homoeologs are expressed unequally and inconsistently across the three tissues. Homoeolog pairs for which H0 can be rejected for H1, but H1 cannot be rejected for H2, are therefore cases in which one of the homoeologs appears to be up- or downregulated consistently throughout the organism. For the first test, the Benjamini–Hochberg118 correction for multiple testing was applied. For the second test, because the question was being unable to reject a hypothesis, no correction was made. Both tests used a 1% significance level. Pairwise genomic alignments, described above, were used to identify homoeologs for each of the subgenomes, retained duplicate genes from tandem duplications, and orthologous genes to A. thaliana111, on the basis of ortholog assignments in F. vesca37. Thes complete list of Fragaria–Arabidopsis orthologs was then filtered to genes with functional data in the AraGEM Arabidopsis metabolic72,119 and STRING global protein interaction network120. These gene lists were used to investigate subgenome- and pathway-level-specific expression in fruit with an available transcriptome dataset in NCBI-SRA (BioProject PRJNA394190) (Supplementary Dataset 2).
Analysis of disease-resistance-gene familie
NBS-LRR genes were detected with HMMER v 3.1 (ref. 121) with default settings, by searching the protein sequences of the Fragaria × ananassa genome against the raw hidden Markov model for the NB-ARC-domain family downloaded from Pfam (family ID PF00931)122. Only genes identified by both HMMER and BLAST were used for subsequent analysis. TIR subdomains were detected with PfamScan on default settings by searching the identified NB-ARC genes against the Pfam-A hidden Markov model. The 423 Fxa NB-ARC-domain-containing proteins were batch-searched in the NCBI Conserved Domain Database (see URLs)123 and Pfam database (see URLs). Results from the CD database were used to assign the gene models that contained CC, TIR, RPW8, or ‘other’ (none of the three established N-terminal domains); gene models were further mapped onto the assembled octoploid genome to assign positions (Supplementary Fig. 12). The CD results were then filtered to remove established R-gene domains (CC, TIR, RPW8, LRR, and NB-ARC), thus resulting in a list of potential integrated domains (Supplementary Dataset 1). Eight Fxa proteins with predicted Sec7/ADP-ribosylation-factor and G-nucleotide-exchange-factor domains were aligned by ClustalW and FastME 2.0 (ref. 124), and their illustrated domain organization is displayed in Supplementary Fig. 13. The full protein sequences of the 423 Fxa NB-ARC-domain-containing proteins were aligned with MUSCLE v 3.8.31 (ref. 125) under default settings. This alignment was trimmed with trimAl v 1.4.rev22 build 2015-05-21 (ref. 126) under default settings. An unrooted maximum-likelihood tree was constructed with RAxML v 8.2.11 (ref. 127) with the PROTGAMMA substitution model. The tree was visualized with the APE package v 4.1 (ref. 128) in R v 3.3.3 (ref. 129) (see URLs).
The comparison of homoeolog-expression abundance between the dominant subgenome and the three submissive subgenomes was carried out with a likelihood-ratio test and combined with Benjamini–Hochberg correction for multiple testing with a 1% significance level. The Kolmogorov–Smirnov test was used to determine which subgenome had the lowest-overall TE densities near genes. The χ2 test, with three degrees of freedom, was used to analyze the subgenome bias of disease-resistance genes. Bootstrapping, with 500 replicates under the GTR + gamma evolutionary model, was used to assess node support in trees generated by phylogenetic analyses.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
The genome assembly, annotation files, alignments, and phylogenetic trees are available on Dryad (see URLs). Custom software for running PhyDS phylogenetic analyses is available on GitHub (see URLs). The genome assembly and annotation files are also available on the Genome Database for Rosaceae (GDR; see URLs) and the CyVerse CoGe platform (see URLs). ‘Camarosa’ clones are available from most strawberry nurseries. The raw sequence data are available in the Sequence Read Archive under NCBI BioProject PRJNA508389 (see URLs).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by Michigan State University AgBioResearch to P.P.E., USDA-NIFA HATCH 1009804 to P.P.E., NSF-DEB 1737898 to P.P.E., S.J.T as a participant in the Plant Genomics at MSU REU program funded by NSF-DBI 1757043, USDA-NIFA SCRI 2017-51181-26833 to S.J.K., the California Strawberry Commission to S.J.K., and the University of California to S.J.K.