Super-pangenome analyses highlight genomic diversity and structural variation across wild and cultivated tomato species

Effective utilization of wild relatives is key to overcoming challenges in genetic improvement of cultivated tomato, which has a narrow genetic basis; however, current efforts to decipher high-quality genomes for tomato wild species are insufficient. Here, we report chromosome-scale tomato genomes from nine wild species and two cultivated accessions, representative of Solanum section Lycopersicon, the tomato clade. Together with two previously released genomes, we elucidate the phylogeny of Lycopersicon and construct a section-wide gene repertoire. We reveal the landscape of structural variants and provide entry to the genomic diversity among tomato wild relatives, enabling the discovery of a wild tomato gene with the potential to increase yields of modern cultivated tomatoes. Construction of a graph-based genome enables structural-variant-based genome-wide association studies, identifying numerous signals associated with tomato flavor-related traits and fruit metabolites. The tomato super-pangenome resources will expedite biological studies and breeding of this globally important crop.

Effective utilization of wild relatives is key to overcoming challenges in genetic improvement of cultivated tomato, which has a narrow genetic basis; however, current efforts to decipher high-quality genomes for tomato wild species are insufficient. Here, we report chromosome-scale tomato genomes from nine wild species and two cultivated accessions, representative of Solanum section Lycopersicon, the tomato clade. Together with two previously released genomes, we elucidate the phylogeny of Lycopersicon and construct a section-wide gene repertoire. We reveal the landscape of structural variants and provide entry to the genomic diversity among tomato wild relatives, enabling the discovery of a wild tomato gene with the potential to increase yields of modern cultivated tomatoes. Construction of a graph-based genome enables structural-variant-based genome-wide association studies, identifying numerous signals associated with tomato flavor-related traits and fruit metabolites. The tomato super-pangenome resources will expedite biological studies and breeding of this globally important crop.
Tomato (Solanum lycopersicum L.) is among the most important vegetable crops in terms of global production (http://www.fao.org/faostat/ en/#data/QCL), also serving as a classic model system for genetic, developmental and physiological studies of fleshy fruits 1 . It belongs to the genus Solanum in the nightshade family Solanaceae. Cultivated tomatoes have lost substantial genetic diversity owing to a domestication bottleneck and intensive artificial selection in pursuit of bigger fruits and higher yield 2 , which has impeded tomato improvement.
Article https://doi.org/10.1038/s41588-023-01340-y and utilization of genetic variants in tomato wild relatives. Recently, it was highlighted that a super-pangenome that includes genomic information of many diverse species, especially wild relatives within a genus, could expedite crop improvement 23 . Therefore, it is necessary to assemble additional reference genomes for tomato wild relatives to accelerate biological studies and genetic improvement in tomato. In this study, we construct a section-wide super-pangenome by de novo assembling 11 chromosome-level genomes from ten tomato species, representing major clades of tomato wild relatives and their cultivated counterparts in Lycopersicon. Comparative analyses reveal the panorama of genomic content, evolutionary history and structural variation across tomato species, empowering the discovery of a wild tomato gene that has the potential to increase yield in modern cultivated tomatoes. These results will provide insight for the construction and exploitation of super-pangenomes in other crop species.

Eleven wild and cultivated tomato reference genomes
To represent the diversity of wild and cultivated tomato species, we selected nine wild tomatoes (eight species from Solanum section Lycopersicon: S. habrochaites, Solanum chilense, Solanum peruvianum, Solanum corneliomulleri, Solanum neorickii, Solanum chmielewskii, S. pimpinellifolium and S. galapagense; and one from Solanum section Lycopersicoides: S. lycopersicoides) and two diverse domesticated tomatoes (S. lycopersicum var. cerasiforme and S. lycopersicum var. lycopersicum cv. M82; Table 1). We assembled a high-quality chromosome-scale reference genome of wild tomato S. galapagense 'LA0436', using a hybrid assembly approach integrating Pacific Biosciences (PacBio) sequencing, optical genome mapping (Bionano Genomics) and high-throughput chromosome conformation capture (Hi-C; Supplementary Note and Supplementary Tables 1-5). The 802-Mb final assembly had a contig N50 length of 15.5 Mb, and more than 99.5% of sequences in the final assembly were anchored to the 12 chromosomes, higher than the corresponding percentages for the three existing reference genomes 'LA2093' (99.0%), 'Heinz 1706' (97.5%) and 'LA716' (93.6%) ( Table 1). The ten other tomato genomes were also assembled at chromosome level using the above-mentioned strategy, except that Bionano data were not generated. These By contrast, wild tomatoes in Solanum section Lycopersicon, which have adapted to various ecological environments in western South America including the Galapagos islands, from offshore to 3,600 m above sea level 3 , exhibit broad genetic and phenotypic diversity [4][5][6] . These wild species represent a rich source of allelic variation and harbor genes underlying biotic and abiotic stress tolerance, as well as consumer-preferred traits such as high levels of soluble solid content, lycopene and flavor compounds 2,7 . Hence, effective harnessing of natural diversity from these wild germplasms is essential to facilitate tomato genetic improvement.
The availability of the tomato reference genome (S. lycopersicum var. lycopersicum cv. Heinz 1706) 8,9 has enabled comprehensive characterization of genetic diversity in terms of SNPs and small insertion/ deletions (indels) by resequencing numerous accessions, revealing the domestication and wild introgression history of tomato 10,11 . Despite this, increasing numbers of studies have indicated that large structural variants (SVs), such as presence/absence variants and copy number variants (CNVs), also have vital roles in plant adaptive evolution and functional diversity 12,13 . However, conventional strategies based on a single linear reference genome can only capture a portion of genetic diversity, resulting in strong reference biases, and accurate detection of SVs is still challenging using merely short-read resequencing approaches.
To overcome these limitations, pangenomics, as applied in human and many plant species, has emerged as a promising approach to capture the nearly full spectrum of genetic diversity of crops and their wild relatives 14,15 . Recent genomic advances in tomato include a pangenome of 725 tomato accessions constructed using short reads 13 , a pan-SV map built from Oxford Nanopore long reads of 100 diverse tomato lines 12 and a graph pangenome integrating variant information from 838 tomato genomes 16 . These studies suggest that SVs contribute to phenotypic variance and can be powerful when utilized in genome-wide association studies (GWAS). However, most of the accessions sampled in the studies were domesticated tomatoes and their closely related progenitor species Solanum pimpinellifolium. Currently, genome assemblies for five wild tomato species, Solanum habrochaites 11,17 , Solanum pennellii 11,18,19 , Solanum galapagense 17 , S. pimpinellifolium 20,21 and Solanum lycopersicoides 22 , are available, with different assembly approaches and qualities, which impedes the characterization  Tables 8-10). We combined ab initio prediction, homology search and transcriptome mapping approaches for protein-coding gene prediction (Methods), resulting in gene numbers ranging from 31,613 (S. chmielewskii) to 34,375 (S. chilense), similar to that of Heinz 1706 (35,768) but fewer than that of LA716 (44,965) (Table 1). A total of 81.7% to 89.5% of exons of the predicted genes were supported by transcript data, suggesting the high quality of gene predictions. All assembled genome sequences and annotations are publicly accessible through a web-based database (http://caastomato.biocloud.net).
Eukaryotic genomes are rich in transposable elements (TEs), which shape genome evolution through expansions, eliminations and transpositions 25 . The TE contents of the 11 tomato genomes ranged from 64.3% to 74.5%, with long terminal repeat retrotransposons (LTR-RTs) representing the most abundant class of TE (Fig. 1a). A higher abundance of Gypsy LTR-RTs was found in S. lycopersicoides, which possibly contributed to it having the largest assembled genome size (1.2 Gb) among the tomato species 22 (Fig. 1a). To trace the evolutionary history of the expanded TEs in S. lycopersicoides, we estimated insertion times of 162,216 intact LTR-RTs and detected a lineage-specific burst of Gypsy LTR-RTs occurring c. 2 million years ago (Ma) in S. lycopersicoides, after its divergence from potato, probably leading to its large extant genome ( Supplementary Fig. 3). Notably, we observed recent amplification of Gypsy and Copia LTR-RTs in four wild tomato species (S. lycopersicoides, S. corneliomulleri, S. peruvianum and S. chilense; Supplementary Fig. 3), implying that these wild species may have increasing degrees of genomic diversity and environmental adaptability compared with cultivated tomatoes. These results provide insight into the role of TEs in genome evolution of the Solanum genus.

Phylogeny of Lycopersicon and neighboring species
Reconstructing the phylogeny of Lycopersicon species has been problematic owing to the conflict between gene trees and morphological trees, especially for the wild tomato clade 3 . The phylogenetic relationship between S. pennellii and other tomatoes remains unresolved 6 , owing largely to limited available genomic data, despite S. pennellii being considered to be a unique group based on morphological classification. Using 9,343 single-copy orthologous genes, we inferred the phylogeny of ten wild and three domesticated tomatoes, using potato (Solanum tuberosum) as an outgroup; the results indicated that section Lycopersicoides (including S. lycopersicoides) was sister to section Lycopersicon (Fig. 1b), consistent with previous research 3 . Based on the phylogeny, we resolved the polytomy issue in Lycopersicon and unambiguously classified Lycopersicon species into four main clades. Clade I encompassed two species, S. pennellii and S. habrochaites, which diverged from the common ancestor of the other wild and cultivated tomatoes (except S. lycopersicoides) c. 1.97 Ma. Clade IV, which comprised domesticated tomatoes and two closely related wild species (S. galapagense and S. pimpinellifolium), divided from the ancestor of clade III (S. neorickii and S. chmielewskii) approximately 1. 73 Ma. Similar to a recent study of Oryza genus evolution 26 , a few conflicts were observed between the phylogeny constructed using genes from one chromosome and that built using whole-genome genes ( Fig. 1b and Supplementary Fig. 4). For example, within Lycopersicon, phylogenetic analyses using genes from chromosomes 1, 2, 9 and 11 showed that S. pennellii was sister to other wild and cultivated tomato species, rather than clustering into a monophyletic group with S. habrochaites as inferred from the genome-wide phylogeny ( Supplementary Fig. 4), suggesting possible incomplete lineage sorting and/or hybridization events. These results enhance our understanding of the evolutionary history within Solanum section Lycopersicon.

Super-pangenome of tomato
Although pangenomes for cultivated tomato and its close wild relatives have been reported 13   Article https://doi.org/10.1038/s41588-023-01340-y three Solanum species 13 to a super-pangenome covering 11 species in the Solanum genus. We defined 40,457 pangene families by clustering protein-coding genes of the 11 chromosome-scale genomes assembled herein and two previously released genomes 8,18 ; this number of gene families was higher than that of the Oryza genus 26 but lower than that of soybean 27 . The number of gene families increased rapidly when including more genomes, suggesting that the 13 genomes are diverse and that a single reference genome cannot capture the full genetic diversity in tomato (Fig. 2a). Only 54.0% of gene families were conserved among the 13 tomato genomes (core gene families), and the number of core genes (23,839) was lower than that of the previously reported pangenome of 519 cultivated tomatoes and 67 closely related wild tomato accessions belonging to S. pimpinellifolium and S. galapagense (29,938) 13 , owing largely to the higher level of divergence among the wild tomato species used in this study. The dispensable gene families (present in two to 12 accessions) occupied 38.4% of gene families, and 7.6% of pangene families were categorized as accession-specific. Gene ontology (GO) enrichment analysis showed that core genes were enriched for biological processes including carboxylic acid, lipid or organic substance metabolic process, RNA modification or processing and amide transport, consistent with the results of a previous study 13 (Supplementary Tables 11 and 12), whereas the dispensable genes were enriched for terpenoid biosynthesis, telomere maintenance, mitochondrial electron transport and photosynthesis (Supplementary Fig. 6). Expression levels of core genes were significantly higher than those of dispensable genes at different fruit ripening stages (P < 2.2 × 10 −16 , Wilcoxon rank-sum test; Supplementary Fig. 7  Article https://doi.org/10.1038/s41588-023-01340-y previous tomato pangenome 13 were captured in our super-pangenome (Supplementary Table 13), and we also identified 9,320 nonredundant genes absent from the reported tomato pangenome 13 (Supplementary  Note and Supplementary Table 14), indicating the rich diversity of the 13 wild and domesticated tomatoes. This super-pangenome dataset lays a foundation for exploration and exploiting of genes or alleles in wild tomato species.

Extensive variation among wild and cultivated tomatoes
Despite efforts to characterize genetic variants among cultivated tomatoes and their proposed progenitor species S. pimpinellifolium 12,13,16 , the genetic diversity among distantly related wild tomato species, for example, S. peruvianum, S. habrochaites and S. chilense, remains poorly explored. We identified 2.0-8.1 million SNPs and 0.6-1.5 million small indels (≤50 base pairs (bp) in size) in the 12 tomato genomes, relative to the reference S. galapagense genome. The total number of SNPs and small indels (42.4 M) was much higher than that of each accession (Supplementary Tables 15 and 16 Fig. 3). The majority of insertions, deletions and CNVs were shorter than 2 kb, 2 kb and 8 kb, respectively, and most of the translocations had lengths shorter than 20 kb, whereas some inversions were longer than 300 kb ( Supplementary Fig. 17). We found that insertions and deletions were more likely to be found at both ends of the chromosomes, consistent with previous studies 12,20 , whereas inversions and translocations were randomly distributed along the 12 chromosomes (Fig. 2c). SVs were more likely to occur at repeat regions than nonrepeat genomic regions (Student's t test, P = 1.03 × 10 −4 ). We further identified 5,186 large indels (>50 bp) fixed either in all wild or all domesticated tomato genomes investigated in this study, some of which led to insertions of protein-coding genes present only in certain wild tomato genomes (Supplementary Table 19 and Fig. 2d). Further functional characterization of these variants may enable a better understanding of the genetic basis of phenotypic divergence between domesticated tomatoes and their wild relatives. Previous studies have identified several SVs responsible for phenotypic variation, including a 1.4-kb deletion in the CSR gene resulting in increased fruit weight 28 , a 7.1-kb deletion in the LNK2 locus responsible for a light-conditional clock deceleration 29 , an 85-bp deletion in the promoter of ENO that regulates floral meristem activity 30 and a CNV affecting NSGT associated with biosynthesis of a fruit flavor volatile guaiacol 12 . These SVs were all accurately detected in this study (Supplementary Figs. 18-21), indicating the broad diversity of our collection. Two different alleles (4,724 bp and 4,151 bp) have been identified at 149 bp upstream of TomLoxC, a gene encoding a 13-lipoxygenase; the 4,151-bp allele was reported to contribute to desirable fruit flavor and is rare in cultivated tomatoes 13 . We found that S. pennellii, S. habrochaites, S. chilense and S. neorickii carried the 4,151-bp allele upstream of TomLoxC ( Supplementary Fig. 22), suggesting that these wild species have the potential to improve fruit flavor in cultivated tomato by backcrossing. The extensive variation among wild and cultivated tomato species presented herein provides access for further harnessing of the genetic diversity of distantly related wild tomato species in genomic-based breeding.

Hidden genetic diversity of tomato wild species
Large inversions have been reported to suppress recombination by reducing crossing-over 31,32 , resulting in severe linkage drag when conducting backcross breeding. To overcome this, it is necessary to choose donor lines without inverted segments harboring targeted genes. However, a holistic view of genome-wide inversions is not available, owing to the lack of chromosome-scale wild tomato genomes. Based on the 11 high-quality tomato genomes, we identified 12 (S. lycopersicum var. lycopersicum cv. Heinz 1706) to 42 (S. chmielewskii) megabase-scale inversions compared with the S. galapagense genome (Supplementary Table 20). Notably, a 7.1-Mb inversion on chromosome 3, carrying 55 genes, was present in all clade IV tomato accessions compared with other wild species (except S. pennellii) and was supported by clear chromatin interactions around the breakpoints when Hi-C reads of S. neorickii and S. chmielewskii were mapped to the S. galapagense genome (Fig. 2e). This inversion might occur after the divergence between species from clade IV and other clades. Given that S. pennellii does not carry this inversion within this region, this wild tomato species would be an ideal donor parent to introduce possibly favored genes within this 7.1-Mb segments into elite cultivars by backcrossing.
Previous research reported a tomato pan-SV map, which was built by long-read sequencing of 100 cultivated and closely related wild tomato accessions 12 Supplementary Fig. 24a). The vast majority of SVs displayed relatively low frequencies (<0.25) in all four groups, and accessions from the wild group contained a higher proportion of SVs with presence frequency between 0 and 0.25 ( Supplementary  Fig. 24b). We observed that 8,094 SVs exhibited significant frequency changes between the wild and cultivated (SLL and SLC) groups (Fisher's exact test, false discovery rate (FDR) < 0.01; Supplementary Fig.  24c), affecting upstream regions and exons of 2,585 genes. Functional analyses indicated that these genes were mainly enriched for biological processes such as meristem development and ammonium transport ( Supplementary Fig. 24d). We further identified 388 highly divergent SVs between wild and cultivated tomatoes, which disrupted CDS of 328 genes by causing frameshift, loss of exons or in-frame insertions (Supplementary Table 22). These results suggest that SVs in these distantly related wild tomatoes have undergone distinct evolutionary trajectories compared with cultivated tomatoes and their progenitors. Our analyses also provide a candidate dataset for further characterizing genes underlying phenotypes with great divergence between wild and cultivated tomatoes.

A wild tomato cytochrome P450 gene that increases yield
A major goal of tomato breeding is to increase yield by developing varieties with larger fruit size and/or more effective shoot branches. Regulation of shoot architecture is thus of great interest to the tomato research community 33 . Wild tomato species usually display a markedly greater number of lateral fruit-bearing branches than their domesticated counterpart; however, whether we can introduce this trait into cultivated tomatoes, especially modern processing tomato varieties, remains elusive. Among the 388 highly divergent SVs between wild and cultivated tomatoes that greatly affected gene CDS, a 244-bp deletion, showing the second most significant frequency change (FDR = 1.43 × 10 −8 ; Supplementary Fig. 24c), was present in the first exon of Sgal12g015720 (Fig. 3a,b). This gene encodes a protein belonging to the cytochrome Article https://doi.org/10.1038/s41588-023-01340-y P450 superfamily, which has been reported to play important parts in plant growth, development and secondary metabolite biosynthesis 34 . The 244-bp deletion was found in 22.22% of the 19 wild accessions and 100% of cultivated tomatoes, which represented the derived state, as this deletion was absent from all the nine wild tomato species used in this study (Fig. 3a,b and Supplementary Fig. 25). Sgal12g015720 was expressed at the highest level in stems of the wild tomato S. pennellii (Fig. 3c), whereas its expression in two cultivated tomatoes could barely be detected (Supplementary Fig. 26). These results suggest that the 244-bp deletion event may have occurred during tomato domestication, which might lead to pseudogenization of Sgal12g015720 in cultivated tomato.
To investigate functions of this gene and its potential value for tomato breeding, we generated Sgal12g015720-overexpression (OE) transgenic lines under the background of a tomato cultivar 'Micro-Tom' (Supplementary Fig. 27). Compared with the wild-type (WT) plants, the transgenic lines possessed a greater number of lateral branches, resulting in a greater than twofold increase in total fruit number, whereas only slight reductions in single fruit weight, transverse diameter and longitudinal diameter were observed (red fruits; Fig. 3d-i). To further validate the function of Sgal12g015720, we screened previously reported introgression lines (ILs) 35 generated using wild tomato S. pennellii (LA716, donor parent) and cultivated accession M82 (recurrent parent). As expected, two ILs, IL12-2 and IL12-3, carrying a homozygous introgressed segment that harbors an Sgal12g015720 ortholog from the wild tomato donor, generated markedly more lateral branches and fruits compared with the recurrent parent M82 ( Supplementary  Fig. 28). Therefore, this gene represents a promising target for regulation of plant architecture as well as increasing yield or biomass in tomato breeding. These analyses also present an example of how the super-pangenome could facilitate tomato biological studies and breeding.

Graph-based genome enables SV-based GWAS in tomato
Numerous studies have suggested that SVs are causative variants responsible for agronomically important traits 12,13,36,37 . However, population-scale SV genotyping is challenging in plants, impeding the exploitation of SVs in identifying genotype-phenotype associations. Here, we constructed a tomato graph-based genome by integrating the linear reference genome sequence of S. galapagense and the 360,189 SVs identified from the 12 tomato genomes and the 100 previously reported tomato genomes 12 . Graph-based genomes are capable of storing both reference and alternative allele sequences while retaining the coordinate systems of the linear reference genome, which facilitates Article https://doi.org/10.1038/s41588-023-01340-y mapping of short reads from SV regions and thus SV genotyping 38,39 . We then genotyped these SVs in a tomato population comprising 321 accessions 2 and performed SV-based GWAS for 32 flavor-related compounds 2 and 362 fruit metabolites 40 . For comparison, we also called SNPs and indels from the 321 accessions and employed SNP-based GWAS. Significantly associated signals were detected for 17 flavor volatiles and 249 fruit metabolites. Surprisingly, we observed that only 5.2% (161) of peaks (quantitative trait loci) overlapped (800-kb flanking region) between SV-based and SNP-based GWAS results, and 21.3% (658) could only be identified by SVs. The remaining 2,263 (73.4%) were exclusively detected by SNPs (Fig. 4a,b and Supplementary Table  23). Examples included a peak at 65.2 Mb on chromosome 10 that could only be detected using SVs, which was strongly associated with the content of geranylacetone (P = 7.91 × 10 −9 ), one of the important tomato flavor volatiles contributing a leafy flavor to fruits (Fig. 4c and Supplementary Fig. 29). The leading SV was a 347-bp deletion, and the content of geranylacetone in tomato fruits significantly differed between accessions carrying the reference allele and those carrying the alternative allele (Student's t test, P = 3.7 × 10 −8 , Fig. 4c). Similarly, we detected significantly associated SVs for the content of additional metabolites (Fig. 4d-f, Supplementary Figs. Table 23). Tomato accessions carrying alleles of corresponding leading SVs showed significantly increased content of these metabolites (Fig. 4d-f). This identification of SVs exhibiting significant associations with important tomato fruit flavor compounds and metabolites will pave the way for further fine mapping and isolation of putative candidate genes. Our SV-based GWAS provide an important complement to the conventional SNP-based GWAS, which will be helpful to develop markers for breeding flavor-improved tomato cultivars.

Discussion
Domestication of tomato has led to a substantial loss of genetic diversity in modern varieties due to the bottleneck and successive rounds of artificial selection; therefore, the rich diversity of wild tomato Article https://doi.org/10.1038/s41588-023-01340-y species contains valuable breeding materials. However, the availability of only a few wild tomato genomes has hampered the exploration and utilization of alleles and gene repertoire in those wild species. The chromosome-scale reference genomes for nine wild tomato species presented here offer valuable resources for not only comparative genomics but also biological studies and molecular breeding in tomato. Notwithstanding, our dataset still lacks three wild tomato species in Solanum section Lycopersicon (Solanum cheesmaniae, Solanum huaylasense and Solanum arcanum). S. cheesmaniae is endemic to the Galápagos island with yellow to orange fruits 41 , whereas S. huaylasense and S. arcanum are wild tomatoes segregated from S. peruvianum 42 . Development of their genome sequences and annotation will further enrich our understanding of the biodiversity and evolutionary trajectory within Lycopersicon.
Although pangenomes for numerous crops have been reported, most of them incorporated one or a few species 43 . Here, we constructed a super-pangenome by analyzing 11 distinct tomato species, representative of major wild and cultivated tomato clades. Coupling this with an existing dataset 12 , we identified a wild tomato gene that could increase fruit yield by an average of 67.1% in OE transgenic lines ( Fig. 3d-f). As both OE lines and ILs carrying this gene had higher numbers of fruit-bearing branches ( Fig. 3d and Supplementary Fig. 27), we anticipate its use in modern processing tomatoes. According to tomato population resequencing data, this gene was predominantly found in wild tomato accessions (52% of S. pimpinellifolium, 80% of S. cheesmaniae and 100% of S. galapagense), in contrast to a mere 6% and 19% in cultivated tomato forms S. lycopersicum var. lycopersicum and S. lycopersicum var. cerasiforme, respectively (Supplementary Table 24). These results indicate that this gene, although potentially important, has not been widely utilized in tomato breeding programs. Backcrossing would be an ideal approach to introduce this gene into cultivated tomatoes from wild species. However, hybridization between wild and cultivated crops may lead to severe repression of genetic recombination, owing largely to the presence of large-scale genomic divergence, such as large inversions 10,31 . This may ultimately result in the introduction of exotic genomic fragments carrying unfavorable alleles that are hard to purge 40 . We did not observe chromosomal rearrangements between the genome of Heinz 1706 and those of eight out of the nine wild species surrounding this gene (Supplementary Table 25), suggesting that introgression of this gene by backcrossing, when the donor parent is properly selected, would be less likely to cause linkage drag.
To facilitate the utilization of genetic diversity from our super-pangenome, we constructed a graph-based genome reference for wild and cultivated tomatoes by integrating SV information for 112 tomatoes from 11 Solanum species into the linear reference sequence, offering a powerful platform for population-level SV genotyping. As previous research has suggested that SVs are more likely to be causal variants in tomato 16 , further studies could use this graph-based genome to perform SV-based association analyses to identify additional signals responsible for agronomically important traits. However, the current graph-based tomato genome is only capable of storing certain types of SVs: insertions, deletions and inversions. Other SVs of relatively high complexity, such as inverted duplications and translocations, cannot yet be integrated. Furthermore, SVs with multiple alleles are not represented in the graph, as downstream analytic pipelines can only handle biallelic variants. It is possible that an insertion with distinct inserted fragments in various individuals contributes to different phenotypic outcome. We anticipate further implementation of relevant tools and algorithms that could tackle these issues. This research will accelerate comparative genomics and biological studies in tomato and shed light on the utilization of super-pangenomes in crop improvement.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41588-023-01340-y.

De novo genome assembly
Methods for library construction and sequencing are provided in the Supplementary Note. Contig-level assemblies for the 11 representative accessions were conducted using a pipeline based on Canu (v.1.5) 16,44 with the following procedures: longer seed reads were selected with the settings corOutCoverage = 35; raw read overlapping was detected using a highly sensitive overlapper MHAP 45 (v.2.1.2, parameter corMhapSensitivity = normal), and error correction was performed using the Falcon 46 sense method (option correctedErrorRate = 0.025); error-corrected reads were trimmed of unsupported bases and hairpin adapters to reach their longest supporting range with default parameters, and the draft assemblies were then generated using the top 80% longest trimmed reads. Finally, to ensure base accuracy of assembly results from SMRT molecules, we further polished the consensus genome sequences based on Illumina paired-end reads using Pilon 47 (v.1.22) with parameter: -mindepth 10-fix bases.

Scaffolding using Bionano optical maps
For S. galapagense, we constructed Bionano optical maps. Young leaves were collected after two days of dark treatment. High-molecular-weight DNA was isolated and labeled with the restriction endonuclease Nb.BssSI, and labeled DNA was imaged with a Bionano Irys system. Molecules with lengths >150 kb, label SNR ≥3.0 and average molecule intensity <0.6 were retained for scaffolding. These molecules were de novo assembled into genome maps using IrysSolve v.3.5_12162019 (https://bionanogenomics.com/support/software-downloads/). Pairwise comparison was first performed with RefAligner (https://bionanogenomics.com/support/software-downloads/) to identify overlaps among these molecules, and consensus maps were constructed. All molecules were then mapped back to the consensus maps and recursively refined and extended. The Bionano IrysSolve module 'HybridScaffold' was used to perform hybrid assembly between the assembled contigs and genome maps. Assembled contigs were first converted into cmap format and then aligned to the contig cmaps with RefAligner, followed by label rescaling. The rescaled Bionano cmaps were aligned again to the contig cmaps, and sequences were split at the conflict points. Finally, scaffolds were built according to the alignment information. To improve the contiguity of assembly results, PBJelly 48 (v.15.8.24) was used to fill gaps using the error-corrected PacBio reads.

Pseudomolecule construction
The Hi-C data were mapped to the assemblies using BWA 49 (v.0.7.10-r789) with default parameters. Only uniquely aligned read pairs with mapping quality >20 were retained for further analysis. Duplicate removal, sorting and quality assessment were performed using HiC-Pro 50 (v.2.8.1) with default parameters. Only valid interaction pairs of Hi-C reads were fed into LACHESIS (v.1.0) 51 for chromosome-scale scaffold construction. Briefly, contigs or scaffolds for each tomato assembly were broken into fragments with a length of 200 kb and then clustered using valid interaction read pairs by LACHESIS with the following parameters: 'CLUSTER_MIN_RE_SITES = 22, CLUSTER_MAX_ LINK_DENSITY = 2, CLUSTER_NONINFORMATIVE_RATIO = 2, ORDER_ MIN_N_RES_IN_TRUN = 10, ORDER_MIN_N_RES_IN_SHREDS = 10'. We manually checked the Hi-C interaction heat maps to identify potential genomic regions containing assembled haplotigs due to heterozygosity, which were then excluded from the assembly. The manual curation step was reperformed several times, until the chromatin interaction signals reflecting putative haplotigs were undetectable.

Evaluation of genome assemblies
Completeness of the assembled tomato genomes was first assessed using BUSCO 24 (v.5.2.0) based on the embryophyta_odb9 database. We also assessed the mapping proportions of transcripts assembled with Trinity (v.2.8.5) 52 to corresponding genome assemblies using BLASTN (v.2.12.0+) 53 with minimum alignment length of 300 bp and sequence identity >95%. These assemblies were also evaluated by mapping the Illumina short reads using BWA (default parameters).

Repeat sequence annotation
Both homology-based and de novo strategies were applied to identify repetitive sequences for all the tomato genomes. Four de novo prediction programs were applied: RepeatScout 54 (v.1.0.5), LTR-FINDER 55 (v.1.05), MITE-hunter (v.1.0) 56 and PILER-DF 57 (v.1.0). Results from these four programs were integrated into a repetitive sequence database, which was then merged with Repbase 58 (v. 19.06) and classified into different categories by the PASTEClassifier.py script included in REPET 59 (v.2.5). Using this repeat database, repetitive sequences were identified by homolog searching using RepeatMasker 60 (v.4.0.5) with default parameters. We computed the genetic distance (K) between both ends of an intact LTR-RT using the distmat (default parameters) program in the EMBOSS package (v.6.6.0) 61 , and the insertion time (T) of each intact LTR-RT was estimated using the formula T = K/2μ, where μ is the base substitution rate of 1.3 × 10 −8 (ref. 62).

Gene prediction and functional annotation
De novo, homology-based and transcriptome-based strategies were used to predict protein-coding genes for all tomato genomes assembled in this study. Predicted proteins from four plant genomes (Arabidopsis thaliana, Oryza sativa, S. lycopersicum and S. tuberosum) were used to perform homology-based prediction with GeMoMa 63  . We used AUGUSTUS with parameters trained by unigenes, which were assembled from pooled transcriptome data. As for the third approach, transcriptome data generated from pooled tissues of leaves, stems and roots were assembled using HISAT2 (ref. 65

Phylogenetic tree construction and divergence time estimation
We selected S. tuberosum as the outgroup to infer species phylogeny. Single-copy orthologous genes were identified using quota-alignment (v.1.0) 77 with parameters '-merge-format=raw-Dm 30-Nm 40'. A total of 9,343 orthologous groups were identified among the 14 genomes. Protein sequences of the 9,343 single-copy orthologous genes were aligned using MUSCLE 78 (v.3.8.31) and the alignments were Article https://doi.org/10.1038/s41588-023-01340-y then concatenated. We constructed a phylogenetic tree using phyML (v.3.3.20190909) 79 with parameters '-model JTT -f e -v 0.576 -a 0.886nclasses 4-search SPR -t e'. The divergence time was estimated using the MCMCtree program in the PAML package 80 81 were used to constrain the divergence time.

Analyses of the super-pangenome
To identify homologous relationships among the genomes of 11 tomatoes assembled in this study, S. lycopersicum var. lycopersicum cv. Heinz 1706 and S. pennellii, the longest transcript of each predicted gene in each genome was chosen as a representative. To handle unannotated genes, a common issue during gene prediction, we aligned coding sequences of all predicted genes to each of the 13 tomato genomes using GMAP (v.2015-06-12) 82 . If a gene showed more than 80% alignment coverage and identity, and no gene was predicted within the aligned regions, it was considered to be an unannotated gene in the corresponding genome and was not regarded as 'missing' in the further analysis. An all-against-all comparison was then performed using BLASTP 53 (E-value <1 × 10 −5 ), followed by clustering using OrthoFinder (v.2.5.2) 83 with default parameters. Based on the clustering results, we extracted gene families that were shared among all samples; these were defined as core gene families. Genes that were absent from two or more samples were defined as dispensable gene families, whereas those only present in one individual were considered to be specific gene families. Clade-specific gene families were defined as those exclusively present in one of the four clades of wild and cultivated tomatoes. Enrichment analysis with respect to GO terms was performed using the 'topGO' R package (https://bioconductor.org/ packages/topGO). Details of the methods used for comparison of the super-pangenome and the previously reported tomato pangenome are provided in the Supplementary Note.

Identification of genetic variants
We performed pairwise genome alignments between each of the 12 genomes and the S. galapagense reference genome using the nucmer program in MUMmer 84 (v.4.0.0beta2) with default parameters. The resultant alignments were filtered to retain the one-to-one alignment blocks, and SNPs and indels (<50 bp in length) were identified by the show-snps program within MUMmer with parameters '-Clr -x 1 -T'. For identification of SVs, two sets of SV calling results were generated using SVMU (v.0.4-alpha) 85 and SyRI (v.1.2) 86 , respectively, both using default parameters. For SV detection using SVMU, intergenomic alignments were performed using the nucmer program in MUMmer 84 (v.4.0.0beta2) with default parameters. The results were then parsed in SVMU to produce collinear blocks and insertions, deletions and CNVs. Insertions and deletions larger than 50 bp inside the syntenic alignment regions were also kept for further analysis. For SV calling using SyRI, minimap2 (v.2.21-r1071) 87 was used to generate pairwise genome alignments with parameters '-ax asm5-eqx'. The alignment results were subsequently passed to SyRI, and SVs consisting of insertions, deletions, inversions (<1 Mb in size) and translocations (>50 bp in length) were kept. SVs encompassing 'N' sequences were removed. SVs with ambitious alignment margins and/or poor synteny alignment surrounding the breakpoint were also filtered. We only kept CNVs from SVMU output by applying a filtering of length >50 bp and coverage of reference or coverage of query ≥2 or ≤0.5. Inversions that were >1 Mb were extracted from the results generated from SyRI, followed by a manual check. The identified SVs from each sample were then merged using SURVIVOR (v.1.0.6) 88 with the following parameters: '50 1 0 0 0 0'.

Analyses of presence frequency of SVs in tomato populations
Details of integrating SVs reported in the previous study are provided in the Supplementary Note. The 12 tomato genomes used in this study and the 100 previously reported tomato accessions 12 were divided into four groups: wild (19 accessions from S. galapagense, S. cheesmaniae, S. chmielewskii, S. neorickii, S. corneliomulleri, S. peruvianum, S. chilense, S. habrochaites and S. lycopersicoides), SP (22 S. pimpinellifolium accessions), SLC (24 S. lycopersicum var. cerasiforme accessions) and SLL (47 big-fruited S. lycopersicum var. lycopersicum accessions). We computed presence frequencies of each SV in the four groups and compared those between the wild and cultivated (SLC and SLL) groups using Fisher's exact test. The resultant P-values were next adjusted using the p.adjust function in R (v.4.03), with the 'method = 'fdr'' parameter. SVs with FDR < 0.01 were regarded as highly divergent between wild and cultivated tomatoes, showing significantly altered presence frequencies between the two groups.

Functional characterization of the candidate gene
To generate an overexpression construct, the full-length ORF sequence of the candidate gene Sgal12g015720 was amplified from S. galapagense using specific primers (Supplementary Table 26) and cloned into the plant expression vector pCAMBIA1300 by seamless cloning. Micro-Tom (S. lycopersicum var. lycopersicum) was transformed with the overexpressing transgene using Agrobacterium tumefaciens (strain GV3101)-mediated cotyledon transformation.

Quantitative real-time PCR
Total RNA was isolated from young fresh materials (roots, stems and leaves) of WT and transgenic tomato lines using the a Plant RNA Kit (catalog number DP432, Tiangen), and cDNA sequences were synthesized using 5X All-In-One Master Mix (with AccuRT Genomic DNA Removal Kit; catalog number G492, Applied Biological Materials Quidel) according to the manufacturer's instructions. Real-time quantitative PCR (rt-qPCR) was carried out using a LightCycler96 real-time PCR system. Detection of rt-qPCR product was performed by staining with ChamQ SYBR qPCR Master Mix (catalog number Q311-02/03, Vazyme Biotech Co.). Specific primers are listed in Supplementary Table 26. The relative amplification of the tomato actin gene was used for normalization. The amplification was performed using the following conditions: 95 °C for 2 min followed by 40 cycles of 95 °C for 5 s and 60 °C for 30 s. Three samples (biological replicates) of each treatment were duplicated (technical replicates) in the rt-qPCR experiment. The relative expression level of genes was quantified according to the R = 2 −ΔΔCt mathematical model. The final value of relative quantification was described as the fold change of gene expression in the test sample compared with the internal control (actin).

Graph-based tomato genome construction and SV genotyping
To integrate the linear reference genome and large-scale genomic variant information, we constructed a graph-based genome of tomato using vg (v.1.38.0) 38 . Reference sequences of S. galapagense and SVs in terms of insertions and deletions from the 12 tomato genomes (this study) and the 100 tomato genomes reported from a previous study 12 were built into a variation graph by the 'construct' subcommand in vg without removing any alternate alleles. The preliminary graph was indexed in XG and GBWT by using 'vg index' with the '-L' option to retain alternative allele paths. A GBWT index was then built using 'vg gbwt' with the parameter '-P'. Previously reported Illumina paired-end reads of 321 tomato accessions were subsequently mapped against the indexed graph, and alignments in GAM format were generated by vg giraffe 89 . We then excluded low-quality alignments with mapping quality less than 5 and base quality less than 5. Finally, a compressed coverage index was calculated using 'vg pack', and snarls were generated using 'vg snarls', both with default parameters. SV genotyping in the 321 tomato accessions were performed using 'vg call' (default parameters) by examining the state (including read pair and split read information) and coverage of mapped reads around the SV breakpoints. Genotyped SVs with fewer than two supporting reads were marked as 'missing'. Article https://doi.org/10.1038/s41588-023-01340-y

Genome-wide association studies
We selected the 321 tomato accessions that have been resequenced 2,10 for GWAS. A total of 43,901,591 SNPs were identified using the GATK (v.4.1.4.1) pipeline 90 with the S. galapagense genome as the reference. Population structure was calculated by principal component analysis in PLINK (v.1.9.0b4.6) 91 using 437,028 SNPs showing less linkage disequilibrium, which was extracted using PLINK with parameters '-indep-pairwise 50 5 0.1 (windows, step, r 2 )'. The first five principal components were used as cofactors for population structure correction.
A total of 32 tomato flavor-related metabolite traits reported previously 2 and contents of 362 annotated metabolites from tomato fruits reported previously 40 were analyzed using EMMAX (v.20120210) 92 with the default KN kinship, in which the selected principal components were used as cofactors. SNP-based and SV-based GWAS were performed using SNPs or SVs with minor allele frequency >0.01 and missing call rate <0.1. The genome-wide significance thresholds (7.58 × 10 −7 ) were determined using a uniform threshold of 1/n, where n is the effective number of independent SNPs and SVs calculated using the Genetic type 1 Error Calculator (v.0.2) 93

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
All assembled genome sequences and annotations are accessible through our database (http://caastomato.biocloud.net). Assemblies for the tomato genomes have also been deposited in the National Center for Biotechnology Information (NCBI) under BioProject accession number PRJNA809001. Raw PacBio, transcriptome and Hi-C sequencing reads have been deposited in the NCBI sequence read archive (https://www.ncbi.nlm.nih.gov/sra/) under BioProject accession number PRJNA756391. Tomato whole-genome sequencing data were downloaded from NCBI (BioProjects: PRJNA259308, PRJNA353161, PRJNA454805 and PRJEB5235). The RepBase database was downloaded from https://www.girinst.org/server/RepBase/index.php. Source data are provided with this paper.