Background & Summary

Potato is currently the third most important crop for human consumption with increasing production in the developing world and it is an important sustainable global food security crop for climate-smart agriculture1,2. Breeding for improved nutritious potato varieties suitable for sustainable production and climate change resilience will be aided by introgression of diverse genes from native Andean cultivars and landraces along with wild Solanum relatives of potato3. The genebank at the International Potato Center (CIP) and other genebanks globally, carry diverse germplasm used by breeding programs for genetic improvement of the crop4,5. The CIP genebank has potato cultivars, landraces, and wild Solanum germplasm collected from a wide range of environments in North and South America with variation in abiotic and biotic stress. Native Andean cultivars and landraces also carry beneficial nutrients and bioactive compounds for human health6,7.

A reference genome sequence for a doubled monoploid potato clone DM1-3 516 R44 (DM) was released in 2011 and later improved8,9,10. Low depth sequencing of potato varieties and wild species aligned against the DM reference was used to assess genome diversity, copy number variation (CNVs), introgressions, and selective sweeps11,12,13. Since then, the genomes of additional potato clones, diploid as well as tetraploid, have been sequenced and de novo assembled14,15,16,17,18,19,20.

Potato and its wild relatives are in the genus Solanum section Petota, which consists of over 100 species that form tubers21. Increasing knowledge on the genetic potential of species in section Petota will increase capacity for their use in enhancing potato germplasm. Genomes of several wild species are also available22,23,24,25 and a Petota super pangenome representing 60 species was recently released26. The current data set includes a phased genome assembly of the wild species S. okadae, for which short reads and the mitogenome were previously published26,27. In addition, the study contributes Illumina short read sequences (50–100x sequencing depth) and de novo assemblies (3,328–59,223 N50) for genomes of 16 wild species in section Petota that were noted for stress resistance and of interest to potato breeders. An S. etuberosum genome from outside section Petota was also included in this data set, as were three polyploid hybrids. For most of the species in this project, these are the first genome sequences released of their species. Taxonomic groupings of the species of section Petota into Clade 1 + 2, Clade 3 and Clade 4 were represented in the genomes sequenced.

Transcriptome sequence data was generated and used for annotation. Both nuclear and organellar genome sequences are provided. Comparative analysis of the presence/absence variation against the Petota super pangenome provided insight into the diversity of the species in the phylogenetic analysis.

Methods

Solanum okadae (OKA15) genome sequencing

Botanical S. okadae seed sourced from US Genebank NRSP-6 (PI 458367) were sown on agar media. Individual clones were propagated in vitro. Seventeen lines were planted in the greenhouse and self-pollinated. While abundant flowering was noted, a single line, OKA15, was selected for further study26,27. The 10X Genomics short reads were previously published26,27. High molecular weight DNA for S. okadae OKA15 was prepared using a modified CTAB procedure28 and used for Hi-C (chromatin interactions), Pacific Biosciences (PacBio) and Nanopore sequencing. The Hi-C library preparation was done using Phase Genomics Proximo Plant Hi-C version 4.0, and sequenced at Génome Québec’s Sequencing centre, Montreal, Canada, on Illumina NovaSeq. 6000 platform in PE 100 bp mode with ~3B reads, while both the PacBio libraries (sequenced on Sequel II system using the SMRT technology (HiFi) at a depth of 100X,) and the Nanopore ONT libraries (sequenced on the PromethION at a depth of 100X) were prepared and sequenced by Novogene (Novogene Co., Ltd, Beijing, China).

Genome and transcriptome sequencing

A total of 23 additional accessions representing potato wild relatives or hybrids with agronomically important traits such as disease resistance, cold and drought tolerance, and other biotic and abiotic stresses were selected for this study (Table 1). The genomic DNA of these accessions was isolated from in vitro or greenhouse grown plantlets or leaves. The library construction, quality control, and sequencing were done by Novogene (Novogene Co., Ltd, Beijing, China). The genomic DNA was fragmented, ligated with adapters, and PCR amplified to construct single libraries with an insert size of 350 bps. The sequencing was performed on the Illumina NovaSeq. 6000 platform, in a paired-end mode (2 X 150 bp) at ~50x or ~100x depth (Table 2). Similarly, total RNA was extracted from these accessions, including from OKA15, from leaf, shoot, flower, and/or tuber tissues. The RNA-Seq library construction used standard protocol (Novogene) and sequencing was done on the Illumina NovaSeq. 6000 platform (PE150) (Table 2).

Table 1 A list of the 24 potato wild relatives (Solanum sp) accessions selected for this study with their detailed descriptions.
Table 2 Details of the tissues used and amount of data generated for WGS and RNA-Seq. The raw data is calculated as: number of reads*sequence length (150 bp).

S. okadae OKA15 phased genome assembly and quality control

The Nanopore reads were adapter trimmed using porechop v0.2.4 (https://github.com/rrwick/Porechop) and default parameters. hifiasm v0.16.1-r37529 was used to generate a haplotype-resolved assembly using the PacBio HiFi and the Hi-C reads (Fig. 1). Organellar genomes were removed from both resulting haplotype assemblies. The HiFi and Nanopore reads were used to scaffold the haplotype assemblies. Hi-C reads were trimmed using fastp v0.23.130, and mapped to haplotype assemblies individually using the Arima mapping pipeline (https://github.com/ArimaGenomics/mapping_pipeline) and default parameters. The alignments were filtered to keep only the unique alignments (samtools view -q 40)31. The scaffolding was performed using YaHS32. Hi-C maps were generated using Juicer33 and manually reviewed using JBAT33. Any duplicated contigs from the assemblies were removed using mummer34. Gaps were closed using tgsgapcloser35 with PacBio and Nanopore reads. A custom repeat library was constructed using Repeatmodeler36 and repeats were masked using Repeatmasker37 and default parameters. RNA-Seq reads were filtered using Kraken238 and trimmed using fastp30, then aligned to the assembly using hisat239. Structural annotations were identified using Braker340, and curated using gfacs41, and functional annotations were generated using ahrd (https://github.com/groupschoof/AHRD) and interproscan42.

Fig. 1
figure 1

The haplotypes of the phased Solanum okadae (OKA15) genome.

Quality control and analysis of WGS and RNA-Seq for additional potato wild relatives

Quality control was performed on the raw sequencing data (WGS and RNA-Seq) using FastQC v0.11.943. An adaptor removal and read quality trimming was performed on the raw reads using Trimmomatic v0.3944. Genome heterozygosity and estimated size were determined using the trimmed WGS reads. First, k-mer frequencies were calculated for each genome (k = 21) and a k-mer histogram was generated using Jellyfish v2.3.045. The k-mer histogram was used to generate various genome characteristics with GenomeScope 2.046. The polyploid genomes of S. brevicaule (=oplocense) X S. tuberosum (B1595106), S. tarnii X S. tuberosum (SH7_18_3), and S. commersonii X S. andigena (cxa70P5) hybrids showed higher rates of heterozygosity (Fig. 2), as seen in previous potato polyploid genomes20, while the S. etuberosum (etb0161) genome showed very low levels of heterozygosity. The estimated haploid genome sizes ranged from 575 Mbp in S. pinnatisectum (S7) to 795 Mbp in S. tarnii X S. tuberosum (SH7_18_3).

Fig. 2
figure 2

The k-mer coverage plots of each of 23 genome sequences from potato wild relatives (Solanum sp).

De novo assembly and annotation

Each genome was assembled into contigs from the raw Illumina reads using the de novo assembler MaSuRca v4.0.547. Organelle sequences were removed from the resulting assemblies using publicly available potato plastome48,49 and mitogenome sequences27,50. The de novo assemblies were aligned against the organellar reference using nucmer from Mummer v4.0.0beta2 utilities34 and the alignments were filtered for 95% identity and 200 bp alignment length. Any contigs with >95% coverage against the organelles were removed, as well as organellar sequences at the start and ends of contigs. The resulting filtered contigs were queried against the non-redundant nucleotide database using blast+ v2.11.051 to remove contaminant sequences. Any contig with a reliable match (90% query converge with 90% sequence identity) to organisms outside of green plants were removed from the assembly. Finally, contigs that are less than 200 bp in length were removed and the remaining contig ids were modified to create a final clean assembly file using BBmap v38.8652 (Table 3). The plastomes of each genome were assembled and annotated from raw Illumina reads using the Plastaumatic53 pipeline. The quality of each de novo assembly was calculated using QUAST v5.0.254 to determine the assembly lengths, N50 values etc, and BUSCO v5.2.255 to check for the completeness of the assembly by looking for the presence of single copy orthologs from Viridiplantae (Fig. 3). An assembly was aligned against its reference genome22 when available using nucmer from mummer v4.0.0beta234 and the alignment statistics were generated using dnadiff.

Table 3 De novo genome assembly statistics and the heterozygosity information.
Fig. 3
figure 3

BUSCO quality assessment results of short reads de novo genome assemblies for 24 potato wild relatives or hybrid clones (Solanum sp). The bar plot shows the percent of BUSCO genes (Viridiplantae dataset) present in each genome assembly. Species information and accession numbers are detailed in Table 1.

The genome assemblies were annotated using the RNA-Seq data as the evidence for gene prediction along with homology-based prediction. The two accessions (B1595106 and SH7_18_3) with no RNA-Seq data were annotated only with the protein sequences. A repeat masking was performed on each genome assembly using RepeatMasker v4.1.2-pl37 with a custom repeat library of potato reference sequences constructed by concatenating repeat libraries from the Petota super pangenome26 and DMv6.19. Table 4 details the number of bases masked in each genome and size of different types of transposable elements found in them. Then the structural annotation was performed using BRAKER v2.1.640,56,57 in two runs, one with the RNA-Seq data as evidence and another run with protein sequences as evidence. The trimmed RNA-Seq reads of each accession were aligned to its genome assembly using HiSAT2 v2.2.139 for BRAKER157. Most of the accessions have an optimal RNA-Seq alignment rate (>65%) against their short read de novo assemblies. The alignment files were sorted and indexed using SAMtools v1.16.131. Then the structural annotation was performed using the BRAKER pipeline58,59,60,61,62,63 on the masked genome assembly with the alignment file. The BRAKER pipeline performed gene prediction by AUGUSTUS v3.4.064 and GeneMark-ET65,66 to generate gene structural annotations. For the homology-based prediction (BRAKER2), Solanales protein sequences from OrthoDB along with the protein sequences from potato reference genomes9,18,67 were combined to create a protein database for BRAKER. The BRAKER1 and BRAKER2 results were combined to select transcripts based on support by extrinsic evidence using TSEBRA v1.0.368 to get a high-confidence gene set. The annotations for the accessions with no RNA-Seq data (B1595106 and SH7_18_3) were obtained from the BRAKER2 run with protein sequences as evidence. Since no RNA-Seq evidence was available for these two, all the predicted structural genes were added to the final gene set. Figure 4 shows the number of genes and transcripts predicted in these accessions by the BRAKER pipeline. A total of 31,247–47,705 high confidence genes (i.e. genes with RNA-Seq evidence) were found in these genomes, with S. etuberosum having the lowest number of genes and the S. commersonii X S. andigena hybrid having the greatest number of genes. A high confidence gene set was not generated for B1595106 and SH7_18_3 genomes due to lack of the RNA-Seq data. A functional annotation was performed using Interproscan v5.52-8642 against the PFAM database. Approximately 65–70% of the annotated genes in all the genomes were found to have PFAM domains, except for the SH7_18_3, and B1595106 genomes (for which RNA-Seq data was not available).

Table 4 Results of the repeat analysis with different classes of TEs and their size.
Fig. 4
figure 4

The number of high-confidence genes and transcripts found in 22 potato wild relatives accessions (Solanum sp). The two columns for each genome represent the total number of predicted genes and transcripts (larger number on top of bar) along with their proportions of functional annotations (smaller number inside bar). Species information and accession numbers are detailed in Table 1.

Presence/absence variation analysis

The trimmed reads of each genome were aligned to the Petota super pangenome26 using bwa v0.7.1769. The resulting alignments were filtered to keep only properly paired (-f 2) and remove secondary alignments (-F 2048), then sorted and indexed using SAMtools v1.1331. The gene presence/absence variations (PAVs) were identified in each genome using SGSGeneLoss v0.170 with minCov = 2 and lostCutoff = 0.2. These PAVs were used to generate a maximum-likelihood phylogenetic tree using IQ-TREE v2.1.371 with GTR2 + FO + R4 as a substitution model and 1000 bootstrap replicates and S. etuberosum set as an outgroup (Fig. 5). A multiple sequence alignment of the 23 plastomes was made using MAFFT v772, followed by a plastome based parsimonious phylogenetic tree constructed using paup v4.0a73 with 1000 bootstrap replicates and S. etuberosum as an outgroup (Fig. 6). The phylogenetic trees were visualized in FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/).

Fig. 5
figure 5

A whole genome presence-absence variation (PAV) based phylogenetic tree showing the relatedness of potato wild relatives (Solanum sp). The S. etuberosum genome sequence was used as an outgroup. The numbers at the nodes represent bootstrap support values. For clarity of the figure, only key values are given, and the ones omitted are all 100.

Fig. 6
figure 6

A plastome sequence based phylogenetic tree of 24 potato wild relatives (Solanum sp). The S. etuberosum was used as an outgroup. The numbers at the nodes represent bootstrap support values.

Data Records

S. okadae (OKA 15) genome sequence data is deposited at NCBI under BioProject PRJNA68456574 and BioSample SAMN1786056075: the Hi-C data under SRR2608197276; Nanopore SRR2087005177; PacBio SRR2087005278; Illumina reads SRR1448238479; and the phased haplotype genome assemblies are available under BioProject PRJNA101811580 (haplotype 1) JAWDCX00000000081 and PRJNA101811582 (haplotype 2) JAWDCY00000000083. The OKA 15 RNA-Seq data BioSamples are available under SAMN37429684- SAMN3742968684,85,86 and the RNA-Seq Illumina reads are available under SRR26082554-SRR2608255687,88,89.

All other data used in this study is deposited in NCBI and available under the BioProject PRJNA77936890. The sample descriptions of the 23 genomes used in the genome sequencing are available under the BioSample accession numbers (SAMN2344097791, SAMN23440980 – SAMN2344098392,93,94,95, SAMN23440986 – SAMN2344099096,97,98,99,100, SAMN23440993 – SAMN23440998101,102,103,104,105,106, SAMN23441000 – SAMN23441002107,108,109, SAMN23441004 – SAMN23441007110,111,112,113). The Illumina WGS reads of each genome were deposited with their respective Biosample IDs under the Sequence Read Archive (SRA) submission (SRR17078416 – SRR17078418114,115,116, SRR17078420117, SRR17078422 – SRR17078424118,119,120, SRR17078426 – SRR17078432121,122,123,124,125,126,127, SRR17078435 – SRR17078439128,129,130,131,132, SRR17078441 – SRR17078444133,134,135,136). The final genome assemblies were deposited under the WGS assembly projects with the accession numbers JAJONR000000000 – JAJONU000000000137,138,139, JAJONW000000000 – JAJOOD000000000140,141,142,143,144,145,146,147,148, JAJOOF000000000 – JAJOOJ000000000149,150,151,152,153, JAJOOM000000000 – JAJOOP000000000154,155,156,157, JAJOOS000000000158, JAKRZT000000000159 and the 23 plastome sequences were deposited in the GenBank database with the accession numbers OM638053-OM638057160,161,162,163,164, OM638059-OM638064165,166,167,168,169,170, OM638066-OM638068171,172,173, OM638071-OM638072174,175, OM638074-OM638077176,177,178,179, OM638081-OM638083180,181,182. A complete list of WGS raw data and assemblies are available in Table 5. The sample descriptions of the RNA-Seq samples are available under the BioSample accession numbers SAMN32886741 – SAMN32886746183,184,185,186,187,188, SAMN32886748 – SAMN32886751189,190,191,192, SAMN32886756 – SAMN32886763193,194,195,196,197,198,199,200, SAMN32886766 – SAMN32886771201,202,203,204,205,206, SAMN32886782 – SAMN32886784207,208,209, SAMN32886786210, SAMN32886787211, SAMN32886790 – SAMN32886796212,213,214,215,216,217,218, SAMN32886798219, SAMN32886799220 and their raw sequencing reads are available under the SRA accession numbers SRR23225904 - SRR23225909221,222,223,224,225,226, SRR23225913227, SRR23225914228, SRR23225916 – SRR23225918229,230,231, SRR23225929232, SRR23225930233, SRR23225933234, SRR23225938 – SRR23225940235,236,237, SRR23225942 – SRR23225944238,239,240, SRR23225947 – SRR23225951241,242,243,244,245, SRR23225953 – SRR23225955246,247,248, SRR23225960 – SRR23225962249,250,251, SRR23225964252, SRR23225966 – SRR23225971253,254,255,256,257,258 (Table 6). The gene annotations are available at the following link: https://potatogenomeportal.org/download

Table 5 A detailed list of genomic data descriptor records. It includes BioSample, SRA, genome, and plastome assembly accession numbers of each genome.
Table 6 A detailed list of transcriptomic data descriptor records. It includes BioSample and their respective SRA accession numbers of individual samples.

Technical Validation

Sequencing data quality control

The quality of the raw WGS and RNA-Seq reads were checked using FastQC43. Various metrics such as per base sequence quality, overall read quality score, GC content, N content, and overrepresented sequences were analyzed to detect low quality sequences or biases in the data. We have found no issues or bias in the WGS reads, whereas the RNA-Seq data is found to have overrepresented sequences in some of the datasets, which is not unusual for RNA-Seq data. We have also checked for viral genome sequences in the RNA-Seq data since potato plants are frequently found to have various types of potato viruses. The reads were classified using kraken2 v2.1.238 by searching against their own genome assemblies and potato virus sequences downloaded from NCBI. Most genomes have very few hits to the viral sequences, and we removed reads that have hits to viruses from each sample and used the clean reads in the following analyses.

Quality assessment of the genome assemblies

The de novo genome assembly quality was assessed by QUAST54 and BUSCO55 where contiguity and completeness of the assemblies were analyzed. The OKA15 genome assembly haplotype 1 and 2 have a complete BUSCO score of 98.3% and 98.5%, respectively, when compared to the (solanales_odb10). Merqury259 was used to estimate the completeness (QV). The complete OKA15 assembly was 99.4%, while haplotype 1 and haplotype 2 were 65 and 65% complete individually (meaning 15–30% of the reads map equally well to both haplotypes). Haplotype 1 (hap1) has 214 contigs with a total length of 729 Mb. The largest contig is 84.7 Mb and the N50 is 58.6 Mb. Haplotype 2 (hap2) has 131 contigs with a total length of 726 Mb, the largest contig is 83.2 Mb and the N50 58.4 Mb. The heterozygosity of the oka15 genome was calculated using jellyfish45 and found to be 0.87%.

The S. etuberosum genome, which has been self-pollinated for several generations, has the best assembly of all with an N50 value of 59,223, 34,061 contigs, and 610 Mbp genome size. The three tetraploids (B1595106, SH7_18_3, and cxa70P5), and a diploid S. chacoense (chc0917) have fragmented assemblies (Table 2). Nine genome assemblies from this study were also compared with long-read assemblies of the same species from a recent study22. Overall, the amount of sequences that mapped to their individual reference ranged from ~85–99% with ~79–98% of the assembly coverage (Table 7). The BUSCO assessment of the assemblies revealed the majority of them are complete with presence of >80% complete genes (Fig. 2). Overall, the S. etuberosum genome has the highest percent of core plant orthologous genes with 94.1% of genes present as complete copies, followed by the chq762573 and pur1868 genomes with 92% and 89.6% of complete BUSCO genes. The SH7_18_3, cxa70P5, and B1595106 genomes have a higher percentage of fragmented genes among the 23 genomes. High rates of heterozygosity and higher ploidy levels are the major contributing factors to highly fragmented assemblies38.

Table 7 Alignment statistics of nine genomes from this study compared against reference genomes.

Phylogenetic inference

Previous studies classified section Petota (tuber-bearing) into major Clades and subgroups13,26,260,261. Here we have reconstructed phylogenetic trees from the PAV data (Fig. 5) as well as plastome sequences (Fig. 6) to understand the relationship between these genomes. The results have shown similar groupings as previous studies where S. pinnatisectum (Clade 1 + 2), S. paucissectum, S. piurae, and S. chiquidenum (Clade 3), and the remaining accessions (Clade 4) separated into different clades. Within Clade 4, the S. megistacrolobum accessions formed a sister clade to Clade 4 South, something which was also seen in the Petota pangenome26. S. okadae placed in the Clade 4 South as also seen in the Petota pangenome. The S. tarnii X S. tuberosum (SH7_18_3) genome, which is a somatic hybrid between S. tarnii (Clade 1 + 2 species) and S. tuberosum (Clade 4) obtained by in vitro fusions of protoplasts, placed differently in the PAV and plastome phylogenetic trees. In the plastome phylogenetic tree, it is grouped with the Clade 1 + 2 species, whereas, in the PAV tree it is grouped with another S. tuberosum hybrid (Clade 4) showing the phylogenetic difficulty with hybrids and the importance of including both nuclear and organellar data. Moreover, genomes that are known to be closely related are grouped together, and genomes representing the same species grouped together, which validates the use of PAVs in determining phylogenetic relationships26.

Usage Notes

The Solanum section Petota has over 100 cultivated and wild species. The potato crop wild relatives are an excellent source of genetic variation; however, not many sequencing efforts have been carried out to study these wild species. Here, we present the 24 phased chromosomes for S. okadae, and genome and transcriptome data for an additional 23 potato wild relative accessions. Though most of these are short-read assemblies, they are a useful tool for exploring genetic diversity that can provide insight in potato pre-breeding. The WGS reads can be used in various analyses such as determining SNPs, structural variations, and PAVs to understand the genetic diversity that exists within the section Petota. The RNA-Seq data of these wild species with important agronomic traits is especially crucial to understand their transcriptome profile. S. okadae foliage has been shown to contain the glycoalkaloid tomatine, and undetectable levels of solanine and chaconine, and carries Colorado potato beetle resistance, drought and cold tolerance (but not frost) and moderate late blight resistance262,263,264,265.