Background & Summary

The Nicotiana genus belongs to the Solanaceae family, which also includes tomato (Solanum lycopersicum), potato (Solanum tuberosum), and eggplant (Solanum melongena)1,2. While most of the Solanaceae are diploids with 12 chromosome pairs, tobacco (Nicotiana tabacum L.) is an allotetraploid (2n = 4x = 48) resulting from a hybridization event that likely occurred in the Andes within the last 200,000 years between ancestors of Nicotiana sylvestris (S-genome; 2n = 2x = 24) and Nicotiana tomentosiformis (T-genome; 2n = 2x = 24)3,4. In addition to being a modern descendant of the N. tabacum maternal progenitor, N. sylvestris, which is nowadays largely cultivated as an ornamental plant, is also one the closest descendants of the ancestral species from the Alatae/Sylvestres section that hybridized as the paternal donor with an ancestral species from the Noctiflorae/Petunioides section to give rise to the almost all-Australian clade of allopolyploid species constituting the Nicotiana section Suaveolentes5.

Similar to other members of the Nicotiana genus, N. sylvestris, N. tomentosiformis, and N. tabacum produce a wide range of alkaloids that are known to be toxic to insects and are a well-established mechanism of defense against herbivores6. While N. sylvestris accumulates similar amounts of alkaloids in roots and leaves (3.5 mg/g in roots and 2.1 mg/g in leaves), N. tomentosiformis accumulates more alkaloids in roots (8.8 mg/g in roots and 0.6 mg/g in leaves), and N. tabacum has more in leaves (1.3 mg/g in roots and 12.5 mg/g in leaves)7. The composition of the accumulated alkaloids varies between the three species, with N. tabacum benefiting from both of its progenitors’ genetic and regulatory contributions. In N. sylvestris roots, 87% of the alkaloids is nicotine, 11% is anatabine, and 1.9% is anabasine, while in leaves, 100% of the alkaloids is nicotine. In N. tomentosiformis roots, 56% of the alkaloids is nornicotine, 28% is anatabine, 14% is nicotine, 1.6% is anabasine, and 0.57% is cotinine, while in leave 73% of the alkaloids is nicotine and 27% is nornicotine. In N. tabacum roots, 87% of the alkaloids is nicotine, and 13% is nornicotine, while in leaves, 92% of the alkaloids is nicotine, 5.1% is nornicotine, and 2.6% is anatabine7.

The Nicotiana genus is also a rich source of terpenoids, which play a significant role as attractants to several pollinator insects. In N. tabacum, both cembranoid and labdanoid diterpenoids are synthesized in the trichome glands, whereas N. sylvestris produces predominantly cembranoid diterpenoids and N. tomentosiformis predominantly labdanoid diterpenoids8.

Although several Nicotiana species genomes have been published in the last decade, including for N. sylvestris9, N. tomentosiformis9, and N. tabacum10,11, these genomes are primarily based on the assembly of second-generation sequencing data and therefore suffer from an important fragmentation resulting in only partial anchoring to chromosomes.

In the present study, we integrated Illumina short-read sequencing (Illumina, San Diego, CA, USA) with third-generation Oxford Nanopore long-read sequencing and Oxford Nanopore chromosome conformation capture (PoreC) technology (Oxford Nanopore Technologies, Oxford, UK) to generate high-quality chromosome-level reference genomes for N. tabacum, N. sylvestris, and N. tomentosiformis. These new resources will broaden our understanding of the contributions of both N. tabacum progenitors to the genes and the pathways of tobacco and enable more efficient synteny-based cross-species Solanaceae research.

Methods

DNA Extraction and Sequencing

Young leaves from N. tabacum L. Cultivar K326 (PVY resistant derived from USDA ARS GRIN Global NPGS: PI 552505), N. Sylvestris Speg. TW136 (USDA ARS GRIN Global NPGS: PI 555569) and N. tomentosiformis Goodsp. TW142 (USDA ARS GRIN Global NPGS: PI 555572) were snap-frozen with liquid nitrogen and finely ground in a mortar. High molecular weight genomic DNA for long-read sequencing was extracted using Promega Wizard HMW DNA Extraction Kit (Promega AG, Madison, WI, USA).

Short genomic DNA fragments were deleted using Circulomics short-read eliminator kits from PacBio (PacBio, Menlo Park, CA, USA), and long-read sequencing libraries were prepared using Oxford Nanopore Technologies SQK-LSK109 Ligation Sequencing Kits before sequencing on Oxford Nanopore Technologies PromethION R9.4.1 flowcells. About 139 Gb of raw data were collected for N. tabacum, 159 Gb for N. sylvestris, and 76 Gb for N. tomentosiformis.

To conduct chromosome-level assembly, frozen leaves were cut into one square centimeter pieces and treated with formaldehyde to fix the DNA. The fixed genomic DNA was then digested overnight using the NlaIII restriction enzyme, and the 3′ overhangs were re-ligated using T4 ligase before extraction. PoreC sequencing libraries were prepared using Oxford Nanopore Technologies SQK-LSK109 Ligation Sequencing Kits before sequencing on Oxford Nanopore Technologies PromethION R9.4.1 flowcells. About 40 Gb of raw data were collected for N. tabacum, 66 Gb for N. sylvestris, and 63 Gb for N. tomentosiformis.

To polish and validate the assembled genomes, Illumina short-reads were prepared for N. tabacum using Tecan Celero EZ DNA-Seq Library Preparation Kits (Tecan, Männedorf, Switzerland) and sequenced as 2 × 151 bp paired-end reads on an Illumina NovaSeq 6000 to generate a total of 139 Gb. Illumina short-reads from ERR27452712 and ERR27452813 for N. sylvestris and from ERR27454014 and ERR27454215 for N. tomentosiformis were retrieved from the Short Read Archive.

De novo Assembly and Chromosome Construction

For N. tabacum, Oxford Nanopore basecalling was performed using Guppy 6.3.7 using the plant super model. Long-read sequences were filtered using seqkit16 2.2.0 to remove short (length <5000) and low-quality reads (average qscore <9), resulting in 98 Gb (N50 length: 28.5 kb).

For N. sylvestris and N. tomentosiformis, Oxford Nanopore basecalling was performed using Guppy 6.1.1 using the plant super model. Long-read sequences were filtered using seqkit16 2.2.0 to remove short (length <2500) and low-quality reads (average qscore <9), resulting in 108 Gb (N50 length: 25.9 kb) and 41 Gb (N50 length: 28.2 kb) for N. sylvestris and N. tomentosiformis, respectively.

Genomes were assembled using flye17 2.9.1 using the nano-hq input pre-set and a read error rate of 0.03.

The Illumina short-reads were processed for each species using fastp18 0.23.2 to trim adapters and low-quality bases, merge pairs, and remove low complexity and short (length <75) reads. During processing, the reads were split into two sets, one for assembly polishing which contained 80% of the processed Illumina reads and one for assembly validation containing 20% of the processed Illumina reads.

The assembled genomes were polished with processed Illumina short-reads using fmlrc219 0.1.7. The remaining haplotig sequences were removed from the assemblies using purge_dups20 1.2.6, with cut-offs set to 3, 8, and 1000 for N. tabacum, to 5, 10, and 1000 for N. sylvestris, and to 2, 3, and 1000 for N. tomentosiformis.

Illumina short-reads were mapped to the assembly contigs using minimap221,22 2.24, duplicates marked with samblaster23 0.1.26, and filtered using samtools24 1.15.1. The coverage of the assembly contigs by Illumina sequencing was then calculated using samtools24 1.15.1, and contigs with less than 70% of their length with a coverage of at least 5 for N. tabacum and 15 for N. sylvestris and N. tomentosiformis were removed.

Because the biological material used for sequencing originated from inbred plants that can be considered homozygotes, variants were called using freebayes25 1.3.6 with the ploidy parameter set to 1 and ignoring sites with coverage higher than 200 and filtered with vcflib26 1.0.3 vcffilter using the parameters --filter-sites–info --filter “QUAL >20 & QUAL/AO >10 & SAF >0 & SAR >0 & RPL >1 & RPR >1”. Variants were then applied to the genomes using bcftools24 1.15.1 consensus to generate the polished assembly contigs.

Assembly contigs from plastid and mitochondrion were removed by mapping the polished assembly contigs to the N. tabacum plastid and mitochondrion sequences (NC_001879.227 and NC_006581.128, respectively) using minimap221,22 2.24 and filtering out contig mapping on more than 50% of their length.

Assembly contigs from possible contamination were identified using kraken229 2.1.2 using the k2_pluspfp_20220908 database30 and removed by only retaining contigs identified as belonging to Nicotiana or Solanum species.

PoreC reads were mapped to the cleaned assembly contigs using minimap221,22 2.24. Alignments with a mapping quality lower than 60 for N. tabacum and 30 for N. sylvestris and N. tomentosiformis were discarded, and contact pairs were created from the remaining alignments. The positions on the contigs of each contact pair were recorded as two consecutive lines in a BED file. The scaffolding of the contigs to a chromosome-level assembly was performed using yahs31 1.2a1. Contact maps were prepared using PretextMap32 0.1.9, manually curated and annotated in PretextView33 0.2.5, and the resulting scaffolds exported as chromosome-level sequences.

To name and orient the N. tabacum chromosome-level sequences, the PT markers, mapped to the sequences using hisat234 2.2.1 and the tobacco genetic map35, were used. Similarly, the N. tomentosiformis chromosome-level sequences were named and oriented using the N genetic map36 combined with the tobacco PT markers35. The chromosome-level assembly of the N. tomentosiformis genome was then used as a reference to name and orient the N. sylvestris chromosome-level sequences based on minimap221,22 2.24 mapping (Fig. 1).

Fig. 1
figure 1

PoreC contact maps. Intra-chromosomal and inter-chromosomal contacts are shown for the Nicotiana sylvestris, Nicotiana tomentosiformis, and Nicotiana tabacum genome assemblies. The black bottom and right edges correspond to unplaced sequences.

The proportion of the assembly anchored to chromosomes reached 99.5%, 95.9%, and 97.6% of the total assembly lengths for N. sylvestris, N. tomentosiformis, and N. tabacum, respectively (Table 1).

Table 1 Chromosome length, total assembly length, and percentage of the assembly anchored to chromosomes for Nicotiana sylvestris, Nicotiana tomentosiformis, and Nicotiana tabacum.

When compared to the previously available N. tabacum genome assembly11 generated from short-read sequencing, whole genome profiling and optical and genetic mapping data, the new N. tabacum genome assembly has fewer contigs (decrease from 1,257,801 to 1410) with a larger N50 length (increase from 9.1 kb to 11.8 Mb), and the proportion of the assembly anchored to chromosomes consequently improved from 64% to 97.6%.

Retrotransposon Prediction and Annotation

Nested retrotransposons were annotated by iteratively running genometools 1.6.2 ltrharvest37 using the parameters -similar 70 -seed 20 -minlenltr 100 -maxlenltr 7000 -mindistltr 1000 -maxdistltr 15000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 3 -vic 10 -overlaps best, retaining the predictions matching to the RepeatExplorer Viridiplantae 3.0 dataset38 using diamond39 2.1.6 blastx with the parameters --max-target-seqs 1 --ultra-sensitive --frameshift 15, and excising them from the assembly using samtools24 1.17. At most, 20 prediction-filtering-excision iterations were performed.

The predicted retrotransposons were classified by their homology to the RepeatExplorer Viridiplantae 3.0 dataset38 sequences. Their age was estimated under the assumption that their long terminal repeats (LTRs) were identical at the time of insertion by aligning their 3′ and 5′ LRTs using clustalo40,41 1.2.4, calculating their divergence (K) using the Kimura-2-parameter distance and dividing it by twice 1.5 × 10−8 substitution per site per year (r)42.

The predicted retrotransposons covered 26.6%, 32.2%, and 29.3% of the N. sylvestris, N. tomentosiformis, and N. tabacum genomes, respectively (Table 2). Regardless of the species, the most frequent element subclass is Ty3/gypsy|chromovirus|Tekay, representing between 40% and 56% of the total predicted retrotransposon length. The only element subclass that shows a marked difference between the three species is Ty3/gypsy|non-chromovirus|OTA|Tat|Ogre, which covers 116,167,517 bp (18.8% of the total predicted retrotransposon length) in N. sylvestris, and only 21,672,795 bp (3.9%) in N. tomentosiformis. In N. tabacum, it covers 135,653,424 bp (11.6%), close to the sum of its coverage in the two precursor species (137,840,312 bp). Looking at the predicted insertion ages, a recent expansion of the Alesia and Angela subclasses of Ty1/copia and of the Ogre subclass of Ty3/gypsy retrotransposons in N. sylvestris and N. tabacum, but not in N. tomentosiformis, is observed (Fig. 2).

Table 2 Predicted retrotransposons length and genome coverage statistics.
Fig. 2
figure 2

Predicted retrotransposon insertion ages. (a) Predicted insertion ages in millions of years for retrotransposons of the Ty1/copia superfamily; (b) Predicted insertion ages in millions of years for retrotransposons of the Ty3/gypsy superfamily.

Coding-gene Prediction and Annotation

Genomes were masked using blast43,44 2.14.0 windowmasker with dusting, and augustus45 3.5.0 was used for gene prediction. A training dataset was created by separately mapping S. lycopersicum, S. tuberosum, and Nicotiana attenuata cDNA and CDS from Ensembl 56 using minimap221,22 2.26 to the N. sylvestris and N. tomentosiformis genomes. Any sequence with an annotation matching ‘hypothetical’, ‘unknown’, ‘polyprotein’, ‘domain-containing’, ‘chloroplast’, or ‘mitochondria’ were omitted from the mapping. Gene models were constructed from the mapped sequences using bedtools46 2.30.0 and filtered using gffread47 0.12.7 with the parameters -V -H -U -N -P -J -M -K -Q -Y -Z -F --keep-exon-attrs. Training sequences were then extracted from the genomes using the obtained GFF annotation file and adding 1,000 bp flaking regions. One-fourth of the gene models were set aside for testing for each combination of species and dataset. After merging the training and testing datasets, a Nicotiana model was trained using the etraining and optimize_augustus.pl programs bundled with augustus45 3.5.0. A total of 10,092 loci were used for training, and 3,362 loci were used for testing.

To hint at the augustus predictions, Ensembl 56 proteins from S. lycopersicum, S. tuberosum, and N. attenuata were mapped to the genomes using miniprot48 0.11, and aletsch49 1.0.3 was used to construct transcripts from Illumina paired-end RNA-Seq reads from SRR1191245750, SRR210653151, ERR27438752, ERR27438853, ERR27438954, ERR27439055, ERR27439156, ERR27439257, ERR27439358, ERR27439459, ERR27439560, ERR27439661, ERR27439762, ERR27439863, ERR27439964, ERR27440065, ERR27440166, ERR27440267, ERR27440368, ERR27440469, and ERR27440570 mapped using hisat234 2.2.1, and Oxford Nanopore long cDNA reads from SRR1204599171, SRR1204599272, SRR1204599373, and SRR1204599474 mapped with minimap221,22 2.26.

Augustus45 3.5.0 predictions were obtained using the trained Nicotiana model, the extrinsic.MPE.cfg extrinsic configuration file, and hints derived from the miniport48 0.11 and aletsch49 1.0.3 output with priorities of 4 and 3, respectively. Other augustus45 3.5.0 parameters used were --alternatives-from-evidence=off --alternatives-from-sampli ng=off --softmasking=1 --strand=both --genemodel=complete --UTR=on. Predicted gene models without supporting hints that did not encode a protein found in a uniprot eudicotyledons proteins dataset filtered to omit proteins with annotations matching ‘uncharacterized’, ‘unknown’, ‘hypothetical’, ‘genome’, ‘domain-containing’, ‘family’, ‘transmembrane’, ‘putative’, ‘probable’, ‘predicted’, ‘member’, ‘fragment’, ‘truncated’, ‘superfamily’, ‘chloroplast’, ‘mitochond’, ‘low quality’, or ‘At.g’ when using diamond39 2.1.6 blastx with the parameters --max-target-seqs 1 --min-score 200 --ultra-sensitive --frameshift 15 were removed.

To complement the augustus predictions, additional gene models were created by separately mapping the predicted N. sylvestris, N. tomentosiformis, and N. tabacum cDNA and CDS and the S. lycopersicum, S. tuberosum, and N. attenuata cDNA and CDS from Ensembl 56 to the genomes using minimap221,22 2.26. Models that overlapped augustus predictions by 25% or more according to bedtools46 2.30.0 intersect were then filtered out by IDs using gffread47 0.12.7 with the parameters -P -M -K -Q -Y -Z -F, and the remaining genes models were added to those predicted with augustus45 3.5.0.

Functional annotation of the gene models was performed using diamond39 2.1.6 blastx with the parameters --max-target-seqs 1 --min-score 200 --ultra-sensitive --frameshift 15 and uniprot eudicotyledons proteins filtered to omit proteins with annotations matching ‘uncharacterized’, ‘unknown’, ‘hypothetical’, ‘genome’, ‘domain-containing’, ‘family’, ‘transmembrane’, ‘putative’, ‘probable’, ‘predicted’, ‘member’, ‘fragment’, ‘truncated’, ‘superfamily’, ‘chloroplast’, ‘mitochond’, ‘low quality’ or ‘At.g’. Gene models overlapping with retrotransposons by 75% or more according to bedtools46 2.30.0 intersect and those with annotations matching ‘transposon’, ‘transposase’, ‘polyprotein’, ‘gagpol’, or ‘gag-pol’ were excluded to yield the final set of annotated gene models.

Data Records

The genomes and annotations are available from Zenodo under records 825625275, 825625476, and 825625677. The trained Nicotiana model for augustus gene prediction is available from Zenodo under record 825628078.

The genomes have been deposited at DDBJ/ENA/GenBank under the accessions ASAF0000000079, ASAG0000000080 and AWOJ0000000081.

Raw sequencing data are available from the National Center for Biotechnology Information Short Read Archive under accessions SRR2568512682, SRR2568512783, SRR2568512884, SRR2568512985, and SRR2568513086 in BioProject PRJNA182500, SRR2568503487, SRR2568503588, SRR2568503689, SRR2568503790, SRR2568503891, SRR2568503992, and SRR2568504093 in BioProject PRJNA182501, and SRR2568538694, SRR2568538795, SRR2568538896 SRR2568538997, SRR2568539098, SRR2568539199, SRR25685392100, SRR25685393101, SRR25685394102, SRR25685395103, and SRR25685396104 in BioProject PRJNA208210 for N. sylvestris, N. tomentosiformis, and N. tabacum, respectively.

Technical Validation

The quality and completeness of the assemblies were assessed with yak105 0.1 using 20% of the processed Illumina short-reads which were set aside for that purpose. For N. tabacum, Quality Coverage and Quality Value of 0.982 and 38.1 were obtained; for N. sylvestris, they were of 0.993 and 41.5; and for N. tomentosiformis they were of 0.991 and 43.2.

The quality of the gene predictions from the trained Nicotiana model was evaluated using the prepared testing sets and compared with results obtained using already available models for arabidopsis, tomato, and coyote_tobacco models (Table 3).

Table 3 Augustus testing metrics with the arabidopsis, tomato, coyote_tobacco, and Nicotiana models.

The completeness of the gene model sets was evaluated using BUSCO106 5.4.7 with the solanales_odb10 lineage dataset. Completeness of 98.1%, 95.1%, and 96.1% at the transcript level and of 97.0%, 92.8%, and 93.4% at the protein level were obtained for N. tabacum, N. sylvestris, and N. tomentosiformis, respectively (Table 4). These values are similar to those obtained for S. lycopersicum, of 95.0% at the transcript level and 92.3% at the protein level.

Table 4 Statistics of the BUSCO genome, transcripts, and proteins completeness evaluation using the solanales_odb10 lineage dataset for Nicotiana sylvestris, Nicotiana tomentosiformis and Nicotiana tabacum.