Chromosome-level genome assemblies of Nicotiana tabacum, Nicotiana sylvestris, and Nicotiana tomentosiformis

Sierro, Nicolas; Auberson, Mehdi; Dulize, Rémi; Ivanov, Nikolai V.

doi:10.1038/s41597-024-02965-2

Download PDF

Data Descriptor
Open access
Published: 26 January 2024

Chromosome-level genome assemblies of Nicotiana tabacum, Nicotiana sylvestris, and Nicotiana tomentosiformis

Nicolas Sierro¹,
Mehdi Auberson¹,
Rémi Dulize¹ &
…
Nikolai V. Ivanov¹

Scientific Data volume 11, Article number: 135 (2024) Cite this article

2188 Accesses
Metrics details

Subjects

Abstract

The Solanaceae species Nicotiana tabacum, an economically important crop plant cultivated worldwide, is an allotetraploid species that appeared about 200,000 years ago as the result of the hybridization of diploid ancestors of Nicotiana sylvestris and Nicotiana tomentosiformis. The previously published genome assemblies for these three species relied primarily on short-reads, and the obtained pseudochromosomes only partially covered the genomes. In this study, we generated annotated de novo chromosome-level genomes of N. tabacum, N. sylvestris, and N. tomentosiformis, which contain 3.99 Gb, 2.32 Gb, and 1.74 Gb, respectively of sequence data, with 97.6%, 99.5%, and 95.9% aligned in chromosomes, and represent 99.2%, 98.3%, and 98.5% of the near-universal single-copy orthologs Solanaceae genes. The completion levels of these chromosome-level genomes for N. tabacum, N. sylvestris, and N. tomentosiformis are comparable to other reference Solanaceae genomes, enabling more efficient synteny-based cross-species research.

The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars

Article Open access 15 April 2024

Genome assembly in the telomere-to-telomere era

Article 22 April 2024

A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range

Article Open access 11 April 2024

Background & Summary

The Nicotiana genus belongs to the Solanaceae family, which also includes tomato (Solanum lycopersicum), potato (Solanum tuberosum), and eggplant (Solanum melongena)^1,2. While most of the Solanaceae are diploids with 12 chromosome pairs, tobacco (Nicotiana tabacum L.) is an allotetraploid (2n = 4x = 48) resulting from a hybridization event that likely occurred in the Andes within the last 200,000 years between ancestors of Nicotiana sylvestris (S-genome; 2n = 2x = 24) and Nicotiana tomentosiformis (T-genome; 2n = 2x = 24)^3,4. In addition to being a modern descendant of the N. tabacum maternal progenitor, N. sylvestris, which is nowadays largely cultivated as an ornamental plant, is also one the closest descendants of the ancestral species from the Alatae/Sylvestres section that hybridized as the paternal donor with an ancestral species from the Noctiflorae/Petunioides section to give rise to the almost all-Australian clade of allopolyploid species constituting the Nicotiana section Suaveolentes⁵.

Similar to other members of the Nicotiana genus, N. sylvestris, N. tomentosiformis, and N. tabacum produce a wide range of alkaloids that are known to be toxic to insects and are a well-established mechanism of defense against herbivores⁶. While N. sylvestris accumulates similar amounts of alkaloids in roots and leaves (3.5 mg/g in roots and 2.1 mg/g in leaves), N. tomentosiformis accumulates more alkaloids in roots (8.8 mg/g in roots and 0.6 mg/g in leaves), and N. tabacum has more in leaves (1.3 mg/g in roots and 12.5 mg/g in leaves)⁷. The composition of the accumulated alkaloids varies between the three species, with N. tabacum benefiting from both of its progenitors’ genetic and regulatory contributions. In N. sylvestris roots, 87% of the alkaloids is nicotine, 11% is anatabine, and 1.9% is anabasine, while in leaves, 100% of the alkaloids is nicotine. In N. tomentosiformis roots, 56% of the alkaloids is nornicotine, 28% is anatabine, 14% is nicotine, 1.6% is anabasine, and 0.57% is cotinine, while in leave 73% of the alkaloids is nicotine and 27% is nornicotine. In N. tabacum roots, 87% of the alkaloids is nicotine, and 13% is nornicotine, while in leaves, 92% of the alkaloids is nicotine, 5.1% is nornicotine, and 2.6% is anatabine⁷.

The Nicotiana genus is also a rich source of terpenoids, which play a significant role as attractants to several pollinator insects. In N. tabacum, both cembranoid and labdanoid diterpenoids are synthesized in the trichome glands, whereas N. sylvestris produces predominantly cembranoid diterpenoids and N. tomentosiformis predominantly labdanoid diterpenoids⁸.

Although several Nicotiana species genomes have been published in the last decade, including for N. sylvestris⁹, N. tomentosiformis⁹, and N. tabacum^10,11, these genomes are primarily based on the assembly of second-generation sequencing data and therefore suffer from an important fragmentation resulting in only partial anchoring to chromosomes.

In the present study, we integrated Illumina short-read sequencing (Illumina, San Diego, CA, USA) with third-generation Oxford Nanopore long-read sequencing and Oxford Nanopore chromosome conformation capture (PoreC) technology (Oxford Nanopore Technologies, Oxford, UK) to generate high-quality chromosome-level reference genomes for N. tabacum, N. sylvestris, and N. tomentosiformis. These new resources will broaden our understanding of the contributions of both N. tabacum progenitors to the genes and the pathways of tobacco and enable more efficient synteny-based cross-species Solanaceae research.

Methods

DNA Extraction and Sequencing

Young leaves from N. tabacum L. Cultivar K326 (PVY resistant derived from USDA ARS GRIN Global NPGS: PI 552505), N. Sylvestris Speg. TW136 (USDA ARS GRIN Global NPGS: PI 555569) and N. tomentosiformis Goodsp. TW142 (USDA ARS GRIN Global NPGS: PI 555572) were snap-frozen with liquid nitrogen and finely ground in a mortar. High molecular weight genomic DNA for long-read sequencing was extracted using Promega Wizard HMW DNA Extraction Kit (Promega AG, Madison, WI, USA).

Short genomic DNA fragments were deleted using Circulomics short-read eliminator kits from PacBio (PacBio, Menlo Park, CA, USA), and long-read sequencing libraries were prepared using Oxford Nanopore Technologies SQK-LSK109 Ligation Sequencing Kits before sequencing on Oxford Nanopore Technologies PromethION R9.4.1 flowcells. About 139 Gb of raw data were collected for N. tabacum, 159 Gb for N. sylvestris, and 76 Gb for N. tomentosiformis.

To conduct chromosome-level assembly, frozen leaves were cut into one square centimeter pieces and treated with formaldehyde to fix the DNA. The fixed genomic DNA was then digested overnight using the NlaIII restriction enzyme, and the 3′ overhangs were re-ligated using T4 ligase before extraction. PoreC sequencing libraries were prepared using Oxford Nanopore Technologies SQK-LSK109 Ligation Sequencing Kits before sequencing on Oxford Nanopore Technologies PromethION R9.4.1 flowcells. About 40 Gb of raw data were collected for N. tabacum, 66 Gb for N. sylvestris, and 63 Gb for N. tomentosiformis.

To polish and validate the assembled genomes, Illumina short-reads were prepared for N. tabacum using Tecan Celero EZ DNA-Seq Library Preparation Kits (Tecan, Männedorf, Switzerland) and sequenced as 2 × 151 bp paired-end reads on an Illumina NovaSeq 6000 to generate a total of 139 Gb. Illumina short-reads from ERR274527¹² and ERR274528¹³ for N. sylvestris and from ERR274540¹⁴ and ERR274542¹⁵ for N. tomentosiformis were retrieved from the Short Read Archive.

De novo Assembly and Chromosome Construction

For N. tabacum, Oxford Nanopore basecalling was performed using Guppy 6.3.7 using the plant super model. Long-read sequences were filtered using seqkit¹⁶ 2.2.0 to remove short (length <5000) and low-quality reads (average qscore <9), resulting in 98 Gb (N50 length: 28.5 kb).

For N. sylvestris and N. tomentosiformis, Oxford Nanopore basecalling was performed using Guppy 6.1.1 using the plant super model. Long-read sequences were filtered using seqkit¹⁶ 2.2.0 to remove short (length <2500) and low-quality reads (average qscore <9), resulting in 108 Gb (N50 length: 25.9 kb) and 41 Gb (N50 length: 28.2 kb) for N. sylvestris and N. tomentosiformis, respectively.

Genomes were assembled using flye¹⁷ 2.9.1 using the nano-hq input pre-set and a read error rate of 0.03.

The Illumina short-reads were processed for each species using fastp¹⁸ 0.23.2 to trim adapters and low-quality bases, merge pairs, and remove low complexity and short (length <75) reads. During processing, the reads were split into two sets, one for assembly polishing which contained 80% of the processed Illumina reads and one for assembly validation containing 20% of the processed Illumina reads.

The assembled genomes were polished with processed Illumina short-reads using fmlrc2¹⁹ 0.1.7. The remaining haplotig sequences were removed from the assemblies using purge_dups²⁰ 1.2.6, with cut-offs set to 3, 8, and 1000 for N. tabacum, to 5, 10, and 1000 for N. sylvestris, and to 2, 3, and 1000 for N. tomentosiformis.

Illumina short-reads were mapped to the assembly contigs using minimap2^21,22 2.24, duplicates marked with samblaster²³ 0.1.26, and filtered using samtools²⁴ 1.15.1. The coverage of the assembly contigs by Illumina sequencing was then calculated using samtools²⁴ 1.15.1, and contigs with less than 70% of their length with a coverage of at least 5 for N. tabacum and 15 for N. sylvestris and N. tomentosiformis were removed.

Because the biological material used for sequencing originated from inbred plants that can be considered homozygotes, variants were called using freebayes²⁵ 1.3.6 with the ploidy parameter set to 1 and ignoring sites with coverage higher than 200 and filtered with vcflib²⁶ 1.0.3 vcffilter using the parameters --filter-sites–info --filter “QUAL >20 & QUAL/AO >10 & SAF >0 & SAR >0 & RPL >1 & RPR >1”. Variants were then applied to the genomes using bcftools²⁴ 1.15.1 consensus to generate the polished assembly contigs.

Assembly contigs from plastid and mitochondrion were removed by mapping the polished assembly contigs to the N. tabacum plastid and mitochondrion sequences (NC_001879.2²⁷ and NC_006581.1²⁸, respectively) using minimap2^21,22 2.24 and filtering out contig mapping on more than 50% of their length.

Assembly contigs from possible contamination were identified using kraken2²⁹ 2.1.2 using the k2_pluspfp_20220908 database³⁰ and removed by only retaining contigs identified as belonging to Nicotiana or Solanum species.

PoreC reads were mapped to the cleaned assembly contigs using minimap2^21,22 2.24. Alignments with a mapping quality lower than 60 for N. tabacum and 30 for N. sylvestris and N. tomentosiformis were discarded, and contact pairs were created from the remaining alignments. The positions on the contigs of each contact pair were recorded as two consecutive lines in a BED file. The scaffolding of the contigs to a chromosome-level assembly was performed using yahs³¹ 1.2a1. Contact maps were prepared using PretextMap³² 0.1.9, manually curated and annotated in PretextView³³ 0.2.5, and the resulting scaffolds exported as chromosome-level sequences.

To name and orient the N. tabacum chromosome-level sequences, the PT markers, mapped to the sequences using hisat2³⁴ 2.2.1 and the tobacco genetic map³⁵, were used. Similarly, the N. tomentosiformis chromosome-level sequences were named and oriented using the N genetic map³⁶ combined with the tobacco PT markers³⁵. The chromosome-level assembly of the N. tomentosiformis genome was then used as a reference to name and orient the N. sylvestris chromosome-level sequences based on minimap2^21,22 2.24 mapping (Fig. 1).

The proportion of the assembly anchored to chromosomes reached 99.5%, 95.9%, and 97.6% of the total assembly lengths for N. sylvestris, N. tomentosiformis, and N. tabacum, respectively (Table 1).

Table 1 Chromosome length, total assembly length, and percentage of the assembly anchored to chromosomes for Nicotiana sylvestris, Nicotiana tomentosiformis, and Nicotiana tabacum.

Full size table

When compared to the previously available N. tabacum genome assembly¹¹ generated from short-read sequencing, whole genome profiling and optical and genetic mapping data, the new N. tabacum genome assembly has fewer contigs (decrease from 1,257,801 to 1410) with a larger N50 length (increase from 9.1 kb to 11.8 Mb), and the proportion of the assembly anchored to chromosomes consequently improved from 64% to 97.6%.

Retrotransposon Prediction and Annotation

Nested retrotransposons were annotated by iteratively running genometools 1.6.2 ltrharvest³⁷ using the parameters -similar 70 -seed 20 -minlenltr 100 -maxlenltr 7000 -mindistltr 1000 -maxdistltr 15000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 3 -vic 10 -overlaps best, retaining the predictions matching to the RepeatExplorer Viridiplantae 3.0 dataset³⁸ using diamond³⁹ 2.1.6 blastx with the parameters --max-target-seqs 1 --ultra-sensitive --frameshift 15, and excising them from the assembly using samtools²⁴ 1.17. At most, 20 prediction-filtering-excision iterations were performed.

The predicted retrotransposons were classified by their homology to the RepeatExplorer Viridiplantae 3.0 dataset³⁸ sequences. Their age was estimated under the assumption that their long terminal repeats (LTRs) were identical at the time of insertion by aligning their 3′ and 5′ LRTs using clustalo^40,41 1.2.4, calculating their divergence (K) using the Kimura-2-parameter distance and dividing it by twice 1.5 × 10⁻⁸ substitution per site per year (r)⁴².

The predicted retrotransposons covered 26.6%, 32.2%, and 29.3% of the N. sylvestris, N. tomentosiformis, and N. tabacum genomes, respectively (Table 2). Regardless of the species, the most frequent element subclass is Ty3/gypsy|chromovirus|Tekay, representing between 40% and 56% of the total predicted retrotransposon length. The only element subclass that shows a marked difference between the three species is Ty3/gypsy|non-chromovirus|OTA|Tat|Ogre, which covers 116,167,517 bp (18.8% of the total predicted retrotransposon length) in N. sylvestris, and only 21,672,795 bp (3.9%) in N. tomentosiformis. In N. tabacum, it covers 135,653,424 bp (11.6%), close to the sum of its coverage in the two precursor species (137,840,312 bp). Looking at the predicted insertion ages, a recent expansion of the Alesia and Angela subclasses of Ty1/copia and of the Ogre subclass of Ty3/gypsy retrotransposons in N. sylvestris and N. tabacum, but not in N. tomentosiformis, is observed (Fig. 2).

Table 2 Predicted retrotransposons length and genome coverage statistics.

Full size table

Coding-gene Prediction and Annotation

Genomes were masked using blast^43,44 2.14.0 windowmasker with dusting, and augustus⁴⁵ 3.5.0 was used for gene prediction. A training dataset was created by separately mapping S. lycopersicum, S. tuberosum, and Nicotiana attenuata cDNA and CDS from Ensembl 56 using minimap2^21,22 2.26 to the N. sylvestris and N. tomentosiformis genomes. Any sequence with an annotation matching ‘hypothetical’, ‘unknown’, ‘polyprotein’, ‘domain-containing’, ‘chloroplast’, or ‘mitochondria’ were omitted from the mapping. Gene models were constructed from the mapped sequences using bedtools⁴⁶ 2.30.0 and filtered using gffread⁴⁷ 0.12.7 with the parameters -V -H -U -N -P -J -M -K -Q -Y -Z -F --keep-exon-attrs. Training sequences were then extracted from the genomes using the obtained GFF annotation file and adding 1,000 bp flaking regions. One-fourth of the gene models were set aside for testing for each combination of species and dataset. After merging the training and testing datasets, a Nicotiana model was trained using the etraining and optimize_augustus.pl programs bundled with augustus⁴⁵ 3.5.0. A total of 10,092 loci were used for training, and 3,362 loci were used for testing.

To hint at the augustus predictions, Ensembl 56 proteins from S. lycopersicum, S. tuberosum, and N. attenuata were mapped to the genomes using miniprot⁴⁸ 0.11, and aletsch⁴⁹ 1.0.3 was used to construct transcripts from Illumina paired-end RNA-Seq reads from SRR11912457⁵⁰, SRR2106531⁵¹, ERR274387⁵², ERR274388⁵³, ERR274389⁵⁴, ERR274390⁵⁵, ERR274391⁵⁶, ERR274392⁵⁷, ERR274393⁵⁸, ERR274394⁵⁹, ERR274395⁶⁰, ERR274396⁶¹, ERR274397⁶², ERR274398⁶³, ERR274399⁶⁴, ERR274400⁶⁵, ERR274401⁶⁶, ERR274402⁶⁷, ERR274403⁶⁸, ERR274404⁶⁹, and ERR274405⁷⁰ mapped using hisat2³⁴ 2.2.1, and Oxford Nanopore long cDNA reads from SRR12045991⁷¹, SRR12045992⁷², SRR12045993⁷³, and SRR12045994⁷⁴ mapped with minimap2^21,22 2.26.

Augustus⁴⁵ 3.5.0 predictions were obtained using the trained Nicotiana model, the extrinsic.MPE.cfg extrinsic configuration file, and hints derived from the miniport⁴⁸ 0.11 and aletsch⁴⁹ 1.0.3 output with priorities of 4 and 3, respectively. Other augustus⁴⁵ 3.5.0 parameters used were --alternatives-from-evidence=off --alternatives-from-sampli ng=off --softmasking=1 --strand=both --genemodel=complete --UTR=on. Predicted gene models without supporting hints that did not encode a protein found in a uniprot eudicotyledons proteins dataset filtered to omit proteins with annotations matching ‘uncharacterized’, ‘unknown’, ‘hypothetical’, ‘genome’, ‘domain-containing’, ‘family’, ‘transmembrane’, ‘putative’, ‘probable’, ‘predicted’, ‘member’, ‘fragment’, ‘truncated’, ‘superfamily’, ‘chloroplast’, ‘mitochond’, ‘low quality’, or ‘At.g’ when using diamond³⁹ 2.1.6 blastx with the parameters --max-target-seqs 1 --min-score 200 --ultra-sensitive --frameshift 15 were removed.

To complement the augustus predictions, additional gene models were created by separately mapping the predicted N. sylvestris, N. tomentosiformis, and N. tabacum cDNA and CDS and the S. lycopersicum, S. tuberosum, and N. attenuata cDNA and CDS from Ensembl 56 to the genomes using minimap2^21,22 2.26. Models that overlapped augustus predictions by 25% or more according to bedtools⁴⁶ 2.30.0 intersect were then filtered out by IDs using gffread⁴⁷ 0.12.7 with the parameters -P -M -K -Q -Y -Z -F, and the remaining genes models were added to those predicted with augustus⁴⁵ 3.5.0.

Functional annotation of the gene models was performed using diamond³⁹ 2.1.6 blastx with the parameters --max-target-seqs 1 --min-score 200 --ultra-sensitive --frameshift 15 and uniprot eudicotyledons proteins filtered to omit proteins with annotations matching ‘uncharacterized’, ‘unknown’, ‘hypothetical’, ‘genome’, ‘domain-containing’, ‘family’, ‘transmembrane’, ‘putative’, ‘probable’, ‘predicted’, ‘member’, ‘fragment’, ‘truncated’, ‘superfamily’, ‘chloroplast’, ‘mitochond’, ‘low quality’ or ‘At.g’. Gene models overlapping with retrotransposons by 75% or more according to bedtools⁴⁶ 2.30.0 intersect and those with annotations matching ‘transposon’, ‘transposase’, ‘polyprotein’, ‘gagpol’, or ‘gag-pol’ were excluded to yield the final set of annotated gene models.

Data Records

The genomes and annotations are available from Zenodo under records 8256252⁷⁵, 8256254⁷⁶, and 8256256⁷⁷. The trained Nicotiana model for augustus gene prediction is available from Zenodo under record 8256280⁷⁸.

The genomes have been deposited at DDBJ/ENA/GenBank under the accessions ASAF00000000⁷⁹, ASAG00000000⁸⁰ and AWOJ00000000⁸¹.

Raw sequencing data are available from the National Center for Biotechnology Information Short Read Archive under accessions SRR25685126⁸², SRR25685127⁸³, SRR25685128⁸⁴, SRR25685129⁸⁵, and SRR25685130⁸⁶ in BioProject PRJNA182500, SRR25685034⁸⁷, SRR25685035⁸⁸, SRR25685036⁸⁹, SRR25685037⁹⁰, SRR25685038⁹¹, SRR25685039⁹², and SRR25685040⁹³ in BioProject PRJNA182501, and SRR25685386⁹⁴, SRR25685387⁹⁵, SRR25685388⁹⁶ SRR25685389⁹⁷, SRR25685390⁹⁸, SRR25685391⁹⁹, SRR25685392¹⁰⁰, SRR25685393¹⁰¹, SRR25685394¹⁰², SRR25685395¹⁰³, and SRR25685396¹⁰⁴ in BioProject PRJNA208210 for N. sylvestris, N. tomentosiformis, and N. tabacum, respectively.

Technical Validation

The quality and completeness of the assemblies were assessed with yak¹⁰⁵ 0.1 using 20% of the processed Illumina short-reads which were set aside for that purpose. For N. tabacum, Quality Coverage and Quality Value of 0.982 and 38.1 were obtained; for N. sylvestris, they were of 0.993 and 41.5; and for N. tomentosiformis they were of 0.991 and 43.2.

The quality of the gene predictions from the trained Nicotiana model was evaluated using the prepared testing sets and compared with results obtained using already available models for arabidopsis, tomato, and coyote_tobacco models (Table 3).

Table 3 Augustus testing metrics with the arabidopsis, tomato, coyote_tobacco, and Nicotiana models.

Full size table

The completeness of the gene model sets was evaluated using BUSCO¹⁰⁶ 5.4.7 with the solanales_odb10 lineage dataset. Completeness of 98.1%, 95.1%, and 96.1% at the transcript level and of 97.0%, 92.8%, and 93.4% at the protein level were obtained for N. tabacum, N. sylvestris, and N. tomentosiformis, respectively (Table 4). These values are similar to those obtained for S. lycopersicum, of 95.0% at the transcript level and 92.3% at the protein level.

Table 4 Statistics of the BUSCO genome, transcripts, and proteins completeness evaluation using the solanales_odb10 lineage dataset for Nicotiana sylvestris, Nicotiana tomentosiformis and Nicotiana tabacum.

Full size table

Code availability

All software used in this work is publicly available, with versions and parameters clearly described in Methods. If no detailed parameters were mentioned for a software, the default parameters suggested by the developer were used. No custom code was used during this study for the curation and/or validation of the datasets.

References

Knapp, S., Bohs, L., Nee, M. & Spooner, D. M. Solanaceae—A model for linking genomics with biodiversity. Comp. Funct. Genomics 5, 285–291 (2004).
Article CAS PubMed PubMed Central Google Scholar
Olmstead, R. G. et al. A molecular phylogeny of the Solanaceae. Taxon 57, 1159–1181 (2008).
Article Google Scholar
Clarkson, J. J. et al. Phylogenetic relationships in Nicotiana (Solanaceae) inferred from multiple plastid DNA regions. Mol. Phylogenet. Evol. 33, 75–90 (2004).
Article CAS PubMed Google Scholar
Clarkson, J. J. et al. Long‐term genome diploidization in allopolyploid Nicotiana section Repandae (Solanaceae). New Phytol. 168, 241–252 (2005).
Article CAS PubMed Google Scholar
D’Andrea, L. et al. Polyploid Nicotiana section Suaveolentes originated by hybridization of two ancestral Nicotiana clades. Front. Plant Sci. 14 (2023).
Baldwin, I. T. Inducible Nicotine Production in Native Nicotiana as an Example of Adaptive Phenotypic Plasticity. J. Chem. Ecol. 25, 3–30 (1999).
Article CAS Google Scholar
Kaminski, K. P. et al. Alkaloid chemophenetics and transcriptomics of the Nicotiana genus. Phytochemistry 177, 112424 (2020).
Article CAS PubMed Google Scholar
Tissier, A. Trichome Specific Expression: Promoters and Their Applications. in Transgenic Plants - Advances and Limitations (InTech, 2012).
Sierro, N. et al. Reference genomes and transcriptomes of Nicotiana sylvestris and Nicotiana tomentosiformis. Genome Biol. 14, R60 (2013).
Article PubMed PubMed Central Google Scholar
Sierro, N. et al. The tobacco genome sequence and its comparison with those of tomato and potato. Nat. Commun. 5, (2014).
Edwards, K. D. et al. A reference genome for Nicotiana tabacum enables map-based cloning of homeologous loci implicated in nitrogen utilization efficiency. BMC Genomics 18, (2017).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274527 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274528 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274540 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274542 (2013).
Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One 11, e0163962 (2016).
Article PubMed PubMed Central Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Article CAS PubMed Google Scholar
Chen, S. Ultrafast one‐pass FASTQ data preprocessing, quality control, and deduplication using fastp. Imeta 2, (2023).
Mak, Q. X. C., Wick, R. R., Holt, J. M. & Wang, J. R. Polishing De Novo nanopore assemblies of bacteria and eukaryotes with FMLRC2. Mol. Biol. Evol. 40, (2023).
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19, (2018).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).
Article CAS PubMed PubMed Central Google Scholar
Faust, G. G. & Hall, I. M. SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics 30, 2503–2505 (2014).
Article CAS PubMed PubMed Central Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, (2021).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing https://doi.org/10.48550/ARXIV.1207.3907 (2012).
Article Google Scholar
Garrison, E., Kronenberg, Z. N., Dawson, E. T., Pedersen, B. S. & Prins, P. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput. Biol. 18, e1009123 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
NCBI Genome Project. Nicotiana tabacum plastid, complete genome. Nucleotide https://identifiers.org/nucleotide/NC_001879.2 (2000).
NCBI Genome Project. Nicotiana tabacum mitochondrion, complete genome. Nucleotide https://identifiers.org/nucleotide/NC_006581.1 (2004).
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).
Article CAS PubMed PubMed Central Google Scholar
Langmead, B. Kraken 2, KrakenUniq and Bracken indexes https://benlangmead.github.io/aws-indexes/k2 (2022).
Zhou, C., McCarthy, S. A. & Durbin, R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics 39, btac808 (2023).
Article CAS PubMed Google Scholar
High Performance Algorithms Group. The Wellcome Sanger Institute. Paired REad TEXTure Mapper https://github.com/wtsi-hpag/PretextMap (2022).
High Performance Algorithms Group. The Wellcome Sanger Institute. OpenGL Powered Pretext Contact Map Viewer https://github.com/wtsi-hpag/PretextView (2022).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bindler, G. et al. A high density genetic map of tobacco (Nicotiana tabacum L.) obtained from large scale microsatellite marker development. Züchter Genet. Breed. Res. 123, 219–230 (2011).
Google Scholar
Wu, F. & Tanksley, S. D. Chromosomal evolution in the plant family Solanaceae. BMC Genomics 11, 182 (2010).
Article PubMed PubMed Central Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, (2008).
Neumann, P., Novák, P., Hoštáková, N. & Macas, J. Systematic survey of plant LTR-retrotransposons elucidates phylogenetic relationships of their polyprotein domains and provides a reference for element classification. Mob. DNA 10, (2019).
Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).
Article CAS PubMed PubMed Central Google Scholar
Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences: Clustal Omega for Many Protein Sequences. Protein Sci. 27, 135–145 (2018).
Article CAS PubMed Google Scholar
Sievers, F. et al. Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, (2011).
Mokhtar, M. M., Alsamman, A. M. & El Allali, A. PlantLTRdb: An interactive database for 195 plant species LTR-retrotransposons. Front. Plant Sci. 14, (2023).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 1–9 (2009).
Article Google Scholar
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).
Article CAS PubMed Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).
Article Google Scholar
Li, H. Protein-to-genome alignment with miniprot. Bioinformatics 39, btad014 (2023).
Shao, M. Assembler for multiple RNA-seq samples https://github.com/Shao-Group/aletsch (2020).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR11912457 (2020).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR2106531 (2016).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274387 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274388 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274389 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274390 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274391 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274392 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274393 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274394 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274395 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274396 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274397 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274398 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274399 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274400 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274401 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274402 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274403 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274404 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:ERR274405 (2013).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR12045991 (2021).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR12045992 (2021).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR12045993 (2021).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR12045994 (2021).
Sierro, N. Nicotiana sylvestris genome assembly and annotation. Zenodo https://doi.org/10.5281/zenodo.8256252 (2023).
Sierro, N. Nicotiana tomentosiformis genome assembly and annotation. Zenodo https://doi.org/10.5281/zenodo.8256254 (2023).
Sierro, N. Nicotiana tabacum genome assembly and annotation. Zenodo https://doi.org/10.5281/zenodo.8256256 (2023).
Sierro, N. Nicotiana model for augustus gene prediction, Zenodo, https://doi.org/10.5281/zenodo.8256280 (2023).
Sierro, N. & Ivanov, N. V. Nicotiana sylvestris, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:ASAF00000000 (2023).
Sierro, N. & Ivanov, N. V. Nicotiana tomentosiformis, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:ASAG00000000 (2023).
Sierro, N. & Ivanov, N. V. Nicotiana tabacum cultivar K326, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:AWOJ00000000 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685126 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685127 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685128 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685129 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685130 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685034 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685035 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685036 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685037 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685038 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685039 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685040 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685386 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685387 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685388 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685389 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685390 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685391 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685392 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685393 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685394 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685395 (2023).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRR25685396 (2023).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO update: Novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank Simon Goepfert and Nicolas Bakaher for scientific discussions, and Rebecca Higgins for manuscript editorial revision.

Author information

Authors and Affiliations

PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, CH-2000, Neuchâtel, Switzerland
Nicolas Sierro, Mehdi Auberson, Rémi Dulize & Nikolai V. Ivanov

Authors

Nicolas Sierro
View author publications
You can also search for this author in PubMed Google Scholar
Mehdi Auberson
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Dulize
View author publications
You can also search for this author in PubMed Google Scholar
Nikolai V. Ivanov
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.S. and N.V.I. conceived this project; M.A. and R.D. performed the experiments; N.S. assembled the genomes, generated the annotation sets, and performed the data analysis; N.S. and N.V.I. wrote and revised the manuscript. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Nicolas Sierro.

Ethics declarations

Competing interests

N.S., M.A., R.D., and N.V.I. are employees of Philip Morris International.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sierro, N., Auberson, M., Dulize, R. et al. Chromosome-level genome assemblies of Nicotiana tabacum, Nicotiana sylvestris, and Nicotiana tomentosiformis. Sci Data 11, 135 (2024). https://doi.org/10.1038/s41597-024-02965-2

Download citation

Received: 09 November 2023
Accepted: 12 January 2024
Published: 26 January 2024
DOI: https://doi.org/10.1038/s41597-024-02965-2