Genomic signatures of the evolution of defence against its natural enemies in the poisonous and medicinal plant Datura stramonium (Solanaceae)

Tropane alkaloids and terpenoids are widely used in the medicine and pharmaceutic industry and evolved as chemical defenses against herbivores and pathogens in the annual herb Datura stramonium (Solanaceae). Here, we present the first draft genomes of two plants from contrasting environments of D. stramonium. Using these de novo assemblies, along with other previously published genomes from 11 Solanaceae species, we carried out comparative genomic analyses to provide insights on the genome evolution of D. stramonium within the Solanaceae family, and to elucidate adaptive genomic signatures to biotic and abiotic stresses in this plant. We also studied, in detail, the evolution of four genes of D. stramonium—Putrescine N-methyltransferase, Tropinone reductase I, Tropinone reductase II and Hyoscyamine-6S-dioxygenase—involved in the tropane alkaloid biosynthesis. Our analyses revealed that the genomes of D. stramonium show signatures of expansion, physicochemical divergence and/or positive selection on proteins related to the production of tropane alkaloids, terpenoids, and glycoalkaloids as well as on R defensive genes and other important proteins related with biotic and abiotic pressures such as defense against natural enemies and drought.

www.nature.com/scientificreports/ also affected the number of genes annotated. Nonetheless, this number in both genomes approximately is equal to the expected number in Solanaceae species. Furthermore, the percentage of missing BUSCOs was relatively low for both genomes, transcriptomes and proteomes 25 . Here, the number of complete BUSCOs for our genome assemblies, transcriptomes and proteomes is very similar to that reported for Tomato, Potato, Eggplant, Pepper, Tobacco and its wild relatives, as well as P. inflata and P. axilaris [9][10][11]13,14,17,26,27 .
Repetitive landscape of Datura genomes. Datura genomes are rich in repetitive DNA (as are most other plant genomes 28 ). The repetitive landscape of our genomes revealed that 76.04% and 74.11% of the genomes are composed by repetitive elements (Supplementary Table S6 online, Fig. 2). These results reveal a higher proportion of repetitive elements than in other Solanaceae genomes, such as tomato, potato and Petunia species, and nearly similar to the repetitive landscapes of Nicotiana and Capsicum genomes 9,10,14,26,27 (Supplementary Table S7 online). Long terminal repeats (LTR) elements are the most abundant in the D. stramonium genomes (Supplementary Table S6 online, Fig. 2), covering 65.88% and 63.41% of the genomes for Ticumán an Teotihuacán, respectively (Supplementary Table S6 online, Fig. 2). The Gypsy family is the most LTR represented in both genomes covering 61.33% and 58. 71% for Ticumán and Teotihuacán genomes, respectively (Fig. 2). The Copia family represents almost the rest of the repetitive landscape for both genomes (Fig. 2). An analysis of the history of repetitive elements between Nicotiana and Solanum species revealed that all Nicotiana species experienced a recent independent wave of Gypsy retrotransposon expansion 12,26 and this seems to have happened also in the Datura species.
Comparative genomic analyses. Protein coding genes from 11 genomes were sourced from the Sol Genomics Network (https ://solge nomic s.net/, see "Materials and Methods" section). We used these genomes along with both D. stramonium genomes to construct orthogroups (gene families) using OrthoFinder v2.3.3 29 . This program assigned 480,594 genes out of 536,483 (89.6% of total) to 35  On the x-axis, contigs are ordered from the largest to smallest. The y-axis gives the size of the x largest contigs in the assembly. This is the total genome assembled (red line = Ticumán, blue line = Teotihuacán). (e) BUSCO plots for the two Datura stramonium genomes, transcriptomes and proteomes predicted by MAKER program. The plot shows quantitative measures for the assessment of the genome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs selected from the "Solanaceae odb10*" database. See Supplementary Table S3 Table S11 online). Likewise, we found with the MapMan4 annotation a total of 14 enriched proteins with signal of expansion (Supplementary Table S12 online), 23 proteins with positively selected conserved amino acids (Supplementary Table S13 online), and 54 enriched proteins with physicochemical divergence (Supplementary Table S14 online). We found that either over-represented domains (enrichment test using InterproScan database) as well as over-represented proteins (enrichment test using MapMan4 database) with signal of expansion, positive selection or physicochemical divergence are related with immunity and defence against pathogens, virus, fungi and insect herbivores as well as related with responses to abiotic stresses such as drought and nutrients deficiency (Tables 2, 3, Supplementary Table S14 online).
Domains associated with defensive proteins (R genes). Several domains have been associated as fundamental components of the R genes in D. stramonium genomes ( Table 2) but some notable domains in expanded and positively selected proteins were found in D. stramonium (Table 2); the Virus X resistance proteinlike (IPR038005) and Rx, N-terminal (IPR041118) are domains that confer resistance against the potato virus X 31,32 . IPR038005 domain has been identified in a family of resistance proteins that recognize pathogen effector proteins and trigger a response that may be as severe as localized cell death 32 . The NB-ARC (IPR002182) and Leucine-rich repeat (LRR) domain superfamily (IPR032675) ( Table 2), interact and release a signal to initiate an event of immunity against pathogens 31,33 .
Resistance to a diverse range of pathogens, including nematodes, fungi, bacteria, and viruses involves LRR proteins either as resistance proteins or as proteins required for resistance proteins to function 34 . We found that the LRR-XII kinase and SD-1 kinase proteins had positively selected codons in Datura genomes (Supplementary Table S13 online). Magalhães et al. 35 found a large expansion of LRR-XII in Citrus genomes, suggesting that it might play a key role in adaptive responses in host-pathogen co-evolution, related to the perennial life cycle and domestication of the citrus crop species. Likewise, it has demonstrated that SD-1 kinase protein is a plant receptor with roles in signaling and plant defense 36 . Moreover, we found several proteins belonging to the Kinase superfamily with significant physicochemical divergence (Supplementary Table S14 online). Kinase superfamily proteins have been related with different stresses including pathogen invasion 37 .
The Late blight resistance domain R1 (IPR021929) ( Table 2) showed a significant expansion signal. The R1 is a protein for resistance to late blight, the most destructive disease in potato cultivation worldwide 38 . On the other  (Table 2). This domain is involved in non-host resistance (NHR) or plant immunity to non-adapted pathogen species 39 . Domains and proteins related with the biosynthesis of Terpenoids. We found in both enrichment analyses (InterPro and MapMan4 annotation) proteins with signal of expansion, physicochemical divergence and positively selected that are directly related with the biosynthesis of terpenoids (Table 3, Supplementary Tables S9-S14 online). For instance, Cytochrome P450 domain is related in the biosynthesis of terpenoids and has been associated as a key domain in the production of different terpenoids that mediate plant defence against herbivores 40 . We also found domains directly related in the biosynthesis of terpenoids in D. stramonium protein families with significant expansion and with positively selected codons. These families comprise Terpene synthases with the N-terminal domain (IPR001906), Terpenoid cyclases/protein prenyltransferase alphaalpha toroid (IPR008930), Terpene synthase, metal-binding domain (IPR005630) and Terpene cyclase-like 1, C-terminal (IPR034741) ( Table 3). Enrichment analysis with MapMan4 revealed that mono/sesquiterpene/ diterpene synthase family proteins showed signal of expansion and with positively selected codons (Supplementary Tables S12, S13 online). Several studies have indicated that terpene synthases are the primary enzymes in the formation of terpene metabolites 41 . It has reported that terpene compounds can act as defense metabolites against herbivores or they also play a role as attractants to carnivorous arthropods that prey upon or parasitise herbivores, and so reduce further damage [41][42][43][44] . A recent study carried out with experimental populations of D. stramonium has identified a triterpenoid compound involved as defense against the most dangerous herbivore of this species, the larvae of Lema daturaphila (Chrysomelidae) 21 . Also, several terpenoids have been reported to mediate plant-plant communication 45 .
Enriched proteins related with abiotic stresses. A notable domain, the SNF1-related protein kinase regulatory subunit beta-2 (IPR030070) was detected in proteins with physicochemical divergence (Supplementary Tables S9, S15 online). This domain has been implicated in the response against drought, and in the efficiency of carbohydrate metabolism and in the response to glucose limitation 46,47 . Also, we found over-represented domains in proteins with signal of expansion and with positively selected codons containing Galactose oxidase/kelch, beta-propeller (IPR011043) domain (Supplementary Table S15 online). This domain is also involved in the stress responses induced under Fe deficiency in the roots and also related as defence protein 48 .
Kinase proteins showed signal of physicochemical divergence (Supplementary Table S15 online), and these proteins have been related with different abiotic stresses including light, temperature, and nutrient deprivation 35 .
Genes involved in the tropane alkaloid biosynthesis. Notable domains involved in the tropane alkaloids pathway were proteins of families expanded in the Datura branch and with positively selected conserved Table 2. Classifications of domains related with defence against natural enemies (pathogens, viruses, fungi, oomycete, herbivores) subject to expansion, positive selection or physicochemical divergence. Some domains were detected to be expanded and positively selected. p value is showed for each analysis. The entire list for each analysis is showed in Supplementary Tables S9, S10 and S11. Ex expanded, PS positive selected, FQ physicochemical divergence. www.nature.com/scientificreports/ amino acids (codons) (Supplementary Tables S10, S11 online). For instance, Cytochrome P450 (IPR001128), Transferase (IPR003480), NADH:ubiquinone oxidoreductase (IPR003918) and Phosphoethanolamine N-methyltransferase (IPR025771) ( Table 3). Cytochrome P450 is involved in the rearrangement of Littorine (a kind of tropane alkaloid) to produce atropine/hyosciamine and scopolamine 49,50 . This step is very important in the biosynthesis of scopolamine via the Hyoscyamine (6S)-dioxygenase gene (h6h) 49,50 . Indeed, the enzymes that participate in the tropane alkaloid biosynthesis belong to the classes of oxidoreductases and transferases 22 , such as we detected in enriched proteins with signal of expansion, positively selected and proteins with physicochemical divergence (Table 3). Within tropane alkaloids, the pmt gene family showed significant gene expansion during the evolution of the Datura genus; the last common ancestor of D. stramonium had only one gene (Supplementary Table S16 online), while D. stramonium Ticumán and Teotihuacán have three and two gene copies, respectively (Supplementary material S16 online). Pfam annotation of pmt genes showed that the gene dati7568, which belongs to the Ticumán genome, has an extra domain of spermine-synthase in comparison with its homolog from Teotihuacán (Supplementary Fig. S2 online). Kasukabe et al. 51 found that the overexpression of spermidine-synthase enhanced tolerance to multiple environmental stresses including herbivory and pathogenesis. Moreover, pmt is the key gene catalyzing the formation of N-methylputrescine from putrescine and S-adenosyl-L-methionine and this enzyme triggers the production of hygrine and other different tropane alkaloids 22 . HPLC-TOF-MS results revealed that the plant of Ticumán showed 26.63-fold of hygrine concentration than the plant of Teotihuacán (Supplementary Table S17 online, Fig. 5 a). In fact, a differentiation of 59-fold was obtained in total tropane alkaloid concentration Table 3. Classifications of domains related with the biosynthesis of secondary compounds and that act as defence against natural enemies (pathogens, viruses, fungi, oomycete, herbivores) subject to expansion, positive selection or physicochemical divergence. Some domains were detected to be expanded and positively selected. p value is showed for each analysis. The entire list for each analysis is showed in Supplementary Tables S9, S10 and S11. Ex expanded, PS positive selected, FQ physicochemical divergence.  Table S16 online). However, we observed four copies of this gene in both Datura genomes (Fig. 6). The gene date9161 (Teotihuacán) and dati33033 (Ticumán) have two domains absent in the other tpr I Datura genes; Sulfakinin (PF08257) and ECR1_N domain (PF14382) (Fig. 6). Wei et al. 52 found that the sulfakinins may reduce the sensitivity of taste receptor of Schistocerca gregaria (Acrididae). Here, we found that sulfakinin domain is observed in D. stramonium but not in the other Solanaceae species (Fig. 6). Likewise, ECR1_N is an N-terminal region of the exosome complex of resistance proteins 33 . The Datura tpr I genes date9170 and dati33044 had a domain architecture different from the tpr genes of the other studied Solanaceae (Fig. 6). However, the gene dati33044 (Ticumán) has one additional domain adh_short_C2 (PF13561), in comparison with its homolog date9170 (Teotihuacán) (Fig. 6). Likewise, dati33027 showed an additional adh_short_C2 domain in comparison with its homolog date20542 (Fig. 6). We observed that the Ticumán plant produced ca. 32 times more tropine than the Teotihuacán plant (Supplementary Table S17 online, Fig. 5 a). This chemical difference in tropine concentration could be related with this additional domain (adh_short_C2) that was observed in the Ticumán genome. It has reported that the domain adh_short_C2 plays a role as anti-microbial and anti-parasitic molecules 53 .  www.nature.com/scientificreports/ Other gene copies of tpr I, date11128 (Teotihuacán) and dati22507 (Ticumán) have the domain RHH_5 (PF07878) (Fig. 6), this domain has been described as a toxin-antitoxin system (TA) 54 . Toxin-antitoxin genes are often inherited through horizontal gene transfer and are associated with pathogenic bacteria 55 . TA systems are numerous in many plant-associated bacteria, but very little is known regarding their function and distribution in phytopathogens 54 . This TA system is also found in N. attenuata, N. tabacum, N. sylvestris and C. annuum gabriusculum (Fig. 6). Several studies about the production of secondary metabolites as defence against natural enemies have been reported for these species 1 .
Evidence of expansion for the gene family tpr II was detected (Supplementary Table S16 online). The analysis revealed that the last common ancestor of D. stramonium had one tpr II gene, while five and three copies of tpr II genes showed to be expanded from Teotihuacán and Ticumán, respectively (Supplementary Table S16 online). An interesting domain, (PapA_C) is present in the gene copy date754 (Supplementary Fig. S3 online). This domain is so-called Polyketide-associated protein (Pap) that belongs to the subfamily of acyltransferases and has been found to be involved in the biosynthesis of secondary metabolites 53 . A Cobalamin (CobU) domain is found in the Datura gene dati23799 (Supplementary Fig. S3 online). Vascular plants neither synthesize nor require vitamin B12 because they contain cobalamin-independent methionine synthase (MetE) 56 . However, herbivores have been found to obtain their dietary quota of cobalamin from plants contaminated with cobalamin-producing soil bacteria (rhizobia) that grow in roots and nodules of plants 56 . Thus, more studies are needed to prove and assess the interaction between mycorrhizal and D. stramonium.
Hyoscyamine (6S)-dioxygenase gene (h6h) is the last rate-limiting enzyme directly catalyzing the formation of atropine and scopolamine in tropane alkaloids biosynthesis pathway 22 . As we have pointed out, scopolamine is the main secondary metabolite of D. stramonium with pharmaceutic and medical interest 57 . First, our results revealed that two copies of h6h are found in both genomes of D. stramonium (Fig. 5b). However, these copies were distributed in two different gene families (gene families "OG0028637" and "OG0043057") ( Fig. 5b). Indeed, both gene families were composed by only two genes; one from the Ticumán plant and one from the Teotihuacán plant. Therefore, we used 17 h6h genes (13 genes retrieved from Uniprot database and four from our D. stramonium genomes, see "Materials and Methods" section) to construct an artificial gene family. We carried out a multiple sequence alignment and reconstruct the phylogeny. Also, using the Pfam database the protein domain architecture of these genes was identified. The h6h copy gene of D. stramonium (Daturastramonium-Tic8550_OG0028637) from Ticumán has two domains of DIOX_N (PF14226), while the above homologs have only one DIOX_N domain (Fig. 5b). In contrast, only three genes in this h6h family are composed of a single domain (2OG-FeII_Oxy, PF03171); two of them belong to our D. stramonium genomes and one are found in the Medicago truncatula genome (Fig. 5b). The DIOX_N domain is composed of a highly conserved N-terminal region of proteins with Oxoglutarate/iron-dependent dioxygenase activity (2OG-FeII_Oxy) 57 . This extra domain (DIOX_N) observed in one h6h gene from the Ticuman plant (DaturastramoniumTic8550_OG0028637) could be involved in the higher production of atropine, anisodamine and scopolamine than the plant from Teotihuacán. (Supplementary Table S16 online, Fig. 5a,b).
In summary, the Ticumán genome showed 57.31-fold of total tropane alkaloids than the Teotihuacán genome. Differences in several tropane alkaloid concentrations between Ticumán and Teotihuacán seem to be related with the differences in domain architecture of the four genes studied here involved in the tropane alkaloid biosynthesis. However, these results must be confirmed with expression data and/or qPCR of these important genes. Different evidence has pointed out that tropane alkaloids are implicated in resistance against herbivores in D. stramonium 21,[58][59][60] and that the selective value of tropane alkaloids preventing or reducing herbivory varies among populations of this plant species, depending on the type of enemies (specialist or generalists herbivores, fungi, pathogens, oomycete) 21,58-60 .

Conclusions
The information generated here will provide support for future studies in the non-model species D. stramonium. Understanding the evolution, adaptation and the ecological role of tropane alkaloids and other secondary metabolites such terpenoids is necessary to disentangle its role in defence against natural enemies. We described how the D. stramonium genome expanded and we detected positive selection and physicochemical divergence on terpenoids, tropanes, glycosides, R genes, and proteins related with abiotic stresses such as drought. Indeed, the availability of these draft genomes provides a tool for future studies to better understand the genome evolution of the Solanaceae family and for other scientific fields such as medicine.

Materials and methods
Selection of the parent genomes. The two selected genomes were extracted from two different populations, Ticumán in the State of Morelos, 18° 45′ 39.90" N, 99° 7′ 13.86" W, and Teotihuacán, in the State of Mexico, 19° 41′ 6.96" N, 98° 52′ 19.63" W 3,19 . We analyzed 21 tropane alkaloids via HPLC-TOF-MS from 47 (Ticumán) and 45 (Teotihuacán) plants under controlled conditions (green house experiment) and we selected these two individuals that were the most differentiated in their total tropane alkaloid concentration (Teotihuacán = 1018 ng/g; Ticumán = 59,051 ng/g). Chemical conditions and details of samples extraction and mass spectrometry analyses can be consulted in De-la-Cruz et al. 3 .

DNA extraction, genomic library preparation and sequencing.
To obtain a high-quality de novo assembly, we combined data generated from short insert paired-end libraries from Illumina sequencing, with long read sequencing by PacBio Sequel I sequencing. First, gDNA was extracted from the two individuals. gDNA was isolated from fresh leaves with a modified CTAB mini-prep protocol 61  Preprocessing of sequenced short reads. Reads quality has a major impact on the quality of the resulting assembly, and the use of error-corrected reads increases the size of the contigs 62 . Illumina paired-end reads were trimmed using a phred quality score > 30 in TRIMMOMATIC v0.32 63 , following a sliding window trimming approach. We verified visually the quality (including contamination with Illumina paired-end adaptors) before and after trimming using the program FastQC 64 . This allowed us to only keep high-quality reads prior to the assembly steps.
Genome size estimation. Quality trimmed Illumina paired-end sequences were used to estimate the genome size using KmerGenie v1.7016 65 . The best k-mer sizes were 95 and 93 for Teotihuacán and Ticumán, respectively. Also, the genome size was estimated through cell flow cytometry for both individuals carried out at National Laboratory of Flow Cytometry of the National Autonomous University of Mexico. To estimate the genome size, Arabidopsis col-1 ecotype and human PBMCs (male donor) were used as a reference. The nuclear DNA content of the sample was calculated with the formula: where pg is picograms and IMF is average fluorescence intensity.
De novo genome assembly. Each individual of D. stramonium was assembled independently de novo. We followed the workflow of Chakraborty et al. 66 for both plants, with modifications 66 . First, PacBio raw subreads (in bam format) were transformed to fasta format and assembled with Canu v1.8 pipeline that includes three stages: correction, trimming, and assembly 67 . PacBio only assembles of high error, long molecule sequences, depend upon redundancy between the various low-quality reads to 'vote out' errors and identify the true sequence in the sequenced individual 66 . Therefore, we used also a hybrid assembly approach suggested by Ye et al. 68 . For this, Illumina short reads were used to perform De Bruijn graph assembly with SparseAssembler 68 . The generated contigs from SparseAssembler were used with PacBio raw sequences to carry out a hybrid assembly using the program DBG2OLC 68 . An advantage of DBG2OLC program is that uses multiple sequence alignment to clean the PacBio reads and remove reads with structural errors (the so-called chimeras) 68 .
The program MUMmer v3 24 was used to run the NUCmer wrap and the program delta-filter to compute unique alignments between the contigs from the Hybrid assembly (DBG2OLC) and PacBio assembly (Canu). DBG2OLC assembly was used as reference and Canu assembly was used as query. This last step allowed us to merge both assemblies (DBG2OC and Canu) using the program Quickmerge 66 . As the two assemblies used for merging come from the same genome, gaps in one assembly can be bridged using the corresponding sequences from the other assembly 66 . Thus, Quickmerge program improved the contiguity of both genome assemblies.
Polishing, consensus and scaffolding. Genome assemblies were polished using the programs Pilon 69 and Arrow (https ://githu b.com/Pacifi cBio scien ces/Genom icCon sensu s). Polishing the contigs using both programs brings the error rate down to 0.01% or lower 66 . First, raw Illumina sequences were aligned to its corresponding merged assembly (draft genome) with Bowtie2 70 . We used SAMtools v1.8 71 to transform, sort and index the alignments outputs and then Pilon was used to polishing the draft genome with these Illumina aligned reads.
We used the program pbalign (https ://githu b.com/Pacifi cBio scien ces/pbali gn) to align the PacBio raw sequences to the new corresponding polished draft genome from Pilon. Then, the program Arrow was implemented as a second polishing step and to generate a consensus genome. After this, we used the program OPERA-LG v2.0.6 72 for scaffolding and then a third step of polishing with Pilon (which implied align the raw Illumina sequences against its corresponding genome) to improve the accuracy of the final genome assembly.
To evaluate the sequence and structural similarity between the two draft genomes (i.e., single nucleotide polymorphisms or SNPs, breakpoints, insertions, relocations, translocations, inversions, average sequence similarity), we used the wrapper dnadiff from MUMmer v3 24 . The Ticumán assembly was used as reference and the Teotihuacán assembly as query. Likewise, NOVOPlasty v3.8.2 was used to extract and reconstruct the chloroplast and mitochondrial genomes from the whole genome shotgun data of the two plants of D. stramonium. This program is capable to assemble the incidentally sequenced chloroplast and mitochondrial DNA that is present in almost all plant sequencing projects, due to the extraction of whole cellular DNA. The complete report and results of the chloroplast and mitochondrial genomes of these plants can be consulted in De-la-Cruz and Núñez-Farfán 73 . www.nature.com/scientificreports/ Nuclear genome validation. We evaluated the genome assemblies using the standard assembly statistics (average contig size, number of contigs, assembled genome size, N50, etc.) with the package Quast v5.0.2 74 . Also, BUSCO v.2.0.1 25 was used to assess the assembly quality through the gene completeness for both genomes. BUSCO was run in its three assessment modes; genome, transcriptome and proteins. BUSCO inspects de novo assemblies searching for single-copy orthologs (BUSCOs) and assess the completeness of the genomes according with the number of BUSCOs found 25 . In our case, the "Solanaceae odb10*" dataset loaded in the program was used to find 3052 orthologs. BUSCOs were classified as complete and single-copy (S), complete and duplicated (D), fragmented (F) or missing (M). As an additional evaluation of the genome assembly quality, we assessed the mapping rate of the Illumina sequences of each individual to their corresponding assembly using Bowtie2. We also used the program Merqury v1.1 23 to evaluate the quality of genome assemblies. Merqury generates assembly assessment metrics using k-mers alone 23 . This program compares a set of k-mers derived from unassembled, high accuracy sequencing reads to a genome assembly for evaluation (e.g., Illumina short-reads). The generated assembly metrics include consensus quality (QV) and k-mer completeness, Thus, Merqury is able to estimate base-level accuracy and completeness of genome assemblies 23 .

Repetitive elements analysis.
To characterize the repetitive elements in the genomes of D. stramonium, we followed the pipeline "Repeat Library Construction-Basic for MAKER v2.31.10" (http://weath erby.genet ics.utah.edu/MAKER /wiki/index .php/Repea t_Libra ry_Const ructi on-Basic) by Campbell et al. 75 . With this method, we followed a de novo approach to identified and collect repetitive sequences from the genomes. This was achieved using RepeatModeler v2.0 76 . The repetitive elements derived from this pipeline were concatenated with the databases RepBase-20181026 (https ://www.girin st.org/serve r/RepBa se/index .php) and Dfam_Consensus-20181026 (https ://dfam.org/help/tools ). These databases contain a comprehensive number of repetitive elements from all plant species 76 . Then, the program RepeatMasker v4.0.9 77 was run to identify the final interspersed repeats and low complexity DNA sequences. The output of the program was a detailed annotation of the repeats that are present in the genome sequence, as well as a modified version of the genome sequence in which all the annotated repeats have been masked 77 .
Gene prediction and structural annotation. The program BUSCO was used for genome assembly assessment and the annotated BUSCO gene models built during genome assessments were used to optimize the Hidden Markov search model (HMM) to train the gene predictor program Augustus v.3.2.2 78 using the-long option (BUSCO uses Augustus to search the conserved genes) and produce a trained HMM of our genes models for the program MAKER v.2.31.10 75 . MAKER identifies and masks out repeat elements based on repeat annotation from RepeatMasker, aligns RNA-seq data from the same species and/or related species to the genome; also, aligns proteins from related species and use gene predictors to synthesizes all these data into final structural annotations and produces evidence-based quality values for downstream annotation management 75 . The D. stramonium annotation workflow consisted of a total of four MAKER runs which is the recommended number to obtain the best annotation 75 . For the first run, we used the gene trained models from Augustus, 443,235 proteins from UniProtKB/TrEMBL from all Solanaceae species database (searching for the word "Solanaceae" with date 30/03/2019), expression sequence tags (ESTs) and a transcriptome of D. stramonium provided by an alternative experiment from our laboratory (other plants; NCBI BioProject PRJNA669339), 328,166 ESTs of five Solanaceae species (Solanum lycopersicum, Solanum tuberosum, Nicotiana attenuata, Nicotiana tabacum and Capsicum annuum) from EnsemblPlants and our specific repeat library to masks out the genome. This first step produced a set of draft gene models. The gene models from this first run were used to train another ab initio gene predictor called SNAP v.2006-07-28 79 . Once SNAP was trained with the draft gene models, we ran MAKER again using the same parameters. This process was repeated twice to retrain SNAP for three times in total 79 . Therefore, we used the gene models from the one round to train ab initio SNAP program to improve the inference of gene models in the next round. Only the HMM gene models from SNAP in the MAKER configuration file was changed in each run. Retraining of SNAP was performed using gene models (from each previous MAKER run) with an annotation edit distance (AED ≤ 0.25) and amino acids length ≥ 50 bp. AED ranges from 0 to 1 and quantifies the congruency between a gene annotation and its supporting evidence (gene models, EST, protein and mRNAseq alignments) 80,81 . Lower AED values imply higher congruency between the intron-exon coordinates of annotation and its aligned evidence, whereas AED = 1 indicates no evidence for support of predicted genes 80 . Only sequences with AED < 0.5 were retained in the final set of predicted genes 80 . AED scores were calculated following the formulate given by Holt and Yandell 80 . We used the Perl scripts from the GitHub repository Genome Assembly Annotation Service (GAAS) (https ://githu b.com/NBISw eden/GAAS/tree/maste r/annot ation ) in order to retrieve AED scores and summary statistics from the MAKER annotation.

Functional annotation.
Blastp v.2.6.0 82 was used to functionally annotate the genes from both Datura genomes against all the Solanaceae sequences from the UniProt/TrEMBL and UniProt/Swiss-Prot database. We used the program Automated Assignment of Human Readable Descriptions (AHRD) (https ://githu b.com/asish allab /AHRD) to assign gene descriptions that were concise, informative and precise 10 . Gene Ontology terms were annotated using MapMan4 through the Mercator webtool 83 . In addition, protein motifs and domains were annotated using Interproscan v.5.24 84  Orthology, reconstruction of orthogroups (protein families) and construction of species and gene family trees. To gain insight into the evolution of D. stramonium genome, we used the thirteen proteomes as input to OrthoFinder program 29 . We used in OrthoFinder v2.3.3, DIAMOND blast (E-value < 1e −5 ) 100 for orthogroup inference, and the MCL clustering algorithm for sequence similarity and clustering 101 . For each orthogroup or gene family we used MAFFT v7 102 as multiple protein sequence aligner and FastTree2 v2.1.10 103 for maximum likelihood gene trees inference. The inference of species tree is constructed by OrthoFinder, using a concatenated alignment of single-copy orthogroups (those with at most one gene per species) 29 . For some species sets which have been diverging for a very long time, there are not enough single copy orthogroups. In those cases, orthogroups that are mostly single-copy are also used for the concatenated alignments by only using sequences for the species that are single-copy in that orthogroup and gap characters for the other species 29 . The species tree was inferred with FastTree2 29 . The rooting is done via STRIDE algorithm (Specie Tree Root Inference from Duplication Events) 29 and according with OrthoFinder, P. inflata, was selected as outgroup of the whole phylogeny.
Inferring the species ultrametric phylogeny. To build an ultrametric phylogeny for the analysis of gene family evolution (expansions/contractions in gene families; see below), the rooted species tree obtained from OrthoFinder was used to search in TimeTree webtool 104 the divergence times between the branches, the rooted species tree and the information of divergence times were used to create the ultrametric species tree using the chronos function of the R package ape (v.3.4 on R v.3.2.1) 105 . The tip to root length was adjusted to match the approximately 40 million-year evolutionary history of Solanaceae species 14,104 .

Identification and analysis of gene expansions/contractions. To assess the gene family expansion
and contractions of the thirteen Solanaceae species, we used only the gene families with more than four genes per family (24,235) and the species ultrametric tree as inputs to the CAFE v4.2.1 106 open access program (Computational Analysis of gene Family Evolution). The main goal of CAFE is to estimate the birth-death (λ) parameters for the provided tree and gene family counts, the λ parameter describes the probability that any gene will be gained or lost 106 . First, the python scripts provided by CAFE pipeline were used to estimate the error in our dataset and to removed gene families with large variance 106 . This last filter was carried out because gene families that have large gene copy number variance can cause parameter estimates to be non-informative 106 . The CAFE software was then run using the mode in which the gain and loss rates are estimated together (λ) for the whole phylogeny. For the entire analysis, the CAFE overall p value threshold was kept at its default value (0.01). We used a custom script (https ://githu b.com/asish allab /SlydG eneFa msAna lyses /blob/icruz /exec/parse CafeR esult .R) to parse the CAFE output for functional enrichment analysis (see below).

Physicochemical protein divergence.
We used all the multiple sequence alignments of the 24,235 (protein families with more than four proteins) protein families to carried out a Multivariate Analysis of Protein Polymorphism (MAPP program) 107 . MAPP estimates the average deviation from six physicochemical properties (hydropathy, polarity, charge, volume, free energy in alpha-helix conformation, and free energy in beta-strand conformation) at an amino acid position across a multiple sequence alignment to assess the effect of a substitution at a particular amino acid site (physicochemical divergence) 107 . Thus, we used MAPP to estimate the physiochemical divergence in each gene family. First, we used the script readAndParseOrthogroupsTxt.R (https :// githu b.com/asish allab /SlydG eneFa msAna lyses /blob/icruz /exec/readA ndPar seOrt hogro ups Txt.R) to parse and create folders from each gene family and stored its corresponding protein tree and multiple sequence alignment from OrthoFinder results. Then, we used MAPP program 107 with default parameters in each one of the protein families. We used the script readMappResults.R (https ://githu b.com/asish allab /SlydG eneFa msAna lyses /blob/ icruz /exec/readM appRe sults .R) to parse and read all the MAPP results of the gene families. This script reads the MAPP results for all families, adjust p value, find Datura genes of families with good multiple sequence alignments (Valdar Score > 0.6) and only retains significant sites with physicochemical divergence that fell into conserved domain proteins. Valdar Score method allows to score residues in a multiple sequence alignment and assigns a score ranging from 0 for low and 1 for high conservation 108 . This program can be found in https ://githu b.com/asish allab /SlydG eneFa msAna lyses /blob/icruz /exec/compu teVal darMs aScor es. R and was used into the readMappResults.R script.
Positive selection in gene families. We performed a codon-level analysis of positive natural selection with FUBAR program (Fast, Unconstrained Bayesian AppRoximation) 109 on 24,235 gene families. FUBAR is a Bayesian approach to infer non-synoymous (dN) and synonymous (dS) substitution rates on a per-site basis for a given coding alignment and corresponding gene phylogeny 109 . To run FUBAR, first we retrieved the coding sequences (CDS) for each of the 13 Solanaceae species mentioned above. We removed trailing stop codons from the CDS, then we applied PAL2NAL 110 to produce a codon alignment for each gene family. PAL2NAL is a program that converts a multiple sequence alignment of proteins and the corresponding DNA (CDS) sequences into a codon alignment 110 . Thus, we used the protein tree that we already had from each protein family to run PAL-2NAL. FUBAR was run for all the codon alignments of each protein family. A custom Python script was used to transform the ".json" format from FUBAR result to tabular format. Then, the R script "loadFubarResults.R" Scientific Reports | (2021) 11:882 | https://doi.org/10.1038/s41598-020-79194-1 www.nature.com/scientificreports/ from the R package GeneFamilies (https ://githu b.com/asish allab /GeneF amili es/blob/maste r/exec/loadF ubarR esult s.R) was used to obtain a table with the significant posterior probabilities of a codon being subject to positive selection for each gene family (significant posterior probabilities ≥ 0.98; and Bayes Factor > 100).

Enrichment analysis.
For enrichment test (Fisher's exact test 111 ), we used as background all the proteins from both genomes of D. stramonium (64,790 proteins) to detect over-represented proteins that showed signal of expansion, physicochemical divergence (MAPP), and with positively selected conserved amino acids (codons) (FUBAR). Functional annotation of the proteins was done using MapMan4 83 and InterproScan. MapMan4 was used to annotate the general function of the proteins in order to retrieve the function of significant proteins resulted from MAPP, FUBAR and CAFE analyses, while InterProscan was used to identify and annotate domains overlapping the proteins with significant expansion signal, proteins with physicochemical divergence as well as positively selected conserved amino acids (codons). These analyses were done using the scripts "enrichedAnnosInExpContrFams.R (CAFE)", "identifyDomainsAtSelectedSites.R (FUBAR)" and "readMappResults.R (MAPP)" of the R package SlydGeneFamsAnalyses (https ://githu b.com/asish allab /SlydG eneFa msAna lyses ).
Genes involved in the tropane alkaloids biosynthesis. We investigated four families that contain genes involved in the pathway of tropane alkaloids that is stored in the KEEG database; https ://www.genom e.jp/ kegg-bin/show_pathw ay?map=map00 960&show_descr iptio n=show 22 ). These genes correspond to Putrescine N-methyltransferase (pmt), Tropinone reductase I (tpr I), Tropinone reductase II (tpr II), and Hyoscyamine (6S)-dioxygenase (h6h). Multiple sequence alignments and protein trees for each family were generated from the previous analyses. We analyzed into our CAFE results if these eight protein families experienced expansions. Since proteins were already functional annotated, we also investigated the differences in the protein domain architecture in each gene family.

Data availability
The complete workflow, all supplemental materials as well as commands used in this study are available in https ://githu b.com/icruz 1989/Datur a-stram onium -genom e-proje ct. Genome assemblies, Illumina and PacBio raw sequences from the two plants of D. stramonium have been deposited at DDBJ/ENA/GenBank under the BioProject PRJNA622882: Teotihuacan assembly, acc. JAAWWX000000000, Ticumán assembly acc. JAAWWY000000000. Illumina and PacBio sequences for the Ticumán genome: acc. SRR11474700, SRR11474698, respectively. Illumina and PacBio sequences for the Teotihuacán genome: SRR11474701, SRR11474699, respectively.