Thellungiella parvula1 is related to Arabidopsis thaliana and is endemic to saline, resource-poor habitats2, making it a model for the evolution of plant adaptation to extreme environments. Here we present the draft genome for this extremophile species. Exclusively by next generation sequencing, we obtained the de novo assembled genome in 1,496 gap-free contigs, closely approximating the estimated genome size of 140 Mb. We anchored these contigs to seven pseudo chromosomes without the use of maps. We show that short reads can be assembled to a near-complete chromosome level for a eukaryotic species lacking prior genetic information. The sequence identifies a number of tandem duplications that, by the nature of the duplicated genes, suggest a possible basis for T. parvula's extremophile lifestyle. Our results provide essential background for developing genomically influenced testable hypotheses for the evolution of environmental stress tolerance.
According to phylogenetic studies based on fossil evidence3, the spilt between A. thaliana and the main Brassica group encompassing T. parvula in the subclade Eutremeae is thought to have occurred about 43 million years ago. Both T. parvula and A. thaliana have similar genome sizes, and their close taxonomic relationship provides unique opportunities for tracing evolutionary rearrangements between the two species.
The main goal of this project was to produce a de novo, scaffold-level, gap-free assembly of the T. parvula genome. To achieve this, we used second generation sequencing exclusively, including ROCHE-454 GS FLX Titanium sequencing for its read length advantage and Illumina GA2 sequencing for its higher quality reads. We included varying insert sizes of paired-end libraries in addition to single-end reads (Online Methods). In total, we obtained 7.8 × 109 high quality basepairs, equivalent to ∼50-fold genome coverage. Of these, 85% came from the 454 sequencing (Supplementary Fig. 1 and Supplementary Table 1).
In the absence of genetic maps, with only limited physiological studies4, without prior genome information and with only very limited transcriptome sequences, we used an iterative hybrid approach to construct a draft genome (Online Methods). The result was a total of 1,496 meta-contigs (scaffolds) of merged primary contigs, ranging in size from 1 kb to 13.08 Mb (Table 1). However, unlike typical scaffold sequences, these meta-contigs were free of gaps. Overall, 73% of the length of the T. parvula draft genome was represented in 20 contigs longer than 1.5 Mb, and 85% of the sequenced genome was represented by the largest 60 contigs, each 100 kb or greater in length. Based on flow cytometry of propidium-iodide–stained nuclei5, the T. parvula genome had previously been estimated to be about 160 Mb, or 15% larger than that of A. thaliana. The total size of the curated and assembled T. parvula genome sequence space, however, was 137.09 Mb. This discrepancy is similar to those for A. thaliana (estimated as ∼150 Mb, or 25% longer than the sequenced genome6) and Cucumis sativus (estimated at 30% greater than the draft genome7).
Syntenic regions between T. parvula contigs and other Brassicaceae chromosomes were apparent after aligning T. parvula contigs with the A. thaliana genome (Fig. 1) and chromosome A3 of Brassica rapa8 (Supplementary Fig. 2). The 20 longest contigs covered all five A. thaliana chromosomes, with the exception of positions that approached and included centromeric regions. The largest T. parvula contig, c1 (13.08 Mb), aligned with the entire length of one arm of A. thaliana chromosome 1 (Fig. 1a).
For T. parvula contigs and A. thaliana chromosomes, we annotated repetitive elements (Online Methods). Overall, repetitive sequences amounted to 7.5% of the T. parvula genome based on similarity searches against genomic repeat databases and de novo clustering of repetitive sequences (Supplementary Tables 2,3). Figure 1a and b show repeat distributions in combination with overall sequence alignment comparisons using Circos plots.
Repetitive sequences were distributed unevenly in both species. Repeat-rich sequences were concentrated near the centromeric regions in A. thaliana chromosomes6, as reported for other plant genomes9,10; these sequences were, however, enriched toward the ends of T. parvula contigs (Fig. 1a). As a result of established difficulties in assembling repetitive sequences11, we found repeat-rich sequences more frequently among the smaller T. parvula contigs (Fig. 1b). Thus, the average repeat content in the largest 20 contigs was 5.5%, whereas the next 40 contigs, c21–c60, contained 17.5% repeat content.
We predicted gene models using FGENESH++, GENSCAN and BLAST (see URLs) searches to minimize false positive predictions. We based annotations on sequence similarity identified using independent BLAST searches and the Blast2GO pipeline (Online Methods). We manually inspected predicted open reading frames (ORFs) whose length deviated more than 20% from the putative A. thaliana homologs for exon merging or splitting. T. parvula contained a total of 28,901 predicted protein-coding ORFs. This is about 7% more than A. thaliana, which contains 27,059 protein-coding complementary DNAs (cDNAs) (excluding chloroplast and mitochondrial genes and based on the TAIR9 release). We mapped Illumina short read sequences from the transcriptome of young T. parvula plant tissues to 19,176 of these predicted ORFs (Online Methods and Supplementary Table 4).
The mean size of the predicted ORFs was 1,252 bp, with 71% of the ORFs between 201 bp and 1,500 bp in length (Fig. 2a). This distribution is similar to that of A. thaliana protein-coding cDNAs (Supplementary Fig. 3). The GC contents were substantially higher in exons than in introns and intergenic regions (Table 1). Based on sequence similarity searches to the NCBI nucleotide database, the primary matches for the T. parvula predicted ORFs were most frequently coding regions from Arabidopsis lyrata (53%), A. thaliana (29%) and B. rapa (5%) (Supplementary Table 5). BLASTn searches of T. parvula ORFs against A. thaliana cDNAs identified 25,783 (89%) hits (e value < 0.00001). Among these, 21,523 ORFs were of very similar lengths (80–120%) to their putative A. thaliana homologs (Fig. 2b). The arrangement of predicted ORFs in the T. parvula genome showed extensive macro-synteny with A. thaliana with infrequent rearrangements (Supplementary Table 4), mirroring the genome-wide alignments observed between A. thaliana chromosomes and T. parvula contigs. Each of the 20 largest T. parvula contigs consisted mostly of ORFs that shared sequence similarity with genes from a single A. thaliana chromosome, the exception being contig c3, which shared similarity with genes from three chromosomes (Supplementary Table 6).
A total of 3,118 predicted ORFs had no BLASTn hits to A. thaliana cDNAs even at lowered stringency levels (e value > 0.001). We have listed these as unidentified ORFs in Supplementary Table 4. Notably, these putative ORFs were enriched in regions containing larger numbers of repetitive sequences, possibly indicating T. parvula–specific transposable elements (for example, contigs c17 and c18 in Supplementary Table 6 and the histograms in the outer circle of Fig. 1a). The draft genome also includes 86.6 kb of noncoding RNAs based on sequence searches against microRNA (miRNA) and other noncoding RNA databases (Supplementary Tables 2,7).
We assigned Gene Ontology (GO) terms for the T. parvula predicted ORFs using the Blast2GO pipeline12 and compared them with the A. thaliana transcriptome (Fig. 2 and Supplementary Table 8). In the GO class 'biological processes', subcategories of 'response to abiotic or biotic stimulus' and 'developmental processes' were enriched in T. parvula, whereas genes in the subcategory 'signal transduction' were underrepresented (Fig. 2c). In the GO class 'molecular function', we found the subcategories of 'transporter activity' and 'receptor binding or activity' to be significantly different between the species (Fig. 2d). Among genes annotated as performing transporter activities, the numbers of ATPase and nucleotide, cation and sugar transporters were significantly higher in T. parvula than in A. thaliana (Table 2). These differences may reflect the different habitats and environmental pressures to which the species adapted. ATPase and nucleotide transporters with functions in pH homeostasis and cellular energy generation have, for example, been related to protection under salinity stress13,14, whereas transport and accumulation of soluble sugars or polyols are considered key mechanisms that provide osmotic stress tolerance15. We found the most significant difference in gene copy numbers for transporters of cations other than Na+ and K+, perhaps reflecting the adaptation of T. parvula to soil not only containing saline but also imbalanced in other ions2,16.
Gene copy number variation has also been proposed as a major mechanism of phenotypic differentiation and as reflecting evolutionary adaptation to the environment17,18,19. The T. parvula genome included 1,842 more predicted ORFs than protein coding cDNAs in A. thaliana (Supplementary Fig. 3). Mirroring the observed differences in the GO subcategories, the T. parvula genome contained higher copy numbers of orthologous genes related to stress adaptation, for example, AVP1, HKT1, NHX8 (ref. 20), CBL10 (ref. 21) and MYB47 (Supplementary Table 4).
Gene duplication as a vehicle for evolution has long been hypothesized22, and experimental evidence of this has recently been accumulating7,23,24,25. In both T. parvula and A. thaliana, the major role in generating copy number variation has been played by tandem gene duplication (Fig. 3a) rather than by large gains or losses in segment composition following the A. thaliana and T. parvula divergence after the most recent whole-genome duplication3. We found a total of 1,278 and 1,113 tandem duplication events in the T. parvula and A. thaliana genomes, respectively. Only half of these were shared between the two species (Fig. 3b and Supplementary Table 9). Inspection of the GO class representation of tandem duplications revealed significantly different GO 'biological process' (Fig. 3c) and GO 'molecular function' (Fig. 3d) subcategories. Differences in gene numbers in the subcategories 'response to abiotic or biotic stimulus' and 'developmental processes' (Fig. 2c) were most prominent among genes multiplied by tandem duplication (Fig. 3c), as supported by substantially lower P values (Supplementary Table 8).
Finally, Figure 4 and Supplementary Table 10 show the assembly of the T. parvula contigs into the seven chromosomes that characterize this species. The evolution of chromosome structures in Brassicaceae has previously been traced through comparative 'chromosome painting' techniques using BAC-size sequence probes from the A. thaliana genome26. With these techniques, Lysak and colleagues26 identified large genome segments, termed A to X, derived from an ancestral karyotype (n = 8). These ancestral karyotypes can be found in different assemblages in chromosome structures of different Brassicaceae clades26, including A. thaliana27 (Fig. 4a) and Eutremeae28 (n = 7). Using these as guides, the 40 largest T. parvula contigs could be unambiguously assembled into seven chromosomes28 covering 114.39 Mb (83% of the draft genome) (Fig. 4b and Supplementary Table 10). Each of these has a distinct, repeat-rich region, signifying the centromere (Fig. 4c, outer histogram). That the five largest contigs (c1–c5) covered the entire lengths of single chromosome arms attests to the quality of the de novo assembly. It is further noteworthy that the genomic regions in T. parvula contigs c2 and c3, although showing extensive rearrangements compared to the A. thaliana genome sequence, matched distinct ancestral karyotype blocks (Supplementary Fig. 4a, ancestral karyotype blocks R and W for c2, and Supplementary Fig. 4b, ancestral karyotype blocks V, K, L, Q, V′ and X for c3). Thus, our model for the T. parvula chromosomes provides sequence-based evidence for the Lysak model for crucifer species with n = 7, including the clade Eutremeae28. It also defines the boundaries of ancestral karyotype blocks more clearly and suggests more detailed structure than can be captured by chromosome painting experiments alone. This is particularly clear with respect to ancestral karyotype block V, which, based on sequence information, was divided into the blocks V and V′ in the T. parvula genome (Fig. 4b and Supplementary Fig. 4b). Also, ancestral karyotype block I extended to the pericentromeric region of T. parvula chromosome 4 (Fig. 4b,c) rather than falling entirely to one side of the centromere, as previously indicated by the chromosome painting experiments in various crucifer species with n = 7 (ref. 28).
A number of angiosperm families include extremophile species, although fewer than 10% of all plant species may be classified this way. Extremophiles' presence in evolutionarily distinct lineages reveals genetic complexities that appear to have evolved from the common genetic makeup of all plants. In adaptation to various combinations of environmental stresses, these extremophiles show tolerance of stresses against which crop plants in particular have no defenses. Knowing how extremophiles operate can, however, instruct us about the underlying genetic requisites and mechanisms for successful stress defenses. In this report, we have now shown that it is possible to determine the genome sequence of extremophiles, as well as model glycophytes, exclusively relying on next-generation DNA sequencing tools and de novo assembly.
The availability of the T. parvula genome provides a unique view of chromosome structure, organization and gene complement. Of particular importance is the comparison of this genome with that of the related A. thaliana, which is unquestionably a stress-sensitive species. In our initial analysis, this halophyte, with a genome only ∼15% larger than that of A. thaliana, shows striking differences in gene complement. The differences are partly because of tandem duplications in T. parvula of single copy genes in A. thaliana and preferential amplification of genes with known or assumed functions in stress defense responses. Within these differences, we expect, lie the unique solutions to understanding T. parvula's particular lifestyle and adaptation to its demanding ecological niche. More detailed examination of genome structure, coding complexity, and gene structure and expression in stress response pathways in comparative studies will point the way toward correlating the T. parvula phenotype with its genetic makeup.
SeqAnswers online forum, http://seqanswers.com/; GENSCAN, http://genes.mit.edu/GENSCAN.html; BLAST, http://blast.ncbi.nlm.nih.gov/Blast.cgi; FGENESH++, http://linux1.softberry.com/berry.phtml?topic=fgeneshplus2; TAIR GOslim, ftp://ftp.arabidopsis.org/home/tair/Ontologies/; sff_extract program, http://bioinf.comav.upv.es/sff_extract/; Vmatch suite, http://www.vmatch.de/; AMOS pipeline, http://sourceforge.net/apps/mediawiki/amos/index.php?title=AMOS; Repbase, http://www.girinst.org/repbase/; Plant Repeat database, ftp://ftp.plantbiology.msu.edu/pub/data/TIGR_Plant_Repeats/; TAIR9 cDNA database, ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR9_blastsets/TAIR9_cdna_20090619; miRBase, http://www.mirbase.org/; Rfam database, http://rfam.sanger.ac.uk/; GraphPad QuickCalcs, http://www.graphpad.com/quickcalcs/contingency1.cfm.
Plant material and DNA extraction.
Total DNA was isolated from 10-day-old seedlings of T. parvula. The seeds were derived from a single plant propagated from single seeds over eight successive generations. The original accession was collected from a salt lake in Tuz Golu, central Turkey at an elevation of 905 m above sea level. At the collection site, the soil bulk density was 1.225 g/cm3 with 32.4% salts by weight. Genomic DNA was prepared using the Nucleon Phytopure Genomic DNA Extraction kit (GE Healthcare).
Strategy for a highly contiguous draft genome.
Compared to Sanger sequencing, the shorter reads associated with either 454 or Illumina sequencing manifest decreased connectivity. As a result, considerably deeper coverage is required to generate contiguous assemblies. Deeper coverage alone, however, does not in itself solve the problem of fragmented assemblies; if reads are shorter than a repeat, gaps are unavoidable, and with deeper coverage, accumulated sequencing errors make assembly more computationally challenging. In assembling the T. parvula genome, the problem was mitigated by (i) using reads from different technologies, (ii) using paired reads with different insert lengths to span different repeat lengths and (iii) computationally selecting high quality reads.
Overview of sequencing, assembly and annotation.
Library construction and sequencing were performed in the W.M. Keck Center for Comparative and Functional Genomics at the University of Illinois at Urbana-Champaign. Random shotgun genomic libraries were constructed according to the manufacturer's recommendations for each of the two pyrosequencing platforms, GS FLX Titanium (454 Life Sciences) and Illumina GA2 (Illumina). Newbler (454-Roche), ABySS29 and minimus2 (ref. 30) were used as the main assembly programs to generate the draft genome, and FGENESH++ (SoftBerry), GENSCAN, BLAST (see URLs) and Blast2GO12 were used to predict and annotate gene models.
DNA library preparation and sequencing.
For 454 pyrosequencing, both shotgun and paired-end libraries were constructed. Genomic DNA was randomly sheared by nebulization to fragments of 500–800 bp in length to construct two shotgun libraries. Additional DNA was processed to construct paired-end libraries with size spans of 3 kb (three libraries), 8 kb (two libraries) and 20 kb (two libraries). All libraries were constructed, clonally amplified and sequenced on the 454 Genome Sequencer FLX-Titanium according to the manufacturer's kits and protocols (454 Life Sciences). Signal processing and base calling were performed using the bundled GS FLX software version 2.0.01.
For Illumina sequencing, genomic DNA was nebulized, and fragments 200–500 bp in length were size selected to construct a shotgun library using the Illumina Genomic DNA Sample Prep kit (Illumina). The library was sequenced on three lanes of a flowcell from one end (single read) for 81 cycles on a Genome Analyzer IIx. The Illumina Pipeline 1.5 was used to generate fastq sequence files from the raw data.
Hybrid genome assembly.
A combined total of 7.8 × 109 bases resulted from sequencing using both platforms. Average read sizes were 355 bp and 80 bp for the 454 and Illumina sequences, respectively. Approximately 85% of all sequences were derived from the 454 sequencing. We followed an iterative approach for assembly starting from raw sequence reads assembled into primary contigs. We used two assembly programs and combined the primary contigs and paired-end data to build scaffolds in successive assemblies. Single and paired-end 454 sequences were assembled using the Roche GS assembler, Newbler (version 2.0.01.14), with a 40 bp minimum overlap and 90% identity. In both instances, reads were first assembled as single-end reads, after which the paired-end information was used to construct scaffolds.
To assemble Illumina reads, we tested both Velvet31 (v1.3) and ABySS29 (v1.2) short-read assemblers using only reads that passed the Illumina chastity filter (base call values for chastity greater than 0.6 in the first 25 cycles); the k-mer size was set to 31 bp, and the coverage cutoff was set to 4. Both assemblers produced comparable results, but ABySS was much faster and was, therefore, chosen for further optimized short read assemblies. We also used Newbler contigs as single reads with the Illumina reads as the input in ABySS. We tested every odd-numbered length from 29 bp to 61 bp as the k-mer size to find the optimal size, meaning that which yielded the longest N50 and the fewest total contigs while maintaining total contig length near the flow-cytrometry–estimated genome size of 160–180 Mb.
Because ABySS can be very sensitive to sequencing errors such as short indels, when using raw 454 reads in ABySS, custom Perl scripts were used to remove any raw 454 read that had homopolymers exceeding 10 bases (all 454 homopolymer error reads cannot be removed but can be minimized). To enable the scaffold generating step in ABySS to proceed when 454 raw paired-ends reads were used, the program sff_extract (see URLs) was used to process the standard flowgram (sff) files generated by the GS FLX sequencer. Different k-mer sizes were selected based on the different paired-end libraries used for scaffolding, but in most instances, the optimum k-mer size found with ABySS was 41.
The collection of contigs and scaffolds created in the primary assemblies was an overlapping set with a high level of redundancy. To select a non-redundant set, we used mkvtree in the Vmatch suite (see URLs) to index the sequences by length; Vmatch was used to cluster sequences, including clusters with size of 1 (singlets). We matched for 100% identity and full coverage of the smaller sequences in pairs. Contigs longer than 1,000 bp were used for further processing. This set was further inspected with all-against-all BLAST searches and the aligner NUCmer in MUMmer 3.22 (ref. 32) to remove duplicate contigs that may have been assembled for the same region of the genome.
The meta-assembly of selected contigs from primary assemblies was carried out with the overlap-layout-consensus assembler, Minimus2 in the AMOS pipeline (see URLs), using a minimum 40-bp overlap with 95% identity. The resulting contigs and singlets were combined and purged of further redundancy, contaminating DNA and mitochondrial and chloroplast DNA using BLAST searches. This resulted in 1,496 contigs with a total length of ∼137 Mb.
The T. parvula draft genome was masked for repetitive sequences by RepeatMasker33 searching Repbase 14.01 and with BLASTn using the Plant Repeat database (see URLs). The masked contigs for known repetitive elements were further analyzed with NUCmer and custom scripts to search for long tandem repeats and for T. parvula–specific unclassified, non-exact, long repeats. Any sequences that were found more than five times were considered as repeats in this search.
FGENESH++ (SoftBerry) was used to predict protein coding ORFs in the T. parvula draft genome masked for repetitive sequences, with parameters optimized for dicot plants and protein sequences from the NCBI non-redundant (NR) database as reference. A total of 29,338 ORFs were predicted, of which 437 were further annotated as transposable elements based on BLASTn searches. Genomic regions that contained FGENESH++-predicted ORFs with lengths similar to their Arabidopsis homologs (± 20%) were tested with another gene prediction program, GENESCAN. When the predictions from the two programs deviated for the same genomic region, the ORF closest in length to another known homologous cDNA was taken as the more likely prediction. All genomic contigs and predicted ORFs were searched against NCBI nucleotide and protein databases and TAIR9 cDNA database (see URLs) using BLASTn and BLASTx searches. The predicted proteins were further annotated with the Blast2GO pipeline12 to assign GO and GOslim-plant terms based on NCBI plant databases and InterProScan34. To obtain experimental evidence for our ab initio predictions, we mapped the ORFs to high quality Illumina reads trimmed to 80 bp from a transcriptome sequence library generated from young seedlings. Using the program Bowtie35 with 100% identity to a minimum length of 50 bp and with '-m' set to 1 to ensure unique mapping, we found that 73% of the high quality reads mapped uniquely to the predicted ORFs (Supplementary Table 4). The remaining reads are too repetitive in nature, map to multiple ORFs or contain low complexity regions and are therefore unusable in mapping.
BLAST searches were performed to identify miRNA genes and other RNA genes by searching against the miRBase database of plant miRNA collections (release 16) and the Rfam database (see URLs) (release 10) for other non-coding RNA families including rRNA and tRNA genes.
When comparing the distributions of GO subcategories between A. thaliana and T. parvula (Figs. 2 and 3), two-tailed χ2 tests were used (see URLs). For each GO subcategory, a 2 × 2 contingency table was constructed by recording the numbers of genes included or not included in a subcategory for each species and ranking the statistical significance of the differences.
The raw reads for this project are deposited in the NCBI SRA project under the accession number SRA026763. The Illumina reads can be accessed under SRX047632 and the 454 reads under SRX032604. The genome assembly is deposited with the NCBI Genome Project ID 63843, and the sequences are deposited with the GenBank ID AFAN00000000.1.
NCBI Reference Sequence
Sequence Read Archive
We thank M.P. D'Urzo (Purdue University, West Lafayette, Indiana, USA) for providing plant materials and J.-H. Mun (National Academy of Agricultural Science, Suwon, Korea) for providing the B. rapa chromosome sequence. We also gratefully acknowledge M. Vaughn (University of Texas, Austin, Texas, USA), S. Jackman, M. Krzywinski (Michael Smith Genome Sciences Center, Vancouver, British Columbia, Canada) and SeqAnswers online forum (see URLs) for advice on genome assembly and visualization. Funding has been provided by King Abdullah University of Science and Technology, Thuwal, Saudi Arabia, by the World Class University Program (R32–10148) at Gyeongsang National University, Republic of Korea and the Next-generation BioGreen21 Program (SSAC, PJ008025), Rural Development Administration, Republic of Korea.
Repetitive sequences in T. parvula draft genome
List of T. parvula predicted ORFs and their annotations
List of non-coding RNAs in T. parvula draft genome
List and comparison of GO annotations of the T. parvula predicted ORFs and A. thaliana cDNAs
Tandem local duplications in the T. parvula draft genome and the A. thaliana genome
Assignments of the largest 40 T. parvula contigs in seven chromosomes
About this article
Draft genome sequence of first monocot-halophytic species Oryza coarctata reveals stress-specific genes
Scientific Reports (2018)