Background & Summary

Noccaea caerulescens, also known as Alpine pennycress, is a metal hyperaccumulating plant of the Brassicaceae family, previously classified as Thlaspi caerulescens1. Hyperaccumulation is a very rare characteristic in plants, with around 500 species identified2. Metal hyperaccumulation was first defined in relation to Ni hyperaccumulation3. A Ni hyperaccumulator was defined as a plant that could accumulate Ni in shoots at levels >1000 μg g−1 of dry weight. Hyperaccumulation has been extended to other metals with metal-specific thresholds. For Zn, levels of 3000 μg g−1 are used and for Cd 100 μg g−1 (ref. 2). Plant hypertolerance refers to plants that are able to grow under high metal concentrations without showing symptoms of toxicity. Metallophytes, plants that occur on metal-enriched soils, can be obligate and require the presence of a particular metal, or facultative, which can grow with or without the metal present. Only a small subset of metallophytes are metal hyperaccumulators. Accessions of N. caerulescens are facultative hyperaccumulators of Ni, Zn and Cd, with Zn hyperaccumulation being a species-wide trait, and Ni and Cd hyperaccumulation population-level traits4. N. caerulescens is used as a model plant species for studies on heavy metal hyperaccumulation due to its small genome size and the high degree of variation in metal hypertolerance and hyperaccumulation profiles between different accessions2,5,6.

Metal hyperaccumulating plants are of interest for several reasons. These include biofortification, where attempts are made to increase levels of nutrients in plants, e.g. Fe and Zn in staple crops7,8; phytoremediation, where plants can be used to concentrate polluting or contaminating metals, which can then be removed from the environment9 and reducing levels of toxic metals in plants, e.g. Cd in rice10.

Here we provide transcriptomes of four commonly studied accessions for which detailed Zn, Ni and Cd accumulation and tolerance data are available6. Two calamine accessions, La Calamine (LC) and Ganges (GA), are much more tolerant to Zn and Cd than the nonmetallicolous accession Lellingen (LE) and the serpentine accession Monte Prinzera (MP). Furthermore, the GA accession is a Cd hyperaccumulator, whereas MP is sensitive to Cd but hyperaccumulates Ni. The LE accession is least tolerant to Zn, but also has the most efficient Zn translocation capacity among the four accessions. Overall, the accessions show metal-specific root to shoot translocation rates. These mechanisms may be related to gene expression level11, but variation in hyperaccumulation or tolerance may also originate from differences in the protein sequences by, e.g., leading to different metal specificity of a metal transporter protein.

Sequence information available for N. caerulescens includes 454-sequencing of the transcriptome of the GA accession12 yielding 23725 sequences, and an EST library of 4289 sequences from the LC accession13. Genome sequencing of the GA accession is underway. SOLiD sequencing of root transcriptomes of GA, LC and MP accessions has been utilised for gene expression analysis11 but not for transcriptome assembly and sequence analysis.

The present data consist of assembled transcriptome sequences of the roots and shoots of the N. caerulescens accessions GA, LC, LE and MP grown in hydroponics under optimal Zn and Ni exposure. The transcriptomes have been annotated and clustered into ortholog groups with other closely related plant species. The transcriptome data can be used for genome, whole transcriptome and gene level studies, serving as a reference sequence, and also providing a sequence resource for primer design. The ortholog clustering will support comparative gene level studies for linking protein sequence variation to phenotypes. Assembly and release of annotated transcriptomes from Illumina data for the four accessions will serve as a valuable sequence resource for future studies.

Methods

Experimental design

Seeds of the N. caerulescens accessions GA, LC, MP and LE were germinated in soil, and plants with eight to ten leaves were rinsed and transferred to 10-l containers filled with half-strength Hoagland solution (modified from Schat et al.14): 3 mM KNO3, 2 mM Ca(NO3)2, 1 mM NH4H2PO4, 0.5 mM MgSO4, 1 μM KCl, 25 μM H3BO3, 2 μM MnSO4, 0.1 μM CuSO4, 0.1 μM (NH4)6Mo7O24, 20 μM Fe(Na)EDTA. For GA and LC, 10 μM ZnSO4, and for MP and LE 2 μM ZnSO4 was added. In addition, 10 μM NiSO4 was added to MP. MES (2 mM) was added and the pH was adjusted to 5.5 with KOH. The plants were grown in three climate chambers: 20/15 °C day/night, 250 μmol/m-2/s, 75% RH, light period 14 h per day. Continuously aerated solutions were changed twice a week. After three weeks, twelve plants of uniform appearance (with approx. 14–16 leaves) were pooled from each chamber to obtain three independent biological replicates (roots and shoots separately), frozen in liquid N2 and stored at −80 °C.

Generation of the datasets

RNA was extracted using RNeasy Plant Mini kit (Qiagen). Adequate RNA quality and quantity of RNA samples was ensured by Bioanalyzer (Agilent) analysis. Library preparation and sequencing were performed at the Weill Cornell Medical College Genomics Resources Core Facility (NY, USA). RNA libraries were prepared using Illumina TruSeq RNA-Seq Sample Prep Kit following manufacturer's instructions. Libraries were multiplexed, pooled and sequenced using the Paired End Clustering protocol with 51x2 cycles sequencing on four lanes of Illumina HiSeq2000 (Data Citation 1).

Processing of the datasets

The overall process for transcriptome assembly, annotation, ortholog clustering and validation is summarised in Fig. 1. After checking the technical quality of the sequencing with FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), root and shoot samples for each accession were combined and assembled using the Trinity15 de novo assembly program at kmer values of 25 and 32. Quality of the assemblies was assessed using BUSCO (ref. 16) (Benchmarking Universal Single-Copy Orthologs) and TransRate17. For MP accession with a higher number of reads, subsampling was performed to 105 Million reads using seqtk (https://github.com/lh3/seqtk.git). This step was performed as it has previously been reported that there is an optimum coverage for de novo transcriptome assembly18. Assembly for MP accession was conducted on both subsampled and complete sets of reads.

Figure 1: Overview of data processing.
figure 1

Raw reads (1) were assembled using the Trinity Assembler (2) at two kmer values: 25 and 32. Assembly quality was assessed using BUSCO and TransRate (3) utilising external sequence and protein data along with initial raw read sequences. A final assembly was then chosen for each accession (4). For MP accession, reads were also subsampled to the same read depth using seqtk (5) and assembled at both read depths. The predicted protein sequences were obtained using Transdecoder (6). Blast searches were carried out on the protein and transcript sequences against the uniprot and uniref databases (7). These were then combined into an annotation using Trinotate (8). Protein sequences were also clustered into orthogroups using OrthoFinder (9) and protein sequences from other plant species. A multiple alignment was produced from each orthogroup using Muscle (10). Key—Yellow, input data; blue, processing steps; orange, intermediate data/files produced during the process; green, data from public databases; red, final output data.

Quality of the assemblies was assessed using TransRate and BUSCO. The Kmer 32 assemblies and the MP subsampled kmer 32 assembly were chosen for annotation and ortholog identification. These assemblies are available in the NCBI Transcriptome Shotgun Assembly Sequence Database (Data Citations 2–5). Annotation for each assembly was conducted using the Trinotate program. Orthologs were identified using OrthoFinder. As a final step in the pipeline, each assembly was filtered to remove sequences that did not have a top blast hit to viridiplantae (green plant) sequences. After filtering, the BUSCO assessment was performed on the filtered datasets to show whether or not the coverage was reduced.

De novo assembly

Reads for all samples (three biological replicates of both roots and leaves) from each accession were combined, and each accession was assembled separately using the Trinity v2.0.6 de novo transcriptome assembler15. The total number of reads assembled for each accession is shown in Table 1. The settings that were used for Trinity included quality and adapter trimming using Trimmomatic19. No path merging was set so that all sequences with small differences were included in the output. Other settings were kept at default values. Reads were assembled using kmer values of 25 (default) and 32. For the MP accession 219 million reads were sequenced compared to approximately 105 million for the GA, LC and LE accessions. Since it has previously been reported that there is an optimum sequencing depth for transcriptome assembly18, we also subsampled 105 million reads from MP using seqtk and assembled these at kmer values of 25 and 32.

Table 1 Raw number of reads for each accession.

Assessment of assembly quality

The quality of each assembly was checked using TransRate to generate metrics for comparison. The reads generated during the assembly following trimming were provided and used by TransRate to calculate mapping statistics. For the MP subsampled assembly, the complete read files (before subsampling) were used for the mapping. The protein set from Eutrema salsugineum20 was downloaded from Phytozome 10.2 (ref. 21) and used for TransRate comparative metrics. Assemblies were compared against the BUSCO (ref. 16) plant early release dataset to calculate the extent of coverage (Table 2).

Table 2 Assembly quality metrics.

Existing sequences for GA from a 454-sequencing experiment were obtained from the Transcriptome shotgun assembly database GASZ01000000 (ref. 12). These sequences were used for validation and to compare coverage of the assemblies. TransRate and BUSCO quality assessments were performed on this dataset. The highest TransRate scores were obtained for the kmer 32 assemblies and in the case of MP the kmer 32 assembly from sub sampled reads.

Annotation

The transcripts for each accession for the kmer 32 assemblies were annotated using the Trinotate15,2230 annotation pipeline following the method outlined at (http://trinotate.github.io/). Initially, the transcripts were searched against the custom UniProt and UniRef90 databases using blastx allowing one hit and with output in tabular format. No e-value cut-off was set. The expected protein translations were obtained using TransDecoder and then searched against UniProt and UniRef90 using blastp. The same blast parameters were used as for the blastx searches. The blast searches were loaded into the Trinotate.sqlite database that was obtained from the Trinity ftp site and an annotation report generated. An e-value of 1e-5 was used as the threshold for the blast results during the report generation.

OrthoFinder

Protein sequences from six other plant species were obtained to identify ortholog groups. Arabidopsis thaliana (ATH)31, Arabidopsis lyrata (ALY)32, Thellungiella parvula (TPA)33, Brassica rapa (BRA)34 and Capsella rubella (CRU)37 protein sequences were downloaded from Plaza v 3.0 (ref. 38). Eutrema salsugineum (EUT)20 sequences were downloaded from Phytozome 10.2 (ref. 21). OrthoFinder37 was used to identify groups of orthologs between the species.

Filtering by top blast hit

As the annotated transcripts could still include non-plant sequences, all transcripts were also searched against the NCBI non-redundant protein sequences (nr) database using blastx and nucleotide collection (nt) database using blastn, both with an e-value cut-off of 1e-5. The blast output format was set as -outfmt ‘6 qseqid staxids sseqid’ to output the taxonomic information for each hit. A python script available in Data Citation 6 was used to parse the taxonomic group information from the NCBI Taxonomy database. Transcripts with a top blast hit to Viridiplantae (‘green plants’) were retained. The fasta files were filtered using cdbfasta (https://sourceforge.net/projects/cdbfasta/) providing the ID of the transcripts to be retained. The BUSCO scores were calculated for the filtered transcript sets to ensure that the assembly coverage was not reduced by the filtering (Table 3). Filtered transcript sequences have been deposited in the NCBI Transcriptome Shotgun Assembly (TSA) sequence database (Data Citations 2–5).

Table 3 BUSCO quality metrics after assembly filtering.

Multiple alignment

Ortholog groups that contained one or more N. caerulescens sequence after top blast hit filtering were retained. The sequences for each group were collected into a fasta file for each individual cluster. Sequences for each cluster were multiply aligned using muscle3.8.31 (ref. 38). Output was selected in fasta and html format. Fasta files and html alignment files for each cluster are available in Data Citation 6.

Code availability

The python code used to parse taxonomy information is available in Data Citation 6.

Data Records

The raw sequence data (Data Citation 1 and Table 4) was deposited in the NCBI Sequence Read Archive. The dataset contains 24 records. For each accession (GA, LC, LE and MP) three replicates were sequenced for root and shoot samples. Each replicate was comprised of 12 plants.

Table 4 Description of samples that have been submitted to the NCBI Sequence Read Archive.

The assemblies for each accession at a kmer size of 32 and with subsampled reads for MP (Data Citations 2–5 and Table 5) were deposited in the NCBI Transcriptome Shotgun Assembly Sequence Database.

Table 5 Description of the Accession numbers for the sequences that have been submitted to the NCBI Transcriptome Shotgun Assembly Sequence Database.

Full annotation information for the assemblies contained in Excel files and fasta files of ortholog groups (Data Citation 6) are available on Dryad.

Technical Validation

Computational Validation

Comparison against the BUSCO plant early release dataset identified that 90 to 91% of single-copy orthologs in the benchmarking dataset were present and complete in the assemblies before and after filtering Tables 2 and 3. TransRate statistics for both mapping and reference based metrics were also high with over 90% of reads mapping to the assemblies and over 80% classed as good mappings Table 2.

Manual validation of the assemblies

To manually validate the assembly results, complete protein sequences available in Genbank for the accessions were searched. There were results for GA and LC but no sequences were available for LE or MP. In total 14 sequences for GA corresponding to 9 genes and 10 sequences for LC corresponding to 8 genes were analysed. First, a search using blastp was conducted to obtain the matching sequence from the de novo assemblies. The sequences were then grouped, where more than one Genbank sequence matched to the same assembled sequence, and a multiple alignment was performed. The similarity of known sequences to the assembly and the length of the alignment was recorded (Table 6). From these sequences, 14 out of 17 had at least 98.9% identity. Sequences that were difficult to assemble from the transcriptome included genes that are known to have multiple copies, e.g. HMA4 (ref. 39)/IRT1 (ref. 40).

Table 6 Comparison of assembled sequences to sequences available in Genbank.

Additional information

How to cite this article: Blande, D. et al. De novo transcriptome assemblies of four accessions of the metal hyperaccumulator plant Noccaea caerulescens. Sci. Data 4:160131 doi: 10.1038/sdata.2016.131 (2017).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.