Introduction

The recent advent of next-generation sequencing (NGS) technologies has opened unique avenues to address ecological and evolutionary questions involving non-model biological systems for which there are limited genomic resources (Hudson, 2008; Ekblom and Galindo, 2010). This is particularly relevant for complex and redundant genomes of polyploid species, which represent a major fraction of eukaryotic lineages (Otto, 2007). Although sequencing performance is rapidly improving in read depth, technologies generating long-sequence fragments such as 454 Roche pyrosequencing have proven particularly useful in de novo sequencing and development of new resources for non-model species, without an available reference genome (Wheat, 2008). High-throughput transcriptome sequencing allows assembly of reference transcriptomes that may be used for various purposes in evolutionary ecology, such as functionally important gene annotation or discovery (for example, Alagna et al., 2009; Barakat et al., 2009; Sun et al., 2010; Logacheva et al., 2011), molecular marker (for example, microsatellite, single-nucleotide polymorphism (SNP)) detection (Barbazuk et al., 2007; Novaes et al., 2008; Bundock et al., 2009) or gene expression variation (Buggs et al., 2010; Swarbreck et al., 2011; Ilut et al., 2012; Yoo et al., 2012). As polyploidy is a recurrent process, many lineages exhibit superimposed traces of genome duplication. Large-scale sequencing and deep read coverage offer a unique opportunity to explore the redundant genome and transcriptome of polyploids, even when diploid progenitors are unidentified or extinct, which makes identification of duplicated homoeologous gene copies particularly challenging.

Recurrent polyploidy is particularly well illustrated in the genus Spartina (Poaceae), where all extant species are polyploids (reviewed in Ainouche et al. (2012)). The grass genus Spartina belongs to the Chloridoideae subfamily, a genomically poorly explored Poaceae lineage, contrasting with well-investigated crops, such as rice, sorghum, maize or wheat that belong to other grass subfamilies. Divergence between Spartina and these grass models is currently estimated to be 35–40 million years ago (MYA) with Panicoideae (including Sorghum and maize) and at least 50 MYA with Ehrhartoideae (including rice) (Christin et al., 2008). Spartina is composed of 13–15 perennial species (Mobberley, 1956), colonizing coastal or inland salt marshes. The basic chromosome numbers in Spartina is x=10, as in most Chloridoideae (Marchant, 1968). Spartina species exhibit various ploidy levels ranging from tetraploid to dodecaploid (Ainouche et al., 2004a). Two closely related hexaploid species, Spartina maritima (Curt.) Fern., and S. alterniflora Lois., are derived from a common hexaploid ancestor (Baumel et al., 2002a; Fortune et al., 2007); although divergence time has not been definitively ascertained, analysis of chloroplast DNA divergence suggests that they diverged less than 3 MYA. They have a critical ecological role in coastal salt marshes at the interface of land and sea, and represent classical models involved in reticulate evolution and recent polyploid speciation (Ainouche et al., 2004a, 2004b; Ainouche et al., 2009). They thus make a good model in evolutionary ecology to investigate the consequences of polyploidy at different evolutionary time scales in natural populations, and to explore the adaptive processes accompanying hybridization, polyploid species formation and expansion.

As for most Spartina species, S. alterniflora is native to the New World, where it is distributed from Canada to southern Argentina along the North and South American Atlantic coast (Mobberley, 1956), whereas S. maritima is distributed along the western European and African Atlantic coasts. Divergence between the two species across the Atlantic Ocean was accompanied by ecological and phenotypic differentiation. Spartina alterniflora has a larger distribution and displays invasive abilities in most regions where it was introduced: in California (Ayres et al., 2004; Civille et al., 2005), in China (Li et al., 2009) and in western Europe (Campos et al., 2004; Ainouche et al., 2009; Querné et al., 2011). In contrast, S. maritima populations are regressing. The recession of S. maritima in its northern range limit (southern England and Brittany) is interpreted as a consequence of climatic changes and anthropogenic habitat disturbance (Raybould et al., 1991), but may also be related to the biological and morphological differences between these two species. Spartina alterniflora exhibits strong rhizomes facilitating lateral expansion and sediment accretion, and thus has an important role in the salt marsh dynamics where it is considered as an ecosystem engineer, whereas S. maritima is a non-rhizomatous, genetically depauperate species (Yannic et al., 2004) with very low seed production (Marchant and Goodman, 1969; Castellanos et al., 1994; Castillo et al., 2008). Spartina maritima and S. alterniflora also exhibit chromosome number differences, as the former has a regular hexaploid number (2n=6x=60) whereas the latter presents aneuploidy (2n=62), and genome size differences (2C=3.8 pg for S. maritima and 2C=4.3 pg for S. alterniflora, Fortune et al., 2008). Less than 5% nucleotide divergence was encountered at 10 putative orthologous-coding loci between the two species, but consistent gene expression differences (13% of the examined genes) were detected using heterologous rice microarrays (Chelaifa et al., 2010a). Genes involved in cellular growth were found highly expressed in S. alterniflora and downregulated in S. maritima, whereas stress-related genes were highly expressed in S. maritima (Chelaifa et al., 2010a).

Spartina alterniflora and S. maritima are involved in one of the textbook examples of recent allopolyploid speciation (reviewed in Ainouche et al., 2004b; Ainouche et al., 2009). Spartina alterniflora was accidentally introduced during the 19th century in Europe, where it hybridized with the native S. maritima. In England, hybridization (with S. alternifora as maternal genome donor, Ferris et al., 1997; Baumel et al., 2001) resulted in Spartina × townsendii, a perennial sterile hybrid first recorded around 1870 (Groves and Groves, 1880), that gave rise by chromosome doubling to a fertile, vigorous and highly invasive allo-dodecapolyploid species, Spartina anglica, which has now been introduced on several continents. An independent hybridization event between S. maritima and S. alterniflora occurred also in southwest France with S. alterniflora as the maternal parent (Baumel et al., 2003), contributing to the formation of another sterile F1 hybrid, Spartina × neyrautii.

Recent studies have been aimed at examining the evolutionary fate of the homoeologous parental genomes from S. maritima and S. alterniflora in the neo-allododecaploid species to understand the genomic determinants of the ecological success of the invasive neopolyploid (Baumel et al., 2002b; Ainouche et al., 2004a; Salmon et al., 2005; Parisod et al., 2009). These studies have revealed that epigenetic reprogramming (for example, DNA methylation, Salmon et al., 2005; Parisod et al., 2009) and evolution of gene expression (Chelaifa et al., 2010b) represent important components of the speciation process in polyploid Spartina species, and are most likely having a critical role in the ecology of the species. However, the previously employed technology for transcriptome analysis (heterologous hybridization on rice microarrays) had several limitations (for example, only a fraction of the genes that hybridized on the array could be analyzed, only global gene expression variation could be evaluated, with no possible distinction of the copies duplicated by polyploidy). Developments in sequencing technology mean that there is now the potential to develop more advanced genomic resources in this important model system for understanding the ecological and evolutionary consequences of hybridization and polyploidy.

When analyzing species such as these where genomic resources are lacking, constitution of a reference transcriptome represents a first critical step to explore the genic compartment. In polyploids, assembled contigs from sequence reads represent consensus sequences among potentially different alleles at strictly orthologous loci, more or less divergent homoeologues (parental orthologues duplicated by polyploidy), or recent paralogues (resulting from individual gene duplication); thus necessitating a more complicated analytical strategy than for diploids. The goal of this study is to build a reference ‘consensus’ transcriptome in the hexaploid parental species S. alterniflora and S. maritima using NGS technology, which will allow annotation and identification of specific Spartina genes, including genes of ecological or evolutionary (that is, genes whose expression is altered following speciation) interest. The strategy was then to (i) choose the appropriate high-throughput sequencing that generates long reads facilitating de novo assemblies in the absence of a reference genome (that is, the GS-FLX Roche 454 technology) and (ii) to sequence as many diverse transcripts as possible (to annotate a maximum of genes), by using different types of complementary DNA (cDNA) libraries (normalized and non-normalized) from different tissues (leaves, roots) and from different (natural or controlled) environmental conditions. Sequence heterogeneity at putative homologous loci (within ‘consensus’ contigs) is discussed in the context of the (hexaploid) redundant genomes of S. maritima and S. alterniflora. Beyond the Spartina model, the procedure presented here may be applicable to any polyploid system for which no reference genome is available and whose parental species (that is, homoeologous copies) are unknown.

Materials and methods

Plant material

Samples from S. alterniflora were collected in Landerneau (Finistère, France). Spartina maritima was collected at two sites from the French Atlantic coast: Pointe du Verdon (Morbihan) and Noirmoutier (Vendée). Several individuals were collected at each site, and plants were transplanted in the greenhouse (University of Rennes 1).

To maximize detection and annotation of various expressed Spartina genes, RNA extraction was performed on different organs (leaves and roots) from plants sampled either from wild populations and so grown in variable natural conditions (normalized cDNA libraries) or transplanted in a common greenhouse environment (non-normalized cDNA libraries) (Figure 1). Non-normalized libraries usually offer an overview of the most transcribed genes, whereas normalization facilitates the assessment of rare transcripts by decreasing the prevalence of abundant transcripts. For practical reasons, the normalized library could be done only on one species (the European native S. maritima), which was chosen because a larger population sampling was available as part of an ongoing project in our laboratory, involving genome sequencing of this species.

Figure 1
figure 1

Sampling strategy and construction of the normalized (S. maritima) and non-normalized (S. maritima and S. alterniflora) cDNA libraries.

Non-normalized cDNA libraries for both S. maritima (from Pointe du Verdon) and S. alterniflora (from Landerneau) were created from plants grown in the same conditions in the greenhouse (30 cm3 daily watered pots containing a mixture of soil, fertilizer and sand) under a day temperature of 20 °C and night temperature of 14 °C. After 21 days of acclimatization, 1–2 g of young leaves and roots per plant were collected separately from three different individuals (from the same population), frozen in liquid nitrogen and stored at −80 °C until RNA extraction.

A normalized library (for S. maritima) was created using leaves from eight individuals collected in the population from Noirmoutier and sampled along a tidal gradient to capture subtleties in gene expression under varying environmental conditions. Two additional S. maritima individuals collected from Pointe du Verdon and transplanted in the greenhouse were also included in the normalized library. Five young leaves were selected for each individual plant, and stored in RNAlater solution (Ambion Inc., Austin, TX, USA) at −20 °C until RNA extraction. For practical reasons, the root normalized library was performed from the same plants used for the non-normalized library that were transplanted in the greenhouse. Roots were carefully washed in distilled water, and then young roots were cut and collected in liquid nitrogen.

For each sample, total RNA was extracted from frozen leaves and roots with Trizol reagent (Sigma-Aldrich Inc., St. Louis, MO, USA) using three cycles of precipitation with isopropanol (Sigma-Aldrich), according to a procedure previously described for Spartina (Chelaifa et al., 2010a, 2010b). All RNA samples were quantified using a Nanodrop Spectrophotometer ND 1000 (Nanodrop Technologies, Thermo Fisher Scientific Inc. Waltham, MA, USA) and the RNA quality (absence of degradation and DNA contamination) was checked on an Agilent 2100 Bioanalyzer (DNA 7500 Chip, Agilent Technologies, Santa Clara, CA, USA). After processing, RNA was stored at −80 °C.

cDNA preparation

cDNA synthesis was performed with 1 μg of total RNA using the SMARTer cDNA Synthesis Kit (Clontech, Mountain View, CA, USA), following the protocol recommended by manufacturers. Briefly, first-strand cDNA synthesis was primed with a modified oligo(dT) primer (the 3′SMART CDS Primer II A). When SMARTScribe RT reaches the 5′-end of the mRNA, the enzyme adds a few additional nucleotides to the 3′-end of the cDNA. After a second-strand cDNA synthesis reaction, double-stranded cDNAs were amplified (21 cycles with primer 5′ PCR Primer II A). This procedure yielded about 2–6 μg of cDNAs that were purified using the Qiaquick PCR Purification Kit (Qiagen, Hilden, Germany). An equimolar mix of samples was constituted for each organ and each species to reach 10 μg of total cDNA and stored at −20 °C until sequencing.

Normalization of S. maritima cDNA

A total of 1 μg of cDNAs from each organ (leaves and roots) of S. maritima was separately normalized as following: 4 μl 4 × hybridization buffer were added and the samples denatured at 95 °C for 5 min and then allowed to anneal at 68 °C for 5 h. The following preheated reagents from the Trimmer kit (Evrogen, Moscow, Russia) were added to the hybridization reaction at 68 °C: 3.5 μl milliQ water, 1 μl 5 × DNAse buffer, 1 μl double-strand nuclease (DSN) enzyme. After incubation at 68 °C for 25 min, the DSN enzyme was inactivated by adding 10 μl of DSN stop solution and heating at 68 °C for 5 min. The normalized cDNA samples were diluted by adding 40.5 μl milliQ water and used for two PCR amplifications. The first PCR (50 μl) contained 1 μl diluted cDNA, 5 μl 10 × Advantage 2 PCR buffer (Clontech), 1 μl 50 × dNTPs mix, 1.5 μl PCR primer M1 10 μM (Evrogen), 1 μl 50 × Advantage 2 Polymerase mix (Clontech) and was amplified as following: initial denaturation at 95 °C for 1 min, followed by 18 cycles (95 °C for 15 s, 66 °C for 20 s, 72 °C for 3 min). The second PCR reaction (100 μl) was performed using 2 μl of diluted normalized cDNA, 1 μl of 10 × Advantage 2 PCR Buffer (Clontech), 2 μl 50 × dNTP mix, 4 μl PCR Primer M2 10 μM (Evrogen), 2 μl 50 × Advantage 2 Polymerase mix (Clontech) and was amplified following an initial denaturation at 95 °C for 1 min, then 12 cycles (95 °C for 15 s, 64 °C for 20 s, 72 °C for 3 min), and a final extension step (64 °C for 15 s and 72 °C for 3 min). The normalized double-stranded cDNAs were checked on an agarose gel and on an Agilent 2100 bioanalyzer DNA chip (DNA 7500 chip), quantified with a ND 1000 Spectrophotometer (Nanodrop Technologies Inc., Wilmington, DE, USA), and stored at −20 °C.

Sequencing, cleaning and assembly

The four non-normalized cDNA libraries (roots and leaves from S. maritima and S. alterniflora) were sheared by nebulization and sequenced at the Genoscope Platform (Evry). A total of 500 ng of cDNAs were sequenced for each library in two runs on a 454 GS XLR70 Titanium Genomic Sequencer (Roche Inc., Basel, Switzerland). The tissues (leaves and roots) were distinctly distributed on two half regions of the sequencing plate.

Sequencing of the normalized S. maritima cDNA libraries was performed at the Environmental Genomic Platform of the University of Rennes 1. A total of 500 ng of each normalized cDNA library from S. maritima leaves and roots were nebulized and sequenced separately in two half-plates on a 454 GS XLR70 Titanium Genomic Sequencer (Roche Inc.).

The 454 sequence primers (Roche Inc.) and low-quality sequences were removed during signal processing. GS Assembler version 2.3 (Roche, Inc.) was employed to assemble reads into contigs; this program was already successfully used for assembly in transcriptome analyses (Bellin et al. (2009) in Vitis vinifera; Gedye et al. (2010) in S. pectinata; Sun et al. (2010) in Panax ginseng).

Different assemblies were performed for each separate library or for combined data sets per species, tissue and normalization type. Finally, a global assembly of all the obtained reads provided the reference transcriptome for both hexaploids.

As hexaploid Spartina species are expected to potentially express up to six allelic transcripts per locus (resulting from three duplicated pairs of homoeologous genes), the assembly strategy aimed at assembling potentially homologous reads (orthologues and homoeologues) with relatively low stringency to construct consensus contigs constituting the ‘hexaploid reference transcriptome’ that will be used for identification and annotation of Spartina genes. In this perspective, effects of different minimum match percentages (90, 95, 96 and 97%) on the assembly process were explored. Analyses presented in this paper are based on de novo assemblies executed with 90% of minimum match on at least 100 bp and GS Assembler version 2.3 (Roche, Inc.) default parameters for cDNA. This low minimum match percentage (90%) was chosen to maximize assembly of reads corresponding to putative orthologous and homoeologous transcripts, although we cannot rule out assembling weakly divergent paralogs. Useful information (such as the number of reads used in the assembly, the number of contigs and singletons, mean length and read depth) was extracted from assembly files. Read depth is estimated by GS Assembler as the total number of included bases from all the obtained 454 sequence reads aligned to generate the consensus contig sequence, divided over the contig length. To test validity of the assembly, we aligned 10 contigs against homologous expressed sequence tags (ESTs), which were sequenced using the Sanger method (Chelaifa et al., 2010a) and sequence identities were calculated.

Contigs from S. maritima and S. alterniflora were then mapped to the Sorghum bicolor genome, the closest related species to Spartina that has a fully sequenced and annotated genome (Paterson et al., 2009), to compare the distribution and density of the identified Spartina homologous genes across the different Sorghum chromosomes. The Sorghum bicolor gene annotation was retrieved from the Sbicolor_79_gene.gff3 annotation file available at http://genome.jgi-psf.org/Sorbi1/ and gene density was estimated from the proportion of annotated genes per 100 kb intervals. Colinearity between Spartina and Sorghum has not been investigated previously, but conservation of gene colinearity is expected according to what is known from related lineages (for example, finger millet, Chloridoideae and rice, Ehrhartoideae, Srinivasachary et al. (2007)) in the grass family. The BLASTn algorithm was used with a P-value of 10−5 and Best BLAST Hit (corresponding to the highest e-value and bit score) parsed for each query sequence. The proportion of Spartina homologs was calculated by 100 kb windows (delimitated from Sorghum) and the results were represented using the Circos v.0.55 software (Krzywinski et al., 2009). To evaluate the genome-wide representation of the assembled contigs on the Sorghum genome, Pearson’s correlations and linear regressions were calculated between gene densities (number of genes per 100 kb window) in Sorghum and corresponding homologs in the investigated Spartina species. Both statistics were calculated for all 10 Sorghum chromosomes and by individual chromosomes using the R software (R Development Core Team, 2011).

Annotation

BLASTn and tBLASTx (Altschul et al., 1990) analyses of contigs and singletons were conducted against two nucleotide databases: Oryza sativa ESTs database (http://rapdb.dna.affrc.go.jp), and a home-built regularly updated Poaceae database, including ESTs from Oryza sativa, Zea mays, Brachypodium distachyon and Sorghum bicolor (www.gramene.org). All BLAST searches were performed with an e-value of 10−5. Best BLAST Hit from all BLAST results were parsed for a homology-based functional annotation.

GO annotations using BLAST2Go (Conesa et al., 2005; Götz et al., 2008) were performed using tBLASTx (e-value 10−6) on assembled contigs against the Arabidopsis thaliana database from the TAIR website (www.arabidopsis.org) (with GO IDs and term assigned), with an annotation e-value hit filter of 10−6 and a cutoff of 55 (maximum similarity).

The annotated Spartina transcriptome was examined to identify genes of potential ecological interest (for example, genes involved in salt stress response, oxidative stress, heavy metal tolerance or growth). Genes whose expression was previously found altered following hybridization and genome duplication from a rice microarray-based study on these species (Chelaifa et al., 2010a) were investigated. The corresponding accession numbers of the rice oligos spotted on Agilent microarrays (44 K Agilent G2519F) employed in that study were used to retrieve putative homologs in our Spartina reference transcriptome using BLASTn (e-value 10−5).

Sequence heterogeneity at homologous gene copies

As both Spartina species studied here are hexaploid, sequence read heterogeneity is expected in the assembled contigs, resulting from both genome duplication and allelic variation within homoeologues (heterozygosity at orthologous loci). In this study, we chose the 454 technology because it generates long read sequences to facilitate de novo assembly, but this sequencing method offers less read depth than alternative technologies generating short reads to capture all the allelic variants that may be transcribed at each locus. As a preliminary evaluation of sequence heterogeneity among assembled reads obtained with the 454 pyrosequencing technique, we have selected contigs with relatively good coverage (at least 50 reads) that were present in both S. maritima and S. alterniflora data sets.

We looked at polymorphisms within contigs by mapping the corresponding reads (using Genome Assembler v 2.5.3, Roche) to a subset of selected homologous contigs between the two species. We then scanned the resulting alignments for SNPs using the Ace.py program from the biopython package (http://biopython.org/). Rare SNPs or SNPs detected within homopolymeric regions were removed from the analysis to avoid putative false-positive SNPs. We then assembled reads presenting 100% similarity (using at least one shared SNP) to maximize the consensus sequence length. This consensus sequence was then considered as a haplotype, representing a particular copy in the corresponding contig.

Results

De novo assemblies and contig annotation

Spartina maritima

Sequencing of the non-normalized and normalized cDNA libraries from roots and leaves resulted in 425 274 reads (average length 314±147.3 bp) and 558 732 reads (average length 203±102.8 bp), respectively. Data are available in Genbank under accession references SRP015701 and SRP015702 for S. maritima and S. alterniflora, respectively.

Assemblies and annotations were first performed separately on the sequences obtained from the non-normalized and normalized cDNA libraries for each tissue, respectively, then on the pooled reads from both normalized and non-normalized libraries. A total of nine different assemblies (as presented in Table 1) were performed using individual (by tissue and normalization) or combined data sets, allowing the comparison of annotated contigs by tissue and evaluation of the normalization process efficiency.

Table 1 Summary of assemblies and annotations of the Spartina maritima and Spartina alterniflora complementary DNA libraries

After trimming the adapter sequences and removing sequences shorter than 50 bases, 405 386 and 359 159 reads remained for the S. maritima non-normalized library and S. maritima normalized library, respectively. Assembly of the trimmed reads resulted in 12 309 contigs for the non-normalized library and 17 182 contigs for the normalized library. The mean contig length was 617 bp (s.d.=540.3, range=50–8036) and 415 bp (s.d.=246.9, range=50–2252) for the non-normalized and normalized libraries, respectively.

Separate assemblies for roots and leaves were also processed for each library, as well as global assembly of all the reads from S. maritima to get a global gene annotation for this species. Unequal read numbers were obtained for leaf and root cDNA sequencing in both the normalized and non-normalized libraries. In the non-normalized cDNA library, the read number in leaves was twice that of roots. In the root normalized library, read number was three times larger than the number obtained in the non-normalized library (Table 1). Equivalent number of contigs were assembled for leaves (5866) and roots (5910) in the non-normalized cDNA library, but many more contigs were assembled for roots (13 315) than for leaves (3654) in the normalized library. When pooling all reads from S. maritima (normalized and non-normalized for both organs), 25 239 contigs were assembled. Separate assemblies of roots and leaves resulted in 19 069 and 10 098 contigs, respectively.

Functional annotation was performed by sequence comparisons with public databases. The different S. maritima data sets (from non-normalized and normalized cDNAs in each tissue) were first compared with the Oryza sativa EST database, then to a larger database including four sequenced Poaceae genomes. As expected, the use of this homemade Poaceae database improved the number of annotated genes (Table 1). In the non-normalized library, 5705 different genes were annotated with the O. sativa database and 7290 with the Poaceae database. In the normalized library, 8195 were annotated with the O. sativa database and 10 629 with the Poaceae database. The normalization of the cDNA library significantly increased the number of annotated genes, as among these 10 629 annotated genes, 3620 were common to both libraries and 6642 genes were specific to the normalized data set (Figure 2a).

Figure 2
figure 2

Common annotated contigs (using the Poaceae database) of S. maritima (a) between non-normalized and normalized cDNA libraries (leaves+roots) (b) between roots and leaves.

The Poaceae database allowed annotation of 6100 different genes for S. maritima leaves and 11 149 genes for roots (Table 1). Among these, 2938 genes were found in both root and leaf transcriptomes, (Figure 2b). When pooling all the read data sets (both tissues and both normalization types), 13 786 genes were annotated in total for S. maritima with the Poaceae database (Table 1).

Spartina alterniflora

Sequencing of the S. alterniflora non-normalized cDNA library from roots and leaves resulted in 495 749 reads, with an average length of 285±160.6 bp. After trimming, 344 723 reads were used for the assembly, which resulted in 14 137 contigs (Table 1). The S. alterniflora contigs have an average length of 759 bp (s.d.=637.1, range=50–12 334) and a mean read depth of 14.3. Separate assemblies of roots and leaves were processed as for S. maritima and resulted in 3217 contigs for leaves and 11 155 contigs for roots. More reads and more contigs were obtained for roots than for leaves, as observed in S. maritima (Table 1).

Functional annotation of the S. alterniflora contigs using the Oryza and Poaceae databases, respectively, resulted in 1806 and 2169 different genes annotated in leaves. For roots, 5281 (Oryza database) and 6811 (Poaceae database) genes were annotated. When pooling root and leaf data sets, 6430 genes were annotated when using the Oryza database, and 8370 genes were annotated in total for S. alterniflora, when using the Poaceae database (Table 1).

Spartina leaf and root transcriptomes

To maximize the number of contigs and annotated genes per tissue, S. maritima and S. alterniflora reads were pooled, which resulted in 13 824 and 29 187 assembled contigs for leaves and roots, respectively (Table 1). When using the Poaceae database for functional annotation, 7773 and 14 135 different genes were annotated for leaves and roots, respectively. Among these, 4019 (22.5%) genes were common to root and leaf Spartina transcriptomes (Figure 3d).

Figure 3
figure 3

Common annotated contigs (using the Poaceae database) between S. maritima and S. alterniflora. (a) Comparison between S. maritima and S. alterniflora leaves. (b) Comparison between S. maritima and S. alterniflora roots. (c) Comparison between S. maritima and S. alterniflora (combined data from leaves and roots). (d) Comparison between roots and leaves (combined data from both species).

When examining leaf and root transcriptomes between species, 978 and 3251 annotated genes were found common to S. maritima and S. alterniflora for leaves and roots, respectively (Figures 3a and b). Overall, S. maritima and S. alterniflora share 4298 expressed genes (pooled leaf and root data sets) with 9488 genes annotated only in S. maritima and 4072 genes only in S. alterniflora (Figure 3c). The total data set (both species and organs) resulted in 38 478 contigs and 16 753 annotated Spartina genes (Table 1), which represent the first reference transcriptome for the hexaploid Spartina species.

Distribution of the contigs on the Sorghum genome

The number of homologous genes sequenced in Spartina hexaploid species was about half the number found in Sorghum bicolor per 100 kb sliding window. Mapping of the Spartina contigs to the Sorghum genome revealed similar relative gene densities for both Spartina EST libraries among the 10 chromosomes (Figure 4b, Supplementary Figure 1). High correlation between Sorghum gene densities along chromosomes and the number of homologous Spartina genes in a 100-kb Sorghum window were encountered for most chromosomes. A relatively lower correlation was found for chromosomes 5 and 8 (Supplementary Figure 1), which could suggest more extensive rearrangements during evolution of these taxa. Furthermore, we observed that Spartina gene densities were higher in the corresponding subtelomeric Sorghum chromosome positions than in pericentromeric ones, as expected from gene distributions in Sorghum (Paterson et al., 2009).

Figure 4
figure 4

(a) Spartina contigs mapped to the Sorghum genome. The 10 individual chromosomes are shown in the outer circle. Relative gene densities on each chromosome are displayed successively inward as following: (i) gene density in Sorghum bicolor, (ii) gene density in Spartina maritima, (iii) Spartina maritima gene density relative to Sorghum gene density, (iv) gene density in Spartina alterniflora, (v) Spartina alterniflora gene density relative to Sorghum gene density (by 100 kb region). (b) Correlations between Sorghum density and S. maritima and S. alterniflora homologous gene densities by 100 kb region (P-value<2.2 e−16).

Most-represented genes in the normalized and non-normalized Spartina data sets

The 20 most-represented transcripts (according to read depth) in the non-normalized libraries appear very similar in S. alterniflora and S. maritima (Supplementary Table 1). In both leaves and roots, they are mainly involved in respiratory pathways (for example, cytochrome c oxidase, ATP synthase), and in RNA and ribosomal protein synthesis. In roots, NADH-ubiquinone oxidoreductase and acylCoA-binding protein were also well-represented. Genes involved in stress responses were observed mainly in root transcripts. Among the most represented are the metallothionein and zinc finger (A20 and AN1) domains involved in metal binding and control of oxidative stress. A transcription elongation factor (EF) was also well represented in the root transcriptome of S. maritima and S. alterniflora (Supplementary Table 1); this gene is involved in protein elongation during translation (Andersen et al., 2003) and is also found highly represented in the roots of other grass species (for example, in Zea mais, Poroyko et al. (2005) or Avena barbata, Swarbreck et al. (2011)). The chaperone protein DnaJ gene was also encountered in the root transcriptome of S. maritima. This gene is induced by heat shock and prevents apoptosis (Gotoh et al., 2004). In addition, in S. alterniflora, two contigs annotated with a pathogenesis-related Bet V family protein were highly represented. This gene can be induced by different pathogens, such as viruses, bacteria and fungi (Liu and Ekramoddoullah, 2006).

The most abundant sequences annotated from the normalized cDNA data set in S. maritima belong to a larger set of gene categories compared with those encountered in the non-normalized data sets for both S. alterniflora and S. maritima. In leaves, all of the important functions are represented: we encountered genes involved in flowering control (tetratricopeptide repeat protein 1), in cell wall structure (glycine-rich protein), in the C4 assimilation process (phosphoenol-pyruvate carboxykinase, carbonic anhydrase) and in fatty acid metabolism (Acyl coA-binding protein). The thioredoxin gene has a critical role in redox regulation in the apoplast, which regulates cell division (Tian et al., 2009), cell differentiation (Takeda et al., 2003), pollen germination (Ge et al., 2011) and stress responses (Song et al., 2011).

In the normalized root cDNA data set, apart from three highly represented contigs annotated as ribosomal genes, all others were genes and proteins involved in primary metabolism, such as cell transport (ADP-ribosylation factor, ranBP1 domain-containing protein), cell organization (mps 1 binder kinase activator-like 1A, steroid-binding protein, FYVE zinc finger domain-containing protein), plant growth (peptidase T1 family, tetratricopeptide repeat protein 1) and stress response (calreticulin precursor protein, phosphatase 2C, cytosolic ascorbate peroxidase gene, peroxiredoxin).

GO (Gene ontology) annotation and biological process analyses

Functional annotation

Using the A. thaliana protein database of the TAIR website, GO functions could be assigned to Spartina transcripts. Among the various biological processes, cellular (5865) and metabolic (5660) processes, as well as biological regulations (2125) were most highly represented (Figure 5). Important functions were also identified, such as response to stimulus, protein localization and transport and developmental process. Similarly, cell and organelle were most represented between the cellular component and binding and catalytic activities among the various molecular functions (Figure 5).

Figure 5
figure 5

Functional classification of the leaf transcriptome of S. maritima and S. alterniflora. GO annotations were used for classification for GO cellular component, GO molecular function and GO biological process.

Identification of ecologically relevant genes

Annotated Spartina genes with potential ecological relevance are listed in Supplementary Table 2, with the corresponding number of putative homologous regions identified in the Sorghum genome. Transcription factors, such as zinc finger proteins, anti-oxidants (for example, gdp-mannose pyrophosphorylase) and osmolyte synthetic transporters were identified. Heat shock proteins, such as zeaxanthin epoxidase, a precursor of abscisic acid (ABA), which is involved in response to abiotic stress (including salt and heavy metal tolerance), were also encountered.

Among the known genes of the lignin biosynthetic pathway (Humphreys and Chapple, 2002), we were able to identify the cinnamoyl-CoA reductase and cinnamyl alcohol dehydrogenase genes. Gene families associated with the production of cellulose, such as cellulose synthases (CesA) and glycosyl transferases, were also identified in the reference Spartina transcriptome (Supplementary Table 2).

Identification of genes whose expression is altered following speciation in Spartina

When searching for the differentially expressed genes between the parental species (S. maritima and S. alterniflora) and between the parents and their hybrid (Spartina × townsendii) or allopolyploid (S. anglica) derivatives detected using rice microarrays by Chelaifa et al. (2010a, 2010b) we found 409 Spartina contigs exhibiting similarities to rice sequences (Supplementary Table 3). A BLAST2Go analysis was performed on these 409 sequences, of which 271 were found to have different functional annotation. Sequences whose expression is altered following speciation according to Chelaifa et al. (2010a, 2010b) such as transcription factors, retrotransposons, peptide transport system genes, glutathione transferases, peroxidases and cytochrome c oxidase were parsed to provide a sequence database. This database now constitutes a reference for future studies regarding genomic and transcriptomic consequences of polyploidy speciation in Spartina.

Polymorphism analysis at homologous genes

Because of the polyploid nature of these highly redundant genomes, up to three duplicated homoeologs may be encountered at each locus, leading to sequence heterogeneity among reads. Contigs from four genes (phosphoenol-pyruvate carboxykinase, HECT domain-containing protein, homeobox domain-containing protein and a heat shock protein) were analyzed in detail to identify homologous sequences, polymorphic sites and putative haplotypes. In these contigs, three to four haplotypes could be distinguished within individuals when comparing the homologous sequences for each gene (Table 2). The polymorphism analysis is illustrated in Table 3, for a 200-bp region of the HECT domain-containing protein gene. In this window, seven haplotypes (over the two species) were aligned. Six polymorphic sites were detected in each species, including four polymorphic sites shared between S. maritima and S. alterniflora, and two species-specific polymorphic sites. The shared polymorphisms allow distinction of two divergent haplotypes (where all six polymorphic sites are different) present in both hexaploids, and one (in S. maritima) or two (in S. alterniflora) additional less divergent variants (one or two nucleotide difference). Although the number of polymorphic sites defining haplotypes is variable among the other analyzed contigs, we observed the same pattern distinguishing two divergent haplotypes and one or two less divergent variants within individuals (Table 2).

Table 2 Nucleotide polymorphisms detected among reads within four annotated contigs from S. maritima and S. alterniflora
Table 3 Single-nucleotide polymorphisms among assembled reads of the gene coding the HECT domain-containing protein in Spartina maritima and Spartina alterniflora

Discussion

We have explored the transcriptome of two related Spartina species (S. maritima and S. alterniflora) using 454 sequencing technology. Before this study, only a limited number of Spartina ESTs were deposited in the NCBI EST database. If we exclude a recent transcriptome analysis in the tetraploid Spartina pectinata that generated 556 198 ESTs (Gedye et al., 2010), a few hundred sequences only were available for S. maritima (Chelaifa et al., 2010a) and S. alterniflora (Baisakh et al., 2008). Our work represents the first effort to analyze the transcriptome of the hexaploid Spartina species, resulting in a reference transcriptome of more than 16 700 annotated genes from leaves and roots.

De novo transcriptome assembly using 454 sequencing technology

Compared with other NGS technologies, the Roche platform offers long read lengths that facilitate assembly and annotation (Morozova et al., 2009) and for this reason it is the most widely used technology for de novo EST sequencing (Sun et al., 2010). In total, 25 239 (normalized and non-normalized libraries) and 14 317 contigs were assembled for S. maritima and S. alterniflora, respectively, representing 65.1% and 57.8% of the reads, the remaining of the reads left as singletons. Using a similar technology and assembly software, Gedye et al. (2010) assembled 65% of the reads into contigs for S. pectinata. The contig lengths found for both species are comparable to the length range reported in similar studies on other species (for example 299 bp in Oryza longistaminata, Yang et al. (2010); 394 bp in S. pectinata, Gedye et al. (2010); 526 bp in Panax quinquefolius, Sun et al. (2010)). From this data set, 17 307 contigs were annotated for S. maritima and 14 123 contigs were annotated for S. alterniflora (38 089 total annotated contigs for both species) corresponding to 16 753 different genes. These results are situated in the range of reported studies in non-model species (69.8% in ginseng; 72.6% in S. pectinata; 82% in amaranth; 85.5% in Cicer). Functional annotation could be assigned to 68.6% of the S. maritima contigs and 98.6% of the S. alterniflora contigs. Nonetheless, a large number of unique reads (singletons) were found, that is, 15% for our data set compared with other studies using the same assembler: 13% in S. pectinata (Gedye et al., 2010); 10–25% in Mytilus galloprovincialis (Craft et al., 2010); 8.8% in Palomero maize (Vega-Arreguin et al., 2009) and 7% in Amaranthus and Ginseng (Sun et al., 2010; Délano-Frier et al., 2011). This could result from various causes such as the presence of rare transcripts from lowly expressed genes. The 454 sequencing technology also has some limitations resulting mainly from sequencing errors associated with homopolymers (Margulies et al., 2005; Moore et al., 2006; Wicker et al., 2006), A/T bias (Moore et al., 2006; Wicker et al., 2006) or random nucleotide misincorporation (Huse et al., 2007; Holt and Jones, 2008). The error rate for 454 sequencing is higher than the rate usually observed with Sanger sequencing (0.04 and 0.01%, respectively (Ewing and Green, 1998; Margulies et al., 2005; Moore et al., 2006)). Nevertheless, the error rate drops significantly to 0.4 bp errors per 10 kb after assembly (Margulies et al., 2005; Moore et al., 2006). We checked the quality of our sequence assemblies from 454 sequencing by comparing 10 assembled contigs to their putative homologs in S. maritima ESTs sequenced with the Sanger method (Chelaifa et al., 2010a, 2010b). The identity between the sequences was found very high (99.5%), which validates the procedures employed.

As there is no reference genome for Spartina, we used information from several EST and protein databases for gene annotation, a procedure successfully employed for other non-model species (for example, Barakat et al., 2009; Gedye et al., 2010; Franssen et al., 2011; Garg et al., 2011). In de novo sequencing projects transcriptome coverage efficiency has been evaluated by comparing the number of unique genes to the nearest transcriptome available (Parchman et al., 2010). We compared our data to the nearest sequenced grass genomes: Oryza sativa (51 258 protein-coding transcripts, Yu et al., 2005 and the Rice Genome Annotation project, http://rice.plantbiology.msu.edu//) and Sorghum bicolor (36 338 protein-coding transcripts, Paterson et al., 2009). Using combined cDNA libraries, we identified 16 753 putative (non-redundant) genes by homology searches, which represent more than half of the genes found in fully sequenced related plant genomes. Interestingly, these genes appear distributed among the different Sorghum chromosomes, particularly in high gene density subtelomeric regions. Global gene colinearity is known to be well conserved among grass genomes (Feuillet and Keller, 2002; Srinivasachary et al., 2007) and the comparison here between hexaploid Spartina and Sorghum bicolor validates the utilization of Sorghum as a comparative model, as first observed in Gedye et al. (2010) for S. pectinata. The percentage of contigs without a BLAST hit in our study is quite low (1.01%), with 389 contigs that did not match any putative homolog in the Poaceae database. This fraction varies among other studies fluctuating from 14.5% in Cicer (Garg et al., 2011) to 30.2% in Panax (Sun et al., 2010), for instance. These sequences without homology hit can be attributed to technical biases, such as low-quality data, inaccurate assembly, assembly parameters and contamination by genomic DNA. The causes can also be biological: some cDNAs are non-coding, lineage-specific or highly variable (Logacheva et al., 2011). Specific Spartina (or Chloridoideae) sequences also might be too divergent from the grass model species used.

In this study, among the 13 786 genes annotated in S. maritima, 6642 were retrieved in the normalized library, 3201 genes in the non-normalized and only 3620 genes overlapping both libraries, which indicates that normalization significantly improved the number of annotated genes. The normalization reduces oversampling of abundant transcripts and maximizes the potential to sequence less abundant transcripts (Zhulidov et al., 2004). RNA-Seq studies on Zebra finch and rice have reported a higher efficiency in gene discovery using normalized cDNA libraries compared with non-normalized libraries (Yang et al., 2010; Ekblom et al., 2012). In contrast, Hale et al. (2009) demonstrated that normalization has a limited influence on increasing sequenced gene number. Ekblom et al. (2012) suggest that differences in technologies used and sequencing efforts can affect the outcome of the comparison between normalized and non-normalized libraries. In our present study, the normalized library was constructed from plants grown under natural conditions along a tidal gradient, which might also have increased the number of transcripts annotated. The transcriptome size, unknown in most non-model species may also affect the coverage and the sequencing effort. Therefore, it can affect indirectly the efficiency of normalization: normalized libraries show less efficiency when the non-normalized library already covers the whole transcriptome. This suggests that the combination of both normalized and non-normalized libraries is essential for gene discovery in non-model species, particularly in species exhibiting redundant genomes such as hexaploid Spartina.

Functional aspects: biology and ecology of Spartina

The 16 753 Spartina unigenes annotated in this study represent an important resource to explore genes involved in functions of ecological and adaptive interest. The genus Spartina exhibits a C4-type photosynthesis, which evolved in the Chloridoideae between 25 and 32 MYA (Christin et al., 2008), and which uses the ATP-dependent phosphoenolpyruvate carboxykinase (PCK) as decarboxylating enzyme (Christin et al., 2009). C4 metabolism confers high plant productivity under warm, arid and saline conditions, although Spartina species (and most particularly the hexaploids) colonize temperate regions (Long et al., 1975). In the study conducted by Christin et al. (2009), one PCK-sequence-type was found in S. maritima, whereas two sequence types were found in S. anglica, one being sister to the maritima-type sequence and the other one most likely originating from the other parent of S. anglica (S. alterniflora, which was not analyzed by these authors). When analyzing an 830-bp partial PCK-coding region in S. maritima and S. alterniflora, Chelaifa et al. (2010a) found high nucleotide identity (99.7%) between S. maritima and S. alterniflora. In our study, a fragment of the PCK gene was found well represented in the leaf transcriptome of both S. maritima (623 bp) and S. alterniflora (470 bp), which is less than 25% of the total CDS length of O. sativa being 2820 bp but provides an indication of levels of heterogeneity. SNPs examined in this region revealed the presence of up to two haplotypes for each species. The identity between the two most divergent haplotypes of S. maritima was 98.5%, whereas the two less divergent sequences exhibited 99.4% identity. Our results then indicate that at least two different, putative homoeologous PCK sequences are expressed in the leaves of the hexaploid S. maritima and S. alterniflora species.

S. alterniflora and S. maritima are low-marsh species that have developed particular adaptation to tolerate several hours of immersion under seawater at high tide (Adams and Bate, 1995; Daehler and Strong, 1996). Survival of low-marsh Spartina species in anoxic sediments is facilitated by their ability to develop aerenchyma systems (studied particularly in S. alterniflora) that supply the submerged plants with atmospheric oxygen and efficiently transport oxygen to the roots (Maricle and Lee, 2002). High salinity can be damaging by salt toxicity and dehydration caused by low water potential. Thus, plants living in saline, high-light environments are adapted to minimize water loss to prevent dehydration, and have developed particular adaptive anatomical features with this regard (Maricle et al., 2007). Salt marsh Spartina species have thick leaves with pronounced ridges on the adaxial side. They are adapted to controlling water loss by having stomata on the adaxial side and by having large leaf ridges that fit together as the leaf rolls during water stress (Maricle et al., 2009). To prevent salt toxicity, Spartina have large vacuoles for salt storage (Munns and Tester, 2008) and salt-secreting glands to excrete inorganic ions (Zhu, 2001). Phenotypic adaptations are well documented but little is known about genes involved in these responses. The first Spartina transcriptome analyses under salt stress were performed in S. alterniflora using cDNA amplified fragment length polymorphism (Baisakh et al., 2006) and EST analyses (Baisakh et al., 2008); these analyses identified various transcripts involved in ion transport and compartmentalization, osmolyte production, cell division, metabolism and protein synthesis, as well as previously unknown genes induced by salt stress. Although our transcriptome analysis of S. maritima and S. alterniflora was not performed under salt stress, we retrieved 937 (4642 contigs) of the 1266 ESTs Baisakh et al. (2008) and Subudhi and Baisakh (2011) generated. Using A. thaliana as a functional reference transcriptome, we were also able to annotate 130 genes (305 contigs) involved in salt stress response. These genes include transcription factors, heat shock protein and cytochrome c oxidase that have been found to respond to salt and oxidative stress by balancing ion concentrations in Spartina (Maricle et al., 2006).

We also annotated 71 genes (190 contigs) involved in heavy metal tolerance. Spartina species are of high interest regarding their ecological role in polluted coastal environment, where they exhibit particular tolerance to oil spill and where they are considered for phytoremediation purposes (Maricle and Lee, 2002; Martinez-Dominguez et al., 2008; Mateo-Naranjo et al., 2008). Ramana Rao et al. (2011) found 28 differentially expressed genes following experimental petroleum hydrocarbon exposure in S. alterniflora. We retrieved in our data set 8 of these genes (52 contigs).

Genes involved in stress response or in developmental and cellular growth were found to be differentially expressed in controlled conditions in the two species, the former being overexpressed in S. maritima, whereas the latter were overexpressed in S. alterniflora (Chelaifa et al., 2010a). Most of these genes have also been found to be predominantly affected following hybridization between these species (that is, in Spartina × townsendii) and subsequent genome duplication in S. anglica (Chelaifa et al., 2010b). Here, we identified 409 contigs corresponding to 271 different genes matching the putative homologous rice probes described in Chelaifa et al. (2010b). Our Spartina sequence data set may provide useful information to target genes of ecological and evolutionary interest (that is, whose expression is affected by divergent and reticulate speciation). Specific primers may be now designed to explore gene expression evolution in natural conditions and under various ecological situations.

Sequence polymorphism at homologous loci in hexaploid Spartina

The contigs assembled from the 454 reads in each of these hexaploid species actually represent a consensus sequence among strictly homologous (that is, orthologous) sequences but may also include homoeologous sequences (generated by polyploidy). Within homoelogues (at strictly orthologous loci), levels of heterozygosity have been poorly investigated in S. maritima, although this species is well known for its predominant clonal propagation and weak inter-individual genetic variation (Yannic et al., 2004). Spartina alterniflora has a mixed, predominantly outcrossing mating system (Travis et al., 2004); thus, more allelic variation within homoeologues might be expected than for S. maritima. Reads were assembled using a 90% identity threshold, to avoid potential comparisons involving divergent paralogs, but homoeologous sequences are expected to exhibit more similarity at each locus, and thus will most likely be aligned in the same contig. Homology assessment requires sequence comparison examined in a phylogenetic context. Such an analysis was performed for Spartina for the granule-bound starch synthase I (Waxy) gene (Fortune et al., 2007). Molecular cloning, sequencing, and phylogenetic analyses allowed detection of paralogous, homoeologous and orthologous copies. In S. alterniflora, three homoeologous waxy copies were detected, exhibiting substitution rates ranging from 0.0218 to 0.0479. When analyzing sequence polymorphism among the assembled reads at four putative homologous loci between S. maritima and S. alterniflora, we found at each of these loci four different haplotypes that include two divergent sequences and two other less divergent variants. These results suggest the presence of two expressed homoeologous sequences with, respectively, two allelic variants; complementary phylogenetic analyses involving tetraploid Spartina species and outgroups will help to elucidate the evolutionary origin of these different sequences. As S. maritima and S. alterniflora are hexaploid, up to three duplicated homoeologs may be expected per locus. The fact that only two homoeologs were encountered in the analyzed transcripts might result from either homoeologous silencing as observed in the various cases of subfunctionalization reported in allopolyploids (reviewed in Osborn et al., 2003; Adams and Wendel, 2005; Doyle et al., 2008), from physical loss of the duplicated copies that may occur more or less rapidly following polyploid speciation (for example, Gaeta et al., 2007; Tate et al., 2009; Koh et al., 2010) or from homoeologous recombination (Cifuentes et al., 2010; Salmon et al., 2010; Gaeta and Pires, 2010). For the Waxy gene mentioned above, Fortune et al. (2007) found a variable number of retained copies per homologous locus. Two paralogs (A and B) were identified in the genus Spartina, only one B copy was found in S. maritima, whereas three distinct B copies were encountered in S. alterniflora. The A copy was apparently lost in these two species but is maintained in the hexaploid S. foliosa, which is sister species to S. alterniflora.

Conclusions

NGS technologies open new opportunities to screen large sets of genes and their evolution in polyploid species (Buggs et al., 2012). This first reference transcriptome, coupled with ongoing studies in our laboratory, involving deeper coverage from (Illumina INC., San Diego, CA, USA) RNA-Seq, and high-throughput genomic DNA sequencing, will facilitate a more accurate estimate of the level of duplicated homoeologous gene retention and relative expression in the hexaploid Spartina species and their hybrid and allopolyploid derivatives, in controlled and natural conditions. The analysis of the retained gene copies will also shed light into the origin of the hexaploid lineage and improve our understanding of the deepest Spartina history.

Data archiving

Data have been deposited at Genbank (Sequence Read Archive SRA) under accession references SRP015701 and SRP015702 for Spartina maritima and Spartina alterniflora, respectively.