Transcriptome sequencing of a thalloid bryophyte; Dumortiera hirsuta (Sw) Nees: assembly, annotation and marker discovery

Singh, Harpal; Rai, Krishan Mohan; Upadhyay, Santosh Kumar; Pant, Poonam; Verma, Praveen Chandra; Singh, Ajit Pratap; Singh, Pradhyumna Kumar

doi:10.1038/srep15350

Download PDF

Article
Open access
Published: 20 October 2015

Transcriptome sequencing of a thalloid bryophyte; Dumortiera hirsuta (Sw) Nees: assembly, annotation and marker discovery

Harpal Singh¹,
Krishan Mohan Rai¹,
Santosh Kumar Upadhyay²,
Poonam Pant¹,
Praveen Chandra Verma¹,
Ajit Pratap Singh¹ &
…
Pradhyumna Kumar Singh¹

Scientific Reports volume 5, Article number: 15350 (2015) Cite this article

2322 Accesses
10 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Bryophytes are the first land plants but are scarcely studied at the molecular level. Here, we report transcriptome sequencing and functional annotation of Dumortiera hirsuta, as a representative bryophyte. Approximately 0.5 million reads with ~195 Mb data were generated by sequencing of mRNA using 454 pyrosequencer. De novo assembly of reads yielded 85,240 unigenes (12,439 contigs and 72,801 singletons). BlastX search at NCBI-NR database showed similarity of 33,662 unigenes with 10-¹⁰ e-value. A total of 23,685 unigenes were annotated at TAIR10 protein database. The annotated unigenes were further classified using the Gene Ontology. Analysis at Kyoto Encyclopedia of Genes and Genomes pathway database identified 95 pathways with significant scores, among which metabolic and biosynthesis of secondary metabolite were the major ones. Phenylpropanoid pathway was elucidated and selected genes were characterized by real time qPCR. We identified 447 transcription factors belonging to 41 families and 1594 eSSRs in 1479 unigenes. D. hirsuta unigenes showed homology across the taxa from algae to angiosperm indicating their role as the connecting link between aquatic and terrestrial plants. This could be a valuable genomic resource for molecular and evolutionary studies. Further, it sheds light for the isolation and characterization of new genes with unique functions.

De novo transcriptome and tissue specific expression analysis of genes associated with biosynthesis of secondary metabolites in Operculina turpethum (L.)

Article Open access 18 November 2021

Bhagyashree Biswal, Biswajit Jena, … Laxmikanta Acharya

De novo transcriptome assembly of Dalbergia sissoo Roxb. (Fabaceae) under Botryodiplodia theobromae-induced dieback disease

Article Open access 22 November 2023

Ummul Buneen Zafar, Muhammad Shahzaib, … Iqrar Ahmad Rana

Analyses of transcriptomes and the first complete genome of Leucocalocybe mongolica provide new insights into phylogenetic relationships and conservation

Article Open access 03 February 2021

Mingzheng Duan, Haiying Bao & Tolgor Bau

Introduction

Bryophytes are the oldest known terrestrial plants with high ecological and economic value. These are among the highly stress tolerant fauna of the world. Some of them resurrect even after complete dryness. Although several economically important secondary metabolites, anti-fungal, anti-microbial compounds and enzymes are reported from bryophytes, characterization of the later at genetic and molecular levels has not yet been done. Genome and transcriptome data of this group can be helpful in evolutionary and taxonomic studies and can also contribute significantly in the determination of its ecological attributes and economic importance. Even after 8 years of Next Generation Sequencing (NGS) technology being available, genome sequence of only one bryophyte (Physcomitrella patens) has been reported till date¹.

Dumortiera hirsuta is a thalloid liverwort reported from all bryo-geographical regions in India. It grows on moist rocks in shady places around the waterfalls² and has been widely used for the isolation of several economically important compounds. The aqueous extract of D. hirsuta is effective against several phyto-pathogens³. The plant has inherent abilities of stress tolerance⁴. It produces superoxide radicals (a common response to biotic and abiotic stresses) as an extracellular oxidative burst during rehydration after desiccation stress⁵. This leads to the signal transduction and formation of suberin, melanin and lignin like protective substances for defense against phyto-pathogens^6,7,8. Different isoforms of peroxidases (EC 1.11.1.7) and tyrosinases (EC 1.10.3.1) are found in the apoplast of D. hirsuta and are responsible for the oxidative burst⁹. Despite the economic significance and cosmopolitan occurrence, D. hirsuta has not been studied at molecular level. Moreover, it is difficult to study an important metabolic pathway of uncommon/non-model organisms where the reference genome/transcriptome is not available.

High-throughput sequencing is an important step towards the development of nucleotide sequence database of non-model organisms like D. hirsuta. De novo sequencing of D. hirsuta on NGS platform like 454 pyrosequencer will be very useful as it generates longer read lengths which makes the assembly easy and accurate. The sequence data generated by transcriptome sequencing facilitates elucidation of metabolic pathways, global gene expression and many others. Present study describes the transcriptome sequencing of D. hirsuta and their assembly to get a non-redundant set of unigenes. The present study also describes the annotation of unigenes with different proteins datasets, their gene ontology, elucidation of important metabolic pathways, identification of transcription factor families and EST-SSRs markers. As D. hirsuta has some unique properties, the study is expected to help in designing new strategy for the better utilization of plant resource for mankind.

Results

Sequencing and assembly

D. hirsuta cDNA library was prepared from the mRNA of gametophyte and used for sequencing at 454 pyrosequencer. Signal processing of raw images acquisited during sequencing run on gsRunBrowser v2.5.3 generated 5,48,123 reads, having a total of 19,55,92,000 bases (~195.6 Mb). Average read length was 357 bases with length distribution 40 to 737 bases, (Table 1, Fig. 1A). A total of 1,968 reads having read length less than 50 bases and low quality were removed before further analysis. Remaining reads were mapped on pool of plants ribosomal RNAs using gsMapper v2.5.3 and 1, 21,117 reads of ribosomal origin were omitted from the analysis. After the quality filtration, 4, 25,038 reads were passed with 95.44% of Q40plus bases and used for de novo assembly. 80.7% reads were aligned to assemble into 12, 439 contigs (having length ≥100 bases); rest19.3% reads were singletons. About 64% contigs (8, 007 out of 12, 439) were classified as large contigs having length ≥500 bases (Table 2, Fig. 1B). The average contig size was 989 bases with largest contig of 4843 bases. However the N50 contig size of assembled data was 1071 bases. All the contigs and singletons (12, 439 and 72, 801, respectively) were pooled together to form 85, 240 non-redundant set of unigenes. The transcriptome sequencing data is submitted to NCBI (SRA2A1172).

Table 1 Details of D. hirsuta transcriptome sequencing using 454 GS FLX pyrosequencer.

Full size table

Table 2 De novo assembly outputs of D. hirsuta transcriptome using gsAssembler (Newbler v2.5.3).

Full size table

Annotations with different databases

D. hirsuta unigenes were searched for their annotation against different databases (Table 3). Total 33, 662 unigenes showed similarity with NCBI non-redundant (NR) protein database in BlastX search with cut-off E-value 10⁻¹⁰ (File S1). These unigenes represents 20760 proteins of NR database. Similarly, 23, 685 unigenes were annotated against TAIR10 protein database representing 9, 014 unique TAIR locus IDs (File S2). Unigenes with TAIR10 locus IDs were used for further functional annotation.

Table 3 Details of annotation of D. hirsuta unigenes with different public databases and pathway analysis.

Full size table

GO annotation and KEGG pathway analyses

Unigenes with TAIR10 locus IDs were analysed on Gene Ontology to classify them into different GO categories. Out of total 62, 685 GO terms, 20, 582 were assigned to cellular component, 14, 493 to molecular function and 27, 610 to biological processes (Table 3, Fig. 2). Among the cellular components, nucleus (13%) and chloroplast (10.5%) were at the top. Similarly, nucleotide binding (11%), hydrolase (9.7%) and transferease (8.9%) activities were among the most enriched molecular processes. Protein metabolism (8.2%), response to stress (6.6%), cell organisation and biogenesis (6.4%) and response to abiotic and biotic stimulus (6.4%) were among the most enriched biological processes (File S3).

Unigenes with TAIR locus IDs were searched for KEGG Orthologous IDs and 4305 IDs were identified. These were further searched for various pathways involved in the D. hirsuta and 95 significant pathways were identified. Metabolic pathways (comprised of 721 unigenes), biosynthesis of secondary metabolites (309 unigenes), spliceosome (92 unigenes), oxidative phosphorylation (87 unigenes) and RNA transport (80 unigenes) were the top five pathways (Fig. 3, File S4).

Elucidation of Pathway involved in Stress

Unigenes with TAIR locus IDs were searched for KEGG Orthologous IDs and were mapped for phenylpropanoid pathway. Several important enzymes of this pathway were significantly identified from the unigenes. Cinnamyl-alcohol dehydrogenase (EC: 1.1.1.195) was found to be most expressive with log2 TPM value 7.7, followed by PRDX6; peroxiredoxin 6, 1-Cys peroxiredoxin (EC: 1.11.1.7) with log2 TPM value 7.2. The enzymes with more than 3 log2 TPM value are given in Table 4. Mapped enzymes from the phenyl propanoid pathway (Figure S1) show that most of the important steps have been covered in this study. These results indicate towards the possible important role of phenylpropanoid pathway in the stress tolerance of bryophytes.

Table 4 Major enzymes of phenylpropanoid pathway expressing in D. hirsuta.

Full size table

Expression analysis of genes involved (log2TPM >3) in phenylpropanoid pathway was analysed by real time PCR. Expression level was compared with the actin gene. We found that all the genes were expressed in gametophyte of D. hirsuta (Fig. 4). Significant transcript abundance was observed for each gene when compared to the expression level of actin.

Transcription Factors

Among the D. hirsuta unigenes, a total of 447 transcription factor encoding genes (TFs) were identified (File S5A) with the help of A. thaliana transcription factor database (AGRIS). All the TFs were further classified into 41 families (File S5B). C2H2 (100 unigenes), C3H (81 unigenes), GRAS (25 unigenes), bZIP (22 unigenes) and Homeobox (21 unigenes) were among the top five families. bHLH, TUB, NAC, WRKY and MYB TF families were also found present (Fig. 5).

Similarity with other plant genomes

D. hirsuta unigenes were analysed for the pattern of genes distribution across the representatives of four phyla, C. reinhardtii (Blue green algae), P. patens (Bryophyte), S. moellendorffii (Pteridophyte) and A. thaliana (Angiosperm) with total number of proteins 19526, 38354, 22285 and 35386 respectively (Fig. 6). Proteins sequences of these plants were downloaded from Phytazome server (http://www.phytozome.net/) and compared with D. hirsuta unigenes. A total of 13653 unigenes with C. reinhardtii, 28519 with P. patens, 23959 with S. moellendorffii and 23685 with A. thaliana were found similar. A total of 12615 unigenes (representing 14.8%) were found present across all the four genera used to perform this analysis.

A total of 3990 unigenes were found uniquely in P. patens. Similarly 133, 445 and 367 unigenes were also found unique to C. reinhardtii, S. moellendorffii and A. thaliana, respectively. Out of total, 1672 unigenes did not have any similarity to P. patens, the closest reported genome of a bryophyte.

Identification of EST-SSRs

A total of 81,838 D. hirsuta unigenes with more than 100 bases of length were used for EST-SSRs mining using MIcro SAtellite (MISA) identification tool with standard criteria of 5 motifs depth. After excluding mono-nucleotide repeats from the analysis, 1594 SSRs have been identified within 1479 unigenes. Out of total, 102 unigenes were found having more than one SSRs whereas 89 SSRs were found in compound formation i.e. within 100 bases of proximity. Among the identified SSR motifs, 1040 DNRs (65.2%), 490 TNRs (30.7), 43 TtNRs, 14 PNRs and 7 were HNRs (Table S1, Fig. 7). AG/CT with 601 motifs in DNRs (57.8% of DNRs) and AGC/CTG with 131 motifs in TNRs (26.7% of TNRs) were the most abundant motifs (File S6).

Discussion

Here we report the transcriptome analysis of the bryophyte D. hirsuta. Bryophytes occupy an important phylogenetic position between aquatic and land plants. They are largely divided into three sub-groups- liverworts, hornworts and mosses. D. hirsuta is a rare liverwort, grows in shady and humid rocks. Although genome sequence of moss P. patens is reported¹, the NGS data of a liverwort, the first land plants are expected to be of much more relevance and expected to unveil some of the evolutionary mysteries. Further, they are wonderful examples of stress tolerance. They might have unique pathways for adaptation in aquatic and terrestrial mode of life. Their gene pool might be useful in exploring/modifying the crops for stress tolerance.

A total of 85,240 unigenes identified in the present study shows development of comprehensive NGS data for D. hirsuta. We have identified 33,662 unigenes through NCBI-Nr protein database and 23,685 unigenes through TAIR10 protein database. 23,309 unigenes are common. Thus, 376 genes/proteins are additionally identified by TAIR10 database. Therefore, a total of 34,038 protein coding unigenes are mapped. This number is very close to the 35,938 genes reported in case of P. patens, another bryophyte with closest genome available¹.

Mapping of unigenes at very low (10⁻¹⁰) e-value to different databases indicates the high quality of sequencing and assembly. Besides the mapped unigenes, we have found several sequences which do not show similarity with the database. Majority of such sequences are non-coding RNAs (ncRNAs)¹⁰. ncRNAs are reported to play significant role in transcriptional and post transcriptional gene regulation as well as in maintenance of genome stability^11,12. However, there is also possibility of transcripts of uncharacterized genes in the sequence pool.

Annotated unigenes had been further used for functional annotation and classification using GO and KEGG, which provide information about probable biological functions and biosynthesis pathways^13,14. Functional annotation of these unigenes has covered almost entire biological processes. GO analysis for biological processes shows that besides the protein metabolism, 6.6% unigenes are related to stress response and 6.4% to abiotic and biotic stimulus. This establishes the presence of stress tolerance mechanism in bryophytes.

Unigenes have also been mapped into 95 significant KEGG pathways; in which metabolism and biosynthesis of secondary metabolites pathways have higher representations. In secondary metabolites pathways, strong representation of genes for terpenoid, abscisic acid and flavonoids biosynthesis were observed (Supplementary Figure 1). These were supposed to be the major components of biotic and abiotic stress tolerance as well as thought to involve in signalling¹⁴.

Phenylpropanoids contribute to all aspects of plant responses towards biotic and abiotic stimuli. Phenylpropanoids are a diverse group of compounds derived from the carbon skeleton of phenylalanine that are involved in plant defense, structural support and survival¹⁵. The pathway also gives rise to an array of other small molecules such as the flavonoids, coumarins, hydroxycinnamic acid conjugates and lignans^15,16. Flavonoids exhibit central functions in various aspects of plant life and an enhanced amount of these compounds can be observed under different stress conditions. Highest log2 TPM valued Cinnamyl-alcohol dehydrogenase [EC: 1.1.1.195] converts aldehydes to alcohols which are the key substances for the production of final product in phenylpropanoid pathway. PRDX6; peroxiredoxin 6, 1-Cys peroxiredoxin [EC: 1.11.1.7] with log2 TPM value produces lignins in plants which is most important structural molecule although lignin is not present in bryophytes. Bryophytes do not synthesize lignin, but are widely found to accumulate soluble phenylpropanoids, such as flavonoids and lignans^17,18. Phenylalanine ammonia-lyase [EC: 4.3.1.24] being the most enzyme of the phenylpropanoid pathway shows high TPM value. In qPCR expression analysis transcript of all the genes were observed and found to be following the similar pattern as observed in sequencing data.

We have analysed the representation of transcription factors (TFs) in D. hirsuta and found 447 TFs, representing 41 families. C2H2 and C3H TFs are highly represented followed by GRAS, bZIP and Homeobox. C2H2 and C3H are Zinc finger proteins and are known to be involved in various developmental pathways in plants^19,20. Further, role of C2H2 in different biotic and abiotic stresses has also been reported²¹. GRAS family of proteins play major role in gibberellin (GA) signalling, which regulates plant growth and development²². They are also involved in Mycorrhizal signaling in legumes²³. The basic leucine zipper (bZIP) families of transcription factors have several important roles in processes like development, general physiology and stress responses in all eukaryotes²⁴. Homeobox family of proteins are also involved in the plant development especially in meristematic tissues²⁵.

Bryophytes are the connecting link between aquatic and terrestrial plants²⁶, thus sequences similarity of the annotated unigenes of D. hirsuta has been compared with some selected organisms from different phyla. C. reinhardtii from algae, P. patens from bryophytes, S. moellendorffii from pterydophytes and A. thaliana from angiosperms are selected for the homology analysis. Significant gene similarity is observed with phyla across the plant kingdom supporting bryophytes being the connecting link between aquatic and terrestrial plants. Maximum homology with P. patens is not surprising as both belong to bryophytes and closest to each other among the analysed organisms. We have found that about 30% of the annotated unigenes show similarity with algae C. reinhardtii, in which 133 matched uniquely. This indicates that the bryophytes retained several features from algae. About 60% genes are similar to P. patens, S. moellendorffii and A. thaliana and about 10% annotated unigenes are exclusively shared with P. patens. These results indicate that the bryophytes have acquired several new features to adapt in new ecosystem. Most of these features are retained in higher plants in course of evolution. About 10% of the total unigens were exclusive to bryophytes, which might be related to their characteristic features. Since all of them are predicted or hypothetical proteins, we could not correlate them with functions. Characterization of bryophyte specific genes in depth will be taken up in future studies.

We have also analysed D. hirsuta transcriptome data for SSR markers. Being highly polymorphic, SSR markers are widely used in several evolutionary and genomics based studies^27,28. High quality transcriptome data opens opportunity for marker discovery. We have identified 1594 SSRs, mostly dinucleotides and trinucleotide. Dinucleotides repeats are almost double of the trinucleotides. The result is in agreement with the earlier reports in higher plants^29,30,31,32. AG/CT repeats are most abundant in dinucleotide SSRs, as reported in higher plants^33,34. In trinucleotide repeats, AAG/CTT is most frequent and similar to the dicotyledonous plants³⁴.

Conclusion

Bryophytes occupy a very critical place in the evolution of terrestrial plants. D. hirsuta was selected as a representative species and a large transcriptome dataset was generated using 454 NGS technology. The data was used for functional annotation, gene ontology, elucidation of metabolic pathways, identification of transcription factor families and EST-SSRs markers. The study provides a comprehensive transcriptome data of rarely studied plant and would be useful for scientific community.

Methods

Plant materials

D. hirsuta was collected from Pachmarhi Biosphere Reserve Madhya Pradesh, India and conserved in moss house of National Botanical Research Insititute, Lucknow during this study.

Library preparation and sequencing

Total RNA was prepared from the whole plant using Sigma Spectrum Plant Total RNA Isolation Kit and was further used for double stranded cDNA synthesis following the standard protocol. DS cDNA (5 μg) was purified using PCR purification column (Qiagen, GmbH, Germany). DNA was randomly sheared to an average fragment size 300–800 bp by applying 30 psi (2.1 bar) pressure of gaseous nitrogen for 1 minute using a nebulizer. The size of DNA fragments was analyzed on Agilent BioAnalyzer DNA 7500 Chip. Sheared cDNA ends were repaired by T4 DNA polymerase and T4 polynucleotide kinase. Small cDNA fragments were removed by Ampure beads purification. Sequencing library was prepared as per standard protocol (Roche, Switzerland) and the quality and quantity of the library was analysed on Agilent BioAnalyzer High sensitivity chip. An emulsion PCR based amplification of sequencing library was performed using GS emPCR kit and sequencing of the amplified fragments was done as per manufacture’s protocol on 454 Genome Sequencer FLX (Titanium) using half of the Pico Titer Plate (PTP, 70 × 75).

De novo assembly

Reads with low quality score and less than 50 bp in length were removed before data processing (http://hiv.sanbi.ac.za/tools/qtrimwebcite). All the quality filtered reads were assembled using gsAssembler (Newbler v2.5.3) with the criteria of 50 bp overlap and 96% similarity. After assembly, both the contigs and singletons were pooled together to form a non-redundant unigenes set for further analysis.