Jute (Corchorus sp.) is one of the most important sources of natural fibre, covering ∼80% of global bast fibre production1. Only Corchorus olitorius and Corchorus capsularis are commercially cultivated, though there are more than 100 Corchorus species2 in the Malvaceae family. Here we describe high-quality draft genomes of these two species and their comparisons at the functional genomics level to support tailor-designed breeding. The assemblies cover 91.6% and 82.2% of the estimated genome sizes for C. olitorius and C. capsularis, respectively. In total, 37,031 C. olitorius and 30,096 C. capsularis genes are identified, and most of the genes are validated by cDNA and RNA-seq data. Analyses of clustered gene families and gene collinearity show that jute underwent shared whole-genome duplication ∼18.66 million years (Myr) ago prior to speciation. RNA expression analysis from isolated fibre cells reveals the key regulatory and structural genes involved in fibre formation. This work expands our understanding of the molecular basis of fibre formation laying the foundation for the genetic improvement of jute.
Bast (phloem) fibres are obtained from the stem of the plants such as jute, flax, hemp, ramie and kenaf. The annual global production of jute generates a farm value of ∼US$2.3 billion1. The cultivated species of jute, C. olitorius and C. capsularis, are morphologically and physiologically distinct (Supplementary Fig. 1), and a combination of useful traits from these species into a single genotype is highly desirable3. However, interspecific hybridization is limited because of their cross-incompatibility4,5. To facilitate comparative functional genomics and to understand the molecular basis of bast fibre biogenesis, genomes of two popular jute cultivars C. olitorius var. O-4 and C. capsularis var. CVL-1 are sequenced and analysed.
We performed whole-genome shotgun (WGS) sequencing with the Roche/454 platform (Supplementary Table 1) and assembled the genomes using CABOG6. The resulting assemblies were 445 Mb (scaffold N50 length, 3.3 Mb; longest scaffold, 45.5 Mb) for C. olitorius and 338 Mb (scaffold N50 length, 4.1 Mb; longest scaffold, 28.5 Mb) for C. capsularis (Table 1 and Supplementary Table 2). Eighty per cent of the C. olitorius and C. capsularis assemblies were covered with 415 scaffolds (minimum length 76 kb) and 231 scaffolds (minimum length 120 kb), respectively. We estimated the genome sizes for C. olitorius and C. capsularis to be ∼448 and ∼404 Mb (Supplementary Information and Supplementary Fig. 2), respectively, which were consistent with reported estimates7. Whole-genome optical mapping was used to improve the assemblies, resulting increase in N50 of the scaffolds to 4.0 Mb for C. olitorius and 8.5 Mb for C. capsularis (Supplementary Information and Supplementary Tables 3–6). We anchored ∼60% of each assembly to seven genetic linkage groups using a set of 1,389 molecular markers from a consensus genetic linkage map8,
We examined the evolutionary relationship between jute and 13 other sequenced plant genomes with representatives from the Malvids (cacao, cotton and Arabidopsis), Fabids (castor bean, flax, Medicago, soybean, Fragaria and Populus), Asterids (tomato and potato), Vitales (grape) and Monocots (rice). Phylogenetic analysis, based on a concatenated alignment of 13 single-copy gene families from 15 sequenced plant genomes, supported the placement of jute with cacao and cotton in the Malvaceae family (Fig. 1a). This inclusion is consistent with the analysis done with chloroplast DNA sequences18. All protein-coding genes from 15 genomes (Supplementary Table 20) were clustered into 47,186 gene families (two or more members), of which 8,177 were common to all five groups and 8,816 were confined to the Malvids (Supplementary Fig. 5). Among the Malvids-specific gene families only 613 and 152 are unique to C. olitorius and C. capsularis, respectively (Fig. 1b). These jute-specific gene families were significantly enriched with genes related to the oxidation–reduction process, transcription factors, transposases and defence-related proteins (Supplementary Tables 21 and 22). To investigate the expression of jute-specific genes in fibre cells, the RNA-seq data from isolated fibre cells and seedlings of C. olitorius and C. capsularis were compared. Among the jute-specific genes, Myb/SANT-like domain, Zinc finger, PHD-type, F-box domain cyclin-like proteins are highly expressed in fibre cells than seedlings (Supplementary Tables 23 and 24) indicating their involvement in bast fibre formation.
The natural genetic diversity within the jute species is very narrow19,20 and it is one of the impediments for the breeder to develop high-yield and quality varieties. The extent of gene duplications in the C. olitorius and C. capsularis genomes were examined. By calculating the synonymous substitution rates (Ks) for paralogous gene pairs, two peaks at 0.24–0.32 and 0.72–0.92 for both of the genomes were found (Fig. 1c). The first peak reveals the whole-genome duplication (WGD) event occurred ∼18.6 (16.0–21.3) Myr ago prior to their separation at ∼6 Myr ago (Supplementary Fig. 6). The second peak is indicating an ancient WGD event occurred in jute ∼129.2 (110.7–141.5) Myr ago (Fig. 1c). The ancient duplication event corresponds to the ancient hexaploidization that is shared among the eudicots21. Comparison of the two genomes revealed that they share 160 syntenic blocks (five or more genes per block) with the linkage groups covering 58% and 65% of the assembled genome of C. olitorius and C. capsularis, respectively (Fig. 1d; Supplementary Fig. 7 and Supplementary Tables 25 and 26). It indicates that extensive synteny and conserved gene order exists between the genomes. A one-to-one relationship of the predominantly aligned syntenic regions denotes no WGD after speciation (Supplementary Fig. 8). The occurrence of tandem duplications, which tend to be biased towards genes involved in responses to environmental stimuli22, was relatively low for C. olitorius (7.2% of total genes) and C. capsularis (5.9% of total genes) (Supplementary Table 27) compared with other plant genomes23.
The genomics information of jute fibre biogenesis is merely available for the improvement of its yield and quality. RNA-seq data obtained from isolated fibre cells (elongated cells undergoing secondary cell wall (SCW) deposition) and seedlings of C. olitorius and C. capsularis were analysed to investigate the molecular events of jute fibre development (Supplementary Fig. 9 and Supplementary Tables 28 and 29). We identified 6,077 upregulated and 6,809 downregulated genes for C. olitorius and 7,695 upregulated and 7,809 downregulated genes for C. capsularis (Supplementary Fig. 10 and Supplementary Tables 30 and 31). Among them, 329 C. olitorius and 344 C. capsularis genes were identified based on the analysis of homologous genes reportedly involved in plant fibre formation24 which facilitated us to propose a model for bast fibre biogenesis in jute (Fig. 2a). It was found that 174 (53%) C. olitorius and 216 (63%) C. capsularis genes were expressed significantly within the fibre cells and seedlings (Supplementary Table 32). Genes encoding the WOX4, APL and HAT22 transcription factors and the TDIF signalling peptide, which are involved in vascular cambium initiation and proliferation25,
The relatively high lignin content (∼15%) in jute fibre makes it coarser than other bast fibres such as flax and ramie (<5% lignin)30. Among the lignin biosynthetic genes detected in the C. olitorius and C. capsularis genomes, there were expansions of the 4-coumarate:CoA ligase (4CL), cinnamoyl-CoA reductase (CCR), trans-caffeoyl-CoA 3-O-methyltransferase (CCoAOMT) and caffeic acid O-methyltransferase (COMT) gene families compared with flax (Supplementary Table 33). The expression profiles of the lignification genes reveal that only a few homologues appeared to be preferentially expressed at high levels in the fibre cells for most of the gene families (Fig. 2b; Supplementary Table 32 and Supplementary Fig. 12a1), highlighting possible targets for engineering low-lignin jute fibres. Cellulose, synthesized by the cellulose synthase (CesA) complex, makes up the majority of the SCW in jute fibres (∼60%). We identified 10 CesA and 32 cellulose synthase-like (Csl) genes, similar to several other plants (Supplementary Table 34). SCW synthesis specific genes CesA4 and CesA7 were distinctly upregulated in fibre cells (Fig. 2c; Supplementary Table 32 and Supplementary Fig. 12a2) indicating their association with SCW cellulose deposition, whereas significant expression of CesA1, CesA3 and CesA6 in seedlings suggest their involvement in primary cell wall cellulose deposition (Fig. 2c; Supplementary Table 32 and Supplementary Fig. 12a2).
Senescence and cell death are the final phases of fibre biogenesis related to autophagy and proteolysis pathways. KEGG pathway enrichment analysis with fibre cell transcriptome data indicated an upregulation of autophagy and proteolysis pathways whereas most of the metabolic pathways were downregulated (Supplementary Tables 35 and 36). In flax phloem and poplar xylem fibres, a gradual degradation of the nucleus and cytoplasm is likely to be mediated by autophagy while deposition of the SCW continues31,32. In poplar xylem fibres32, all copies of ATG8 were upregulated in jute fibre cells and the expressions were the highest among the autophagy-related genes (Supplementary Table 32) suggesting a similar cell death programme.
RNA-seq results for fibre biogenesis pathway genes were validated with quantitative polymerase chain reaction with reverse transcription (RT–qPCR) on randomly selected genes (Supplementary Fig. 13). For most of the genes, similar upregulation and downregulation patterns were observed.
To explain the morphological and physiological differences between the species, we focused on comparing the transcripts and genes that are related to lignin deposition in SCW, fibre colour and response to abiotic stress. As C. olitorius fibre contains more lignin and less cellulose than that of C. capsularis (Supplementary Table 37)33, expression patterns of lignin and cellulose biosynthesis genes between them were different (Fig. 2b,c). Genes encoding most of the key enzymes for proanthocyanidin biosynthesis were highly expressed in the fibres of C. olitorius (golden colour) than in C. capsularis (whitish colour) (Supplementary Fig. 14), indicating their involvement for the differences in fibre pigmentation.
C. capsularis is comparatively tolerant to flood and drought but slightly more susceptible to diseases and pests than C. olitorius34. Moreover, C. capsularis is somewhat tolerant to salt stress compared with C. olitorius and can survive up to 60 mM concentrations of NaCl in nutrient media (Supplementary Fig. 15). We categorized gene ontology (GO) using the Blast2GO pipeline35 to identify gene copy number variation which is a major mechanism of phenotypic differentiation and evolutionary adaptation to the environment36,37 (Supplementary Tables 38 and 39). In the GO class ‘biological processes’, genes with GO terms associated with response to important environmental factors including salt and osmotic stress were over-represented in C. capsularis (Fig. 3a; Supplementary Table 40). Besides, in the GO class ‘molecular function’, we found that genes responding to abiotic stress were also significantly over-represented in C. capsularis (Fig. 3b; Supplementary Table 40). Among them transmembrane transport, vacuolar transport, homeostatic process, ATPase activity, transmembrane transporter activity, signal transducer activity, oxidoreductase activity were significantly higher in C. capsularis, indicating its adaptability in different habitats and environmental pressure. For example, transmembrane transport proteins mediate ion fluxes, including sodium ATPase, vacuolar H+-ATPase, chloride channel protein, and ABC transporters play important roles in ionic and osmotic homeostasis under salt environments38. To corroborate GO analysis results, correlated genes were identified (Supplementary Table 41) and most of them are associated with abiotic stress tolerance.
The comparative genome analysis opens up opportunities for the development of improved breeding strategies to meet the increasing demand of natural fibres in industry. The genome sequences provide a valuable resource to advance our understanding of bast fibre biogenesis in jute, thus serving as the foundation of genetic improvement for productivity and fibre quality.
Genome sequencing and assembly
The genomes of C. olitorius var. O-4 and C. capsularis var. CVL-1 were sequenced using the WGS approach on the 454 platform. A total of 13.04 Gb of sequence data was generated for the C. olitorius genome, consisting of 5.65 Gb of shotgun sequences, 2.56 Gb of 3 kb paired-end sequences, 2.47 Gb of 8 kb paired-end sequences and 2.36 Gb of 20 kb paired-end sequences. For the C. capsularis genome, 13.69 Gb of sequence data was generated, consisting of 7.87 Gb of shotgun sequences, 2.04 Gb of 3 kb paired-end sequences, 2.26 Gb of 8 kb paired-end sequences and 1.51 Gb of 20 kb paired-end sequences (Supplementary Table 1).
We used the CABOG tool sffToCA to identify mated reads and remove duplicate mate pairs. The remaining read sequence data were converted to the fastq format, trimmed to a length of 65 bases and used in the assembly. The CABOG v7.0 pipelines were then run with default parameters using a kmer parameter of 22, which was selected after testing a range of kmer settings.
We used whole-genome optical mapping technology to improve and validate the assemblies (Supplementary Table 3). A total of 360,906 and 260,615 single-molecule restriction maps longer than 250 kb each, with an average size of 356.37 and 356.99 kb, were generated using the KpnI restriction enzyme for C. olitorius and C. capsularis, respectively (Supplementary Table 4). Super-scaffolding with optical map data was performed using Genome-Builder software (OpGen). Super-scaffolds and scaffolds were anchored to seven linkage groups using a combination of traditional markers and whole-genome mapping data using ALLMAPS software.
The accuracy and completeness of the assemblies were assessed by aligning isotigs that were generated from transcriptome sequencing onto the WGS scaffolds using BLAT (Supplementary Table 9). We also checked the relative completeness of the assemblies by performing core gene annotation using the CEGMA v2.5 pipelines (Supplementary Table 10).
Repetitive elements were identified and masked by RepeatModeler v1.0.7 and RepeatMasker Open-3.0 with default parameters. Gene prediction was performed using a combination of homology, de novo and transcript-based approaches (Supplementary Fig. 4). The predicted genes were analysed for functional domains and homologies by using InterProScan and Basic Local Alignment Search Tool (BLAST) against the NCBI non-redundant database, TrEMBL and SwissProt with Protein BLAST (BLASTP) (e < 1 × 10−5) and Blast2GO v3.3 with default parameters.
Genome comparison and evolution
Comparative analysis was performed to identify orthologous gene families among the genomes of C. olitorius, C. capsularis, Arabidopsis thaliana, Theobroma cacao, Gossypium raimondii, Glycine max, Populus trichocarpa, Ricinus communis, Fragaria vesca, Linum usitatissimum, Medicago truncatula, Vitis vinifera, Solanum lycopersicum, Solanum tuberosum and Oryza sativa. All predicted protein sequences of these plants (Supplementary Table 20) were searched against each other using BLASTP (e < 1 × 10−5) and clustered into orthologous groups using MCL-10-201 (inflation parameter, 1.5). Clusters containing single-copy orthologues were identified with exact one member per species. Phylogenetic relationships were determined from these single-copy orthologues using the maximum likelihood method.
Paralogous genes in C. olitorius and C. capsularis were detected by all-against-all protein sequence similarity searches using BLASTP. The synonymous substitution rate (Ks) was calculated for each gene pair. Paralogous genes were determined to be tandem duplicates if they were located within five genes from each other. Orthologous genes between C. olitorius and C. capsularis were identified using the reciprocal best hit method and the Ks values were calculated for each pair. Intra- and intergenomic regions of synteny were identified and visualized by SyMAP v4.0.
Metabolic and regulatory pathways were reconstructed with Pathway Studio software based on the Resnet-Plant 4.0 database. Predicted jute interologues and pathways were imported into a new Pathway Studio database for manual pathway reconstruction and genome analysis.
Fibre cell transcriptome sequencing and analysis
The transcriptomes of isolated fibre cells and whole seedlings were sequenced with an Illumina HiSeq 2,500 at HudsonAlpha Institute for Biotechnology, Huntsville, Alabama (Supplementary Tables 28 and 29). Expression patterns were compared by aligning the RNA-seq reads against the C. olitorius and C. capsularis genome sequences and quantifying the transcript abundances using the Cufflinks v2.2.1 package, which was visualized by R libraries. KEGG Orthology Based Annotation System (KOBAS) was used to identify the pathways in the C. olitorius and the C. capsularis genome using the model organism A. thaliana. KEGG (Release 74.0) and Biocyc v19.0 pathways were utilized to run R package piano v1.8.0 for Gene set analysis (GSA). Pathways in the distinct direction were selected for subsequent analysis based on adjusted P < 0.05. The differential gene expression from the in silico analysis were validated by RT–qPCR with randomly selected several fibre biosynthesis pathway genes. All primers used in this study are provided in Supplementary Table 42.
Two-tailed chi-squared tests were used to compare the distributions of GO subcategories between C. olitorius and C. capsularis (Fig. 3). For each GO subcategory, a 2 × 2 contingency table was constructed by recording the existence of the number of genes in a subcategory for each species and ranking the statistical significance of the differences.
Detailed methods and their associated references are in the Supplementary Information.
The WGS projects have been deposited at NCBI GenBank under BioProject ID PRJNA215141 and accession no. AWUE00000000 for C. olitorius and BioProject ID PRJNA215142 and accession no. AWWV00000000 for C. capsularis. The genomic and transcriptomic raw data have been deposited in the NCBI Sequence Read Archive (SRA) under SRP049494 and SRP053213 for C. olitorius and C. capsularis, respectively.
How to cite this article: Islam, M. S. et al. Comparative genomics of two jute species and insight into fibre biogenesis. Nat. Plants 3, 16223 (2017).
This paper is dedicated to A. Shamsul Islam, in memory of his contributions to biotechnological research on jute. We thank C. Detter (US Defense Threat Reduction Agency), P. Chain (Los Alamos National Laboratory), M. Zafar Iqbal (Shahjalal University of Science and Technology) and M. Mutahar Hossain (Hossain & Khan Associates) for suggestions and comments on the project and manuscript. This research was funded by the Government of Bangladesh.
Supplementary Tables 1–42.