A high-quality reference genome of wild Cannabis sativa

Gao, Shan; Wang, Baishi; Xie, Shanshan; Xu, Xiaoyu; Zhang, Jin; Pei, Li; Yu, Yongyi; Yang, Weifei; Zhang, Ying

doi:10.1038/s41438-020-0295-3

Download PDF

Article
Open access
Published: 02 May 2020

A high-quality reference genome of wild Cannabis sativa

Shan Gao¹^na1,
Baishi Wang¹^na1,
Shanshan Xie²^na1,
Xiaoyu Xu¹,
Jin Zhang¹,
Li Pei¹,
Yongyi Yu²,
Weifei Yang² &
…
Ying Zhang¹

Horticulture Research volume 7, Article number: 73 (2020) Cite this article

20k Accesses
56 Citations
35 Altmetric
Metrics details

Subjects

Abstract

Cannabis sativa is a well-known plant species that has great economic and ecological significance. An incomplete genome of cloned C. sativa was obtained by using SOAPdenovo software in 2011. To further explore the utilization of this plant resource, we generated an updated draft genome sequence for wild-type varieties of C. sativa in China using PacBio single-molecule sequencing and Hi-C technology. Our assembled genome is approximately 808 Mb, with scaffold and contig N50 sizes of 83.00 Mb and 513.57 kb, respectively. Repetitive elements account for 74.75% of the genome. A total of 38,828 protein-coding genes were annotated, 98.20% of which were functionally annotated. We provide the first comprehensive de novo genome of wild-type varieties of C. sativa distributed in Tibet, China. Due to long-term growth in the wild environment, these varieties exhibit higher heterozygosity and contain more genetic information. This genetic resource is of great value for future investigations of cannabinoid metabolic pathways and will aid in promoting the commercial production of C. sativa and the effective utilization of cannabinoids. The assembled genome is also a valuable resource for intensively and effectively investigating the C. sativa genome further in the future.

The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars

Article Open access 15 April 2024

Jarkko Salojärvi, Aditi Rambani, … Patrick Descombes

A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range

Article Open access 11 April 2024

Qichao Lian, Bruno Huettel, … Raphael Mercier

The variation and evolution of complete human centromeres

Article Open access 03 April 2024

Glennis A. Logsdon, Allison N. Rozanski, … Evan E. Eichler

Introduction

Cannabis sativa L., a native plant of Central Asia, is first cultivated in Asia and Europe and is now one of the most popularly cultivated plants worldwide¹. In China, hemp fiber has been used to produce textiles for the past 6000 years².

C. sativa is one of the most valuable agriculturally important crops in nature. Although it is widely used to produce paper, textiles, building materials, food, and medicine, a secondary metabolite, tetrahydrocannabinol (THC), is also used to produce well-known drugs. Frequent, long-term, selective breeding has produced both hemp fiber and medicinal cannabis strains, with medicinal cannabis showing promise in effectively treating various diseases³ by relieving an array of symptoms, including pain, nausea, anxiety, and inflammation^4,5,6,7. The therapeutic efficacy of medicinal cannabis is mainly dependent on cannabinoids, which are endemic metabolites unique to C. sativa⁸, among which THC and cannabidiol (CBD) are the main chemical cannabinoid compounds.

Although cannabis has considerable economic and medical value, information about its genome is limited. While a genomic draft was published recently, in 2011⁹, the splicing of this draft was neither of good quality nor complete, thus hindering its usefulness.

Cannabis is mostly dioecious, with a diploid genome (2n = 20) containing nine pairs of autosomes and one pair of sex chromosomes (female plants (XX) and male plants (XY)). The Y chromosome is larger than the X chromosome, and the female plant’s haploid genome is estimated to be 818 Mb in size, while the male plant’s genome is estimated to be 843 Mb^10,11. However, the published genome is for cloned C. sativa and was assembled using SOAPdenovo software to obtain a genome of approximately 786 Mb⁹. It shows a contig N50 = 2.8 kb and scaffold N50 = 16.2 kb, and genome annotations are missing. Additionally, the genome assembly quality is poor since it contains incomplete assembly of gene regions and repeat sequences.

The cannabis genome has been sequenced⁹, but the sequenced plant came from a cultivated variety. Generally, cultivated varieties lose substantial genetic diversity through successive bottlenecks due to domestication and selection for traits to increase yield under intensive human cultivation¹². Therefore, wild-type varieties are an important source of genetic diversity for molecular breeding. In this report, we performed genomic sequencing, assembly, annotation, and evolutionary analysis in wild-type varieties of C. sativa. The genetic data obtained in this study will be a valuable resource for future studies assessing the pharmacology, chemical constituents, cultivation, and genetic improvement of the traits of these plants and could be used as a reference in future population genetic studies of C. sativa.

Results

Sample collection and sequencing

One female of the wild-type C. sativa “JL” variety was used for whole-genome sequencing in this study. The sequencing depth was 153×, and 124 Gb of genomic data were obtained. The subread N50 was 13.5 kb (Table 1). Additionally, we performed next-generation, paired-end sequencing and finally obtained a total of 95.97 Gb of clean data. These data were then used to evaluate the quality of the wild-type C. sativa “JL” variety genome.

Table 1 Statistical results for PacBio sequencing data

Full size table

RNA sequencing data were used for genome function annotation. Using paired-end sequencing, we obtained 42.1 Gb of clean data, with each sample producing 6.5–9.8 Gb (Supplementary Information; Table S1).

Genome assembly

A variety of methods were used for genome assembly, and the initial assembly yielded a genome size of 811,814,330 bp, with a contig N50 of 632,748 bp. After assembling the third-generation subreads, the next-generation data were used to correct the genome map. For this purpose, we used BWA (v0.7.9a, RRID: SCR 010910) to compare the next-generation clean reads with the assembled sequence, and based on the comparison results, we corrected the sequence using Pilon (v1.22, Broad Institute, MA, USA)¹³. The post-correction genome size was 812,295,151 bp, with a contig N50 of 633,146 bp. Statistical analysis of base pairs in the corrected genome showed that the average GC content in the genome was 33.8%. The contents of other base pairs are presented in Supplementary Information; Table S2.

Hi-C

The Hi-C approach efficiently uses high-throughput sequencing to determine the state of genome folding by measuring the frequency of contact between pairs of loci^14,15. Originally, this technique was developed to generate chromosomal genome assemblies, but it was subsequently found to be useful for genome-wide chromosome conformation capture¹⁶.

Nearly 487 million raw reads (146.25 Gb) were collected and then reduced to 424 million clean reads after filtering out low-quality reads and retaining reads with more than 5% N bases, adapter reads, and single reads. We then successfully clustered, ordered and oriented 2,506 contigs into 10 groups according to the agglomerative hierarchical clustering method in Lachesis (https://github.com/shendurelab/LACHESIS)¹⁷, representing 91% and 99% of the total genome by contig number and base count, respectively (Table 2). Along with Hi-C analysis, we visually inspected contig orientation and suspicious fragments and found little incorrect information, and we identified and corrected that information using self-written scripts. We obtained a high-quality chromosome-level cannabis genome with a contig N50 of 513 kb and a scaffold N50 of 83 Mb (Table 3). According to a heatmap of the contig contact matrix with Hi-C data (Supplementary Information; Fig. S1), we estimated that the clustering, ordering, and orientation of the contigs was valid (Table 4). Among these contigs, the scaffold N50 was 162 times greater than the scaffold N50 of the preliminarily assembled genes. There were 2,506 mounted and 245 unmounted scaffolds on the chromosomes (Table 5). Our C. sativa genome is a solidly based genomic resource for cultivar identification, population analysis, and functional analysis.

Table 2 Summary of contig/scaffold clustering results

Full size table

Table 3 Summary of Hi-C auxiliary assembly results

Full size table

Table 4 Contig/scaffold sorting results

Full size table

Table 5 Summary of Hi-C-assisted assembly pseudomolecule lengths

Full size table

Assessment of genomic integrity

BUSCO (v3.0)¹⁸ was employed to evaluate the accuracy and completeness of our genome assembly, gene set, and transcripts. Based on the OrthoDB (http://cegg.unige.ch/orthodb) database, BUSCO built several large, single-copy gene sets covering the branches of the evolutionary tree. When comparing the gene set to the genome, it was noted that the proportion of complete BUSCOs was 92.6% (Supplementary Information; Table S3), indicating that the genome assembly integrity was very good.

Due to potential contamination during sequencing and assembly, we further evaluated our genome assembly by using GC depth analysis. The GC depth scatter plot showed no significant differentiation, and points were concentrated around the 34% area, indicating high assembly quality without any bacterial contamination (Supplementary Information; Fig. S2). Finally, the sequencing profile base depth was close to a Poisson distribution, further indicating that the assembled genome showed high assembly quality (Supplementary Information; Fig. S3).

To evaluate the consistency of the next-generation data, we compared the sequencing reads to the assembled scaffold sequences, and the resultant comparison ratio for the reads and genomic coverage showed that they were deep and complete (Supplementary Information; Fig. S4). The comparison rate of the next-generation data was 96.77% (Supplementary Information; Table S4), indicating that the assembled genome was of high quality.

Genome annotation

Repeat sequences, including tandem repeats and interspersed repeats, are important components in the genome, and there are two strategies for predicting such sequences. The total lengths of transposon elements obtained from the genome sequence were 118,700,582 and 161,847,743 bp, representing 14.61% and 19.92% of the genome, respectively. For de novo prediction, RepeatModeler (v1.0.8) was used to first establish a de novo repeat sequence library, after which the genome sequence was predicted using RepeatMasker (v4.0.6). These results revealed a total length of 584,319,477 bp, representing 71.93% of the genome. In addition, we used the de novo prediction method with Tandem Repeat Finder (v4.09, https://tandem.bu.edu/trf/trf.html) to identify tandem repeats in the genome. The total length obtained by this method was 22,382,718 bp, representing 2.76% of the genome. Finally, we removed overlapping portions of the non-redundant repeats that we identified with those procedures, resulting in 612,733,451 bp of non-redundant repeats, which accounted for 74.75% of the assembled genome (Supplementary Information; Table S5; Figshare 1). Long terminal repeats (LTRs) represented 50% of the repeat sequences in the assembled genome (Supplementary Information; Table S6).

After predicting the gene approval rating, we used CPC (v0.9-r2)¹⁹ to analyze the potential of the predicted genes and to generate 38,828 predicted genes. The average length of the predicted genes was 551.10 bp, and the average length of their coding sequences was 1153.47 bp. There were an average of four exons per gene with a length of 281.91 bp per exon, and the average intron length was 2147.77 bp. After comparing our predicted gene set with the functional databases SwissProt, NT (https://www.ncbi.nlm.nih.gov/nucleotide/), NR, PFAM²⁰, eggNOG (http://eggnogdb.embl.de/)²¹, GO (http://geneontology.org/page/go-database)²², and KEGG²³, 38,129 genes were annotated, accounting for 98.20% of the genome (Table 6, Figshare 2). These results were statistically graphed in the NT, NR, UniProt-BLASTX, and UniProt-BLASTP databases (Fig. 1).

Table 6 Summary of gene function annotations

Full size table

**Fig. 1: Gene annotation Venn diagram.**

We predicted non-coding RNAs, such as rRNAs, tRNAs, snRNAs, and miRNAs, by comparing their sequences with the known non-coding RNA library Rfam²⁴. A total of 2,441 rRNAs, 214 snRNAs, and 281 miRNAs were thus predicted (Supplementary Information; Table S7, Figshare 3, 4, 5). tRNAscan-SE²⁵ was used to predict tRNA sequences in the genome, resulting in 712 tRNAs (Supplementary Information; Table S7). To further verify our gene annotation results, we conducted a BUSCO evaluation using the embryophyta_odb10 database, producing a result of 93%, indicating that the annotation results were acceptable (Supplementary Information; Table S8).

Gene families and phylogenetic analysis

OrthoMCL (v1.4)²⁶ was used to classify gene families with single and multiple copies from both closely related and remotely related species. (Supplementary Information; Table S9 and Fig. S5), resulting in the identification of 930 C. sativa-specific genes. C. sativa shows more genes in common with Trema orientale and Morus notabilis than with other species (Supplementary Information; Fig. S6). We used MUSCLE software (v3.8.31)²⁷ to perform multiple sequence alignments for all single-copy gene families sequences. After we constructed the integrated supergene sequence, which was based on the four-fold degenerated sites (4DTv sites) of orthologous family genes, we used PhyML (v3.0)²⁸ to construct the species phylogenetic tree (ML-Tree). As shown in Fig. 2, Vitis vinifera and Fragaria vesca in Rosaceae diverged from one another earlier than T. orientale, M. notabilis, and Ziziphus jujuba diverged from each other, and C. sativa is most closely related to T. orientale.

**Fig. 2: *C. sativa* phylogenetic tree.**

Based on our phylogenetic analysis of the integrated supergene sequence, we used PAML’s MCMCTREE software (v4.4)²⁹ and the Bayesian relaxed molecular clock method to estimate divergence time. We used corrected time directly obtained from TimeTree^30,31,32,33 (see Supplementary Material 10). The divergence time corresponding to the crown clade of Eudicots was 115 Ma (Fig. 3).

**Fig. 3: Phylogenetic relationships and divergence time.**

Whole-genome duplication and gene families expansion/contraction analysis

Gene families expansion and contraction were analyzed based on mathematical statistical tests. After the cluster analysis of gene families, those with abnormal gene numbers in individual species were filtered, and CAFE (v4.1)³⁴ and probabilistic graphical models (PGMs) were then used to simulate the acquisition and loss of genes under the specified phylogenetic tree and to analyze gene families expansion and contraction using hypothesis testing (Fig. 4). We found 12,801 gene families in the MCRA (most recent common ancestor). In comparison to M. notabilis, T. orientale, V. vinifera, F. vesca, Musca domestica, Z. jujuba, and Papaver somniferum, there were 2,599 gene families showing expansion and 1,298 gene families showing contraction in C. sativa.

**Fig. 4: Fourfold degenerate site (4DTv) value distribution.**

Whole-genome duplication events are widespread in plants and are a potent force that drives plant genome evolution. We used MCscan (v.0.8)^35,36 to identify genome synteny blocks within C. sativa and other related species. Using MUSCLE (v3.8.31)²⁷, we performed multiple sequence alignment of the internal sequences of the blocks and then calculated 4DTv site values. Based on the abundance of 4DTv site values, we estimated the relative timing of whole-genome duplication (WGD) or species split events. In general, three significant peaks were seen in the C. sativa genome (4DTv ~0.19, ~0,42, and ~0.92; Fig. 5), suggesting that C. sativa has experienced three WGD events. We also identified the 4DTv values from collinear blocks between C. sativa and the genomes of three closely related species, M. notabilis, T. orientale, and Z. jujuba. Additionally, we identified two ancient WGD events in T. orientale with two peaks at ~0.58 and 1.0. The γ event occurred after the divergence between monocots and dicots approximately 185 ± 55 million years ago (Mya)³⁷. Therefore, this pattern indicates that C. sativa experienced large-scale gene duplication more recently than T. orientale, 35 Mya. The divergence of C. sativa and T. orientale occurred at ~52 Mya (4DTv ~0.28), and C. sativa and T. orientale shared two WGD events (4DTv ~0.42 to ~0.58 and ~0.92 to ~1.0).

Genomic synteny block analysis can be used to determine the evolutionary source of chromosomes between species^38,39,40. In this study, we used BLASTP (v2.2.31+) to analyze the aligned protein sequences of C. sativa and Z. jujuba (Rhamnaceae) and then used MCScan (v0.8) to evaluate those results by using genome synteny blocks. Our results showed that C. sativa and Z. jujuba present a strong genomic synteny relationship (Fig. 6a).

**Fig. 6: Synteny analysis and LTR analysis.**

LTR recognition and the identification of occurrence history are important in plant genome research. In a plant genome, LTR retrotransposons are the type of transposon constituting the majority of plant TEs. The identification of the historical occurrence of such elements is important for plant genome research, thus placing much emphasis on characterizing LTR structure and identifying LTR sequences. At the same time, the study of LTR evolution elucidates the evolution of plant genome structure and function. LTRs are involved in shaping genomic structure and size, thus affecting the regulation and variation of genes and the origin of new genes. Since the discovery of LTR retrotransposons, research has shown that differences in LTR occurrence can be identified by comparing the time difference between sequenced, target species and related species⁴¹. In this study, LTRharvest (v1.5.10)⁴² and LTR_Finder (v1.0.5)⁴³ were used to identify de novo LTR regions of C. sativa, T. orientale, and Z. jujuba. LTR_retriever (https://github.com/oushujun/LTR_retriever)⁴⁴ was used to integrate the results from LTRharvest and LTR_Finder to obtain a high-quality LTR-RT library and to perform genome-wide LTR-RT annotation. We then generated LTR annotation regions and sequence information for the three species (Supplementary Information; Tables S11–S13). Through this LTR recognition, we found that the LTR retrotransposon content of 37.1% constituted most of the LTR TEs. Our estimation of LTR occurrence times revealed that C. sativa’s recent LTR burst differed from those of both T. orientale and Z. jujuba (Fig. 6b).

Discussion

C. sativa is an invaluable plant species given its economic and ecological significance. Here, we generated an updated draft genome for wild-type varieties of C. sativa in China using PacBio single-molecule sequencing and Hi-C technology. Our assembled genome is approximately 808 Mb, with scaffold and contig N50 sizes of 83.00 Mb and 513.57 kb, respectively. Our new genome version is more contiguous than the previously assembled genome according to its a contig N50 of 12 kb. Therefore, the genome assembled in this study is superior to the previously assembled genome version.

In this study, we found a contig N50 of 632 kb for the PacBio read assembly. The genome continued to be assembled with the help of the Hi-C physical map, and the contig N50 became 513 kb. Here, we assumed that Hi-C technology concatenated the contigs into superlong scaffolds at the chromosome level. If the contigs were initially assembled correctly, the contig N50 would not change. However, due to the high heterozygosity of the wild-type varieties of C. sativa, chimeras were prone to occur during the assembly process, so the result of the assembly would be cut. Once the contig was cut, the contig N50 would decrease slightly, and we therefore considered the slight decrease in the contig N50 to be normal. Scaffolding with Hi-C further facilitated the assignment of all scaffolds to chromosomal positions. In this study, the new genome assembly presented higher genome contiguity and sequence quality than the previous genome assembly.

Because of the high repetition rate and high heterozygosity in the cannabis genome, no high-quality cannabis genome has been generated previously. There are unknown regions in the cannabis genome assembled using SOAPdenovo software by Van Bakel et al. in 2011⁹. The genome was not assembled to the chromosome level, and the number and length of the scaffolds in that study are much lower than the values expected for plant genes. Therefore, the cannabis genome assembled in 2011 shows poor quality and does not contain annotation information, which greatly limits its applicability to research on cannabis. In this study, we reassembled a wild-type cannabis genome by using third-generation sequencing data and thus obtained a high-quality cannabis genome.

Single-molecule real-time sequencing has the characteristics of a high throughput and long read length, which can reduce the number of contigs after sequencing and can effectively increase the number and integrity of genomes during the process of genome splicing. We combined TGS and NGS sequencing methods with Hi-C assembly technology to construct a high-density wild-type cannabis genome sequence map.

After obtaining the high-quality cannabis genome, we annotated its genes and thus considerably improved upon the 2011⁹ version of the genome. Following the completion of the assembly of its repeat sequences and statistical analysis, we found that cannabis has abundant repeat regions, which may be the cause of the poor quality of the cannabis genome assembled by Van Bakel et al. in 2011⁹. This high-quality reference genome will undoubtedly benefit researchers in the exploration and manipulation of the agronomic characteristics of C. sativa.

To understand the evolutionary status of cannabis, we analyzed its evolution and divergence times. Through these analysis, we found that the evolutionary status of cannabis and T. orientale is close at the molecular level, and their kinship is thus very close. The quality of the T. orientale genome is still relatively poor, and the high-quality cannabis genome that we obtained in this study could therefore provide useful information for the future study of the T. orientale genome and its evolution. By analyzing whole-genome duplication events in cannabis, we found three recent WGD events and one large-scale duplication event in cannabis and that cannabis shares two WGD incidents with T. orientale. Our data further elucidate the evolutionary status of cannabis.

Materials and methods

DNA extraction and sequencing

All samples were collected from the Kyirong Gully (28°28′–28°66′ N, 85°13′–85°28′ E) in Tibet, China. The Kyirong Gully is a plateau gorge with an altitude of 1,700–6,000 m located on the south slope of the Himalayas and is very isolated from the outside world. Its special geography and environment make the gorge a typical vertical ecological system of Himalayan areas, and it is considered “the richest species gene bank in a plateau area”. We extracted high-quality genomic DNA from the leaves of female C. sativa using a Plant DNA kit (TIANGEN Biotech, Beijing, China). A 10–20 kb SMRTbell DNA library (Pacific Biosciences of California, CA, USA)⁴⁵ was prepared using BluePippin for DNA size selection (Sage Science, MA, USA) and then sequenced on the PacBio Sequel platform (Pacific Biosciences of California, CA, USA) with P6/C4 chemistry. Finally, 124 Gb of subreads were obtained, with 153× coverage of the C. sativa genome.

A paired-end library with an insert size of 350 bp was constructed and sequenced using the Illumina HiSeq X Ten platform.

Total RNA was extracted from the roots, stems, leaves, and seeds using the QIAGEN Total RNA Extraction Kit (QIAGEN, Hilden, Germany), and RNA-seq libraries were then constructed using the TruSeq RNA Library Kit (Illumina, CA, USA). These data were used for gene structure prediction.