Introduction

Durian (Durio zibethinus) is a flowering tree from the family Malvaceae. It produces a large, spiky fruit with a strong husk and pungent edible flesh. The fragrance can be so pungent that the fruit is often banned from indoor public spaces. Durian is referred to as the ‘king of fruit’ and is commonly grown in Southeast Asian countries, such as Thailand, Indonesia and Malaysia. Fruit production is seasonal and the price is quite high relative to other fruits, making durian a valuable crop species. One of the most popular varieties in Thailand is Monthong, which has a large fruit, relatively mild odour and soft creamy flesh that is on the sweeter side of durian varieties.

A high-quality draft assembly of the durian genome was published in 2017 using PacBio reads and Chicago Hi-C, where approximately 95% of the ~ 738 Mb genome was covered by 30 scaffolds1. The durian chloroplast genome was assembled from Illumina sequence data into a 164 kb cyclic sequence2. Chloroplast sequences often, but not always, have a quadripartite structure consisting of a large single copy sequence and a short single copy sequence separated by a pair of large inverted repeats (IR) that range in size between 10 and 30 kb depending on the species3. The durian chloroplast was reported to have a large single copy sequence that was 95.7 kb and a short single copy sequence of 20.9 kb with a 23.6 kb inverted repeat2.

It is generally accepted that the chloroplasts, found in all members of the plant kingdom, are derived from a single ancestral origin. This is reflected in both the chloroplast genome size and organisation with high levels of synteny, among several other similarities (for review see4). The chloroplast genome in most species is highly conserved, which has led some to consider a functional role for the IR, such as the initiation of replication5, gene conservation6,7, or to help stabilise the genome6. However, some species have experienced IR expansion or contraction8,9,10 and some have lost an entire copy of the IR6,11 with no apparent negative consequences to the plant, which means that any function the IR may have is not required. We used long PacBio reads to assemble the durian chloroplast genome.

Results and discussion

Durian chloroplast assembly and annotation

We have assembled the durian chloroplast from the Monthong variety using long PacBio reads and the CANU assembly program12 (with ABruijn assembler13 also returning the same assembly structure). The chloroplast assembled into a 142,733 bp cyclic contig that contained 111 genes (Fig. 1, Table 1). There were 46 direct repeats ranging in size from 45 to 586 bp and 5 small inverted repeats ranging in size from 63 to 169 bp. The majority of repeats were imperfect and contained several mismatches or gaps. The most striking finding was an absence of the IR that is common in plant chloroplasts. The assembled durian chloroplast contains only a single copy of the sequence that typically comprises the IR (Fig. 1). This was unexpected since the published durian chloroplast genome (MG138151.1), also from a Thai Monthong variety, was 163,974 bp and included an IR2. The junction of the IR in the published chloroplast is a 169 bp small inverted repeat in our assembly. This repeat is slightly longer than the short reads that were used for the published chloroplast assembly, so it is likely that this repeat caused an assembly error.

Figure 1
figure 1

Structure of the Durio zibethinus chloroplast genome showing gene location and exon structure. Gray arrows at the top indicate transcription direction and gene location on the plus or minus strand is indicated by the exon being outside or inside the circle, respectively. GC content is indicated as a histogram on the inner circle. The sequence that typically comprises the IR is marked using the black line.

Table 1 Genes encoded on the Durio zibethinus chloroplast genome, grouped according to function.

To investigate the absence of the IR, we mapped the long reads (> 10 kb) to both the chloroplast sequence that we generated and to the published chloroplast sequence. The mapped reads showed that our chloroplast assembly was well supported, the long reads mapped with a fairly uniform distribution and greater than 10 kb overlaps between reads for any point along the assembly (supplementary table S1). Multiple long reads confirmed that the contig was cyclic with half of each read mapping to the start of the contig and the other half mapping to the end of the contig (supplementary table S2). In contrast, the published chloroplast assembly had a junction not supported by our sequence data. This unsupported junction is evident from a gradual decline in read depth leading into and out from the junction (Fig. 2). The unsupported junction occurs at the point of the IR that Cheon et al.2 reported, which is consistent with the lack of an IR in our assembly. The chloroplast genome tends to have relatively low diversity within a species4,14,15, so we do not expect that both assemblies would be correct, especially since they are from the same variety. Interestingly, when we blasted our chloroplast assembly against the published Musang King whole genome assembly1, only two chloroplast contigs, totalling 40 kb, were found, meaning that the whole genome assembly lacks the chloroplast genome. This is likely the result of a filtering step during HiC scaffolding.

Figure 2
figure 2

Read depth of long PacBio reads mapped against the published durian chloroplast genome (MG138151.1).

Comparison to other durian varieties

We performed short read sequencing of 24 Thai varieties (supplementary table S3) of durian and downloaded the Illumina reads from the Musang King variety1 (SRX3204603). We mapped these 25 durian varieties to both the published chloroplast assembly and our chloroplast assembly. Repeat regions and regions of extreme GC content (0–20% GC) showed significant drops in read depth in both assemblies (Fig. 3, Supplementary Figure S1-S24). Monthong was also included in the short read sequences and showed the same read depth drops as the other samples, showing that the read depth drops are from the short read sequences and not because of differences in sequence between the samples. The most striking drop in read depth occurred at the IR of the published assembly, with all samples showing approximately half read depth at both copies of the IR (Fig. 3, Supplementary Figure S1-S24). This shows that the IR does not exist in any of the 25 varieties as the reads that should all map to the single copy are being divided between the two, consistent with our assembly showing no IR. However, read depth calling programs tend to set a maximum limit for each position and with high copy number genomes, such as the chloroplast, this number can easily be reached by even a moderate amount of sequence data, resulting in saturation of read depth at each location. Such was the case when we used the full data for the Musang King variety (Supplementary Figure S25), all positions showed fairly equal coverage when default settings were used, and the halved read depth signal was only visible when a small portion of the reads were used.

Figure 3
figure 3

Read depth of Musang King (SRX3204603) against our chloroplast and the published chloroplast genome sequences (MG138151.1).

We called variants in the 24 Thai varieties and the Musang King variety of durian mapped against our version of the chloroplast genome. Since the data is from the whole genome there are also reads from the mitochondria and nuclear genomes, which could be identified based on relative read depth in most cases. There were nine SNPs identified in six of the varieties that could be reliably identified as chloroplast based on read depth (Table 2). We also found 100 variants where the read depth was consistent with chloroplast, yet split between two alleles, suggesting heteroplasmy (supplementary table S4). There are a number of reports of chloroplast heteroplasmy in a variety of species, which is proposed to originate from bi-parental inheritance or through spontaneous mutation16,17,18,19. These studies mostly used PCR on purified chloroplast fractions, so confounding results from mitochondria or nuclear DNA have been accounted for. Hoang et al.20 found that true chloroplast reads accounted for approximately 70% of whole genome shotgun reads that mapped to a chloroplast reference with the other reads coming from nuclear or mitochondria genomes, fairly consistent with what we found, and suggested this could be used to identify cases of heteroplasmy. However, since we did not map to the whole genome, it is possible that these variants occur in high copy number sequence from the mitochondrial and/or nuclear genomes. In addition, there was no overlap with the nine SNPs that could be assigned as chloroplast only. Thus, these variants represent potential heteroplasmic variants, but lack sufficient evidence to confirm true heteroplasmy.

Table 2 Chloroplast SNPs identified from 24 Thai varieties and the Musang King variety of duran.

Comparison with other species

The protein sequences of the durian chloroplast genes were compared to 19 other species plus the published durian chloroplast and a phylogenetic tree was constructed (Fig. 4). The results show durian sharing ancestry with Tilia species and Theobroma cacao, consistent with analyses performed using the whole genome assembly1 and the chloroplast genome2. All of the closest species to durian are reported to contain an IR, so the loss of the IR must have occurred after durian split from these species. It should be noted, however, that the chloroplast genomes for these species were also assembled from short read data, which, considering our findings, raises some doubt regarding their degree of accuracy.

Figure 4
figure 4

Phylogenetic tree using chloroplast genes.

Conclusions

We have assembled the durian chloroplast from the Monthong variety using long PacBio reads that spanned all low complexity sequence allowing for the whole chloroplast genome to be assembled into a single high-confidence contig. We found, despite a publication showing an IR being present, that the durian chloroplast lacks the IR that is common to plant chloroplast genomes. We then used publicly available short read data to show that it can be difficult to identify the assembly error using only short read data, which highlights the value of using long read data for de novo assembly.

Materials and methods

Sample and DNA extraction

The durian sample is a Monthong variety maintained at the Chantaburi Horticultural Research Center, Thailand. Leaf tissue was collected from a single plant and used for DNA extraction with the standard CTAB method. The DNA sample was subsequently purified with Ampure PB beads (Pacific Biosciences, Menlo Park, USA), and the DNA integrity was assessed using the Pippin Pulse Electrophoresis System (Sage Science, Beverly, USA).

Sequencing and assembly

DNA was used to prepare libraries for the PacBio RSII following the Pacific Biosciences ‘Procedure and Checklist-20 kb Template Preparation Using BluePippin Size-Selection System’ protocol. DNA (10 ug) was sheared with a Covaris gTube, 4500 rpm for 2 min and the BluePippin cassette used was ‘0.75%DF Marker S1 high-pass 15–20 kb’ with selection of 12–50 kb. Sequencing was performed for 14 cells on the PacBio RSII. Raw reads longer than 20 kb were used as seed reads and reads shorter than 20 kb were used to correct them by the RS_PreAssembler.1 protocol with default settings from the Pacific Biosciences SMRTanalysis (v2.3.0) software package. The corrected reads were then assembled using CANU (version 1.8)12 and ABruijn assembler (version 2.0b)13. Quiver (part of the SMRTanalysis suite) was then run on the final assembly to fix PacBio sequencing errors. Annotation was performed using the online tool CpGAVAS21. The genome was plotted using OGDraw22. Repeats were identified by blasting the assembly against itself.

The published chloroplast genome assembly2 and the raw reads from the whole genome assembly of the Musang King variety durian1 (SRX3204603) were downloaded from NCBI. All of our corrected reads and the SRX3204603 reads were then mapped to our assembly and the published assembly, using BWA MEM for the PacBio reads and bowtie23 for the Illumina reads, to confirm that the assembly was supported by the majority of reads. Read depth was calculated from each set of mapping data using samtools depth and plotted using ggplot2 in R.

Sequence comparison and phylogenetic tree

A phylogenetic tree was constructed using 19 species (Gossypium thurberi, Gossypium barbadense, Gossypium arboreum, Gossypium robinsonii, Abelmoschus esculentus, Talipariti hamabo, Hibiscus syriacus, Althaea officinalis, Tilia paucicostata, Tilia oliveri, Tilia mandshurica, Tilia amurensis, Theobroma cacao, Aquilaria sinensis, Arabidopsis thaliana, Brassica napus, Citrus sinensis, Panax ginseng, Sesamum indicum). Gene sequences from each species for 111 common genes (Table 3) were compared and a phylogenetic tree was constructed using MEGA-X with the maximum likelihood method and bootstrap 1000 times24.

Table 3 List of genes that were used to construct a phylogenetic tree.