Organellar genome assembly methods and comparative analysis of horticultural plants

Although organellar genomes (including chloroplast and mitochondrial genomes) are smaller than nuclear genomes in size and gene number, organellar genomes are very important for the investigation of plant evolution and molecular ecology mechanisms. Few studies have focused on the organellar genomes of horticultural plants. Approximately 1193 chloroplast genomes and 199 mitochondrial genomes of land plants are available in the National Center for Biotechnology Information (NCBI), of which only 39 are from horticultural plants. In this paper, we report an innovative and efficient method for high-quality horticultural organellar genome assembly from next-generation sequencing (NGS) data. Sequencing reads were first assembled by Newbler, Amos, and Minimus software with default parameters. The remaining gaps were then filled through BLASTN search and PCR. The complete DNA sequence was corrected based on Illumina sequencing data using BWA (Burrows–Wheeler Alignment tool) software. The advantage of this approach is that there is no need to isolate organellar DNA from total DNA during sample preparation. Using this procedure, the complete mitochondrial and chloroplast genomes of an ornamental plant, Salix suchowensis, and a fruit tree, Ziziphus jujuba, were identified. This study shows that horticultural plants have similar mitochondrial and chloroplast sequence organization to other seed plants. Most horticultural plants demonstrate a slight bias toward A+T rich features in the mitochondrial genome. In addition, a phylogenetic analysis of 39 horticultural plants based on 15 protein-coding genes showed that some mitochondrial genes are horizontally transferred from chloroplast DNA. Our study will provide an important reference for organellar genome assembly in other horticultural plants. Furthermore, phylogenetic analysis of the organellar genomes of horticultural plants could accurately clarify the unanticipated relationships among these plants.


Introduction
Horticultural plants, which are grown for aesthetic value or as food in a home garden, can improve mental and physical health 1 . In plant cells, chloroplasts and mitochondria are the necessary organelles forming the powerhouse of the cell. Chloroplasts conduct photosynthesis, and mitochondria indirectly supply energy. In addition, both possess their own DNA. A horticultural plant cell generally has one copy of the nuclear genome and multiple copies of organellar genomes (including chloroplast and mitochondrial genomes). For example, the plastid genome in plant leaf cells has 400 to 1600 copies 2 . The chloroplast genomes of horticultural plants are highly conserved and possess a circular DNA structure varying from 120 3 to 163 kb 4 . The chloroplast genomes of horticultural species consist of four parts, two copies of inverted repeats (IR) of 20-28 kb in size, an LSC (large single-copy) area of 80-90 kb, and an SSC (small singlecopy) area of 16-27 kb 5 . The LSC and SSC areas are separated by the IRs. The mitochondrial genomes of horticultural plants are very complex and have distinct characteristics including large genome size, foreign DNA uptake, and continued recombination 6 . As a result of non-coding sequence extension and a large repetitive section 7 , the lengths of the published mitochondrial genomes of angiosperms, especially horticultural plants, vary in size 8,9 , ranging from 258 kb in Raphanus sativus 10 to 983 kb in Cucurbita pepo 11 . Sequence data of plant organellar genomes are accumulating at a very rapid pace. Currently, over 1193 chloroplast and 199 mitochondrial genome sequences of land plants are included in the NCBI Gen-Bank Organelle Genome Resources (http://www.ncbi.nlm. nih.gov/genome/browse/). However, only 39 organellar genomes of horticultural plants are present in the database.
Most strategies for assembling organellar genomes require the isolation of chloroplast or mitochondrial DNA from total DNA during the sample preparation. For chloroplast genome assembly, one of the time-consuming steps in the traditional method is to extend overlapping fragments by the polymerase chain reaction (PCR) from conserved gene loci. An alternative approach is to first isolate chloroplasts and then identify sequences using high-throughput sequencing techniques 12 . Similarly, there are several approaches for mitochondrial genome assembly. For example, Unseld et al. determined the sequence of the mitochondrial DNA of Arabidopsis thaliana using a shotgun-based approach. Mitochondrial DNA was first isolated from cosmid libraries of total Arabidopsis thaliana DNA. Random fragments were obtained from entire trimmed and subcloned cosmids. These fragments were then sequenced and assembled into contigs for unique mitochondrial sequences 13 . There are other two strategies for mitochondrial genome assembly: physical map-based 14 and gene-based 15 . For these methods, the key step is isolating organellar DNA. However, this step is challenging and time consuming 16 . In addition, the large size of replication and the dynamic nature of the mitochondrial genome, including foreign DNA uptake and genome recombination, make the sequence assembly complex.
Next-generation sequencing (NGS) technologies using Roche or Illumina platforms provide new high-throughput, low-cost, and efficient methods for chloroplast and mitochondrial genome assembly [17][18][19] . In this paper, we introduce an innovative and efficient method for de novo horticultural organellar genome assembly from next generation whole-genome sequencing data without organellar DNA isolation. We have successfully assembled the complete chloroplast and mitochondrial genomes of an ornamental plant, Salix suchowensis, and a fruit tree, Ziziphus jujube, which is the first plant in the Rhamnaceae family to have its chloroplast genome sequenced 20 . Whole-genome sequencing of these two plants was conducted at Nanjing Forestry University. Our study paves the way for the organellar genome assemblies of other horticultural plants 21 .

Materials and methods
Two approaches for completing high quality organellar genome sequences from NGS data are shown in Figs. 1 and 2. The assembly process includes data preparation, assembly of raw reads according to read depth of contigs, the creation of the contig graph, and the construction of the organellar genome sequences. Unlike the traditional method, there is no need to isolate chloroplast or mitochondrial DNAs from a mixture of nuclear and organellar DNAs during the sample preparation.

Data preparation
Whole-genome sequencing of an ornamental plant, S. suchowensis, was conducted on the Roche 454 and Illumina HiSeq 2000 sequencing systems at Nanjing Forestry University.
The fruit tree Z. jujuba was grown at Nanjing Forestry University, and its total DNA was extracted using a DNeasy Plant Mini kit 22 . The 454 pyrosequencing was Fig. 1 The pipeline flow chart of the assembly of the chloroplast genome of Ziziphus jujuba. The assembly process includes data preparation, assembly of raw reads, the creation of the contig graph, and the construction of the chloroplast genome sequences performed on a 454 GS FLX Sequencer with XLR 70 Titanium kit (Roche Diagnostics) following the manufacturer's standard protocol (Roche Diagnostics) 23 .

Chloroplast genome assembly of Z. jujuba
The pipeline used for the assembly of the chloroplast genome of Z. jujuba is shown in Fig. 1. The chloroplast genomes of homologous species are similar and can be used as reference genomes to obtain the order of contigs. Sequencing reads from the Roche 454 system were initially mapped to land plant chloroplast genome sequences through BLASTN search 24 . Amos 25 , Minimus 26 , and Phrap software 27 were then used to assemble the sequences. The detailed parameters for BLASTN were: blastn -db database_name -query input_file -out out-put_file -evalue 1e-5 -word_size 9 -outfmt 6.
Default parameters were used for Amos, Minimus, and Phrap. Connected contigs were linked, and the gaps were filled by BLASTN and PCR experiments. The wholegenome sequence was corrected based on Illumina sequencing data using BWA software 28 .
A novel method for organellar genome assembly A novel method for organellar genome assembly is shown in Fig. 2. Most chloroplast genomes are conserved and have a quadripartite organization, consisting of two copies of inverted repeats, a large-single-copy region, and a small single-copy region. The pipeline shown in Fig. 1 can be used to complete most chloroplast genome assemblies. However, assembling the mitochondrial genomes of related species by homology is more complicated, as reference genomes provide less information. Furthermore, the pipeline in Fig. 1 cannot determine the contig connection order. Thus, the method cannot fully complete the mitochondrial genome assembly. The pipeline shown in Fig. 2 can obtain the structural information and connect contigs easily.
The input of the procedure is the sequencing reads from the Roche 454 sequencing system. Newbler software was first used to assemble the raw reads and produce longer contigs. Mitochondrial and chloroplast genome-related contigs were then isolated from nuclear contigs. Contigs Fig. 2 The flow chart of a novel method for organellar genome assembly. The assembly process includes data preparation, assembly of raw reads according to read depth of contigs, the creation of the contig graph, and the construction of the organellar genome sequences. The mitochondrial genome of Ziziphus jujuba and the organellar (mitochondrial and chloroplast) genomes of Salix suchowensis were assembled using this method Step 1: prepare the query sequence; Step 2: "setup", specify related options and database, create a lookup table; Step 3: "BLASTN search"; Step 4: "back-track", input the preliminary matches and locate the insertions and deletions of uncertain sequences; Step 5: output the results in a file; Step 6: Phrap software assembles these alignments were divided into three categories: high read depth contigs, medium read depth contigs, and low read depth contigs. According to statistics from different plant species, high read depth contigs mainly belong to chloroplast genomes and nuclear repeat sequences, medium read depth contigs mainly belong to mitochondrial genomes and nuclear repeat sequences, and low read depth contigs belong to the nuclear genome. In this paper, we used read depth contigs over 100× as chloroplast genome candidate contigs and contigs between 50× and 100× as mitochondrial genome candidate contigs. Notably, the parameters for this step can be adjusted based on the user's own sequencing data.
The mitochondrial genome of Z. jujuba and the organellar (mitochondrial and chloroplast) genomes of S. suchowensis were assembled. Organellar contig graphs were plotted through Perl scripts. A visualized map was constructed using OmniGraffle software 29 .

Gap filling and correction
In our study, database indexing was used to fill the remaining gaps between sequences. As shown in Table 1, there are six steps for filling the remaining gaps. The input is related contigs with remaining gaps and the raw reads database. The first step is to prepare the query sequence with gaps. In the second step, we specified related options and searched the database using BLASTN to create a lookup table. The output format of the results can be adjusted through user options 30 . The third step is to discover matches between sequences and the database using BLASTN with an E value of 1e−5 30 . During this process, the position may not be located accurately, therefore this step should be iterated additional times. Finally, we assemble these alignments by the program Phrap 27 .
After obtaining a reference genome, shorter reads from Illumina sequencing platform are mapped to reference genomes through BWA 31 , forming a consensus sequence to determine whether there are base differences in the reference genome.
The detailed procedure of aligning Illumina short reads against the reference genomes using BWA are as follows: 1. PCR experiments have verified that this method can effectively correct errors in the assembled genome 22 .

Organellar genome analysis
To identify the phylogenetic position of horticultural plants, 39 horticultural plant mitochondrial genomes were downloaded from NCBI. A phylogenetic tree was constructed based on 15 protein-coding genes (atp1, atp9, ccmB, cob, cox1, cox3, nad1, nad3, nad4, nad4L, nad6, nad7, nad9, rps3, and rps4). The sequences of these genes were extracted by local Perl scripts. The program MEGA 32 was used for the alignment of conserved genes, building a tree of the species, and calculating GC content 32 . MEGA integrates multiple functions including aligning multiple sequences by ClustalW and the algorithms of neighbor-joining (NJ), maximum likelihood (ML), and minimum evolution (ME). The alignment of conserved genes was modified manually to remove gaps.

Sequencing data
The sequencing reads of Z. jujuba were generated using the Roche 454 GS FLX sequencer. A total of 573,141 raw reads were obtained with a mean length of 360 bp. After the quality checking by the program FastQC 33 , we retained 70,931 sequences (~34.50 Mb) and 2950 contigs whose quality was acceptable 22 . The sequencing of S. suchowensis was performed on the Roche 454 and Illumina HiSeq 2000 systems. A total of 1,240,387 raw reads were produced with a total length of 702,204,081 bp, and the mean size was 567 bp. After checking quality by FastQC 33 , we retained 235,005 contigs, and the longest length of a contig was 349,758 bp.

Complete chloroplast genome
The Amos 20 and Minimus software 34 with default parameters were used to assemble the chloroplast genome sequences of Z. jujuba (shown in Fig. 3). The sequences and detail information of each contig were stored in a fasta formatted file called "Contigs.fasta" and a text file called "Contigs.contig", respectively. In this process, 70,931 sequences (~34.50 Mb) and 2950 contigs were assembled. We further obtained 62 contigs by Phrap software 27 with default parameters. To confirm the location of the contigs in the Z. jujuba chloroplast genome, the final contigs were mapped to the Arabidopsis thaliana chloroplast genome. The N50 of contigs and the percentage of the organellar genome covered by the contigs of Z. jujuba were 84,718 bp and 98.38%, respectively.
Two methods, BLASTN search and PCR amplification with Sanger sequencing, were used to fill the remaining gaps. The gaps were assembled by Phrap 24 . We filled 2611 bp gaps completely by BLASTN and PCR for the Z.  35 . To obtain a high quality chloroplast genome, the assembled sequences were corrected based on high quality Illumina sequencing data by using BWA software. The Illumina sequencer could produce reads with high accuracy 36 . In this process, we successfully corrected 165 errors in the complete mitochondrial genome of S. suchowensis.
The chloroplast genome of S. suchowensis (Fig. 4) was assembled using the novel approach shown in Fig. 2. For S. suchowensis (NC_029317.1), 1,240,387 raw reads with a total length of 702,204,081 bp were first input into Newbler. Newbler software was used to assemble the Roche 454 GS FLX sequencing shorter reads and to produce contigs with longer length 37 . A contig graph was also plotted, in which the nodes are contigs and the edges are the reads spanning them. All the information on this Fig. 4 Salix suchowensis circular chloroplast genome map. Genes that belong to different functional groups are color-coded. GC content is represented on the inner circle by the dark gray plot. A total of 110 unique genes were annotated, including 76 protein-coding genes, 30 tRNA genes, and 4 rRNA genes graph, except the actual read alignments and consensus contig, is included in the 454 ContigGraph.txt file. There are several sections in the file. The first section is contig statistics, including contig number, name length, and contig read depth. The second section is the edge information, including the letter "C", the contig number on the left end of the edge, 5′ or 3′ to indicate which end of the contig the left edge refers to, the contig number at the right end of the edge, 5′ or 3′ to indicate which end of the contig the right edge refers to, and the depth of the edge ( Table 2). The first and second sections were used to assemble the organellar genomes. After calculation, we obtained 235,005 contigs, of which the longest contig was 349,730 bp. The chloroplast genome of S. suchowensis has been submitted to http://bio.njfu.edu.cn/gb2/gbrowse/ Salix_su_cp_sun/.

Complete mitochondrial genomes
Our previous study showed that the contig read depths in the nuclear DNA, mitochondrial, and chloroplast DNA were~1-2×, 50-100×, and over 100×, respectively 38 . According to read depth, we filtered out mitochondrial contigs that contained essential mitochondrial genes for further assembly. An initial mitochondrial contig graph was then constructed by Perl scripts based on the file 454ContigGraph.txt. In this process, the contigs in the first row of the file were used as a starting point to transverse all adjacent contigs; if there was a breakpoint, a new contig was selected to repeat the process. Contigs already connected with the original seed were considered as new seeds for searching its connected contigs recursively. In addition, because of the high frequency of chloroplast genomic DNA in the mitochondrial genome 39 , chloroplast-like contigs that were partially in a path were also saved for further analysis. At the same time, false links and forks that might belong to different genomes were removed according to the read depths of the contigs. A revised graph with repetitive contigs was constructed and is shown in Fig. S1. Eventually, a highquality mitochondrial genome including 13 contigs with a total length of 644,437 bp was completed 40 (Fig. 5). Similarly, we successfully assembled the mitochondrial genome of Z. jujuba and submitted it to the NCBI Genome database (NC_029809.1). The circular mitochondrial genome of Z. jujuba is shown in Fig. 6.

Analysis of organellar genomes
Chloroplasts and mitochondria are thought to have been developed during the formation of membrane compartments in eukaryotic cells in evolution. Nevertheless, some studies of their gene organization and content indicate that chloroplasts and mitochondria originated from cyanobacteria and alpha-proteobacteria, respectively 41 . Mitochondrial genome size, genome reorganization, and number of genes transferred from chloroplast genome into mitochondrial genome show a notable difference among higher plants because of homologous recombination during the evolution of the mitochondrial genome. Therefore, it is difficult to detect mitochondrial ancestry 42 .
Organellar  (Table S2), a slight bias toward A+T rich features was shown in the mitochondrial genomes of these two plants.
The chloroplast genomes of Beta macrocarpa, Butomus umbellatus, Cucurbita pepo, Malus domestica, and Vaccinium macrocarpon have not been included in NCBI. The average length of the completed chloroplast genomes of the 34 remaining horticultural plants is 151,720 bp. Among of them, Nelumbo nucifera has the longest length at 163,330 bp and Welwitschia mirabilis has the shortest length at 119,726 bp. Similar to mitochondrial genomes, in horticultural plants, A+T bases occupy a large proportion of the chloroplast genomes (Table S3).
Phylogenetic analysis of complete organellar genomes can identify plant evolutionary relationships accurately. In this study, a phylogenetic tree was constructed by an alignment of 15 protein-coding genes from 39 horticultural plants. As illustrated in Fig. 7, the 39 horticultural plants were categorized into two major groups: gymnospermae (colored by blue) and angiospermae (colored by red). The phylogenetic tree supported the separation of angiospermae and gymnospermae with 65% bootstrap value. A total of 27 dicotyledons in these plants were grouped in the category of angiospermae. The bootstrap value for the separation of eudicots and monocots is 66%. According to the phylogenetic tree, Z. jujuba is evolutionally closer to Malus domestica than to other plants. The sister relationship between S. suchowensis and S. purpurea is strongly supported 43 .
In plant evolution, the number of protein-coding genes in mitochondrial genomes declines (Table S1). As a representative species of dicot, the mitochondrial genome of Vitis vinifera has 61 protein-coding genes, which is almost the maximum number for all horticultural plants. Protein-coding genes such as PetA and Ycf4 in the mitochondrial genome of Vitis vinifera have been horizontally transferred from chloroplast DNA. In contrast, the mitochondrial genome of Geranium maderense, Allium cepa, and Vigna angularis have the minimum number of MttB, which encodes a transport membrane protein, was lost in Beta macrocarpa and V. angularis. More unusually, contrasting with three species in gymnospermae, the protein-coding genes of S. suchowensis and Z. jujuba include the same ATP synthesis genes (Atp1, Atp4, Atp6, Atp8, and Atp9) and NADH dehydrogenase subunits (Nad1, Nad2, Nad3, Nad4, Nad4L, Nad5, Nad6, Nad7, and Nad9). However, all three plants of the Fig. 6 The circular mitochondrial genome of Ziziphus jujuba. Genes belonging to different functional groups are color-coded. GC content is represented on the inner circle by the dark gray plot. This long circular mitochondrial genome encodes 58 unique genes (37 protein-coding genes, 19 tRNA genes, and 2 rRNA genes) Fig. 7 The neighbor-joining tree was constructed based on 15 conserved protein-coding genes of 39 horticultural plant mitochondrial genomes. The red and blue represent the categories angiospermae and gymnospermae, respectively. The numbers at the nodes are bootstrap support values gymnospermae have lost rpl10, and thus, it can be inferred that rpl10 has gradually developed into a pseudogene during the evolution of gymnosperms.
Some tRNA genes from chloroplast genomes have been inserted into mitochondrial genomes through intercellular transfers 39 . Our data show that the chloroplast genes trnM, trnH, and trnS are found in S. suchowensis and Z. jujube. The same gene insertion event was observed in other 28 horticultural species. The tRNA gene transformation of these plants may indicate that this phenomenon occurred before the formation of angiosperms. In addition, a mitochondrial-like gene, trnE, is found in many plants, with the exception of Cocos nucifera and Ginkgo biloba. All horticultural species can be generally separated into two groups according to their types of ribosomal genes. One group has the rRNA genes rrn5, rrn18, and rrn26, including Ajuga reptans, and the other group has the genes rrn5, rrnL, and rrnS, including G. maderense. Two plants are considered the exceptions: Z. jujuba lacks the rrn18 gene, and B. umbellatus has rrn16 in its mitochondrial genome.

Discussion
In this paper, we proposed an innovative and efficient assembly approach (shown in Fig. 2) for organellar genome assembly of horticultural plants using next generation sequencing data without isolating organellar DNA. We assembled the mitochondrial genome of Z. jujuba and the mitochondrial and chloroplast genomes of S. suchowensis using this pipeline. This study proved that our method can assemble both chloroplast and mitochondrial genomes.
Compared to other sequencing platforms such as SOLiD 44 and Illumina HiSeq 36 , Roche 454 sequencing is a high-throughput and low-cost sequencing technology, which can produce longer and relatively accurate reads (Table 3). In addition, a single lane of the Roche 454 platform is sufficient for organellar genome assembly 37 . Chloroplast or mitochondrial sequences can be well separated based on the read depths of the contigs derived from the sequencing reads.
To ensure high assembly quality, some quality control steps were included in this study. First, FastQC was used to check the raw sequence reads, which can provide a global picture of the quality of the sequencing data. Second, if the same species had both 454 sequencing data and Illumina data, Illumina sequencing data can be used for the correction of its organellar genome assembly using BWA. PCR experiments have proved that the BWA-based method can efficiently correct genome assemblies 22 . After obtaining the complete organellar genomes of horticultural plants, related genes, including protein genes, tRNAs, and rRNAs, were identified subsequently. GC content was also analyzed by a Perl script. Repeat sequences can be detected, which provide useful information to characterize mitochondrial genomes 45 , to investigate the influence of repeat sequences on mitochondrial genome size and to identify evolutionary changes in mitochondrial genome organization and structure 46,47 . In the process of evolution, mitochondria and chloroplast have a prokaryotic ancestry that could be suggested by their functions and genome organizations 48 . Moreover, most activities of the mitochondrial and chloroplast genomes are occasional and have an immediate or delayed impact on nuclear genome evolution because the nuclear genome and organellar genomes work together 48 . As a result, complete organellar genomes provide important to support breeding projects 49 and a better understanding of DNA transfers within and between the genomes and genomic recombination, which will facilitate the biological studies of horticultural plants in the future 21 .

Conclusions
In this paper, we have successfully applied a new, efficient approach to determine the complete chloroplast and mitochondrial genomes of two horticultural plants from Roche 454 GS FLX sequencing data. The Roche 454 GS FLX sequencer could generate longer sequencing reads 37 . Newbler, an efficient assembly software, also enabled the organellar genome assembly with high quality 50 . The read depths of contigs in the chloroplast and mitochondrial genomes rely on the proportion of total DNA and their copy numbers in the cell 37 . According to the read depths of the contigs and the copy numbers of the organellar genomes, we assembled chloroplast and mitochondrial DNA from the NGS data. Unlike the traditional method, there is no requirement to isolate organellar DNAs from total DNAs. Our method can also be extended to other platforms. We believe that this approach can be used for organellar genome assembly in other horticultural plants. Our method can also be applied to evaluate other sequencing platforms 51 .
A comparative analysis of the mitochondrial and chloroplast genomes of horticultural plants shows that they share most common genomic features with other plants. Mitochondrial gene comparison with other horticultural species will contribute to a systemic understanding of plant evolution. Complete horticultural organellar genomes and a phylogenetic analysis of these organellar genomes would provide useful clues for better understanding intra-genomic and inter-genomic DNA transfers and genomic recombination in horticultural plants 21 .