The two chromosomes of the mitochondrial genome of a sugarcane cultivar: assembly and recombination analysis using long PacBio reads

Sugarcane accounts for a large portion of the worlds sugar production. Modern commercial cultivars are complex hybrids of S. officinarum and several other Saccharum species. Historical records identify New Guinea as the origin of S. officinarum and that a small number of plants originating from there were used to generate all modern commercial cultivars. The mitochondrial genome can be a useful way to identify the maternal origin of commercial cultivars. We have used the PacBio RSII to sequence and assemble the mitochondrial genome of a South East Asian commercial cultivar, known as Khon Kaen 3. The long read length of this sequencing technology allowed for the mitochondrial genome to be assembled into two distinct circular chromosomes with all repeat sequences spanned by individual reads. Comparison of five commercial hybrids, two S. officinarum and one S. spontaneum to our assembly reveals no structural rearrangements between our assembly, the commercial hybrids and an S. officinarum from New Guinea. The S. spontaneum, from India, and one sample of S. officinarum (unknown origin) are substantially rearranged and have a large number of homozygous variants. This supports the record that S. officinarum plants from New Guinea are the maternal source of all modern commercial hybrids.

cases have been identified where the master circle no longer exists and the genome consists of multiple circular strands of DNA without shared sequence that could facilitate recombination 6,13 . Plant mitochondrial genomes are unlikely to be limited to a single origin of replication 14,15 . Break-induced repair and recombination has been proposed as a potential source for genome expansion and could be the cause for the long repeat sequences often found in plant mitochondria 16 . These long repeats plus DNA sequence sharing between the nuclear and plastid genomes can confound efforts to assemble plant mitochondrial genomes by introducing branch points that lead to multiple sequences including mitochondrial, nuclear or chloroplast sequence. This sequence sharing, the highly repetitive nature and relatively large size of plant mitochondrial genomes makes them difficult to assemble.

Results and Discussion
Sugarcane mitochondrial assembly. An assembly from CAP3 using a subset of corrected reads > 30 Kb consisted of 20 contigs which included 4 mitochondrial contigs and two chloroplast contigs with the remaining contigs coming from nuclear DNA based on blast results. The high number of nuclear contigs is a reflection of choosing a loose e-value blast cut-off so that all of the mitochondrial reads would be included to facilitate a complete final assembly. The four mitochondrial contigs could be joined to form two distinct circular chromosomes ( Fig. 1) by using all of the corrected reads. The median read depth of all corrected reads to the two circular chromosomes was 13 and the mean read depth was 14. The largest chromosome, chromosome 1, is 300778 bp and includes a 15 Kb direct repeat sequence at 97558:113073 and 285262:300778 bp. There were reads that spanned both copies of the 15 Kb repeat sequence supporting that both copies occur in a single circular chromosome and no reads that supported any subgenomic circles from this sequence. The other chromosome, chromosome 2, is 144698 bp and forms a circular chromosome with no reads linking any sequence to chromosome 1.
While it is common for plant mitochondrial genomes to exist as a master circle with minicircles resulting from recombination between repeats, this is not the case for sugarcane. There were a total of 111 repeats in the mitochondrial assembly. The two largest repeats were the 15 Kb direct repeat and a 4 Kb inverted repeat. The remaining repeats were shorter than 360 bp with repeats in the range of 30-80 bp accounting for 87% of the total number of repeats. There were 47 repeats on chromosome 1, 11 repeats on chromosome 2 and 53 repeats shared between the two circular chromosomes. While the repeats shared between the two chromosomes could potentially facilitate recombination, the largest was only 296 bp so any recombination would have been easily detected by the long read lengths, yet none were found.
While no recombinations were found, a single alternate arrangement was identified for chromosome 1 that involves a 4 Kb inverted repeat that occurs at 45730-49805 bp and 169987-174062 bp with long reads spanning both copies. The alternate arrangement results in an inversion of the 120 Kb segment between the two repeats and deletion of one of the inverted repeats with five reads supporting the inversion versus seven reads supporting the arrangement we have presented. The lack of large repeats shared between the two chromosomes or sequences derived through recombination is solid evidence that the sugarcane mitochondrial genome exists not as a master circle with minicircles, but as two completely separate DNA circles. The mechanism by which plant mitochondrial genomes go from a combination of the master circle plus minicircles to multiple discrete DNA circles could be the break-induced repair and recombination events discussed by Christensen AC 16 .
Sugarcane mitochondria annotation. We identified 66 unique open reading frames plus 26 duplicate copies and 17 partial chloroplast gene fragments (Table 1). These genes were primarily from the oxidative phosphorylation pathway (22 genes) and ribosome (10 genes). Fifty-seven genes are encoded by a single exon and eight genes are encoded across multiple exons. We found trans-splicing of group II introns in three genes: nad1, nad2 and nad5 (for review see 17 ). The exons of nad1 are separated by as much as 80 Kb and encoded on both the plus and minus strands of chromosome 1, consistent with findings in other species 17 . The genes nad2 and nad5 have exons split between chromosome 1 and chromosome 2, similar to what was found for S. vulgaris 6 . The RNA-seq data for six varieties plus genomic sequence for eight varieties from the DRA database was used to identify C → U RNA-editing in start and stop codons. Only nad1 had confirmed RNA editing at a start codon, all of the database varieties had the base identified in our assembly (Cytosine) at this location and all of the RNA-seq varieties had a Uracil. No other cases of RNA-editing at start or stop codons were detected.
There were 18 tRNA genes identified, three of which occurred twice in the assembled mitochondrial genome ( Table 1). Seven of the tRNA genes plus six other genes are from the sugarcane chloroplast (indicated with '-cp' in Table 1) and primarily occur in the large sections of transferred chloroplast DNA. Gene transfer to and from the nucleus occurs commonly in mitochondria 18 . Sugarcane showed the same gene loss and gain as Sorghum, with one exception, sugarcane regained trnL-CAA (Fig. 2).
Comparison with other species and sugarcane cultivars. We constructed a phylogenetic tree using 28 mitochondrial genes from seven species and found that sugarcane is most closely related to Sorghum (Fig. 2). The closest ancestor to sugarcane, of species with database sequence, has been identified as Sorghum by comparison of sugarcane BAC sequence 19 . The two species are close enough that Sorghum could be used as a template to assemble the majority of sugarcane genic DNA 19 . Blast against the mitochondrial genome of the closest species in the database, Sorghum, showed that 345 Kb of the 468 Kb mitochondrial genome is represented in our assembly, although, substantially rearranged (Fig. 3). This shows that a large amount of mitochondrial repeat sequence is shared between the two species. This includes 3 Kb of the inverted repeat and the entire direct repeat split into two parts, in both cases existing in the Sorghum genome as a single copy.
Database sequence from eight varieties, including one S. spontaneum, two S. officinarum and five hybrids, were used to identify variants. A large number of structural variants were identified between the S. spontaneum, S. officinarum and hybrids that we checked ( Table 2). All of the structural variants found were in SES205A (S. spontaneum, accession: SRR922216) and sample 82-72 (S. officinarum, accession: SRR922217). The clone SES205A originates from India, but the origin of cultivar 82-72 could not be traced. Interestingly, the other sample of S. officinarum, IJ76-514, (accession: SRR528718), originally sourced from Irian Jaya (New Guinea), did not have the same structural variants, and instead was consistent with both our assembly and the other commercial hybrids. This is consistent with the hypothesis that all modern commercial varieties are derived from a New Guinean S. officinarum 3 . We performed de novo assemblies of the two samples with the structural variants

Complex III cob
Complex IV cox1, cox2 [2], cox3 Cytochrome-c biogenesis ccmB, ccmC, ccmFc [2], ccmFn (SES205A and 82-72) and the contigs from these assemblies supported the structural variants identified by the mapped reads, however, both samples had a large number of contigs which could not be constructed into complete genomes because of the high number of structural variants found. The most notable differences between these two samples and the others are multiple cases of reads joining Chromosome 1 with Chromosome 2. It is possible that these two samples have a single circular DNA strand instead of the two in our assembly or just a different arrangement involving two or more circles. A total of 2,243 small variants were identified from the eight database samples consisting of 183 small InDels and 2,060 SNPs. We excluded any small variants with a per-sample minor allele proportion of less than 10% in an effort to exclude sequencing errors, which could not be reliably estimated from the database samples. The number of homozygous variants in most samples were small, in the range of 0 to 1%, with the exception of SES205A and 82-72, which both had 10%. This is consistent with the structural variations observed where these two samples were markedly different to the others. The majority of small variants identified (2,088) were shared by one to five samples (Table 3) and are therefore likely to have originated after sugarcane development. The remaining 155 variants were common to six or more samples and thus likely occurred early during sugarcane development.

Conclusions
We have assembled the mitochondrial genome of a commercial sugarcane hybrid, Khon Kaen 3, and annotated coding sequence with the aid of RNA-seq data. Phylogenetic analysis supports the previous finding of Sorghum being the closest relative to sugarcane in the database. Although we only have two samples of S. officinarum, the similarity between the modern hybrid cultivars and IJ76-514 (SRR528718) is consistent with the hypothesis that S. officinarum from New Guinea was used to generate all modern commercial cultivars.
We have shown that the sugarcane mitochondrial genome exists as two discrete DNA circles with no evidence of recombination between them. However, the pronounced rearrangement between sugarcane and Sorghum shows that significant rearrangements have taken place in the past. The large number of sequences linking the two chromosomes in the sample of S. spontaneum and one of the samples (82-72) listed as S. officinarum show that the events leading to the separate chromosomes we identified here must have occurred relatively recently. This is consistent with divergence estimates from chloroplast sequence that show S. officinarum diverged from S. spontaneum between 580 and 780 thousand years ago 20 .
The large differences in size, structural arrangement and level of recombination between published mitochondrial genomes of different species suggest that plant mitochondrial genomes are in an interesting phase of evolution 11 . Sequencing additional species with long read length technologies is likely to yield additional insight to the evolution of plant mitochondrial genomes.

Materials and Methods
Sample and DNA extraction. The sugarcane sample is a commercial hybrid that has been developed in Thailand known as Khon Kaen 3. This cultivar was generated by crossing K84-200 (ROC1 x CP63-588) with 85-2-352 (SP70-1143 x Q76) and is a cultivar that is commonly used in South East Asia. Leaf tissue was collected from a single plant and used for DNA extraction with the standard CTAB method followed by clean-up using a DNeasy Mini spin column from Qiagen.  All the corrected reads were then blasted against the final assembly and contigs were joined that had overlapping sequence or reads that joined them to form a circular DNA strand. The corrected reads were mapped to this assembly using BWA MEM to confirm that the assembly was supported by the majority of reads and check for evidence of alternate genome configurations that could result from recombination 22 . Quiver (part of the SMRTanalysis suite) was then run on the final assembly to fix PacBio sequencing errors. A data set of RNA-seq was obtained from DDBJ (SRR849062) for six pooled sugarcane cultivars and used to check for non-canonical start codons and RNA-editing 23 . Open Reading Frames (ORFs) were predicted using Open Reading Frame Finder [https://www.ncbi.nlm.nih.gov/gorf/gorf.html]. The tRNA genes were searched using tRNAscan-SE 24 . The annotated genes were also checked with the plant mitochondrial genome annotation program Mitofy 25 . All predicted ORFs, tRNA genes and rRNA genes were searched against the publicly available mitochondrial nucleotide and protein sequence database. Codon usage was calculated from all (33) of the mitochondrial coding genes. Repeats were identified using Reputer v3.0 26 .
Sugarcane mitochondrial sequence comparison. A total of eight datasets of genomic sequence data from Illumina runs were downloaded from DDBJ 27,28 . These data sets included one S. spontaneum (SRR922216) sample, two S. officinarum (SRR922217 and SRR528718) and five samples listed as Saccharum hybrid (SRR528717, SRR871522, SRR922218, SRR922219, SRR922220). The reads from each data sets were mapped to the sugarcane mitochondrial assembly generated from this work using Bowtie2 29 , variants were called using GATK v3.4-46 30 and structural variants were identified visually using IGV 31 . Small variants identified by GATK were only considered if the minor variant accounted for at least 10% of the reads on a per sample basis. Variants within repeat regions, including chloroplast sequence, were excluded. In addition, a de novo assembly using Ray 32 was performed for two of the samples with the most structural variants (SRR922216 and SRR922217). The sugarcane mitochondrial genome was compared to the Sorghum bicolor mitochondrial genome using BLAST and graphed using the BLAST Ring Image Generator 33 .