The architecture of the Plasmodiophora brassicae nuclear and mitochondrial genomes

Plasmodiophora brassicae is a soil-borne pathogen that attacks roots of cruciferous plants causing clubroot disease. The pathogen belongs to the Plasmodiophorida order in Phytomyxea. Here we used long-read SMRT technology to clarify the P. brassicae e3 genomic constituents along with comparative and phylogenetic analyses. Twenty contigs representing the nuclear genome and one mitochondrial (mt) contig were generated, together comprising 25.1 Mbp. Thirteen of the 20 nuclear contigs represented chromosomes from telomere to telomere characterized by [TTTTAGGG] sequences. Seven active gene candidates encoding synaptonemal complex-associated and meiotic-related protein homologs were identified, a finding that argues for possible genetic recombination events. The circular mt genome is large (114,663 bp), gene dense and intron rich. It shares high synteny with the mt genome of Spongospora subterranea, except in a unique 12 kb region delimited by shifts in GC content and containing tandem minisatellite- and microsatellite repeats with partially palindromic sequences. De novo annotation identified 32 protein-coding genes, 28 structural RNA genes and 19 ORFs. ORFs predicted in the repeat-rich region showed similarities to diverse organisms suggesting possible evolutionary connections. The data generated here form a refined platform for the next step involving functional analysis, all to clarify the complex biology of P. brassicae.

www.nature.com/scientificreports www.nature.com/scientificreports/ Application of long-read sequencing greatly supports not only reliable pathotyping (diagnostics) or identification of important structural variants when genomic gaps with repetitive or unique sequences are closed but also accurately resolve chromosomes in complex genomes. Genomic information of high quality is crucial for enhancing our understanding of important evolutionary events, which could give us insights into related organisms together with events behind lost and gained traits. Such information could in case of plant pathogens be extra valuable for development of durable control strategies. Reliable genomic information is essential not least for experimentally challenging obligate biotrophic soil-borne pathogen as P. brassicae where a number of questions in its life-cycle and differences between pathotypes remain to be clarified.
We previously made a de novo genome assembly of the P. brassicae strain e3 based on a combination of Illumina and 454 sequencing approaches 16 . This achievement was followed by genome information for isolates from Canada, China and Germany based on similar technologies [21][22][23] . The former sequencing of the P. brassicae e3 genome 16 generated two mitochondrial contigs whose complete assembly could not be achieved, thus the mitochondrial sequence was not made public. Here, we applied long-read PacBio RSII single molecule real-time (SMRT) sequencing technology to fill in sequence gaps and further resolve regions with repetitive sequences of the P. brassicae e3 genome. We present new data based on twenty contigs representing the nuclear genomic content of P. brassicae e3 and one contig covering the entire mitochondrial genome generated by this approach. We were able to identify telomere sequences similar to those found in green algae and Theileria annulata, an apicomplexan parasite. The assembly and de-novo annotation of the mitochondrial genome revealed a 114,663 bp large genome with a complex sequence organization and a distinct 12 kb repeat-rich region. These findings emphasize the value of long-reads in resolving unrecognized genomic variation and highlight the importance of distinguishing biological from technical sequence differences.

Results and Discussion
P. brassicae has a t 4 AG 3 telomere repeat composition and possess meiosis-related proteins in the nuclear genome. By using the long-read SMRT technology, the numbers of nuclear assembled contigs were reduced from 165 16 to 20 (Supplementary Table S1). In the 25 Mb nuclear genome coding sequences were evenly distributed on each scaffold, interspersed with minor repeat regions (Fig. 1). Repetitive sequences were looked for manually and by using the tandem repeats finder 24 . Centromeric candidates with different length could be found on all nuclear contigs (Fig. 1). Thirteen of the 20 contigs of the nuclear genome (ranging from sizes between 692,149 bp to 2,120,846 bp) represent complete chromosomes from telomere to telomere ( Fig. 1; Supplementary Table S2). The [TTTTAGGG] or T 4 AG 3 telomeric sequences of P. brassicae ( Supplementary  Fig. S1) are identical with those found in Chlamydomonas reinhardtii, the green alga 25 and Theileria annulata, an apicomplexan animal parasite 26 . Among Archaeplastida, the T 3 AG 3 telomeric motif is most common in plants for example in the P. brassicae e3 host Arabidopsis thaliana whereas the telomeres in red and green algae vary between T 4 AG 3 , T 3 AG 3 , T 2 AG 3 , and T 2 AG 6 27,28 . Five of the remaining seven P. brassicae contigs were terminated by a single telomere. Efforts to further assemble all the remaining sequences into longer contigs terminated with telomeres were not successful. This could be explained by the enrichment of repetitive sequences at the ends making overlapping in silico analysis and additional PCR analysis followed by Sanger sequencing unsuccessful. Alternatively, the seven contigs may be dispensable or form supernumerary chromosomes. The organization of the nuclear chromosomes into two groups: a core set (permanent chromosomes) and accessory, dispensable, or supernumerary chromosomes is common among plant pathogenic fungi 29 . The extra chromosomes are known to play important roles for evolution, high recombination rates and adaptation to external changes 30  and mitochondrial genome (no. 21). Telomeres are marked with black ends. Density of coding (red) and repeat (blue) sequences in non-overlapping sliding 10 kbp windows. The color intensity is proportional to the given feature density.
www.nature.com/scientificreports www.nature.com/scientificreports/ dispensable chromosomes are enriched for genes involved in host colonization, infection and traits attributed to polymorphism observed between different isolates. Any such information and connection is yet not reported in P. brassicae. Earlier analyses based on pulse-field gel electrophoresis revealed between 6 to 16 chromosomal bands ranging from 680 kb to 2.2 Mb in size, including chromosomal polymorphisms between different single-spore and field isolates 12,31 . Whether the different observations reflect biological differences or are technical related remains to be clarified.
The MAKER 32 annotation pipeline combined with several ab-initio gene predictors generated 9,231 protein-encoding genes in the P. brassicae nuclear genome (Supplementary Table S1). In a comparison of P. brassicae with B. natans, R. filosa and the transcriptome of S. subterranea 5,605 proteins were earlier assigned to be P. brassicae specific 16 . In an extended and new comparison including additional transcriptome data from marine species 15 a joint core of 476 proteins shared with 12 other Rhizaria species was found. This revised analysis generated 3,017 P. brassicae-specific proteins whereof a majority lacked functional annotations (Supplementary Fig. S2; Supplementary Table S3).
Synaptonemal complexes (SC) were found by using serial thin sections of P. brassicae for electron microscopy 33 . SC, a proteinaceous structure, is known to assemble at the interface between pairs of homologous chromosomes at prophase I (zygotene) of meiosis I. SC formation is an essential feature of meiosis but is not strictly conserved in all organisms [34][35][36] . We used our new P. brassicae genome sequence to search for SC-associated and meiotic-related gene information (Supplementary Table S4) and next monitored their activity in enriched life stages of P. brassicae ( Supplementary Fig. S3). Candidates such as HOP1, ZIP1, ASY2, REC8, MER3, MSH5, FKBP6 were found to be highly active whereas animal orthologs such as C(3)G, SYCP1 and SYP2 were suppressed. SC proteins and important proteins for SC modification are known to vary on sequence level 36,37 which implies that additional genes with similar functions but not identified here can be present in P. brassicae. However, much remains to be learnt on SC components, their dynamics and regulation in general, and not least in P. brassicae. P. brassicae has a large mitochondrial genome with a 12 kb repeat-rich region. The mitochondrial genome represented by a single contig has a circular structure and a size of 114,663 bp (Supplementary Table S2). A uniform coverage with an average depth of ~800x was obtained across the genome, except from 47,000 to 50,000 bp where the depth decreased to ~400×. The possibility of miss-assembly was minimized by well-aligned reads spanning over this 3 kb stretch of AT-rich sequences that caused the reduced depth. The AT-rich sequences were found to be part of a 12,500 bp long repeat-rich region (42,150 bp) by a dot-plot self-similarity comparison ( Supplementary Fig. S4a). A closer look into the dot plot indicated presence of tandem minisatellite-and microsatellite repeats with partially palindromic sequences in this region ( Supplementary Fig. S4b).
Mt sequences are available for three P. brassicae strains: Pb3 from Canada 21 , ZJ-1 from China 22 and eH from Germany 23 . In BLASTn search of the whole-genome shotgun database the mt sequence of e3 shared high identity (>99.95%) with these strains (Supplementary Table S5). However, the e3 mt sequence is about 11, 13 and 21 kb longer than eH, Pb3 and ZJ-1, respectively. We next compared the mt sequences from these three P. brassicae strains with e3, here visualized by dot-plots. While the sequences show high similarity for most of their length, the 12 kb repeat-rich region in e3 was not resolved in any other strain (Supplementary Figs. S5-S7). Neither was the repeat-rich region identified in the mt contigs we generated earlier by applying Illumina/454 technologies on the e3 genome 16 . The previously excluded sequence is provided here (Supplementary Table S6). The difference between the updated and former e3 mt sequence is visualized in Supplementary Fig. S8.
When analyzing mitochondrial synteny between the four P. brassicae strains, two locally collinear blocks (LCB) with highly conserved sequences and no rearrangements were identified by the whole genome alignment tool Mauve 38 ( Supplementary Fig. S9). The area outside LCBs corresponds to the repeat-rich region identified in the e3 genome and lacked detectable homology in the other three genomes. If the homologous LCBs from eH, Pb3 and ZJ-1 are aligned to the e3 mt genome as illustrated in Fig. 2, absence of the repeat-rich region considerably contributes to genome size differences observed between the strains.
In conclusion, the mt intragenomic variation observed between the four strains ( Fig. 2) is most likely technical rather than biological related. Several studies have reported unrecognized variation by short-read sequencing technologies and demonstrated that long reads which can span highly repetitive regions and thereby facilitate assembly, are essential for correct genome resolution 39  www.nature.com/scientificreports www.nature.com/scientificreports/ the P. brassicae mt genome has a complex gene organization. We initially annotated the P. brassicae e3 mt sequence with MFannot, an automated tool commonly used for gene prediction in organelle genomes 41 . However, in the MFannot output many protein-coding and RNA genes were fragmented and with numerous introns predicted in intergenic regions, indicating interruption of coding sequences and incomplete annotations. Similar "mosaic" gene structure was reported in the eH mt genome annotated by MFannot 23 . To optimize de novo annotation, we combined automated predictions done with Prokka 42 and MAKER2 43 using Repeatmasker 44 , tRNAscan-SE 45 , Uniprot/Swiss-Prot mitochondrial proteins, the ribosomal database 46 and Rfam 47 . P. brassicae e3 transcriptome data 16 were re-assembled and mapped to the PacBio mt sequence to provide further information. The annotated mt genome of S. subterranea 48 was used as an additional source. All data were uploaded to Web Appollo 49 , a web based annotation editing platform to facilitate manual curation.
This extensive annotation procedure identified seventy-nine genes in the P. brassicae e3 mt genome. Thirty-two are predicted as protein-coding, 28 as structural RNA genes and 19 are ORFs ( Fig. 3a; Supplementary  Table S7). Seventeen protein-coding genes are involved in the mitochondrial respiratory chain, ten genes code for small ribosomal subunit proteins, four for proteins of the large ribosomal subunit and one gene (rdp) is coding for a RNA-directed DNA polymerase (Supplementary Table S8). Structural RNAs include large and small ribosomal subunits (rnl and rns), 5S ribosomal RNA (rrn5), a ribonuclease P type B (rnpB) and 24 transfer RNAs (Supplementary Table S9). By sequence similarity searches of the Uniprot/Swiss-Prot database (e-value < 10e-6) functional information was retrieved for ten ORFs (3 intergenic and 7 within introns) while nine ORFs (4 intergenic and 5 within introns) had no significant similarity (Supplementary Table S10). Seven intron group II and 2 transposon-like elements (Supplementary Table S11) and few simple repeats (Supplementary Table S12) were also found in the e3 mt genome.
In the P. brassicae eH strain sixty mt genes were identified 23 , divided on 32 protein-coding and 28 structural RNA genes corresponding to numbers and functional categories in e3. No transposable elements, ORFs, intron group II or repeats were reported. A detailed comparison of translated coding sequences could not be carried out here because amino acid sequences are not available for eH.
In comparison to its closely related potato scab pathogen S. subterranea 48 , the P. brassicae mt genome is roughly three times larger ( Fig. 3a-f; Supplementary Table S7). When aligned, the genomes shared two syntenic blocks with well-conserved sequences, free from rearrangements (Fig. 4). Homology with the S. subterranea mt genome was not found in the 12 kb repeat-rich region in P. brassicae e3. Otherwise, the gene order between the genomes is nearly identical as shown in the gene maps (Fig. 3a,f). The map view demonstrates that presence of the repeat-rich region, high numbers of introns and variation in intergenic regions contributed to larger size of the P. brassicae e3 mt genome. Due to a larger mt genome size, P. brassicae e3 has a slightly lower coding density (54%) than S. subterranea (66%) (Supplementary Table S7).
When comparing the incidence of protein-coding genes, three genes (atp8, rpl5 and rps10) were missing in the S. subterranea mt sequence (Supplementary Table S8). Several small ribosomal subunit proteins share overlapping sequences in the e3 mt genome, including rps7/rps12 (19 bp), rps11/rps13 (30 bp) and rps3/rps19 (22 bp). Similar sequence overlaps between ribosomal genes are also seen in the S. subterranea mt genome 48 . The majority of e3 protein-coding genes (23 out of 32) start with the standard ATG codon (methionine) whereas TTA and TTG (leucine) seem to be used by five genes, and ATC and ATT (isoleucine) by three genes as alternative initiation codons (Supplementary Table S13). The GTG codon (valine) is employed only by nad5, in contrast to its extensive use among mt proteins in Lotharella oceanica, the rhizarian chlorarachniophyte alga 50 . Most of the e3 genes are terminated by TAA and in few cases with TAG codon. TGA, the third standard stop codon is translated into tryptophan in the mt genomes of P. brassicae e3 and S. subterranea 48 . This TGA usage pattern is therefore important to consider when selecting a correct genetic code for translation of P. brassicae mt coding sequences.
All structural RNA genes except arginine (trnR), proline (trnP) and tryptophan (trnW) were encoded on an 11,332 bp segment in the P. brassicae e3 mt genome (Fig. 3a). Densely located ribosomal RNA genes have been found in several other mt genomes and are believed to be involved in balanced co-transcription and RNA turnover 51 . Total number of tRNAs (24) was identical to that found in S. subterranea, with the difference that the e3 mt genome codes for glutamine (trnQ) and one copy of histidine (trnH) (Supplementary Table S9).
In contrast to S. subterranea, the P. brassicae e3 mt genome is strikingly intron-rich. Twelve genes in the e3 mitochondrial respiratory chain harbor the majority (41 out of 54) of the introns (Supplementary Table S8) while the remaining 13 introns are found in the large and small ribosomal subunit (Supplementary Table S9). Only five introns in three mt protein-coding genes were found in S. subterranea 48 , whereas chlorarachniophyte algae, B. natans and L. oceanica lack introns in their mt genomes 50 . Defining exon-intron splicing sites, especially in the genes with multiple exons was a challenge during annotation of the e3 mt genome and some positions remain to be exactly clarified. Application of the canonical acceptor-donor sites (GT-AG, GC-AG, AT-AC) resulted in reading frame shifts and disruptions of coding sequences predicted based on sequence alignments with known proteins. According to our current annotations supported by transcriptome data 16 , a new splicing pattern with the GT-AT and GA-CA acceptor-donor sites is likely employed by the P. brassicae e3 mt genome. Further, seven group II introns and two transposon-like elements, including Long Terminal Repeat (LTR) and Enhancer/ Suppressor-mutator (En/Spm) were identified in the e3 mt genome (Fig. 3a; Supplementary Table S11). All these sequences except En/Spm were found to overlap exon-intron boundaries and may have an important role in the splicing mechanism. The MFannot tool 41 predicted much higher number of group II introns (45) whereof 34 in gene regions (Supplementary Table S14). These introns were not submitted to the European Nucleotide Archive since no other annotation software supported their prediction. Annotated mt sequences from other plasmodiophorids are needed to resolve whether the higher number of group II introns represents an overestimated or a common landmark. Out of nineteen ORFs predicted in the P. brassicae e3 mt genome, the majority (12) was found within introns of the protein-coding genes (Supplementary Table S7). Seven of these ORFs showed significant similarity (e-value < 10e-6) with known proteins including among others three maturase-like proteins (mat1 and www.nature.com/scientificreports www.nature.com/scientificreports/ mat2), a group II intron-encoded protein (LtrA) and a DNA binding endonuclease (spmit.06) (Supplementary Table S10). Much remains to understand about evolution and regulatory function of the complex intron pattern in the P. brassicae e3 mt genome. It is intriguing to speculate that intron-encoded maturases could assist in the splicing processes, maybe promoting self-splicing as observed in fungal mitochondria 52 .
Overall GC content is close to identical in the mt genomes of P. brassicae e3 (26.2%) and S. subterranea (26.8%) 48 while being lower compared to the chlorarachniophyte algae B. natans (42.2%) and L. oceanica (50.1%) 50 . However, large variation in the GC content was observed across the 12 kb repeat-rich region in P. brassicae e3 (Fig. 3b). A drastic increase above 75% of GC content was present from 42,800 to 47,200 bp, in the region where tandem minisatellites were identified (Supplementary Fig. S4b). Two ORFs coding for hypothetical proteins and one ORF with high similarity to thrombospondin motifs were annotated in this region ( Fig. 3a; Supplementary Table S10). A sharp drop to below 25% of GC content was observed from 47,300 to 49,800 bp, corresponding to the AT-rich stretch of tandem microsatellite repeats (Supplementary Fig. S4b). This partially palindromic 2.5 kb sequence was annotated as a En/Spm transposon-like element ( Fig. 3a; Supplementary Table S11). A second increase of GC content (>75%) occurred from 49,900 to 52,150 bp, in the region with tandem minisatellites ( Supplementary Fig. S4b), and an ORF with high similarity to a proline-rich protein ( Fig. 3a; Supplementary  Table S10). The GC content varied between 75% and 50% from 52,200 to 55,000 bp, where minisatellites were detected on the end of repeat-rich region (Supplementary Fig. S4b)   www.nature.com/scientificreports www.nature.com/scientificreports/ as genomic signatures of possible horizontal or lateral gene transfers particularly in bacteria 53,54 . Whether the observed fluctuations in GC content in the P. brassicae-specific mitochondrial region 42,800 to 55,000 bp region could be a remnant of ancient genomic events remain to be elucidated.
The SAR group (Stramenopila, Alveolata, Rhizaria) comprises numerous diverse eukaryotic organisms on which revised phylogenetic relationships have been presented 2 . We attempted to generate new information based on the P. brassicae e3 mt genome, since no phylogeny is yet proposed for the Phytomyxea plant pathogens based on a larger set of mitochondrial genes. Amino acid sequences for 12 protein-coding genes (cob, cox1, cox2, cox3,  atp6, nad1, nad2, nad3, nad4, nad4L, nad5 and nad6) from 67 organisms were used to infer maximum likelihood RAxML single-gene and concatenated trees. Twelve single-gene trees showed a well-supported close relationship between P. brassicae and S. subterranea as well as between B. natans and L. oceanica. When 773 mt proteins were analyzed, the five species in Rhizaria clustered together and P. brassicae and S. subterranea remained the two most closely related plasmodiophorids in the concatenated tree (Fig. 5). Moderate support for Rhizaria and several other major clades most likely reflects limitations in sequence information. However, no evidence for horizontal gene transfer events to P. brassicae neither from any species in Viridiplantae including Arabidopsis thaliana, which can act as a host, nor from other organisms was detected.

conclusion
Our long-read sequencing approach provided new insights into the nuclear and mitochondrial genomes of the clubroot pathogen P. brassicae. For the first time telomeric sequences are presented in the supergroup Rhizaria. Whether P. brassicae and algae share evolutionary history is intriguing. Both the telomeric T 4 AG 3 repeats and some mitochondrial ORFs in P. brassicae showed high sequence similarities with algae and point to that direction.  www.nature.com/scientificreports www.nature.com/scientificreports/ The close relationship between P. brassicae and S. subterranea was further visualized upon mt genome comparisons. A number of peculiarities were found in the P. brassicae mt genome such as: a unique repeat-rich region, a new splicing pattern with the GT-AT and GA-CA acceptor-donor sites, group II introns spanning exon-intron boundaries, and sequence similarities in ORFs to functional genes present in various organisms.

Materials and sequencing.
Resting spores from clubs of Brassica rapa grown in Plasmodiophora brassicae strain e3 infested soils were isolated and used for DNA extraction 55 . High-quality DNA was sent to SciLifeLab, Uppsala, Sweden for PacBio RSII sequencing according to the manufacturer's protocol.
nuclear assembly and annotation. Raw reads were assembled using FALCON v0.4 and HAGP3 from SMRTportal v2.3. The two assemblies were manually merged and polished using Quiver from SMRTportal v2.3. The gene annotation pipeline MAKER v2.3 32 was used in combination with ab-initio gene predictors: Augustus v2.5.5, SNAP and GeneMark-ES v2.3. Augusts and SNAP were trained on the previously annotated and manually curated P. brassicae nuclear genes 16 . The UniProt/Swiss-Prot database and all rhizarian ESTs and proteins found at NCBI, as well as transcripts assembled from strand-specific RNASeq P. brassicae libraries 16 were used as evidence integrated in gene predictions. Annotation of repeats was performed within MAKER 32 , using a P. brassicae specific repeat library constructed de novo using RepeatModeler v1.0.7 as well as MAKER's internal library of transposable elements and the Repbase repeat library rm-20130422. MAKER was run using default parameters except pred_flank that was set to 100 bp, split_hit. Telomeric sequences were identified by using Tandem Repeats Finder 24 . For biological pathway classifications we used the WebMGA and the KOG classification tools. Additional protein analysis was done using OrthoMCL, sequence similarity searches, BLASTP searches against GenBank non-redundant protein database, HMM-searches against Pfam database, and RPS-BLAST searches against NCBI KOG (March 2017).
Mitochondrial genome de novo assembly and annotation. The contig encoding the mitochondrial genome was assembled using Canu v1.5 56 . Visualization of the raw assembly with Bandage generated a circular contig (133,222 bp) in which overlapping sequences were identified by Gepard v1.40 57 . After removing overlaps (18,559 bp) and circularization, the final 114,663 bp long PacBio sequence was polished using Quiver. Mapping of the Illumina data 16 to the PacBio sequence using BWA v0.7.15 and SAMtools v1.5 and polishing with Pilon revealed 2 × 1 bp difference, which were corrected. Next, the coverage was tested by aligning the PacBio reads to the assembled mitochondrial genome using GraphMap v0.5.2 58 .
To optimize de novo annotation several tools and sources were used and combined with manual curation. Automated annotations were done using MFannot v1.33 41 with the genetic code 4 "Mold, protozoan and coelenterate mitochondrial; Mycoplasma/Spiroplasma", Prokka v1.1 42 with mitochondria and archaea kingdoms and MAKER2 43 using Repeatmasker 44 , tRNAscan-SE 45 , Uniprot/Swiss-Prot mitochondrial proteins (Nov. 2016), the ribosomal database 46 and Rfam v12 47 . Further information was provided by transcriptome data 16 , re-assembled using Trimmomatic, Tophat and Stringtie. The S. subterranea annotated mitochondrial genome 48 was used as an additional source. Based on the different lines of annotation and sources, the gene models have been manually created through Web Apollo 49 . Translated CDS features were blasted against the Uniprot/Swiss-Prot reference data set (Aug. 2016) and filtered using a maximal e-value of 10e-6 and run against InterProScan v5.21-60. All retrieved functional information have been integrated into the final annotated data set. Predicted ORFs were used as query sequences for sequence similarity searches of the Uniprot/Swiss-Prot database (March 2019).
Mitochondrial sequence comparison. Dot-plots created by Gepard v1.40 57 were used for comparison of P. brassicae mitochondrial sequences from four strains: e3 (generated in this study, GeneBank accession LS992577), eH 23 (GeneBank accession POCA01000043), ZJ-1 22 (GeneBank accession MCBL01000050) and the Pb3 strain 21 (GeneBank accession RZOB01000060). Additional sequence comparison was done using two mitochondrial contigs of the e3 strain generated by Illumina/454, excluded from 16 but provided in Supplementary  Table S6. Synteny between mitochondrial sequences from four P. brassicae strains and S. subterranea 48 (GeneBank accession KF738139) was tested using the Mauve genome alignment tool v2.4.0 38 with default settings. phylogenetic analysis. Amino acid sequences were retrieved from public databases for 12 mitochondrial protein-coding genes (cob, cox1, cox2, cox3, atp6, nad1, nad2, nad3, nad4, nad4L, nad5 and nad6) conserved across 67 organisms. Sixty-three organisms were selected to represent major eukaryotic groups with available complete mitochondrial genomes and if possible, deep-branching positions. Four α-protebacteria were selected as outgroup species. The sequences were aligned using Clustal Omega v1.2.1 59 with default settings. Multiple alignments were visualized with AliView, examined and automatically trimmed with trimAL v1.4. Maximum likelihood (ML) phylogenetic trees with rapid bootstrap (RB) analyses were generated with RAxML v8.2.11 60 . The best amino acid substitution model was estimated with PROTGAMMAUTO option for each single gene tree and run with 250 to 650 RB, a number of iterations predicted to be sufficient by the autoFC stopping criteria. For concatenated trees, 12 protein alignments were concatenated (the script is available at https://github.com/nylander/ catfasta2phyml) into a super-matrix comprising 773 genes and 3,819 aligned amino acid positions. ML analyses were inferred under GAMMA rate of heterogeneity with the substitution models specified for each partition and run with 250 RB iterations. Trees were displayed and edited with Dendroscope v3.5.9 61 . Information on organisms and proteins and substitution models are listed in Supplementary Table S15. www.nature.com/scientificreports www.nature.com/scientificreports/

Data availability
Data retrieved in this study are deposited in the European Nucleotide Archive under the project PRJEB24736. For the nuclear sequence under accession number OVEO01000001-OVEO01000020 and the mt sequence under accession number LS992577.