Deep-level phylogeny of Cicadomorpha inferred from mitochondrial genomes sequenced by NGS

Recent development and advancement of next-generation sequencing (NGS) technologies have enabled the determination of mitochondrial genome (mitogenome) at extremely efficiency. In this study, complete or partial mitogenomes for 19 cicadomorphan species and six fulgoroid species were reconstructed by using the method of high-throughput sequencing from pooled DNA samples. Annotation analyses showed that the mitogenomes obtained have the typical insect mitogenomic content and structure. Combined with the existing hemipteran mitogenomes, a series of datasets with all 37 mitochondrial genes (up to 14,381 nt total) under different coding schemes were compiled to test previous hypotheses of deep-level phylogeny of Cicadomorpha. Thirty-seven species representing Cicadomorpha constituted the ingroup. A taxon sampling with nine species from Fulgoroidea and six from Heteroptera comprised the outgroup. The phylogenetic reconstructions congruently recovered the monophyly of each superfamily within Cicadomorpha. Furthermore, the hypothesis (Membracoidea + (Cicadoidea + Cercopoidea)) was strongly supported under the heterogeneous CAT model.

Taxon sampling. A total of 25 taxa were selected for mitogenome sequencing, with emphasis on the superfamily relationships within Cicadomorpha and the possible sister group lineage of Cicadomorpha (i.e. Fulgoroidea) 10 . Specimen identification were conducted by checking adult morphological characters, and by blasting in online identification tool of BOLD systems (Barcode Of Life Database: http://www.boldsystems.org -Identification section) and by the standard nucleotide BLAST in NCBI. The primary specimen materials can be accessed by the Entomological Museum of Henan Agricultural University. The detailed classification information and voucher numbers of species sequenced in this study are listed in Table S1.
In addition, we included all the current available mitogenomes of Membracoidea, Cicadoidea, Cercopoidea and Fulgoroidea in the GenBank (up to September 2016). In total, the ingroup included 37 taxa from Cicadomorpha: 17 species representing Membracoidea, eight species representing Cicadoidea, and 12 species representing Cercopoidea.
Outgroup choice included a taxon sampling based on the results of recent molecular studies 5,6,10,11 . Nine species representing the superfamily of Fulgoroidea served as the close outgroups. Six species representing Heteroptera were selected as the relatively distant outgroups. The complete list of taxa included in this study is given in Table S1. DNA extraction. Total genomic DNA was isolated from each 95-100% ethanol preserved specimen individually using the TIANamp Micro DNA Kit (TIANGEN BIOTECH CO., LTD) following the manufacturer's protocol. DNA concentration was measured by Nucleic acid protein analyzer (QUAWELL TECHNOLOGY INC.).
Mitogenome reconstruction. The assembly strategy of complete mitogenome is largely identical to that of Gillett et al. (2014) 33 . The minor differences lied in the universal primers designed to amplify bait sequences as those in Song et al. (2016) 37 . Additionally, the Illumina HiSeq X Ten sequencing platform was utilized in the present study. Similar amounts of genomic DNA for each individual were pooled to improve the sequencing efficiency for every species. The amount of pooled DNA was quantified at 1.5 μg. Moreover, species with phylogenetically distant relations were mixed in a library to avoid the chimera formation. Totally, three libraries were constructed using Illumina TruSeq TM DNA Sample Prep Kit (Illumina, San Diego, CA, USA). Genomic DNA were fragmented with a Covaris sonicator to an average insert size of 350 bp. The subsequent de novo genome sequencing was conducted on a single lane of Illumina HiSeq X Ten by Beijing Novogene Bioinformatics Technology Co., Ltd (China). Approximately 10 Gb paired-end reads of 150 bp length were generated for each library. FastQC 38 was used for quality control of raw sequence data to remove reads containing adapters and ploy-N, or low quality reads from the raw data. All the downstream analyses were based on clean data of high quality (avg. Q20 > 90%, and avg. Q30 > 85%). No less than 8 Gb high-quality reads for each library were used in de novo assembly using IDBA-UD v. 1.1.1 39 , with default settings.
The information of bait sequences blasted against each mitochondrial contig determined are presented in Table S2. Only the assembled contigs to which at least one Sanger sequence could be matched with certainty were retained for further analysis. Mapping to identified mitochondrial contigs were performed using BWA v 0.7.13 40 under default parameters. Mapping statistics were obtained with Qualimap 41 and Tablet 42 , in order to check the quality of the assembled contigs.
The preliminary annotation for the contigs identified by bait sequences were conducted using MITOS 43 , with default settings and the invertebrate genetic code for mitochondria. The resultant gene boundaries were further checked and corrected by alignment against 20 published mitogenome sequences from Cicadomorpha and Fulgoroidea (see details in Table S1). Furthermore, hand alignment checking was also conducted by blasting each predicted gene for every mitogenome against GenBank data in Web BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi), to ensure the identity of the mitochondrial genes identified.
A part from two datasets mentioned above, a series of datasets were compiled by the following methods to test the influences of recoding schemes on the phylogenetic estimate: (1) PCG_AA: translating nucleotides of protein-coding genes into amino acid sequences; (2) PCGDegen: 13 protein-coding genes were re-coded by degenerating all sites including synonymous substitutions to IUPAC ambiguity codes through Degen v1.4 (Degen-code) 48,49 ; (3) PCGDegenRNA: 13 protein-coding genes with Degen-coding combined with 24 RNA genes. Additionally, the datasets of PCG and PCGRNA were masked by using Aliscore version 2.0 50, 51 , combined with Alicut version 2.0 50,51 . Finally, to investigate the potential long-branch effect on the tree topology, a series of reduced datasets (i.e. 43 taxa datasets without the outgroup Fulgoroidea) were created to further phylogenetic analyses.
Sequence saturation in the combined protein-coding genes and RNA genes were assessed using the index of substitution saturation (Iss) 52 as implemented in the DAMBE 5 53 , respectively. Estimates of nonsynonymous (dN) and synonymous (dS) substitution rates of concatenated protein-coding genes were obtained by the method of Yang and Nielsen (2000) 54 using the program yn00 as implemented in PAML 4.9 55  Prior to ML analyses, PartitionFinder 56 was employed to infer the optimal partitioning strategy. Simultaneously, the Baysian Information Criterion (BIC) were used to choose the best models for the combined nucleotide and amino acid datasets under a greedy search with RAxML 57, 58 . The data blocks were defined by gene types (each of 13 PCGs, 22 tRNAs and two rRNAs) and by codon positions (each of three codon positions for PCGs), totally 63 independent blocks were employed for the datasets of 52taxa_PCGRNA, 52taxa_ PCGDegenRNA, 43taxa_PCGRNA, and 43taxa_PCGDegenRNA. The partition schemes and best-fit models selected for each dataset are provided in Table S3. Gene types and codon positions cannot be distinguished when data were masked, thus no partition analyses were applied to these datasets (i.e. Alicut-52taxa_PCG, Alicut-52taxa_PCGRNA, Alicut-43taxa_PCG, and Alicut-43taxa_PCGRNA).
ML analyses were conducted using IQ-TREE 59 as implemented on the multicore version of IQ-TREE 1.5.5. The partition schemes and best-fit models selected by PartitionFinder were applied to corresponding dataset. Branch support was estimated using Ultrafast option for bootstrap analysis, with 1000 replicates. The detailed commands are as following: iqtree-omp -s dataset.nex -st DNA (or AA) -bb 1000 -alrt 1000.
Bayesian analyses were performed using a parallel version of PhyloBayes (pb_mpi1.5a) 60, 61 as implemented on a HP server with twenty-four CPU and 320 G memory. For all Phylobayes analyses, two independent runs were performed, and started from random topology, respectively. Each run implemented two differentially heated chains, with at least 30,000 cycles. The CAT-GTR model was used for nucleotide analyses, while the CAT model for amino acids. Convergence was monitored using bpcomp ("maxdiff " value < 0.1) and tracecomp (minimum effective size > 100). A 25% burn-in was applied after checking for stationarity, and a consensus tree was calculated.
For each phylogenetic analysis, we utilized FigTree v1.4.3 62 to visualize the consensus tree and the corresponding branch lengths. The one-way ANOVA analyses for branch lengths of major groups are performed in Excel 2016. The bootstrap supports (BS) of ≥75 and posterior probabilities (PP) of ≥0.95 were considered to be credible support values for tree nodes. All sequence alignment files and tree files built in this article are available in the Treebase: http://purl.org/phylo/treebase/phylows/study/TB2:S19876.

Hypothesis testing.
To test the statistical significance of alternative hypotheses of Cicadomorpha, we compared all possible relationships among three superfamilies, namely (Membracoidea + (Cicadoidea + Cercopoid ea)), (Cicadoidea + (Membracoidea + Cercopoidea)), and (Cercopoidea + (Membracoidea + Cicadoidea)). The topology tests were conducted using the datasets with the full 52 taxa (i.e. 52taxa_PCG, 52taxa_PCGDegen, 52taxa_PCG_AA, 52taxa_PCGRNA, 52taxa_PCGDegenRNA, Alicut_52taxa_PCG, and Alicut_52taxa_ PCGRNA). The site-log-likelihood values were calculated under the GTR + I + G model for nucleotides and the MtREV + I + G model for amino acids using TREE-PUZZLE 5.3 63 (the command-line option: -wsl). The obtained values were used as input for the software CONSEL 64 . Constraint likelihood trees were constructed on the basis of dataset of 52taxa_PCGRNA with RAxML 57, 58 using the model GTRGAMMA and using partitions selected by PartitionFinder. Three competing hypotheses were statistically tested among each other by AU 65 , KH 66 , SH 67 , WKH and WSH.

Results
Assembly of mitogenomes. Twenty-five hemipteran insect mitogenomes determined were identified from the individual contigs by bait sequences. No chimeric formation was found when we inspected the base coverage along each mitogenome with the software Tablet 42 . Because every site in each mitochondrial contig corresponded to an identical nucleotide. The statistics from the program BWA 40 showed that read coverage for all gene regions in each contig was no less than 30-fold. Mean coverage for every mitogenome varied between 103-fold and 1,922-fold. The average depth of coverage was 475-fold, with 17 contigs ranging from 220-fold to 742-fold coverage and four contigs at >1,000-fold coverage (Fig. 1). Although with the higher sequence coverage, complete mitogenomes (including the full 37 mitochondrial genes and the entire control region) were identified for seven insect species. The remainder were nearly complete (12 mitogenomes including the full 37 mitochondrial genes and the partial control region) or partial mitogenomes (six mitogeonomes including 20-35 mitochondrial genes and no partial control region). The lengths of the mitogenome sequences ranged from 8,455 nt to 16,226 nt, of which 17 mitogenomes had sequences' length more than 15,033 nt, four ones ranged from 12,231 nt to 14,803 nt, and the rest had sequences < 10,000 nt. For the incomplete mitogenomes, the missing segments were mainly located adjacent to the putative control region. The complete or nearly complete mitogenomes have the consistent gene content and organization with other published auchenorrhynchan insect mitogenomes. All new mitogenome sequences have been deposited in GenBank (accession numbers are presented in Table S1).
The results of the substitution saturation tests based on the concatenated protein-coding genes with 52 taxa showed that the values of substitution saturation index (Iss) for the first and/or second codons and all sites of PCG were significantly smaller than the critical values (Iss.cSym or Iss.cAsym). However, the Iss values of the third codon positions were larger than the Iss.cSym and Iss.cAsym (Table 1). This result indicated that the third codon positions might have a negative impact on phylogenetic analysis. Removal of the long-branched outgroups (i.e. Fulgoromorpha) did not significantly reduced the saturation degree. There still showed saturation in the third codon positions and in the RNA genes only (Table 1). From the point of view of the lower Iss values, data masking reduced the degree of saturation.
To explore the correlation between sequence evolution rate and tree topology, we compared the nonsynonymous (dN) and synonymous (dS) evolutionary rates for each sequence pair within five groups base on the concatenated protein-coding genes ( Table 2). The one-way ANOVA analysis revealed incongruence in the dN values between groups (P = 0.0429). This result was mainly due to the lower dN values of Cercopoidea. When we omitted the Cercopoidea to rerun one-way ANOVA analysis, there was no significant difference across all  Phylogenetic analyses. ML analyses based on the full taxa datasets (i.e. 52taxa_PCG, 52taxa_PCGDegen, 52taxa_PCGRNA, 52taxa_PCGDegenRNA, 52taxa_PCG_AA, Alicut_52taxa_PCG, and Alicut_52taxa_ PCGRNA) consistently resulted in a paraphyletic Cicadomorpha, with respect to the nested position of Fulgoroidea (Fig. 2). Monophyly of each superfamily within Cicadomorpha was very strongly supported (BP = 100). The sister group relationship between Cicadoidea and Cercopoidea was favored with strong nodal support (BP ≥ 85) in all analyses. Table 3 provides the nodal supports and branch lengths for major lineages in each tree. Comparing the support values of deep nodes in each tree, we found that data treatment decreased the support for the inter-superfamily relationships. Specially, the dataset PCGDegenRNA showed the lowest support for the affinity of Cicadoidea with Cercopoidea. This suggested synonymous sites may provide information for the deep-level relations within Cicadomorpha. The one-way ANOVA analysis showed no significant incongruence in the branch lengths between major groups (P > 0.05) in the ML analyses. At the family level, the Membracidae were congruently supported as a monophyletic lineage (BP = 100). However, the Cicadellidae was recovered as a paraphyletic assemblage, owing to the close affinity of Olidiana sp. to the clade (Aetalionidae + Membracidae). Within Cercopoidea, both families Aphrophoridae and Cercopidae were supported as the non-monophyletic groups.
Bayesian analyses based on the full taxa datasets under the heterogeneous model provided the distinct tree topology from those in the ML analyses. All Bayesian analyses recovered the monophyly of Cicadomorpha (Fig. 3), except for the analysis based on the 52taxa_PCG_AA. The Cicadoidea formed a sister group to the Cercopoidea, with posterior probability 1 in the analyses of 52taxa_PCGRNA, 52taxa_PCGDegenRNA, Alicut_52taxa_PCG, and Alicut_52taxa_PCGRNA. The Membracoidea was sister to the clade (Cicadoidea + Cercopoidea). The datasets with recoding schemes displayed the weaker power in resolving the deep-level phylogeny of Cicadomorpha. The branching pattern of (Membracoidea + (Cicadoidea + Cercopoidea)) was not retrieved in the Bayesian trees Other ML trees reconstructed in this paper can be available from the TreeBase: http://purl.org/phylo/ treebase/phylows/study/TB2:S19876.
from PCGDegen and PCG_AA. The dataset PCGDegen showed the lowest nodal support for the sister group relation of Cicadoidea + Cercopoidea. The one-way ANOVA analysis revealed significant incongruence in the branch lengths between major groups (P < 0.05) in the Bayesian analyses. When removing the Fulgoroidea and Membracoidea to rerun the ANOVA analysis, there was no difference between Cicadoidea and Cercopoidea. This result demonstrated that the potential long-branch effect was introduced by the Fulgoroidea and Membracoidea.
Taxon deletion experiment. The taxon deletion experiments were conducted to test for the effect of long-branch attraction. For that, a series of ML analyses were rerun based on the 43 taxa datasets. As a result, all ML analyses with 43 taxa datasets yielded strong evidence for a monophyletic Cicadomorpha (BP = 100), with an inter-superfamily relationship of (Membracoidea + (Cicadoidea + Cercopoidea)). The deep nodes defining superfamily relationships received strong bootstrap support values (BP ≥ 78).
Similarly, the majority of Bayesian analyses based on the 43 taxa datasets reproduced the superfamily relationships within Cicadomorpha as ML analyses based on the 43 taxa datasets. With the exception of 43taxa_ PCGDegenRNA, all 43 taxa datasets showed Cicadoidea as sister group to Cercopoidea, and together they were sister to Membracoidea. In the Bayesian tree of 43taxa_PCGDegenRNA, Membracoidea was placed as sister group to Cercopoidea, while Cicadoidea was recovered as an independent lineage within Cicadomorpha.
Hypothesis testing. Various tests consistently supported the arrangement of (Membracoidea + (Cicadoid ea + Cercopoidea)) as the best tree topology, regardless of data coding schemes applied (Table 4). In contrast, all tests but for those on the ALICUT_52taxa_PCGRNA significantly rejected the hypothesis of (Cicadoidea + (Mem bracoidea + Cercopoidea)). Additionally, the comparison of topologies clearly rejected the alternative hypothesis of (Cercopoidea + (Membracoidea + Cicadoidea)).

Discussion
Efficiency of reconstructing complete mitogenomes. The present study demonstrates the utility of next-generation sequencing of mixed DNA samples for reconstructing hemipteran mitogenomes. This method  Table 3. Nodal supports and branch lengths for major lineages in each tree. Note: The branch lengths were calculated from the longest terminal taxon of each lineage to the common ancestor to the Heteroptera. "−" denotes the monophyletic lineage not to be recovered by the dataset. NS: nodal support; BL: branch length.
shows a great improvement in the efficiency of achieving a large number of mitogenome data, compared with the traditional Sanger sequencing via primer walking. We used this method to successfully reconstruct 19 cicadomorphan insect mitogenomes and six fulgoroid insect mitogenomes. The number of mitogenome with full-length sequence is relatively small, with only seven ones. For the remaining nearly complete or partial mitogenomes, the missing segments are mainly located in the control region and/or including the adjacent regions. The low capture specificity of the control region could be attributed to the following reasons: 1) the complex nucleotide motifs found in this region (e.g. A + T-rich elements); 2) non-uniform read coverage along the mitogenome determined, especially with the significant drops occurred in the segments corresponding to the control region. Both factors may lead to the sequencing failure and lower assembly efficiency of this specific region. According to previous studies 33,34,68 , the current sequencing depth should be sufficient to cover the whole mitogenome. To some extent, more taxon sampling can be added to the pools to improve the sequencing efficiency. Therefore, in order to ensure the completeness of mitogenome, how to increase the sequencing depth for the particular regions of genome and how to assemble the complete mitogenome by developing more efficient assembler will be other important issues.
Outgroup selection for phylogenetic analysis of Cicadomorpha. Outgroup    phylogeny should include representatives from Heteroptera as outgroups 10 . In the current study, we applied an outgroup taxon sample including Heteroptera and Fulgoroidea to the tree reconstruction of Cicadomorpha. Unfortunately, long branch lengths shared by outgroup Fulgoroidea and ingroup Membracoidea were found in the current mitogenome data (Table 3). Moreover, evolutionary rate analyses indicated the distinct values of nonsynonymous substitution across five major groups analyzed. These results made us to suspect that incongruence between analyses of two inference methods (i.e. ML analysis under homogeneous model and Bayesian analysis under heterogeneous model) from the full datasets might be the consequence of long-branch effect. Excluding long-branched outgroup Fulgoroidea resulted in a congruent result from both inference methods. Cicadomorpha was recovered as a monophyletic group, with the internal relationship of (Membracoidea + (Cicadoidea + Cerc opoidea)). Therefore, the Heteroptera may be a more suitable outgroup choice to the Cicadomorpha phylogeny estimate on the current mitogenome data, due to the similar evolutionary rate shared by them.
Deep-level phylogeny within Cicadomorpha. For the superfamily relationship of (Membracoidea + (Cicadoidea + Cercopoidea)) within Cicadomorpha, Boulard (1991a,b) 70, 71 listed two character states supporting this hypothesis: (1) structures of the alimentary canal; (2) the larval behavior of applying excreted liquid. Liang & Fletcher (2002) 14 proposed the common antennal features shared by Cicadoidea and Cercopoidea. Rakitov (2002) 15 listed structural characters of brochosomes and proteinaceous particles secreted by glandular regions of the Malpighian tubules as synapomorphic for the grouping (Cicadoidea + Cercopoidea). In addition, fossil researches proved the close affinity of Cicadoidea with Cercopoidea 7, 17 . Recent molecular studies 10,11 , based mainly on the nuclear gene fragments, also recovered a solid branch pattern of (Membracoidea + (Cicadoidea + Cercopoidea)). Results presented in the current study are the first molecular investigation to utilize mitogenome data only for reconstructing deep-level phylogeny of Cicadomorpha. The majority of our analyses recovered the topology of (Membracoidea + (Cicadoidea + Cercopoidea)), corroborating earlier studies with new data. Nevertheless, the causes for another recovery of (Cicadoidea + (Membracoidea + Cercopoidea)) by only one analysis based on 43taxa_PCGDegenRNA under PhyloBayes may be complex. It is possible that a combined effect of Degen-coding and additional RNA genes contributed to the latter tree structure. Taken together, the deep-level phylogeny of Cicadomorpha was explored using mitogenome data, with various coding strategies and different algorithms. Our results validate the power of mitogenome for resolving the relationships at the superfamily level in Cicadomorpha. Although the number of mitogenomes available for Cicadomorpha has been almost doubled by this study, phylogenetic result is still preliminary for this megadiverse insect group. In particular, analysis of family level relationship within Cicadomorpha will require more taxon sampling. The related research project is being carried out by the authors.