Transcriptome sequencing and phylogenetic analysis of four species of luminescent beetles

The evolution of bioluminescence has prompted scientific attention to illuminate phylogenetic relationships of luminescent beetles. However, genomic resources are virtually lacking in rhagophthalmids (Rhagophthalmidae) and their related firefly beetles lampyrids (Lampyridae). Here, we employed the Illumina Hiseq 2000 platform and sequenced the whole-body transcriptomes of the four luminescent beetles: one rhagophthalmid (Rhagophthalmus sp.) and three fireflies (Asymmetricata circumdata, Aquatica ficta, and Pyrocoelia pectoralis). We obtained 55.4, 43.4, 38.6, and 36.7 million clean reads for the four species, respectively. All reads were assembled into contigs from which unigenes were derived. All unigenes were annotated by publicly available databases, and a total of 4325 orthologous genes were identified. Using multiple phylogenetic approaches, our transcriptome data confirmed the distinctiveness of Rhagophthalmidae from Lampyridae, which was also supported by our mitogenome analysis using three newly determined mitogenome sequences and 12 previously published ones. Together, this study is the first report of whole transcriptome sequencing data in Rhagophthalmidae and Lampyridae species, representing a valuable genomic resource for studying the origin and evolution of some remarkable traits in these beetles such as bioluminescence. Moreover, our transcriptome and mitogenome data provide useful phylogenetic information that could be of importance in future studies of phylogenetic inference.

Bioluminescence is among the most spectacular features in living organisms, including numerous species of marine fishes, marine invertebrates, terrestrial invertebrates, fungi, bacteria, and protists 1,2 . The uses of bioluminescence in nature involve a range of vital functions: camouflage, attraction, defense, warning, communication, mimicry, and illumination 2 . The chemical basis of the natural light-producing molecules has been elucidated in the last century 3 , permitting discoveries of countless valuable applications in biology and medicine using luciferase-based systems 2 . In addition, industrial designers have been ambitious in utilizing the natural light-producing systems in bioluminescent organisms for decoration and street lighting 4 .
Among these bioluminescent organisms, luminescent insects were primarily found in members of the three orders: Collembola (springtails), Diptera (flies), and Coleoptera (beetles) 5 . Within the superfamily Elateroidea of the order Coleoptera, several groups of beetles are able to produce and emit light such as fireflies (Lampyridae), railroad worms (Phengodidae), click beetles (Elateridae), and Rhagophthalmidae 5,6 . The origin and evolution of bioluminescence in these beetles has prompted a number of studies to illuminate the phylogenetic relationships of these bioluminescent beetles [7][8][9][10][11][12][13][14][15][16] . Although both morphological 10-12, 14, 16 and molecular 7-9, 13, 15, 17 features were involved in these studies, molecular evidence has been expected to resolve taxonomic status and phylogenetic relationships that have been contentious based on morphological evidence 18 , because the evolutionary history of morphological features is usually complex 18,19 . Unfortunately, the incongruence of molecular evidence has also been observed in phylogenetic analyses concerning the phylogenetic relationships of these beetles 7-9, 13, 15 , possibly because a single or a few genes lack sufficient phylogenetic signals 20 . For instance, Rhagophthalmidae had previously been assigned to the family Phengodidae inferred from morphological data 11 ; other studies based on morphological, embryological and molecular evidence had assigned Rhagophthalmidae to be a subfamily or genus in the family Lampyridae 8,14,15,21 ; but more recent evidence from both morphological and molecular data had supported the familial status of Rhagophthalmidae 7,9,10,13,16,[22][23][24] , which is distinct from both Lampyridae and Phengodidae. In addition, the taxonomic studies focusing on Lampyridae and Rhagophthalmidae have been scarce, especially for species distributed in China 15,25,26 . The scarcity calls for additional data of taxonomic importance. Phylogenomics, i.e. phylogenetic analysis involving genome-scale data, has been believed to outcompete single-gene phylogenetics, which frequently yielded conflicting results caused by stochastic errors from small-scale data sets 20 . However, genomic resources are extremely limited in Rhagophthalmidae and its related beetles, which may prevent their phylogenetic relationships from in-depth investigations.
To provide novel genomic resources from Rhagophthalmidae and its related firefly beetles (Lampyridae), we employed the Illumina Hiseq 2000 platform and sequenced the whole-body transcriptomes of the four beetles: one rhagophthalmid beetle (Rhagophthalmus sp.) and three representatives of lampyrid beetles (Asymmetricata circumdata, Aquatica ficta, and Pyrocoelia pectoralis). To validate the phylogenetic inference based on transcriptome data, we employed the traditional Sanger sequencing to obtain two complete mitogenomes (Aquatica ficta and A. wuhana) and one nearly complete mitogenome (Lamprigera yunnana); we also identified the 13 protein coding genes (PCGs) in mitogenomes from each of the four whole-body transcriptomes (Supplementary Table S1); and we additionally retrieved 12 published complete mitogenomes from members of Rhagophthalmidae, Lampyridae, Phengodidae, Elateridae, Lycidae, Cantharidae, and Tenebrionidae (see the species names and GenBank accessions in Materials and Methods). Our transcriptome analysis recognized the distinctiveness of Rhagophthalmidae from Lampyridae, which was also supported by our mitogenome analysis.

Results
Transcriptome sequencing and assembly. Total RNA was isolated from each of the frozen whole-bodies of beetles. mRNA was purified from the total RNA using oligo-dT attached magnetic beads. Following mRNA fragmentation, cDNA synthesis, end repair, 3' adenylation, adaptor ligation, and PCR enrichment, four RNA sequencing (RNA-Seq) libraries were constructed using the Illumina TruSeq RNA sample preparation kit. The raw sequence data were filtered by removing adaptors, low-quality reads, and ambiguous reads. We ultimately obtained 55.4, 43.4, 38.6, and 36.7 million clean reads for Rhagophthalmus sp., A. circumdata, A. ficta, and P. pectoralis, respectively (Table 1). These clean reads were separately assembled into 44883, 57254, 71424, and 80017 contigs, in which 15418, 24275, 31520, and 31356 unigenes (i.e. unique putative genes) were derived, respectively (Table 1). It appears that the transcriptome data in greater size tend to have less assembled contigs or unigenes. For example, the largest data (Rhagophthalmus sp.) contains the least contigs, while the smallest data (P.pectoralis) consists of the most contigs ( Table 1). Details of clean reads, assembled contigs and unigenes of the four transcriptomes generated in this study were given in Table 1. In addition, we compared the 13 PCGs in A. ficta mitogenome identified from the Illumina-based transcriptome assembly with the same genes obtained from the traditional Sanger sequencing for another individual of the same species, and found that a total of 99.3% sequences from both sequencing approaches are identical. This finding suggests that our transcriptome assemblies are reliable and our sequencing coverages are acceptable.  Table 2, Supplementary Dataset S1). We used all unigenes as query and ran the TBLASTX program to search against the Swiss-prot, NR, COG, and GO databases with an E-value of 1.0E-5. For example, we identified 10322 best blast hits in the NR database when annotating the transcriptome of Rhagophthalmus sp., comprising 66.95% of all unigenes ( Table 2); while only 4845 best hits were detected in the GO database for the same species, covering 31.42% of all unigenes (Table 2). Indeed, the NR database generally annotated the highest percentages of unigenes, whereas the GO database typically annotated the lowest (Table 2). Overall, Rhagophthalmus sp. has a higher percentage of unigenes being annotated compared to the other three beetles (Table 2), possibly because a greater size of sequencing data can give a better assembly for the Rhagophthalmus beetle (Table 1). In addition, the distributions of both GO terms and COG classes were similar among the four transcriptomes, except A. circumdata showing a unique expansion in a single GO term (Fig. 1).

Ortholog identification. Genes from the reference genome of the red flour beetle Tribolium castaneum
were respectively compared with all unigenes from each of the four transcriptome assemblies using the TBLASTX searches 27 . By examining reciprocal (or bi-directional) best blast hits, putative orthologs were identified. If the BLAST score ratio of the second best-hit to the first best-hit is greater than 0.8, we excluded the putative ortholog for further analysis, because the candidate ortholog is possibly a paralog due to the high level of sequence similarity. The resulting putative orthologs range from 6992 to 7612 (Table 2). Each ortholog from the red flour beetle and the four beetles with transcriptome data was aligned by PRANK version 100802 28 , and poorly aligned positions and divergent regions in alignments were filtered by GBLOCKS version 0.91b 29 . After discarding the alignments with an aligned region shorter than 100 nt (nucleotides), we obtained 4325 putative orthologs with high-quality alignments for subsequent analyses.
Mitogenome sequencing. Using the traditional Sanger sequencing approach, we sequenced two complete mitogenomes (A. ficta and A. wuhana) and one nearly complete mitogenome (L. yunnana) ( Table 3). We were unable to sequence the region between ND2 and 12S rRNA in L. yunnana despite multiple attempts (Table 3), possibly because this region contains the A+T-rich region which may pose technical issues in sequencing 30 . All the 13 protein coding genes (PCGs) from each of the three mitogenomes were complete except the ND2 in  Table 3. Annotations of the three newly sequenced mitochondrial genomes. Incomplete sequenced region was indicated with an asterisk. Abbreviation: n.a., not available (due to incomplete sequencing).

Gene Direction
L. yunnana lacking a segment near its 5' end ( Table 3). All the 13 PCGs, 2 rRNAs, 22 transfer RNAs (tRNAs) and 1 A-T-rich region common to the vast majority of animal mitogenomes 31 were identified in the three newly sequenced mitogenomes, except for the three tRNAs (tRNA Trp , tRNA Cys , tRNA Tyr ) and the A-T-rich region in L. yunnana (Table 3). Arrangements and orientations of all genes in the three mitogenomes are identical to other beetles 25,[32][33][34] . Similar to other firefly beetles, all the PCGs employed traditional mitochondrial start codons ATN and terminated with TAA, TAG or single T (Table 3).

Phylogenetic analysis based on nuclear gene sequences. Phylogenetic analysis was first undertaken
with transcriptome-derived nuclear gene sequences. Given that Lampyridae was divided into two monophyletic clades, with one clade consisting of Lampyrinae and the other clade comprising Cyphonocerinae, Ototetrinae, Luciolinae 13 , we selected one species P. pectoralis (Lampyrinae) from the first clade and two species A. ficta (Luciolinae) and A. circumdata (Luciolinae) from the second clade to perform whole-body transcriptome sequencing, in addition to the species Rhagophthalmus sp. The resulting transcriptome data were assembled and annotated, and a total of 4325 putative orthologs were identified after careful filtering of sequencing artifacts. A data set is considered to be mutationally saturated when multiple substitutions are dominating. Substitution saturation decreases phylogenetic information, and a highly saturated data set will produce an incorrect phylogeny 20 . To assess the impact of substitution saturation, we randomly selected 100 nuclear orthologs and performed substitution saturation tests. Results showed that the third codon sites of most nuclear orthologs have undergone substantial substitution saturation (Supplementary Table S2), we thus only used the first and second codon sites of the 4325 nuclear orthologs (Supplementary Dataset S2) for subsequent phylogenetic analysis with the concatenation method 35 . Based on the 1854774-nt concatenated alignment of nuclear orthologs from the four species with transcriptome data and the red flour beetle with reference genome data, we used both Maximum Likelihood (ML) and Bayesian Inference (BI) approaches to reconstruct phylogenetic trees. Both maximum likelihood bootstrap values and Bayesian inference posterior probability values (shown as percentages) were 100% at all nodes of the phylogenetic tree (Fig. 2). Specifically, both species from Luciolinae were first clustered into a sister clade, which next grouped with the single species from Lampyrinae (Fig. 2). All the three firefly beetles (Lampyridae) formed a monophyletic group, and the single species from Rhagophthalmidae appeared to locate outside of Lampyridae (Fig. 2).
To reduce the influence of gene tree heterogeneity, we also undertook phylogenetic analysis with the coalescent model using the MP-EST method 36 . Based on the alignment of each orthologous gene with all three codon positions, the ML gene trees with 100 bootstrap replicates were respectively reconstructed by PhyML version 3.0 37 using the best-fit models generated by jModelTest version 2.1.4 38 . Because poorly supported individual gene trees could reduce phylogenetic signals in inferring species tree 39, 40 , only 2,555 genes with average bootstrap support values greater than 70% were used for MP-EST tree reconstruction. Results showed that the tree topology and bootstrap values of the MP-EST tree using the coalescent method were identical to those of the ML and BI trees reconstructed using the concatenation method (Fig. 2).
For comparison, we additionally reconstructed a phylogeny using a bioluminescence related gene. Because bioluminescence in beetles is produced by catalyzing the oxidation of luciferin by luciferase 41 , the luciferase gene is preferable for this analysis. We used the luciferase gene of Lampyris noctiluca (Genbank accession: X89479) as the query to search against the assembled transcripts of each transcriptome using TBLASTN. The best hit from each transcriptome was selected as a candidate, and candidate luciferase genes of the four beetles were searched against NR (NCBI non-redundant proteins) to ensure accuracy in gene identification. Taking the luciferase gene of Tribolium castaneum as the outgroup, we generated an alignment using the five luciferase genes, and phylogenetic trees were built using similar approaches as described elsewhere 19,[42][43][44][45][46][47] . The resulting tree topologies ( Supplementary Fig. S1) were identical to those inferred from 4325 genes (Fig. 2).
The 13 mitochondrial PCGs from each of the 18 beetle species were aligned and concatenated. Substitution saturation analysis revealed that the third codon sites of all examined mitochondrial PCGs have experienced substantial substitution saturation, while the first and second codon sites have not (Supplementary Table S3), we thus just used the first and second codon sites of the 13 mitochondrial genes for subsequent phylogenetic analysis with the concatenation method (Supplementary Dataset S3). The concatenated alignment of the 13 mitochondrial PCGs is 7382 nt in length. Both ML and BI phylogenetic analyses recovered identical phylogenetic trees (Fig. 3). Our trees showed that Rhagophthalmus sp. clustered with two other rhagophthalmids with published mitogenomes, and the three rhagophthalmids formed a monophyletic group with a strong support of 100% in both ML and BI analyses (Fig. 3), confirming that our sampled Rhagophthalmus beetle indeed belongs to Rhagophthalmidae phylogenetically. As depicted from the trees, Rhagophthalmidae was first allied with Phengodidae as sister groups, and Rhagophthalmidae and Phengodidae formed a monophyletic group with Lampyridae (Fig. 3), confirming that Rhagophthalmidae is not a subgroup within Lampyridae. The long branches in Phengodidae species would not affect our phylogenetic analysis, because we recovered similar trees after removing the two phengodids ( Supplementary Fig. S2). We did not perform the coalescent method in the phylogenetic analysis based on mitogenomes, because mitochondrial genes exhibit limited incongruence and the impact of mitochondrial gene tree heterogeneity has been believed to be minimal 48 .

Discussion
In this study, we present the first whole transcriptome shotgun sequencing data in Rhagophthalmidae and Lampyridae species using massive parallel mRNA sequencing (RNA-seq), providing valuable genome resources for studying the evolution of intriguing traits in these beetles such as bioluminescence. We also newly determined two complete and one nearly complete mitogenome sequences in Lampyridae species. Using various phylogenetic approaches, our transcriptome and mitogenome data unambiguously demonstrate Rhagophthalmidae being distinct from Lampyridae and reject previous hypotheses supporting Rhagophthalmidae as a subgroup within Lampyridae. Our molecular phylogenetic study supports the familial status of Rhagophthalmidae.
The phylogenetic position of Rhagophthalmidae was controversial due to conflicting results inferred from molecular and morphological data. For example, Rhagophthalmidae had been assigned to the family Phengodidae inferred from morphological data 11 ; Rhagophthalmidae had been considered to be a subfamily or genus in the family Lampyridae based on morphological, embryological and molecular evidence 8,12,14,15 . However, the overwhelming molecular evidence has consistently recognized the distinctiveness of Rhagophthalmidae, which appeared to be different from both Lampyridae and Phengodidae 7, 9, 13, 17, 22-24, 26, 49, 50 . As a result, the familial status of Rhagophthalmidae has been well established. In this study, we used 4325 nuclear genes derived from transcriptome data and all 13 mitochondrial protein coding genes to conduct phylogenetic inference. Our analyses based on various analytical approaches and several datasets consistently recognized the distinctiveness of Rhagophthalmidae (Figs 2 and 3). Although our study did not involve multiple individuals or populations from each species, intraspecific genetic variation should not impact the phylogenetic resolution at the family or subfamily level. Although our samples are limited due to the lack of genetic material, we included taxa that may represent basal lineages. More importantly, our phylogenetic resolution of Rhagophthalmidae remained consistent after using different analytical approaches and multiple data sets (Figs 2 and 3). In support of our findings, morphological evidence has identified the distinctiveness of Rhagophthalmidae from Lampyridae and Phengodidae 10, 16 , and molecular evidence has revealed that Rhagophthalmidae is a sister group to Phengodidae 9, 13 . In contrast to our findings, two molecular studies argued that Rhagophthalmidae should be placed within Lampyridae 8,15 . However, this argument was inferred from a short 16S rRNA gene segment with a length of 506-nt 8,15 and cannot be supported by multi-locus data sets 9,13 . Although broad sampling of taxa is important in molecular phylogenetic studies 8,15 , sufficient number of gene loci is also required to build a reliable phylogenetic tree 39,51 . According to our mitogenome analysis, the three luminescent beetle families (Rhagophthalmidae, Lampyridae, and Phengodidae) formed a monophyletic group (Fig. 3), suggesting a single origin of bioluminescence in the common ancestor of these beetles. This finding is consistent with a recent phylogenetic analysis based on mitogenomes 9 but inconsistent with another study based on a combination of mitochondrial and nuclear genes 13 . The contrasting results between mitochondrial and nuclear genes call for an in-depth study by adding more taxa and more genome-scale data, which will help reconstruct a reliable phylogeny and make a conclusive inference on the evolution of bioluminescence in beetles.
Phylogenomics refers to phylogenetic analysis involving genome-scale data. Phylogenomics has been believed to outcompete single-gene phylogenetics, which frequently yielded conflicting results caused by stochastic errors from small-scale data sets 20,51 . However, systematic errors are still present after adding more data 52 . There are three major challenges that could generate strong incongruence in phylogenomic analysis: substitution saturation, nucleotide compositional bias, and different tree reconstruction methods 20 . We took the following steps to overcome these challenges. First, we undertook substitution saturation tests and removed the third codon positions of all genes that may have undergone substantial substitution saturation. Both nuclear and mitochondrial gene data sets with the first and second codon positions consistently supported Rhagophthalmidae to be an independent group that is distinct from Lampyridae (Figs 2 and 3). Second, we used the deduced protein sequences to reduce nucleotide compositional bias. Our result showed that the BI tree topologies inferred from protein sequences (Supplementary Fig. S3) were identical to those inferred from nucleotide sequences (Figs 2 and  3), suggesting that nucleotide compositional bias is not a major factor affecting our phylogenetic analysis. Third, we used both ML and BI methods with nuclear and mitochondrial gene data sets, and identified no incongruence between the two tree reconstruction methods (Figs 2 and 3). In addition, given the prevalence of gene tree heterogeneity [53][54][55] , we also conducted phylogenetic analysis with the coalescent method, which has been proved to generate accurate and congruent phylogenies in the presence of heterogeneous gene trees 56,57 . The coalescent method implemented in the MP-EST program 36 was applied to our transcriptome-derived nuclear gene data set of 2555 putative orthologous genes, and yielded a phylogenetic tree identical to the concatenation-based ML and BI trees (Fig. 2). Indeed, after examining the proportions of 14 possible topologies inferred from the 2555 orthologous genes, we found that most genes (75.7%) supported the topology shown in Fig. 2 and the average proportion of other 13 topologies is 1.9% (Fig. 4 and Supplementary Table S4), suggesting a low level of gene tree heterogeneity among beetles studied in this work. This study proved usefulness of our genome-scale data in phylogenetic analysis, and will help to illuminate the origin and evolution of bioluminescence in Lampyridae and other luminescent beetles.

Materials and Methods
Ethics statement. All beetle species used in this study were sampled in the field. No specific permits were required, and no endangered or protected species were involved. All experiments on the beetles were conformed to the rules and guidelines on animal experimentation in China.
Taxon sampling and DNA extraction. Live larvae or adults of beetles studied here were sampled at five locations in China (Supplementary Table S5 De novo assembly and unigene annotation. Sequence quality of each RNA-seq sample was assessed by FastQC version 10.1 (www.bioinformatics.bbsrc.ac.uk/projects/fastqc). Trimmomatic version 0.32 59 was used to trim out low quality sequences, ambiguous sequences and artificial sequences such as residual adaptors and Illumina specific sequences as described elsewhere 60 . Four de novo transcriptome assemblies were constructed by the Trinity program 61 with default settings. Because de novo transcriptome assemblies involve many contigs in a contig cluster, we required a minimum expression filter of one fragment per kilobase of exon per million fragments mapped (FPKM) and filtered the contigs with FPKM value smaller than one; the contig with the highest expression level in a contig cluster was selected to be a unigene for downstream analyses 60 . The resulting unigenes were used for BLASTX searches 27 and unigene annotations based on the NR, Swiss-prot and COG databases, with an E-value cutoff of 1e-5. The GO annotation was conducted by Blast2GO software 62 with default parameters using the NR blast results in XML format.

PCR amplifications and mitogenomes annotation.
To amplify the three mitogenomes (A.ficta, A.wuhana and L.yunnana), dozens of primer pairs were designed according to published firefly mitogenome sequences [32][33][34] . Details of PCR amplification and sequencing procedure were described in our previous study 33 . We defined the 13 PCGs of the three species by multiple sequence alignments with related species using MEGA version 5.20 63 . The tRNAs were identified by tRNAscan-SE 64  Ortholog identification and phylogenomic analysis. The cDNA sequences of the red flour beetle (Tribolium castaneum 3.0 Assembly) were downloaded from BeetleBase 65 (http://www.Beetlebase.org), and compared with the transcriptome-derived unigenes by reciprocal (or bi-directional) TBLASTX searches 27 . We identified putative orthologs by examining reciprocal best blast hits. We discarded putative orthologs with a BLAST score ratio of the second best-hit to the first best-hit greater than 0.8, which can exclude potential paralogs in our phylogenetic analysis. All putative orthologs were aligned by PRANK version 100802 28 , and poorly aligned positions and divergent regions were removed by GBLOCKS version 0.91b 29 . In addition, the alignments with an aligned region shorter than 100-nt were discarded.
All the 13 mitochondrial PCGs and 100 randomly selected nuclear orthologs were used to examine the substitution saturation by DAMBE 66 . According to the Akaike information criterion (AIC) and Bayesian information criterion (BIC) 67 , we ran the jModelTest version 2.1.4 program 38 separately to select the best-fit models of nucleotide substitution for concatenated nuclear and mitochondrial gene alignments after removal of the saturated third codon positions. For the concatenated nuclear genes, RAxML version 7.2.6 68 was used to reconstruct the ML tree under the GTR+GAMMA model with 100 bootstrap replicates, and MrBayes version 3.2.6 69 was applied to reconstruct the BI tree using the recommended GTR+I+G model with 0.5 million generations 45,47 . To reduce the impact of gene tree heterogeneity, we also undertook phylogenetic analysis with the coalescent model using the MP-EST method 36 . For the concatenated 13 mitochondrial PCGs, the ML tree was reconstructed by RAxML version 7.2.6 with 1000 bootstrap replicates under recommended GTR+GAMMA model, and the concatenated sequence was partitioned by different genes to estimate and optimize individual α-shape parameters, GTR-rates, and base frequencies for each gene. MrBayes version 3.2.6 was used to reconstruct the BI tree with one million generations and the concatenated sequence was partitioned according to different models: HKY+G model for ND4L, GTR+G model for ND1 and ATP6, and GTR+I+G model for the other 10 mitochondrial genes.
In addition, we used deduced protein sequences in our phylogenomic analysis, aiming to reduce the impact of nucleotide compositional bias. Briefly, protein sequences of the nuclear and mitochondrial genes were deduced and aligned by MEGA version 5.20 63 . ProtTest version 3.41 70 was applied to select the best-fit model of protein sequence evolution following the Akaike information criterion (AIC). The LG+I+G model was determined to provide the best fit to the concatenated protein sequences of nuclear genes, while the MtRev+I+G model was selected for the concatenated protein sequences of mitochondrial genes. Phylogenetic trees were reconstructed by MrBayes version 3.2.6 69 with as many generations as required.