Genomic adaptations to aquatic and aerial life in mayflies and the origin of wings in insects

The first winged insects underwent profound morphological and functional transformations leading to the most successful animal radiations in the history of earth. Despite this, we still have a very incomplete picture of the changes in their genomes that underlay this radiation. Mayflies (Ephemeroptera) are one of the extant sister groups of all other winged insects and therefore are at a key phylogenetic position to understand this radiation. Here, we describe the genome of the cosmopolitan mayfly Cloeon dipterum and study its expression along development and in specific organs. We discover an expansion of odorant-binding proteins, some expressed specifically in the breathing gills of aquatic nymphs, suggesting a novel sensory role for gills. In contrast, as flying adults, mayflies make use of an enlarged set of opsins and utilise these visual genes in a sexually dimorphic manner, with some opsins expressed only in males. Finally, to illuminate the origin of wings, we identify a core set of deeply conserved wing-specific genes at the root of the pterygote insects. Globally, this is the first comprehensive study of the structure and expression of the genome of a paleopteran insect and shows how its genome has kept a record of its functional adaptations.


Introduction
The first insects colonized the land more than 400 Myr 1 . But it was only after insects evolved wings that this lineage (the Pterygotes) became the most prominent animal group in terms of number and diversity of species, and completely revolutionized Earth ecosystems. The development of wings also marked the appearance of hemimetabolous development. Concomitant with the development of wings, pterygotes also developed an hemimetabolous life cycle 2 , with two clearly differentiated living phases (non-flying juveniles and flying adults). This allowed them to specialize functionally and exploit two entirely different ecological niches. This is still the life cycle of Ephemeroptera (mayflies) and Odonata (damselflies and dragonflies), the extant Paleoptera ("old wing") orders, the sister group of all other winged insects (Figure 1a), and which undergo a radical ecological switch where aquatic nymphs metamorphose into terrestrial flying adults. The appearance of wings and the capacity to fly greatly increased the capability of insects for dispersal, escape and courtship and allowed them access to previously unobtainable nutrient sources, while establishing new ecological interactions. This 'new aerial dimension in which to experience life' 3 created new evolutionary forces and constraints that since, have been continuously reshaping the physiology, metabolism, morphology and sensory capabilities of different pterygote lineages-evolutionary changes that should ultimately be traced back to genomic features associated to these adaptations. However, our knowledge of these genomic features is still very incomplete, mostly because paleopteran lineages (Figure 1a), have not been extensively studied from a genomic and transcriptomic perspective.
Mayflies are an ideal group to fill this gap of knowledge. By living in both terrestrial and aquatic environments, mayflies had to develop different sensory, morphological and physiological adaptations to each of these ecological niches.
For example, mayflies have abdominal gills during the aquatic stages, a feature that places them in a privileged position to assess the different hypotheses accounting for the origin of wings [4][5][6][7][8][9] . Moreover, some mayfly families exhibit a striking sexual dimorphism in their visual systems, which in the case of the Baetidae family includes the presence of a second set of very large compound eyes in males ( Figure 1b). All these features make mayflies a fundamental order to investigate the origin of evolutionary novelties associated with the conquest of new habitats. The recent establishment of a continuous culture system of the mayfly C. dipterum, a cosmopolitan Baetidae species with a life cycle of just six weeks, makes it now possible the access to all embryonic and postembryonic stages 10 , overcoming past challenges to study paleopterans which are generally not very amenable to rear in the lab.
Here we sequenced the genome and profiled a comprehensive set of stage-and organ-specific transcriptomes of Cloeon dipterum. Our analyses identify potential genomic signatures associated to the mayfly adaptations to the aquatic lifestyle, to innovations in its visual system and novel genetic players in the evolution of wings. The results from this work establish C.
dipterum as a new platform to investigate insect genomics, development and evolution from a phylogenetic vantage point.

C. dipterum genome assembly
We sequenced and assembled the genome of an inbred line of the mayfly species C. dipterum using both Illumina and Nanopore technologies (see adult organs. These datasets included four embryonic stages (4 days post fertilization (dpf) stage: "germ disc"; 6 dpf stage: "segmentation"; 10 dpf stage: "revolution" and 14 dpf stage: "pre-nymph"), heads of three different nymphal stages, nymphal gut, nymphal malpighian tubules, gills, wing pads, adult muscle, ovaries, testes, female adult brain and male and female adult heads (Figure 1c, Supplementary Table 6).

Temporal gene expression profiles reflect life cycle adaptations
C. dipterum spends its life cycle in three very different environments: within the abdomen of the mother during embryonic stages (as C. dipterum is one of the few ovoviviparous mayfly species 18 ); freshwater streams and ponds as nymphs; and land/air as adults 19 . To explore the expression of its genome during these three major phases, we performed soft clustering analysis of stage-specific transcriptomes using M-Fuzz 20 Table 7). Clusters of genes transcribed preferentially during embryonic stages, such as cluster 21 and 8, showed enrichment in Gene Ontology (GO) categories that reflected the processes of embryogenesis and organogenesis happening during these stages (i.e. regulation of transcription, ectoderm development, dendrite morphogenesis, etc.). On the other hand, clusters with genes highly expressed during the aquatic phase (e.g., clusters 18 and 9) presented GO enrichment in categories that evidence the continued moulting process that mayfly nymphs undergo, such as chitin-based cuticle development. In addition, GO terms related to sensory perception of chemical stimulus or odorant binding were also enriched in these clusters (  Table 8). Finally, cluster 30, which contained genes with the highest expression during adulthood showed a striking enrichment of GO categories associated with visual perception. Therefore, the embryonic and postembryonic (aquatic and aerial) phases are characterized by very distinct gene expression profiles. Specifically, comparison between aquatic nymphs and aerial adults highlight expression differences that indicate very distinct sensory modalities in these two free-living phases. The aquatic phase is characterised by genes involved in the perception of chemical cues, whereas vision becomes the main sensory system during the terrestrial/aerial adult phase (Figure 1d).

Expansion and aquatic role of Odorant Binding Proteins
Since temporal gene expression profiling indicated a prominent role of genes involved in perception of chemical cues during C. dipterum aquatic phase, we investigated in its genome the five main chemosensory (CS) gene families in arthropods. These include the odorant receptor/odorant receptor cofactor (OR/ORCO) and the odorant binding protein family (OBP), both of which have been suggested to play essential roles during the evolution of terrestrialization in hexapods and insects [21][22][23][24][25][26][27] .
We  Table 9). Importantly, we discovered 171 different OBP genes, which represents the largest repertoire of this family described until now, and a 57% increase with respect to the largest OBP gene complement previously described, that of the cockroach Blattella germanica (109 OBP genes) 22 Our previous GO analyses pointed to an important role of these CS genes during nymphal stages. To investigate this further, we asked in which of those M-fuzz clusters the different CS gene families were incorporated. We found that more than half of CS genes, and among these, nearly 80% of OBPs (147 out of 276 CS genes and 121 out of 152 OBPs included in the soft clustering analysis) were assigned to just four clusters (18, 12, 9 and 10), all of them related to nymphal or pre-nymphal stages. Cluster 10 contained genes that had a peak of expression at 14 dpf stage, just prior to the hatching of the first swimming nymph 10 , and clusters 9, 12 and 18 are nymph-specific.
Next, we examined the expression of CS genes in specific tissues and organs and found that most of the CS genes were expressed in a highly tissuespecific manner, as indicated by high (> 0.8) tau values (an index used as a proxy for tissue specific expression 29 ). As expected, many CS genes were expressed in the head, where the antennae, the main olfactory organs in insects, are located. There was however, an additional major chemosensory tissue, the gills, where 34% of the CS genes were expressed, ( To better characterise these clusters, we co-stained the gills with HRP, a marker of insect neurons 30 . Remarkably, OBP-expressing cell clusters were HRP-positive, suggesting a neurosensory nature. Globally, these results strongly suggest that, beyond their respiratory role 31 , the gills are a major chemosensory organ of the aquatic mayfly nymph (Figure 2f-g).

Expansion of light sensing opsins in C. dipterum
Visual cues are essential during adulthood in mayflies ( Figure 1d). During their short adult phase, they must be able to find mating partners while flying in big swarms, copulate and finally find a suitable place to deposit the eggs 32 .
Similar to what has been observed in other diurnal insect lineages, such as Odonata 33 Figure 5). In addition, we also found that the blue-sensitive opsin underwent independent duplications in the Ephemeridae and Baetidae. E. danica has three different blue-Ops, while Baetis species with available transcriptomic data (B. sp. AD2013, B. sp. EP001, 1 ), and C. dipterum have two blue-Ops, which in the latter case are located together in tandem (Figure 3a, c).
Surprisingly, and in contrast to other insect lineages, where the UV sensing opsin (UV-Ops) has been usually kept as a single copy, we also found a large expansion of UV-Ops in the genome of C. dipterum. Duplicated UV-Ops genes were also present in different Baetis species (Figure 3a 34 . Indeed, the most highly upregulated gene in male heads compared to female heads was one of the UV-sensitive opsins, UV-Ops4 followed by blue-Ops2, whose protein sequence is highly modified (Figure 3b The terrestrial/aerial adult phase of C. dipterum is not only characterised by its visual system, but even more prominently by the key feature ancestral to all pterygote insects: the wings. As paleopterans, mayflies can provide important insights into the origin of the genetic programs responsible for the evolution of this crucial morphological novelty. To investigate this, we first generated modules of co-regulated gene expression across several tissues (Supplementary Table 11), including developing wings, for C. dipterum and D. melanogaster (see Methods, Supplementary Data 1). To address transcriptomic conservation, we tested which modules showed significant orthologue overlap in pairwise comparisons between the two species. This analysis revealed deeply conserved co-regulated gene modules associated with muscle, gut, brain and Malpighian tubules (insect osmoregulatory organ), among others, indicating that the shared morphological features of the pterygote body plan are mirrored by deep transcriptomic conservation ( Figure 4a). Remarkably, when we extend this comparison to include the centipede Strigamia maritima, the majority of these co-regulated modules, especially those corresponding to neural functions, muscle and gut, have been conserved as well, indicating that these genetic programs date back, at least, to the origin of Mandibulata (Supplementary Figure 6). Importantly, one of the highly correlated pairs of modules between mayfly and fruitfly corresponded to the wings (wing pad and wing imaginal disc modules, respectively). A total of 130 genes were shared between these modules, defining a core set of genes associated with this organ since the last common ancestor of pterygotes. This gene set exhibited an enrichment in GO terms that corresponded to wing development (Figure 4c), and in agreement with this, some of these orthologues have been shown to participate in wing development in Drosophila (e.g. abnormal wing discs (awd), inturned (in) or wing blister (wb), etc.). In addition, among them there were also numerous genes (96 genes out of 130, Supplementary Table 12) for which no previous wing function has been described so far. However, our comparative approach strongly indicated a putative function in wing development. We functionally tested this hypothesis for eight of these genes in Drosophila (Supplementary   Table 13), by RNAi knockdown assays using the wing specific nubbin-GAL4 driver 35 . Indeed, these experiments resulted in wing phenotypes in all cases (8/8) Figure 6). In general, our results were in agreement with previously described phylostratigraphic patterns, in which older genes exhibit broad expression patterns while novel genes are expressed in a tissue-specific manner 36,37 . We did not observe any enrichment of new genes expressed in the wings in the Pterygota stratum that could be linked to the origin of this morphological novelty, despite the high number of novel genes that were gained in this lineage (Figure 1a, Supplementary Figure 1). These results suggested that the transcriptomic program responsible for the wings was assembled from genes that were already forming part of pre-existing gene networks in other tissues 38 . Therefore, we investigated the similarities between the transcriptomic programs of wings and other mayfly tissues (Figure 4e-f). Several hypotheses to explain the origin of wings have been put forward since the last century 4,5,7-9,45 , including the pleural origin hypothesis in which the wings would be the thoracic serial homologues of the abdominal gills. Our results uncovered expression similarities between gills and wings, which would be consistent with both pleural and dual scenarios, in which ancestral thoracic gills would contribute to wing structures. Alternatively, in a scenario in which gills and the aquatic nymphs would not be the ancestral state to pterygotes 46,47 these similarities could represent gene co-option between these two organs.
Whatever the case the transcriptomic similarities observed between gills and wings suggest that they share a common genetic program.

Genome sequencing and assembly
Genomic DNA was extracted from C. dipterum adult males from an inbred line kept in the laboratory for seven generations 10 . Illumina and Nanopore technologies were used to sequence the genome ( Supplementary Table 1-3).
The assembly was generated by hybrid assembly using MaSuRCA v3.2.3 48,49 with 95.9x short-read Illumina short-read and 36.3x ONT long-read coverage.
RNA sequencing and assembly 35 RNA-seq datasets of multiple developmental stages (four embryonic stages) and dissected tissues and organs (nymphal and adult dissected organs and whole heads) were generated using the Illumina technology. Samples were processed immediately after dissection and RNA was extracted using RNeasy Mini Kit (Qiagen) or RNAqueous™-Micro Total RNA Isolation Kit (Ambion) following manufacturers instructions. Single-end and paired-end libraries were generated using Illumina (TruSeq) RNA-Seq kit.

Genome annotation
Reads from all paired-end transcriptomes were assembled altogether using Trinity 50 and subsequently aligned to the genome using the PASA pipeline 51 .
High-quality transcripts were selected to build a Hidden-Markov profile for de novo gene prediction in Augustus 52 .
Reads were aligned to the genome using STAR 53 and transcriptome assembled using Stringtie 54 . The assemblies were merged using Taco 55 .
Consensus transcriptome assembly and splice-junctions were converted into hints and provided to Augustus gene prediction tool which yielded 16364 evidence-based gene models. These models contain 4308 non-redundant PFAM models as assigned using the PfamScan tool. RepeatModeller was used to identify repeats in C. dipterum assembled genome.
We integrated these RNA-seq data into a UCSC Genome Browser track hub together with two additional tracks, insect sequence conservation and annotation of repetitive elements.

Comparative transcriptomics
Mfuzz software 20 was used to perform soft clustering of genes according to developmental and life history expression dynamics using normalised read counts. We selected eight developmental and post-embryonic stages.
We used DESeq2 R package 56 to analyse differential gene expression between male and female adult heads.
We used the cRPKM metric (corrected-for-mappability Reads Per Kilobasepair of uniquely mappable positions per Million mapped reads 57 to perform WGCNA gene expression analyses. We performed the analyses using as datasets genes that were present in a least two species in our family reconstructions and showed variance across samples (coef. var ≥ 1). Each module was designated with a tissue or a biological/molecular category. Finally, we analysed the overlap between homologous groups for each pair of modules for each of the species in a pairwise manner. To evaluate significance we performed hypergeometric tests.

Gene orthology
To obtain phylogeny-based orthology relationships between different taxa, the predicted proteomes of 14 species (Supplementary Table 4) representing major arthropod lineages and outgroups were used as input for OrthoFinder2 58 .

Chemosensory genes identification and phylogeny
We created a dataset containing well-annotated members of the chemosensory (CS) gene family (i.e., GR, OR, IR/iGluR, OBP and CSP) from a group of representative insects (Supplementary Table 9

Acknowledgments
We

Competing interests
The authors declare no competing interests.

Data Availability
The sequence reads and the genome assembly have been deposited in the European Nucleotide Archive (ENA) with the project accessions PRJEB34721 and PRJEB35103.