Main

Animals, and eukaryotes generally, exhibit a striking range of genome sizes across species1, seemingly uncorrelated with morphological complexity and gene content. This has been deemed the ‘C-value enigma’2. Animal genomes often increase in size due to the expansion of transposable elements (TE) (for example, in rotifers3, chordates4,5 and insects6) and through chromosome rearrangements and polyploidization (for example, in vertebrates7,8,9 and insects10), which is usually counterbalanced through TE removal11, DNA deletions12,13 and rediploidization14. Although the adaptive impact of these changes is complex and probably often influenced by neutral non-adaptive population dynamics15,16, genome expansions might also provide new genetic material that can stimulate species radiation7 and the evolution of new genome regulatory contexts17 and gene architectures18. By contrast, the evolutionary drivers of genome compaction are more debated and hypotheses are often based on correlative associations1; for example, with changes in metabolic19 and developmental rates20, cell and body sizes1,21 (as in some arthropods22,23, flatworms22 and molluscs24) and the evolution of radically new lifestyles, such as powered flight in birds and bats13,25 and parasitism in some nematodes26,27 and orthonectids28. However, these correlations often suffer from multiple exceptions; for example, not all parasites have small genomes27 neither does the insect with arguably the smallest body size have a compact genome29 and thus they probably reflect lineage-specific specializations instead of general trends in animal evolution. In addition, genomic compaction leading to minimal genome sizes, as in some free-living species of nematodes30, tardigrades31,32 and appendicularians5,33, apparently co-occurs with prominent changes in gene repertoire34,35, genome architecture (for example, loss of macrosynteny36) and genome regulation (for example, trans-splicing and operons37,38,39), yet these divergent features are also present in closely related species with larger genomes5,32,40. Therefore, it is unclear whether these are genomic changes required for genomic streamlining or lineage specializations unrelated to genome compaction.

The marine annelid Dimorphilus gyrociliatus (O. Schmidt, 1857) (formerly Dinophilus gyrociliatus) has been reported to have a C-value (haploid genome size) of only 0.06–0.07 pg (~59–68 megabases, Mb)41, the smallest ever reported for an annelid42, and a haploid karyotype of 12 chromosomes43. D. gyrociliatus is a free-living meiobenthic species44 whose adults show strong sexual dimorphism, evident already during embryogenesis (Fig. 1a). The adult females are ~1 mm long and display a typical, albeit simplified, annelid segmental body plan45 with only six segments, reduced coelom, and no appendages, parapodia or chaetae (Supplementary Note 1). D. gyrociliatus males are, however, only 50 µm long, comprise just a few hundred cells, lack a digestive system but still possess highly specialized sensing and copulatory organs46. Despite their miniature size, D. gyrociliatus retain ancestral annelid traits, such as a molecularly regionalized nervous system in the female47,48 and the typical quartet spiral cleavage49 (Fig. 1b). With only a few genomes sequenced (Supplementary Table 1), annelids have retained ancestral spiralian and bilaterian genomic features50. Therefore, D. gyrociliatus, with its reduced genome size and small body, is a unique system in which to investigate the genome architecture and regulatory changes associated with genome compaction and to assess the interplay between genomic and morphological miniaturization.

Fig. 1: D. gyrociliatus has the smallest annelid genome.
figure 1

a, Differential interference contrast images of adults and embryos of D. gyrociliatus. The adults are miniature annelid worms with an extreme sexual dimorphism, already apparent during early embryogenesis. b, The life cycle of D. gyrociliatus comprises a 6-day-long embryogenesis with a canonical early spiral cleavage programme, followed by a juvenile and an adult, reproductively active stage. c, Flow cytometry analysis using the nematode C. elegans as reference and propidium iodide (PI) nuclear intensity estimates the genome size of D. gyrociliatus as 73.82 Mb. d, K-mer counts estimate the genome size of D. gyrociliatus and T. axi to be 70.95 Mb and 92.47 Mb, respectively. e, D. gyrociliatus and T. axi belong to Dinophiliformia, the sister group to Sedentaria and Errantia, and their genome sizes are the smallest known among annelids. dc, dorsal ciliary field; dg, digestive system; ey, eye; oo, oocyte; pe, penis; ph, pharynx. Drawings are not to scale.

Results

We performed long-read PacBio sequencing (Extended Data Fig. 1a) to generate a highly contiguous (N50, 2.24 Mb) and complete (95.8% BUSCO genes) ~78 Mb-long haploid assembly, comparable in quality to other published annelid genomes (Extended Data Fig. 1d,e and Supplementary Table 1). Flow cytometry measurements and K-mer based analyses estimated the size of D. gyrociliatus genome to be 73.82 Mb and 70.95 Mb, respectively (Fig. 1c,d), agreeing with previous estimations41. While their simple morphology originally prompted them to be considered as early-branching annelids51 (‘Archiannelida’), molecular phylogenies later placed D. gyrociliatus either within Sedentaria52 or as sister to Errantia and Sedentaria53, the two major annelid clades (Supplementary Note 2). Gathering an extensive dataset of annelid sequences54, we robustly placed D. gyrociliatus together with Trilobodrilus axi, Dinophilus vorticoides and Lobatocerebrum sp.—all miniature annelids—in a clade we name Dinophiliformia that is sister to Errantia and Sedentaria, thus confirming the previous proposal53 (Fig. 1e and Extended Data Fig. 2). Given the generally larger bodies and genome sizes found in annelid lineages outside Dinophiliformia (Fig. 1e), and that T. axi also has a compact, 92.47 Mb genome (Fig. 1d), our data suggest genome size reduction and morphological miniaturization both occurred in the lineage leading to D. gyrociliatus and its relatives.

To assess how changes in repeat content contributed to genome reduction in D. gyrociliatus, we annotated the complement of TEs, uncovering a much lower percentage (4.87%) than in other annelid genomes (Fig. 2a and Extended Data Fig. 3a,b). Most TEs (91.5%) group in four classes and, as in the annelid Helobdella50, TEs are either old copies or very recent expansions (Fig. 2b). The most abundant TE class is a Ty3-gypsy-like long terminal repeat (LTR) retrotransposon that appears to be an annelid- or D. gyrociliatus-specific subfamily, and thus we name it Dingle (Dinophilidae Gypsy-like elements) (Extended Data Fig. 3c). As in some insect and nematode clades55, where LTR retrotransposon envelope (env) proteins are apparently related to env proteins of DNA viruses, Dingle envelope (env) protein shows similarities with envelope glycoprotein B precursors of cytomegalovirus (CMV) and herpesviridae-1 (HSV-1) (Extended Data Fig. 3d,e). Compared to species with minimal genome sizes, D. gyrociliatus TE load is three to four times lower than in the appendicularian Oikopleura dioica and the tardigrade Ramazzottius varieornatus but around four times larger than in insects with larger, still compact genomes (~100 Mb) (Supplementary Table 5). Therefore, TE depletion contributed to genome compaction in D. gyrociliatus but this does not appear to be the main driving factor since other small animal genomes show even lower fractions of TEs.

Fig. 2: D. gyrociliatus has a reduced transposable element and intronic landscape.
figure 2

a, The percentage of the genome assigned to TEs and repeats in three annelid genomes. D. gyrociliatus has considerably less TEs and simple repeats than other annelids. b, TE abundance according to sequence divergence (Kimura distance) to family consensus. TE expansions are limited in size and correspond to either very recent bursts or old elements. c, Number of annotated genes in 28 animal genomes plotted against genome size. D. gyrociliatus has a reduced gene repertoire compared to other annelids but comparable to other animals of similar genome size. d, Size distribution of orthologous introns in seven bilaterian species. Intron size is comparable between D. gyrociliatus and the annelid C. teleta and the centipede S. maritima, which are both slow-evolving lineages with larger genomes. Dashed horizontal line indicates D. gyrociliatus median intron size. e, Rates of intron gain (green), intron loss (violet) and introns per kb of CDS (blue) in representative spiralian lineages and a consensus phylogeny. D. gyrociliatus has lost introns, yet at a much lower rate and preserving many more ancestral animal introns than other fast-evolving spiralian lineages, such as flatworms and rotifers. Note that intron densities in the platyhelminthes S. mediterranea and M. lignano are underestimated due to the low fraction of single-copy complete orthologues detected in these species for the BUSCO gene dataset.

To explore how changes in gene architecture influenced genome compaction, we used transcriptomic data and ab initio predictions to annotate 14,203 protein-coding genes in the D. gyrociliatus genome, a smaller gene repertoire than that of other annelids (Fig. 2c, Extended Data Fig. 1b,c and Supplementary Table 1). However, the gene number is comparable to free-living species with similar genome sizes, such as O. dioica33 (~15,000 genes) and R. varieornatus32 (~14,000 genes). With a gene density (208.86 genes per Mb) double that in the annelids Capitella teleta (99.96 genes per Mb) and Helobdella robusta (97.5 genes per Mb), D. gyrociliatus has shorter intergenic regions and transcripts, but similar exon lengths and even larger untranslated regions (UTRs) (Extended Data Fig. 4a,b,d–f), suggesting that intron shortening might have contributed to genome compaction. However, although D. gyrociliatus shows overall very short introns (median 66 base pairs, bp) and its splicing is thus more efficient at removing short intron sizes (Extended Data Fig. 4i), introns are not shorter on average than in C. teleta (median 57 bp) and even similar to the centipede Strigamia maritima (median 67 bp) (Fig. 2d and Extended Data Fig. 4h), both with larger genomes than D. gyrociliatus. Instead, D. gyrociliatus has fewer introns than other annelids (Fig. 2e) and exhibits an intron density comparable to other animals with small genome sizes, such as O. dioica and C. elegans, but with a much higher retention of ancestral introns (Extended Data Fig. 4j,k). Therefore, gene and intron loss, rather than short intron size—which was probably a pre-existing condition—correlates with genome compaction in D. gyrociliatus, unlike in free-living nematodes of similar genome size56.

To investigate how gene loss shaped the D. gyrociliatus genome and morphology, we first reconstructed clusters of orthologous genes using a dataset of 28 non-redundant proteomes covering major animal groups and estimated gene loss and gain rates. Over 80% of D. gyrociliatus genes are assigned to multispecies gene families; the highest percentage in any annelid sequenced so far (Extended Data Fig. 5a). However, 38.9% of the genes in D. gyrociliatus are in orthogroups where there is only one D. gyrociliatus sequence, and thus D. gyrociliatus has the smallest average gene family size among annelids (1.63 genes per orthogroup; Supplementary Table 7). Although the rate of gene family loss is greater than in C. teleta, an annelid species with a conservatively evolving genome50, gene loss in D. gyrociliatus is similar to those of the annelids H. robusta and Hydroides elegans, species with larger genomes (Fig. 3a and Extended Data Fig. 5b). Therefore, our data suggest that reduction of gene family size outweighs complete gene family loss, and thus probably underpins the reduced total gene number of D. gyrociliatus, as also observed in certain Caenorhabditis species of small genome size56,57

Fig. 3: D. gyrociliatus has retained a conserved developmental toolkit and ancestral linkeage blocks.
figure 3

a, Number of gene family gains (green) and losses (violet) in representative spiralian lineages under a consensus tree topology. Gene loss in D. gyrociliatus is similar to or lower than that observed in other fast-evolving spiralian lineages. b, D. gyrociliatus has a conserved Hox complement, organized in a compact cluster (top). Whole-mount in situ hybridization during embryogenesis reveals that Hox genes exhibit staggered anteroposterior domains of expression, but not temporal collinear expression domains (arrowheads) along the trunk region, with Hox1, Hox5 and Antp further exhibiting anterior head expression domains (arrows). Dashed lines in lateral views of early and late elongation timepoints demarcate the head–trunk boundary and asterisks mark the anterior end. Scale bar, 50 µm. c, Oxford dot plots of orthologous genes between the scallop M. yessoensis and three annelid genomes. Orthologous genes are coloured according to their position in M. yessoensis linkage groups. The presence of an organized Hox cluster correlates with the preservation of some macrosyntenic blocks (areas of higher density of shared orthologues) in D. gyrociliatus, which are lost in the fast-evolving H. robusta.

Consistent with the streamlining of its gene repertoire, we detected only nine expanded gene families in D. gyrociliatus (but 73 and 42 in C. teleta and H. robusta, respectively), most of them corresponding to locally duplicated genes implicated in immune responses (Extended Data Fig. 5c–e). In addition, D. gyrociliatus shows canonical repertoires of gene families expanded in other annelids, such as G-protein-coupled receptors (GPCRs) and epithelial sodium channels (ENaCs)50 (Extended Data Fig. 6a,b and Supplementary Table 8). The GPCR complement of genomes is dynamic and often linked to specific (neuro)physiological adaptations, as seen in lineages with miniature genomes that have experienced either losses (for example, O. dioica lacks Class C, glutamate receptors) or expansions (for example, C. elegans58 and R. varieornatus59 expanded Class A, rhodopsin receptors) (Extended Data Fig. 6b). Thus, the conserved GPCR repertoire and the canonical neuropeptide complement (Extended Data Fig. 6c) further support that D. gyrociliatus nervous system is functionally equivalent to, although morphologically smaller than, that of larger annelids47,48.

Despite its miniature body plan, D. gyrociliatus has an overall conserved developmental toolkit at the level of both transcription factors and signalling pathways (Extended Data Fig. 5f,g). D. gyrociliatus, and Dinophilidae generally, exhibit a limited repertoire of certain extracellular signalling molecules (for example, Wnt and TGF-β ligands) and lacks bona fide FGF and VEGF ligands (Extended Data Fig. 5g–i). However, these simplifications do not affect the receptor repertoire (Extended Data Fig. 5j). Unlike appendicularians60, tardigrades32 and nematodes32 with compact genomes, D. gyrociliatus exhibits a compact, ordered Hox cluster, only lacking lox2 and post1 (Fig. 3b and Extended Data Fig. 7a,b). In other annelids61,62, post1 is separate from the main Hox cluster, and as in brachiopods63, it is expressed in chaetoblasts62, supporting the homology of this new cell-type63. Remarkably, the distantly related H. robusta and D. gyrociliatus both lack chaetae, post1 and FGF ligand (also expressed in annelid chaetoblasts; Extended Data Fig. 5k–r), suggesting that the secondary loss of chaetae followed convergent routes of gene loss in different annelid species.

To investigate whether the clustered Hox genes of D. gyrociliatus exhibit temporal collinearity, we first performed comparative transcriptomics at four different stages of the D. gyrociliatus female life cycle (Extended Data Fig. 8a,b). Genome-wide expression dynamics revealed five main clusters of coregulated genes (Extended Data Fig. 8c), corresponding to major developmental events, such as cell proliferation in early development or during adult growth (clusters 5 and 4, respectively), sex differentiation (cluster 2), nervous system maturation during late embryogenesis and postembryogenesis (cluster 1) and increased metabolism after hatching (cluster 3). While there is a gradual increase in gene upregulation as embryogenesis proceeds, which stabilizes in the juvenile to adult transition (Extended Data Fig. 8d–f), all Hox genes but Hox5, Antp and post2 are expressed during early embryogenesis (days 1–3; Extended Data Fig. 7c). Using whole-mount in situ hybridization, we identified late gastrula (~3 d after egg deposition) as the earliest stage at which most Hox genes become simultaneously transcribed, including post2 (Fig. 3b), altogether suggesting that D. gyrociliatus Hox genes lack temporal collinearity. Different from other annelid species64,65,66, D. gyrociliatus embryogenesis is slow, taking ~6 d from egg laying to hatching (Fig. 1b), and thus it is unlikely that Hox temporal collinearity is compressed to span a short and quick early morphogenesis. During body elongation and segment formation, Hox genes are expressed in staggered anteroposterior domains along the developing trunk, in patterns resembling those of C. teleta62, further supporting that D. gyrociliatus retains the ancestral annelid molecular body patterning (Fig. 3b and Extended Data Fig. 7d). Therefore, D. gyrociliatus Hox genes show only staggered expression domains along the anteroposterior axis (Extended Data Fig. 7e), providing a compelling case where temporal collinearity is not driving Hox cluster compaction and maintenance67.

Animal groups with reduced genome sizes show altered gene orders, as exemplified by their disorganized Hox clusters60,68 and the loss of conserved gene linkage blocks that represent the ancestral chromosomal organization36,50. In O. dioica, this loss has been related to the loss of the classical non-homologous end-joining, double-strand DNA break repair pathway69. In addition to an ordered Hox cluster, D. gyrociliatus shows residual conservation of ancestral linkage blocks, which appear eroded but still visible (Fig. 3c). These blocks are almost intact in C. teleta but completely lost in H. robusta (Fig. 3c and Extended Data Fig. 7f). Moreover, D. gyrociliatus has a conserved double-strand DNA break repertoire (Supplementary Table 9), with the exception of BRCA1, which is however also absent in other invertebrates capable of homologous recombination, such as Drosophila melanogaster70. Therefore, mutation-prone double-strand DNA break repair mechanisms that can increase DNA loss do not underpin genomic compaction in D. gyrociliatus, which occurred without drastic genome architecture rearrangements.

Changes in genome size have been positively correlated to differences in cell and body sizes in a range of animal groups1,21,22,23,24. Given the miniature body size and the compact genome of D. gyrociliatus, we thus suggested that the molecular mechanisms controlling cell and organ growth might exhibit critical divergences in this lineage, should these two traits be connected. To test this, we used genome-wide KEGG annotation (Supplementary File 4) to reconstruct signalling pathways known to be involved in the control of cell growth and proliferation (cyclin/CDKs71 and PI3K/Akt/mTOR72) and organ size (Hippo pathway73) in metazoans (Fig. 4a). D. gyrociliatus shows orthologues of all core components of these pathways (Supplementary Table 10), with the exception of PRR5—an mTOR complex 2 interactor that is, however, dispensable for complex integrity and/or kinase activity74—and a clear orthologue of p21/p27/p57 kinases, general inhibitors of cyclin-CDK complexes among other roles75. Besides, the Myc transduction pathway, which regulates growth and proliferation76 and sits downstream of the Hippo and PI3K/Akt/mTOR pathways73,77, lacks the regulators mad (in D. gyrociliatus) and mnt (in all Dinophilidae), a condition also shared with the appendicularian O. dioica (Fig. 4b and Supplementary Table 11). In Dinophilidae, MYC additionally has a W135 point mutation in the broadly conserved MYC box II (MBII) transactivation domain that has been shown to impair MYC function in human cells, in particular its ability to repress growth arrest genes78 (Fig. 4c). Myc downregulation in vertebrates and flies causes hypoplasia79, which could explain the miniature size of dinophilids, and slows down DNA replication80, which could act as a selective pressure favouring smaller genomes. Although the full extent of these genomic changes is hard to evaluate given the poor understanding of cell and organ growth in annelids, our data provide a substrate for studying whether there is a mechanistic link between genome size reduction and organism miniaturization in D. gyrociliatus.

Fig. 4: D. gyrociliatus exhibits a divergent MYC pathway.
figure 4

a, Schematic representation of signalling pathways involved in cell growth/proliferation and organ size in animals. D. gyrociliatus shows conserved Hippo and PI3K/Akt/mTOR pathways (green boxes), but also divergences in the MYC pathway (orange box), one of the downstream regulators. See main text and Supplementary Table 11 for a complete list of genes. b, Schematic representation of the MYC/MAX/MAD pathway and the interactions between the main protein partners. D. gyrociliatus lacks bona fide MAD and MNT proteins (the latter also absent in other members of Dinophilidae). c, Multiple protein alignment of the MBII repressor domain of MYC, highlighting how Dinophilidae exhibit point mutations in the critical tryptophan (W) residue.

To investigate how compaction affected genome regulation, we first used assay for transposase-accessible chromatin using sequencing (ATAC-seq) to identify ~10,000 reproducible open chromatin regions in adult D. gyrociliatus females (Extended Data Fig. 9a–d). Open chromatin regions are short in D. gyrociliatus and mostly found in promoters (Fig. 5a,b), consistent with its small genome size and small intergenic regions. Despite the generally short intron size in D. gyrociliatus, 944 ATAC-seq peaks were in intronic regions substantially larger than non-regulatory introns (Fig. 5c). We recovered a canonical regulatory profile (Fig. 5d), which together with the lack of putative spliced leaders in 5′ UTRs (Extended Data Fig. 4g), suggests that trans-splicing and operons do not occur in D. gyrociliatus, similar to other annelids81. The CTCF DNA-binding motif was the most abundant in active regulatory regions, located mostly in promoters and as single motifs (Fig. 5e and Extended Data Fig. 9e–h). Unlike nematodes with compact genomes82, which lack CTCF, the D. gyrociliatus genome encodes for a CTCF orthologue (Supplementary Fig. 8). However, localization of CTCF DNA-binding motifs, for the most part close to transcriptional start sites, instead of in intergenic regions, suggests that CTCF might play a role in regulating gene expression in D. gyrociliatus rather than in chromatin architecture as seen in vertebrates83. Thus, our data indicate that D. gyrociliatus has retained conserved genomic regulatory features (for example, lack of operons and trans-splicing, and presence of CTCF) but streamlined regulatory regions and potentially lost distal intergenic cis-regulatory elements with genome compaction.

Fig. 5: The regulatory genomic landscape of D. gyrociliatus.
figure 5

a, Violin plot depicting ATAC-seq peak size distribution in D. gyrociliatus compared to the median values in the fly D. melanogaster and humans. The open chromatin regions are shorter in D. gyrociliatus than in other animal genomes. b, Distribution of ATAC-seq peaks according to genomic feature. Most of the open chromatin regions are found in promoters, intergenic regions and (first) introns. c, Violin plots of size distributions in introns with and without ATAC-seq peaks. The presence/absence of open chromatin regions in introns correlates positively with size. d, Metagene profile of ATAC-seq signal. All gene lengths are adjusted to 2 kb. e, Top ten most-significant motifs identified in D. gyrociliatus ATAC-seq peaks. The most abundant motif in open chromatin regions corresponds to CTCF. f,g, Tag clusters centred on the dominant CAGE-supported TSS (CTSS) are usually narrow (based on interquantile range q0.1–q0.9) (f) and retain the canonical metazoan polymerase II initiation pyrimidine (C, T)/purine (A, G) dinucleotides (g). h, Most (11,245 out of 13,693) of the CTSS have a TATA-box and/or a downstream promoter element (DPE). i,j, Nucleosomes are consistently located after the CTSS (i), regardless of the promoter type (j). k,l, While genes with a TATA-box tend to be slightly narrower on average (l), there are no major differences in expression levels between genes with different promoter elements (k).

Since most regulatory information is restricted to promoter regions (<1 kilobase (kb) upstream of the transcription start site, TSS), we applied cap analyses gene expression (CAGE)-seq to characterize promoter architecture (Extended Data Fig. 10a). Promoters are narrow (<150 bp) in D. gyrociliatus and use pyrimidine–purine dinucleotides as preferred initiators (Fig. 5f,g and Extended Data Fig. 10e). Upstream TA and downstream GC enrichment, respectively, revealed the presence of TATA-box and downstream promoter elements (DPE) in D. gyrociliatus, with TATA-box generally associated with short promoters (Fig. 5h and Extended Data Fig. 10f). Similar to vertebrates84, strength of nucleosome positioning correlates with promoter broadness in D. gyrociliatus (Fig. 5i) and thus narrow TATA-box dependent promoters have lower +1 nucleosome occupancy than wide non-TATA-box promoters (Fig. 5j). As in other eukaryotes, TATA-box containing D. gyrociliatus promoters have somewhat higher expression levels, while promoters with DPE motif have no particular features, indicating this element might be non-functional (Fig. 5k,l). Therefore, the general D. gyrociliatus promoter architecture resembles that of other bilaterians (Extended Data Fig. 10g), further supporting that genomic compaction did not alter genome regulation.

Discussion

Our study demonstrates that genome compaction and morphological miniaturization are specificities of D. gyrociliatus (Fig. 1e), grounded in a nested phylogenetic position within Annelida, TE depletion, intergenic region shortening, intron loss and streamlining of the gene complement and genome regulatory landscape (Fig. 2a,e, Fig. 3a and Fig. 5a,f). Traditionally, morphological miniaturization in D. gyrociliatus and Dinophiliformia has been considered a case of progenesis (underdevelopment)45,52, yet the exact underlying mechanisms are unknown. As in other animal lineages34,35,85, our data support that morphological change might be partially explained by gene loss in D. gyrociliatus (Fig. 6a), as we identified a reduced repertoire of extracellular signalling ligands and the loss of developmental genes related to missing organs, such as chaetae (post1 and FGF ligand) and mesodermal derivatives like coeloms (VEGF ligand). However, cis-regulation of gene expression is mostly restricted to the proximal regions in Dimorphilus (Fig. 5b). Therefore, our study suggests that coordinated distal gene regulation, which is an animal innovation86 whose emergence has been associated with the evolution of sophisticated gene regulatory landscapes and morphological diversification87,88, is also limited in D. gyrociliatus.

Fig. 6: A new conservative route to genome compaction in D. gyrociliatus.
figure 6

a, Schematic diagram of the genomic changes which occurred during genome compaction and morphological miniaturization in D. gyrociliatus and Dinophilidae. b, D. gyrociliatus genome represents a more conservative evolutionary pathway to genome compaction compared to the more drastic genomic changes experienced by other bilaterian lineages with compact genomes, such as O. dioica and C. elegans.

Unlike in other cases of genomic compaction5,30,31,32,33,36,37,38,39, but similar to what has been reported for the teleost fish Takifugu rubripes89,90, our work provides compelling evidence that genome miniaturization did not trigger drastic changes in genome architecture and regulation in D. gyrociliatus (Fig. 3c, Fig. 5c,e,h and Fig. 6b). Therefore, the genomic features observed in appendicularians, tardigrades and some nematodes are lineage specificities that might have eventually facilitated genome compaction, but that are not always associated with genome size reduction, thus questioning the assumed causal link between fast-evolving genomic traits and genome compaction. Altogether, our study characterizes an alternative, more conservative route to genome compaction, and furthermore provides an exciting new system and genomic resources to investigate the evolutionary plasticity and function of core cellular mechanisms in animals.

Methods

Genome sequencing and assembly

Adult females of D. gyrociliatus were used to isolate genomic DNA following standard guanidium isothiocyanate protocol and RNase A treatment. Library was prepared using Pacific Biosciences 20-kb library preparation protocol and size-selected using BluePippin with 5-kb cutoff. The library was sequenced on a Pacific Bioscience RS II instrument using P6-C4 chemistry at the Norwegian Sequencing Centre. An Illumina library of median insert size of 298 bp was sequenced in 101 bases paired end mode on an Illumina HiSeq 2500 instrument at GeneCore (EMBL). All raw sequence data associated with this project are available under primary accession PRJEB37657 in the European Nucleotide Archive.

PacBio reads were filtered with SMRTAnalysis v.2.3.0.140936 and assembled with PBcR v.8.3rc2 (refs. 91,92) using default options, except for K-mer = 14 and asmMerSize = 14. Four rounds of decontamination using Blobtools v.0.9.16 (ref. 93) were applied, removing contigs with similarity to bacteria, algae, fungi or unicellular eukaryotes. A consensus assembly was generated with Quiver and improved with Pilon v.1.16 (ref. 94) using the Illumina paired end reads previously filtered for adaptors with cutadapt v.1.4.2 (ref. 95). We used HaploMerger2 v.20151124 (refs. 96,97) to reconstruct a high-quality haploid reference assembly, which we further scaffolded with SSPACE-LongRead v.1.1 (ref. 98).

We collected hundreds of adult individuals of T. axi Remane, 1925, at the intertidal beach of Königshafen, Sylt (Germany)44 and extracted genomic DNA as described above to prepare a TruSeq v.3 Illumina library that was sequenced in 101 bases paired end mode on a full lane of an Illumina HiSeq 2500 instrument at GeneCore (EMBL). Before assembly, we removed adaptors and low-quality regions with cutadapt v.1.4.2 (ref. 95) and Trimmomatic v.0.35 (ref. 99), error correction with SPAdes v.3.6.2 (ref. 100) and deduplication with Super_Deduper v.2.0. Cleaned reads were assembled with Platanus v.1.2.4 (ref. 101) and contigs with similarity to proteobacteria were identified with Blobtools v.0.9.16 (ref. 93). After removal of bacterial contigs, we generated the final assembly with Velvet v.1.2.10 (ref. 102).

We used BUSCO v.2 pipeline (ref. 103) to validate the completeness of the genome assemblies. Out of the 978 metazoan BUSCO genes, 930 were complete (95.1%), seven were fragmented (0.7%) and 41 were missing (4.2%) (Extended Data Fig. 1e) in the D. gyrociliatus genome assembly. Only 27 (2.8%) of the BUSCO genes were complete and duplicated. BUSCO analysis on the T. axi genome resulted in 835 complete (85.4%), 27 complete and duplicated (2.8%), 75 fragmented (7.7%) and 68 missing (6.9%) (Extended Data Fig. 1e). Finally, we used KAT v.2.4.2 (ref. 104) to estimate the completeness and copy number variation of the assemblies (Supplementary Fig. 1).

Genome size measurements

For flow cytometry measures, adult D. gyrociliatus females and C. elegans worms (reference) were starved for 3–4 d before analysis. D. gyrociliatus and C. elegans were chopped with a razor blade in General-Purpose Buffer105 and the resulting suspension of nuclei was filtered through a 30-µm nylon mesh and stained with propidium iodide (Sigma; 1 mg ml–1) on ice. We used a flow cytometer Partex CyFlow Space fitted with a Cobalt Samba green laser (532 nm, 100 mW) to analyse the samples, performing three independent runs with at least 5,000 nuclei per run. For K-mer-based measures, we used the raw Illumina paired end reads of D. gyrociliatus and T. axi. We removed adaptors using cutadapt v.1.4.271 (ref. 95), quality trimmed the reads using Trimmomatic v.0.3575 (ref. 99), performed error correction using SPAdes v.3.6.276 (ref. 100) and removed duplicated reads using Super-Deduper v.2.0. We identified and removed contaminant reads using BlobTools v.1.1.1, and normalized read coverage to 100 times in both datasets using BBNorm from BBTools suite v.38.86 to mitigate the effects of a strong GC content bias in D. gyrociliatus and reduce the impact of highly abundant repeats in T. axi. We used Jellyfish v.2.2.386 (ref. 106) to count and generate a histogram of canonical 31-mers, and GenomeScope 2.0 (refs. 107,108) to estimate the genome size and heterozygosity (Fig. 1d and Supplementary Fig. 2). We also used Smudgeplot107 to estimate ploidy and analyse the genome structure (Supplementary Fig. 3).

Transcriptome sequencing and assembly

A publicly available dataset (Sequence Read Archive (SRA), accession number SRX2030658) was used to generate a de novo transcriptome assembly as previously described47. Redundant contigs were removed using the cd-hit-est program with default parameters of CD-HIT (ref. 109) and CAP3 (ref. 110). Additionally, we used that dataset to generate a genome-guided assembly using Bowtie2 (ref. 111) and Trinity v.2.1.1 (ref. 112). Supplementary Table 2 shows standard statistics for the de novo and genome-guided assemblies calculated with Transrate113. Transcriptome completeness was evaluated with BUSCO v.2 (ref. 103).

Stage-specific RNA-seq

Two biological replicates of four developmental stages of D. gyrociliatus (early embryo, 1–3-days-old; late embryo, 4–6-days-old; juvenile females, 7–9-days-old; and adult females, 20–23-days-old) were used to isolate total RNA with TRI Reagent Solution (Applied Biosystems) following manufacturer’s recommendations and generate Illumina short-reads on a NextSeq 500 High platform in 75 base paired end reads mode and a ~270 bp library mean insert size at GeneCore (EMBL). We pseudo-aligned reads to D. gyrociliatus filtered gene models with Kallisto v.0.44.0 (ref. 114), and followed the standard workflow of DESeq2 (ref. 115) to estimate counts, calculate size factors, estimate the data dispersion, and perform a gene-level differential expression analysis between consecutive stages (Supplementary Data 1). Datasets were first corrected for low count and high dispersion values using the apeglm log-fold change shrinkage estimator116, and then compared using Wald tests between contrasts. For clustering and visualization, we homogenized the variance across expression ranks by applying a variance-stabilizing transformation to the DESeq2 datasets. We used the pheatmap package to create heatmaps117, the package EnhancedVolcano for volcano plots118 and ggplot2 for the remaining plots119. To characterize and identify enriched gene ontology terms, we used the package clusterProfiler120. All analyses were performed in R (ref. 121) using the RStudio Desktop122.

Phylogenetic analysis

Annelid transcriptomes (Supplementary Data 1) were downloaded from SRA and assembled using Trinity v.2.5.1 (ref. 112) with the Trimmomatic99 read trimming option. Transcriptomes were then translated using Transdecoder v.5.0.2 (ref. 112) after searching for similarity against the metazoan Swissprot database. Predicted proteins were searched using HMMER123 for 1,148 single-copy phylogenetic markers previously described124 using reciprocal BLAST to discard possible paralogues and character supermatrix was assembled as described before124. From this initial dataset, we selected the 264 genes with lowest saturation, yielding a concatenate alignment of 71,508 positions (as the analysis of the full dataset with site-heterogeneous models was not computationally tractable). Phylogenetic analyses were performed on the concatenated alignment using IQTREE125 with a C60 mixture model, an LG matrix to account for transition rates within each profile, the FreeRate heterogeneity model (R4) to describe across sites evolution rates, and an optimization of amino acid frequencies using maximum likelihood. Support values were drawn from 1,000 ultrafast bootstraps with NNI optimization. We also carried out Bayesian reconstruction using a site-heterogeneous CAT + GTR + Gamma model running two chains for <1,000 iterations. We reached reasonable convergence for one of the datasets (bpdiff > 0.19).

Annotation of repeats and transposable elements

We used RepeatModeler v.1.0.4 9 (ref. 126) and RepeatMasker ‘open-4.0’ (ref. 126) to generate an automated annotation of TEs and repeats (Supplementary Table 3). We performed a BLAST analysis using the TE sequences recovered with RepeatModeler and PFAM sequence collections corresponding to entries RVT_1 (PF00078) (Supplementary Table 4) to uncover non-LTR retrotransposons and Helitrons represented by only a few copies. Using MITE Digger127, we identified MITEs whose terminal inverted repeats matched Mariner transposons in the D. gyrociliatus genome. D. gyrociliatus DNA transposons belong to Mariner and Mutator-Like Elements (MULE) on the basis of the amino acid signature of their transposases. To establish the gene arrangement in LTR retrotransposons, we performed six-frame translations of most intact copies, identified as such by having two identical LTRs and being flanked by short direct repeats created by target site duplication. LTR retrotransposons were further compared to other elements of the Ty3/gypsy clade using a set of protein sequences comprising the reverse transcriptase domain and the integrase core domain. The phylogeny of Ty3/gypsy was established with a collection of sequences from the Gypsy database128, including hits obtained with TBLASTN (databases NR and TSA) using D. gyrociliatus sequences as queries. To look for distant homologues of the protein found downstream from the integrase in LTR retrotransposons, we submitted a multiple sequence alignment of ten peptide sequences (corrected to the original coding frame when recovered from disrupted genes) to HHPred (database PDB_mmCIF70_28_Dec). Using MODELLER, the three best hits (P > 99, E value < e–29) were used to model the three-dimensional structure of the Dingle-1 envelope.

Gene prediction and functional annotation

The predicted set of core eukaryotic genes generated by CEGMA129 was used to train and run AUGUSTUS v.3.2.1 (ref. 130). The predicted proteomes of the annelids C. teleta and H. robusta were aligned to the D. gyrociliatus genome using EXONERATE v.2.2.0 (ref. 131) and PASA v.2.0.2 (ref. 132) was used to align the transcriptome to the genome with BLAT and GMAP aligners133,134. EvidenceModeler v.1.1.1 (ref. 135) was used to generate weighted consensus gene predictions, giving a weight of 1 to ab initio gene predictions and spliced protein alignments, and a weight of 10 to the PASA transcript assemblies. EvidenceModeler output was used to refine PASA gene models and generate alternative splice variants. Predictions with BLAST hit against transposons and/or with an overlap ≥90% on masked regions were removed. The final prediction set contains 14,203 coding-protein loci that generate 17,409 different peptides. We used ORFik136 to refine TSS with CAGE-seq data. Functional annotation for the 17,409 different transcripts was performed with Trinotate v.3.0. We retrieved a functional annotation for 13,437 gene models (77.18%).

Gene structure evolution

We compared genome-wide values of gene structure parameters among D. gyrociliatus, C. teleta, H. robusta, D. melanogaster, C. elegans and O. dioica (Supplementary Table 6). To identify splice leader sequences in D. gyrociliatus, we predicted protein-coding sequences in the de novo assembled transcriptome with Transdecoder v.5.5.0 (ref. 112) and used the scripts nr_ORFs_gff3.pl (from Transdecoder) and gff3_file_UTR_seq_extractor.pl (from PASA) to extract the non-redundant 5′ UTR sequences of protein-coding transcripts. We used these sequences and Jellyfish v.2.2.3 (ref. 106) to identify over-represented 22-mer and 50-mer sequences that would correspond to the splice leader.

Intron evolution analysis

We compared distributions of intron lengths between D. gyrociliatus, Homo sapiens, C. teleta, Crassostrea gigas, Lottia gigantea, Strigamia maritima and Branchiostoma lanceolatum (Supplementary Table 6) using only introns in genes with orthologues across the seven species (as defined by OrthoFinder; see below) and orthogroups with less than four paralogues per species. To identify conserved and new D. gyrociliatus introns, we aligned each D. gyrociliatus protein against each annotated protein isoform of each orthologous gene of the abovementioned six species and added the intron positions into the alignments137. To identify high-confidence conserved intron positions, we required that a given D. gyrociliatus intron position was found at the exact position of the alignment and with the same phase (0, 1 or 2) in at least four out of six other species. To define high-confidence non-conserved (probably new) introns, we required that a D. gyrociliatus intron position did not match an intron position with the same phase within 25 alignment positions in any of the other six species. To assess the impact of intron length on splicing efficiency on D. gyrociliatus, S. maritima and H. sapiens, we used RNA-seq-based quantifications of intron retention as previous described138 and implemented by vast-tools139. Only introns that had sufficient read coverage138,139 were used to calculate average PIR.

To quantify intron gain and loss in D. gyrociliatus we generated a database of homologous introns from 28 metazoan genomes (Supplementary Table 6), obtaining one-to-one orthologous genes using BUSCO v.3 (ref. 103) (prot mode and 1 × 10–4 E value) and the OrthoDB v.9 (ref. 140) dataset of 978 single-copy animal orthologues. We aligned the predicted peptides using MAFFT v.7.310 G-INS-i algorithm141 and used Malin142 to identify conserved intron sites and infer their conservation status in ancestral nodes. We estimated the rates of intron gain and loss in each node with Malin’s built-in model maximum-likelihood optimization procedure. We used this model to estimate the posterior probabilities of intron presence, gain and loss in extant and ancestral nodes. For each node, we calculated the intron density expressed as introns per kb of coding sequence (introns per CDS kb), as follows:

$${\mathrm{intron}}\,{\mathrm{density}}_{\it{i}} = \left( {\left( {{\mathrm{num}}\,{\mathrm{introns}}_{\it{i}}/{\mathrm{num}}\,{\mathrm{genes}}} \right)/{\mathrm{median}}\,{\mathrm{gene}}\,{\mathrm{length}}} \right) \times 1{,}000$$

where num intronsi is the number of introns present in a given node (extant or ancestral, corrected by missing sites), num genes = 978 (number of alignments of one-to-one orthologues) and median gene length = 682.5 bp (as obtained from the lengths of the seed proteins curated in the OrthoDB v.9 Metazoa dataset; ‘ancestral’ FASTA file). We used the same strategy to obtain the rates of intron gain and loss per node in terms of introns per CDS kb. In addition, we inferred the uncertainty of the estimated intron gains, losses and presence values with Malin and 1,000 bootstrap iterations. To visualize the evolution of intron content, we used the ape library v.5.0 (ref. 143) from the R statistical package v.3.6 (ref. 121). To calculate the percentage of ancestral metazoan introns retained in each species, we retrieved all introns present in the last common metazoan ancestor (at >99% probability, n = 3,024) and calculated the sum of their presence probabilities in extant species.

Gene family evolution analyses

We used OrthoFinder v.2.2.7 (ref. 144) with default values to reconstruct clusters of orthologous genes between D. gyrociliatus and 27 other animal proteomes (Supplementary Table 6). OrthoFinder gene families were used to infer gene family gains and losses at different nodes using the ETE 3 library145. Gene expansions were computed for each species using a hypergeometric test against the median gene number per species for a given family. We used the functionally annotated gene sets of D. gyrociliatus, C. teleta and H. robusta to identify their repertoires of transcription factors, ligands and receptors. If a gene was not in the annotated D. gyrociliatus genome assembly, we performed manual search via BLAST on the de novo and genome-guided transcriptome. For T. axi and D. vorticoides, gene identification was conducted on the assembled transcriptome via manual BLAST searches. To reconstruct KEGG pathways via KEGG Mapper146, we used the functional annotations obtained from Trinotate to extract KEGG IDs. GPCR sequences in D. gyrociliatus and other animals (Supplementary Table 7) were retrieved using HMMER v.3.2.1 (refs. 123,147) (E value cutoff < 0.01) with Pfam profiles of class A (PF00001), class B (PF00002), class C (PF00003) and class F (PF01534) GPCRs (according to GRAFS classification). Sequences from each class were tested for false positives from other classes (including cAMP slime-mold class E GPCRs, PF05462). Phylogenetic analyses of GPCRs were performed as described elsewhere148. Neuropeptide candidates (Supplementary Data 2) were retrieved by a combination of BLAST searches (E value cutoff < 0.1) and the use of a customized script148 to detect cleavage patterns on precursors.

Orthology assignment

Multiple protein alignments were constructed with MAFFT v.7 (ref. 141); poorly aligned regions were either removed by hand or with gBlocks149. Maximum likelihood trees were constructed with FastTree 2 (ref. 150) using default parameters and visualized with FigTree.

Gene expression analyses

D. gyrociliatus embryos were collected in their egg clusters and manually dissected. The embryonic eggshell was digested in a solution of 1% sodium thioglycolate (Sigma-Aldrich, T0632) and 0.05% protease (Sigma-Aldrich, P5147) in seawater, pH 8, for 30 min at room temperature, followed by relaxation in MgCl2 and fixation. Whole-mount in situ hybridization (WMISH) was performed as described elsewhere47. Images were taken with a Zeiss Axiocam HRc connected to a Zeiss Axioscope Ax10 using bright-field Nomarski optics. C. teleta embryos were fixed and WMISH was performed as previously described151. C. teleta orthologues Ct-fgf8/17/18/24 (ref. 152) (protein ID: 218971), Ct-pvf1 (ref. 153) (protein ID: 153454) and Ct-pvf2 (ref. 153) (protein ID: 220370) were mined from the publicly available genome50. WMISH samples were imaged on a Leica DMRA2 compound microscope coupled with a QIClick camera. Animals stained for F-actin were fixed for 30 min at room temperature, incubated with 1:100 BODIPY FL-Phallacidin (Life Technologies, catalogue no. B607) and imaged with an IXplore SpinSR (Olympus). Capitella WMISH images were rendered using Helicon Focus (HelSoft). Contrast and brightness of images were edited with Photoshop (Adobe Systems) when needed.

Macrosynteny analysis

Single-copy orthologues obtained using the mutual best hit approach were used to generate Oxford synteny plots comparing sequentially indexed orthologue positions as previously described154. Plotting order was determine by hierarchical clustering of the shared orthologue content using the complete linkage method155.

ATAC-seq

ATAC-seq libraries were performed as described elsewhere156, using 50,000–70,000 cells (~18 adult females). Cell dissociation and lysis was improved by disaggregating the tissue with a syringe in lysis buffer. Transposed DNA fragments were amplified by 16 cycles of PCR. Two biological replicates were sequenced in an Illumina NextSeq500 in rapid paired end mode and 75 base read length. Adaptor contamination was removed with cutadapt v.1.2.1 (ref. 95) and cleaned reads were aligned to the unmasked genome with bowtie2 (ref. 111). Peaks were called with MACS2 v.2.1.1.20160309 (ref. 157) with the options --nomodel --extsize 70 --shift -30 --call-summits --keep-dup 1. Irreproducible discovery rates (IDR) were calculated with IDRCode158. A final set of 10,241 consistent peaks (IDR ≤ 0.05) was used for de novo motif enrichment analysis using HOMER159, with default parameters, except -size given (Supplementary Data 1).

CAGE-seq

Total RNA from adult D. gyrociliatus was isolated using Trizol followed by RNeasy RNA clean-up protocol (Qiagen). CAGE libraries were prepared for two biological replicates (barcodes ATG and TAC) using the latest nAnT-iCAGE protocol160. The libraries were sequenced in single-end 50 base mode (Genomic Facility, MRC LMS). Demultiplexed CAGE reads (47 bp) were mapped to the D. gyrociliatus genome assembly using Bowtie2 (ref. 111) and resulting Bam files were imported into R using the standard CAGEr package (v.1.20.0) and G-correction workflow161. Normalization was performed using a referent power-law distribution162 and CAGE-derived TSSs that passed the threshold of 1 transcript per million (TPM) were clustered together using distance-based clustering (Supplementary Data 1). Genomic locations of tag clusters were determined using the ChIPseeker package and gene model annotations, where promoters were defined to include 500 bp upstream and 100 bp downstream of the annotated transcript start site. Visualization of motifs, sequence patterns or reads coverage was performed using Heatmaps and seqPattern Bioconductor packages.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.