Publisher Correction: Conservative route to genome compaction in a miniature annelid

A Correction to this paper has been published: https://doi.org/10.1038/s41559-020-01327-6.

, apparently co-occurs with prominent changes in gene repertoire 34,35 , genome architecture (e.g. loss of macrosynteny 36 ) and genome regulation (e.g. trans-splicing and operons [37][38][39] ), yet these divergent features are also present in closely related species with larger genomes 5, 32,40 . Therefore, it is unclear whether these are genomic changes required for genomic streamlining or lineage specialisations unrelated to genome compaction.
The marine annelid Dimorphilus gyrociliatus (O. Schmidt, 1857) (formerly Dinophilus gyrociliatus) has been reported to have a C-value (i.e. haploid genome size) of only 0.06-0.07 pg (~59-68 Mb) 41 , the smallest ever reported for an annelid 42 , and a haploid karyotype of 12 chromosomes 43 . D. gyrociliatus is a free-living meiobenthic species 44 whose adults exhibit a strong sexual dimorphism, evident already during embryogenesis (Fig. 1a). The adult females are about 1mm long and display a typical albeit simplified annelid segmental body plan 45 with only six segments, reduced coelom, and no appendages, parapodia or chaetae (Supp. Note 1).
D. gyrociliatus males are however only 50 µm long, comprise just a few hundreds of cells, lack a digestive system, but still possess highly specialised sensing and copulatory organs 46 . Despite their miniature size, D. gyrociliatus retains ancestral annelid traits, such as a molecularly regionalised nervous system in the female 47,48 and the typical quartet spiral cleavage 49 (Fig.   1b). With only a handful of genomes sequenced (Supp . Table 1), annelids have retained ancestral spiralian and bilaterian genomic features 50 . Therefore, D. gyrociliatus, with its reduced genome size and small body, emerges as a unique system to investigate the genome architecture and regulatory changes associated with genome compaction and assess the interplay between genomic and morphological miniaturisation. gyrociliatus comprises a 6-days long embryogenesis with a canonical early spiral cleavage programme, followed by a juvenile and an adult, reproductively active stage. (c) Flow cytometry analysis using the nematode C. elegans as reference and propidium iodide (PI) nuclear intensity estimates the genome size of D. gyrociliatus in 73. 82
gyrociliatus either within Sedentaria 52 or as sister to Errantia+Sedentaria 53 , the two major annelid clades (Supp. Note 2). Gathering an extensive dataset of annelid sequences 54  Given the generally larger bodies and genome sizes found in annelid lineages outside Dinophiliformia (Fig. 1e), and that T. axi also has a compact, 92.47 Mb genome (Fig. 1d), our data suggest genome size reduction and morphological miniaturisation both occurred in the lineage leading to D. gyrociliatus and its relatives.
To assess how changes in repeat content contributed to genome reduction in D. gyrociliatus, we annotated the complement of transposable elements (TEs), uncovering a much lower percentage (4.87%) than in other annelid genomes ( Fig. 2a Fig. 6b). Thus, the conserved GPCR repertoire and the canonical neuropeptide complement (Extended Data Fig. 6c) further support that D.
gyrociliatus nervous system is functionally equivalent to, though morphologically smaller than that of larger annelids 47,48 .    Moreover, D. gyrociliatus has a conserved DSB repertoire (Supp . Table 9), with the exception of BRCA1, which is however also absent in other invertebrates capable of homologous recombination, such as Drosophila melanogaster 70 . Therefore, mutation prone DSB repair mechanisms that can increase DNA loss do not underpin genomic compaction in D.
gyrociliatus, which occurred without drastic genome architecture rearrangements.
Changes in genome size have been positively correlated to differences in cell and body size in a range of animal groups 1,21-24 . Given the miniature body size and the compact genome of D.
gyrociliatus, we thus hypothesised that the molecular mechanisms controlling cell and organ growth might exhibit critical divergences in this lineage, should these two traits be connected.
To test this hypothesis, we used genome wide KEGG annotation (Supp. File 4) to reconstruct signalling pathways known to be involved in the control of cell growth and proliferation (cyclin/CDKs 71 , PI3K/Akt/mTOR 72 ) and organ size (Hippo pathway 73 ) in metazoans (Fig. 4a).
D. gyrociliatus shows orthologs of all core components of these pathways (Supp . Table 10), with the exception of PRR5 -an mTOR complex 2 interactor that is however dispensable for complex integrity and/or kinase activity 74 -and a clear ortholog of p21/p27/p57 kinases, general inhibitors of cyclin-CDK complexes among other roles 75 . Besides, the Myc transduction pathway, which regulates growth and proliferation 76 Table 11). In Dinophilidae, MYC additionally has a W135 point mutation in the broadly conserved MYC box II (MBII) transactivation domain that has been shown to impair MYC function in human cells, in particular its ability to repress growth arrest genes 78 (Fig. 4c). Myc downregulation in vertebrates and flies causes hypoplasia 79 , which could explain the miniature size of dinophilids, and slows down DNA replication 80 , which could act as a selective pressure favouring smaller genomes. Although the full extent of these genomic changes is hard to evaluate given the poor understanding of cell and organ growth in annelids, our data provide a substrate for studying whether there is a mechanistic link between genome size reduction and organism miniaturisation in D. gyrociliatus. To investigate how compaction affected genome regulation, we first used ATAC-seq to identify ~10,000 reproducible open chromatin regions in adult D. gyrociliatus females (Extended Data Fig. 9a-d). Open chromatin regions are short in D. gyrociliatus and mostly found in promoters (Fig. 5a, b), consistent with its small genome size and small intergenic regions. Despite the generally short intron size in D. gyrociliatus, 944 ATAC-seq peaks were in intronic regions significantly larger than non-regulatory introns (Fig. 5c) (Fig. 5d), which together with the lack of putative spliced leaders in 5' UTRs (Extended Data Fig. 4g), suggests that trans-splicing and operons do not occur in D. gyrociliatus, similar to other annelids 81 . The CTCF DNA-binding motif was the most abundant in active regulatory regions, located mostly in promoters and as single motifs ( Fig.   5e; Extended Data Fig. 9e-h). Unlike nematodes with compact genomes 82 , which lack CTCF, the D. gyrociliatus genome encodes for a CTCF ortholog (Supp. Fig. 8). However, localisation of CTCF DNA-binding motifs, for the most part close to transcriptional start sites, instead of in intergenic regions, suggests that CTCF might play a role in regulating gene expression in D.
gyrociliatus rather than in chromatin architecture as seen in vertebrates 83 . Thus, our data indicate that D. gyrociliatus has retained conserved genomic regulatory features (e.g. lack of operons and trans-splicing, presence of CTCF) but streamlined regulatory regions and potentially lost distal intergenic cis-regulatory elements with genome compaction.  Meta-gene profile  5i) and thus narrow TATA-box dependent promoters have lower +1 nucleosome occupancy than wide non-TATA-box promoters (Fig. 5j). As in other eukaryotes, TATA-box containing D. gyrociliatus promoters have somewhat higher expression levels, while promoters with DPE motif have no particular features, indicating this element might be non-functional (Fig. 5k, l).
Therefore, the general D. gyrociliatus promoter architecture resembles that of other bilaterians (Extended Data Fig. 10g), further supporting that genomic compaction did not alter genome regulation.

Discussion
Our study demonstrates that genome compaction and morphological miniaturisation are specificities of D. gyrociliatus (Fig. 1e), grounded in a nested phylogenetic position within Annelida, TE depletion, intergenic region shortening, intron loss and streamlining of the gene complement and genome regulatory landscape (Fig. 2a, e; Fig. 3a; Fig. 5a, f). Traditionally, morphological miniaturisation in D. gyrociliatus and Dinophiliformia has been considered a case of progenesis (underdevelopment) 45,52 , yet the exact underlying mechanisms are unknown. As in other animal lineages 34,35,85 , our data support that morphological change might be partially explained by gene loss in D. gyrociliatus (Fig. 6a), as we identified a reduced repertoire of extracellular signalling ligands and the loss of developmental genes related to missing organs, such as chaetae (post1 and FGF ligand) and mesodermal derivatives like coeloms (VEGF ligand). However, cis-regulation of gene expression is mostly restricted to the proximal regions in Dimorphilus (Fig. 5b). Therefore, our study suggests that coordinated distal gene regulation, which is an animal innovation 86 whose emergence has been associated with the evolution of sophisticated gene regulatory landscapes and morphological diversification 87,88 , is also limited in D. gyrociliatus. Unlike in other cases of genomic compaction 5, 30 Genomic features miniaturisation did not trigger drastic changes in genome architecture and regulation in D. gyrociliatus ( Fig. 3c; Fig. 5c, e, h; Fig. 6b). Therefore, the genomic features observed in appendicularians, tardigrades and some nematodes are lineage specificities that might have eventually facilitated genome compaction, but that are not always associated with genome size reduction, thus questioning the assumed causal link between fast evolving genomic traits and genome compaction. Altogether, our study characterises an alternative, more conservative route to genome compaction, and furthermore provides an exciting new system and genomic resources to investigate the evolutionary plasticity and function of core cellular mechanisms in animals.

Genome sequencing and assembly
Adult females of D. gyrociliatus were used to isolate genomic DNA (gDNA) following standard guanidium isothiocyanate protocol and RNase A treatment. Library was prepared using Pacific Biosciences 20 kb library preparation protocol and size-selected using BluePippin with 5 kb cut-off. The library was sequenced on a Pacific to estimate the completeness and copy number variation of the assemblies (Supplementary Figure 1). gyrociliatus and reduce the impact of highly abundant repeats in T. axi. We used Jellyfish v.2.2.386 106 to count and generate a histogram of canonical 31-mers, and GenomeScope 2.0 107,108 to estimate the genome size and heterozygosity (Fig. 1d, Supplementary Figure 2). We also used Smudgeplot 107 to estimate ploidy and analyse the genome structure (Supplementary Figure 3).

Transcriptome sequencing and assembly
A publicly available dataset (Sequence Read Archive, accession number SRX2030658) was used to generate a de novo transcriptome assembly as previously described 47 . Redundant contigs were removed using cd-hit-est program with default parameters of CD-HIT 109 and CAP3 110 . Additionally, we used that dataset to generate a genome-guided assembly using Bowtie2 111 and Trinity v2.

Stage-specific RNA-seq
Two biological replicates of four developmental stages of D. gyrociliatus (early embryo, 1-3 days old; late embryo, 4-6 days old; juvenile females, 7-9 days old; and adult females, 20-23 days old) were used to isolate total RNA with TRI Reagent® Solution (Applied Biosystems) following manufacturer's recommendations and generate Illumina short-reads on a NextSeq 500 High platform in 75 base paired end reads mode and a ~270 bp library mean insert size at GeneCore (EMBL). We pseudo-aligned reads to D. gyrociliatus filtered gene models with Kallisto v0.44.0 114 , and followed the standard workflow of DESeq2 115 to estimate counts, calculate size factors, estimate the data dispersion, and perform a gene-level differential expression analysis between consecutive stages (Supplementary Data Files 1-3). Datasets were first corrected for low count and high dispersion values using the apeglm log fold change shrinkage estimator 116 , and then compared using Wald tests between contrasts. For clustering and visualization, we homogenized the variance across expression ranks by applying a variance stabilizing transformation to the DESeq2 datasets. We used the pheatmap package to create heatmaps 117 , the package EnhancedVolcano for volcano plots 118 , and ggplot2 for the remaining plots 119 . To characterize and identify enriched gene ontology terms, we used the package clusterProfiler 120 . All analyses performed in R 121 using the RStudio Desktop 122 .

Phylogenetic analysis
Annelid transcriptomes (Supplementary Data File 4) were downloaded from SRA and assembled using Trinity v2.5.1 112 with the Trimmomatic 99 read trimming option. Transcriptomes were then translated using Transdecoder v5.0.2 112 after searching for similarity against the metazoan Swissprot database. Predicted proteins were searched using HMMER 123 for 1,148 single-copy phylogenetic markers previously described 124 using reciprocal BLAST to discard possible paralogues and character supermatrix was assembled as described before 124 . From this initial dataset, we selected the 264 genes with lowest saturation, yielding a concatenate alignment of 71,508 positions (as the analysis of the full dataset with site-heterogeneous models was not computationally tractable).
Phylogenetic analyses were performed on the concatenated alignment using IQTREE 125 with a C60 mixture model, a LG matrix to account for transition rates within each profile, the FreeRate heterogeneity model (R4) to describe across sites evolution rates and an optimisation of amino acid frequencies using maximum likelihood (FO). Support values were drawn from 1,000 ultrafast bootstraps with NNI optimisation. We also carried out Bayesian reconstruction using a site-heterogeneous CAT+GTR+Gamma model running two chains for more than a thousand iterations. We reached reasonable convergence for one of the datasets (bpdiff > 0.19).

Annotation of repeats and transposable elements
We used RepeatModeler v.1.0.4 126 and RepeatMasker "open-4.0" 127 to generate an automated annotation of transposable elements (TEs) and repeats (Supplementary Table 3). We performed a BLAST analysis using the TE sequences recovered with RepeatModeler and PFAM sequence collections corresponding to entries RVT_1 (PF00078) (Supplementary Table 4 LTR retrotransposons were further compared to other elements of the Ty3/gypsy clade using a set of protein sequences comprising the reverse transcriptase domain and the integrase core domain. The phylogeny of Ty3/gypsy was established with a collection of sequences from the Gypsy database 129 , including hits obtained with TBLASTN (databases NR and TSA) using D. gyrociliatus sequences as queries. To look for distant homologues of the protein found downstream the integrase in LTR retrotransposons, we submitted a multiple sequence alignment of ten peptide sequences (corrected to the original coding frame when recovered from disrupted genes) to HHPred (database PDB_mmCIF70_28_Dec). Using MODELLER, the three best hits (Probability > 99, Evalue < e -29 ) were used to model the 3D structure of the Dingle-1 envelope.

Gene prediction and functional annotation
The predicted set of core eukaryotic genes generated by CEGMA 130 Table 6). To identify splice leader sequences in D. gyrociliatus, we predicted protein coding sequences in the de novo assembled transcriptome with Transdecoder v.5.5.0 112 and used the scripts nr_ORFs_gff3.pl (from Transdecoder) and gff3_file_UTR_seq_extractor.pl (from PASA) to extract the nonredundant 5' UTR sequences of protein coding transcripts. We used these sequences and Jellyfish v.2.2.3 106 to identify over-represented 22-mer and 50-mer sequences that would correspond to the splice leader.

Gene family evolution analyses
We used OrthoFinder v.2.2.7 145 with default values to reconstruct clusters of orthologous genes between D.
gyrociliatus and 27 other animal proteomes (Supplementary Table 6). OrthoFinder gene families were used to infer gene family gains and losses at different nodes using the ETE 3 library 146 . Gene expansions were computed for each species using a hypergeometric test against the median gene number per species for a given family. We Gene expression analyses D. gyrociliatus embryos were collected in their egg clusters and manually dissected. The embryonic eggshell was digested in a solution of 1% sodium thioglycolate (Sigma-Aldrich, T0632,) and 0.05% protease (Sigma-Aldrich, P5147) in sea water, pH 8, for 30 min at room temperature (RT), followed by relaxation in MgCl2 and fixation.
Whole mount in situ hybridization (WMISH) was performed as described elsewhere 47 . Images were taken with a Zeiss Axiocam HRc connected to a Zeiss Axioscope Ax10 using bright-field Nomarski optics. C. teleta embryos were fixed and WMISH was performed as previously described 152

Macrosynteny analysis
Single-copy orthologues obtained using the mutual best hit approach (MBH) were used to generate Oxford synteny plots comparing sequentially indexed orthologue positions as previously described 155 . Plotting order was determine by hierarchical clustering of the shared orthologue content using the complete linkage method 156 .
Cell dissociation and lysis was improved by disaggregating the tissue with a syringe in lysis buffer. Transposed Irreproducible discovery rates (IDR) were calculated with IDRCode 159 . A final set of 10,241 consistent peaks (IDR ≤ 0.05) was used for de novo motif enrichment analysis using HOMER 160 , with default parameters, except -size given (Supplementary Data File 6).

Cap Analysis Gene Expression (CAGE)-seq
Total RNA from adult D. gyrociliatus was isolated using Trizol followed by RNeasy RNA clean-up protocol (Qiagen). CAGE libraries were prepared for two biological replicates (barcodes ATG and TAC) using the latest nAnT-iCAGE protocol 161 . The libraries were sequenced in single-end 50 base mode (Genomic Facility, MRC LMS). Demultiplexed CAGE reads (47 bp) were mapped to the D. gyrociliatus genome assembly using Bowtie2 162 and resulting Bam files were imported into R using the standard CAGEr package (version 1.20.0) and G-correction workflow 163 . Normalization was performed using a referent power-law distribution 164 and CAGEderived TSSs that passed the threshold of 1 TPM were clustered together using distance-based clustering wrote the manuscript.

Data availability
All new raw sequence data associated to this project are available under project with primary accession PRJEB37657 in the European Nucleotide Archive (ENA). Genome annotation files and additional datasets are available in https://github.com/ChemaMD/DimorphilusGenome.

Code availability
All custom code used in this study is freely available in https://github.com/fmarletaz/comp_genomics. Figure 1 Sequencing approach and assembly statistics (a-c) Diagram of the approach taken to sequence and assemble Dimorphilus gyrociliatus genome and transcriptome, and to annotate coding genes in the genome. (d) Comparison of genome assembly statistics between D. gyrociliatus and the annelids C. teleta and H. robusta. D. gyrociliatus genome is smaller than one third of C. teleta's genome, and the assembly is contained in only ~350 scaffolds, with an N50 of 2.24 Mb, the second-best contiguity value for an annelid genome assembly to date. (e) Genome completeness, as indicated by metazoan BUSCO repertoire, in genome assemblies of different annelid lineages. D. gyrociliatus completeness is comparable to C. teleta, the most conservative annelid genome sequence to date.