Main

Rapid regeneration from tiny pieces of tissue makes planarians a prime model system for regeneration. Abundant adult pluripotent stem cells, termed neoblasts, power regeneration and the continuous turnover of all cell types1,2,3, and transplantation of a single neoblast can rescue a lethally irradiated animal4. Planarians therefore also constitute a prime model system for stem cell pluripotency and its evolutionary underpinnings5. The taxonomic clade Platyhelminthes (‘flatworms’) also includes parasitic lineages that have substantial effects on human health, such as blood flukes (Trematoda) and tape worms (Cestoda)6. Here, the phylogenetic position of planarians as free-living flatworms7 provides a reference point towards an understanding of the evolution of parasitism8.

Despite the modest genome sizes of planarians (mostly in the range of 1–2 gigabase pairs (Gb)), genome resources relating to these animals are limited. Although the model species S. mediterranea was sequenced by Sanger sequencing, even 11.6× coverage of around 600-bp Sanger reads yielded only a highly fragmented assembly (N50 19?kb)9. Recent high-coverage, short-read approaches yielded similarly fragmented assemblies10,11. The high A–T content (about 70%) represents one known assembly challenge. Furthermore, standard DNA isolation procedures perform poorly on planarians, which has so far precluded the application of long-read sequencing approaches or BAC-clone scaffolding.

We here report a highly contiguous PacBio SMRT long-read sequencing12 assembly of the S. mediterranea genome. Giant gypsy/Ty3 retroelements, abundant AT-rich microsatellites and inbreeding-resistant heterozygosity collectively provide an explanation for why previous short-read approaches were unsuccessful. We find a loss of gene synteny in the genome of S. mediterranea and other flatworms. In analysis of highly conserved genes, we find a loss of MAD1 and MAD2, suggesting a MAD1–MAD2-independent spindle assembly check point (SAC)13,14. Our S. mediterranea genome assembly provides a resource for probing the evolutionary plasticity of core cell biological mechanisms, as well as the genomic underpinnings of regeneration and the many other phenomena that planarians expose to experimental scrutiny.

De novo long read assembly of the planarian genome

In preparation for genome sequencing, we inbred the sexual strain of S. mediterranea (Fig. 1a) for more than 17 successive sib-mating generations in the hope of decreasing heterozygosity. We also developed a new DNA isolation protocol that meets the purity and high molecular weight requirements of PacBio long-read sequencing12 (Extended Data Fig. 1a–d, Supplementary Information S1, S2). We used MARVEL, a new long-read genome assembler developed for low complexity read data15 (Supplementary Information S3). An initial de novo MARVEL assembly of reads of more than 4?kb with approximately 60× genome coverage showed an improvement over the PacBio assembly tool (Canu16) and substantial improvements over existing S. mediterranea assemblies based on short read sequencing (Extended Data Table 1). We further made use of the Chicago/HiRise in vitro proximity ligation method17 for scaffolding (Extended Data Fig. 1e, Supplementary Information S4). The polished haplotype-filtered (see below) and error-corrected (Supplementary Information S5) S. mediterranea assembly consists of 481 scaffolds with an N50 length of 3.85?Mb (Extended Data Table 1).

Figure 1: Long-range contiguous genome assembly of S. mediterranea.
figure 1

a, Top, individual of the sequenced sexual strain. Bottom left, egg cocoons. Bottom right, karyotype (2N?=?8). Scale bars, 2?mm (top) and 2.5?μm(bottom right). b, Chicago quality control of the assembly. c, Treemap comparison between the MARVEL S. mediterranea assembly and the most contiguous existing Sanger S. mediterranea assembly10. Squares encode the relative contributions of individual scaffolds or contigs to assembly size.

PowerPoint slide

To assess the quality of this genome assembly, we back-mapped a transcriptome of the sequenced strain (Supplementary Information S6) and found that more than 99% of transcripts were mapped, thus confirming that the assembly was both near-complete and accurate (Supplementary Information S7, Extended Data Fig. 1f, g). To assess the contiguity of the global assembly, we analysed structural conflicts between the MARVEL assembly and Chicago/HiRise scaffolding. Out of 51 such events across the 782.1?Mb of assembled genome sequence, only two represented unambiguous MARVEL assembly mistakes (Fig. 1b, Supplementary Information S4.3). Furthermore, high-stringency back-mapping of high-confidence cDNA sequences (Supplementary Information S7.3) confirmed assembly contiguity below the approximately 1-kb resolution limit of the Chicago/HiRise method, with small-scale sequence duplications near assembly gaps as only minor inconsistencies (Extended Data Fig. 2).

Our S. mediterranea genome assembly represents a major improvement over existing S. mediterranea assemblies10 (Fig. 1c) and, to our knowledge, is the first long-range contiguous assembly of the genome of a non-parasitic flatworm species. A UCSC genome browser instance with supplementary quality control, annotation and experimental data tracks (Supplementary Information S8) is available at PlanMine18 (http://planmine.mpi-cbg.de). All analyses in this manuscript refer to the assembly release version dd_Smed_g4. The current source code of the MARVEL assembler is available at https://github.com/schloi/MARVEL. The execution scripts used for S. mediterranea can be found in the smed subfolder of the examples folder.

Assembly challenges in the S. mediterranea genome

To understand why the S. mediterranea genome was recalcitrant to earlier short-read assembly, we first analysed its repeat content (Supplementary Information S9). The genome has a repetitive fraction of 61.7% (Fig. 2a), substantially exceeding the 38% or 46% repeat content of the mouse or human genomes, respectively19. We detected more than 7,000 insertions of 11 distinct families of long terminal repeat (LTR) retroelements (Fig. 2b, Extended Data Fig. 3a, Supplementary Information S10). These do not cluster with known Metaviridae (Fig. 2b), suggesting that they represent either extremely divergent or so far undescribed retroelement families. Three LTR families were more than 30?kb long—an exceptional size that is more than three times longer than the 5–10?kb typically observed in vertebrates (Fig. 2c, Extended Data Fig. 3b). The only known similar-sized LTRs are the plant-specific Ogre elements20, which is why we refer to the giant S. mediterranea repeat families as Burro (big, unknown repeat rivalling ogre; Supplementary Information S10.3). Burro elements are pervasively transcribed (Extended Data Fig. 3c, d, Supplementary Information S10.4), yet their high degree of intra-family sequence divergence suggests a relatively ancient invasion (Supplementary Table 1, Supplementary Information S10.5, Extended Data Fig. 3e). Burro-1, the most abundant giant retroelement with 130 fully assembled copies, is highly overrepresented at contig ends, and 50% of all current scaffolds terminate in a Burro-1 element (Fig. 2d, Supplementary Information S10.6). Therefore, these abundant, over 30-kb repeat elements still limit the size of the current assembly. In addition, abundant AT-rich microsatellite regions disrupt the alignment of spanning reads and thus also reduce contig contiguity (Extended Data Fig. 4, Supplementary Information S11). Finally, the S. mediterranea assembly graphs showed substantial structural heterogeneity (Supplementary Information S12) in the form of bubbles (transient divergences in sequencing read alignments) and spurs (divergences without re-connection), which were largely absent from a comparable genome assembly (Drosophila melanogaster using PacBio sequencing and MARVEL assembly; Fig. 2e, Supplementary Information S12.1) or assemblies of 17 other species (Supplementary Table 2). Heterozygous mobile element insertions and microsatellite tracts were prominent causes of assembly divergences (Fig. 2f, Extended Data Fig. 4d, Supplementary Information S12.3). The persistence of substantial genomic heterozygosity in spite of more than 17 successive sib-mating generations confirms that meiotic recombination is inefficient in S. mediterranea21.

Figure 2: S. mediterranea assembly challenges.
figure 2

a, Repeat content of the assembly. b, LTR family phylogeny. Known LTR families are shown in colour, S. mediterranea LTR families in black. Red arcs delimit clusters for consensus calculation. Scale bar: 0.2 substitutions per site. c, Domain annotation of the 11 S. mediterranea LTR families (SLFs). d, Enrichment analysis of indicated repeat elements within the terminal 1,000 bp of all scaffolds (n?=?962). Expected values represent mean repeat frequency with 95% bootstrap confidence interval (n?=?1,000). e, Graphical representation of representative S. mediterranea (left, ~1.6?Mb) and D. melanogaster (right, ~1.7?Mb) MARVEL PacBio assembly graph segments. Thick lines, consensus sequence; thin lines, individual read alignments; colour-coding, alignment quality (blue, low; red, high; see spectra at bottom); black marks, repeats. The contig tour of the final haploid genome assembly is shown offset to the right and alternative regions are shown in red. f, Dot plot comparison between a representative alternative region and the corresponding main contig. Fwd, forward match; Rev, reverse match; Break, insertions or deletions over 99 bp. Break annotations (1–11, right) list repeat categories that cover more than 60% of the insertion/deletion sequence; ‘mixed’ indicates contributions of multiple repeat classes.

PowerPoint slide

Overall, the combination of giant repeat elements, low-complexity regions and inbreeding-resistant heterozygosity provides an explanation for why previous short-read sequencing assemblies of S. mediterranea have proven so challenging. The long-range contiguity that we achieved in the S. mediterranea genome assembly, and similarly substantial improvements in the PacBio genome assembly of the flatworm Macrostomum lignano22 (Supplementary Table 2), further emphasize the improvements that a combination of long-read sequencing with the MARVEL assembler offers in the assembly of challenging genomes.

Comparative analysis of the planarian gene complement

We next annotated the S. mediterranea gene complement, relying on our planarian transcriptome resources18 (Supplementary Information S13). Our analysis showed a high divergence of S. mediterranea gene sequences (Supplementary Information S14), en par with Caenorhabditis elegans (Fig. 3a). By contrast, the low degree of sequence substitutions between the sexual and asexual S. mediterranea strains (Fig. 3a) and nearly identical mapping statistics of the two transcriptomes to the genome (Supplementary Information S7.1, Extended Data Fig. 1f) establish the utility of our assembly for both strains.

Figure 3: Genome divergence of S. mediterranea and other flatworms.
figure 3

a, Protein sequence divergence amongst 51 single copy genes (Supplementary Table 3). Branch length shows substitutions per site. Red, flatworms; blue, lophotrochozoan outgroups. b, Whole genome alignments of S. mediterranea, M. lignano and H. sapiens against the indicated reference genomes. The distributions of the alignment score (top) and alignment span (bottom) of the top 10,000 chains of co-linear alignments are shown as box plots, with boxes indicating the first quartile, median and third quartile with whiskers extending up to 1.5 times the interquartile distance. Outliers are defined as more than 1.5 times the interquartile and are shown as dots. c, Presence (green) or absence (red) of highly conserved genes in the indicated species. The yellow box highlights S. mediterranea. Asterisks mark homologues secondarily identified by manual searches.

PowerPoint slide

To evaluate the S. mediterranea genome structure, we performed whole-genome alignments (Supplementary Information S15) with the available parasitic flatworm genomes6 and a draft genome of the platyhelminth M. lignano22 (Fig. 3b). The highest alignment similarity was found between S. mediterranea and the parasitic flatworm Schistosoma mansoni, which is consistent with the platyhelminth phylogeny7. However, alignments were mostly limited to individual exons of specific genes, irrespective of the quality of the various assemblies (Extended Data Fig. 5a, b). In general, flatworm genome comparisons resulted in alignment chains that were much shorter and lower scoring than those obtained from comparisons across the tetrapod (human–frog) or vertebrate (human–zebrafish) clades (Fig. 3b). Together with more than 1,000 likely planarian-specific protein coding genes (Supplementary Information S16, Supplementary Table 5, Extended Data Fig. 6a–g), our data show a high degree of genome divergence between S. mediterranea and other flatworms.

We therefore next investigated gene loss in planarians. Our analysis deliberately focused on highly conserved genes, such that the absence of sequence similarity alone provides a strong indication of loss (Supplementary Information S17). We identified 452 highly conserved genes that were lost in both S. mediterranea and other planarians (Fig. 3c), which compares to 284 and 757 such losses in D. melanogaster and C. elegans, respectively (Extended Data Fig. 5c). Gene loss in planarians is therefore broadly in the same range as in established invertebrate model organisms. However, the lost genes included 124 homologues of genes that are essential in humans or mice (Supplementary Table 6) and are generally key components of multiple cell biological core mechanisms (Fig. 3c). Specifically, planarians lack multiple highly conserved components of DNA double-stranded break (DSB) repair, including RAD52, XRCC4, NHEJ1 (also known as XLF), SMC5, SMC6 and the entire condensin II complex23. A possibly consequent reliance on mutagenic DSB repair pathways (for example, micro-homology-mediated end joining)24 could account for both the abundance of microsatellite repeats and the structural divergence of the S. mediterranea genome (Fig. 3b), but raises questions regarding the extraordinary resistance of planarians to DSB-inducing γ-irradiation4.

Planarians are also lacking recognizable homologues of key metabolic genes. Loss of the fatty acid synthase (FASN) gene is striking in the face of its essential role in de novo fatty acid synthesis in eukaryotes, and may indicate that planarians are particularly dependent on dietary lipids. The loss of the haem breakdown enzyme genes HMOX1 and BLVRB despite maintained haem biosynthesis capacity25 is similarly unusual for a free-living eukaryote (both are lost in C. elegans26). Remarkably, the above and multiple other genes were missing not only in planarians, but also in the parasite genomes6 and the transcriptome of the macrostomid M. lignano27 (Fig. 3c). Given their broad conservation in the lophotrochozoan sister clade, the broad absence of these genes in flatworms is likely to represent an ancestral loss. This complicates, for example, the interpretation of FASN loss in the parasitic lineages as a specific adaptation to parasitism6. Conversely, the absence of key metabolic genes as phylogenetic signal underscores the utility of free-living flatworms as model systems for the parasitic lineages and the development of anti-helminth reagents8.

A MAD1-MAD2-independent spindle check-point?

The apparent absence of MAD1 and MAD2 in planarians (Fig. 3c) raises the question of whether planarians have a functional SAC, and how essential cellular functions can be maintained in the absence of supposed core components. Both MAD1 and MAD2 (also known as MAD1L1 and MAD2L1) are near-universally conserved owing to their essential roles in the SAC, which guards against aneuploidy28 by inhibiting cell cycle progression as long as even a single chromosome remains unattached to the mitotic spindle14. Although MAD1 and MAD2 homologues are easily identifiable in all other flatworms examined (Extended Data Figs 7, 8), not even flatworm queries could identify significant homologues in S. mediterranea or the transcriptomes of five other planarian species. Therefore, planarians are likely to have lost MAD1, MAD2 and multiple other SAC components (Fig. 4a). The known M-phase arrest of planarian cells upon pharmacological interference with spindle function29 (Fig. 4b) is therefore remarkable, as it indicates the maintenance of a SAC-like response despite a lack of supposed SAC core components.

Figure 4: Spindle assembly checkpoint (SAC) function in the likely absence of MAD1–MAD2.
figure 4

a, Cartoon illustration of SAC core components and function. Black and red denote components conserved or missing in S. mediterranea, respectively. KMN network: KNL1, MIS12 complex, NDC80 complex. b, Fractional abundance of mitotic cells under RNAi targeting the indicated SAC component genes, with (red) and without (cyan) nocodazole pre-treatment. Values are shown as mean with 95% confidence intervals (n?=?4 biological replicates, 10 pooled animals, 5 technical replicates with 5 or 6 images each). Cells treated with RNAi targeting CDC20 are shown as single replicates owing to rapid stem cell loss (Supplementary Information S18, Extended Data Fig. 9a, b). Significance assessed by two-way ANOVA, followed by Dunnett’s post-hoc test (****P?<?0.0001; NS, not significant), excluding RNAi targeting CDC20.

PowerPoint slide

Source data

To explore the underlying mechanisms of the SAC-like response in S. mediterranea, we targeted remaining components of the SAC network (Fig. 4a) by RNA interference (RNAi) and quantified the fraction of M-phase arrested cells with or without the microtubule depolymerizing drug nocodazole (Fig. 4b, Supplementary Information S18). The marked increase in the proportion of M-phase cells and the subsequent loss of dividing cells under RNAi targeting CDC20 (Fig. 4b, Extended Data Fig. 9a) or the anaphase-promoting complex/cyclosome (APC/C) subunit gene CDC2330 indicate that APC/C inhibition remains rate limiting for progression from M-phase in planaria. The SAC-mediated regulation of CDC20 in human cells involves the recruitment of MAD1 and MAD2 to the kinetochore by two molecular complexes thought to act in parallel, the broadly conserved KNL1–BUB3–BUB1 (KBB) complex and the ROD–ZW10–ZWILCH (RZZ) complex, which has been studied less because of its absence in yeast31 (Fig. 4a). The lack of clear KNL1 and MIS12 homologues, and of a cell-cycle phenotype of RNAi targeting BUB3 (Fig. 4b), indicates that planarians have lost KBB complex function. However, we could identify clear RZZ complex homologues and, notably, knockdown of these homologues prevented nocodazole-mediated M-phase arrest without affecting basal stem cell numbers or proliferation (Fig. 4b, Extended Data Fig. 9b). Therefore, planarian RZZ components control APC/C–CDC20 either independently of MAD1 and MAD2 or in concert with homologues that have lost defining sequence features (Extended Data Figs 6, 7). Our results motivate the examination of putative MAD1 and MAD2-independent roles of the RZZ complex in other model systems and, together with the striking evolutionary plasticity of the SAC network in eukaryotes13, generally challenge our understanding of a core cell biological mechanism.

Discussion

We have described the highly contiguous genome sequence of the planarian model species S. mediterranea, which enables the genomic analysis of whole-body regeneration, stem cell pluripotency, lack of organismal ageing and other notable features of this model system. The resulting bird’s eye view of a ‘difficult’ genome using long-read sequencing and de novo assembly also highlights important challenges that remain to be overcome. In the case of S. mediterranea, these include an abundance of low-complexity microsatellite repeats, inbreeding-resistant heterozygosity and a new class of extraordinarily long LTR elements. However, the fact that the scaffold size of newly reported genome assemblies often remains substantially below the 3.85?Mb of the S. mediterranea assembly (Extended Data Table 1) indicates that similar challenges may be widespread. We therefore expect that the specific improvements of the MARVEL assembler towards heterozygous and/or compositionally biased sequencing data15 will be useful for enhancing assembly contiguity in de novo genome sequencing projects.

We have also found a high degree of structural rearrangement and the absence of a number of conserved genes in the S. mediterranea genome. However, D. melanogaster, C. elegans and other animals also show loss of ‘essential’ genes13,26,32, which raises a general conundrum: how can animals survive and compete while lacking core components of essential mechanisms? In cell biological terminology, a core mechanism signifies a chain of molecular interactions that explain a given process in multiple species, while essentiality indicates importance for organismal survival. The emergence of viable yeast strains upon deletion of essential genes33 or the competitiveness of hundreds of extant planarian species in a diversity of habitats worldwide34 both make it clear that essentiality is relative. The demonstration of SAC function in the likely absence of MAD1 and MAD2 suggests that our genetic and mechanistic understanding of SAC function is incomplete. Further studies on planarians and other ‘non-traditional’ model organisms are needed to understand the basis and mechanism of these cellular functions. Such a function-oriented, rather than gene-centric, view of biological mechanisms abstracts general function from individual molecules and is therefore likely to ultimately facilitate the reverse engineering of biology.

Data Availability

The S. mediterranea genome assembly is accessible at GenBank under accession number NNSW00000000 and can also be browsed at and downloaded from http://planmine.mpi-cbg.de. All DNA and RNA reads were deposited at the Sequence Read Archive under the bioproject accession PRJNA379262 and under the following SRA accession numbers: PacBio P4/C2 data, SRX2700681 and SRX2700682; PacBio P6/C4 data, SRX2700683; PacBio CCS data, SRX2700684; DNA shotgun, SRX2700686; DNA Chicago, SRX2700687; and RNA-seq, SRX2700685.

Code Availability

The current source code of the MARVEL assembler is available at https://github.com/schloi/MARVEL. The execution scripts used for S. mediterranea can be found in the smed subfolder of the examples folder.