Main

Most cancers arise from oncogenic mutations in host cells and remain confined to the body of that host; however, a small number of transmissible cancer lineages exist in which cancer cells metastasize repeatedly to new hosts, living past the death of their original hosts as asexually reproducing unicellular organisms1. Observed cases of transmissible cancer in nature include canine transmissible venereal tumor (CTVT) in dogs2,3, two unrelated lineages of devil facial tumor disease (DFTD) in Tasmanian devils4,5 and at least eight bivalve transmissible neoplasia (BTN) lineages observed in several marine bivalve species6,7,8,9,10. Although transmissible cancers and their host genomes have been well characterized in dogs11,12,13 and Tasmanian devils14,15,16, little is known about the evolutionary history of the BTN lineages, which have only recently been recognized as transmissible cancers. Here we perform a genome-wide analysis of a BTN lineage found in the soft-shell clam (Myaarenaria) or MarBTN.

BTN is a fatal leukemia-like cancer characterized by high numbers of cancer cells in the circulatory fluid of the bivalve and dissemination into tissues in the later stages of disease. BTN cells can survive for days to weeks in seawater17,18 and likely spread from animal to animal by transmission through the water column. This cancer, referred to in the literature as disseminated neoplasia or hemic neoplasia, was first reported in soft-shell clams in the 1970s19,20 and has since been found across much of the soft-shell clam’s native range along the east coast of North America (Fig. 1a). In the 1980s in New England and in the 2000s in Prince Edward Island, Canada, severe outbreaks were documented with prevalence as high as 90% followed by severe population losses21,22. The disease is still observed throughout this range, although no more recent large-scale population die-offs have been reported. All disseminated neoplasia isolates tested in a 2015 study were shown to be of clonal origin and it was hypothesized that historical observations of the cancer dating back to the 1970s were occurrences of this same clonal lineage6; however, it is not known how long this lineage has propagated, or how the genome has evolved since the original cancer initiated. To address these and other questions, we assembled a high-quality soft-shell clam reference genome and characterized the genome evolution of the MarBTN lineage by comparative analysis of healthy clam and MarBTN sequences. We show a notable pattern of mutation occurrence and evolution, suggestive of an unstable genome with the potential to rapidly mutate despite its long-term survival.

Fig. 1: MarBTN distribution and sequencing.
figure 1

a, Locations of samples sequenced (circles) and disseminated neoplasia observations (indicated by x) along the east coast of North America. Circles colored for healthy clams (black) and MarBTN sampled from the PEI (red) or USA (blue) coast. b,c, Image of healthy clam used to assemble reference genome (MELC-2E11) (b) and hemolymph of the same clam (c), with hemocytes extending pseudopodia. The healthy reference clam (open black circle from a) was included in WGS analysis. d, Hemolymph from a clam infected with MarBTN (FFM-22F10), with distinct rounded morphology and lack of pseudopodia of cancer cells (representative of similar images from n = 8 MarBTN samples in this study). Scale bars, 10 mm (clam) and 50 µm (hemolymph). e, Phylogeny of cancer samples built from pairwise differences of SNVs not found in healthy clams, excluding regions that show evidence of LOH. Numbers along branches indicate the number of SNVs unique to and shared by individuals in that clade. All nodes have 100 of 100 bootstrap support.

Source data

Results

Sample sequencing and genome assembly

We assembled a soft-shell clam reference genome from a single healthy female clam collected from Larrabee Cove, Machiasport, Maine, USA (Fig. 1b,c; MELC-2E11). We assembled PacBio long reads into contigs using FALCON-Unzip23, scaffolded contigs to the chromosome-level with Hi-C sequences using FALCON-Phase, polished the scaffolds using 10x Chromium reads and annotated with RNA-seq reads using MAKER to yield a high-quality reference genome. The final reference genome is 1.22 Gb, organized into 17 phased scaffolds, matching the 17 chromosomes expected based on karyotype data24. The contig N50 is 3.4 Mb and the metazoan BUSCO (Benchmarking Universal Single Copy Orthologs25) score is 94.9%. Our assembly is similar in size, GC and repeat content of a recently published M.arenaria genome26 but with drastically improved contiguity and completeness (Supplementary Table 1), allowing for comprehensive genomic investigation into the evolutionary history of MarBTN.

We performed whole-genome sequencing (WGS) on three healthy uninfected clams and eight isolates of MarBTN from the hemolymph of highly infected clams (for example Fig. 1d) sampled from five locations across the established MarBTN range27 (Fig. 1a and Supplementary Table 2) and called single-nucleotide variants (SNVs) against the reference genome. Contaminating host variants were removed from MarBTN sequences via variant calling thresholds, rather than using paired tissue sequences as has been conducted for other transmissible cancers, as MarBTN hemolymph isolates were of high purity (>96% cancer DNA), whereas paired tissue samples from the host often contained high cancer DNA due to dissemination (Extended Data Fig. 1).

To investigate somatic evolution of the MarBTN lineage, it is important to distinguish between founder variants, those present in the genome of the founder clam from which the cancer initially arose, and somatic mutations, which occurred during the propagation and evolution of the cancer lineage. We observed that 10.7 million SNVs were shared by all MarBTN samples but not present in the reference genome. Of these, 8.1 million were found in at least one of the three healthy clams, indicating that these variants are likely from the germline of the founder.

A MarBTN phylogeny, built from pairwise SNV differences between samples, confirmed the previous analysis identifying two distinct sub-lineages of MarBTN6, here referred to as the Prince Edward Island (PEI) and United States of America (USA) sub-lineages (Fig. 1e). While the original founder clam is lost, we are able to leverage this deep split between the sub-lineages to identify those mutations likely to be somatic and not founder, as SNVs that occurred after the divergence of the two subgroups would be somatic. Most SNVs identified in the cancers and also found in healthy animals (and therefore highly likely to be founder variants) were present in both sub-lineages of MarBTN, but we observed some genomic regions with clusters of these founder SNVs in one sub-lineage but not the other. These are unlikely to be somatic mutations, instead they likely indicate loss-of-heterozygosity (LOH) events that took place after divergence of the sub-lineages. LOH was identified in 8% and 13% of the USA and PEI sub-lineage genomes, respectively (Extended Data Fig. 2). LOH regions were excluded during identification of somatic mutations in the following SNV analysis unless otherwise noted, since we are unable to determine which mutations are founders and which are somatic in these regions. SNVs found in all cancer samples, but no healthy samples, represent a mix of both founder variants and somatic mutations (2.3 million), whereas SNVs found in just one or the other sub-lineage represent likely somatic mutations (700,242). The majority of these SNVs were shared by all individuals in a sub-lineage and are herein referred to as ‘high-confidence somatic mutations’ (320,715 for PEI and 331,167 for USA).

Mutational biases in MarBTN

By analyzing all identified SNVs and their trinucleotide context, we observed a distinct SNV mutational bias in somatic mutations within both the PEI and USA sub-lineages that was not found in healthy clams (Fig. 2a). These biases are nearly identical in somatic SNVs from both sub-lineages and were also present in more recent mutations, such as SNVs unique to each MarBTN sample (Extended Data Fig. 3a). De novo signature extraction, which deconvolutes mutational biases in their trinucleotide context between samples28, yielded four mutational signatures (Extended Data Fig. 3b). Three signatures were found in both healthy clams and MarBTN samples and thus are likely endogenous within the germline of clam genomes. One signature closely resembles COSMIC signature 1 (termed Sig1′), showing a characteristic bias for C > T mutations at CpG sites, which is associated with the deamination of methylated CpGs in humans29. Sig1′ represents a greater fraction of mutations in the PEI sub-lineage (Extended Data Fig. 4), which may indicate that PEI has more methylated CpG sites than USA. Sig1′ also represents a greater fraction of mutations in coding regions, fitting previous observations that methylation is elevated in gene regions in bivalves30. The other two signatures are ‘flatter’ and less distinctive, most closely resembling COSMIC signatures 5 and 40 (termed Sig5′ and Sig40′), which are both associated with aging in humans31,32.

Fig. 2: Unique mutational signature found in somatic mutations dates cancer to >200 years old.
figure 2

a, Trinucleotide context of SNVs found in healthy clams (top) and high-confidence somatic mutations in PEI (middle) or USA (bottom) sub-lineages, corrected for mutational opportunities in the clam genome. The trinucleotide order is the same as in b. b, De novo extracted mutational biases for SigS. c,d, Sig5′ (c) and SigS (d) attributed mutations per Mb (signature fitting estimates with fitting error) across USA MarBTN samples (n = 5) by sampling date. Results of linear regression with 95% CI (gray) overlaid. SNVs found in healthy clams, PEI MarBTN samples or LOH regions are excluded. e, Fraction of SNVs attributed to SigS from healthy clams (black), variants found in all MarBTN samples (gray) and high-confidence somatic mutations (colored). Variants found in all MarBTN samples are divided by whether they are found in healthy clams and whether they are homozygous (hmz) or heterozygous (htz). Dashed lines display SigS fraction estimates for likely somatic mutations and likely founder variants. f, Age estimate of the most recent common ancestor (MRCA) of the USA and PEI sub-lineages using Sig5′ and SigS and of the BTN origin from SigS mutations. g, dN:dS ratios (ratio of 1 indicates neutrality) for SNVs found in healthy clams (black), SNVs found in all MarBTN samples (gray) and high-confidence somatic mutations (colored) (n = 20,075,227, 7,676,209, 2,596,657, 320,715, 331,167 and 651,882 as shown from left to right). Error bars in all plots display 95% CI.

Source data

A single signature captured the biases specific to the somatic mutations in MarBTN, termed SigS (Fig. 2b). The closest analog in the COSMIC database of human mutational signatures is signature 9, which shares a T > G bias in A/T trinucleotide contexts31. Signature 9 in humans represents mutations induced by polymerase eta during somatic hypermutation and translesion synthesis in humans31,33. This may indicate that an error-prone polymerase with similar biases to human polymerase eta is broadly upregulated in cancer or induced due to a high level of DNA lesions during MarBTN replication. In addition to the notable T > G bias in A/T contexts, there is also a notable bias toward C > A mutations compared to healthy clam SNVs, particularly CC > CA and TCT > TAT. Notably, both C > A and T > G mutations have been linked to oxidative DNA damage34. Clam hemolymph is strongly hypoxic in late stages of the disease35, so this environment may also be contributing to these mutational biases.

MarBTN is several centuries old

Signatures 1 and 5 are considered clock-like in humans and other mammals36,37 and signature 1 was used to date CTVT’s origin to 4,000–8,500 years before present12. We took advantage of the temporal distribution of our USA samples to test whether any signatures were clock-like in MarBTN. We fitted somatic mutations for each sample (SNVs not in other sub-lineages and outside LOH regions) to the four extracted signatures and regressed mutations attributed to each signature against sample collection date (Extended Data Fig. 5a). Sig1′ did not correlate with time, perhaps due to methylation changes affecting CpG > TpG mutation rates and/or inherent differences between clams and mammals. Sig5′ mutations did display a strong correlation with time within the USA samples (Fig. 2c; P = 0.013). Assuming the Sig5′ mutation rate has remained steady since USA diverged from PEI, this corresponds to the sub-lineages diverging 319 years ago (95% CI 199–801 years); however, PEI samples have 33% fewer Sig5′ mutations than USA samples, indicating that the Sig5′ mutation rate differs between sub-lineages. SigS mutations also seem to increase with time and although the correlation is not statistically significant within the USA sub-lineage (Fig. 2d; P = 0.085), the number of SigS mutations in PEI samples fall within the range predicted by the linear regression of USA samples (Extended Data Fig. 5a). Minimal deviation in the SigS accumulation over time across both sub-lineages, despite their deep divergence, indicates that the mechanism producing SigS mutations is remarkably steady, although the lack of recent PEI samples does not allow us to independently test whether SigS continues to accumulate at the same rate in PEI. Based on the rate calculated from the USA samples, the sub-lineages diverged 315 years ago (95% CI 139–infinity years), in close agreement with our Sig5′ estimate. This estimate lacks an upper bound due to the small number of USA samples and higher deviation of SigS in comparison to Sig5′; however, we can be more confident in the stability of the SigS mutation rate than Sig5′ given the consistency in SigS between the sub-lineages.

As SigS is specific to somatic mutations, we can use it to estimate how many of the mutations shared by all cancers are somatic mutations and therefore estimate how long before the sub-lineage divergence the cancer first arose in the founder clam and began horizontal transmission. SigS contributed roughly half of high-confidence somatic mutations in each sub-lineage but was virtually absent from SNVs in the healthy clam population (Fig. 2e). If we assume that the SigS mutation rate has remained constant since oncogenesis and that the founder clam SNVs have a similar profile of genomic SNVs to those observed in healthy clams, we estimate that 3.1% of heterozygous SNVs found in all cancer samples, but no healthy samples, are somatic mutations attributed to SigS. This corresponds to 108 years by the SigS rate estimate above, for a total cancer age estimate of 423 years (95% CI 187–infinity years) (Fig. 2f), long before the first recorded observations of disseminated neoplasia in soft-shell clams in the 1970s19,20.

If we also assume the fraction of SigS somatic mutations has remained constant since oncogenesis, we estimate that, in addition to the 3.1% SigS SNVs estimated above, approximately 3.7% (95% CI 3.4–4.0%) of heterozygous SNVs found in all cancer samples, but no healthy clams, are somatic mutations due to the other three signatures. Combining this estimate (116,765 mutations) with sub-lineage-specific mutations (320,715 and 331,167) we calculate a total somatic SNV estimate of 441 and 452 mutations per Mb for the PEI and USA sub-lineages, respectively. This is a much higher mutation density than that estimated for the <40-year-old DFTD lineages (DFT1, <3.1 mutations per Mb; DFT2, <1.3 mutations per Mb)15, but less than the >4,000-year-old CTVT (~867 mutations per Mb from exome data)12, showing that mutation density generally scales with age across the small number of characterized transmissible cancer lineages.

Selection on SNVs is largely neutral

We used the ratio of nonsynonymous to synonymous coding changes (dN:dS) to infer selection acting on coding regions in our sample set. After correcting for mutational opportunities in coding regions, a ratio of one indicates neutral selection, >1 indicates positive selection and <1 indicates negative/purifying selection. We used dNdScv38 to determine that the global dN:dS for healthy clam SNVs was 0.454 (95% CI 0.451–0.457), indicating that genes are generally under negative selection in clam genomes, as expected. On a gene-by-gene basis, 70% of intact coding genes (16,222 out of 23,273) in healthy clams have significantly negative dN:dS, whereas 0.4% (88 out of 23,273) are significantly positive. Genes under positive selection in hosts may be those at the host–pathogen interface that are under selection for continued nonsynonymous mutation. In the case of clams, some of these genes may be a response to MarBTN evolution itself, though this hypothesis cannot be tested by the current study.

High-confidence somatic mutations had a global dN:dS of 0.982 (95% CI 0.943–1.024), indicating that MarBTN is largely dominated by neutral selection, reflecting observations in human cancers39 and CTVT12 (Fig. 2g). We found no genes with a dN:dS ratio significantly <1, indicating that no genes are under significant negative (or purifying) selection, but we did identify five genes with a dN:dS ratio significantly >1, indicating positive selection (Supplementary Table 3). For all five of these genes, nearly all somatic mutations were found in a single sub-lineage. Only one of these genes has a dN:dS ratio above one in healthy clams, suggesting that four of five genes are truly under positive selection in only a single sub-lineage and they are not founder or host clam SNVs. The only characterized gene among the four is a TEN1-like gene that is under positive selection in the USA sub-lineage. TEN1 is a component of the CTC1–STN1–TEN1 complex, which plays a crucial role in telomere replication and genome stability40.

Widespread structural mutation

Polyploidy has been described in disseminated neoplasia in several bivalve species27,41. In M.arenaria, disseminated neoplasia cells have approximately double the chromosome count and genome content of healthy clam cells24. Given the discovery that these cells are of clonal origin6, we had hypothesized that a full genome duplication occurred early in the cancer’s evolution and that most of the MarBTN genome should be 4N. To test this theory, we called copy number states across each non-reference sample genome based on read depth (Fig. 3a). As expected, both healthy clams were 2N across nearly the entire genome (Fig. 3b). Notably, MarBTN samples displayed a wide variety of copy number states.

Fig. 3: Widespread copy number gain and structural mutation.
figure 3

a, Copy number calls across clam genome, rounded to the nearest integer (black) and unrounded (gray) in 100-kB segments. The healthy clam is a representative individual and the MarBTN sub-lineages are averages of each individual sample from that sub-lineage, which were in close agreement. b, Summary of copy number states across entire genomes for two non-reference healthy clams and MarBTN sub-lineages. Gray lines display copy number summaries for individual samples within each sub-lineage, which are in close agreement. c, Number of SVs in each sample. The reference clam was excluded as one haplotype from that animal was used to build the reference genome and thus does not contain SVs. Values were normalized to the average number of SVs in non-reference healthy clams for each SV type (numbers below SV type labels). P values are from two-sided unequal variance t-test between MarBTN samples (n = 8) and non-reference healthy clams (n = 2). Exact P values are 1.9 × 10−5, 2.9 × 10−2, 1.0 × 10−5 and 8.0 × 10−11, respectively. Labels follow DELLY abbreviations of SV types: BND, translocations; DEL, deletions; DUP, tandem duplications; INV, inversions. Bars indicate means and error bars indicate s.d. d, Size distribution of tandem duplications in each non-reference sample. Dashed line indicates 11 kB. e, Telomere length estimated by TelSeq for each sample. f, Tandem duplicate copies of the mitochondrial D-loop region per sample. Healthy clams are black, MarBTN from PEI are red and MarBTN samples from USA are blue.

Source data

PEI samples were predominantly 4N with substantial 3N and 2N portions, whereas USA samples were more evenly distributed between 4N, 3N and 2N (Fig. 3b). Copy number calls in cancer samples displayed close agreement within sub-lineages (R2 > 0.94). There was a positive correlation between copy number calls between the two sub-lineages, but large differences could be observed suggesting that copy number changes have occurred since sub-lineage divergence (R2 = 0.53–0.56) (Extended Data Fig. 6a). Variant allele frequencies (VAFs) for high-confidence somatic mutations largely support copy number calls (Extended Data Fig. 6b,c), with some off-target VAF peaks, most notably in the lower copy number regions (<3N), indicating that some of these regions have higher copy numbers than called through this method but seemed lower likely due to reduced read mapping in polymorphic genome regions.

To estimate timing of duplication events we looked at VAF in regions called CN4 across both sub-lineages (14% of the genome; Extended Data Fig. 6d–g). While the majority of founder variants were distributed around a VAF of 0.5 (2 of 4 alleles) in both sub-lineages, as expected for a CN2 > CN4 duplication, USA also had VAF distributions around 0.25 and 0.75 (1 of 4 and 3 of 4 alleles) that were absent in PEI, indicative of CN2 > CN3 > CN4 duplication where not all haplotypes duplicated evenly. Additionally, we observe more 2 of 4 high-confidence somatic mutations in PEI than USA, indicative of later duplication events. The fraction of 2 of 4 somatic mutations in the USA sub-lineage was low in nearly all CN4 segments of the genome, indicating most segments duplicated before or shortly after the USA–PEI sub-lineage split, with a low rate of duplications occurring after that time. In contrast, many segments in PEI sub-lineage have around 20% of the somatic mutations at 2 of 4 alleles, suggesting a burst of duplications at some point after the USA–PEI sub-lineage split. Overall, these frequencies indicate the USA and PEI sub-lineages arrived at CN4 largely via independent duplication events, rather than the assumed single whole-genome duplication and that duplication events have occurred at multiple points throughout MarBTN evolution.

Many mid-chromosome breakpoints were apparent in the copy number calls, indicating that the MarBTN genome has likely undergone widespread structural alterations in addition to whole-chromosome and within-chromosome copy number gain. We are unable to resolve the structure of the MarBTN genome with the short sequence reads in our current dataset but were able to call likely structural variants (SVs) from split reads. Relative to non-reference healthy clams, MarBTN samples had a significantly higher number of deletions, inversions, tandem duplications and inter-chromosomal translocations, indicating substantial somatic structural alterations (Fig. 3c).

Comparing likely somatic SVs specific to each sub-lineage, USA samples had significantly more translocations and tandem duplications than PEI (Extended Data Fig. 6h). Median somatic tandem duplication sizes displayed a distinct distribution around a mode of ~11 kB (Fig. 3d and Extended Data Fig. 6i). In human cancers, tandem duplication phenotypes of this same size distribution are thought to be driven by the loss of TP53 and BRCA1 (ref. 42), indicating that a parallel mutational process may be influencing the observed genome instability in MarBTN and more active in the USA sub-lineage.

Maintenance of telomere length is a requirement for an immortalized cell line such as MarBTN and would be necessary for long-term survival. We estimated telomere lengths for each sample and found them to be highly variable within the USA sub-lineage (8–47 kB), whereas they were short but relatively stable within the PEI sub-lineage (8–11 kB) compared to healthy clams (18–19 kB) (Fig. 3e). Variable telomere lengths in the USA sub-lineage may relate to the TEN1-like gene that is under positive selection in that sub-lineage, as the CTC1–STN1–TEN1 complex inhibits telomerase and is involved in telomere length homeostasis40.

Mitochondrial genome evolution

A tree built from pairwise mitochondrial SNV differences between samples reflects a similar phylogeny to that built from genomic SNVs (Extended Data Fig. 7a). This indicates no evidence of mitochondrial uptake or recombination with host mitochondria, which has been observed in other transmissible cancers8,43,44. Transitions were highly overrepresented in both healthy and cancer samples, with C > T mutations composing 41 of 50 likely somatic mutations (Extended Data Fig. 7b). Somatic mutations resulted in missense mutations in at least 10 of the 12 mitochondrial genes, and the genes seem to be under relaxed selection, with dN:dS ratios of 0.97 (95% CI 0.45–2.1) versus 0.26 (95% CI 0.11–0.58) for SNVs in healthy clams (Extended Data Fig. 7c).

When aligned to the published M.arenaria mitochondrial genome45, short read sequences from all MarBTN and healthy samples display increased coverage across the mitochondrial D-loop (Extended Data Fig. 7d), indicating the region is multi-copy. The D-loop is part of the non-coding control region of the mitochondrial genome and is the origin of both replication and transcription. We resolved this region with PacBio long reads from the healthy reference clam, revealing three copies in tandem. Two of the copies contain a 236-bp insertion not found in the published mitochondrial genome. The insert includes an 80-bp region with 70% guanine content, likely complicating previous PCR-based efforts to resolve it. Altogether, the observed copies extend the D-loop region of the reference clam genome from 845 bp to 2,727 bp and the full mitochondrial genome to 19,815 bp.

Read coverage of the D-loop region suggest that there have been additional somatic tandem duplications in the MarBTN mitogenome. While read coverage indicates 3–4 copies in the non-reference healthy clams, PEI MarBTN samples have 5–6 copies and USA MarBTN samples have 8–11 (Fig. 3f). These somatic tandem duplications likely arose via replication errors and the trend toward increased copies in cancer suggests that they may be under selection. Selection can act on the level of the mitogenome itself, giving it a replicative advantage over other mitogenomes (as hypothesized for CTVT) or on the level of the cancer cell, if this duplication provides cancer cells a replicative advantage over others. Notably, the mitogenome site suspected to be under selection during repeated mitochondrial capture in CTVT is in the control region44, the same region we see amplified in MarBTN.

Transposable element mobilization

MarBTN is known to contain the LTR retrotransposon, Steamer, at a much higher copy number than healthy clams, indicating likely somatic expansion46. To test whether Steamer activity is ongoing we identified Steamer insertion sites using split reads spanning Steamer and the reference genome. Only 5–11 sites were found in each healthy sample, versus 275–460 sites in each cancer sample. A total of 193 sites are shared by all cancer samples, indicating that Steamer expansion likely began early in the cancer’s evolution, whereas sub-lineage-specific Steamer integrations indicate that Steamer has continued to replicate somatically in the MarBTN genome (Fig. 4a); however, Steamer has generated more insertions within the USA sub-lineage (n = 248) than the PEI sub-lineage (n = 64), indicating the regulatory environments of the sub-lineages have not remained stable since they diverged.

Fig. 4: Somatic expansions of Steamer and other TEs.
figure 4

a, Phylogeny of all samples built from pairwise differences of Steamer insertion sites, colored by healthy (black), USA MarBTN (blue) and PEI MarBTN (red). Numbers along branches indicate the number of insertions unique to and shared by individuals in that clade, numbers on nodes indicate bootstrap support, with bootstrap values below 75 not shown. b, Logo plot of insertion bias relative to the 5-bp target site duplication (TSD) of all Steamer insertions, normalized by nucleotide content of the genome. c, Steamer insertion probability in annotated genome regions, normalized by read mapping rates and relative to full genome. Displayed for insertions found in all MarBTN samples but no healthy clams and unique to each sub-lineage but shared by all individual in that sub-lineage. Dashed line indicates expectation given random insertions. d, Volcano plot comparing copy number of all repeat elements in MarBTN and healthy clam samples by two-sided unequal variance t-test. Dashed lines correspond to significance threshold (P = 0.05, Bonferroni-corrected) and fivefold differences. Elements annotated as DNA transposons are marked in gray.

Source data

We also observed strong biases for Steamer to insert at specific genomic sequences. Steamer has a palindromic bias for NATG outside the five bp target site duplication (CATNnnnnnNATG), inserting at these locations 45× more frequently than expected by chance (Fig. 4b). Steamer was also >3× more likely to insert within 1,000 bp upstream of genes than would be expected by chance (Fig. 4c). We also observed early Steamer insertions (those found in all MarBTN samples) upstream of cancer-associated orthologs more often than expected by chance in the reverse but not the forward orientation (Extended Data Fig. 8a and Supplementary Table 4). This bias, which could indicate either an insertion preference for those locations or a selective advantage to MarBTN cells, was associated with those insertions.

We further investigated whether other transposable elements (TEs) in addition to Steamer have expanded somatically by identifying a library of repeat sequences (putative TEs) found in clam genomes and counting the copy number of each TE type in each sample. Forty-five TEs were present at a significantly higher copy number in cancer samples relative to healthy clams after removing TEs with fewer than fivefold differences (Fig. 4d). TEs annotated as DNA transposons were enriched in this dataset (8 of 45, 17.8%) compared to the total TE library (171 of 4,471, 3.8%), indicating this TE type may have been particularly successful in somatically expanding its copy number in MarBTN. LTR retrotransposons (such as Steamer) seem to have had more success in the USA versus PEI sub-lineage. Thirty-six TEs have significantly more copies in the USA sub-lineage than PEI and eight of those are LTR retrotransposons, compared to 0 LTR retrotransposons out of 20 of those more highly expanded in PEI (Extended Data Fig. 8b). Reduced copy numbers of LTR retrotransposons and other TEs in the PEI sub-lineage could be linked to the increased methylation indicated by mutational signature analysis, as methylation is thought to repress TE mobilization30,47. Our finding of widespread increases in TE copy numbers alongside structural mutations indicate general genome instability of the MarBTN lineage and provides further evidence of a higher rate of certain mutation types in the USA sub-lineage, which cannot be explained by the temporal distribution of the samples alone (Extended Data Fig. 5b).

MarBTN gene expression

To investigate the role of genes implicated in MarBTN evolution we sequenced RNA from a new set of five MarBTN isolates from the USA sub-lineage, six tissues (hemocytes, foot, gill, adductor muscle, mantle and siphon) across three healthy clams and hemocytes from an additional two clams (Supplementary Table 5). Both principal-component analysis and hierarchical clustering clearly separate MarBTN and hemocytes from all solid tissue samples (Fig. 5a and Extended Data Fig. 9a), indicating MarBTN likely originated as a hemocyte. This origin has been hypothesized due to MarBTN being most obviously detectable in the hemolymph6,48, but had not previously been tested.

Fig. 5: Expression indicates hemocyte origin and possible mutagenic pathways in MarBTN.
figure 5

a, Principal-component analysis of normalized expression across all genes, with PC1 separating MarBTN and hemocytes from all other tissues. b, Volcano plot of expression of polymerase genes (n = 28 genes) for MarBTN (n = 5 isolates) compared to hemocytes (n = 5 clams). c, Normalized expression, in reads per gene, of TP53, HSPA9 (mortalin) and BRCA1 for MarBTN (n = 5 isolates), hemocytes (n = 5 clams) and non-hemocyte tissues (n = 15: 5 tissues for three clams). Error bars display standard deviation, differential expression comparison results from Wald test displayed as *P < 0.05; **P < 1×10−5; NS, not significant. Exact P values, adjusted for multiple comparisons, are 5.5 × 10−1, 6.8 × 10−1, 8.4 × 10−8, 5.0 × 10−7, 3.3 × 10−2 and 1.6 × 10−9, respectively.

Source data

MarBTN-specific SigS resembles an error-prone polymerase signature in humans, so we first compared the expression of the 28 polymerase genes identified in the clam genome. We observed widespread upregulation across polymerases in MarBTN (Fig. 5b and Extended Data Fig. 9b), likely facilitating increased cellular replication and/or DNA damage repair. The most highly upregulated polymerase is homologous to polymerase Nu (POLN), a very low fidelity polymerase that plays a role in translesion synthesis and cross-link repair by homologous recombination49,50. Polymerase Nu frequently mis-incorporates dT opposite a template dG in humans51,52, a bias which does not match SigS; however, given the distance between bivalves and humans, it is possible that this polymerase introduces different biases in clams and is in part responsible for the observed SigS biases and/or genome instability.

We next looked at the expression of four genes under putative positive selection as identified by dN:dS (Extended Data Fig. 9c). Positive selection in cancer can indicate repeated selection for either loss-of-function or gain-of-function mutations. Two genes were not expressed in MarBTN, including the TEN1-like gene, indicating a potential loss of function, whereas two genes were upregulated in MarBTN versus healthy hemocytes, indicating a potential gain of function.

Finally, we investigated genes implicated by the distinct ~11 kB tandem duplication phenotype; TP53 and BRCA1. Previous work identified the deactivation of p53 via cytoplasmic sequestration by overexpressed mortalin53, so we investigated the expression of genes homologous to TP53 and mortalin-encoding HSPA9 (Fig. 5c). Indeed, whereas TP53 had no nonsynonymous MarBTN mutations and was not differentially regulated, HSPA9 was significantly upregulated in MarBTN samples, supporting the proposed model of inactivation by mortalin sequestration. Similarly, clam BRCA1 homolog has no obvious loss of function mutations (three missense SNVs were observed in all MarBTN samples, which do not correspond to known loss-of-function mutations and could be either somatic mutations or inherited founder variants). The tandem duplicator phenotype was reported to be strongly associated with loss of function of BRCA142 in humans, but, in MarBTN, BRCA1 was upregulated (Fig. 5c). We speculate that either (1) BRCA1 is rendered non-functional by some other mechanism (similar to p53); (2) it is functionally overwhelmed by genome instability over the long timescale of this cancer lineage, resulting in a similar phenotype to loss of function; and/or (3) a different pathway in bivalves is responsible for the tandem duplication phenotype; although we are unable to test these hypotheses in this study. Overall, MarBTN gene expression illuminates possible mechanisms behind the lineage’s observed genome instability, though much remains unknown about the forces generating and tolerating such widespread genomic alterations.

Discussion

Our genome analyses reveal a diverse set of somatic mutations occurring in MarBTN, with continued accumulation of SNVs and widespread structural mutations indicative of genome instability. It is unclear whether these mutations have consistently occurred over time or have been generated in multiple punctuated chromothripsis-like events, but the continued accumulation of these changes between the sub-lineages shows that this instability was not confined to a single ancestral event. Genomic studies of the dog and Tasmanian devil transmissible cancers have shown contrastingly stable genomes, remaining predominantly diploid, despite thousands of years of evolution in the case of CTVT11,54. Polyploidy has been reported in other BTN lineages in other bivalves27, indicating that genome instability may be a common driver mechanism or a tolerated by-product of conserved processes in BTN evolution. Of note, while there is ongoing instability in both sub-lineages, we observed differences in the number of structural mutations, duplication timing, telomere length and TE amplification between the two sub-lineages, suggesting that genome instability or mutation tolerance may have changed over time in MarBTN after the sub-lineages diverged. These changes in fundamental mutational mechanisms observed in distinct sub-lineages post-divergence highlight the fact that oncogenesis is not a single event, but an ongoing evolutionary process.

In contrast to the above unstable and variable structural processes, we observed a pattern of consistent single-nucleotide mutation biases in both sub-lineages. Most notable is the distinct profile and consistent accumulation of mutational signature S. We hypothesize that this signature is due in part to an upregulated error-prone polymerase and that its consistent accumulation may be due to consistent MarBTN replication rates over time, as seen in human somatic cells with defective proofreading polymerases55 or due to continual damage of chromosomal DNA and its repair using translesion synthesis. Both SigS and Sig5′ (analogous to the clock-like signature five in humans) generate consistent age estimates for the most recent common ancestor of our sample set and estimate the cancer is at least 200 years old, though uncertainty in the calculated mutation rates means the actual age of the cancer could be far greater. This indicates that MarBTN is likely an intermediate age compared to DFTD (<40 years16) and CTVT (4,000–8,500 years12).

We observed that the MarBTN genome is largely dominated by neutral selection, reflecting observations in human cancers39 and CTVT12, with a few notable genes under positive selection in a single sub-lineage, which may reflect selection for repeated mutations involved in critical oncogenic processes; however, we also note that selection is not simply relevant at the level of the cancer cell, but also on the level of the gene (as seen in MarBTN TE expansions), mitochondria (as seen in CTVT horizontal transfer44 and MarBTN mitogenome expansion) and hosts (as seen in DFTD56). Further analysis of MarBTN and other cancers will help us to understand how these selective forces interact to influence cancer evolution and perhaps how we can manipulate those forces to our advantage to combat conventional and transmissible cancers.

Our analysis of the MarBTN genome is presented simultaneously to an independent analysis of two lineages in the common cockle (Cerastodermaedule) or CedBTN, by Bruzos and colleagues57. CedBTN infection presents as a similar leukemia-like disseminated neoplasia phenotype to MarBTN and gene expression points toward a hemocyte origin for BTN in both species. The CedBTN genomes display signatures of ongoing instability such as MarBTN, supporting the hypothesis that genome instability is a common feature of BTN evolution and confirming that long-term survival of a cancer lineage can be maintained despite remarkably widespread and continued genome rearrangement. This level of instability might be expected to lead to an error catastrophe58, yet these cancers have continued to replicate for centuries, changing our understanding of what is possible in cancer evolution. Tandem duplications in the mitochondrial control region were also observed in both studies and may represent convergent evolution driven by the same selective mechanisms. Similar tandem duplications in the D-loop have also been observed in human cancers59,60, though the functional consequences of these mutations remain unclear. Repeated expansion of this region in independent BTN lineages, along with their long history of coevolution with their hosts, make BTNs unique model systems for the understanding of the functional significance of mitogenome mutations on cancer cell growth and the potential for selfish selection at the level of the mitogenome in cancer.

In contrast, our finding of a distinctive polymerase-associated mutational signature, evidence of positive selection, variable telomere length and amplification of the Steamer retrotransposon and other TEs may be unique features of the BTN in clams. We find no evidence of mitochondrial genome transfer events or host co-infection by multiple clones as observed in cockles, though this may be due to the smaller sample size of our study and the low level of polymorphisms in mitochondrial DNA in soft-shell clams. Given the apparent abundance of BTNs, continuing to analyze BTN lineages in other species may reveal both common and unique pathways that have allowed these cancers to repeatedly circumvent new host immune systems and spread through host populations as contagious cancers. These cancers therefore provide unique models for the understanding of cancer evolution and exemplify what genomic changes are possible in long-lived cancers evolving together with their hosts.

Methods

M. arenaria genome assembly

Reference animal collection and sequencing

Due to the high rate of heterozygosity in bivalves, a single clam was chosen to be the source of all DNA used in the generation of the reference genome and a diploid phased assembly strategy was used. The reference animal (MELC-2E11, 62 mm shell length; Fig. 1b) was collected from Larrabee Cove, Machiasport, Maine, USA in June 2018 and shipped to the Pacific Northwest Research Institute laboratories. Hemolymph was drawn from the pericardial sinus using a 0.5 in 26-gauge needle on a 3-ml syringe and it was checked for the presence of MarBTN through morphological analysis (Fig. 1c) and with a sensitive cancer-specific qPCR assay17, with no evidence of detectible BTN. Examination of the gonad region revealed the presence of eggs, showing that this individual was female.

High molecular weight (HMW) DNA, used for PacBio sequencing, was extracted from snap-frozen mantle tissue using a modified CTAB extraction protocol (adapted from elswhere61). HMW DNA was also extracted using the MagAttract HMW DNA kit (QIAGEN) and used for 10x Chromium sequencing. RNA was extracted from six tissues, frozen at −80 °C in RNAlater (Invitrogen) and RNA sequenced (1, mantle; 2, foot; 3, siphon; 5, adductor muscle; 6, gills; and 7, hemocytes). Extraction details are outlined in the Supplementary Note.

Diploid assembly, Hi-C scaffolding, gap-filling and polishing

HMW DNA extracted using the CTAB protocol was sequenced using the PacBio core facility at the University of Washington Department of Genome Sciences. Due to the high heterozygosity in bivalve genomes, the FALCON-Unzip pipeline was run to generate a diploid-aware de novo assembly. The resulting assembly can be expressed as either as two pseudo-haploid reference genomes or as a primary assembly with alternate ‘haplotigs’ in genomic regions where the two copies of the diploid genome in the reference individual differ. The purge_haplotigs pipeline62 was used to remove pairs of contigs that were called as separate primary contigs by FALCON-Unzip but which are more likely to be alternate alleles, generating a new curated assembly (Mar.3.2.3_curated.FALC.fasta).

Chromatin conformation capture data was generated using a Phase Genomics Proximo Hi-C Animal kit, which is a commercially available version of the Hi-C protocol63 and Phase Genomics’ standard Hi-C alignment protocol64.

PBJelly was run to gap-fill the scaffolded assembly using pbsuite65 (v.15.8.24, slightly modified; https://github.com/esrice/PBJelly) using blasr (v.5.1) and networkx (v.2.2) with Python v.2.7, with the protocol file Protocol_MELC.xml.

We used a phase-aware polishing strategy, modified from the pipeline described in the Vertebrate Genome Project (https://github.com/VGP/vgp-assembly/tree/master/pipeline/freebayes-polish), using 10x linked reads.

Assembly details are outlined in the Supplementary Note.

Genome annotation

RNA-seq reads from the six tissues were concatenated and used to assemble a transcriptome using Trinity (v.2.8.5)66. Repeat elements in the genome assembly were called using RepeatModeler (v.2.0) and masked using RepeatMasker (v.4.1.0)67. The genome was annotated using MAKER (v.2.31.10) and exonerate (v.2.2.0), with two rounds of SNAP training, following previous methods68. The M.arenaria transcriptome was used as input into the MAKER annotation, along with the proteins identified from five well-annotated bivalve genomes. Putative gene identification was made by BLASTP search of the uniprot database (accessed 2 March 2021) and the five well-annotated bivalve genomes using blast+ (v.2.10.0). Annotation details are outlined in the Supplementary Note.

Genome assembly statistics for the current and previous M.arenaria assemblies can be found in Supplementary Table 1. Genome size, GC content, scaffold N50 and contig N50 were calculated using BBTools stats.sh (v.38.86)69. Repeat content was estimated by running RepeatMasker (v.4.1.0) using the RepeatModeler repeat library generated above. BUSCO scores were calculated against the metazoa_odb10 database using BUSCO v.3 (ref. 25).

MarBTN genome sequence analysis

Sample collection, DNA extraction and sequencing

MarBTN samples were collected from highly neoplastic clams from Maine and New York, USA, and PEI, Canada (Fig. 1a and Supplementary Table 2). Several MarBTN samples were previously reported (those collected between 2009 and 2014)6,46 and remaining samples (those collected between 2020 and 2022) were shipped live on ice from a seafood supplier in Maine. Hemolymph was drawn and screened for highly neoplastic animals (as above) and genomic DNA was extracted using the protocol previously used (DNeasy Blood and Tissue kit, QIAGEN)6,46. Two healthy clams were collected and DNA was extracted from the siphon or mantle tissue as reported previously6, in addition to the healthy reference clam. Previous reports of likely BTN in M.arenaria (Fig. 1a; denoted by x) are described in the Supplementary Note.

All samples were sequenced on an Illumina HiSeq (paired end 150-bp reads, Genewiz). Healthy tissue and cancer hemolymph were sequenced using a full lane with a target read depth of 50×. Paired tissue samples for a subset of cancer samples were sequenced with a target read depth of 30×. Illumina sequences were purged of optical duplicates using BBTools clumpify (v.38.86)69, trimmed using trimmomatic (v.0.36) with a read quality threshold of 20 and mapped to the reference genome using BWA-MEM70 with default settings.

SNV calling

SNVs and indels were called using somatypus (v.1.3), a platypus-based variant calling pipeline designed for closely related cancer data without a paired normal sample, ideal for transmissible cancer genomes12. Variants were called as present in a healthy clam if they were called by somatypus and supported by >3 reads. For cancer samples, we used more stringent thresholds to eliminate contaminating host DNA from being called as cancer alleles. Paired host tissue samples proved to be too highly contaminated by cancer to be useful and were only used as a downstream confirmation that we were eliminating host alleles with our read thresholds. Unlike mammalian transmissible cancers, which form solid tumors and allow collection of uncontaminated healthy host DNA, BTN disseminates into the tissues of the host as the cancer progresses, resulting in tissue samples that include significant BTN cells in late stages of the disease; however, we find DNA extracted from hemolymph of animals with late-stage disease to be so highly composed of BTN cells and so few host hemocytes (Extended Data Fig. 1) that we were able to effectively remove host variants using these thresholds. Thresholds for SNV calling are described in the Supplementary Note.

We used median allele frequency of MarBTN-specific homozygous nuclear SNVs in copy number 2 regions as a proxy for cancer isolate purity and host tissue purity, as shown in Extended Data Fig. 1.

LOH region identification

To call genome regions where one of the two original founder haplotypes was lost in one sub-lineage but retained by the other sub-lineage (termed LOH for loss of heterozygosity), we focused on SNVs for which we had high confidence that they came from the founder clam germline, using methods described in the Supplementary Note. A region with germline SNVs transitioning to homozygous from heterozygous (with the ancestral heterozygous state being captured in the other sub-lineage) would indicate regions that had lost a parental haplotype in the homozygous sub-lineage. We then calculated signature S mutation fraction and dN:dS ratio for each and plotted the values against the threshold used for the test calling (Extended Data Fig. 2c–e). To validate that our LOH calling method was successfully removing LOH regions we filtered for a different set of SNVs than those used to call LOH: sub-lineage-specific founder variants (variants found in a healthy clam and all individuals of one sub-lineage but none in the other sub-lineage). The density of USA-specific founder variants SNVs was 36× higher in PEI LOH regions versus non-LOH regions and PEI-specific founder variants SNVs was 20× higher in USA LOH regions versus non-LOH regions (Extended Data Fig. 2b), confirming these regions were likely lost from the other sub-lineage.

MarBTN phylogeny

To build the phylogeny in Fig. 1e we concatenated all variant loci into an alignment for all eight cancer samples with the reference genome sequence at those loci as the tree root. SNVs found in any healthy clam samples were excluded before this analysis, as nearly all those SNVs were likely present in the founder clam. SNVs in LOH regions were also excluded to remove founder variants from the sub-lineage branches. We then used R package ‘ape’ (v.5.5) to calculate the pairwise distance between sequences using the dist.dna(model = ‘raw’) function, build a neighbor-joining tree using the and nj() function and calculated bootstrap support using the boot.phylo() function, revealing high confidence (100 of 100) at all nodes.

Mutational signature extraction and fitting

We categorized SNVs into 25 bins based on which samples they were found in and the MarBTN phylogeny (see Extended Data Fig. 10 or code reference below). We further divided each SNV bin by annotated genome regions into additional nested bins (full genome, genes, exons, CDS, 5′ UTR and 3′ UTR), with the thought that some mutational processes may have different exposures across the genome. We used Helmsman (v.1.5.2)71 to count SNVs for each bin in their trinucleotide context and R package ‘Biostrings’ (v.2.54.0) to count trinucleotide opportunities in each genome region. We performed de novo signature extraction on this dataset using R package ‘sigfit’ (v.2.0.0)72, correcting for opportunities in each genome region. The unbiased estimate for the best number of signatures to fit our data was 3, though extracting four signatures revealed a signature of unmistakable resemblance to COSMIC signature 1 (CpG > TpG), so we proceeded with four signatures. SNV bins were then reanalyzed with these four signatures, again correcting for mutational opportunities, to reveal the fraction of SNVs in each category that could be attributed to each signature.

Cancer dating

To estimate the age of the MarBTN lineage we only wanted to consider likely somatic mutations, so we excluded regions that were called as LOH in either sub-lineage from these analyses (as true founder SNVs in a region lost in one sub-lineage would appear to be unique to the other sub-lineage and could be falsely considered to have occurred after the divergence of the sub-lineages if those regions were not removed). We only included genomic SNVs for this analysis, as there were a limited number of MarBTN-specific mitochondrial SNVs and they displayed a different mutational profile than genomic mutations. We then filtered remaining SNVs from each MarBTN sample to remove any SNVs that were found in a healthy clam or the other sub-lineage using the same thresholds as described above (SNV calling). We have high confidence that the remaining SNVs for each sample should be somatic mutations that occurred since the time the two sub-lineages diverged (as the MRCA). We counted the number of mutations in their trinucleotide contexts using Helmsman71 for each MarBTN sample and fitted this to our de novo extracted mutational signatures to estimate contributions of each of the four signatures. We then performed a linear regression of the mutation count attributed to each signature for each sample against the date the sample was collected (Extended Data Fig. 5). We performed regression across USA samples only, with the thought that this set would be less susceptible to small changes in mutation rates between the sub-lineages and would not be confounded by the timing or number of copy number differences between the sub-lineages. Within the USA sub-lineage, Sig5′ was the best fit with time. When considering PEI samples, SigS seemed to be more clock-like, in that PEI samples fall within the 95% CI of the USA regression. Additionally, to test whether structural mutation types that were higher in USA than PEI were due to sampling date, we performed the same analysis on somatic tandem duplications, somatic translocations, total Steamer insertion sites and total mitochondrial D-loop copies.

The x intercept of the regressions calculated above indicates the age of the MRCA of the two sub-lineages (when mutation count separating them equals zero). To estimate the total age of the cancer, we first estimated the number of somatic SigS mutations in the trunk of the MarBTN lineage (SNVs shared by all MarBTN samples) and we then used this estimation to further estimate the total number mutations as described in the Supplementary Note.

dN:dS

We ran R package ‘dNdScv’ (v.0.0.1.0)38 to calculate global dN:dS, the overall ratio across all genes in the genome, for each SNV subset in Fig. 2g (details are provided in the Supplementary Note). We also calculated dN:dS for individual genes. We filtered for genes under significantly positive or negative selection (corrected P value < 0.05). For the five hits generated when dN:dS was run for somatic mutations, we performed an NCBI blastp query for each of these genes. We checked each gene visually/manually using IGV, noting that in each case nearly all SNVs seem to be on a single haplotype. We calculated the dN:dS for SNVs found in any healthy clam for each of these five genes, removing one that was also under positive selection in the observed healthy clam genomes (presumed to be due to missed founder variants in a gene under positive selection in the healthy clam population). Results and notes for each gene are summarized in Supplementary Table 3.

Copy number calling

Most cancer copy number calling tools rely on having paired tissue samples; we instead developed a custom copy number calling script that uses cn.mops (v.1.32.0)73 to call read depth and depth relative to the reference clam (MELC-2E11) to determine copy number, with the assumption that this reference clam is diploid. The Supplementary Note provides the details and validation with allele frequency using bedtools (v.2.29.1)74.

To estimate duplication timing, we filtered for 100-kB segments that were called CN4 in both USA and PEI sub-lineages. We calculated VAFs for founder germline variants (found in all cancers and at least one healthy sample) and for high-confidence somatic mutations in each sub-lineage by taking the mean VAF for each SNV across the five USA samples and the three PEI samples. For each 100-kB segment, we calculated the fraction of 2/4 somatic mutations by taking mutations with VAF 0.375–0.625 and dividing by total mutations.

Structural variant and telomere calling

We used DELLY (v.0.8.5)75 to call deletions, small (<100 bp) insertions, tandem duplications, inversions and translocations in each sample individually from split read mapping. DELLY is sensitive to read depth, so we subsampled all sample sequences to only include 600,000,000 reads (which is a lower count than the lowest sequenced sample) before running DELLY using ‘samtools view -s’. We only considered SVs supported by reads mapping to precise breakpoints in the genome. We used default settings, except for setting a minimum paired end read mapping quality threshold to 30 to minimize false positives. We merged all called SVs into a single file based on shared breakpoints. We removed SVs called in the reference clam from all samples and compared the number of each SV type and size of each intra-chromosomal SV type. To narrow in on high-confidence somatic SVs we then filtered out SVs found in any healthy clam or the opposite sub-lineage from each sample (similar to our approach for identifying somatic SNVs) and compared the number and size of SVs. To compare SV counts between healthy/MarBTN and USA/PEI, we used a two-sided t-test (unequal variance) and to compare sizes we used a two-sided Wilcoxon signed-rank test.

We used telseq (v.0.0.2)76 using default settings to estimate telomere lengths. Telseq takes raw bam alignments for all samples (generated above) as an input and uses TTAGGG-repeat content to estimate mean telomere length for each sample as an output (Fig. 3e).

Identifying Steamer insertion sites

We called Steamer insertion sites in all samples via a custom pipeline which uses split reads that map to both the reference genome and Steamer itself (details are provided in the Supplementary Note).

We noticed a bias for ATG in positions 7–9 in both our upstream and downstream Steamer flanking reads. To investigate this bias, we extracted the 35 bp surrounding each Steamer insertion sites from the reference genome (15 bp upstream, 5 bp TSD and 15 bp downstream) using bedtools getfasta74. We then counted the number of occurrences of each nucleotide at each position, normalized by the GC content of the genome (35%) and created logo plots using ggseqlogo. This bias held whether we looked at Steamer sites across all samples, just cancer samples, sites shared by all cancer samples, sites unique to the USA sub-lineage and sites unique to the PEI sub-lineage. For sites found in any cancer sample, we also counted the number of sites that had an ATG in positions 7–9 upstream, downstream (note ATG in read in reverse is CAT) and both upstream and downstream. Compared to the frequency expected based on the frequency of ATG in the genome (2.2% of trinucleotides), these sites were 8.5, 7.4 and 44.6 times more frequent than expected by chance, respectively.

To investigate where Steamer inserted relative to genes, we found the closest gene to each insertion site using bedtools closest74, excluding insertion sites within genes. There was a noticeable bias in the 1–2 kB upstream genes (Extended Data Fig. 8) and these genes were more likely to be cancer-associated than expected by chance, as described in the Supplementary Note.

TE copy number analysis

We did not observe Steamer in our RepeatModeler run on the reference genome, likely due to it being present at low copy number in healthy clams and thus not clearing the threshold to be called as a repeat element. To capture other repeat elements such as Steamer that might have a copy number in MarBTN but be low in the reference genome, we ran REPdenovo77, a repeat element identifier that can be run on raw WGS data, as opposed to the assembled genome required for RepeatModeler. We ran REPdenovo on the healthy reference clam (MELC-2E11), a USA MarBTN sample (MELC-A11) and a PEI MarBTN sample (PEI-DN08) to capture repeat elements at high copy number in either sub-lineage, as well as a healthy clam to control for biasing repeat element identification toward MarBTN. We then ran RepeatClassifier, a component of RepeatModeler used for classifying repeats based on sequences, on the output repeat elements.

To generate a consensus repeat library, we used CD-HIT (v.4.8.1)78 to merge the libraries generated from the RepeatModeler and REPdenovo runs, using the same CD-HIT settings as those used by RepeatModeler itself to merge repeats with greater than 80% identity (-aS 0.8 -c 0.8 -g 1 -G 0 -A 80 -M 10,000). We then used BWA-MEM to map reads from each sample to the repeat library and calculated the average read depth across each repeat element and normalized by read depth across the genome, calculated previously, to yield an estimate of the number of copies of each repeat element in each sample. Note that this copy number is relative to the haploid genome for all samples.

For each repeat element, we calculated the average copy number among our three healthy clams, eight MarBTN samples and each MarBTN sub-lineage individually. We calculated the ratio of copies in healthy clams versus MarBTN samples and PEI sub-lineage versus the USA sub-lineage, followed by a two-tailed unequal variance t-test to calculate the significance of each difference (Fig. 4d). We removed repeats with fewer than one copy in any sample, as these likely represent TEs that are only present in a subset of the clam population and would yield a highly significant difference simply due to the absence in some samples and presence in others. We additionally divided and plotted the dataset by repeat type classified by RepeatClassifier (DNA transposon, LTR, LINE, rolling circle, rRNA, simple repeat, SINE, snRNA or tRNA). We performed chi-squared tests to determine whether certain elements were higher copy number in one group versus another. The magnitude of repeat expansions may be overestimated as we are comparing an average from three difference clams to an average from eight samples of a clonal lineage; however, the strong skew toward more copies in MarBTN compared to healthy clams indicates that either (1) the founder clam had more copies of many TEs than the healthy animals sequenced here or (2) many TEs have increased their copy number through somatic expansion.

Mitochondrial analysis

We mapped each whole-genome sequenced sample to the previously published mitochondrial genome45 using BWA-MEM70. We then ran somatypus12 using default settings to call SNVs and indels. We excluded SNVs around the multi-copy region in positions 12,060–12,971. We did not see evidence of heteroplasmy outside this region, so an SNV was counted as present if it was present in a sample at >0.5 VAF. To infer relatedness of mitochondrial genotypes we built a neighbor-joining tree, as conducted for genome SNVs, from an alignment of sequences built by concatenating all variant allele positions versus the reference mitochondrial genome (170 loci).

To look at mutational biases, we included 12 possible single-nucleotide substitution types rather than the traditional 6, as the heavy/light strand differences of mtDNA result in unequal C/G and A/T in the forward or reverse direction (forward: A, 0.29%; T, 0.37%; C, 0.12%; and G, 0.23%). We counted SNVs of each substitution type for SNVs found in healthy clams (39), shared among all MarBTN samples but not found in healthy clams (13), those found in all samples of the USA (21) or PEI (26) sub-lineages and all high-confidence somatic mutations (50; those found in only a subset of MarBTN samples). We also calculated the expected number of substitutions of each type based on the nucleotide content of the mitochondrial genome assuming no mutational biases for comparison.

We used dndscv38 as described previously to calculate the global dN:dS in the mitochondrial genome. We calculated dN:dS for SNVs found in healthy clams, SNVs shared among all cancer samples but not found in healthy clams and high-confidence somatic mutations (those found in just the USA or PEI sub-lineages). 95% CIs from dndscv are quite large due to the small number of coding mitochondrial mutations in our samples used for this calculation.

We calculated the read depth at each position using SAMtools depth. To estimate the number of copies of the D-loop region, we calculated the average read depth in positions 12,300–12,500 relative to the average read depth across the full mitochondrial genome, excluding that region. This region was chosen because it is within the multi-copy D-loop region but should not have reads that border the duplication breakpoint or the insertion that is only present in some copies and may cause errors in amplification due to its G-rich sequence. Copy numbers were compared between the groups using a t-test (two-sided and unequal variance).

We confirmed the presence of a D-loop tandem duplication in a healthy clam using inverse PCR (Extended Data Fig. 7e), with outward-facing primers that would only amplify if the copies or the region are in tandem (Supplementary Table 4). PCR amplification and long-read assembly confirms tandem duplication of the region (details provided in the Supplementary Note).

RNA-sequence analysis

Samples from multiple tissues were collected and RNA was extracted/sequenced as described above for two healthy clams (to add to the previously RNA-sequenced reference clam, MELC-2E11) and for hemocytes only for two additional healthy clams. Hemolymph was drawn from five heavily diseased clams and MarBTN isolates were further purified by allowing to settle for 1 h in a 24-well plate at 4 °C. Remaining host hemocytes adhered to the plate and purified MarBTN cells were gently collected by pipetting. RNA was extracted and sequenced as described above (six samples per Illumina HiSeq 4000 lane for 20–30 million reads per sample).

We aligned reads for all samples to the indexed annotated genome using STAR (2.7.5a_2020-06-29)79 and quantified reads mapped per gene using quantMode GeneCounts. We confirmed that MarBTN isolates were all part of the USA sub-lineage at 48 of 48 mitochondrial loci differentiating USA versus PEI and that the VAFs of USA-specific mitochondrial SNVs were 96–99% in all samples, confirming high BTN purity. We merged counts per gene for all samples and ran DESeq2 (v.1.26.0)80, using tissue (or BTN) as the condition on which to test differential expression. We performed principal-component analysis by applying variance-stabilizing transformation using vst() and plotPCA() from the DESeq2 package. We determined the top tissue-specific genes for each tissue by comparing each to the five others using DESeq2, sorting by the ‘stat’ output and taking the top 100 overexpressed genes for each tissue. We normalized read counts for each sample by calculating total mapped reads and multiplying so that each sample totaled the same number of reads as the maximum sample. We then performed hierarchical clustering on expression of the 600 tissue-specific genes using the pheatmap package with clustering_distance_cols = ‘canberra’. For individual gene comparisons of MarBTN versus healthy samples, we compared MarBTN separately to hemocytes and to non-hemocyte solid tissues. Bar plots are comparisons of normalized read counts per gene, whereas statistical results for differential expression are adjusted P values from DESeq2.

Statistics and reproducibility

No statistical method was used to predetermine sample size, but our sample sizes are similar to those reported in previous publications11,14. Two cancer samples were excluded from this analysis due to high host contamination of samples, as described in Extended Data Fig. 1. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. For all t-tests, data distribution was assumed to be normal but this was not formally tested.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.