Emergence of novel cephalopod gene regulation and expression through large-scale genome reorganization

Coleoid cephalopods (squid, cuttlefish, octopus) have the largest nervous system among invertebrates that together with many lineage-specific morphological traits enables complex behaviors. The genomic basis underlying these innovations remains unknown. Using comparative and functional genomics in the model squid Euprymna scolopes, we reveal the unique genomic, topological, and regulatory organization of cephalopod genomes. We show that coleoid cephalopod genomes have been extensively restructured compared to other animals, leading to the emergence of hundreds of tightly linked and evolutionary unique gene clusters (microsyntenies). Such novel microsyntenies correspond to topological compartments with a distinct regulatory structure and contribute to complex expression patterns. In particular, we identify a set of microsyntenies associated with cephalopod innovations (MACIs) broadly enriched in cephalopod nervous system expression. We posit that the emergence of MACIs was instrumental to cephalopod nervous system evolution and propose that microsyntenic profiling will be central to understanding cephalopod innovations.

Restriction digestion with HindIII (from NEB; 2,000 U per 0.25 million cells) was performed overnight with agitation (950 rpm at 37°C). After chromatin digestion, the restriction sites were filled with biotin-14-dATP (Thermo Fisher (Waltham, Massachusetts, United States)), dCTP, dGTP, and dTTP, with Klenow (50 U per 0.5 million cells) for 1 h at 37°C with repeated agitation (700 rpm 10 sec and rest 30 sec for 1 h in a thermal cycler).
Ligation was performed overnight at 18°C (2000 U of T4 DNA ligase). After ligation, crosslinking was reversed by proteinase K treatment overnight at 65°C. An additional proteinase K incubation at 65°C for 2 h was followed with RNase A treatment and two sequential phenol/chloroform extractions. After DNA precipitation, the DNA was spun down (centrifugation with max-speed for 30 min at 4°C). The pellets were resuspended in 20 μl TE and the DNA concentration was determined using Qubit 2 device (Thermo Fisher). 20 μg of Biotinylated DNA was used for the library preparation. The biotin from non-ligated fragment ends was removed with T4 DNA polymerase (NEB) for 30 min at 37°C and EDTA was added to stop the reaction (10 mM f. c.). DNA was sonicated using the Covaris system to generate DNA fragments with a size peak around 400 bp (Covaris S2 settings: duty factor: 10%; peak incident power: 5; time: 60 sec). After end repair with T4 DNA polymerase, Klenow Large fragment and T4 DNA polynucleotide kinase in the presence of dNTPs in T4 DNA ligation buffer (for 30 min at room temperature), the DNA was purified (QIAGEN mini purification kit (QIAGEN ( Hilden, Germany))).
A double-size selection using DNA purification beads was performed and the size selected DNA was eluted in Elution buffer (QIAGEN). Biotinylated ligation products were isolated using prewashed MyOne Streptavidin C1 Dynabeads (Life Technologies) on a magnet stand in binding buffer (5 mM Tris pH8, 0.5 mM EDTA, 1 M NaCl) for 30 min at room temperature. dA-tailing was carried on beads: dATP was added with Klenow exo-(for 1h at 3°C, NEB), then the enzyme was heat-inactivated (20 min at 65°C). After two washes in binding buffer and one wash in T4 DNA ligation buffer, Illumina adapters were ligated onto Hi-C ligation products bound to streptavidin beads in T4 DNA ligase by slow rotation (for 2 h at room temperature). After washing twice with wash buffer (5 mM Tris pH 8.0, 0.5 mM EDTA, 1 M NaCl, 0.05 % Tween -20) and then once with binding buffer, the DNA-bound beads were resuspended in a final volume of 20 μl 1x NEBBuffer2. Captured biotinylated Hi-C DNA was amplified by 9 cycles PCR amplification (with NEBNext® High-Fidelity 2X PCR Master Mix (NEB)). After PCR amplification, the Hi-C libraries were purified with DNA purification beads. Hi-C libraries were sequenced by paired-end 50 bp mode at VBCF NGS facility (Vienna BioCenter, Vienna, Austria).

Genome scaffolding
Genome scaffolding based on Hi-C data was done using Lachesis 8 . Scaffolds of the published Euprymna scolopes genome 9 were filtered to be at least 50k in size (changing the N50 from 3.5 MBp to 3.7 MBp) and used as draft de-novo assembly. Bam files aligned to the draft assembly of the two Hi-C datasets were used as Hi-C read input. Alignment was done as a step in the HiC-Pro 10 pipeline, thus HiC-Pro settings for bowtie 11 were used (see next section). As the exact number of chromosomes is unknown for E. scolopes different numbers for CLUSTER_N (expected chromosome number) were tried (35, 40, 45, 46, 48, 50,

Annotation lift-over
Genes were lifted over from the published annotation using an in-house script. The script uses the sizes of old and new scaffolds and the ordering files provided by Lachesis, which contain a list of all scaffolds assembled to a pseudo-chromosome in the assembled order. Between each scaffold 1001 bp are added as Ns.

Hi-C mapping and contact matrices
Two biological Hi-C replicates were mapped against the unmasked reference genome after filtering out scaffolds smaller than 50k, without trimming, using HiC-Pro 10 . Bowtie settings were: Global options: --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder. Local options= --very-sensitive -L 20 --score-min L,-0  15,16 . Blocks were constrained to consist of least three genes, with no more than five intervening genes. The maximum number of paralogues was set to 100. Further filtering settings: minimum overlap 0.3 (at least 30% of orthogroups must overlap in any given syntenic loci pair) and minimum species overlap 0.5 (orthogroup present in at least half of the species in the block).
Idiosepius paradoxus, Doryteuthis pealeii and Acanthaster plancii which were used for the ortholog assignment were excluded from further analysis, thus 24 species were used for clustering. Using the resulting clustering file, we defined metazoan syntenies to be present in at least seven species out of the 24 initial species. Cephalopod-specific synteny was defined to be present in at least two cephalopods, but none of the other species. Then clusters present in E. scolopes were used for further analysis, thus all clusters that are present in E. scolopes and six other species (275 clusters) were used for the analysis of ancestral, metazoan clusters in E.
scolopes and all clusters present in E. scolopes and shared with at least one other cephalopod (505 clusters, of which five are paralogous clusters) were used for the analysis of novel, cephalopod-specific clusters.

Counting of novel microsyntenies
Novel microsyntenies emerging at specific branches of the taxon tree (Supplementary Figure 2) were filtered. Each microsyntenic cluster had to contain at least wto species out of a list defining the group (e.g., cephalopods -Octopus bimaculoides, Callistoctopus minor, Euprymna scolopes) and no species that was not in the list.
Those pre-filtered microsyntenic clusters were then used to calculate the number of novel micro-syntenic clusters for each branch. Paralogous clusters were excluded by using the clustering file with information of all connections between species in each syntenic block and only counting each species once (even if they have more than one connection to a syntenic block). Specific filtering parameters are listed in Supplementary Table 3. Random microsyntenic blocks were modelled after the distribution of observed microsyntenic blocks as described in 16 . Randomized syntenic blocks were computed as described in 16 (randomization for the phylogenetic tree was computed as described in 15 see below) Twenty randomizations were modelled after the distribution of either cephalopod-specific or metazoan synteny, resulting in 10101 and 5501 random blocks respectively.

Circos plot
Files were fomatted as follows: if a syntenic block was present in at least seven metazoans out of 24 species, it was counted as metazoan. If it was present in at least two cephalopods, but no other species, it was counted as cephalopod. Paralogous clusters were excluded. If a block was counted in at least five molluscs but no other species it was defined as mollusk specific. Results were then filtered to a subset of species to make the plot more easily readable. The circos plot was then plotted for 14 of the species in R using the circlize 17 package.

Density analysis of micro-syntenies, microsyteny distribution and karyoplots
The distribution of length in bp of metazoan microsyntenies and cephalopod-specific microsyntenies in E. scolopes (computed as described in 1.1-Microsyntenies) from start of first gene to start of last gene in the syntenic block was computed in with ggplot2's (https://ggplot2.tidyverse.org/) kernel density estimate. The same was done for the density of gene-counts of micro-syntenic blocks. Intergenic distances were calculated with a custom script, which uses a bed file as input that only contains whole genes (start-stop) and only the longest mapped transcripts for gene locations. Karyoplots for both Mizuhopecten yassoensis and Euprymna scolopes were plotted with KaryoplotR 18 .

GO term enrichment analysis
Go term enrichment analysis of microsyntenic clusters was done with the topGO package 19 . E.

Synteny and TAD composition
Mizuhopecten yessoensis and E. scolopes pseudo-chromosomes were plotted with KaryoplotR as described in 1.3. Euprymna Hi-C maps, RNA-seq and Atac-seq tracks were plotted with HiCExplorer. First, HiC-Pro output files were converted to h5 format with the hicConvertFormat function, specifying the chromosome of interest (hicConvertFormat --matrices Sample1_40000_abs.bed --resolutions 40000).
Mapped RNA-seq bam files (see also Supplementary Note 8) were converted to bigwig format using samtools 24 and deeptools 25  for one of each sample using the Genrich (https://github.com/jsh58/Genrich) narrowpeak file. All of these were added to an ini file and then plotted following hicexplorers tutorial.

TAD averaging
To understand the distribution of microsyntenies within TADs (as predicted by HiC explorer) we found the center of each microsynteny and its location in a TAD, which was then normalized to The regions were then checked for enriched motifs using homer 26 findMotifsGenome.pl with default -size (200) parameter against the non-masked genome.

Protein analysis
An smc1 orthogroup was identified from the orthofinder output. CTCF and smc3 proteins could not be identified in orthogroups, thus smc3 sequences were downloaded from NCBI. CTCF sequences were kindly provided by 27 and only one sequence for each species was used. E.
scolopes orthologues were identified using BLASTP 28 . Pseudo-energy of 3D chromosome conformations, calculated as the sum of kinetic and potential energy in the system, is monitored throughout the simulation as an indicator of convergence.

The chromatin bead dynamics
The dynamics of a coarse-grained chromatin bead is governed by the following Newtonian equation of motion: where ⃗ a i and ⃗ v i are the instantaneous acceleration and velocity of the bead, respectively; m is the mass of the bead; γ is the drag coefficient; Hi−C are forces implemented in the model to characterize the mutual volume exclusion between beads, the interaction between genomically consecutive beads, and the interaction between genomically distant beads with high Hi-C frequency. Computationally, Verlet integration is applied to calculate the trajectories of chromosome beads over time.

The volume exclusion force
The volume exclusion between any two spatially overlapping beads is assumed linearly elastic.
The contribution of this force to a bead is described by the following equation: Where K rep is the spring contant reflective the imcompressibility of genetic content within the beads in contact; d i , j is the distance between the centre of two consecutively connected beadsi and j; d rep 0 is the rest length of the linearly elastic spring; u i , j is a unit vector pointing from bead i to bead j.

The chromatin tension force
The interaction between two genomically consecutive beads is assumed linearly elastic. The contribution of this force to a bead is described by the following equation: Where K ten is the spring constant of the inter-bead 'chromatin' linker, d i ,i +1 is the distance between the centre of two consecutively connected beads and i+1 ; c 2 is the rest length of the linearly elastic spring; u i , i+1 is a unit vector pointing from bead i to bead i+1.

The Hi-C restraint force
The interaction between genomically distant beads is also assumed linearly elastic. The contribution of this force to a bead is described by the following equation: Where K Hi−C is a constant reflective of the constraint strength implied by Hi-C and applies to any pairs of coarse-grained beads that have pair-wise Hi-C frequency greater than a threshold value, namely, p i , j > p rep 0 ; d hic 0 is the rest length of the linearly elastic spring; u i , j is a unit vector pointing from bead i to bead j .

Data preparation and modelling
Normalized sparse matrix from Hi-C experiment was parse into separate single chromosome matrices, containing only intra-chromosomal contacts without any scaffold interactions. We removed all inter-chromosomal interactions due to multi-cell nature of Hi-C experiment.
Furthermore, an interaction frequency (IF) cut-off was applied to further filter out desired contacts that were used as spatial constraints for modelling. Specifically, cut-off with IF value 5,

Model analysis • Surface accessible solvent area
We modified the freesasa python package (https://freesasa.github.io/python/) in order to make it applicable for SASA calculations of 3D chromosome scaffold models. We also calculated 'normalized SASA', SAS A n , contribution by cephalopod or metazoan synteny per chromosome as:

SAS A n = SAS A i n
where SAS A i is SASA contribution of all metazoan or cephalopod microsynteny on selected chromosome scaffold, and n is the genomic size of cluster (bp).
In addition to normalization, we can calculate proportion coverage of chromosome scaffold surface p coverage as: where SAS A chromosome is the total SASA of chromosome and ∑ SASA ceph i (or ∑ SASA meta i ) is the sum of all the SASA of cephalopod/metazoan clusters on the particular chromosome scaffold.
Alternatively, we can calculate the proportional coverage p coverage i of chromosome scaffold surface by occupied each individual microsynteny cluster: Depth of cluster is defined as the distance between centre of mass of the microsynteny cluster d ceph ∨ d meta and the closest point of chromosome scaffold located on its surface, .
Cluster depth is an exposure measure complementing the information provided by SASA.
Supplementary Note 6. Neighbour-joining method for TAD syntenic consistency profiling All scripts for this part can be found in the folder Tree_method directory in the bitbucket repository. Only chromosome scale scaffolds were considered for this analysis. Unassigned and unordered scaffolds were filtered out prior to analysis. The normalized Hi-C matrix and bed files (HiC-Pro output) were split into separate matrix and bed files for each chromosome and sorted by the stop column. "Interaction trees" for each chromosome were computed. Using the intensity of interaction between bins we clustered pairs of bins together; two bins with the highest interaction become "sister groups" to each other, then they cluster with the bins with the next highest interaction until all bins are integrated in the tree. Every chromosome is saved as a newick tree file and the length of the branches indicate interaction intensities.
To understand how well a region is defined by its interactions we extracted the last common ancestor of that region (the bins in that region) from the whole tree for a chromosome. We then constructed a table with information about the syntenic clusters and their sub-tree structure e.g.
synteny-type (cephalopod-specific, ancestral (metazoan) synteny), chromosome, synt-id, farthest node, distance root farthest node, distance root to farthest leaf, number of nodes in the extr tree, number of nodes in the microsyntenic cluster, number of genes, bins total on that chromosome, count, number of leaves in the extracted tree (for 503 of the 505 cephalopodspecific clusters, 274 out of the 275 metazoan clusters, 10074 of random-cephalopod clusters and 5490 of random-metazoan clusters, as not all sub-trees could be extracted).
We calculated the ratio between the number of bins in the extracted tree for a syntenic cluster (by last common ancestor) and the number of initial bins in the syntenic cluster. If a synteny location is well defined by its interaction, the ratio between the nodes of the tree and the bins making up the synteny id should be close to 1. Ratios higher than 1 were excluded. Wilcox test (unpaired, two-sided) was used to test differences between samples, using ggstatsplot To test whether genes in microsyntenic clusters are more likely to be co-expressed than genes that randomly sit in close proximity in the genome we sampled the genome 20x by the distribution of cephalopod-specific and metazoan syntenies (number of genes)(see 2.1.), extracting clusters of genes in close proximity that are not syntenic. We then calculated the coexpression coefficient of genes in observed and random microsyntenies followin 16 [38][39][40][41] ). Other settings were: -max_target_seqs 10 -max_hsps 1000task megablast -template_length 16 -penalty -2 -word_size 11evalue 1 -template_type coding_and_optimal. Multimapping regions were excluded if they overlapped by more than 50% and occured more than 3 times using BEDOPS 42 prefiltered_megablast.bed | awk '$1<4' | cut -f2-|sort-bed -| uniq and bedops -merge 35 .Any region overlapping with an exon by 1bp or more was excluded using bedtools 43 subtract with the -A parameter. To exclude repetitive regions, fasta sequences were extracted from the filtered putative CNE locations and meme's dust (cut-off 10) function was used to mask repeats. Any region with more than 25% Ns was excluded. Additionally, two datasets were created for each similarity score of at least 100bp or 50bp regions or if regions had fewer than 50 or 100bp non N nucleotides. For similarity scores of 0%, only 100bp regions were kept (custom python script available on https://bitbucket.org/hannahschm/ceph_regulation_microsynteny/ ). To remove any remaining coding sequences, the remaining putative CNE sequences were blasted against the NCBI 44 NR database (mirrored on Mar 17 2021 for 0% similarity and Jan 3 2022 for others) and any regions overlapping with a BLAST match were removed with bedtools intersect -A.  Reads were mapped to the E. scolopes chromosomal assembly using bowtie2 11  Motifs were then identified for all ATAC peaks overlapping a specific cluster using homer`s findMotifsGenome.pl using the default size parameter (=200). Summarized results for pvalues <=1e-3 are found in Supplementary Table 1. Motifs were annotated with information from the uniprot 50 website.

Supplementary
Transposable elements were annotated using the repeat element library of 9 Table 6).

In-situ hybridization (ISH) and fluorescence in-situ hybridization (FISH)
The following primary steps were the same for both FISH and ISH.
Negative controls were done for all genes using sense probes, as well as one round of adding all ingredients except probes. Probes were reused several times. In-situs were done on early

Steps for FISH
The antibody was then added to samples in blocking buffer (Roche) in a dilution of 1:500.
Antibody incubation was followed by 1x quick, 3x 5min and 5x 1h washing steps in TBST.
Samples were washed 3x 15min in TBST, followed by two washing steps at 72°, first in detergent, then 30min in Solution X and 3x 15min in TBST at RT. Samples were stored in TBST at 4° until imaging. Nuclei were stained with DAPI (Sigma-Aldrich) and samples were mounted in Fluoromount-G® (SouthernBioTech (Birmingham, Alabama, United States)). Samples were imaged on an inverted Zeiss (Oberkochen, Germany) LSM780 multiphoton laser scanning confocal microscope at the Marine Biological Laboratory in Woods Hole.
Steps for ISH The antibody was then added to samples in a blocking buffer (Roche) in a dilution of 1:5000.
scolopes embryos were infiltrated for 1h at 37°C, then embedded in gelatine in custom molds.