Early vertebrate origin of CTCFL, a CTCF paralog, revealed by proximity-guided shark genome scaffolding

The nuclear protein CCCTC-binding factor (CTCF) contributes as an insulator to chromatin organization in diverse animals. The gene encoding this protein has a paralog which was first identified to be expressed exclusively in the testis in mammals and designated as CTCFL (also called BORIS). CTCFL orthologs were reported only among amniotes, and thus CTCFL was once thought to have arisen in the amniote lineage. In this study, we identified elasmobranch CTCFL orthologs, and investigated its origin with the aid of a shark genome assembly improved by proximity-guided scaffolding. Our analysis employing evolutionary interpretation of syntenic gene location suggested an earlier timing of the gene duplication between CTCF and CTCFL than previously thought, that is, around the common ancestor of extant vertebrates. Also, our transcriptomic sequencing revealed a biased expression of the catshark CTCFL in the testis, suggesting the origin of the tissue-specific localization in mammals more than 400 million years ago. To understand the historical process of the functional consolidation of the long-standing chromatin regulator CTCF, its additional paralogs remaining in some of the descendant lineages for spatially restricted transcript distribution should be taken into consideration.

Scientific RepoRtS | (2020) 10:14629 | https://doi.org/10.1038/s41598-020-71602-w www.nature.com/scientificreports/ an earlier origin of CTCFL than the split between chondrichthyan and osteichthyan lineages. This evolutionary scenario would be more reliably corroborated with accumulating information from recent genome sequencing of chondrichthyans 5 . A typical solution for dating gene duplication in an early age of vertebrate evolution involves genome expansion, referred to as two-round whole genome duplications (WGDs) 18,19 . This event gave rise to multiple arrays of chromosomal regions containing a similar set of genes, termed conserved synteny 20 . This strategy of phylogenetic characterization has not been applied to CTCF or CTCFL genes. Exploration of gene repertoire and epigenome regulation has been facilitated by the recent release of largescale molecular-level resources for multiple shark species 5 . This study included the landscape of CTCF protein binding in the cloudy catshark and the bamboo shark, as well as the whole genome assembly of the latter species 5 whose completeness and continuity are comparable or superior to those of a member of Holocephali, Callorhinchus milii, that stood long as the only chondrichthyan species with the sequenced genome 21 . While the CTCF orthologs have been characterized even in jawless and cartilaginous fishes 5,6 , the available resources have not allowed the identification of CTCFL orthologs outside amniotes.
In this study, we improved the quality of the existing bamboo shark genome assembly with long-range scaffolding to reliably identify a CTCFL ortholog and characterize its phylogenetic property based on conserved synteny spanning the flanking genomic regions. With a further effort to identify CTCF and CTCFL orthologs in more diverse vertebrates, we inferred molecular phylogeny and provided a rigorous assessment of its output. Our study, supported by novel identification of elasmobranch CTCFL orthologs, consolidated an early origin of CTCFL through WGD which was only ambiguously suggested previously 6 . Our tissue-by-tissue transcriptome data also supported an early establishment of the testis-associated expression documented earlier solely for mammalian CTCFL.

Results and discussion
Proximity-guided genome scaffolding of the bamboo shark. Previously, the whole genome shotgun reads and mate-pair reads of the brownbanded bamboo shark Chiloscyllium punctatum were assembled by the program Platanus 22 to reconstruct its genome sequences 5 , which marked the N50 scaffold length of 1.96 Mbp (assembly version Cpunctatum_v1.0; NCBI Entry GCA_003427335.1). This assembly resulted from decontamination and a length cut-off at 500 bp for the Platanus output (see Methods of Ref. 5 ). To improve the completeness and continuity of this assembly, we extracted high molecular weight genomic DNA extracted from the residual piece of the liver used for the production of the previously released assembly Cpunctatum_v1.0 (see "Methods" section). The genomic DNA was processed with in vitro chromatin reconstruction and proximity ligation to prepare two Chicago libraries (see "Methods" section), and they were sequenced to obtain 495 million read pairs in total. The obtained reads were used for long-range scaffolding by the program HiRise 23 . The scaffolding was performed in two separate runs with the minimum lengths for input sequences of 1,000 bp and 300 bp (versions Cpunctatum_v2.0 and Cpunctatum_v2.1, respectively), which both resulted in higher continuity than the input assembly that was previously released (version Cpunctatum_v1.0) as visualized in Fig. 1. Possibly because of the decreased cutoff length for the input sequences in scaffolding, the output with the cutoff of 300 bp (ver- Cpunctatum_v2.0 Cpunctatum_v2.1  ). An additional cloudy catshark CTCF homolog was identified by a BLASTP search in the deduced amino acid sequences of the cloudy catshark genes predicted on its whole genome assembly Storazame_v1.0 (GCA_003427355.1) using the amino acid sequence of the human CTCFL gene (NP_001255969.1). This search resulted in the highest bit score for the gene Scyto0009998 predicted on the genome scaffold scf_scyto00004224 whose sequences are different from those of the cloudy catshark CTCF ( Supplementary Fig. S1). We also identified transcript contigs derived from our RNA-seq data 5 that have overlapping nucleotide sequences to a part of Scyto0009998. One of the transcript contig sequences included a putative upstream region that partially matched the genome scaffold scf_scyto00086509, while the other two included a potential 3′ untranslated region (UTR). Using oligonucleotide primers designed in the potential 5′ and 3′ UTR of the putative second CTCF homolog, we amplified a fragment of cDNA reverse transcribed with the total RNA extracted from the adult testis. The 2,264 nt-long nucleotide sequence covering the whole putative ORF (637 amino acids, compared with its shorter predicted ORF of Scyto0009998 with 558 amino acids) was deposited as the entry KY883980 in NCBI GenBank. This gene is tentatively designated as the cloudy catshark CTCFL gene. We also identified potential orthologs of CTCFL in the whale shark and the brownbanded bamboo shark whose sequences are distinct from those of their CTCF orthologs (Rhity2000076 and Chipu0005442), respectively. The putative ORFs of these shark CTCFL genes, as well as their CTCF genes (Supplementary Data 1), were all revealed to possess eleven zing finger domains, as known for CTCF and CTCFL genes of osteichthyans including the human ( Fig. 2A, C). The ORF lengths of the shark CTCF and CTCFL respectively resembled those of the mammalian counterparts rather than the lamprey homologs ( Fig. 2A). The amino acid sequences of CTCFL orthologs (whose phylogenetic classification is confirmed below) exhibited a much lower similarity among them, compared with the CTCF counterparts, which is featured by the absence of the YDF motif in the amino acid sequences of CTCFL orthologs, with which CTCF interacts with cohesion and contributes to the formation of CTCF-anchored chromatin loops 25 (Supplementary Fig. S2).
phylogenetic relationships among vertebrate ctcf relatives. We previously showed orthology of elasmobranch CTCF genes to osteichthyan CTCF genes 5 . To infer the phylogenetic relationships including the newly identified putative elasmobranch CTCFL genes, we reconstructed phylogenetic trees of the CTCF gene family with the maximum-likelihood (ML) method and the Bayesian approach using the amino acid sequences of the zinc finger domains (see "Methods" section). We have also included the newly identified sequences of the putative Iberian ribbed newt CTCF and CTCFL orthologs in this analysis. The ML tree displayed phylogenetic proximity of the putative elasmobranch CTCFL genes to the tetrapod CTCFL genes indicating their orthologous relationship (Fig. 2B). The putative elasmobranch CTCFL and tetrapod CTCFL genes exhibited long branches in comparison with their counterpart CTCF genes, showing that the CTCFL gene accepted much more amino acid substitutions than the CTCF did. We also observed a large heterogeneity of branch lengths among the different lineages of CTCFL genes. Importantly, the phylogenetic proximity between the putative elasmobranch CTCFL and tetrapod CTCFL genes were poorly supported in this ML tree (bootstrap value, 23; posterior probability, < 0.50).
To dissect the ambiguity in the phylogenetic relationship of the putative elasmobranch CTCFL genes in more detail, we performed an exhaustive likelihood computation for all possible tree topologies (see "Methods" section for details). In this analysis, internal relationships within several major operational taxonomic unit (OTU) (e.g., Table 1. Improvement of the brownbanded bamboo shark genome assembly. Sequences shorter than 500 bp are not taken into consideration. Gene space completeness was estimated by BUSCO v3 with the CVG, a set of 233 single-copy reference orthologs 36 .

Metric
Cpunctatum_v1.0 Cpunctatum_v2.  Table 2, in which top ten tree topologies are listed in the descending order of the log-likelihoods, followed by the tree topologies exhibiting the largest log-likelihoods with the elasmobranch CTCFL or chondrichthyan CTCF (including the Callorhinchus milii CTCF) proximally clustering with either remaining OTU ( Table 2). As a result, all tree topologies with the proximal cluster of the putative elasmobranch CTCFL with the chondrichthyan CTCF were statistically rejected by AU and Kishino-Hasegawa (KH) tests (p < 0.05; Table 2). In other words, chondrichthyan lineage-specific gene duplication between their putative CTCFL genes and the CTCF was not supported. On the other hand, proximal clustering of the putative elasmobranch CTCFL with the osteichthyan CTCF could not be rejected at the significance level of 0.05 in all the tests performed (e.g., Rank 416 in Table 2). Similarly, proximal clustering of the putative elasmobranch CTCFL with either of the lamprey CTCF, the lamprey CTCF2, or the hagfish CTCF remained unrejected (Rank 82, 101, and 163 in Table 2). Overall, regarding the phylogenetic position of the putative elasmobranch CTCFL genes, the molecular phylogenetic analysis did not provide unequivocal support, which prompted us to report to a different strategy, namely synteny analysis described below. www.nature.com/scientificreports/ Synteny analysis for the orthology between divergent ctcfL orthologs. To investigate molecular phylogeny of putative CTCFL genes of elasmobranchs, we consulted possible synteny conserved across different vertebrate taxa. First, we employed the previously released cloudy catshark shark genome assembly Storazame_v1.0, only to find that it does not contain a scaffold sequence spanning more genes than the CTCFL ortholog ( Supplementary Fig. S1). Therefore, we employed the previously released genome assembly of the brownbanded bamboo shark Cpunctatum_v1.0. In this genome assembly, the putative CTCFL gene was localized in the approximately 254 Kbp-long scaffold scf_chipu00001848, which however harbored one additional protein-coding gene (Fig. 3A). To overcome this situation, the abovementioned, newly built version of the genome assembly Cpunctatum_v2.1 was adopted for mapping this gene, which shows its localization in an approximately 2.3 Mbp-long scaffold ccg_chipu00000311 harboring seven additional predicted protein-coding genes (Fig. 3A, B). We compared the composition of the genes flanking the putative bamboo shark CTCFL gene with two selected amniote species (the human and the softshell turtle) (Fig. 3B). Our molecular phylogenic analysis on the flanking genes supported the one-to-one orthology among these species, indicating that the gene array in those genomic regions is derived from the jawed vertebrate ancestor (Fig. 3C for the PCK1 gene). Although our abovementioned phylogenetic analysis on CTCF/CTCFL did not provide unambiguous results, this observation of conserved synteny ascertains the orthology of the putative shark CTCFL genes with the previously identified amniote CTCFL genes (Fig. 3B).
Synteny analysis for the paralogy between ctcf and ctcfL genes. Whereas the abovementioned synteny analysis scrutinized the orthology between CTCFL genes, the following analysis focuses on paralogy between CTCF and CTCFL. This analysis investigates whether these two genes arose in small-scale gene duplication or WGD whose timing is easier to pinpoint. In the human genome, the CTCF and CTCFL genes are located on chromosome 16 and 20, respectively, but the CTCFL-containing region is thought to have undergone frequent rearrangement of the gene order 26 . This prompted us to intensively analyze the homologous region in the chicken genome instead, although the chicken CTCFL ortholog is missing. In the chicken genome, the CTCF gene is localized in chromosome 11, while the genomic region from which the CTCFL ortholog was lost is localized on chromosome 20, still maintaining the neighboring genes. Between these chromosomes as well as the chicken chromosome 2, we observed quite a few gene families that have paralogs duplicated in early vertebrate evolution in common, such as FAM65c/FAM65a/FAM65b and CHD9/CHD6/CHD7 (Fig. 4). This is consistent with the observation in the previous study based on genome-wide synteny analysis 26 . Although our analysis did not unveil the fourth chromosome or chromosome part that has maintained the equivalent gene array in the chicken genome, this observation, consistent with the previously documented pattern 18,19 , suggests that CTCF  38 . c p value of the Shimodaira-Hasegawa (SH) test 39,40 . The parentheses include standard errors. The underlined items in the tree topologies refer to the top-rank tree that supports their proximal clustering. www.nature.com/scientificreports/ and CTCFL also split as a part of WGD rather than small-scale duplication. Altogether, our study shows that the CTCF-CTCFL duplication occurred around the emergence of vertebrates, as a part of the two-round WGDs.

Rank by lnL
Asymmetric expression patterns between shark ctcf and ctcfL. To examine possible commonalities of expression patterns with mammals, we analyzed tissue distribution of shark CTCF and CTCFL expression. Using the RNA-seq data released previously 5 , we quantified their expression levels in embryos and adult tissues (Fig. 5). This analysis revealed an intensive expression of the CTCFL ortholog in the catshark testis, as described in mammals, while the CTCF ortholog is widely expressed. It should be noted that the catshark CTCFL is also expressed in the epididymis, whereas no equivalent expression has been documented for its mammalian ortholog 27 . It is suggested that the CTCFL ortholog was recruited for some role in the male reproductive organ before the split between the chondrichthyan and osteichthyan lineages. Later, at least the shark lineage, as  www.nature.com/scientificreports/ well as the therian mammal lineage, have retained the testis-associated expression, while other lineages, including the chicken and anuran lineages, secondarily lost the CTCFL orthologs (Fig. 6).

conclusions
This study challenged the previous understanding of the timing of the duplication between CTCF and CTCFL (BORIS). By exploiting the emerging genome and transcriptome sequence information of formerly underrepresented taxa, we performed in-depth molecular phylogenetic analysis, reinforced by evolutionary interpretation of syntenic gene location. This investigation suggested that CTCF and CTCFL were duplicated earlier than previously thought, namely before the divergence between the osteichthyan and chondrichthyan lineages, possibly around the time of the occurrence of vertebrates (Fig. 6). Our analysis revealed testis-associated expression of the shark CTCFL orthologs, suggesting that the CTCFL was already intensively expressed in the testis at the osteichthyan-chondrichthyan divergence (Fig. 6). Altogether, the well-studied chromatin regulator CTCF has a complex evolutionary history, with its sister gene retained by some of the descendant gnathostome lineages with restricted expression domains.

Methods
Genomic DNA extraction and genome scaffolding with Dovetail Chicago. We used a residual piece of the liver dissected from the brownbanded bamboo shark C. punctatum individual used in our initial genome sequencing 5  . Length distribution of the genomic DNA was analyzed by pulsed-field gel electrophoresis, which exhibited an average length of over 2 Mbp. Using the genomic DNA, two Chicago libraries were constructed, which were sequenced at Dovetail Genomics. Scaffolding with the program HiRise 23 was performed twice using the Chicago sequencing data and the previously generated C. punctatum genome assembly which contains additional sequences shorter than 500 bp 5 (Table 2) was performed with CONSEL 32 v1.20 and RAxML using the PROT-GAMMAWAG model. For all possible tree topologies and statistical tests, the internal relationships of the sequences used in the phylogenetic analysis were constrained to the following eight groups at the locations of the black circles plotted at each node in Fig. 2B; osteichthyan CTCF, chondrichthyan CTCF, tetrapod CTCFL, putative elasmobranch CTCFL, lamprey CTCF, lamprey CTCF2, inshore hagfish CTCF, and the outgroup. Synteny analysis. Detection of conserved synteny was performed as described previously 20 Figure 5. Expression profiles of CTCF and CTCFL in cloudy catshark tissues. Expression levels of cloudy catshark CTCF and CTCFL in adult tissues and embryos at different developmental stages were quantified in TPM (transcripts per kilobase million mapped reads) by the eXpress program using reads mapped to the coding nucleotide sequences of the cloudy catshark (see "Methods" section). Note that the scales are not equal between the genes. Cloudy catshark embryos were staged according to the existing literature 35 . The details of the RNAseq data used for the analysis are included in Supplementary Table S2. The equivalent expression profiles of the brownbanded bamboo shark CTCF and CTCFL genes are included in Supplementary Fig. S4 5 were mapped to the genome assembly Cpunctatum_v2.1 by the program BLAT v36. Phylogenetic properties of the genes located in the regions harboring the orthologs of CTCF or CTCFL as well as the regions homologous to them were analyzed by inferring molecular phylogenetic trees using the webserver aLeaves 33 and the method described above. The selection of the candidate gene families for the phylogenetic analysis was assisted by the OHNOLOGS database (https ://ohnol ogs.curie .fr) 34 .

Gene expression quantification.
We used the RNA-seq data of various cloudy catshark tissues produced in our previous study 5 . Gene expression levels were quantified as described previously 6