CTCF binding landscape in jawless fish with reference to Hox cluster evolution

The nuclear protein CCCTC-binding factor (CTCF) contributes as an insulator to chromatin organization in animal genomes. Currently, our knowledge of its binding property is confined mainly to mammals. In this study, we identified CTCF homologs in extant jawless fishes and performed ChIP-seq for the CTCF protein in the Arctic lamprey. Our phylogenetic analysis suggests that the lamprey lineage experienced gene duplication that gave rise to its unique paralog, designated CTCF2, which is independent from the previously recognized duplication between CTCF and CTCFL. The ChIP-seq analysis detected comparable numbers of CTCF binding sites between lamprey, chicken, and human, and revealed that the lamprey CTCF protein binds to the two-part motif, consisting of core and upstream motifs previously reported for mammals. These findings suggest that this mode of CTCF binding was established in the last common ancestor of extant vertebrates (more than 500 million years ago). We analyzed CTCF binding inside Hox clusters, which revealed a reinforcement of CTCF binding in the region spanning Hox1-4 genes that is unique to lamprey. Our study provides not only biological insights into the antiquity of CTCF-based epigenomic regulation known in mammals but also a technical basis for comparative epigenomic studies encompassing the whole taxon Vertebrata.


Materials and Methods
cDNA Cloning and Sanger Sequencing. For LjCTCF, cDNA was synthesized from total RNA extracted from pooled stage 27 embryos, using SuperScript III Reverse Transcriptase (Thermo Fisher Scientific) and a gene specific primer. RT-PCR was performed to amplify cDNA containing the fulllength ORF of approximately 3.7 Kbp using KAPA HiFi HotStart ReadyMix (KAPA Biosystems) and 1 M Betaine (Sigma) in the reaction. The PCR product was ligated into the pCR4Blunt-TOPO vector (Thermo Fisher Scientific) and sequenced on a 3730xl DNA Analyzer (Thermo Fisher Scientific).
For LjCTCF2, 3´RACE was performed using total RNA extracted from pooled stage 25 embryos, with 3´ RACE kit (Thermo Fisher Scientific) and gene specific primers. Total RNA extracted from pooled stage 25 embryos was reverse transcribed into cDNA, with ThermoScript Reverse Transcriptase (Thermo Fisher Scientific) and a primer specific to LjCTCF2. RT-PCR was performed to amplify cDNA containing the putative full-length ORF of approximately 1.3 Kbp using Phusion High-Fidelity DNA Polymerase (NEB) and the supplied GC buffer. PCR products were Atailed using Taq HS Perfect Mix (Takara Bio), ligated into the pCRII-TOPO vector (Thermo Fisher Scientific) and sequenced as above. Oligonucleotide primers used here are included in Supplementary Table S7. Immunoprecipitation was carried out using 1-3 mg of protein lysate and 10 μl of anti-CTCF antibody in RIPA buffer at 4 °C for 4 hours under gentle rotation. Following the addition of Protein A beads (Thermo Fisher Scientific), the reaction mixture was incubated for another 1 hour with rotation at 4 °C. Beads were washed with RIPA buffer three times, and proteins were eluted and denatured by boiling the beads in SDS sample buffer for 5 minutes. Following electrophoresis, proteins were stained with silver staining kit (Wako). Gel bands were excised and analyzed by mass spectrometry (LC-MS/MS) as described previously 1 .
Gene Prediction and Curation. Previous genome analysis on L. camtschaticum did not provide gene annotation to be readily used in RNA-seq and ChIP-seq data analysis 2 . For this reason, we performed gene prediction and its evidence-based improvement as follows (see Supplementary Fig.   S1, for a flow chart of this entire procedure). First, the program BUSCO v1 3 was ran on the L. camtschaticum genome assembly LetJap1.0, referring to the metazoan gene set (mBUSCO), which resulted in 535 mBUSCO component genes identified on the genomic scaffold sequences that were longer than 100 Kbp. To increase sequence-level heterogeneity among them, we ran blastclust 4 with options '-L 0.1 -S 20'. In order for individual genes to represent distinct genomic regions with their flanking non-coding sequences, we discarded one of the neighboring gene pairs that are located with an intergenic sequence shorter than 3 Kbp. 450 genes that remained after these selections were used in a training of the gene prediction program AUGUSTUS v3.1 5 , according to the developer's instruction (http://bioinf.uni-greifswald.de/augustus/binaries/tutorial/training.html). Second, in order to improve the accuracy of identification of exon-boundaries later by AUGUSTUS, we aligned RNA-seq reads to the genome assembly (LetJap1.0) using TopHat2 v2.0.11 6 and Bowtie2 v2.2. 2 7 with the default setting without providing any gene model, and constructed 'intron hints' according to the developer's instruction (http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=IncorporatingRNAseq.Tophat) (see Supplementary Table S1 for mapping rates). In addition, in order to generate 'CDSpart hints' based on homologs of other species, a set of vertebrate protein sequences were used as queries in TBLASTN with a threshold of 1e-10, followed by exact identification of protein-coding regions with exon-intron boundaries taken into account using Exonerate v2.2.0 8 . The vertebrate protein sequence set consisted of RefSeq 'known' proteins of human (39,582 sequences), chicken (6,189 sequences), and amniote vertebrates (44,675 sequences), as well as predicted sea lamprey genes (24, 271 sequences 9 ). Third, the LetJap1.0 genome assembly was subjected to identification of conventional and de novo repetitive sequences with RepeatModeler v1.0.4 10 with default parameters, followed by masking of the identified repetitive sequences with RepeatMasker v4.0.5 11 with the options '-nolowxsmall'. Fourth, the program AUGUSTUS was run with the trained parameter files on the genome assembly in which the identified repeats are soft-masked, incorporating intron hints and CDSpart hints with the options '--singlestrand=false --alternatives-from-evidence=false --allow_hinted_splicesites=atac --softmasking=1'.
These predicted genes were then further improved by making use of transcriptional evidence from RNA-seq, as well as sequence similarity to the aforementioned vertebrate protein sequence set.
The RNA-seq reads were used to build a transcriptome assembly via mapping-first and assemblyfirst approaches. In the mapping-first approach, RNA-seq reads were first aligned to the Arctic lamprey genome assembly with Tophat2 v2.0.11. This alignment data was used in combination with the AUGUSTUS predicted gene model to generate a transcriptome assembly by Cufflinks v2.2.1 12 .
In the assembly-first approach, the same RNA-seq reads were utilized by Trinity r20140413p1 13 to produce de novo transcriptome assemblies, which were aligned to the Arctic lamprey genome with PASA v2.0.2 14 . The coding regions of the assembled transcripts from both approaches were deduced using Transdecoder v2.0.1 15 , which employed homology matches to the UniProtKB/Swiss-Prot release 2014_07 peptides. Once this was completed, the transcripts with putative ORFs were collected, and their deduced amino acid sequences were compared against their best-hit match to the vertebrate proteins. From this comparison, a protein-coding transcript evidence set, consisting of the assembled transcripts which had at least 80% of their length matching to at least 80% of a RefSeq protein, was obtained.
The transcript evidence set was aligned back against the AUGUSTUS predicted genes.
From this analysis, certain AUGUSTUS genes, depending on how they lined up against the transcript evidence, were placed into one of four different categories, (i) split genes, (ii) fusion genes, (iii) truncated genes, and (iv) unannotated genes (see Supplementary Fig. S1b). (i) A split gene had the putative ORF of the transcript evidence overlapping with more than two AUGUSTUS genes; if these AUGUSTUS genes matched to different parts of the same vertebrate protein, they were replaced by the exons composing the transcript evidence. (ii) A fusion gene occurred when multiple vertebrate protein-genome alignments, which were generated with Exonerate as described above, overlapped with different parts of a single AUGUSTUS gene; if at least 60% of the individual protein-genome alignments uniquely overlapped with at least 60% of the transcript evidence in different AUGUSTUS gene positions, the AUGUSTUS gene was replaced by the exons of the overlapping transcript evidence. (iii) A truncated gene was specified as an AUGUSTUS gene that was entirely overlapped by a coding sequence in the transcript evidence; the AUGUSTUS gene was replaced with the exons composing the transcript evidence if its match to the vertebrate protein was longer than the AUGUSTUS gene's match to the vertebrate protein. (iv) An unannotated gene was a case where at least 60% of a vertebrate protein aligned to a region of the genome in which no genes were predicted by AUGUSTUS; if a coding region in the transcript evidence overlapped with the vertebrate protein, the exons composing the transcript evidence was added to the gene set. In each of these cases, similarity searches of the amino acid sequences of both the AUGUSTUS genes and the assembled transcripts against the vertebrate proteins were performed using FASTA v36.3.8b 16 , with the numbers of replaced genes in each category shown in Supplementary Table S2. As a result, a total of 34,435 genes were identified. The completeness of the gene set assessed by BUSCO increased by 1 after the procedure for correcting mispredictions outlined above (Supplementary Table S2). The produced gene model is available at our laboratory web site (http://www2.clst.riken.jp/phylo/GRAS-LJ.gff3.gz).
Gene Expression Quantification using RNA-seq Data. Gene expression quantification was performed by mapping RNA-seq reads to nucleotide coding sequences of the genes predicted as above, except that incompletely predicted LjCTCF and LjCTCF2 sequences were replaced with their exonic nucleotide sequences containing their full ORFs that we confirmed with cDNA cloning.
ChIP-seq Data Analysis. ChIP-seq data for mouse E14.5 embryonic brain (SRR392354 and SRR505014), E14.5 MEF (SRR207080 and SRR207071), and ES cells (SRR207089 and SRR207081) were obtained from NCBI SRA. ChIP-seq data for adult dog liver (ERR022285, ERR022304) and adult opossum liver (ERR022303, ERR022307, ERR022306, and ERR022301) were obtained from EBI ENA. ChIP-seq data for fly Drosophila melanogaster (SRR066831-SRR066836), was obtained from NCBI SRA. ChIP-seq reads were processed by Trim Galore! v0.3.7 (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) with default parameters except for the opossum data. ChIP-seq reads for mouse, dog, opossum, and fly were mapped to mm10, CanFam3.1, MonDom5, and Dmel r6.11 genome assemblies using Bowtie v0.12.8 18 , respectively. For mouse and fly data, only uniquely mapped reads were obtained using the options '-m 1 -n 2 -a -best --strata'. For dog and opossum data, uniquely mapped reads were first obtained using the options '-m 1 -v 2 -a --best --strata'. Subsequently, unmapped reads were mapped using the options '-m 5 -n 2 -a --best --strata', allowing each read to align with up to five locations, and combined with the uniquely mapped reads. Peak calling was performed using MACS2 v2.0.10 19 with its default parameters. Identification of the CTCF 'core' and 'upstream' motifs was performed for mouse, dog, and opossum as described in Materials and Methods, using sequences from the top 2000 peak regions. Only the 'core motif' was identified in the top 500 peak regions for the fly. The location and orientation of CTCF motifs within peak regions inside Hox clusters were identified by FIMO v4.10.1 20 as described in Materials and Methods.

Protein Identification by Chromatography-coupled Tandem Mass Spectrometry (LC-MS/MS).
Proteins from gel slices were subjected to reduction/alkylation, followed by digestion with 10 μg/ml trypsin at 37 °C for 16 hours. Peptides were extracted with 5 % formic acid and 50 % acetonitrile, dried, and dissolved in 2 % acetonitrile and 0.1 % formic acid. Peptides were fractionated by reverse-phase chromatography (ADVANCE UHPLC) and applied to ion trap mass spectrometer (LTQ Orbitrap Velos Pro; ThermoFisher) with advanced captive spray source. The Mascot software (Matrix Science) was used to identify the match of molecular masses to nonredundant protein from the NCBI database (ver. 20151116), while for L. camtschaticum, peptides were searched against the gene set predicted by AUGUSTUS from the L. camtschaticum genome (LetJap1.0) containing the CTCF protein sequences determined in this study.
Chromatin Immunoprecipitation Protocol. Chromatin immunoprecipitation was performed basically as described 21 but with modifications to adapt to embryos and tissues of the Arctic lamprey L. camtschaticum and chicken.
Step-1: Sample Preparation. Lamprey embryos were collected and pooled at stage 27. The 1 % (w/v) TritonX-100, 2mM EDTA, 500 mM NaCl). The immune complexes were eluted from the beads by agitation in elution buffer (10 mM Tris-HCl (pH8.0), 300 mM NaCl, 5 mM EDTA, 1 % SDS) at room temperature for 15 minutes, and supernatants were transferred to a new tube and incubated overnight at 65 °C for reverse-crosslinking. Eluates were further treated with RNase A and Protease K, and they were precipitated in ethanol to obtain ChIP DNA, as described for the input DNA.
Step-6: ChIP DNA QC. Prior to library preparation, enrichment of the CTCF bound regions in the ChIP DNA was analyzed by quantitative qPCR using primers designed for human H19, HoxA6 and Gapdh loci, for chicken HoxB8, HoxD8, and Actb loci, and for lamprey Hox-5, Hox-6 and Actb loci. CTCF binding sites in human were identified by analyzing ChIP-seq datasets from a public database, while for chicken and lamprey, CTCF binding sites were predicted using the database JASPAR 2014 (http://jaspar.genereg.net/). Enrichment folds were calculated as the ratio between "quantitated concentrations by qPCR" and "actual DNA concentration". Samples that displayed enrichment greater than 10 fold were used for library preparation.
Step-7: Library Preparation and Library QC. Libraries were prepared from 20-30 ng input DNA or 1-5 ng ChIP DNA using KAPA LTP Library Preparation Kit (KAPA Biosystems, KK8232) with TruSeq index adaptors. We modified the manufacturer's protocol and carried out reaction at 1/5 scale. The amount of PEG/NaCl solution used for reaction clean-ups were 4× up to the adaptor ligation step, and 1.3× after the adaptor ligation step. Size selection of DNA was carried out after the end repair reaction by AMPure XP beads (Beckman Coulter, A63881). The library preparation procedure in brief was as follows. End repair reaction was carried out in 14 μl volume at 20 °C for 30 minutes. Prior to size selection, the volume was increased to 30 μl by adding 16 μl of TE buffer.
Amounts of AMPure XP beads used were 0.7× (21 μl) to remove DNA ≥400 bp, and additional 3.3× PCR cycle that reached the intensity of the fluorescence standard 1 was determined to be the optimal cycle number for library amplification in each sample. Once the PCR cycle was determined, library was amplified in 20 μl reaction volume using KAPA HiFi HotStart ReadyMix (KAPA Biosystems, KK2601) and 8.5 μl of the adaptor ligated DNA. PCR products were purified by adding 1.3× amount (26 μl) of AMPure XP beads to the reaction, and eluted in 20 μl of Tris-HCl buffer (pH 8.0). In order to confirm the quality and quantity of the prepared libraries, their concentration and size distribution were analyzed with Qubit dsDNA HS Assay kit and Agilent High Sensitivity DNA kit. Library qPCR was also carried out to calculate the enrichment folds at positive and negative control regions, using equal amounts of input and ChIP library DNA as templates in the reaction. Enrichment folds were calculated as the ratio of quantified concentrations of "ChIP libraries" and "input libraries".
Libraries that displayed enrichment greater than 10 fold at positive control regions were used for sequencing in Illumina HiSeq1500.

P. sinensis
: .: : :     DNAs used for qPCR were input and ChIP DNAs for the ChIP-qPCR experiment (on the left), or the input and ChIP DNAlibraries for the library-qPCR experiment (on the right). For ChIP-qPCR, enrichment folds were calculated as the ratio between the "qPCR-quantitated concentration" and "actual DNA concentration" . For library-qPCR, equal amounts of template DNAs were used for the qPCR reaction, and enrichment folds were calculated as the ratio of quantitated values of "ChIP library" and "input library" . ChIP libraries were confirmed to display no less than 10 folds of enrichment in at least one positive control region before sequencing.    Table S7).
Shown are the fold enrichment values of ChIP samples against the input sample.    Interestingly, all the identified CTCF motifs were oriented towards the 3´ end of the cluster.
It was also shown that the ANT-C has two outstanding CTCF peaks, between pb and Dfd genes as well as within the Antp gene. The ChIP-seq data were obtained from NCBI SRA (SRR066831-SRR066836) and processed as described in Supplementary Materials and Methods.
These data, originally in triplicates, were merged into one and mapped to the D. melanogaster genome assembly Dmel r6.11. with the 3´ probe is included in Fig. 1c. Scale bars: 500 µm for the whole embryo panels, and 200 µm for the enlargement panels.   101 nt-long paired-end reads were produced by the High-Output mode of HiSeq1500, and only the read 1 trimmed to 80 bp were used for data analysis. † Lamprey and chicken ChIP-seq reads were mapped against LetJap1.0 and galGal5 genome assemblies, respectively, allowing multi-map (-m 5), whereas human reads were mapped against hg19 genome at single map (-m 1) condition of Bowtie (see Materials and Methods for details).  RepeatMasker analysis was carried out for the whole genome and Hox clusters using the custom repeat library generated by RepeatModeler for each species.