Multiomic atlas with functional stratification and developmental dynamics of zebrafish cis-regulatory elements

Zebrafish, a popular organism for studying embryonic development and for modeling human diseases, has so far lacked a systematic functional annotation program akin to those in other animal models. To address this, we formed the international DANIO-CODE consortium and created a central repository to store and process zebrafish developmental functional genomic data. Our data coordination center (https://danio-code.zfin.org) combines a total of 1,802 sets of unpublished and re-analyzed published genomic data, which we used to improve existing annotations and show its utility in experimental design. We identified over 140,000 cis-regulatory elements throughout development, including classes with distinct features dependent on their activity in time and space. We delineated the distinct distance topology and chromatin features between regulatory elements active during zygotic genome activation and those active during organogenesis. Finally, we matched regulatory elements and epigenomic landscapes between zebrafish and mouse and predicted functional relationships between them beyond sequence similarity, thus extending the utility of zebrafish developmental genomics to mammals.


Data Processing
Sequencing files from the DANIO-CODE DCC were processed with standardised pipelines, see https://gitlab.com/danio-code for details. The resulting files were uploaded to the DCC annotated with the version of the used pipeline and linked to their source files. RNA-seq, methylation data, HiC, 4C-seq data were processed using custom processing pipelines. ChIP-seq and ATAC-seq data were processed using adapted ENCODE pipelines 1 and CAGE-seq and 3P-seq data were processed using adapted FANTOM pipeline 2 .

WashU Epigenome Browser tracks
In order to visualize the developmental timeline, bigwig files of the different assays were combined to one matplotlib track and each track was shifted by adding a fixed amount to their score column.
To improve readability, the empty regions were set to a score of 0 before the shift. The drawing of the adult stage in Figure 1 was created with BioRender.com.

Track hubs
Processed data track can be uploaded to UCSC Genome browser as a track hub 4 . The DCC track hub was generated using custom scripts and it is available in the following link: https://daniocode.zfin.org/trackhub/DANIO-CODE.hub.txt, but also registered at UCSC as a public track hub.

Box plot definition
Box plots throughout the manuscript were defined in the standard way with the center depicting the median, the hinges of the box depicting lower and upper quartiles, and the whiskers extending 1.5 * inter-quartile width from the lower and upper box hinges. The outliers were depicted as points.

Nanti-CAGE-seq on zebrafish developmental stages
Total RNA was extracted from multiple stages of zebrafish development (Fertilized egg, 16-cell, 128-cell, 512-cell, 30%-epiboly, 4 somite, Prim-5, and High-pec) using the miRNeasy kit (Qiagen), according to the manufacturer's instructions. nAnT-iCAGE libraries were prepared as described previously 5 using the CAGE™ Preparation Kit (DNAFORM). All libraries were sequenced on Illumina HiSeq2500 except the high pec library which has been sequenced on NextSeq500.

Total and Nuclear Tagging CAGE-seq
Total RNA was collected from multiple stages of zebrafish development (30%-epiboly, 9 somite, Prim-5, High-pec and Larvae) using the miRNeasy kit (Qiagen), according to the manufacturer's instructions. Nuclear RNA was also collected, through a nuclear isolation protocol based on the Nuclei EZ isolation kit (Sigma) but optimized for zebrafish tissue. In summary, embryos were dechorionated and then dissociated and de-yolked in a cell swelling buffer (250mM Sucrose, 10mM Tris-HCL (pH 7.9), 10mM MgCl2, 1mM EGTA), through vigorous pipetting. Embryo slurry was left to stand for 5 minutes and then filtered using a 50uM tube top filter. It was then centrifuged at 500 g, 5 mins, 4°C and the pellet resuspended in Freezing buffer (Sigma), after which they were stored at -80°C. Upon defrosting Nuclei EZ Lysis Buffer was added to the cells, vortexed and left to stand for 5 mins. Nuclei were then centrifuged at 500 g, 5 mins, 4°C and the pellet washed once in Nuclei EZ Lysis Buffer. Upon a final centrifugation of the nuclei, the pellet was resuspended in Qiazol Lysis Reagent (Qiagen) and RNA extracted using the miRNeasy kit (Qiagen), according to the manufacturer's instructions. Tagging-CAGE libraries were prepared from both the total and nuclear fractions following the protocol described previously 6 . All libraries were sequenced on Illumina HiSeq2500 by the SNP&SEQ Technology Platform in Uppsala, Sweden.

Embryo preparation and ChIP experiments
Approximately 1500 Dome/30%-epiboly, 25 5-9 somites and 200 Prim-5/Long-pec embryos from natural crossing of wild type AB strain adults were collected. The embryos were enzymatically dechorionated using pronase and fixed in 1.85% formaldehyde in Hanks media for 20 minutes at room temperature. After crosslinking, samples were washed once with PBS and the fixation was stopped by incubating in 1x glycine for 10 minutes at room temperature followed by three washes with ice-cold PBS.
The embryos at somite stages were fixed in 1% formaldehyde for 10 minutes at room temperature upon dissociation in 1 ml of PBS and 1x protease inhibitor cocktail. Fixation was stopped with 115 μl of glycine for 5 minutes at room temperature. The cell suspension was centrifuged at 300 g for 10 minutes at 4°C and pellets were washed with Hank's balanced salt solution (HBSS) supplemented with 1x protease inhibitor cocktail (PIC) and centrifuged at 300 g for 10 minutes.
Supernatants were removed and pellets were kept on ice.
ChIP experiments were carried out using the ChIP-IT Express Enzymatic kit

Library preparation for ChIP-seq experiments
The DNA libraries were prepared using the Microplex library preparation Kit V2 (Diagenode, Cat. No. C05010013) following the manufacturer's instructions. Briefly, templates were prepared with 2 μl of template preparation buffer and 1 μl of preparation enzyme and the reaction was carried in a thermocycler as follows: 25 minutes at 22ºC, 20 minutes at 55ºC and hold at 4ºC. Upon template preparation libraries were synthesized with 1 μl of library synthesis buffer and 1μl of library synthesis enzyme incubate the samples at 22ºC for 40 minutes before proceeding to library amplification. Finally, libraries were amplified adjusting the number of cycles to the amount of DNA used for the preparation, according to the instructions. All libraries were purified and size selected with AMPure XP beads (Beckman Coulter) using the ratios recommended by the microplex protocol. The quality and concentration of the samples was verified by High Sensitivity D5000 ScreenTape (Agilent) and Qubit dsDNA High sensitivity assays (Thermo fisher scientific).

ATAC-seq experiments
Chromatin was prepared using 200 embryos, 100 embryos and 50 embryos in the 32-cell, 128-cell and 256-cell stages, respectively. Embryos were de-chorionated enzymatically using pronase and collected in 1.5ml Eppendorf tubes. 1ml Cell Dissociation Buffer (13151014, Gibco) was added into the tubes and cells were dissociated by pipetting up and down several times. Dissociated cells were then collected by spinning at 500 g for 10 min at 4°C. The following steps followed the ATAC-seq protocol as previously published 7 .

Cell preparation for Hi-C
Samples were prepared as previously described 8,9 . Briefly, Tg(Buc-GFP) heterozygous embryos were dissociated in 500 ml of HBSS supplemented with 0.25% BSA and 10 mM Hepes by pipetting for 2 minutes with a glass pipette. Excess of yolk was removed by two rounds of 3 minutes centrifugation at 350 x g at 4ºC, while pelleted cells were resuspended in 1 ml HBSS supplemented with 0.25% BSA and 10mM Hepes prior to filtering. The cell suspension was kept on ice during sorting in a FACS Aria Fusion sorter.

Low input somatic cells in situ Hi-C library generation
Briefly, we performed low input in situ Hi-C 9 with small modifications on sorted somatic cells in biological duplicates. 6,000 FACS-sorted somatic cells were crosslinked in 1% formaldehyde, incubated for 10 min at room temperature with rotation (20 rpm) and quenched by adding 0.2M glycine for 5 min at room temperature with gentle rotation (20 rpm). Cells were then washed three times with 1 mL cold PBS (centrifuged at 300 g for 5 min at 4C) and lysed using ice-cold in situ Hi-C buffer (10 mM Tris-Cl pH 8.0, 10 mM NaCl, 0.2% IGEPAL CA-630, cOmplete Ultra protease inhibitors) by incubating them on ice for 15 min. Centrifuged nuclei were then resuspended in 125 mL ice-cold NEB2 buffer, pelleted (13,000 g for 5 min at 4°C) and permeabilized in 12.5 mL of 0.4% SDS at 65C for 10 min. SDS was quenched by adding 12.5 mL of 10% Triton X-100 and 85 mL of nuclease-free water and incubating at 37°C for 45 min with shaking. Chromatin was digested with 100 U of MboI restriction enzyme (90 min at 37°C with rotation; New England Biolabs) and the enzyme was heat-inactivated at 62°C for 20 min. Digestion generated overhangs were filled in with a mixture of 0.4 mM biotin-14-dCTP (10 mL; Thermo Fisher Scientific) and 10 mM dATP/dGTP/DTTP (0.5 mL each; Thermo Fisher Scientific) by incubating with DNA polymerase I Klenow (4 mL; New England Biolabs) for 90 min at 37°C with rotation. DNA fragments were ligated using T4 DNA ligase (Thermo Fisher Scientific) and a mix (60 mL of T4 DNA ligase buffer, 50 mL of Triton X-100, 6 mL of BSA 20mg/ml, 3.5 mL of T4 DNA ligase and 328.5 mL of nuclease-free water) during 4 h at 20°C with gentle rotation. Nuclei were gently pelleted (2,500 g for 5 min at room temperature) and resuspended in 250 mL extraction buffer. Protein was digested with 10 mL Proteinase K (20mg/ml; Applichem) for 30 min at 55°C with shaking (1,000 rpm) followed by adding 65 mL of 5M NaCl and an overnight incubation at

4-C data generation
4C-seq experiments were performed and analysed as described earlier 10 . For zebrafish, 500 embryos at the 24hpf stage were dechorionated with pronase and deyolked using the Ginzburg Fish Ringer buffer (55 mM NaCl, 1.8 mM KCl and 1.25 mM NaHCO3). Then they were fixed using 2% PFA for 15 minutes at room temperature. Fixations were stopped by adding glycine and washing several times with PBS. Then, the fixed samples were lysed (lysis buffer: 10 mM Tris-HCl pH 8, 10 mM NaCl, 0.3% Igepal CA-630 (Sigma-Aldrich, I8896) and 1x protease inhibitor cocktail (Complete, Roche, 11697498001)), and the DNA was digested with DpnII (New England BioLabs, R0543M) and Csp6I (Fermentas, Thermo Scientific, FD0214) as primary and secondary enzymes, respectively. T4 DNA ligase (Thermo scientific, EL0014) was used for both ligation steps. Specific 4C-seq primers containing single-end Illumina adaptors were designed using as bait the different promoters of interest. For each library, eight independent PCRs were performed using the Expand Long Template PCR System (Roche, 11759060001) that were subsequently pooled and purified using AMPure XP beads (Beckman Coulter, A63883). Then libraries were sent for single-end sequencing. 4C-seq sequencing reads were aligned to the zebrafish reference genome using bowtie. Reads located in fragments flanked by two restriction sites of the same enzyme, in fragments smaller than 40 bp or within a window of 10 kb around the viewpoint were filtered out.
Mapped reads were then converted to reads per first enzyme fragment ends and smoothened using a 30-fragment mean running window algorithm. This signal is ready to be visualized. More details and supporting code are available in the GitLab repository (https://gitlab.com/danio-code/danio-code_4c-seq).

Guide RNA preparation and microinjection
Single guide RNAs (dried pellet, Agilent) were resuspended with H2O into 100 mM stock concentration. Zebrafish codon optimized dCas9-Neon was cloned into pCS2+ vector, then the plasmid was linearized with NotI and dCas9-Neon mRNA was synthesized in vitro with mMESSAGE mMACHINE™ SP6 Transcription Kit (Ambion). Single guide RNA (80 nM/ul) and dCas9neon mRNA (500 ng/ul) were injected into newly fertilized one cell stage zebrafish embryos at 2 nl volume.

KEGG pathway enrichment analysis
KEGG Pathways enrichment analysis of alternative promoter usage was performed using the Enrichr and FishEnrichr tools 11

Genome browser visualisation of chromatin marks and annotated regions
Chromatin marks and gene tracks (Figure 3A, Figure 5C), as well as the corresponding annotation tracks (ChromHMM, PADREs, and ensembles) were visualized with the Gviz package 12 .

eRNA calling
We identified eRNA using CageFightR v1.8.0 13 on the 24 total and nuclear Tagging CAGE-seq samples with the standard parameters. We applied an expression filter, keeping only clusters with an expression greater than zero (unexpressed=0) in more than two samples (minSamples=2). Furthermore, only intergenic and intronic clusters were kept. The code is available in the GitHub repository.

Identification of CTCF binding sites
The CTCF PWM matrix was taken from the JASPAR database 14 under the ID MA0139.1.
PADREs were scanned for CTCF motif using TFBSTools 15 , using a threshold of 90% identity.
For the conservation analyses, the cyprinid (grass carp, common carp, goldfish, and zebrafish) phastCons 17 score from Chen et al. 18

Cell-type specificity assignment
Genome-wide cell-type specificity of the genome version danRer11 was assigned from single-cell ATAC-seq data using the scregseg-pi algorithm described in McGarvey et al. 33 . In brief, the scregseg-pi annotations are based on a hidden Markov model (HMM) that takes as input ATACseq read counts across cell-type clusters in 500 bp bins across the genome. An HMM was fitted with 30 states that describe the cross-cell type accessibility profile using Dirichlet multinomial emission probabilities. The HMM was employed to group the genomic regions into states with similar accessibility profiles. Broadly speaking, three types of states were identified: 1) states that reflect accessibility in individual cell types, 2) states that represent accessibility in multiple cell types and 3) background states that reflect genomic background noise. By investigating state summary statistics and comparing state call proximity to cell-type marker genes (e.g., marker genes derived from Zfin and corresponding scRNA-seq 34

Expression clustering
We grouped 27,781 consensus clusters based on their expression log fold changes using selforganising maps choosing a 5 x 5 grid as in previous work 37 . These analyses separated consensus clusters mostly by maternal and zygotic expression. Afterwards we manually grouped the resulting 25 cells into maternal (blue), zygotic (red), ZGA-peaking (brown) and constitutively expressed at distinct levels (gray).

Identification of Genomic Regulatory Blocks (GRBs)
The GRBs used were called from whole genome alignment between zebrafish and mexican cavefish as described previously 40 , with the following parameters: 70% match and 18kb window.
Extreme conservation is a (fat) tail of a continuum, so some "non-GRB TADs" might be TADs with higher turnover of noncoding conservation. Our comparison looked for enrichments, not sharp differences.

Hi-C data processing and normalisation.
Raw sequencing reads were processed using HiCUP v0. 6

Aggregate contact map analysis around H3K27ac ensembles
Ensembles were categorised as GRB or non-GRB ensembles, and 50-150 kb long ensembles were selected for this analysis. Control sets of ensembles were generated by shifting the ensemble coordinates by 10 mb, or randomly uniformly sampling GRB/non-GRB TADs with/without ensembles. All sets were downsampled to the smallest set. The aggregate coverage-and distancecorrected Hi-C maps around these ensemble sets were computed at 10 kb resolution in a +/-500 kb or +/-150 kb window around the centre of ensembles, using HOMER's analyzeHiC function with the options -hist 10000 -superRes 10000 -norm -normTotal given -normArea given.

Compartment analysis.
The compartment signal was computed as the first principal component of the normalised (coverage-and distance-corrected) interaction profile correlation matrix at 50 kb bin resolution using the runHiCpca.pl function of HOMER. Chromosomes where the contact enrichment within chromosome arms was stronger than the compartment signal, we used the second principal component. Chromosomes 4 and 7 were excluded from the analysis as the compartment calls were poor quality due to assembly issues. TADs were categorised into 4 categories, based on whether they were GRB or non-GRB TADS and whether they did or did not contain ensembles. For all TADs, a compartment score was defined as the compartment signal of the central 50 kb bin of the TAD. The distributions of compartment scores for the GRB/non-GRB TADs with/without ensembles were compared using two-sided two-sample unpaired Wilcoxon test.

Expression profiles of genes in TADs with H3K27ac ensembles
CAGE promoters located in TADS with H3K27ac ensembles are classified as H3K27ac associates if their promoter is located within the ensemble or in the 12.5kb flanking region of the ensemble.
The expression of each promoter in H3K27ac TADs throughout development was centered and scaled, and the promoters were clustered using the kohonen package 20 . The clustering was performed using a 3x3 hexagonal grid, which resulted in 9 classes ordered in a 3x3 grid. The obtained classes were visualized using the ComplexHeatmap package 44 .

Expression of target and bystander genes
The list with prediction of human target and bystander genes was used from Tan, 2017. Zebrafish orthologs were identified using a homology table from biomaRt 4,45 . To account for ohnologs, only predicted target genes with a homolog in a GRB TAD were considered in the analysis. The expression of each promoter in GRB TADs throughout development was centered and scaled, and the expression was visualized using the ComplexHeatmap package.

H3K27ac signal across TADs
H3K27ac signal across TADs was calculated using the ScoreMatrixBin function from the genomation package with 400 bins. Signals of H3K27ac for each stage were used as targets. For windows, each TADs were ordered by length and each TAD was extended to the size of the largest TAD. The calculated matrices were smoothed and visualized using the heatmaps package (Perry, 2021, doi: 10.18129/B9.bioc.heatmaps).

Motif search for regulatory element projection
Motif for cross-species comparison hits were computed using seqPattern on a set of position weight matrices for 120 transcription factor families. Regions with scores of >85 were considered as motif hits.