Epigenomic profiling of primate lymphoblastoid cell lines reveals the evolutionary patterns of epigenetic activities in gene regulatory architectures

Changes in the epigenetic regulation of gene expression have a central role in evolution. Here, we extensively profiled a panel of human, chimpanzee, gorilla, orangutan, and macaque lymphoblastoid cell lines (LCLs), using ChIP-seq for five histone marks, ATAC-seq and RNA-seq, further complemented with whole genome sequencing (WGS) and whole genome bisulfite sequencing (WGBS). We annotated regulatory elements (RE) and integrated chromatin contact maps to define gene regulatory architectures, creating the largest catalog of RE in primates to date. We report that epigenetic conservation and its correlation with sequence conservation in primates depends on the activity state of the regulatory element. Our gene regulatory architectures reveal the coordination of different types of components and highlight the role of promoters and intragenic enhancers (gE) in the regulation of gene expression. We observe that most regulatory changes occur in weakly active gE. Remarkably, novel human-specific gE with weak activities are enriched in human-specific nucleotide changes. These elements appear in genes with signals of positive selection and human acceleration, tissue-specific expression, and particular functional enrichments, suggesting that the regulatory evolution of these genes may have contributed to human adaptation.

quality and multiple-mapping reads were removed using Samtools (version 1.2) 5 with options '-q30 -F 1796' ('-q30 -F 1804 -f 2' for sample GM12878 that was sequenced with 75 bp PE reads). PCR duplicates were removed using Picard (version 1.95) (http://broadinstitute.github.io/picard/). Reads mapping to the corresponding species genome were then selected and only autosomal chromosomes were kept for downstream analysis. We identified enriched regions or peaks with MACS2 (version 2.1.1) 6 using the following parameters -nomodel --shift 0 -extsize sample_specific_fragment_size -keep-dup all for histones H3K4me3, H3K4me1 and H3K27ac, and adding -broad for histones H3K36me3 and H3K27me3. We only kept reproducible peaks, which we defined following a partition concordance approach. For each sample, we created two pseudo-replicates (defined by randomly choosing half of each sample reads without replacement) and used them to call peaks using the abovestated command. We defined as reproducible peaks those that overlapped at least 50% with peaks from both sample pseudo-replicates. Samples peaks were filtered based on their corresponding 50-kmer mappability tracks and only peaks with at least 80% mappable bp were retained. Histone peaks were

Background noise normalization of enrichment values
To obtain background noise corrected histone enrichment signals, we implemented a 3-step approach.
First, sample matched immunoprecipitated (IP) and input counts were normalized by sequencing depth.
The median number of aligned reads across samples IP and input datasets (for any given histone mark) was used as the total reference count. IP and input counts were then scaled to the total reference count.
Second, we removed background noise in a sample-and histone modification-specific fashion. The input-IP relationship in genomic regions with no significant enrichment (non-peaks) was used to estimate the noise contribution to the IP in peaks. We defined a confident set of non-peaks with the same size and same mappability requirements as for the identified samples peaks that did not overlap with any peak detected in any of the five species considered. These strict requirements prevented the inclusion of putative false negatives, enriched regions that were overlooked by the peak calling algorithm. We used a Deming regression to evaluate the linear relationship between the input and IP in non-peaks. Deming regression assumes random measurement error in both the dependent y-variable and the independent x-variable 7 , and thus, is more suitable for these analyses than simple linear regression where only the response variable Y is measured with error ( Supplementary Fig. 34). The noise contribution to the IP was estimated for each genomic region of interest using the inferred linear input-IP relationship in non-peaks with the corresponding depth-normalized input count ( Supplementary Fig. 34). Then, the estimated noise is subtracted from the depth-normalized IP counts in each genomic region, resulting in the background-noise normalized IP enrichment. Finally, the normalized enrichment signal was transformed with the inverse hyperbolic sin (!"#$%): !"#$% = ($ ( * + , * ! + -) This transformation makes the data homoscedastic, that is, with approximately equal spreads despite pronounced variations in enrichment levels. This eases the handling and interpretation of the data and is fundamental in the regression models that will be implemented later on.

RNA-seq alignment and gene expression quantification
RNA-seq 101bp PE reads were aligned to the corresponding reference genome along with the genome of the Epstein-Barr virus using hisat2 (version 2.0.4) 8 with default parameters. Low-quality and multiple-mapping reads were removed using Samtools (version 2.1) 5 with options '-q30 -F 1804 -f 2'.
Potential PCR duplicates were removed using PICARD v1.91 (http://picard.sourceforge.net). Read counts in genes were computed using htseq-count 9 . We used the R package DESeq2 (version 1.14.1) 10 to evaluate the coherence between technical replicates, exploring sample-to-sample distances and PCA results. Technical replicates were highly correlated and hence collapsed and used thereafter

Establishment of orthologous relationships
To find 1-to-1 orthologous genes, we mined the Ensembl 11 database was using biomaRt 12 . We retrieved the following features for the non-human species studied: Ensembl genes ID, gene coordinates, homology type and orthology confidence. Only autosomal one-to-one orthologous genes with the highest orthology confidence value (1) were kept for defining the set of 1-to-1 orthologous genes across the five species (11,249 genes). Of these, 9,936 were annotated as protein-coding in all species and 7,850 were expressed (TPM ≥ 0.5) in at least one species.
To find orthologous regulatory regions, we followed the approach described in Supplementary Fig. 36.
First, we mapped non-human regulatory elements to the human reference genome and considered orthologous regulatory elements those with a minimum overlap of 50% and for which the overlapping regions had similar size (50% of each other) ( Supplementary Fig. 36a). In the case of several overlapping regions, we prioritized the regions with the largest overlap. For every orthologous region for which we could not recover an orthologous regulatory element in at least one species, we defined the primate orthologous regulatory region coordinates as a genomic region of size equal to the average size of the species orthologous elements and located in the middle of the collapsed orthologous elements coordinates. We used these primate orthologous regions to recover overlooked orthologous relationships, this is, regulatory elements that either overlapped or were close (200 bp) to the primate orthologous region (Supplementary Fig. 36b). If no regulatory element was recovered, the coordinates of the primate orthologous region were mapped to the corresponding species assembly ( Supplementary   Fig. 36c). For species-specific regulatory elements for which we found no orthologous regulatory elements in any of the other species, we mapped their coordinates to the other species reference genome assemblies and considered those regions as the corresponding orthologous region ( Supplementary Fig.   36d). All inter-species projections were performed using the liftOver tool from the UCSCTOOLS/331 suite 13 . We used pairwise best reciprocal chains. For every inter-species projection, coordinates were mapped twice, going forward and backward, and only regions that could be properly mapped in both directions were kept. Overlaps were performed using the intersectBed tool from the BEDTools suite 14 .
Note here that our dataset of orthologous regulatory regions is restricted to genomic regions that can be mapped across species (regions for which an orthologous region at the sequence level can be found). This is a restrictive protocol that ensures the comparability and balance between different species, which have very different levels of annotation and genome assembly qualities. However, this comparability comes at the price of not being able to study the epigenomic changes associated with genomic gains and losses. To assign a regulatory state to those orthologous regions not associated with a regulatory element, we used the underlying enrichments in histone modifications and open chromatin (see Assignment of a regulatory state to regulatory elements). For those analyses comparing the regulatory state at orthologous regulatory regions associated with genes, we assigned a consensus regulatory component type to each orthologous regulatory region considering the different quality of the assemblies and gene annotations. Specifically, we assigned to the group of orthologues the type of regulatory component assigned in more species. In the case of a draw, we established the following hierarchy: human > chimpanzee > macaque > gorilla > orangutan.

Normalization of gene expression and enrichment signals across species
First, we used batch correction to remove the technical variation derived from the integration of previously published data for the cell line GM12878. The batch effect was corrected using the ComBat function implemented in the R package sva (version 3.22.0) 15 . Then, we developed a method to normalize gene expression across samples. This method is based on the identification of a set of internal reference controls (IRCs) (genes) whose expression is constant across samples and from which a normalization factor can be derived ( Supplementary Fig. 37). We defined internal reference controls as genes that were not outliers in any pairwise sample comparison, where outliers were defined as any gene with pairwise sample differences larger than the pairwise sample-specific median difference plus two median absolute deviations. We first standardized (robust standardization) each sample signal in IRCs. Then, we defined the representative IRCs values as the means of the standardized values of each IRC in the different samples. These representative IRCs are used to define a common scale for all the samples. We modeled the linear relationship between IRCs of each sample and the representative IRC values using Deming regressions. Sample-specific normalization factors are the result of dividing each sample-specific slope by the mean slope of all samples. To normalize the signal of all genes, we first scaled each sample signal in the robustly standardized distribution of IRCs using the corresponding sample median and median absolute deviation computed using these standardizations. Then, we proceeded to do the normalization with the corresponding sample normalization factor. Finally, to retrieve normalized signals, destandardization was performed using the mean and standard deviation of the samples' IRC median and median absolute deviations.
To assess the performance of the developed calibration method, we evaluated the effect of both the batch correction and expression signal normalization for every sample-pairwise comparison. To do so, we measured the angle to the identity line (x=y). If the normalization procedure removed noisy variance effectively, differences between samples were expected to be reduced. This would reflect in regression lines closer to the identity line ( Supplementary Fig. 38). Our method effectively removed technical noise in the expression signal, the angles were successfully shrunk both at IRCs and pairwise orthologous expressed protein-coding genes excluding IRCs ( Supplementary Fig. 39).
We further compared our normalization method with the widely used quantile normalization 16 and found that our method outperformed the latter, particularly at normalizing the tails of the distributions (Supplementary Figs. 40 and 41). Normalization of the tails is a known problem associated with quantile normalization, which cannot properly handle the greater inter-samples differences at extreme values.
Our parametric approach overcomes this limitation and correctly calibrates signal values throughout the whole distribution.
We applied the same method to normalize the enrichment signals of the histone modifications at regulatory elements associated with orthologous protein-coding genes.

Genotyping
WGS was used for genotyping every cell line and hard-filtering criteria were applied to identify SNPs, Indels and STRs. Variant discovery was performed using GATK version v3.7 18 . 'HaplotypeCaller' was run for each sample independently with default parameters using the '--emitRefConfidence GVCF' mode. Subsequently, both biological replicates were jointly genotyped using 'GenotypeGVCFs'. Hard   26 , which are, in general, lower than those observed in non-transformed tissues, e.g., ~72% global methylation in great-ape blood 27 . Nonetheless, orangutan sample O1 stood out, due to its unusually low global methylation (average sample-specific methylation value of 48.6%). As a result, Pearson correlation values between orangutan replicates were much lower (64%) than among human (81%), chimpanzee (77%), gorilla (77%) and macaque (81%) samples.
The R package MethylseekR (version 1.12.0) 28 was used to identify unmethylated regions (UMRs) and low methylated regions (LMRs). Methylation levels and FDR parameters were inferred from the data, as suggested by the MethylseekR workflow.
UMRs tend to be CpG-rich regions and are commonly regarded as proximal regulatory elements; LMRs, CpG-poor, have been associated with the binding of transcription factors, which cause the local reduction of otherwise high methylation levels, and are largely considered as distal regulatory elements, highly dynamic and tissue-specific 29 . Overall, similar numbers of UMRs and LMRs were found across samples and species ( Supplementary Fig. 43). Notably, sample O1 showed a discordant pattern, with particularly elevated numbers of both UMRs and LMRs.
The relationship between UMRs and LMRs and chromatin states was explored, and the expected trends were observed. On the one hand, UMRs showed an enrichment pattern highly akin to that of CGI. On

Supplementary Figures
Supplementary Figure 1. Schematic illustration of the approach followed to annotate and classify regulatory elements. a, DNA strand represents a gene annotation track, wherein dark grey regions correspond to coding annotated regions. The second row represents the binarized output from ChromHMM, wherein each box corresponds to a 200 bp bin. Light grey indicates bins without evidence of promoter or enhancer states, whereas the different colors represent the different learned chromatin states. Shorter DNA strands represent the genomic coordinates defined for regulatory elements that result from merging adjacent 200 bp bins with epigenetic signals associated with promoter or enhancer states. We defined species regulatory elements from the union of the regulatory elements detected in each biological replicate. b, We a b c established a hierarchy between chromatin states based on the combination of chromatin marks found within each regulatory region and classified regulatory elements into epigenetic promoter (P) and enhancer (E) states with three different activity levels: strong (s), weak (w) or poised (p). Then, we applied a linear discriminant analysis (LDA) using normalized histone and open chromatin enrichments to refine this epigenetic classification. c, We linked regulatory elements to genes based on gene proximity and using previously published 3D chromatin maps in GM12878 cells, we recovered physical interactions between regulatory elements. We additionally assigned regulatory elements associated with genes to a type of regulatory component. We classified regulatory elements into genic promoters (gP), genic enhancers (gE), proximal enhancers (prE), promoter-interacting enhancers (PiE) and enhancer-interacting enhancers (EiE) (    Color-coded the different regulatory states a, pre-LDA composition; b, post-LDA composition. aP and aE correspond to regulatory elements with consensus states (both promoters or enhancers, respectively) but with different activities between replicates (e.g., sE in one replicate and pE in the other replicate). P/E correspond to regulatory elements with a promoter state in one replicate and an enhancer state in the other replicate. P/Non-RE and E/Non-RE correspond to regulatory elements with a promoter or enhancer state in one replicate and no evidence of regulatory activity in the other replicate.

Supplementary Figure 17. Number of distal enhancers annotated as PiE and EiE per species through the integration of 3D chromatin data.
Promoter-interacting-enhancers are gene-associated enhancers that interact with genic promoters. Enhancer-interacting-enhancers are gene-associated enhancers that interact with other enhancers.                Tissue expression patterns (median TPM) from bulk RNA-seq data obtained from the latest GTEx release (v8) in genes that have conserved regulatory elements that are a, promoter-interacting enhancers with strong enhancer activities; b, proximal enhancers with weak enhancer activities and c, enhancer-interacting enhancers with weak enhancer activities. Box plots show medians and the first and third quartiles (the 25th and 75th percentiles), respectively. The upper and lower whiskers extend the largest and smallest value no further than 1.5 × IQR.  Box plots show medians and the first and third quartiles (the 25th and 75th percentiles), respectively. The upper and lower whiskers extend the largest and smallest value no further than 1.5 × IQR. b, Effect sizes corresponding to significant pairwise comparison from (a) (Dwass-Steel-Critchlow-Fligner test; P < 0.05). c, Comparison of the proportion of brain-specific (tauBrain > 0.8) and testis-specific genes (tauTestis > 0.8) associated with either human-specific (n = 90 genes) or conserved intragenic enhancers with weak enhancer states (n = 524 genes). Two-tailed Fisher's exact test; P-values below and above 0.05 are indicated by an asterisk (*) or ns, respectively. Distribution of hSNC densities in human genes associated with intragenic enhancers (grey), in the subset of genes associated with fully conserved weak intragenic enhancers (purple) and in the subset of genes with human-specific weak intragenic enhancers (hswEgE, pink). P-values in the plot correspond to the indicated one-tail Mann-Whitney U tests. b, Three hswEgE that accumulate more hSNCs than expected. For each of the significant hits, the name of the hswEgE-containing gene, the enhancer id (t), the number of hSNCs in the enhancer and the P-value is shown. c, Significance of the observed number of hits. Distribution of the simulated number of hits after 10,000 simulations. In b, and c, the vertical dashed line represents the value of the observed density for each enhancer and the observed number of hits in our data, respectively, and P-values are adjusted using Bonferroni correction.  RE which is then recovered as the corresponding orthologous. c, Neither the chimpanzee nor macaque REs overlap with REs from the remaining species. When the representative orthologous region is mapped to the chimpanzee reference genome, a RE is found in close proximity and recovered as orthologous RE. When the representative orthologous region is mapped to the macaque reference genome, no RE is found, but the coordinates can be mapped and this genomic region is recovered as the corresponding orthologous region in macaque. In downstream quantitative analysis, reads will be counted and normalized in defined orthologous regions despite absence of RE. This approach allows the recovery of overlooked RE. d, Gorilla orthologous RE is recovered through proximity to the reference orthologous RE in gorilla coordinates. Orangutan and macaque are assigned their corresponding orthologous regions. Figure 37. Schematic representation of the approach employed to calibrate the expression and epigenetic signals between species. Step 1. Color gradient represents different degrees of expression/enrichments (green: low expression/enrichment; yellow: medium expression /enrichment; red: high expression/enrichment). IRCs are composed of units (genes, regulatory elements, epigenetic signals associated with a particular type of gene component) with similar enrichments across samples. Step 2. After robust standardization, values are normally distributed and more similar across samples (blue gradient; light blue: low enrichment; blue: medium enrichment; dark blue: high enrichment). Normalization factors are estimated considering the linear relationships between sample-specific and representative robust standardized values at IRCs. Step 3. Values at genes/elements/components are scaled to the common space where normalization factors are computed and after the calibration, values are brought back to the original scale. S stands for sample and j ∈ [1,10]. N is the number of observations and I ∈ [1,N].