Pig genome functional annotation enhances biological interpretations of complex traits and 1 comparative epigenomics 2

The functional annotation of livestock genomes is crucial for understanding the molecular 47 mechanisms that underpin complex traits of economic importance, adaptive evolution and 48 comparative genomics. Here, we provide the most comprehensive catalogue to date of regulatory 49 elements in the pig ( Sus scrofa ) by integrating 223 epigenomic and transcriptomic data sets, 50 representing 14 biologically important tissues. We systematically describe the dynamic epigenetic 51 landscape across tissues by functionally annotating 15 different chromatin states and defining their 52 tissue-specific regulatory activities. We demonstrate that genomic variants associated with 53 complex traits and adaptive evolution in pig are significantly enriched in active promoters and 54 enhancers. Furthermore, we reveal distinct tissue-specific regulatory selection between Asian and 55 European pig domestication processes. Compared with human and mouse epigenomes, we show 56 that porcine regulatory elements are more conserved in DNA sequence, under both rapid and slow 57 evolution, than those under neutral evolution across pig, mouse, and human. Finally, we provide 58 novel biological insights on tissue-specific regulatory conservation and demonstrate that, 59 depending on the traits, mouse or pig might be more appropriate biomedical models for different 60 complex traits and diseases in humans through integrating comparative epigenomes with 47 human 61 genome-wide association studies.


9
We clustered the entire genome into 12 modules based on their relative frequency of chromatin 176 states and observed that these modules exhibited distinct enrichments for protein-coding genes, 177 non-coding genes and CpG islands (Fig. 3a). For instance, module 2 (M2) was characterized by 178 active promoters and accessible enhancers, showed the highest enrichment for genes and CpG 179 islands, and the lowest levels of DNA methylation, and the highest gene expression levels (Fig.  180 3b). Compared to modules 11 and 12, module 10 showed similar enrichment for Polycomb 181 repression but higher enrichment for TssBiv, in which genes exhibited significantly lower 182 expression levels, suggesting the crucial role of TssBiv for regulating gene repression (Fig. 3b). In 183 addition, we noticed that module 1 had high enrichment for TssAHet, high levels of DNA 184 methylation, and high representation of genes located on the X chromosome and these genes are 185 relevant with histone modification Gene Ontology (GO) terms (Supplementary Table 3). This may 186 indicate potential roles of TssAHet in heterochromatin on the Chromosome X 35 . 187 By examining the distribution of chromatin states among all 14 tissues, we found that enhancer 188 activity was the most variable between tissues, while promoters were least variable ( Fig. 3c- Table 4). We also observed that TSE were enriched for active states 198 (promoters, transcribed regions and enhancers) and depleted for repressed states in the 2kb regions 199 around their TSS in the corresponding tissue compared to other tissues (Fig. 3f). Furthermore, we 200 found that predicted target enhancers of TSE in a tissue were more constitutive among biologically 201 similar tissues compared with other tissues (Fig. 3g), which was consistent with promoters of TSE 202 ( Supplementary Fig. 6). 203

Functional characterization of tissue-specific chromatin states 204
As enhancers were most variable among tissues compared to other chromatin states, we 205 identified an average of 6,895 tissue-specific EnhAs among 14 tissues, ranging from 1,393 in 206 jejunum to 14,811 in skeletal muscle (Fig. 4a). To further investigate the biological functions of 207 such enhancers, we defined three other types of EnhA, including all-common EnhA (shared among 208 all tissues), gut-common EnhA (shared among gut tissues) and brain-common EnhA (shared 209 among brain tissues). Gene Ontology (GO) analysis of putative target genes of these different types 210 of EnhAs revealed distinct biological functions (Fig. 4b, Supplementary Table 5). For instance, 211 all-common EnhAs were involved in fundamental biological processes (e.g., regulation of mRNA 212 catabolic processes and responses to wounding), whereas gut-common EnhAs were significantly 213 involved in intestinal development, digestion and absorption, and immune response. EnhAs that 214 were specifically active in individual gut tissues showed distinct functions, clearly matching the 215 known biological functions of the tissue in question. For example, jejunum-specific EnhAs were 216 involved in biological processes relevant to T cell and lymphocyte function 36 , while colon-specific 217 EnhAs were mainly engaged in stress-activated MAPK cascades 37 (Fig. 4b). We observed that 218 intestine-and spleen-specific EnhAs shared many immune functions, and brain-specific EnhAs 219 were significantly involved in memory and learning (Fig. 4b). Furthermore, we observed that 220 genes whose topologically associated with tissue-specific EnhAs (Methods) were specifically 221 12

Chromatin states predictions enhanced the biological interpretations of adaptive evolution 245 and complex traits in pigs 246
To determine whether genomic regions associated with adaptive evolution are significantly 247  10a). In examining tissue-specific regulation, our analysis revealed that the all-common TssA 253 were significantly enriched within regions under selective pressure in both populations (Fig. 5b). 254 Interestingly, spleen-specific REs were most enriched in Asian pig domestication, wheras cortex-255 specific REs were most enriched in European pig domestication (Fig. 5b). Consequently, tissue-256 specific gene regulation may have played an essential role in the adaptive selection processes that 257 resulted in Asian and European pig domestication. This result was also in agreement with the 258 observation that Asian domesticated pigs being more disease resistant 41 , whereas European 259 domesticated pigs are more active and aggressive 42,43 . 260 To ask whether SNPs associated with complex traits in pigs are enriched in regulatory regions, 261 we integrated GWAS signal enrichment analysis for 44 complex traits (Supplementary Table 9) 262 with all 15 chromatin states, and demonstrated that GWAS signals were most enriched in TssA 263 ( Fig. 5c), which was consistent with previous findings in humans 44 . We also found that enrichment 264 for variants associated with complex traits was significantly positively correlated with signatures 265 of selection ( Supplementary Fig. 10b,c). We then asked if tissue-specific REs were involved in 266 genetic control of specific complex traits. To answer this question, we conducted GWAS signal 13 enrichment analysis for average daily gain (ADG) in three separate breeds (i.e., Duroc, Landrace 268 and Yorkshire), with emphasis on tissue-specific TssA and EnhA. As we expected, muscle, 269 adipose, liver, and gut-common regulatory elements were the most relevant for ADG (Fig. 5d). In 270 further examining the top ADG QTLs in Landrace (Fig. 5e), we found that the top hit SNPs that 271 are within a muscle-specific EnhA (Fig. 5f) that appears to target two genes (ZNF532 and ALPK2) 272

Comparative analysis of pig, mouse and human epigenomes 282
The distribution of individual histone epigenetic marks and chromatin accessibility with respect 283 to genomic features (e.g., 5'UTRs and exons) was consistent between pig, mouse, and human 284 ( Supplementary Fig. 11). To determine if the chromatin states are similarly conserved between 285 these species, we predicted 15 chromatin states in mouse and human based on the same epigenetic 286 marks in pig. The resulting chromatin state predications demonstrated general similarity among 287 the three species in terms of genome coverage, genomic distribution and sequence conservation 288 ( Fig. 6a, Supplementary Fig. 12). 289 To explore the relationship between the epigenome and DNA sequence conservation among 290 three species, we divided each genome into regions corresponding to 50 different levels of 291 sequence conservation (0 th -49 th ) (Methods). Our results revealed that the majority of chromatin 292 states showed higher conservation levels in sequences under both rapid and slow evolution than 293 those under neutral evolution, following a U-shaped distribution 47 (Fig. 6b, Supplementary Fig.  294 13a). We also found that the densities of chromatin states and gene elements followed the similar 295 U-shaped distribution ( Supplementary Fig. 13b,c), supporting the hypothesis that conserved epi-296 modifications may buffer negative selective pressures by providing the genome more elastic room 297 to adapt 47 . Furthermore, we categorized orthologous genes into 50 groups based on the degree of 298 conservation of gene expression between species and observed that genes with more conserved 299 expression levels also demonstrated more conserved TssA and TssBiv signatures (Fig. 6c). In 300 further examining sequence extremely conserved (49 th ) or extremely variable regions (0 th ), genes 301 linked to TssA shared by human and pig are involved in basic biological processes, such as ncRNA 302 metabolic process and mRNA catabolic process ( Supplementary Fig.14). For the sequence 303 extremely conserved region(49 th ), we found that genes proximal (± 2 kb) by human-specific 304 (comparing with pig) TssA in brain (e.g., FOXG1 48 ) are engaged in neuron fate commitment, 305 cerebral cortex development, learning and memory (Fig. 6d, Supplementary Fig. 15, 306 Supplementary Table 10). 307 Next, we evaluated the evolutionary basis of complex traits in humans. Heritability enrichment 308 analysis of 47 complex traits across 15 chromatin states that were mapped from pigs to orthologous 309 regions in humans found that promoters and TSS-proximal transcribed regions were most enriched 310 for variants (Fig. 6e). We further revealed that the more conserved (species-shared) chromatin 311 states showed significantly higher enrichment of complex traits heritability than the more 312 divergent (species-specific) chromatin states (Fig. 6f). Then we further examined the role of tissue-313 specific gene regulation on human complex traits. Our heritability enrichment analysis of complex 314 traits, based on human orthologous regions of tissue-specific EnhAs identified in pigs, 315 demonstrated that tissue-specific enhancers were significantly enriched for the corresponding 316 human complex traits relevant to biological functions of specific tissues (Fig. 6g). For instance, 317 the lung-specific EnhAs were significantly enriched for the heritability of lung forced expiratory 318 volume 1 (FEV1), liver for fasting glucose and cholesterol, colon for Crohn's disease, and cortex 319 for intelligence (Fig. 6g). 320 Finally, we sought to determine if this annotation of regulatory elements substantiated the use 321 of pig as an appropriate animal model for different human diseases by comparing human, mouse 322 and pig epigenomes in specific tissues. In brain cortex, the mouse-human shared EnhAs exhibited 323 significantly higher heritability enrichment than the pig-human shared EnhAs for most brain-324 relevant traits, such as attention deficit hyperactivity disorder (ADHD), intelligence, depression 325 and reaction time, with the exception of Alzheimer's disease, for which heritability was 326 significantly enriched in pig-human shared EnhAs rather than the mouse-human shared EnhAs 327

337
In this study, we provided the most comprehensive catalog of porcine regulatory elements to 338 date, spanning 14 tissues including six gut-associated tissues, and characterized the dynamic 339 chromatin state landscape across these tissues and uncovered extensive tissue-specific regulation 340 of gene expression. 341 The annotation of functional elements in human and mouse has proven highly effective for the 342 identification of causative variants of complex traits 23,27 . Our results also demonstrated that variants 343 of complex traits and eQTLs of growth-related traits were significantly enriched in the active 344 promoters and enhancers annotated by this study. Specifically, we speculate that a potential 345 causative SNP, which was associated with average daily gain and which was found within a 346 muscle-specific enhancer, may regulate the expression of ALPK2 45,46 , a gene demonstrating 347 muscle-specific expression (0.5Mb away). In addition, our annotation of functional elements in 348 pigs allows us to evaluate the potential role of regulatory elements on pig domestication. Our 349 analysis illustrated that signatures of domestication were significantly enriched in porcine 350 regulatory elements. Specifically, genetic variants in the spleen-specific promoters were enriched 351 during Asian pig domestication, whereas variants within cortex-specific promoters were enriched 352 during European pig domestication. This novel insight may reflect the observed distinct 353 phenotypic difference between Asian (more disease resistance 41 ) and European domesticated pigs 354 (more active and aggressive 42,43 ). Further investigation is warranted to deepen our understanding 355 of genetic selection and domestication in the pig. This regulatory element atlas will serve as a 356 valuable source for the livestock community to inform GWAS and eQTL findings, genomic 357 selection program, and genome editing strategies, as well as to enhance our understanding of 358 genome evolution and adaptation. With continued efforts by the FAANG Consortium 50 , more 359 epigenomic data will be available from diverse samples such as reproduction related tissues, 360 additional developmental stages, and different physiological states. The systemic integration of 361 "omics" data, for instance, the on-going pig GTEx effort will contribute additional insight into the 362 biological mechanisms that underpin agronomic traits, and thereby enhancing genetic

Clustering of large-scale chromatin structure 464
To examine genome-wide chromatin structure, we first divided the genome (excluding chrUn) 465 into 1,224 fragments of 2Mb in length. Then we calculated the state frequencies (state bin/total 466 bin) in each 2Mb fragment for each tissue and then the average frequency across tissues. To 467 identify modules, column clustering was performed by k-means=12, and rows were clustered using 468 k=3. In addition, we calculated number of protein-coding, lncRNA, and CpG islands for each 2Mb 469  Fig. 4d). Finally, followed by the order from high to low 479 based on the GL/TGL value in each tissue, we calculated the total genomic length of accumulated 480 tissues (aGL) by adding one tissue each time until all 14 tissues were added, and the cumulative 481 state coverage was calculated as aGL/ TGL. States whose cumulative coverage changed faster than 482 others were considered to be less constitutive (more variable) states. 483

Chromatin state switching between tissues 484
Chromatin state switching between tissues was calculated by pairing two tissues. Given a 485 pairing of A and B tissues, we first counted total bins of chromatin state "e" in A (TbAe), then 486 obtained the overlap bins of chromatin state "e" (Obe) in A and B, then computed the state 487 switching probabilities using Obe/TbAe for the tissue A to B transition and Obe/TbBe for the 488 tissue B to A transition. By averaging these calculations for a pair of tissues, we obtained the pair 489 switching probabilities. We calculated the state switching probabilities in between intestinal 490 tissues, between brain tissues ( Supplementary Fig. 6a,b) and between 8 distinguishable tissues 491 (jejunum, cortex, adipose, liver, lung, muscle, spleen). 492

Hierarchical epigenome clustering 493
We first calculated the an epigenetic mark's signal confidence scores (-log 10 (Poisson P value)) 494 within 200 bp of the genomic regions for each mark of each sample as described in 495 http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/02_peak-calling/peak-496 calling_report.html#data_sets. Then, we extracted a specific mark's signal confidence score of 497 each sample for specific state of RRATs regions. For example, we extract H3K4me1 signal 498 confidence scores for EnhA. After combining all samples' mark confidence scores for each tissue 499 and each state, we constructed a distance matrix using the ward.D2-linkage hierarchical clustering 500 following by Euclidean distance method in R. 501

Promoter enrichment analysis of tissue-specific expressed genes among 14 tissues 502
To evaluate how chromatin state changes at promoter regions of TSE genes across tissues, we 503 first performed a Student's t-test among 14 tissues to identify tissue-specific expressed genes based 504 on TPM. We further grouped some tissues into different sub-groups such as small intestine 505 (Jejunum, Ileum, Duodenum), large intestine (Cecum, Colon), and brain (Cortex, Cerebellum, 506 Hypothalamus), and identified tissue-specific expressed genes by excluding the tissues in the same 507 sub-group. Then we selected genes with the top 5% t-value as TSE genes 68 . The biological process 508 of GO enrichment for these TSE genes were identified by WebGestalt2019 69 using the default 509 significance level (FDR< 0.05). Then we calculated the chromatin state fold enrichment of TSE 510 (up and down stream 2000bp around TSS) in each tissue and the change in enrichment by TSE 511 enrichment in specific tissue minus other tissues. 512

Chromatin state switching of target enhancer (EnhA) of TSE gene 513
To evaluate how enhancers of TSE genes switch among tissues, we first identified the target 514 enhancers of TSE genes following the method described in our recent study 29 . Briefly, we 515 generated the predicted TADs from CTCF ChIP-seq data by FIMO 70 following the method 516 described in Oti,et al. 71 . Then we predicted the enhancer-gene pairs according to the Spearman's 517 rank correlation of every possible combination of regulatory element H3K27ac signal and gene 518 expression value within each TAD. Benjamini-Hochberg adjustment (FDR < 0.05) was used to 519 define putative interacting pairs. The enhancers in the enhancer-gene pairs that corresponded to 520 TSE genes were considered as TSE genes' target enhancers. Finally, we computed enhancer state 521 switching probabilities of TSE genes among tissues using the method described above. 522

TSR of enhancer, promoter and their putative functional regulation 523
For strong enhancer (EnhA) identified in each tissue, we counted the bins of overlapping RRATs 524 by comparing to other tissues. If the number of bins >= 1, the tissue of this RRATs region would 525 be assigned 1, otherwise it was assigned 0. We generated a total of 17 modules of tissue-specific 526 regulatory elements (TSR) enhancers. The 17 modules included all-common (presented in all 527 tissues), gut-common (presented in all 5 intestinal tissues), brain-common (presented in all 3 brain 528 tissues) and 14 tissue-specific modules. The same method was used to obtain TSR for promoters 529 (1_TssA). In addition, we performed enrichment analyses (GO, Human Phenotype Ontology 530 (HPO), Mouse Phenotype) based on genes proximal to TSR using the GREAT 72 tool with default 531 parameters except for TSR promoters (proximal 2kb upstream, 1kb downstream, plus distal up to 532 3kb). We used a cut-off of FDR<0.05 for both the binomial and the hypergeometric distribution-533 based tests. 534

25
The motifs of tissue-specific EnhAs were identified by HOMER 73 (v.4.11) with cutoff 535 FDR<0.05. We selected the top three enriched or tissue function relevant motifs for each tissue as 536 the candidate tissue-specific EnhAs motifs and generated a total of 51 motifs enriched in tissue-537 specific EnhAs. In addition, we used these 51 motifs as known TF motifs to conduct the enrichment 538 for all tissues by HOMER. The mRNA expression of corresponding TFs in pigs were used to 539 calculate the correlation with motif enrichment. MarkDuplicates with default parameters. The SNPs of Gvcf for each sample were called by GATK 545 HaplotypeCaller. All Gvcf were then combined and the variants for each sample were called by 546 GenotypeGVCFs. After SNP calling, the variants were filtered using VariantFiltration (QD < 2.0, 547 MQ < 40.0, FS > 60.0, SOR > 3.0, MQRankSum < -12.5, ReadPosRankSum < -8.0) to remove 548 low-quality SNPs. We then performed Fst analysis between Asian wild and domestic pigs, and 549 between European wild and domestic pigs, and calculated the fold enrichment of selection 550 signature for chromatin states using the same method for gene elements enrichment described 551 above. 552

GWAS and eQTL signal enrichment of chromatin state 553
The pig GWAS data of 44 traits was described previously 76,77 (Supplementary Table 9). First, we 554 filtered out all SNPs with minor allele frequency below 0.5%, with a large deviation from Hardy-555 Minimac4 of less than 0.4. We performed GWAS signal enrichment of 44 pig complex traits (3 557 daily gain related, 20 lipid related, and 21 feed efficiency related) for each chromatin state across 558 14 tissues using a 10,000 times genotype cyclical permutation tests 68 . The eQTLs data in pig 559 muscle 78 with FDR<0.05 were used to calculate the fold enrichment for the chromatin states using 560 the same method above. 561

Interspecies conservation of chromatin state 562
We collected data from ENCODE 4,9 , Roadmap Epigenomics 8 and published articles 79 (9 tissues 563 in human and 7 tissues in mouse, Supplementary Table 11,12), including ChIP-seq (H3K4me3, 564 H3K4ac, H3K4me1, H3K27me3, Input), ATAC-seq, DNase-seq, and RNA-seq. In total, we 565 obtained six matched tissues (small intestine, liver, spleen, lung, adipose, cortex) among pig, 566 human, and mouse. All the data were processed following the same pipeline used in pig. The 567 GRCh38 (human) and GRCm38 (mouse) assemblies with Ensembl annotations (v100) were used 568 for data analysis. Chromatin states of human and mouse were also trained by ChromHMM and 15 569 chromatin states were identified. To explore the relationship between sequence conservation and 570 epi-conservation among the three mammals, we first divided the genome into 50 equal sized sets 571  Fig. 13d). To quantify epigenomic 576 conservation, we downloaded the whole genome alignments UCSC chain files among human 577 (hg38), pig (SusScr11), mouse (mm10) and processed as described in the UCSC Genome Wiki 578 website (http://genomewiki.ucsc.edu/index.php/HowTo:_Syntenic_Net_or_Reciprocal_Best) to 579 derive reciprocal best chains. Then we converted genomic coordinates between assemblies using 580 the UCSC Liftover tool (https://genome.sph.umich.edu/wiki/LiftOver) based on 0.65 sequence 581 identity. All the chromatin states in pig and mouse were lifted over to human. The conservation 582 rate (0~1) of each region of each state from pig to human was calculated based on state region 583 coverage of pig over human. If there was no overlap it was assigned 0, if completely occupied it 584 was assigned 1. The same analysis was conducted for pig to mouse and mouse to human. 585 Furthermore, we performed genomic and epigenomic conservations for every pair of mammalian 586 species in each tissue. Finally, we conducted the same analysis on mammalian conserved score 587 based Genomic Evolutionary Rate Profiling (GERP) using 103 mammalian genomes 588 (ftp://ftp.ensembl.org/pub/release-589 100/compara/conservation_scores/103_mammals.gerp_conservation_score/) 590 To examine the biological relevance of sequence extremely variable (0 th -2 th sets) and conserved 591 regions (47 th -49 th sets), we extracted the human-pig shared and human-specific chromatin state 592 TssA from these regions. Then the GREAT tool with parameter of proximal 2kb upstream, 1kb 593 downstream, plus distal up to 3kb was used to conduct GO function enrichment analysis. 594

Expression conservation versus epi-conservation 595
The TPM of 14302 orthologous genes from pig, human, and mouse were used to identify 596 differentially expressed genes in each tissue using the Student's t-test. We sorted the genes by p-597 value within each species and divided them into 50 equally sized sets. Then we calculated the 598 average epi-conservation score of states in the 20kb region around TSS of gene in each set. 599

Heritability enrichment of human complex traits in chromatin state 600
To explore how conserved or species-specific chromatin states affects complex traits in humans, 601 we extracted six types of species-share or species-specific regulatory elements (all_shared, 602 human_mouse_shared, human_pig_shared, human_specific, mouse_specific, pig_specific). We 603 applied stratified linkage disequilibrium score regression (LDSC) to partition heritability of 47 604 human complex traits into distinct functional categories 44 , which revealed which functional regions 605 explained more genetic variation of complex traits from an evolutionary point of view. These 606 functional categories included six types of species-shared/specific regulatory elements, chromatin 607 state of each tissue, and TSR of EnhA/TssA. We calculated the stratified LD scores using 1000G 608

Code availability 630
The pipeline for RNA-seq, ATAC-seq, DNase-seq and ChIP-seq processing is available at 631 https://github.com/kernco/functional-annotation. RRBS pipeline and other processing codes are 632 publicly available at https://github.com/zhypan/Functional-Annotation-of-Pig. Correspondence and requests for materials should be addressed to H.Z., and L.F. 660 Reprints and permissions information is available at www.nature.com/reprints. 661 662