Introduction

Annotating functional elements in the human genome is a major goal in human genetics. Despite years of efforts from both experimental and computational scientists, functional annotation remains challenging, especially in the non-protein-coding regions. It is estimated that approximately 98% of the human genome is non-protein-coding1. Because of the apparent importance of coding regions, many computational tools have been developed to annotate DNA variants in the coding regions2,3,4. Although the non-coding regions were considered “junk DNA” for many years, much has been learned on the potential roles of these regions in the last decade. First, extensive comparative genomic studies have shown that the majority of mammalian-conserved regions consist of non-coding elements5. Second, results from genome-wide association studies show that close to 90% of the significant variants associated with human diseases reside outside of the coding regions6, only slightly less underrepresented among all the variants in the human genome, where about 95% of known variants are from the non-coding regions. Third, high-throughput experiments, e.g. the ENCODE project7, also suggest that a large fraction of the human genome are functionally relevant. All of this evidence suggests the importance and need for extending the annotation tools from the coding regions to the entire human genome.

Despite the increasing need to functionally annotate the human genome, there is no universal definition of genomic function8,9, which differs among geneticists, evolutionary biologists and molecular biologists. The experimental approaches and analysis techniques of detecting functional genomic elements among these scientists also vary greatly. Extensive work in some genomic regions such as the β-globin gene complex has shown that no single approach is sufficient to identify all the regulatory activities in the non-coding regions8,10. In order to obtain a comprehensive picture of the genomic functional structure, all the valuable information acquired through different approaches needs to be combined using appropriate statistical learning techniques.

Several annotation tools focusing on the non-coding regions have been established recently11,12,13,14,15. Similar to the long list of deleteriousness prediction tools developed for the coding regions, most of these new methods aim to distinguish tolerable variants from the deleterious ones. Though important, prediction of deleteriousness does not cover every aspect of functional annotation. The potential of these variant classifiers in understanding the genomic architecture on a large scale and in detecting regulatory elements such as cis-regulatory modules remains to be thoroughly investigated. Moreover, scientists now routinely analyze different cell types7 and even single cells16. In order to keep up with these technological advances, it is critical to develop a functional annotation framework that can be generalized to different species, cell types and single cells. Such a generalizable framework can be achieved through biologically-motivated and statistically-justified models. As for choosing between a supervised approach, where some gold standard datasets are needed to train the model and an unsupervised approach, where no labeled data are used, we focus on developing an unsupervised learning method in this article. This is because current supervised-learning-based annotation tools suffer from highly biased training data, which is largely due to our limited knowledge of non-coding regions. This may become less of an issue after we have gained a deeper understanding of non-coding functional mechanisms. However, at such an early stage, we think unsupervised learning techniques would be advantageous.

In this paper, we present GenoCanyon (inspired by the canyon-like plots it generates), a whole-genome annotation tool based on unsupervised statistical learning. From a collection of the comparative genomic conservation scores and biochemical signals obtained from the ENCODE project17, the posterior probability of a genomic position being functional is used as the prediction score. Compared to existing methods, GenoCanyon not only measures the deleteriousness of variants, but also the functional potential of each genomic location. Its flexible and generalizable statistical framework could also benefit future applications.

Results

Estimating the Proportion of Functional Regions in the Human Genome

Genetic approaches that focus on studying the consequences of genetic perturbations are often referred to as a gold standard for defining function8. Such a genetic definition is also directly related to causal inference, which is at the core of developmental biology and disease research9. In this study, we also adopt this genetically meaningful definition of genomic function. On the other hand, we treat the conservation measures and the biochemical signals as consequences of genomic function (Fig. 1A). For a specific location in the human genome, define Z to be the latent indicator of function. We collected 22 different annotations, denoted as (Supplementary Table 1). We also assumed that the 22 annotations are conditionally independent given Z (Fig. 1B). Then, the posterior probability serves as the prediction score of the functional potential at this location (See Methods, Fig. 1C).

Figure 1
figure 1

Modeling of causal relationship among variables. (a) We adopt the biologically meaningful definition of function and treat conservation measures and biochemical signals as consequences. (b) The latent functional indicator Z is modeled as the parental variable and all the 22 annotations are treated as consequences. Also, we assume there is no direct causal relationship between any two annotations. Therefore the annotations are conditionally independent given Z. (c) Workflow of GenoCanyon functional prediction.

We have pre-calculated the prediction scores for the entire human genome (hg19). Overall, when using 0.5 as the cutoff for defining functionality, 33.3% of the human genome was predicted to be functional. The proportion of functional elements is mostly stable across chromosomes (Supplementary Table 2; Supplementary Figure 1). We note that the functional proportion of the human genome has been estimated using many different approaches8,18,19,20,21,22 and results differed drastically. Comparative genomic analysis of multiple mammals revealed that constrained elements consist of approximately 4.5% of the human genome18,19. At the other extreme, the ENCODE project found that 80% of the human genome has detectable biochemical activities in at least one cell line7. However, it has been discussed recently that several corrected constraint estimations would each suggest two to three times increase to the original estimate of 4.5%20,21,22. Also, it still remains non-trivial to distinguish real biochemical signals from biological noises in the ENCODE data8. The large amount of observed biochemical activities have also been criticized to be more like an “effect” rather than “function”9. Our prediction falls in the middle of these highly diverging estimates of functional regions in the literature. It is worth noting that the GenoCanyon functional prediction represents a mixed probability involving multiple tissues. A smaller proportion of the human genome would be expected to be functional for a particular tissue.

Prediction for cis-regulatory Modules in the HBB Gene Complex

The intensively studied β-globin (HBB) gene complex on chromosome 11 contains embryonically expressed HBE1, fetally expressed HBG1 and HBG2 and adult globin genes HBD and HBB, along with a pseudogene HBBP1. This locus is known to provide a paradigm for developmental gene expression and regulation23,24. A large number of cis-regulatory modules (CRMs) that control both the developmental timing and the spatial pattern of gene expression have been discovered in the HBB gene complex10. More interestingly, the epigenetic and evolutionary signals at these CRMs differ substantially8. Therefore, the HBB gene complex provides a perfect example to test if GenoCanyon could effectively combine different sources of signals and successfully predict the functional segments.

We analyzed the prediction results in the HBB gene complex. On the entire chromosome 11, 32.2% of the DNAs were predicted to be functional. Strong enrichment of signals was observed at this locus. Using 0.5 as the cutoff, 62.2% of HBB gene complex and 97.0% of the CRMs were predicted as functional (Fig. 2A). Remarkably, a cluster of five DNase I hypersensitive CRMs upstream of the HBB gene complex, known as the locus control region (LCR)25, showed strong functional signals as a whole (Fig. 2B). The 3’HS1 enhancer blocker (chr11: 5226013-5226493; hg19) downstream of the HBB gene complex was also successfully predicted with high resolution. Interestingly, these CRMs showed highly variable patterns of annotations (Fig. 2C). This proved that GenoCanyon could effectively combine different sources of information. Recent research revealed several new regulatory elements at this locus, including one in the intergenic region between HBBP1 and HBG124 and another one upstream of HBD23. These elements also reside in the highly scored regions. Moreover, it is worth noting that the understanding of CRMs is still incomplete even in a relatively well-studied region such as the HBB complex. Some of the apparent false positives might actually be regulatory elements not yet discovered. The functional regions provided by our method could potentially offer a guideline for further studies.

Figure 2
figure 2

Functional prediction for the HBB gene complex. (a) Histogram of the prediction scores in chromosome 11, HBB gene complex and the 23 CRMs. 32.2%, 62.2% and 97.0% are predicted as functional, respectively. (b) Prediction results for the HBB complex. Dark blue bars show the prediction score at each location. All the 23 CRMs are marked in red. There appears to be fewer than 23 red bars because some of the CRMs are very close to each other. Red dots indicate the locations of known pathogenic SNPs downloaded from the NCBI Variation Viewer. (c) The posterior probabilities given a single group of annotations could be used to measure the relative contribution of different sources of information (See Methods). Four CRMs are plotted to illustrate that prediction scores are driven by different annotations in different CRMs. (d) Prediction results for the HBB gene and its promoter. The promoter, UTRs, introns and exons are marked with different colors. Red dots show the prediction scores of the pathogenic variants. (e) Prediction results for the HBB promoter. Known protein binding sites in the HBB promoter are marked in blue. Red dots show the prediction scores of the pathogenic variants. (f) Boxplot of the prediction scores of HBB promoter, known protein binding sites and pathogenic variants.

Among the 23 CRMs being reviewed10, only the promoter of HBB did not get the perfect score (Table 1). Therefore, we analyzed the HBB gene and its promoter in more details (Fig. 2D). Within the HBB gene, the 600 bp segment near the 3’UTR was predicted to be functional. 77 pathogenic or likely pathogenic SNPs were downloaded from the NCBI Variation Viewer ( http://www.ncbi.nlm.nih.gov/variation/view/). Interestingly, 14 of these pathogenic SNPs, including 4 in the 3’UTR and 6 in the second intron, lie in this 600 bp functional segment. In the upstream half of the second intron, prediction scores were substantially lower. No pathogenic variants could be found in that region. Overall, using 0.5 as the cutoff, 89.6% (69 out of 77) of the pathogenic SNPs located at functional locations. Within the HBB promoter, 75% (6 out of 8) of the pathogenic variants located at functional locations (Fig. 2E). Moreover, bumps of high scores could be observed at the known protein binding sites in the HBB promoter26. When comparing the entire HBB promoter, known protein binding sites and the pathogenic variants within the promoter, there was a substantial increase in prediction score (Fig. 2F). All of this evidence suggests that important functional segments could still be detected locally even in a generally lower-scored region.

Table 1 Mean prediction scores of the known CRMs in the HBB gene complex.

Prediction for ZRS, an enhancer of the SHH gene

Zone of polarizing activity regulatory sequence (ZRS) is one of the most studied developmental enhancers. It is located in the fifth intron of the protein-coding gene LMBR1, approximately 1 Mb upstream of SHH’s transcriptional start site27,28. Through linkage mapping of several large families with preaxial polydactyly (PPD) and triphalangeal thumb, an associated locus of approximately 500 kb was identified on chromosome 7q36. In later studies, the region was further narrowed down to the fifth intron of LMBR129,30,31. As a highly conserved 774 bp region in this intron, ZRS has been intensively studied. It has been shown to be crucial for limb development not only in humans, but also in mice, dogs, cats and even chickens27.

We investigated the prediction results in gene LMBR1. A highly scored plateau could be observed in its fifth intron (Fig. 3A). The mean predicted score for this intron was 0.595. This was higher than the mean score of the entire LMBR1 transcript (0.385), of all the introns in LMBR1 (0.384) and even of all the exons in LMBR1 (0.448). These results showed strong signs of function in the fifth intron. The ZRS region got an even higher mean predicted score 0.871, which confirmed its importance (Fig. 3B). When observing its surrounding region, ZRS could be easily identified as a dense region with high prediction scores (Fig. 3C). Moreover, the ZRS region serves as one of the most well studied examples for pathogenic variants in an enhancer. A total of 13 single nucleotide variants in ZRS have been identified to cause human limb malformations27. All these 13 SNVs were predicted to be highly functional, with the mean prediction score 0.987 (Fig. 3D).

Figure 3
figure 3

Prediction results for the SHH enhancer in LMBR1. (a) Prediction scores in the LMBR1 gene. The fifth intron and ZRS are highlighted in light blue and red, respectively. (b) Boxplot of the prediction scores in LMBR1, 16 introns, 17 exons, the 5th intron and ZRS. The results highlighted the function in the 5th intron of LMBR1 and confirmed the importance of ZRS. (c) Prediction results for the surrounding region of ZRS, which is highlighted in pink. An obvious highly scored plateau can be observed at ZRS. (d) The prediction results within the ZRS. 13 pathogenic variants are discovered in ZRS. The predicted scores at their locations are marked with red dots. There appears to be only 11 dots because three variants all reside at location 156584166 (hg19).

In conclusion, our method successfully identified the fifth intron of LMBR1 as a functional region. It also further confirmed the importance of ZRS. It is notable that the large number of identified pathogenic variants in ZRS is possibly subject to the ascertainment bias. In fact, mutations in ZRS did not account for the limb malformation in all the studied families32. Our prediction in the surrounding regions has the potential to guide future studies.

Prediction for Functional Elements in the Human X-inactivation Center

X-chromosome inactivation, originally described 50 years ago33, is the mechanism for X-chromosome dosage compensation in mammals. The long non-coding RNA Xist has been shown to be both necessary and sufficient to induce X-chromosome inactivation in mouse ES cells34. The surrounding genomic region, often referred to as the X-inactivation center (Xic for mouse and XIC for human), contains several crucial regulatory elements for mouse X-inactivation35. However, recent studies have suggested the existence of substantial variations in the mechanism of achieving X-inactivation among species36,37,38,39. We applied GenoCanyon on human XIC to predict the functional potential of the orthologs of known regulatory elements in mouse models (Fig. 4A; Table 2).

Table 2 Prediction results for the 6 protein-coding genes, 4 lncRNA genes and 1 pseudogene in the human XIC, as well as all their transcripts in RefSeq.
Figure 4
figure 4

Prediction results for regions involved in human X-inactivation. Each dark blue line shows the prediction score at a single base. (a) Functional prediction for the human XIC. All the RefSeq transcripts in this region are plotted. The master lncRNA XIST is highlighted in red. Red dots show the locations of known pathogenic variants downloaded from the NCBI variation viewer. (b) Functional prediction for the intergenic region between AMOT and HTR2C on chromosome Xq23. A red and a blue arrow represent the recently discovered transcripts XACT and T113.3, respectively.

Xist and its antisense ncRNA Tsix, as well as two upstream ncRNAs Ftx and Jpx have all been shown to haven cis-regulatory roles in mouse X-inactivation40,41,42,43. Our prediction confirmed the function of the master ncRNA XIST in human. Both the XIST gene and its transcribed regions got nearly perfect prediction scores. Moreover, a XIST-specific peak of high score could be observed on Fig. 4A, showing satisfying resolution of prediction. Studies suggesting a truncated form of TSIX in human have led to some debate in its function. Compared to its mouse ortholog, the human TSIX gene has lost the CpG island as well as the enhancer elements Dxpas34 and Xite44. In our prediction, TSIX got mean score 0.383, which is low for such an active genomic region. When considering only the region that does not overlap with XIST, the number even dropped to 0.197. A recently discovered lncRNA, Linx, has been hypothesized to take part in Tsix expression in mice45. In the mouse genome, the Linx gene lies between two protein-coding genes Nap1l2 and Cdx4. However, its human ortholog has not yet been discovered. The intergenic region between human NAP1L2 and CDX4 has a low mean prediction score 0.016, which argues against not only the existence of LINX in human, but also TSIX function. Jpx and Ftx both showed the potential to activate X-inactivation in mice42,43. But the functions of their human orthologs have not been studied37. The mean prediction scores of the transcribed regions in JPX and FTX are 0.256 and 0.304, respectively. This suggested only moderate functional potential of these two human lncRNAs. However, both scores received a substantial boost when the entire gene was considered. On Fig. 4A, several functional peaks could also be clearly observed in the untranscribed regions in JPX and FTX. These results might guide the detection of novel regulatory elements in human XIC.

Besides the mentioned lncRNA genes, the human XIC also contains 6 protein-coding genes, NAP1L2, CDX4, CHIC1, ZCCHC13, SLC16A2 and RLIM. It is notable that most of their exons clearly reside in the functional peaks in Fig. 4A, showing the ability of GenoCanyon to capture the functional landscape of this genomic region. We calculated the mean prediction scores for all the RefSeq transcripts of these genes. In CDX4, CHIC1 and SLC16A2, all the transcript scores were substantially larger than the scores of untranscribed regions. Among the 6 protein-coding genes, Rlim (also referred to as Rnf12) produces the U3 ubiquitin ligase that acts in a dose-dependent manner on the initiation of X-inactivation46. The human RLIM gene has a high mean predicted score 0.973. Two of its RefSeq transcripts both got 0.930 as the mean score, which is also very high. In Fig. 4A, the RLIM gene perfectly lies in an isolated functional plateau, which suggests its strong functional potential in human. It has been observed that the homologous pairing of two regions (Tsix/Xite and Xpr/Slc16a2) might have impacts on Xist upregulation47,48. In human XIC, the TSIX/XITE region has been truncated, but the region surrounding the SLC16A2 gene showed its functional potential in our prediction. The exons of SLC16A2 lie in two large separate functional peaks, suggesting the importance of the transcribed region as well as a large bulk of untranscribed region in SLC16A2. Whether these regions serve as the human XPR remains to be investigated. More interestingly, 8 pathogenic SNPs in SLC16A2 have been submitted to ClinVar49. These variants were believed to be involved in Allan-Herndon-Dudley syndrome, showing that SLC16A2 has its crucial function in other processes as well. The other genes in Xic have not been related to X-inactivation yet. Our prediction suggested that the exons of NAP1L2, CDX4 and CHIC1 all showed different levels of functional potential, which is not surprising because of their protein-coding nature. The human XIC also contains several microRNA genes and one pseudogene MAP2K4P1. MAP2K4P1 did not get a high score, which was in agreement with its pseudogene status. The microRNA transcript might partially explain the large functional plateau near the 5’ end of FTX.

XACT, a recently discovered lncRNA coating the active X chromosome in human pluripotent cells, has been shown to take part in X-inactivation initiation uniquely in human37,38. It lies in a 1.7 Mb large intergenic region between protein-coding genes AMOT and HTR2C. A shorter transcript T113.3 upstream of XACT was also identified. But its function has not been studied. We investigated this region using GenoCanyon. The AMOT gene and the HTR2C exons both showed substantial functional potential. A clear plateau of high scores could also be observed in the intergenic domain (Fig. 4B). The mean prediction score for the entire intergenic region, XACT and T113.3 were 0.148, 0.383 and 1.000, respectively. Although the mean predicted score for XACT was only moderate, it still confirmed the functional signal in such a lowered-scored intergenic domain. Also, our prediction suggested the importance of T113.3 and its surrounding region.

Investigating the Ability of Classifying Variants

GenoCanyon was not designed as a variant classifier. However, enrichment in prediction score is still expected for the known pathogenic variants. We downloaded all the annotated variants from ClinVar in June 201449. The subset of single nucleotide variants annotated as “Pathogenic”, “Likely Pathogenic”, or “Pathogenic/Likely Pathogenic” was treated as the positive set. Similarly, the subset of SNVs annotated as “Benign”, “Likely Benign”, or “Benign/Likely Benign” was treated as the negative set. The positive set contained 19,242 variants and the negative set contained 8,874 variants. The mean prediction score in the positive set and the negative set were 0.912 and 0.735, respectively. When using 0.5 as the natural cut-off, the sensitivity was as high as 0.915, with a low specificity of 0.263. The AUC was 0.727.

It is worth noting that GenoCanyon measures the functional potential of genomic locations, not the tolerability of specific variants. The transcribed regions in a crucial protein-coding gene should be expected to have a high functional score. However, it would still be natural to observe many tolerable synonymous SNPs in that gene. All these tolerable SNPs become “false-positives” in the analysis above, leading to a low specificity. Moreover, many of the known “benign” variants are by-products of association studies. Their properties were investigated because they lie in candidate regions in the disease pathway, which explains why the mean prediction score of benign variants was also high. On the other hand, if a variant were shown to be pathogenic in experiments, the underlying region would surely have some functions related to the disease. In this sense, the high sensitivity of GenoCanyon suggests that it may be a good indicator of its prediction ability. Finally, the performance of supervised-learning-based methods is highly sensitive to the choice of training data. For example, when using common variants with matched regions as the negative training set, the performance of GWAVA on its own training data dropped substantially (AUC = 0.71)14.

Discussion

The HBB gene cluster, ZRS and the X-inactivation center all have been paradigms for studying the complex genomic regulatory network. The prediction results in these regions showed that GenoCanyon is capable of detecting functional regions in the human genome, which is a unique feature most existing whole-genome annotation tools do not have. With the wide adoption of next-generation sequencing, GenoCanyon may help researchers focus on candidate regions that are likely to be functional and reduce the spurious signals among the overwhelming genomic information.

Throughout this article, we have discussed the differences between GenoCanyon and variant classifiers in that GenoCanyon measures the functional potential of genomic locations instead of the pathogenicity of a specific variant and a high score does not necessarily imply deleteriousness. However, in some scenarios that variants distribute across the entire genome, GenoCanyon may still serve well as a conservative tool for noise reduction. For example, sequencing technology is rapidly becoming a focus of efforts in genomic epidemiology. However, the overwhelming number of rare variants in the human genome brings the issue of extreme multiple testing. It has been discussed recently that the sample size required for a well-powered RVAS (rare variants association study) using sequencing is similar to that of a traditional GWAS (genome-wide association study)50. Without a huge cohort, true signals could be easily overshadowed by extreme yet spurious observations. In this case, GenoCanyon could be used to filter the SNPs and reduce 2/3 of the tests as more than 2/3 of the human genome is less likely to be functional. Moreover, the high sensitivity of GenoCanyon ensures that the true signal is still kept in the dataset. The ability of predicting functional potential at each nucleotide is another useful feature of GenoCanyon. In association studies, genetic variants are used as markers capturing signals for nearby regions. Therefore, for each SNP, the mean prediction score for its surrounding region may serve well as a prior in post-GWAS prioritization. Existing variant classifiers cannot achieve this task because they only predict the deleteriousness of genotyped variants. It is worth noting that most of the annotation data have a resolution ranging from tens to hundreds of nucleotides due to the limitation of current experimental techniques. However, data input of these annotations is at nucleotide level, which makes it possible to measure the functional potential for each base pair.

Based on unsupervised learning, GenoCanyon does not suffer from the highly biased knowledge of the non-coding DNA. More importantly, the model can be generalized in many directions. Firstly, the ENCODE annotations used in GenoCanyon were clustered across several or even nearly a hundred different cell lines. Therefore, the current functional regions predicted by GenoCanyon are in fact the union of functional elements in different cell types. Using the annotations for one single cell type, a cell type-specific functional prediction tool could be built under the same framework. In studies where several candidate cell types are of interest, prediction based on the cell-type-specific models would have higher specificity. Secondly, the model can be extended to other species. The functional elements in model organisms are generally better studied. Such tools for different species could potentially benefit the multi-species comparison and help detecting functional orthologs in human. Thirdly, in order to simplify the model, we transformed the biochemical annotations into binary variables (See Methods). Therefore, the information of signal strength has not been used. When these information as well as more annotations are incorporated using more complex modeling techniques, the specificity may be improved. Finally, the current model assumes the leading role of genetic function and treats conservation measures and the biochemical signals as consequences. Among different annotations, conditional independence was also assumed (Fig. 1). However, it would be interesting to investigate the correlations among variables in either the functional or the non-functional group. In that case, statistical graphical models could be implemented to make the model more flexible. These are all very interesting directions to generalize GenoCanyon. However, complex models lead to higher variance, intensive computation and less interpretability. Dealing with these trade-offs has never been trivial. The good prediction results show that GenoCanyon has reached a nice balance. The current powerful features as well as its generalizable potential make GenoCanyon a unique and useful tool for whole-genome annotation.

Methods

Statistical Model

For each location in the human genome, define Z to be the latent indicator of function, where Z = 1 indicates that location is functional and 0 otherwise. We selected 22 different annotations corresponding to either conservation score or biochemical activity, including 2 genomic conservation measures, 2 indicators of open chromatin, 8 histone modifications and 10 TFBS peaks (Supplementary Table 1). These annotations are selected because their functional impacts are relatively well studied and easier to model. DNA methylation is not included in the model because the gene silencing mechanism requires modeling the functional impact of methylation to other nucleotides that are possibly far away, which is a challenging task. Genomic data for all the 22 annotations were downloaded from the UCSC Genome Browser except GERP (Supplementary Table 3). We denote the vector of all the annotations as .

When a genomic location is functional (Z = 1), we assume that the annotations have a joint probability density ; similarly, when a genomic location is non-functional (Z = 0), we assume that the annotations have another joint density . Since Z is unknown, the distribution of the observed data would be a mixture of and . Instead of modeling direct causal relationships among these 22 annotations, we assume that they are connected only through Z. In other words, the 22 annotations are all modeled to be consequences of Z. Under these assumptions, the 22 different annotations are conditionally independent when Z is given51. Therefore, the conditional joint density of given Z can be factorized as

Finally, for a genomic location, assume to be the prior probability of being functional, i.e.

Then, given the annotations, the posterior probability of Z = 1 can be used as a reasonable functional measure when the parameter estimates are plugged in.

We chose GERP52 and PhyloP53 as the conservation measures because both of them are approximately normally distributed and therefore easier to model. PhyloP46way was chosen instead of PhyloP100way because a large phylogenetic distance would bring too little conserved signal as well as many incomplete data. All the other annotations were cell-type-specific, so we coded them into binary variables to cluster the signal across cell lines. If signal was detected in at least one cell line, we coded the corresponding . Otherwise, . For DNase I, FAIRE and TFBS, there were downloadable cluster files on the UCSC Genome Browser. A total of 125, 25 and 91 cell lines were clustered, respectively. We made our own histone peak cluster files across 16 cell lines from the Broad histone track on ENCODE (Supplementary Table 4). The 8 histone modifications were chosen because they are relatively well-studied54. We chose the top 10 Transcription Factors with the highest binding site coverage after being transformed into binary variables.

Finally, normal distribution and Bernoulli distribution were used to model the continuous and binary annotations, respectively.

Estimation

In total, our model has 49 parameters.

where

The GWAS Catalog55 was downloaded from the NHGRI GWAS Catalog website ( http://www.genome.gov/gwastudies/) in July 2014. It contained 13,070 unique SNPs that were significant in GWAS studies. For each SNP, we marked the interval between its 500 bp upstream and 499 bp downstream. In this way, 13,070 intervals were collected. Each interval spanned 1 k bp. After deleting the overlapping coordinates, the entire region spanned 12,801,840 bp. Each significant SNP in the GWAS Catalog hints the existence of functional elements nearby. These functional elements differ in their sizes and in the distance to the probed SNP. Since each interval was 1,000 bp in length and a large number of intervals were collected, the whole collection was a large enough and reasonably chosen set on which we could learn the distributions of annotations in both functional and non-functional groups. All the 22 annotations were then collected at each location in this set. The PhyloP scores and GERP scores were not available at 221,643 and 28,741 locations, respectively. After removing these locations, the final dataset contained 12,580,197 genomic locations. None of the other annotations have the issue of incomplete data. Finally, the Expectation-Maximization (EM) algorithm was used to estimate the parameters. As expected, the estimates showed solid differences between the functional and non-functional groups (Supplementary Table 5). We also tried replacing the missing conservation measures with the neutral score 0. Then the entire 12,801,840 locations were used to estimate the parameters. Little differences in parameter estimates were observed between the two approaches (Supplementary Table 6). Moreover, in order to test if the estimates are stable under different choices of datasets, we randomly sampled two subsets on chromosome 1, containing 2,000,000 and 6,000,000 bp, respectively. After adding these locations into the original 12,801,840 bp dataset, the parameters were estimated using the EM algorithm again. No substantial differences were observed in the estimates (Supplementary Tables 7 and 8). Based on these results, the GWAS-loci-based dataset containing 12,801,840 bp seems to contain enough functional elements for accurate parameter estimation and is general enough so that genome heterogeneity does not have a strong impact on estimation. Finally, in order to check the sensitivity of our model to the perturbation in annotation data, we re-fitted the model multiple times after removing several annotations (Supplementary Table 9). The parameter estimates remained consistently stable in all these cases, suggesting that the framework we propose is robust to the choice of annotations. The stable estimates of marginal parameters also show that the potential correlations among annotations do not have a strong impact on model fitting.

Marginal Effect of Different Annotations

For each binary annotation , its effect on the final prediction can be measured using the odds ratio.

We calculated the odds ratios for all 20 binary annotations (Supplementary Table 5). The annotation with the least effect was the histone modification H3K27me3. According to our estimation, the probabilities to detect the H3K27me3 signal in functional and non-functional classes are almost the same (0.80 and 0.72). In fact, H3K27me3 has been discovered to be associated with Polycomb-repressed regions56,57, which could partially explain the phenomenon. All the other binary annotations showed variable yet substantial signals of function. The marginal effect of a continuous annotation depends on its value. The interpretation is also less straightforward. More importantly, although these statistics could help us gain some intuition of how each annotation works marginally, the final prediction relies on all of them. The effectiveness of the method needs to be tested as a whole.

In order to visualize the relative contribution of different sources of information (Fig. 2C), posterior probabilities given a particular group of annotations were calculated for each location.

Then, for each CRM, the mean posterior probabilities were plotted.

Estimating the Functional Proportion

After plugging in the parameter estimates, the prediction score could be calculated using formula (4). If the PhyloP or the GERP score was not available, the neutral value 0 was used. Using the cutoff 0.5, 33.3% of the human genome was predicted to be functional. However, it is notable that the EM algorithm also gave an estimate for the functional proportion, 42.7% in our case (Supplementary Table 5). This estimation was based on the 12,580,197 locations we chose, which might not represent the entire genome. 42.7% could be treated as the prior knowledge, but the final prediction will be driven by the actual annotations at each location. Therefore, 33.3% would still be a better estimation. To see if the prior had a strong effect, we estimated the functional proportion of chromosome 22 using different values for while keeping other parameters unchanged. When using 0.3 and 0.5 as the values, the estimated functional proportions were 0.376 and 0.389, respectively. Compared to the original estimate 0.383, there was not a substantial change.

Figures and Web Application

All figures were plotted using R. The “ggbio” package was used to plot the chromosomes and transcripts58. The GenoCanyon web application was developed using the “shiny” package in R. The “bigmemory” package was implemented to access and manipulate massive datasets59. The GenoCanyon web application is available at http://genocanyon.med.yale.edu. The web server is implemented using Apache running on CentOS version 6.

Additional Information

How to cite this article: Lu, Q. et al. A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data. Sci. Rep. 5, 10576; doi: 10.1038/srep10576 (2015).