The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.
The ENCODE Project was launched in 2003, as the first nearly complete human genome sequence was reported2. At that time, our understanding of the human genome was limited. For example, although 5% of the genome was known to be under purifying selection in placental mammals3,4, our knowledge of specific elements, particularly with regards to non-protein coding genes and regulatory regions, was restricted to a few well-studied loci2,5.
ENCODE commenced as an ambitious effort to comprehensively annotate the elements in the human genome, such as genes, control elements, and transcript isoforms, and was later expanded to annotate the genomes of several model organisms. Mapping assays identified biochemical activities and thus candidate regulatory elements.
Analyses of the human genome in ENCODE proceeded in successive phases (Extended Data Fig. 1). Phase I (2003–2007) interrogated a specified 1% of the human genome in order to evaluate emerging technologies6. Half of this 1% was in regions of high interest, and the other half was chosen to sample the range of genomic features (such as G+C content and genes). Microarray-based assays were used to map transcribed regions, open chromatin, and regions associated with transcription factors and histone modification in a wide variety of cell lines, and these assays began to reveal the basic organizational features of the human genome and transcriptome. Phase II (2007–2012) introduced sequencing-based technologies (for example, chromatin immunoprecipitation with sequencing (ChIP–seq) and RNA sequencing (RNA-seq)) that interrogated the whole human genome and transcriptome7. General assays such as transcript, open-chromatin and histone modification mapping were used on a wide variety of cell lines, while more specific assays, such as mapping transcription factor binding regions, were performed extensively on a smaller number of cell lines to provide detailed annotations on, and to investigate the relationships of, many regulatory proteins across the genome. Transcriptome analysis of subcellular compartments (the nucleus, cytosol and subnuclear compartments) of these cells enabled the locations of transcripts to be analysed7.
ENCODE phase III
ENCODE 3 (2012–2017) expanded production and added new types of assays8 (Fig. 1, Extended Data Fig. 1), which revealed landscapes of RNA binding and the 3D organization of chromatin via methods such as chromatin interaction analysis by paired-end tagging (ChIA-PET) and Hi-C chromosome conformation capture. Phases 2 and 3 delivered 9,239 experiments (7,495 in human and 1,744 in mouse) in more than 500 cell types and tissues, including mapping of transcribed regions and transcript isoforms, regions of transcripts recognized by RNA-binding proteins, transcription factor binding regions, and regions that harbour specific histone modifications, open chromatin, and 3D chromatin interactions. The results of all of these experiments are available at the ENCODE portal (http://www.encodeproject.org). These efforts, combined with those of related projects and many other laboratories, have produced a greatly enhanced view of the human genome (Fig. 2), identifying 20,225 protein-coding and 37,595 noncoding genes (Fig. 2a), 2,157,387 open chromatin regions, 750,392 regions with modified histones (mono-, di- or tri-methylation of histone H3 at lysine 4 (H3K4me1, H3K4me2 or H3K4me3), or acetylation of histone 3 at lysine 27 (H3K27ac)), 1,224,154 regions bound by transcription factors and chromatin-associated proteins (Fig. 2c), 845,000 RNA subregions occupied by RNA-binding proteins, and more than 130,000 long-range interactions between chromatin loci. These annotations have greatly enhanced our view of the human genome from its original annotation in 2003 to a much richer and higher-resolution view (for example, Fig. 2d, e). Indeed, although the number of human protein-coding genes known has changed only modestly, the number of transcript isoforms, long noncoding RNAs (lncRNAs), and potential regulatory regions identified has increased greatly since the project began (Fig. 2a–c). An important part of ENCODE 3 is that the regulatory mapping efforts have now been integrated and synthesized into the first version of an encyclopedia, highlighting a registry of 0.9 million cCREs in human and 0.3 million cCREs in mouse. Details can be found in the accompanying ENCODE paper8 and companion papers in this issue and other journals9,10,11,12,13,14.
Technology, quality control and standards
Reaching the present annotation required a substantial expansion of technology development, from ENCODE groups and others, as well as the establishment of standards to ensure that the data are reproducible and of high quality. Most ENCODE 2 assays used sequence-based readouts (for example, RNA-seq15,16 and ChIP–seq17,18) rather than the array-based methods19,20 used in the pilot phase, and in ENCODE 3, methods such as global mapping of 3D interactions13 and RNA-binding regions14 were added. Throughout the project, computational and visualization approaches were developed for mapping reads and integrating different data types (Supplementary Note 1).
A key feature of ENCODE is the application of data standards, including the use of independent replicates (separate experiments on two or more biological samples5,21), except when precluded by the limited availability of materials (for example, postmortem human tissues). Of the 8,699 ENCODE 2 and ENCODE 3 experiments, 6,101 have independent replicates. Of equal importance was the use of well-characterized reagents, such as antibodies for mapping sites of transcription factor binding, chromatin modifications and protein–RNA interactions22. ENCODE developed protocols to test each antibody ‘lot’ to demonstrate their experimental suitability, captured extensive metadata, and implemented controlled vocabularies and ontologies. Standards for reagents, experimental data, and metadata are on the ENCODE website: https://www.encodeproject.org/data-standards/.
Many metrics, including sequencing depth, mapping characteristics, replicate concordance, library complexity, and signal-to-noise ratio, were used to monitor the quality of each data set, and quality thresholds were applied21. A minority of experiments that fell short of the standards (for example, insufficiently validated antibodies) are still reported, but are marked with a badge to indicate that an issue was found. This is a compromise for having some data versus none when an experiment did not meet ENCODE-defined thresholds.
An important component is uniform data processing. Data from the major ENCODE assays (ChIP–seq, DNase I hypersensitive sites sequencing (DNase-seq), RNA-seq, and whole-genome bisulfite sequencing (WGBS)) are uniformly processed and the processing pipelines are available for users to apply to their own data, by downloading the code from the GitHub (http://github.com/ENCODE-DCC) or by accessing the pipelines at the DNAnexus cloud provider. The standards and pipelines will continue to evolve as new technologies arise and are implemented.
The ENCODE Consortium is a good example of how large-scale group efforts can have a large impact on the scientific community, and many other national and international projects—including the NIH Roadmap Epigenomics Program, The Cancer Genome Atlas (TCGA), the International Human Epigenome Consortium (IHEC), BLUEPRINT, the Canadian Epigenetics, Environment and Health Research Consortium (CEEHRC), the Genotype and Tissue Expression Project (GTEx), PsychENCODE, Functional Annotation of Animal Genomes (FAANG), the Global Alliance for Genomics and Health (GA4GH), the 4D Nucleome Program (4DN), the Human Cell Atlas and the FANTOM consortium—have now formed (Supplementary Note 1). ENCODE has engaged with most of these consortia to share standards for data quality control, submission, and uniform processing and has helped to facilitate the use of common ontologies with some of these consortia. Data from the now-completed NIH Roadmap Epigenomics Program have been reprocessed and are available in the ENCODE database and are part of the Encyclopedia annotation. ENCODE continues to work with other consortia, individually and as part of the IHEC and GA4GH (for example, http://epishare-project.org) to increase data interoperability and the value of its resources.
ENCODE as a resource
The purpose of ENCODE is to provide valuable, accessible resources to the community. ENCODE data and derived features are available from a publicly accessible data portal (https://www.encodeproject.org), and consent was obtained from donors to make data freely available to the public. Raw and processed data are available directly from the cloud as an Amazon Public Data Set (https://registry.opendata.aws/encode-project/). The data are widely used by the scientific community—more than 2,000 publications from researchers outside of ENCODE have used ENCODE data to study diverse topics (Fig. 3). Because most disease-associated common variants are noncoding and show substantial enrichment in candidate cell-type-specific cis regulatory elements23,24, ENCODE-derived resources, both in isolation and in conjunction with data from other resources (for example, GTEx), can help to identify and interpret disease-associated noncoding variants (Fig. 3a). Users engage with the data in many ways, ranging from downloads of multiple data sets to detailed investigations of specific loci. Anyone navigating a major genome browser has access to thousands of biochemical, functional, and computational annotations to display at any genomic scale or to overlay on any sequence variant. Maps of epigenomic features relevant to gene regulation have been integrated to form a registry of discrete elements that are candidates for enhancers, promoters, or other regulatory elements. A specialized browser, SCREEN (http://screen.encodeproject.org), is an interface that can be used to identify and study these cCREs and associated ENCODE data and other annotations. This dynamic registry will be regularly updated as additional information is acquired.
Mouse ENCODE and modENCODE
Model organism studies have produced essential insights into almost every aspect of biology, including genome organization and function. During ENCODE 2, mapping of mouse epigenomic and transcriptomic features was conducted in adult mouse tissues and cell lines through the Mouse ENCODE Project25, which identified 21,978 protein-coding regions, 32,168 noncoding genes, 1,192,301 open chromatin regions, 722,334 regions with modified histones H3K4me1, H3K4me2, H3K4me3, or H3K27ac, and 686,294 regions bound by transcription factors.
During ENCODE 2, a model organism ENCODE project (modENCODE26,27) was conducted to characterize the transcriptome, epigenome, and transcription factor binding sites in Drosophila melanogaster and Caenorhabditis elegans tissues, developmental stages and cell lines (Extended Data Fig. 1). These organisms provided the opportunity to develop detailed records of epigenomic features and transcriptome maps throughout development, which is difficult to accomplish in humans. Deep mapping of the spatial and temporal transcriptomes of these species has substantially enhanced the annotation of both genomes. Similarly, detailed mapping of the regulatory circuits that govern gene regulation in Drosophila and C. elegans has provided insights into general principles of genome organization and function. Mapping of transcription factor binding sites in Drosophila and C. elegans has continued after modENCODE ended in a project called model organism Encyclopedia of Regulatory Networks (modERN) and to date has characterized more than 262 transcription factors in Drosophila and 217 transcription factors in C. elegans28. Collectively, the modENCODE Project has provided new insights about how the genomes of multicellular organisms direct development and maintain homeostasis.
In ENCODE phase III, experiments were carried out to characterize dynamic histone marks and accessibility, DNA methylomes, and transcriptomes in samples taken during eight mouse fetal developmental stages with up to twelve tissues per stage28,29,30 (Fig. 4). The resulting more than 1,500 datasets comprise, to our knowledge, the most comprehensive study of epigenomes and transcriptomes during the prenatal development of a mammal. Integrative analysis of these datasets has expanded our knowledge of the transcriptional regulatory networks that regulate mammalian development and underscored the role of gene regulatory mechanisms in human disease. At least 214,264 of the candidate enhancers identified in fetal mouse tissues are conserved in the human genome8. The human orthologues of these potential regulatory elements are significantly enriched for genetic variants that are associated with common illnesses in a tissue-restricted manner, providing information for investigations of the molecular basis of human disease29,30.
The mouse data from ENCODE 3 also include the results of more than 400 experiments using transgenic reporter mice designed to assess the function of cCREs in three embryonic tissues at two developmental stages. The results of this systematic study have helped to predict the in vivo activities of cCREs. For example, stronger enrichment for epigenetic signatures of enhancer activity correlated with higher rates of validation in the corresponding tissue29,31.
Finally, comparisons of epigenome and transcriptome maps across species have led to insights into the evolution of transcribed regions and regulatory information25,32. Combinatorial histone modification patterns at cis-regulatory elements and other genomic features are broadly conserved in metazoans. These chromatin states and transcript levels are highly correlated across tissues and developmental stages in all species examined. However, a notable fraction of specific cis-regulatory elements undergoes sequence and functional turnover during evolution, indicating that some regulatory components show substantial plasticity in their evolution while operating in a conserved regulatory network33.
Current limitations: phase IV and beyond
It is now apparent that elements that govern transcription, chromatin organization, splicing, and other key aspects of genome control and function are densely encoded in the human genome; however, despite the discovery of many new elements, the annotation of elements that are highly selective for particular cell types or states is lagging behind. For example, very few examples of condition-specific activation or repression of transcriptional control elements are currently annotated in ENCODE. Similarly, information from human fetal tissue, reproductive organs and primary cell types is limited. In addition, although many open chromatin regions have been mapped, the transcription factors that bind to these sequences are largely unknown, and little attention has been devoted to the analysis of repetitive sequences. Finally, although transcript heterogeneity and isoforms have been described in many cell types, full-length transcripts that represent the isoform structure of spliced exons and edits have been described for only a small number of cell types.
Thus, as part of ENCODE 4, considerable effort is being devoted to expanding the cell types and tissues analysed (see URLs in Supplementary Note 1) as well as mapping the binding regions for many more transcription factors and RNA-binding proteins. These efforts are largely focused in a few reference cell lines, with the hope that improved knowledge will help with imputation or predictions in other cell states34. Single-cell transcriptome capture agents35 and open chromatin assays36 are also being applied to increase our understanding of the cellular heterogeneity of different tissues and samples. These efforts will supplement the many related activities that are also being pursued by HCA, HuBMAP and others37,38. Extensive mapping efforts of all types will continue in both the human and mouse, and parallel efforts to map transcription factor binding sites are being pursued in the Drosophlia and C. elegans by the modERN Project28. Full-length transcript isoforms are being elucidated in different cell types using long-read sequencing technologies39. ENCODE will continue to work with other consortia, and the data from different groups and individual laboratories will need to be consolidated into a common repository.
Importantly, although very large numbers of noncoding elements have been defined, the functional annotation of ENCODE-identified elements is still in its infancy. High-throughput reporter-based assays40, CRISPR-based genome and epigenome editing methods41, and other high-throughput approaches are being used in the current phase of ENCODE to assess the functions of many thousands of elements and to relate those functional results to their biochemical signatures. These targeted functional assays, combined with the large-scale annotation of biochemical features, should further enhance the value of ENCODE data.
Through these and other efforts, it is expected that many more elements in the human genome will be identified across a variety of cell types and conditions, their activities will be revealed (often at the single-cell level), and their biological functions will be inferred more accurately. The development of a systems-wide understanding of function and integration with genetic information associated with human traits will greatly enhance our understanding of human biology and disease.
Kellis, M. et al. Defining functional DNA elements in the human genome. Proc. Natl Acad. Sci. USA 111, 6131–6138 (2014).
ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306, 636–640 (2004).
Lindblad-Toh, K. et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476–482 (2011).
Waterston, R. H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
ENCODE Project Consortium. A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 9, e1001046 (2011).
Birney, E. et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007). The results of the pilot phase of ENCODE included extensive functional assays across a selected one per cent of the human genome with experiments conducted on a variety of cell lines and largely with array-based technology.
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). The results of the second phase of ENCODE were based mostly on a large number of genome-wide assays that leveraged high-throughput sequencing technologies and were done across two ‘tier one’ cell lines with large-scale assays across several hundred cell and tissue types.
The ENCODE Project Consortium et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature https://doi.org/10.1038/s41586-020-2493-4 (2020).
Partridge, E. C. et al. Occupancy maps of 208 chromatin-associated proteins in one human cell type. Nature https://doi.org/10.1038/s41586-020-2023-4 (2020).
Meuleman, W. Index and biological spectrum of human DNase I hypersensitive sites. Nature https://doi.org/10.1038/s41586-020-2559-3 (2020).
Vierstra, J. et al. Global reference mapping of human transcription factor footprints. Nature https://doi.org/10.1038/s41586-020-2528-x (2020).
Breschi, A. et al. A limited set of transcriptional programs define major cell types. Preprint at https://doi.org/10.1101/857169 (2020).
Grubert, F. et al. Landscape of cohesin-mediated chromatin loops in the human genome. Nature https://doi.org/10.1038/s41586-020-2151-x (2020).
Van Nostrand, E. L. et al. A large-scale binding and functional map of human RNA binding proteins. Nature https://doi.org/10.1038/s41586-020-2077-3 (2020).
Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621–628 (2008).
Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1497–1502 (2007).
Robertson, G. et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods 4, 651–657 (2007).
Iyer, V. R. et al. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409, 533–538 (2001).
Ren, B. et al. Genome-wide location and function of DNA binding proteins. Science 290, 2306–2309 (2000).
Landt, S. G. et al. ChIP–seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22, 1813–1831 (2012). A consortium-wide effort to standardize performance, quality control and outputs of ChIP–seq experiments, including validation of antibodies, to facilitate experimental reproducibllity and data utility.
Sundararaman, B. et al. Resources for the comprehensive discovery of functional RNA elements. Mol. Cell 61, 903–913 (2016).
Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
Schaub, M. A., Boyle, A. P., Kundaje, A., Batzoglou, S. & Snyder, M. Linking disease associations with regulatory information in the human genome. Genome Res. 22, 1748–1759 (2012).
Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014). Results of a large-scale effort of the mouse ENCODE consortium, presenting regulatory and transcript maps of the mouse.
Gerstein, M. B. et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330, 1775–1787 (2010).
The modENCODE Consortium et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 1787–1797 (2010).
Kudron, M. M. et al. The ModERN Resource: genome-wide binding profiles for hundreds of Drosophila and Caenorhabditis elegans transcription factors. Genetics 208, 937–949 (2018).
Gorkin, D. U. et al. An atlas of dynamic chromatin landscapes in mouse fetal development. Nature https://doi.org/10.1038/s41586-020-2093-3 (2020).
He, P. A. The changing mouse embryo transcriptome at whole tissue and single-cell resolution. Nature https://doi.org/10.1038/s41586-020-2536-x (2020).
He, Y. et al. Spatiotemporal DNA methylome dynamics of the developing mouse fetus. Nature https://doi.org/10.1038/s41586-020-2119-x (2020).
Cheng, Y. et al. Principles of regulatory information conservation between mouse and human. Nature 515, 371–375 (2014).
Stefflova, K. et al. Cooperativity and rapid evolution of cobound transcription factors in closely related mammals. Cell 154, 530–540 (2013).
Keilwagen, J., Posch, S. & Grau, J. Accurate prediction of cell type-specific transcription factor binding. Genome Biol. 20, 9 (2019).
Tang, F., Lao, K. & Surani, M. A. Development and applications of single-cell transcriptome analysis. Nat. Methods 8 (Suppl), S6–S11 (2011).
Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Hu, B. C.; HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574, 187–192 (2019).
Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).
Rhoads, A. & Au, K. F. PacBio sequencing and its applications. Genomics Proteomics Bioinformatics 13, 278–289 (2015).
Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).
Klein, J. C., Chen, W., Gasperini, M. & Shendure, J. Identifying novel enhancer elements with CRISPR-based screens. ACS Chem. Biol. 13, 326–332 (2018).
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Paudyal, A. et al. The novel mouse mutant, chuzhoi, has disruption of Ptk7 protein and exhibits defects in neural tube, heart and lung development and abnormal planar cell polarity in the ear. BMC Dev. Biol. 10, 87 (2010).
We thank S. Moore, E. Cahill, M. Kellis and J. Li for their assistance, and B. Wold for helpful comments. This work was supported by grants from the NIH: U01HG007019, U01HG007033, U01HG007036, U01HG007037, U41HG006992, U41HG006993, U41HG006994, U41HG006995, U41HG006996, U41HG006997, U41HG006998, U41HG006999, U41HG007000, U41HG007001, U41HG007002, U41HG007003, U41HG007234, U54HG006991, U54HG006997, U54HG006998, U54HG007004, U54HG007005, U54HG007010 and UM1HG009442.
B.E.B. declares outside interests in Fulcrum Therapeutics, 1CellBio, HiFiBio, Arsenal Biosciences, Cell Signaling Technologies, BioMillenia, and Nohla Therapeutics. P.F. is a member of the Scientific Advisory Boards of Fabric Genomics, Inc. and Eagle Genomics, Ltd. M.P.S. is cofounder and scientific advisory board member of Personalis, SensOmics, Mirvie, Qbio, January, Filtricine, and Genome Heart. He serves on the scientific advisory board of these companies and Genapsys and Jupiter. Z.W. is a cofounder of Rgenta Therapeutics and she serves on its scientific advisory board. R.M.M. is an advisor to DNAnexus and Decheng Capital, and has outside interests in IMIDomics, Accuragen and ReadCoor, Inc. The authors declare no other competing financial interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Pilot phase: September 2003–September 2007; ENCODE 2: September 2007–September 2012; ENCODE 3: September 2012–January 2017; ENCODE 4: February 2017–present; modENCODE: April 2007–April 2012; mouse ENCODE: 2009–2012.
About this article
Cite this article
The ENCODE Project Consortium., Snyder, M.P., Gingeras, T.R. et al. Perspectives on ENCODE. Nature 583, 693–698 (2020). https://doi.org/10.1038/s41586-020-2449-8
Journal of Mammary Gland Biology and Neoplasia (2021)
OpenContami: a web-based application for detecting microbial contaminants in next-generation sequencing data
Briefings in Functional Genomics (2021)
Current advances of epigenetics in periodontology from ENCODE project: a review and future perspectives
Clinical Epigenetics (2021)