This page has been archived and is no longer updated
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.
Keywords
Keywords for this Article
Add keywords to your Content
Save
|
Cancel
Share
|
Cancel
Revoke
|
Cancel
Rate & Certify
Rate Me...
Rate Me
!
Comment
Save
|
Cancel
Flag Inappropriate
The Content is
Objectionable
Explicit
Offensive
Inaccurate
Comment
Flag Content
|
Cancel
Delete Content
Reason
Delete
|
Cancel
Close
Full Screen
"ARTICLES Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project The ENCODE Project Consortium* Wereportthegenerationandanalysisoffunctionaldatafrommultiple,diverseexperimentsperformedonatargeted1%ofthe humangenomeaspartofthepilotphaseoftheENCODEProject.Thesedatahavebeenfurtherintegratedandaugmentedbya number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed,such that themajority ofitsbases canbefoundinprimarytranscripts, including non-protein-coding transcripts, andthosethatextensivelyoverlaponeanother.Second,systematicexaminationoftranscriptionalregulationhasyieldednew understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, includingitsinter-relationshipwithDNAreplicationandtranscriptionalregulation.Finally,integrationofthesenewsourcesof information,in particular withrespectto mammalian evolutionbased on inter- and intra-species sequence comparisons, has yieldednewmechanisticandevolutionaryinsightsconcerningthefunctionallandscapeofthehumangenome.Together,these studies are defining a path for pursuit of a more comprehensive characterization of human genome function. Thehumangenomeisanelegantbutcrypticstoreofinformation.The roughly three billion bases encode, either directly or indirectly, the instructions for synthesizing nearly all the molecules that form each human cell, tissue and organ. Sequencing the human genome 1?3 pro- videdhighlyaccurateDNAsequencesforeachofthe24chromosomes. However, at present, we have an incomplete understanding of the protein-coding portions of the genome, and markedly less under- standing of both non-protein-coding transcripts and genomic ele- ments that temporally and spatially regulate gene expression. To understand the human genome, and by extension the biological pro- cesses it orchestrates and the ways in which its defects can give rise to disease,weneedamoretransparentviewoftheinformationitencodes. Themolecularmechanismsbywhichgenomicinformationdirects the synthesis of different biomolecules has been the focus of much of molecularbiologyresearchoverthelastthreedecades.Previousstud- ieshavetypicallyconcentratedonindividualgenes,withtheresulting general principles then providing insights into transcription, chro- matin remodelling, messenger RNA splicing, DNA replication and numerous other genomic processes. Although many such principles seem valid as additional genes are investigated, they generally have not provided genome-wide insights about biological function. The first genome-wide analyses that shed light on human genome function made use of observing the actions of evolution. The ever- growing set of vertebrate genome sequences 4?8 is providing increas- ingpowertorevealthegenomicregionsthathavebeenmostandleast acted on by the forces of evolution. However, although these studies convincingly indicate the presence of numerous genomic regions under strong evolutionary constraint, they have less power in iden- tifyingtheprecisebasesthatareconstrainedandprovidelittle,ifany, insightintowhythosebasesarebiologicallyimportant.Furthermore, although we have good models for how protein-coding regions evolve,ourpresentunderstandingabouttheevolutionofotherfunc- tional genomic regions is poorly developed. Experimental studies that augment what we learn from evolutionary analyses are key for solidifying our insights regarding genome function. The Encyclopedia of DNA Elements (ENCODE) Project 9 aims to provideamorebiologicallyinformativerepresentationofthehuman genomebyusinghigh-throughputmethodstoidentifyandcatalogue the functional elements encoded. In its pilot phase, 35 groups pro- vided more than 200 experimental and computational data sets that examined in unprecedented detail a targeted 29,998kilobases (kb) of the human genome. These roughly 30 Mb?equivalent to ,1% of the human genome?are sufficiently large and diverse to allow for rigorous pilot testing of multiple experimental and computational methods. These 30Mb are divided among 44 genomic regions; approximately 15Mb reside in 14 regions for which there is already substantial biological knowledge, whereas the other 15Mb reside in 30 regions chosen by a stratified random-sampling method (see http://www.genome.gov/10506161). The highlights of our findings to date include: $ The human genome is pervasively transcribed, such that the majority of its bases are associated with at least one primary tran- scriptandmanytranscriptslinkdistalregionstoestablished protein- coding loci. $ Many novel non-protein-coding transcripts have been identified, with many of these overlapping protein-coding loci and others located in regions of the genome previously thought to be transcrip- tionally silent. $ Numerous previously unrecognized transcription start sites have been identified, many of which show chromatin structure and sequence-specific protein-binding properties similar to well- understood promoters. *A list of authors and their affiliations appears at the end of the paper. Vol 447|14 June 2007|doi:10.1038/nature05874 799 Nature �2007 Publishing Group $ Regulatory sequences that surround transcription start sites are symmetrically distributed, with no bias towards upstream regions. $ Chromatin accessibility and histone modification patterns are highly predictive of both the presence and activity of transcription start sites. $ Distal DNaseI hypersensitive sites have characteristic histone modification patterns that reliably distinguish them from promo- ters; some of these distal sites show marks consistent with insulator function. $ DNAreplication timing is correlated with chromatin structure. $ A total of 5% of the bases in the genome can be confidently identified as being under evolutionary constraint in mammals; for approximately 60% of these constrained bases, there is evidence of function on the basis of the results of the experimental assays per- formed to date. $ Although there is general overlap between genomic regions iden- tified as functional by experimental assays and those under evolu- tionary constraint, not all bases within these experimentally defined regions show evidence of constraint. $ Different functional elements vary greatly in their sequence vari- ability across the human population and in their likelihood of res- iding within a structurally variable region of the genome. $ Surprisingly, many functional elements are seemingly uncon- strained across mammalian evolution. This suggests the possibility of a large pool of neutral elements that are biochemically active but provide no specific benefit to the organism. This pool may serve as a ?warehouse? for natural selection, potentially acting as the source of lineage-specific elements and functionally conserved but non- orthologous elements between species. Below,wefirstprovideanoverviewoftheexperimentaltechniques usedforourstudies,afterwhichwedescribetheinsightsgainedfrom analysingandintegratingthegenerateddatasets.Weconcludewitha perspective of what we have learned to date about this 1% of the human genome and what we believe the prospects are for a broader and deeper investigation of the functional elements in the human genome. To aid the reader, Box 1 provides a glossary for many of the abbreviations used throughout this paper. Experimental techniques Table 1 (expanded in Supplementary Information section 1.1) lists themajorexperimentaltechniquesusedforthestudiesreportedhere, relevant acronyms, and references reporting the generated data sets. These data sets reflect over 400million experimental data points (603million data points if one includes comparative sequencing bases). In describing the major results and initial conclusions, we seek to distinguish ?biochemical function? from ?biological role?. Biochemical function reflects the direct behaviour of a molecule(s), whereas biological role is used to describe the consequence(s) of this function for the organism. Genome-analysis techniques nearly always focus on biochemical function but not necessarily on bio- logical role. This is because the former is more amenable to large- scale data-generation methods, whereas the latter is more difficult to assay on a large scale. The ENCODE pilot project aimed to establish redundancy with respect to the findings represented by different data sets. In some instances,thisinvolvedtheintentionaluseofdifferentassaysthatwere based on a similar technique, whereas in other situations, different techniques assayed the same biochemical function. Such redundancy has allowed methods to be compared and consensus data sets to be generated, much of which is discussed in companion papers, such as the ChIP-chip platform comparison 10,11 . All ENCODE data have been released after verification but before this publication, as befits a ?community resource? project (see http://www.wellcome.ac.uk/ doc_wtd003208.html).Verificationisdefinedaswhentheexperiment is reproducibly confirmed (see Supplementary Information section 1.2). The main portal for ENCODE data is provided by the UCSC Genome Browser (http://genome.ucsc.edu/ENCODE/); this is Box 1 | Frequently used abbreviations in this paper AR Ancient repeat: a repeat that was inserted into the early mammalian lineage and has since become dormant; the majority of ancient repeats are thought to be neutrally evolving. CAGE tag A short sequence from the 59 end of a transcript CDS Coding sequence: a region of a cDNA or genome that encodes proteins ChIP-chip Chromatin immunoprecipitation followed by detection of the products using a genomic tiling array CNV Copy number variants: regions of the genome that have large duplications in some individuals in the human population CS Constrained sequence: a genomic region associated with evidence of negative selection (that is, rejection of mutations relative to neutral regions) DHS DNaseI hypersensitive site: a region of the genome showing a sharply different sensitivity to DNaseI compared with its immediate locale EST Expressed sequence tag: a short sequence of a cDNA indicative of expression at this point FAIRE Formaldehyde-assisted isolation of regulatory elements: a method to assay open chromatin using formaldehyde crosslinking followed by detection of the products using a genomic tiling array FDR False discovery rate: a statistical method for setting thresholds on statistical tests to correct for multiple testing GENCODE Integrated annotation of existing cDNA and protein resources to define transcripts with both manual review and experimental testing procedures GSC Genome structure correction: a method to adapt statistical tests to make fewer assumptions about the distribution of features on the genome sequence. This provides a conservative correction to standard tests HMM Hidden Markov model: a machine-learning technique that can establish optimal parameters for a given model to explain the observed data Indel An insertion or deletion; two sequences often show a length difference within alignments, but it is not always clear whether this reflects a previous insertion or a deletion PET A short sequence that contains both the 59 and 39 ends of a transcript RACE Rapid amplification of cDNA ends: a technique for amplifying cDNA sequences between a known internal position in a transcript and its 59 end RFBR Regulatory factor binding region: a genomic region found by a ChIP-chip assay to be bound by a protein factor RFBR-Seqsp Regulatory factor binding regions that are from sequence-specific binding factors RT?PCR Reverse transcriptase polymerase chain reaction: a technique for amplifying a specific region of a transcript RxFrag Fragment of a RACE reaction: a genomic region found to be present in a RACE product by an unbiased tiling-array assay SNP Single nucleotide polymorphism: a single base pair change between two individuals in the human population STAGE Sequence tag analysis of genomic enrichment: a method similar to ChIP-chip for detecting protein factor binding regions but using extensive short sequence determination rather than genomic tiling arrays SVM Support vector machine: a machine-learning technique that can establish an optimal classifier on the basis of labelled training data TR50 A measure of replication timing corresponding to the time in the cell cycle when 50% of the cells have replicated their DNA at a specific genomic position TSS Transcription start site TxFrag Fragment of a transcript: a genomic region found to be present in a transcript by an unbiased tiling-array assay Un.TxFrag A TxFrag that is not associated with any other functional annotation UTR Untranslated region: part of a cDNA either at the 59or 39end that does not encode a protein sequence ARTICLES NATURE | Vol 447 | 14 June 2007 800 Nature �2007 Publishing Group augmented by multiple other websites (see Supplementary Informa- tion section 1.1). A common feature of genomic analyses is the need to assess the significance of the co-occurrence of features or of other statistical tests. One confounding factor is the heterogeneity of the genome, whichcanproduceuninterestingcorrelationsofvariablesdistributed across the genome. We have developed and used a statistical frame- work that mitigates many of these hidden correlations by adjusting the appropriate null distribution of the test statistics. We term this correction procedure genome structure correction (GSC) (see Sup- plementary Information section 1.3). Inthenextfivesections,wedetailthevariousbiologicalinsightsof the pilot phase of the ENCODE Project. Transcription Overview. RNA transcripts are involved in many cellular functions, eitherdirectlyasbiologicallyactivemoleculesorindirectlybyencod- ing other active molecules. In the conventional view of genome organization, sets of RNA transcripts (for example, messenger RNAs) are encoded by distinct loci, with each usually dedicated to a single biological role (for example, encoding a specific protein). However,thispicturehassubstantiallygrownincomplexityinrecent years 12 . Other forms of RNA molecules (such as small nucleolar RNAs and micro (mi)RNAs) are known to exist, and often these are encoded by regions that intercalate with protein-coding genes. These observations are consistent with the well-known discrepancy between the levels of observable mRNAs and large structural RNAs compared with the total RNA in a cell, suggesting that there are numerous RNA species yet to be classified 13?15 . In addition, studies of specific loci have indicated the presence of RNA transcripts that have a role in chromatin maintenance and other regulatory control. Wesoughttoassayandanalysetranscriptioncomprehensivelyacross the 44 ENCODE regions in an effort to understand the repertoire of encoded RNA molecules. Transcript maps. We used three methods to identify transcripts emanating from the ENCODEregions: hybridization of RNA (either total or polyA-selected) to unbiased tiling arrays (see Supplementary Information section 2.1), tag sequencing of cap-selected RNA at the 59 or joint 59/39 ends (see Supplementary Information sections 2.2 and S2.3), and integrated annotation of available complementary DNA and EST sequences involving computational, manual, and experimental approaches 16 (see Supplementary Information section 2.4). Weabbreviate the regions identified byunbiased tilingarrays as TxFrags,thecap-selectedRNAsasCAGEorPETtags(seeBox1),and the integrated annotation as GENCODE transcripts. When a TxFrag does not overlap a GENCODE annotation, we call it an Un.TxFrag. Validation of these various studies is described in papers reporting these data sets 17 (see Supplementary Information sections 2.1.4 and 2.1.5). These methods recapitulate previous findings, but provide enhanced resolution owing to the larger number of tissues sampled andtheintegrationofresultsacrossthethreeapproaches(seeTable2). Tobeginwith,ourstudiesshowthat14.7%ofthebasesrepresentedin theunbiased tilingarrays are transcribedinatleastonetissuesample. Consistent with previous work 14,15 , many (63%) TxFrags reside out- side of GENCODE annotations, both in intronic (40.9%) and inter- genic (22.6%) regions. GENCODE annotations are richer than the more-conservative RefSeq or Ensembl annotations, with 2,608 tran- scripts clustered into 487 loci, leading to an average of 5.4 transcripts per locus. Finally, extensive testing of predicted protein-coding sequences outside of GENCODE annotations was positive in only 2% of cases 16 , suggesting that GENCODE annotations cover nearly all protein-coding sequences. The GENCODE annotations are cate- gorized both by likely function (mainly, the presence of an open readingframe)andbyclassificationevidence(forexample,transcripts based solely on ESTs are distinguished from other scenarios); this classification is not strongly correlated with expression levels (see Supplementary Information sections 2.4.2 and 2.4.3). Analysesofmorebiological sampleshave allowedaricher descrip- tion of the transcription specificity (see Fig. 1 and Supplementary Information section 2.5). We found that 40% of TxFrags are present in only one sample, whereas only 2% are present in all samples. Although exon-containing TxFrags are more likely (74%) to be expressed in more than one sample, 45% of unannotated TxFrags are also expressed in multiple samples. GENCODE annotations of separatelocioften(42%)overlapwithrespecttotheirgenomiccoor- dinates, in particular on opposite strands (33% of loci). Further analysisofGENCODE-annotatedsequenceswithrespecttotheposi- tionsofopenreadingframesrevealedthatsomecomponentexonsdo not have the expected synonymous versus non-synonymous substi- tution patterns of protein-coding sequence (see Supplement Infor- mation section 2.6) and some have deletions incompatible with Table 1 | Summary of types of experimental techniques used in ENCODE Feature class Experimental technique(s) Abbreviations References Number of experimental data points Transcription Tiling array, integrated annotation TxFrag, RxFrag, GENCODE 117 118 19 119 63,348,656 59 ends of transcripts* Tag sequencing PET, CAGE 121 13 864,964 Histone modifications Tiling array Histone nomenclature{, RFBR 46 4,401,291 Chromatin{ structure QT-PCR, tiling array DHS, FAIRE 42 43 44 122 15,318,324 Sequence- specific factors Tiling array, tag sequencing, promoter assays STAGE, ChIP- Chip, ChIP-PET, RFBR 41,52 11,120 123 81 34,51 124 49 33 40 324,846,018 Replication Tiling array TR50 59 75 14,735,740 Computational analysis Computational methods CCI, RFBR cluster 80 125 10 16 126 127 NA Comparative sequence analysis* Genomic sequencing, multi- sequence alignments, computational analyses CS 87 86 26 NA Polymorphisms* Resequencing, copy number variation CNV 103 128 NA * Not all data generated by the ENCODE Project. { Histone code nomenclature follows the Brno nomenclature as described in ref. 129. {Also contains histone modification. Table 2 | Bases detected in processed transcripts either as a GENCODE exon, a TxFrag, or as either a GENCODE exon or a TxFrag GENCODE exon TxFrag EitherGENCODEexon or TxFrag Total detectable transcripts (bases) 1,776,157 (5.9%) 1,369,611 (4.6%) 2,519,280 (8.4%) Transcripts detected in tiled regions of arrays (bases) 1,447,192 (9.8%) 1,369,611 (9.3%) 2,163,303 (14.7%) Percentages are of total bases in ENCODE in the first row and bases tiled in arrays in the second row. NATURE | Vol 447 | 14 June 2007 ARTICLES 801 Nature �2007 Publishing Group protein structure 18 . Such exons are on average less expressed (25% versus87%byRT?PCR;seeSupplementaryInformationsection2.7) than exons involved in more than one transcript (see Supple- mentaryInformationsection2.4.3),butwhenexpressedhaveatissue distribution comparable to well-established genes. Critical questions are raised by the presence of a large amount of unannotated transcription with respect to how the corresponding sequencesareorganizedinthegenome?dothesereflectlongertran- scripts that include known loci, do they link known loci, or are they completely separate from known loci? We further investigated these issues using both computational and new experimental techniques. Unannotated transcription. Consistent with previous findings, the Un.TxFrags did not show evidence of encoding proteins (see Sup- plementary Information section 2.8). One might expect Un.TxFrags to be linked within transcripts that exhibit coordinated expression and have similar conservation profiles across species. To test this, we clustered Un.TxFrags using two methods. The first method 19 used expressionlevelsin11celllinesorconditions,dinucleotidecomposi- tion, location relative to annotated genes, and evolutionary conser- vationprofilestoclusterTxFrags(bothunannotatedandannotated). By this method, 14% of Un.TxFrags could be assigned to annotated loci, and 21% could be clustered into 200 novel loci (with an average of ,7 TxFrags per locus). We experimentally examined these novel loci to study the connectivity of transcripts amongst Un.TxFrags and between Un.TxFrags and known exons. Overall, about 40% of the connections (18 out of 46) were validated by RT?PCR. The second clustering method involved analysing atime course (0, 2, 8 and 32h) of expression changes in human HL60 cells following retinoic-acid stimulation. There is a coordinated program of expression changes from annotated loci, which can be shown by plotting Pearson correlation values of the expression levels of exons inside annotated loci versus unrelated exons (see Supplementary Information sec- tion 2.8.2). Similarly, there is coordinated expression of nearby Un.TxFrags, albeit lower, though still significantly different from randomizedsets.Bothclusteringmethodsindicatethatthereiscoor- dinated behaviour of many Un.TxFrags, consistent with them res- iding in connected transcripts. Transcriptconnectivity.WeusedacombinationofRACEandtiling arrays 20 to investigate the diversity of transcripts emanating from protein-coding loci. Analogous to TxFrags, we refer to transcripts detected using RACE followed by hybridization to tiling arrays as RxFrags. We performed RACE to examine 399 protein-coding loci (those loci found entirely in ENCODE regions) using RNA derived from 12 tissues, and were able to unambiguously detect 4,573 RxFrags for 359 loci (see Supplementary Information section 2.9). Almost half of these RxFrags (2,324) do not overlap a GENCODE exon, and most (90%) loci have at least one novel RxFrag, which often extends a considerable distance beyond the 59end of the locus. Figure 2 shows the distribution of distances between these new RACE-detectedendsandthepreviouslyannotatedTSSofeachlocus. The average distance of the extensions is between 50kb and 100kb, with many extensions (.20%) being more than 200kb. Consistent withtheknownpresenceofoverlappinggenesinthehumangenome, our findings reveal evidence for an overlapping gene at 224 loci, with transcripts from 180 of these loci (,50% of the RACE-positive loci) appearing to have incorporated at least one exon from an upstream gene. To characterize further the 59 RxFrag extensions, we performed RT?PCR followed by cloning and sequencing for 550 of the 59 RxFrags (including the 261 longest extensions identified for each locus). The approach of mapping RACE products using microarrays is a combination method previously described and validated in sev- eral studies 14,17,20 . Hybridization of the RT?PCR products to tiling arrays confirmed connectivity in almost 60% ofthe cases. Sequenced clones confirmed transcript extensions. Longer extensions were harder to clone and sequence, but 5 out of 18 RT?PCR-positive extensions over 100kb were verified by sequencing (see Supple- mentary Information section 2.9.7 and ref. 17). The detection of numerous RxFrag extensions coupled with evidence of considerable intronic transcription indicates that protein-coding loci are more transcriptionally complex than previously thought. Instead of the traditional view that many genes have one or more alternative tran- scriptsthatcodeforalternativeproteins,ourdatasuggestthatagiven gene may both encode multiple protein products and produce other transcriptsthatincludesequencesfrombothstrandsandfromneigh- bouring loci (often without encoding a different protein). Figure 3 illustratessuchacase,inwhichanewfusiontranscriptisexpressedin the small intestine, and consists of at least three coding exons from the ATP5O gene and at least two coding exons from the DONSON 1/11 2/11 3/11 4/11 5/11 6/11 7/11 8/11 9/11 10/11 11/11 cell lines Intronic proximal Intronic distal Intergenic proximal Intergenic distal Other ESTs GENCODE exonic 12 Annotated transcripts Novel transcripts 10 8 6 4 2 0 2 Tiling array nucleotides (%) 4 6 8 10 12 Figure 1 | Annotated and unannotated TxFrags detected in different cell lines. The proportion of different types of transcripts detected in the indicated number of cell lines (from 1/11 at the far left to 11/11 at the far right) is shown. The data for annotated and unannotated TxFrags are indicated separately, and also split into different categories based on GENCODEclassification:exonic,intergenic(proximalbeingwithin5kbofa gene and distal being otherwise), intronic (proximal being within 5kb of an intronand distal beingotherwise),and matching otherESTsnot usedin the GENCODEannotation(principallybecausetheywereunspliced).Theyaxis indicatesthe percent oftiling array nucleotidespresentin that class forthat number of samples (combination of cell lines and tissues). Per cent of RxFrag extensions (shaded boxes) 0 5 10 15 Extension length (kb) Cumulative per cent of extensions this length or greater (line) < 0.5 0.5?1 5?10 10?25 25?50 50?100 100?200200?300300?400400?500 ? 500 1?5 01 0 2 0 3 0 4 0 50 60 70 80 90 100 Figure 2 | LengthofgenomicextensionstoGENCODE-annotatedgeneson thebasisofRACEexperimentsfollowedbyarrayhybridizations(RxFrags). Theindicatedbarsreflectthefrequencyofextensionlengthsamongdifferent length classes. The solid line shows the cumulative frequency of extensions of that length or greater. Most of the extensions are greater than 50kb from the annotated gene (see text for details). ARTICLES NATURE | Vol 447 | 14 June 2007 802 Nature �2007 Publishing Group gene, with no evidence of sequences from two intervening protein- coding genes (ITSN1 and CRYZL1). Pseudogenes. Pseudogenes, reviewed in refs 21 and 22, are generally considered non-functional copies of genes, are sometimes tran- scribed and often complicate analysis of transcription owing to close sequence similarity to functional genes. We used various computa- tional methods to identify 201 pseudogenes (124 processed and 77 non-processed) in the ENCODE regions (see Supplementary Infor- mation section 2.10 and ref. 23). Tiling-array analysis of 189 of these revealed that 56% overlapped at least one TxFrag. However, possible cross-hybridization between the pseudogenes and their correspond- ingparentgenesmayhaveconfoundedsuchanalyses.Toassessbetter the extent of pseudogene transcription, 160 pseudogenes (111 pro- cessed and 49 non-processed) were examined for expression using RACE/tiling-array analysis (see Supplementary Information section 2.9.2). Transcripts were detected for 14 pseudogenes (8 processed and 6 non-processed) in at least one of the 12 tested RNA sources, the majority (9) being in testis (see ref. 23). Additionally, there was evidence for the transcription of 25 pseudogenes on the basis of their proximity (within 100bp of a pseudogene end) to CAGE tags (8), PETs(2),orcDNAs/ESTs(21).Overall,weestimatethatatleast19% of the pseudogenes in the ENCODEregions are transcribed, which is consistent with previous estimates 24,25 . Non-protein-coding RNA. Non-protein-coding RNAs (ncRNAs) include structural RNAs (for example, transfer RNAs, ribosomal RNAs, and small nuclear RNAs) and more recently discovered regulatory RNAs (for example, miRNAs). There are only 8 well- characterized ncRNA genes within the ENCODE regions (U70, ACA36, ACA56, mir-192, mir-194-2, mir-196, mir-483 and H19), whereas representatives of other classes, (for example, box C/D snoRNAs, tRNAs, and functional snRNAs) seem to be completely absent in the ENCODE regions. Tiling-array data provided evidence for transcription in at least one of the assayed RNA samples for all of these ncRNAs, with the exception of mir-483 (expression of mir-483 might be specific to fetal liver, which was not tested). There is also evidence for the transcription of 6 out of 8 pseudogenes of ncRNAs (mainly snoRNA-derived). Similar to the analysis of protein- pseudogenes, the hybridization results could also originate from the known snoRNA gene elsewhere in the genome. Many known ncRNAs are characterized by a well-defined RNA secondary structure. We applied two de novo ncRNA prediction algorithms?EvoFold and RNAz?to predict structured ncRNAs (as well as functional structures in mRNAs) using the multi-species sequencealignments(seebelow,SupplementaryInformationsection 2.11andref.26).Usingasensitivitythresholdcapableofdetectingall known miRNAs and snoRNAs, we identified 4,986 and 3,707 can- didate ncRNA loci with EvoFold and RNAz, respectively. Only 268 loci (5% and 7%, respectively) were found with both programs, representing a 1.6-fold enrichment over that expected by chance; the lack of more extensive overlap is due to the two programs having optimalsensitivityatdifferentlevelsofGCcontentandconservation. We experimentally examined 50 of these targets using RACE/ tiling-array analysis for brain and testis tissues (see Supplementary Information sections 2.11 and 2.9.3); the predictions were validated ata56%,65%,and63%rateforEvofold,RNAzanddualpredictions, respectively. Primary transcripts. The detection of numerous unannotated transcripts coupled with increasing knowledge of the general com- plexity of transcription prompted us to examine the extent of prim- ary (that is, unspliced) transcripts across the ENCODE regions. Three data sources provide insight about these primary transcripts: the GENCODE annotation, PETs, and RxFrag extensions. Figure 4 summarizes the fraction of bases in the ENCODE regions that over- lap transcripts identified by these technologies. Remarkably, 93% of basesarerepresentedinaprimarytranscriptidentifiedbyatleasttwo independent observations (but potentially using the same techno- logy); this figure is reduced to 74% in the case of primary transcripts detected by at least two different technologies. These increased spans are not mainly due to cell line rearrangements because they were present in multiple tissue experiments that confirmed the spans (see Supplementary Information section 2.12). These estimates assume that the presence of PETs or RxFrags defining the terminal ends of a transcript imply that the entire intervening DNA is tran- scribed and then processed. Other mechanisms, thought to be unlikely in the human genome, such as trans-splicing or polymerase jumping would also produce these long termini and potentially should be reconsidered in more detail. Previous studies have suggested a similar broad amount of tran- scription across the human 14,15 and mouse 27 genomes. Our studies confirm these results, and have investigated the genesis of these transcripts in greater detail, confirming the presence of substantial intragenic and intergenic transcription. At the same time, many of the resulting transcripts are neither traditional protein-coding No coverage One technology, one observation One technology, two observations Two technologies All three technologies Figure 4 | Coverageof primarytranscriptsacrossENCODEregions. Three differenttechnologies(integratedannotationfromGENCODE,RACE-array experiments (RxFrags) and PET tags) were used to assess the presence of a nucleotide in a primary transcript. Use of these technologies provided the opportunity to have multiple observations of each finding. The proportion of genomic bases detected in the ENCODE regions associated with each of thefollowingscenariosisdepicted:detectedbyallthreetechnologies,bytwo ofthethreetechnologies,byonetechnologybutwithmultipleobservations, and by one technology with only one observation. Also indicated are genomic bases without any detectable coverage of primary transcripts. 33,900,000 33,950,000 34,000,000 34,050,000 34,100,000 34,150,000 34,200,000 RxFrag DONSON CRYZL1 ATP5O PETs (?) strand (?) strand (+) strand ITSN1 DONSON Cloned RT-PCR product ATP5O Chr. 21 GENCODE reference genes Figure 3 | Overview of RACE experiments showing a gene fusion. Transcripts emanating from the region between the DONSON and ATP5O genes.A330-kbintervalofhumanchromosome21(withinENm005)isshown, whichcontainsfourannotatedgenes:DONSON,CRYZL1,ITSN1andATP5O. The 59RACE products generated from small intestine RNA and detected by tiling-array analyses (RxFrags) are shown along the top. Along the bottom is showntheplacementofaclonedandsequencedRT?PCRproductthathastwo exonsfromtheDONSONgenefollowedbythreeexonsfromtheATP5Ogene; these sequences are separated by a 300kb intron in the genome. A PET tag shows the termini of a transcript consistent with this RT?PCR product. NATURE | Vol 447 | 14 June 2007 ARTICLES 803 Nature �2007 Publishing Group transcripts nor easily explained as structural non-coding RNAs. Other studies have noted complex transcription around specific loci orchimaeric-genestructures(forexamplerefs28?30),butthesehave often been considered exceptions; our data show that complex inter- calated transcription is common at many loci. The results presented in the next section show extensive amounts of regulatory factors around novel TSSs, which is consistent with this extensive transcrip- tion. The biological relevance of these unannotated transcripts remains unanswered by these studies. Evolutionary information (detailed below)ismixed inthis regard; for example, itindicates that unannotated transcripts show weaker evolutionary conservation than many other annotated features. As with other ENCODE- detected elements, it is difficult to identify clear biological roles for themajority ofthese transcripts; suchexperiments are challenging to perform on a large scale and, furthermore, it seems likely that many ofthecorrespondingbiochemicaleventsmaybeevolutionarilyneut- ral (see below). Regulation of transcription Overview. A significant challenge in biology is to identify the tran- scriptional regulatory elements that control the expression of each transcript and to understand how the function of these elements is coordinated to execute complex cellular processes. A simple, com- monplace view of transcriptional regulation involves five types of cis-acting regulatory sequences?promoters, enhancers, silencers, insulators and locus control regions 31 . Overall, transcriptional regu- lation involves the interplay of multiple components, whereby the availability of specific transcription factors and the accessibility of specific genomic regions determine whether a transcript is gener- ated 31 . However, the current view of transcriptional regulation is known to be overly simplified, with many details remaining to be established. For example, the consensus sequences of transcription factor binding sites (typically 6 to 10 bases) have relatively little information content and are present numerous times in the genome, with the great majority of these not participating in transcriptional regulation. Does chromatin structure then determine whether such a sequence has a regulatory role? Are there complex inter-factor inter- actions that integrate the signals from multiple sites? How are signals fromdifferentdistalregulatoryelementscoupledwithoutaffectingall neighbouring genes? Meanwhile, our understanding of the repertoire of transcriptional eventsis becoming more complex, with an increas- ing appreciation of alternative TSSs 32,33 and the presence of non- coding 27,34 and anti-sense transcripts 35,36 . To better understand transcriptional regulation, we sought to begin cataloguing the regulatory elements residing within the 44 ENCODE regions. For this pilot project, we mainly focused on the binding of regulatory proteins and chromatin structure involved in transcriptional regulation. We analysed over 150 data sets, mainly from ChIP-chip 37?39 , ChIP-PET and STAGE 40,41 studies (see Sup- plementary Information section 3.1 and 3.2). These methods use chromatin immunoprecipitation with specific antibodies to enrich for DNA in physical contact with the targeted epitope. This enriched DNA can then be analysed using either microarrays (ChIP-chip) or high-throughput sequencing (ChIP-PET and STAGE). The assays included 18 sequence-specific transcription factors and components of the general transcription machinery (for example, RNA polymer- ase II (Pol II), TAF1 and TFIIB/GTF2B). In addition, we tested more than600potentialpromoterfragmentsfortranscriptionalactivityby transient-transfection reporterassaysthatused16humancelllines 33 . We also examined chromatin structure by studying the ENCODE regions for DNaseI sensitivity (by quantitative PCR 42 and tiling arrays 43,44 ,seeSupplementaryInformationsection3.3),histonecom- position 45 , histone modifications (using ChIP-chip assays) 37,46 , and histone displacement (using FAIRE, see Supplementary Information section 3.4). Below, we detail these analyses, starting with the efforts to define and classify the 59 ends of transcripts with respect to their associated regulatory signals. Following that are summaries of generated data about sequence-specific transcription factor binding and clusters of regulatory elements. Finally, we describe how this information can be integrated to make predictions about transcrip- tional regulation. Transcription start site catalogue. We analysed two data sets to catalogue TSSs in the ENCODE regions: the 59 ends of GENCODE-annotated transcripts and the combined results of two 59-end-capture technologies?CAGE and PET-tagging. The initial results suggested the potential presence of 16,051 unique TSSs. However, in many cases, multiple TSSs resided within a single small segment (up to ,200bases); this was due to some promoters con- taining TSSs with many very close precise initiation sites 47 . To nor- malize for this effect, we grouped TSSs that were 60 or fewer bases apart into a single cluster, and in each case considered the most frequent CAGE or PET tag (or the 59-most TSS in the case of TSSs identified only from GENCODE data) as representative of that clus- ter for downstream analyses. The above effort yielded 7,157 TSS clusters in the ENCODE regions. We classified these TSSs into three categories: known (pre- sent at the end of GENCODE-defined transcripts), novel (supported by other evidence) and unsupported. The novel TSSs were further subdivided on the basis of the nature of the supporting evidence (see Table 3and Supplementary Information section 3.5), with allfour of the resulting subtypes showing significant overlap with experimental evidence using the GSC statistic. Although there is a larger relative proportion of singleton tags in the novel category, when analysis is restricted to only singleton tags, the novel TSSs continue to have highly significant overlap with supporting evidence (see Supplemen- tary Information section 3.5.1). Correlating genomic features with chromatin structure and tran- scriptionfactorbinding.BymeasuringrelativesensitivitytoDNaseI digestion(seeSupplementaryInformationsection3.3),weidentified DNaseI hypersensitive sites throughout the ENCODE regions. DHSs and TSSs both reflect genomic regions thought to be enriched for regulatory information and many DHSs reside at or near TSSs. We partitioned DHSs into those within 2.5kb of a TSS (958; 46.5%) and theremainingones,whichwereclassifiedasdistal(1,102;53.5%).We then cross-analysed the TSSs and DHSs with data sets relating to histone modifications, chromatin accessibility and sequence-specific transcription factor binding by summarizing these signals in aggreg- ate relative to the distance from TSSs or DHSs. Figure 5 shows rep- resentative profiles of specific histone modifications, PolII and selected transcription factor binding for the different categories of TSSs. Further profiles and statistical analysis of these studies can be found in Supplementary Information 3.6. In the case of the three TSS categories (known, novel and unsup- ported), known and novel TSSs are both associated with similar signals for multiple factors (ranging from histone modifications through DNaseI accessibility), whereas unsupported TSSs are not. Table 3 | DifferentcategoriesofTSSsdefinedonthebasisofsupportfrom different transcript-survey methods Category Transcript survey method Number of TSS clusters (non-redundant)* P value{ Singleton clusters{ (%) Known GENCODE 59 ends 1,730 2310 270 25 (74 overall) Novel GENCODE sense exons 1,437 63 10 239 64 GENCODE antisense exons 521 3 3 10 28 65 Unbiased transcription survey 639 7 3 10 263 71 CpG island 164 4 3 10 290 60 Unsupported None 2,666 - 83.4 * Number of TSS clusters with this support, excluding TSSs from higher categories. { Probability of overlap between the transcript support and the PET/CAGE tags, as calculated by the Genome Structure Correction statistic (see Supplementary Information section 1.3). { Per cent of clusters with only one tag. For the ?known? category this was calculated as the per cent of GENCODE 59 ends with tag support (25%) or overall (74%). ARTICLES NATURE | Vol 447 | 14 June 2007 804 Nature �2007 Publishing Group The enrichments seen with chromatin modifications and sequence- specific factors, along with the significant clustering of this evidence, indicatethatthenovelTSSsdonotreflectfalsepositivesandprobably use the same biological machinery as other promoters. Sequence- specific transcription factors show a marked increase in binding across the broad region that encompasses each TSS. This increase is notably symmetric, with binding equally likely upstream or downstream of a TSS (see Supplementary Information section 3.7 for an explanation of why this symmetrical signal is not an artefact of the analysis of the signals). Furthermore, there is enrichment of SMARCC1 binding (a member of the SWI/SNF chromatin- modifying complex), which persists across a broader extent than other factors. The broad signals with this factor indicate that the ChIP-chip results reflect both specific enrichment at the TSS and broader enrichments across ,5-kb regions (this is not due to tech- nical issues, see Supplementary Information section 3.8). We selected 577 GENCODE-defined TSSs at the 59ends of a pro- tein-coding transcript with over 3 exons, to assess expression status. Each transcript was classified as: (1) ?active? (gene on) or ?inactive? (gene off) on the basis of the unbiased transcript surveys, and (2) residing near a ?CpG island? or not (?non-CpG island?) (see Sup- plementary Information section 3.17). As expected, the aggregate signal of histone modifications is mainly attributable to active TSSs (Fig. 5), in particular those near CpG islands. Pronounced doublet peaks at the TSS can be seen with these large signals (similar to previous work in yeast 48 ) owing to the chromatin accessibility at the TSS. Many of the histone marks and PolII signals are now clearly asymmetrical, withapersistentlevelofPolIIintothegenicregion,as expected.However,thesequence-specificfactorsremainlargelysym- metrically distributed. TSSs near CpG islands show a broader distri- bution of histone marks than those not near CpG islands (see Supplementary Information section 3.6). The binding of some tran- scription factors (E2F1, E2F4 and MYC) is extensive in the case of active genes, and is lower (or absent) in the case of inactive genes. Chromatin signature of distal elements. Distal DHSs show char- acteristic patterns of histone modification that are the inverse of TSSs, with high H3K4me1 accompanied by lower levels of H3K4Me3 and H3Ac (Fig. 5). Many factors with high occupancy at TSSs (for example, E2F4) show little enrichment at distal DHSs, whereas other factors (for example, MYC) are enriched at both TSSs and distal DHSs 49 . A particularly interesting observation is the relative enrichment of the insulator-associated factor CTCF 50 at bothdistalDHSsandTSSs;thiscontrastswithSWI/SNFcomponents SMARCC2 and SMARCC1, which are TSS-centric. Such differential ?5000 Distance to nearest TSS ?5,000 ?3,000 ?1,000 0 1,000 3,000 5,000 Distance to nearest TSS ?5,000 ?3,000 ?1,000 0 1,000 3,000 5,000 Distance to nearest DHS ?5,000 ?3,000 ?1,000 0 1,000 3,000 5,000 ?0.5 0 0.5 1.0 Distance to nearest DHS ?5,000 ?3,000 ?1,000 0 1,000 3,000 5,000 ?0.5 0 0.5 1.0 Distance to nearest TSS ?5,000 ?3,000 ?1,000 0 1,000 3,000 5,000 Distance to nearest TSS ?5,000 ?3,000 ?1,000 0 1,000 3,000 5,000 ?1.0 ?0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 ?1.0 ?0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 ?1.0 ?0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 ?1.0 ?0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Distance to nearest TSS ?5,000 ?3,000 ?1,000 0 1,000 3,000 5,000 ?0.5 0 0.5 1.0 Distance to nearest TSS ?5,000 ?3,000 ?1,000 0 1,000 3,000 5,000 ?0.5 0 0.5 1.0 Distance to nearest TSS ?5,000 ?3,000 ?1,000 0 1,000 3,000 5,000 ?0.5 0 0.5 1.0 Distance to nearest TSS ?5,000 ?3,000 ?1,000 0 1,000 3,000 5,000 ?0.5 0 0.5 1.0 a GENCODE TSS ?3,000 ?1,000 0 1,000 3,000 5,000 ?0.5 0 0.5 1.0 ?3,000 ?1,000 0 1,000 Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity 3,000 5,000 ?5,000 ?0.5 0 0.5 1.0 b Novel TSS d Distal DHS c Unsupported tags f Gene off CpG e Gene on CpG Distance to nearest TSS Distance to nearest TSS H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II Figure 5 | Aggregate signals of tiling-array experiments from either ChIP- chip or chromatin structure assays, represented for different classes of TSSsandDHS. Foreachplot,thesignalwasfirstnormalizedwithameanof 0andstandarddeviationof1,andthenthenormalizedscoresweresummed at each position for that class of TSS or DHS and smoothed using a kernel densitymethod(seeSupplementaryInformationsection3.6).Foreachclass ofsites therearetwo adjacentplots.Theleft plot depictsthe dataforgeneral factors: FAIRE and DNaseI sensitivity as assays of chromatin accessibility andH3K4me1,H3K4me2,H3K4me3,H3acandH4achistonemodifications (as indicated); the right plot shows the data for additional factors, namely MYC, E2F1, E2F4, CTCF, SMARCC1 and PolII. The columns provide data for the different classes of TSS or DHS (unsmoothed data and statistical analysis shown in Supplementary Information section 3.6). NATURE | Vol 447 | 14 June 2007 ARTICLES 805 Nature �2007 Publishing Group behaviour of sequence-specific factors points to distinct biological differences, mediated by transcription factors, between distal regula- tory sites and TSSs. Unbiased maps of sequence-specific regulatory factor binding. The previous section focused on specific positions defined by TSSs or DHSs. We then analysed sequence-specific transcription factor binding data in an unbiased fashion. We refer to regions with enriched binding of regulatory factors as RFBRs. RFBRs were iden- tified on the basis of ChIP-chip data in two ways: first, each invest- igator developed and used their own analysis method(s) to define high-enrichment regions, and second (and independently), a strin- gent false discovery rate (FDR) method was applied to analyse all data using three cut-offs (1%, 5% and 10%). The laboratory-specific and FDR-based methods were highly correlated, particularly for regions with strong signals 10,11 . For consistency, we used the results obtained with the FDR-based method (see Supplementary Infor- mation section 3.10). These RFBRs can be used to find sequence motifs (see Supplementary Information section S3.11). RFBRs are associated with the 59ends of transcripts. The distri- bution of RFBRs is non-random (see ref. 10) and correlates with the positions of TSSs. We examined the distribution of specific RFBRs relative to the known TSSs. Different transcription factors and his- tone modifications vary with respect to their association with TSSs (Fig. 6; see Supplementary Information section 3.12 for modelling of random expectation). Factors for which binding sites are most enriched at the 59 ends of genes include histone modifications, TAF1 and RNA PolII with a hypo-phosphorylated carboxy-terminal domain 51 ?confirmingpreviousexpectations.Surprisingly,wefound that E2F1, a sequence-specific factor that regulates the expression of many genes at the G1 to S transition 52 , is also tightly associated with TSSs 52 ; this association is as strong as that of TAF1, the well-known TATA box-binding protein associated factor 1 (ref. 53). These results suggest that E2F1 has a more general role in transcription than prev- iously suspected, similar to that for MYC 54?56 . In contrast, the large- scale assays did not support the promoter binding that was found in smaller-scale studies (for example, on SIRT1 and SPI1 (PU1)). Integration of data on sequence-specific factors. We expect that regulatory information is not dispersed independently across the genome, but rather is clustered into distinct regions 57 . We refer to regionsthatcontainmultipleregulatoryelementsas?regulatoryclus- ters?. We sought to predict the location of regulatory clusters by cross-integrating data generated using all transcription factor and histone modification assays, including results falling below an arbit- rary threshold in individual experiments. Specifically, we used four complementary methods to integrate the data from 129 ChIP-chip data sets (see Supplementary Information section 3.13 and ref. 58. Thesefour methods detect different classes ofregulatory clusters and asawholeidentified1,393clusters.Ofthese,344wereidentifiedbyall four methods, with another 500 found by three methods (see Supplementary Information section 3.13.5). 67% of the 344 regula- tory clusters identified by all four methods (or 65% of the full set of 1,393)residewithin2.5kbofaknownornovelTSS(asdefinedabove; see Table 3and Supplementary Information section 3.14 for a break- down by category). Restricting this analysis to previously annotated TSSs (for example, RefSeq or Ensembl) reveals that roughly 25% of the regulatory clusters are close to a previously identified TSS. These results suggest that many of the regulatory clusters identified by integrating the ChIP-chip data sets are undiscovered promoters or are somehow associated with transcription in another fashion. To test these possibilities, sets of 126 and 28 non-GENCODE-based regulatory clusters were tested for promoter activity (see Supple- mentary Information section 3.15) and by RACE, respectively. Thesestudiesrevealedthat24.6%ofthe126testedregulatoryclusters had promoter activity and that 78.6% of the 28 regulatory clusters analysed by RACE yielded products consistent with a TSS 58 . The ChIP-chip data sets were generated on a mixture of cell lines, pre- dominantlyHeLaandGM06990,andweredifferentfromtheCAGE/ PET data, meaning that tissue specificity contributes to the presence of unique TSSs and regulatory clusters. The large increase in pro- moter proximal regulatory clusters identified by including the addi- tional novel TSSs coupled with the positive promoter and RACE assays suggests that most of the regulatory regions identifiable by these clustering methods represent bona fide promoters (see Supplementary Information 3.16). Although the regulatory factor assays were more biased towards regions associated with promoters, many of the sites from these experiments would have previously been described as distal to promoters. This suggests that common- place use of RefSeq- or Ensembl-based gene definition to define promoter proximity will dramatically overestimate the number of distal sites. Predicting TSSs and transcriptional activity on the basis of chro- matin structure. The strong association between TSSs and both his- tonemodificationsandDHSspromptedustoinvestigatewhetherthe location and activity of TSSs could be predicted solely on the basis of chromatin structure information. We trained a support vector machine(SVM)byusinghistonemodificationdataanchoredaround DHSstodiscriminatebetweenDHSsnearTSSsandthosedistantfrom TSSs. We used a selected 2,573 DHSs, split roughly between TSS- proximal DHSs and TSS-distal DHSs, as a training set. The SVM performed well, with an accuracy of 83% (see Supplementary Information section 3.17). Using this SVM, we then predicted new TSSs using information about DHSs and histone modifications?of 110 high-scoring predicted TSSs, 81 resided within 2.5kb of a novel TSS. As expected, these show a significant overlap to the novel TSS groups (defined above) but without a strong bias towards any par- ticular category (see Supplementary Information section 3.17.1.5). To investigate the relationship between chromatin structure and gene expression, we examined transcript levels in two cell lines using a transcript-tiling array. We compared this transcript data with the results of ChIP-chip experiments that measured histone modifica- tions across the ENCODE regions. From this, we developed a variety of predictors of expression status using chromatin modifications as variables;thesewerederivedusingbothdecisiontreesandSVMs(see SupplementaryInformationsection3.17).Thebestofthesecorrectly predicts expression status (transcribed versus non-transcribed) in 91% of cases. This success rate did not decrease dramatically when thepredictingalgorithmincorporatedtheresultsfromonecelllineto predicttheexpressionstatusofanothercellline.Interestingly,despite 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.05 0.1 0.15 0.2 0.25 0.3 Fraction of TSSs near RFBRs Fraction of RFBRs near TSSs E2F1 Pol II TAF1 MYC CTCF SIRT1 SPI1 H3K27me3 STAT1 SMARCC1 SMARCC2 H3K4me2 H3K4me3 H3K4me1 Sequence-specific >200 >100 > 50 > 25 ? 25 General >200 >100 > 50 > 25 ? 25 Figure 6 | Distribution of RFBRs relative to GENCODE TSSs. Different RFBRs from sequence-specific factors (red) or general factors (blue) are plottedshowingtheirrelativedistributionnearTSSs.Thexaxisindicatesthe proportion of TSSs close (within 2.5kb) to the specified factor. The yaxis indicates the proportion of RFBRs close to TSSs. The size of the circle providesanindicationofthenumberofRFBRsforeachfactor.Ahandfulof representative factors are labelled. ARTICLES NATURE | Vol 447 | 14 June 2007 806 Nature �2007 Publishing Group the striking difference in histone modification enrichments in TSSs residing near versus those more distal to CpG islands (see Fig. 5 and Supplementary Information section 3.6), including information about the proximity to CpG islands did not improve the predictors. This suggests that despite the marked differences in histone modifi- cations among these TSS classes, a single predictor can be made, using the interactions between the different histone modification levels. In summary, we have integrated many data sets to provide a more complete view of regulatory information, both around specific sites (TSSs and DHSs) and in an unbiased manner. From analysing mul- tiple data sets, we find 4,491 known and novel TSSs in the ENCODE regions, almost tenfold more than the number of established genes. This large number of TSSs might explain the extensive transcription described above; it also begins to change our perspective about reg- ulatory information?without such a large TSS catalogue, many of theregulatory clusters wouldhave been classifiedasresiding distal to promoters. In addition to this revelation about the abundance of promoter-proximal regulatory elements, we also identified a consid- erable number of putative distal regulatory elements, particularly on the basis of the presence of DHSs. Our study of distal regulatory elements was probably most hindered by the paucity of data gener- ated using distal-element-associated transcription factors; neverthe- less, we clearly detected a set of distal-DHS-associated segments bound by CTCF orMYC. Finally, weshowed thatinformation about chromatin structure alone could be used to make effective predic- tions about both the location and activity of TSSs. Replication Overview. DNA replication must be carefully coordinated, both acrossthegenomeandwithrespecttodevelopment.Onalargerscale, early replication in S phase is broadly correlated with gene density and transcriptional activity 59?66 ; however, this relationship is not universal, as some actively transcribed genes replicate late and vice versa 61,64?68 . Importantly, the relationship between transcription and DNA replication emerges only when the signal of transcription is averaged over a large window (.100kb) 63 , suggesting that larger- scale chromosomal architecture may be more important than the activity of specific genes 69 . The ENCODE Project provided a unique opportunity to examine whether individual histone modifications on human chromatin can be correlated with the time of replication and whether such correla- tionssupportthegeneralrelationshipofactive,openchromatinwith early replication. Our studies also tested whether segments showing interallelic variation in the time of replication have two different types of histone modifications consistent with an interallelic vari- ation in chromatin state. DNA replication data set. We mapped replication timing across the ENCODE regions by analysing Brd-U-labelled fractions from syn- chronized HeLa cells (collected at 2 h intervals throughout Sphase) on tiling arrays (see Supplementary Information section 4.1). Although the HeLa cell line has a considerably altered karyotype, correlation ofthesedata withother celllinedata (seebelow)suggests theresultsarerelevant toothercelltypes.Theresultsareexpressedas the time at which 50% of any given genomic position is replicated (TR50),withhighervaluessignifyinglaterreplicationtimes.Inaddi- tiontothefive?activating?histonemarks,wealsocorrelatedtheTR50 withH3K27me3,amodificationassociatedwithpolycomb-mediated transcriptional repression 70?74 . To provide a consistent comparison framework, the histone data were smoothed to 100-kb resolution, and then correlated with the TR50 data by a sliding window correla- tion analysis (see Supplementary Information section 4.2). The continuous profiles of the activating marks, histone H3K4 mono-, di-, and tri-methylation and histone H3 and H4 acetylation, are generally anti-correlated with the TR50 signal (Fig. 7a and Sup- plementary Information section 4.3). In contrast, H3K27me3 marks show a predominantly positive correlation with late-replicating seg- ments (Fig. 7a; see Supplementary Information section 4.3 for addi- tional analysis). Although most genomic regions replicate in a temporally specific window in S phase, other regions demonstrate an atypical pattern of replication (Pan-S) where replication signals are seen in multiple parts ofSphase. Wehave suggested thatsuch a pattern of replication stemsfrominterallelicvariationinthechromatinstructure 59,75 .Ifone allele is in active chromatin and the other in repressed chromatin, both types of modified histones are expected to be enriched in the Pan-S segments. An ENCODE region was classified as non-specific (or Pan-S) regions when.60% of the probes in a 10-kb window H3k27me3 1.6 Mb (ENm006) Enrichment Enrichment TR50 152,800 4 3 2 1 0 2.5 1.5 0.5 4.0 3.5 3.0 153,000 153,200 153,400 153,600 153,800 H3k4me2 Genomic position (kb) ab Per cent enrichment Early Mid Late Pan-S ?80 ?40 0 40 80 120 H3K 27me3. He L a H3K4 me1 .H eL a H3K4me2.He La H3K 4me 3 .He La H3ac.HeLaH4a c.HeLa H3K 4me1.G M H3K4me2.GMH3K 4me3.G M H3a c.GM H4ac.GM Figure 7 | Correlation between replication timing and histone modifications. a, Comparisonoftwo histonemodifications (H3K4me2and H3K27me3), plotted as enrichment ratio from the Chip-chip experiments andthetimefor50%oftheDNAtoreplicate(TR50),indicatedforENCODE regionENm006.Thecoloursonthecurvesreflectthecorrelationstrengthin a sliding 250-kb window. b, Differing levels of histone modification for different TR50 partitions. The amounts of enrichment or depletion of different histone modifications in various cell lines are depicted (indicated along the bottom as ?histone mark.cell line?; GM5GM06990). Asterisks indicate enrichments/depletions that are not significant on the basis of multipletests.Eachsethasfourpartitionsonthebasisofreplicationtiming: early, mid, late and Pan-S. NATURE | Vol 447 | 14 June 2007 ARTICLES 807 Nature �2007 Publishing Group replicated in multiple intervals in S phase. The remaining regions were sub-classified into early-, mid- or late-replicating based on the average TR50 of the temporally specific probes within a 10-kb win- dow 75 . For regions of each class of replication timing, we determined the relative enrichment of various histonemodification peaks in HeLa cells (Fig. 7b; Supplementary Information section 4.4). The correlations of activating and repressing histonemodification peaks with TR50 are confirmed by this analysis (Fig. 7b). Intriguingly, the Pan-S segments are unique in being enriched for both activating (H3K4me2, H3ac and H4ac) and repressing (H3K27me3) histones, consistent with the suggestion that the Pan-S replication pattern arises from interallelic variation in chromatin structure and time of replication 75 . This observation is also consistent with the Pan-S rep- lication pattern seen for the H19/IGF2 locus, a known imprinted region with differential epigenetic modifications across the two alleles 76 . TheextensiverearrangementsinthegenomeofHeLacellsledusto ask whether the detected correlations between TR50 and chromatin stateareseenwithothercelllines.Thehistonemodificationdatawith GM06990 cells allowed us to test whether the time of replication of genomic segments in HeLa cells correlated with the chromatin state in GM06990 cells. Early- and late-replicating segments in HeLa cells are enriched and depleted, respectively, for activating marks in GM06990 cells (Fig. 7b). Thus, despite the presence of genomic rear- rangements(seeSupplementary Informationsection2.12),theTR50 and chromatin state in HeLa cells are not far from a constitutive baselinealsoseenwithacelllinefromadifferentlineage.Theenrich- ment of multiple activating histone modifications and the depletion of a repressive modification from segments that replicate early in S phase extends previous work in the field at a level of detail and scale not attempted before in mammalian cells. The duality of histone modification patterns in Pan-S areas of the HeLa genome, and the concordance of chromatin marks and replication time across two disparate cell lines (HeLa and GM06990) confirm the coordination of histone modifications with replication in the human genome. Chromatin architecture and genomic domains Overview. The packaging of genomic DNA into chromatin is inti- mately connected with the control of gene expression and other chromosomal processes. We next examined chromatin structure over a larger scale to ascertain its relation to transcription and other processes. Large domains (50 to.200kb) of generalized DNaseI sensitivity have been detected around developmentally regulated gene clusters 77 , prompting speculation that the genome is organized into ?open? and ?closed? chromatin territories that represent higher- order functional domains. We explored how different chromatin features, particularly histone modifications, correlate with chro- matin structure, both over short and long distances. Chromatin accessibility and histone modifications. We used his- tone modification studies and DNaseI sensitivity data sets (intro- duced above) to examine general chromatin accessibility without focusing on the specific DHS sites (see Supplementary Informa- tion sections 3.1, 3.3 and 3.4). A fundamental difficulty in analysing continuous data across large genomic regions is determining the appropriate scale for analysis (for example, 2kb, 5kb, 20kb, and so on). To address this problem, we developed an approach based on wavelet analysis, a mathematical tool pioneered in the field of signal processing that has recently been applied to continuous-value geno- mic analyses. Wavelet analysis provides a means for consistently transforming continuous signals into different scales, enabling the correlation of different phenomena independently at differing scales in a consistent manner. Global correlations of chromatin accessibility and histone modi- fications. We computed the regional correlation between DNaseI sensitivity and each histone modification at multiple scales using a wavelet approach (Fig. 8 and Supplementary Information section 4.2). To make quantitative comparisons between different histone modifications, we computed histograms of correlation values be- tween DNaseI sensitivity and each histone modification at several scales and then tested these for significance at specific scales. Figure 8cshowsthedistributionofcorrelationvaluesata16-kbscale,which is considerably larger than individual cis-acting regulatory elements. At this scale, H3K4me2, H3K4me3 and H3ac show similarly high correlation. However, they are significantly distinguished from H3K4me1 and H4ac modifications (P,1.5310 233 ; see Supple- mentaryInformationsection4.5),whichshowlowercorrelationwith DNaseI sensitivity. These results suggest that larger-scale relation- ships between chromatin accessibility and histone modifications are dominated by sub-regions in which higher average DNaseI sens- itivity is accompanied by high levels of H3K4me2, H3K4me3 and H3ac modifications. Local correlations of chromatin accessibility and histone modifi- cations. Narrowing to a scale of ,2kb revealed a more complex situation, in which H3K4me2 is the histone modification that is best correlated with DNaseI sensitivity. However, there is no clear combination of marks that correlate with DNaseI sensitivity in a way that is analogous to that seen at a larger scale (see Supplemen- tary Information section 4.3). One explanation for the increased 1.11 Mb (ENm013) 25 15 16 8 4 2 0 0 4 8 H3k4me2 DNaseI sensitivity 89,600 89,800 90,000 90,200 90,400 H3k4me2 : DNaseI correlation by scale Genomic position (kb) Genomic position (kb)PositiveNegative Correlation a c b Signal/control Scale (kb) H3k4me2 H3k4me3 H3Ac H3k4me1 H4Ac 0 0.5 1.0?0.5?1.0 Correlation value Density 16-kb scale 1.2 1.0 0.8 0.6 0.4 0.2 0 Figure 8 | Wavelet correlations of histone marks and DNaseI sensitivity. Asanexample,correlationsbetweenDNaseIsensitivityandH3K4me2(both intheGM06990cellline)overa1.1-Mbregiononchromosome7(ENCODE region ENm013) are shown. a, The relationship between histone modification H3K4me2 (upper plot) and DNaseI sensitivity (lower plot) is shown for ENCODE region ENm013. The curves are coloured with the strength of the local correlation at the 4-kb scale (top dashed line in panel b). b, The same data as in a are represented as a wavelet correlation. The yaxis shows the differing scales decomposed by the wavelet analysis from largetosmallscale(inkb);thecolourateachpointintheheatmaprepresents the level of correlation at the given scale, measured in a 20kb window centred at the given position. c, Distribution of correlation values at the 16kb scale between the indicated histone marks. The yaxis is the density of these correlation values across ENCODE; all modifications show a peak at a positive-correlation value. ARTICLES NATURE | Vol 447 | 14 June 2007 808 Nature �2007 Publishing Group complexity at smaller scales is that there is a mixture of different classes of accessible chromatin regions, each having a different pat- tern of histone modifications. To examine this, we computed the degree to which local peaks in histone methylation or acetylation occur at DHSs (see Supplementary Information section 4.5.1). We found that 84%, 91% and 93% of significant peaks in H3K4 mono-, di-andtri-methylation,respectively,and93%and81%ofsignificant peaks in H3ac and H4ac acetylation, respectively, coincided with DHSs (see Supplementary Information section 4.5). Conversely, a proportion of DHSs seemed not to be associated with significant peaks in H3K4 mono-, di- or tri-methylation (37%, 29% and 47%, respectively), nor with peaks in H3 or H4 acetylation (both 57%). Because only a limited number of histone modification marks were assayed, the possibility remains that some DHSs harbour other his- tone modifications. The absence of a more complete concordance betweenDHSsandpeaksinhistoneacetylationissurprisinggiventhe widely accepted notion that histone acetylation has a central role in mediating chromatin accessibility by disrupting higher-order chro- matin folding. DNA structure at DHSs. The observation that distinctive hydroxyl radical cleavage patterns are associated with specific DNA struc- tures 78 prompted us to investigate whether DHS subclasses differed with respect to their local DNA structure. Conversely, because dif- ferent DNA sequences can give rise to similar hydroxyl radical cleav- agepatterns 79 ,genomicregionsthatadoptaparticularlocalstructure do not necessarily have the same nucleotide sequence. Using a Gibbs sampling algorithm on hydroxyl radical cleavage patterns of 3,150 DHSs 80 , we discovered an 8-base segment with a conserved cleavage signature(CORCS;seeSupplementaryInformationsection4.6).The underlying DNA sequences that give rise to this pattern have little primary sequence similarity despite this similar structural pattern. Furthermore,thisstructuralelementisstronglyenrichedinpromoter- proximal DHSs (11.3-fold enrichment compared to the rest of the ENCODE regions) relative to promoter-distal DHSs (1.5-fold enrich- ment); this elementis enriched 10.9-fold in CpG islands, but is higher still (26.4-fold) in CpG islands that overlap a DHS. Large-scale domains in the ENCODE regions. The presence of extensive correlations seen between histone modifications, DNaseI sensitivity, replication, transcript density and protein factor binding led us to investigate whether all these features are organized system- atically across the genome. To test this, we performed an unsuper- vised training of a two-state HMM with inputs from these different features (seeSupplementary Information section4.7andref. 81).No other information except for the experimental variables was used for the HMM training routines. We consistently found that one state (?active?)generallycorrespondedtodomainswithhighlevelsofH3ac and RNA transcription, low levels of H3K27me3 marks, and early replication timing, whereas the other state (?repressed?) reflected domains with lowH3ac and RNA, high H3K27me3, andlate replica- tion (see Fig. 9). In total, we identified 70 active regions spanning 11.4Mb and 82 inactive regions spanning 17.8Mb (median size 136kb versus 104kb respectively). The active domains are markedly enriched for GENCODE TSSs, CpG islands and Alu repetitive ele- ments (P,0.0001 for each), whereas repressed regions are signifi- cantly enriched for LINE1 and LTR transposons (P,0.001). Taken together,theseresultsdemonstrateremarkableconcordancebetween ENCODE functional data types and provide a view of higher-order functional domains defined by a broader range of factors at a mark- edly higher resolution than was previously available 82 . Evolutionary constraint and population variability Overview. Functional genomic sequences can also be identified by examining evolutionary changes across multiple extant species and within the human population. Indeed, such studies complement experimental assays that identify specific functional elements 83?85 . Evolutionary constraint (that is, the rejection of mutations at a par- ticular location) can be measured by either (i) comparing observed substitutions to neutral rates calculated from multi-sequence alignments 86?88 , or (ii) determining the presence and frequency of intra-species polymorphisms. Importantly, both approaches are indifferent to any specific function that the constrained sequence might confer. Previous studies comparing the human, mouse, rat and dog genomes examined bulk evolutionary properties of all nucleotides inthe genome, andprovided littleinsight aboutthe precisepositions of constrained bases. Interestingly, these studies indicated that the chr21: 33,000,000 33,500,000 34,000,000 4.98883 - 2.86774 _ 9.80729 - 0.737867 _ 3.71716 - 0.236546 _ 2.03637 - 0.760276 _ - 0.00426965 _ 0.163796 - ?0.00406855 _ 1- ?1 _ 1.6 Mb ab TR50 RNA H3K27me3 H3ac DHS RFBR Active Repressed GENCODE genes Gencode TSSs CpG islands All repeat LINEs (L1) LTRs SINE Alus 0.114405 Repressed Enrichment Depletion Active 100 % 50 % 0 50 % 100 % Figure 9 | Higher-order functional domains in the genome. The general concordance of multiple data types is shown for an illustrative ENCODE region (ENm005). a, Domains were determined by simultaneous HMM segmentation of replication time (TR50; black), bulk RNA transcription (blue),H3K27me3(purple),H3ac(orange),DHSdensity(green),andRFBR density (light blue) measured continuously across the 1.6-Mb ENm005. All data were generated using HeLa cells. The histone, RNA, DHS and RFBR signals are wavelet-smoothed to an approximately 60-kb scale (see Supplementary Informationsection 4.7). TheHMM segmentation is shown astheblockslabelled?active?and?repressed?andthestructureofGENCODE genes (not used in the training) is shown at the end. b, Enrichment or depletion of annotated sequence features (GENCODE TSSs, CpG islands, LINE1repeats,Alurepeats,andnon-exonicconstrainedsequences(CSs))in activeversusrepresseddomains.NotethemarkedenrichmentofTSSs,CpG islandsandAlusinactivedomains,andtheenrichmentofLINEandLTRsin repressed domains. NATURE | Vol 447 | 14 June 2007 ARTICLES 809 Nature �2007 Publishing Group majority of constrained bases reside within the non-coding portion of the human genome. Meanwhile, increasingly rich data sets of polymorphisms across the human genome have been used exten- sively to establish connections between genetic variants and disease, but far fewer analyses have sought to use such data for assessing functional constraint 85 . The ENCODE Project provides an excellent opportunity for more fully exploiting inter- and intra-species sequence comparisons to examine genome function in the context of extensive experimental studies on the same regions of the genome. We consolidated the experimentally derived information about the ENCODE regions and focused our analyses on 11 major classes of genomic elements. TheseclassesarelistedinTable4andincludetwonon-experimentally derived data sets: ancient repeats (ARs; mobile elements that inserted early in themammalian lineage, have subsequently becomedormant, and are assumed to be neutrally evolving) and constrained sequences (CSs; regions that evolve detectably more slowly than neutral sequences). Comparativesequencedatasetsandanalysis.Wegenerated206Mb of genomic sequence orthologous to the ENCODE regions from 14 mammalian speciesusingatargetedstrategythatinvolved isolating 89 and sequencing 90 individual bacterial artificial chromosome clones. For an additional 14 vertebrate species, we used 340Mb of ortholo- gous genomic sequence derived from genome-wide sequencing efforts 3?8,91?93 . The orthologous sequences were aligned using three alignment programs: TBA 94 , MAVID 95 and MLAGAN 96 . Four inde- pendent methods that generated highly concordant results 97 were then used to identify sequences under constraint (PhastCons 88 , GERP 87 , SCONE 98 and BinCons 86 ). From these analyses, we deve- loped a high-confidence set of ?constrained sequences? that corre- spond to 4.9% of the nucleotides in the ENCODE regions. The threshold for determining constraint was set using a FDR rate of 5% (see ref. 97); this level is similar to previous estimates of the fraction of the human genome under mammalian constraint 4,86?88 but the FDR rate was not chosen to fit this result. The median length of these constrained sequences is 19 bases, with the minimum being 8 bases?roughly the size of a typical transcription factor binding site. These analyses, therefore, provide a resolution of constrained sequences that is substantially better than that currently available using only whole-genome vertebrate sequences 99?102 . Intra-speciesvariationstudiesmainlyusedSNPdatafromPhasesI and II, and the 10 re-sequenced regions in ENCODE regions with 48 individuals of the HapMap Project 103 ; nucleotide insertion or dele- tion (indel) data were from the SNP Consortium and HapMap.We alsoexaminedtheENCODEregionsforthepresenceofoverlapswith known segmental duplications 104 and CNVs. Experimentally identified functional elements and constrained sequences. We first compared the detected constrained sequences withthepositionsofexperimentallyidentifiedfunctionalelements.A total of 40% of the constrained bases reside within protein-coding exons and their associated untranslated regions (Fig. 10) and, in agreement with previous genome-wide estimates, the remaining constrained bases do not overlap the mature transcripts of protein- coding genes 4,5,88,105,106 . When we included the other experimental annotations, we found that an additional 20% of the constrained bases overlap experimentally identified non-coding functional regions, although far fewer of these regions overlap constrained sequencescomparedtocodingexons(seebelow).Mostexperimental annotationsaresignificantlydifferentfromarandomexpectationfor both base-pair or element-level overlaps (using the GSC statistic, see Supplementary Information section 1.3), with a more striking devi- ation whenconsidering elements (Fig.11). Theexceptions tothis are pseudogenes, Un.TxFrags and RxFrags. The increase in significance moving from base-pair measures to the element level suggests that discrete islands of constrained sequence exist within experimentally identified functional elements, with the surrounding bases appar- ently not showing evolutionary constraint. This notion is discussed in greater detail in ref. 97. We also examined measures of human variation (heterozygosity, derivedallele-frequencyspectraandindelrates)withinthesequences of the experimentally identified functional elements (Fig. 12). For these studies, ARs were used as a marker for neutrally evolving sequence. Most experimentally identified functional elements are associated with lower heterozygosity compared to ARs, and a few have lower indel rates compared with ARs. Striking outliers are 39UTRs, which have dramatically increased indel rates without an obvious cause. This is discussed in more depth in ref. 107. These findings indicate that the majority of the evolutionarily constrained, experimentally identified functional elements show evidence of negative selection both across mammalian species and withinthehumanpopulation.Furthermore,wehaveassignedatleast onemolecularfunctiontothemajority(60%)ofallconstrainedbases in the ENCODE regions. Conservation of regulatory elements. The relationship between individual classes of regulatory elements and constrained sequences varies considerably, ranging from cases where there is strong evo- lutionary constraint (for example, pan-vertebrate ultraconserved regions 108,109 ) to examples of regulatory elements that are not con- served between orthologous human and mouse genes 110 . Within the ENCODE regions, 55% of RFBRs overlap the high-confidence All 44 ENCODE regions (29,998 kb) 4.9% Coding 32% 8% UTRs Unannotated 20% Other ENCODE experimental annotations 40% Constrained Non-constrained Figure 10 | Relative proportion of different annotations among constrained sequences. The 4.9% of bases in the ENCODE regions identified as constrained is subdivided into the portions that reflect known coding regions, UTRs, other experimentally annotated regions, and unannotated sequence. Table 4 | Eleven classes of genomic elements subjected to evolutionary and population-genetics analyses Abbreviation Description CDS Coding exons, as annotated by GENCODE 59UTR 59 untranslated region, as annotated by GENCODE 39UTR 39 untranslated region, as annotated by GENCODE Un.TxFrag Unannotated region detected by RNA hybridization to tiling array (that is, unannotated TxFrag) RxFrag Region detected by RACE and analysis on tiling array Pseudogene Pseudogene identified by consensus pseudogene analysis RFBR Regulatory factor binding region identified by ChIP-chip assay RFBR-SeqSp Regulatory factor binding region identified only by ChIP-chip assays for factors with known sequence-specificity DHS DNaseI hypersensitive sites found in multiple tissues FAIRE Region of open chromatin identified by the FAIRE assay TSS Transcription start site AR Ancient repeat inserted early in the mammalian lineage and presumed to be neutrally evolving CS Constrained sequence identified by analysing multi-sequence alignments ARTICLES NATURE | Vol 447 | 14 June 2007 810 Nature �2007 Publishing Group constrained sequences. As expected, RFBRs have many uncon- strained bases, presumably owing to the small size of the specific binding site. We investigated whether the binding sites in RFBRs could be further delimited using information about evolutionary constraint. For 7 out of 17 factors with either known TRANSFAC or Jaspar motifs, our ChIP-chip data revealed a marked enrichment of the appropriate motif within the constrained versus the uncon- strained portions of the RFBRs (see Supplementary Information sec- tion 5.1). This enrichment was seen for levels of stringency used for defining ChIP-chip-positive sites (1%and 5% FDRlevel), indicating thatcombiningsequenceconstraintandChIP-chipdatamayprovide a highly sensitive means for detecting factor binding sites in the human genome. Experimentally identified functional elements and genetic vari- ation. The above studies focus on purifying (negative) selection. We used nucleotide variation to detect potential signals of adaptive (positive)selection.WemodifiedthestandardMcDonald?Kreitman test (MK-test 111,112 ) and the Hudson?Kreitman?Aguade (HKA) 113 test (see Supplementary Information section 5.2.1), to examine whether an entire set of sequence elements shows an excess of poly- morphisms or an excess of inter-species divergence. We found that constrained sequences and coding exons have an excess of poly- morphisms (consistent with purifying selection), whereas 59UTRs show evidence of an excess of divergence (with a portion probably reflectingpositiveselection).Ingeneral,non-codinggenomicregions show more variation, with both a large number of segments that undergo purifying selection and regions that are fast evolving. We also examined structural variation (that is, CNVs, inversions and translocations 114 ; see Supplementary Information section 5.2.2). Within these polymorphic regions, we encountered significant over- representation of CDSs, TxFrags, and intra-species constrained sequences (P,10 23 , Fig.13), andalsodetected astatistically signifi- cant under-representation of ARs (P510 23 ). A similar overrepre- sentationofCDSsandintra-speciesconstrainedsequenceswasfound within non-polymorphic segmental duplications. Unexplained constrained sequences. Despite the wealth of comple- mentary data, 40% of the ENCODE-region sequences identified as constrained are not associated with any experimental evidence of function. There is no evidence indicating that mutational cold spots account for this constraint; they have similar measures of con- straint to experimentally identified elements and harbour equal proportions of SNPs. To characterize further the unexplained con- strained sequences, we examined their clustering and phylogenetic distribution. These sequences are not uniformly distributed across most ENCODE regions, and even in most ENCODE regions the distribution is different from constrained sequences within experi- mentally identified functional elements (see Supplementary Information section 5.3). The large fraction of constrained sequence that does not match any experimentally identified elements is not surprisingconsideringthatonlyalimitedsetoftranscription factors, cell lines and biological conditions have thus far been examined. Unconstrained experimentally identified functional elements. In contrast, an unexpectedly large fraction of experimentally identified functional elements show no evidence of evolutionary constraint ranging from 93% for Un.TxFrags to 12% for CDS. For most types of non-coding functional elements, roughly 50% of the individual elements seemed to be unconstrained across all mammals. There are two methodological reasons that might explain the apparent excess of unconstrained experimentally identified func- tional elements: the underestimation of sequence constraint or over- estimation of experimentally identified functional elements. We do not believe that either of these explanations fully accounts for the large and varied levels of unconstrained experimentally functional sequences. Thesetofconstrainedbasesanalysed hereis highlyaccur- ate and complete due to the depth of the multiple alignment. Both by bulk fitting procedures and by comparison of SNP frequencies to constraint there is clearly a proportion of constrained bases not cap- turedinthedefined4.9%ofconstrainedsequences,butitissmall(see Fraction of experimental annotation overlapping constrained sequence b CDSs 5 ? UTRs 3 ? UTRs Un.TxFrag s Pseudogenes RxFrag s D HSs FAIR E RFBRs-SeqSp RFB Rs ARs RNA transcription Open chromatin DNA/protein a Experimental annotation Constrained sequence Overlap 20% 70% 33%Bases Overall Regions Yes Yes No Yes 25% 75% (3 out of 4) Bases Regions 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 Figure 11 | Overlap of constrained sequences and various experimental annotations. a, A schematic depiction shows the different tests used for assessing overlap between experimental annotations and constrained sequences, both for individual bases and for entire regions. b, Observed fractionofoverlap,depictedseparatelyforbasesandregions.Theresultsare shownforselectedexperimentalannotations.Theinternalbarsindicate95% confidence intervals of randomized placement of experimental elements using the GSC methodology to account for heterogeneity in the data sets. When the bar overlaps the observed value one cannot reject the hypothesis that these overlaps are consistent with random placements. Un.TxFrags DHSs ARs TSSs FA IR E RFBRs RFBRs-SeqSp RxFrags CDSs CSs 0 0.0005 0.00010 0.00015 0.00020 0.00025 0.00030 Rate of polymorphic indels Heterozygosity ( � 10 ?4 ) 5? UTRs 3? UTRs 4 5 6 7 8 9 10 Figure 12 | Relationship between heterozygosity and polymorphic indel rate for a variety of experimental annotations. 39UTRs are an expected outlier for the indel measures owing to the presence of low-complexity sequence (leading to a higher indel rate). NATURE | Vol 447 | 14 June 2007 ARTICLES 811 Nature �2007 Publishing Group Supplementary Information section 5.4 and S5.5). More aggressive schemes to detect constraint only marginally increase the overlap with experimentally identified functional elements, and do so with considerably less specificity. Similarly, allexperimental findings have been independently validated and, for the least constrained experi- mentally identified functional elements (Un.TxFrags and binding sites of sequence-specific factors), there is both internal validation and cross-validation from different experimental techniques. This suggests that there is probably not a significant overestimation of experimentallyidentifiedfunctionalelements.Thus,thesetwoexpla- nations may contribute to the general observation about uncon- strained functional elements, but cannot fully explain it. Instead, we hypothesize five biological reasons to account for the presenceoflargeamountsofunconstrainedfunctionalelements.The first two are particular to certain biological assays in which the ele- ments being measured are connected to but do not coincide with the analysed region. An example of this is the parent transcript of an miRNA, where the current assays detect the exons (some of which are not under evolutionary selection), whereas the intronic miRNA actually harbours the constrained bases. Nevertheless, the transcript sequence provides the critical coupling between the regulated pro- moter and the miRNA. The sliding of transcription factors (which might bind a specific sequence but then migrate along the DNA) or the processivity of histone modifications across chromatin are more exotic examples of this. A related, second hypothesis is that deloca- lized behaviours of the genome, such as general chromatin access- ibility, may be maintained by some biochemical processes (such as transcriptionofintergenicregionsorspecificfactorbinding)without the requirement for specific sequence elements. These two explana- tionsofbothconnectedcomponentsanddiffusecomponentsrelated to, but not coincident with, constrained sequences are particularly relevant for the considerable amount of unannotated and uncon- strained transcripts. The other three hypotheses may be more general?the presence of neutral (or near neutral) biochemical elements, of lineage- specific functional elements, and of functionally conserved but non-orthologous elements. We believe there is a considerable pro- portion of neutral biochemically active elements thatdo not confer a selective advantage or disadvantage to the organism. This neutral pool of sequence elements may turn over during evolutionary time, emerging via certain mutations and disappearing by others. The size of the neutral pool would largely be determined by the rate of emer- gence and extinction through chance events; low information- content elements, such as transcription factor-binding sites 110 will have larger neutral pools. Second, from this neutral pool, some ele- mentsmightoccasionallyacquireabiologicalroleandsocomeunder evolutionaryselection.Theacquisitionofanewbiologicalrolewould thencreatealineage-specificelement.Finally,aneutralelementfrom the general pool could also become a peer of an existing selected functional element and either of the two elements could then be removed by chance. If the older element is removed, the newer ele- ment has, in essence, been conserved without using orthologous bases, providing a conserved function in the absence of constrained sequences. For example, a common HNF4A binding site in the human and mouse genomes may not reflect orthologous human and mouse bases, though the presence of an HNF4A site in that regionwasevolutionarilyselectedforinbothlineages.Notethatboth the neutral turnover of elements and the ?functional peering? of ele- ments has been suggested for cis-acting regulatory elements in Drosophila 115,116 and mammals 110 . Our data support these hypo- theses, and we have generalized this idea over many different func- tional elements. The presence of conserved function encoded by conserved orthologous bases is a commonplace assumption in com- parative genomics; our findings indicate that there could be a sizable set of functionally conserved but non-orthologous elements in the humangenome,andthattheseseemunconstrainedacrossmammals. Functional data akin to the ENCODE Project on other related spe- cies, such as mouse, would be critical to understanding the rate of such functionally conserved but non-orthologous elements. Conclusion The generation and analyses of over 200 experimental data sets from studies examining the 44 ENCODE regions provide a rich source of functional information for 30Mb of the human genome. The first conclusion of these efforts is that these data are remarkably inform- ative.Althoughtherewillbeongoingworktoenhanceexistingassays, invent new techniques and develop new data-analysis methods, the generation of genome-wide experimental data sets akin to the ENCODE pilot phase would provide an impressive platform for future genome exploration efforts. This now seems feasible in light of throughput improvements of many of the assays and the ever- declining costs of whole-genome tiling arrays and DNA sequencing. Such genome-wide functional data should be acquired and released openly, as has been done with other large-scale genome projects, to ensure its availability as a new foundation for all biologists studying the human genome. It is these biologists who will often provide the critical link from biochemical function to biological role for the identified elements. The scale of the pilot phase of the ENCODE Project was also sufficiently large and unbiased to reveal important principles about the organization of functional elements in the human genome. In many cases, these principles agree with current mechanistic models. Forexample,weobservetrimethylationofH3K4enrichednearactive genes,andhaveimprovedtheabilitytoaccuratelypredictgeneactiv- ity based on this and other histone modifications. However, we also uncovered some surprises that challenge the current dogma on bio- logical mechanisms. The generation of numerous intercalated tran- scripts spanning the majority of the genome has been repeatedly suggested 13,14 , but this phenomenon has been met with mixed opi- nions about the biological importance of these transcripts. Our ana- lyses of numerous orthogonal data sets firmly establish the presence ofthesetranscripts,andthusthesimpleviewofthegenomeashaving adefinedsetofisolatedlocitranscribedindependentlydoesnotseem tobeaccurate.Perhapsthegenomeencodesanetworkoftranscripts, many of which are linked to protein-coding transcripts and to the majority of which we cannot (yet) assign a biological role. Our per- spectiveoftranscription andgenesmayhavetoevolveandalsoposes 3 ? UTRs 0 0.2 0.4 0.6 Relative enrichment 0.8 1.0 1.2 1.4 1.6 CSs TxFrags CDSs ARs FAIRE RxFra gs Pseu dog enes DHSs RFBR s-SeqSp 5 ? UTRs CS_non-CDSRFBR s TSSs Figure 13 | CNV enrichment. The relative enrichment of different experimental annotations in the ENCODE regions associated with CNVs. CS_non-CDS are constrained sequences outside of coding regions. A value of 1 or less indicates no enrichment, and values greater than 1 show enrichment.Starredcolumnsarecasesthataresignificantonthebasisofthis enrichment being found in less than 5% of randomizations that matched each element class for length and density of features. ARTICLES NATURE | Vol 447 | 14 June 2007 812 Nature �2007 Publishing Group some interesting mechanistic questions. For example, how are splic- ingsignalscoordinatedandusedwhentherearesomanyoverlapping primarytranscripts?Similarly,towhatextentdoesthisreflectneutral turnover of reproducible transcripts with no biological role? We gained subtler but equally important mechanistic findings relating to transcription, replication and chromatin modification. Transcription factors previously thought to primarily bind promo- ters bind more generally, and those which do bind to promoters are equallylikelytobinddownstreamofaTSSasupstream.Interestingly, manyelementsthatpreviously wereclassified asdistalenhancers are, in fact, close to one of the newly identified TSSs; only about 35% of sites showing evidence of binding by multiple transcription factors are actually distal to a TSS. This need not imply that most regulatory information is confined to classic promoters, but rather it does sug- gestthattranscriptionandregulationarecoordinatedactionsbeyond just the traditional promoter sequences. Meanwhile, although distal regulatoryelementscouldbeidentifiedintheENCODEregions,they are currently difficult to classify, in part owing to the lack of a broad set of transcription factors to use in analysing such elements. Finally, we now have a much better appreciation of how DNA replication is coordinated with histone modifications. At the outset of the ENCODE Project, many believed that the broad collection of experimental data would nicely dovetail with the detailed evolutionary information derived from comparing mul- tiple mammalian sequences to provide a neat ?dictionary? of con- served genomic elements, each with a growing annotation about their biochemical function(s). In one sense, this was achieved; the majorityofconstrainedbasesintheENCODEregionsarenowassoc- iated with at least some experimentally derived information about function. However, we have also encountered a remarkable excess of experimentally identified functional elements lacking evolutionary constraint, and these cannot be dismissed for technical reasons. This is perhaps the biggest surprise of the pilot phase of the ENCODE Project,andsuggeststhatwetakeamore?neutral?viewofmanyofthe functions conferred by the genome. METHODS The methods are described in the Supplementary Information, with more technical details for each experiment often found in the references provided in Table1.TheSupplementaryInformationsectionsarearrangedinthesameorder asthemanuscript(withsimilarheadingstofacilitatecross-referencing).Thefirst page of Supplementary Information also has an index to aid navigation. Raw data are available in ArrayExpress, GEO or EMBL/GenBank archives as appro- priate, as detailed in Supplementary Information section 1.1. Processed data are also presented in a user-friendly manner at the UCSC Genome Browser?s ENCODE portal (http://genome.ucsc.edu/ENCODE/). Received 2 March; accepted 23 April 2007. 1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860?921 (2001). 2. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304?1351 (2001). 3. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931?945 (2004). 4. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520?562 (2002). 5. Rat Genome Sequencing Project Consortium. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493?521 (2004). 6. Lindblad-Toh, K. et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438, 803?819 (2005). 7. International Chicken Genome Sequencing Consortium. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695?716 (2004). 8. Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69?87 (2005). 9. ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306, 636?640 (2004). 10. Zhang, Z. D. etal. Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions. Genome Res. 17, 787?797 (2007). 11. Euskirchen, G. M. et al. Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array and sequencing based technologies. Genome Res. 17, 898?909 (2007). 12. Willingham, A. T. & Gingeras, T. R. TUF love for ??junk?? DNA. Cell 125, 1215?1220 (2006). 13. Carninci, P. etal. Genome-wide analysis of mammalian promoter architecture and evolution. Nature Genet. 38, 626?635 (2006). 14. Cheng, J. et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308, 1149?1154 (2005). 15. Bertone, P. et al. Global identification of human transcribed sequences with genome tiling arrays. Science 306, 2242?2246 (2004). 16. Guigo�,R.et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 7, (Suppl. 1; S2) 1?31 (2006). 17. Denoeud, F. etal. Prominent use of distal 59transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 17, 746?759 (2007). 18. Tress, M. L. et al. The implications of alternative splicing in the ENCODE protein complement. Proc. Natl Acad. Sci. USA 104, 5495?5500 (2007). 19. Rozowsky, J. et al. The DART classification of unannotated transcription within ENCODE regions: Associating transcription with known and novel loci. Genome Res. 17, 732?745 (2007). 20. Kapranov, P. et al. Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. Genome Res. 15, 987?997 (2005). 21. Balakirev, E. S. & Ayala, F. J. Pseudogenes: are they ??junk?? or functional DNA? Annu. Rev. Genet. 37, 123?151 (2003). 22. Mighell, A. J., Smith, N. R., Robinson, P. A. & Markham, A. F. Vertebrate pseudogenes. FEBS Lett. 468, 109?114 (2000). 23. Zheng, D. et al. Pseudogenes in the ENCODE regions: Consensus annotation, analysis of transcription and evolution. Genome Res. 17, 839?851 (2007). 24. Zheng, D. et al. Integrated pseudogene annotation for human chromosome 22: evidence for transcription. J. Mol. Biol. 349, 27?45 (2005). 25. Harrison, P. M., Zheng, D., Zhang, Z., Carriero, N. & Gerstein, M. Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability. Nucleic Acids Res. 33, 2374?2383 (2005). 26. Washietl, S. et al. Structured RNAs in the ENCODE selected regions of the human genome. Genome Res. 17, 852?864 (2007). 27. Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559?1563 (2005). 28. Runte, M. et al. The IC-SNURF?SNRPN transcript serves as a host for multiple small nucleolar RNA species and as an antisense RNA for UBE3A. Hum.Mol.Genet. 10, 2687?2700 (2001). 29. Seidl, C. I., Stricker, S. H. & Barlow, D. P. The imprinted Air ncRNA is an atypical RNAPII transcript that evades splicing and escapes nuclear export. EMBO J. 25, 3565?3575 (2006). 30. Parra, G. et al. Tandem chimerism as a means to increase protein complexity in the human genome. Genome Res. 16, 37?44 (2006). 31. Maston, G. A., Evans, S. K. & Green, M. R. Transcriptional regulatory elements in the human genome. Annu. Rev. Genomics Hum. Genet. 7, 29?59 (2006). 32. Trinklein, N. D., Aldred, S. J., Saldanha, A. J. & Myers, R. M. Identification and functional analysis of human transcriptional promoters. Genome Res. 13, 308?312 (2003). 33. Cooper, S. J., Trinklein, N. D., Anton, E. D., Nguyen, L. & Myers, R. M. Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res. 16, 1?10 (2006). 34. Cawley, S. et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116, 499?509 (2004). 35. Yelin, R. et al. Widespread occurrence of antisense transcription in the human genome. Nature Biotechnol. 21, 379?386 (2003). 36. Katayama, S. et al. Antisense transcription in the mammalian transcriptome. Science 309, 1564?1566 (2005). 37. Ren, B. etal. Genome-wide location and function of DNA binding proteins. Science 290, 2306?2309 (2000). 38. Iyer, V. R. et al. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409, 533?538 (2001). 39. Horak, C. E. et al. GATA-1 binding sites mapped in the b-globin locus by using mammalian CHIp-chip analysis. Proc.NatlAcad.Sci.USA99, 2924?2929 (2002). 40. Wei, C. L. etal. A global map of p53 transcription-factor binding sites in the human genome. Cell 124, 207?219 (2006). 41. Kim, J., Bhinge, A. A., Morgan, X. C. & Iyer, V. R. Mapping DNA?protein interactions in large genomes by sequence tag analysis of genomic enrichment. Nature Methods 2, 47?53 (2005). 42. Dorschner, M. O. et al. High-throughput localization of functional elements by quantitative chromatin profiling. Nature Methods 1, 219?225 (2004). 43. Sabo, P. J. et al. Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nature Methods 3, 511?518 (2006). 44. Crawford, G. E. et al. DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays. Nature Methods 3, 503?509 (2006). 45. Hogan, G. J., Lee, C. K. & Lieb, J. D. Cell cycle-specified fluctuation of nucleosome occupancy at gene promoters. PLoS Genet. 2, e158 (2006). NATURE | Vol 447 | 14 June 2007 ARTICLES 813 Nature �2007 Publishing Group 46. Koch, C. M. et al. The landscape of histone modifications across 1% of the human genome in five human cell lines. Genome Res. 17, 691?707 (2007). 47. Smale, S. T. & Kadonaga, J. T. The RNA polymerase II core promoter. Annu. Rev. Biochem. 72, 449?479 (2003). 48. Mito, Y., Henikoff, J. G. & Henikoff, S. Genome-scale profiling of histone H3.3 replacement patterns. Nature Genet. 37, 1090?1097 (2005). 49. Heintzman, N. D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genet.39, 311?318 (2007). 50. Yusufzai, T. M., Tagami, H., Nakatani, Y. & Felsenfeld, G. CTCF tethers an insulator to subnuclear sites, suggesting shared insulator mechanisms across species. Mol. Cell 13, 291?298 (2004). 51. Kim, T. H. et al. Direct isolation and identification of promoters in the human genome. Genome Res. 15, 830?839 (2005). 52. Bieda, M., Xu, X., Singer, M. A., Green, R. & Farnham, P. J. Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. Genome Res. 16, 595?605 (2006). 53. Ruppert, S., Wang, E. H. & Tjian, R. Cloning and expression of human TAF II 250: a TBP-associated factor implicated in cell-cycle regulation. Nature 362, 175?179 (1993). 54. Fernandez, P. C. etal. Genomic targets of the human c-Myc protein. GenesDev.17, 1115?1129 (2003). 55. Li, Z. etal. A global transcriptional regulatory role for c-Myc in Burkitt?s lymphoma cells. Proc. Natl Acad. Sci. USA 100, 8164?8169 (2003). 56. Orian, A. et al. Genomic binding by the Drosophila Myc, Max, Mad/Mnt transcription factor network. Genes Dev. 17, 1101?1114 (2003). 57. de Laat, W. & Grosveld, F. Spatial organization of gene expression: the active chromatin hub. Chromosome Res. 11, 447?459 (2003). 58. Trinklein, N. D. et al. Integrated analysis of experimental datasets reveals many novel promoters in 1% of the human genome. Genome Res. 17, 720?731 (2007). 59. Jeon, Y. et al. Temporal profile of replication of human chromosomes. Proc. Natl Acad. Sci. USA 102, 6419?6424 (2005). 60. Woodfine, K. et al. Replication timing of the human genome. Hum. Mol. Genet. 13, 191?202 (2004). 61. White, E. J. et al. DNA replication-timing analysis of human chromosome 22 at high resolution and different developmental states. Proc. Natl Acad. Sci. USA 101, 17771?17776 (2004). 62. Schubeler, D. et al. Genome-wide DNA replication profile for Drosophila melanogaster: a link between transcription and replication timing. Nature Genet. 32, 438?442 (2002). 63. MacAlpine, D. M., Rodriguez, H. K. & Bell, S. P. Coordination of replication and transcription along a Drosophila chromosome. GenesDev.18, 3094?3105 (2004). 64. Gilbert, D. M. Replication timing and transcriptional control: beyond cause and effect. Curr. Opin. Cell Biol. 14, 377?383 (2002). 65. Schwaiger, M. & Schubeler, D. A question of timing: emerging links between transcription and replication. Curr. Opin. Genet. Dev. 16, 177?183 (2006). 66. Hatton, K. S. et al. Replication program of active and inactive multigene families in mammalian cells. Mol. Cell. Biol. 8, 2149?2158 (1988). 67. Gartler, S. M., Goldstein, L., Tyler-Freer, S. E. & Hansen, R. S. The timing of XIST replication: dominance of the domain. Hum. Mol. Genet. 8, 1085?1089 (1999). 68. Azuara, V. et al. Heritable gene silencing in lymphocytes delays chromatid resolution without affecting the timing of DNA replication. Nature Cell Biol. 5, 668?674 (2003). 69. Cohen, S. M., Furey, T. S., Doggett, N. A. & Kaufman, D. G. Genome-wide sequence and functional analysis of early replicating DNA in normal human fibroblasts. BMC Genomics 7, 301 (2006). 70. Cao, R. et al. Role of histone H3 lysine 27 methylation in Polycomb-group silencing. Science 298, 1039?1043 (2002). 71. Muller, J. etal. Histone methyltransferase activity of a Drosophila Polycomb group repressor complex. Cell 111, 197?208 (2002). 72. Bracken, A. P., Dietrich, N., Pasini, D., Hansen, K. H. & Helin, K. Genome-wide mapping of Polycomb target genes unravels their roles in cell fate transitions. Genes Dev. 20, 1123?1136 (2006). 73. Kirmizis, A. et al. Silencing of human polycomb target genes is associated with methylation of histone H3 Lys 27. Genes Dev. 18, 1592?1605 (2004). 74. Lee, T. I. et al. Control of developmental regulators by Polycomb in human embryonic stem cells. Cell 125, 301?313 (2006). 75. Karnani, N., Taylor, C., Malhotra, A. & Dutta, A. Pan-S replication patterns and chromosomal domains defined by genome tiling arrays of human chromosomes. Genome Res. 17, 865?876 (2007). 76. Delaval, K., Wagschal, A. & Feil, R. Epigenetic deregulation of imprinting in congenital diseases of aberrant growth. Bioessays 28, 453?459 (2006). 77. Dillon, N. Gene regulation and large-scale chromatin organization in the nucleus. Chromosome Res. 14, 117?126 (2006). 78. Burkhoff, A. M. & Tullius, T. D. Structural details of an adenine tract that does not cause DNA to bend. Nature 331, 455?457 (1988). 79. Price, M. A. & Tullius, T. D. How the structure of an adenine tract depends on sequence context: a new model for the structure of T n A n DNA sequences. Biochemistry 32, 127?136 (1993). 80. Greenbaum, J. A., Parker, S. C. J. & Tullius, T. D. Detection of DNA structural motifs in functional genomic elements. Genome Res. 17, 940?946 (2007). 81. Thurman, R. E., Day, N., Noble, W. S. & Stamatoyannopoulos, J. A. Identification of higher-order functional domains in the human ENCODE regions. Genome Res. 17, 917?927 (2007). 82. Gilbert, N. etal. Chromatin architecture of the human genome: gene-rich domains are enriched in open chromatin fibers. Cell 118, 555?566 (2004). 83. Nobrega, M. A., Ovcharenko, I., Afzal, V. & Rubin, E. M. Scanning human gene deserts for long-range enhancers. Science 302, 413 (2003). 84. Woolfe, A. et al. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3, e7 (2005). 85. Drake, J. A. etal. Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nature Genet. 38, 223?227 (2006). 86. Margulies, E. H., Blanchette, M., NISC Comparative Sequencing Program, Haussler D. & Green, E. D. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507?2518 (2003). 87. Cooper, G. M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901?913 (2005). 88. Siepel, A. etal. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034?1050 (2005). 89. Thomas, J. W. et al. Parallel construction of orthologous sequence-ready clone contig maps in multiple species. Genome Res. 12, 1277?1285 (2002). 90. Blakesley, R. W. et al. An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14, 2235?2244 (2004). 91. Aparicio, S. etal. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301?1310 (2002). 92. Jaillon, O. et al. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431, 946?957 (2004). 93. Margulies, E. H. et al. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl Acad. Sci. USA 102, 4795?4800 (2005). 94. Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708?715 (2004). 95. Bray, N. & Pachter, L. MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14, 693?699 (2004). 96. Brudno, M. etal. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721?731 (2003). 97. Margulies, E. H. et al. Relationship between evolutionary constraint and genome function for 1% of the human genome. Genome Res. 17, 760?774 (2007). 98. Asthana, S., Roytberg, M., Stamatoyannopoulos, J. A. & Sunyaev, S. Analysis of sequence conservation at nucleotide resolution. PLoS Comp. Biol. (submitted). 99. Cooper, G. M., Brudno, M., Green, E. D., Batzoglou, S. & Sidow, A. Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res. 13, 813?820 (2003). 100. Eddy, S. R. A model of the statistical power of comparative genome sequence analysis. PLoS Biol. 3, e10 (2005). 101. Stone, E. A., Cooper, G. M. & Sidow, A. Trade-offs in detecting evolutionarily constrained sequence by comparative genomics. Annu. Rev. Genomics Hum. Genet. 6, 143?164 (2005). 102. McAuliffe, J. D., Jordan, M. I. & Pachter, L. Subtree power analysis and species selection for comparative genomics. Proc. Natl Acad. Sci. USA 102, 7900?7905 (2005). 103. International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299?1320 (2005). 104. Cheng, Z. et al. A genome-wide comparison of recent chimpanzee and human segmental duplications. Nature 437, 88?93 (2005). 105. Cooper, G. M. etal. Characterization of evolutionary rates and constraints in three Mammalian genomes. Genome Res. 14, 539?548 (2004). 106. Dermitzakis, E. T., Reymond, A. & Antonarakis, S. E. Conserved non-genic sequences - an unexpected feature of mammalian genomes. Nature Rev. Genet. 6, 151?157 (2005). 107. Clark, T. G. et al. Small insertions/deletions and functional constraint in the ENCODE regions. Genome Biol. (submitted) (2007). 108. Bejerano, G. et al. Ultraconserved elements in the human genome. Science 304, 1321?1325 (2004). 109. Woolfe, A. et al. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3, e7 (2005). 110. Dermitzakis, E. T. & Clark, A. G. Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol. 19, 1114?1121 (2002). 111. McDonald, J. H. & Kreitman, M. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351, 652?654 (1991). 112. Andolfatto, P. Adaptive evolution of non-coding DNA in Drosophila. Nature 437, 1149?1152 (2005). 113. Hudson, R. R., Kreitman, M. & Aguade, M. A test of neutral molecular evolution based on nucleotide data. Genetics 116, 153?159 (1987). 114. Feuk, L., Carson, A. R. & Scherer, S. W. Structural variation in the human genome. Nature Rev. Genet. 7, 85?97 (2006). 115. Ludwig, M. Z. et al. Functional evolution of a cis-regulatory module. PLoS Biol. 3, e93 (2005). 116. Ludwig, M. Z. & Kreitman, M. Evolutionary dynamics of the enhancer region of even-skipped in Drosophila. Mol. Biol. Evol. 12, 1002?1011 (1995). 117. Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 7, (Suppl. 1; S4) 1?9 (2006). ARTICLES NATURE | Vol 447 | 14 June 2007 814 Nature �2007 Publishing Group 118. Emanuelsson, O. et al. Assessing the performance of different high-density tiling microarray strategies for mapping transcribed regions of the human genome. Genome Res. advance online publication, doi: 10.1101/gr.5014606 (21 November 2006). 119. Kapranov, P. et al. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916?919 (2002). 120. Bhinge, A. A., Kim, J., Euskirchen, G., Snyder, M. & Iyer, V. R. Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE). Genome Res. 17, 910?916 (2007). 121. Ng, P. et al. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nature Methods 2, 105?111 (2005). 122. Giresi, P. G., Kim, J., McDaniell, R. M., Iyer, V. R. & Lieb, J. D. FAIRE (Formaldehyde- Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Res. 17, 877?885 (2006). 123. Rada-Iglesias, A. et al. Binding sites for metabolic disease related transcription factors inferred at base pair resolution by chromatin immunoprecipitation and genomic microarrays. Hum. Mol. Genet. 14, 3435?3447 (2005). 124. Kim, T. H. et al. A high-resolution map of active promoters in the human genome. Nature 436, 876?880 (2005). 125. Halees, A. S. & Weng, Z. PromoSer: improvements to the algorithm, visualization and accessibility. Nucleic Acids Res. 32, W191?W194 (2004). 126. Bajic, V. B. et al. Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment. Genome Biol. 7, (Suppl 1; S3) 1?13 (2006). 127. Zheng, D. & Gerstein, M. B. A computational approach for identifying pseudogenes in the ENCODE regions. Genome Biol. 7, S13.1?S13.10 (2006). 128. Stranger, B. E. et al. Genome-wide associations of gene expression variation in humans. PLoS Genet 1, e78 (2005). 129. Turner, B. M. Reading signals on the nucleosome with a new nomenclature for modified histones. Nature Struct. Mol. Biol. 12, 110?112 (2005). Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgements We thank D. Leja for providing graphical expertise and support. Funding support is acknowledged from the following sources: National Institutes of Health, The European Union BioSapiens NoE, Affymetrix, Swiss National Science Foundation, the Spanish Ministerio de Educacio�n y Ciencia, Spanish Ministry of Education and Science, CIBERESP, Genome Spain and Generalitat de Catalunya, Ministry of Education, Culture, Sports, Science and Technology of Japan, the NCCR Frontiers in Genetics, the Je�ro?me Lejeune Foundation, the Childcare Foundation, the Novartis Foundations, the Danish Research Council, the Swedish Research Council, the Knut and Alice Wallenberg Foundation, the Wellcome Trust, the Howard Hughes Medical Institute, the Bio-X Institute, the RIKEN Institute, the US Army, National Science Foundation, the Deutsche Forschungsgemeinschaft, the Austrian Gen-AU program, the BBSRC and The European Molecular Biology Laboratory. We thank the Barcelona SuperComputing Center and the NIH Biowulf cluster for computer facilities. The Consortium thanks the ENCODE Scientific Advisory Panel for their advice on the project: G. Weinstock, M. Cherry, G. Churchill, M. Eisen, S. Elgin, J. Lis, J. Rine, M. Vidal and P. Zamore. Author Information Reprints and permissions information is available at www.nature.com/reprints. The authors declare no competing financial interests. The list of individual authors is divided among the six main analysis groups and five organizational groups. Correspondence and requests for materials should be addressed to the co-chairs of the ENCODE analysis groups (listed in the Analysis Coordination group) E. Birney (birney@ebi.ac.uk); J. A. Stamatoyannopoulos (jstam@u.washington.edu); A. Dutta (ad8q@virginia.edu); R. Guigo� (rguigo@imim.es); T. R. Gingeras (Tom_Gingeras@affymetrix.com); E. H. Margulies (elliott@nhgri.nih.gov); Z. Weng (zhiping@bu.edu); M. Snyder (michael.snyder@yale.edu); E. T. Dermitzakis (md4@sanger.ac.uk) or collectively (encode_chairs@ebi.ac.uk). The ENCODE Project Consortium Analysis Coordination Ewan Birney 1 , John A. Stamatoyannopoulos 2 , Anindya Dutta 3 , Roderic Guigo� 4,5 , Thomas R. Gingeras 6 , Elliott H. Margulies 7 , Zhiping Weng 8,9 , Michael Snyder 10,11 & Emmanouil T. Dermitzakis 12 Chromatin and Replication John A. Stamatoyannopoulos 2 , Robert E. Thurman 2,13 , Michael S. Kuehn 2,13 , Christopher M. Taylor 3 , Shane Neph 2 , Christoph M. Koch 12 , Saurabh Asthana 14 , Ankit Malhotra 3 , Ivan Adzhubei 14 , Jason A. Greenbaum 15 , Robert M. Andrews 12 , Paul Flicek 1 , Patrick J. Boyle 3 , Hua Cao 13 , Nigel P. Carter 12 , Gayle K. Clelland 12 , Sean Davis 16 , Nathan Day 2 , Pawandeep Dhami 12 , Shane C. Dillon 12 , Michael O. Dorschner 2 , Heike Fiegler 12 , Paul G. Giresi 17 , Jeff Goldy 2 , Michael Hawrylycz 18 , Andrew Haydock 2 , Richard Humbert 2 , Keith D. James 12 , Brett E. Johnson 13 , Ericka M. Johnson 13 , Tristan T. Frum 13 , Elizabeth R. Rosenzweig 13 , Neerja Karnani 3 , Kirsten Lee 2 , Gregory C. Lefebvre 12 , Patrick A. Navas 13 , Fidencio Neri 2 , Stephen C. J. Parker 15 , Peter J. Sabo 2 , Richard Sandstrom 2 , Anthony Shafer 2 , David Vetrie 12 , Molly Weaver 2 , Sarah Wilcox 12 , Man Yu 13 , Francis S. Collins 7 , Job Dekker 19 , Jason D. Lieb 17 , Thomas D. Tullius 15 , Gregory E. Crawford 20 , Shamil Sunyaev 14 , William S. Noble 2 , Ian Dunham 12 & Anindya Dutta 3 Genes and Transcripts Roderic Guigo� 4, 5 , France Denoeud 5 , Alexandre Reymond 21,22 , Philipp Kapranov 6 , Joel Rozowsky 11 , Deyou Zheng 11 , Robert Castelo 5 , Adam Frankish 12 , Jennifer Harrow 12 , Srinka Ghosh 6 , Albin Sandelin 23 , Ivo L. Hofacker 24 , Robert Baertsch 25,26 , Damian Keefe 1 , Paul Flicek 1 , Sujit Dike 6 , Jill Cheng 6 , Heather A. Hirsch 27 , Edward A. Sekinger 27 , Julien Lagarde 5 , Josep F. Abril 5,28 , Atif Shahab 29 , Christoph Flamm 24,30 , Claudia Fried 30 ,Jo�rg Hackermu�ller 32 , Jana Hertel 30 , Manja Lindemeyer 30 , Kristin Missal 30,31 , Andrea Tanzer 24,30 , Stefan Washietl 24 , Jan Korbel 11 , Olof Emanuelsson 11 , Jakob S. Pedersen 26 , Nancy Holroyd 12 , Ruth Taylor 12 , David Swarbreck 12 , Nicholas Matthews 12 , Mark C. Dickson 33 , Daryl J. Thomas 25,26 , Matthew T. Weirauch 25 , James Gilbert 12 , Jorg Drenkow 6 , Ian Bell 6 , XiaoDong Zhao 34 , K.G. Srinivasan 34 , Wing-Kin Sung 34 , Hong Sain Ooi 34 , Kuo Ping Chiu 34 , Sylvain Foissac 4 , Tyler Alioto 4 , Michael Brent 35 , Lior Pachter 36 , Michael L. Tress 37 , Alfonso Valencia 37 , Siew Woh Choo 34 , Chiou Yu Choo 34 , Catherine Ucla 22 , Caroline Manzano 22 , Carine Wyss 22 , Evelyn Cheung 6 , Taane G. Clark 38 , James B. Brown 39 , Madhavan Ganesh 6 , Sandeep Patel 6 , Hari Tammana 6 , Jacqueline Chrast 21 , Charlotte N. Henrichsen 21 , Chikatoshi Kai 23 , Jun Kawai 23,40 , Ugrappa Nagalakshmi 10 , Jiaqian Wu 10 , Zheng Lian 41 , Jin Lian 41 , Peter Newburger 42 , Xueqing Zhang 42 , Peter Bickel 43 , John S. Mattick 44 , Piero Carninci 40 ,Yoshihide Hayashizaki 23,40 , Sherman Weissman 41 , Emmanouil T. Dermitzakis 12 , Elliott H. Margulies 7 , Tim Hubbard 12 , Richard M. Myers 33 , Jane Rogers 12 , Peter F. Stadler 24,30,45 , Todd M. Lowe 25 , Chia-Lin Wei 34 , Yijun Ruan 34 , Michael Snyder 10,11 , Ewan Birney 1 , Kevin Struhl 27 , Mark Gerstein 11,46,47 , Stylianos E. Antonarakis 22 & Thomas R. Gingeras 6 Integrated Analysis and Manuscript Preparation James B. Brown 39 , Paul Flicek 1 , Yutao Fu 8 , Damian Keefe 1 , Ewan Birney 1 , France Denoeud 5 , Mark Gerstein 11,46,47 , Eric D. Green 7,48 , Philipp Kapranov 6 , Ulas� Karao�z 8 , Richard M. Myers 33 , William S. Noble 2 , Alexandre Reymond 21,22 , Joel Rozowsky 11 , Kevin Struhl 27 , Adam Siepel 25, 26 {, John A. Stamatoyannopoulos 2 , Christopher M. Taylor 3 , James Taylor 49,50 , Robert E. Thurman 2,13 , Thomas D. Tullius 15 , Stefan Washietl 24 & Deyou Zheng 11 Management Group Laura A. Liefer 51 , Kris A. Wetterstrand 51 , Peter J. Good 51 , Elise A. Feingold 51 , Mark S. Guyer 51 & Francis S. Collins 52 Multi-speciesSequenceAnalysisElliott H. Margulies 7 , Gregory M. Cooper 33 {, George Asimenos 53 , Daryl J. Thomas 25,26 , Colin N. Dewey 54 , Adam Siepel 25,26 { , Ewan Birney 1 , Damian Keefe 1 , Minmei Hou 49,50 , James Taylor 49,50 , Sergey Nikolaev 22 , Juan I. Montoya-Burgos 55 ,AriLo�ytynoja 1 , Simon Whelan 1 {, Fabio Pardi 1 , Tim Massingham 1 , James B. Brown 39 , Haiyan Huang 43 , Nancy R. Zhang 43,56 , Peter Bickel 43 , Ian Holmes 57 , James C. Mullikin 7,48 , Abel Ureta-Vidal 1 , Benedict Paten 1 , Michael Seringhaus 11 , Deanna Church 58 , Kate Rosenbloom 26 , W. James Kent 25,26 , Eric A. Stone 33 , NISC Comparative Sequencing Program*, Baylor College of Medicine Human Genome Sequencing Center*, Washington University Genome Sequencing Center*, Broad Institute*, Children?s Hospital Oakland Research Institute*, Mark Gerstein 11,46,47 , Stylianos E. Antonarakis 22 , Serafim Batzoglou 53 , Nick Goldman 1 , Ross C. Hardison 50,59 , David Haussler 25,26,60 , Webb Miller 49,50,61 , Lior Pachter 36 , Eric D. Green 7,48 & Arend Sidow 33,62 TranscriptionalRegulatoryElements Zhiping Weng 8,9 , Nathan D. Trinklein 33 {, Yutao Fu 8 , Zhengdong D. Zhang 11 , Ulas� Karao�z 8 , Leah Barrera 68 , Rhona Stuart 68 , Deyou Zheng 11 , Srinka Ghosh 6 , Paul Flicek 1 , David C. King 50, 59 , James Taylor 49, 50 , Adam Ameur 69 , Stefan Enroth 69 , Mark C. Bieda 70 , Christoph M. Koch 12 , Heather A. Hirsch 27 , Chia-Lin Wei 34 , Jill Cheng 6 , Jonghwan Kim 71 , Akshay A. Bhinge 71 , Paul G. Giresi 17 ,Nan Jiang 72 , Jun Liu 34 , Fei Yao 34 , Wing-Kin Sung 34 , Kuo Ping Chiu 34 , Vinsensius B. Vega 34 , Charlie W.H. Lee 34 , Patrick Ng 34 , Atif Shahab 29 , Edward A. Sekinger 27 , Annie Yang 27 , Zarmik Moqtaderi 27 , Zhou Zhu 27 , Xiaoqin Xu 70 , Sharon Squazzo 70 , Matthew J. Oberley 73 , David Inman 73 , Michael A. Singer 72 , Todd A. Richmond 72 , Kyle J. Munn 72,74 , Alvaro Rada-Iglesias 74 , Ola Wallerman 74 , Jan Komorowski 69 , Gayle K. Clelland 12 , Sarah Wilcox 12 , Shane C. Dillon 12 , Robert M. Andrews 12 , Joanna C. Fowler 12 , Phillippe Couttet 12 , Keith D. James 12 , Gregory C. Lefebvre 12 , Alexander W. Bruce 12 , Oliver M. Dovey 12 , Peter D. Ellis 12 , Pawandeep Dhami 12 , Cordelia F. Langford 12 , Nigel P. Carter 12 , David Vetrie 12 , Philipp Kapranov 6 , David A. Nix 6 , Ian Bell 6 , Sandeep Patel 6 , Joel Rozowsky 11 , Ghia Euskirchen 10 , Stephen Hartman 10 , Jin Lian 41 , Jiaqian Wu 10 , Alexander E. Urban 10 , Peter Kraus 10 , Sara Van Calcar 68 , Nate Heintzman 68 , Tae Hoon Kim 68 , Kun Wang 68 , Chunxu Qu 68 , Gary Hon 68 , Rosa Luna 75 , Christopher K. Glass 75 , M. Geoff Rosenfeld 75 , Shelley Force Aldred 33 , Sara J. Cooper 33 , Anason Halees 8 , Jane M. Lin 9 , Hennady P. Shulha 9 , Xiaoling Zhang 8 , Mousheng Xu 8 , Jaafar N. S. Haidar 9 , Yong Yu 9 , Ewan Birney* ,1 , Sherman Weissman 41 , Yijun Ruan 34 , Jason D. Lieb 17 , Vishwanath R. Iyer 71 , Roland D. Green 72 , Thomas R. Gingeras 6 , Claes Wadelius 74 , Ian Dunham 12 , Kevin Struhl 27 , Ross C. Hardison 50,59 , Mark Gerstein 11,46,47 , Peggy J. Farnham 70 , Richard M. Myers 33 Bing Ren 68 & Michael Snyder 10,11 UCSC Genome Browser Daryl J. Thomas 25,26 , Kate Rosenbloom 26 , Rachel A. Harte 26 , Angie S. Hinrichs 26 , Heather Trumbower 26 , Hiram Clawson 26 , Jennifer Hillman-Jackson 26 , Ann S. Zweig 26 , Kayla Smith 26 , Archana Thakkapallayil 26 , Galt Barber 26 , Robert M. Kuhn 26 , Donna Karolchik 26 , David Haussler 25,26,60 & W. James Kent 25,26 NATURE | Vol 447 | 14 June 2007 ARTICLES 815 Nature �2007 Publishing Group Variation Emmanouil T. Dermitzakis 12 , Lluis Armengol 76 , Christine P. Bird 12 , Taane G. Clark 38 , Gregory M. Cooper 33 {, Paul I. W. de Bakker 77 , Andrew D. Kern 26 , Nuria Lopez-Bigas 5 , Joel D. Martin 50,59 , Barbara E. Stranger 12 , Daryl J. Thomas 25,26 , Abigail Woodroffe 78 , Serafim Batzoglou 53 , Eugene Davydov 53 , Antigone Dimas 12 , Eduardo Eyras 5 , Ingileif B. Hallgr?�msdo�ttir 79 , Ross C. Hardison 50,59 , Julian Huppert 12 , Arend Sidow 33,62 , James Taylor 49,50 , Heather Trumbower 26 , Michael C. Zody 77 , Roderic Guigo� 4,5 , James C. Mullikin 7 , Gonc�alo R. Abecasis 78 , Xavier Estivill 76,80 & Ewan Birney 1 . *NISC Comparative Sequencing Program Gerard G. Bouffard 7,48 , Xiaobin Guan 48 , Nancy F. Hansen 48 , Jacquelyn R. Idol 7 , Valerie V.B. Maduro 7 , Baishali Maskeri 48 , Jennifer C. McDowell 48 , Morgan Park 48 , Pamela J. Thomas 48 , Alice C. Young 48 & Robert W. Blakesley 7,48 Baylor College of Medicine, Human Genome Sequencing Center Donna M. Muzny 63 , Erica Sodergren 63 , David A. Wheeler 63 , Kim C. Worley 63 , Huaiyang Jiang 63 , George M. Weinstock 63 & Richard A. Gibbs 63 ; Washington University Genome Sequencing Center Tina Graves 64 , Robert Fulton 64 , Elaine R. Mardis 64 & Richard K. Wilson 64 BroadInstitute Michele Clamp 65 , James Cuff 65 , Sante Gnerre 65 , David B. Jaffe 65 , Jean L. Chang 65 , Kerstin Lindblad-Toh 65 & Eric S. Lander 65,66 Children?s Hospital Oakland Research Institute Maxim Koriabine 67 , Mikhail Nefedov 67 , Kazutoyo Osoegawa 67 , Yuko Yoshinaga 67 , Baoli Zhu 67 & Pieter J. de Jong 67 Affiliations for participants: 1 EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. 2 Department of Genome Sciences, 1705 NE Pacific Street, Box 357730, University of Washington, Seattle, Washington 98195, USA. 3 Department of Biochemistry and Molecular Genetics, Jordan 1240, Box 800733, 1300 Jefferson Park Ave, University of Virginia School of Medicine, Charlottesville, Virginia 22908, USA. 4 Genomic Bioinformatics Program, Center for Genomic Regulation, 5 Research Group in Biomedical Informatics, Institut Municipal d?Investigacio� Me`dica/Universitat Pompeu Fabra, c/o Dr. Aiguader 88, Barcelona Biomedical Research Park Building, 08003 Barcelona, Catalonia, Spain. 6 Affymetrix, Inc., Santa Clara, California 95051, USA. 7 Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA. 8 Bioinformatics Program, Boston University, 24 Cummington St., Boston, Massachusetts 02215, USA. 9 Biomedical Engineering Department, Boston University, 44 Cummington St., Boston, Massachusetts 02215, USA. 10 Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA. 11 Department of Molecular Biophysics and Biochemistry, Yale University, PO Box 208114, New Haven, Connecticut 06520, USA. 12 The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. 13 Division of Medical Genetics, 1705 NE Pacific Street, Box 357720, University of Washington, Seattle, Washington 98195, USA. 14 Division of Genetics, Brigham and Women?s Hospital and Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA. 15 Department of Chemistry and Program in Bioinformatics, Boston University, 590 Commonwealth Avenue, Boston, Massachusetts 02215, USA. 16 Genetics Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA. 17 Department of Biology and Carolina Center for Genome Sciences, CB# 3280, 202 Fordham Hall, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, USA. 18 Allen Institute for Brain Sciences, 551 North 34th Street, Seattle, Washington 98103, USA. 19 Program in Gene Function and Expression and Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, 364 Plantation Street, Worcester, Massachusetts 01605, USA. 20 Institute for Genome Sciences & Policy and Department of Pediatrics, 101 Science Drive, Duke University, Durham, North Carolina 27708, USA. 21 Center for Integrative Genomics, University of Lausanne, Genopode building, 1015 Lausanne, Switzerland. 22 Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland. 23 Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, 1-7-22, Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan. 24 Institute for Theoretical Chemistry, University of Vienna, Wa�hringerstra�e 17, A-1090 Wien, Austria. 25 Department of Biomolecular Engineering, University of California, Santa Cruz, 1156 High Street, Santa Cruz, California 95064, USA. 26 Center for Biomolecular Science and Engineering, Engineering 2, Suite 501, Mail Stop CBSE/ITI, University of California, Santa Cruz, California 95064, USA. 27 Department of Biological Chemistry & Molecular Pharmacology, Harvard Medical School, 240 Longwood Avenue, Boston, Massachusetts 02115, USA. 28 Department of Genetics, Facultat de Biologia, Universitat de Barcelona, Av Diagonal, 645, 08028, Barcelona, Catalonia, Spain. 29 Bioinformatics Institute, 30 Biopolis Street, #07-01 Matrix, Singapore, 138671, Singapore. 30 Bioinformatics Group, Department of Computer Science, 31 Interdisciplinary Center of Bioinformatics, University of Leipzig, Ha�rtelstra�e 16-18, D-04107 Leipzig, Germany. 32 Fraunhofer Institut fu�r Zelltherapie und Immunologie - IZI, Deutscher Platz 5e, D-04103 Leipzig, Germany. 33 Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA. 34 Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672, Singapore. 35 Laboratory for Computational Genomics, Washington University, Campus Box 1045, Saint Louis, Missouri 63130, USA. 36 Department of Mathematics and Computer Science, University of California, Berkeley, California 94720, USA. 37 Spanish National Cancer Research Centre, CNIO, Madrid, E-28029, Spain. 38 Department of Epidemiology and Public Health, Imperial College, St Mary?s Campus, Norfolk Place, London W2 1PG, UK. 39 Department of Applied Science & Technology, University of California, Berkeley, California 94720, USA. 40 Genome Science Laboratory, Discovery and Research Institute, RIKEN Wako Institute, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan. 41 Department of Genetics, Yale University School of Medicine, 333 Cedar Street, New Haven, Connecticut 06510, USA. 42 Department of Pediatrics, University of Massachusetts Medical School, 55 Lake Avenue, North Worcester, Massachusetts 01605, USA. 43 Department of Statistics, University of California, Berkeley, California 94720, USA. 44 Institute for Molecular Bioscience, University of Queensland, St. Lucia, QLD 4072, Australia. 45 The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, USA. 46 Department of Computer Science, Yale University, PO Box 208114, New Haven, Connecticut 06520-8114, USA. 47 Program in Computational Biology & Bioinformatics, Yale University, PO Box 208114, New Haven, Connecticut 06520-8114, USA. 48 NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA. 49 Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA. 50 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA. 51 Division of Extramural Research, National Human Genome Research Institute, National Institute of Health, 5635 Fishers Lane, Suite 4076, Bethesda, Maryland 20892-9305, USA. 52 Office of the Director, National Human Genome Research Institute, National Institute of Health, 31 Center Drive, Suite 4B09, Bethesda, Maryland 20892-2152, USA. 53 Department of Computer Science, Stanford University, Stanford, California 94305, USA. 54 Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 6720 MSC, 1300 University Ave, Madison, Wisconsin 53706, USA. 55 Department of Zoology and Animal Biology, Faculty of Sciences, University of Geneva, 1205 Geneva, Switzerland. 56 Department of Statistics, Stanford University, Stanford, California 94305, USA. 57 Department of Bioengineering, University of California, Berkeley, California 94720-1762, USA. 58 National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland 20894, USA. 59 Department of Biochemistry and Molecular Biology, Huck Institutes of Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA. 60 Howard Hughes Medical Institute, University of California, Santa Cruz, California 95064, USA. 61 Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA. 62 Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, USA. 63 Human Genome Sequencing Center and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA. 64 Genome Sequencing Center, Washington University School of Medicine, Campus Box 8501, 4444 Forest Park Avenue, Saint Louis, Missouri 63108, USA. 65 Broad Institute of Harvard University and Massachusetts Institute of Technology, 320 Charles Street, Cambridge, Massachusetts 02141, USA. 66 Whitehead Institute for Biomedical Research, 9 Cambridge Center, Cambridge, Massachusetts 02142, USA. 67 Children?s Hospital Oakland Research Institute, BACPAC Resources, 747 52nd Street, Oakland, California 94609, USA. 68 Ludwig Institute for Cancer Research, 9500 Gilman Drive, La Jolla, California 92093-0653, USA. 69 The Linnaeus Centre for Bioinformatics, Uppsala University, BMC, Box 598, SE-75124 Uppsala, Sweden. 70 Department of Pharmacology and the Genome Center, University of California, Davis, California 95616, USA. 71 Institute for Cellular & Molecular Biology, The University of Texas at Austin, 1 University Station A4800, Austin, Texas 78712, USA. 72 NimbleGen Systems, Inc., 1 Science Court, Madison, Wisconsin 53711, USA. 73 University of Wisconsin Medical School, Madison, Wisconsin 53706, USA. 74 Department of Genetics and Pathology, Rudbeck Laboratory, Uppsala University, SE-75185 Uppsala, Sweden. 75 University of California, San Diego School of Medicine, 9500 Gilman Drive, La Jolla, California 92093, USA. 76 Genes and Disease Program, Center for Genomic Regulation, c/o Dr. Aiguader 88, Barcelona Biomedical Research Park Building, 08003 Barcelona, Catalonia, Spain. 77 Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, Massachusetts 02142, USA. 78 Center for Statistical Genetics, Department of Biostatistics, SPH II, 1420 Washington Heights, Ann Arbor, Michigan 48109-2029, USA. 79 Department of Statistics, University of Oxford, Oxford OX1 3TG, UK. 80 Universitat Pompeu Fabra, c/o Dr. Aiguader 88, Barcelona Biomedical Research Park Building, 08003 Barcelona, Catalonia, Spain. {Present addresses: Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA (G.M.C.); Department of Biological Statistics & Computational Biology, Cornell University, Ithaca, New York 14853, USA (A.S.); Faculty of Life Sciences, University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT, UK (S.W.); SwitchGear Genomics, 1455 Adams Drive, Suite 2015, Menlo Park, California 94025, USA (N.D.T.; S.F.A.). ARTICLES NATURE | Vol 447 | 14 June 2007 816 Nature �2007 Publishing Group "
Add Content to Group
|
Bookmark
|
Keywords
|
Flag Inappropriate
share
Close
Digg
Facebook
MySpace
Google+
Comments
Close
Please Post Your Comment
*
The Comment you have entered exceeds the maximum length.
Submit
|
Cancel
*
Required
Comments
Please Post Your Comment
No comments yet.
Save Note
Note
View
Public
Private
Friends & Groups
Friends
Groups
Save
|
Cancel
|
Delete
Please provide your notes.
Next
|
Prev
|
Close
|
Edit
|
Delete
Genetics
Gene Inheritance and Transmission
Gene Expression and Regulation
Nucleic Acid Structure and Function
Chromosomes and Cytogenetics
Evolutionary Genetics
Population and Quantitative Genetics
Genomics
Genes and Disease
Genetics and Society
Cell Biology
Cell Origins and Metabolism
Proteins and Gene Expression
Subcellular Compartments
Cell Communication
Cell Cycle and Cell Division
Scientific Communication
Career Planning
Loading ...
Scitable Chat
Register
|
Sign In
Visual Browse
Close
Comments
CloseComments
Please Post Your Comment