Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Annotating non-coding regions of the genome

Key Points

  • Most of the human genome consists of DNA that does not code for proteins.

  • Annotating functional regions in the non-coding genome involves two complementary analysis techniques: comparative analysis, which involves examining DNA sequences, and functional analysis, which involves examining the output of functional genomics experiments.

  • With the exponential increase in DNA sequence data, it is now possible to compare sequences within a single human haplotype, between cell types in a single person, across the human population and between species. Integrating the analysis across all these scales is useful.

  • There are two main methods of sequence comparison: scanning for regions of high sequence similarity above some operational threshold, and building statistical models of sequence families. Model-based sequence analysis can incorporate more biological knowledge than sequence similarity scans and provide more refined results.

  • The output of most high-throughput functional genomics experiments can be treated as a continuous signal mapped onto the genome and analysed with a standardized signal processing approach.

  • Signal processing involves smoothing the raw signal, then thresholding and segmenting the signal into discrete annotated blocks.

  • Integration of multiple types of signals generates a progression of more and more complex annotations; these smaller annotations are clustered into groups and then into functional networks that begin to represent the state of biological knowledge about the genome.

  • A chronic problem with annotation based on functional genomics data is the lack of sufficient validation by more low-throughput methods.

  • Techniques such as paired-end sequencing and chromosome conformation capture (and its descendants) enable annotation of connectivity between elements and necessitate a move beyond the one-dimensional signal approach to annotation.

Abstract

Most of the human genome consists of non-protein-coding DNA. Recently, progress has been made in annotating these non-coding regions through the interpretation of functional genomics experiments and comparative sequence analysis. One can conceptualize functional genomics analysis as involving a sequence of steps: turning the output of an experiment into a 'signal' at each base pair of the genome; smoothing this signal and segmenting it into small blocks of initial annotation; and then clustering these small blocks into larger derived annotations and networks. Finally, one can relate functional genomics annotations to conserved units and measures of conservation derived from comparative sequence analysis.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Annotation process for non-coding regions: an overview.
Figure 2: Signal resolution and signal thresholding.
Figure 3: Matrix showing how to correlate genomic elements.

References

  1. 1

    Britten, R. J. & Kohne, D. E. Repeated sequences in DNA. Science 161, 529–540 (1968).

    CAS  Google Scholar 

  2. 2

    Ohno, S. So much 'junk' DNA in our genome. Brookhaven Symp. Biol. 23, 366–370 (1972).

    CAS  PubMed  Google Scholar 

  3. 3

    Lewin, R. Proposal to sequence the human genome stirs debate. Science 232, 1598–1600 (1986).

    CAS  Google Scholar 

  4. 4

    Robertson, M. The proper study of mankind. Nature 322, 11 (1986).

    CAS  PubMed  Google Scholar 

  5. 5

    Choi, M. et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc. Natl Acad. Sci. USA 106, 19096–19101 (2009).

    CAS  Google Scholar 

  6. 6

    Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nature Biotech. 27, 182–189 (2009).

    CAS  Google Scholar 

  7. 7

    Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8

    International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  9. 9

    Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    CAS  Google Scholar 

  10. 10

    Ghildiyal, M. & Zamore, P. D. Small silencing RNAs: an expanding universe. Nature Rev. Genet. 10, 94–108 (2009).

    CAS  Google Scholar 

  11. 11

    Bejerano, G. et al. Ultraconserved elements in the human genome. Science 304, 1321–1325 (2004).

    CAS  PubMed  Google Scholar 

  12. 12

    Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  13. 13

    Pennacchio, L. A. et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444, 499–502 (2006).

    CAS  Google Scholar 

  14. 14

    Kleinjan, D. A. & van Heyningen, V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am. J. Hum. Genet. 76, 8–32 (2005).

    CAS  PubMed  Google Scholar 

  15. 15

    Yeager, M. et al. Comprehensive resequence analysis of a 136 kb region of human chromosome 8q24 associated with prostate and colon cancers. Hum. Genet. 124, 161–170 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16

    Visel, A., Rubin, E. M. & Pennacchio, L. A. Genomic views of distant-acting enhancers. Nature 461, 199–205 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17

    Lupski, J. R. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet. 14, 417–422 (1998). A prescient exposition of the important link between disease and structural variation in the human genome.

    CAS  Google Scholar 

  18. 18

    Kidd, J. M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008). The first high-resolution sequence map of human structural variation.

    CAS  PubMed  PubMed Central  Google Scholar 

  19. 19

    Lupski, J. R. & Stankiewicz, P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet. 1, e49 (2005).

    PubMed  PubMed Central  Google Scholar 

  20. 20

    The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007). A comprehensive overview of what was learned during the ENCODE pilot project.

  21. 21

    Celniker, S. E. et al. Unlocking the secrets of the genome. Nature 459, 927–930 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22

    Searls, D. B. The language of genes. Nature 420, 211–217 (2002).

    CAS  PubMed  Google Scholar 

  23. 23

    Whitfield, J. Across the curious parallel of language and species evolution. PLoS Biol. 6, e186 (2008).

    PubMed  PubMed Central  Google Scholar 

  24. 24

    Pagel, M. Human language as a culturally transmitted replicator. Nature Rev. Genet. 10, 405–415 (2009).

    CAS  PubMed  Google Scholar 

  25. 25

    Saha, S., Bridges, S., Magbanua, Z. V. & Peterson, D. G. Empirical comparison of ab initio repeat finding programs. Nucleic Acids Res. 36, 2284–2294 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26

    Washietl, S. et al. Structured RNAs in the ENCODE selected regions of the human genome. Genome Res. 17, 852–864 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27

    Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 7, S4 (2006).

    PubMed  PubMed Central  Google Scholar 

  28. 28

    Zhang, Z. L. et al. PseudoPipe: an automated pseudogene identification pipeline. Bioinformatics 22, 1437–1439 (2006).

    CAS  PubMed  Google Scholar 

  29. 29

    Karro, J. E. et al. Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res. 35, D55–D60 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30

    Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, 1998).

    Google Scholar 

  31. 31

    Miller, W., Makova, K. D., Nekrutenko, A. & Hardison, R. C. Comparative genomics. Annu. Rev. Genomics Hum. Genet. 5, 15–56 (2004).

    CAS  PubMed  Google Scholar 

  32. 32

    Margulies, E. H. & Birney, E. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nature Rev. Genet. 9, 303–313 (2008).

    CAS  Google Scholar 

  33. 33

    Ren, B. et al. Genome-wide location and function of DNA binding proteins. Science 290, 2306–2309 (2000).

    CAS  Google Scholar 

  34. 34

    Iyer, V. R. et al. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409, 533–538 (2001).

    CAS  Google Scholar 

  35. 35

    Lee, T. I., Johnstone, S. E. & Young, R. A. Chromatin immunoprecipitation and microarray-based analysis of protein location. Nature Protoc. 1, 729–748 (2006).

    CAS  Google Scholar 

  36. 36

    Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1497–1502 (2007).

    CAS  Article  Google Scholar 

  37. 37

    Robertson, G. et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature Methods 4, 651–657 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38

    Park, P. J. ChIP–seq: advantages and challenges of a maturing technology. Nature Rev. Genet. 10, 669–680 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39

    Bertone, P. et al. Global identification of human transcribed sequences with genome tiling arrays. Science 306, 2242–2246 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40

    Cheng, J. et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308, 1149–1154 (2005).

    CAS  PubMed  Google Scholar 

  41. 41

    Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA–seq. Nature Methods 5, 621–628 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42

    Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43

    Sultan, M. et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, 956–960 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. 44

    Wang, Z., Gerstein, M. & Snyder, M. RNA–seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45

    Karolchik, D. et al. The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51–54 (2003).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  46. 46

    Lister, R. et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462, 315–322 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. 47

    Bernstein, B. E. et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125, 315–326 (2006).

    CAS  Google Scholar 

  48. 48

    Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007).

    CAS  Google Scholar 

  49. 49

    Mikkelsen, T. S. et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448, 553–560 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. 50

    Royce, T. E., Rozowsky, J. S. & Gerstein, M. B. Assessing the need for sequence-based normalization in tiling microarray experiments. Bioinformatics 23, 988–997 (2007).

    CAS  PubMed  Google Scholar 

  51. 51

    Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52

    Li, R. Q., Li, Y. R., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008).

    CAS  PubMed  Google Scholar 

  53. 53

    Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

    PubMed  PubMed Central  Google Scholar 

  54. 54

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  55. 55

    Zhang, Z. D., Rozowsky, J., Snyder, M., Chang, J. & Gerstein, M. Modeling ChIP sequencing in silico with applications. PLoS Comput. Biol. 4, e1000158 (2008).

    PubMed  PubMed Central  Google Scholar 

  56. 56

    Rozowsky, J. et al. PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls. Nature Biotech. 27, 66–75 (2009).

    CAS  Google Scholar 

  57. 57

    Auerbach, R. K. et al. Mapping accessible chromatin regions using Sono-Seq. Proc. Natl Acad. Sci. USA 106, 14926–14931 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. 58

    Kapranov, P. et al. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–919 (2002).

    CAS  Google Scholar 

  59. 59

    Rinn, J. L. et al. The transcriptional activity of human Chromosome 22. Genes Dev. 17, 529–540 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. 60

    Kapranov, P. et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316, 1484–1488 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. 61

    Ponjavic, J., Ponting, C. P. & Lunter, G. Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs. Genome Res. 17, 556–565 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. 62

    Struhl, K. Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nature Struct. Mol. Biol. 14, 103–105 (2007).

    CAS  Google Scholar 

  63. 63

    van Bakel, H., Nislow, C., Blencowe, B. J. & Hughes, T. R. Most dark matter transcripts are associated with known genes. PLoS Biol. 8, e1000371 (2010). A recent reappraisal, based on RNA–seq and tiling-array data, of the degree of pervasive transcription in the human genome.

    PubMed  PubMed Central  Google Scholar 

  64. 64

    Farnham, P. J. Insights from genomic profiling of transcription factors. Nature Rev. Genet. 10, 605–616 (2009).

    CAS  Google Scholar 

  65. 65

    Pinkel, D. et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genetics 20, 207–211 (1998).

    CAS  PubMed  Google Scholar 

  66. 66

    Gokcumen, O. & Lee, C. Copy number variants (CNVs) in primate species using array-based comparative genomic hybridization. Methods 49, 18–25 (2009).

    PubMed  PubMed Central  Google Scholar 

  67. 67

    Stathopoulos, A., Van Drenth, M., Erives, A., Markstein, M. & Levine, M. Whole-genome analysis of dorsal-ventral patterning in the Drosophila embryo. Cell 111, 687–701 (2002). An elegant study of the effect of transcription factor concentration on the arrangement of cis -regulatory elements at target genes.

    CAS  Google Scholar 

  68. 68

    Tantin, D., Gemberling, M., Callister, C. & Fairbrother, W. High-throughput biochemical analysis of in vivo location data reveals novel distinct classes of POU5F1(Oct4)/DNA complexes. Genome Res. 18, 631–639 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  69. 69

    Zhang, Z. D. D. et al. Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions. Genome Res. 17, 787–797 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  70. 70

    Rozowsky, J. S. et al. The DART classification of unannotated transcription within the ENCODE regions: associating transcription with known and novel loci. Genome Res. 17, 732–745 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. 71

    Bailey, J. A. & Eichler, E. E. Primate segmental duplications: crucibles of evolution, diversity and disease. Nature Rev. Genet. 7, 552–564 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  72. 72

    Kim, P. M. et al. Analysis of copy number variants and segmental duplications in the human genome: evidence for a change in the process of formation in recent evolutionary history. Genome Res. 18, 1865–1874 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  73. 73

    Zheng, D. et al. Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res. 17, 839–851 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  74. 74

    Tam, O. H. et al. Pseudogene-derived small interfering RNAs regulate gene expression in mouse oocytes. Nature 453, 534–538 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  75. 75

    Watanabe, T. et al. Endogenous siRNAs from naturally formed dsRNAs regulate transcripts in mouse oocytes. Nature 453, 539–543 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  76. 76

    Sasidharan, R. & Gerstein, M. Protein fossils live on as RNA. Nature 453, 729–731 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  77. 77

    Ahituv, N. et al. Deletion of ultraconserved elements yields viable mice. PLoS Biol. 5, e234 (2007).

    PubMed  PubMed Central  Google Scholar 

  78. 78

    Monroe, D. Genomic clues to DNA treasure sometimes lead nowhere. Science 325, 142–143 (2009).

    CAS  PubMed  Google Scholar 

  79. 79

    Lareau, L. F., Inada, M., Green, R. E., Wengrod, J. C. & Brenner, S. E. Unproductive splicing of SR genes associated with highly conserved and ultraconserved DNA elements. Nature 446, 926–929 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  80. 80

    Baer, C. F., Miyamoto, M. M. & Denver, D. R. Mutation rate variation in multicellular eukaryotes: causes and consequences. Nature Rev. Genet. 8, 619–631 (2007).

    CAS  Google Scholar 

  81. 81

    Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227 (2009). A good example of the benefits of integrating comparative and functional analysis, which in this case led to the discovery of a new class of functional NCEs.

    CAS  PubMed  PubMed Central  Google Scholar 

  82. 82

    Khalil, A. M. et al. Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc. Natl Acad. Sci. USA 106, 11667–11672 (2009).

    CAS  Google Scholar 

  83. 83

    Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nature Nanotechnol. 4, 265–270 (2009).

    CAS  Google Scholar 

  84. 84

    Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

    CAS  PubMed  Google Scholar 

  85. 85

    Du, J. et al. A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and ChIP–chip experiments: systematically incorporating validated biological knowledge. Bioinformatics 22, 3016–3024 (2006).

    CAS  PubMed  Google Scholar 

  86. 86

    Geiss, G. K. et al. Direct multiplexed measurement of gene expression with color-coded probe pairs. Nature Biotech. 26, 317–325 (2008).

    CAS  Google Scholar 

  87. 87

    Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome conformation. Science 295, 1306–1311 (2002).

    CAS  Google Scholar 

  88. 88

    Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  89. 89

    Fullwood, M. J. et al. An oestrogen-receptor-a-bound human chromatin interactome. Nature 462, 58–64 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  90. 90

    Dostie, J. et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 16, 1299–1309 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  91. 91

    Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  92. 92

    Duan, Z. et al. A three-dimensional model of the yeast genome. Nature 465, 363–367 (2010). References 91 and 92 are two examples of the power of using long-distance connectivity data in the genome to map genome structure.

    CAS  PubMed  PubMed Central  Google Scholar 

  93. 93

    Clamp, M. et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl Acad. Sci. USA 104, 19428–19433 (2007).

    CAS  Google Scholar 

  94. 94

    King, M. C. & Wilson, A. C. Evolution at two levels in humans and chimpanzees. Science 188, 107–116 (1975).

    CAS  PubMed  PubMed Central  Google Scholar 

  95. 95

    Gregory, T. R. Synergy between sequence and size in large-scale genomics. Nature Rev. Genet. 6, 699–708 (2005).

    CAS  Google Scholar 

  96. 96

    Galgoczy, D. J. et al. Genomic dissection of the cell-type-specification circuit in Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA 101, 18069–18074 (2004).

    CAS  PubMed  Google Scholar 

  97. 97

    Sulston, J. E., Schierenberg, E., White, J. G. & Thomson, J. N. The embryonic-cell lineage of the nematode Caenorhabditis elegans. Dev. Biol. 100, 64–119 (1983).

    CAS  Google Scholar 

  98. 98

    Vickaryous, M. K. & Hall, B. K. Human cell type diversity, evolution, development, and classification with special reference to cells derived from the neural crest. Biol. Rev. Camb. Philos. Soc. 81, 425–455 (2006).

    Google Scholar 

  99. 99

    Arendt, D. The evolution of cell types in animals: emerging principles from molecular studies. Nature Rev. Genet. 9, 868–882 (2008).

    CAS  Google Scholar 

  100. 100

    Schlotterer, C. & Tautz, D. Slippage synthesis of simple sequence DNA. Nucleic Acids Res. 20, 211–215 (1992).

    CAS  PubMed  PubMed Central  Google Scholar 

  101. 101

    Amor, D. J. & Choo, K. H. A. Neocentromeres: role in human disease, evolution, and centromere study. Am. J. Hum. Genet. 71, 695–714 (2002).

    PubMed  PubMed Central  Google Scholar 

  102. 102

    Vinces, M. D., Legendre, M., Caldara, M., Hagihara, M. & Verstrepen, K. J. Unstable tandem repeats in promoters confer transcriptional evolvability. Science 324, 1213–1216 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  103. 103

    Mills, R. E., Bennett, E. A., Iskow, R. C. & Devine, S. E. Which transposable elements are active in the human genome? Trends Genet. 23, 183–191 (2007).

    CAS  PubMed  Google Scholar 

  104. 104

    Zhang, Z., Frankish, A., Hunt, T., Harrow, J. & Gerstein, M. Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates. Genome Biol. 11, R26 (2010).

    PubMed  PubMed Central  Google Scholar 

  105. 105

    Lagos-Quintana, M., Rauhut, R., Lendeckel, W. & Tuschl, T. Identification of novel genes coding for small expressed RNAs. Science 294, 853–858 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  106. 106

    Lau, N. C., Lim, L. P., Weinstein, E. G. & Bartel, D. P. An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science 294, 858–862 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  107. 107

    Lee, R. C. & Ambros, V. An extensive class of small RNAs in Caenorhabditis elegans. Science 294, 862–864 (2001).

    CAS  PubMed  Google Scholar 

  108. 108

    Brennecke, J. et al. Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila. Cell 128, 1089–1103 (2007).

    CAS  Google Scholar 

  109. 109

    Carmell, M. A. et al. MIWI2 is essential for spermatogenesis and repression of transposons in the mouse male germline. Dev. Cell 12, 503–514 (2007).

    CAS  Google Scholar 

  110. 110

    Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. & Luscombe, N. M. A census of human transcription factors: function, expression and evolution. Nature Rev. Genet. 10, 252–263 (2009). A useful synthesis of the current state of knowledge about human transcription factors.

    CAS  Google Scholar 

  111. 111

    Maston, G. A., Evans, S. K. & Green, M. R. Transcriptional regulatory elements in the human genome. Annu. Rev. Genomics Hum. Genet. 7, 29–59 (2006).

    CAS  PubMed  Google Scholar 

  112. 112

    Bovee, D. et al. Closing gaps in the human genome with fosmid resources generated from multiple individuals. Nature Genet. 40, 96–101 (2008).

    CAS  PubMed  Google Scholar 

  113. 113

    Kaiser, J. A plan to capture human diversity in 1000 genomes. Science 319, 395–395 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  114. 114

    Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, 2113–2144 (2007).

    CAS  Google Scholar 

  115. 115

    Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods 6, 677–681 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  116. 116

    Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  117. 117

    Lee, S., Hormozdiari, F., Alkan, C. & Brudno, M. MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nature Methods 6, 473–474 (2009).

    CAS  PubMed  Google Scholar 

  118. 118

    Kidd, J. M. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nature Methods 7, 365–371 (2010). The authors report the characterization of new insertion sequences relative to the human reference genome; this study is a useful addition to the field as it moves towards a series of reference genomes for sub-populations.

    CAS  PubMed  PubMed Central  Google Scholar 

  119. 119

    Lam, H. Y. K. et al. Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nature Biotech. 28, 47–55 (2010).

    CAS  Google Scholar 

  120. 120

    Li, R. Q. et al. Building the sequence map of the human pan-genome. Nature Biotech. 28, 57–63 (2010).

    CAS  Google Scholar 

  121. 121

    Griffiths-Jones, S., Saini, H. K., van Dongen, S. & Enright, A. J. miRBase: tools for microRNA genomics. Nucleic Acids Res. 36, D154–D158 (2008).

    CAS  PubMed  Google Scholar 

  122. 122

    Iafrate, A. J. et al. Detection of large-scale variation in the human genome. Nature Genet. 36, 949–951 (2004).

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors thank members of the Gerstein laboratory for helpful discussions and careful reading of the manuscript. We acknowledge support from the US NIH and from the Albert L. Williams Professorship funds.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mark B. Gerstein.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

Mark B. Gerstein's homepage

1000 Genomes Project

Berkeley Drosophila Genome Project

Database of Genomic Variants

FlyBase

GTEx project

The ENCODE Project

Human Genome Structural Variation Project

miRBase

The modENCODE Project

Pseudogene.org

Saccharomyces Genome Database

UCSC Genome Browser

WormBase

Glossary

Targeted exome sequencing

A technique that involves filtering genomic DNA by capturing regions of interest (often protein-coding exons) on a microarray, then sequencing the captured DNA using next-generation techniques.

Structural variants

Chromosomal rearrangements (deletions, duplications, novel sequence insertions or inversions) that are inherited and polymorphic across the human population. Structural variants are by definition longer than SNPs and can be hundreds of thousands of base pairs long.

Copy-number variants

Structural variants that arise from deletion or duplication and thus lead to a change in copy number of the underlying region of the genome.

Segmental duplication

The operational definition of a segmental duplication rests on finding two regions in the same genome ranging in length from a thousand to several million nucleotides with at least 90% sequence identity. Segmental duplications are inherited but not necessarily polymorphic across the human population.

Pseudogenes

Copies of protein-coding genes with mutations that disrupt their coding sequence and demolish their original protein-coding function.

Syntenic blocks

Segments that align between genome sequences from two species and that are believed to define an orthologous relationship.

DNA-based transposons

Transposable DNA elements that rely on a transposase enzyme to excise themselves from one region of the genome and insert themselves into a different region, without increasing in copy number.

RNA-based retrotransposons

Transposable elements generated when reverse transcriptase enzymes copy RNA elements into DNA and insert the DNA copies back into the genome.

Duplicated pseudogenes

Pseudogenes that result from whole-genome or segmental duplications, in which one copy maintains its ancestral function and the other copy degrades into a pseudogene.

Processed pseudogenes

Pseudogenes that arise when the mRNA of a parent gene is retrotranscribed back into DNA and inserted into the genome.

Unitary pseudogenes

A rare class of pseudogene in which a single-copy parent gene becomes non-functional.

Chromatin immunoprecipitation

(ChIP.) A technique for identifying potential regulatory sequences that are bound by the protein of interest. Soluble DNA–chromatin extracts (complexes of DNA and protein) are isolated by using antibodies that recognize specific DNA-binding proteins. In ChIP–chip, the ChIP step is followed by microarray analysis, whereas in ChIP–seq, it is followed by sequencing.

Tiling arrays

A class of microarray in which probes of a specific length and spacing provide uniform coverage of an entire genome or portion of a genome to a desired resolution.

RNA sequencing

The use of high-throughput sequencing of RNA that has been reverse-transcribed into DNA to characterize the set of RNA transcripts produced by a cell.

Smoothing

The process of filtering noise from a signal by removing fine-scale variation.

Thresholding

The process of discretizing a continuous signal by choosing a signal value above which the signal is considered 'on' or 'active' and below which the signal is considered 'off' or 'inactive'.

Segmenting

The result of thresholding in signal processing — that is, segments are those regions defined as 'on' or 'active' after discretization of the signal.

Heterochromatin

Highly compact and therefore inactive regions of the genome. Largely composed of repetitive DNA, heterochromatin forms dark bands after Giemsa staining.

Euchromatin

The lightly staining regions of the genome that are generally decondensed during interphase and contain transcriptionally active regions.

Fosmid

A low-copy vector for the construction of stable genomic libraries that uses the Escherichia coli F-factor origin of replication. Each fosmid clone can store 40 kb of library DNA. Cloned sequences are more stable in fosmids than in high-copy vectors.

Specificity

A measure of the proportion of true negatives correctly identified as such (for example, the percentage of healthy people who are identified as not having a disease).

Regulatory forests

Regions of the genome that are enriched with binding sites for regulatory factors, such as transcription factors.

Principal components analysis

A statistical method used to simplify data sets by transforming a series of correlated variables into a smaller number of uncorrelated factors.

Non-allelic homologous recombination

Recombination between segmental duplications that leads to local duplication, deletion or inversion of genome sequence.

Ultraconserved elements

Operationally defined as non-coding elements that are hundreds of base pairs long and 100% identical across human, mouse and rat genomes.

Sensitivity

A measure of the proportion of true positives that are correctly identified as such (for example, the percentage of sick people who are identified as having a disease).

Paired-end sequencing

Determination of the sequence at both ends of a fragment of DNA of known size.

Chromosome conformation capture

A technique used to study the long-distance interactions between genomic regions, which in turn can be used to study the three-dimensional architecture of chromosomes within a cell nucleus.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Alexander, R., Fang, G., Rozowsky, J. et al. Annotating non-coding regions of the genome. Nat Rev Genet 11, 559–571 (2010). https://doi.org/10.1038/nrg2814

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing