5′ end–centered expression profiling using cap-analysis gene expression and next-generation sequencing

Journal name:
Nature Protocols
Year published:
Published online


Cap-analysis gene expression (CAGE) provides accurate high-throughput measurement of RNA expression. CAGE allows mapping of all the initiation sites of both capped coding and noncoding RNAs. In addition, transcriptional start sites within promoters are characterized at single-nucleotide resolution. The latter allows the regulatory inputs driving gene expression to be studied, which in turn enables the construction of transcriptional networks. Here we provide an optimized protocol for the construction of CAGE libraries on the basis of the preparation of 27-nt-long tags corresponding to initial bases at the 5′ ends of capped RNAs. We have optimized the methods using simple steps based on filtration, which altogether takes 4 d to complete. The CAGE tags can be readily sequenced with Illumina sequencers, and upon modification they are also amenable to sequencing using other platforms.

At a glance


  1. Workflow of CAGE library preparation.
    Figure 1: Workflow of CAGE library preparation.

    cDNA is reverse transcribed by reverse transcriptase using a random primer including the EcoP15I sequence (yellow) and polyadenylated and nonpolyadenylated RNA as template in Steps 1–4. Cap and 3′ ends are biotinylated, and after RNase digestion of nonhybridized single-stranded RNA (represented by scissors), 5′ complete cDNAs hybridized to biotinylated capped RNAs are captured by streptavidin-coated magnetic beads in Steps 5–22. The cDNA is next released from RNA and ligated to a 5′ linker including a bar-coded sequence (red) and EcoP15I sequence (yellow) in Steps 23–32. The double-strand 5′ linkers are then denatured at 94 °C to allow the biotin-modified second SOL primer to anneal to the single-stranded cDNA and prime second-strand cDNA synthesis in Steps 33–39. Subsequently, cDNA is digested with EcoP15I, which cleaves 27 bp inside the 5′ end of the cDNA in Steps 40–42. Next, a 3′ linker containing the 3′ Illumina primer sequence (purple) is ligated at the 3′ end in Steps 43 and 44. The 96-bp CAGE tags are amplified with the forward primer (green) and reverse primer, which are both compatible with the Illumina flow cell surface, in Steps 45–58. C, cap; B, biotin; SMB, streptavidin-coated magnetic beads.

  2. Strategies for eliminating noncapped biotinylated molecules.
    Figure 2: Strategies for eliminating noncapped biotinylated molecules.

    (a) In addition to the 5′ end of capped RNAs, biotinylation takes place also on the diol group at the 3′ end of capped RNAs and at the 3′ end of ribosomal/other uncapped RNAs, which must be subsequently eliminated to avoid contamination of 5′ complete cDNA. Careful usage of random primers has been instrumental in achieving this. Blue strand indicates 5′ capped RNA and green strand indicates noncapped RNA. Pink strand includes random primer (with restriction enzyme site in yellow) and shows first-strand cDNA extension. Examples 1–6 show different potential random priming patterns. C, cap; B, biotin. (b) RNase I is used to cleave single-strand mismatched regions produced by cDNA synthesis using random primers. The two examples show different random priming patterns on 5′ capped RNA; the upper example (from example 1 in a) results in capture of 5′ complete cDNA, whereas the bottom example (from example 2 in a) shows incomplete cDNA that did not extend to the 5′ end. The incomplete cDNA is subsequently eliminated because of RNase I cleavage from the biotinylated cap. (c) Uncapped/incomplete cDNAs derived from primers that perfectly matched the 3′ end of the RNA and biotinylated at the 3′ end need to be eliminated from the library to reduce bias due to ribosomal RNA contamination. Infrequent cases of perfectly aligned random priming at the 3′ end would cause capture through the 3′ end biotin on ribosomal RNA. However, long random primers (N15; pink) leave mismatches that are cleaved by RNase I treatment, as described in b. Heating to 65 °C after RNase treatment releases the biotin at the 3′ end from the cDNA/RNA hybrid, which is then washed out at the cap-trapping step. (d) Removal of ribosomal cDNA sequence tags by RNase treatment and heating at 65 °C. In a previous protocol that used the GS20/GSFLX sequencer (454 Life Sciences)39, ribosomal CAGE tags represented ~30% of the tags without treatment at 65 °C (n = 6). Other samples were incubated at 65 °C for 5 min, resulting in a ribosomal RNA decrease to 5.41% (n = 6). Error bars indicate s.d. of experiments. A third sample, by using the protocol presented here for the Illumina sequencing instruments, shows further decreased ribosomal RNAs. Other RNAs include RNA sequences mappable to the genome (60–89%) or unmapped RNA-derived sequences.

  3. Linker dimer elimination strategies.
    Figure 3: Linker dimer elimination strategies.

    (a) Phosphate- and NH2-modified 5′ linkers ligate only y to cDNA (pink) and not with themselves. (b) Bottom right, small amounts of linker dimers may form after ligation of the NN single strand of the 3′ linker to the 5′ end linker extended by the second SOL primer. This can form artifacts as identified by sequencing if left without treatment. Bottom left, Antarctic phosphatase treatment prevents ligation of the 3′ linker by removing the phosphate group at the end of this artifact, eliminating linker dimers.

  4. Cap-trapped cDNA size distribution.
    Figure 4: Cap-trapped cDNA size distribution.

    Quality check result of cap-trapped, single-stranded cDNA obtained at Step 28. A volume of 1 μl of purified cDNA is measured with the Agilent RNA pico kit. cDNA should range from a few hundred base pairs and may reach the length of 4 kb. FU, fluorescence unit; dashed green lines, baseline; 25-nt peak, molecular size marker.

  5. Measurement of PCR products.
    Figure 5: Measurement of PCR products.

    Example of PCR cycle optimization by using the Agilent Bioanalyzer DNA 1000 kit. (ad) The amount of applied PCR product is 1 μl; shown are 9 cycles (a), 13 cycles (b), 15 cycles (c) and 18 cycles (d). Peak values indicate the height of fluorescence units (FU). With only 9 cycles, only the primer peak (25 bp) is visible and the CAGE peak is not visible. With 13 cycles, there are two peaks, the primer peak and the CAGE peak. The measured size may slightly differ from the actual 96 bp within the inherent instrument error range (103–105 bp). CAGE tag peaks with FU values between 5 and 10 (molarity: ~10 nmol per liter) are suitable for bulk PCR. With 15 cycles, the FU exceeds 20 (molarity: ~30 nmol per liter) and with 18 cycles the reactions shows a broad peak because of overcycling (compared with a and b). (e) The final product molarity of the single peak was estimated to be 17.6 nmol per liter at 13 cycles. PCR primers are subsequently removed during Steps 55 and 57 of the PROCEDURE. After Step 58, the single-peak products are ready for sequencing. (fg) Example of small linker dimer contamination (70–80 bp) (f), which does not affect sequencing, and large linker contamination (g), which prevents the use of the library.

  6. Scatter plot of cluster expression between two biological replicates (K562 whole cell).
    Figure 6: Scatter plot of cluster expression between two biological replicates (K562 whole cell).

    Points in red represent measurements above a 0.1 IDR21 threshold. The x axis and y axis indicate the log2 of the sequence tags per million.


  1. Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467470 (1995).
  2. Cheng, J. et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308, 11491154 (2005).
  3. Forrest, A.R. & Carninci, P. Whole genome transcriptome analysis. RNA Biol. 6, 107112 (2009).
  4. Velculescu, V.E., Zhang, L., Vogelstein, B. & Kinzler, K.W. Serial analysis of gene expression. Science 270, 484487 (1995).
  5. Kodzius, R. et al. CAGE: cap analysis of gene expression. Nat. Methods 3, 211222 (2006).
  6. Kanamori-Katayama, M. et al. Unamplified cap analysis of gene expression on a single-molecule sequencer. Genome Res. 21, 11501159 (2011).
  7. Kawaji, H. et al. Dynamic usage of transcription start sites within core promoters. Genome Biol. 7, R118 (2006).
  8. Ponjavic, J. et al. Transcriptional and structural impact of TATA-initiation site spacing in mammalian core promoters. Genome Biol. 7, R78 (2006).
  9. Frith, M.C. et al. Evolutionary turnover of mammalian transcription start sites. Genome Res. 16, 7132 (2006).
  10. Hoskins, R.A. et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 21, 182192 (2011).
  11. Carninci, P. et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 38, 626635 (2006).
  12. Gustincich, S. et al. The complexity of the mammalian transcriptome. J. Physiol. 575, 321332 (2006).
  13. Vitezic, M. et al. Building promoter aware transcriptional regulatory networks using siRNA perturbation and deepCAGE. Nucleic Acids Res. 38, 81418148 (2010).
  14. Frith, M.C. et al. A code for transcription initiation in mammalian genomes. Genome Res. 18, 112 (2008).
  15. Suzuki, H. et al. The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line. Nat. Genet. 41, 553562 (2009).
  16. Faulkner, G.J. et al. The regulated retrotransposon transcriptome of mammalian cells. Nat. Genet. 41, 563571 (2009).
  17. Hestand, M.S. et al. Tissue-specific transcript annotation and expression profiling with complementary next-generation sequencing technologies. Nucleic Acids Res. 38, e165 (2010).
  18. Wei, C.L. et al. 5′ Long serial analysis of gene expression (LongSAGE) and 3′ LongSAGE for transcriptome characterization and genome annotation. Proc. Natl. Acad. Sci. USA. 101, 1170111706 (2004).
  19. Valen, E. et al. Genome-wide detection and analysis of hippocampus core promoters using DeepCAGE. Genome Res. 19, 255265 (2009).
  20. Shiraki, T. et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl. Acad. Sci. USA 100, 1577615781 (2003).
  21. Myers, R.M. et al. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 9, e1001046 (2011).
  22. Takahashi, H., Kato, S., Murata, M. & Carninci, P. CAGE (Cap Analysis of Gene Expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol. Biol. 786, 181200 (2012).
  23. Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 15591563 (2005).
  24. Carninci, P. et al. Thermostabilization and thermoactivation of thermolabile enzymes by trehalose and its application for the synthesis of full length cDNA. Proc. Natl. Acad. Sci. USA 95, 520524 (1998).
  25. Carninci, P., Shiraki, T., Mizuno, Y., Muramatsu, M. & Hayashizaki, Y. Extra-long first-strand cDNA synthesis. Biotechniques 32, 984985 (2002).
  26. Carninci, P. et al. High-efficiency full-length cDNA cloning by biotinylated CAP trapper. Genomics 37, 327336 (1996).
  27. Shibata, K. et al. RIKEN integrated sequence analysis (RISA) system—384-format sequencing pipeline with 384 multicapillary sequencer. Genome Res. 10, 17571771 (2000).
  28. Plessy, C. et al. Linking promoters to functional transcripts in small samples with nanoCAGE and CAGEscan. Nat. Methods 7, 528534 (2010).
  29. Maeda, N. et al. Development of a DNA barcode tagging method for monitoring dynamic changes in gene expression by using an ultra high-throughput sequencer. Biotechniques 45, 9597 (2008).
  30. Janscak, P., Sandmeier, U., Szczelkun, M.D. & Bickle, T.A. Subunit assembly and mode of DNA cleavage of the type III restriction endonucleases EcoP1I and EcoP15I. J. Mol. Biol. 306, 417431 (2001).
  31. Raghavendra, N.K. & Rao, D.N. Exogenous AdoMet and its analogue sinefungin differentially influence DNA cleavage by R.EcoP15I–usefulness in SAGE. Biochem. Biophys. Res. Commun. 334, 803811 (2005).
  32. Pfaffl, M.W. A new mathematical model for relative quantification in real-time RT-PCR. Nucleic Acids Res. 29, e45 (2001).
  33. Lassmann, T., Hayashizaki, Y. & Daub, C.O. TagDust—a program to eliminate artifacts from next generation sequencing data. Bioinformatics 25, 28392840 (2009).
  34. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 17541760 (2009).
  35. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 20782079 (2009).
  36. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841842 (2010).
  37. Fujita, P.A. et al. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 39, D876D882 (2011).
  38. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139140 (2010).
  39. Carninci, P. Cap-analysis Gene Expression (CAGE): The Science of Decoding Gene Transcription (Pan Stanford, 2010).
  40. Li, Q., Brown, J.B., Huang, H. & Bickel, P.J. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 5, 17521779 (2011).

Download references

Author information


  1. RIKEN Omics Science Center, RIKEN Yokohama Institute, Yokohama, Japan.

    • Hazuki Takahashi,
    • Timo Lassmann,
    • Mitsuyoshi Murata &
    • Piero Carninci


H.T. performed most experiments. M.M. performed the background reduction experiment. T.L. performed computations analysis. H.T. and P.C. wrote the manuscript. P.C. designed the project.

Competing financial interests

P.C. is an inventor on various patents owned by RIKEN and Dnaform on the Cap-trapper technology, full-length cDNA cloning technologies and the CAGE technology.

Corresponding author

Correspondence to:

Author details

Supplementary information

Image files

  1. Supplementary Fig. 1 (652 KB)

    Oligo-dT priming enhances the capture of CAGE tags on exons and 3′ UTRs.
    CAGE libraries made from THP-1 cells. Data was displayed with the ZENBU genome browser (J. Severin, unpublished data). (a) The Actin beta gene is transcribed from right to left (violet arrow) on chromosome 7. (b) GAPDH gene is transcribed from left to right (green arrow) on chromosome 12. CAGE libraries were primed RT reaction with (1) random and oligodT (ratio 4:1) primers. (2) oligodT primers only and (3) random primers only. Both panels indicate that oligodT primers could enhance the capture of transcripts on 3′ exons and on internal exons, compared to random primer alone.

Text files

  1. Supplementary Data 1 (1 KB)

    The make_ctss script, which is used to cluster the CTSS (Step 65).

Additional data