The FANTOM5 consortium described the promoter-level expression atlas of human and mouse by using CAGE (Cap Analysis of Gene Expression) with single molecule sequencing. In the original publications, GRCh37/hg19 and NCBI37/mm9 assemblies were used as the reference genomes of human and mouse respectively; later, the Genome Reference Consortium released newer genome assemblies GRCh38/hg38 and GRCm38/mm10. To increase the utility of the atlas in forthcoming researches, we reprocessed the data to make them available on the recent genome assemblies. The data include observed frequencies of transcription starting sites (TSSs) based on the realignment of CAGE reads, and TSS peaks that are converted from those based on the previous reference. Annotations of the peak names were also updated based on the latest public databases. The reprocessed results enable us to examine frequencies of transcription initiations on the recent genome assemblies and to refer promoters with updated information across the genome assemblies consistently.
Machine-accessible metadata file describing the reported data (ISA-tab format)
Background & Summary
A complete genome sequence provides an essential infrastructure to study an organism at the molecular level. It does not only represent an entire set of genetically inherited information, but also a coordinated system to describe genomic entities, such as nucleotide polymorphisms, genes, and regulatory elements. The representation of genome entities along a genomic reference is termed genome annotation1. The Genome Reference Consortium has been providing a variety of reference genome assemblies, updated in a timely manner in order to reflect the outcomes of recent research2. The latest assemblies available for human and mouse are versioned as GRCh38/hg38 and GRCm38/mm10. Recent efforts to improve the genome annotations based on the current assemblies were consequently carried, including the update of RefSeq3 and GENCODE transcripts4.
The FANTOM (Functional ANnoTation Of Mammalian genomes) project was launched to provide functional annotation for mammalian genomes, in particular focusing on transcriptome profiles5. The fifth edition of the FANTOM project (FANTOM5) profiled genome-wide transcription start sites (TSSs) in mammalian genomes by using CAGE (Cap Analysis of Gene Expression) with single-molecule sequencers (HeliScope); it then revealed a promoter-level expression atlas across mammalian primary cells, tissues, cell lines and time course samples, as well as a transcribed enhancer atlas6,
In the original publications of the FANTOM5 papers, the GRCh37/hg19 human and NCBI37/mm9 mouse genome assemblies were used. Here we reprocessed the FANTOM5 data to make it available on the current assemblies GRCh38/hg38 and GRCm38/mm10. We remapped individual CAGE reads, to fully benefit of improvements such as correction of erroneous genome sequences. In contrast, we converted the genomic coordinates of the CAGE peaks into the current assemblies by liftOver tool10, so that we can easily make corresponding between the reprocessed CAGE peaks and the existing FANTOM5 resources across the assemblies. Following the conversion, we added new CAGE peaks introduced in the latest genome assemblies, for example, ones for newly introduced genes. For this purpose, we used the result of peak-calling by the same method as reported in the original report8 based on the realigned CAGE reads with the latest genome assemblies (Data Citation 1: Zenodo http://doi.org/10.5281/zenodo.545682), and chose non-overlapped CAGE peaks in the result to merged with the converted CAGE peaks.
Unique identifiers were assigned to the CAGE peaks to provide the consistent reference to the CAGE peaks even after future updates. Majority of the peaks were successfully converted, and a few unexpected conversions were manually corrected. As a consequence, expression and gene annotations of the CAGE peaks were updated (Fig. 1). The reprocessed data of the FANTOM5 human and mouse CAGE datasets (Data Citations Data Citation 2: DDBJ Sequence Read Archive DRA000991,Data Citation 3: DDBJ Sequence Read Archive DRA001026,Data Citation 4: DDBJ Sequence Read Archive DRA001027,Data Citation 5: DDBJ Sequence Read Archive DRA001028,Data Citation 6: DDBJ Sequence Read Archive DRA001101,Data Citation 7: DDBJ Sequence Read Archive DRA002216,Data Citation 8: DDBJ Sequence Read Archive DRA002711,Data Citation 10: DDBJ Sequence Read Archive DRA002748) are publicly available from the FANTOM5 data web site (http://fantom.gsc.riken.jp/5/datafiles/reprocessed/), LSDB Archive (Data Citation 11: LSDB Archive http://doi.org/10.18908/lsdba.nbdc01389-000.V002) and Figshare (hg38 (Data Citation 12: Figshare http://doi.org/10.6084/m9.figshare.c.3743288) and mm10 (Data Citation 13: Figshare http://doi.org/10.6084/m9.figshare.c.3743378)).
Realignment of CAGE reads
The FANTOM5 CAGE reads (Data Citations Data Citation 2: DDBJ Sequence Read Archive DRA000991,Data Citation 3: DDBJ Sequence Read Archive DRA001026,Data Citation 4: DDBJ Sequence Read Archive DRA001027,Data Citation 5: DDBJ Sequence Read Archive DRA001028,Data Citation 6: DDBJ Sequence Read Archive DRA001101,Data Citation 7: DDBJ Sequence Read Archive DRA002216,Data Citation 8: DDBJ Sequence Read Archive DRA002711,Data Citation 10: DDBJ Sequence Read Archive DRA002748) were realigned by Delve version 0.95 with the GRCh38/hg38 and GRCm38/mm10 genome assemblies following the same procedure to the previous report6,
Conversion of the CAGE peaks genomic coordinates
The genomic coordinates of the original CAGE peaks (hg19 (Data Citation 16: Figshare http://doi.org/10.6084/m9.figshare.4883819) and mm9 (Data Citation 17: Figshare http://doi.org/10.6084/m9.figshare.4883822); or (Data Citation 11: LSDB Archive http://doi.org/10.18908/lsdba.nbdc01389-000.V002)) were computationally converted to the latest genome assemblies, followed by manual curation (Fig. 1). First we applied a software utility to convert between genomes, liftOver10, with option --minMatch=1 and the chain files from hg19 to hg38 (Data Citation 18: The UCSC Genome Browser http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz) and from mm9 to mm10 (Data Citation 19: The UCSC Genome Browser http://hgdownload.cse.ucsc.edu/goldenPath/mm9/liftOver/mm9ToMm10.over.chain.gz). A few of the CAGE peaks couldn’t be located in the latest genome assemblies (termed ‘dropped’), where the ratios of dropped peaks were less than 0.1% (Table 1).
Next we examined whether the mapped peaks were supported by the presence of any realigned CAGE reads, to confirm that the converted locations still have experimental evidences. We found that 335 and 76 peaks for human and mouse respectively had no support, and excluded them from downstream analysis by classifying them as ‘problematic’. We also examined internal overlap within the converted peaks, since they are not supposed overlap each other, based on their definition8. We found ten peaks overlapping in four loci only on hg38, which were caused by the fusion of originally distinct genomic regions or by unintentional stretch of a single CAGE peak after the conversion. We excluded four peaks, one peak for each overlapping locus, and classified them as ‘problematic’ as well (Table 2).
Lastly we examined whether the length of CAGE peaks was altered after the conversion to the latest genomic coordinates. We found eight peaks whose lengths were changed more than one base due to insertions and/or deletions in the updated genome assemblies. We manually updated the genomic coordinates of two out of eight CAGE peaks, to make them consistent with the CAGE read realignments (Table 2). As a result, we successfully generated a clean set of CAGE peaks (called ‘fair’) on the latest genome assemblies, representing more than 99% of the original ones (Table 1).
Integration of newly identified CAGE peaks in the latest genome assemblies
We complemented the converted CAGE peaks with a new set of CAGE peaks identified to the same method to the previous study (DPI, decomposition-based peak identification)8 based on the realigned CAGE reads with the latest genome assemblies (Data Citation 1: Zenodo http://doi.org/10.5281/zenodo.545682). To identify their overlaps, we used intersectBed command with option -s from Bedtools13. Non-overlapped CAGE peaks identified by DPI were merged with the converted CAGE.
Assignment of accession numbers to the CAGE peaks
Identifiers assigned to the CAGE peaks (CAGE peak ID) were formatted based on their genomic coordinates (Table 3). To have a consistent way to refer individual peaks, we assigned accessions to them, in the form of ‘hg_(serial_num).(version_num)’ and ‘mm_(serial_num).(version_num)’ for human and mouse, respectively. This enabled us to refer them independently of their genome assemblies and their coordinates, with explicit indication of versions. For backward traceability, the new ID is concatenated with the original ID by semicolon (;) in the final files. To avoid confusion of genome versions, we added ‘hg19’ or ‘mm9’ as prefix of the original ID.
Re-annotation of CAGE peaks with the latest gene and protein databases
To take advantage of recent genome annotations, in particular long-noncoding RNAs5, we updated the association of the CAGE peaks with genes, transcripts, and proteins. The process consists of the two steps: (i) associate CAGE peaks to transcripts, and subsequently to genes and proteins (ii) assign human-readable short names. The first step, association with transcripts, was achieved by finding the TSS of transcripts within 500 bp flanking region of the CAGE peak (flanking regions are limited to 50 bp for 5′-end not derived from transcription initiation by RNA Polymerase II, such as small nucleolar RNA), choosing the nearest transcripts, and associating CAGE peaks to gene and protein models based on the nearest transcripts. In the second step, human-readable short names were assigned in the same form to the original ones, ‘p<serial>@<HGNC/MGI/Entrez Gene name>’. Serial numbers were chosen not to conflict with already existing names, so that all names are unique not only within the current genome assembly but also across genome assemblies of a species.
We used the following annotation databases for re-annotation as of March 31, 2016: GENCODE4, Entrez Gene14, HUGO Gene Nomenclature Committee (HGNC) database15, the Mouse Genome Database (MGD)16 and the UCSC Genome Browser12. Transcript sets are from GENCODE (human/mouse), RefSeq (human/mouse), UCSC genes (human/mouse) and mRNAs in the UCSC genome browser (human/mouse). Gene models used for the annotation are from HGNC (human), MGI (mouse) and Entrez Gene (human/mouse). The protein sets are from UniProt (human/mouse).
Recalculation of promoter-level expression tables
As in the previous report8, we counted the read counts under the CAGE peaks on the current genome assemblies. The counts were normalized as TPM (tags per million) after scaling by normalization factors calculated by RLE (Relative Log Expression) method17,18.
The latest version of the source code of the Delve mapper is openly available from http://fantom.gsc.riken.jp/5/suppl/delve/delve-0.95.tgz.
The latest version of the source code of the DPI is openly available from https://github.com/hkawaji/dpi1.
The original data of raw reads (CAGE tags) for human and mouse is available from the DDBJ Sequence Read Archive (DRA) (Data Citations Data Citation 2: DDBJ Sequence Read Archive DRA000991,Data Citation 3: DDBJ Sequence Read Archive DRA001026,Data Citation 4: DDBJ Sequence Read Archive DRA001027,Data Citation 5: DDBJ Sequence Read Archive DRA001028,Data Citation 6: DDBJ Sequence Read Archive DRA001101,Data Citation 7: DDBJ Sequence Read Archive DRA002216,Data Citation 8: DDBJ Sequence Read Archive DRA002711,Data Citation 10: DDBJ Sequence Read Archive DRA002748). The human reprocessed dataset is available at (Data Citation 12: Figshare http://doi.org/10.6084/m9.figshare.c.3743288) and mouse reprocessed dataset is available at (Data Citation 13: Figshare http://doi.org/10.6084/m9.figshare.c.3743378). The results of realignment of CAGE reads in BAM format (.bam) with their index files (.bai) are located under ‘basic/*’ directories at http://fantom.gsc.riken.jp/5/datafiles/reprocessed/ or in LSDB Archive (Data Citation 11: LSDB Archive http://doi.org/10.18908/lsdba.nbdc01389-000.V002). The reprocessed CTSS file of each sample in BED format is also located in the same directory.
The genomic coordinates of the reprocessed CAGE peak regions in the BED format are in (Data Citations Data Citation 11: LSDB Archive http://doi.org/10.18908/lsdba.nbdc01389-000.V002,Data Citation 12: Figshare http://doi.org/10.6084/m9.figshare.c.3743288,Data Citation 13: Figshare http://doi.org/10.6084/m9.figshare.c.3743378). The tab-delimited annotation files of CAGE peaks are in (Data Citations Data Citation 11: LSDB Archive http://doi.org/10.18908/lsdba.nbdc01389-000.V002,Data Citation 12: Figshare http://doi.org/10.6084/m9.figshare.c.3743288,Data Citation 13: Figshare http://doi.org/10.6084/m9.figshare.c.3743378). The expression profiles of CAGE peaks in OSC table (Order Switchable Column table) (https://sourceforge.net/projects/osctf/) can be found in (Data Citations Data Citation 11: LSDB Archive http://doi.org/10.18908/lsdba.nbdc01389-000.V002,Data Citation 12: Figshare http://doi.org/10.6084/m9.figshare.c.3743288,Data Citation 13: Figshare http://doi.org/10.6084/m9.figshare.c.3743378).
Sample metadata is stored in sample and data relationship format (SDRF)19 at the ‘basic/’ directory at http://fantom.gsc.riken.jp/5/datafiles/reprocessed/.
Number of realigned CAGE reads
The numbers of CAGE reads successfully aligned to the genome assemblies are shown in (Table 4). Slightly less number of CAGE reads were successfully aligned with the current genome assemblies than the original ones for both human and mouse. This can be explained by the incorporation of complex and duplicated regions into the current genome assemblies, which made uniquely aligned CAGE reads in the original assemblies as aligned with multiple locations in the current assemblies. Overall, very similar number of CAGE reads are available before and after the realignments.
Number of converted CAGE peaks
The numbers of CAGE peaks converted from the original assemblies are shown in (Table 1). The majority of the peaks, more than 99%, were classified into the category ‘fair’. The detailed numbers of problematic and dropped peaks are shown in (Tables 5,6,7,8). We further inspected the problematic CAGE peaks, and found that the CAGE reads corresponding to the original peak were discarded due to low mapping quality (that is, aligned to multiple regions). This is again likely due to duplicated genomic regions that were introduced in hg38 and mm10.
Comparison of the fair CAGE peaks and DPI-called CAGE peaks
We compared the converted CAGE peaks in the ‘fair’ category with the peaks identified based on the read realignments with the latest genome assemblies (DPI CAGE peaks). The number of their overlaps (Table 9) shows that approximately 95% of a peak set is covered by the other. This result indicates high concordance of the converted CAGE peaks to the ones based on realignments, and underline a certain amount of additional CAGE peaks, ~8,600 and ~5,700 in human and mouse, were newly included in the CAGE peak set provided here.
Expressions of the CAGE peaks
Next, we examined expressions of the CAGE peaks on the original genome assemblies and the current ones (Fig. 2). Their expression levels are tightly consistent across the assemblies, as correlation coefficients are more than 0.99, using Spearman’s and Pearson’s correlation for human and mouse. At the same time, the scatter plot indicates a limited number of peaks that were quantified differently. For the majority of those inconsistent peaks, the expression levels were low, very likely due to the incorporation of duplicated regions in the current assemblies.
Annotation of the CAGE peaks
We compared CAGE peak annotations based on the latest genomes (hg38/mm10) with the original ones based on hg19/mm9. First, we compared the number of CAGE peaks that was associated with any transcripts, genes in Entrez Gene (human and mouse), HGNC (human), and proteins in UniProt (Table 10). The numbers of associated UniProt proteins were almost similar. However, the number of associated transcripts and genes were increased from the original ones by 15–20%. This may reflect the growing pace of the gene annotations in public databases, in particular non-coding RNAs and complex isoforms of protein coding genes. Next, we compared the number of CAGE peaks classified based on gene categories as defined in Entrez Genes (Table 11). The number of CAGE peaks with protein coding genes increased in the same ratio as with all genes. On the other hand, CAGE peaks with non-coding genes were drastically increased, which may reflect the recent numerous findings in non-coding RNA studies5.
How to cite this article: Abugessaisa, I. et al. FANTOM5 CAGE profiles of human and mouse reprocessed for GRCh38 and GRCm38 genome assemblies. Sci. Data 4:170107 doi: 10.1038/sdata.2017.107 (2017).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The UCSC Genome Browser http://hgdownload.cse.ucsc.edu/goldenPath/mm9/liftOver/mm9ToMm10.over.chain.gz (2016)
FANTOM5 was made possible by a Research Grant for RIKEN Omics Science Center from MEXT to Yoshihide Hayashizaki and a Grant of the Innovative Cell Biology by Innovative Technology (Cell Innovation Program) from the MEXT, Japan to YH and to the RIKEN Center for Life Science Technologies. This study is also supported by Research Grants from the Japanese Ministry of Education, Culture, Sports, Science and Technology through RIKEN Preventive Medicine and Diagnosis Innovation Program to YH and RIKEN Centre for Life Science, Division of Genomic Technologies to P.C. We would like to thank all members of FANTOM5 consortium for contributing to generation of samples and analysis of the data-set and thank GeNAS for data production.