Monitoring transcription initiation activities in rat and dog

The promoter landscape of several non-human model organisms is far from complete. As a part of FANTOM5 data collection, we generated 13 profiles of transcription initiation activities in dog and rat aortic smooth muscle cells, mesenchymal stem cells and hepatocytes by employing CAGE (Cap Analysis of Gene Expression) technology combined with single molecule sequencing. Our analyses show that the CAGE profiles recapitulate known transcription start sites (TSSs) consistently, in addition to uncover novel TSSs. Our dataset can be thus used with high confidence to support gene annotation in dog and rat species. We identified 28,497 and 23,147 CAGE peaks, or promoter regions, for rat and dog respectively, and associated them to known genes. This approach could be seen as a standard method for improvement of existing gene models, as well as discovery of novel genes. Given that the FANTOM5 data collection includes dog and rat matched cell types in human and mouse as well, this data would also be useful for cross-species studies.

myoblast-like state, while the other set is made differentiate towards myotubes (where the suffix 'diff' is used to tell them apart). Total RNA was extracted using miRNeasy kit (QUIAGEN Valencia, CA, USA), following manufacturer's instructions. Age or developmental stage information for neither dog nor rat samples could be obtained. A detailed list of all samples used in this study is available in Supplementary  Table 1.
Library preparation and sequencing CAGE libraries were prepared for single molecule sequencing as described previously 26 . Two protocols were applied for the preparation of dog samples. The standard library preparation method used 5 ug of total RNA, whereas a low quantity protocol was used for a sample of 1 ug or less total RNA (dog aortic smooth muscle cells donor1). CAGE libraries were subsequently sequenced on HeliScope sequencers 27 , following the manufacturer's instructions.

Mapping and data processing
Sequenced libraries were first filtered for reads aligning to ribosomal DNA with up to two mismatches and for artifacts. We used our own developed tool Tag-dust 28 to remove artifactual sequences. Retained reads were aligned to canFam3 (Sept 2011, Broad canFam3.1/canFam3) and rn6 (July 2014, RGSC 6.0/rn6) reference genomes by delve (see Code availability), which leverages on a hidden markov model (HMM) to assign probability scores to all mapped positions in the genome. It then uses all the probability scores obtained to calculate the most likely alignment. Mapping of each library was achieved by performing the following commands: 'delve seed -l 12 -s 8 -o SEEDED_FILE.SAM -t 8 FILTERED.FA GENOME' and 'delve align -u 1 -o ALIGNED.BAM -t 8 SEEDED_FILE.SAM GENOME' SEEDED_FILE.SAM is the output seeded file from FILTERED.FA, ALIGNED.BAM is the final alignment and GENOME is the genome assembly used to map the reads to.
The first command is used to seed the alignment, with a seed length of 12 bases (-l option), step size 8 (-s) and with max 8 threads (-t). The second command is used to build the HMM model and assign the scores to the aligned reads. Option -u ensures that only one alignment is reported (with the highest score).
Expression quantification was performed as described previously 16 . Simply, frequencies of the observed transcription start sites (TSSs) at a single base resolution were extracted from the alignments by using a combination of samtools 29 and bedtools 30 , such that all mapped CAGE reads aligning at the same 5′ position accounted for the expression of the TSS at that position. We retained only aligned reads with 99% accuracy (mapping quality q>20) and with sequence identity above 85%. In practice, each alignment, for both dog and rat, was processed as follows: 'samtools view -q 20 -F 768 -u ALIGNED.BAM|bamToBed -i stdin| awk 'BEGIN{OFS = "\t"}{if ($6 = = "+"){print $1,$2,$5}}'| sort -k1,1 -k2,2n| groupby -i stdin -g 1,2 -c 3 -o count| awk 'BEGIN {OFS = "\t"}{print $1,$2,$2+1, $1":"$2".."$2+1",+",$3,"+"}' The commands ensure that the reads are filtered for any other bad alignments (using the flag value 768) that may still pass the mapping quality threshold, then converted to a bed file. TSS-level expression is then obtained by grouping, and counting, the reads with the same start position (done for plus and minus strand orientation separately, here shown only for '+' strand). The total number of uniquely mapped reads for each sample is given in Supplementary Table 1. Aligned reads (BAM format) and 1-base resolution frequencies of transcription initiation (BED format) are publicly available via DDBJ (see Data Records, Data Citations 1 and 2).

Identification of promoter regions
CAGE peaks were defined by using a Decomposition Peak Identification (DPI) method as described previously 16 . Briefly, regions of continuous, composite signal were identified genome-wide, were subsequently decomposed across samples in order to discriminate the signal coming from distinct samples, and finally peaks were called based on distance and a minimum read counts metrics. Peaks were obtained by running the software with default setting, taking as input the BED files of 5′-end frequencies produced as described above. Two sets of peaks were generated: a 'permissive' set of promoters with minimum 3 read counts in a single position in at least one sample, and a 'robust' set with minimum 1 tpm and 10 read counts in a single position in at least one sample. This ensured that the sets of peaks for dog and rat were comparable to those generated for human and mouse 16 . Since HeliScopeCAGE is an amplification-free protocol based on single-molecule sequencing, individual reads represent independent evidence of capped 5′-ends. In defining the peak sets, read counts corresponding to the number of observations were primarily employed as a threshold, and the expression level (1 tpm threshold) was added to ensure the robust set included only peaks with a certain minimum level of expression.
The peaks were annotated based on Ensembl transcripts, Augustus, GeneScan, GeneID, RefSeq and EST gene models obtained from UCSC genome browser 22 (download date 06/2016). As an annotation rule we followed what was done previously 16 , that is a CAGE peak overlapping a 1 kb region centred on the gene's 5′ end on the same strand orientation is associated to that gene.

Projections of human promoters to dog and rat
The conversion of the promoters' genomic coordinates from human to dog and rat was performed using liftOver tool (see Code availability). Default parameters with pairwise alignment chain files 'hg19ToCanFam3.over.chain.gz' and 'hg19ToRn6.over.chain.gz' were used to convert human coordinates into rat and dog (downloaded from the UCSC site http://hgdownload.cse.ucsc.edu/goldenPath/hg19/ liftOver/). In total, 129,287 (64%) and 111,218 (55%) human CAGE promoters could be projected to dog and rat, respectively. In order to make sure that we considered likely conserved promoters only, we required that a projected promoter be within 50 bp both upstream and downstream of the promoter in the destination genome, following a similar rationale as applied in Young et al. 31 . Past works on promoter architecture and evolution across species showed that conservation decreases beyond a 70 nt distance 32 . Moreover, since the set of human CAGE peaks was obtained from profiling thousands of samples, we filtered the set to only use those likely conserved peaks that were expressed in the corresponding cell types to rat and dog. A summary of the number of projected promoters is given in Table 1.

RNA-seq data processing
Publicly available RNA-seq data were downloaded for dog (Data Citations 3 to 5) and rat (Data Citation 6). The rat (Fisher 344 strain) data sets comprise tissues sampled from 11 different organs (muscle, spleen, brain, uterus, testis, kidney, heart, thymus, lung, liver, and adrenal gland), of both sexes and at varying development stages (2 to 100 weeks old). We arbitrarily chose three female and three male samples for each tissue origin, with the exceptions of uterus and testis. RNA-seq samples for dog represented heart (normal and diseased), pituitary, adrenal cortex, and lymph node tissues (normal and B-cell lymphoma). Ages of the dog samples used in the heart disease study varied from 3 to 17 years, and pituitary and adrenal samples were collected from adult dogs; we couldn't find age information for the lymphoma study samples. A summary of the data sets used is given in Supplementary Table 2 for rat and  Supplementary Table 3 for dog.
Data sets were reprocessed using HISAT2, an improved version of HISAT tool 33 , to align them to the canFam3 and rn6 reference genomes.

Additional data processing
Genomic distribution of the CAGE peaks and screening for TATA-box motifs was performed via HOMER 34 (v4.8.1) using the 'annotatePeaks.pl' function with standard parameters, except for the TATA motif annotation where a region 500 bp upstream and 200 bp downstream around the peak was scanned. The HOMER default reference gene set for genomic distribution annotations is RefSeq, and the default promoter region window is from 1,000 bp upstream to 500 bp downstream of a gene TSS: 'annotatePeaks.pl PEAK_FILE.BED GENOME -size -500,200 -m data/knownTFs/motifs/tata.motif -multi -CpG -noann>tata_annotation.txt' 'annotatePeaks.pl PEAK_FILE.BED GENOME -size given -go DEST_DIR -genomeOntology DEST_DIR -annStats stats_record.txt>annotation.txt' Student's t-test was performed on the two sets of CpG-and TATA-associated promoters using the R language function t.test() with default parameters. Clustering of samples was performed using R bioconductor edgeR package 35 . Specifically, normalization of expression values was calculated with the function 'calcNormFactors()' and the plot was visualized with the function 'plotMDS()'. Figures were all generated using R, version 3.1.3.

Code availability
Mapping was performed by our own algorithm delve, version 0.95, described in Djebali et al. 37 specifically developed for handling data derived from HeliScope single molecule sequencing. The software is available at this URL (fantom.gsc.riken.jp/5/suppl/delve/). The DPI peak calling was performed using our own software, detailed in Forrest et al. 16 . It can be freely downloaded at this URL (https://github.com/hkawaji/dpi1/).

Data Records
The data resulting from sequencing (fastq format), mapping (BAM format) and TSS profiling (BED format) can be found in DDBJ data repository (Data Citations 1 and 2). The same datasets, together with the identified CAGE peaks sets and association to gene models can also be found within the FANTOM data repository (http://fantom.gsc.riken.jp/5/datafiles/latest). The folder 'basic' contains the primary mapping data (BAM format) and TSSs (BED format). The folder 'extra' contains the subfolder 'CAGE_peaks' for the coordinates of the identified peaks (BED format), peak-based expression tables for raw and normalized counts (tabular text format) and associations to gene models (tabular text format). A detailed description of the samples used in this study, such as cell type, strain, species, tissue of origin, is also given in Supplementary Table 1.

Technical Validation
Primary cell data show high reproducibility across replicates Biological replicates of each cell type were tested in order to assess the reproducibility of the experiments.
Results show agreement within all three replicates for all cell types (Table 2), as seen, for instance, for two of the rat aortic smooth muscle (AoSM) cells replicates where Spearman correlation is 0.97 (Fig. 2a). The same level of reproducibility is evident also in the dog primary cell replicates (Fig. 2b), although we obtained less amount of RNA for one of the dog AoSM cell replicate. That forced us to adopt a variation of the standard library preparation protocol (see Methods), which resulted in a lower total read count and consequently in a slightly lower correlation (0.83, Spearman). The correlation between the differentiated and genuine AoSM cells is still relatively high, whereas it decreases when AoSM cells are compared to phenotypically distinct cell types such as hepatocytes, which means that the CAGE expression alone is already enough to distinguish between those two states. Clustering analysis via multi-dimensional scaling (see Methods) shows that samples group based on cell type (Fig. 2).

Characterization of the dog and rat CAGE promoters
All samples were sequenced on HeliScope single molecule sequencer 27 . While the number of mapped reads across samples is comparable in general, many factors can affect it, like RNA amount, RNA quality, protocol efficiency or sequencing errors. The total number of mapped reads for the rat samples varied between 1 M (mesenchymal stem cell donor 2 and 3) and 7 M (AoSM cell), while the universal RNA tissue sample topped 10 M uniquely mapped reads. Uniquely mapped reads for dog samples were, as stated above, generally lower than for rat but with some exceptions, varying between 500,000 of mesenchymal stem cells and 10 M of hepatocytes (Supplementary Table 1). We observed a consistent lower number of uniquely mapped reads for the mesenchymal stem cells in both species, compared to the other cell types. At any rate, the percentage of mapped reads at putative TSSs is more or less constant, around 70% (Fig. 3a), suggesting that the majority of the CAGE signal is concentrated around specific genomic locations (i.e., CAGE peaks) identifying genuine transcription initiation rather than being scattered everywhere.
We calculated the CAGE based expression levels in each sample separately, and then aggregated the signal into CAGE peaks representing transcription initiation events, or promoters (see Methods). The

AoSMCdiff1
AoSMCdiff2 total number of CAGE peaks is comparable across species: we obtained 28,497 and 23,147 'robust', and 92,031 and 85,324 'permissive' promoters for rat and dog, respectively. Unless explicitly stated, all the analyses presented in this work are based on the robust set of promoters only. One of the main points of the DPI method in identifying peaks is the ability to decompose big regions of CAGE signal into smaller, non overlapping segments with expression. In order to verify such ability, we calculated the distribution of CAGE peaks sizes and confirmed that they were a) short and b) were consistent between species. Distributions of peak sizes show that they tend to be less than 150 bp long, with the majority of them being 10 to 30 bp (Fig. 3b). By comparison, human and mouse promoters can be longer, up to 300 and 200 bp, respectively 16 , but the majority of them are around 20-25 nucleotides long, resembling the distributions in rat and dog datasets.
Finally, we evaluated the ratio of CpG-and TATA-rich regions. Promoters are generally classified as either sharp, TATA-rich, shorter regions that harbour most of the expression within a few nucleotides, or broad, CG-rich ones where the expression is distributed on a larger region with sub-peaks of sharp expression 32,39 . Moreover, TATA-rich promoters tend to be associated to housekeeping genes 16,32 . CpG islands are one of the most prominent indications of promoters 32 , and indeed we found 57% (16,327) and 60% (13,863) of rat and dog CAGE peaks, respectively, overlapping a CpG. Since there is no publicly available dataset for TATA-rich regions in dog and rat, we used HOMER motif finding tool 34 to check for enrichment of TBP (TATA binding protein) sites around our promoters. An appreciable number of CAGE peaks harboured TATA-box motifs in the region 500 bp upstream and 200 bp downstream of the peak region (32.7% (9,338) and 29% (6,747) in rat and dog respectively), with a clear preference for the region 30-35 bp upstream of a TSS (Fig. 3c). We next subdivided the CAGE peaks into three categories, TATA-only (3,775 dog and 4,763 rat), CpG-only (10,891 and 11,752 dog and rat respectively) and both, and compared their sizes. Although it is not as evident as in human promoters, we could observe that TATA-only promoters tend to be shorter than the CpG-only ones (Student's t-test, P-valueo 2.2e-16 in both species) (Fig. 3d). These results are in line with what is known for other model organisms 13,16 .  CAGE can contribute to refine the promoter landscape of poorly characterized genomes In order to have a rough estimate of what genomic features the CAGE promoters are closest to, we took the dog and rat peaks and annotated them by using HOMER's annotation tool (see Methods). We found that 58% (16,625) of rat CAGE peaks are located near TSSs of known genes (Table 3). In the case of dog CAGE peaks, the picture that emerged was different, with only 1,394 peaks annotated as promoters of known RefSeq genes, while the majority of them were annotated as CpG-island (Table 3), highlighting the fragmentary coverage of the dog genome. In fact, the total number of RefSeq genes for dog is as low as about 2,300. As we would like to employ CAGE as an accessory means to improve resolution, coverage and accuracy of the existing gene models, we first downloaded all available gene models, either manually curated or predicted, from UCSC genome browser 22 for both dog and rat, and associated them to the CAGE peaks (see Methods) to see how many we could recover.
For the rat CAGE peaks, we could link 19,254 of them (68%) to the TSS of a known Ensembl transcript and 17,188 (60%) to that of a known RefSeq gene (Fig. 3e). Distributions of the distances between all CAGE-defined TSSs and the genes' TSSs show that CAGE technology captures loci of genuine transcription initiation (Fig. 3f). For 2,289 (8%) CAGE peaks we found no association to any of the gene references. Of those, however, 1,761 (77%) were overlapping with one or more of CpG islands (harbouring the potential for novel TSSs), repeats, gene bodies (i.e., introns or exons more than 500 bp downstream of the TSS). The remaining 520 un-annotated peaks are either intergenic, potential bidirectional enhancers 11 , or are located more than 500 bp upstream of a known gene TSS.
In the case of the dog dataset we could associate 10,542 (46%) of the CAGE peaks to Ensembl transcripts, and only 1,350 (6%) to known RefSeq genes (Fig. 3e). The large difference is likely due to gene coverage, as only~2,200 genes are annotated by RefSeq, which adopts strict experimental validation criteria 40 , against the~39,000 transcripts (corresponding to about 29,000 genes) reported in Ensembl. On the other hand, the proportion of Ensembl transcripts covered by a CAGE peak in dog is as low as 21% (Table 4), not much lower than what observed in rat (27%). This, and the fact that CAGE peaks tend to align at the TSS of known genes (Fig. 3f) together suggest likely incorrect or incomplete gene annotations. The number of un-annotated peaks for the dog data set was 4,550 (19%), with 873 of them found far   from other known features. The peaks-to-genes associations are summarised in Table 4, while the total numbers of CAGE peaks associated to zero or more gene models are listed in Table 5.
To check whether we could increase the gene annotation rate, we converted the coordinates all human robust CAGE promoters identified in a previous study 16 to both dog and rat genomes via the liftOver tool (see Methods). These human projections would thus serve to confirm the already annotated peaks and reduce the number of the un-annotated ones, in a similar manner to a guilt-by-association exercise. We chose the human CAGE promoters set primarily because it is the most accurate, comprehensive and characterized, but also because a lift-over of full transcripts to other species is error-prone due to length, splicing, or directionality of genes. For all peaks, we adopted a distance rule requiring the human projected peak be within 50 bp of a dog or rat peak in order to consider the alignments reliable (see Methods). Overall, 79% of dog and 67% of rat CAGE peaks were aligned with a projected human peak expressed in the same cell types (Table 1). Regarding the un-annotated peaks, we required in addition that their aligned human peak be annotated and within 10 kb of a dog or rat orthologous gene to be eligible for annotation. We used Ensembl genes for this survey. Following this principle, we isolated 205 and 801 unique rat and dog peaks respectively, and named them Rescued CAGE Peaks (RCP) (Data Citation 7).
As an additional support, we also used publicly available RNA-seq data for dog and rat (Data Citation 3 to 6), as close to the cell types we profiled as possible, and used them to check whether the RCP are located at the 5′-end of the RNA-seq model as well. Even though it is not a general rule, we've noticed several cases where if the RCP are near the dominant human peak (most expressed, noted as 'p1@gene') they exhibit higher expression, the RNA-seq model supporting the gene is better defined and chances of them representing a better TSS for the gene are higher.
One example is LOXL3 gene, which was not associated to any CAGE peak in dog because the TSS of its corresponding Ensembl transcript (ENSCAFT00000013310.3) is more than 1 kb away; however, there are two lifted-over human LOXL3 peaks within 50 bases of the un-annotated dog CAGE peaks (Fig. 4a). RNA-seq data shows that the 5′ end of LOXL3 gene is indeed further upstream than the current defined model, judging by the drop of signal at the start of the CAGE peak. As a comparison, we checked the rat genome, where Loxl3 gene is associated to a CAGE peak, instead, and exhibits a similar RNA-seq profile defining the 5′ end together with the same human LOXL3 peaks (Fig. 4b).
We tried to find out whether those RCP or their associated genes had some features in common. We observed that all RCP in both dog and rat were relatively short, with a median size of 11 bp and tend to be expressed at low levels (Data Citation 7). We next looked at the RCP nearby genes, noticing that, for the vast majority, they are very big, with a median length of 80 kb in rat and 65 kb in dog (Data Citation 7). Since Ensembl genes are predicted, many of them may represent mere open reading frames (ORFs), and in that case the reported TSS would not represent the real one, but just the beginning of the ORF; this scenario could explain why several peaks could not be associated to genes in the first place. We checked this by calculating the distance between the genes' transcription and coding sequence start, which was    7). In comparison, of the genes associated to a CAGE peak only 12.5% of them in rat are ORFs, while for dog, where the number of curated genes (i.e., RefSeq) is overall lower, the proportion is 52%. Alternatively, the observation that most genes were associated to multiple RCP found within their body could signify the existence of previously unknown isoforms 13,16,32 (i.e., Ncor2 gene in rat, http://fantom.gsc.riken.jp/zenbu/ gLyphs/#config = 3eolUxzyN1nm167YHayYLC;loc = rn6::chr12:36831460..37074148+). Relative to the type of the RCP associated genes, we found that 25% (36/147) of those in rat and 32% (155/478) in dog are known TFs in human; 8 TF genes (AHR, AHRR, HOXB6, JARID2, KDM2A, NCOR2, TSC22D2, ZC3H4) are listed in both RCP sets. This simple survey showed that, although a systematic validation of all RCP should be pursued, CAGE profiling can be considered as a useful tool to refine existing gene models, even with a limited number of cell types at disposal.

Usage Notes
The FANTOM web portal (fantom.gsc.riken.jp) provides a starting point for data download and exploration. Genomic coordinates and frequencies of TSSs at a single base-pair resolution can be found in fantom.gsc.riken.jp/5/datafiles/latest/basic/, whereas the CAGE peaks, their expression across samples and their gene annotation can be found in fantom.gsc.riken.jp/5/datafiles/latest/extra/. Transcription initiation signals can also be used to identify transcribed enhancers 11 ; however, it has to be noted that the signal from enhancers are quite weak in general. Given the quite limited number of samples, it was not possible to identify enhancer profiles in the two species dog and rat.
All data available through the web portal are public and thus free to use.