Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network

Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.

Captured cDNA was released from the beads by heat shock and RNaseI treatment.
Beads were resuspended in 35ul of release buffer (1x RNaseONE buffer, 0.01% Tween20), incubated at 95C for 5min and chilled on ice immediately. The supernatant containing cDNA was transferred to a new tube. The beads were washed with 30ul of release buffer, and the supernatant was pooled together with the first elution. Then the sample was treated with RNase (0.1ul of 60U/ul RNaseH (TaKaRa) and 2ul of 10U/ul RNaseI for 30min at 37C) to remove RNA completely. Then the Cap-Trapped cDNA was purified with Agencourt AMPure XP (Beckman coulter) according to the manufacture's protocol. The cDNA quantity was determined with the Quant-iT OliGreen ssDNA Assay kit (ThermoFisher Scientific).

USER
To shorten the long polyT stretch of RT primer, the U residues in the RT primer were digested with USER enzyme (NEB). We added 2ul of USER enzyme (2U/ul), 5ul of 10x CutSmart buffer (NEB) and 3ul of water to 40ul of 5'linker ligated cDNA. We incubated the reaction solution at 37C for 30min and chilled on ice.
Then the dT stretch at 5'end of cDNA became 5nt. The cDNA was purified with AMPure XP beads.

3'SSLL
The cDNA solution was dried up using a SpeedVac (80C for 35min). The pellet was dissolved in 4ul of water. After incubation of cDNA solution at 95C for 5min and chilled on ice for 2min, 1ul of 3'CTR-Seq linker (10uM), which was incubated at 65C for 5min and chilled on ice, was added. Then 10ul of Mighty Mix was added, mixed gently and incubated at 16C for 16h. The sample after ligation was purified with AMPure XP.

SAP treatment
To digest excessed 3'linker and dephosphorylate the 3'end of 5'linker down strand, the cDNA was treated with 1ul of SAP (Affymetrics) and 2ul of USER in 1x SAP buffer, incubated at 37C for 30min. After reaction, the cDNA was purified with AMPure XP.

2 nd strand synthesis
The cDNA solution was concentrated to 5ul using a SpeedVac (80C for 35min Then the reads were oriented in the RNA forward-strand direction: analyze-linkers.py reads.fa linkers-reads.maf > reads-fwd.fa 2> linkers-reads.txt The .txt files have some statistics on linker analysis failures.
Finally, the reads were aligned to the genome: And alternative alignment formats were prepared: maf-convert -j1e6 psl reads.maf | grep -v linker > reads.psl pslToBed reads.psl reads.bed ## Warnings * The results include low-confidence alignments. In the maf files, each alignment has a "mismap" probability, which is the estimated probability that it's aligned to the wrong place.
* There are probably some incorrect alignments to processed pseudogenes. It's hard to avoid these completely. (There may also be correct alignments to processed pseudogenes.) * There may be an artifactual tendency for first exons to begin just after AG, and last exons to end just before GT. This is because the spliced alignment method does not treat linkers differently.
Supplementary  Figure 1d. Median CAGE signal for intragenic STRs = 0.57143 ; median CAGE signal for intergenic STRs = 0.52174 (two-sided Wilcoxon test p-value < 2.2e-16). The statistical significance of the test is merely due to the high number of elements considered in each case. By reducing that number to, for instance, 500 randomly chosen elements, this p-value can increase to 0.1544. We therefore concluded that there is no drastic difference between the CAGE signals observed at intra-and intergenic STRs.
Supplementary  Figure 6 -Features associated to transcription directionality [2] A. U1 binding sites. U1 PWM was built using MEME [3] and sequences encompassing -3/+10bp around FANTOM CAT 5' donor splice sites (exon 3' end). We then used this PWM to scan, with FIMO [4], 2kb regions centered around (T ) n 3' ends (top 50,000 with the highest CAGE signal) and FANTOM CAT TSSs. B. PolyA sites. We used the UCSC track corresponding to the predictions made by Cheng et al. [5], as a bed file and used it in bedtools intersect [6] to look at polyA site distribution in regions encompassing 1 kb around (T ) n 3' ends (top 50,000 with the highest CAGE signal) and FANTOM CAT TSSs. PolyA sites are enriched downstream FANTOM CAT TSSs, looking at the antisense orientation (or upstream in the sense orientation), as previously reported [2] . .  Figure 8 -Impact of the length of the sequences used as input of the CNN models. Spearman (orange) and Pearson (blue) correlations (y-axis)were computed between the predicted and the observed CAGE signal. Different sequence size were tested as input (50bp, 100bp, 150bp and 200bp). The size is indicated as multiples of 50bp on the x-axis. Only 6 representative STR classes are shown. . The mouse models are overall less accurate than human models ( Figure 6C). The (CT T T T ) n mouse model performs poorly in mouse and human (ρ < 0.2). Likewise, this model hardly predicts transcription at human (T ) n (ρ < 0.2). For other classes, the Spearman correlation between the signal predicted by the mouse model and the observed human signal was > 0.3, confirming that several features are conserved between human and mouse.   Figure 7C. Note that very few pathogenic variants are detected between -11 and 0 explaining why variance is close to 0 at these positions.