Multi-landmark alignment of genomic signals reveals conserved expression patterns across transcription start sites

Vilar, Jose M. G.; Saiz, Leonor

doi:10.1038/s41598-023-37140-x

Download PDF

Article
Open access
Published: 05 July 2023

Multi-landmark alignment of genomic signals reveals conserved expression patterns across transcription start sites

Jose M. G. Vilar^1,2 &
Leonor Saiz³

Scientific Reports volume 13, Article number: 10835 (2023) Cite this article

531 Accesses
Metrics details

Subjects

Abstract

The prevalent one-dimensional alignment of genomic signals to a reference landmark is a cornerstone of current methods to study transcription and its DNA-dependent processes but it is prone to mask potential relations among multiple DNA elements. We developed a systematic approach to align genomic signals to multiple locations simultaneously by expanding the dimensionality of the genomic-coordinate space. We analyzed transcription in human and uncovered a complex dependence on the relative position of neighboring transcription start sites (TSSs) that is consistently conserved among cell types. The dependence ranges from enhancement to suppression of transcription depending on the relative distances to the TSSs, their intragenic position, and the transcriptional activity of the gene. Our results reveal a conserved hierarchy of alternative TSS usage within a previously unrecognized level of genomic organization and provide a general methodology to analyze complex functional relationships among multiple types of DNA elements.

Sequence determinants of human gene regulatory elements

Article Open access 21 February 2022

Identifying chromatin features that regulate gene expression distribution

Article Open access 25 November 2020

The spatial organization of transcriptional control

Article 14 September 2022

Introduction

Genomic signals encapsulate highly detailed quantitative information up to the nucleotide level¹ on key aspects of DNA transcription, the subsequent RNA processing, and multiple DNA-dependent processes, including DNA methylation², transcription factor binding³, and CRISPR-Cas9 efficiency⁴. At the core of interpreting this information, there are specific genomic locations, or genomic landmarks, such as TSSs, transcription factor binding sites, RNA splice junctions, or the midpoint of a DNA extended region⁵. These landmarks provide anchoring points to summarize general trends and characterize different types of DNA regions.

The prototypical approaches to analyze these data start with the alignment of the signals to a landmark along a one-dimensional coordinate for subsequent processing. In mathematical terms, the alignment of a genomic signal $g\left( z \right)$ along the coordinate $z$ to a landmark with position denoted by $z_{U}$ leads to a relative coordinate $x = z - z_{U}$ and an aligned signal $g\left( {x + z_{U} } \right)$. The most widely used type of processing is the aggregation of alignments for multiple positions $z_{U}$ of the landmark, which leads to an average signal $G\left( x \right) = \left\langle {g\left( {x + z_{U} } \right)} \right\rangle _{{z_{U} }}$. This approach has provided general information as diverse as the sharp dependence of CRISPRi/a activity on both the proximity of a TSS and nucleosome occupancy^6,7; how the directionality of promoters reflects on the asymmetry of DNA accessibility and histone methylation signals around TSSs⁸; and the enrichment or depletion of single nucleotide variation occurrence around multiple landmarks in the genomes of human populations⁹. To capture the inherent heterogeneity, aligned signals are often structured into heatmaps¹⁰, which can be sorted and clustered according to specific parameters¹¹ and can be incorporated into automated machine-learning pipelines¹². This type of one-dimensional alignments is also the usual approach to link genomic signals with the results of methodologies, such as chromosome conformation capture techniques^13,14, that map the three-dimensional DNA looping^15,16 interactions between distal DNA elements.

The alignment with respect to a single position, however, is frequently ambiguous because regulatory regions often involve multiple relevant landmarks^17,18. The presence of a landmark, such as a TSS, can often affect the functioning of another one and, in general, multiple landmarks can affect each other's function. To analyze functional relationships among multiple types of DNA elements, we develop a method to consider multiple landmarks at the same level (Fig. 1). The main idea is to align the signal to multiple locations through the expansion of the dimensionality of the genomic-coordinate space by considering relative coordinates from the different landmarks.

Results

Simultaneous alignment to multiple positions

To consider genomic signals in two dimensions, we expand the genomic coordinate $z$ with respect to the positions of the upstream, $z_{U}$, and downstream, $z_{D}$, landmarks into $x = z - z_{U}$ and $y = z - z_{D}$. Explicitly, this transformation assigns the value of the signal $g\left( z \right)$ to the coordinates $\left( {x,y} \right) = \left( {z - z_{U} ,z - z_{D} } \right)$ for each value of the genomic coordinate $z$ and for each pair of landmarks. To eliminate the dependence on the genomic coordinate $z$, we consider first that $x$ and $y$ correspond to the same genomic coordinate, which leads to a line in the $x,y$-plane defined by $y + z_{D} = x + z_{U}$. Secondly, we consider the signal along this line in the two-dimensional space described mathematically by $f\left( {x,y} \right) = g\left( {x + z_{U} } \right)\delta_{{y,x + z_{U} - z_{D} }}$, where $\delta_{i,j}$ represents the Kronecker delta function, which is one if $i = j$ and zero otherwise. Finally, we also consider the unit signal $n\left( {x,y} \right) = \delta_{{y,x + z_{U} - z_{D} }}$ along the same line in the $x,y$-plane. This description allows the efficient computation of the two-dimensional average signal density, $G\left( {x,y} \right) = \left\langle {g\left( {x + z_{U} } \right)\delta _{{y,x + z_{U} - z_{D} }} } \right\rangle _{{R\left( {x,y} \right),~\{ z_{U} ,z_{D} \} }}$, over pairs of landmarks $\{ z_{U} ,z_{D} \}$ and a two-dimensional sliding window $R\left( {x,y} \right)$ around $\left( {x,y} \right)$. The average is defined as the sum of the two-dimensional representation of the signal over the same sum for the unit signal, which in mathematical terms leads to $G\left( {x,y} \right) = \frac{1}{{N\left( {x,y} \right)}}\mathop \sum \limits_{{\left( {x^{\prime},y^{\prime}} \right) \in R\left( {x,y} \right)}} \mathop \sum \nolimits_{{\left\{ {z_{U} ,z_{D} } \right\}}} f\left( {x^{\prime},y^{\prime}} \right)$, where the normalization factor is $N\left( {x,y} \right) = \mathop \sum \limits_{{\left( {x^{\prime},y^{\prime}} \right) \in R\left( {x,y} \right)}} \mathop \sum \nolimits_{{\left\{ {z_{U} ,z_{D} } \right\}}} n\left( {x^{\prime},y^{\prime}} \right)$. Intuitively, the approach generates a two-dimensional representation because different pairs of landmarks lead to distinct lines on the plane where $f\left( {x,y} \right)$ is different from zero. Collectively, these lines cover a two-dimensional area. (Specific details are provided in the “Methods” section.) The use of the Kronecker delta function is also useful because it allows the straightforward extension of the methodology to multiple dimensions. For instance, in the case of three locations, the aligned three-dimensional signal is given by $g\left( {x + z_{U} } \right)\delta_{{y,x + z_{U} - z_{D} }} \delta_{{v,x + z_{U} - z_{F} }}$, where $v = z - z_{F}$ is the relative position associated with the landmark with position $z_{F}$.

Transcriptional activity shows a complex dependence on multiple TSSs

The resulting multidimensional signal density provides a precise general description to analyze any function of a genomic coordinate in terms of the distances from multiple genomic landmarks. We use this approach to study the dependence of transcription, as reported by RNA sequencing (RNA-seq), on pairs of consecutive TSSs. Specifically, we focus on how transcription in human at a given genomic location depends on the relative positions of two TSSs, including how the presence of a TSS correlates with transcription at another TSS. The transcription of mammalian genomes^19,20, with an average of four TSSs per gene²¹, is particularly relevant. This is because the arrangement of TSSs according to different positional patterns, such as those in focused or dispersed promoters, is associated with different types of transcriptional programs²². TSSs locations were obtained from the comprehensive gene annotation on the reference chromosomes of Gencode V19. By considering the comprehensive set of annotated TSSs rather than only the ones expressed in each particular cell type, we could also investigate the factors that correlate with alternative TSSs expression.

As a representative case, we consider explicitly K562 human myeloid leukemia cells for the first and second TSSs (Figs. 2A and S1A) and second and third TSSs (Fig. 2B and S1B) of each protein-coding gene. Here, TSSs are ordered according to their genomic position, starting the enumeration from the most upstream TSS. The two-dimensional RNA-seq signal density $G\left( {x,y} \right)$ reveals a strong dependence on the relative position of pairs of TSSs. There are dominant trends, such as a high transcriptional signal density downstream of the TSSs and the suppression of the signal upstream of a TSS.

Many key features, however, are strongly dependent on the intragenic position of the TSSs and the transcriptional activity of the gene, which we have stratified as high, medium–high, medium, medium–low, low, and zero (Figure S2). Without this stratification, the signal would be dominated by highly transcribed genes. The most salient general feature is the absence of substantial transcription at the first annotated TSS of highly transcribed genes irrespective of its distance to the second one. Transcription at the first annotated TSS becomes more prominent only as the activity of the gene decreases. Another general salient feature is the high RNA-seq signal density just downstream of two TSSs that are close to each other.

Quantitative characterization of TSS-proximity dependent effects on gene expression

To accurately characterize the observed dependence patterns, we consider seven regions of the two-dimensional signal density (Fig. 2C). Five of the regions are located immediately downstream of one of the TSSs and are distinguished by the relative position of the other TSS. The additional TSS can be located upstream at distal and at intermediate distances (regions A and B, respectively) or downstream at proximal, at intermediate, and at distal distances (regions C, D, and E, respectively). The other two regions are located at intermediate distances downstream a TSS and at distal and at non-distal distances upstream of the next TSS (regions Af and Bf, respectively).

Explicitly, comparing RNA-seq densities in region B with those of region A indicates that the proximity of the 2nd TSS strongly correlates with reduced transcription at the 1st TSS. These upstream effects of the 2nd annotated TSS extend up to ~ 1kbp distances. In contrast, transcription in region D is higher than in region E, which shows that the downstream effects of the 1st annotated TSS statistically enhance transcription at the 2nd TSS. This effect is even more marked when comparing transcription in region C with transcription in region E, which we have termed proximal cooperativity, indicating that on average there is more transcription at the 2nd TSS the closer it is to the 1st TSS. To compare transcription when the two TSSs are far from each other, we consider regions A and E. For highly transcribed genes, transcription is much more prominent at the 2nd than at the 1st TSS. This distal positional dominance of the downstream TSS shifts to the upstream TSS as the transcriptional activity of the gene decreases.

The statistical interdependence of the RNA-seq signal at the first pair of annotated TSSs is also present to a large extent at the second and third TSSs (Fig. 2B). Proximal, intermediate, and distal effects, except for the intermediate upstream effects for high transcription, closely parallel those of the first pair of annotated TSSs. Intermediate upstream effects change from negative to positive for highly transcribed genes for TSS pairs after the pair comprising the first and second TSS. This effect could originate from enhanced DNA accessibility due to high transcription initiated at upstream TSSs, which is not present at the first TSS.

After transcription initiation, the average RNA-seq signal is expected to be lost progressively due to multiple processes, including transcription abortion, transcription termination, and RNA processing^23,24. The persistence of the RNA-seq signal is strongly influenced by the position of the downstream TSS (Figs. 2 and S1). Explicitly, persistence with a non-distal downstream TSS (average signal in region Bf compared to that of region B) is substantially higher than persistence with a distal downstream TSS (average signal in region Af compared to that region A), especially for low and medium values of the transcriptional activity. Therefore, the presence of a nearby downstream TSS correlates with lower transcription initiation but, at the same time, with more persistent RNA-seq signals. Any dependence on the TSS arrangement of the processes that lead to loss of the average RNA-seq signal could affect persistence. For instance, a downstream TSS located nearby could favor a lower intron-to-exon RNA ratio between the two TSSs, thus promoting more persistent RNA-seq signals. The presence of a downstream TSS could also lead to an increase in DNA accessibility between the two TSSs, positively correlating with higher transcriptional progression and lower abortion rates. Regarding the absolute value of the average RNA-seq signal between TSSs, it tends to be higher as the downstream TSS gets closer to the upstream TSS (average signal in region Bf compared to that of region Af), which we refer to as signal dominance (Figs. 2 and S1).

Transcription initiation is statistically dependent on neighboring TSSs

To investigate the statistical interdependence of transcription initiation at neighboring annotated TSSs, we computed the two-dimensional signal densities for RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression (RAMPAGE) data²⁵ in the same way as for RNA-seq data (Fig. 3). This technique provides specific sequencing of 5′‐complete complementary DNAs and avoids counting transcripts that initiate at other TSSs. The results show that the interdependence of RAMPAGE densities at the TSSs mimics to a large extent the phenomenology observed for RNA-seq densities immediately downstream of the TSSs (Fig. 2), including proximal, intermediate, and distal effects. Outside the TSS region, RAMPAGE densities are zero. The qualitative similarities between transcription initiation and transcription immediately downstream of the TSSs are consistent with a hierarchy of alternative TSS usage in delineating the overall RNA-seq signal. There are, however, general trends in the two-dimensional RNA-seq signal density space, such as differential persistence depending on the position of the closest downstream TSS, that extend beyond transcription initiation.

Interdependence of transcription on neighboring TSSs is regulated

We investigated how the interdependence of the RNA-seq signal on consecutive pairs of TSSs is associated with known transcriptional regulation features. Explicitly, we considered three types of data: chromatin immunoprecipitation followed by sequencing (ChIP-seq) data for POLR2A as a reporter of RNA polymerase II (Pol II) occupancy (Figs. 4A and S3A); DNase I hypersensitivity analysis followed by sequencing (DNase-seq) data as a reporter of DNA accessibility (Figs. 4B and S3B), which is required for transcription factors and other regulatory proteins to bind DNA; and ChIP-seq data for the active chromatin marker H3K4me3 (Figs. 4C and Figure S3C). The results show that changes in the transcriptional activity downstream of a TSS are correlated with changes in the transcriptional regulation features, indicating that the interdependence of transcription initiation and transcription at neighboring TSSs originates at the regulatory level.

Explicitly, the presence of a downstream TSS negatively impacts both Pol II occupancy and DNA accessibility around TSSs positioned upstream at intermediate distances. As in the case of transcription, these effects are much more marked for the first pair (Fig. 4A and B) than for the second pair of TSSs (Figures S3A and S3B). Pol II occupancy and DNA accessibility are systematically enhanced as well by the presence of an upstream TSS at intermediate distances and by the cooperative actions of two proximal TSSs. For pairs of TSSs that are far from each other, the relative contributions of Pol II occupancy and DNA accessibility decrease at the downstream TSS and increase at the upstream TSS as the transcriptional activity of the gene decreases, paralleling the shift in positional dominance observed for transcription. Outside the transcription initiation regions, the RNA-seq signal closely follows the main trends of Pol II occupancy, which overlap to a large extent with DNA accessibility. Therefore, a downstream TSS affects not only transcription initiation but also transcription progression.

Pol II occupancy and the presence of DNase I hypersensitivity sites are two general indicators of transcription and of transcription initiation and regulation, respectively²⁶. Similarly, the active chromatin marker H3K4me3 (Figs. 4C and Figure S3C) shows differentiated patterns on the two-dimensional RNA-seq signal densities that are consistent with active transcription initiation. Namely, H3K4me3 is high downstream of transcription initiation and significantly lower just upstream. This pattern is clearly observed, for instance, for the first pair of distal TSSs around the 2nd TSS for high transcription and how it switches to the 1st TSS as transcription decreases. In general, we observe that, for high transcriptional activity, H3K4me3 is high downstream of two TSSs, low immediately upstream of both TSSs, and changing from low to high between the two TSSs depending on their relative positions.

The interdependence of transcription on neighboring TSSs is conserved across human cell types

To study to what extent there are general trends present in other cell types, we obtained the two-dimensional RNA-seq signal densities for the GM12878 human lymphoblastoid cell line (Figure S4) and for H1-hESC human embryonic stem cells (Figure S5), which together with K562 constitute the three Tier 1 cell types of the encyclopedia of DNA elements (ENCODE) project^27,28,29. The main features, involving proximal cooperativity, upstream effects, downstream effects, and positional dominance, are very similar for all three cell types.

We quantified the presence of these general trends across all the spectrum of different cell types for each of the pairs of consecutive TSSs up to the 11th TSSs in all human experiments in the ENCODE project with high replicate concordance (Table S1). These included 191 experiments with 122 different cell types (biosamples), covering all different biosample types. The results show that the complex interdependence of the transcriptional signal at multiple TSSs observed in K562, GM12878, and H1-hESC cells is conserved across all variety of human cell types (Figs. 5, S6, and S9).

Explicitly, upstream and downstream effects are extremely marked for the first pair of TSSs, substantially decrease for the second pair, and are highly suppressed for the other pairs further downstream in the gene, except for highly transcribed genes. In this latter case, the presence of an additional TSS nearby, either upstream or downstream, is always associated with enhanced transcription. Similarly, positional dominance also ranges from very marked for the first pair of TSSs to highly suppressed for the other pairs further downstream in the gene. In contrast, proximal cooperativity is always maintained at a high level irrespective of the transcriptional activity of the gene and the relative position of the TSS pair within the gene.

We also quantified the presence of general trends in transcription initiation using RAMPAGE data for each of the pairs of consecutive TSSs up to the 11th TSSs in all human experiments in the ENCODE project with high replicate concordance (Table S2). These included 65 experiments with 56 different cell types, covering all biosample types. The results show that the main trends observed for the transcription initiation in K562 are conserved across human cell types (Figs. 6, S7, and S10). There are broad similarities with RNA-seq data but also notable differences.

Positional dominance for RAMPAGE data across multiple cell types closely mimics the results for RNA-seq data, indicating that transcription initiation as well as transcription generally shift from the upstream to the downstream TSS of the distal TSS pair as the transcriptional activity of the gene increases. Upstream effects are also remarkably similar for both processes, except for the first pair of TSSs with low transcriptional activity of the gene. Downstream effects and proximal cooperativity are positive in both RNA-seq and RAMPAGE data but are much more marked in the latter. In general, it is observed that these effects become more pronounced in RAMPAGE data as the positional order of the TSS pair in the gene increases. The fact that these marked transcription initiation effects are reduced to a large extent in transcription as the positional order of the TSS pair increases is consistent with transcription at a given position accounting for the cumulative effects of transcription initiated at the upstream TSSs.

The quantification of the average RNA-seq signal between consecutive TSSs indicates that the general trends observed qualitatively for Tier 1 cell types are indeed conserved across all variety of human cell types (Figs. 7, S8, and S11). Explicitly, the expected reduced average signal after transcription initiation is observed for any location of the downstream TSSs (Figs. 7, S8, and S11), except for the first TSS pair with a non-distal downstream TSS. In the case of K562 leukemia cell line, which we analyzed explicitly at the level of the regulatory features, this behavior is also present at the level of the Pol II occupancy (Fig. 4), thus indicating that it is a general feature of transcription itself. Comparing the effects of the downstream TSS location, the persistence of the average RNA-seq signal is systematically higher for a non-distal than for a distal downstream TSS for all TSS pairs (Figs. 7, S8, and S11), which we refer to as persistence dominance. In absolute terms, the average RNA-seq signal between TSSs does not depend on the downstream TSS distance for the first pair of TSSs of the gene, but tends to be higher in the presence of a non-distal downstream TSS for subsequent TSS pairs in the gene (Figs. 7, S8, and S11).

Multiple levels of variability across biosamples

Alternative transcription has important implications for gene expression as it determines the variability of the repertoire of isoform proteins. In our analysis, we have observed high variability in the means of replicates along the main trends, which is considerably higher than the variability between replicates (Figures S6, S7, and S8). This variability has both a random-like component and a bias that is determined by the genomic context. The bias, resulting from the general interdependence patterns across TSSs we have identified, is strongly evident in the averages of all experiments within each biosample type, which exhibit little variation across different biosample types (Figs. 5, 6, and 7). Therefore, the main trends in the interdependence of transcriptional processes on the TSSs arrangements are present in the same form for all cell types, regardless of their mutational background, specific origin, or function within the organism. Compounded with these general trends, there are multiple levels of variability, such as replicate noise, cell-type-specific TSS usage, and specific responses to different conditions.

Discussion

Directed analyses on specific systems have shown that many fundamental mechanisms involved in transcription regulation strongly depend on the precise distances among the locations of multiple DNA elements^15,30 but it has been unclear to what extent this dependence could be present along the genome after the confluence of many of these, potentially opposing mechanisms³¹. Especially relevant is the case of alternative transcription³². There is ample evidence that multiple TSSs in most genes have independent cell-type-specific expression profiles²¹. These profiles have been found to be connected to disease states, including alternative transcription initiation at multiple TSSs that is deregulated across cancer types and patients³³ and that exhibits well-defined, specific signatures in type 2 diabetes³⁴. The types of regulation comprise a wide range of modalities, including the TSSs of a gene being coregulated, namely increasing or decreasing their expression proportionally³⁵, and, on the opposite side, switching expression from one TSS to another³⁶.

The multiple-landmark-alignment methodology we have developed provides an avenue to elucidate how the precise positioning of multiple landmarks reflects in DNA-dependent processes on a genome-wide scale. The simultaneous consideration of multiple distances (stratified as proximal, intermediate, and distal) has been a fundamental element of our approach to uncover the existence of regulated interdependence patterns of gene expression at alternative TSSs and between TSSs across human cell lines, primary cells, in vitro differentiated cells, and tissues. This interdependence comprises proximal cooperativity, upstream and downstream interactions, positional dominance, enhancement of transcription persistence, and attenuation of the transcriptional signal. In general, these effects are highly dependent on the intragenic position of the TSSs, the transcriptional activity of the gene, and the precise distances between TSSs, but at the same time, they are consistently conserved across human cell types, irrespective of their specific origin or function within the organism.

Among the most salient phenomena, there are proximal cooperativity and downstream effects, which encompass higher transcription downstream a TSS the closer it is to an upstream TSSs within the gene. This type of enhancement observed in transcription is also present, even more prominently, in transcription initiation. On the opposite side, our results show the presence of marked upstream effects, namely, the attenuation of the transcriptional signal and transcription initiation at an upstream TSS by the presence of a nearby downstream TSS. Simultaneously with the negative effects on the absolute levels of transcription, a downstream TSS positively enhances the persistence of transcription after its initiation. Concomitantly, DNA accessibility and Pol II densities show lower but more sustained profiles for a non-distal than for a distal downstream TSS. These results can be understood mechanistically considering that the assembly of the transcription initiation complex upstream a TSS interferes with transcription initiation and that the overall transcription process promotes upstream and downstream DNA accessibility.

Recent genome-wide analyses have concluded that multiple alternative transcription initiation is largely nonadaptive and resulting predominantly from imprecise events³⁷. At the molecular level, fundamental biochemical principles dictate that non-specific effects, as quantified and validated in simpler gene expression prokaryotic systems, cannot generally be suppressed completely and that they are affected by regulatory processes in the usual way³⁸. In this context, the existence of general patterns within multiple levels of variability we have identified shows both a consistent signal across cell types determined by the genomic context and a random-like component dependent on the cell type and conditions, akin to non-specific transcription initiation.

Our analysis has also shown that there are clear positional dominance effects when the two TSSs are far from each other, resulting in most of transcription and transcription initiation shifting from the upstream to the downstream TSS of the distal TSS pair as the transcriptional activity of the gene increases. This effect is extremely marked for the 1st and 2nd annotated TSS of a gene and it is generally more pronounced for transcription initiation than for transcription. These types of results have also a practical side as they can be used to refine and complement TSS annotations. Explicitly, the case of positional dominance implies that the 1st annotated TSS of a gene is essentially not active, or it is not an actual TSS, if the transcriptional activity of the gene is high.

The most remarkable finding of our work is therefore the discovery of the existence of general regulated interdependence patterns of gene expression at and between alternative TSSs of protein-coding genes. We showed that these effects are conserved across cell types through a comprehensive analysis of the hundreds of human transcription and transcription initiation experiments of the ENCODE project. Compounded with these general patterns, there are multiple levels of variability, such as replicate noise, cell-type-specific TSS usage, and adaptation to different conditions. The identification of these general patterns in the alternating structure of transcription has important implications for gene expression as they determine the variability of the repertoire of isoform proteins.

On the methodological side, our approach can generally be applied to virtually any combination of landmarks and any genomic signal in the same way as we have applied them to TSSs and RNA-seq, RAMPAGE, DNase-seq, and ChIP-seq signals. Therefore, our results open an avenue to find novel distance-dependent functional relationships among multiple DNA elements in a wide variety of systems.

Materials and methods

Genomic signals in two dimensions

The average of the signal $g\left( z \right) = g\left( {x + z_{U} } \right)\delta_{{y, x + z_{U} - z_{D} }}$ over a rectangular region from $x_{0}$ to $x_{1}$ along the $x$ coordinate and from $y_{0}$ to $y_{1}$ along the $y$ coordinate for all TSS pairs in the set $V$ is expressed as

$$R_{V} \left[ {\left( {x_{0} ,y_{0} } \right),\left( {x_{1} ,y_{1} } \right)} \right] = \frac{1}{N}\mathop \sum \limits_{{\{ z_{U} ,z_{D} \} \in V}} \mathop \sum \limits_{{x = x_{0} }}^{{x_{1} }} \mathop \sum \limits_{{y = y_{0} }}^{{y_{1} }} g\left( {x + z_{U} } \right)\delta_{{y, x + z_{U} - z_{D} }} ,$$

where $N$ is the normalization factor, which is given by

$$N = \mathop \sum \limits_{{\{ z_{U} ,z_{D} \} \in V}} \mathop \sum \limits_{{x = x_{0} }}^{{x_{1} }} \mathop \sum \limits_{{y = y_{0} }}^{{y_{1} }} \delta_{{y, x + z_{U} - z_{D} }} .$$

In our analysis, we use multiple sets $V$ corresponding to a specific contiguous pair of TSSs of genes with transcriptional activities within a range of values (e.g., the set of 1st and 2nd TSSs of all protein-coding genes with high transcription).

Two-dimensional region averages

To compute the region average from the previous expression efficiently, we take into account that $\mathop \sum \limits_{{y = y_{0} }}^{{y_{1} }} g\left( {x + z_{U} } \right)\delta_{{y, x + z_{U} - z_{D} }}$ is $g\left( {x + z_{U} } \right)$ if $y_{0} \le x + z_{U} - z_{D} \le y_{1}$ and zero otherwise. Therefore, the sum over $x$ is different from zero only for $x \ge y_{0} - z_{U} + z_{D}$ and $x \le y_{1} - z_{U} + z_{D}$, which leads to

$$R_{V} \left[ {\left( {x_{0} ,y_{0} } \right),\left( {x_{1} ,y_{1} } \right)} \right] = \frac{1}{N}\mathop \sum \limits_{{\begin{array}{*{20}c} {\{ z_{U} ,z_{D} \} \in V} \\ \\ \end{array} }} \mathop \sum \limits_{{x = \max \left( {x_{0} , y_{0} - z_{U} + z_{D} } \right)}}^{{\min \left( {x_{1} ,y_{1} - z_{U} + z_{D} } \right)}} g\left( {x + z_{U} } \right).$$

Similarly, the normalization factor is expressed as

$$N = \mathop \sum \limits_{{\{ z_{U} ,z_{D} \} \in V}} \mathop \sum \limits_{{x = \max \left( {x_{0} , y_{0} - z_{U} + z_{D} } \right)}}^{{\min \left( {x_{1} ,y_{1} - z_{U} + z_{D} } \right)}} 1.$$

Note that the region average is defined only if there exists at least a pair $\{ z_{U} ,z_{D} \}$ in $V$ so that $y_{0} - x_{1} \le z_{U} - z_{D} \le y_{1} - x_{0}$, which is equivalent to the condition $\max \left( {x_{0} , y_{0} - z_{U} + z_{D} } \right) \le \min \left( {x_{1} ,y_{1} - z_{U} + z_{D} } \right)$.

Two-dimensional signal densities

To compute the signal densities, we use a moving window defined by a rectangular domain centered at $\left( {x,y} \right)$ with dimensions $2n_{X} + 1$ along the $x$ coordinate and $2n_{Y} + 1$ along the $y$ coordinate. The signal density $G\left( {x,y} \right)$ averaged over this domain for all TSS pairs in the set $V$ is given by

$$G\left( {x,y} \right) = R_{V} \left[ {\left( {x - n_{X} ,y - n_{Y} } \right),\left( {x + n_{X} ,y + n_{Y} } \right)} \right].$$

The explicit values of $n_{X}$ and $n_{Y}$ used in our analysis are $n_{X} = 99$ for $- 500 \le x \le 1k$, $n_{X} = x/4$ for $1k < x \le 20k$, $n_{Y} = 99$ for $- 1k \le y \le 1k$, and $n_{Y} = - y/4$ for $- 20k \le y < - 1k$.

Average transcription in a region

The average transcription in a region $W$, $T_{W} = \left\langle {g\left( {x + z_{U} } \right)\delta _{{y,~x + z_{U} - z_{D} }} } \right\rangle _{{\left\{ {z_{U} ,z_{D} } \right\},\left( {x,y} \right)}}$ with $\left( {x,y} \right) \in W$, is computed explicitly for the representative regions as $T_{A} = R_{V} \left[ {\left( {0, - 20k} \right),\left( {200, - 10k} \right)} \right],$ $T_{Af} = R_{V} \left[ {\left( {300, - 20k} \right),\left( {1k, - 10k} \right)} \right],$$T_{B} = R_{V} \left[ {\left( {0, - 900} \right),\left( {200, - 200} \right)} \right],$ $T_{Bf} = R_{V} \left[ {\left( {300, - 900} \right),\left( {1k, - 200} \right)} \right],$ $T_{C} = R_{V} \left[ {\left( {0, 0} \right),\left( {200, 200} \right)} \right],$ $T_{D} = R_{V} \left[ {\left( {300, 0} \right),\left( {1k,200} \right)} \right],$ and $T_{E} = R_{V} \left[ {\left( {10k, 0} \right),\left( {20k,200} \right)} \right].$

TSSs

TSSs were obtained from the comprehensive gene annotation on the reference chromosomes of Gencode V19 (https://www.gencodegenes.org/human/release_19.html).

TSS order

TSSs are ordered according to their genomic position, starting the enumeration from the most upstream TSS. Therefore, according to this notation, the 1st TSS does not necessarily correspond to the TSS with the highest expression.

Genomic signals

RNA-seq, RAMPAGE, DNase-seq, and ChIP-seq genomic signals were downloaded from the Encyclopedia of DNA Elements (ENCODE) consortium repository (http://www.encodeproject.org/) as bigWig files for the hg19 mapping assembly/V19 genome annotation. Gene quantifications for the corresponding RNA-seq signals were downloaded as tsv files. Signals were normalized by their average value over the whole genome before analysis.

RNA-seq experiment selection

RNA-seq experiments were selected in two steps. First, we considered all the experiments that matched the search criteria "hg19" for assembly and "polyA mRNA RNA-seq" or "total RNA-seq" for assay title, which produced 594 results. Subsequently, we selected RNA-seq experiments that included plus and minus strand signal of unique reads and that had high replicate concordance (Spearman correlation > 0.8 between gene quantifications of the replicates), which resulted in 191 experiments.

RAMPAGE experiment selection

RAMPAGE experiments were selected in two steps. First, we considered all the experiments that matched the search criteria "hg19" for assembly and "RAMPAGE" for assay title, which produced 155 results. Subsequently, we selected RAMPAGE experiments that included plus and minus strand signal of unique reads and that had high replicate concordance (Spearman correlation > 0.8 between gene quantifications of the replicates), which resulted in 65 experiments.

Data analysis

The data analysis was performed using custom Python 3.8 scripts implemented in Jupyter Notebooks available in the supplemental information.

Data availability

The datasets analyzed during the current study are available from the Encyclopedia of DNA Elements (ENCODE) consortium repository (http://www.encodeproject.org/) and the comprehensive gene annotation on the reference chromosomes of Gencode V19 (https://www.gencodegenes.org/human/release_19.html). Accession codes for the data used are provided in the corresponding figure legends and Supplementary Tables S1 and S2.

References

Mayer, A. et al. Native elongating transcript sequencing reveals human transcriptional activity at nucleotide resolution. Cell 161, 541–554. https://doi.org/10.1016/j.cell.2015.03.010 (2015).
Article CAS PubMed PubMed Central Google Scholar
Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49. https://doi.org/10.1038/nature09906 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Rhee, H. S. & Pugh, B. F. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147, 1408–1419. https://doi.org/10.1016/j.cell.2011.11.013 (2011).
Article CAS PubMed PubMed Central Google Scholar
Kuscu, C., Arslan, S., Singh, R., Thorpe, J. & Adli, M. Genome-wide analysis reveals characteristics of off-target sites bound by the Cas9 endonuclease. Nat. Biotechnol. 32, 677–683. https://doi.org/10.1038/nbt.2916 (2014).
Article CAS PubMed Google Scholar
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461. https://doi.org/10.1038/nature12787 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Gilbert, L. A. et al. Genome-scale crispr-mediated control of gene repression and activation. Cell 159, 647–661. https://doi.org/10.1016/j.cell.2014.09.029 (2014).
Article CAS PubMed PubMed Central Google Scholar
Horlbeck, M. A. et al. Nucleosomes impede cas9 access to DNA in vivo and in vitro. Elife 5, e12677. https://doi.org/10.7554/eLife.12677 (2016).
Article CAS PubMed PubMed Central Google Scholar
Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82. https://doi.org/10.1038/nature11232 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Nat. Acad. Sci. U. S. A. 113, 11901–11906. https://doi.org/10.1073/pnas.1613365113 (2016).
Article ADS CAS Google Scholar
Neph, S. et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 83–90. https://doi.org/10.1038/nature11212 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Erb, M. A. et al. Transcription control by the ENL YEATS domain in acute leukaemia. Nature 543, 270–274. https://doi.org/10.1038/nature21688 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Kundaje, A. et al. Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements. Genome Res 22, 1735–1747. https://doi.org/10.1101/gr.136366.111 (2012).
Article CAS PubMed PubMed Central Google Scholar
Li, G. et al. Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell 148, 84–98. https://doi.org/10.1016/j.cell.2011.12.014 (2012).
Article CAS PubMed PubMed Central Google Scholar
Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680. https://doi.org/10.1016/j.cell.2014.11.021 (2014).
Article CAS PubMed PubMed Central Google Scholar
Levine, M., Cattoglio, C. & Tjian, R. Looping back to leap forward: Transcription enters a new era. Cell 157, 13–25. https://doi.org/10.1016/j.cell.2014.02.009 (2014).
Article CAS PubMed PubMed Central Google Scholar
Saiz, L. & Vilar, J. M. G. DNA looping: The consequences and its control. Curr. Opin. Struct. Biol. 16, 344–350. https://doi.org/10.1016/j.sbi.2006.05.008 (2006).
Article CAS PubMed Google Scholar
Levine, M. & Tjian, R. Transcription regulation and animal diversity. Nature 424, 147–151. https://doi.org/10.1038/nature01763 (2003).
Article ADS CAS PubMed Google Scholar
Maston, G. A., Evans, S. K. & Green, M. R. Transcriptional regulatory elements in the human genome. Annu. Rev. Genomics Hum. Genet. 7, 29–59. https://doi.org/10.1146/annurev.genom.7.080505.115623 (2006).
Article CAS PubMed Google Scholar
Lenhard, B., Sandelin, A. & Carninci, P. Metazoan promoters: Emerging characteristics and insights into transcriptional regulation. Nat Rev Genet 13, 233–245. https://doi.org/10.1038/nrg3163 (2012).
Article CAS PubMed Google Scholar
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108. https://doi.org/10.1038/nature11233 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
The FANTOM Consortium the RIKEN PMI & CLST (DGT). A promoter-level mammalian expression atlas. Nature 507, 462–470. https://doi.org/10.1038/nature13182 (2014).
Article ADS CAS Google Scholar
Juven-Gershon, T., Hsu, J. Y., Theisen, J. W. & Kadonaga, J. T. The RNA polymerase II core promoter—the gateway to transcription. Curr. Opin. Cell Biol. 20, 253–259. https://doi.org/10.1016/j.ceb.2008.03.003 (2008).
Article CAS PubMed PubMed Central Google Scholar
Proudfoot, N. J. Transcriptional termination in mammals: Stopping the RNA polymerase II juggernaut. Science 352, aad9926. https://doi.org/10.1126/science.aad9926 (2016).
Article CAS PubMed PubMed Central Google Scholar
Licatalosi, D. D. & Darnell, R. B. RNA processing and its regulation: Global insights into biological networks. Nat. Rev. Genet. 11, 75–87. https://doi.org/10.1038/nrg2673 (2010).
Article CAS PubMed PubMed Central Google Scholar
Batut, P., Dobin, A., Plessy, C., Carninci, P. & Gingeras, T. R. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 23, 169–180. https://doi.org/10.1101/gr.139618.112 (2013).
Article CAS PubMed PubMed Central Google Scholar
Boyle, A. P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311–322. https://doi.org/10.1016/j.cell.2007.12.014 (2008).
Article CAS PubMed PubMed Central Google Scholar
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74. https://doi.org/10.1038/nature11247 (2012).
Article ADS CAS Google Scholar
Sloan, C. A. et al. ENCODE data at the ENCODE portal. Nucleic Acids Res. 44, D726-732. https://doi.org/10.1093/nar/gkv1160 (2016).
Article ADS CAS PubMed Google Scholar
ENCODE Project Consortium et al. (2020) Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710, https://doi.org/10.1038/s41586-020-2493-4
Saiz, L. & Vilar, J. M. G. Ab initio thermodynamic modeling of distal multisite transcription regulation. Nucleic Acids Res. 36, 726–731. https://doi.org/10.1093/nar/gkm1034 (2008).
Article CAS PubMed Google Scholar
Ptashne, M. & Gann, A. Genes & signals. (Cold Spring Harbor Laboratory Press, 2002).
de Klerk, E. & t Hoen, P. A. Alternative mRNA transcription, processing, and translation: Insights from RNA sequencing. Trends Genet. 31, 128–139. https://doi.org/10.1016/j.tig.2015.01.001 (2015).
Article CAS PubMed Google Scholar
Demircioglu, D. et al. A pan-cancer transcriptome analysis reveals pervasive regulation through alternative promoters. Cell 178, 1465–1477. https://doi.org/10.1016/j.cell.2019.08.018 (2019).
Article CAS PubMed Google Scholar
Varshney, A. et al. A transcription start site map in human pancreatic islets reveals functional regulatory signatures. Diabetes https://doi.org/10.2337/db20-1087 (2021).
Article PubMed PubMed Central Google Scholar
Karlsson, K., Lonnerberg, P. & Linnarsson, S. Alternative TSSs are co-regulated in single cells in the mouse brain. Mol. Syst. Biol. 13, 930. https://doi.org/10.15252/msb.20167374 (2017).
Article CAS PubMed PubMed Central Google Scholar
Hollerer, I. et al. Evidence for an integrated gene repression mechanism based on mRNA isoform toggling in human cells. G3 (Bethesda) 9, 1045–1053. https://doi.org/10.1534/g3.118.200802 (2019).
Article CAS PubMed Google Scholar
Xu, C., Park, J. K. & Zhang, J. Evidence that alternative transcriptional initiation is largely nonadaptive. PLoS Biol. 17, e3000197. https://doi.org/10.1371/journal.pbio.3000197 (2019).
Article CAS PubMed PubMed Central Google Scholar
Vilar, J. M. G. & Saiz, L. Reliable prediction of complex phenotypes from a modular design in free energy space: An extensive exploration of the lac operon. ACS Synth. Biol. 2, 576–586. https://doi.org/10.1021/sb400013w (2013).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by Ministerio de Ciencia e Innovación (MCI/AEI/FEDER, UE PGC2018-101282-B-I00 and PID2021-128850NB-I00 to J.M.G.V.) and the University of California, Davis (to L.S.).

Author information

Authors and Affiliations

Biofisika Institute (CSIC, UPV/EHU), University of the Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain
Jose M. G. Vilar
IKERBASQUE, Basque Foundation for Science, 48011, Bilbao, Spain
Jose M. G. Vilar
Department of Biomedical Engineering, University of California, 451 E. Health Sciences Drive, Davis, CA, 95616, USA
Leonor Saiz

Authors

Jose M. G. Vilar
View author publications
You can also search for this author in PubMed Google Scholar
Leonor Saiz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.M.G.V and L.S conceived, designed, and performed the research.

Corresponding authors

Correspondence to Jose M. G. Vilar or Leonor Saiz.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Supplementary Information 2.

Supplementary Information 3.

Supplementary Information 4.

Supplementary Information 5.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Vilar, J.M.G., Saiz, L. Multi-landmark alignment of genomic signals reveals conserved expression patterns across transcription start sites. Sci Rep 13, 10835 (2023). https://doi.org/10.1038/s41598-023-37140-x

Download citation

Received: 29 November 2022
Accepted: 16 June 2023
Published: 05 July 2023
DOI: https://doi.org/10.1038/s41598-023-37140-x

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.