Introduction

Genomic signals encapsulate highly detailed quantitative information up to the nucleotide level1 on key aspects of DNA transcription, the subsequent RNA processing, and multiple DNA-dependent processes, including DNA methylation2, transcription factor binding3, and CRISPR-Cas9 efficiency4. At the core of interpreting this information, there are specific genomic locations, or genomic landmarks, such as TSSs, transcription factor binding sites, RNA splice junctions, or the midpoint of a DNA extended region5. These landmarks provide anchoring points to summarize general trends and characterize different types of DNA regions.

The prototypical approaches to analyze these data start with the alignment of the signals to a landmark along a one-dimensional coordinate for subsequent processing. In mathematical terms, the alignment of a genomic signal \(g\left( z \right)\) along the coordinate \(z\) to a landmark with position denoted by \(z_{U}\) leads to a relative coordinate \(x = z - z_{U}\) and an aligned signal \(g\left( {x + z_{U} } \right)\). The most widely used type of processing is the aggregation of alignments for multiple positions \(z_{U}\) of the landmark, which leads to an average signal \(G\left( x \right) = \left\langle {g\left( {x + z_{U} } \right)} \right\rangle _{{z_{U} }}\). This approach has provided general information as diverse as the sharp dependence of CRISPRi/a activity on both the proximity of a TSS and nucleosome occupancy6,7; how the directionality of promoters reflects on the asymmetry of DNA accessibility and histone methylation signals around TSSs8; and the enrichment or depletion of single nucleotide variation occurrence around multiple landmarks in the genomes of human populations9. To capture the inherent heterogeneity, aligned signals are often structured into heatmaps10, which can be sorted and clustered according to specific parameters11 and can be incorporated into automated machine-learning pipelines12. This type of one-dimensional alignments is also the usual approach to link genomic signals with the results of methodologies, such as chromosome conformation capture techniques13,14, that map the three-dimensional DNA looping15,16 interactions between distal DNA elements.

The alignment with respect to a single position, however, is frequently ambiguous because regulatory regions often involve multiple relevant landmarks17,18. The presence of a landmark, such as a TSS, can often affect the functioning of another one and, in general, multiple landmarks can affect each other's function. To analyze functional relationships among multiple types of DNA elements, we develop a method to consider multiple landmarks at the same level (Fig. 1). The main idea is to align the signal to multiple locations through the expansion of the dimensionality of the genomic-coordinate space by considering relative coordinates from the different landmarks.

Figure 1
figure 1

Constructing multidimensional representations of genomic signals. Starting with a genomic signal \(g\left( z \right)\) along the genomic coordinate \(z\), we perform a coordinate expansion using multiple landmarks, such as TSSs (depicted by black arrows), to obtain a multiple-landmark alignment of the signal. For pairs of landmarks, genomic locations in the neighborhood of two landmarks, such as those in the intervals \(z_{1} - z_{4}\) and \(z_{5} - z_{8}\), are mapped into a two-dimensional representation with respect to the distances from each of the landmarks. Taking the average of \(g\left( z \right)\) in the expanded space in windows centered at \(\left( {x,y} \right) = \left( {z - z_{U} ,z - z_{D} } \right)\) for all the relevant pairs of landmarks \(\left\{ {z_{U} ,z_{D} } \right\}\) provides a multidimensional signal density, depicted by \(G\left( {x,y} \right)\) in two dimensions.

Results

Simultaneous alignment to multiple positions

To consider genomic signals in two dimensions, we expand the genomic coordinate \(z\) with respect to the positions of the upstream, \(z_{U}\), and downstream, \(z_{D}\), landmarks into \(x = z - z_{U}\) and \(y = z - z_{D}\). Explicitly, this transformation assigns the value of the signal \(g\left( z \right)\) to the coordinates \(\left( {x,y} \right) = \left( {z - z_{U} ,z - z_{D} } \right)\) for each value of the genomic coordinate \(z\) and for each pair of landmarks. To eliminate the dependence on the genomic coordinate \(z\), we consider first that \(x\) and \(y\) correspond to the same genomic coordinate, which leads to a line in the \(x,y\)-plane defined by \(y + z_{D} = x + z_{U}\). Secondly, we consider the signal along this line in the two-dimensional space described mathematically by \(f\left( {x,y} \right) = g\left( {x + z_{U} } \right)\delta_{{y,x + z_{U} - z_{D} }}\), where \(\delta_{i,j}\) represents the Kronecker delta function, which is one if \(i = j\) and zero otherwise. Finally, we also consider the unit signal \(n\left( {x,y} \right) = \delta_{{y,x + z_{U} - z_{D} }}\) along the same line in the \(x,y\)-plane. This description allows the efficient computation of the two-dimensional average signal density, \(G\left( {x,y} \right) = \left\langle {g\left( {x + z_{U} } \right)\delta _{{y,x + z_{U} - z_{D} }} } \right\rangle _{{R\left( {x,y} \right),~\{ z_{U} ,z_{D} \} }}\), over pairs of landmarks \(\{ z_{U} ,z_{D} \}\) and a two-dimensional sliding window \(R\left( {x,y} \right)\) around \(\left( {x,y} \right)\). The average is defined as the sum of the two-dimensional representation of the signal over the same sum for the unit signal, which in mathematical terms leads to \(G\left( {x,y} \right) = \frac{1}{{N\left( {x,y} \right)}}\mathop \sum \limits_{{\left( {x^{\prime},y^{\prime}} \right) \in R\left( {x,y} \right)}} \mathop \sum \nolimits_{{\left\{ {z_{U} ,z_{D} } \right\}}} f\left( {x^{\prime},y^{\prime}} \right)\), where the normalization factor is \(N\left( {x,y} \right) = \mathop \sum \limits_{{\left( {x^{\prime},y^{\prime}} \right) \in R\left( {x,y} \right)}} \mathop \sum \nolimits_{{\left\{ {z_{U} ,z_{D} } \right\}}} n\left( {x^{\prime},y^{\prime}} \right)\). Intuitively, the approach generates a two-dimensional representation because different pairs of landmarks lead to distinct lines on the plane where \(f\left( {x,y} \right)\) is different from zero. Collectively, these lines cover a two-dimensional area. (Specific details are provided in the “Methods” section.) The use of the Kronecker delta function is also useful because it allows the straightforward extension of the methodology to multiple dimensions. For instance, in the case of three locations, the aligned three-dimensional signal is given by \(g\left( {x + z_{U} } \right)\delta_{{y,x + z_{U} - z_{D} }} \delta_{{v,x + z_{U} - z_{F} }}\), where \(v = z - z_{F}\) is the relative position associated with the landmark with position \(z_{F}\).

Transcriptional activity shows a complex dependence on multiple TSSs

The resulting multidimensional signal density provides a precise general description to analyze any function of a genomic coordinate in terms of the distances from multiple genomic landmarks. We use this approach to study the dependence of transcription, as reported by RNA sequencing (RNA-seq), on pairs of consecutive TSSs. Specifically, we focus on how transcription in human at a given genomic location depends on the relative positions of two TSSs, including how the presence of a TSS correlates with transcription at another TSS. The transcription of mammalian genomes19,20, with an average of four TSSs per gene21, is particularly relevant. This is because the arrangement of TSSs according to different positional patterns, such as those in focused or dispersed promoters, is associated with different types of transcriptional programs22. TSSs locations were obtained from the comprehensive gene annotation on the reference chromosomes of Gencode V19. By considering the comprehensive set of annotated TSSs rather than only the ones expressed in each particular cell type, we could also investigate the factors that correlate with alternative TSSs expression.

As a representative case, we consider explicitly K562 human myeloid leukemia cells for the first and second TSSs (Figs. 2A and S1A) and second and third TSSs (Fig. 2B and S1B) of each protein-coding gene. Here, TSSs are ordered according to their genomic position, starting the enumeration from the most upstream TSS. The two-dimensional RNA-seq signal density \(G\left( {x,y} \right)\) reveals a strong dependence on the relative position of pairs of TSSs. There are dominant trends, such as a high transcriptional signal density downstream of the TSSs and the suppression of the signal upstream of a TSS.

Figure 2
figure 2

Transcription in K562 leukemia cell lines shows a complex dependence on the distance from pairs of TSSs, their intragenic position, and the transcriptional activity of the gene. (A, B), two-dimensional density of normalized RNA-seq signal for pairs of the first (TSS 1) and second (TSS 2) TSSs (A) and the second (TSS 2) and third (TSS 3) TSSs (B) of genes with high, medium, and low levels of transcription. (C), seven representative regions of the two-dimensional (density) signal used to characterize the interdependence on pairs of TSSs. TSSs are ordered according to their genomic position. Regions A and B correspond to transcription at the upstream TSS (\(0 \le x \le 200\)) when the downstream TSS is far away (\(- 20{\text{k}} \le y \le - 10{\text{k}}\)) and at an intermediate distance (\(- 900 \le y \le - 200\)), respectively. Regions Af and Bf correspond to transcription at intermediate distances from the upstream TSS (\(300 \le x \le 1k\)) when the downstream TSS is far away (\(- 20{\text{k}} \le y \le - 10{\text{k}}\)) and at an intermediate distance (\(- 900 \le y \le - 200\)), respectively. Regions C, D, and E correspond to transcription at the downstream TSS (\(0 \le y \le 200\)) when the upstream TSS is nearby (\(0 \le x \le 200\)), at an intermediate distance (\(300 \le x \le 1{\text{k}}\)), and far away (\(10{\text{k}} \le x \le 20{\text{k}}\)), respectively. For the quantification of proximal, intermediate, and distal effects between TSSs, we define the average transcription \(T_{W}\) in a given region \(W\) as \(T_{W} = \left\langle {g\left( {x + z_{U} } \right)\delta_{{y, x + z_{U} - z_{D}}} } \right\rangle_{{ \left\{ {z_{U} ,z_{D} } \right\},\left( {x,y} \right)}}\) with \(\left( {x,y} \right) \in W\) (see “Materials and Methods” section). Selecting \(W\) as one of the representative regions leads to the definitions of proximal cooperativity as \(T_{C} /T_{E}\); upstream effects as \(T_{B} /T_{A}\); downstream effects as \(T_{D} /T_{E}\); positional dominance as \(T_{E} /T_{A}\); persistence with a distal downstream TSS as \(T_{Af} /T_{A}\); persistence with a non-distal downstream TSS as \(T_{Bf} /T_{B}\); and signal dominance as \(T_{Bf} /T_{Af}\). Data is available from the ENCODE consortium (experiment accession number ENCSR000AEL, Thomas Gingeras lab, CSHL). The accession numbers of the minus and plus strand RNA-seq signals and gene quantifications are ENCFF652ZSN, ENCFF091RAW, and ENCFF782PCD, respectively.

Many key features, however, are strongly dependent on the intragenic position of the TSSs and the transcriptional activity of the gene, which we have stratified as high, medium–high, medium, medium–low, low, and zero (Figure S2). Without this stratification, the signal would be dominated by highly transcribed genes. The most salient general feature is the absence of substantial transcription at the first annotated TSS of highly transcribed genes irrespective of its distance to the second one. Transcription at the first annotated TSS becomes more prominent only as the activity of the gene decreases. Another general salient feature is the high RNA-seq signal density just downstream of two TSSs that are close to each other.

Quantitative characterization of TSS-proximity dependent effects on gene expression

To accurately characterize the observed dependence patterns, we consider seven regions of the two-dimensional signal density (Fig. 2C). Five of the regions are located immediately downstream of one of the TSSs and are distinguished by the relative position of the other TSS. The additional TSS can be located upstream at distal and at intermediate distances (regions A and B, respectively) or downstream at proximal, at intermediate, and at distal distances (regions C, D, and E, respectively). The other two regions are located at intermediate distances downstream a TSS and at distal and at non-distal distances upstream of the next TSS (regions Af and Bf, respectively).

Explicitly, comparing RNA-seq densities in region B with those of region A indicates that the proximity of the 2nd TSS strongly correlates with reduced transcription at the 1st TSS. These upstream effects of the 2nd annotated TSS extend up to ~ 1kbp distances. In contrast, transcription in region D is higher than in region E, which shows that the downstream effects of the 1st annotated TSS statistically enhance transcription at the 2nd TSS. This effect is even more marked when comparing transcription in region C with transcription in region E, which we have termed proximal cooperativity, indicating that on average there is more transcription at the 2nd TSS the closer it is to the 1st TSS. To compare transcription when the two TSSs are far from each other, we consider regions A and E. For highly transcribed genes, transcription is much more prominent at the 2nd than at the 1st TSS. This distal positional dominance of the downstream TSS shifts to the upstream TSS as the transcriptional activity of the gene decreases.

The statistical interdependence of the RNA-seq signal at the first pair of annotated TSSs is also present to a large extent at the second and third TSSs (Fig. 2B). Proximal, intermediate, and distal effects, except for the intermediate upstream effects for high transcription, closely parallel those of the first pair of annotated TSSs. Intermediate upstream effects change from negative to positive for highly transcribed genes for TSS pairs after the pair comprising the first and second TSS. This effect could originate from enhanced DNA accessibility due to high transcription initiated at upstream TSSs, which is not present at the first TSS.

After transcription initiation, the average RNA-seq signal is expected to be lost progressively due to multiple processes, including transcription abortion, transcription termination, and RNA processing23,24. The persistence of the RNA-seq signal is strongly influenced by the position of the downstream TSS (Figs. 2 and S1). Explicitly, persistence with a non-distal downstream TSS (average signal in region Bf compared to that of region B) is substantially higher than persistence with a distal downstream TSS (average signal in region Af compared to that region A), especially for low and medium values of the transcriptional activity. Therefore, the presence of a nearby downstream TSS correlates with lower transcription initiation but, at the same time, with more persistent RNA-seq signals. Any dependence on the TSS arrangement of the processes that lead to loss of the average RNA-seq signal could affect persistence. For instance, a downstream TSS located nearby could favor a lower intron-to-exon RNA ratio between the two TSSs, thus promoting more persistent RNA-seq signals. The presence of a downstream TSS could also lead to an increase in DNA accessibility between the two TSSs, positively correlating with higher transcriptional progression and lower abortion rates. Regarding the absolute value of the average RNA-seq signal between TSSs, it tends to be higher as the downstream TSS gets closer to the upstream TSS (average signal in region Bf compared to that of region Af), which we refer to as signal dominance (Figs. 2 and S1).

Transcription initiation is statistically dependent on neighboring TSSs

To investigate the statistical interdependence of transcription initiation at neighboring annotated TSSs, we computed the two-dimensional signal densities for RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression (RAMPAGE) data25 in the same way as for RNA-seq data (Fig. 3). This technique provides specific sequencing of 5′‐complete complementary DNAs and avoids counting transcripts that initiate at other TSSs. The results show that the interdependence of RAMPAGE densities at the TSSs mimics to a large extent the phenomenology observed for RNA-seq densities immediately downstream of the TSSs (Fig. 2), including proximal, intermediate, and distal effects. Outside the TSS region, RAMPAGE densities are zero. The qualitative similarities between transcription initiation and transcription immediately downstream of the TSSs are consistent with a hierarchy of alternative TSS usage in delineating the overall RNA-seq signal. There are, however, general trends in the two-dimensional RNA-seq signal density space, such as differential persistence depending on the position of the closest downstream TSS, that extend beyond transcription initiation.

Figure 3
figure 3

Transcription initiation in K562 leukemia cell line shows a complex dependence on the distance from pairs of TSSs, their intragenic position, and the transcriptional activity of the gene. (A, B), two-dimensional density of RAMPAGE signal for pairs of the first (TSS 1) and second (TSS 2) TSSs (A) and the second (TSS 2) and third (TSS 3) TSSs (B) of genes with high, medium, and low levels of transcription. Data is available from the ENCODE consortium (experiment accession number ENCSR000AER, Thomas Gingeras lab, CSHL). The accession numbers of the minus and plus strand RAMPAGE signals and gene quantifications are ENCFF198YEH, ENCFF707TAV, and ENCFF782PCD, respectively.

Interdependence of transcription on neighboring TSSs is regulated

We investigated how the interdependence of the RNA-seq signal on consecutive pairs of TSSs is associated with known transcriptional regulation features. Explicitly, we considered three types of data: chromatin immunoprecipitation followed by sequencing (ChIP-seq) data for POLR2A as a reporter of RNA polymerase II (Pol II) occupancy (Figs. 4A and S3A); DNase I hypersensitivity analysis followed by sequencing (DNase-seq) data as a reporter of DNA accessibility (Figs. 4B and S3B), which is required for transcription factors and other regulatory proteins to bind DNA; and ChIP-seq data for the active chromatin marker H3K4me3 (Figs. 4C and Figure S3C). The results show that changes in the transcriptional activity downstream of a TSS are correlated with changes in the transcriptional regulation features, indicating that the interdependence of transcription initiation and transcription at neighboring TSSs originates at the regulatory level.

Figure 4
figure 4

RNA polymerase II occupancy, DNA accessibility, and H3K4me3 epigenetic chemical modification of the histone H3 protein in K562 leukemia cell lines shows a complex dependence on the distance from pairs of TSSs and the transcriptional activity of the gene. (A, B, C), two-dimensional density of POLR2A ChIP-seq signal (A), DNase-seq signal (B), H3K4me3 ChIP-seq signal (C) for pairs of the first (TSS 1) and second (TSS 2) TSSs of genes with high, medium, and low levels of transcription. Data is available from the ENCODE consortium (experiment accession numbers ENCSR000FAJ, Sherman Weissman lab, Yale; ENCSR000EKS, Gregory Crawford lab, Duke; ENCSR000AKU and Bradley Bernstein, Broad). The accession numbers of the POLR2A ChIP-seq signal, DNase-seq signal, H3K4me3 ChIP-seq signal, and gene quantifications are ENCFF000YWY, ENCFF000SVL, ENCFF000BYB, and ENCFF782PCD, respectively.

Explicitly, the presence of a downstream TSS negatively impacts both Pol II occupancy and DNA accessibility around TSSs positioned upstream at intermediate distances. As in the case of transcription, these effects are much more marked for the first pair (Fig. 4A and B) than for the second pair of TSSs (Figures S3A and S3B). Pol II occupancy and DNA accessibility are systematically enhanced as well by the presence of an upstream TSS at intermediate distances and by the cooperative actions of two proximal TSSs. For pairs of TSSs that are far from each other, the relative contributions of Pol II occupancy and DNA accessibility decrease at the downstream TSS and increase at the upstream TSS as the transcriptional activity of the gene decreases, paralleling the shift in positional dominance observed for transcription. Outside the transcription initiation regions, the RNA-seq signal closely follows the main trends of Pol II occupancy, which overlap to a large extent with DNA accessibility. Therefore, a downstream TSS affects not only transcription initiation but also transcription progression.

Pol II occupancy and the presence of DNase I hypersensitivity sites are two general indicators of transcription and of transcription initiation and regulation, respectively26. Similarly, the active chromatin marker H3K4me3 (Figs. 4C and Figure S3C) shows differentiated patterns on the two-dimensional RNA-seq signal densities that are consistent with active transcription initiation. Namely, H3K4me3 is high downstream of transcription initiation and significantly lower just upstream. This pattern is clearly observed, for instance, for the first pair of distal TSSs around the 2nd TSS for high transcription and how it switches to the 1st TSS as transcription decreases. In general, we observe that, for high transcriptional activity, H3K4me3 is high downstream of two TSSs, low immediately upstream of both TSSs, and changing from low to high between the two TSSs depending on their relative positions.

The interdependence of transcription on neighboring TSSs is conserved across human cell types

To study to what extent there are general trends present in other cell types, we obtained the two-dimensional RNA-seq signal densities for the GM12878 human lymphoblastoid cell line (Figure S4) and for H1-hESC human embryonic stem cells (Figure S5), which together with K562 constitute the three Tier 1 cell types of the encyclopedia of DNA elements (ENCODE) project27,28,29. The main features, involving proximal cooperativity, upstream effects, downstream effects, and positional dominance, are very similar for all three cell types.

We quantified the presence of these general trends across all the spectrum of different cell types for each of the pairs of consecutive TSSs up to the 11th TSSs in all human experiments in the ENCODE project with high replicate concordance (Table S1). These included 191 experiments with 122 different cell types (biosamples), covering all different biosample types. The results show that the complex interdependence of the transcriptional signal at multiple TSSs observed in K562, GM12878, and H1-hESC cells is conserved across all variety of human cell types (Figs. 5, S6, and S9).

Figure 5
figure 5

The complex interdependence of transcription at multiple TSSs is conserved across human cell types. The replicate mean and noise of the \({\text{log}}_{2}\) values of upstream effects, downstream effects, proximal cooperativity, and positional dominance are shown in terms of the transcriptional activity in region C stratified in five groups for the first and second TSSs, for the second and third TSSs, and for the average of all subsequent pairs of consecutive TSS up to the 10th and 11th TSSs for all experiments in ENCODE with Spearman correlation > 0.8 among replicates. In total, there are 191 experiments (indicated by small symbols) comprising 122 different cell types. Different symbols indicate different biosample types, which include primary cell (62 experiments), cell line (93 experiments), tissue (27 experiments), and in vitro differentiated cells (9 experiments). Large symbols indicate the average of experiments within a biosample type. The replicate mean, represented in blue color, corresponds to the average of the \({\text{log}}_{2}\) values of two replicates [i.e., \(1/2\left( {\log_{2} \left( {T_{C}^{1} /T_{A}^{1} } \right) + \log_{2} \left( {T_{C}^{2} /T_{A}^{2} } \right)} \right)\), where the superscript indicates the replicate number]. The replicate noise, represented in orange color, corresponds to the difference of the \({\text{log}}_{2}\) value of replicate 1 from the replicate mean [i.e.,\(1/2\left( {\log_{2} \left( {T_{C}^{1} /T_{A}^{1} } \right) - \log_{2} \left( {T_{C}^{2} /T_{A}^{2} } \right)} \right)\)]. Data is available from the ENCODE consortium (Brenton Graveley lab, UConn; Eric Lécuyer lab, IRCM; Michael Snyder lab, Stanford; and Thomas Gingeras lab, CSHL). For ENCODE accession numbers, see Table S1.

Explicitly, upstream and downstream effects are extremely marked for the first pair of TSSs, substantially decrease for the second pair, and are highly suppressed for the other pairs further downstream in the gene, except for highly transcribed genes. In this latter case, the presence of an additional TSS nearby, either upstream or downstream, is always associated with enhanced transcription. Similarly, positional dominance also ranges from very marked for the first pair of TSSs to highly suppressed for the other pairs further downstream in the gene. In contrast, proximal cooperativity is always maintained at a high level irrespective of the transcriptional activity of the gene and the relative position of the TSS pair within the gene.

We also quantified the presence of general trends in transcription initiation using RAMPAGE data for each of the pairs of consecutive TSSs up to the 11th TSSs in all human experiments in the ENCODE project with high replicate concordance (Table S2). These included 65 experiments with 56 different cell types, covering all biosample types. The results show that the main trends observed for the transcription initiation in K562 are conserved across human cell types (Figs. 6, S7, and S10). There are broad similarities with RNA-seq data but also notable differences.

Figure 6
figure 6

Transcription initiation parallels the conserved interdependence patterns of transcription at multiple TSSs. The same quantities as in Fig. 5 are shown computed with RAMPAGE data instead of with RNA-seq data. In total, there are 65 experiments comprising 56 different cell types, which include, as biosample types, primary cell (11 experiments), cell line (25 experiments), tissue (24 experiments), and in vitro differentiated cells (5 experiments). Data is available from the ENCODE consortium (Thomas Gingeras lab, CSHL). For ENCODE accession numbers, see Table S2.

Positional dominance for RAMPAGE data across multiple cell types closely mimics the results for RNA-seq data, indicating that transcription initiation as well as transcription generally shift from the upstream to the downstream TSS of the distal TSS pair as the transcriptional activity of the gene increases. Upstream effects are also remarkably similar for both processes, except for the first pair of TSSs with low transcriptional activity of the gene. Downstream effects and proximal cooperativity are positive in both RNA-seq and RAMPAGE data but are much more marked in the latter. In general, it is observed that these effects become more pronounced in RAMPAGE data as the positional order of the TSS pair in the gene increases. The fact that these marked transcription initiation effects are reduced to a large extent in transcription as the positional order of the TSS pair increases is consistent with transcription at a given position accounting for the cumulative effects of transcription initiated at the upstream TSSs.

The quantification of the average RNA-seq signal between consecutive TSSs indicates that the general trends observed qualitatively for Tier 1 cell types are indeed conserved across all variety of human cell types (Figs. 7, S8, and S11). Explicitly, the expected reduced average signal after transcription initiation is observed for any location of the downstream TSSs (Figs. 7, S8, and S11), except for the first TSS pair with a non-distal downstream TSS. In the case of K562 leukemia cell line, which we analyzed explicitly at the level of the regulatory features, this behavior is also present at the level of the Pol II occupancy (Fig. 4), thus indicating that it is a general feature of transcription itself. Comparing the effects of the downstream TSS location, the persistence of the average RNA-seq signal is systematically higher for a non-distal than for a distal downstream TSS for all TSS pairs (Figs. 7, S8, and S11), which we refer to as persistence dominance. In absolute terms, the average RNA-seq signal between TSSs does not depend on the downstream TSS distance for the first pair of TSSs of the gene, but tends to be higher in the presence of a non-distal downstream TSS for subsequent TSS pairs in the gene (Figs. 7, S8, and S11).

Figure 7
figure 7

The complex interdependence of transcription between multiple TSSs is conserved across human cell types. The replicate mean and noise of the \({\text{log}}_{2}\) values of transcription persistence with a distal downstream TSS, persistence with a non-distal downstream TSS, signal dominance, and persistence dominance are shown in terms of the transcriptional activity in region C for the same cases and conditions as in Fig. 5.

Multiple levels of variability across biosamples

Alternative transcription has important implications for gene expression as it determines the variability of the repertoire of isoform proteins. In our analysis, we have observed high variability in the means of replicates along the main trends, which is considerably higher than the variability between replicates (Figures S6, S7, and S8). This variability has both a random-like component and a bias that is determined by the genomic context. The bias, resulting from the general interdependence patterns across TSSs we have identified, is strongly evident in the averages of all experiments within each biosample type, which exhibit little variation across different biosample types (Figs. 5, 6, and 7). Therefore, the main trends in the interdependence of transcriptional processes on the TSSs arrangements are present in the same form for all cell types, regardless of their mutational background, specific origin, or function within the organism. Compounded with these general trends, there are multiple levels of variability, such as replicate noise, cell-type-specific TSS usage, and specific responses to different conditions.

Discussion

Directed analyses on specific systems have shown that many fundamental mechanisms involved in transcription regulation strongly depend on the precise distances among the locations of multiple DNA elements15,30 but it has been unclear to what extent this dependence could be present along the genome after the confluence of many of these, potentially opposing mechanisms31. Especially relevant is the case of alternative transcription32. There is ample evidence that multiple TSSs in most genes have independent cell-type-specific expression profiles21. These profiles have been found to be connected to disease states, including alternative transcription initiation at multiple TSSs that is deregulated across cancer types and patients33 and that exhibits well-defined, specific signatures in type 2 diabetes34. The types of regulation comprise a wide range of modalities, including the TSSs of a gene being coregulated, namely increasing or decreasing their expression proportionally35, and, on the opposite side, switching expression from one TSS to another36.

The multiple-landmark-alignment methodology we have developed provides an avenue to elucidate how the precise positioning of multiple landmarks reflects in DNA-dependent processes on a genome-wide scale. The simultaneous consideration of multiple distances (stratified as proximal, intermediate, and distal) has been a fundamental element of our approach to uncover the existence of regulated interdependence patterns of gene expression at alternative TSSs and between TSSs across human cell lines, primary cells, in vitro differentiated cells, and tissues. This interdependence comprises proximal cooperativity, upstream and downstream interactions, positional dominance, enhancement of transcription persistence, and attenuation of the transcriptional signal. In general, these effects are highly dependent on the intragenic position of the TSSs, the transcriptional activity of the gene, and the precise distances between TSSs, but at the same time, they are consistently conserved across human cell types, irrespective of their specific origin or function within the organism.

Among the most salient phenomena, there are proximal cooperativity and downstream effects, which encompass higher transcription downstream a TSS the closer it is to an upstream TSSs within the gene. This type of enhancement observed in transcription is also present, even more prominently, in transcription initiation. On the opposite side, our results show the presence of marked upstream effects, namely, the attenuation of the transcriptional signal and transcription initiation at an upstream TSS by the presence of a nearby downstream TSS. Simultaneously with the negative effects on the absolute levels of transcription, a downstream TSS positively enhances the persistence of transcription after its initiation. Concomitantly, DNA accessibility and Pol II densities show lower but more sustained profiles for a non-distal than for a distal downstream TSS. These results can be understood mechanistically considering that the assembly of the transcription initiation complex upstream a TSS interferes with transcription initiation and that the overall transcription process promotes upstream and downstream DNA accessibility.

Recent genome-wide analyses have concluded that multiple alternative transcription initiation is largely nonadaptive and resulting predominantly from imprecise events37. At the molecular level, fundamental biochemical principles dictate that non-specific effects, as quantified and validated in simpler gene expression prokaryotic systems, cannot generally be suppressed completely and that they are affected by regulatory processes in the usual way38. In this context, the existence of general patterns within multiple levels of variability we have identified shows both a consistent signal across cell types determined by the genomic context and a random-like component dependent on the cell type and conditions, akin to non-specific transcription initiation.

Our analysis has also shown that there are clear positional dominance effects when the two TSSs are far from each other, resulting in most of transcription and transcription initiation shifting from the upstream to the downstream TSS of the distal TSS pair as the transcriptional activity of the gene increases. This effect is extremely marked for the 1st and 2nd annotated TSS of a gene and it is generally more pronounced for transcription initiation than for transcription. These types of results have also a practical side as they can be used to refine and complement TSS annotations. Explicitly, the case of positional dominance implies that the 1st annotated TSS of a gene is essentially not active, or it is not an actual TSS, if the transcriptional activity of the gene is high.

The most remarkable finding of our work is therefore the discovery of the existence of general regulated interdependence patterns of gene expression at and between alternative TSSs of protein-coding genes. We showed that these effects are conserved across cell types through a comprehensive analysis of the hundreds of human transcription and transcription initiation experiments of the ENCODE project. Compounded with these general patterns, there are multiple levels of variability, such as replicate noise, cell-type-specific TSS usage, and adaptation to different conditions. The identification of these general patterns in the alternating structure of transcription has important implications for gene expression as they determine the variability of the repertoire of isoform proteins.

On the methodological side, our approach can generally be applied to virtually any combination of landmarks and any genomic signal in the same way as we have applied them to TSSs and RNA-seq, RAMPAGE, DNase-seq, and ChIP-seq signals. Therefore, our results open an avenue to find novel distance-dependent functional relationships among multiple DNA elements in a wide variety of systems.

Materials and methods

Genomic signals in two dimensions

The average of the signal \(g\left( z \right) = g\left( {x + z_{U} } \right)\delta_{{y, x + z_{U} - z_{D} }}\) over a rectangular region from \(x_{0}\) to \(x_{1}\) along the \(x\) coordinate and from \(y_{0}\) to \(y_{1}\) along the \(y\) coordinate for all TSS pairs in the set \(V\) is expressed as

$$R_{V} \left[ {\left( {x_{0} ,y_{0} } \right),\left( {x_{1} ,y_{1} } \right)} \right] = \frac{1}{N}\mathop \sum \limits_{{\{ z_{U} ,z_{D} \} \in V}} \mathop \sum \limits_{{x = x_{0} }}^{{x_{1} }} \mathop \sum \limits_{{y = y_{0} }}^{{y_{1} }} g\left( {x + z_{U} } \right)\delta_{{y, x + z_{U} - z_{D} }} ,$$

where \(N\) is the normalization factor, which is given by

$$N = \mathop \sum \limits_{{\{ z_{U} ,z_{D} \} \in V}} \mathop \sum \limits_{{x = x_{0} }}^{{x_{1} }} \mathop \sum \limits_{{y = y_{0} }}^{{y_{1} }} \delta_{{y, x + z_{U} - z_{D} }} .$$

In our analysis, we use multiple sets \(V\) corresponding to a specific contiguous pair of TSSs of genes with transcriptional activities within a range of values (e.g., the set of 1st and 2nd TSSs of all protein-coding genes with high transcription).

Two-dimensional region averages

To compute the region average from the previous expression efficiently, we take into account that \(\mathop \sum \limits_{{y = y_{0} }}^{{y_{1} }} g\left( {x + z_{U} } \right)\delta_{{y, x + z_{U} - z_{D} }}\) is \(g\left( {x + z_{U} } \right)\) if \(y_{0} \le x + z_{U} - z_{D} \le y_{1}\) and zero otherwise. Therefore, the sum over \(x\) is different from zero only for \(x \ge y_{0} - z_{U} + z_{D}\) and \(x \le y_{1} - z_{U} + z_{D}\), which leads to

$$R_{V} \left[ {\left( {x_{0} ,y_{0} } \right),\left( {x_{1} ,y_{1} } \right)} \right] = \frac{1}{N}\mathop \sum \limits_{{\begin{array}{*{20}c} {\{ z_{U} ,z_{D} \} \in V} \\ \\ \end{array} }} \mathop \sum \limits_{{x = \max \left( {x_{0} , y_{0} - z_{U} + z_{D} } \right)}}^{{\min \left( {x_{1} ,y_{1} - z_{U} + z_{D} } \right)}} g\left( {x + z_{U} } \right).$$

Similarly, the normalization factor is expressed as

$$N = \mathop \sum \limits_{{\{ z_{U} ,z_{D} \} \in V}} \mathop \sum \limits_{{x = \max \left( {x_{0} , y_{0} - z_{U} + z_{D} } \right)}}^{{\min \left( {x_{1} ,y_{1} - z_{U} + z_{D} } \right)}} 1.$$

Note that the region average is defined only if there exists at least a pair \(\{ z_{U} ,z_{D} \}\) in \(V\) so that \(y_{0} - x_{1} \le z_{U} - z_{D} \le y_{1} - x_{0}\), which is equivalent to the condition \(\max \left( {x_{0} , y_{0} - z_{U} + z_{D} } \right) \le \min \left( {x_{1} ,y_{1} - z_{U} + z_{D} } \right)\).

Two-dimensional signal densities

To compute the signal densities, we use a moving window defined by a rectangular domain centered at \(\left( {x,y} \right)\) with dimensions \(2n_{X} + 1\) along the \(x\) coordinate and \(2n_{Y} + 1\) along the \(y\) coordinate. The signal density \(G\left( {x,y} \right)\) averaged over this domain for all TSS pairs in the set \(V\) is given by

$$G\left( {x,y} \right) = R_{V} \left[ {\left( {x - n_{X} ,y - n_{Y} } \right),\left( {x + n_{X} ,y + n_{Y} } \right)} \right].$$

The explicit values of \(n_{X}\) and \(n_{Y}\) used in our analysis are \(n_{X} = 99\) for \(- 500 \le x \le 1k\), \(n_{X} = x/4\) for \(1k < x \le 20k\), \(n_{Y} = 99\) for \(- 1k \le y \le 1k\), and \(n_{Y} = - y/4\) for \(- 20k \le y < - 1k\).

Average transcription in a region

The average transcription in a region \(W\), \(T_{W} = \left\langle {g\left( {x + z_{U} } \right)\delta _{{y,~x + z_{U} - z_{D} }} } \right\rangle _{{\left\{ {z_{U} ,z_{D} } \right\},\left( {x,y} \right)}}\) with \(\left( {x,y} \right) \in W\), is computed explicitly for the representative regions as \(T_{A} = R_{V} \left[ {\left( {0, - 20k} \right),\left( {200, - 10k} \right)} \right],\) \(T_{Af} = R_{V} \left[ {\left( {300, - 20k} \right),\left( {1k, - 10k} \right)} \right],\)\(T_{B} = R_{V} \left[ {\left( {0, - 900} \right),\left( {200, - 200} \right)} \right],\) \(T_{Bf} = R_{V} \left[ {\left( {300, - 900} \right),\left( {1k, - 200} \right)} \right],\) \(T_{C} = R_{V} \left[ {\left( {0, 0} \right),\left( {200, 200} \right)} \right],\) \(T_{D} = R_{V} \left[ {\left( {300, 0} \right),\left( {1k,200} \right)} \right],\) and \(T_{E} = R_{V} \left[ {\left( {10k, 0} \right),\left( {20k,200} \right)} \right].\)

TSSs

TSSs were obtained from the comprehensive gene annotation on the reference chromosomes of Gencode V19 (https://www.gencodegenes.org/human/release_19.html).

TSS order

TSSs are ordered according to their genomic position, starting the enumeration from the most upstream TSS. Therefore, according to this notation, the 1st TSS does not necessarily correspond to the TSS with the highest expression.

Genomic signals

RNA-seq, RAMPAGE, DNase-seq, and ChIP-seq genomic signals were downloaded from the Encyclopedia of DNA Elements (ENCODE) consortium repository (http://www.encodeproject.org/) as bigWig files for the hg19 mapping assembly/V19 genome annotation. Gene quantifications for the corresponding RNA-seq signals were downloaded as tsv files. Signals were normalized by their average value over the whole genome before analysis.

RNA-seq experiment selection

RNA-seq experiments were selected in two steps. First, we considered all the experiments that matched the search criteria "hg19" for assembly and "polyA mRNA RNA-seq" or "total RNA-seq" for assay title, which produced 594 results. Subsequently, we selected RNA-seq experiments that included plus and minus strand signal of unique reads and that had high replicate concordance (Spearman correlation > 0.8 between gene quantifications of the replicates), which resulted in 191 experiments.

RAMPAGE experiment selection

RAMPAGE experiments were selected in two steps. First, we considered all the experiments that matched the search criteria "hg19" for assembly and "RAMPAGE" for assay title, which produced 155 results. Subsequently, we selected RAMPAGE experiments that included plus and minus strand signal of unique reads and that had high replicate concordance (Spearman correlation > 0.8 between gene quantifications of the replicates), which resulted in 65 experiments.

Data analysis

The data analysis was performed using custom Python 3.8 scripts implemented in Jupyter Notebooks available in the supplemental information.