RNA proximity sequencing reveals the spatial organization of the transcriptome in the nucleus


The global, three-dimensional organization of RNA molecules in the nucleus is difficult to determine using existing methods. Here we introduce Proximity RNA-seq, which identifies colocalization preferences for pairs or groups of nascent and fully transcribed RNAs in the nucleus. Proximity RNA-seq is based on massive-throughput RNA barcoding of subnuclear particles in water-in-oil emulsion droplets, followed by cDNA sequencing. Our results show RNAs of varying tissue-specificity of expression, speed of RNA polymerase elongation and extent of alternative splicing positioned at varying distances from nucleoli. The simultaneous detection of multiple RNAs in proximity to each other distinguishes RNA-dense from sparse compartments. Application of Proximity RNA-seq will facilitate study of the spatial organization of transcripts in the nucleus, including non-coding RNAs, and its functional relevance.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Proximity RNA-seq.
Fig. 2: RNA cobarcoding and proximal transcriptomes.
Fig. 3: Bipartite nuclear transcriptome.
Fig. 4: RNA proximities of nascent and processed transcripts.
Fig. 5: RNA valency.
Fig. 6: RNA valency and chromatin territories.

Data availability

Proximity RNA-seq and Hi-C raw sequencing data are available on Gene Expression Omnibus accession: GSE129732.

Code availability

Code for Hi-C and Proximity RNA-seq analysis is available on github: https://github.com/3DGenomes/TADbit and https://github.com/StevenWingett/CloseCall, respectively.


  1. 1.

    Zhao, R., Bodnar, M. S. & Spector, D. L. Nuclear neighborhoods and gene expression. Curr. Opin. Genet. Dev. 19, 172–179 (2009).

    CAS  Article  Google Scholar 

  2. 2.

    Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).

    CAS  Article  Google Scholar 

  3. 3.

    Quinodoz, S. A. et al. Higher-order inter-chromosomal hubs shape 3D genome organization in the nucleus. Cell 174, 744–757 (2018).

  4. 4.

    Beagrie, R. A. et al. Complex multi-enhancer contacts captured by genome architecture mapping. Nature 543, 519–524 (2017).

    CAS  Article  Google Scholar 

  5. 5.

    Stahl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353, 78–82 (2016).

    CAS  Article  Google Scholar 

  6. 6.

    Lee, J. H. et al. Highly multiplexed subcellular RNA sequencing in situ. Science 343, 1360–1363 (2014).

    CAS  Article  Google Scholar 

  7. 7.

    Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015).

    Article  Google Scholar 

  8. 8.

    Shah, S. et al. Dynamics and spatial genomics of the nascent transcriptome by Intron seqFISH. Cell 174, 363–376.e16 (2018).

    Article  Google Scholar 

  9. 9.

    Weidmann, C. A., Mustoe, A. M. & Weeks, K. M. Direct duplex detection: an emerging tool in the RNA structure analysis toolbox. Trends Biochem. Sci. 41, 734–736 (2016).

    CAS  Article  Google Scholar 

  10. 10.

    Nguyen, T. C. et al. Mapping RNA-RNA interactome and RNA structure in vivo by MARIO. Nat. Commun. 7, 12023 (2016).

    CAS  Article  Google Scholar 

  11. 11.

    Kudla, G., Granneman, S., Hahn, D., Beggs, J. D. & Tollervey, D. Cross-linking, ligation, and sequencing of hybrids reveals RNA-RNA interactions in yeast. Proc. Natl Acad. Sci. USA 108, 10010–10015 (2011).

    CAS  Article  Google Scholar 

  12. 12.

    Sugimoto, Y. et al. hiCLIP reveals the in vivo atlas of mRNA secondary structures recognized by Staufen 1. Nature 519, 491–494 (2015).

    CAS  Article  Google Scholar 

  13. 13.

    Ramani, V., Qiu, R. & Shendure, J. High-throughput determination of RNA structure by proximity ligation. Nat. Biotechnol. 33, 980–984 (2015).

    CAS  Article  Google Scholar 

  14. 14.

    Dressman, D., Yan, H., Traverso, G., Kinzler, K. W. & Vogelstein, B. Transforming single DNA molecules into fluorescent magnetic particles for detection and enumeration of genetic variations. Proc. Natl Acad. Sci. USA 100, 8817–8822 (2003).

    CAS  Article  Google Scholar 

  15. 15.

    Shendure, J. et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309, 1728–1732 (2005).

    CAS  Article  Google Scholar 

  16. 16.

    Ameur, A. et al. Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nat. Struct. Mol. Biol. 18, 1435–1440 (2011).

    CAS  Article  Google Scholar 

  17. 17.

    Scheer, U. & Hock, R. Structure and function of the nucleolus. Curr. Opin. Cell Biol. 11, 385–390 (1999).

    CAS  Article  Google Scholar 

  18. 18.

    Neve, J. et al. Subcellular RNA profiling links splicing and nuclear DICER1 to alternative cleavage and polyadenylation. Genome Res. 26, 24–35 (2016).

    CAS  Article  Google Scholar 

  19. 19.

    Gondran, P., Amiot, F., Weil, D. & Dautry, F. Accumulation of mature mRNA in the nuclear fraction of mammalian cells. FEBS Lett. 458, 324–328 (1999).

    CAS  Article  Google Scholar 

  20. 20.

    Engreitz, J. M. et al. RNA-RNA interactions enable specific targeting of noncoding RNAs to nascent pre-mRNAs and chromatin sites. Cell 159, 188–199 (2014).

    CAS  Article  Google Scholar 

  21. 21.

    Padeken, J. & Heun, P. Nucleolus and nuclear periphery: Velcro for heterochromatin. Curr. Opin. Cell Biol. 28, 54–60 (2014).

    CAS  Article  Google Scholar 

  22. 22.

    Fagerberg, L. et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell. Proteomics 13, 397–406 (2014).

    CAS  Article  Google Scholar 

  23. 23.

    Kryuchkova-Mostacci, N. & Robinson-Rechavi, M. A benchmark of gene expression tissue-specificity metrics. Brief. Bioinform. 18, 205–214 (2017).

    CAS  PubMed  Google Scholar 

  24. 24.

    Whyte, W. A. et al. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell 153, 307–319 (2013).

    CAS  Article  Google Scholar 

  25. 25.

    van Groningen, T. et al. Neuroblastoma is composed of two super-enhancer-associated differentiation states. Nat. Genet. 49, 1261–1266 (2017).

    Article  Google Scholar 

  26. 26.

    Busch, A. & Hertel, K. J. HEXEvent: a database of Human EXon splicing Events. Nucleic Acids Res. 41, D118–124 (2013).

    CAS  Article  Google Scholar 

  27. 27.

    van der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2015).

    Google Scholar 

  28. 28.

    Edstrom, J. E., Grampp, W. & Schor, N. The intracellular distribution and heterogeneity of ribonucleic acid in starfish oocytes. J. Biophys. Biochem. Cytol. 11, 549–557 (1961).

    CAS  Article  Google Scholar 

  29. 29.

    Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).

    CAS  Article  Google Scholar 

  30. 30.

    Nora, E. P. et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 381–385 (2012).

    CAS  Article  Google Scholar 

  31. 31.

    Sexton, T. et al. Three-dimensional folding and functional organization principles of the Drosophila genome. Cell 148, 458–472 (2012).

    CAS  Article  Google Scholar 

  32. 32.

    Serra, F. et al. Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals structural features of the fly chromatin colors. PLOS Comput. Biol. 13, e1005665 (2017).

    Article  Google Scholar 

  33. 33.

    Bernhard, W. A new staining procedure for electron microscopical cytology. J. Ultrastruct. Res. 27, 250–265 (1969).

    CAS  Article  Google Scholar 

  34. 34.

    Jonkers, I., Kwak, H. & Lis, J. T. Genome-wide dynamics of Pol II elongation and its interplay with promoter proximal pausing, chromatin, and exons. eLife 3, e02407 (2014).

    Article  Google Scholar 

  35. 35.

    Veloso, A. et al. Rate of elongation by RNA polymeraseII is associated with specific gene features and epigenetic modifications. Genome Res. 24, 896–905 (2014).

    CAS  Article  Google Scholar 

  36. 36.

    Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).

    CAS  Article  Google Scholar 

  37. 37.

    Ziv, O. et al. COMRADES determines in vivo RNA structures and interactions. Nat. Methods 15, 785–788 (2018).

    CAS  Article  Google Scholar 

  38. 38.

    Battaglia, S. et al. RNA-dependent chromatin association of transcription elongation factors and Pol II CTD kinases. eLife 6, e25637 (2017).

  39. 39.

    Rybak-Wolf, A. et al. Circular RNAs in the mammalian brain are highly abundant, conserved, and dynamically expressed. Mol. Cell 58, 870–885 (2015).

    CAS  Article  Google Scholar 

  40. 40.

    Mifsud, B. et al. Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C. Nat. Genet. 47, 598–606 (2015).

    CAS  Article  Google Scholar 

  41. 41.

    Rubin, A. J. et al. Lineage-specific dynamic and pre-established enhancer-promoter contacts cooperate in terminal differentiation. Nat. Genet. 49, 1522–1528 (2017).

    CAS  Article  Google Scholar 

  42. 42.

    Tsanov, N. et al. smiFISH and FISH-quant – a flexible single RNA detection approach with super-resolution capability. Nucleic Acids Res. 44, e165 (2016).

    Article  Google Scholar 

  43. 43.

    Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    CAS  Article  Google Scholar 

  44. 44.

    Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).

    CAS  Article  Google Scholar 

  45. 45.

    Kaimal, V., Bardes, E. E., Tabar, S. C., Jegga, A. G. & Aronow, B. J. ToppCluster: a multiple gene list feature analyzer for comparative enrichment clustering and network-based dissection of biological systems. Nucleic Acids Res. 38, W96–102 (2010).

    CAS  Article  Google Scholar 

  46. 46.

    Yu, G. et al. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26, 976–978 (2010).

    CAS  Article  Google Scholar 

  47. 47.

    Wingett, S. et al. HiCUP: pipeline for mapping and processing Hi-C data. F1000Research 4, 1310 (2015).

    Article  Google Scholar 

  48. 48.

    Imakaev, M. et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9, 999–1003 (2012).

    CAS  Article  Google Scholar 

  49. 49.

    Trussart, M. et al. Assessing the limits of restraint-based 3D modeling of genomes and genomic domains. Nucleic Acids Res. 43, 3465–3477 (2015).

    CAS  Article  Google Scholar 

  50. 50.

    Gerchman, S. E. & Ramakrishnan, V. Chromatin higher-order structure studied by neutron scattering and scanning transmission electron microscopy. Proc. Natl Acad. Sci. USA 84, 7802–7806 (1987).

    CAS  Article  Google Scholar 

  51. 51.

    Pettersen, E. F. et al. UCSF Chimera – a visualization system for exploratory research and analysis. J. Comput. Chem. 25, 1605–1612 (2004).

    CAS  Article  Google Scholar 

Download references


J.M. was supported by a Swiss National Science Foundation early postdoc mobility fellowship, a Human Frontier Science Program long-term fellowship and the Babraham Science Policy Committee. The work of I.F. and M.A.M.-R. was partially supported by the European Research Council under the 7th Framework Program FP7/2007-2013 (ERC grant no. 609989) and the European Union’s Horizon 2020 research and innovation program (grant no. 676556) to M.A.M.-R. M.A.M.-R. also acknowledges the support of the Spanish Ministry of Economy and Competitiveness (grant nos. BFU2013-47736-P and BFU2017-85926-P), Centro de Excelencia Severo Ochoa 2013-2017 (grant no. SEV-2012-0208) and the Agency for Management of University and Research Grants (AGAUR). M.F.-M. was supported by UNAM Technology Innovation and Research Support Program PAPIIT IA201817 and PAPIIT IN207319. We acknowledge Sphere Fluidics for their contribution of microfluidic knowhow and time and their free donation of surfactants. We thank Babraham sequencing, fluorescence-activated cell sorting and imaging facilities for technical support, and Peter Rugg-Gunn, Paulo Amaral and Lucas Edelman for helpful discussions.

Author information




J.M. designed the study, performed experiments, analysed data and wrote the manuscript with contributions from all authors. S.W.W. provided conceptual advice and analysed data. I.F., J.C., A.S.-P. and M.A.M.-R. analysed data. M.F.-M., L.F.J.-G. and S.W. performed experiments. X.L. and F.F.C. provided technical help. S.A. provided conceptual advice. P.F. designed and supervised the study and wrote the manuscript.

Corresponding author

Correspondence to Jörg Morf.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Emulsion droplets and molecular biology of library preparation.

A) Image of a water-in-oil emulsion with polydisperse droplets as used in Proximity RNA-seq. A 1-micron magnetic bead is indicated. Similar images were taken repeatedly (n = 3) and emulsions were monitored by microscopy routinely. B) A pulsed reverse transcription protocol of 60 cycles of primer annealing and extension was used to capture RNA of nuclear particles and synthesize cDNA in emulsion droplets. Actinomycin D was added to prevent DNA-templated synthesis. After breakage of the emulsion, crosslinks were reversed and RNA digested. 3’ cDNA ends were extended by terminal transferase and a mixture of dGTP and ddGTP. The fraction of 2',3'-dideoxyguanosine-5'-triphosphate used terminates the synthesis of guanosine homopolymers after an average of 20 bases (see C). Pre-amplification of cDNA fragments on beads by 5 cycles of PCR using one of the Illumina primers flanked by a poly-cytosine stretch was followed by in-droplet PCR amplification using Illumina small RNA library primers. C) Urea/polyacrylamide gel separation (n = 1) of G-tailed single-stranded 50-mer DNA oligos shows that the tailing reaction is invariant against a range of different substrate concentrations (10–100 ng). The gel picture on the right has been acquired after longer exposure of the same gel as shown on the left side.

Supplementary Figure 2 Proximity RNA-seq pipeline and transcriptome annotation.

A) Pipeline overview. B) Transcriptome annotation. Features were genes from Ensembl 78 and RNA repeats from repeat masker (coloured bars with capital letters) and were used to assign reads to after modifications described hereafter. The top row of features in the scheme represents the gene and RNA repeat annotation, the bottom row shows retained features or stretches of features used by the Proximity RNA-seq pipeline. Opposing strands (turquoise and orange, respectively) were treated separately. Reads mapping to regions of overlap between gene features on the same strand were excluded from further analysis (overlap between features A and E). However, when a gene was contained entirely within a surrounding gene on the same strand (feature F within B), the area of overlap was assigned to the inner gene. For example, the majority of genes encoding snoRNAs in humans are located within intronic regions of host genes and therefore belong to the class of features contained within others. RNA repeat annotations (grey, G), which are assigned to both strands by repeat masker, were prioritised as well over overlapping gene annotations on both strands (overlap of G with C).

Supplementary Figure 3 Data characteristics, sample correlations and random simulation.

A) Number of proxy reads, barcodes and co-barcoded pairs of after all possible permutations within a particle as a function of barcode group size (see also Supplementary table 2). B) Pairwise Spearman rank correlations between different Proximity RNA-seq datasets using the number of proxy reads (left) or the number of detected co-barcoding events between two transcripts (right) as variables. For co-barcoding datasets, an initial cutoff of five or more observations for RNA pairs was applied. Then, all datasets were trimmed by removing the top and bottom 10% pairs based on number of co-barcoding observations. Trimming removes highly abundant transcripts or RNA pairs, which inflate correlation coefficients, as well as very low transcript counts or spurious pairs, which lower coefficients. Abundance datasets were trimmed by removing the top and bottom 10% transcripts. Sequencing libraries p2, p5, p7 and p8 are replicates with crosslinked nuclear homogenates from SH-SY5Y cells. p1 is the control library with randomly barcoded beads; p3 is the control with reverse-crosslinked RNA. p4 has been prepared in parallel and is comparable in sequencing depth to p1 and p3. Subsequently, p4 has been sequenced deeper to generate p5. p7 was generated with a homogenate equivalent of 100 ng RNA instead of 50 ng as for all the other libraries. While an increased input amount has minor effects on the ranking of transcript abundance and co-barcoding events between transcript pairs, it reduces sensitivity in the measurement of transcript valency. We therefore excluded p7 from further analysis, except for intron regression. p2, p5 and p8 were pooled into one library. For abundance (n: number of transcripts), p1 n = 18,461, p2 n = 16,752, p3 n = 20,088, p4 n = 18,183, p5 n = 19,396, p7 n = 18,512, p8 n = 15,715, for co-barcoding (n: number of co-barcoded RNA pairs), p1 n = 2,067, p2 n = 16,100, p3 n = 1,305, p4 n = 11,760, p5 n = 15,343, p7 n = 25,787, p8 n = 13,170. C) Monte Carlo randomisation of transcript observations, or proxy reads, on beads while preserving the total number of proxy reads for each transcript and the size of the barcode groups observed in an experiment. D) Pairwise RNA co-barcoding counts (blue, n = 735,247 pairwise co-barcoding events) were plotted against the average co-barcoding counts from 100,000 simulations. Coloured in yellow are contacts with at least 3 observations and a local background-corrected p value <= 0.01 (see methods for derivation of p value), in red are contacts with at least 3 observations and a Benjamini-Hochberg corrected p value <= 0.05. E). A single random dataset (n = 708,196 random co-barcoding counts) was used as input for further 1,000,000 simulations and plotted as in D). F) Boxplot showing how the p value distribution changed as the Monte Carlo simulation progressed using n = 735,247 pairwise RNA co-barcoding events. The 100,000 simulations were divided into 20 batches of 5,000 simulations. Batches were sequentially added to one another and the log2- fold change of the p value for each feature before and after the addition of a new batch of simulations was calculated. The graph shows that as the simulation progressed the p values stabilised. The borders, bar and whiskers of the box plot represent the first (Q1) and third (Q3) quartiles, the median and the most extreme data points within 1.5x the interquartile range from Q1, Q3, respectively.

Supplementary Figure 4 Sequencing saturation.

Evaluation of Proximity RNA-seq sequencing saturation for A,) pool 2 (p2), B,) pool 5 (p5) and C,) pool 8 (p8). The number of co-barcoded proxy read pairs was counted after running the Proximity RNA-seq pipeline with increasing numbers of sequencing reads.

Supplementary Figure 5 Comparison of crosslinking conditions.

Comparison of standard crosslinking with long crosslinking. We compared by Proximity RNA-seq a sample using standard crosslinking (10 minutes EGS before addition of formaldehyde for 10 min resulting in a total of 20 minutes of EGS crosslinking, n = 1) with a sample incubated for 20 minutes with EGS before addition of formaldehyde for 40 min (total of 60 min EGS crosslinking, n = 1). Both samples were subjected to four sonication cycles with settings as described in the methods section. A) Nucleic acid length from standard or long crosslinked, sonicated nuclear homogenates. Genomic DNA and RNA was length-separated by an agarose gel after crosslink reversal and purification (lane 1: double-stranded DNA ladder, lane 2: genomic DNA, standard crosslinking, lane 3: DNA, long crosslinking, lane 4, red: single-stranded RNA ladder ([low range ssRNA ladder, NEB)], lane 5, blue: single-stranded RNA ladder, ([ssRNA ladder, NEB)], lane 6: RNA, standard crosslinking, lane 7: RNA, long crosslinking). B) Bioanalyzer profiles of standard (top) and long crosslinking sample. C) Percentage of reads mapping onto ribosomal RNA (left), mitochondrial RNA and antisense within gene annotations (middle) and onto introns and exons (right, plotted as for Fig. 2b). D) As described in Supplementary Fig. 3b. Pairwise Spearman rank correlations between standard and long cross-llinked Proximity RNA-seq datasets using the number of proxy reads (top, standard n = 14,589, long n = 11,213 transcripts) or the number of detected co-barcoding events between two transcripts (bottom, standard n = 17,103, long n = 8,168 RNA pairs) as variables.

Supplementary Figure 6 Species-mixing experiment to estimate false positive rate.

To human nuclear particles equivalent to 50 ng RNA we added another 50 ng RNA equivalent of fly particles before generating droplets in which we performed reverse transcription and particle barcoding. We chose 100 ng as the total amount of input for the species-mixing experiment knowing that correlations to 50 ng samples are good (Supplementary Fig. 3b, for comparison, library p7 was prepared with 100 ng RNA of only human particles) and to ensure that we would obtain a significant fraction of species mixing in order to test our computational pipeline. For example, using less total input material, 2x 25 ng, would have resulted in fewer droplets with species mixing, lower counts of mixed transcripts and lack of statistical power. By keeping the human RNA content identical to standard 50 ng experiments and adding 50 ng of fly particles as spike-in in the species- mixing experiment, we chose a more cautious and conservative design to assess the specificity of Proximity RNA-seq. After sequencing, we mapped to a combined human and fly genome, which resulted in a human to fly read ratio of approximately 3:2. A) Species-specific reads (n = 42,312) in barcode groups consisting of two human or two fly transcripts or one transcript of each of the two species were counted and compared to expected probabilities using a chi-squared goodness of fit test. Intra- species counts were increased and inter- species counts depleted compared to expected probabilities (chi-squared p 0). B) Observed counts of proxy reads, the corresponding observed read fractions and expected read fractions in human-human, fly-fly and mixed barcode groups as used in A). Expected values were derived from products of read fractions. The number of reads from one species divided by the total number of reads was multiplied by itself for human-human and fly-fly barcode groups. For mixed barcode groups, the expected value was derived from the number of reads from one species divided by total number of reads multiplied by the number of reads from the other species divided by total number of reads. The product was multiplied by 2. C) P value distributions for pairwise, intra- and inter- species RNA associations with more than one observation. P values were derived after comparison of observed co-barcoding counts with random simulations as described in online methods (RNA pairs: human-human n = 1,941, fly-fly n = 221, mixed n = 657). The borders, bar and whiskers of the box plot represent the first (Q1) and third (Q3) quartiles, the median and the most extreme data points within 1.5x the interquartile range from Q1, Q3, respectively. D) Co-barcoding between ribosomal RNAs 28S and 18S for human and fly, respectively (p = 0) and for inter- species rRNA pairs (p = 1). E) The rate of false positives was estimated based on the number of inter- species RNA pairs, whose p value was found below a given p value threshold, as a fraction of the total number of inter- species RNA pairs. The percentages of false positives for p values 0.01, 0.05 and 0.1 were 0%, 2%, and 6.4%, respectively. To compare different groups of spatially associated transcripts in follow-up analyses (Figs. 4 and 5), we used a p value cut-off of 0.1. For dimensionality reduction techniques, we used a –log10 p value matrix for pairwise proximities, again with a cut-off at 0.1.

Supplementary Figure 7 RNA-RNA proximities and genomic distance between corresponding genes.

Spatial RNA association (-log10 p value of pairwise RNA association) as a function of genomic distance between the two transcript-encoding genes. A) Full data set (RNA pairs in distance bins, from left to right: n = 24, 13, 13, 25, 19, 12, 11, 433, 8,453), B) intronic reads only (RNA pairs in distance bins: n = 21, 8, 15, 28, 32, 17, 19, 602, 11,206). The first distance bin excludes pairs closer than 10 kb to reduce possible false positive RNA associations due to multiple sense, overlapping annotations of the same transcription unit. The borders, bar and whiskers of the box plot represent the first (Q1) and third (Q3) quartiles, the median and the most extreme data points within 1.5x the interquartile range from Q1, Q3, respectively.

Supplementary Figure 8 Gene ontology terms and proximity to super-enhancers for transcripts in different principal component 2 quantiles.

A)Enrichments of gene ontology terms for biological processes across proximity principal component 2 quantiles (n = 1,390 transcripts in total) as derived by Toppcluster. Negative log10 of p values (<0.05, Bonferroni corrected, online methods) in grey scale with maximal value of 10 in black. Recurrent GO terms after semantic grouping are indicated. B)Cumulative distribution of genomic distances between genes encoding transcripts assigned to PC2 quantiles and the closest super-enhancer. Super-enhancers, as published in 25 based on H3K4me3 and H3K27ac signals with {p values < 0.05, were included if present in SH-SY5Y cells and 4 additional neuroblastoma cell lines. Principal component 2 quantiles were pooled into compartment I (quantiles 1–3, shown in red, n = 413 distances) and compartment II (quantiles 4–8, orange, n = 856 distances) (Kolmogorov-Smirnov, p 0.0007, two-sided).

Supplementary Figure 9 RNA valency.

A) Heatmap showing transcript valency 1, 2 and 3 for Proximity RNA-seq triplicates (p2, p5 and p8). Proxy reads for each transcript were counted in barcode groups of different size, i.e. valencies. For each transcript, counts in valency 1, 2 and 3 were divided by the sum of all three valencies of that given transcript. Subsequently, the transcriptome-wide distributions for the first three valencies were separately transformed into z-scores (shown here). High- and low- valency transcripts were retained if reproducibly assigned to high- and low- valency classes in Proximity RNA-seq triplicates (n = 2,486 transcripts with assigned valency). B) Comparison of observed and simulated transcript valency measures. Scatterplots for valency 1, 2 and 3 are shown in a single plot. Instead of z-scores, for each transcript the fraction of reads from random simulations were plotted against the fraction of reads from observed data for valency groups 1, 2 and 3. Only transcripts that were reproducibly assigned to high- or low- valency classes (as described in B.2. Proximity RNA-seq follow-up analysis, valency) are shown. Red depicts transcripts assigned to the high- valency, blue to the low- valency classes. Of note, the separation of high- and low- valency transcripts in observed data (x-axis) is absent in random simulated valencies (y-axis). C), D) Valency measures along the principal component 2 axis after transcript binning on length and expression, respectively. The borders, bar and whiskers of the box plot represent the first (Q1) and third (Q3) quartiles, the median and the most extreme data points within 1.5x the interquartile range from Q1, Q3, respectively. Transcripts in PC2 quantiles 1–3 of and 4–8 were merged to generate compartments I and II. The grouping of quantiles to give rise to the two compartments was based on the switch from of high to low valency between quantiles 3 and 4. Four groups based on transcript length as indicated were generated for compartments I and II and transcript valency 1, 2 and 3 plotted. Compartment I valency 1, 2 and 3 are coloured in red, orange and yellow, while compartment II valency 1, 2 and 3 are coloured in blue, light blue and white, respectively. C) Compartment I (q 1–3) showed higher valency and compartment II (q 4–8) low valency for all length groups except for transcripts shorter than 5 kb. The compartment II group of transcripts shorter than 5 kb, however, consisted of only 6 transcripts, including the high- valency non-coding RNAs SNORA22, SNORD22, SNORD10, SNORA63 and RNase_MRP, of which are also as well strongly associated with the nucleolus of compartment I. The number of transcripts for each group is indicated by n. D) Four groups based on transcript expression quartiles (rpkm) as indicated were generated for compartments I and II and valencies plotted as described in C). Compartment I (q 1–3) showed high valency, compartment II (q 4–8) low valency for all expression quartiles.

Supplementary Figure 10 RNA valency characteristics of genomic A, B regions derived by Hi-C.

A) Expression density measures for all A and B regions and for A and B regions of high-, low- and no- valency regions were derived as follows. For every region, the number of transcripts with abundance larger than or equal to the median expression level (rpkm) of the transcriptome as measured by Proximity RNA-seq was counted. The counts of the top 50% expressed transcripts per region was then normalised by the length of the region. The borders, bar and whiskers of the box plot represent the first (Q1) and third (Q3) quartiles, the median and the most extreme data points within 1.5x the interquartile range from Q1, Q3, respectively. B) Mean valency for every A, B region (x axis) was plotted against length of genomic regions (y axis). Categories of high, low and no valency are coloured in red, orange and green, respectively. C) Mean number of transcripts with assigned valency in high-, low- and no- valency groups of genomic A, B regions. D) Genomic regions were grouped according to their mean expression (rpkm) into high-, medium- and low- expression regions and cumulative distributions of Hi-C contact enrichments (p <= 0.05) plotted as in Fig. 6c (n = 68,692 contacts between highly expressed, n = 66,494 contacts between medium and n = 52,651 between lowly expressed regions). Regions tend to form stronger contacts with decreasing expression (high versus low, Kolmogorov-Smirnov, two-sided, p 0). In turn, high- valency regions, with stronger contacts than low- valency regions (Fig. 6b, c), showed similar or higher expression than low valency regions, which suggests that increased Hi-C contact enrichments for high- valency regions cannot be explained by gene expression levels within such regions.

Supplementary Figure 11 Electron microscopy of SH-SY5Y cell nuclei.

Electron microscopy of SH-SY5Y cell nuclei after Bernhard’s regressive EDTA staining for RNA. SH-SY5Y cell nuclei show RNA foci (example regions highlighted by red arrows) close to heterochromatin (yellow) surrounding nucleoli (turquoise). Similar results were observed in two independent experiments.

Supplementary Figure 12 Measures derived from 3D Hi-C model of chromosome 14.

A) Boxplot summarising Fig. 6f. Local accessibility (here expressed as % buried) to 100 kb bins of high- and low- valency genomic regions for a virtual object with 150 nm radius. The same number of randomly selected 100 kb bins as high- (n = 64,000) and low- (n = 319,000) 100 kb bins, respectively, were used for comparison (all bins, n = 873,999). Mann Whitney U, two-sided, tests: p 10^-16 for all versus high, p 0 for all versus low, p 10^-7 for high versus random high, p 10^−233 for low versus random low. The borders, bar and whiskers of the box plot represent the first (Q1) and third (Q3) quartiles, the median and the most extreme data points within 1.5x the interquartile range from Q1, Q3, respectively. Chromosome view tracks of model consistency (B), one-dimensional chromatin density (bp/nm) (C) and angle between consecutive 100 kb bins (D).

Supplementary Figure 13 Transcription elongation rate estimates.

Linear regression of intronic read density using different minimal read density cutoffs. A) Minimum of 25 reads/intronic kb, B) Minimum of 30 reads/intronic kb, C) Minimum of 35 reads/intronic kb, D) Minimum of 40 reads/intronic kb. Slopes of intronic read density decays (multiplied by 1,000,000) were grouped into different intron length bins (left panel). We observed a tendency of decreasing slopes, or faster transcription elongation, with increasing intron size. However, for most length bins and irrespective of the cutoff, introns in low- valency regions tended to exhibit steeper slopes, and therefore slower transcription elongation rates, than high- valency regions. Actual intron lengths in high- and low- valency regions binned as in the left panels are shown in the middle panels. Right panels show mean transcription elongation rates for high- and low- valency regions. n indicates number of introns in analysis (for left and middle panels) or number of genomic regions (right panels). P values (Kolmogorov-Smirnov, two-sided) using different read density cutoffs were combined using Fisher’s method, resulting in p = 0.045. Sample sizes (n) are indicated in the figure. The borders, bar and whiskers of the box plot represent the first (Q1) and third (Q3) quartiles, the median and the most extreme data points within 1.5x the interquartile range from Q1, Q3, respectively.

Supplementary Figure 14 Transcription elongation rates and genomic contact strength as derived from Hi-C.

Analysis of Hi-C contact enrichments for different genomic A, B regions in K562 cells. Hi-C data was prepared as described in the methods section. Kinetic RNA labelling and sequencing data (BruDRB-seq) was used to assign mean transcription elongation rates to A, B regions. A) Similarly to fast-transcribing high- valency regions in SH-SY5Y cells (Fig. 6f), fast-transcribing regions in K562 cells formed stronger contacts to each other (n = 6,696) than slow-transcribing regions (n = 9,383) (Kolmogorov-Smirnov, two-sided: p 0, Cliff’s delta effect size: 0.21). B) Mean contact enrichments between different classes of genomic regions (NA: no transcription elongation rate assigned).

Supplementary information

Supplementary Information

Supplementary Figs. 1–14

Reporting Summary

Supplementary Table 1

Primers and sequencing adapters, RNA-FISH probes

Supplementary Table 2

Mapping and read processing statistics of Proximity RNA-seq and Hi-C

Supplementary Table 3

Proximity RNA-seq pairwise RNA associations using gene-based transcript annotation

Supplementary Table 4

Transcripts with complex proximal transcriptomes

Supplementary Table 5

Transcript lists for PCA quantiles 1–8

Supplementary Table 6

Proximity RNA-seq pairwise RNA associations using intron/exon junction split annotation

Supplementary Table 7

Valencies (V1, V2, V3 for valency 1, 2 and 3) of transcripts from three replicates (p2, p5, p8)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Morf, J., Wingett, S.W., Farabella, I. et al. RNA proximity sequencing reveals the spatial organization of the transcriptome in the nucleus. Nat Biotechnol 37, 793–802 (2019). https://doi.org/10.1038/s41587-019-0166-3

Download citation

Further reading