Nuclear gene proximity and protein interactions shape transcript covariances in mammalian single cells

Single-cell RNA sequencing studies into gene co-expression patterns could yield important new regulatory and functional insights, but have so far been limited by the confounding effects of cell differentiation and the cell cycle. We apply a tailored experimental design that eliminates these confounders, and report >80,000 intrinsically covarying gene pairs in mouse embryonic stem cells. These covariances form a network with biological properties, outlining known and novel gene interactions. We provide the first evidence that miRNAs naturally induce transcriptome-wide covariances, and compare the relative importance of nuclear organization, transcriptional and post-transcriptional regulation in defining covariances. We find that nuclear organization has the greatest impact, and that genes encoding for physically interacting proteins specifically tend to covary, suggesting importance for protein complex stoichiometry. Our results lend support to the concept of post-transcriptional ‘RNA operons’, but we further present evidence that nuclear proximity of genes on the same or even distinct chromosomes also provides substantial functional regulation in mammalian single cells.

Here, we apply carefully designed experimental conditions to remove the confounding 69 extrinsic effects of differentiation and cell cycle progression, and apply sensitive Smart-70 Seq2 single-cell sequencing to profile the transcriptomes of hundreds of mouse 71 embryonic stem cells (mESCs). Specifically, using stringent cut-offs we report >80,000 72 gene pairs that intrinsically covary in expression; more than have been described in 73 previous single-cell studies. The covarying gene pairs interlink to form a network with 74 well-established biological features, following a so-called power-law distribution, and 75 recover known regulatory patterns and pathways. We further apply a novel 76 computational framework to study the relative importance of distinct regulatory 77 mechanisms for gene expression covariances. We find that genes regulated by the same 78 4 transcription factors or miRNAs tend to covary, while the strongest effect is seen with 79 genes that are in nuclear proximity on the same chromosomes. A similar but weaker 80 effect is seen for genes that are in nuclear proximity on distinct chromosomes. We  Seq2 sequencing of hundreds of mouse single-cell transcriptomes   102   To obtain reliable and reproducible measurements of gene expression for our study, we   103 applied the Smart-Seq2 protocol to sequence the transcriptomes of 567 individual 104 mouse embryonic stem cells divided between three well-plates which serve as biological 105 replicates. The Smart-Seq2 protocol is one of the gold standards in the single-cell field.

106
While labor-intensive and not easily scalable, it is highly sensitive and precise, which 107 means that each measurement is less affected by technical noise 6,10,11 . It also reliably 108 detects both exons and introns, which is useful for distinguishing transcriptional and  Table 1). Gene expression values in each cell were normalized to the sum of mRNA 112 sequence reads in the given cell, as this approach was found to best eliminate biases 113 from technical factors (Methods, Suppl. Fig. 2, 3). We considered a gene to be reliably 114 profiled if its transcript was detected in at least half of the cells in each of the three 115 replicates. Importantly, we estimated that for genes that are expressed at above 16 116 reads per million biological variation between cells exceeds technical noise (as 117 estimated by ERCC spike-ins, Suppl. Fig. 4). The vast majority of the genes that fulfill our 118 criteria for reliable detection also exceeds this expression threshold. Overall our 119 analysis yielded reliable gene expression measurements for 8,983 genes (Suppl . Table   120 2).

329
We additionally investigated the "reverse covariance enrichment". Here, we observe the 330 set of all significantly covarying gene pairs and ask how often they are regulated by the 331 same miRNA, compared to a background set (Methods). We find that covarying genes 332 are 12% more likely to be co-regulated by the top 16 miRNAs, and 35% more likely to 333 be regulated by the seven most highly expressed conserved miRNAs ( Figure 2C Figure 7A). A substantial number of the genes that cease to covary were 343 miRNA targets and the ratio of lost to gained covariances increase when high-confidence 344 targets were considered (Suppl. Figure 7B). The genes that lost covariances were 345 enriched in functions in RNA biology (Suppl. Figure 7F

400
Next, we ranked the relative importance of transcription factors, miRNAs and nuclear 401 proximity for the regulation of covariation (Figure 3). We found that miRNA targets 402 were 35% more likely to covary, transcription factor targets were 39% more likely to 403 covary, genes in nuclear proximity on different chromosome were 3-fold more likely to 404 covary, and, remarkably, genes that are proximal on the same chromosome are 5.3-fold 405 more likely to covary. In summary, we find that transcriptional regulation, miRNA-

422
Proteins that are in surplus can be degraded, mis-folded or form aggregates.

424
Protein interaction rather than shared function drives gene covariances 425 Next, we examined putative functions of the covariances that we observe in single cells.

426
We formulated two hypotheses. The first hypothesis can be described as the "pathway 427 hypothesis" -that genes involved in the same pathway are coordinated in expression, for 428 instance to avoid bottlenecks in the production of metabolic intermediates 41 . The 429 second hypothesis is the "complex hypothesis" -that covariances ensure correct 430 stoichiometry among proteins that are part of the same heteromeric protein complex, 431 since surplus proteins may mis-fold or even cause aggregates 42 . 432 433 As stated earlier, genes that share the same Gene Ontology function or process or the 434 same KEGG pathway annotation are significantly enriched for gene covariances (Figure   435 1F). The same is true for genes that physically interact on the protein level according to 436 experimental evidence gathered by the STRING database ( Figure 4A). For these 437 interactions we can see a gradient with those interactions with the highest 438 confidence/affinity score also having the highest enrichment for covariances. We further 439 find that genes that contribute to the same complex are 2.6-fold more likely to covary, 440 compared to just 1.6-fold for genes that are part of the same pathway ( Figure 4B To further test the two hypotheses, we split the set of functionally related genes into one 444 set with genes that share a functional annotation and protein interaction and those that 445 share functional annotation but no protein interaction. If the "pathway hypothesis" 446 holds, we would expect both gene sets to covary, since they share functions. If the 447 "complex hypothesis" holds, we would expect only the genes whose proteins physically

471
This gene is known to regulate cell adhesion and has been linked to various cancer 472 phenotypes 44 . We then ranked all other genes according to how many of the top 100 473 Ctnnb1 targets they covary with, and observed that the more covariances a gene 474 exhibited, the more likely it is to be bound by Ctnnb1 in a second ChIP-seq experiment 43 475 ( Figure 5A). In other words, the more significant covariances a gene had with the high 476 confidence targets identified in the first experiment, the more often is was observed 477 among the high confidence targets identified in a second independent experiment. While 478 the predictive power of our method is limited (target probability ~ 25%, Figure 5A) it 479 serves as a proof of principle that single-cell transcriptome data can be used for 480 predicting regulatory relations even in a homogeneous cell population. This approach 481 could be used to make sparse data sets more complete, through "guilt-by-association" 482 with previously identified targets or to identify targets that escape current technologies 483 due to biases. Last, we found that the function of genes could be inferred by surveying  Figure 4D). Therefore, we suggest that establishing expression covariance of such 524 genes already on the RNA level might be an advantage in evolutionary terms. 525 526 19 In this study we measure RNA rather than protein with the latter being closer to the 527 cellular phenotype. However, when inferring upstream regulation, it may be more 528 informative to measure RNA. Further, many of the interesting and biologically 529 meaningful covariances that we discover may not be detectable at the protein level, even 530 in single cells. For instance, transcript covariances may be important for co-folding, but 531 they may not be visible at the proteome level for proteins that have long half-lives and 532 that are therefore stably expressed. It will be exciting to study covariances at the protein