## Introduction

Gene expression is key to cellular identity and function. Its regulation is complex allowing for gene-specific control of the abundance of gene products and the speed at which their abundances may be adapted to changes or perturbations. Considering mRNA, the gene expression intermediate whose cytoplasmic presence allows for protein synthesis, its cytoplasmic abundance is a function of numerous molecular reactions that are grouped within the broader terms of nuclear mRNA synthesis (transcription initiation, elongation), mRNA processing and export (e.g., capping, splicing, polyadenylation, and transport), and cytoplasmic degradation. Studies of immune response gene expression have been insightful as the dynamical nature of immune gene expression reveals the underlying kinetics of the regulatory steps. In response to pathogen-derived substances such as endotoxin, the expression of hundreds of genes is rapidly induced1. While the primary regulatory steps control nuclear mRNA synthesis2, mRNAs show widely different cytoplasmic half-lives (ranging from just a few minutes to many hours), which thereby contributes to the responsiveness of gene expression2,3.

What remains less well characterized is to what extent mRNA processing and export regulate gene expression. Genetic perturbation of the splicing machinery can diminish the abundance of mature mRNAs4 and incompletely spliced mRNAs may be degraded via the nuclear exosome5,6,7. Indeed, mRNA export is mediated by RNA-binding proteins that are recruited to exon–exon junction complexes (EJCs). Recent studies have shown that while 3′-end cleavage and polyadenylation are always rapid, many genes have one intron that is spliced post-transcriptionally, potentially introducing delays in the appearance of cytoplasmic mature mRNA8,9. However, the resulting effective transport rates have not been measured quantitatively and it remains unknown to what extent these rates may be gene-specific or whether they contribute to the regulation of gene expression.

Here, we leveraged the high inducibility of innate immune gene expression programs to measure effective mRNA chromatin-to-cytoplasmic transport rate (“export rate”) associated with each immune response gene. We produced genome-wide mRNA measurements in chromatin, nucleoplasmic, and cytoplasmic compartments at high temporal resolution, and developed a mathematical modeling workflow to infer kinetic rate constants and their associated confidence intervals. We report that the mRNA export rates vary over a 100-fold range among genes, but surprisingly do not contribute much to the temporal responsiveness of immune response gene expression, which is primarily controlled by cytoplasmic mRNA half-life. Instead, export rates determine the efficiency of transport (in the face of nucleoplasmic decay) and show a high correlation with cytoplasmic mRNA-degradation rates. Thereby, highly responsive genes with short half-lives are expressed highly thanks to highly efficient transport, and later waves of immune responsive genes with long half-lives are not disproportionately overexpressed due to lower efficiency transport.

## Results

### A detailed, quality dataset of endotoxin-induced mRNA synthesis and transport

To study the kinetics of post-transcriptional mRNA transport and decay of immune response genes we developed an experimental approach to follow mRNA expression within the cell. When mRNA is being transcribed it is linked to the chromatin-bound polymerase and may be isolated as chromatin-associated RNA (caRNA). It is then released by 3′-end cleavage and polyadenylation into the nucleoplasm (npRNA) and exported to the cytoplasm (cytoRNA) (Fig. 1A). We produced high spatio-temporal resolution RNA-seq data of three biological replicates (see Supplementary Fig. S1 for reproducibility of the replicates) by deeply sequencing RNA from three subcellular fractions (chromatin-associated, nucleoplasmic, and cytoplasmic) prepared from murine bone-marrow-derived macrophages (BMDMs) at twelve timepoints within 2 h of stimulation with the endotoxin analog Lipid A, (Fig. 1B, see “Methods”). As observed for the Tnf gene, intronic reads are still present in the caRNA samples, but less in the npRNA samples and transcripts are fully spliced in the cytoplasmic samples (Fig. 1C). To enable a reliable quantification of mRNA expression for downstream analysis we selected strongly inducible genes (based on caRNA data, Fig. 1D). For the 273 selected genes, the npRNA expression profile is more similar to the caRNA expression profile than the cytoRNA expression profile (Fig. 1E). Correlation analysis shows that, overall, npRNA only slightly lags behind caRNA but that cytoRNA expression is less well correlated and more delayed (Fig. 1F).

To avoid bias in the caRNAseq data due to partially transcribed mRNA, genes were quantified based on the exonic portion of their last 5 kb. This required highly accurate annotation of the dominant transcription end site (TES). Thus, every selected gene track coverage was checked against GENCODE annotation database and 62 discrepant genes were removed, leaving 211 for further analysis (see “Methods”, Supplementary Fig. S2A–F, and Supplementary Data 1). The observed TESs were largely consistent (within +/−100 bp) with an established database of polyA sites10 based on 3′-end sequencing data (Supplementary Fig. S2G, right panel). For some genes (bottom right of Supplementary Fig. S2G, right panel) the observed TESs are even more consistent with the database than the information provided in the gene annotation (discrepant by >500 bp).

Tracking RNA expression in time and space illustrates that some genes, such as Cd74 and Btg2, that have very similar caRNA expression, may exhibit very different npRNA or cytoRNA expression profiles (Fig. 1G, top panel); or that some genes with different caRNA expression profiles exhibit very similar cytoRNA expression profiles, such as Arl5b and Ccr3 (Fig. 1G, bottom panel). This suggests that kinetic parameters of post-transcriptional processes may be gene-specific and may regulate gene expression dynamics.

### A mathematical model of mRNA dynamics to derive kinetic transport and decay rates

We developed a simple mechanistic mathematical model (Fig. 2A) to simulate the abundance of mRNAs in the different subcellular fractions. The model uses the measured caRNA expression profile as input to calculate npRNA and cytoRNA abundances over time as a function of kinetic parameters describing transport and decay. Given that RNA sequencing allows for relative quantification across genes and samples, the parameter values that may be derived by fitting the model to the data are also relative. These relative parameters are denoted: k1 for the fractional appearance rate of mRNA in the nucleoplasm (in npFPKM/caFPKM min−1); k2 for the mRNA disappearance rate from the nucleoplasm (in min−1), determined by both nucleoplasmic decay and nucleoplasm-to-cytoplasm transport; k2 for the fractional appearance rate of mRNA in the cytoplasm (in cytoFPKM/npFPKM min−1), termed “the nucleoplasm-to-cytoplasm transport rate”; and kcyto_deg for the cytoplasmic decay rate (in min−1).

We fit the model to the expression data for each gene to estimate these kinetic parameters using an optimization pipeline with a cost function defined by the negative log-likelihood of reproducing the experimental data given the model; for the error model, we used a negative binomial distribution to account for both biological variability and sampling error for lowly expressed timepoints (schematized in Fig. 2B and described in “Methods”). A visual comparison of model-simulated and measured data graphed for all genes in a heatmap (Fig. 2C) or line graphs for two sample genes (Fig. 2D) illustrate the quality of the fits (see Supplementary Document 1 for detailed graphs of individual genes). As the negative log-likelihood depends on the expression level, we developed a “fit quality” metric that also includes autocorrelation of the residuals (see Methods). Most genes exhibit excellent fits (Supplementary Fig. S3A) with fit quality values <0.06 (e.g., Tnfaip2 and Il10 have fit quality scores of 0.011/0.014 and 0.050/0.041, for the two replicate datasets, respectively, Supplementary Fig. S3B, top row). Only 9 genes have scores ≥0.06, and 2 have scores ≥0.1 for both replicates (e.g., Cpd and Cd44 have fit quality scores of 0.096/0.084 and 0.15/0.012, respectively, Supplementary Fig. S3B, bottom row). Poor fit quality is typically due to discrepancies with the data from the nucleoplasmic fraction.

### Model fitting reveals which kinetic rate constants are identifiable from the data

Using the profile likelihood method, we also computed the 95% confidence interval of the estimated parameters (Supplementary Data 2). While k2 and kcyto_deg were identifiable for almost every gene (k2’: 207 and 202 for replicate 1 and 2, respectively; kcyto_deg: 199 for both replicates out of the 211 fitted genes), k1 and k2 were identifiable for only ~130 genes (k1: 142 and 117; k2: 144 and 116 for replicate 1 and 2, respectively; Fig. 3A left panel, see “Methods” for confidence interval estimation for each individual gene). However, in most cases where these parameters were not fully identifiable the lower bound could be found (k1: 46/69 and 50/95; k2: 44/67 and 49/94 for replicate 1 and 2, respectively). Moreover, the confidence interval of the identifiable parameters was relatively narrow, around threefold between the upper bound and lower bound (i.e., +/− 1.8×) for k2 and kcyto_deg, and about tenfold (i.e., +/− 3.2×) for k1 and k2 (Fig. 3A, right panel). To compare genes, we first examined the distribution of parameters that were identifiable; their values spread over a broad range, around 100-fold for k2 and kcyto_deg, and about 30-fold for k1 and k2 (Fig. 3B). The parameter distributions resulting from fitting each replicate separately were similar (with a Kolmogorov–Smirnov distance of <0.14 between replicates for all parameters). Examining each parameter individually revealed that optimally fitted values (for genes that yielded identifiable parameter values) were reproducible across replicates (Fig. 3C), with the estimated values differing by less than +/− 2× for most genes (red dashed lines). Even when k1 and k2 parameters were not identifiable, their ratio (k1’/k2), was almost always identifiable (202 and 198 for replicates 1 and 2 respectively) and highly reproducible (Fig. 3D), and this ratio’s spread across the tested genes was much narrower than the individual parameters (only threefold).

To obtain an external validation, we compared the model-inferred mRNA half-life values to estimates obtained using Actinomycin-D treatment (Supplementary Fig. 4). The two methods led to very similar results (Fig. 3E) with a spearman rank correlation of 0.8 (P value <2 × 10−16). Interestingly, model inference led to a broader range of estimated half-lives than estimates with actinomycin-D experiments. The actinomycin-D approach estimated half-lives of within 15–300 min, whilst model-inferred half-lives ranged from 1 to over 1000 min. This reflects the shortcomings of actinomycin-D experiments in estimating very short and long half-lives11.

### Nuclear export efficiencies and effective transport rates are highly gene-specific

One meaningful composite variable, k2’/k2, represents the efficiency of the nuclear export, i.e., how much mRNA arrives in the cytoplasm versus how much leaves the nucleoplasm by either export or nucleoplasmic decay. This measure spreads over 30-fold (101.5, Fig. 4A, left panel) meaning that if we assume no loss for the most efficiently transported genes, then for the least efficiently transported genes only ~3% of the nucleoplasmic mRNA will actually arrive in the cytoplasm. Similar to k2, the export efficiency was not always identifiable (Fig. 4A, right panel), but we were able to identify it with the present dataset for ~130 genes out of the 211 (144 and 118 for replicates 1 and 2, respectively) fitted genes, and for most unidentifiable genes the 95% confidence upper bound could still be defined (43/67 and 49/93 for replicates 1 and 2, respectively). For the genes for which this composite variable was identifiable the estimated value was also highly reproducible (Fig. 4A, middle panel).

Another composite variable is k1’k2/k2 which describes how fast a gene’s caRNA transcript reaches the cytoplasmic fraction. We denote this composite measure “effective transport rate”, and it is composed of the fractional cytoplasmic appearance rate k1 multiplied the nuclear export efficiency k2/k2. This composite variable spreads across an even wider range of values: 100-fold between the fastest genes and the slowest (Fig. 4B, left panel). This effective transport rate is highly reproducible (Fig. 4B, middle panel) and almost always identifiable (203 and 198 for replicate 1 and 2, respectively, out of the 211 fitted genes; Fig. 4B, right panel).

These composite measures, “export efficiency” and “effective transport rate”, correlate strongly (Fig. 4C, Pearson’s correlation of ~0.82; 0.83 and 0.81 for replicates 1 and 2, respectively) for the genes for which they are identifiable. The genes for which the export efficiency was not identifiable tend to have a lower effective transport rate. Given that the effective transport rate is identifiable for almost all genes and is strongly correlated with export efficiency, we focused on this composite parameter in subsequent analyses.

Interestingly, cytokines and chemokines have relatively high effective transport rates (Fig. 4D), as do some inflammatory transcription factors (Junb, Egr1/2, Fos, Fosb), while for others it is lower (Fosl2, Irf1, Rel, Relb). Negative feedback genes span a broad range in effective transport rates, with negative regulators of MAPK having higher values than negative regulators of NFκB. Genes involved in cell growth and cell adhesion tend to locate at the lower to medium range of the effective transport distribution.

Examining two genes on opposite ends of the effective transport rate range (Egr1 and Malt1), we observe that even though they both reach similar levels on the chromatin, Egr1, which has a higher effective transport rate, is present at higher levels in the cytoplasmic fraction than Malt1 (Fig. 4E). This is the case even though Malt1 has a longer cytoplasmic half-life than Egr1 (~135 min for Malt1 and ~15 min for Egr1).

### Transport parameters correlate with gene structure and sequence motifs rather than epigenetic signatures

To determine if the effective transport rate is an intrinsic gene characteristic or if it is context-dependent, we examined gene structure characteristics. We found that the effective transport rate is significantly anti-correlated with gene length (Fig. 5A) and the number of introns (Fig. 5B). Short genes with few introns have higher effective transport rates (potentially mediated by TPR) than longer ones with more introns, whose transport may depend on exon–exon junction complexes. Interestingly, even though cytokines and chemokines have similar gene lengths and intron numbers, cytokines tend to have higher effective transport rates than chemokines.

To examine if the need to splice in the nucleoplasm (rather than on the chromatin) might slow the effective transport rate, we calculated the percentage of spliced junction reads over the total of junction reads (spliced or unspliced) for each intron in the nucleoplasmic fraction using SIRI12 and assumed that each intron is independently spliced. Interestingly, this measure of post-transcriptional splicing in the nucleoplasm, correlates with the effective transport rate even more strongly (Fig. 5C), though not necessarily due to a single bottle-neck intron (Supplementary Fig. S5A). These data suggest that when splicing is not completed co-transcriptionally, it slows the effective mRNA transport rate to the cytoplasm.

To examine if RNA-binding proteins (RBPs) may also play a role in regulating the effective transport rate, we tested for enrichment of number of motifs for known RBP in the 5′-UTR and 3′-UTR of the genes and tested against the estimated effective transport rate (Supplementary Fig. S5B). Some of these RBP motifs were correlated with a higher effective transport rate, e.g., HNRNPK and QKI13,14,15,16. Conversely, there are some RBP motifs on the 3′-UTR that correlate with a lower effective transport rate, e.g., HNRNPLL17,18,19 and YBX120,21,22,23. These proteins were shown to be involved in mRNA splicing, transport, and decay.

Next, we examined if the effective transport rate may be affected by the chromatin context of the gene. We measured four histone marks using ChIP-seq in cells prior to stimulation: H3K27ac is associated with active enhancers, H3K4me3 with promoters, H3K36me3 and H3K79me2 with actively transcribed gene bodies24,25. The average peak ChIP-seq signal of all peaks assigned to the closest gene TSS was quantified (see “Methods”), but none of the marks showed a strong correlation with the effective transport rate (Fig. 5D). In addition, machine-learning models were trained to assess if the ChIP-seq signal would add information to the other metrics in predicting the effective transport rate values (see “Methods”). While predicting the effective transport rate using the combination of gene length, intron number, and intron retention significantly improved the prediction over to using just one of these variables, adding ChIP-seq signals in a variety of different windows or measures to those did not increase the predictive power of the models (Fig. 5E).

### Transport parameters are unaffected by tolerance-inducing pre-stimulation

To further examine whether effective transport rates are unaffected by the genes’ chromatin, we produced equivalent experimental datasets in macrophages that had been pre-stimulated with Lipid A and thus rendered into a so-called tolerized state in which the epigenome is substantially altered and gene expression is less responsive to stimulation with the second dose of Lipid A. Similar to the naive condition, replicate data demonstrated high reproducibility (Supplementary Fig. S6A, C), but some genes showed such a low level of expression that they had to be removed from subsequent analyses (Supplementary Fig. S6B), leaving 186 genes to be fitted. Tolerized macrophages showed slightly increased basal levels for most genes (Fig. 6A, top), but, as expected, they exhibited a strong reduction of induction as observed in the chromatin fraction (Fig. 6A, bottom), and also in subsequent fractions (Fig. 6B).

Likely due to the lower fold gene induction, the fits were not as good as for naive macrophages (see Fig. 6C and Supplementary Fig. 7A, B) with more genes having a fit quality metric of ≥0.06 for both replicates (37 vs. 9 for tolerized vs naive macrophages). However, the number of genes for which parameters were identifiable was similar to the naive condition (Supplementary Fig. S7B top panel), though the associated 95% confidence interval was wider for k2’ and kcyto_deg (Supplementary Fig. S7B middle panel). The estimated parameters for genes with good fits showed good reproducibility in replicates (Supplementary Fig. S7B, bottom panel) and were remarkably similar to the estimated parameters from the naive condition (Fig. 6D); k2’ and kcyto_deg being mostly within a ±2-fold range. Further, the composite parameters of transport efficiency and effective transport rate for the well fitted genes were close to indistinguishable between naive and tolerized macrophages, with only poorly fitted genes showing some differences (Fig. 6E). This suggests that while the context of chromatin and trans-acting factors regulate transcriptional initiation of immune responses genes, effective transport rates are primarily regulated by context-independent gene structure and sequence features.

### Transport parameters do not regulate the responsiveness but the abundance of mRNA induction

Two hypotheses address how the mRNA responsiveness of immune response genes is regulated: Intuitively, the effective transport rate, which includes any delays in mRNA processing and splicing, would control responsiveness. The alternative hypothesis posits that cytoplasmic mRNA decay rate determines responsiveness, based on theoretical considerations and actual experimental observations3,26 in studies that, however, did not consider nuclear-to-cytoplasmic transport. We simulated the gene expression induction of nine hypothetical genes combining a high, medium, or low transport rates, with a short, a medium, and long mRNA half-life (Fig. 7A). The results demonstrated that the mRNA half-life was a primary determinant of responsiveness, which can be quantified as time to half-maximal expression. Examining the control of responsiveness further, we identify regimes in which transport rates may be important. However, plotting the actual transport and degradation rates of immune response genes onto this map, we found that almost all genes fall into the regime where the responsiveness is controlled almost entirely by cytoplasmic mRNA half-life (Fig. 7B). Quantifying the mRNA responsiveness of immune response genes, we observed that it is strongly correlated with their estimated mRNA half-life, but also showed a nonlinear relationship with the effective transport rate (Fig. 7C). Comparing mRNA-degradation rates and effective transport rates we then found an unexpected correlation (Fig. 7D), with short-lived mRNAs having a higher effective transport rate. We rationalized that the need for rapid responsiveness requires a high cytoplasmic decay rate, which in turn would decrease the magnitude of gene expression; by increasing the effective transport rate, short-lived mRNAs would then be expressed at high cytoplasmic levels (Fig. 7E). Despite the correlation the magnitude of the mRNA-degradation rate tends to be lower than the nuclear export rate, ensuring that it is generally rate limiting. These observations suggest that the effective transport rate is not primarily a determinant of the responsiveness but of the magnitude of gene expression in innate immune responses.

## Discussion

Here, we report that the effective nucleo-cytoplasmic transport rate of immune response genes varies over a 100-fold range. Our measurements were based on triplicate deeply sequenced RNA-seq data from chromatin-associated, nucleoplasmic and cytoplasmic compartments, to which a kinetic model of mRNA transport and decay was fit, yielding confidence intervals for the inferred kinetic parameters. Our method leveraged the relative quantitation afforded by RNA-seq without the use of spike-in RNAs, whose addition often introduces variability. Further, estimates for the effective chromatin-to-cytoplasm transport rate had tight confidence intervals, even for genes for which chromatin-release and nucleoplasmic outflux rates (by transport or decay) were less well definable when mRNA abundances in chromatin and nucleoplasmic fractions were more similar. Thus, the effective transport rates we report could serve for further downstream analysis.

Our regression approach indicated that the effective transport rate may be accounted for to some degree by intrinsic features of the gene such as length and exon/intron structure. In contrast, chromatin context does not seem to be a determinant of effective transport as a range of histone ChIP-seq data showed little predictive power despite testing numerous features, and the altered epigenetic state triggered by prior endotoxin exposure did not alter the kinetic parameters values. While intron retention that results from experimentally perturbing the splicing machinery has been shown to diminish cytoplasmic mRNA4, our work suggests that it is also contributes to the differences in effective transport seen among immune response genes in wild-type cells. As RBPs mediating export are recruited to exon–exon junction complexes (EJCs), genes that are spliced co-transcriptionally may be favored. But not all introns need to necessarily be co-transcriptionally spliced, as the availability of some EJCs may be sufficient. We do observe that some introns are retained in the nucleoplasm, and that the correlation of transport with splicing is not perfect. Thus, future work may determine whether specific exon–exon junctions are important for recruiting export factors, or whether all exon–exon junctions could in principle make that contribution, or whether TPR pathway may apply, allowing fast transport without EJCs.

An unexpected finding of our analysis was that effective transport rates are highly correlated with cytoplasmic mRNA-degradation rates among immune response genes. This correlation may not be incidental but mechanistically linked: our motif analysis identified a host of RNA-binding proteins that prior literature had implicated in both mechanisms: for example, HNRNPK is involved in splicing27,28 and shuttles between the nucleus and cytoplasm in a manner consistent with an involvement in mRNA export29,30 but other work demonstrated its involvement in regulating the mRNA stability of the thymidine phosphorylase gene31; further, QKI was shown to be involved in splicing32 and nuclear export of MBP33 but other work had implicated it in regulating the mRNA stability of the AIP gene34.

Why would cytoplasmic mRNA decay be mechanistically linked with effective nuclear mRNA export? Prior work has established that the cytoplasmic mRNA half-life controls the responsiveness of immune response gene expression3,26. However, if genes that must be highly responsive to immune signals have evolved a high cytoplasmic mRNA-degradation rate, their expression level would also be dramatically lowered. Mechanistically linking the effective transport rate to cytoplasmic mRNA-degradation allows rapidly induced immune response genes to also be expressed at a high level. Indeed, our modeling analysis shows that the hundred-fold range of effective transport rates may ensure similar levels of expression for both rapidly inducible or long half-life mRNAs. Thus, contrary to expectation, effective transport rates do not directly regulate the stimulus-responsiveness of immune gene expression but regulate the magnitude of gene expression in immune response programs.

## Methods

### Macrophage cell culture and stimulation

Bone-marrow cells were isolated from wild-type C57 BL/6 mice (females, 3 months, approved and maintained by the University of California, Los Angeles Division of Laboratory Animal Medicine accredited by Association for Assessment and Accreditation of Laboratory Animal Care International, AAALAC) and plated with 30% L929-conditioned IMDM 10% serum (Gibco ES) supplemented with Penicillin/Streptomycin, 2-Mercaptoethanol, and L-Glutamine. At day 7, “naive cells” were kept in media for an extra 24 h, while “tolerized cells” were generated by pre-stimulation for 12 h with 100 ng/ml Lipid A, followed by three washes with warm PBS and 12 h rest with differentiation medium. At day 8, naive and tolerized cells were stimulated with 100 ng/ml Lipid A (Invivogen cat. no. tlrl-mpls) with no change of medium. Three biological replicates were prepared several weeks apart.

### RNA preparation and sequencing

After stimulation, BMDMs were harvested at desired timepoints. Subcellular fractions were prepared as described8. Cytoplasmic and nucleoplasmic RNA were isolated using DIRECT-zol micro prep kit (Zymo Research). For chromatin-associated RNA, chloroform extraction was done first and then, the aqueous phase was used to isolate RNA using the DIRECT-zol micro prep kit. In all cases, DNase I digestion was carried out to remove DNA. The nuclear fraction of one replicate was mishandled and thus not available for analysis, however the chromatin and cytoplamic fractions were still used when possible. Strand-specific libraries were generated from 250 ng–1 μg of RNA using KAPA Stranded RNA-seq with RiboErase Library Preparation kit (KAPA Biosystems, Wilmington, MA) according to the manufacturer’s instructions. Resulting cDNA libraries were paired-end sequenced multiple times for appropriate depth with a length of 101 bp on an Illumina HiSeq 2000 or Illumina HiSeq4000 (Illumina, San Diego, CA). Replicates run on either platform revealed little technical variability.

### RNA-seq data analysis

Adapter sequences were removed and lower quality 3′-end trimmed if needed using cutadapt v1.1235. Reads were aligned to the mm10 genome build with STAR 2.5.2b36 using Gencode vM14 as reference annotation37. Pairs with unmapped reads were filtered out using samtools 1.3.138. Pairs falling onto chrM, chrY and known rRNA regions were also filtered out based on UCSC table browser39. Bam files for each sample were merged to a single bam file using samtools38. All RNA-seq data (fastqs and bams files) was deposited on ENCODE DCC (https://www.encodeproject.org/awards/U01HG007912/). For gene selection, chromatin-level expression was calculated using featureCounts 1.5.140 on the gene whole body for uniquely mapped fragments based on Gencode vM14 annotations37. Genes for which all samples had less than 32 fragments were removed from further analysis. Highly expressed genes (3 data points with FPKM ≥ 1) were kept for further analysis. Then significantly induced genes were selected using R (3.5.1)41 with edgeR package 3.22.542. Induction had to be at least tenfold relative to basal and within the first 40 min after LPA stimulation, with FDR corrected P value <=0.01. Non-protein coding genes and predicted genes were filtered out leaving a total of 288 genes for the naive condition. For the tolerized conditions the same list of genes was used if they also had ≥32 fragments in at least one sample and three samples with FPKM ≥1 on the chromatin, leaving 249 out of the 288 for further analysis. Furthermore, sample tracks of chromatin-associated RNA were examined using IGV43, to filter out false positives due to read-through from a close gene (14 genes for naive and 6 for LPA stimulation, see examples Supplementary Fig. 2). In addition, one gene was removed because the chromatin tracks/junction reads seemed to combine annotated transcripts from a different gene. Thus, we considered 273 genes for the naive condition and 242 for the tolerized condition.

### Actinomycin-D mRNA half-life measurement

Transcription was inhibited by adding 10 µg/ml of Actinomycin D (ActD; A9415, Sigma-Aldrich), at 0, 1, and 3 h of LPA stimulation. Cells were harvested at 0, 30, 60, 90, 120, 240, and 360 min after ActD addition (Supplementary Fig. S4A). Cells were harvested in TRIzol and RNA was isolated using Directzol kit (Zymo) after DNAase treatment. During RNA library preparation, 2 µl of 1:100 diluted RNA spike-in (Ambion ERCC Spike-In Mix Part no 4456740) was added for external normalization. Sequencing was done with a length of 50 bp on Illumina HiSeq4000 sequencer. RNA sequencing processing included trimming of remaining using cutadapt v1.1235, alignment of the reads with STAR36 to mm10 genome with Gencode vM6 annotations37 supplemented by ERCC spike-in sequences. Unmapped reads were filtered out using samtools38. Gene-wise counts were generated with featureCounts40 using uniquely mapped reads. All sequencing fastq files were deposited to Sequence Read Archive44 under BioProject IDs PRJNA641336.

We created an R package, called ActDanalyser (https://github.com/dlefaudeux-ucla/ActDAnalyser), which allows users to easily calculate half-life from Actinomycin-D RNA sequencing data. Within the package, functions were implemented to render every step as easy as possible. First, genes with low counts (≤32) in all samples were removed from the analysis. Then the median ratio of each sample’s spike-ins to its geometric mean across samples was used as normalization factor and applied to corresponding sample gene set. The library size after normalization decreased with time after actD addition, as expected given that mRNAs were decaying. This helped to flag some experiments that clearly did not follow that pattern and were not included in downstream analysis, for example, the sample corresponding to 120 min after actD addition of the 1 h post Lipid A stimulation experiment (Supplementary Fig. S4B).

To derive mRNA half-life, the following issues were considered: (i) after actD addition, polymerase arrest is not instantaneous. (ii) Late timepoints are less reliable, when short-lived mRNAs are at low levels. Therefore, the regression was implemented to start at any timepoint within the first hour, and the last timepoint was picked such that the regression gives the highest adjusted R² (Supplementary Fig. S4C). The steps are summarized as follows:

• Identify potential start points as: t ≤ 1 h after actD addition and for which the normalized counts (in log2) are not lower than the max of the 1st hour – 0.25× the max decay per hour.

• For each ActD timecourse: (i) identify possible end timepoints, (ii) run linear regression between the log2 normalized counts and time allowing removal of one point as long as at least three timepoints remain. Allow the intercept to be different for each replicate.

• Select the negative slope regression that has the highest adjusted R².

• Calculate slope confidence interval (CI) using the confint function from the R stats package41

• Convert slope to mRNA half-life

A web interface, called ActDBrowser (https://www.signalingsystems.ucla.edu/ActDBrowser), was implemented using the R shiny package allowing users to search half-life for specific mRNA(s) in specific cell type and conditions as a resource to the community.

### ChIP-sequencing

ChIP-seq protocol was conducted according to published methods45 with 5 µg of antibody against H3K4me3 (05-745R, Millipore), H3K36me3 (ab9050, Abcam), H3K27ac (39133, Active-Motif), and H3K79me2 (ab3594, Abcam). ChIP-seq libraries were generated using Kapa Hyper Prep Kit (KAPA Biosystems, Wilmington, MA), and were single-end sequenced on an Illumina HiSeq 2000 (Illumina, San Diego, CA) with a length of 50 bp. FASTQ reads were aligned using the ENCODE-defined analysis pipeline for ChIP-seq read mapping46,47. Histone ChIP peaks were called using the ENCODE-defined analysis pipeline for histone ChIP-seq and annotated to the closest gene with HOMER suite v4.1148. Biological replicate histone signals were normalized to peak sequence depth using the ENCODE pipeline. Histone mark signals were averaged between replicates. All ChIP-seq data (fastqs files) was deposited on ENCODE DCC (https://www.encodeproject.org/awards/U01HG007912/) and publicly available.

### Regression and machine-learning modeling with ChIP-seq signals

Total, upstream, and downstream histone mark levels were calculated by summing across the mentioned ranges. Alternatively, histone mark levels were partitioned into windows with fixed width based on the average width of each mark. Histone windows were symmetrically centered around the TSS. H3K27ac had four windows with a width of 7500 bp. H3K36me3 had eight windows with a width of 6250 bp. H3K4me3 had four windows with a width of 2500 bp. H3K79me2 had six windows with a width of 8333 bp. For each window, the average was calculated for all ChIP signals located within its bounds. Boundary ranges for windows were manually defined to incorporate a threshold of at least 50% of genome-wide peaks for each histone mark, also taking into account the function of each mark. Thus, for classification, each histone mark had several features associated: fixed width windows, total, upstream, and downstream signal. Extreme Gradient Boosting (XGBtree) models were trained to predict derived mRNA transport parameters using various model with input feature combinations of histone peak features, gene length, number of introns, splicing probability as defined previously with fivefold cross-validation repeated three times and 80/20 data split49. Hyperparameters of each model were tuned to improve model R2 in the following order—number of rounds, learning rate, maximum depth, child weight, column and row sampling, and gamma. R2 metrics were calculated for the resamples of each model and used to compare predictive performance. Plots were generated using the R package ggplot250 complexHeatmap51. All p-values were determined using the Mann–Whitney U test. Feature importance represent the gain of each feature (calculated with the varimp function from the caret R package) over the total gain of all features (i.e., to sum up to 1). This analysis was done in R.

### Estimating splicing probability

Measurement of nuclear intron percentage used SIRI12 on the nucleoplasmic RNA-seq data. It measures the number of spliced reads across the junction (EE) and the number of reads spanning the exons-intron junction on both sides of the introns (EI and IE). The percentage of intron (PI) for each individual intron was calculated as:

$${PI}=\frac{\frac{{EI}+{IE}}{2}}{{EE}+\frac{{EI}+{IE}}{2}}$$
(1)

For each gene having a single main isoform, the splicing probability (SP) was calculated assuming that each intron was independently spliced:

$${SP}=\mathop{\prod}\limits_{{intron}}(1-{{PI}}_{{intron}})$$
(2)

The splicing probability was calculated using the average PIs of the last two timepoints (90 and 120 min) as it correlates well with the basal steady-state PIs but having more reads on the junctions allowing for more accurate quantification. Moreover, the splicing probability was calculated only if the gene had more than ten junction reads (spliced or unspliced) for all its introns.

### RNA-binding protein (RBP) motif analysis

RBP sequence motif analysis used AME tool (Analysis of Motif Enrichment) from the MEME suite52. The 5′-UTR and 3′-UTR sequences of each gene main isoform were used and any number of RBP motifs from human and mouse motif databases53 were searched for. Enrichment was tested by spearman correlation (--method spearman) on the total number of hits (--scoring totalhist) using a threshold of 0.25 times the maximum log odd ratio of the motif to be considered a hit (--hit-to-fraction 0.25) in the respective UTR sequence versus the effective transport rate derived from the modeling.

### Mathematical model formulation

To describe mRNA transport through the different cellular compartments, a two-step model was written as a system of two ordinary differential equations:

$$\frac{d{{{{{{\rm{RNA}}}}}}}_{{{{{{\rm{np}}}}}}}}{{dt}}={k}_{{ca}\to {np}}\cdot {{{{{{\rm{RNA}}}}}}}_{{{{{{\rm{ca}}}}}}}-\left({k}_{{np}\to {cyto}}+{k}_{{np}{\deg }}\right)\cdot {{{{{{\rm{RNA}}}}}}}_{{{{{{\rm{np}}}}}}}$$
(3)
$$\frac{d{{{{{{\rm{RNA}}}}}}}_{{{{{{\rm{cyto}}}}}}}}{{dt}}={k}_{{np}\to {cyto}}\cdot {{{{{{\rm{RNA}}}}}}}_{{{{{{\rm{np}}}}}}}-{k}_{{cyto\_deg}}\cdot {{{{{{\rm{RNA}}}}}}}_{{{{{{\rm{cyto}}}}}}}$$
(4)

This model describes the exact number of RNA transcripts in different cellular compartments. As RNA-seq measurements are only relative, normalization factors $$\alpha$$, $$\beta$$, $$\gamma$$ were included:

$$x=\alpha \cdot {{{{{{\rm{RNA}}}}}}}_{{{{{{\rm{ca}}}}}}}$$
$$y=\beta \cdot {{{{{{\rm{RNA}}}}}}}_{{{{{{\rm{np}}}}}}}$$
$$z=\gamma \cdot {{{{{{\rm{RNA}}}}}}}_{{{{{{\rm{cyto}}}}}}}$$
$$\frac{dy}{dt}=\underbrace{\frac{{\beta}}{{\alpha}}{k}_{ca \to np}}_{{k}_{1}^{{\prime} }} \cdot x- \underbrace{({k}_{np \to cyto}+{k}_{npdeg})}_{{k}_{2}} \cdot y$$
(5)
$$\frac{dz}{dt}=\underbrace{\frac{\gamma }{\beta }{k}_{np \to cyto}}_{{k}_{2}^{{\prime} }} \cdot y-{k}_{cyto\_deg} \cdot z$$
(6)

The value of each normalization factor is the same for all genes, allowing comparisons between genes.

Summary of mathematical model parameters

Parameter

Description

Unit

k1

Transport rate constant from chromatin to the nucleoplasmic fraction

(npFPKM/caFPKM)/min

k2

Rate of disappearance from the nucleoplasmic fraction (either by export or degradation)

min−1

k2

Transport rate constant from nucleoplasmic to cytoplasmic fraction

(cytoFPKM/npFPKM)/min

kcyto_deg

min−1

k1’/k2

Chromatin-release efficiency

npFPKM/caFPKM

k2’/k2

Transport efficiency (from nucleoplasm to cytoplasm)

cytoFPKM/npFPKM

k1’k2’/k2

Effective transport rate (rate constant from chromatin to a cytoplasmic fraction)

(cytoFPKM/caFPKM)/min

### Gene annotation for quantifying mRNA abundances

Reliable quantification of mRNA abundances is critical for modeling, and this relies on accurate gene annotations. The model considers full-length transcripts, so to estimate chromatin-associated transcripts we considered the exonic regions within the last 5 kb of each gene. To ensure we have accurate annotation of the transcription end site (TES), genome browser tracks were manually checked and annotated. Gencode annotation and the manual reannotation of the TSS and TES were compared against external databases, using CAGE peaks54 and 3′-end sequencing data10 (Supplemental Fig. S2G). The manual annotation was often closer to the TSS and TES from these external databases than the Gencode annotation37. One gene with very low cytoplasmic mRNA levels was removed because it did not allow for reliable identification of the expressed isoform; another gene was removed, for having overlapping transcripts with another gene. Genes for which expression tracks/splice junctions seemed to come from unannotated transcripts were also removed (35 for naive, 33 for Tolerized). Additionally, genes for which the TES location was uncertain were also removed (4 for naive, 3 for Tolerized). Moreover, genes having more than one isoform corresponding to ≥10% and ≥1 FPKM of the total expression in more than ¼ of the samples (custom script in Python 3.7) were considered having multiple isoforms expressed. This curation was based on examining tracks as well as using estimated isoforms expression from cufflinks55. When genes were deemed to have multiple isoforms expressed, the last 5 kb was reduced to correspond to the exonic portion that is shared by all expressed isoforms species. If the exonic portion of the last 5 kb region represented less than 500 bp (21 for naive, 19 for Tolerized) genes were removed from further analysis to avoid that the small length undermines the reliability of expression estimation. Overall, 77 genes were removed in naive condition (15 genes based on chromatin RNA filtering and 62 based on cytoplasmic RNA filtering) leaving 211 genes for modeling. In Tolerized conditions, 102 genes were removed (46 based on chromatin RNA filtering and 56 based on cytoplasmic RNA filtering) leaving 186 genes. Details of the manual curation can be found in Supplementary Data 1 and examples in Supplementary Fig. S2. To fit the model, relative gene expression was estimated using FPKM as it is proportional to the number of transcripts.

### Error model for RNA-seq analysis

In order to fit the model to each replicate sample individually, we developed an error model. The main sources of error are measurement error, timepoint sampling error, and biological variability between individual samples. We first considered timepoint sampling error and biological variability between samples. Timepoint sampling error affects highly dynamic gene expression trajectories and therefore is a function of the derivative at that timepoint56. Let g be the gene expression and lg be the gene expression in log scale, then $${{lg}}_{{observed}}({timepoint})={lg}\left(t+\varDelta t\right)+{\varepsilon }_{b}={lg}\left(t\right)+\Delta t\cdot {{lg}}^{{\prime} }\left(t\right)+{\varepsilon }_{b}$$, where εb represents the biological variability and Δt the temporal variability. We assume that $${\varepsilon }_{b} \sim N\left(0,{\sigma }_{b}^{2}\right)$$ and $$\Delta t \sim N(0,{\sigma }_{t}^{2})$$ and $$\Delta t\cdot l{g}^{{\prime} } \sim N\left(0,{slop}{e}^{2}\cdot {\sigma }_{t}^{2}\right)$$, where $${slope}={lg}{\prime}$$. Hence $${{lg}}_{{observed}}({timepoint}) \sim N\left({lg}\left(t\right),{slop}{e}^{2}\cdot {\sigma }_{t}^{2}+{\sigma }_{b}^{2}\right)$$ and thus $${g}_{{observed}}\left({timepoint}\right)=g\left(t\right)\cdot {\varepsilon }_{{total}}$$, with $${\varepsilon }_{{total}} \sim {{\log }}N\left(0,{\sigma }_{{total}}^{2}={\sigma }_{b}^{2}+{slop}{e}^{2}\cdot {\sigma }_{t}^{2}\right)$$.

Moreover, mRNA abundance measurements are subject to sampling error, especially when the number of reads for a given gene is small. Sampling error is usually represented by a binomial distribution but given that any given gene will be represented by only a small proportion of the total number of reads (which is large), this can be approximated by a Poisson distribution.

When n is large and the proportion p small: $${Binomial}\left(n,p\right)\simeq {Poisson}\left(\lambda=n\cdot p\right)$$, thus $$c{ount}{s}_{{observed}}\left({timepoint}\right) \sim {Poisson}\left(\lambda=N\cdot p\right)$$, where N is the total number of reads and p is the proportion of reads that should belong to the given gene and is proportional to the true mRNA abundance, which is assumed to follow a log-normal distribution. We approximated the log-normal distribution to a Gamma distribution with equivalent mean and variance. Therefore:

$$g\left({timepoint}\right) \sim {\log }N\left({{{{{\rm{lg}}}}}}\left(t\right),{\sigma }_{{total}}^{2}\right) \\ \simeq \varGamma \left(k=\frac{1}{{{{{{\rm{exp }}}}}}\left({\sigma }_{{total}}^{2}\right)-1},\theta=\left({{{{{\rm{exp }}}}}}\left({\sigma }_{{total}}^{2}\right)-1\right)\cdot {{{{{\rm{exp }}}}}}\left({lg}\left(t\right)+\frac{{\sigma }_{{total}}^{2}}{2}\right)\right)$$
(7)

This leads to the expression that the observed counts follow a negative binomial:

$${count}{s}_{{observed}} \sim {NB}\left(r=\frac{1}{{{{{{\rm{exp }}}}}}\left({\sigma }_{{total}}^{2}\right)-1},p=1-\frac{1}{1+\mu \cdot \left({{{{{\rm{exp }}}}}}\left({\sigma }_{{total}}^{2}\right)-1\right)\cdot \left(\frac{{\sigma }_{{total}}^{2}}{2}\right)}\right)$$
(8)

where μ is the true expected number of counts.

Such a negative binomial distribution is commonly used for representing counts distributions in RNA-seq analyses, for example, in software packages edgeR, DESeq, cuffdiff from cufflinks.

### Cost function for model fitting

The cost function was defined using the likelihood of the model reproducing the log2(FPKM) data, with the counts following the negative binomial distribution described above (FKPM are counts normalized for library size and gene length). The total cost sums the negative log-likelihood of each timepoint t of each compartment (cpt):

$${Cost} \, =-{{{\rm{log }}}}\left({{{\mathcal{L}}}}\left({{{\rm{\theta }}}},{{{\rm{\alpha }}}}\right){{{\mathcal{D}}}}\right)\cdot P\left.(\alpha )\right) \\ =\mathop{\sum}\limits_{t}\mathop{\sum}\limits_{{cpt}}-{{{\rm{log }}}}\left(P\left(\left({{\log }}_{2}\left({{FPKM}}_{{cpt}}\left(t\right)\right)\right)\Big | \theta,\alpha \right)\right)-{{{\rm{log }}}}(P(\alpha ))$$
(9)

Here, θ represents the model parameters k1, k2, k2, kcyto_deg and α represents additional parameters, such as for smoothing (spar), time variability (σt) and sample variability (σb) shown below. These were also included in the error model, however, their effects are assumed to be relatively small for a single replicate, thus these parameters were regularized in the cost function by giving them a certain prior that has a higher probability of low values, see below. In one replicate the unstimulated timepoint for the chromatin data was missing; this was added as a parameter (ca0) and also regularized as follows:

Prior distribution used for cost function parameters

Prior distribution

spar

$${{{{{\mathscr{N}}}}}}\left(\mu=0.45,{\sigma }^{2}=0.0025\right)$$

σt

$${{{{{{\mathscr{N}}}}}}}_{1/2}(\mu=0,{\sigma }^{2}=25)$$

σb

$${{{{{{\mathscr{N}}}}}}}_{1/2}(\mu=0,{\sigma }^{2}=0.01)$$

ca0

$${{{{{\mathscr{N}}}}}}\left(\mu={ca}\left({t}_{1}\right)-\Delta {ca},{\sigma }^{2}=0.25\right)$$ with Δca = meanrrep(car(t2) −  car(t1))

where $${{{{{{\mathscr{N}}}}}}}_{1/2}$$ represents the half-normal distribution.

### Model simulation

Each replicate was used separately for fitting the model parameters, allowing a comparison of the optimal parameter set. For each replicate, the chromatin-associated expression is interpolated using the.spline function in R (with each point weighted based on its log2(FPKM) probability, accounting for sampling error), then the model is simulated using a defined set of parameters. The numerical simulations were done using the deSolve R package57 as well as the compiler package41 for faster execution of the ode model and cost function calculation.

### Parameter estimation

Local optimization method (BFGS as implemented in the R optim function) using 1000 different random initialization sets was used to find the best parameter set, as it has been shown to be as efficient as other global methods58. The initial parameters were sampled from distributions. This pipeline was done separately for each replicate.

Distributions used to sample the 1000 initial parameter sets

Parameters

Distributions

θ

log10(k1’)

$${{{{{\mathscr{U}}}}}}(a=-5,b=5)$$

log10(k1’/k2)

$${{{{{\mathscr{U}}}}}}(a=-5,b=5)$$

log10(k2/k2’)

$${{{{{\mathscr{U}}}}}}(a=-5,b=5)$$

log10(k2’/kcyto_deg)

$${{{{{\mathscr{U}}}}}}(a=-5,b=5)$$

α

spar

$${{{{{\mathscr{N}}}}}}(\mu=0.45,{\sigma }^{2}=0.0025)$$

σt

$${{{{{{\mathscr{N}}}}}}}_{1/2}(\mu=0.45,{\sigma }^{2}=0.0025)$$

σb

$${{{{{{\mathscr{N}}}}}}}_{1/2}(\mu=0,{\sigma }^{2}=0.01)$$

ca0

$${{{{{\mathscr{N}}}}}}(\mu=ca({t}_{1})-\varDelta ca,{\sigma }^{2}=0.25)\,{{{{{\rm{with}}}}}}\,\varDelta ca={{{{{{\rm{mean}}}}}}}_{r\in rep}(c{a}_{r}({t}_{2})-c{a}_{r}({t}_{1}))$$

### Fit quality assessment

To assess the fit quality of parameterized models, we considered that likelihood, used for fitting, is not a sufficient measure. For example, if the fits of two different genes had the same likelihood but for one the simulations were consistently below the data, it would be perceived as worse than the fit for the other gene that is sometimes below and sometimes above the data, especially if the data for the second gene are more jaggedy. Similarly, if one compartment is not well fitted even if the others are, the perceived fit quality would be strongly affected by the data from the poorly fitting compartment. Hence, we developed the following metric which was used to report the perceived fit quality:

$$\mathop{{{{{{\rm{max }}}}}}}\limits_{{cpt}}\left(\left|{{{{{\rm{autocorr}}}}}}\left({{error}}_{{cpt}}\right)\right|\cdot \frac{{{{{{\rm{mean}}}}}}\left(\left|{{error}}_{{cpt}}\right|\right)}{{{range}}_{{cpt}}}\right)$$
(10)

A good fit should have independent residuals, i.e., no autocorrelation and a relatively small remaining error.

To assess the impact of every parameter a profile likelihood approach was used59. For each parameter, its profile can be estimated by:

$$P{L}_{{\theta }_{i}}(x)=\mathop{\max }\limits_{\theta|{\theta }_{i}=x}\,\log ( {\mathcal L} (\theta|{{{{{\mathcal{D}}}}}}))$$
(11)

The profile likelihood can also be used to estimate confidence interval of the parameters59:

$$C{I}_{{\theta }_{i},\alpha }=\left\{{\theta }_{i}=x|-P{L}_{{\theta }_{i}}(x)\le \mathop{\min }\limits_{\theta }(-\,\log ( {\mathcal L} (\theta|{{{{{\mathcal{D}}}}}})))+\frac{1}{2}\Delta (\alpha )\right\}$$
(12)

Here, α represents the chosen confidence level. For a sufficient amount of data:

$$\Delta \left(\alpha \right)={{{{{\rm{icdf}}}}}}\left({\chi }_{1}^{2},\alpha \right)$$
$$\Delta \left(0.95\right)=3.841459$$

We used the R package dMod60 as well as the numDeriv package to calculate for each parameter from θ and combinations such as k1’/k2, k2’/k2, and k1’k2’/k2 the profile likelihood up to the confidence interval limits (or 1000-fold lower to 1000-fold higher, whichever condition was met first). To be able to apply the profile likelihood measure to the compound parameters and estimate their confidence interval, the model was modified to represent those quantities. Specifically, for k1’/k2, k2’/k2 the model was reparametrized as follows, where four parameters of the model are now k2, k1’/k2, k2’/k2, and kcyto_deg:

$$\frac{{dy}}{{dt}}={k}_{2}\cdot \frac{{k}_{1}^{{\prime} }}{{k}_{2}}\cdot x-{k}_{2}\cdot y$$
(13)
$$\frac{{dz}}{{dt}}={k}_{2}\cdot \frac{{k}_{2}^{{\prime} }}{{k}_{2}}\cdot y-{k}_{{cyto\_deg}}\cdot z$$
(14)

With this new parameterization the profile likelihood was employed only for k1’/k2, k2’/k2 as the confidence interval was already assessed by the original parametrization for k2 and kcyto_deg

For k1’k2’/k2, the model was re-parameterized using the four parameters k2, k1’k2’/k2, k2’/k2, and kcyto_deg:

$$\frac{{dy}}{{dt}}=\frac{1}{\frac{{k}_{2}^{{\prime} }}{{k}_{2}}}\cdot \frac{{k}_{1}^{{\prime} }\cdot {k}_{2}^{{\prime} }}{{k}_{2}}\cdot x-{k}_{2}\cdot y$$
(15)
$$\frac{{dz}}{{dt}}={k}_{2}\cdot \left(\frac{{k}_{2}^{{\prime} }}{{k}_{2}}\right)\cdot y-{k}_{{cyto\_deg}}\cdot z$$
(16)

Similarly, the profile likelihood was employed only for k1’k2’/k2 as the confidence interval was already assessed for the other parameters.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.