Bias toward long gene misregulation in synaptic disorders can be an artefact of amplification-based methods

Several recent studies have suggested that genes that are longer than 100 kilobases are more likely to be misregulated in neurological diseases associated with synaptic dysfunction, such as autism and Rett syndrome. These length-dependent transcriptional changes are modest in Mecp2-mutant samples, but, given the low sensitivity of high-throughput transcriptome profiling technology, the statistical significance of these results needs to be re-evaluated. Here, we show that the apparent length-dependent trends previously observed in MeCP2 microarray and RNA-Sequencing datasets, particularly in genes with low fold-changes, disappeared after accounting for baseline variability estimated from randomized control samples. As we found no similar bias with NanoString technology, this long-gene bias seems to be particular to PCR amplification-based platforms. In contrast, authentic long gene effects, such as those caused by topoisomerase inhibition, can be detected even after adjustment for baseline variability. Accurate detection of length-dependent trends requires establishing a baseline from randomized control samples. HIGHLIGHTS Length-dependent gene misregulation is not intrinsic to Mecp2 disruption. Topoisomerase inhibition produces an authentic long gene bias. PCR amplification-based high-throughput datasets are biased toward long genes.


23
Several recent studies have suggested that genes that are longer than 100 kilobases are more 24 likely to be misregulated in neurological diseases associated with synaptic dysfunction, such as autism 25 and Rett syndrome. These length-dependent transcriptional changes are modest in Mecp2-mutant 26 samples, but, given the low sensitivity of high-throughput transcriptome profiling technology, the 27 statistical significance of these results needs to be re-evaluated. Here, we show that the apparent length-28 dependent trends previously observed in MeCP2 microarray and RNA-Sequencing datasets, particularly 29 in genes with low fold-changes, disappeared after accounting for baseline variability estimated from 30 randomized control samples. As we found no similar bias with NanoString technology, this long-gene 31 bias seems to be particular to PCR amplification-based platforms. In contrast, authentic long gene effects, 32 such as those caused by topoisomerase inhibition, can be detected even after adjustment for baseline 33 variability. Accurate detection of length-dependent trends requires establishing a baseline from 34 randomized control samples.

40
• Topoisomerase inhibition produces an authentic long gene bias.

41
• PCR amplification-based high-throughput datasets are biased toward long genes.

99
The first data that we analyzed were those from a study that evaluated transcriptional effects of

106
Given that these untreated samples were obtained from littermates, we did not expect to observe any 107 differences in gene expression and predicted that a running average plot comparing gene expression 108 between vehicle-treated control samples would yield a horizontal line through y=0. However, we found 109 that genes over 100kb in length tended to be down-regulated on average (blue curve in Figure 1A) when 110 gene expression levels between control samples are compared. This effect was found for both RNA-Seq 111 and microarray datasets ( Figure 1A) and indicates that a portion of the length-dependent trend observed in 112 the topotecan datasets is due to a length-dependent bias (i.e. noise) that can be observed even in the 113 control samples.

114
To determine the significance of average fold-change trends, we applied a Student's t-test to each 115 of the matching data bins from the drug vs. vehicle (D/V) and vehicle vs. vehicle (V/V) comparisons, 116 followed by an adjustment for multiple hypothesis testing. For consistency, these plots are referred to as 117 overlap plots (Experimental Procedures, Figure S1). At a false discovery rate of 0.05, only the long gene 118 bins were statistically significant and showed preferential downregulation following topotecan treatment 119 in both RNA-Seq and microarray datasets (lower panel in Fig 1A, red dots indicate statistically significant 120 bins; Figure S1). In other words, although the control samples showed that long genes are downregulated 121 at baseline (i.e. when comparing controls to controls), topotecan treatment produced an even stronger suggested that loss of MeCP2 function causes preferential upregulation of long genes and, conversely, 132 that gain of MeCP2 function leads to preferential downregulation of long genes (Gabel et al., 2015). We 133 chose to delve deeper into these datasets to explore the extent of the contribution of long genes to RTT 134 pathology. We applied our method to eleven MeCP2 datasets (Table 2)

145
A few long gene bins showed significant preferential upregulation in Mecp2-null mice (FDR < 146 0.05) in these datasets. For example, in hypothalamus dataset, we found 12 bins of long genes to be 147 significant ( Figures 1B, right panel). However, we observed a similar or even larger number of 148 significant bins for genes less than 100k (Figures 1B-1C; see also Figures S2B-S2C, S2F-S2G, and S2J). 149 Likewise, no preferential repression of long genes was observed for datasets from Mecp2-overexpression 150 models (Tg) ( Figure 1C; see also Figure S2L). Indeed, we found more short genes to be preferentially 151 dysregulated in the Mecp2-overexpression models ( Figure 1C). Thus, when assessing the bins of genes 152 with the significant difference in expression between WT and KO mice, we found that genes with a 153 variety of lengths were altered in KO and Tg mice. Additionally, while there are certainly some long 154 genes with significantly altered expression in both KO and Tg mice, there is no consistent and preferential 155 long gene trend observed in the Mecp2 datasets.

165
We reanalyzed the data using overlap plots and observed no significant downregulation of long 166 genes in wildtype or Mecp2-mutant excitatory neurons from 18-week old T158M or R106W female mice 167 ( Figures 1D and S4E). In excitatory neurons bearing the R106W mutation, we observed few bins that are 168 significantly different from WT expression levels. Notably, bins with significant gene expression changes 169 were not due to the downregulation of long genes in mutant samples. Rather, these bins were significant 170 due to the downregulation of long genes in control (WT) samples, as indicated by the downward slopes of 171 the running average plots comparing WT vs. WT samples (blue lines in Figures 1D and S4E). Similarly, 172 we observed no significant repression of long genes in nuclear RNA-Seq datasets of excitatory and 173 inhibitory neurons from 6-week old male mice with the same mutation type ( Figures S4A-S4D). Finally, 174 when we examined downregulation of long genes from the GRO-Seq (global nuclear run-on with high-175 throughput sequencing) data collected from these mice, we confirmed a marginal significance in the 176 downregulation of long genes ( Figure S3A), but upregulation of long genes was not observed in whole 177 cell RNA-Seq data ( Figure S3B). These results suggest that the transcriptome changes in long genes that 178 appear in RNA isolation-based methods are independent of the sex, age, or mutation type of the mouse.

179
Together, these results suggest that when the fold-change difference is 50% or more, as it is in the 180 topotecan datasets, there is likely to be a genuine long gene bias. When the fold-change effect is small 181 (<15%), however, as it is with the long genes observed in the Mecp2 datasets, it is more likely that the 182 observed long gene trend is due to inherent variation among samples. The reported long gene trend in the 183 Mecp2 datasets is in the same range as the noise that we derived from the intra-sample comparison in the 184 control groups, and this effect was seen in all the Mecp2 datasets that we assessed. This further suggests 185 that the length-dependent variability estimated from microarray and RNA-Seq platforms is not sensitive 186 enough to capture small transcriptional changes. We, therefore, recommend that baseline gene length  Figure 2D, left panel) when it was compared to its age-223 matched control (age = 18 years, n = 1 each). To further probe whether the long-gene trend might be 8 present in the early stages of the disease, we compared a RTT postmortem male sample from frontal 225 cortex (age = 1 year, n = 1) to an age-matched control sample (age = 2 days, n = 1) and again could find 226 no significant upregulation of long genes ( Figure 2D theory, the ratio should be independent of gene length in the brain and non-brain tissues. After assessing 271 various SEQC datasets, we found that the Novartis dataset had nominal batch effects and the β ratio was 272 close to 2. Therefore, this dataset would be ideal, as it would not bias downstream analyses ( Figure 4A).

273
We separated Human Brain Reference (sample type B) RNA-Seq samples into two groups of 32 274 samples each, based on their y-axis coordinates of the MDS plot, and computed a running average plot.

275
Since these samples were technical replicates of the same reference RNA sample type, we expected the 276 mean log2 fold-change to be a horizontal line along the x-axis with a y-intercept equal to zero (i.e., y=0 on 277 an xy plane). Instead, we found that long genes deviated from the expected pattern, with the fold-changes 278 of long genes being overestimated ( Figure S6A, left panel).

279
We then investigated whether the fold-change of long genes is constant for the β ratio samples.

280
The expected average log2 fold-change should be a horizontal line along the x-axis with a y-intercept 281 equal to two. We found, however, that the expected ratio was not maintained for long genes and was 282 overestimated ( Figure 4B). Moreover, we observed a similar bias in the β ratio with respect to transcript 283 length, with longer transcripts being overrepresented ( Figure S6A, right panel). Overall, the range of 284 overestimation in the RNA-Seq dataset was between 3% and 40%. Consistent with our findings, another 285 study (using a different dataset) previously reported that long genes were more likely to be identified as  were expressed in brain samples ( Figure S8B). We again computed the running average plots against their 318 average gene length, and we did not observe any long gene bias between the brain samples or when 319 computing the β ratio of the samples (Figures S8C-S8D).

320
We next compared the mean expression levels of all the common genes across the RNA-Seq, 321 microarray and nCounter datasets. Our analysis shows that fold-changes of long genes are overestimated 322 in the RNA-Seq (P-value < 2.7 e-07; Figure 4E) and microarray datasets (P-value < 0.021; Figure 4F); in 323 contrast, the nCounter dataset showed no difference in the average expression of long and short genes (P-324 value = 0.86; Figure 4G). Although it is possible that the smaller number of genes (~680) might make it 325 more difficult to detect a preference, the proportion of long genes in this dataset (~180 out of ~680 genes, 326 or 26%) is twice that found in the human transcriptome (~3200 long genes out of ~ 24,000 genes, or 327 13%). Any preference for long genes should thus be revealed even more strongly in this dataset. These 328 results lead us to posit that the long gene overestimation we observed in RNA-Seq and microarray 329 datasets might be caused by a length-dependent bias in PCR amplification.

335
To understand this reciprocal relationship, we divided Human Brain Reference samples (B) into 3 groups 336 (n = 16 each) based on different library preparation ID numbers from the Novartis SEQC dataset. The 337 PCA plot clearly clustered the brain samples based on the library preparation group to which they 338 belonged ( Figure S9A). Comparing the brain samples of library preparation ID 2 (green) to library 339 preparation ID 1 (red) and ID 3 (blue) separately reversed the running average plot ( Figures S9B-S9C).

340
These results show that a reciprocal relationship can be observed in the gene expression data between any 341 groups that form three distinct clusters on a PCA plot.

342
We next assessed the influence of the fold-change threshold on differential expression analysis 343 using brain samples. Although we did not expect to see a trend between replicates, preferential regulation 344 of long genes was observed ( Figure S9D) when the fold-change was small (<10%, or log2FC ~ 13%).  Table 3) and NanoString (n = 3 each; Table 4)