We introduce a method for analyzing small interfering RNA (siRNA) genetic screens based entirely on off-target effects. Using a screen for members of the Wnt pathway, we demonstrate that this method identifies known pathway components, some of which are not present in the screening library. This technique can be applied to siRNA screen results retroactively to confirm positives and identify genes missed using conventional methods for on-target gene selection.
Large-scale siRNA screening is an increasingly popular method of interrogating signaling pathways both to identify relevant genes and to discover novel drug targets. siRNA screens typically involve transfecting small numbers of cells with a library designed to target the entire genome or some subset (such as a set of genes coding for theoretically druggable domains)1. The systematic knockdown of each gene enables high-throughput loss-of-function analysis by assessing pathway dependent phenotypes as measured through changes in the appearance of specific markers or induction of reporter constructs (Fig. 1a).
Unfortunately, siRNA screens have demonstrated a high false positive rate2. Researchers typically perform labor-intensive follow-up work on hundreds of hits to confirm a handful of relevant genes. Many false positives are likely due to off-target effects3,4, wherein partial complementarity between an siRNA and multiple transcripts, typically in the 3′UTR, results in their down-regulation, adding unintended silencing to the screen (Fig. 1b). Previous work on attenuating off-target effects has largely focused on identifying lower-risk sequences, introduction of chemically modified siRNAs, or use of multiple siRNA sequences in additional screens5.
In order to understand and exploit the off-target effects present in siRNA screening data, we implemented a predictive model of down-regulation due to siRNA off-targeting. Existing predictors are microRNA-related and often use conservation or other criteria not applicable to siRNA off-targeting6. We trained a simple linear model specific for siRNAs using published gene expression profiles in which off-targeting mediated by the seed (positions 2–8 of the guide strand) has been detected7. Our model for off-target seed-based down-regulation is:
The model includes four types of seed matches, or reverse complementarity between the guide strand seed sequence and the 3′UTR of the transcript: PM, perfect match to guide bases 2–7 followed by adenine opposite base 1; M1, no adenine opposite base 1; M8, mismatch opposite base 1; and M18, with both terminal mismatches. We calculate as predictive variables the number of times a particular match-type occurs between the seed sequence of the siRNA n and the 3′UTR of the transcript k. In addition, we use the length of the 3′UTR as a predictive variable, since in our analyses longer 3′UTRs were correlated with up-regulation of transcripts in the absence of siRNA seed matches. Up-regulation of genes with long 3′UTRs and a lower density of siRNA matches has been previously observed in the literature8. Model parameters (β) are derived from linear regression against the mean log ratio of mRNA expression data (see Fig. 2, Supplementary Data Set 1 and Methods).
We hypothesize that siRNAs off-targeting to pathway members can in part explain the phenotypes observed in genome-wide screens (Fig. 1c). To investigate this, we introduce Haystack, a new computational method for identifying the most statistically significant genes that explain screen results, based entirely on off-target effects. First, we predict all On,k, the off-target down-regulation of every transcript k by every siRNA n. Second, for each transcript we calculate the significance of correlation between the predicted down-regulation by every siRNA to the siRNAs' screen results z, using a t-test. Third, the most statistically significant transcript is selected and a linear parameter ak is estimated, approximating z as a function ak * Ok + c. Finally, the residual between z and the predicted values of z in this linear model is calculated and the next transcript is selected via the significance of the correlation of each remaining Ok to the residual. In this stepwise manner, the most statistically significant transcripts are selected and added iteratively as features to a linear model, until no transcript has a Bonferonni-corrected correlation p-value less than 0.01. The final model can be viewed as predicting the phenotypic score associated with an siRNA zn as a linear combination of the predicted off-target effects of siRNA n on a set of transcripts K (with some constant intercept term c):
We assess the transcripts selected based on ak, the magnitude of the off-target effects on transcript k in explaining the screening results. The directionality of ak indicates the effect (either positive or negative) that down-regulation of each transcript has on the assay readout.
We applied Haystack to 19,815 siRNAs used in screening 6,605 theoretically druggable genes for activity in the Wnt/β-catenin signaling pathway ( Supplementary Data Set 2 ). The Wnt/β-catenin pathway is constitutively active in many human cancers. To screen for novel factors in the Wnt/β-catenin pathway, HT1080 sarcoma cells were engineered to contain a firefly luciferase reporter coupled to a β-catenin-driven promoter, activated in the screen by conditioned media containing Wnt-3a. A control EF1α-driven Renilla luciferase reporter was used for normalization. Three siRNAs per gene were transfected individually into the reporter cell line in three separate screens. We calculated z-scores for the siRNAs from the log ratio of reporter intensities.
Table 1 lists, ordered by p-value, the top 10 genes included in the model built via Haystack from the siRNA screens in combination. Predicted activities per gene correlated well between screens when analyzed separately (Fig. 3). Supplementary Table 1 contains all 61 hits identified. In the case of the Wnt pathway, a large number of “canonical” pathway members have been previously identified. To measure pathway enrichment in screening results, we used 158 Wnt related genes from the KEGG pathway database9. Of the top 10 most statistically significant transcripts, 6 (LEF1, AXIN2, CCND1, RAC1, CTNNB1 and LRP6) are in the KEGG list of Wnt pathway genes, a remarkable level of enrichment (P < 3.7e-11; hypergeometric test for enrichment).
We reviewed literature on the genes present in Table 1 but not listed in KEGG to check for other associations to the Wnt pathway. SUMO2, the third most statistically significant hit, has been identified as sumoylating and repressing LEF110. Consistent with that role, its parameter estimate in the regression indicates that down-regulation of SUMO2 increases reporter activity. Another hit ( Supplementary Data Set 3 ), SENP7, has been shown to reverse the sumoylation of SUMO2/311 and accordingly the regression analysis shows that its down-regulation decreases Wnt reporter activity. Finally, the transcription factor POU4F2 (BRN3B) is involved in both cell development12 and cancer13, as is the Wnt pathway, making a role for POU4F2 in Wnt signaling seem plausible.
It is worth noting that three transcripts identified in the top 10 hits (RAC1, CLEC18B and POU4F2) did not have siRNAs designed against them in the druggable library screened here. Although RAC1 is not included in the library, it is found by Haystack analysis and is a canonical member of the Wnt pathway. Since off-target effects span the transcriptome, Haystack can detect biologically relevant genes for which no siRNA exists in our screening library.
Haystack analysis can be applied to any previous siRNA screen of sufficient scale, preferably with normally distributed scores, to confirm hits via an orthogonal metric or to identify genes originally missed due to library composition. Unfortunately, it is not yet standard for raw data from siRNA screens to be published, as microarray expression profiling data are. Consequently, Haystack is available as a download (http://rnai.nih.gov/haystack/Software.zip) or web-based service (http://rnai.nih.gov/haystack/) for researchers who have performed screens to analyze their data and confirm hits or find new hits potentially missed by “on-target” analysis.
Finally, although a genome-specific siRNA library would be expected to have relatively poor on-target performance against cells from another organism, a large-scale library would likely have the seed sequence diversity required to perform Haystack analysis on any organism having a RNA interference pathway. Thus Haystack also presents a novel solution for siRNA screens in cells from organisms for which no library is yet available.
Statistical analysis and graphs were generated using R14. ROC curves were generated using the ROCR package15. The kernel density plot of 3′UTR length distributions was generated using the sm package16.
Human 3′UTR sequences (a total of 34268 sequences) from the Refseq17 collection were downloaded using the UCSC table browser18. These sequences were matched against the IDs for the gene expression data sets for purposes of training the off-target model. To create the matrix of predicted off-target effects, 170 sequences that were shorter than 20 nucleotides and 2554 sequences that were non-coding (NR prefix in accession) were removed. In addition, 10852 sequences were removed because they were identical to a 3′UTR already included. The remaining 20692 3′UTR sequences were used to create the off-target matrix.
Gene expression data
Expression data previously shown to be enriched with transcripts that were down-regulated and contained seed matches to transfected siRNAs in their 3′ UTRs7 were downloaded from the GEO website19 and used in training and testing the siRNA off-target model.
Training and testing the off-target model
Features for predicting siRNA off-target down regulation were selected based on the previously observed importance of the seed sequence7,20 (bases 2–7 of the guide strand) and the observed preference for adenine opposite from base 1 of the guide strand21. Although all possible match types to the seed including one or two base-mismatch were considered, only four match types (PM, M1, M8, M18) were statistically significant predictors of down-regulation. The reverse complement of the guide strand seed sequence followed by an adenine forms a perfect match (PM) and mismatches to base 1 (M1), base 8 (M8), or both (M18) form the four different seed match types selected. In addition, we used the length of the 3′UTR as a predictive variable, since in our analyses longer 3′UTRs were correlated with up-regulation of transcripts in the absence of siRNA seed matches. Linear models predicting the down-regulation of a transcript based on the number of each match type observed in the 3′UTR and the total length of the 3′UTR were trained separately on gene expression data from the transfection of two different siRNAs. These models were then cross-validated on the data not used to train them and finally the data sets ( Supplementary Data Set 1 ) were combined to train a final model (Fig. 2).
The custom siRNA libraries used in screening were synthesized by Sigma–Proligo. siRNAs were designed to have sequence asymmetry to increase efficacy and had less than 17 bp of complementarity to other genes22. The library contains 6605 sets of 3 siRNAs each designed to theoretically druggable genes.
siRNA screening data
Data used to evaluate these methods came from a genome-scale siRNA screen23,24 in the HT1080 sarcoma cell line for genes involved in the Wnt/β-catenin pathway. Cells were engineered to contain a luciferase reporter coupled to a β-catenin-driven promoter, which was subsequently activated with conditioned media containing Wnt-3a. In primary screening, both a whole genome and a druggable library were screened in pools with at least three replicates. In addition, all single siRNAs contained within the druggable pools were assayed separately. All assay results were normalized to a plate-based non-targeted control to yield a z-score.
Correlation and statistical significance calculations
Correlation between the screen results z (or later the residual) and the predicted off-target effects of the library on a given transcript were calculated using the formula:
where On,k is our prediction of the off-target down-regulation of each mRNA transcript k by each siRNA n. The t-statistic of the null hypothesis that the correlation is actually zero is then calculated as:
with the associated p-value of the null hypothesis being derived from the Student's t-distribution with n-2 degrees of freedom, where n is the number of siRNA screening results.
Since the off-target model makes the same predictions for every siRNA with the same heptamer seed sequence (bases 2–8 of the guide strand) and there are only 16384 possible heptamers, we can pre-compute the predictions for each 3′UTR. Using this simplification, we do not need a separate set of predictions for each siRNA library. To use this pre-computed matrix to calculate Haystack t-statistics for a set of single siRNAs used in a screen, we first calculate the mean assay response for all siRNAs that contain the same seed sequence. We then calculate the weighted correlation coefficient and corresponding t-statistic between the predicted down-regulation caused by each seed sequence and the mean observed response of that seed sequence in the assay (or the residual in later steps), where the weights are the number of occurrences of each seed sequence in the library. A weighted linear regression is performed at each iteration between the selected Ok and the mean assay response for each seed. Likewise, it is possible to use Haystack analysis on siRNA screens performed using pools by calculating the mean response of the assay for all pools in which one of the component siRNAs has a given seed sequence and then weighting the correlation coefficient and linear regression by the number of pools in the screen containing that seed sequence. PERL and R code for pre-computing off-target predictions from a FASTA formatted file of 3′UTRs and calculating these t-statistics from mean seed sequence responses are supplied from the Haystack website (http://rnai.nih.gov/haystack/Software.zip).
Validation of statistical techniques
The validity of using Pearson's correlation coefficient, the t-statistic and corresponding p-value to estimate the statistical significance of non-normally distributed vectors has been a subject of some debate25 and seems to depend on the size of the vectors and the particular characteristics of the distributions considered. Although the z-scores generated from this set of screening data are fairly normal in their distribution, the predictions of off-target effects to which we are comparing them are non-normal. Therefore, we tested how well the Student's T distribution was approximating the distribution expected at random.
Z-scores from the druggable singles collection were randomly permuted against their original heptamers. Fig. 4a shows a q-q plot of the t-statistics resulting from the random permutation (in red) and the distribution observed for the original non-permuted data (in blue) compared to the Student's T-distribution with 10029 degrees of freedom (which is the length of the vectors for which the null hypothesis of zero correlation is being tested, minus 2). As expected, the quantiles from our randomly permuted data track well against the Student's t-distribution, while the non-permuted data does not.
We next performed 1000 random permutations of the same data set and noted transcripts for which the Bonferroni corrected p-value (calculated from the t-statistic) was less than 0.01. Fig. 4b shows the distribution of false positives observed in 1000 trials. Although the majority of permutations did not result in any false positives, 20.3% of the permutations resulted in one or more false positives, which is significantly greater than the 1% of false positives that we would expect if Student's t-distribution perfectly modeled the null hypothesis we are testing. The mean number of false positives per random permutation was 0.263. In comparison to the 1217 putative positives observed for the original, non-permuted data, we judge the risk of false positives to be small and error in this approximation to be acceptable, given that the alternative would be to use non-parametric statistics or random permutation to estimate p-values, either of which would significantly reduce our sensitivity and increase the already significant computation time required.
To further investigate the source of false positives, we plotted the distribution of 3′UTR length for the false positives identified in the 1000 random permutations versus all transcripts analyzed and those transcripts identified as positives in the non-permuted data (Fig. 4c). There is a significant shift towards smaller 3′UTRs in the set of false positives compared to the general population. This may indicate that the approximation of the Student's T distribution is slightly less valid for shorter transcripts. Similarly, transcripts identified as positives in the non-permuted set are shifted towards longer 3′UTRs than the set of all 3′UTRs. We believe this may be the case because true positives with short 3′UTRs have a smaller “off-target signature”, being down-regulated by fewer siRNAs off-target and thus presenting less signal for detection by the method. It is also possible that short 3′UTRs represent incorrect or incomplete sequences. Thus we conclude that transcripts with short 3′UTRs may be more likely to be both false positives and false negatives in Haystack analysis.
Mohr, S., Bakal, C. & Perrimon, N. Genomic Screening with RNAi: Results and Challenges. Annual Review of Biochemistry 79, 37–64 (2010).
Echeverri, C. J. et al. Minimizing the risk of reporting false positives in large-scale RNAi screens. Nat Methods 3, 777–779 (2006).
Birmingham, A. et al. 3′ UTR seed matches, but not overall identity, are associated with RNAi off-targets. Nat Methods 3, 199–204 (2006).
Ma, Y., Creanga, A., Lum, L. & Beachy, P. A. Prevalence of off-target effects in Drosophila RNA interference screens. Nature 443, 359–363 (2006).
Jackson, A. L. & Linsley, P. S. Recognizing and avoiding siRNA off-target effects for target identification and therapeutic application. Nature reviews 9, 57–67 (2010).
Hammell, M. Computational methods to identify miRNA targets. Seminars in cell & developmental biology 21, 738–744 (2010).
Jackson, A. L. et al. Widespread siRNA “off-target” transcript silencing mediated by seed region sequence complementarity. Rna 12, 1179–1187 (2006).
Khan, A. A. et al. Transfection of small RNAs globally perturbs gene regulation by endogenous microRNAs. Nat Biotechnol 27, 549–555 (2009).
Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M. & Hirakawa, M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Research 38, D355–D360 (2010).
Sachdev, S. et al. PIASy, a nuclear matrix-associated SUMO E3 ligase, represses LEF1 activity by sequestration into nuclear bodies. Genes Dev 15, 3088–3103 (2001).
Shen, L. N., Geoffroy, M. C., Jaffray, E. G. & Hay, R. T. Characterization of SENP7, a SUMO-2/3-specific isopeptidase. Biochem J 421, 223–230 (2009).
Xiang, M. et al. Role of the Brn-3 family of POU-domain genes in the development of the auditory/vestibular, somatosensory and visual systems. Cold Spring Harb Symp Quant Biol 62, 325–336 (1997).
Budhram-Mahadeo, V. S. & Latchman, D. S. Targeting Brn-3b in breast cancer therapy. Expert Opin Ther Targets 10, 15–25 (2006).
R Development Core Team, R. F. F. S. C. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing 1, 2673 (2008).
Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. .Bioinformatics 21, 3940–3941 (2005).
Bowman, A. W. & Azzalini, A. Computational aspects of nonparametric smoothing with illustrations from the sm library. Computational Statistics & Data Analysis 42, 545–560 (2003).
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35, D61–65 (2007).
Karolchik, D. et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res 32, D493–496 (2004).
Barrett, T. et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Research 37, D885–D890 (2009).
Lin, X. et al. siRNA-mediated off-target gene silencing triggered by a 7 nt complementation. Nucleic Acids Research 33, 4527–4535, 10.1093/nar/gki762 (2005).
Lewis, B. P., Burge, C. B. & Bartel, D. P. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 15–20 (2005).
Majercak, J. et al. LRRTM3 promotes processing of amyloid-precursor protein by BACE1 and is a positional candidate gene for late-onset Alzheimer's disease. Proc Natl Acad Sci U S A 103, 17967–17972 (2006).
Major, M. B. et al. New regulators of Wnt/beta-catenin signaling revealed by integrative molecular screening. Sci Signal 1, ra12 (2008).
Tang, W. et al. A genome-wide RNAi screen for Wnt/beta-catenin pathway components identifies unexpected roles for TCF transcription factors in cancer. Proc Natl Acad Sci U S A 105, 9697–9702 (2008).
Kowalski, C. J. On the Effects of Non-Normality on the Distribution of the Sample Product-Moment Correlation Coefficient. Journal of the Royal Statistical Society. Series C (Applied Statistics) 21, 1–12 (1972).
Thanks to Christina Leslie of Sloan-Kettering Institute, Steven Bartz, Peter Linsley and Alan Sachs of Merck & Co., Inc. for their guidance, encouragement and support.
The authors declare no competing financial interests.
Electronic supplementary material
About this article
Cite this article
Buehler, E., Khan, A., Marine, S. et al. siRNA off-target effects in genome-wide screens identify signaling pathway members. Sci Rep 2, 428 (2012). https://doi.org/10.1038/srep00428
This article is cited by
Nature Reviews Drug Discovery (2020)
OSM potentiates preintravasation events, increases CTC counts, and promotes breast cancer metastasis to the lung
Breast Cancer Research (2018)
Improved estimation of cancer dependencies from large-scale RNAi screens using model-based normalization and data integration
Nature Communications (2018)
Cellular and Molecular Life Sciences (2018)
BMC Genomics (2017)