## Introduction

Here, we combine CRISPR/Cas9 genome engineering, next-generation sequencing, and machine learning to decipher cis-regulatory RNA sequence and structural elements that affect ADAR-mediated RNA editing. As proof-of-concept, we choose three representative RNA editing substrates and introduce hundreds of mutations at the endogenous loci in human cells, using the CRISPR-mediated approach. We use supervised machine learning to build predictive models of substrate-specific RNA editing levels based on a variety of cis-sequence and structural features. We identify highly edited structures different from wild-type (WT) structure (referred to as alternative structures), and general, as well as idiosyncratic features that determine editing efficiency of individual substrates, highlighting the complexity of the cis-regulatory editing code. Our integrative approach, named predicting RNA editing using sequence and structure (PREUSS), lays the foundation for developing predictive models of RNA editing.

## Results

### CRISPR/Cas9-mediated mutagenesis to interrogate endogenous RNA editing

To interrogate the effects of cis-regulatory elements of RNA editing, we applied the CRISPR/Cas9 technology (Fig. 1a) to introduce mutations at the endogenous loci of three natural ADAR1 substrates (NEIL1, TTYH2, and AJUBA; Fig. 1b, see “Methods”). The mutations were introduced both in the strand containing the editing site (“editing strand”) and in the complementary sequence involved in forming the secondary structure, which we refer to as editing complementary sequence (ECS). Briefly, we designed CRISPR guide RNAs (gRNAs) targeting the regions of interest, as well as oligonucleotide donors carrying mutations to direct knock-in (KI) mutations through the CRISPR/Cas9-mediated homology-directed repair (HDR) pathway20. To measure the RNA editing levels of the resulting variants, we performed targeted amplicon deep sequencing. Because the variant and the associated editing site are in the same transcript, there is no need to perform laborious clonal selection for homozygotes of the variants. Because each designed variant has a unique sequence, we successfully performed large-scale multiplex mutagenesis and measured the editing levels without the aid of barcodes (see “Methods” for details).

In a pilot experiment to introduce one or more mutations near the editing site, we used a single degenerate donor oligonucleotide with mutations at each position to randomize the region from −3 to +3 positions of the editing strand for NEIL1 (Fig. 1c) and a 10 nt region on the ECS of TTYH2 (Supplementary Fig. 1a). These random mutations provide a rapid means to evaluate the CRISPR/Cas9 KI efficiency and the effects of mutations. We observed that three or more mutations almost always lead to an abolishment of editing events (Fig. 1d, e and Supplementary Fig. 1b). Therefore, to generate variants that lead to a wide range of editing levels, we next performed targeted mutagenesis using a pool of 200–300 donor oligonucleotides with designed mutations, focusing on single and double mutations with larger mutagenesis regions around the editing site in the editing strand and the ECS (Figs. 1b and 2a, and Supplementary Fig. 3a–c). For NEIL1 and TTYH2, we designed all possible single mutations both in the editing strand and the ECS (with the exception of –1 and +1 positions for NEIL1 where A-to-G mutations, would be indistinguishable from A-to-I editing). For AJUBA, we only designed mutations in the editing strand due to its long-range ECS. We selectively designed a subset of double-transversion mutations (11% all possible double mutations) that theoretically disrupt the original base-pairing at the mutation position. For NEIL1, we also introduced compensatory mutation variants that theoretically maintain base-pairing. In addition, we designed indel variants to study the effects of selected secondary structure features of NEIL1, such as bulges, internal loops, and stem length (RNA sequences and editing levels see Supplementary Data 7).

Overall, we achieved high KI efficiency as 10–20% of the sequenced RNAs at the target locus carried mutations, similar to previous reports for a similar approach20. We were able to reliably detect >90% of our designed variants after using stringent quality control filters. The KI results and editing measurements were highly reproducible (Fig. 1f and Supplementary Fig. 1c–f). Interestingly, we discovered that similar KI efficiency and editing results were achieved when using ssDNA oligonucleotides or dsDNA (e.g., PCR product) as the donor for CRISPR-mediated HDR (Supplementary Fig. 1g–j). Using dsDNA PCR products greatly simplified the procedures and reduced the cost of the experiments. The coverage of RT-PCR product for each variant was generally well correlated with the corresponding coverage of the product amplified from gDNA (R2 = 0.87 for NEIL1 and R2 = 0.25 for TTYH2; Supplementary Fig. 2a, d), suggesting that the RNA abundance is generally not affected by the introduced variants. There is no correlation between the RNA or gDNA coverage with the editing level for all three substrates, which is consistent with previous reports21,22 that argues against potential influence of the substrate expression level on the editing level (Supplementary Fig. 2b,c, e–g).

### Intertwined effects of primary sequence and secondary structure on editing levels

We compared the effects of single and double mutations in terms of the type (transition or transversion) and location of the mutation across all three RNA substrates (Fig. 2a and Supplementary Fig. 3a–c). We used the computationally predicted minimum free energy (MFE) secondary structure of each RNA variant to dissect the associations between mutations and structure. The mutational effects are summarized for each RNA substrate below.

#### NEIL1

For NEIL1, most single mutations led to minor decreases in editing (−1 < Z-score < 0), with the largest effects observed at positions +1 and +2 relative to the editing site (Figs. 2c and 3b). Exceptions from this pattern were the large effects (Z-score < −1) of G mutations downstream from the editing site in the ECS strand. An example of how a G mutant changes secondary structure is shown in Supplementary Fig. 4d. Some RNA variants have the same predicted RNA structure but different editing level, most simply suggesting a primary sequence effect (Supplementary Fig. 4e). To decouple sequence and structural effects at each position, we considered six categories of single mutations. The simple “transition” (i.e., purine to purine, pyrimidine to pyrimidine) and “transversion” (pyrimidine to purine and vice versa) categories indicate that the RNA structure was unchanged compared to the WT. “Transition + break” and “transversion + break” categories indicate that the mutation disrupted the base-pair at the mutation site, and the “transition + shift” and “transversion + shift” categories include all other scenarios, such as the formation of a new base pair or disruption of more than one base pair (Fig. 3a). We observed moderate effects (Z-score < −1) for the transversion mutations that also disrupted base-pairing (transversion + break) or caused other structural changes (transversion + shift) at positions in close vicinity to the editing site (−5, −1, +1 to +3, +9), and at the 3′ side of the editing site (*+1 to *+8). Double-transversion mutations in the editing strand of NEIL1 had overall pronounced effects on editing (Fig. 2d), and the strongest effects (Z-score < −2) were observed when at least one of the mutations was in close proximity to the editing site (positions –6 and +2). The effect was generally smaller (Z-score > −2) if one of the mutations was in a non-base-paired region (+4, +5, +11; Fig. 2d).

#### TTYH2

In contrast to NEIL1, where the vast majority of mutations decreased editing, a large proportion of TTYH2 single mutations increased the editing efficiency (Z-score > 0; Supplementary Figs. 3a, b and 4a). This difference may be explained by the lower WT editing level for TTYH2 (0.31) than for NEIL1 (0.66). Similar to NEIL1, single mutations closer to the editing site (−2 to +1, *−3 to *+4) tended to have larger negative effects on editing levels (Z-score < −1; Supplementary Fig. 4a). Interestingly, several single mutations located both upstream (–6 to −3) and downstream (+6 to +8, *+6) of the editing site increased editing levels (Z-score > 0). For TTYH2 double mutations, the effects were most negative when at least one mutation was located around the editing site (–2 to +6 in the editing strand and *−3 to *+5 in the ECS; Supplementary Fig. 3a, b).

#### AJUBA

We only examined the mutations on the editing strand of AJUBA. In contrast to NEIL1 and TTYH2, many single mutations of AJUBA were sufficient to disrupt editing. Also, all double mutations abolished editing regardless of positions (Supplementary Fig. 3c). Most of the “transition + break”, “transition + shift”, and the “transversion + shift” single mutations had large effects (Z-score < −2; Supplementary Fig. 3b). Many AJUBA single mutations have much larger effects (Z-score < −3) than single mutations in NEIL1 and TTYH2. This difference may be explained by the long distance (524 nt) between AJUBA editing strand and ECS, such that mutations may lower the probability of forming this long-range structure relative to alternative proximal structures; alternatively, primary sequence might have a larger influence on editing of the AJUBA RNA.

Taken together, these results are consistent with previous observations of intertwined sequence and structure effects on editing. These effects also vary among the three different RNA substrates, suggesting substrate-specific cis-regulation rules.

### RNA structural features affect editing levels

Next, we systematically explored the effects of changes to the RNA secondary structure on editing levels. We found that compensatory double mutations in NEIL1 that did not affect secondary structure resulted in only minor reduction of editing levels (Fig. 4a–c). To investigate how a specific structural change affects editing efficiency, we designed several indels that change the predicted secondary structure of NEIL1 (Fig. 4d). Shortening the 5′ stem or breaking base-pairs within this stem abolished editing, suggesting the importance of this region for editing; increasing the stem length by 2-bp did not increase editing efficiency (Fig. 4d and Supplementary Fig. 4f). The 3′ base-pairing is also critical because breaking it led to nearly complete disruption of editing (Supplementary Fig. 4f). When we replaced the downstream 3′ internal loop with either a canonical base-pair or wobble base-pair, the editing efficiency decreased by 50% (Z-score = −1), suggesting the importance of this structural feature (Fig. 4d and Supplementary Fig. 4f). Enlarging the loop with additional nucleotides resulted in mild (−1 < Z-score < 0) reductions in editing levels (Fig. 4d). As expected, editing site structures containing an A:C mismatch (1:1 internal loop) exhibited higher editing levels on average than when the editing site resided in a larger loop (P < 0.0001 by Wilcoxon test, Supplementary Fig. 4g). However, several editing site structures harboring non-A:C mismatches also showed strong editing levels for NEIL1 and TTYH2 (Supplementary Fig. 3d, e), indicating that additional factors affect editing efficiency.

We reasoned that structural and thermodynamic features affecting RNA stability could also affect editing efficiency15. We observed significantly greater predicted structural stability in highly (highest 25 percentile of editing level in each RNA library) compared to lowly (lowest 25 percentile) edited NEIL1 (P < 0.0001) and AJUBA (P < 0.001) variants, based on both the MFE structure (Fig. 5a) and the predicted structural ensemble (Fig. 5b). We also observed significantly higher MFE frequency for NEIL1 (P < 0.01) and TTYH2 (P < 0.0001; Fig. 5c) and lower ensemble diversity for NEIL1 (P < 0.0001; Fig. 5d) in highest edited quantile. The same observation held when stability was approximated by the number of base-pairs formed (Fig. 5e).

We hypothesized that RNA variants that are structurally more similar to the WT would result in editing levels similar to WT. We quantified structural similarity using two measures: the probability of active conformation, which indicates the probability of forming WT-like secondary structure in the predicted structural ensemble for each RNA variant (Fig. 5f) and a similarity score that indicates the degree to which the MFE structure of a variant is similar to the MFE structure of WT (scale is 0–1, from least similar to identical structure, Fig. 5g). A higher probability of active conformation was observed in the highly edited variants compared to lowly edited variants for all three substrates (P < 0.0001, Fig. 5f). We found significant differences (P < 0.0001) of similarity score between highly and lowly editing variants for NEIL1 and AJUBA (Fig. 5g).

However, when we considered all variants across the entire editing spectrum, instead of highly versus lowly edited variants, no significant correlations were observed between individual features and RNA editing levels (Supplementary Figs. 5 and 6). These results show that individual sequence, structure, and stability features of variants can explain the differences between highest and lowest edited substrates, but only have limited predictive association with quantitative editing levels. Therefore, we decided to carry out integrative analyses of RNA sequence and structure features to model quantitative editing levels.

### RNA clustering reveals alternative structures that support efficient editing

Given that no single property of the RNA substrates correlated strongly and consistently with editing efficiency (Supplementary Figs. 5 and 6), we used machine learning to dissect the collective effects of different features on editing. First, we performed a hierarchical clustering analysis based on variant sequence and structure. We clustered the NEIL1 and TTYH2 libraries because the editing levels are widely distributed compared to the AJBUA results. We used the locARNA pipeline23,24 that takes into account both the sequence and the MFE structure (Fig. 6a). Because the sequence variation is relatively small, the similarity and difference among the MFE structures was weighted highly for the resulting hierarchical clustering. The resulting clusters of RNA variants generally share a similar core structure and show similar editing levels (Fig. 6b and Supplementary Figs. 7–9). Interestingly, we found clusters of RNA with predicted structures distinct from WT (alternative structures) that are edited with near-WT efficiencies, both for NEIL1 (e.g., clusters 2, 4, and 8 in Fig. 6), and for TTYH2 (e.g., clusters 3, 2, 7, and 8 in Supplementary Figs. 8 and 9). As an example, positioning of the NEIL1 editing site in an asymmetric 2:3 internal loop (cluster 8 in Fig. 6b) instead of the 1:1 A:C loop seen in WT (cluster 1 in Fig. 6b) appears to maintain, and even enhance, the editing efficiency. In contrast, a 1:2 internal loop in cluster 7 (Fig. 6b) with similar downstream structures as cluster 8 is mostly poorly edited. Cross-cluster comparisons further illuminate the contributions of certain structural features to editing. For example, comparing NEIL1 cluster 5 with cluster 1 (Fig. 6) suggests a negative effect of a bulge in the 5′ stem. While NEIL1 prefers a good 5′ stem structure, the TTYH2 can tolerate symmetric internal loops in the 5′ stem (Supplementary Figs. 8 and 9).

### Machine learning models accurately predict substrate-specific RNA editing levels from sequence and structure features

To quantitatively capture the complex relationship between editing levels and multidimensional RNA sequence and structure features, we turned to machine learning models. A set of 125 features were derived to annotate the RNA variants (see “Methods” and Supplementary Datas 14 for feature annotations for all variants of NEIL1, TTYH2, and AJUBA, respectively). The sequence features summarize various properties of the primary RNA sequence of each variant at and around the vicinity of the editing site where the mutations were made. We used the bpRNA25 tool to assign all residues in each variant to diverse structural elements, such as hairpin loops, bulges, internal loops, stems, multi-loops and closing pairs (Fig. 7a). We chose to featurize the bpRNA structural annotations at the editing site and adjacent regions (up to 3 bpRNA structural elements upstream and downstream from the editing site structure element, detailed in Supplementary Data 1; Fig. 7b), as these regions within the RNA substrate fully encompass the interaction site with the ADAR deaminase domain (Fig. 7c)26,27. The 125 features were further grouped into nine major categories for purposes of feature interpretation (Supplementary Data 1). Gradient boosted trees (GBTs) were trained via the XGBoost algorithm28. We trained and tuned GBTs on distinct subsets of RNA variants to map their feature annotations to corresponding real-valued editing levels or binarized labels obtained by thresholding editing levels into two classes (edited versus not edited).

First, we evaluated the prediction performance of our model for each substrate. We trained and tuned models on a subset of variants and then tested model performance on a held-out test set of variants of the same substrate. For NEIL1, the models accounted for 85.6% of the variance (R2) in ADAR editing levels for variants in the held-out test set, with a Spearman correlation (Rs) of 0.92 between observed and predicted editing levels. Binary editing status was also predicted accurately (area under precision-recall curve, auPR = 0.97). Similarly, high test set predictive performance was obtained for TTYH2 variants (R2 = 0.68, Rs = 0.91, auPR = 0.81) and AJUBA variants (R2 = 0.79, Rs = 0.90, auPR = 0.93). Augmenting the training set for each substrate with variants from the other substrates did not result in any significant improvements in model performance (Supplementary Data 5 and Supplementary Fig. 10). These results indicate that it is possible to predict RNA editing levels of new mutations in a substrate with high accuracy from sequence and structure features, using integrative machine learning models trained on a subset of mutations from the same substrate (Fig. 7d).

Next, we tested whether models trained on variants of one or more substrates could predict editing effects of mutations in a different substrate. We observed a significant drop in model performance for cross-substrate prediction of RNA editing (Supplementary Fig. 11,). For example, a model trained on NEIL1 variants yielded lower performance on AJUBA variants (R2 < 0.05, Rs = 0.68, auPR = 0.69) and TTYH2 variants (R2 < 0.05, Rs = 0.46, auPR = 0.59), as compared to a model trained and tested on NEIL1 variants (R2 = 0.88, Rs = 0.93, auPR = 0.97). Similarly, a model trained on all NEIL1 and TTYH2 variants also yielded lower predictive performance when tested on AJUBA variants (R2 < 0.05, Rs = 0.66, auPR = 0.29), albeit higher than the model trained on either substrate independently. The same held true for all models trained on two of the substrates and evaluated on the third—lower performance compared to within-substrate training and evaluation.

The inability of our current models to accurately generalize predictions to new substrates is not entirely surprising considering the diversity of the substrates and the small number (three) of distinct substrates available for model training. It is likely that the challenge of cross-substrate training may be solved by training on a larger number of variants from diverse substrates, and future efforts will focus on this task. However, given the success of our substrate-specific models in predicting editing effects for unseen mutations within each substrate, we decided to interpret these models to investigate the features that may be predictive of RNA editing levels.

### Model interpretation provides insights into common and substrate-specific features associated with RNA editing efficiency

For each of the three substrate-specific models, we used the TreeExplainer SHAP (SHapley Additive exPlanations) algorithm to quantify the contributions (or importance) of all features to the RNA editing predictions of each variant in the test sets29. The SHAP importance score of a feature with a specific value for a variant of a substrate estimates how much the feature contributes to pushing the model’s output from a baseline editing level to the predicted RNA editing level for the variant. The baseline editing level is defined as the average editing level across all variants in the test set. Examples of how SHAP scores illuminate feature importance are illustrated in Fig. 7e,f. Figure 7e illustrates the SHAP scores for the five most important features for the NEIL1 test variant (NEIL1 RNA ID 092, U31G) with a high observed editing level (0.78) agrees with model prediction (0.78). For the NEIL1 test set, the baseline (mean) predicted editing value is 0.25. We display the contribution of all feature values for this variant in pushing the prediction from the baseline of 0.25 to the predicted output value of 0.78. The feature “sim_nor_score (same as in Fig. 5g, normalized similarity score comparing MFE structure of variant to WT) = 0.99”, is estimated to have the highest importance and increases the prediction of editing level by 0.09 (SHAP value) from the baseline. The contribution of the editing site with an A:C mismatch has a SHAP value of 0.05, and so on. Although a larger number of mutations generally decreases editing level (Fig. 7g), the “num_mutations = 3” has a positive SHAP value (red) for this variant, highlighting the ability of the model to pick up different feature combinations. Conversely, Fig. 7f illustrates how feature values unfavorable to ADAR editing result in a predicted editing level of 0 for another variant (NEIL1 RNA ID 142, G41C/C45G) of the NEIL1 substrate relative to the baseline. This variant has two mutations in the substrate (num_mutations = 2). This feature has a SHAP score of −0.05 (blue), indicating that a higher number of mutations in this RNA is unfavorable to editing. There is no A:C mismatch at the editing site, and this feature value has a SHAP score of −0.06. These and other highlighted feature values serve to drive the prediction down from the baseline of 0.25–0 (Fig. 7f).

To illustrate the directionality of predictive association of the features with RNA editing levels, we plotted the SHAP scores of the top 20 features for all test set variants of the three substrates (Fig. 7g). We also summarized the relative importance of features for each of the three substrates by computing the percent contribution from each feature to the mean of absolute SHAP values across all examples in the test sets of each substrate, and highlighted the six new features unique to this study in red (“probability_active_conf”, “sim_nor_score”, “ensemble diversity”, “mfe_frequency”, “site_5prm_cp_internal:C:G”, and “d2_5prm_cp_internal:G:C”, Fig. 8a). The closing pairs for loops and bulges are previously unexplored, but highly ranked features for NEIL1 and TTYH2 (Fig. 7g). Closing pair can be a readout for alternative active structure, such as in some highly edited NEIL1 variants the d2 internal loop’s 5′ closing pair is a G:C sequence (“d2_5prm_cp_internal:G:C”; Supplementary Fig. 7) compared to the U:A pair in WT (Fig. 7b). These results also corroborate with trends observed in the abovementioned clustering analysis (Fig. 6, and Supplementary Figs. 7–9).

Eight features were illuminated as most important for driving model predictions across substrates. Number of mutations (“num_mutations”) was the strongest contributor for AJUBA (47.69%) and NEIL1 (19.18%), and in the top six most important features for TTYH2 (4.51%; Fig. 8a). Increasing number of mutations had a negative influence on editing levels (Fig. 7g). This effect supports the proposal that RNA structure plays a big role in editing activity because in our library design the more mutations (single versus double-transversion) the larger changes occur in the structure (Fig. 4c). An A:C mismatch at the editing site (“site_1_1:A:C”) had a high relative contribution for NEIL1 (18.03%) and TTYH2 (15.95%), but contributed less to AJUBA editing levels (0.31%), consistent with previous proposals that A:C mismatch facilitates the flip-out of the adenosine for ADAR editing26,27 (Fig. 7g and Supplementary Fig. 4g). The probability of the active conformation (“probability_active_conf”) accounted for a mean of 8.8% relative contribution across substrates. The structure-similarity score of variants compared to WT feature (“sim_nor_score”) was positively correlated with editing levels. The lower MFE is positively associated with editing for NEIL1 and TTYH2, but not for AJUBA. The higher ensemble diversity is positively associated with editing levels for TTYH2 and AJBUA but bidirectional for NEIL1. Higher MFE frequency is positively associated with editing levels for NEIL1 and TTYH2, but not for AJUBA (Fig. 7g). These features corroborate with previous results that the overall structural stability of RNA substrate is positively correlated with the editing activity15 and reveal that RNA conformational diversity plays important and specific roles in different substrates. The seventh ranking feature was the position of the mutation along the RNA molecule (“mut_pos”). “Mut_pos” values are numbered beginning at the 5′ end of the RNA molecule, so higher values indicate positions further from 5′ and closer to 3′. This result indicates that the nucleotides adjacent to the editing site in the structure is the hot spot dictating activity. Though the “mut_pos” feature had a strong impact on editing level, the directionality varied across substrates, reflecting the interplay of the mutation position with other structural features.

In addition to top individual features, a sparse set of features collectively contribute to the accurate predictions made by the models. For the NEIL1 substrate, 90% of the explained variance could be attributed to the 26 top features, compared with the 32 top features for TTYH2 and 23 top features for AJUBA (Supplementary Data 8). To illustrate the contributions of different types of features and to draw biological insights, we looked at feature groups and subgroups. We categorized the group of all structure and sequence features excluding mutation-related features to four subgroups (Fig. 6b, full list of feature groups and subgroups in Supplementary Data 1). Overall, the thermodynamics and the editing site structure have the largest contributions, consistent with prior proposals that the overall thermodynamics (RNA stability and conformational diversity), and the structure of the editing site dictate the editing efficiency11,15. Notably, upstream and downstream structure features are also important, such as the downstream features in TTYH2. The −1 and +1 nt sequence motif (the 5′ and 3′ nearest neighbor, termed “site_prev_nt” and “site_next_nt” in Fig. 7f and Supplementary Data 1) also contributes to the prediction albeit to a lesser extent.

This systematic interpretation of our models reveals not only several biologically relevant features that are globally predictive across the three substrates, but also some that are highly predictive for specific substrates. These results showcase the promise of predictive cis-regulatory models of RNA editing, but also highlight the need for much larger datasets spanning diverse substrates to learn more generalizable models of RNA editing.

## Discussion

The ultimate goal in understanding the cis-regulation of RNA editing is to develop a model that accurately predicts the ADAR editing efficiency in vivo, namely an “editing code”. Unlike protein–DNA or protein–ssRNA interactions, where the primary cis-sequence largely dictates the interaction, ADAR substrates are required to bear double-stranded secondary structure. The difficulty of associating RNA sequence and secondary structure features to editing activity is a major challenge in studying the cis-regulation of RNA editing. To tackle this challenge, we integrated high-throughput measurements of ADAR editing with computational analysis of >100 RNA sequence and structure features simultaneously. The CRISPR/Cas9 engineering allowed us to study the cis-regulation of RNA editing by introducing desired mutations at the endogenous locus with the minimal perturbation of the RNA editing process. Our key results can be summarized in four main points. First, we found alternative structures that can be equally or better edited than the WT structure (Fig. 6, and Supplementary Figs. 7–9). Second, our models confirmed all known features identified by previous biochemical and transcriptomic studies (A:C mismatch at the editing site, the 5′ and 3′ nearest neighbors, the length and stability of the substrate including 5′ stem length and 3′ loop structure)11,12,13,14,17 and revealed previously unexplored features, including ensemble diversity and closing pairs (Figs. 7 and 8, and Supplementary Data 8). Third, our substrate-specific machine learning models integrated diverse sequence and structural features to quantitatively predict editing levels of new variants for a given target (Fig. 7d). Fourth, both general and substrate-specific features synergistically contribute to editing levels, and the degree of contribution of each feature varies across different RNAs, suggesting complex and context-dependent cis-regulation of the RNA editing landscape (Fig. 7e–g). A lot of progress has been made in recent years in deciphering the RNA splicing code30,31,32. However, we just began to uncover the ADAR-mediated RNA editing code which harbors complex regulation via RNA secondary structure.

Our approach opens several new lines of inquiry for further improvement. Measuring the structure of RNA variants in cells33,34,35 at native endogenous loci (versus relying on predicted structure) would greatly enhance the RNA structure analysis. In addition, advanced experimental methods such as irCLASH36 may be applied together with gene-specific amplification to validate the ECS sequence. Further, while we focused on cis-elements adjacent to the editing site, long-range interactions, such as the editing inducer elements37 important for editing can be investigated using our approach. Although our data focused on the editing level, which is largely determined by the deaminase domain14,38, the dsRBD (illustrated in Fig. 5c) also contributes to substrate recognition39,40. In the future, high-throughput in vitro RNA-binding experiments can be performed to combine with existing SELEX data for dsRBDs41 to identify features specific to dsRBDs of human ADARs using our pipeline. To tease apart trans-regulation effects by RNA-binding proteins (RBP)42, in vitro competitive binding experiments and the editing measurements in cells can be conducted in the knockout or overexpression background for the RBP of interest. Because we observed no correlation of RNA abundance on editing level (Supplementary Fig. 2) and our experiment measures pre-mRNAs, our editing analysis is likely not affected by potential effects of sequence variation on RNA processing. Nevertheless, understanding the interplay between RNA editing and various RNA biology, such as RNA processing pathways and RBP presents an important question for future investigations.

Several systems were recently developed to recruit ADAR enzymes to specific sites for site-directed RNA editing43,44,45,46,47,48,49, providing novel tools to study biological function, and a safer and reversible alternative to gene therapy43,50,51,52. Currently, these RNA engineering methods mainly use antisense gRNAs that form perfect duplex with the target region except for an A:C mismatch at the editing site. Our results strongly support additional imperfectly base-paired designs to mimic the highly selective and efficient editing observed in the natural ADAR substrates with complex structure features. Such features include relatively short 5′ (upstream) stem required for ADAR1 (ref. 14) compared to ADAR2 and specific non-stem 3′ (downstream) structure, where the internal loops at the 3′ likely contribute to the ADAR selectivity17. Notably, each of the three RNA substrates we tested has substrate-specific features that dictate the editing efficiency (Fig. 7g). This showcases that a screen of possible designs of gRNA would be a valuable and cost-effective strategy to learn the best features that lead to the most specific and efficient editing for each different target site in transcriptome engineering. In this regard, our experimental methods and computational pipeline are readily applicable.

There are several limitations to the modeling approaches utilized in this study. While our current results give rise to models with substantial predictive power for individual substrates, their generalizability remains low (Supplementary Fig. 11). This tendency to overfit will be mitigated by expanding the training set to include more RNA substrates, allowing the model to learn the shared properties of RNA substrates. Furthermore, the SHAP interpretation of feature importance in the XGBoost model highlighted the significance of features related to mutation number, structure, sequence, and position. This result suggests that an effective featurization of the data relies upon knowledge of a WT substrate structure and sequence, which may not be available for all substrates. Nonetheless, any variant in a library can be arbitrarily assigned as the WT for featurization purpose.

Building on our work using the PREUSS pipeline, ADAR editing can be further investigated in larger scale and in different cell types, tissues and disease states to explore the full spectrum of cis-regulation. Ultimately, establishing the “RNA editing code” will help us better understand the underlying rules of RNA editing, and facilitate efficient and precise transcriptome engineering for studying RNA biology and treating human disease.

## Methods

### Cell culture and transfection

HEK293T cells (ATCC) were cultured in Dulbecco’s modified Eagle medium (Life Technologies) supplemented with 10% FBS (Gibco, Thermo Fisher) and penicillin streptomycin (Life Technologies). Cells were maintained at 70–90% confluency. One day before transfection, ~700,000 cells were split to 6-well plates. The next day, 500 ng of Cas9–sgRNA construct in the px330 backbone (https://www.addgene.org/42230/) was co-transfected with 500 ng of the DNA donor using lipofectamine 2000 (Invitrogen). Cells were maintained at 50–90% confluency for 5 days.

### Design of the CRISPR/KI donor oligos

We selected three natural ADAR1 substrates (NEIL1, TTYH2, and AJUBA; Fig. 1b and Supplementary Data 7) for the mutagenesis studies based on the observations from available RNA-seq data that (1) the editing sites for all three substrates are highly edited (30–60%) in HEK293T cells, in which ADAR1 is expressed but ADAR2 is lowly expressed; (2) the editing sites are not edited when ADAR1 activity is abolished; and (3) they represent three different types of dsRNA substrates. The NEIL1 editing site is in the coding region. The editing event leads to an amino acid change from lysine (K) to arginine (R), which has been shown to increase the enzymatic activity of the NEIL1 glycosylase53. The TTYH2 editing site is intronic and the AJUBA editing site is located in its 3′ UTR. The functional impact of these two editing sites is currently unknown.

Two types of CRISPR KI donors were designed in this study: the degenerate donor and the fixed donor. For the degenerate donor oligos, a single-stranded DNA oligo was synthesized in which degenerate sequences were introduced at the interrogated regions. In the NEIL1 donor (Fig. 1c), −3 to −1 and +1 to +3 were interrogated and equal molar of four nucleotides were introduced at these positions during DNA synthesis. To avoid cutting by the Cas9, a point mutation was also introduced at the PAM sequence. In the TTYH2 donor (Supplementary Fig. 1a), a 10 nt region in the ECS was studied, and equal molar of C or T was introduced. The PAM sequence was also mutated along with a compensatory mutation to maintain the secondary structure.

For donors used for targeted mutagenesis, individual DNA sequences were designed to carry desired mutation(s). Briefly, a 15–20 nt region around the target editing site and the corresponding region on the opposite strand were subject to mutagenesis. All possible nucleotide at any single position was tested, with exceptions where A-to-G mutation was avoided in the +1 and −1 nt of NEIL1 because it potentially becomes indistinguishable with A-to-I editing in RNA-seq results. Combination mutations at two positions were also designed, in which each of the positions is mutated to the nucleotide in the opposite strand to disrupt the original structure. In addition, individual donors with altered length for interrogation of specific features of the RNA substrate were included. For NEIL1, we were able to use donor oligos to introduce compensatory mutation variants because the ECS and editing site are close in sequence space. All oligo sequences used in this study are listed in Supplementary Data 6.

### Generation of the CRISPR/KI donor pool

For NEIL1 donors, 80mer oligos were purchased from IDT and pooled at equal molar ratio. Oligo pairs NEIL1_leftarm/NEIL1_rightarm, or asymmetrical labeled primer pairs NEIL1_leftarm_biotin/NEIL1_rightarm and NEIL1_leftarm/NEIL1_rightarm_biotin (Supplementary Data 6) were used separately to add additional sequences to obtain 200mers in PCR reactions using Phusion polymerase. Around 400 µl PCR products were purified using MinElute PCR purification kit (Qiagen) to obtain the dsDNA donor pool, which was verified by agarose gel electrophoresis. For single-stranded donors, 100 µl MyOne Streptavidin Dynabeads (Thermo Fisher) were added to the purified products that were amplified with asymmetrical biotin label, and then the mixtures were denatured at 95 °C for 10 min and chilled on ice immediately. The unbound single-stranded oligos were collected from the supernatant and then purified with column MinElute PCR Purification Kit (Qiagen) to obtain the ssDNA donors. For TTYH2 and AJUBA donors, 100–120mer pooled oligos were purchased from Agilent and amplified using individual primers (Supplementary Data 6). Primers donor_F and donor_R of each target gene were used to specifically amplify the oligo library from the oligo chip. A second PCR using Donor_F_70 and Donor_R_70 was performed to elongate the homologous arms of each donor. The PCR products were purified using MinElute PCR purification kit (Qiagen) and used as dsDNA donors later. NEIL1_degenerate_donor and TTYH2_degenerate donors were synthesized as Ultramer by IDT and used directly in the transfection.

### Guide RNA design and cloning

gRNA was predicted by the web-based software CRISPR.mit.edu. The higher ranked gRNA with a PAM sequence close to the interrogation region was selected. For the TTYH2 and AJUBA loci, different sets of gRNAs were designed for the KI regions in two opposite strands. To construct gRNA plasmids, two reverse complementary single-stranded oligos with overhangs were synthesized by IDT and annealed on a thermocycler (Bio-Rad) before ligation to BbsI-linearized PX330 backbone. The ligation mix was transformed into Stbl3 chemical competent cells (Invitrogen) and single clones were sequence verified by Sanger sequencing.

### CRISPR mutagenesis and library construction

We used 600 ng single-stranded oligo donor library or 1200 ng double-stranded oligo donor library along with 500 ng gRNA construct to co-transfect into 1 million HEK293T cells using lipofectamine 2000 (Invitrogen). For degenerate donor mediated KI, 1 µl of 10 µM degenerate donor was used. A total of 1 µM L755507 (Sigma Aldrich) was added to the media 1 day after transfection to enhance the HDR efficiency54. Two biological replicates were included for each assay. The transfected cells were grown for 5 days before they were seeded onto 10 cm dishes for an additional two days. A total of 10% of the cells were harvested for genomic DNA using Quick-DNA kits (Zymo Research). The remaining cells were used for nuclear extraction using the Nuclear/Cytosolic Fractionation Kit (Cell Biolabs) following the manual. Nuclear RNA was purified from the nuclear extract using the Trizol method. Genomic DNA was removed from the RNA samples using the TURBO DNase (Thermo Fisher Scientific). The primers were designed to make sure pre-mRNA species were amplified for RNA editing analysis. RT was performed using SuperScript III kit (Thermo Fisher Scientific) and the gene-specific primers. All RT products were used in total of 300 µl (50 µl × 6) PCR reaction with Phusion polymerase (Thermo Fisher Scientific), and gene-specific primers with Fluidigm mmPCR adaptor sequences55. Genomic DNA library was amplified using a similar approach, except for the different primer set. All first round PCR products were size-selected on 1.5% agarose gel and purified using Gel purification Kit (Qiagen). Diluted PCR product (1:50) was used in the second round of PCR to add the Illumina sequencing adapter and individual barcode sequences, using Fludigm_universal_F/fludigm_barcode_R (ref. 55). The library was size selected and purified as in the previous step.

### Next-generation sequencing and data analysis

All libraries were sequenced on a NextSeq550 using Basespace (Illumina) for data collection. NEIL1 libraries were sequenced for 75 cycles paired-end and TTYH2 and AJUBA libraries 150 cycles paired end. Quality of reads were evaluated, and reads were filtered by FastQC default settings. To map the variants of the target gene, a reference genome was first built using the GMAP package, where designed mutations were included as SNPs. Briefly, GSNAP was used to detect variants with mismatches inside the interrogated region but not indels. The mapped reads were separated into individual variants based on the unique mutations carried in the region except for the editing site, and RNA editing was called and measured for each variant, as described previously56. The indel variants were mapped individually. The editing level Z-score of each variant is calculated for each RNA library by Eq. (1) as:

$$Z_i = \frac{{\mathrm{EL}_i - \mathrm{EL}_{\mathrm{WT}}}}{S}$$
(1)

where $$S = \sqrt {\frac{{\mathop {\sum }\nolimits_{i = 1}^N \left( {x_i - \overline x } \right)^2}}{N}}$$; EL = editing level; $$x_i = {\mathrm{EL}}_i - {\mathrm{EL}}_{\mathrm{WT}}$$; $$\bar x =$$ mean editing level for a given library.

### Computational RNA secondary structure prediction

The sequence used for WT NEIL1, TTYH2, and AJUBA are shown in Supplementary Data 7 and Fig. 1b. We chose the ECS sequence according to the reported method18 by examining the 100–1000 bp sequence flanking the editing site and choosing the most sData RNA duplex. For AJUBA, the ECS sequence we predicted also matches the predicted duplex using the RNAhybrid method63. We chose the region flanking the editing site of AJUBA to fold a “minimum” hairpin structure as the AJUBA RNA substrate (Fig. 1b) by omitting 524 nt sequences in lieu of the full length (>800 bp). This is because this minimum AJUBA RNA structure preserved both the base-pairing and a natural loop structure (shown as the hairpin loop in Fig. 1b) based on the resulting duplex structure, using the ECS prediction results mention above. The secondary structures with the MFE of the RNA variants (Supplementary Data 7) for all three RNAs are calculated from the Vienna RNAfold64 2.4.14, using default parameters except for allowing lone pairs (parameter: -p -d2).

To evaluate if the computationally predicted MFE structure is similar to the RNA structure probed experimentally in vitro, we used the SimTree65 method version 1.2.3 to compare the MFE structure between computationally predicted to the experimentally inferred structures (described above) for all NEIL1 variants and a portion of the TTYH2 variants that we were able to experimentally measure. The MFE structures are largely similar between experimental and computational results shown by the high average value of the pairwise normalized similarity score calculated using SimTree65 (NEIL1 = 0.96 ± 0.08, TTYH2-ECS library = 0.97 ± 0.04, where 1 means identical structures, Supplementary Fig. 4c). We reasoned that the results from this comparison justify the suitability of using the computationally predicted structure in our analysis. Therefore, for all of the structural feature analysis, we used the computationally predicted RNA structures to be consistent for all three RNA libraries.

### Clustering analysis of RNA sequence and structure

We performed clustering analysis for each RNA library using the LocARNA pipeline (version 2.0.0RC8)23,24 and associated tools. First, we used both the RNA sequence and the MFE structure (Supplementary Data 7) as the input to the mlocarna module23 to generate a multiple alignment. The resulting multiple alignment was then input into RNAclust66 (RNAclust.pl, version 1.3, modified to suit current computing environment) to generate a hierarchical cluster tree file (Fig. 6a and Supplementary Fig. 8) and the consensus RNA structure for each cluster (Fig. 6b, and Supplementary Figs. 7b and 9). These hierarchical clustering and the consensus RNA structure for each cluster can also be viewed in the SoupViewer66. The hierarchical clustering was illustrated (Fig. 6 and Supplementary Fig. 8) using dendrogram generated by the iTOL web server67.

### Calculating the probability of forming wild-type secondary structure

The probability of forming WT-like RNA secondary structure was calculated with Vienna RNAfold64 version 2.1.9. The probability of forming the WT-like secondary structures was calculated by Eq. (2) as:

$$\mathrm{Probability} = \frac{{{\mathrm{e}}^{\left( {\frac{{ - E\_\mathrm{wt}}}{{kT}}} \right)}}}{Z}$$
(2)

where kT = 0.6 kcal at temperature T = 37 °C. Z is the unconstrained partition function (calculated with RNAfold -p). E_wt is the energy of the state with the WT-like secondary structure, calculated using both the constraints and reference information (listed below) in RNAfold. The details of the usage and the selection criteria for these constrains and reference information are also explained below.

The structure constraints are:

NEIL1:

…………………………………..>>>….>…>>…>…>…

TTYH2:

…………………………………………………….>>>>>>>>>>>.>.>.>>>>>>>>>>….

AJUBA:

…………………………………………………..>>.>>>>.>.>>>.>>>>.>>>>>>>.

where “>” indicates that the given base must be paired with a residue that comes before it (5′) in the sequence and “.” indicates no constraint for the given base.

The reference information is:

NEIL1

………………..(((((.(((.(((((……))))).)))..)))))…………………

TTYH2:

…………………((((((.((.((((((((((((……………………….)))))))))))).)).))))))………………….

AJUBA:

(((((((((((((((.((((((((.((((((.((.(((((((((.((((……….)))).))))))))).)).)))))).)))))))).))))))))))))))).

The constraints are used to specifically calculate the free energy when fold the RNA sequence according to the constraints. The reference information was used to calculate a penalty by further comparing the base-pairing between the folded structure and the structure indicated by the reference information. We incrementally tested different versions of both the constraints and the reference information, starting from the exact secondary structure of WT to gradually relax the base-pairing starting from the first base-pair below the hairpin loop. This way the constrains and reference information together would preserve majority of the base-pairing information in WT structure near the core region of the RNA (regions that flanking the editing site). The reference information is used so that an additional penalty was added if base pairs could not be formed in the core region of the RNA. Note that penalties are not applied for any additional base pairs that form in the unpaired regions of the reference information. The probability was divided by the number of noncanonical base pairs that would be formed in the core of the WT secondary structure (defined by the reference information) to roughly account for the additional energetic penalty that these base pairs should incur. These constraints and reference information also ensure that the “active conformation” we calculated includes a wide group of core conformations that closely resembles the MFE structure of the WT, but are not limited to the single WT MFE structure. Additionally, the final versions we choose (listed above) are the ones that can compute conformations existed in the majority of RNA variants (88% for NEIL1, 96% for TTYH2 and AJUBA). The script in is available on GitHub (URL: https://github.com/kundajelab/PREUSS.)

### Modeling the 3D structure of ADAR1 bound to NEIL1

A 3D model of ADAR1 bound to NEIL1 was built through homology modeling and the Rosetta RNP-denovo method68. First, a homology model of human ADAR1 deaminase was built using Phyre2 (ref. 69). The conformation of the core RNA residues (nucleotides corresponding to NEIL1 residues 30–39 and 44–53) was taken from the previously solved structure of human ADAR2 bound to double-stranded RNA (PDB ID: 5HP3). The RNA was positioned relative to the protein by aligning the previously solved ADAR2 structure (in complex with dsRNA) to the homology model of ADAR1, then copying the RNA coordinates from the ADAR2-dsRNA structure. Protein residues in the ADAR1 homology model that clashed with the RNA were removed (the final residues included in the model were: 823–973, 996–1003, and 1010–1223). This model was used as input to RNP-denovo with the -s option. Helical regions of the NEIL1 RNA were modeled as ideal A-form helices, also included with the -s option. Conformations of protein residues were not optimized (-minimize_protein_sc false and -rnp_high_res_cycles 0). Default settings were used for all other options. The complete RNP-denovo command line used is provided below:

rna_denovo -fasta fasta.txt -secstruct_file secstruct.txt -s ADAR1_homology_model_and_core_RNA_from_5hp3.pdb RNA_helix_1.pdb RNA_helix_2.pdb RNA_helix_3.pdb RNA_helix_4.pdb RNA_helix_5.pdb -new_fold_tree_initializer true -minimize_rna true -minimize_protein_sc false -out:file:silent build_full_wt_neil1.out -rna_protein_docking true -rnp_min_first false -rnp_pack_first false -cycles 10000 -rnp_high_res_cycles 0 -minimize_rounds 2 -nstruct 2000 -ignore_zero_occupancy false -convert_protein_CEN false -FA_low_res_rnp_scoring true -ramp_rnp_vdw true -dock_each_chunk_per_chain false -use_legacy_job_distributor true -no_filters

where RNA_helix_1.pdb, RNA_helix_2.pdb, etc. are ideal A-form helices for base-paired regions of Neil1. Possible placements of the double-stranded RNA-binding domains were visualized by aligning the previously solved structure of the ADAR2-dsRNA-binding motif bound to dsRNA (PDB ID: 2L2K) to our model of NEIL1 bound to ADAR1.

### Machine learning models of RNA editing levels

All feature extraction and model training code are available to access on github: https://github.com/kundajelab/PREUSS

#### Feature extraction

RNA structures for NEIL1, AJUBA, and TTYH2 were annotated with the bpRNA algorithm9. The bpRNA annotations were in turn utilized to extract structural and positional features for each variant. A feature matrix with structure-specific features from the bpRNA (annotations, sequence-specific features, features that take into account the mutation type and position, and thermodynamic-specific features was engineered (how the featured were derived are described in Supplementary Data 1)) for each substrate and included a total of 122 features (Supplementary Datas 2–4).

#### Model training

The XGBoost28 Python library (v. 0.81) was used to train gradient boosted regression trees to predict Adar editing levels from feature matrices described above. Training was performed both within-substrate and across substrates. Several training and test combinations of datasets were utilized and summarized in Supplementary Data 5.

The dataset was randomly separated into three splits: training on 70% of variants, model validation on 15%, and testing on the remaining 15%. To avoid train/test contamination, base-pair positions along the RNA molecules were assigned to one of the three splits (training, tuning, or test). All features associated with a given base pair position were assigned to the corresponding split. Any feature that was null or non-varying across all variants in a given training split was removed from analysis. A number of variants were characterized by two or more mutations relative to the WT. For these, features were defined for each mutated base pair separately. To avoid train/test contamination, any base pairs that were both mutated in a given variant were included in the same split. The rationale for defining features relative to bases rather than variants is that different combinations of mutations may lead to variations in editing level, and a number of features were derived in reference to mutation type and position (Supplementary Data 1).

XGBoost was trained for a maximum of 1000 iterations, with early stopping after ten subsequent rounds with no reduction in root mean square error (RMSE) on the validation split. Default parameters were used.

The R2 value was calculated on the test set to determine the percent of total variance explained by the feature matrix. Other metrics to measure model performance included:

• Spearman correlation from the scipy.stats Python library.

• Pearson correlation the scipy.stats Python library.

• Mean absolute error (MAE) from sklearn.metrics Python library.

• Mean absolute percent error (MAPE).

• Root mean square error (RMSE) from sklearn.metrics Python library.

• Area under the precision-recall curve (auPRC) from sklearn.metrics Python library.

• Area under the receiver operating characteristic (auROC) from sklearn.metrics Python library.

#### Feature importance analysis

Feature importance analysis was performed to identify the subset of features most informative in predicting Adar editing levels. The XGBoost “plot_importance” function was used to calculate the F score for each feature. The TreeSHAP algorithm29 was applied to interpret feature importance from the XGBoost model. SHAP summary values were computed for each feature as a measure of feature importance using the “shap_values” function within the “TreeExplainer” class. Pairwise interaction values from TreeShap were also calculated to identify highly correlated feature values.

SHAP values were applied to calculate the combined relative importance of feature subsets. Feature subsets (Supplementary Data 1) were defined as follows; some features were parts of multiple subsets:

• Structure features: stem length, free energy, probability of active conformation.

• Number of mutations in the variant.

• Mutation-specific sequence features: mutation position, mutation site reference allele, mutation site alternate allele, distance of mutation site from edited base.

• Mutation-specific structure features: bpRNA structure designation for the mutation site, bpRNA structure designation for the adjacent upstream site, bpRNA structure designation for the adjacent downstream site, boolean indication of whether or not the mutation is part of the same structure as the editing site.

• “Other” mutation-specific features: type of mutation (indel, SNP), presence/absence of mutation (WT/mutated) in the variant.

• Editing site sequence features.

• Editing site structure features.

• Characterization of the 3 bpRNA structural features upstream of the editing site.

• Characterization of the 3 bpRNA structural features downstream of the editing site.

For each feature subset, the mean absolute SHAP values across variants were calculated. These were in turn summed across all features in the subset and compared to the total sum of mean absolute SHAP values across all features.

Overall feature rankings were computed by calculating the mean absolute value of SHAP values for each feature across the test set samples. These mean(|SHAP|) values were summed across all features, and the percent contribution to the total was obtained for each feature. These percent contributions for each feature were averaged across substrates to determine features that were ranked as high importance consistently across all substrates.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.