Abstract
Nucleic acid aptamers are generated by an in vitro molecular evolution method known as systematic evolution of ligands by exponential enrichment (SELEX). Various candidates are limited by actual sequencing data from an experiment. Here we developed RaptGen, which is a variational autoencoder for in silico aptamer generation. RaptGen exploits a profile hidden Markov model decoder to represent motif sequences effectively. We showed that RaptGen embedded simulation sequence data into lowdimensional latent space on the basis of motif information. We also performed sequence embedding using two independent SELEX datasets. RaptGen successfully generated aptamers from the latent space even though they were not included in highthroughput sequencing. RaptGen could also generate a truncated aptamer with a short learning model. We demonstrated that RaptGen could be applied to activityguided aptamer generation according to Bayesian optimization. We concluded that a generative method by RaptGen and latent representation are useful for aptamer discovery.
Main
Aptamers are short singlestranded oligonucleotides that bind to specific targets through their threedimensional folding structure. They are analogous to antibodies and have a variety of applications, including therapeutics^{1,2}, biosensors^{3} and diagnostics^{4}. The advantages of aptamers are that they are rapidly developed by in vitro generation, are low immunogenic^{5} and have a wide range of binding targets, including metal ions^{6}, proteins^{7}, transcription factors^{8}, viruses^{9}, organic molecules^{10} and bacteria^{11}. Aptamers are generated by the systematic evolution of ligands by exponential enrichment (SELEX)^{12,13}. SELEX involves iterations of affinitybased separation and sequence amplification. This iterative process results in an enriched pool that is analyzed for candidate selection. Recent advances in highthroughput sequencing have enabled us to conduct highthroughput SELEX (HTSELEX) to collect a vast number of aptamer candidates^{14,15,16}. Current sequencing techniques can evaluate a limited number of reads: approximately 10^{6}. Micrograms of a SELEX input library only contains around 10^{14} copies of RNA, whereas an RNA library containing a 30 nt random region theoretically has 10^{18} (~4^{30}) unique sequences. Hence we can only evaluate a very small portion of the theoretical diversity, and thus computational approaches that efficiently process highthroughput sequencing data are critical in aptamer development.
Several computational approaches that identify aptamers using HTSELEX data have been reported. Aptamer identification tools utilize parameters associated with the SELEX principle, such as frequency, enrichment and secondary structure^{17,18,19,20}. Although they are useful for identifying sequences from HTSELEX data, various candidates are limited by the actual sequence existence in the data. Simulationbased methods have been reported for sequence generation^{21,22,23}; however, these methods require preceding motif information and are therefore not suitable for identifying aptamers against an unfamiliar target. Computational approaches have also been developed to predict aptamer motifs. Motif prediction is useful not only for candidate discovery but also for aptamer development processes such as truncations and chemical modifications. Several methods have been developed for motif detection by using secondary structures^{24}, enrichment of subsequences during SELEX experiments^{25} and emphasis on various loop regions^{26}. In addition to these approaches, AptaMut utilizes mutational information from SELEX experiments^{22}. As nucleotide substitutions can increase aptamer affinity, mutational information is beneficial for candidate discovery. However, although insertions and deletions are also important factors for altering aptamer activity, in silico methods that deal with these mutations are poorly developed; thus, a method that generates sequences from experimental data is needed to expand the exploratory space, and including motif information and nucleotide mutations confer an increased opportunity for aptamer discovery.
We focused on a neural network to develop a procedure for aptamer generation and motif finding. As reported previously, neural networks are suitable for analyzing large datasets and are compatible with highthroughput sequencing data. DeepBind adopts a convolutional neural network (CNN) to distinguish DNA motifs from transcription factors and find sequence motifs by visualizing network parameters^{27}. Recurrent neural networks can also be used for sequence discovery^{28,29}. Neural networkdriven generative models are currently being applied in a broad range of research areas. Some examples of neural networkdependent generative models include deep belief networks^{30}, variational autoencoders (VAEs)^{31}, and generative adversarial networks^{32}. For a probabilistic generation of nucleic sequences, using long shortterm memory (LSTM) was proposed to mimic sequence distribution^{33}. Generative adversarial networkbased sequence generation methods have also been proposed^{34}.
Variational autoencoderbased compound designs have been reported in small molecule discovery. VAEs learn a representation of the data by reconstructing the input data from a compressed vector^{31}. Kusner and colleagues used grammarbased VAEs and SMILES sequences to generate chemical structures for activity optimization^{35}, and GómezBombarelli et al. used the representation learned by the VAE to design chemical compounds^{36}. Unlike other generative models, VAEs exploit the relationship between compressed feature space and inputs in a bidirectional manner; they are therefore suitable for visualizing similarityoriented classifications and emphasizing important sequence features. Using VAEs to convert HTSELEX data into lowdimensional space would be useful for candidate discovery; thus, VAEbased aptamer generation systems are worth investigating. When conducting VAE modeling for HTSELEX data, having a profile hidden Markov model (HMM) decoder should be beneficial for aptamer discovery; it captures motif subsequences—robust with substitutions, deletions and insertions—and can easily monitor effects from the subsequences.
Here we present RaptGen, a VAE for aptamer generation. RaptGen uses a profile HMM decoder to efficiently create latent space in which sequences form clusters based on motif structure. Using the latent representation, we generated aptamers not included in the highthroughput sequencing data. Strategies for sequence truncation and activityguided aptamer generation are also proposed.
Results
Overview of RaptGen and its applications
RaptGen is a probabilistic generative model that enables us to generate new aptamer sequences that are not included in the input SELEX dataset. To realize this, RaptGen employs a VAE with a profile HMM for decoder distribution and embeds RNA sequences from the input dataset into lowdimensional latent space (Fig. 1a). Using a profile HMM for the decoder renders RaptGen more robust for substitutions and indels in RNA aptamers, thereby achieving better generative performance than existing models (see the ‘Motifdependent embeddings using simulation data’ section). See Methods for details on RaptGen’s procedures.
In this study we propose three important applications of RaptGen for aptamer discovery. First, a latent space learned by RaptGen is visualized with a sequence motif, and new aptamer sequences are generated for an arbitrary point in the latent space (see Fig. 1b and the ‘Real data evaluation with RaptGen’ section). Second, optimized aptamer sequences are searched in the latent space by considering additional experimental information, such as the binding affinity of a subset of sequences (see Fig. 1c and the ‘RaptGen application in aptamer discovery’ section). Third, RaptGen enables in silico the design of truncated aptamer sequences using a shorterprofile HMM decoder (see Fig. 1d and the ‘RaptGen application in aptamer discovery’ section).
Motifdependent embeddings using simulation data
We first attempted to construct a VAE with an encoder and decoder applicable to aptamer discovery. In the aptamer representation space, sequences containing the same motif should be in a neighboring area. Robustness against nucleotide mutations and motif positions should also be considered. We investigated different types of sequence representation models to identify a desirable decoder. We constructed VAEs with a CNN encoder and three different types of probabilistic models (the multicategorical model, autoregressive model and profile HMM) as a decoder. Simulation data, including ten different motifs, were created to assess the visualizing capability of these VAEs (Fig. 2a). We observed that profile HMMembedded sequences in a motifdependent manner after training the data, whereas the multicategorical and autoregressive models displayed indistinctive distributions (Fig. 2b). The evidence lower bound (ELBO) was calculated to evaluate the model. Although the multicategorical model and profile HMM had almost the same ELBO (20.71 and 20.60), and had similar reconstitution errors (15.32 and 16.02) and Kullback–Leibler divergence scores (5.39 and 4.59), the embedding space of the multicategorical model failed to visualize a motif cluster. This is thought to be due to the inability of the multicategorical model to consider motif positions. As the nucleotide probability of each position was independently estimated in the multicategorical model, the same motifs in the shifted position might not be aligned in latent space. The autoregressive model had the lowest ELBO (19.50); however, the reconstitution error was the worst (18.32). Furthermore, the classification result was not optimal. We suppose that latent representation is dispensable in the autoregressive model as the model itself has context information. We also compared the different encoder types. Long shortterm memory^{37} and CNN–LSTM were evaluated in combination with the above three decoders. Long shortterm memory is used in characterlevel text modeling. The embedding space from the multicategorical and autoregressive models was still inadequate using either encoder (Supplementary Section 8). Profile HMM created distinguishable embedding with LSTM, whereas a learning deficiency was observed in combination with CNN–LSTM (Supplementary Section 8). Collectively, we concluded that the profile HMM decoder is favorable for motifdependent embedding. A VAE composed of a CNN encoder and a profile HMM decoder was examined in the following study.
We next tested whether our VAE model could distinguish split motifs. Subsequence cooccurrence at distances is often observed in RNA due to intramolecular basepairing and internalloop structures^{38}. We applied simulation data with a pair of 5 nt split motifs to the VAE (Fig. 3). The multicategorical model decoder was used for comparison. Figure 3b shows the results of embedding split motifs. Plots are displayed in three groups: right motif, left motif and both motifremaining sequences. Profile HMM output sequences related to the motif, whereas the multicategorical model scattered the sequences. We sampled representative profile HMM distributions from each population. Profile HMM visualization shows that the yellow point skips the left motif. The red point skips the right motif, both by allocating a high probability of jumping to the deletion state from the matching state (Fig. 3c). Visualization of the purple point shows that the middle of two points has a low probability of skipping either of the motif fragments. The transition probability to skip the left motif (\({a}_{{M}_{1},{D}_{2}}\)) and the right motif (\({a}_{{M}_{10},{D}_{11}}\)) for rightonly, both and leftonlymotif models was (0.995, 0), (0.107, 0.002) and (0, 0.987), respectively. Interestingly, the point located between these two motifs has a high probability of including both motifs. These results show that a profile HMM decoder is also applicable for split motifs. Hereafter, we called a VAE with a profile HMM decoder RaptGen.
Real data evaluation with RaptGen
We further evaluated RaptGen using SELEX sequence data obtained from our previous study^{20}. As real data are more complex than simulation data, we first investigated the dimensions of the latent space. Raw HTSELEX data have 30 or 40 nt variable regions and fixed primer regions at both ends. In the present study, we used the variable region to create latent space. We tested up to twelve spatial dimensions and trained the model 50 times on datasets A and B (Supplementary Fig. 1). For Dataset A, the minimum loss was in four dimensions, and the secondlowest was in two dimensions. For Dataset B, the minimum loss was in three dimensions and the secondlowest was in two dimensions. Loss tended to increase as the embedding dimension increased; however, the loss of onedimensional space was higher than that of the tendimensional space. The lower dimension would be favorable for visualization, and performing Bayesian optimization would be advantageous, as described in later sections. We therefore adopted a twodimensional space for analysis.
We next subjected two independent HTSELEX datasets (datasets A and B) to RaptGen. The resulting latent embeddings are shown in Fig. 4 and Supplementary Section 4. We previously demonstrated that aptamers from datasets A and B exhibit continuous and split motifs, respectively. As the SELEX experiment sequences are amplified with specific binding motifs, we reasoned that they would form clusters in a latent space based on their motifs. We thus used the Gaussian mixture model (GMM), which hypothesizes that data consists of a mixture of Gaussian distributions, to classify the distributions. We chose ten different points representing the latent cluster center of the GMM (Fig. 4). We observed that sequences with an uncertain profile HMM such as AGMM2, AGMM5 and BGMM0 were embedded near the latent space center. Unenriched aptamer sequences remained after the SELEX experiments. We predicted that these junk sequences formed clusters in the latent space. By contrast, the nearedge area contained sequences that emit nucleotides preferentially. We also confirmed that similar profiles were embedded in similar areas (Supplementary Section 4). These results provide support for the use of RaptGen to analyze highthroughput SELEX data.
We attempted to generate the most probable sequence from the profile HMM of each GMM center for activity evaluation. We calculated the model state path with the highest probability and derived the most probable sequence according to the path. When the path included insertion states, we generated up to 256 sequences with no duplication by randomly replacing each insertion state with a single nucleotide and selected a sequence with the highest probability. The resulting reconstituted sequences and their probabilities are shown in Fig. 4. After connecting with their fixed primer sequences, aptamer RNAs were produced by in vitro transcription and their binding activities were assessed by surface plasmon resonance assay. Aptamers identified in our previous study were used as positive controls^{20}. Although more than half of the candidates were found to have weak or no activity, some sequences such as AGMM1, BGMM4 and BGMM8 had evident binding activity. To determine whether these aptamers exist in the original data, we calculated each sequence’s edit distance from the nearest HTSELEX sequence (Supplementary Table 1). It should be noted that all candidate sequences were not included in the original SELEX data. Collectively, we concluded that RaptGen enables us to generate aptamers from the latent space and reduces the limitations of working with actual sequence data.
RaptGen application in aptamer discovery
We proposed further applications of RaptGen for aptamer development. Shortening the aptamer length is important for industrial application. Aptamer truncation can reduce the cost of manufacturing and facilitate material quality assurance. It also prevents unexpected biochemical interactions. Hence, aptamers should be shortened as much as possible. As the profile HMM can handle variable sequence lengths, learning settings could diverge from the original SELEX library. For example, a decoder model does not require the same length of the random region. We attempted to generate shorter aptamers than SELEX with RaptGen. We introduced a shortprofile HMM with truncated length by 5 or 10 nt from the original SELEX design. Dataset A was analyzed with a 20 nt and 25 nt model (called AL20 and AL25), where the initial library was 30 nt. Dataset B was analyzed with a 30 nt and 35 nt model (called BL30 and BL35), where the initial library was 40 nt. After creating latent space, ten sequences for each length were created in a GMMdependent manner described above. Figure 5 shows the relative activity of proposed aptamers with their lengths. For Dataset A, the 28 nt candidate showed binding activity where the initial library was 30 nt. For Dataset B, the 29 nt candidate showed considerable activity compared with the original setting, which was 40 nt. These results suggest that RaptGen can generate a shorter aptamer than the experimentally expected length. We found that sequences with low reconstitution probability tended to have low binding activity and that sequences showing binding activity had relatively high probability (Fig. 5). This observation would be helpful for effective candidate selection. We observed a tendency of sequence extension in datasets AL20, AL25 and BL35. For instance, in Dataset A, 26 nt sequences were generated from the 20 nt RaptGen setting. We speculate that the profile HMM is prone to imitating the original length in some situations. The optimal truncation length was different for each dataset. We did not identify the cause of this difference. Further studies should be performed to determine efficient truncation.
In another application of RaptGen, we generated aptamers using activity information. Aptamer derivatives harboring nucleotide mutations should be distributed around the mother sequence in the latent space. To predict effective candidates from the neighboring area of an active aptamer, binding activity distribution should be predicted. We used a Bayesian optimization algorithm for learning an activity distribution. As the distribution for the Bayesian optimization process is required to be of low dimension, RaptGen is suitable for this strategy. To implement Bayesian optimization, we first embedded activity data in the latent space. The sequences listed in Fig. 4 were reconverted into the space. Several locations moved from the initial GMM center (Fig. 6a,b). We used these reembedded positions to perform Bayesian optimization. The resulting predicted activity distributions are shown in Fig. 6a,b. We used the local penalization function to propose multiple candidates in parallel^{39}. Ten profile HMMs were proposed and evaluated for their activity. As shown in Fig. 6a,b, candidates were generated from the peripheral area of the positive clone. We confirmed that new aptamers incorporated nucleotide substitutions (Fig. 4). In addition, most of them had binding activity. Similar results were obtained for both datasets A and B. We further tested the hypothesis that repeated Bayesian optimization could support the generation of superior aptamers. We conducted an additional Bayesian optimization round (BO2) against Dataset A. After generating ten new candidates, we obtained an aptamer with approximately 20% greater binding activity (Supplementary Table 2 and Fig. 6c,d). These results indicate that RaptGen can propose aptamer derivatives in an activityguided manner and can provide opportunities to optimize their activities.
The present version of RaptGen does not consider the secondary structure of aptamers. Secondary structure information is critical for identifying active aptamers^{19,20}. In this subsection, we performed RNA secondary structure analyses for the aptamers obtained in the previous sections. Here we utilized the Rtools webserver^{40} for secondary structure analyses; the results are shown in Supplementary Fig. 3 (Dataset A), Supplementary Fig. 4 (Dataset B) and Supplementary Data 1 (more detailed results for datasets A and B). For Dataset A, the root of the structures (that is, structures around 3′part) gradually changed according to the relative activity (Supplementary Fig. 3), whereas the stemloop region around the middle of the sequence was conserved. This result indicates that our Bayesian optimization process optimizes the root of the structures to increase binding abilities. For Dataset B, we observed that highaffinity aptamers tended to form a specific structure (for example, BGMM4, BBO4 and BBO5) compared with the positive control (Supplementary Fig. 4). To confirm the reliability of this analyses, we also performed RNA secondary structure analyses using RNAfold^{41} (Supplementary Figs. 5 and 6 for datasets A and B, respectively), where a consistent result was obtained for Dataset A. For Dataset B, both tools showed highaffinity aptamers tend to form similar structures (note that both tools also suggest that secondary structures in Dataset B are unstable, that is, including relatively low basepairing probabilities in predicted structures).
Discussion
One of the popular models for handling highthroughput sequencing data (such as the HTSELEX data this study focused on) is a discriminative model that distinguishes real aptamer sequences from nonaptamer sequences; examples include DeepBind^{27} and DeeperBind^{28}. In training the discriminative models, both positive and negative sequences are necessary; positive sequences are usually shuffled to provide the negative data. As discriminative models are specialized for classification, they cannot generate new sequences. Conversely, RaptGen is a generative model that does not require negative data for training and can generate new aptamer sequences not included in the input sequences. This feature is essential for the three applications (see Fig. 1b–d) investigated in this study. Note that we compared RaptGen with DeepBind regarding motif detection performance (Supplementary Section 9). Jinho et al. proposed LSTMbased sequence generation using SELEX data^{33}. However, they did not consider sequence cluster information. RaptGen embeds sequences into a feature representation space, and thus RaptGen could visualize sequence classification and generate representative sequences from each cluster (Fig. 1). Moreover, this lowdimensional representation enables us to conduct Bayesian optimization, which is beneficial for generating variant sequences (Fig. 4). Hence, we believe that RaptGen is a superior generative model compared to LSTM. Hoinka et al. introduced several tools for aptamer discovery, such as AptaCluster, AptaMut and AptaSim^{22}. AptaCluster and AptaMut consider mutation information that is derived from the base substitution error rate of the polymerase enzyme. AptaCluster evaluates only actual sequencing data, whereas RaptGen can generate sequences that are not included in the sequencing data. We confirmed that the generated sequences did not appear in the SELEX experiment (see Supplementary Table 1, in which positive edit distances indicate that the corresponding sequence was not included in the original SELEX data). In addition, AptaMut deals with base substitutions but not insertions or deletions. As RaptGen has a profile HMM, it can embed indel information. This capability was confirmed using simulation data (Fig. 2). Because of indel tolerance, RaptGen could also generate sequences shorter than the actual sequencing data (Fig. 5), whereas AptaMut does not estimate such virtual sequences. We therefore believe that RaptGen incorporates mutational information better than AptaMut. In summary, to the best of our knowledge, there are no other datadriven methods to design optimized and truncated aptamers directly from HTSELEX data, and we believe that RaptGen will be a key tool for efficient aptamer discovery.
In this study, we demonstrated that RaptGen could propose candidates according to activity distribution. According to Bayesian optimization, a sequential construction of posterior distribution would allow us to optimize activity in the latent space. For another instance of Bayesian optimization application, one could set the acquisition function to various indicators other than the binding activity. We could therefore generate candidates according to other properties of interest, including inhibitory activity against enzymes or protein–protein interactions. The application of RaptGen for this purpose is promising.
Although RaptGen helps visualize and understand sequence motifs, this method has computational cost due to sequence probability calculation. Compared with the multicategorical model, which can calculate the sequence independently by position, and the autoregressive model, which only needs calculation on the previous nucleotides, profile HMM requires calculation on all possible state paths and previous (sub)sequences. The offset calculation cost for multicategorical, autoregression and profile HMM is \({{{\mathcal{O}}}}(1)\), \({{{\mathcal{O}}}}(l)\) and \({{{\mathcal{O}}}}(lm)\), respectively, where l is the number of previous characters including itself, and m is the model length of the profile HMM. Profile HMM also needs to frequently calculate the costly logsumexp function, leading to a longer training time. Additional studies are necessary to improve these issues.
There are two possible extensions of RaptGen. First, as shown in the previous section (see the ‘RaptGen application in aptamer discovery’ section), including the secondary structure in the sequence probabilistic model would improve RaptGen performance. In this direction, an alternative model such as profile stochastic contextfree grammar^{42} will be tested in followup studies. Another direction of the extension of RaptGen is to consider RNA sequences in all rounds in HTSELEX experiments.
RaptGen could advance HTSELEX datadriven RNA aptamer generation. As an RNA aptamer binds to the target protein by the structural complementarity, not by hybridization, interaction between the RNA and the protein is hardly predicted without binding experiments such as SELEX. Once enough number of aptamer–protein pairs and binding data is accumulated, de novo aptamer design without wetlab experiments will be realized in the future. Additionally, simulationbased methods such as molecular dynamics will also be effective to improve computational aptamer design (for example, optimization of aptamers^{43}).
Methods
Overall study parameters
The VAE proposed in this study is a CNNbased encoder with skip connections and a profile HMM decoder with several training methods. Two simulation datasets containing different types of motifs were generated to assess the interpretability of the decoder. Two independent HTSELEX datasets were subjected to the VAE, and the GMM was used for multiple candidate selection. Furthermore, Bayesian optimization was performed based on the activities of tested sequences proposed by GMM, and sequences were truncated by shortening the model length. The process is explained in detail in the following sections. An overview is shown in Fig. 1.
Architecture of the RaptGen model
VAE
Variational autoencoders consist of an encoder neural network that transforms input sequence x into latent distribution q_{ϕ}(z∣x) and a decoder neural network that reconstructs the input data from latent representation z by learning p_{θ}(x∣z) where φ and θ are model parameters. As VAE is a generative model, it can be evaluated by model evidence. However, given a dataset \({\bf{X}}=\{{\bf{x}}^{(i)}\}_{i = 1}^{N}\), the model evidence \({p}_{{{{\mathbf{\uptheta }}}}}\left({{{\bf{X}}}}\right)\) is not computationally tractable. Alternatively, we can maximize the ELBO, \({{{\mathcal{L}}}}({{{\mathbf{\uptheta }}}},{{{\mathbf{\upphi }}}};{{{\bf{X}}}})\) to calculate how the model describes the dataset using Jensen’s inequality,
where
where D_{KL}(p∣∣q) is the Kullback–Leibler divergence between distributions p and q. The first term on the righthandside is the regularization error, whereas the second term is the reconstruction error. Modeling this reconstruction error to suit the problem determines the structure of the latent space. Note that ELBO can be utilized as a measure to determine the optimal dimension of the latent space (that is, model selection)^{44}. In this paper we refer to the negative value of ELBO as model loss or loss.
CNNbased encoder with skip connections
The RaptGen encoder network consists of a stack of convolutional layers with skip connections. Each character was first embedded into a 32channel vector and went through seven convolutional layers with skip connections. Max pooling and fully connected layering then transform the vector into the distribution parameters of latent representation q_{ϕ}(z∣x). The structure is shown in detail in Supplementary Section 5.
Profile HMM decoder model
For modeling insertions and deletions, we used the profile HMM as the decoder for RaptGen. The profile HMM is a model that outputs by probabilistically moving from state to state (Supplementary Fig. 2). The profile HMM consists of match (M), insertion (I) and deletion (D) states. Each state emits specific outputs introduced to represent multiple sequence alignments^{45}. The match state has a high probability of emitting a particular character, the insertion state has an equal chance and the deletion state always emits a null character. These probabilities are called emission probabilities. The other probabilistic parameter is the transition probability. This defines the likeliness of transition from a state to the next state. In a profile HMM, the emission probability e_{S}(c) is the probability of output character c from state S, and transition probability \({a}_{S,S^{\prime} }\) is the probability of changing state from S to \(S^{\prime}\). These are defined as e_{S}(c) = p(c∣S) and \({a}_{S,S^{\prime} }=p(S^{\prime}  S)\), respectively.
As profile HMM is a model in which the state transition depends only on the previous single state, the sequence probability p(x) can be written by using the Markov chain rule:
where π is the possible state path, π_{last} is the last state in the path, L is the length of the sequence, x_{j:k} is the subsequence of x from the jth character to the kth character on both ends, x_{0} is a null character that indicates the start of the sequence, x_{L+1} is a null character that indicates the end of the sequence, and m is the number of matching states in the model. It is computationally expensive to calculate the sequence probability for all possible paths. Introducing a forward algorithm can lower the computational cost to \({{{\mathcal{O}}}}(Lm)\). The forward algorithm consists of a forward variable defined as \({f}_{j}^{S}(i)=p({x}_{0:i},{\uppi }_{{{\mbox{last}}}}={S}_{j})\), and the probability can be calculated recurrently by
The emission probability of the insertion state does not depend on the position of the motif; therefore, it is set to a constant of onequarter for RNA sequences. We set the probability to output the final endofsequence token p(x_{L+1}∣M_{m+1}) to 1.
Other tested decoders
Three probabilistic models were tested: the multicategorical model, the autoregressive model and profile HMM. The probabilistic models each have different sequence probability assignments. The multicategorical model assigns a categorical distribution to each position of the sequence. Given the representation vector z and the probability of the sequence x, p(x∣z) is calculated by \(p({{{\bf{x}}}} {{{\bf{z}}}})=\mathop{\prod }\nolimits_{i = 1}^{L}p({x}_{i} z)=\mathop{\prod }\nolimits_{i = 1}^{L}{{{\rm{Cat}}}}({x}_{i} {f}_{{{{\mathbf{\uptheta }}}}}({{{\bf{z}}}}))\), where Cat is a categorical distribution and f_{θ} is a neural network. The autoregressive model outputs a probability according to previous data. The probability of the sequence p(x∣z) is calculated by \(p({{{\bf{x}}}} {{{\bf{z}}}})=\mathop{\prod }\nolimits_{i = 1}^{L}p({x}_{i} {x}_{0:i1},{{{\bf{z}}}})=\mathop{\prod }\nolimits_{i = 1}^{L}{{{\rm{Cat}}}}({x}_{i} {g}_{{{{\mathbf{\uptheta }}}}}({x}_{0:i1},{{{\bf{z}}}}))\), where g_{θ} is a recurrent neural network. The architectures of networks f_{θ} and g_{θ} are described in Supplementary Section 5.
Training techniques
State transition regularization was introduced to train RaptGen. Weighed regularization loss was also introduced for all VAEs, including RaptGen.
State transition regularization
A VAE can be trained with backpropagation by treating ELBO as a loss function. In addition to ELBO, a Dirichlet prior distribution was used on the transition probabilities to avoid unnecessary state transitions in the early rounds of training RaptGen. By penalizing transitions other than matchtomatch at the beginning of the learning process, insertions and deletions are forced to occur less. This allows continuous motifs to be learned and lowers the probability of obtaining models with meaningless transitions traversing deletion states.
The probability of categorical variable p = {p_{k}} sampled from a Dirichlet distribution is
where α = {α_{k}} is the Dirichlet distribution parameter. The regularization term is the sum of the logodds ratio of the training probability from the matching state over each position i, defined as
where p_{i} is \([{a}_{{M}_{i1},{M}_{i}}\,{a}_{{M}_{i1},{I}_{i}}\,{a}_{{M}_{i1},{D}_{i}}]\) which indicates the transition probabilities from the ith matching state, and α(w_{m}) = [1 + w_{m} 1 1] is the parameter representing the induction weight w_{m}. To make this loss zero at a specific round R, w_{m} was set to 4(1 − e/R), where e is the training epoch. This regularization term was added to the ELBO during training.
Weighted regularization loss
The scaling param eter for the regularization was introduced to train the VAE. Scaling the regularization term of the loss function of the VAE to minimize the value in the early epoch of training improves latent embedding^{46}. The scale is defined as e/E, where e is the training epoch, and E is the maximum number of epochs to have scaling. After the E epochs of training have finished, the scale is set to 1.
Training settings
All sequences in the training set were filtered first. Sequences with exact matching adapters, exact matching sequence design lengths, and sequences read more than once remained. The sequences were split into training and test datasets in a 9:1 ratio. The model with the smallest test loss was selected through iterations. For the weighted regularization loss, the maximum number to have scaling E was set to 50. The state transition regularization parameter R was set to 50 for the profile HMM decoder. Adam was used as the training optimizer with default parameters^{47}. All of the networks were trained up to 2,000 epochs with early stopping when the test loss was not updated for 50 epochs.
RaptGen evaluation
Simulation data
For the simulation data shown in Fig. 2a, ten different motif sequences of length ten were generated and single nucleotide modification with a 10% error rate was added. In other words, each motif sequence had a 3.33 … % chance of deletion, insertion or modification at a specific position. After this procedure, sequences were randomly extended to reach 20 nt by adding nucleotides to the right and the left. We made 10,000 sequences in total, with no duplication.
For the simulation data shown in Fig. 3a, sequences containing paired motifs were generated. Two 5 nt motifs were made, and then one of the motifs was randomly deleted at a probability of 25% each. If both motifs remained, 2 to 6 nt were randomly inserted between the left and right motifs. Sequences were then randomly extended to reach 20 nt, and 5,000 of these sequences were generated.
SELEX data
SELEX data used in this study were obtained previously^{20}. The sequences are available as DRA009383 and DRA009384, which we call datasets A and B, respectively. These SELEX were conducted using a conventional selection method. Briefly, the target proteins were immobilized on beads. After washing, bound RNA was recovered and amplified using reversetranscriptionPCR. Dataset A, targeting human transglutaminase 2, consists of nine SELEX rounds from 0 to 8, and Dataset B, targeting human integrin alpha V beta 3, consists of four rounds from 3 to 6. The round with the smallest unique ratio U(T) with the restriction of U(T) > 0.5 was used, defined as
where \({{{\mathcal{D}}}}(T)\) are the whole sequences, read in round T. The fourth round was selected for each dataset.
RaptGen applications in aptamer discovery
GMM for initial sequence selection
We used the GMM for initial sequence selection from the obtained latent space. To efficiently select ten points to be evaluated, GMM was run 100 times with ten components, and the mean vectors of the model with the best evidence (likelihood) were selected.
Surface plasmon resonance assay
The surface plasmon resonance assays were performed using a Biacore T200 instrument (GE Healthcare) as described previously with slight modifications^{20}. The target proteins of datasets A and B were human recombinant transglutaminase 2 (R&D systems, catalogue no. 4376TG) and human recombinant integrin alpha V beta 3 (R&D systems, catalogue no. 3050AV), respectively. Aptamers were prepared with fixed primer regions and 16mer poly(A)tails as follows: 5′–GGGAGCAGGAGAGAGGUCAGAUG–(variable sequence)–CCUAUGCGUGCUAGUGUGA–(polyA)–3′ for dataset A and 5′–GGGAGAACUUCGACCAGAAG–(variable sequence)–UAUGUGCGCAUACAUGGAUCCUC–(polyA)–3′ for Dataset B. Previously reported aptamers were used as positive controls. All evaluated sequences are listed in Supplementary Section 2 (Supplementary Table 3). Aptamers were prepared by in vitro transcription using a mutant T7 RNA polymerase and 2′fluoropyrimidine NTPs. The running buffer consisted of 145 mM NaCl, 5.4 mM KCl, 0.8 mM MgCl_{2}, 1.8 mM CaCl_{2}, 0.05% Tween20 and 20 mM TrisHCl (pH 7.6). A 5′biotinylated dT16 oligomer was immobilized to both active and reference flow cells of the streptavidin sensor chip (BR100531, GE Healthcare). The poly(A)tailed RNA was captured in the active flow cell by complementary hybridization at a concentration of 300 nM and a flow rate of 20 μl min^{−1}, with an association time of 60 s. The proteins were injected into the flow cells of the sensor chip at a concentration of 50 nM and a flow rate of 20 μl min^{−1}, with an association time of 60 s. To regenerate the sensor chip, bound aptamers were completely removed by injecting 6 M urea. Data were obtained by subtracting the reference flow cell data from the active flow cell data. The ratio of the proteinbinding level to aptamercapturing level was used as binding activity. Percent relative binding activities of positive control aptamers are shown in the results and discussion section. For normalization of Dataset A, the cycle numberdependent reduction of control aptamer binding was estimated.
Multipoint Bayesian optimization via local penalization
Bayesian optimization uses both the search for sequences that have not been explored to a reasonable extent and the utility of utilizing sequences with known affinity to select the next sequence for evaluation. The local penalization function is a method that can determine the multipoint expected improvement of candidates by considering the smoothness of the potential function^{48}. As it converges faster than qEI^{49} and other methods for simultaneous optimization. We used this method to perform multipoint optimization. Implementation was performed with the GPyOpt package^{50}.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The HTSELEX sequences are available as DRA009383 (Dataset A) and DRA009384 (Dataset B) in DDBJ. Source Data are provided with this paper.
Code availability
The RaptGen program is available via ref. ^{51}.
References
Ni, S. et al. Recent progress in aptamer discoveries and modifications for therapeutic applications. ACS Appl. Mater. Interfaces 13, 9500–9519 (2020).
Adachi, T. & NakamuraAptamers, Y. A review of their chemical properties and modifications for therapeutic application. Molecules 24, 4229 (2019).
Song, S., Wang, L., Li, J., Fan, C. & Zhao, J. Aptamerbased biosensors. Trends Anal. Chem. 27, 108–117 (2008).
Zhou, W., Huang, P.J. J., Ding, J. & Liu, J. Aptamerbased biosensors for biomedical diagnostics. Analyst 139, 2627–2640 (2014).
Eyetech Study Group. et al. Preclinical and phase 1A clinical evaluation of an antiVEGF pegylated aptamer (EYE001) for the treatment of exudative agerelated macular degeneration. Retina 22, 143–152 (2002).
Ciesiolka, J., Gorski, J. & Yarus, M. Selection of an RNA domain that binds Zn^{2+}. RNA 1, 538–550 (1995).
Tombelli, S., Minunni, M., Luzi, E. & Mascini, M. Aptamerbased biosensors for the detection of HIV1 TAT protein. Bioelectrochemistry 67, 135–141 (2005).
Jolma, A. et al. DNAbinding specificities of human transcription factors. Cell 152, 327–339 (2013).
Binning, J. M. et al. Development of RNA aptamers targeting Ebola virus VP35. Biochemistry 52, 8406–8419 (2013).
Baker, B. R. et al. An electronic, aptamerbased smallmolecule sensor for the rapid, labelfree detection of cocaine in adulterated samples and biological fluids. J. Am. Chem. Soc. 128, 3138–3139 (2006).
Labib, M. et al. Aptamerbased viability impedimetric sensor for bacteria. Anal. Chem. 84, 8966–8969 (2012).
Tuerk, C. & Gold, L. Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science 249, 505–510 (1990).
Ellington, A. D. & Szostak, J. W. In vitro selection of RNA molecules that bind specific ligands. Nature 346, 818–822 (1990).
Zhao, Y., Granas, D. & Stormo, G. D. Inferring binding energies from selected binding sites. PLoS Comput. Biol. 5, e1000590 (2009).
Jolma, A. et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861–873 (2010).
Kupakuwana, G. V., Crill, J. E. II, McPike, M. P. & Borer, P. N. Acyclic identification of aptamers for human alphathrombin using overrepresented libraries and deep sequencing. PLoS ONE 6, e19395 (2011).
Jiang, P. et al. MPBind: a metamotifbased statistical framework and pipeline to predict binding potential of SELEXderived aptamers. Bioinformatics 30, 2665–2667 (2014).
Caroli, J., Taccioli, C., Fuente, A. D. L., Serafini, P. & Bicciato, S. APTANI: a computational tool to select aptamers through sequencestructure motif analysis of HTSELEX data. Bioinformatics 32, 161–164 (2016).
Caroli, J., Forcato, M. & Bicciato, S. APTANI2: update of aptamer selection through sequencestructure analysis. Bioinformatics 36, 2266–2268 (2020).
Ishida, R. et al. RaptRanker: in silico RNA aptamer selection from HTSELEX experiment based on local sequence and structure information. Nucl. Acids Res. 48, e82–e82 (2020).
Kim, N., Izzo, J. A., Elmetwaly, S., Gan, H. H. & Schlick, T. Computational generation and screening of RNA motifs in large nucleotide sequence pools. Nucl. Acids Res. 38, e139–e139 (2010).
Hoinka, J. et al. Large scale analysis of the mutational landscape in HTSELEX improves aptamer discovery. Nucl. Acids Res. 43, 5699–5707 (2015).
Zhou, Q., Xia, X., Luo, Z., Liang, H. & Shakhnovich, E. Searching the sequence space for potent aptamers using SELEX in silico. J. Chem. Theory Comput. 11, 5939–5946 (2015).
Hiller, M., Pudimat, R., Busch, A. & Backofen, R. Using RNA secondary structures to guide sequence motif finding towards singlestranded regions. Nucl. Acids Res. 34, e117–e117 (2006).
Dao, P. et al. AptaTRACE elucidates RNA sequencestructure motifs from selection trends in HTSELEX experiments. Cell Syst. 3, 62–70 (2016).
Hoinka, J., Zotenko, E., Friedman, A., Sauna, Z. E. & Przytycka, T. M. Identification of sequencestructure rna binding motifs for SELEXderived aptamers. Bioinformatics 28, i215–i223 (2012).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNAand RNAbinding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Hassanzadeh, H. R. & Wang, M. D. Deeperbind: enhancing prediction of sequence specificities of DNA binding proteins. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 178–183 (IEEE, 2016).
Pan, X., Rijnbeek, P., Yan, J. & Shen, H.B. Prediction of RNAprotein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC genomics 19, 511 (2018).
Hinton, G. E. Deep belief networks. Scholarpedia 4, 5947 (2009).
Kingma, D. P. & Welling, M. Autoencoding variational bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).
Goodfellow, I. et al. Generative adversarial nets. Adv. Neural Information Process. Syst. 27, 2672–2680 (2014).
Im, J., Park, B. & Han, K. A generative model for constructing nucleic acid sequences binding to a protein. BMC Genomics 20, 1–13 (2019).
Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. Preprint at https://arxiv.org/abs/1712.06148 (2017).
Kusner, M. J., Paige, B. & HernándezLobato, J. M. Grammar variational autoencoder. Preprint at https://arxiv.org/abs/1703.01925 (2017).
GómezBombarelli, R. et al. Automatic chemical design using a datadriven continuous representation of molecules. ACS Central Sci. 4, 268–276 (2018).
Hochreiter, S. & Schmidhuber, J. Long shortterm memory. Neural Comput. 9, 1735–1780 (1997).
Lozupone, C., Changayil, S., Majerfeld, I. & Yarus, M. Selection of the simplest RNA that binds isoleucine. RNA 9, 1315–1322 (2003).
Gonzalez, J., Longworth, J., James, D. C. & Lawrence, N. D. Bayesian optimization for synthetic gene design. Preprint at https://arxiv.org/abs/1505.01627 (2015).
Hamada, M. et al. Rtools: a web server for various secondary structural analyses on single RNA sequences. Nucl. Acids Res. 44, W302–307 (2016).
Lorenz, R. et al. ViennaRNA Package 2.0. Algorithms Mol Biol 6, 26 (2011).
Sakakibara, Y. et al. Stochastic contextfree grammers for tRNA modeling. Nucl. Acids Res. 22, 5112–5120 (1994).
Bell, D. R. et al. In silico design and validation of highaffinity RNA aptamers targeting epithelial cellular adhesion molecule dimers. Proc. Natl Acad. Sci. USA 117, 8486–8493 (2020).
Corduneanu, A. & Bishop, C. Variational bayesian model selection for mixture distributions. In Proc. 8th International Conference on Artificial Intelligence and Statistics 27–34 (Morgan Kaufmann, 2001).
Krogh, A., Brown, M., Mian, I. S., Sjolander, K. & Haussler, D. Hidden Markov models in computational biology. applications to protein modeling. J. Mol. Biol. 235, 1501–1531 (1994).
Bowman, S. R. et al. Generating sentences from a continuous space. Preprint at https://arxiv.org/abs/1511.06349 (2015).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
González, J., Dai, Z., Hennig, P. & Lawrence, N. Batch Bayesian optimization via local penalization. In Proc. 19th International Conference on Artificial Intelligence and Statistics 648–657 (PMLR, 2016).
Ginsbourger, D, Le Riche, R. & Carraro, L. Kriging is wellsuited to parallelize optimization. In Computational Intelligence in Expensive Optimization Problems 131–162 (Springer, 2010).
The GPyOpt authors. GPyOpt: A Bayesian Optimization Framework in Python (GitHub, 2016); http://github.com/SheffieldML/GPyOpt
The RaptGen authors. Raptgen Version 1.0 (Zenodo, 2022); https://doi.org/10.5281/zenodo.6470866
Auer, P. Using confidence bounds for exploitationexploration tradeoffs. J. Mach. Learn. Res. 3, 397–422 (2002).
Acknowledgements
Computation for this study was performed in part on the NIG supercomputer at ROIS National Institute of Genetics. N.I. and M.H. thank members of Hamada Laboratory for their valuable comments. This work was supported by JST CREST (grant nos. JPMJCR1881 and JPMJCR21F1) Japan.
Author information
Authors and Affiliations
Contributions
M.H. designed and supervised the research. N.I. developed the RaptGen program and performed computational experiments. A.T., A.K. and N.Y. performed wet experiments. M.H., N.I. and A.T. analyzed the data, discussed the results and wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Ruhong Zhou and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–6, Tables 1 and 2, and Sections 1–9
Supplementary Data
The detailed results for secondary structure analyses.
Source data
Source Data Fig. 2
Statistical Source Data
Source Data Fig. 3
Statistical Source Data
Source Data Fig. 4
Statistical Source Data
Source Data Fig. 5
Statistical Source Data
Source Data Fig. 6
Statistical Source Data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Iwano, N., Adachi, T., Aoki, K. et al. Generative aptamer discovery using RaptGen. Nat Comput Sci 2, 378–386 (2022). https://doi.org/10.1038/s43588022002496
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588022002496
This article is cited by

Deep generative design of RNA family sequences
Nature Methods (2024)

In vitro selection of aptamers and their applications
Nature Reviews Methods Primers (2023)

AIpowered aptamer generation
Nature Computational Science (2022)