De novo generation of hit-like molecules from gene expression signatures using artificial intelligence

Finding new molecules with a desired biological activity is an extremely difficult task. In this context, artificial intelligence and generative models have been used for molecular de novo design and compound optimization. Herein, we report a generative model that bridges systems biology and molecular design, conditioning a generative adversarial network with transcriptomic data. By doing so, we can automatically design molecules that have a high probability to induce a desired transcriptomic profile. As long as the gene expression signature of the desired state is provided, this model is able to design active-like molecules for desired targets without any previous target annotation of the training compounds. Molecules designed by this model are more similar to active compounds than the ones identified by similarity of gene expression signatures. Overall, this method represents an alternative approach to bridge chemistry and biology in the long and difficult road of drug discovery.

The difficulty of the drug discovery process arises from the fact that only a small fraction of the theoretically possible 10 60 drug-like molecules are therapeutically relevant 1,2 . At the beginning, one of the most challenging tasks in this scenario is the hit identification, namely the identification of small molecules with an adequate (but usually weak) activity on a specific target that then could be used as a starting point for the chemical optimization process. Hit identification can be achieved by knowledge-based approaches that use previous information coming from endogenous ligands, patents, scientific literature or even structural information of the biomolecule 3 . This task becomes even more difficult when little or none previous information is available, which usually happens when working with a novel target family or the so called orphan targets. These cases are restricted to serendipity-based (also known as 'brute-force') methods such as the use of combinatorial libraries or high-throughput screening (HTS) 3 .
Despite these methods generate copious bioactivity data, they are not very efficient since large amount of resources are required compared to the small number of hits discovered 4 .
One alternative is to use computational methods and data-driven approaches to aid hit identification 4,5 . Techniques such as virtual screening aim to identify hits from virtual libraries containing large number of molecules usually by similaritybased searches or by molecular docking 4,6,7 . Another technique is automated molecular generation or automated de novo design where new molecules with specific properties are automatically generated by methods such as inverse QSAR or genetic algorithm [8][9][10][11] . Recently, artificial intelligence, in particular generative models, has been extensively used for molecular de novo design, compound optimization and hit identification 12,13 . Generative models are very attractive since they can 'learn' the properties of specific real training examples and then automatically generate new synthetic entities with similar characteristics. Several groups in industry and academia have reported the use of recurrent neural networks combined with reinforcement learning as a generative model to design focused compound libraries for HTS with particular physicochemical properties or with activity towards a specific target [14][15][16][17][18] . Other generative models such as variational autoencoders (VAE) have been used to automatically optimize molecules to improve their physicochemical and drug-likeness properties 19 . In a similar way, generative adversarial networks (GAN) have been used to produce sets of new molecules with similar properties to know active molecules or photovoltaic materials 20,21 .
Until now, molecular generative models have been designed as chemocentric approaches that barely take into account the resulting biology of the ligandtarget interaction. Herein, we report the first generative model that bridges systems biology and molecular design. By doing this we could generate molecules that not only have high probability to act on a biological target, but also have high probability to produce a desired biological effect at cellular level.
For this we combined a generative adversarial network with transcriptomic data 22 , which already has been shown to be useful in the identification of new active molecules [23][24][25] , drug repurposing 26,27 , mode of action 28,29 , prediction of side-effects 30-32 , among other applications. This approach present several advantages such as the generation of hit-like molecules without the need of previous knowledge of active compounds, biological activity data or target annotations. In addition, it can be seen as a one-fit-many-targets approach since the same model can generate molecules for several targets or biological states.

Expression Signatures
Generative adversarial networks (GANs) are powerful generative models that produce new data points with a similar distribution of the real data 33 . These specific networks are composed by two models, namely generator 0 ( ) and discriminator 0 ( ), which compete with each other. The generator is optimized to produce new data points similar to those in the real data distribution. On the other hand, the discriminator is optimized to distinguish between synthetic data points produced by the generator and those data points coming from the real data distribution. These two models can be seen as the players of a min-max game where at each training step the generator tries to produce synthetic data points more similar to the real ones in order to fool the discriminator, whereas the discriminator becomes better in distinguish real data points from synthetic ones.
Due to the great range of applications of GANs, many other extensions to this architecture have been reported. In this work we used a combination of two of them, namely conditional GANs 34 and the Wasserstein GAN with gradient penalty (WGAN-GP) 35,36 . In the former one, the generator is conditioned by a variable ( 0 ( , )), meaning that the synthetic entities created by the generator will fulfil this condition. On the other hand, the WGAN-GP is a variation which minimizes an approximation of the Earth-Mover distance (or Wasserstein-1 distance), instead of minimizing the Jensen-Shannon divergence as in normal GANs. In this particular implementation, the 1-Lipschitz continuity is enforced by using a gradient penalty as an alternative of the gradient clipping scheme (see original paper 36 for more details). In this way, the final loss functions for 0 ( , ) and 0 ( ) are: where and are a molecule representation and a gene expression signature, respectively, sampled from the real data distribution , is a vector with random noise sampled from a Gaussian distribution ( ) and 0 is a function (in this case, a neural network) that measures the probability of a gene expression signature corresponds to a molecular representation. The λ and terms are regularization parameters, where the former one balances the influence of the gradient penalty term into the discriminator loss. Similarly, the term weights the influence of the 0 function in the generator loss. Both, the λ and terms were empirically set to a value of 10.
Recent reports suggest that stacking two or more GANs produce synthetic data with higher definition compared to just using a single GAN [37][38][39] . In this work we staked two GANs where the second one (Stage II) refines results of the first one (Stage I). The setup of Stage II is similar to Stage I, i.e. it is also composed by a generator ( 1 ( 0 , )) and a discriminator ( 1 ( )). The only difference is that instead of taking random noise as input, 1 takes the output of 0 ( 0 = 0 ( , )) and the gene expression signature ( ). In this sense, the loss functions for 1 ( 0 , ) and 1 ( ) can be written as: (see methods for details) to be able to generate compounds from a desired gene expression signature during the prediction phase.

Generating molecules from compound-induced gene expression signatures
Molecule generation is not an easy task, especially when generated molecules are required to meet specific properties. In this case generated molecules are required to induce a particular gene expression signature when exposed to a cell. To evaluate our approach, we generated 1000 molecular representations for each of the 327 gene expression signatures corresponding to 200 compounds not present in the training set. Notice that the number of gene expression signatures is larger than the number of compounds since some of these were profiled in more than one condition (i.e. cell line or concentration). In average, each signature produced ~10% of valid molecules, most of them (~7% of the total) correspond to unique smiles representation, but only a small fraction (~1.5%) were considered easy to synthesize (synthetic accessibility score 41 < 4.5). Interestingly, no improvement in the number of generated molecules was observed in stage II compared to the results in stage I of the stacked GAN. Figure 2a shows the distribution of valid and synthesizable compounds generated for each of the 327 gene expression signatures.
Following the similarity principle 42,43 we would expect that molecules inducing similar gene expression signatures will have to some extend similar molecular structure or at least share some pharmacophoric features. Figure

Generating inhibitor-like molecules from knock-out gene expression signatures
We evaluate if our approach is capable to generate inhibitor-like molecules without any previous information of the molecular target but just using the gene expression signature of the knocked-out target. The hypothesis behind is that a protein knocked-out will present a gene expression signature that resemble the one observed when the same target protein is inhibited by a potent and   Figure 3b shows examples of generated molecules and their closest known active molecule for each of the ten targets. It is surprising to see that in many cases the generated molecule shares functional groups and even similar molecular scaffold with the active molecule. As seen from these examples, the knock-out gene expression signature of the target was able to direct the molecular generation to a specific areas of the chemical space associated with active molecules.

Optimizing scaffolds towards a gene expression signature
Despite that stage II of the stack GAN was designed to refine results from Stage I, it can also be used to refine any other molecule or scaffold. As a proof of concept, we evaluated if the Stage II of our approach is capable to optimize the benzene ring (the most common scaffold in the dataset) towards active-like compounds for different targets. For this, we encoded the SMILES of the benzene ring into a latent space representation using the encoder of the SMILEStogrammar model which then was fed into the Stage II generator

Molecular generation vs similarity search
Previous studies have performed similarity searches between gene expression signatures induced by different compounds to find molecules that can produce similar effects (e.g. drug repurposing) or between the signatures induced by a compound and a knocked-out target protein in order to find new active molecules. [23][24][25]27,46 Although there are several success stories only using similarity search, the major constraint of this approach is that the chemical space is restricted to the initial pool of compounds with measured gene expression signatures. In this sense, using a generative model can help to escape from the limitation of the chemical space by generating new compounds tailored to match the query gene expression signature.
To evaluate the advantages presented by generative model over classical similarity search we compared these methods in two different tasks. First, we  Finally, we showed that our method generates molecules more similar to known active compounds than the ones that can be found by a similarity search, which is the state-of-the-art method to navigate the L1000 data. The fact that our method does not rely on target annotations or activity data makes it very useful in cases where such information is not available such as in target deorphanization projects. We expect to apply this method to design directed chemical libraries in order to increase the chances to find hits in HTS campaigns and in this sense evaluate the performance of this method in a real drug discovery setting. In addition, we are planning to expand this method to automatically generate molecules with multi target signature or able to reverse toxicological related or disease gene expression signatures.

Methods
Data set: In this study we used the L1000 Cmap database recently reported. 22 This database contains induced gene expression profiles of more than 25 where the initial xi is the concatenation of the molecular representation and the processed gene expression signature, and W1 and W2 are trainable weights.
The output of the residual block is xi+1 which is used as input of the next residual block. Finally, after the series of residual blocks (n = 2 in this work), the output is fed into a dense layer with tanh as activation function which acts as output layer.  was recently proposed as an efficient way to evaluate the efficacy of generative models. 54 This metric takes the mean and covariance of the real distribution (µr and Cr, respectively), together with the mean and covariance of the generated distribution (µg and Cg) in the following formulation: 2 ((µ r , C r ), (µ g , C g )) = ‖µ g − µ r ‖ 2 2 + (C g + C r − 2(C g C r ) 1/2 ) Generating molecules from gene expression signatures: As validation procedure we challenged the model to generate molecules from a gene expression signature coming from knocking-out a target protein. For this, we used 705 gene expression signatures from the L1000 database 22 corresponding to the knock-outs of 53 target proteins in MCF7 generated by CRISPR technology. Know active molecules for these targets were extracted from the Excape database, 44 where 28 of the 53 protein targets were present and only 10 had more than 1000 active molecules. For these 10 targets we generated 1000 molecular representation for each gene expression signature (148 in total) and evaluated the model by comparing the generated molecules to the known inhibitors of these targets. We evaluate the similarity between the generated compounds and the known active molecules using Fraggle similarity and Tanimoto similarity using MACCS keys and Morgan Fingerprints (radius = 3, 1024 bits) with RDKIT. 55 It is worth mentioning that during these validation tasks both the number of valid molecules (validity measure) and the number of unique molecules (uniqueness measure) were recorded as sanity check for the generative model.