Bidirectional generation of structure and properties through a single molecular foundation model

Recent successes of foundation models in artificial intelligence have prompted the emergence of large-scale chemical pre-trained models. Despite the growing interest in large molecular pre-trained models that provide informative representations for downstream tasks, attempts for multimodal pre-training approaches on the molecule domain were limited. To address this, here we present a multimodal molecular pre-trained model that incorporates the modalities of structure and biochemical properties, drawing inspiration from recent advances in multimodal learning techniques. Our proposed model pipeline of data handling and training objectives aligns the structure/property features in a common embedding space, which enables the model to regard bidirectional information between the molecules’ structure and properties. These contributions emerge synergistic knowledge, allowing us to tackle both multimodal and unimodal downstream tasks through a single model. Through extensive experiments, we demonstrate that our model has the capabilities to solve various meaningful chemical challenges, including conditional molecule generation, property prediction, molecule classification, and reaction prediction.


Introduction
Capturing complex relations between chemical objects and their properties is the essence of numerous chemical challenges.During the last decade, artificial intelligence has emerged as a promising tool in chemistry research for estimating many biochemical properties and interactions between molecules, polymers, and proteins, which are difficult to obtain experimentally [1][2][3] .Various deep learning-based approaches in the chemical domain employed deep neural networks to extract desired characteristics like intrinsic properties, biochemical activities, and chemical reactions from raw molecule data [4][5][6] .Especially, de novo molecule design has been extensively studied using recurrent networks 7 , variational autoencoders 8,9 , graph networks 10 , etc [11][12][13] .More recently, unsupervised learning approaches of learning better representations of the chemical inputs have been suggested [14][15][16] to overcome the limitation of learning separate features for each task in a supervised manner.These recent approaches are on the same track as the concept of the foundation models that are trained with large datasets and are often considered as a new paradigm of deep learning 17,18 .
Specifically, a concept of pre-training a neural network in a self-supervised manner for a better feature representation has been adapted for various chemical fields [14][15][16] .N-Gram Graph 19 and GROVER 20 used a graph neural network and a graph transformer network, respectively, to obtain a pre-trained model from the molecular graph.ChemBERTa-2 21  Meanwhile, in the computer vision field, multimodal pre-training methods like Vision-Language Pre-training (VLP) 22 have achieved outstanding performance in downstream tasks that require an understanding of both image and text.Most of the modern VLP models utilize Transformer 23 archi-tecture and its cross-attention mechanism to learn the correlation between different modalities 24,25 .
Moreover, several works introduced contrastive learning, which assimilates features with the same context and distances semantically unrelated features, to align image and language features in the common feature space [26][27][28] .VLP enables various tasks such as visual question answering 29 , imagetext retrieval 30 , text-driven image generation 31 , image-driven text generation 32 , etc., which are not possible using single modality foundation models.
Inspired by the success of multimodal learning, several recent works tried to obtain a better feature of a molecule by leveraging knowledge from different data representations.Winter et al.   trained a translation model between Simplified Molecular-Input Line-Entry System (SMILES) and International Chemical Identifier (InChI) key to get a feature vector with meaningful information that both molecular representations have in common 33 .Zhu et al. used a self-supervised training method of BYOL 34 between different molecule representations of SMILES and molecular graphs to build a dual-view model 35 .However, these works introduced multimodality only for the enhancement of a molecule feature for unimodal tasks, not for the interplay between those different modalities.Furthermore, since SMILES, InChI, and graph representations contain almost identical information about the connection between atoms in a molecule, it is unlikely to expect new emergence properties by multimodal learning between these different molecule representations.
In this work, we are interested in the cross-modal comprehension between molecule structure and the associate properties, which facilitates solving meaningful tasks in many applications like property predictions, conditional molecule design 36,37 , etc. Taking a step further from multi-task learning methods 38 which use the prepared properties as labels to extract general features 21 , our approach regards a set of properties as a stand-alone modality that represents the input molecule and suggests that multimodal learning for molecules with this property modality can provide much more informative features.Specifically, we propose a novel molecule Structure-Property Multi-Modal foundation model(SPMM) which allows various chemistry experiments in silico, which is pre-trained with a wide range of molecules' structures and a vector of its properties.By employing a Transformer architecture 23 , the intramodal feature extraction and intermodal fusion can be done with self-attention and cross-attention mechanisms, respectively.
Our experimental results show that simultaneous learning of structural features with information from the associate properties through a single foundation model gives us a better representation that can be fine-tuned for various downstream tasks.Specifically, by treating both structure and property symmetrically, the model can perform bidirectional generation and prediction with a single pre-trained model, which was not possible before.
Fig. 1(a) illustrates the overall model architecture and training objectives for SPMM.The framework of SPMM extends the structure of the dual-stream VLP models 27,28,39 .Dual-stream VLP models encode the input for each modality with a unimodal encoder, then use another encoder module to perform cross-attention by using one modality feature as a query and the other modality feature as a key/value.When a training molecule is given, SPMM takes the molecule's SMILES string and its property vector (PV) as multimodal data inputs as shown in Fig. 1(a).The SMILES and PV are passed through their corresponding unimodal encoders, which perform selfattention where embedded inputs become the key, query, and value.After two unimodal features are obtained, contrastive learning aligns the SMILES and PV features into the same embedding space by assimilating the features that contain the same context.This is known to improve the model performance by making cross-modal encoding easier and guiding the unimodal encoded features to reflect more semantics of the input 27 .Then, the encoded SMILES and PV features are passed through the fusion encoders, which perform cross-attention between SMILES and PV features.This single fusion encoder can perform cross-attention with an alternation of its query and key/value input because the contrastive learning aligns the output of the SMILES encoder and the PV encoder into the same feature space. 39The fusion encoder is pre-trained with Next Word Prediction (NWP) for SMILES, Next Property Prediction (NPP), and SMILES-PV Matching loss (SPM).Prediction of the next component from the given transformer input is a commonly used self-supervised learning objective, and our NWP and NPP tasks make the model learn the contextual relationship between SMILES tokens and properties with the aid of the other modality's semantic feature.Additionally, SPM predicts whether a given pair of SMILES and PV represents the same molecule or not.
Once trained, SPMM can be used for various bidirectional downstream tasks that require an understanding of both SMILES and properties like property prediction (SMILES-to-properties) and property-conditioned molecule generation (properties-to-SMILES, also referred to as inverse-QSAR 37 ) as shown in Fig. 1(b).Furthermore, the pre-training objectives that we've used allow the pre-trained SPMM to be applied for single-modality tasks as well, such as molecule classification and reaction predictions (see Fig. 1(c)).The pre-trained SPMM showed comparable performances to state-of-the-art models in these unimodal tasks, which suggests the model's generalization ability as a foundation model.

Results
The model learns bidirectional comprehension between SMILES and properties.Once SPMM was pre-trained, we made the model generate SMILES with given PV inputs only, which is a crucial challenge for many chemical tasks such as de novo molecule design.As one of the major approaches for drug discovery, various methods have been suggested for generating molecules with desired properties [9][10][11]13 . In te approaches presented so far, the maximum number of simultaneously controllable properties wasn't very large.Also, the length of the input property vector cannot be changed.Whenever the target properties change, the model needs to be trained again for the new wanted conditions.In contrast, the pre-trained SPMM can take 53 properties used in pre-training as input conditions and generate molecules that satisfy all of them, without separate additional training for each property combination.Moreover, for the properties that we don't want to control, we can let the model ignore those conditions by replacing them with the [UNK] token that we used in pre-training.This is very useful because controlling all 53 input properties is not a usual scenario in practice, and is also not easy since the properties are correlated and entangled (e. g., '5 atoms & 30 bonds' or '2 rings & 5 aromatic rings' is unlikely to be a valid PV input).
To demonstrate the molecule generation capability of SPMM, we prepared a number of PV-to-SMILES generation scenarios and let the pre-trained SPMM autoregressively generate SMILES using the input properties.This process of SPMM is very similar to the sequence-to-sequence translation tasks in terms of the model pipeline (see Figure S3-( the mean value and standard deviations.For deterministic sampling, we ran the experiment with four different random sets of 1,000 unseen PVs.In the case of stochastic scenarios, four different random seeds were used for each experiment. For the first PV-to-SMILES generation scenario, we prepared 1,000 PVs of SMILES from PubChem 40 that are not contained in the pre-training dataset and fed them to the pre-trained SPMM to generate appropriate SMILES.Here, the sampling process was done in a deterministic manner (greedy sampling): starting from the SMILES [CLS] token ([CLS] S ), the model predicts the probability distribution of the next token and chooses the option with the highest probability.The first row of Table 1 shows its results.Among the output of deterministic PV-to-SMILES generation for 1,000 PVs, 98.2% of the generated output were valid SMILES.The mean RMSE of the 53 normalized properties was 0.194, which implies that the properties of the generated samples agree with the property input.
Application fields like drug discovery often require generating multiple molecules for a single wanted target property condition.This can be done by sampling the next token stochastically from the modeled probability distribution instead of using a token with the highest probability.To verify our model's ability to generate multiple molecules from a single PV input, we generated 1,000 SMILES with stochastic sampling on a fixed PV.The validity, uniqueness, and novelty of the generated molecules under conditions in Figure 2 are listed in the "stochastic" rows of Table 1.The validity fluctuated depending on how feasible or difficult the property input is, and it was between 0.75 and 0.9 in most cases.The uniqueness, the ratio between the number of unique molecules against the number of validly generated molecules, was almost 100% in every condition we have experimented with.More examples of the generated   With the same approach as SMILES generation, the pre-trained SPMM can also be used to generate a PV with SMILES input only.This task is equivalent to performing 53 property predictions of a given SMILES at once.Similar to the PV-to-SMILES generation, properties are predicted in an autoregressive manner: the model predicts the first property value using only the property [CLS] token ([CLS] P ), then takes all previous outputs again to get the next prediction value, and so on (see Figure S3-(b)).Although 53 properties that we've used can be calculated using the Python module, the purpose of this experiment is to verify that the data-driven way of property estimation coincides with the analytic approach.
Specifically, we fed 1,000 SMILES from the ZINC15 dataset 41 , which are not contained in the pre-training dataset, to the pre-trained SPMM and generated their corresponding PV. Figure 4 is the scatter plot of the real property value against the generated output for 12 selected properties   In Figure 5, we plotted the cross-attention score from the last fusion layer of our pre-trained SPMM when SMILES and its property vector inputs were given.Since there are multiple heads for the cross-attention, we took the mean of their attention scores.It is interesting that the aspect of cross-attention scores followed the intuitive relations between chemical properties and molecu- Generalization ability as a molecular foundation model.So far, we have demonstrated that the pre-trained SPMM can be applied to tasks that require an understanding of the relationship between SMILES and properties.However, we can also employ the pre-trained SPMM for challenges that only use SMILES data, such as molecular property prediction.One advantage of having a dualstream VLP model structure is that the SPMM's multimodal pre-training process includes adjusting the output of one unimodal encoder to contain contextual information from the other modality, by aligning it with the other unimodal encoder's output.This implies that the SMILES encoder output is a unimodal representation vector, that not only embeds the input molecule's structural information but it's also enhanced by its property information.
We have analyzed if our pre-trained model had learned an informative representation that can be readily used for other tasks, even for a single modality.So we only utilized the SMILES encoder of pre-trained SPMM (see Supplementary Figure S3-(c)) and made a benchmark study on nine MoleculeNet 42 downstream tasks and a Drug-Induced Liver Injury (DILI) prediction task.Each MoleculeNet task is a regression or classification task for pharmaceutical/biochemical applications like solubility, toxicity, and brain penetrability.The DILI classification task was done to overcome the potential limitation of open databases 43,44 and verify if SPMM could be extended to more complex endpoints.The task is to classify whether the given molecule has a risk of causing liver injury.Since many proposed DILI machine learning models have built their dataset rather than using common benchmarks, we took the dataset preparations from a known publication 45 and compared the performance with it for a fair evaluation.We also trained SPMM for the forward and retro-reaction prediction tasks, which require the model to predict the product SMILES from the reactant SMILES and vice versa.Regarding both tasks as sequence-to-sequence generation, the model pipeline for these reaction prediction tasks is the same as the PV-to-SMILES generation tasks, except the PV encoder is replaced with the SMILES encoder (see Supplementary Figure S3-(d)).The detailed task definition and dataset preparation are described in the Methods section.Table 4 shows the performances of SPMM and other benchmark models on forward and retro-reaction prediction tasks.Although the reaction prediction tasks are not the best scenario for the property-emergence features to play significant roles, SPMM showed the highest top-1 accuracy in the forward-reaction task with a relatively small pre-training data size (i.e.20M molecules, compared to 100M molecules of Chemformer).SPMM also achieved the second-best top-1 accuracy among the string-based retro-reaction task models.

Discussion
In this work, we proposed a transformer-based multimodal chemical foundation model SPMM.The proposed model allows for bidirectional generation/prediction of molecular structure and properties, as well as unimodal tasks like reaction prediction.During the process, we introduced a method of treating property collections as a language so that the model could learn the relationship between SMILES tokens and each property independently.We demonstrated that pre-trained SPMM showed remarkable performances in problems for interactions between SMILES and PV domains.
And not only for multimodal challenges but even its unimodal feature for SMILES, SPMM also provides a useful representation that can be fine-tuned for many molecular downstream tasks.It is important to note that all of these results were obtained with a pre-training of 20 million molecules, which is relatively small compared to other large pre-training approaches and still has room for better performance with more data and parameters.We also note that we've gathered our 53 properties to let them cover the widest range possible, rather than paying the best effort to select the most effective combination of properties.This implies the proposed structure-property multimodal training can be flexibly adopted with different property selections, according to the given specified scenarios.
Despite the noticeable performances of SPMM, it has several chances for improvement.One of those comes from using the SMILES notation.Although SMILES can contain full details about the 2D structure of the molecule, the information on how atoms and bonds are connected only exists implicitly.Also, a slight modification in molecular structure can be a drastic change in SMILES.
Graph format is another widely used modality for molecule representation that contains the explicit information of the adjacency matrix, which can be an alternative for SMILES.Another limitation in our current SPMM is that the 53 properties we used happen to be invariant with the changes in the stereochemistry of the given molecule.It is known that considering stereochemistry plays a crucial part in various biochemical tasks.However, the 53 properties we used cannot provide any knowledge about stereochemical information since their values are unchanged in different stereoisomers.This makes the SMILES encoder output of different stereoisomers converge since the contrastive loss aligns them to the same PV feature.We believe this is the prominent factor that lowered the performance of SPMM in MoleculeNet tasks, which could be resolved by using more properties that reflect the molecule's stereochemistry.Moreover, validation through wet-lab experiments to verify the model's predicted/generated properties is another possible further study.
Overcoming these drawbacks of the current study and making the model more applicable to other chemical tasks could be the works for the future.
Nevertheless, we believe that our approach can provide a pre-trained model capable of encompassing each input domain and their multimodal domain simultaneously, which has a vast potential utility.We expect this approach to be applied to more various and practical chemical situations by using broader and richer molecular modalities, and possibly, different biochemical domains like polymers and proteins.

Methods
Handling SMILES and property values as a language.Molecules can be represented with various formats such as fingerprints, strings like SMILES, InChI, or a molecular graph.Since these different notations contain almost the same information about complete molecular structure, we employed SMILES to describe a molecule structure.SMILES is a sequence of characters that represents the connection structure of the molecule.Many researchers treat SMILES as a variant of language data and utilize a concept of language models for chemical tasks on SMILES data 11,21,57 .Once the model is pre-trained, the [CLS] token output of the given sequence can be considered as an input representation vector and be used for classification/regression downstream tasks, as in many BERT variations for images 59,60 and VLP 27 .In the SMILES tokenization, our tokenizer tokenizes a given SMILES into fragments that are contained in a prepared token dictionary of 300 subwords.This dictionary was obtained from the pre-training data SMILES corpus by the BPE algorithm 61 , which starts from a set of simple characters and iteratively appends the most frequent token pairs as a merged subword.Being widely adopted for various language models 62,63 , the BPE algorithm has provided a subword dictionary containing common functional groups and substructures like benzene rings, carbonyl groups, twoletter atoms, and amino groups.Compared to naive character-wise tokenization which considers each character as a separate token, the merged subwords help the model's chemical inference for chemical groups and reduce the total number of tokens.
Meanwhile, a set of chemical properties does not change its carrying information by changing the internal order, but they certainly have correlations between the properties.And it is known that a transformer architecture also performs well for different modalities like images, by giving specific order to its components and treating them as a sequence.For this work, we built a PV for each molecule that contains 53 molecular properties and considered this as a sentence with a length of 53.These properties from the RDKit python module 64 cover a wide range from simple ones, such as the number of rings and molecular weight, to complex properties like solubility, TPSA, and druggability.
The transformer architecture of our model considers each element of PV as a token to perform the attention mechanism, which is equivalent to regarding PV as a semi-sentence of 53 properties.Although the size of the vocabulary is more limited and their order is fixed compared to natural language, it provides much more precise and compact information about the 53 properties.
One benefit of regarding PV as a language is that we do not have to collect all elements to build a valid PV.In contrast to a simple vector input, some property elements can be removed or masked in our approach.The order of these 53 properties is predetermined.Each value in the PV is encoded to a feature vector using a linear layer as a value encoding.Then we randomly replace 50% of the property features into the [UNK] token, which is the special token utilized to simulate that the property is unknown.This is possible since there is no problem in describing a molecule using only a part of these properties.Random property feature masking prevents the model from overly dependent on the specific property, has the effect of data augmentation, and improves the model's generalization ability.Although every property we used in this work can be easily and thoroughly prepared by the computer, this might not be the case for other properties in real-world situations.SPMM still can be trained when some properties for certain training molecules are not known, by replacing those unknown properties with the [UNK] token.On top of the randomly-masked value encoding, options for being the next token, but using a one-hot label for ground truth might ignore this.
To resolve this issue, we built the momentum teacher model 27,65 and utilized its output for contrastive learning and NWP.The momentum teacher performs a knowledge distillation by providing a pseudo-label that reflects how the teacher model comprehends.Specifically, the label for the contrastive learning and NWP are mixed with the momentum model's output s * ,momentum ( * ∈ {s2p, p2s, s2s, p2p}) and p N W P momentum (s n |s 0:n−1 , P ), with an adjusting hyperparameter α.The detailed formulas for utilizing the momentum model for contrastive learning and NWP are described in Eq. ( 8)∼( 9) and Eq. ( 10)∼ (11).
After the student model's parameters w model are updated for each batch, the parameters of the momentum teacher model w momentum are updated by the exponential moving average (EMA) using w model and an EMA hyperparameter λ according to Equation 12.
The overall pre-training objective is the combined loss of Contrastive, NWP, NPP, and SPM loss.
Training for downstream tasks.Supplementary Figure S3 describes how we utilized our pretrained model for downstream tasks.For PV generation and SMILES generation (Supplementary The forward reaction prediction task provides a reactant SMILES (including multiple reagent molecules) and a product SMILES.We encode these two inputs with the SMILES encoder, then feed them into the fusion encoder + prediction head.The model is trained to autoregressively generate the original product SMILES (Supplementary Figure S3-(d)).In the inference stage, starting from the [CLS] S token, the model predicts the next token until it generates the [SEP]   token.Similar to the SMILES generation, the self-attention of the fusion encoder and the reactant SMILES encoder uses a causal mask.The retro-reaction prediction task was done in the same way, but the role of the reactant and product SMILES are swapped.We fine-tuned SPMM for the forward reaction prediction task with an approach of 'mixed task', meaning that the information about the major reactant is not given to the model.For both forward and retro-reaction tasks, we replaced the input reactants and products with their random non-canonical augmented SMILES 67 with a probability of 0.5.
Data preparation.We obtained 20,000,000 SMILES of general molecules from PubChem 40 for pre-training.All 53 properties we used can be calculated with SMILES using the RDKit Python module 64 .The dataset for the MoleculeNet downstream tasks is provided by the DeepChem 68 python library.We split every dataset into train/valid/test sets in a ratio of 8:1:1 using a scaffold splitter from DeepChem, which is a more harsh condition for the model than random splitting.
For the reaction prediction task, we used the USPTO-480k dataset which contains 479,035 pairs of reactants and the major product of their reaction.The retro-reaction prediction task used the USPTO-50k dataset, containing 50,037 product-reactant pairs with corresponding reaction types.
Although the USPTO-50k dataset provides tags of reaction type for each reaction data, we didn't use them, following the previous retro-reaction prediction publications.
Implementation details.We employed the architecture of 6 BERT base encoder layers for our PV encoder and SMILES encoder, and 6 BERT base encoder layers with cross-attention layers for our fusion encoder.With given Q ∈ R lenq×d k , K ∈ R len k ×d k , and V ∈ R len k ×dv as query, key, and value inputs, the self-attention and cross-attention layers in BERT compute the output of the scaled-dot attention according to the following formula: We pre-trained the model until it converges using a batch size of 128 and the AdamW optimizer with a weight decay of 0.02.The learning rate is warmed up to 1e − 4 and decreased to 1e − 5 with a cosine scheduler.We used the momentum-adjusting hyperparameter α of 0.4.Since the pseudo-label from the momentum teacher is not useful in the early stages of the training, we linearly increased α from 0 to 0.4 during the first epoch.The EMA hyperparameter λ was fixed to 0.995, and the size of the PV and SMILES queue k was set to 32,768.The momentum models are not used for downstream tasks.The full description of training for downstream tasks is in Supplementary Table S1.
Correspondence Correspondence and requests for materials should be addressed to Jong Chul Ye. (email: jong.ye@kaist.ac.kr).
trained a roBERTa model with 77 million molecules to build a molecular foundation model, by training the model to predict 200 different chemical property values.

Figure 1 :
Figure 1: (a) Overview of the model architecture and pre-training objectives of SPMM.The contrastive loss aligns the output feature of two unimodal encoders into the same embedding space.The fusion encoder learns the relations between two modalities, trained with Next Word Prediction (NWP), Next Property Prediction (NPP), and SMILES-Property Matching loss (SPM).(b) Downstream tasks that require multimodal comprehension: i) PV-to-SMILES generation, ii) SMILES-to-PV generation.(c) Downstream tasks for single modality inputs: i) property prediction, ii) forward and retro reaction prediction.

Figure 2 shows the property distributions of 1 ,
000 molecules generated from a single PV input.The mode of each property distribution lands on the input property value (Fig.2-(a)).In the situation when only some of the properties are given, the model only regards the known properties while the other masked properties are not restricted (Fig.2-(b), Fig.2-(c)).SPMM can generate molecules even with no property information at all; when all input properties are replaced with [UNK] token (Fig.2-(d)), the model performs an unconditional molecule generation, and the output follows the distribution of the pre-training dataset.

Figure 2 :
Figure 2: Property distribution of the generated molecules with different PV inputs and [UNK] token masking.The red vertical dotted lines are the input property values, and the grey vertical lines are the mean of that property in the pre-training dataset.The controlled properties are colored in red, and uncontrolled properties (=masked with [UNK] token) are colored in blue.Due to the lack of space, only 12 out of 53 properties are shown for each case.For each PV-to-SMILES scenario, we included the structure of two of the generated molecules.(a) All 53 properties are controlled, without using the [UNK] token.The input PV was obtained from the molecule 1.(b) Molecular Weight to 150, and the other property inputs are masked.(c) #ring, #aromatic ring, TPSA, and QED are controlled to 2, 1, 30, and 0.8.The other property inputs are masked.(d) Every property is replaced with [UNK] token.

Figure 3 :
Figure 3: Examples of molecule editing, by changing specific values from the original PV and performing PV-to-SMILES generation with it.The colored output values correspond to the changed properties from the original PV.(1) The output of the same PV of the source molecule.(2) The output when #aromatic ring is changed to 0. (3) The output when #ring is changed to 2 and #aromatic ring is changed to 1. (4) The output when logP is changed to 7. (5) The output when #rotatable bond is changed to 12.For the generation, the other 41 property conditions are masked by the [UNK] token.
out of 53 that we used for pre-training.It is clear that SPMM's predicted property is very close to the actual value, and most of the data point lies on the y = x line.Although the model virtually has never seen a full-filled PV in the pre-training due to the 50% of random property masking, the model could autoregressively predict all 53 properties as a whole.The mean r 2 score of the 53

Figure 4 :
Figure 4: Scatter plots of the 1,000 ZINC15 molecules' real property value against the generated output, for 12 selected properties.The x-axis is the real property value, and the y-axis is the model output.The grey dotted line is the y = x line.

Figure 5 :
Figure 5:The mean attention score from the attention heads in the SPMM fusion encoder's final crossattention layer for two sample molecules.A darker green means a higher attention score.For the attention process, the property features were used as queries, and the SMILES features are used as keys and values.The corresponding fragments for each token are indicated with ivory boxes on the molecular structure, while fragments for duplicated tokens are color-coded with purple.We have calculated cross-attention scores for all 53 properties and SMILES tokens, but only 12 of those properties are shown.
lar fragments.The properties related to hydrogen bonding (NumHDonors, NumHAcceptors) show high attention scores for tokens with oxygen and nitrogen atoms.The property RingCount focuses on the tokens that are involved with rings while showing weak attention to side groups, and the property NumAromaticRings only gives high attention score to the components of aromatic rings.When different SMILES tokens played a similar role in the molecule such as 'c1ccccc1)' and 'c1ccccc1' in the molecule 15, their attention patterns were similar as well.This result demon-strated that SPMM could capture the relations between molecule structures and chemical properties without explicitly-given supervision between them.

Figure 6 -
Figure 6-(a) illustrates our embedding procedure for the input SMILES.The raw SMILES string is tokenized by the tokenizer and embedded by the SMILES encoder with the [CLS] S token and the [SEP] token.Here, [CLS] token is a special token attached to the beginning of every input sequence 58 .Although the [CLS] token itself doesn't contain any meaning, the bidirectional attention mechanism of the model allows the [CLS] token to contain contextual information of the entire input.Once the model is pre-trained, the [CLS] token output of the given sequence can be

Figure 6 :
Figure 6: Embedding process for SMILES and the corresponding PV.

Figure 6 -
Figure 6-(b) shows our embedding procedure for the input PV.Each property element in the PV is a numerical value and normalized with the mean and standard deviation of that property.

Figure S3 -
Figure S3-(a), (b)), we don't need additional fine-tuning since their training objectives are already included in the pre-training (NWP, NPP).For the inference procedure, the model generates PV or SMILES with autoregressive sampling.Specifically, starting from the [CLS] token of the modality that we want to generate, the model predicts the first component and repeats taking the previous outputs to predict the next component until it's done or meets a sign to stop.A causal mask has to be used in the self-attention of the fusion encoder and the unimodal encoder of the generating modality to enforce the autoregressive generation.

Figure S2 :
Figure S2: The scatter plots of the model's generated PVs for 1,000 unseen ZINC15 molecules, against the actual property value for all 53 properties.The r 2 score and RMSE for each property are described at the top of each plot.

Figure S3 :
Figure S3: Overview of the inference and fine-tuning of SPMM for various downstream tasks: (a) The inference process of pre-trained SPMM for molecule generation.(b) The inference process of pre-trained SPMM for PV generation.(c) The model architecture for MoleculeNet downstream tasks.The SMILES encoder of pre-trained SPMM is used as a backbone.(d) The model architecture for the reaction prediction task.We adopted the SMILES encoder and the fusion encoder of pre-trained SPMM and built a sequence-to-sequence model.

Table 1 :
Quantitative and qualitative results on various scenarios of PV-to-SMILES generation tasks, with Regarding the overall molecule generation performance of SPMM, we want to emphasize that SPMM can generate suitable SMILES for many property conditions that the model has not seen in its pre-training.When we trained SPMM without 50% of random property masking with[UNK] token, the model only worked when all 53 properties are given since the model has not seen the partially-given properties.However, even with the technique of [UNK] token masking, the model cannot face most of the 2 53 possible property combination during the pre-training process.The SPMM's ability to handle arbitrary property conditions for SMILES generation comes from treating PV as a 'language with 53 words' and focusing on each property separately, not simply considering the entire property input as a single condition.This innovative approach for conditional molecule generation has never been demonstrated with the existing methods and thus can be used for many important chemical fields.
molecule can be found in Supplementary result S1.The aforementioned results demonstrate that SPMM can perform molecule generation with arbitrary PV inputs, which enables simple molecule designing and editing.Figure3contains the output of the SPMM's stochastic molecule generation for five PV inputs, which all originated from the PV of the molecule 1 but four of them had certain values changed.The generated molecules follow the input modification while maintaining unmodified properties similarly.SPMM is even able to generate molecules with the out-of-domain conditions such as log P = 7 (note that ∼ 5% of the pre-training dataset has log P >7).

Table 2
contains the performance of SPMM and other models for MoleculeNet.Using only 6 BERT encoder layers, SPMM showed comparable performances with state-of-the-art models for all tasks.It achieved the best performance for Clearance, BBBP, and Clintox tasks, showing its

Table 2 :
Benchmark results on MoleculeNet downstream tasks.The best performance for each task was written in bold, and the second-best performance was underlined.For each task, we fine-tuned our model in four random seeds and recorded the mean and the standard deviation of those results.The benchmark model results were taken from ChemRL-GEM and ChemBERTa-2.*Thestandard deviation cannot be found in the source of the benchmark results.†Unofficialresults, obtained from the official checkpoint under our data preparation.modelAccin %[↑] Selectivity in %[↑] Specificity in %[↑] AUROC in %[↑]

Table 3 :
45e DILI classification task performance of Ai et al.45and SPMM.The best performance for each metric was written in bold.

Table 4 :
52e performance of SPMM and other works on the forward and retro-reaction prediction task.For the retro-reaction prediction task, we only prepared the benchmark results of string-based models.The highest accuracy is written in bold, and the performance of the runner-up model is underlined.The benchmark model results are from the paper of LocalTransform 6 and Chemformer52.

Table S1 :
Detailed training hyperparameters for fine-tuning SPMM in MoleculeNet downstream tasks, DILI classification task, forward reaction prediction task, and the reverse-reaction prediction task.