Regression Transformer: Concurrent sequence regression and generation for molecular language modeling

Despite significant progress of generative models in the natural sciences, their controllability remains challenging. One fundamentally missing aspect of molecular or protein generative models is an inductive bias that can reflect continuous properties of interest. To that end, we propose the Regression Transformer (RT), a novel method that abstracts regression as a conditional sequence modeling problem. This introduces a new paradigm of multitask language models which seamlessly bridge sequence regression and conditional sequence generation. We thoroughly demonstrate that, despite using a nominal-scale training objective, the RT matches or surpasses the performance of conventional regression models in property prediction tasks of small molecules, proteins and chemical reactions. Critically, priming the same model with continuous properties yields a highly competitive conditional generative model that outperforms specialized approaches in a substructure-constrained, property-driven molecule generation benchmark. Our dichotomous approach is facilitated by a novel, alternating training scheme that enables the model to decorate seed sequences by desired properties, e.g., to optimize reaction yield. In sum, the RT is the first report of a multitask model that concurrently excels at predictive and generative tasks in biochemistry. This finds particular application in property-driven, local exploration of the chemical or protein space and could pave the road toward foundation models in material design. The code to reproduce all experiments of the paper is available at: https://github.com/IBM/regression-transformer


Introduction
Transformers [1] are now ubiquitous in natural language processing (NLP) and have also enjoyed large success in molecular [2,3,4] and protein language modeling [5,6]. The invention of Transformers was in alignment with the steady decline of inductive biases in ML, a trend that started with the rise of deep learning: CNNs outperformed traditional feature descriptors in object recognition [7], self-attention generalized dense layers to learn sample-dependent instead of static affine transformations [8] and Transformers exploited self-attention to supersede RNNs as the de-facto standard in NLP. The success of vision transformers has questioned the need for translation equivariance in image processing [9] and now, even frozen Transformers pretrained on text achieve SOTA results in object detection and protein classification [10]. Given that Transformers are today's most generic model 1 , it is not surprising that attempts have been made to abstract entire domains like RL to sequence modeling in order to leverage Transformers [11].  Figure 1: Overview of Regression Transformer (RT). The RT is a multitask language model designed to handle combinations of text and numbers. a) Traditional approach in generative chemistry: property predictors and generative models are trained independently from another. b) Our approach: Training the RT yields a dichotomous model that seamlessly switches between property prediction and conditional text generation. The model's task is to fill the content behind the [MASK] tokens. Depending on the mask location, the same model either predicts numerical tokens given textual tokens, thus performing a regression task (blue stream, top); or predicts textual tokens given both numerical and textual tokens, thus performing a property-driven conditional generation (yellow stream, bottom). c) -f): This novel formulation finds application across a wide range of domains. We demonstrate the flexibility of the RT in predictive and generative tasks in modeling small molecules, proteins, chemical reactions and even natural text.
A provocative next step toward reducing inductive biases might be to refrain from explicitly modeling target variables as functions of input variables. Instead of following this discriminative modelling approach when tuning task-specific language heads in Transformers, learning the joint distribution over input and target variables could effectively further blur the lines between predictive and conditional generative models. The feasibility of such approach can be assessed via permutation language modeling (PLM), an extension of masked-language-modeling to autoregressive models [12]. Such dichotomous models (that concurrently excel at regression and conditional sequence generation) are beyond applications in NLP of special interest for in chemical and material design. Molecules are often labelled with continuous properties (e.g., drug efficacy or protein solubility) and design tasks are intertwined with bio-or physicochemical properties. But despite the rise of deep generative models in molecular [13,14] and protein design [15,16], current approaches still develop property predictors and generative models independently. Transformer-based architectures have been used widely on chemical tasks but either focused on property prediction [17,18] or on conditional molecular design [19,20], never on both. This semantic gap persists across architectural flavors (e.g., GANs [21], RL [22], VAEs [23], GNNs [24,20], flow [25,26] and diffusion models [27]). To our knowledge, all existing approaches either tune task-specific heads [28] or limit the communication between both modules to a reward/loss and thus fail to "entangle" constrained structure generation with property prediction. This critically violates the intuitive expectation that a property-driven generative model should, in the first place, excel at recognizing this property.
In this paper, we aim to close this gap by reformulating regression as a sequence modeling task. We propose the Regression Transformer (RT), a novel multitask model that can be trained on combinations of numerical and textual tokens (see Figure 1). This circumvents the canonical way of addressing regression in Transformers, i.e., tuning a designated regression head [29]. Despite solely relying on tokenization of numbers and cross-entropy loss, the RT can successfully solve regression tasks. Notably, the same model can conditionally generate text sequences given continuous properties. This is achieved simply by moving the [MASK] location and does not require finetuning specific heads; thus constituting a true multitask model. To equip the RT with an inductive bias for handling floating-point properties, numbers are first tokenized into a sequence of tokens preserving the decimal order. We then devise numerical encodings to inform the model about the semantic proximity of these tokens. To allow for concurrent optimization of regression and conditional generation, we derive a PLM-inspired, alternating training scheme that includes a novel self-consistency loss for improved text generation based on continuous primers.
In the remainder of this paper, we describe the capabilities of the RT on a diverse set of predictive and generative tasks in chemical and protein language modeling. We commence with small-molecule modeling, validate the RT on a synthetic dataset of drug-likeness [30] and then test it on three property prediction datasets from the MoleculeNet benchmark [31]. The property predictions results are compared with previous approaches relying on a regression loss and demonstrate that regression can be cast as conditional sequence generation task without losing accuracy. These experiments rely on SELFIES [32], a chemical language devised for generative tasks that, as we show, has comparable predictive power to SMILES. Although we aim to concurrently excel at predicting properties and generating sequences conditioned on properties, we start training with the PLM objective [12] which does not explicitly model those tasks. We then refine this objective and devise a training scheme that alternates between optimizing property prediction and text generation. For the latter, we derive a novel self-consistency loss that exploits the dichotomy of the RT by querying itself with the generated candidate sequence. To assess performance in conditional sequence generation, we systematically vary the continuous properties of interest and investigate the model's ability to adapt a seed sequence according to the primed property value. We show applications on property-driven local chemical space exploration by decorating scaffolds with a continuum of properties and evaluate the novel molecules using the RT itself as well as an independent property predictor [33]. The RT is then challenged against specialized molecular generative models on a property-driven molecular generation benchmark [34], where it significantly outperforms prior art. Next, the RT is investigated on protein sequence modeling where it matches the performance of conventional Transformers on two regression datasets from TAPE [35]. In experiments on chemical reactions, we notice that the RT constitutes a generalization of forward reaction and retrosynthesis models. We then demonstrate on two reaction datasets that the RT can not only predict reaction yields with similar accuracy to conventional Transformers [36], but that it can also substitute specific precursors and thus generate novel reactions with higher predicted yield than a seed reaction. To test the feasibility of concurrent property prediction and conditional generation, we start with optimizing the vanilla permutation language objective (Equation 3) on a synthetic QED dataset (see Figure A1 for an illustration of how the mixed alphanumeric sequences are tokenized and embedded). Since this objective masks tokens randomly in the sequence, evaluating such models on property prediction (i.e., masking only numerical tokens; cf. Figure 1b top) does not closely mimic their training dynamics. Despite this (and the unconventional formulation of a regression task as sequence modeling), all models generated sequences of numerical tokens that allowed decoding floats, and even achieved a RMSE < 0.06 (cf. Figure 2a).  Instead, for the generative task, the same models were queried 10 times for every validation molecule with property primers 2 equidistantly spaced in [0, 1] and 40% of masked textual tokens. The high rank correlation ρ (between primers and QED of unique, generated molecules) values show that the model learned successfully to complete the corrupted scaffolds to produce full molecules with a desired QED. Here, the SELFIES models exceeded the SMILES models by far, because SMILES, unlike SELFIES, can be syntactically invalid. Due to the comparable results for property prediction (cf. Figure 2a), the remaining experiments focus exclusively on SELFIES. Notably, the novelty score (i.e., percentage of conditionally generated molecules not present in training data) was > 99% for all models. This demonstrates that the RT can generate novel chemical matter that adheres to a continuous property of interest. Moreover, the numerical encodings (NE) slightly improved performance in all tasks. Further ablation studies on different types of NEs and related work on encoding numbers with Transformer are reported in appendix A2.1.
Next, based on our proposed training scheme with alternating objectives, the models were refined: For every model in Figure 2a, two models were trained, without (α = 0) and with (α = 1) the self-consistency term in the text loss (cf. Equation 7), respectively. As shown in Figure 2b, the performance in regression as well as conditional generation improved significantly, demonstrating the effectiveness of the refined objectives. Moreover, all configurations of the Regression Transformer (RT) outperformed a baseline k-NN-regressor on Tanimoto similarity and our best configuration even surpassed the SMILES-BERT model [17] which achieved a MAE of 0.02 after pretraining on ∼9M SMILES with a regular regression loss (see Figure 2c). The self-consistency term further improved the model's ability to generate tailored ensembles of molecules and led to consistently higher correlation scores. This is exemplarily visualized in Figure 3 (top) where a single seed molecule is decorated according to the property primers to cover the full range of QED scores. Generally, the better performance of the self-consistency models (α = 1) in the generative tasks comes at the cost of slightly inferior regression performance (cf . Table 2b). Presumably, this is because the model weights in charge of the regression are confounded with the gradients from the self-evaluation (cf. Equation 7). The novelty scores for the molecules generated in this setting were even slightly higher than for the PLM training (> 99.3% for all models). A particularly challenging application for property-driven, local exploration of the chemical space is scaffold hopping; for an example on this see appendix A3.1. For ablation studies on SMILES language and other types of numerical encodings, see appendix A2.1.
Learning embeddings of numbers. We sought to understand why the ablation studies on the numerical encodings (NE) on the QED dataset (Table 2a and 2b) reveal only mild but not enormous superiority of models with NEs. Interestingly, in the absence of static NEs, the model learns the natural ordering of digits from the data (cf. Figure 2d). A large number of embedding dimensions (47% and 36% for the decimal places −1 and −2 respectively) directly and significantly encoded the ordering of digits (i.e., p < 0.05 and |P CC| > 0.62 between the 10 embedding values and a strictly monotonic vector). For example, in Figure 2d (left) the digit value is monotonically related to its embedding value. Notably, this ordering trend was much less present in the models using NEs (∼ 16%). For reference, with random weights, 5% would be expected. In general, attention weights in Transformers can capture complex semantics such as protein folding structure [38] or atommapping in chemical reactions [4]. For a qualitative comparison of the RT's attention across the predictive and generative task, see appendix A3.2.

Regression benchmark (MoleculeNet)
After the successful initial experiments, we evaluated the RT on three regression benchmarks from MoleculeNet [31]. The regression performance on ESOL, FreeSolv and Lipophilicity is shown in Table 2e and compared to prior work. The strongest baseline model from MoleculeNet, XGBoost, is outperformed by all our models on all tasks. Even the MPNN [39], a messagepassing GNN, is slightly surpassed on FreeSolv and Lipophicility by some of our models. However, all our models are outperformed by BERT-based approaches [17,18]. Notably, these models leveraged large-scale self-supervised pretraining before finetuning a regression head. Since these results might not be directly comparable to the RT with its XLNet backbone, we also finetuned a XLNet model with a conventional regression head. Notably, despite the absence of a regression los, the RT is on par (Lipophilicty) or only mildly inferior (i.e., within standard deviation range; ESOL, FreeSolv) to XLNet.
But in stark contrast to all those approaches, only the RT can also be used to conditionally generate molecules similar to the training samples (cf. Figure 2f). Since the properties of the generated molecules are intractable to evaluate in-silico, we could For each row, the seed molecule is shown in the middle alongside its true property. Based on 10 property primers, 10 molecules were decoded but duplicates were discarded. Samples generated with the self-consistency model. Top: QED dataset. Bottom: ESOL dataset of aquatic solubility. The solubility of the novel molecules was predicted by the RT itself and is externally validated by Grover [33].
predict them, handily, using the RT. However, as this might be a biased estimator, we evaluated them using Grover [33], a self-supervised Graph Transformer. Hence, the Spearman reported in Figure 2f is based on Grover's predictions. Overall, the generative results underline the benefit of the self-consistency loss (α = 1) and demonstrate that the RT can adapt unseen seed molecules even according to complex molecular properties like water solubility. For a qualitative evaluation, we depict the generations for one exemplary seed molecule of the solubility dataset in Figure 3 (bottom). Last, corroborative for our work was the high correlation of our property predictions (RT) with Grover's for molecules generated by the ESOL, FreeSolv and Lipo models (0.86, 0.84 and 0.75 respectively). Thus, the Spearman ρ scores obtained with RT predictions are consistent to Grover (cf . Table A4).

Conditional molecular generation benchmark
To assess whether the RT is a powerful conditional generative model, we benchmarked it on a property-driven molecular generation task, namely pLogP constrained optimization [34]. Given a seed molecule and a similarity constraint to the seed molecule (δ, given in Tanimoto similarity), the goal is to generate molecules with higher pLogP values. The results in Table 1 demonstrate that, for both similarity thresholds δ, the RT obtained the best results. Across both similarities, it outperforms a Junction-Tree-VAE [34] and a GCPN by 614% and 103% in average improvement, respectively. While the success rate of GCPN is higher than ours, we emphasize that both JT-VAE and GCPN applied gradient optimization schemes at inference time. Instead, the RT does not only not require any optimization at this stage, but it was also never trained explicitly to produce molecules with high pLogP. This finding demonstrates that the RT is able to compete with specialized conditional generative models in goal-directed molecular generation. At the same time, the RT also predicted the pLogP value with a Pearson's correlation of 0.92, a task that cannot be addressed with normal conditional generative models. The results in Table 1 were obtained with the RT including a self-consistency loss, but for ablation studies on the RT and further results on δ = 0.2 and δ = 0, see appendix A2.3. To assess the generality of the RT beyond chemical languages, we benchmarked the RT in protein language modeling. On the synthetic pretraining data, the RT obtained nearly perfect results in predicting Boman's index (Spearman ρ > 0.994; Table 2a) and outperformed a baseline k-NN using Levenshtein distance [42].But the RT also successfully generated peptides with a desired Boman index, given a partially corrupted amino-acid sequence (cf. Spearman ρ of 0.84, see Table 2b). Also, a higher fraction of masked tokens lead to better results in protein generation tasks (cf. appendix Figure A3).

TAPE datasets (protein fluorescence & protein stability)
Next, the RT performed competitively on two realistic protein regression datasets from TAPE (cf . Table 2a). This is remarkable given that the TAPE models were pretrained large-scale on unlabelled protein sequences and finetuned with a regression loss. For example, the RT outperforms all reported methods in Spearman correlation on the Fluorescence task; which has a distribution with two modes, for bright and dark proteins respectively. Inspecting the predictions in more depth showed that the RT excels at recognizing the mode of a protein but struggles with intra-mode precision (see appendix A4.2). Overall, the competitive predictive performance of the RT demonstrates that the benefits of self-supervised pretraining can extend to numerically labelled datasets. This yields, en passant, a conditional generative model for property-driven local exploration of the protein sequence space. Evidence on this can be found in Table 2b: Whereas all TAPE models as well as the UniRep method are incapable of addressing this generation task, the RT was able to modify the test proteins such that their (predicted) stability correlated strongly with the primed property (ρ = 0.44).

Modeling chemical reactions
Language models advanced reaction chemistry significantly [43,4] and also showed superior performance on yield prediction [36], yet models incorporating yield into (partial) reaction generation are lacking entirely.
We therefore optimized the RT for concurrent yield prediction and precursor generation on two reaction-yield datasets: Buchwald-Hartig aminations [44] and Suzuki-Miyaura cross-couplings [45]. On yield prediction, the RT (trained on SELFIES) outperforms fingerprint-based or quantum-mechanics methods, and matches (Suzuki dataset) or almost matches (Buchwald dataset) the performance of language models like Yield-BERT, trained with regression loss on SMILES (cf. Table 4a).
The same model learned to reconstruct missing precursors in Buchwald-Hartwig animations which can be useful to infer missing solvents or reagents in automatically extracted reactions (cf . Table 4b). This is partly achieved with great accuracy (e.g., 98.2% for aryl-halides). Interestingly, inferring additives proved challenging, possibly because they are the dominant precursor type for the reaction yield [44]. However, upon masking the additive only partially (rather than completely), the reconstruction performance increases significantly (ablation study with p mask ∈ [0.25, 0.5, 1] in Table A5). On the Suzuki-couplings, the reconstruction results are more balanced among the five precursor types; the average Tanimoto similarity to the true precursor was > 0.65 in all cases (cf . Table 4c). Moreover, across both datasets we observed mild benefits in reconstruction performance when providing the
For reconstruction, we show the percentage of cases where the exact right precursor was among the top-3 predicted sequences and the Tanimoto similarity of the most similar of those molecules. For decoration, we show the percentage of cases where the top-5 predicted reactions contained a reactions with higher (predicted) yield than the seed reaction (succes rate), alongside the associated average yield improvement. (d) Together with a BH amination from the validation dataset (top), we show two RT-generated reactions with adaptations of the base and halide respectively, both with higher (predicted) yield by the RT. The RXN confidence stems from the forward model by Schwaller et al. [2] which confirmed that the reaction would result in the shown product in all cases. For improvements of additive and ligand of the same reaction, please see Figure A4. true yield rather than masking it (cf. Table A6/ Table A7). In addition to yield prediction and precursor reconstruction, the RT can also decorate existing reactions by adapting specific precursors toward a higher yield (cf. Table 4b/ 4c)). Consistently among both datasets and all precursor types, 40-80% of the top-5 predicted sequences contained reactions with entirely novel precursors and higher predicted yield. Figure 4d visualizes exemplary adaptations of base and arly-halide of a BH amination with very low yield (< 5%). Notably, for this unseen reaction, the RT found novel adaptations of each of the four precursor types that resulted in an increase of predicted yield by 11-85% (see Figure A4 for full details). With the forward reaction prediction model in IBM RXN [2] we confirmed that all reactions indeed result in the desired product. Notably, the confidence from the forward model rank-correlated almost perfectly with the yield predicted by the RT (ρ = 0.90, p < 0.05).

Discussion
Here, we have presented the Regression Transformer (RT), demonstrated that regression can be casted as conditional sequence learning task and introduced a flexible multitask-language-model with wide application in scientific discovery. Our main contribution is a "swiss army knife" transformer that bridges previously considered disjoint tasks (property prediction and conditional generation), excels at both tasks and could thus pave the road toward foundation models in material design.
Regarding molecular property prediction, we find that the RT learns continuous properties even from small datasets, surpasses conventional regression models on several benchmarks and sometimes competes with Transformers trained on regression loss. Remarkably, this is achieved without providing ratio-scale information about the property, potentially even challenging the necessity of using regression rather than classification objectives.
The experiments on conditional text generation underline the versatility of the RT: Across a wide range of tasks, we conditionally generated novel sequences (molecules, proteins, reactions) that seemingly adhere to primed, continuous properties. We foresee this to be useful for property-driven, sub-structure constrained molecular or protein design. Our experiments on constrained molecular generation benchmark further demonstrate that the RT can surpass specialized conditional generative models.
Moreover, we emphasize that even though all experiments reported herein examined singular properties, the RT naturally scales to multiproperty prediction (see "Software" section on how to access pretrained multiproperty models).
Future work could, for example, intensify the work on reaction modeling (the RT effectively generalizes forward reaction and retrosynthesis models) or improve the ability of the RT to perform fine-grained regression (for an interesting failure mode see appendix A4.1). Finally, our work resonates with the recent trend towards multitask Transformers [47,48,49] and we envision it as a mean to accelerate the development of foundation models for scientific discovery applications.

Software and Data
Reproduction The codebase to facilitate reproduction of all experiments is publicly available at: https://github.com/IBM/ regression-transformer.

Data
The data for the MoleculeNet experiments can be obtained from: https://moleculenet.org/datasets-1 The data for the molecular optimization experiments can be obtained from: https://github.com/wengong-jin/ icml18-jtnn/tree/master/data/zinc The data for the protein language modeling experiments can be obtained from: https://github.com/songlab-cal/tape The data for the reaction yield experiments can be obtained from: https://github.com/rxn4chemistry/rxn_yields/ tree/master/data

Usage of trained models
The RT is implemented in the Generative Toolkit for Scientific Discovery (GT4SD [50]) which provides ready-to-use pipelines for inference and training/finetuning on custom data. Via GT4SD, versions of the RT trained on the QED and ESOL datasets (small molecules) and the stability dataset (proteins) are available. Moreover, GT4SD also distributes two additional versions of the RT trained on multi-property prediction tasks. A notebook with a short demo can be found under: https://github.com/GT4SD/ gt4sd-core/blob/main/notebooks/regression-transformer-demo.ipynb. The datasets used for benchmarking are available from the respectively referenced papers.

XLNet backbone
The Regression Transformer (RT) is built upon an XLNet backbone [12] to retain the benefits of auto-regressive modeling in combination with a bidirectional context. At its core, XLNet is an auto-regressive language model but due to its novel training objective, it, in expectation, obtains full bidirectional attention. This bidirectionality is critical because the RT is required to fill multiple tokens at arbitrary positions in a sequence while attending the full remaining sequence 3 . Moreover, the independence assumption in bidirectional but non-autoregressive models (like BERT) becomes increasingly disruptive as more masked tokens are filled, making XLNet the best choice. This limits BERT's applicability for generative tasks in biochemistry like scaffold decoration where large portions of a molecule might be masked and generation of individual atoms can critically alter the molecule's functional properties. In general, it is important to notice that the proposed framework can be applied to all transformer flavors, but it certainly benefits from an autoregressive generation with full sequence attention even for discontiguous mask locations, like XLNet or MPNet [51].

Tokenization
This section describes the processing of alphanumeric sequences, i.e., strings consisting of a mixture of numerical and textual symbols (for a visualization of the tokenization see Figure A1, top). Unlike previous approaches that modelled 8-bit integers (i.e., pixels [52]) with a classifier, we strive to represent real numbers with arbitrary floating point precision. Since representing every number as a single token is suboptiomal due to a lack of generalization to new numbers and sparsity of the provided tokens, we formulated regression as sequential categorical task. In turn, this necessitates a scheme for converting text representing numbers into a sequences of tokens. First, the following regular expression splits a string denoting a numerical: \s*\s*?(\+|-)?(\d+)(\.)?(\d+)?\s* Each of the resulting matches containing a number is converted to a token t v,p where v ∈ N ∩ [0..9] is the value/digit and p ∈ Z is the decimal place (e.g., 12.3 is split into [1 1, 2 0, ., 3 -1]). We call these numerical tokens. This representation has the advantage that it allows easy decoding of the digit sequence but also distinguishes their decimal order by adhering to classic positional notation. Negative numbers are preceded with a special token. Regarding alphabetic tokens, we represent molecules as SELFIES [32] strings and tokenized them with their internal tokenizer. In one ablation study, we instead use SMILES [53] and tokenize with the regular expression from Schwaller et al. [43]. Protein sequences are tokenized per amino acid.

Numerical encodings (NE)
Due to the inherent structure of numbers, learning the embeddings of numerical tokens in a purely data-driven way might be ineffective. Moreover, since the RT is trained with cross-entropy loss, no notion of similarity between numerical tokens is conveyed. As a remedy, we propose numerical encodings (NE), a simple inductive bias about the semantic proximity of numerical tokens, similar to positional encodings [1]. In practice, we sum the NEs with regular word embeddings and relative positional encodings from XLNet (see Appendix Figure A1 for a workflow). Our proposed numerical encodings are zero vectors for all but numerical tokens of the dictionary. We follow positional notation as above. Given a token t v,p (with digit value v and decimal place p), the numerical encoding at embedding dimension j is defined as: Thus, the amplitude of the NE scales with the numerical value of the token. The NEs are perfectly correlated among embedding dimensions but alternate between positive and negative values for even and odd dimensions and vanish for higher dimensions (see example in Figure A2a). Critically, the pairwise distances of the numerical encodings are symmetric and decay monotonically with the float value (see Figure A2b). Note that we also experimented with integer-based numerical encodings (cf. Supplementary Material A2 for additional experiments).

Training objectives
The input x for a RT is defined by a concatenation of k property tokens [x p ] k and l textual tokens [x t ] l , such that: The full sequence length is T =k+l and x p and x t are property and textual tokens respectively.
Permutation language modeling (PLM) objective. The idea of PLM [12] is to fill masked tokens auto-regressively by sampling a factorization order z for a sequence x at runtime. Decomposing the likelihood p θ (x) according to the facorization order yields, in expectation, a bidirectional auto-regressive model. Let z ∈ Z T denote one of the T ! permutations of our sequence x. If z i and z <i are the i-th and first i − 1 elements of z, the PLM objective is: In practice, partial prediction is performed, i.e., only the last c tokens of the factorization order z are predicted. Following XLNet, z is split into a (masked) target subsequence z >c and an unmasked input sequence z ≤c s.t. the objective becomes where c is a hyperparamter, usually sampled per batch such that the fraction of masked tokens is roughly 1/c. We notice that (4) does not make any specific choices on x p and x t . It thus constitutes our baseline objective. While (4) is a generic objective, it is computationally exhaustive to optimize due to the permutations. Moreover it is not ideal for our needs because it does not distinguish between textual and property tokens. Instead, we are aiming to develop a single model that can either predict numerical tokens (when given text sequences) or text tokens (when given a combination of numerical and text tokens). To that end, we propose to train on two alternating objectives, one designed for property prediction and one for text generation.
Property prediction objective. Instead of randomizing which tokens are masked, this objective exclusively masks all the property tokens. Specifically, we constrain the factorization order z by setting the first l elements to x t and fixing c = l. This guarantees that only property tokens are masked. Let Z p T denote the set of possible permutations. Under this constraint, then the objective becomes where x p z>c<i denotes the c-th to the i−1-th element of the factorization order z. We emphasize that this "tailored" property objective J p is still optimized with a cross-entropy loss in practice. Note that this loss cannot convey any notion on the qualitative proximity of the prediction to the labels because the level of measurement of tokens in a language model are on a nominal level. Thus, predicting a sequence of numerical tokens corresponding to a property score of 0.91 for a sample with a true property of 0.11 will not generally result in a higher loss than predicting 0.21. Instead, a traditional regression loss operates on a ratio scale.
Conditional text generation objective. This objective facilitates the generation of textual tokens given a property primer and textual tokens. We constrain the factorization order z by setting the first k elements to x p to and sampling the cutoff c, s.t. c >= k. This ensures that masking only occurs on textual tokens. With this constraint, we denote the set of permutations by Z t T and the objective becomes Intuitively, this objective applies regular PLM while sparing the numerical tokens. It then aims to reconstruct the full text sequence (i.e., molecule) given the uncorrupted property tokens and partially corrupted textual tokens.
Self-consistency (SC) objective. Standalone, the above conditional text generation objective (6) does not reward if the generated sequences adhere to the primed property. This is critical because in chemical as well as natural languages, changes in single tokens (i.e., atoms, amino acids or (sub)words) can drastically change the property (meaning) of a sequence (sentence). As a remedy, we extended the text generation objective J G by a self-consistency term that exploits the dichotomy of the Regression Transformer. The full objective is given by: where the second addend is the self-consistency term, weighted by a factor α. Intuitively, it is given by the difference between the property of the sample and the predicted property of the generated samplex. Here,x is obtained by greedy decoding of the masked tokens and combining it with the non-corrupted tokens of x. To be precise, Here, m is an indicator vector whether masking occurred at a given position andx = arg max is the result of greedy decoding. In such a formulation, the RT acts as an oracle during its own optimization, resembling an additional layer of self-supervision. While this scheme risks undesired side effects when the model performs poorly at property prediction, it introduces a notion of self-consistency and rewards the generation of molecules that are different from training samples as long as they adhere to the property.

Regression.
For the regression (or property prediction) task, we convert the sequence of predicted (numerical) tokens into a floating-point prediction (the model never failed to predict a token sequence not corresponding to a valid numerical). We then report the root-mean-squared error (RMSE), Pearson's correlation coefficient (PCC) or the coefficient of determination (R 2 ), dependent on the dataset and previous methods.

Conditional sequence generation.
Dependent on the application domain, different metrics are utilized.
Small molecule and protein modeling. We strive to assess the model's ability to decorate an arbitrary, possibly discontiguous fractional input sequence (e.g., a molecular scaffold) according to a property of interest. Therefore, we randomly mask a fraction of tokens of the text sequence and then query the model with ten equidistant property primers spanning the full range of property values. The metric is the average Spearman's ρ between the ten primers and the actual properties. Spearman is favorable over Pearson because it is only rank-sensitive. Note that due to constraints induced by the fragmented sequence, covering the entire property spectrum is usually impossible such that e.g., RMSE is inappropriate for this task (e.g., priming a highly toxic scaffold with low toxicity cannot yield a non-toxic molecule). As a sanity check, we also report 0-Var, i.e., the percentage of samples for which the generation was unaffected by the primer (the lower the better). On the property optimization benchmark from Jin et al. [34], we report the same metrics as in their work. The success rate in generating molecules with higher logP (while adhering to the similarity constraint δ), the Tanimoto similarity δ to the seed molecule and the average improvement in logP.
Chemical reaction modeling. For the reaction yield datasets, we challenge the model by two sequence generation tasks. First, fully reconstructing a precursor solely based on the remaining precursors and the reaction yield. The top-3 predicted sequences (decoded via beam search) are considered, s.t. Top-3 accuracy is reported. Additionally we report the average Tanimoto similarity of the most similar of the top-3 molecules to the seed molecule (fingerprint: ECFP4). Secondly, we measure the capability of decorating existing reactions to obtain a (potentially) higher yield. To that end, the model is prompted with incomplete reactions consisting of an increased yield, an entirely masked precursor and complete remaining precursors. We consider the top-3 predicted sequences (decoded via beam search) and report the fraction of samples where one of the reactions had a higher (predicted) yield (success rate). The second response metric is the mean improvement in (predicted) reaction yield (yield y ∈ [0, 100], the distributions are right-skewed). Note that we exclude trivial solutions by removing all predicted precursors that exist in the training dataset.

Chemical language modeling
Synthetic QED dataset. Starting from ∼ 1.6M bioactive molecules from ChEMBL [54], we created a synthetic dataset by computing the QED [30] score (q ∈ [0, 1]) for all molecules with RDKit and rounded to 3 decimal places. We used ∼ 1.4M molecules for training, 1k for validation and 10k for testing.
MoleculeNet datasets. We focused on 3 regression datasets from the MoleculeNet benchmark [31]: ESOL, FreeSolv and Lipophilicity, where the task is to predict water solubility, hydration free energy and lipophilicity of a molecule, respectively. For each dataset, we performed 3 random splits (as recommended by [31]) with 15% validation data. Because the datasets are small (< 5000 samples), we used SMILES augmentation [55] to augment the dataset by a factor of 16.
Property-optimization benchmark. This is a benchmark for property-driven, conditional molecular generation. The goal is to adapt a seed molecule such that a property is maximized while adhering to a fixed similarity constraint. We obtained the data from Jin et al. [34] which ships with a fixed split of 215,381 training and 799 test molecules and their penalized LogP (pLogP) value [56]. pLogP is the octanol-water partition coefficient (logP) penalized by the synthetic accessibility score and the number of cycles with > 6 atoms. Hence, pLogP just like QED can be computed deterministically from the molecule. To maximize comparability we followed the candidate assembly process of Jin et al. [34], described in appendix A1.1.3.

Protein sequence language modeling
Synthetic Boman dataset. As a large-scale, labelled dataset we focused on the Boman index, a measure of potential protein interaction for peptides. It is the average of the solubility values of the residues [57]. We collected all 2,648,205 peptides with 15 to 45 AAs from UniProt [58], computed their Boman index, and used 10k and 1k for testing and validation respectively. TAPE datasets. We focused on two datasets from the TAPE benchmark [35]: Fluorescence [59] and Stability [60]. The goal is to predict, respectively, the fluorescence and intrinsic folding stability of a protein that is one to four mutations away from a training protein. Both datasets ship with fixed splits. The fluorescence (stability) dataset has 21,446 (53,416) training, 5,362 (2,512) validation and 27,217 (12,851) test samples.

Chemical reaction datasets
We investigated two high-throughput experimentation (HTE) yield datasets that examine specific reaction types: Buchwald-Hartig aminations [44] and Suzuki-Miyaura cross-coupling reactions [45]. Both datasets were investigated in the same 10 random splits as examined in Schwaller et al. [36] with a 70/30% train/validation ratio.
Buchwald-Hartwig. This dataset, produced by Ahneman et al. [44], investigates HTE of Palladium-catalysed Buchwald-Hartwig C-N cross coupling reactions. The reaction space comprises 3955 reactions, spanned by 15 unique aryl and heteroaryl halides, 4 Buchwald ligands, 3 bases and 22 isoxazole additives. A Palladium-catalyst and a Methylaniline are the fifth and sixth precursor respectively, however they are identical for all reactions. Each reaction is associated to a yield y ∈ [0, 100] and the 10 random split were identical to the ones released by Sandfort et al. [46] that are also used by all competing methods in Table 4b. Yield is given in a range of [0, 100].
Suzuki cross-couplings. This dataset was provided by Perera et al. [45] and investigates HTE of Suzuki-Miyaura reactions across 15 pairs of electrophiles and nucleophiles, leading to different products respectively. For each pair, a combination of 4 solvents, 12 ligands and 8 bases (reagents) was measured, resulting in a total of 5760 reaction yields that we scale to the range [0, 100]. The catalyst is identical for all reactions, some reactions omitted the ligand or the base while others contained electrophiles, nucleophiles, ligands, bases or solvents that were composed of different fragments (e.g., salts).
USPTO. Before training on the narrow yield datasets, we warmed up the model to learn generic reaction chemistry. We used reactions from the US Patent Office (USPTO), the largest open-source dataset about chemical reactions [61]. Since no yield information was available, the utilized numerical property was the total molecular weight of all precursors. The dataset contained n = 2, 830, 616 reactions and was obtained from Schwaller et al. [4].

Appendix A1 Training and evaluation procedure
All experiments build upon the XLNet [12] backbone from the HuggingFace library [62]. We expanded the XLNet backbone with our proposed tokenization scheme, an additional encoding layer for the numerical embeddings (N dim = 16) and the custom training objectives (cf. Figure A1). Bottom: The RT is trained with an alternating training scheme, derived from the PLM objective [12] and designed to concurrently optimize property prediction and conditional generation (bottom). The dots indicate that the RT naturally scales to multiple property tags.

<QED>0.428|…|<ESOL>-2.92|N#[N+][N-]c1ccc(C)cc1
Regarding architectural hyperparameters, we used 32 hidden layers in the Transformer encoder, with a dimensionality of 256 and 1024 in the feed-forward layer and 16 attention heads (20% dropout). Altogether, this model has ∼ 27M trainable parameters (exact numbers vary dependent on vocabulary size). During evaluation, greedy decoding was used for property prediction and beam search decoding for conditional sequence generation. We used PyTorch 1.3.1 [63] and the XLNet backbone from Transformers 3.1.0 [62]. All models were trained on single GPUs (NVIDIA Tesla A100 or V100).
In the following sections, we elaborate on the training procedures for each dataset. For the MoleculeNet datasets, the models were warm-started using the QED initialization and trained only for 50k steps (batch size 4) with early stopping. Since the QED pretraining utilized numerical values in [0, 1], we normalized the regression values of the MoleculeNet datasets to the same range and rounded them also to three decimal places. For all objectives, unless otherwise constrained, we set the masking hyperparamter c = 5 and restrict the span of consecutively masked tokens to a maximum of 5 tokens.

A1.1.3 Property-optimization benchmark
For this task, the models were also warm-started using the QED initialization and trained for 50k steps with early stopping on perplexity. To assemble the candidates for the optimization of one seed molecule, we tried to follow the process of Jin et al. [34] as closely as possible. Jin et al. [34] applied 80 gradient steps, then decoded 80 molecules and reported the molecule with the highest pLogP score that satisfies the similarity constraint δ. Instead, we form a pool of molecules by prompting 80 times with the same seed molecule but varying the fraction and the maximum span of masked tokens. From the pool of decodings we report the molecule with the highest pLogP, just like Jin et al. [34] and You et al. [40].

A1.2.2 TAPE datasets
Following the ablation study on the loss functions (see Table A4) that revealed the best results for the self-consistency objective, we focused the finetuning exclusively on this configuration. For both datasets, three models were warm-started using the Boman initialization and trained until validation performance saturated (∼ 100k steps). The numerical values were again scaled to [0, 1]. On the Fluorescence data, a small value of Gaussian noise was added to some training samples due to an interesting failure mode (see A4.1). For the evaluation of the conditional generation task, the models were given more flexibility: 60% of the tokens were masked (i.e., c = 1.7 in Equation 3) and the maximum span was 7 AA residues. We did not evaluate the RT on conditional generation for the Fluorescence dataset because of a massive pretraining-finetuning mismatch: While the Boman dataset used for pretraining consisted of 15 to 45 residues (mean/std: 36 ± 7), the fluorescence proteins were significantly larger (246 ± 0.2 residues). Instead, the proteins in the stability dataset were similar in size to the pretraining data (45 ± 3 residues).

A1.3 Reaction yield datasets
Pretraining. Since the two reaction yield datasets only cover narrow regions of the chemical space (one template applied to many precursor combinations), we warmup the model on broader reaction chemistry extracted from patents (USPTO). 5000 reactions were held out for validation and the model was trained until validation performance on the two alternating objectives (Equation 5 and Equation 7 with α = 1) saturated. The masking hyperparameter c was set to 2.5 and the model were trained for ∼ 2 days (single GPU). The vocabulary for reaction SELFIES contained 861 tokens.
Finetuning For both the Buchwald-Hartwig reactions [44] and the Suzuki-couplings [45], ten models were finetuned respectively on repeated random splits. The training objectives again alternated every 50 steps between property prediction (Equation 5) and conditional generation (Equation 7 with α = 1) for a maximum of 50k steps (∼ 1 day). Notably, during the conditional generation task we sampled one precursor per batch and then entirely but exclusively masked this precursor. Thus the objective for the model became to reconstruct a missing precursor from the remaining precursors and the reaction yield (or to produce an alternative precursor with a similar predicted yield).

A1.4.1 k-Nearest-Neighbor (k-NN)
For small molecule and protein modeling we reported results in property prediction with k-NN baseline model. For small molecules, the distance measure was (inverted) Tanimoto similarity [64] of ECFP4 fingerprints [65]. For the protein language models, the Levenshtein distance between the protein sequences was used [42]. For the k-nn baseline models, k was determined based on the best performance on the validation data. This led to k = 25 for the drug-likeness/QED task, k = 21 for the protein interaction (Boman index) task, k = 50 for the fluorescence and k = 15 for the stability task.

A1.4.2 XLNet with regression head
For the molecular property prediction on the MoleculeNet datasets, we trained an XLNet [12] model with a conventional regression loss. This maximizes comparability to the RT since it, unlike the other models in Table 2e, also uses an XL-Net backbone. This model was initialized using the XLNet-base-cased weights from HuggingFace and subsequently the SequenceClassification head was finetuned with an L 2 loss. The model contained ∼ 93M parameters and was finetuned for 200 epochs without any hyperparameter optimization. Early stopping was used to determine the best epoch.

A2.1.1 Description of Integer encodings.
As an alternative to the float-based numerical encodings (NE), we experimented with an encoding scheme relying solely on positive integers. Note that any regression problem can trivially be casted to a regression problem where all labels are positive integers. Under this consideration, we need to define NEs only for positive integers 4 ; similar to positional encodings. We therefore propose to directly utilize the definition from Vaswani et al. [1] as NEs: where d e is the embedding size. The advantage of this integer-based encoding is that every embedding dimension captures fluctuations of different frequencies; using trigonometric functions as continuous analogs to alternating bits. Practically, to use the Integer-NEs, the property values were casted to the range [0, 1000] and rounded. Table A1 provides extended results of Table 2a in the main paper. It shows the standard deviations across several runs of the Table A1: Performance evaluation of PLM training. FE refers our main float-encodings whereas Int refers to the Integer encodings described above. Regression Transformer. In this setting, from the two types of proposed numerical encodings, the float-based encodings yielded slightly superior result to integer-based encodings. Similarly, Table A2 shows extended results of 2b in the main paper, including standard deviations and the ablation study on integer vs. float encodings. Here, integer-encodings (IE) are superior for regression Table A2: Performance evaluation on alternating objectives. The decrease in perplexity compared to the vanilla PLM training is expected given the discrepancy between the refined, alternating objective and the PLM objective. but inferior for conditional generation. Due to that and the non-applicability of IEs to floating numbers, we decided to not further explore them.

Regression task Generation task
Summation vs. concatenation of numerical encodings. We decided to follow the common approach of summing additional encodings with the learned embeddings [1,12] but note that disentangling content and position embeddings can improve language models [66]. So, instead of summing the numerical encodings to the regular embeddings, we also experimented with concatenation (dimensionality of 32 for the NEs.). This produced slightly inferior but nearly identical results, see Table A3. We propose to use a summation for two reasons: First, it avoids additional hyperparameters and model weights. Secondly, it probably yields approximately orthogonal subspaces of token embedding and numerical encodings (due to the high dimensionality). This obviates the need to enforce orthogonality with a concatenation. While we conjectured that using NEs improves the performance in both tasks (property prediction and conditional generation), we emphasize that providing this prior might not be necessary given enough data. We hypothesize that refining our NEs might yield better results and in particular a faster convergence, but leave further refinement to future work, especially given the plethora of research about positional encodings [67,68,69].

A2.2 Conditional generation: External evaluation vs. self-evaluation
Generally, it is intractable to evaluate the performance in most property-driven molecular generation tasks because the property of interest can only be measured in the wet lab. In the main paper, we have reported the predicted ESOL, FreeSolv and Lipophilicity values respectively based on the GROVER approach [33], a graph Transformer with large-scale self-supervised pretraining. Table A4 shows that a self-evaluation with the Regression Transformer would have led to very similar results in all three conditional generation tasks. This is reassuring because the RT is, at least in the self-consistency setting (alpha = 1), a biased estimator since the model is used itself to optimize the conditional generation process. Based on this finding, we refrained from seeking external validation for the conditional protein and reaction generation tasks. Table A4: Conditional generation for MoleculeNet datasets. Average performances across all splits for training with alternating objectives are given. "ρ with RT" refers to the self-evaluation whereas "ρ with Grover" refers to to predictions obtained with the model from [33].

A2.3 Conditional molecular generation (constrained property optimization benchmark)
On the constrained property optimization benchmark, we conducted ablation studies of the Regression Transformer for the use of float-based numerical encodings (NE) as well as the self-consistency loss function. The main metric of this task is the mean improvement in pLogP compared to the seed molecule. The results can be found in Table A3. The value of α refers to Equation 7: α = 0 means that no self-consistency loss was used and α = 1 implies that the self-consistency loss was used with a weight equal to the regular conditional text generation objective (cf. Equation 6). The results of the ablation study indicate that the RT consistently outperformed the JT-VAE and GCPN in the main metric (improvement) by a wide margin. Table A3: Further results on constrained property optimization benchmark. JT-VAE is from Jin et al. [34] and GCPN from You et al. [40]. NE refers to the use of Numerical Encodings.
Training configuration Generation task Regression Model Numerical Encoding Self-consistency (α) Improvement Similarity δ Success rate Pearson's r (PCC) JT-VAE -- Like for the QED dataset, for protein sequence modeling we also investigated the impact of the three training loss setups: The results in Table A4 show that the proposed training scheme with alternating optimization of property tokens and text tokens was highly effective for both, the regression and the generation task. In addition, like on the QED dataset, the self-consistency loss led to better results in conditional generation, but at the expense of slightly reduced accuracy in regression. As stated in the main text, this is most likely caused by the self-evaluations of the decoded sequences. These sequences might differ significantly from the training sequences but are still used with the property value of the original sequences. Since the Boman index can be computed directly from the sequence, this hypothesis could, in principle, be confirmed by correcting the property value during the self-evaluation call. However, limited value would come from such approach because real datasets work with more complex properties.
Apart from that, Figure A3 reveals a general trend in the conditional generation with the Regression Transformer: More freedom in the generative process (i.e., a higher fraction of masked amino acid residues) leads to better results in terms of Spearman ρ to the property primers (cf. Figure A3). This comes, however, at the cost of reduced similarity to the seed sequence.  Figure A3: Correlation between property primer and property of generated protein sequences The model's ability to generate protein sequences with a desired protein interaction index. The self-consistency loss yielded the best results and, generally, a higher fraction of masked tokens led to generated peptides that adhere better to the primed property value. Note that the Boman/protein interaction index can be assessed in-silico from the sequence alone.

A2.5 Chemical reaction modeling
This subsection lists additional results related to the reaction yield modeling. Table A5 reports an ablation study on the impact of p mask (i.e., the probability to mask a specific token) for the reconstruction of additives in Buchwald-Hartwig aminations. Table A6 and Table A7 report an ablation study that assess whether co-encoding reaction yield enables the model to better p mask Top-3 acc. Tanimoto sim.   Figure A4: Adapting an unseen Buchwald-Hartwig amination toward higher yield. Alongside a seed reaction and its reported yield, the RT can generate reactions that selective replace individual precursors. In this case, upon priming for higher yield and a given precursor type, the RT indeed generated reactions with higher yield (as predicted by the RT) as well as higher confidence for the reaction to suceed in general (predicted with forward model from [2]). Note that no adaptations of 4-Methylaniline and the Palladium-catalyst are generated since they are constanta cross the dataset. This is the full figure of 4d (main manuscript) A9 A3 Case studies

A3.1 Case study on scaffold hopping
Scaffold hopping is a technique in medicinal chemistry with the goal to discover novel compounds by modifying the central core structure (i.e., removing substituents while retaining rings and their linker fragments) of known compounds [70]. We simulated this task on the QED dataset by determining the scaffold with RDKit and masking only the non-scaffold tokens (in contrast to the regular evaluation where randomly 40% of the tokens were masked). This task was only performed with the SMILES models since scaffolds cannot be determined trivially in SELFIES. In general, this task is more challenging because the molecule is more constrained. On average, less tokens are being masked and in most cases the full range of drug-likeness cannot be captured, given the scaffold. This explains the higher percentage of molecules where the primer did not influence the generations (cf . Table A8).  But note, that this includes cases where the molecule is itself a scaffold and thus no tokens are masked (we do not control for that explicitly). The generations for one exemplary molecule are shown in Figure A5. In this example, it is interesting to see that the model decorated the scaffold with specific atoms on the rightmost six-ring. These atoms, iodine, chlorine and bromine which were rightfully provided from low to high QED primers seem to be indicative of different levels of drug-likeness. One drawback, however, is that the RT cannot fill no or multiple tokens in the position of one [MASK] location. For example, in the case of the last primer (0.86), the provided scaffold already had a QED of 0.87 and thus not adding any new atoms would have been the best choice here.

A3.2 Case study on attention visualization: Interpreting attention heads
We visualized the attention scores using BertViz [71]. Here, we aimed to compare the inference patterns across the two tasks, property prediction and conditional generation. The results for the first 4 (out of 32) layers are shown in Figure A6. In general, many attention patterns commonly described in natural language models are also present in the Regression Transformer. For example the bag-of-words pattern (i.e., evenly distributed attention, e.g., all heads of first layer) or the next-token (e.g., layer 4, head 4 and 5) or previous-token patterns (e.g., layer 2, head 2) are clearly visible. While the named patterns are consistently present in both tasks, probably because they are useful irrespective of the particular task, some distinctive patterns for either of the tasks can be found. For example, in the conditional generation task ( Figure A6, right) many triangles with their right angle in the upper right are present. In these positions the property tokens are present and thus these patterns indicate that the representation of all other tokens, especially also the masked ones, are heavily influenced by the property value. Instead, in the property prediction task ( Figure A6, left), many triangles with their right angle in the lower right are present. This implies a heavy attention on the [END] token which marks the end of the sequence and is a useful indicator for the QED score because it is critically influenced by the size/weight of the molecule. One particularly interesting attention head is head 3 in layer 2. In the property prediction task its role is to make the masked property tokens aware of the sequence length. In the conditional generation task, its role is to make all tokens aware of the property values. Figure A6: Comparing attention scores across both tasks with BertViz [71]. Attention scores for all heads of the first four layers. Rows depict layers, column depict attention heads. Within each cell, the tokens are ordered from top to bottom. Left: Property prediction task. Right: Conditional generation task. Plot performed with SELFIES model with float encodings, trained on the self-consistency loss.

A4.1 Failure mode in fluourescence dataset from TAPE
This dataset is particularly interesting due to its bimodal mode: one mode corresponding to bright proteins, the other to dark proteins (cf. Figure A7). An interesting failure mode was observed when initially training on the Fluorescence dataset. Figure A7 shows that the dark mode has one sharp spike, exactly at a log fluorescence value of 1.301. Almost 10% of all training samples and almost 50% of the proteins in the dark mode have this exact value. The Regression Transformer is trained on a classification loss and so, the loss during training for such samples will be distributed across the five tokens 1 0, ., 3 0, 0 -1 and 1 -2. In many cases, the model collapsed to always predicting 3.301 where the first token (3 1) was correct for all samples in the bright mode and the remaining tokens (3 0, 0 -1 and 1 -2 were correct for most samples in the dark model. This happened because no weighting of the individual numerical tokens was applied. As a non-algorithmic remedy, we added Gaussian noise to those training samples. Figure A7 reveals the improved performance of the RT compared to the finetuned TAPE Transformer: Less samples were predicted in the wrong mode. However, the RT had difficulties with a fine grained regression, in particular in the bright mode. This becomes particularly apparent when inspecting the detailed results, grouped by bright and dark test proteins respectively Figure A7: Bimodal mode of fluorescence data. The upper part of the plot has been copied from Rao et al. [35] (Figure 3). It shows the bimodal mode of the training data and the test predictions from the TAPE Transformer. At the bottom, we show our remake of the above plot by replacing the predictions from the pretrained TAPE Transformer with the predictions from the Regression Transformer.

Regression Transformer
(cf . Table A9). While the RT achieved the best results in the overall Spearman ρ, the recommended metric by [35], it does not dominate any of the mode-specific metrics. This is a noteworthy finding because it reflects the tendency of the RT to strive for a multi-class classification rather than performing a full regression. It is also interesting to see that the baseline models (k-NN and TAPE One-Hot) achieved the best results in MSE of bright proteins.