Introduction

Late-stage functionalization (LSF) is a powerful technique in medicinal chemistry. The magic methyl effect describes the ability of a single methyl group, even one distal to the binding motif, to dramatically improve (or reduce) potency, solubility, and metabolic stability1. However, methyl groups are not the only motif that can radically change pharmacological properties. Fluoro2, chloro3, trifluoromethyl4, and hydroxyl groups5 are known beneficial motifs and/or temporary functional handles towards other beneficial motifs. Over the past several decades, numerous methods have been developed to diversify lead compounds and selectively install these biologically privileged groups directly6,7,8,9. One methodology commonly utilized in LSF is the Minisci-type functionalization, whereby a radical species adds to an electron-deficient (hetero)arene (Fig. 1A)10,11,12. However, the promiscuity of this single-electron method in conjunction with the inherent structural complexity of LSF molecules makes regioselectivity prediction challenging. Regiochemical predictions for Minisci-type reactions were first summarized by O’Hara et al. who developed a set of guidelines to determine sites of reactivity based on the nucleophilicity of the alkyl radical species, pH of the reaction, solvent effects, and electronics of the heteroarene13. These observations were later formalized when they were noted to correlate well with the indices from Fukui functions, i.e., functions that describe the change in electron density upon the addition or removal of an electron. In the literature, Fukui-based reactivity indices predict the most reactive sites of Minisci functionalization with an average accuracy of 93% (average F-score of 0.77), albeit usually on smaller, minimally functionalized molecules14,15.

Fig. 1: Overview of the model framework, reactions modeled, and model dataset.
figure 1

Source data for dataset breakdown is provided in the source data Excel file. LSF = late-stage functionalization, NN = neural network. A Mechanistic difference between the one-electron-based transformations of the two major types of reactions in the dataset: Minisci and P450. B Graphical overview of the basic message passing neural network (MPNN) model. Molecules are represented as graphs, to go through the MPNN, where atom information is propagated to its through-bond neighbors. The resulting embedded molecule (featurized molecule) is then concatenated with the one-hot encoded reaction information. This resulting vector is given to the final neural network to predict the probability of functionalization of each atom. C Distribution of reaction sites per molecule and molecule size in the dataset. The inclusion of negative data (0 reactive sites) was key to model performance. The majority of LSF molecules were between 20 and 40 heavy atoms (non-hydrogen atoms).

There are two main approaches in the literature for regiochemical predictions: quantum chemical and data-driven. Quantum chemistry-based approaches predict reactivity and regioselectivity by computing energy barriers using techniques such as density functional theory (DFT) or machine-learning (ML) approximations of DFT-energies16,17,18. Data-driven approaches to work directly with experimental data, fitting statistical models to correlate known chemical features to real-world observed outcomes in regioselectivity19,20,21,22,23,24,25. Whilst computational data is more plentiful and significantly less noisy than real-world data, notable performance can be achieved with carefully curated literature datasets. Some experimentally based reactivity models can reach human expert performance in their predictions19. However, ML-based regiochemical prediction is still difficult. Due to the challenges of characterizing the regiochemical outcomes of thousands of reactions, experimental data-based models must often operate in lower data environments, and if gathered from the literature, often with data that contains few negative data points, i.e., molecules that don’t react. In contrast, datasets that include easily extractable yield information often contain ten-fold more data26. This makes it more difficult for ML to find relationships between the molecular structure and LSF outcomes.

Herein, we report a solution to this problem: the utilization of open-source 13C nuclear magnetic resonance (NMR) data in conjunction with LSF data. We hypothesized that an ML model, with its high parameterization, would offer an improvement in accuracy when predicting the regiochemical outcomes of more complex molecules (Fig. 1B). Our model is a graph-based model that does not require pre-computed molecular properties nor any 3D molecular information for accurate regioselectivity prediction. As a proof of concept, we highlight our framework’s predictive ability on both Baran and Molander-type Minisci and P450 LSFs, transformations whose substrate scope is well defined. We show that our model outperforms the Fukui function-based index predictions, and two accurate, previously reported, reactivity-based machine-learning models: one 1-electron-based enzymatic reactivity model and one 2-electron-based small-molecule model. This LSF predictive framework has application towards the development of rapid and facile access to a diverse array of drug-like compounds, specifically with respect to structure-activity relationship (SAR)-probing synthesis and expanding the known chemical space available for exploration.

Results and discussion

The dataset

Data was sourced from Pfizer’s internal medicinal chemistry dataset which consisted of ~2600 reactions, 647 unique molecules, and 823 unique LSF conditions. The majority of these reaction conditions were Minisci-type functionalizations (1928 reactions), including Minisci reactions utilizing the Baran Diversinates™ (463 reactions)27. Classic Minisci conditions were included in the training set, however, the majority of the training data consisted of Baran and Molander Minsci reactions (Table S1). Additionally, other single-electron-based late-stage functionalizations were included in the training data such as P450 catalyzed oxidations (642 reactions), electrochemical methylations (12 reactions), and photoredox alkylations (93 reactions) (see Table S2 for further breakdown of the dataset). Reactions that yielded oxidative cleavage or hydrolyzed side products were kept. A key facet to our dataset was the inclusion of data that contained unsuccessful conditions that led to no significant product formation (zero reactive sites). Despite the significant mechanistic differences between these reaction classes, we hypothesized that additional chemical information relating to the inherent reactivity of both the reagent and the molecule would be advantageous to regiochemical outcome prediction (Fig. S1). A mixture of reaction classes has seen success when utilized in other reactivity-based predictions19,25. To implicitly distinguish between the reaction types, each unique reagent, oxidant, solvent, additive, and acid was one-hot encoded to form a specific reaction vector, unique for each unique reaction condition. Similar to an organic chemist, the selectivity neural network (Fig. 1B) would need to interpret the mechanism type from the collection of reagents.

When deciding the correct method to split the data into training and testing sets, we opted for scaffold-based instead of a random split. It has been hypothesized that a random split encourages the model to simply memorize the inherent reactivity of a molecule, instead of applying its learned chemical knowledge to new scaffolds28. A scaffold split, where every molecule in the test set is an unseen molecule, provides a more challenging target. The retrospective test set consisted of 25 reactions which were comprised of 5 unique molecules and 17 unique reaction conditions. Of the reaction conditions, 22 were Minisci-type functionalizations with 4 utilizing the Baran Diversinates™, one was a P450 oxidation, and one was a metalloenzyme oxidation (Fig. 2).

Fig. 2: The retrospective test set used for optimization of the models.
figure 2

The conditions used for each test set molecule are shown below the corresponding structure. LSF = late stage functionalization. A quinidine—the standard Baran and Molander conditions, B loratadine—containing the widest variety of conditions indicative of its reactivity, C nevirapine—the standard Baran and Molander conditions, D lepidine—both zinc and sodium sulfinate Baran Diversinates™ are used, E imatinib—both zinc and sodium sulfinate Baran Diversinates™ are used.

The model

One artificial intelligence architecture that has seen good performance has been message passing neural networks (MPNNs), a subset of graph convolutional neural networks (GCNNs), first utilized by Duvenaud et al., Li et al., and Gilmer et al. in the mid-2010s29,30,31. MPNNs are a robust and versatile way to predict macro properties (i.e., solubility, compound assay activity, IR spectra, energy)30,32,33,34 and micro properties (i.e.,13C and 1H NMR shifts, regioselectivity)24,35 of molecules by representing molecules as graphs. Graphs, in mathematics, are structures made up by nodes and edges; nodes are concrete entities (events, people, atoms, etc.) and edges indicate that two things have a connection (these events happened due to the same cause, these people all know each other, these atoms share a bond). Briefly, MPNNs work by transmitting information from one node to another via the edge highway. Each message pass transmits the atom’s information one bond further away, radially, with the intention that after a sufficient number of message passes, each atom will have a comprehensive understanding of its local environment (Fig. 1B)30.

We developed an MPNN that sits at ~100 lines of code making it fast, easy to work with, and flexible. The implementation of the MPNN and the trained models can be found at: https://github.com/emmaking-smith/SET_LSF_CODE36. We believe this is the first study that discloses predictive LSF models trained on a large-scale dataset across a drug-like chemical space comprising both positive and negative results. The MPNN was designed to take in basic atomic information (atomic number, atomic symbol if the atom was a hydrogen acceptor or donor, its hybridization, if the atom was aromatic or not, and the number of explicit hydrogens) and basic structural information (the connectivity of each atom to its neighbors and the type(s) of bonds used in those connections). If the chemist did not know molecular property X by looking at the structure, that information would not be given to the model either. Rather the model must infer relevant chemical and spatial information from the structure. From this information, the MPNN would synthesize an embedded molecule vector which would then be concatenated with the reagent-specific one-hot encoding and run through a feed-forward neural network to classify each atom within a molecule as reactive or not reactive (unreactive).

Finding the reaction centers

The first challenge to overcome was to establish automated extraction of reactive sites, the labels for the ML task at hand. Reaction center identification is a challenging area of research37,38 and for our regioselectivity prediction, we required the atom index(es) of the carbon atoms that changed in oxidation state. Visually, this is a trivial task, but due to the arbitrary nature of atom indices across chemoinformatics programs, it becomes much more challenging to perform this automatically. One possible solution is to use atom-mapped SMILES strings, where every atom in the product has been traced back to its corresponding atom in the starting material39. However, we believed a more user-friendly approach was possible. For our style of LSFs, the core structure of the molecule remained unchanged, with only the extremities exchanging a hydrogen atom for a more complex motif. Therefore, the starting materials were mathematically linked: the starting material was a subgraph of the product. In mathematical terms, a subgraph is a graph formed by nodes and edges that are only within its parent graph. From the molecular point of view, a subgraph could be a moiety within a molecule or the core of a molecule. The recent development of a fast, accurate, open-source Glasgow Subgraph Solver was the key to automatically finding the starting material subgraph within the product structure, facilitating the extraction of reactive sites40. Code for the molecule SMILES to reactive site pipeline can be found at: https://github.com/emmaking-smith/SET_LSF_CODE36. In addition to automating the task of finding the LSF reaction centers without the need for atom mapping, the workflow is specifically set-up to deal with symmetry in molecules. The Glasgow Subgraph Solver was directed to find all possible subgraph solutions for a given starting material and product, elucidating all possible starting material-to-product atom mappings. Upon identification of the carbon atom indices whose oxidation state had changed, all corresponding starting material atom indices, including the symmetric indices, were identified and labeled as reactive (Fig. 3). For degradation byproducts, the fragmentation from the resulting oxidation was oftentimes too dramatic for the starting material to remain a subgraph of the product, resulting in 6% of the reactions needing manual elucidation of the reaction center.

Fig. 3: Diagram of workflow for identification of reactive sites in symmetric molecules.
figure 3

All possible starting material (SM) to product atom mappings are generated with the Glasgow Subgraph Solver (creation of atom number conversion table). The reactive site is identified via a change in carbon oxidation (blue highlight) and all corresponding SM indices are labeled as reactive. This technique preserves the symmetry of the SM atom sites.

The loss function

With a model architecture and accurately labeled data in place, we turned our attention to the choice of the loss function, the system that penalizes the model and directs the learning. Loss functions can be broadly divided into two categories, regression or classification, where regression loss functions are used with regression tasks and vice versa41. Our task was to classify each atom in a molecule as a member of the reactive class or not a member of the reactive class (unreactive) thus classification loss functions were appropriate. The Binary Cross Entropy (BCE) loss, which penalizes the model based on the log-likelihood of correct class prediction, was chosen (Eq. S2). A challenge with reactivity and regioselectivity prediction is that most atoms in a given molecule are unreactive. Our most reactive molecule had only 30% of its structural atoms reacting, leaving 70% of its atoms unreactive and most molecules in our training data had 1 or fewer reactive structural atoms (Fig. 1C). Therefore, a model can be technically accurate by simply predicting that all sites are unreactive, though such model would be practically useless. What was required was a loss function that could more heavily penalize incorrect predictions and give less weight to correct unreactive predictions. To this end, a variety of BCE loss weightings were investigated, whose central theme was that the weight given to correct class predictions was inversely correlated to the frequency that that class was predicted (Eq. S2–Eq. S4); the value of each correct reactive site prediction was tempered by how often the model predicted any given atom was reactive, and vice versa for unreactive site prediction.

Model results: retrospective test set

The baseline model was a random forest, which is known to be an excellent predictor of molecular features (e.g., compounds increasing the lifespan of C. elegans, IC50 measurement prediction of drug-like molecules, excitation energies, and associated oscillator strengths of fluorophores) especially in low-data environments42,43,44. Molecules were encoded as their atom-wise Morgan fingerprints. Each row corresponds to the Morgan fingerprint of a specific atom within the molecule. The corresponding one-hot encoded reaction vector was concatenated to the atom-wise Morgan fingerprint and a random forest classifier was then used to predict whether or not each atom in the molecule was reactive or not reactive. We used the well-established classification accuracy metric of the F-score, which balances precision and accuracy to judge model performance. Two other metrics, accuracy (total correct reactive sites predicted/all possible reactive sites) and area under the receiver operating curve (AUROC) are also given for additional interpretability of performance45. Initial results on our test set revealed a modest F-score of 0.42 (Accuracy = 94%, AUROC = 0.67), with Fukui-index-based predictions yielding a lower F-score of 0.19 (Accuracy = 90%, AUROC = 0.57) (Fig. 4A). Fukui indices are predicted only for the molecule, not for the reagent, however, distinctions between different regents are entirely possible. Nucleophilic Fukui indices, Fi(+), correspond to regiochemical outcomes utilizing electrophilic radicals (•CF3), and radical Fukui indices, Fi(0), correspond to regiochemical outcomes utilizing nucleophilic radicals (•CF2H, •cBu) (see SI pg. S5 for a mathematical description of each index)14,15. For any radical whose electrophilicity/nucleophilicity reactivity was uncertain, the Fukui indices that best fit the experimental reactivity were used for the calculation of the F-score.

Fig. 4: Model performance on retrospective and P450-only test sets.
figure 4

Average model performance on 5 initializations (n = 5) with each architecture on the test sets. A basic message-passing neural network (MPNN) is the baseline graph neural network (n = 5). The universal node is the MPNN architecture with the inclusion of a universal node (n = 5). Nuclear magnetic resonance = NMR. NMR transfer learn is the transfer learned model without Fukui-index augmentation (n = 5). NMR transfer learn (Fukui) is the transfer learned model with Fukui-index augmentation (n = 5). The best model on the retrospective test set is highlighted in light blue (n = 1). Fukui is prediction solely from Fukui indices (n = 1). Random Forest predictions are from a random forest classifier (n = 5). The bars in the bar charts represent the average when n > 1, with gray dots representing the individual data points (initializations with identical values are shown as a single point). Standard error bars are shown. Source data for each bar chart can be found in the source data Excel file. A Performance (F-score, accuracy, and area under the receiver operating curve (AUROC)) on the retrospective set. B Performance (F-score, accuracy, AUROC) on P450-only test set with 13C NMR transfer learning. C Comparison of top-1 accuracy for two graph reactivity models originally developed for 2-electron-based (ml-QM-GNN) and 1-electron-based (Meta-UGT) transformations (n = 5 for all).

Evaluation of these initial predictions suggested that the model was challenged with extended conjugated systems, such as those present in loratadine (2) and imatinib (5). We hypothesized that this was due to the difficulty of atoms in one hemisphere of the molecule seeing atoms on the other hemisphere in the MPNN. Whilst increasing the number of bonds that every atom’s information travels between (the range of the atom’s message) did not improve performance, the incorporation of a universal node did. This universal node, as described by Gilmer et al. (Gilmer et al. used the term master node), is an all-seeing node—information from every atom is given to the universal node, which in turn gives information to every atom about distant atoms30. Implementation of a universal node MPNN led to a model with a modest increase in F-score to 0.46 (accuracy = 94%, AUROC = 0.72) (Fig. 4A).

At this point, we suspected we were running up against the limit of the data. Ideally, this would be solved by performing additional LSF reactions, however, this data is laborious and expensive to generate. Every regioisomer must be isolated and characterized for every new substrate which can be cost and/or time prohibitive. Another obvious solution would be to increase the amount of information in each atom’s featurization for a deeper understanding of chemical environments. However, given the poor performance of QM-derived atomic descriptors for MPNN regioselectivity prediction in LSF, alternative solutions were sought out first (see the Quantum Chemistry Augmentation Section for a detailed discussion)24. Thus, transfer learning was employed. This is a technique whereby a model is trained on off-task data before being trained on the desired-task data to boost performance46. It was crucial to choose a transfer learning task that had significantly more data than our current training set which would allow for more complex correlations between structure and reactivity to be inferred. However, it was also imperative that this off-task bore some relationship to atomic reactivity. We hypothesized that 13C NMR shift prediction would be uniquely suited for our goal, which can be abstracted as quantification of local chemical environments. In addition, the inherent symmetry of a molecule is represented in NMR spectra as atoms with identical chemical environments have identical NMR shifts47. This would transfer to atoms with identical chemical environments that have identical reactivity. Thus, ~27,000 open-source 13C NMR shifts were obtained from Jonas et al.'s previous work (originally sourced from NMRShiftDB), and transfer learning from 13C NMR shift to LSF regioselectivity prediction commenced35. This step enabled a major improvement in model performance with the top-performing model, MPNNLSF, yielding an F-score of 0.62 (accuracy = 96%, AUROC = 0.79) (for every 1 true positive, 1.25 incorrect sites are obtained) and an average model performance over 5 initializations of 0.57 (accuracy = 96%, AUROC = 0.75) (Fig. 4A). Interestingly, we observed that negative data was important for model performance. Removing the entries with zero reactive sites (unproductive reaction conditions) led to a substantial decrease in model performance (Fig. S4). We hypothesize that this is because the negative data allows the model to infer similarities between different one-hot encoded reaction conditions.

Comparison to other machine-learning models

To highlight the difficult nature of predicting Minisci-type transformations without this 13C NMR pretraining protocol, we investigated how other graph-based architectures would perform on our retrospective test set. A recently developed neural network by Jensen et al. utilized a joint network approach for 2-electron-based regioselectivity prediction. Their first neural network predicted on-the-fly QM properties, which were then given to their second neural network that classified which product was the major product from a user-generated list of possible structures. This approach, dubbed ml-QM-GNN, saw excellent top-1 accuracy performance even in low training data regimes and was validated on a broad range of 2-electron-based transformation classes, with a top-1 accuracy of over 85%. To investigate Minisci-based transformations, we transformed our dataset into the correct format, first elucidating all possible mono-addition C-H functionalizations given our reagent, followed by complete atom mapping of each reaction48. Using default parameters, ml-QM-GNN was trained on our training dataset and tested against our retrospective test set. Accuracy was determined using ml-QM-GNN’s criteria of top-1 accuracy, where the overall retrospective test set accuracy was the ratio of correctly predicted major products to the total number of reactions. As many reactions contained multiple correct possible products, the ml-QM-GNN’s classification was deemed correct if its top-1 prediction was any of the valid possible products. Over 5 initializations, the average top-1 accuracy of ml-QM-GNN was 11%, compared to an average top-1 accuracy of 71% for our 13C NMR transfer learning model (Fig. 4C).

Finally, we compared our results to a graph-based model specifically developed to predict the outcomes of single-electron-based transformations: Meta-UGT49. Meta-UGT was developed to predict the site of metabolism of UDP-glucuronosyltransferases (UGTs). The natural promiscuity of these phase II metabolic enzymes renders reactivity prediction challenging. The model works in two phases, first predicting if a small molecule is a substrate for the enzyme, followed by the site-specific predictions. When tested upon drug-like molecules, Meta-UGT achieved a top-1 site of reactivity prediction accuracy of 89%, making it a suitable candidate to test our model against. Thus, Meta-UGT was trained with default parameters on our training data and tested on the retrospective test set, yielding an average top-1 accuracy of 42% (Fig. 4C).

Model results—P450-only test set

To investigate this training technique’s performance, we devised a different regioselectivity task: P450 oxidation. P450 oxidation plays a central role in drug metabolism, determining the efficacy and duration of a pharmaceutical. Additionally, the interactions of some drugs with human P450s are known to inhibit and/or induce P450 activity leading to drug–drug interactions50,51. Due to its inherent promiscuity52,53, P450 oxidations are a promising LSF and an excellent test for our framework. Mechanistically distinct from Minisci functionalizations, the Fe(IV)-oxo complex acts upon the substrate via radical rebound or through a concerted mechanism, to release the newly oxidized compound (Fig. 1B)54,55. Site of metabolism (SoM) prediction, which deduces the most likely positions for human P450 oxidation on a given compound, has seen great strides in the past two decades56,57,58,59,60,61. We offer this framework as a jumping-off point to develop an applicable, isoform-agnostic SoM methodology. Fukui-based indices have also been shown to be effective at determining the regiochemical outcomes of P450 oxidations and thus will be used as a baseline measure62,63,64. Thus, a P450-only test set of 31 reactions and 19 unique molecules (Fig. S6), reacting with 18 unique P450s was curated. Employing the aforementioned transfer learning technique to the P450-only test set resulted in an average F-score of 0.48 (accuracy = 94%, AUROC = 0.70) over 5 initializations. The top performing of these initializations, MPNNP450, achieved an F-score of 0.52 (accuracy = 94%, AUROC = 0.73) (Fig. 4B). Despite only 25% of the training data containing P450 oxidations, MPNNP450 outperformed the Fukui-index based reactivity predictions, showcasing the utility of 13C NMR transfer learning.

Quantum chemistry augmentation

A lingering question was whether incorporating 3D information and/or quantum mechanical features as input to the graph would help model performance. Conformer generation and quantum chemistry calculations add computational overhead, which would limit this model’s applicability in practice. However, many MPNNs that utilize QM-derived information find a performance improvement. To this end, a variety of augmentations to the initial atomic features were attempted. However, neither 3D atomic coordinates generated from molecular dynamics (MD) simulations nor electronic information derived from atomic density functions improved overall performance (Fig. S5, SI pg. S4-S5). Interestingly, the addition of each atom’s electrophilic, nucleophilic, and radical Fukui indices (see SI pg. S5 for a mathematical description of each index) did not see an appreciable F-score performance increase in either the prospective or retrospective test sets (Figs. 4A and  5E). It is possible that the Fukui indices may not provide any additional information for the MPNN. There have been numerous prior reports that indicate that MPNNs can accurately predict quantum chemical properties from basic atomic information, implying that an MPNN could extract the necessary quantum chemical information from barebones atom featurization, obviating the need for explicit pre-computation of quantum chemical properties30,34,65. This observation is congruent with Nippa et al. who independently and concurrently published an MPNN for LSF C-H borylation regiochemical and yield prediction24. They noted that similar augmentation of their atomic information with quantum mechanical features did not lead to a noticeable improvement of regioselectivity prediction and incorporation of 3D atomic coordinates only yielded a modest improvement over 2D molecular representations (scaffold splits). It is possible that the lack of improvement with 3D atomic featurization stems from the difficulty in characterizing properties of the LSF reaction transition state with descriptors that refer to an unperturbed substrate molecule.

Fig. 5: Results on the prospective test set and best models overall.
figure 5

Model performance was judged from a single run (n = 1) (nuclear magnetic resonance = NMR). Source data for each bar chart can be found in the source data Excel file. A Experimental results. Color-coded by reagent-specific reactivity. Split circles imply more than one reagent functionalized that position. B MPNNLSF (best retrospective model) predictions on the prospective test set. C Fukui predictions on the prospective test set. D F-score, accuracy, and AUROC (area under the receiver operating curve) reported for MPNNLSF (highlighted in light blue) NMR transfer learn (Fukui) is the transfer learned model with Fukui-index augmentation. E Comparison of the best models on each of the 3 test sets (n = 1). MPNNP450 = Best P450-only test set model. Color coding corresponds to the test set used for evaluation.

Prospective validation

With the success of our architecture in a variety of LSF regiochemical predictions, we turned our attention to assessing its ability in a completely unbiased setting through prospective prediction. Three maximally structurally different molecules were selected from Enamine’s High-Throughput Experimentation catalogue via Butina Clustering. The three compounds were confirmed to not be present within the training or testing data and none had a Tanimoto similarity score over 0.35 with any molecule in the training/testing datasets, indicating low structural similarity between the three prospective compounds and the training/testing data. Each molecule was subjected to CF2H-, CF3-, and cBu- functionalization (Fig. 5A), and these experimental results were compared to the Fukui-derived indices and MPNNLSF predictions (Fig. 5B, C). Gratifyingly, MPNNLSF once again outperformed Fukui predictions (Fig. 5D), and the random forest baseline, even with a respectable performance of Fukui on this prospective test set. All of MPNNLSF’s predictions made chemical sense, with predicted functionalizations occurring at known inherently reactive sites or probable sites of oxidation. Fukui predictions often yielded functionalizations at fully oxidized carbons, something that is rarely seen in these LSFs. This is perhaps due to the mechanistically agnostic behavior of Fukui-based predictions, which highlight the site(s) of the highest probability for nucleophilic/radical attack, regardless of whether or not those sites lead to productive pathways.

A deeper look at our prospective results sheds light on MPNNLSF’s current utility, specifically its highly precise nature. For compound 6, we see a generally good understanding of inherent pyridine electronics, which is naturally activated at  the C2, C4, and C6 positions. However, the effect of the urea motif must be taken into account for a complete picture of regioselectivity. Per the governing heuristics, the π-donating nature of the urea would indicate increased reactivity at the C4 and C6 positions for electrophilic radicals (•CF3) and reduced reactivity for nucleophilic radicals (•CHF2, •cBu)13. Experimentally, it is revealed that the urea motif makes little impact on the electronics of the pyridine, however, MPNNLSF does not capture this. It instead hedges its bets, correctly finding C2 to be reactive for all three radicals but failing to predict the full chemical reactivity at C4 and C6. This may be in part due to the rarity of the urea motif within our dataset. Out of the ~2,600 training and testing molecules, only 12 contained a urea motif (~0.5% of the data), and of those 12 molecules, functionalization occurred on heterocycles distal to the urea motif. Despite this, MPNNLSF found 5/9 reactive sites and none of the sites it predicted to be reactive were incorrect.

For compound 7, we once again see correct ortho reactivity for •CF2H, however, miss the para reactivity for all radicals, perhaps owing to the more sterically congested landscape at that site. However, the clear failure of MPNNLSF was its inability to understand the promiscuous nature of •CF3 functionalization on 7. In the majority of Minisci functionalizations, the role of nucleophile is played by the radical, even for electrophilic radicals like •CF3, and the of role electrophile is played by the heteroarene13. Functionalization generally occurs at a (reasonably) electron-deficient site. However, compound 7 does not completely follow this trend: all but one of the •CF3’s functionalizations occur on non-heterocyclic, more electron-rich arenes, instead of the canonical pyridinyl motif. This atypical substitution pattern plays a large role in the lower performance of MPNNLSF and is even unlikely to be predicted by an expert chemist, highlighting the current limitations of our model: surprising experimental outcomes also surprise MPNNLSF66.

In compound 8 we finally see a small decrease in MPNNLSF’s precision. Instead of identifying the inherently most reactive site on the imidazole, a benzylic oxidation is predicted. The predicted reactivity to difluoromethylation conditions on 8 is likely predicting the major product to be an oxidation byproduct, where the benzylic hydrogen is extracted from the generated alkyl radical and subsequently quenched via TBHP67. A prediction of this nature is most likely due to the decision to include byproduct reactions in the training data and lends credence to the hypothesis that the model understands general chemical reactivity trends.

From this analysis, we see that a general trend is the high precision of MPNNLSF. This has ramifications in SAR studies, which seek to identify the best decoration of molecular scaffolds for optimal pharmacokinetic properties68. In a typical SAR synthesis, one motif is varied and the rest of the molecular structure is held constant. Syntheses of SAR derivates are generally convergent, with the varying motifs brought into the synthesis modularly. Despite this workflow’s streamlined approach, it still requires each SAR derivative to have its own unique route. A more efficient synthesis would use one reaction to generate multiple desired products. Take compound 6 as an example, with a known route from commercial nicotinoyl chloride (9) in an efficient 2-step procedure (Fig. 6)69,70. Aryl isocyanate 10 is formed via a Curtius rearrangement, followed by quenching with amine 11 to produce 6. Current trends in therapeutic molecules have seen the incorporation of fluorinated functional groups as substituents on aromatic systems, such as CF3 and CF2H, to yield molecules with improved pharmacokinetic properties including lipophilicity, metabolic stability, and cell membrane permeability71,72,73. Indeed, approximately 20% of all approved pharmaceuticals contain some fluorine-based group74. An SAR campaign to investigate the effect of a trifluoromethyl at C2 and C6 would require purchasing the corresponding trifluoromethylated nicotinic acid 12/nicotinoyl chloride 13. However, in addition to the added cost of these starting materials (84- and 33-fold more expensive per gram, respectively), the chemist is faced with the challenging task of optimizing and characterizing the outcomes of two small-scale, multi-component, multi-step routes75,76. With MPNNLSF’s precision, a chemist could be confident that a single route could provide multiple desired derivates in one fell swoop, saving the cost of starting material and most importantly, time, both in reaction optimization and in compound characterization. The lower recall isn’t as problematic, as any additional bonus products can be isolated from the crude reaction mixture concurrently with the correctly predicted functionalizations. The benefit of MPNNLSF becomes more apparent when more exotic functional groups are investigated in SAR. Exploration of difluoromethylation at C2 and C6 by purchasing the necessary difluoromethyl starting pyridines 14 and 15 would be exceptionally expensive: 296- and 56-times more expensive per gram, respectively, of which 15 requires a carbonylation further increasing the time to derivatization77,78. Thus, even without perfect accuracy, MPNNLSF can guide SAR syntheses to produce a multitude of functionalized compounds with minimal time burden.

Fig. 6: Potential challenges of a structure-activity relationship (SAR) campaign of compound 6.
figure 6

Literature synthesis of compound 6 and the cost of purchasing fluorinated starting materials for a potential SAR campaign.

The regiochemical outcomes of LSF radical-based transformations are governed by many factors: the nucleophilicity of the radical, the BDE of the molecule’s atoms, and the steric and electronic landscape to name a few. Interestingly, it has been observed that additional QM-derived or MD-derived data does not yield appreciable improvements in regiochemical outcome prediction. We showcase a transfer learning methodology based upon 13C NMR shift prediction which boosts the performance of zinc sulfinate and BF3K salt Minisci reaction regiochemical outcome prediction above that of the accurate Fukui-index reactivity scores, and of two reactivity prediction machine-learning models, on a narrow yet well-defined slice of chemical space. Promising predictive accuracy was also achieved on P450 enzymatic oxidations, a chemistry with a broader scope than the aforementioned Minisci conditions. Model performance was also highly contingent on the inclusion of negative data in the training set. This paradigm stands as a proof of concept for future applications in other LSF regiochemical predictions with the current best model showing potential in diversity-oriented SAR synthesis. Our 13C NMR data is open-source and we anticipate that the incorporation of larger proprietary 13C NMR datasets as the first step in this transfer learning methodology will expand this methodology to include in other LSF chemistry.

Methods

Materials

Liver microsomes were purchased from the following vendors: female mouse, male rat, male cynomolgus monkey and non-transfected microsomes (Corning, Woburn, MA); dexamethasone-induced male rat, male hamster, male dog and pooled male & female human (prepared in-house at Pfizer, Groton, CT); and male guinea pig and male rabbit (Xenotech, Lenexa, KS). Recombinant human P450 enzymes heterologously expressed in microsomes from Sf9 cells were custom-prepared by Panvera (Madison, WI).

High-throughput biocatalytic screens

The reactions were set-up in two 96-well arrays using miniature 8 × 20 mm (0.2 mL) glass vials under standard glove box conditions (H2O and O2 < 20 ppm). A 8 ×20 mm (0.2 mL) glass vial equipped with a stir bar was dispensed the reaction solvent (100 μL, 4.0 mM) followed by a solution of 1 (5.0 μL, 0.4 μmol), added as a 0.1 M solution in dichloroethane. Stirring was initiated before the metalloporphyrin (4.0 μL, 0.04 μmol) was charged, as a 10.0 mM solution in dichloroethane. The vial was treated with a 0.1 M solution of imidazole (2.4 mL, 0.24 μmol) in H2O, followed by a 0.4 M solution of formic acid (4.0 μL, 0.16 μmol) in H2O. Finally, the oxidant (8.0 μL, 0.08 μmol) was added as a 0.1 M solution in dichloroethane. The reaction vial was crimp sealed with a polytetrafluoroethylene (PTFE)/Silicone/PTFE septa to the glove box environment before the reaction was left to stir at 25 °C for 18 h. After this time period, the reaction was diluted with acetonitrile (0.2 mL) and analyzed directly by ultra-performance liquid chromatography-mass spectrometry (UPLC/MS). The UPLC/MS method used a 0.1% AcOH/NH4CO2H/H2O gradient over 0.8 min, running from 5–95% acetonitrile using a Waters Acquity UPLC BEH C18 30 × 2.1 mm column at 100 °C with a flow rate of 2.5 mL/min and a detection wavelength of 210–360 nm. 0.5 μL injections were made directly from diluted reaction mixtures and ionization was monitored in positive mode.

Baran diversinate™ late-stage functionalizations

To 1-dram pressure release vial containing Diversinate™ sulfinate reagent as sodium or zinc salts, e.g., RSO2Na or (RSO2)2Zn (3 eq–6eq), was added to a solution of the test substrate molecule (~2 µmol, 1 eq) in dimethyl sulfoxide (DMSO) (~70–100 µL, 30 mM) and TFA (4 eq) followed by tert-butyl hydroperoxide, 70% in water (5 eq) at room temperature. The resulting reaction mixture was capped and heated to 50 °C overnight. The crude reaction mixture was dissolved in 3:1 acidic mobile phase (1% acetonitrile, 0.1% formic acid) and acetonitrile (~3 mL) then purified via HPLC (XSelect 5 µm C18 130 Å, 250 × 10 mm @ 2 mL/min). The respective fractions were pooled, and the solvent was removed using the EZ-2 Elite Genevac (3-h HPLC setting, 34 °C/238 mbar to 41 °C/7 mbar). Each isolate was characterized by MS and NMR. Due to the low amounts of isolates generated, gravimetric mass analysis is not possible; qNMR in conjunction with the enhanced sensitivity using a 1.7 mm micro-cryoprobe in DMSO-d6 solvent was used to determine the concentration of the sample.

Molander BF3K salt late-stage functionalization

To 1-dram pressure release vial containing the test substrate molecule (~2 µmol, 1 eq), potassium trifluoroborate salt of the radical (1.5–2 eq), in a 1:1 mixture of acetic acid and water to make a 30 mM solution and Mn(OAc)3 was added in one portion. The resulting reaction mixture was capped and heated to 50 °C overnight. The crude reaction mixture was dissolved in 3:1 acidic mobile phase (1% acetonitrile, 0.1% formic acid) and acetonitrile (~3 mL) then purified via HPLC (XSelect 5 µm C18 130 Å, 250 × 10 mm @ 2 mL/min). The respective fractions were pooled, and the solvent was removed using the EZ-2 Elite Genevac (3-h HPLC setting, 34 °C/238 mbar to 41 °C/7 mbar). Each isolate was characterized by MS and NMR. Due to the low amounts of isolates generated, gravimetric mass analysis is not possible; qNMR in conjunction with the enhanced sensitivity using a 1.7 mm micro-cryoprobe in DMSO-d6 solvent was used to determine the concentration of the sample.

Molecular dynamics simulations

Molecule conformations were generated with MOPAC at the PM7 level of theory79. The underlying molecular dynamics (MD) driver was the atomic simulation environment (ASE) package80. A Langevin thermostat controlled the temperature. First, the molecular geometry was optimized followed by equilibration to 500 K for 2.5 picoseconds with a timestep of 0.25 femtoseconds. Upon equilibration, conformations were sampled every 2 picoseconds from a production run of 200 picoseconds in the NVT ensemble (constant temperature, constant volume) at 500 Kelvin, using a timestep of 0.5 femtoseconds with the same thermostat. This yielded a total of 100 configurations per molecule.

Calculating the Fukui indices

Reactivity indices for electrophilicity and nucleophilicity for the i-th atom were computed by multiplying the corresponding Fukui-index [Fi(+) or Fi(-), respectively] of the i-th atom by global electrophilicity/nucleophilicity for the given molecule. Fukui indices of the i-th atom Fi(+), Fi(-), and Fi(0) were computed as differences between the atomic charge of the i-th atom in the original molecule qi(N) with N electrons, the charge of the same atom after adding one electron to the molecule qi(N + 1), and the charge of the same atom after removing one electron from the molecule qi(N – 1):

$${{F}_{i}\left(+\right)=q}_{i}\left({{N}}\right)-{q}_{i}\left({{N}}+1\right)$$
$${F}_{i}\left(+\right)={q}_{i}\left({{N}}-1\right)-{q}_{i}\left({{N}}\right)$$
$${F}_{i}\left(0\right)=\frac{{q}_{i}\left({{N}}-1\right)-{q}_{i}\left({{N}}+1\right)}{2}$$

For electrophilicity and radical indices, quantum chemical computations were run with PBE/6-311G, and for nucleophilicity with B3LYP/6-311G**. As partial atomic charges, Mulliken charges were used. These DFT functionals, basis sets and types of atomic charges were chosen by optimizing the predicting performance of the reactivity indices in SNAr and EAS reactions of an internal dataset of small organic molecules (unpublished). Quantum chemical computations were run in Terachem81.

Equations

TP = true positive, FP = false positive, FN = false negative.

w = weighting value, x = predicted value, y = true value (always 0 or 1).

The variables, predp and truep refer to the ratio of predicted positives to all reactive sites. A predp of 1 indicates a model that predicts all sites react and a predp of 0 indicates a model where every molecule is unreactive.

$${{F\; {{{{{\mathrm{score}}}}}}}}=\frac{2\cdot {{{{{\rm{TP}}}}}}}{2\cdot {{{{{\rm{TP}}}}}}+{{{{{\rm{FP}}}}}}+{{{{{\rm{FN}}}}}}}$$
$${{{{{\rm{BCE\; Loss}}}}}}=\mathop{\sum }\limits_{i=0}^{n}{w}_{i}\cdot \left({y}_{i}\cdot \log \left({x}_{i}\right)+\left(1-{y}_{i}\right)\cdot \log \left(1-{x}_{i}\right)\right)$$
$${{{{{\rm{BCE\; weight}}}}}}\,1=x\cdot y\cdot \log \left({pre}{d}_{p}\right)+\left(1-y\right)\cdot \left(1-x\right)\cdot \log \left(1-{pre}{d}_{p}\right) \\+\left(1-y\right)\cdot x\cdot \log \left({tru}{e}_{p}\right)+y\cdot \left(1-x\right)\cdot \log \left(1-{tru}{e}_{p}\right)$$
$${{{{{\rm{BCE}}}}}}\,{{{{{\rm{weight}}}}}}\,2= \left[x\cdot y\cdot \log \left({pre}{d}_{p}\right)+\left(1-y\right)\cdot \left(1-x\right)\cdot \log \left(1-{pre}{d}_{p}\right)\right.\\ \left. +\left(1-y\right)\cdot x\cdot \log \left({tru}{e}_{p}\right)+y\cdot \left(1-x\right)\cdot \log \left(1-{tru}{e}_{p}\right)\right] \\ + \left[y\cdot x\cdot \log \left({tru}{e}_{p}\right)\left)\right.+\left(1-y\right)\cdot \left(1-x\right)\cdot \log \left(1-{tru}{e}_{p}\right) \right. \\ + \left. \left(1-y\right)\cdot x\cdot \log \left(1-{pre}{d}_{p}\right)+y\cdot x\cdot \log \left({pre}{d}_{p}\right)\right]$$
$${{{{{\rm{BCE}}}}}}\,{{{{{\rm{weight}}}}}}\,3= 2\left[x\cdot y\cdot \log \left({pre}{d}_{p}\right)+\left(1-y\right)\cdot \left(1-x\right)\cdot \log \left(1-{pre}{d}_{p}\right)\right. \\ \left.+\left(1-y\right)\cdot x\cdot \log \left({tru}{e}_{p}\right)+y\cdot \left(1-x\right)\cdot \log \left(1-{tru}{e}_{p}\right)\right] \\ +\left. \left[y\cdot x\cdot \log \left({tru}{e}_{p}\right)\right)+\left(1-y\right)\cdot \left(1-x\right)\cdot \log \left(1-{tru}{e}_{p}\right)\right. \\ \left.+\left(1-y\right)\cdot x\cdot \log \left(1-{pre}{d}_{p}\right)+y\cdot x\cdot \log \left({pre}{d}_{p}\right)\right]$$