Chemical shifts in molecular solids by machine learning

Due to their strong dependence on local atonic environments, NMR chemical shifts are among the most powerful tools for strucutre elucidation of powdered solids or amorphous materials. Unfortunately, using them for structure determination depends on the ability to calculate them, which comes at the cost of high accuracy first-principles calculations. Machine learning has recently emerged as a way to overcome the need for quantum chemical calculations, but for chemical shifts in solids it is hindered by the chemical and combinatorial space spanned by molecular solids, the strong dependency of chemical shifts on their environment, and the lack of an experimental database of shifts. We propose a machine learning method based on local environments to accurately predict chemical shifts of molecular solids and their polymorphs to within DFT accuracy. We also demonstrate that the trained model is able to determine, based on the match between experimentally measured and ML-predicted shifts, the structures of cocaine and the drug 4-[4-(2-adamantylcarbamoyl)-5-tert-butylpyrazol-1-yl]benzoic acid.

The calculation of chemical shifts in solids has enabled methods to determine crystal structures in powders.The dependence of chemical shifts on local atomic environments sets them among the most powerful tools for structure elucidation of powdered solids or amorphous materials.Unfortunately, this dependency comes with the cost of high accuracy firstprinciple calculations to qualitatively predict chemical shifts in solids.Machine learning methods have recently emerged as a way to overcome the need for explicit high accuracy first-principle calculations.However, the vast chemical and combinatorial space spanned by molecular solids, together with the strong dependency of chemical shifts of atoms on their environment, poses a huge challenge for any machine learning method.Here we propose a machine learning method based on local environments to accurately predict chemical shifts of different molecular solids and of different polymorphs within DFT accuracy (RMSE of 0.49 ppm ( 1 H), 4.3ppm ( 13 C), 13.3 ppm ( 15 N), and 17.7 ppm ( 17 O) with R 2 of 0.97 for 1 H, 0.99 for 13 C, 0.99 for 15 N, and 0.99 for 17 O).We also demonstrate that the trained model is able to correctly determine, based on the match between experimentally-measured and ML-predicted shifts, structures of cocaine and the drug 4-[4-(2-adamantylcarbamoyl)-5-tert-butylpyrazol-1-yl]benzoic acid in an chemical shift based NMR crystallography approach.
Solid-state nuclear magnetic resonance (NMR) spectroscopy is among the most powerful methods for determining the atomiclevel structure and dynamics of powdered and amorphous solids.Notably, solid-state NMR directly probes the local atomic environments and thus allows for characterization without the need for long-range order.This has led to its broad use today in many fields including for instance materials and pharmaceutical chemistry.In the latter the determination of structure and packing is essential to elaborate structure-property relations for formulations in the drug development process.
33][34][35][36][37] Recent studies also suggest that the structural accuracy of chemical shift based solid-state NMR crystallography is at least comparable with more traditional methods, such as single crystal Xray diffraction.38 The power of the method arises from the fact that plane wave DFT with the GIPAW method is accurate enough to reproduce the exquisite sensitivity of chemical shifts to changes in local atomic environments.However, this approach also has severe limitations.The scaling of the computational cost with system size prevents the application to larger and more complex crystals, or non-equilibrium structures.If one wanted to use more accurate ab initio calculations, the expense is prohibitive.
0][41][42][43] Notably, prediction of chemical shifts for the specific case of proteins in solution using methods based on large experimental databases, using traditional [44][45][46][47][48][49][50][51] or machine learning approaches, [52][53][54] have met with considerable success in predicting shifts based on local sequence and structural motifs, and are widely used today.While there are some examples of machine learned experimental and ab-initio chemical shifts of liquid and gas phase molecules, [55][56][57][58][59] to date there is only one example of machine learning being applied to calculations of chemical shifts in solids, which deals with the specific case of silicas. 60Molecular solids are characterized by the combinatorial complexity and diversity of organic chemistry, the subtle dependence on conformations, and the long and short range effects of crystal packing, which leads to a considerably broader range of chemical environments and possible chemical shieldings than found e.g. in proteins.All of these aspects make this class of systems particularly challenging for machine learning.
Here, we develop a machine learning framework to predict chemical shifts in solids which is based on capturing the local environments of individual atoms, and which make it well suited for the prediction of such local properties.The protocol is schematically illustrated in Figure 1.We train the model on structures taken from the Cambridge Structural Database (CSD), 61 chosen to be as diverse as possible, and then show that the method can predict chemical shifts in a test set with a R 2 coefficients between the chemical shifts calculated with DFT and with ML of 0.97 for 1 H, 0.99 for 13 C, 0.99 for 15 N, and 0.99 for 17 O, corresponding to RMSEs of 0.49ppm for 1 H, 4.3ppm for 13 C, 13.3 ppm for 15 N, and 17.7 ppm for 17 O.Predicting the chemical shifts for a polymorph of cocaine, with 86 atoms in the unit-cell, using the ML method takes 46 seconds of CPU time, thus reducing the computational time by a factor of between 5 to 10 thousand, without any significant loss in accuracy as compared to DFT.Most significantly, we show that the model has sufficient accuracy to be used in a chemical shift driven NMR crystallography protocol to correctly determine, based on the match between experimentally-measured and ML-predicted shifts, the correct structure of cocaine, and the drug 4-[4-(2-adamantylcarbamoyl)-5-tert-butylpyrazol-1-yl]benzoic acid (AZD8329).We also show that it is possible to calculate the NMR spectrum of very large molecular crystals.For this we calculate the chemical shifts of six structures from the CSD with between 768 and 1,584 atoms in the unit-cells.

Results and discussion
The approach we take to predicting chemical shifts in molecular solids is illustrated in Figure 1.We use the Gaussian Process Regression (GPR) framework 62 to predict the chemical shift of a new atomic configuration based on a statistical model that identifies he correlations between structure and shift for a reference where  and  ( correspond respectively to a description of the chemical environment of the atom for which we are making a prediction, and that of one of the training configurations.The weights  ( are obtained by requiring that Eq. ( 1) is consistent with the values computed by DFT for the reference structures.The essential ingredient that differentiates one GPR-based framework from another is the kernel function (,  ( ) which describes and assesses the similarity between atomic environments, and provides basis functions to approximate the target properties.
Here our model relies on the Smooth Overlap of Atomic Positions (SOAP) kernel, 63,64 in which any atomic environment is represented as a three dimensional neighbourhood density given by a superposition of Gaussians, one centred at each of the atom positions in a spherical neighbourhood within a cut-off radius rc from the core atom.This framework, combined with GPR, has been used to model the stability and properties of a number of different systems, 40,63,64 and has been extended to the prediction of tensorial properties. 65We can see that this choice of kernel should be particularly well adapted to predicting chemical shifts, since it describes the local environments around each atom without any simplification, and this is indeed what the chemical shift also probes, as it is determined by the screening of the nucleus from the main magnetic field by the electron density at the nucleus.Note that it should be possible to tune and train other ML methods to accurately predict chemical shifts of molecular crystals.While these possibilities will be explored in future work, the model we present here is already accurate enough to substitute for DFT calculations in chemical shift based NMR crystallography.
As shown in Figure 1, the model is developed by using a reference training set of structures for which chemical shifts are known.To obtain a model which is robust and general, the training set should be as large, as reliable, and as diverse as possible.We first extract from the CSD a large set of about 61,000 structures, corresponding to all the structures in the CSD with structure set (CSD-61k)  ).Given that performing a GIPAW calculation for all of these structures would be prohibitively demanding, we then select a random subset of 500 structures (CSD-500) that are representative of the chemical diversity in the CSD, and we use it to test the accuracy of our model.For cross-validation and training, instead, we select 2,000 structures (corresponding to about 70,000 atomic environments) out of the CSD-61k using a farthest point sampling algorithm (FPS) 66,67 (CSD-2k).This step ensures near-uniform sampling of the conformational space, improving the quality of the model when using a relatively small number of reference calculations.
To avoid including spurious environments in the model, e.g.environments which are not well described by DFT, we also automatically detect and discard from the training set all atomic environments with values of the DFT calculated shifts that are anomalous based on a cross validation procedure described in the SI.This pruning as well as the parameter optimization procedure (see below) were done exclusively using cross validation on the CSD-2k set.(Notably the test sets were not subject to any curation.) In order to reduce the computational cost of the training and testing procedures we then finally remove from the training set all the symmetrically equivalent environments.In case of 1 H, this reduced the size of the training set from 70,000 to about 35,000 different atomic environments.(Details of the selection method and the members of the different sets used are given in SI).
All the atomic positions of the structures in the training and testing sets were relaxed with DFT, using the Quantum Espresso suite, 68,69 prior to calculation of the chemical shieldings using the GIPAW DFT method. 4,5Parameters for the DFT calculations are given in the SI.The calculated chemical shieldings s are converted to the corresponding chemical shifts d through the relationship d = srefs.Here, we used a sref of 30.8 ppm (for 1 H) and 169.5 ppm (for 13 C), found through linear regression between the calculated and experimental chemical shifts for cocaine.
Figure 2 shows the results of the predictions of the chemical shifts of the CSD-500 set, which is representative of the expected accuracy for the entire CSD-61k.The figure shows the overall prediction accuracy for 1 H chemical shifts as root-meansquare-error (RMSE) in ppm between the shifts calculated with DFT and with ShiftML as a function of the cut-off radius (rc) and as a function of the number of training structures included from CSD-2k.The effect of the different cut-off radii is clearly visible.For example, for rc=2Å the prediction error for a small training set (<10 structures or <100 atomic environments) can be smaller than for the larger radii, but does not improve significantly with increasing size of the training set.On the contrary, for rc=7Å we observe a relatively large prediction error for a small training set, but even with 2,000 structures (35,000 environments), the prediction error is still decreasing.A similar behaviour is observed for the prediction errors of the 13 C, 15 N and 17 O chemical shifts (SI).
The observed differences in the behaviour of the prediction error with respect to rc clearly indicates the influence of the different extents of the local environment on the chemical shift.Short range interactions are sufficient to explain the rough order of magnitude of the shift, but long range interactions are required to learn about the higher order influences of next-nearest neighbours on shifts.However, for long range interactions, a much larger number of environments is needed in order to determine the correlation between environment and shift.
We exploit these differences to generate a combined SOAP kernel consisting of a linear combination of the single local environment kernels, 40 with weightings of 256 (rc=2Å), 128 (rc=3Å), 32(rc=4Å), 8 (rc=5Å and rc=6Å) and 1 (rc=7Å).This weighting was determined by rough optimization around values inspired by previous experience, 40 and by cross-validation on the CSD-2k training set (as described in SI).It is clear that learning with the combined kernel leads consistently to lower prediction errors than any of the single kernels, although the improvement in performance varies between nuclei (SI).
Figure 3a-d shows correlation plots between 1 H,  We observe that the performance of the model degrades considerably if one does not use the procedure we developed to remove environments for which the DFT shifts appear to be inconsistent.Note that the CSD-500 set used for testing is selected randomly from CSD-61k and not curated.Indeed, we find that many of the atomic environments in the CSD-500 set with a relatively high prediction RMSE possess either unusual cavities inside their crystal structure, possibly indicating an organic cage surrounding non-crystalline solvent or other atoms, or exhibit strongly delocalised π-bonding networks.While there is no theoretical reason preventing the machine learning model from correctly describing such environments, they are rare and not well represented within the training set.CSD-500 thus constitutes a fairly demanding test set.
Having evaluated the power of the trained model to predict the diverse CSD-500 set, we now look at the capacity to predict potentially subtler differences by looking at a set of polymorphs of a given structure.Figures 4a and b show the correlation between the 1 H shifts calculated by GIPAW DFT and by ShiftML for 30 polymorphs of cocaine and 14 polymorphs of AZD8329, all of which were previously generated with a crystal structure prediction (CSP) procedure. 18,32The figure clearly shows that ShiftML is able to accurately predict the differences in 1 H chemical shift for different polymorphs.
We find a chemical shift prediction error (RMSE) for 1 H for the cocaine polymorphs of 0.37 ppm and for AZD8329 of 0.46 ppm, which is very comparable to the expected GIPAW DFT accuracy.
Note that for these cases the DFT structure optimization and GIPAW chemical shift calculation were done with a different DFT program (CASTEP) 71 , which suggests that ShiftML is robust with respect to small deviations from the fully optimized structures.(As shown in the SI, performing the prediction using Quantum Espresso consistently leads to comparable prediction accuracy.) For the heteronuclei we obtain for cocaine 3.8 ppm for      Further, the significance of the method is illustrated by comparison to experimentally measured shifts, where we find that the predicted shifts are accurate enough to allow crystal structure determination for both cocaine and AZD8329 from powder samples in a chemical shift driven NMR crystallography approach.
Figures 4c and d show the correlation between experimentally measured 1 H chemical shifts and the 1 H chemical shifts calculated by ShiftML for crystal structures of the six molecules shown in scheme 1 (numerical values of the experimental chemical shifts, the crystal structures, and the shifts calculated with ShiftML are given in the SI).The comparison between experimental and calculated 1 H chemical shifts for all crystal structures (for a total of 68 shifts) gives an error (RMSE) of 0.39 ppm and a R 2 coefficient of 0.99.
Figures 4e and f show in blue the RMSE between DFT calculated and experimental 1 H chemical shifts for the 30 polymorphs predicted by CSP to have the lowest energy for cocaine and the 14 cis polymorphs of AZD8329.For both molecules the only structure in agreement with the GIPAW DFT calculations, to below a 1 H DFT chemical shift confidence interval of 0.49 ppm, 15 is the correct crystal structure.In the same plots we overlay the result where the experimental shifts are now compared to shifts predicted with ShiftML.Note that the RMSE between experiment and the predicted chemical shifts follows the same trends as for the DFT calculated shifts, and that here again the only structures below the confidence interval of 0.49 ppm are the two correct crystal structures.Note, that the cut-off of 0.49 ppm with respect to experiment has been evaluated for GIPAW DFT chemical shifts 15,70 and to rigorously repeat the CSP procedure for the ML method, the accuracy should be re-evaluated using more extensive benchmarking of ShiftML to experiment, which will be the subject of further work.
Finally, we note that the accuracy of the method does not depend on the size of the structure, and that the prediction time is linear in the number of atoms.For the structures we calculate here the prediction time actually appears nearly constant, because it is dominated by the loading time of the reference SOAP vector (see Figure 5a).We have used this method to calculate the NMR spectra (shown in Figure 5b-g) for six structures from the CSD having among the largest numbers of atoms per unit cell (containing only H,C,N,O), with between 768 and 1,584 atoms per unit cell.(See scheme S1 for the chemical formula).The values of the predicted chemical shifts are given as CSD-6 in the SI. Figure 5a shows the comparison between the GIPAW calculation time and the required ML prediction time.We estimate that the whole calculation would require around 16 CPU years by GIPAW.ShiftML requires less than 6 CPU minutes to calculate the shifts for all the compounds.In summary, we have presented a ML model based on local environments to predict chemical shifts of molecular solids containing HCNO to within current DFT accuracy.The R 2 coefficients between the chemical shifts calculated with DFT and with ShiftML are 0.97 for 1 H, 0.99 for 13 C, 0.99 for 15 N, and 0.99 for 17 O.The approach allows the calculation of chemical shifts for structures with ~100 atoms in less than 1 minute, reducing the computational cost of chemical shift predictions in solids by a factor of between 5 to 10 thousand compared to current DFT chemical shift calculations, and thereby relieves a major bottleneck in the use of calculated chemical shifts for structure determination in solids.
Far from being just a benchmark of a machine-learning scheme, the method is accurate enough to be used to determine structures by comparison to experimental shifts in chemical shift based NMR crystallography approaches to structure determination, as shown here for cocaine and AZD8329.The ML model only scales linearly with the number of atoms and, for the prediction of individual structures, is dominated by a constant I/O overhead.Here it allows the calculation of chemical shifts for a set of six structures with between 768 and 1584 atoms in their unit cells in less than six minutes (an acceleration of a factor 10 6 for the largest structure).
The code used for the ML shift calculations is available as an online server at http://shiftml.epfl.ch.The accuracy of the method is likely to increase further with the size of the training set, and subsequently with the future evolution of the accuracy of the method used to calculate the reference shifts used in training (here DFT), or by using experimental shifts if a large enough set were available.
The model used here can easily be extended to organic solids including halides or other nuclei, and to network materials such as oxides, and these will be the subject of further work.

Methods
For the SOAP kernels, 63,64 each atomic environment is represented as a three dimensional neighbourhood density given by a superposition of Gaussians, one centred at each of the atom positions in a spherical neighbourhood within a cut-off radius rc from the core atom.The Gaussians have a variance  2 , and a separate density is built for each atomic species.The kernel is then constructed as the symmetrized overlap between the amplitudes representing  and ′.This degree of overlap thus measures the similarity between the environments X and X'.
The SOAP and GPR parameters are given in the SI.SOAPbased structural kernels contain several adjustable hyper-parameters, which are discussed in refs. 64However, we have not systematically explored the full parametric space here, instead we chose reasonable values of the parameters without extensive fine-tuning, based on previous experience 40 and with some optimization by cross-validation on the CSD-2k training set (see SI for details).We also combine kernels computed for different cutoff radii to capture the contributions to shifts from different length scales, 40 as is described in detail above.The calculations of the local environment, the similarity kernel and the weighted correlations were done using the GLOSIM2 package. 78e program used to calculate shifts using the protocol described here is called ShiftML, and is publically available as an online server at http://shiftml.epfl.ch.

Supporting Information.
Details on the structure selection, crystal structure prediction, the DFT calculations, the GPR method, the SOAP kernels, the FPS algorithm, the detection procedure of unusual environments, NMR crystallography and the DFT calculation time estimates.Prediction parameters optimization, learning curves and evaluation curves for 13 C, 15 N and 17 O.Predicted and GIPAW chemical shieldings for all cocaine and AZD8329 polymorphs.Chemical formula and predicted chemical shieldings of the CSD-6 set predicted with ShiftML.The Refcodes for CSD-2k and CSD-500.

I. Methods and Theoretical Background
All the crystal structures of CSD-61k and CSD-500 were obtained from the Cambridge Structural Database (CSD). 1 A total of 88,648 structures was downloaded from the CSD, using two different selection criteria: the maximum number and the type of atoms contained in the unit-cell.We selected only structures with a maximum of 200 atoms, containing either (i) only H and C or (ii) H, C and one heteroatom between N and O or both.From this set we extracted a subset of 61,012 (CSD-61k) structures by removing (i) structures with ambiguous atom positions, and (ii) structures where the distance of at least one pair of atoms was smaller than the sum of their covalent radii minus 0.3 Å.The remaining structures were then used to create both the training (CSD-2k) and the testing set (CSD-500) for the 1 H, 13 C, 15 N and 17 O chemical shift prediction as described in the main text.The test set (CSD-500) was created by randomly picking 500 structures from the CSD-61k excluding the structures already selected for the training set.

Crystal Structure Prediction.
Here we use a set of polymorphs predicted by CSP for cocaine and the drug 4-[4-(2-adamantylcarbamoyl)-5-tertbutylpyrazol-1-yl]-benzoic acid (also referred as AZD8329).General details on the CSP protocol can be found in ref.
2. In chemical shift based NMR crystallography, the CSP polymorphs are tested against experimental parameters ( 1 H chemical shifts) to determine the experimental crystal structure.In this work we used 30 polymorph structures of cocaine and 14 structures of AZD8329 generated with CSP.The 30 structures of cocaine were obtained from the Electronic Supporting Information (ESI) of ref. 3, and correspond to the most stable polymorphs obtained with CSP.Crystal structures of AZD8329 were obtained from the ESI of ref. 4, and correspond to the 14 most stable predicted polymorphs with the cis conformation of the amide bond.From the same sources we obtained chemical shifts for each structure calculated with GIPAW 5,6 using the DFT program CASTEP 7 and the experimental chemical shifts.Labels for the different polymorphs of each structure are based on their energy, with 1 being the most stable polymorph of a given molecule.

DFT Calculations.
All the DFT calculations were carried out using the DFT program Quantum ESPRESSO. 8,9For all structures in the CSD-2k and CSD-500 databases we first carried out geometry optimization using plane wave DFT.We used ultrasoft pseudopotentials with GIPAW 5,6 10 The optimizations were done with the generalized-gradient-approximation (GGA) density functional PBE, 11 using a wave-function energy cut-off of 60 Ry, a charge density energy cut-off of 240 Ry and without k-points.The Grimme van der Waals dispersion correction 12 was included in order to account for van der Waals interactions.The geometry optimization was done relaxing all atomic positions while keeping the lattice parameters fixed.A single point energy (scf) was then computed for the relaxed geometry, using higher wave-function and charge density energy cut-offs which were set to 100 Ry and 400 Ry respectively.For this calculation we also used a Monkhorst-Pack grid of k-points 13 corresponding to a maximum spacing of 0.06 Å -1 in the reciprocal space.The kpoints and energy cut-off values were optimized to ensure scf convergence.Finally, we calculated the chemical shielding sDFT using the GIPAW method, with the same parameters as used in the scf calculation.

Machine Learning.
We model the isotropic chemical shielding as a function of the local environment  using a Gaussian Process Regression framework, that assumes that chemical shift values predicted by the model can be written as () = () + , where the function  is a Gaussian Process 14 and  represents the error of the prediction, which is modeled as independent identically distributed Gaussian variates, with variance  * + .Following the Gaussian Process Regression framework, the isotropic chemical shielding function becomes: , where { .} .34  2 is a training set of N reference local environments for which the isotropic chemical shieldings are known,  is a kernel function measuring the covariance between local environments and  is a hyperparameter controlling the sensitivity of the kernel.
The weights can be computed by inverting the kernel matrix  .9= : .,  9 ; computed between the reference configurations, including a regularization that depends on an estimate of the intrinsic uncertainty in the fit, due to errors in the training set, the limitations of the model or the reduced number of training configurations .9 9

𝜎:𝑋 9 ;
To assess the correlation between local atomic environment A and B, we use the SOAP kernel 15 defined by the rotationally invariant overlap between smooth representations of their atomic density: where the density is built as a superimposition of Gaussians having width ς, centered on the atoms within a cutoff distance of the central atom in the environment The details of the construction, and the extension to the case with many atomic species, are given in refs.16 and 17.
Farthest Point Sampling Algorithm.
Given that a GPR model is essentially an interpolation procedure between the reference configurations, it is crucial that training points are chosen to cover as uniformly as possible the space of structures for which one wants to perform predictions.To achieve this uniform sampling we use a farthest point selection algorithm 18,19 to sort the CSD-61k in descending order of "diversity".Essentially, we select a first structure at random, and then pick the others in the sequence such that where the distance is the kernel-induced distance associated with an average SOAP kernel for the entire structure. 16he CSD-2k set corresponds to the first 2,000 configurations identified with this procedure.

Detection of Unusual Environments.
The quality of the training set is essential to ensure the optimal performance of a machine learning algorithm.However, the individual curation of the 2,000 molecular crystals of the CSD-2k dataset would be very time consuming and cumbersome.Note, that the 2,000 molecular crystals correspond to around 35,000 symmetrically non-equivalent atomic environments for 1 H alone and the following detection procedure is applied directly to the individual atomic environments instead of the whole molecular crystals.
We automate this detection procedure by assessing the 'instability' of the prediction of the shielding of a given local environment using the difference between the predictions of several GPR models and the reference DFT-shielding.We define this indicator as: , where each of the M models is made using a 2-fold split of the shuffled training set that does not include the structure X.In total we generate M=40 models, where each is generated using a different random shuffling of the data.Environments with a large value of |()| are not well-described by the rest of the training set within the SOAP-GPR framework.Note, that the error would cancel out in the case of random noise within the prediction, while a large value of |()| corresponds to a systematic error in the predicted chemical shielding, that could be associated to the limitations listed below.We define local environments to be unusual when |()| is larger than three times the standard deviation of |()| over the whole training set, and we then do not use them for training.
We perform this elimination procedure on the CSD-2k dataset using a single kernel for each element (rc = 4.5 Å for 1 H, 4 Å for 13 C, 4 Å for 15 N and 3 Å for 17 O).The hyperparameters of the single kernels used in the elimination procedure were determined using a grid search and 3-fold cross validation on the uncleaned CSD-2k training set.
The 1 H environments excluded with this approach are shown in Figure S1, while further details for 1 H and the other nuclei are listed in section VII.
It is interesting to see that in several cases we can trace the unusual behavior of the environment to subtle errors in the DFT calculations, or to physical phenomena that are ill described within our DFT model (metallic systems, zwitterions,…).Most of the environments detected as "unusual" are part of zwitterionic structures or charged structures (such as VIWYEH, ZACSOO or EKUJIF).Others are metallic structures (ELUMO -EHOMO = 0), such as HAZQUV, QUICNA02, DMEBQU01 or AYUKIP, or have a partially empty unit cell (QAHVUQ).An intrinsic limit of this procedure is the fact that it might detect structures with uncommon functional groups as "anomalies" (e.g.TIMCHX, which is an aziridine -a three membered heterocycle with one amine group, or FIGMAJ which has a cubane group), due to the fact that these structures are not well represented by the used training set.However, with increasing training size, we expect these structures to be better represented and they will not be detected as anomalies anymore.

NMR Crystallography.
To validate the accuracy of the chemical shifts calculated with ShiftML, we replicated the last step of the protocol for the ab initio crystal structure determination of powdered solids 3,4,20 using predicted shifts.This step consists in the comparison between experimental and predicted 1 H chemical shifts for the candidate crystal structures selected from a crystal structure prediction method.We perform this analysis for cocaine and form 4 of AZD8329. 3,4The value sref for the conversion between chemical shieldings to chemical shifts is calculated for each structure with a linear regression between calculated and experimental shifts, imposing a slope equal to 1.This procedure is done independently for the 1 H chemical shieldings calculated with DFT and ShiftML.The geometry of the structures predicted with CSP, as well as their relative chemical shift values calculated with GIPAW and the experimental chemical shifts of the observed polymorphs were obtained from refs. 3 and 4. Remarkably, the high accuracy shown in Figure 4 was obtained using crystal structures with only 1 H positions relaxed and DFT chemical shift calculations carried out using a different program (CASTEP) to the one we used to build our training set (Quantum Espresso).
Figure S2 shows the results obtained for cocaine and AZD8329 after all-atom optimization and calculation of GIPAW chemical shifts with Quantum Espresso.Here we show fewer structures compared to Figure 4, due to the fact that we limit ourselves to calculate DFT chemical shifts of structures with less than 250 atoms.This selection removes structures 15 for cocaine and structures 2, 11 and 14 for AZD8329.The accuracy is consistent with that reported in Figure 4, although the all-atom optimization leads to some significant structural differences compared to the only 1 H relaxed structures, especially for AZD8329.We find a chemical shift prediction error (RMSE) for 1 H for cocaine of 0.40 ppm and for AZD8329 of 0.51ppm, which is very comparable to the expected GIPAW DFT accuracy.For the heteronuclei we obtain, for cocaine and AZD 8329 respectively, 3.5 and 3.4 ppm for 13 C, 9.3 ppm and 11.0 ppm for 15 N and 12.2 ppm and 11.5 ppm for 17 O.
Experimental chemical shifts were referenced to the 1 H resonance observed for adamantane at 1.87 ppm with respect to TMS.We used assigned chemical shifts values and we account for rotational dynamics of the methyl groups by averaging the chemical shift values of the three 1 H positions to a single value for each methyl group.For AZD8329 the chemical shifts of the CH2 groups were also averaged.The RMSE calculation was carried out in MATLAB using a home-written script.The chemical structures of cocaine and AZD8329, together with the assignment of the experimental chemical shifts are shown in Figure S3 and Table S1. 1 H chemical shieldings calculated with GIPAW and ShiftML are given as separate .csfiles, named according to the corresponding structure and method used for DFT calculations.Each file contains the 1 H chemical shift calculated with GIPAW (first column) and predicted with ShiftML (second column) ordered according to Table S1.

Cocaine AZD8329
Atom Label Table S1.Experimental chemical shifts of cocaine and the form 4 of AZD8329.Labelling scheme is given in Figure S3.When more than one atom corresponds to a single chemical shift value, their values were average

II. DFT Calculation Times
Figure S4 shows the CPU time needed for part of the GIPAW DFT calculations done for this work.The calculations shown in Figure S4a were done on polymorph 1 of the cocaine dataset, which contains 86 atoms per unit-cell, while the one in Figure S4b were done on 500 structures of the CSD-2k set.In Figure S4a the calculation time is plotted as a function of the number of Monkhorst-pack k-points per axis for three different energy-cut-off (Ecutoff) values: 40 Ry (blue), 70 Ry (red), 100 Ry (yellow).When increased, these two parameters improve the accuracy of the calculation, but at the same time they drastically increase the computational time needed to carry out the calculation.Figure S4b shows the CPU time for the GIPAW chemical shift calculations (blue dots) and for the DFT structure optimizations (green squares) as a function of the number of valence electrons (Ne) per unit-cell.For the GIPAW chemical shift calculations the energycut-off was 100 Ry, using a Monkhorst-pack grid with a k-point spacing of 0.06 Å -1 .For the DFT structure optimizations the energy-cut-off was 60 Ry and no k-points were used.The red line shows the best fit between the number of valence electrons and the required CPU time for the GIPAW chemical shift calculations as  _op =  e + +  e O , where the  e O scaling accounts for the general DFT scaling and the  e + describes the scaling of the matrix inversion, which dominates for small system sizes.The best fit parameters are given as 8.83e-04 (a) and 1.02e-06 (b).
Currently the machine learning model has only been rigorously tested and applied for structures optimized with DFT.Also slight structural changes away from the equilibrium geometry of a molecular crystal have been shown to result in significant changes in the chemical shifts. 21For this reason, the predictive accuracy of ShiftML for non-equilibrium structures has not yet been quantified.This will be the subject of further work.However, Figure S4b clearly shows that the computational cost for the structure optimization is negligible compared to the computational cost of the GIPAW chemical shift calculations.For structures with  e ≈ 100 the GIPAW shift calculations require around 10x more CPU time as the DFT structure optimization, and for  e ≈ 1,000, 80x more CPU time is required.

III. ShiftML Prediction Times
The ShiftML prediction times scales linearly with the number of atoms per unit cell.However, for all the structures investigated here (from 20 to 1,500 atoms per unit-cell) the required prediction time is dominated by a constant prefactor associated with the used training set.
Prior to the prediction step, the SOAP reference vector between the test and the training structures is created.This step should be linear in the size of the test-structure, but is currently dominated by the size of the training set.As a result this takes around one CPU minute for any of the investigated structures here.The actual subsequent chemical shift prediction, which is linear in the number of atoms within the teststructure, requires at most 10-20 CPU seconds for the large investigated structures.Note that prior to the chemical shift predictions, the single kernels for all the atomic species must be loaded into virtual memory and the multiscale kernel created.On one CPU this currently takes around 45 minutes.Note, that this has to be done only once, independently of the number and size of the test-structures that are subsequently calculated.
k-points Table S2 and S3 shows the parameters used for the single and the multi-scale kernel predictions respectively.Using these parameters, we obtained the curves shown in Figure 2 in the main text and the ones shown in Figures S5, S6, S7 and S8.Figures S5 and S6 show the RMSE and MAE learning curves for 1 H, 13 C, 15 N and 17 O for the different local environment cut-off radii, and for the multi-kernel.The training was done on up to 1500 randomly selected frames, while testing on 400 structures selected randomly from the CSD-2k set excluding the structures already selected for the training set.For each point, the random sampling was repeated N times (where N is equal to 300, 255, 215, 170, 130, 85, 45, 5 respectively for training set sizes of 40, 100, 200, 400, 600, 1000, 1400 and 1500 structures) Figure S7 and S8 show the results of the predictions of the chemical shifts of the CSD-500 set as a function of the cut-off value and the size of the training set.The parameters for the multi-scale kernel prediction were optimized using 3-fold cross validation on the CSD-2k set.The results for 1 H are listed in table S4.S4.Optimization of kernel weights and GPR parameters (sn and z) for multi-scale kernel prediction of 1 H chemical shifts.The optimization was carried out on the CSD-2k set, using 3-fold cross validation.For each configuration are reported the corresponding mean absolute error (MAE), root-mean-square error (RMSE), the R-squared (R 2 ) coefficient and the supremum (SUP).In bold is shown the set of parameters that we selected.S5.Optimization of kernel weights and GPR parameters (sn and z) for multi-scale kernel prediction of 13 C chemical shifts.The optimization was carried out on the CSD-2k set, using 3-fold cross validation.For each configuration are reported the corresponding mean absolute error (MAE), root-mean-square error (RMSE), the R-squared (R 2 ) coefficient and the supremum (SUP).In bold is shown the set of parameters that we selected.Table S6.Optimization of kernel weights and GPR parameters (sn and z) for multi-scale kernel prediction of 15 N chemical shifts.The optimization was carried out on the CSD-2k set, using 3-fold cross validation.For each configuration are reported the corresponding mean absolute error (MAE), root-mean-square error (RMSE), the R-squared (R 2 ) coefficient and the supremum (SUP).In bold is shown the set of parameters that we selected.S7.Optimization of kernel weights and GPR parameters (sn and z) for multi-scale kernel prediction of 17 O chemical shifts.The optimization was carried out on the CSD-2k set, using 3-fold cross validation.For each configuration are reported the corresponding mean absolute error (MAE), root-mean-square error (RMSE), the R-squared (R 2 ) coefficient and the supremum (SUP).In bold is shown the set of parameters that we selected.

V. Comparison to Experiments
Comparison between 1 H experimental chemical shifts and 1 H chemical shifts calculated with ShiftML were carried out analysing 68 chemical shifts obtained from 6 crystal structures.The names, IUPAC IDs, CSD refererence codes (when available) and references to the experimental NMR data of the analysed crystal structures are the following: The crystal structures (i-iv) were obtained from Ref. 25, where the experimentally determined crystal structures were subjected to all-atom geometry optimization with fixed lattice parameters, as described in the reference.Crystal structures (v) and (vi) were obtained from Refs. 3 and 4 respectively.
We used assigned chemical shift values and we account for rotational dynamics of the methyl groups by averaging the chemical shift values of the three 1 H positions to a single value for each methyl group.The calculated chemical shieldings s are converted to the corresponding chemical shifts d through the relationship d = sref -bs.For each structure, we calculated the value of sref and b by a linear regression between calculated and experimental shifts.The calculations were carried out in MATLAB using a home-written script.The chemical structures, together with the assigned experimental chemical shifts and the parameters for conversion between shieldings and shifts are shown in Figure S9 and Table S8.S8.Experimental and calculated chemical shifts of naproxen, uracil, the co-crystal of 3,5dimethylimidazole and 4,5-dimethylimidazole, theophylline, cocaine and AZD8329.Labelling scheme is given in Figure S3.When more than one atom corresponds to a single chemical shift value, their values were averaged

Figure 1 .
Figure 1.Scheme of the machine learning model used for the chemical shift predictions.

Figure 2 .
Figure 2. 1 H chemical shift prediction error of the trained model on the CSD-500 set.The prediction error as RMSE is shown for the different local environment cut-off radii, and for the multi-kernel (labelled as msk), as a function of the training set size.

Figure 3 .
Figure 3. Histograms and scatterplots showing the correlation between 1 H (a), 13 C (b), 15 N (c) and 17 O (d) chemical shifts (shieldings) calculated with GIPAW DFT and ShiftML.The black lines indicate a perfect correlation and the grey zones represent the confidence limits between experiment and DFT calculated shifts currently used.15

Figure 4 .
Figure 4. Comparison of ShiftML predictions to experimental shifts and to GIPAW predictions for polymorphs of cocaine and AZD8329.(a) Histogram showing the distribution of the differences between 1 H chemical shifts calculated with GIPAW and ShiftML.The blue bars were calculated for the polymorphs of cocaine, and the orange ones for the polymorphs of AZD8329.(b) Scatterplot showing the correlation between GIPAW and ShiftML 1 H chemical shifts for cocaine (blue) and AZD8329 (orange).(c) Histogram showing the distribution of differences between experimentally measured 1 H chemical shifts and 1 H chemical shifts calculated with ShiftML for six different crystal structures (see SI for the structures and numerical values of the shifts).(d) Scatterplotshowing the correlation between these experimentally measured 1 H chemical shifts and shifts calculated with ShiftML.(e-f) Comparison between calculated and experimental 1 H chemical shifts for the most stable structures obtained with CSP for cocaine (e) and AZD8329 (f).For each candidate structure an aggregate RMSE is shown between experimentally measured shifts and shifts calculated using either GIPAW (blue) or ShiftML (red).The highlighted bar correspond to the expected errors, and candidates that have RMSEs within this range would be determined as correct crystal structures using a chemical shift driven solid-state NMR crystallography protocol.The black line in (b) and (d) indicates a perfect correlation.In (a-f) the grey zones represent the confidence intervals of the 1 H chemical shift RMSD, as described in the text.15

Figure S1. 1 H
Figure S1. 1 H chemical shifts of the 76,214 environments in the CSD-2k set.The environments excluded using the unusual structures detection procedure described in the text are shown in red.

Figure S2 . 20 0Figure S3 .
Figure S2.(a) Histogram showing the distribution of the differences between chemical shifts calculated with GIPAW and ShiftML.The blue bars were calculated for the polymorphs of cocaine, and the orange ones for the polymorphs of AZD8329.(b) Scatterplot showing the correlation between GIPAW and ShiftML chemical shifts for cocaine (blue) and AZD8329 (orange).The black line indicates a perfect correlation.(c-d)Comparison between calculated and experimental 1 H chemical shifts for the most stable structures obtained with CSP for cocaine (c) and form 4 of AZD8329 (d).Chemical shifts were calculated using GIPAW (blue) and ShiftML (red).The highlighted bars correspond to the candidates that would be selected as correct crystal structures using the chemical shift based solidstate NMR crystallography protocol.In (a-d) the grey zones represent the confidence intervals of the 1 H chemical shift RMSD, as described in the text.20

Figure S4 .
Figure S4.CPU time for NMR chemical shift calculations using the GIPAW method.(a) The CPU time is shown as function of the DFT accuracy, determined by the plane-wave cutoff energy  Yvghww and the number of k-points in each dimension for polymorph 1 of cocaine.The charge density energy cut-offs were set to  x = 4 Yvghww .(b) The CPU time is shown as function of increasing system size in CSD-2k.The green squares and blue dots show individual geometry optimization and GIPAW chemical shift DFT calculations, respectively.The red line shows the best fit between the number of valence electrons and the required CPU time as  _op =  e + +  e O .
Parameters and Evaluation Curves of 13 C, 15 N and 17 O

Figure S8 .
Figure S8.MAE evaluation curves of 13 C (a), 15 N (b) and 17 O (c) for different training set sizes, evaluated on the CSD-500 test set.The MAE evaluation curves were acquired as described in the paper.The multi-kernel learning-curve is labelled as msk.

Figure S9 .
Figure S9.Chemical structure of cocaine (a), 3,5-dimethylimidazole and 4,5-dimethylimidazole (b), AZD8329 (c), naproxen (d), theophylline (e) and uracil (f), and the labelling scheme used here. 13 C,170 and17O chemical shifts calculated by DFT and by ShiftML for the CSD-500 set trained on the whole CSD-2k combined kernel.Using the combined kernel, we reach a prediction error of 0.49 ppm for 1 H (4.3 ppm for 13 C, 13.3 ppm for 15 N and 17.7 ppm for 17 O).This is very comparable with reported DFT chemical shift accuracy for 1 H of 0.33-0.43ppm,15,70whilerequiringafraction of the computational time and cost: 46 CPU seconds compared to ~62-150 CPU hours for DFT chemical shift calculation on structures containing 86 atoms (around 350 valence electrons) (see SI).For the other nuclei, the ML accuracy is slightly lower than reported values (1.9-2.2 ppm for 13 C, 5.4 ppm for 15 N and 7.2 ppm for 17 O),15,70which is not surprising as there are (currently) significantly less training environments for the heteronuclei than for 1 H.The R 2 coefficients between the chemical shifts calculated with DFT and with ShiftML are 0.97 for 1 H, 0.99 for 13 C, 0.99 for 15 N, and 0.99 for 17 O.
For AZD8329 the15N and17O RMSEs are proportionally larger(17.7 and 54.7 ppm), and we attribute this to the fact that the molecule contains a rather unusual C-O…H-N / C-O…H-O H-bonded dimer structure, for which the learning is thus even sparser than for the heteronuclei in general.To illustrate the unusual nature of this motif, we note that the calculated17O shifts using DFT also change by up to 50 ppm for structures relaxed either by the CASTEP protocol used in ref.30, or the Quantum Espresso protocol used here (the RMSE between ML and DFT for the Quantum Espresso relaxed structures is reduced to 10.9 and 11.5 ppm for 15 N and 17 O!).The RMSE of 4.0 ppm for13C for AZD8329 is in line with the other systems.
13C, 12.1 ppm for 15 N and 15.7 ppm for17O.These values for cocaine are again very comparable to the best accuracy so far reported for DFT calculations.

Table S2 .
Kernel and GPR parameters.The GPR parameters (sn and z) are the ones used in single kernel predictions.

Table S3 .
Kernel weights and GPR parameters used for multi-scale kernel prediction.