NMR shift prediction from small data quantities

Prediction of chemical shift in NMR using machine learning methods is typically done with the maximum amount of data available to achieve the best results. In some cases, such large amounts of data are not available, e.g. for heteronuclei. We demonstrate a novel machine learning model that is able to achieve better results than other models for relevant datasets with comparatively low amounts of data. We show this by predicting \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{19}F$$\end{document}19F and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{13}C$$\end{document}13C NMR chemical shifts of small molecules in specific solvents. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00785-x.


Introduction
Prediction of chemical shift in nuclear magnetic resonance (NMR) is a long-standing problem in chemoinformatics.[Dailey and Shoolery(1955) Dailey, and Shoolery] is perhaps the earliest publication in the field.We define prediction here as methods using existing data as opposed to ab-initio calculations.Over time, various methods for such predictions have been developed.In particular, machine learning methods have been applied, starting with early methods, like small neural networks [Kvasnicka et al.(1992) Kvasnicka, Sklenak, and Pospichal], up to the latest developments in convolutional and graph neural networks.We refer the reader to the recent review [Jonas et al.(2022)Jonas, Kuhn, and Schlörer] for an overview.
For supervised learning methods, like the mentioned neural networks, annotated datasets are needed, and the number of data points used is a significant factor in the quality of the predictions.A review of the literature shows, that the datasets used are generally big, consisting of tens of thousands of molecules.Table 1 gives an overview of the number of structures used in recent publications.It should be noted, that sometimes preliminary selection was employed, e.g. the 17,000 structures of [Tsai et al.(2022)Tsai, Amichetti, Zanardi, Grimson, Daranas, and Sarotti] are a selection (using Morgan fingerprints and the MaxMin algorithm from RDKit) from 170,000 molecules by structural diversity.
Literature reference [Jonas and Kuhn(2019) Unfortunately, many studies do not take the influence of the amount of training data on the quality of the prediction into account.An exception is [Kuhn et al.(2022) Kuhn, Borges, Venturini, and Sansotera] , which shows that some machine learning methods only show suitable predictive power with more than 5000 training examples.However, in many practical applications, the amount of experimental data available is quite limited.shifts of heteronuclei, specific classes of compounds, NMR spectra measured with particular solvents, or other certain experimental conditions.For such cases of small amounts of data (defined as less than 5000 structures here), the existing models either do not provide a good solution or are yet untested on such small datasets.
In this paper, we present a graph neural network that is able to achieve good predictions with small amounts of data and is competitive with state-of-the-art models for larger amounts of data.We restrict our research to small organic molecules in solution.The prediction of solid-state NMR chemical shifts, biological macromolecules, or inorganic compounds requires different, and more specific models.Therefore, we exclude those cases from the current research.

Materials and Methods
Predictive models can be applied to many different NMR parameters, such as coupling constants, relaxation time, or peak shape.Here, we want to focus on the prediction of NMR chemical shifts.Modern (solution) NMR experiments are an excellent source of data, as they provide isotropic chemical shift information with little noise [Zangger(2015)].In NMR spectroscopy, the chemical shift, which is equivalent to the resonance frequency of an atom, is determined by the (chemical) environment of a nucleus.The representation of this environment is a hard task.A single atom might be represented by a vector, in which its properties are stored.However, that representation won't work for an entire molecule.Because of this, we disregard tensor-like representation and constitute the molecule as a graph instead.There, the atoms can be described by nodes, and the connecting bonds can be described by edges.Geometric learning with graph neural networks [Battaglia et al.(2018)Battaglia, Hamrick, Bapst, Sanchez-Gonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, Gulc Gori et al.(2005)Gori, Monfardini, and Scarselli] provides a suitable tool to run ML algorithms on such specific structures.In the graph, information is passed along the edges, making information from connected atoms the most important information.This is equivalent to real molecules, where neighbouring atoms, which are connected with a low number of bonds, have the most impact on NMR chemical shift values.Following [Fischer et al.(2022)Fischer, Schwarze, Ristic, and Scheidt] , we use a type of message-passing graph network block with an additional edge aggregation function, shown in Figure 1.Since we employ the network to predict node-level features of small molecules, we disregard the second stage of the graph network described in [Fischer et al.(2022)Fischer, Schwarze, Ristic, and Scheidt] .We use a set of features which is given in Tables 2 and 3 to describe atoms and bonds.This specific feature selection was chosen because of preliminary experiments, which measured the impact of each feature on the final prediction.After that, the best-performing features were combined, until the prediction quality was no longer improving.
To optimize the prediction accuracy, we carefully selected the best hyperparameters for the 2023 model.The most important hyperparameters were the number of message-passing steps, the learning rate and the weight decay.The hyperparameters were optimized on the training set of 19 F data using 4-fold cross-validation.The best performing model used 6 message-passing steps, a learning rate of 10 −3 , and a weight decay of 0.01.
All programming is done in Python using RDKit [rdk(2023(accessed February 24, 2023))] version 2022.9.5 as the main library.Furthermore, mendeleev [men(2014-)] version 0.12.1 is used to calculate some atomic properties.A Jupyter notebook, containing the code and explanations, is contained in the Supplementary Information of this paper.
For comparison, we use two other prediction methods.One are hierarchically ordered spherical environment (HOSE) codes [Bremser(1978)], a long-established method that describes atoms and their environments as strings.With those, the chemical shifts of other, similar atoms are looked up and used for prediction.From the point of view of machine learning, this could be called a nearest neighbour search.The HOSE code implementation used is a port of the HOSE code implementation of the Chemistry Development Kit (CDK) [Willighagen et al.(2017)Willighagen, Mayfield, Alvarsson, Berg, Carlsson, Jeliazkova, Kuhn, Pluskal, ó, Spjuth, Torrance, Evelo, Guha and available at [hos(2023(accessed February 24, 2023))] .This produces standard HOSE codes, not the stereoenhanced HOSE codes of [Kuhn and Johnson(2019) Kuhn, and Johnson] .
We use the model from [Jonas and Kuhn(2019)Jonas, and Kuhn] as a modern machine learning model, which we call the "2019 model" in this paper.This uses a convolutional graphical neural network to combine feature vectors for an atom with those of its neighbours to do the prediction.Evaluation of the methods was generally done using a 75:25 training-to-test split.We have decided against a separate validation set, due to the small size of the datasets.
All data was taken from nmrshiftdb2, an open NMR database [Kuhn et al.(2012) Kuhn, Schlörer, Kolshorn, and Stoll].It contains lists of chemical shift values as well as raw data of various 1D and 2D NMR experiments for a number of different nuclei.We focus on particular subsets here, as explained in the subsections of Section 3. It should be noted that the datasets we used consist of random selections of structures.It might be possible to optimize the training process with small datasets by ensuring structural diversity or even distribution in chemical space.We did not follow this and assumed the random distribution of data.In particular, we include all experimental data from nmrshitdb2 (if they fit the subsets used in Section 3).This is opposed to other work, e.g.[Jonas and Kuhn(2019)Jonas, and Kuhn] where the choice is restricted to molecules with only common elements.This also explains slightly different results using the 2019 model with data from nmrshiftdb2, apart from changes to the database over time.
For comparing the performance of the models in various conditions we report three values: The mean absolute error (MAE), the root mean squared error (RMSE), and the standard deviation σ of the error.The standard deviation is calculated over all of the predictions of the model and is used to measure the amount of variation of the error from the mean.

Overall behaviour
First, we wanted to compare the new 2023 model to HOSE codes and the 2019 model.In order to do so, we analyzed the predictive performance of the different models when trained on an increasing number of molecules.The results are shown in Figure 2 and Table 4.It can be clearly seen that the new model outperforms the 2019 model when trained on up to 2500 data points (structures), whereas the 2019 model performs better from 5000 data points onward.The sharp improvement of the 2019 model in that range was already seen in [Kuhn et al.(2022) Kuhn, Borges, Venturini, and Sansotera], and an explanation of that spontaneous improvement is yet outstanding.
The HOSE codes offer good predictive power that was previously observed in other published work, however, in some cases, a prediction based on HOSE codes is not possible, as noted in the amount of data available is much smaller.Therefore, they are a good test case for our model, where we use 19 F spectra as an example.In nmrshiftdb2, there are currently 957 structures with measured 19 F spectra.We disregard the spectra that are calculated via ab-inito calculations in nmrshiftdb2 and use only one spectrum per compound in the rare case that several spectra are recorded.
We use the same machine learning models and HOSE codes as for 13 C in Section 3.1.It might be possible to improve the prediction by optimizing a model specifically for a nucleus, but this is not within the scope of this work.Generally, 19 F should behave similar to 13 C and 1 H, which might not be the case e.g. for metals.
Table 5 and Figure 3 show the results of the predictions based on the 957 19 F spectra in nmrshiftdb2, using 100, 250, 500, and 957 (all) spectra for training the models.The HOSE code and 2019 model results are similar to those in [Kuhn et al.(2022)Kuhn, Borges, Venturini, and Sansotera] (differences are due to an older version of nmrshiftdb2 used in the paper), with the HOSE codes being significantly better than the 2019 model for those small amounts of data.We expect the model to improve with more data, similar to 13 C, however, the amount of data is limited for this nucleus.Our new model, shown in blue in Figure 3, outperforms the 2019 model when trained on 100 spectra and improves significantly, almost reaching the quality of the HOSE code model when trained on 957 spectra (note Figure 3 uses a logarithmic scale).The standard deviation of the 2023 model surpasses that of HOSE codes with 957 spectra, giving a more stable prediction here.Figure 4 shows the distribution of the 2019 model and HOSE codes, showing the significant improvement with more data.
Predictions based on HOSE codes are fairly accurate but have the inherent disadvantage that they might not give a prediction at all, as discussed previously.Machine learning models will always predict chemical shifts even for less similar molecules, as they are able to generalize.Therefore, the category "Missing Predictions" is only used for the HOSE codes.

Solvents
The solvent used is one of the major factors influencing the chemical shift values of a particular compound due to its influence on the chemical environment of the molecule, the possibility of forming hydrogen bonds, changes in the charge state of the investigated molecule, and more.For prediction purposes, it is common practice to ignore the solvent (e.g.[Jonas and Kuhn(2019)Jonas, and Kuhn] or [Kwon et al.(2020)Kwon, Lee, Choi, Kang, and Kang]).More accurate predictions would require using solvent information.One problem with this is the relatively low number of spectra for particular solvents, even for 13 C and 1 H spectra.For example, nmrshiftdb2 currently has 2324 13 C NMR spectra in Chloroform-D1, 456 spectra in Dimethylsulphoxide-D6, and 351 spectra in Methanol-D4 (those being the most common solvents in the database).
We are using those data to train separate models for each solvent and compare the results to the values achieved by using all 13 C spectra.The results are shown in Table 6.It should be noted that the models are the same as used for the previous prediction with all solvents and the 19 F nuclei and are not optimized for a solvent-specific prediction.We can still make the following observations: • The solvent-specific training produces much better results compared to the overall model.For example, for Chloroform-D1, the 2019 model and the 2023 model reach an MAE of 24.06 respectively 4.55 ppm with 2324 spectra, whereas with all solvents the MAE is 31.81respectively 14.60 with 2500 spectra.Figure 5 shows the MAEs achieved by the two models trained with all data and CDCl3 only.It is clearly visible that the 2023 model outperforms the 2019 model.Furthermore, the CDCl3 predictions are not only better with each model than the predictions with all data, but the improvement with the 2023 model is higher than with the 2019 model.
To further verify our results, we have predicted the shifts of only CDCl3 spectra, but with all data used for training, using the 2019 model.For this, we have divided the CDCl3 spectra into four equal parts and trained models with all non-CDCl3 data and three of those parts.The fourth part was then used as a test set, predicting all shifts of it.The average errors of those for test sets were: MAE 2.04, RMSE 2.65, and δ 2.60.This confirms that the 2019 model is able to achieve good results also on the CDCl3 data alone, given enough data and that the good results of the 2023 model with CDCl3 data are not due to those data.
In order to verify that the distribution of the compounds is not dependent on solvents, we have plotted the compounds in a chemical space chart in Figure 6.Here, all three solvents show a distribution similar to the overall database.It should be noted that all methods would be equally affected by any potential distortions.

Discussion
In this paper, we have designed a new model based on message-passing graph neural networks for predicting chemical shifts.The new network is intended, in contrast to existing models, to work with small amounts of training data.
Testing this new model with 19 F data shows that it is possible to decrease the error rates significantly below the values of a standard deep learning model.Specifically, when trained on all 19 F data from nmrshiftdb2, our model achieves an MAE of 9.95 ppm, whereas the standard deep learning model only achieves an MAE of 57.82 ppm.In a similar fashion, it is also possible to improve the predictions on 13 C chemical shifts with a particular solvent.With the new model, we get an MAE of 4.5 ppm for all spectra, whereas the standard model only achieves 24 ppm.This clearly shows our model performs significantly better on smaller datasets.
Analysing Tables 4 and 5, we can conclude that good results can be achieved by training the new model on roughly 500 structures.This result is empirical and might depend on e.g. the diversity of the structures, but since it holds true for 19 F and 13 C it can be considered a rough threshold value.
Our model is only optimized for the prediction of 19 F chemical shift data and was used as-is for the other test cases.This means, that a more specialized model might perform better still for e.g. a particular nucleus or solvent.One way to improve performance would be to adjust the feature selection specifically for a dataset.The approach of using one model should work within solution NMR of organic compounds, where it could be useful to train a model specifically for a certain compound class.On the other hand, there are areas where this is unlikely to work, e.g. when predicting inorganic compounds or solids.The overall approach however could still prove useful as the availability of data is a problem often faced in research.
In this work, we have not tested all currently available models with different amounts of data.Some of them were published after the 2019 model and claim to have slightly better results using large datasets.Therefore, it is possible that they do better with small datasets, however, we expect the differences to the 2019 model to be negligible, as they have not been specifically built for and tested on smaller datasets.

Conclusion
We have introduced a novel machine learning model which is able to achieve reasonable NMR shift prediction results with a low number of samples.On 13 C NMR spectra, the model achieves an MAE of less than 5 ppm when trained on 1000 data points.Tests were performed on 19 F shifts and solvent-specific 13 C shifts.In all cases, our model outperforms an existing state-of-the-art machine learning model.Further improvements might be possible by optimizing aspects of the model for specific datasets, but in this work, we wanted to focus on the performance boost even the generalized model achieves.This approach might be expanded into other areas such as inorganic compounds, but there further adjustments are likely needed.

Figure 1 :
Figure 1: Information flow in the message-passing graph network of the 2023 model.In addition to node and edge aggregation functions, there is also an additional edge aggregation function feeding into the global update function (from [Fischer et al.(2022)Fischer, Schwarze, Ristic, and Scheidt]).

Figure 2 :
Figure 2: MAE of a 13 C NMR shift prediction, using increasing number of samples.

Figure 4 :
Figure 4: Comparison of the accuracies and their distribution of the HOSE code and GNN prediction.

Figure 5 :
Figure 5: MAE of a 13 C NMR shift prediction, using increasing number of samples.

Figure 6 :
Figure 6: A plot of the compounds of nmrshiftdb2, distinguished by solvent, in chemical space.The calculation uses Extended Connectivity (ECFP) fingerprints to calculate descriptors and t-distributed stochastic neighbor embedding (t-SNE) for dimension reduction.The two major components are plotted.Using code from [Simon(2023 (accessed March 8, 2023))] .

Table 1 :
Jonas, and Kuhn] [Unzueta et al.(2021)Unzueta, Greenwell, and Beran] 13 C [Kwon Examples of papers about chemical shift prediction and the number of structures used.

Table 2 :
Atom features used in the 2023 model.

Table 3 :
Bond features used in the 2023 model.

Table 4 :
Prediction results for 13 C shifts using increasing numbers of spectra

Table 4 .
This can happen if no examples with high enough similarity (i.e. at least one sphere) exist in the training set.The table also shows that the standard deviation of the 2019 model's results is significantly lower than with HOSE codes, indicating that the model is more stable than the HOSE code prediction.
19F NMR shift prediction, using increasing number of samples, on a logarithmic scale.

Table 6 :
• The overall tendency is similar to what we have seen before: The predictive quality of the 2019 model starts off with high errors and significantly improves beyond 1000 spectra.The 2023 model outperforms the 2019 model on smaller datasets, due to its quick improvements when trained on up to 2500 spectra.HOSE codes are generally doing well, but do not improve much.• The 2023 model achieves errors of less than 5 ppm with 1000 spectra for Chloroform-D1.That is better than the 2023 model with all data.For the 2019 model to become better than 5 ppm, almost 5000 spectra are needed.This means that, given the available data, our new model outperforms the 2019 model.For Dimethylsulphoxide-D6 and Methanol-D4, our new model achieves reasonably good results when trained on only 456, or respectively 351 spectra.Those results are much better than the 2019 model trained on the same number of spectra.Prediction results for 13 C shifts using increasing numbers of spectra.n/a indicates that not enough data were available, numbers in brackets indicate number of compounds used, deviating from the top header.