Rapid prediction of NMR spectral properties with quantified uncertainty

Accurate calculation of specific spectral properties for NMR is an important step for molecular structure elucidation. Here we report the development of a novel machine learning technique for accurately predicting chemical shifts of both \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${^1\mathrm{H}}$$\end{document}1H and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${^{13}\mathrm{C}}$$\end{document}13C nuclei which exceeds DFT-accessible accuracy for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${^{13}\mathrm{C}}$$\end{document}13C and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${^1\mathrm{H}}$$\end{document}1H for a subset of nuclei, while being orders of magnitude more performant. Our method produces estimates of uncertainty, allowing for robust and confident predictions, and suggests future avenues for improved performance. Electronic supplementary material The online version of this article (10.1186/s13321-019-0374-3) contains supplementary material, which is available to authorized users.

Mean absolute error: 2 Other methods

HOSE codes
Hierarchically Ordered Spherical Environment (HOSE) codes [1] encode the environment of an atom in a molecule in a sphere-wise manner. Spheres are defined by bond distance, so the atoms one bond away form the first sphere, those two bonds away the second sphere etc. For each atom, information including element symbol and bond order is recorded. These codes are then written in a single character sequence, starting with the center atom and going up as many spheres as requested. A typical HOSE code would be: Here, the central atom, for which the HOSE code is generated, is a carbon atom with three neighbours. The first sphere (i.e. the atom in distance of one bond to the central atom) consists of a double bonded (indicated by "=") oxygen and a single-single bonded carbon and nitrogen atom (single-bonds are default). The next sphere is separated by "(" and inside the sphere, the atom bonded to each atom in the previous sphere are separated by ",". So here, no atom is bonded to the oxygen, an aromatic (indicated by "*") carbon and nitrogen atom are bonded to the carbon atom, and two carbon atoms are bonded to the nitrogen. The further spheres are described in a similar fashion, with the sphere deliminator changing to "/" and ")".
NMR prediction using HOSE codes is done per atom. For each atom in the molecule to predict we want to find atoms in the reference data set which have an environment as similar as possible to the atom in question. This is done by generating the HOSE codes for all atoms in the reference dataset for a sufficiently high number of spheres (typically six are used). We also generate the HOSE code with the maxiumum number of spheres of the atom to predict a shift for. If an with this HOSE code is found in the reference data set, the shifts assigned to this atom is the prediction. If more than one shift exists, the mean of all shifts is taken as the prediction. If none is found, the number of spheres used is reduced by one and the search done again. As soon as a shift is found, the shift is used and the prediction for this atom is finished.
Clearly the higher the number of spheres used the better the prediction. Therefore, the number of spheres used is a confidence measure, with six meaning a very good prediction and one a very weak one. In theory a zero-sphere prediction, where the element and neighbour count of the central atom are the only information used is the lowest level possible, but this is too unspecific to be of any use. Furthermore, a reference data set not having a one-sphere HOSE code for all possible cases is probably not good for the intended purpose. HOSE codes do not encode stereochmistry.
The use of HOSE codes for chemical shift prediction has been demonstrated e.g. in [4,2]. It is easily possible to include solvents in the prediction by only using shift values from spectra with a certain solvent. The prediction is deterministic, which makes it possible to exactly trace back how a prediction was composed. This is useful since it makes finding errors in the database possible.
For this paper, we have used the HOSE code prediction of the nmrshiftdb2 code, which in turn uses the Chemistry Developement Kit (CDK) [8] for the generation of the HOSE codes. In nmrshiftdb2 molecules are written to a relational database, with an entry for each atom in a table. This table also includes the HOSE code for the atom. Chemical shifts are listed in a separate table and linked to atoms. We can therefore search in the database for all shifts linked to an atom which has a certain HOSE code. If the fallback to lower spheres is needed, we search for shifts linked to atoms the HOSE code of which start with a certain character sequence. The code of nmrshiftdb2 is available at https://sourceforge.net/p/nmrshiftdb2/code/HEAD/tree/trunk/ nmrshiftdb2/.

DFT Methods
For ab initio calculations we followed the best practices outlined in [5] and [9]. Briefly we perform molecular conformation search using Macromodel [7] identifying the most probable conformers by energy. Each conformation then undergoes DFT-based geometry optimization at the B3LYP / 6-31+G(d,p) level of theory via Gaussian16 vA03 [3]. DFT-based gauge-independent atomic orbitals are used to calculate isotropic shielding values for each conformation at the mPW1PW91 / 6-311+G(2d,p) level of theory with a PCM solvent model with chloroform solvent (again with Gaussian16 vA03). Isotropic shielding values are Boltzmann-weighted via the calculated molecular energies from this last phase, and the weighted average shielding values are calculated. Following the procedure in [5], we then convert calculated isotropic shieldings to chemical shift values via a linear fit to the experimentally observed values, with two modifications: we use a linear fit method robust to outliers (RANSAC, from Scikit-Learn [6]) and we fit to all the data, as opposed to separate train-test sets. This results in a potential over-estimate of the accuracy of DFT, but this is fine for our purposes of comparison. The full inputs and outputs of all experiments are provided in the supplementary data section.