An automated framework for NMR chemical shift calculations of small organic molecules

When using nuclear magnetic resonance (NMR) to assist in chemical identification in complex samples, researchers commonly rely on databases for chemical shift spectra. However, authentic standards are typically depended upon to build libraries experimentally. Considering complex biological samples, such as blood and soil, the entirety of NMR spectra required for all possible compounds would be infeasible to ascertain due to limitations of available standards and experimental processing time. As an alternative, we introduce the in silico Chemical Library Engine (ISiCLE) NMR chemical shift module to accurately and automatically calculate NMR chemical shifts of small organic molecules through use of quantum chemical calculations. ISiCLE performs density functional theory (DFT)-based calculations for predicting chemical properties—specifically NMR chemical shifts in this manuscript—via the open source, high-performance computational chemistry software, NWChem. ISiCLE calculates the NMR chemical shifts of sets of molecules using any available combination of DFT method, solvent, and NMR-active nuclei, using both user-selected reference compounds and/or linear regression methods. Calculated NMR chemical shifts are provided to the user for each molecule, along with comparisons with respect to a number of metrics commonly used in the literature. Here, we demonstrate ISiCLE using a set of 312 molecules, ranging in size up to 90 carbon atoms. For each, calculation of NMR chemical shifts have been performed with 8 different levels of DFT theory, and with solvation effects using the implicit solvent Conductor-like Screening Model. The DFT method dependence of the calculated chemical shifts have been systematically investigated through benchmarking and subsequently compared to experimental data available in the literature. Furthermore, ISiCLE has been applied to a set of 80 methylcyclohexane conformers, combined via Boltzmann weighting and compared to experimental values. We demonstrate that our protocol shows promise in the automation of chemical shift calculations and, ultimately, the expansion of chemical shift libraries. Electronic supplementary material The online version of this article (10.1186/s13321-018-0305-8) contains supplementary material, which is available to authorized users.

of biologically important molecules within all of chemical space-greater than 10 60 compounds [4]-remains a grand challenge in biology. The prevention and treatment of metabolic diseases, determining the interactions between plant and soil microbial communities, and uncovering the building blocks that led to abiogenesis will all strongly depend on confidently identifying small molecules, and thus understanding the mechanisms involved in the complex processes of metabolic networks [5][6][7]. The current gold standard for chemical identification requires matching chemical features to those measured from an authentic chemical standard. However, this is not the case with the vast majority of molecules. For example, only 17% of compounds found in the Human Metabolome Database (HMDB) and less than 1% of compounds found in exposure chemical databases like the U.S. Environmental Protection Agency (EPA) Distributed Structure-Searchable Toxicity (DSSTox) Database [8] can be purchased in pure form [9,10]. Although analytical techniques like nuclear magnetic resonance (NMR) spectroscopy [11][12][13] and mass spectrometry (MS) [14][15][16] have been applied for the identification of metabolites and to build libraries [17][18][19][20][21], determining the complete composition of entire metabolomes is still non-trivial for both technical and economic reasons. In this regard, libraries constructed of experimentally obtained data are too limited, expensive, and slow to build, even for libraries with thousands of metabolites [22][23][24][25].
The most practical approach expand reference libraries for comprehensive identification of compounds detected in metabolomics studies is through in silico calculation of molecular attributes. Molecular properties that can be both accurately predicted computationally and consistently measured experimentally may be used in "standards free" metabolomics identification approaches. The metabolomics community has made many advances in calculations of measurable chemical attributes, such as chromatographic retention time [26,27], tandem mass spectra [28][29][30], ion mobility collision cross section [31,32], and NMR chemical shifts [33]. Recently, high throughput computation of chemical properties has been demonstrated using machine learning approaches [34][35][36][37]. These tools are a good resource for the metabolomics community, however, machine learning methods are limited by the size and scope of the initial training set, and thus ultimately limited by the number of authentic chemical standards available for purchase. In contrast, structure-based approaches, utilizing first principles of quantum chemical calculations, leverage our understanding of the underlying chemistry and physics to directly predict chemical properties of any chemically valid molecule. Thus, quantum chemical calculations enable us to overcome the reliance on authentic chemical standards in metabolomics. In this study, we focus on expanding the utility of density functional theory (DFT), a widely used electronic structure approach, which has been applied to predict NMR chemical shifts [38][39][40][41]. DFT enables examination of molecular conformers [42][43][44][45] and allows custom solvent conditions [46][47][48]. Ultimately, computational modeling can be used in the rapid identification and study of thousands of metabolites, culminating in in silico metabolome libraries of multiple chemical properties. Furthermore, the same tools that can be used to aid identification of small molecules in complex samples can also be used for structure confirmation and correction. For example, we recently used the tool described in this manuscript to help correct the misidentification of the isoflavonoid wrightiadione to the actual structure as an isobaric isostere, the alkaloid tryptanthrin [49].
Metabolomics researchers unfamiliar with DFT or similar calculations may find the application of quantum chemical calculations complicated or challenging to apply quickly, and thus avoid these techniques. To this end, and to help bring DFT calculations to large sets of small organic molecules relevant to the mainstream metabolomics community, we have developed a Python-based workflow and analysis package, the ISiCLE (in silico Chemical Library Engine) NMR chemical shift module employs DFT methods through use of NWChem [50], a high-performance quantum chemistry software package developed at Pacific Northwest National Laboratory (PNNL). The module automates calculations of NMR chemical shifts, including solvent effects, via the COnductor-like Screening Model (COSMO) [51] of userspecified NMR-active nuclei for a given set of molecules for multiple DFT methods. ISiCLE also calculates the corresponding errors if experimental values are available. In this paper, we describe ISiCLE's NMR module, provide a working tutorial example, demonstrate its use through the calculation of chemical shifts for a large set of small molecules, and, finally, show how ISiCLE can be applied to rapidly calculate chemical shifts of arrays of Boltzmann-weighted conformers to yield high accuracy chemical shift calculations.

In silico Chemical Library Engine (ISiCLE)-NMR module
ISiCLE is a Python module that provides straightforward automation of DFT using NWChem, an open source, high-performance computational quantum chemistry package, developed at Pacific Northwest National Laboratory (PNNL), for geometry optimization and chemical shift and solvent effect calculations. Figure 1 shows a schematic representation of ISiCLE. For typical use, ISiCLE requires only a list of molecules and a list of desired levels of DFT theory from the user. For more advanced use cases, users may adjust NWChem parameters by modifying the provided .nw template file.
Here, we describe each step of a typical ISiCLE run (see Fig. 2 for a general workflow for using the ISiCLE NMR module).
To start, users must prepare File A, containing a list of molecules, and File B, containing a list of DFT combinations, which both are required to be in Excel format (.xls or .xlsx). File A must contain all input molecules either as (i) International Union of Pure and Applied Chemistry (IUPAC) International Chemical Identifier (InChI) strings [52,53] or (ii) XYZ files, a free-format text file having XYZ coordinates of atoms. In subsequent versions, alternative file formats will be supported, such as TSV for inputs and outputs.
Once prepared, the user runs ISiCLE. First, ISiCLE opens File A for the input molecules. OpenBabel, an open-source chemical informatics toolbox available with Python wrappers [54,55], is called to generate geometry files. For InChI inputs, OpenBabel generates .xyz files for each molecule, unless .xyz files are provided, and converts InChI to InChIKey for naming files (otherwise, the base names of XYZ files are used for naming subsequent files). Next, OpenBabel applies the Merck molecular force field (MMFF94) [56] to generate a rough threedimensional (3D) structure for each molecule, resulting in associated .mol files. ISiCLE then prepares NWChem input files based on the specified DFT methods, solvents, shielding parameters and regarding task directives given by the user-prepared File B. Finally, ISiCLE submits the appropriate files to, if relevant, a remote NWChem installation (typically on a non-local, networked, highperformance computer), and then retrieves the output files once the calculations are complete. Additional information and further details about ISiCLE is provided in Additional file 1 (S1). Note that future versions of ISiCLE will automatically generate conformers of a given molecule, as part of the seamless pipeline.
For each molecule, ISiCLE generates MDL Molfiles (.mol) [57] that contain isotropic shieldings and NMR chemical shifts. ISiCLE exports isotropic shieldings for each molecule and appends them to a MDL Molfile in the same atomic order of the original XYZ files. Then, ISiCLE converts isotropic shieldings to NMR chemical shifts by subtracting the isotropic shielding constants for the specified nuclei of the molecule of interest from those of a reference compound computed at the same level of theory (Eq. 1). For this manuscript, tetramethylsilane (TMS) is used as a reference compound. The experimental chemical shifts of TMS are assigned a value of zero, thus the calculation of NMR chemical shifts needs only isotropic shieldings of TMS [58][59][60][61]. Any molecule can be used as reference in ISiCLE as long as it has the specified nuclei and its experimental (or calculated) chemical shifts are supplied. It is explained in detail in a supplemental tutorial how a user inputs experimental data. The equation for calculating chemical shifts from isotropic shieldings is:  ISiCLE also calculates errors in NMR chemical shifts if experimental data is provided in the MDL Molfiles in a required way as explained in the tutorial. The errors are quantified in terms of mean absolute error (MAE) (Eq. 2), corrected mean absolute error (CMAE) (Eq. 3), root mean square error (RMSE) (Eq. 4), and maximum absolute error (Eq. 5).
where N is the total number of chemical shifts, and δ calc and δ exp are the lists of calculated and experimental chemical shits, respectively.
Empirical scaling of isotropic shieldings or NMR chemical shifts is the most common approach to remove systematic errors. If experimental data is provided, ISiCLE uses two optional approaches for its linear regression method, where slope and intercept values are derived from (i) regression of computed NMR chemical shifts versus experimental NMR chemical shifts using (Eq. 6), and/or (ii) regression of computed isotropic shieldings versus experimental NMR chemical shifts using (Eq. 7).
where σ calc is the list of isotopic shielding constants of molecules. (1) Alternatively, if the user does not provide experimental NMR chemical shifts, ISiCLE can scale NMR chemical shifts using provided intercept and slope values. The scaled NMR chemical shifts are appended to MDL Molfiles.
A detailed description of InChIs and InChIKeys, and why they were chosen, can be found in the Additional file 1 (S2). Similarly, justification for the use of MDL Molfiles is explained in Additional file 1 (S2). In the next version, ISiCLE will be compatible with other file formats, such as the NMReDATA [62] format that has been recently designed for NMR data use. To help ease the use of our data, we provide NMReDATA files for the demonstration set in the Additional file 2.
Furthermore, installation details for OpenBabel and other required Python packages are provided in the tutorial (see Additional file 2). The Windows-based tutorial provides step-by-step instructions for running ISiCLE for the first time, including information for installation of packages, properly preparing input files, running a calculation, and obtaining output files. The tutorial includes example molecules with anticipated output files for use as a practice set and for benchmarking purposes. It is designed to guide users of ISiCLE and NWChem in the use of the input files and scripts, demonstrated using three small molecules: methanol, methyl-isothiocyanate, and nitromethane. Calculation time may vary (depending on network speed, local computational power, etc.), but it is expected to take less than 10 min.

Demonstration set
For an initial demonstration of ISiCLE, we have compiled a molecule set of 312 compounds from previous studies: Alver [ [93]. We aimed to cover a broad chemical space and distribution of sizes. Our criteria also included the existence of all 1 H and/or 13 C NMR experimental data in chloroform solvent, referenced to TMS at room temperature, for comparisons. Note that the NMR spectra of each molecule set were not recorded at the same magnetic field strengths. A summary of the demonstration set compounds are given in Table 1. Detailed information about the individual sets is given in Additional file 1 (S3).
For a second demonstration of ISiCLE, the NMR chemical shifts, along with frequency calculations (and subsequent Boltzmann weighting), two sets of axial and equatorial conformers (40 conformers each) of methylcyclohexane were processed. We performed in vacuo molecular dynamics (MD) simulations, using the sander MD software program from AmberTools (version 14) [106], to generate 80 conformers of the methylcyclohexane compound. These conformers were generated in four stages. First, the initial geometries of axial and equatorial conformers were taken from the study of Willoughby et al. [92]. Second, a short energy minimization run was performed to relax the initial structure and to remove any non-physical atom contacts. Third, a short 50 ps MD run was performed (in 0.5 fs time steps) to heat the structure from 0 to 300 K, without non-bonded cutoffs. In the fourth step, we performed 8 simulated annealing cycles, where each cycle was run for 1600 ps in 1 fs MD steps with the following temperature profile: heating from 300 to 600 K (0-300 ps), equilibration at 600 K (300-800 ps), cooling from 600 to 300 K (800-1100 ps), and equilibration at 300 K (1100-1600 ps). Ten conformers from the equilibration stage at 300 K, of each simulated annealing cycle, were randomly selected to obtain the 80 conformers. After the conformers were obtained, M06-2X was used with the basis set of 6-31 + G(d,p) for the geometry optimization and frequency calculations and B3LYP with 6-311 + G(2d,p) method for the calculations of NMR chemical shifts. Relative free energies of the conformations and Boltzmann weighted NMR chemical shifts were compared to those found in the literature [92,[107][108][109].
All results shown in this manuscript were generated using the Cascade high-performance computer (1440 compute nodes, 23,040 Intel Xeon E5-2670 processor cores, 195,840 Intel Xeon Phi 5110P coprocessor cores, and 128 GB memory per compute node [110]), in EMSL (a U.S. national scientific user facility) located at PNNL. Cascade is available for external users through a free, competitive proposal process. ISiCLE can utilize local clusters or high-performance computing resources available to the user. NWChem is freely available and can be downloaded from the website [111,112].

Results and discussion
NMR chemical shift calculations have been used successfully to identify new molecules, determine metabolite identifications, and eliminate structural misassignments [59,113]. In the last two decades, many research groups have performed benchmark DFT studies on the accuracy of optimized molecular geometry [92,[114][115][116], functionals [117,118], basis sets [88,119], and solvation models [90,120,121] for NMR chemical shifts [60,[122][123][124]. Each group uses a molecule set focusing on a unique chemical class [78,[125][126][127][128][129] and several groups have recommended different exchange-correlation (XC) energy functionals with a different basis set for a particular condition or suitable to specific chemical functionalities and properties [70,77,[130][131][132][133]. The prevailing opinion is that reliable isotropic NMR chemical shifts strongly depend on accurate calculations of molecular geometries and inclusion of HF exchange in selected DFT methods, to an extent [134,135]. On the other hand, the size of the basis set does not increase the accuracy after a point [136,137].
The ISiCLE software can be installed locally. As seen in Fig. 1, it requires only two input files, prepared in Excel: a sequence of InChI or XYZ molecule geometry files, and a sequence of DFT methods of the user's choice. Preparation of NWChem "run files, " 3D molecule geometry files, and/or Linux/Unix shell script "drivers" are not required. As output, ISiCLE prints isotropic shielding, calculated by NWChem, and calculated chemical shifts with respect to a reference molecule and/or application of a userspecified linear regression technique. ISiCLE is a promising tool contributing to standards-free metabolomics, which depends on the ability to calculate properties for thousands of molecules and their associated conformers.

Application 1: chemical shift calculations for a demonstration set of molecules
To test ISiCLE, we generated a set of 312 molecules. This set is large relative to other metabolomic molecule sets found in the literature, which in our literature survey averaged 34 molecules (Table 1). Our molecule set ranges from small-to large-sized molecules (number of carbon atoms ranging from 1 to 90), and experimental 13 C and 1 H NMR data in chloroform were available for each of them. Our set also spans a wide array of chemical classes, including acetylides, alkaloids, benzenoids, hydrocarbons, lipids, organohalogens, and organic nitrogen and oxygen compounds. ISiCLE was used to successfully perform DFT calculations for this set under chloroform solvation using eight different levels of DFT theory (4 different functionals and 2 basis sets for 13 C and 1 H).
A total of 2494 carbon nuclei and of 3127 hydrogen nuclei were calculated for all 312 molecules of the demonstration set and compared with experimental data. Deviation bars indicating MAE and MAXAE are plotted for each method in Fig. 3. For both 13 C and 1 H NMR chemical shifts, the MAE of each method with cc-pVTZ is higher than those with cc-pVDZ. For 13 C, the MAE of each method with cc-pVTZ (7-10 ppm) is higher than those with cc-pVDZ (5-6 ppm). MAE of methods with a larger basis set deviate more compared to those with a smaller basis set. The smallest deviations are observed for B3LYP and B35LYP, both in MAE and MAXAE results. The same situation is observed for 1 H NMR chemical shifts as well: MAE of each method with cc-pVTZ (~ 0.35 ppm) is higher than those with cc-pVDZ (~ 0.30 ppm). In contrast to 13 C NMR chemical shifts, 1 H NMR chemical shifts are better predicted with methods using larger basis sets (cc-pVTZ). Although the error differences among each method may be too low to confidently identify the outperforming method, B3LYP/ cc-pVDZ is the most successful combination in the calculation of 13 C and 1 H NMR chemical shifts for our application shown here. Figure 4 shows computational costs of DFT combinations for the demonstration set. We found that the smaller basis set (cc-pVDZ) in the calculation of both   13 C and 1 H NMR chemical shifts was an acceptable compromise between accuracy and computational performance, compared with the larger cc-pVTZ basis. This finding is similar to a recent benchmark study [138] that showed B3LYP/cc-pVDZ is a reliable combination, balancing accuracy with computational cost in 13 C chemical shifts calculation. The larger basis set (cc-pVTZ) took 2-3 times longer to complete than cc-pVDZ (in terms of total CPU time). The computational times of the isotropic shielding and chemical shift calculations for this demonstration set are given in the file of Demonstration-Set_CPUtimes.xlsx in the Additional file 2.

Effect of scaling by linear regression
We performed the most general approach to error reduction, empirical scaling. Our molecule set has 1554 and 1830 experimental 13 C and 1 H NMR chemical shifts, respectively. It provides confidence for applying linear regression effectively as it reduces the possibility of overfitting. Empirical scaling was applied to the data obtained with the best combination, B3LYP/cc-pVDZ, using two different relationships: computed shifts versus experimental chemical shifts (Eq. 6), and computed isotropic shieldings versus experimental shifts (Eq. 7). Once the empirical scaling was applied, the accuracy for 13 C chemical shifts and 1 H chemical shifts improved by 0.7 and 0.11 ppm, respectively. Our computed NMR chemical shifts and shieldings deviate from unity (desired slope = 1) by 0.02 for both 13 C and 1 H NMR chemical shifts. Linear fits with correlation coefficients of 0.99 (Fig. 5a, b) and 0.93 (Fig. 5c, d) for 13 C and 1 H NMR chemical shifts, respectively, were observed, which also shows that B3LYP/cc-pVDZ is able to produce data free from random error. Results of linear regression to the 13 C and 1 H NMR chemical shifts obtained by other DFT methods are given in the Additional file 1 (S5).

Detailed look at 13 C NMR chemical shifts
The carbon ( 13 C) magnetic shieldings and chemical shifts derived from the various DFT methods are highly correlated, as shown by a correlation coefficient of 0.99 (Fig. 5). The inclusion of a scaling factor enhances the performance of theoretical calculations with B3LYP/cc-pVDZ//B3LYP/6-31G(d) and decreases the MAE in 13 C NMR chemical shifts for this set by approximately 13%.
There has been a trend toward using multiple references, such that each molecule should be referenced to a molecule with similar properties to improve accuracy of NMR chemical shifts [123,134,139]. Sarotti et al. examined the influence of the reference compound used in the 13 C [83] and 1 H [82] NMR chemical shift calculations over a set of organic compounds, all of which were included in our calculations. They recommended the use of benzene and methanol as a reference standard in the calculations of chemical shifts of sp-sp 2 -and sp 3 -hybridized carbon atoms, respectively, instead of TMS for all type of carbon atoms [140]. Propelled by the discussion in the study of Grimblat et al. [141] about the distribution of the errors observed in sp 2 -and sp 3 -carbons, we determined the distribution of the data of chemical shifts of sp 2 -(933 carbons) and sp 3 -(745 carbons) hybridized carbons (Fig. 6, the sp 2 -and sp 3 -derived series of carbons show two separate chemical shift distributions and two separate error distributions over a much larger variety of compounds than Grimblat et al. For our demonstration set, both errors between calculated and experimental sp 2 -and sp 3 -chemical shifts more closely resemble a Student's t-distribution [58,142], rather than a normal distribution [138,143]. The correlation coefficients of the errors of sp 2 -and sp 3 -carbons are 0.93 and 0.78, and 0.98 and 0.95 for Student's t-distribution and normal distribution, respectively. Furthermore, we looked for the bonded neighbors of each carbon and hydrogen extensively in Fig. 7. For carbon shifts, error was measured for carbon (n = 1709), chlorine (n = 149), fluorine (n = 8), hydrogen (n = 1161), nitrogen (n = 199), oxygen (n = 251) and sulfur (n = 20) attachments. The largest deviations occur in carbonchlorine and carbon-sulfur attachments with MAEs of 11.2 and 5.8 ppm and MAXAE 39.7 and 16.5 ppm, respectively. The study by Li et al. [76], which used a set of chlorinated carbons, reports the same conclusion: calculation accuracy decreases as the size of the basis set used increases, but improvement was obtained after linear regression corrections for B3LYP/6-31 + G(d,p) with slope of 0.98. Other than chlorine and sulfur, carbonhydrogen attachments also make the 13 C NMR chemical shift DFT calculations deviate significantly from experimental values, with MAE of 4.7 and 3.9 ppm and with MAXAE of 51.6 and 34.9 ppm. Carbon was found in rings in 70% of the cases, and these carbons show a MAE of 4.6 ppm. Also, the MAE of 13 C NMR chemical shift is 3.9 ppm for carbons bonded to a hydrogen atom but reaches 9.7 ppm in all other cases.
Oxygen and nitrogen attachments to carbon led to 13 C NMR chemical shifts with MAE up to 3.1 and 4.2 ppm and MAXAE of 32.5 and 23.6 ppm, respectively. Interestingly, half the C-O attachments found in ring-form had chemical shifts with a MAE of 2.8 ppm, compared to the chemical shifts of C-O attachments not found in a ring, which had a relatively higher MAE of 3.2 ppm, leading to a percent difference of 14.1%. NMR chemical shifts of C-N attachments, present in a ring or not, show close MAE of 4.19 and 4.26 ppm, respectively, a percent difference of 1.6%. This is to be expected, since C-O attachments are expected to show some deviation in chemical shift due to the polarization of the electron distribution caused by the high electronegativity of oxygen, while nitrogen atoms have a lower electronegativity, leading to a lower deviation in C-N chemical shifts.
The correlation plot (Fig. 5a, b) shows a linear pattern with only minor deviations of the predicted 13 C shieldings or 13 C chemical shifts from the fitted line. It is verified by the correlation coefficient of 0.99, as observed in previous studies [130,144], that the deviation of the slope from unity within the range of 0.95 and 1.05 is an indicator of a reliable method. However, when placed in subgroups of different attachment types, distant outliers are observed, with some more than 15 ppm away (Fig. 7a). Most outliers are observed in the C-C and C-H attachments, respectively. C-Cl and C-N chemical shifts have a high occurrence of outliers, which may be due to the chemical properties of chlorine and nitrogen such as being remarkably close to first ionization energies. We suspected that some cases of high calculation errors could be due to the consideration of only a single conformer. For the highest accuracy, proper conformational sampling must be considered, as demonstrated below in "Application 2: Boltzmann-weighted NMR chemical shifts of methylcyclohexane" section  Fig. 7b, error bars are shown for 1 H chemical shifts when the hydrogen attaches to carbon (n = 1793), H-O attachments are highly sensitive (to concentration, solvent, temperature, etc.), and it is non-trivial to determine the NMR chemical shift value of arbitrary protons experimentally as well as predict them by using a single, "catch all" DFT method, which explains the relatively low correlation coefficient of 0.93 ( Fig. 5c-d). For future studies, we may need to consider the use of different DFT methods, including the use of explicit solvation, particularly in the calculation of 1 H NMR chemical shifts in the presence of H-O attachments.
Application of empirical scaling to functional groups are given in detail in the Additional file 1 (see S7 and S8).

Cross-validation
As a final assessment for the data collected with the demonstration set, we assessed the stability and accuracy of the linear regression approach using cross-validation [145]. Cross-validation is a technique mostly used in prediction problems to evaluate how much a given model generalizes to an independent set of data. Specifically, we performed Monte Carlo cross-validation [146,147]. The procedure of application of Monte Carlo cross-validation method is explained in Additional file 1 (S9).
We observed that the estimated linear model parameters (i.e. slope and intercept) from the training set do not differ from that of the entire set. Therefore, the predictive linear model is stable to be accurately estimated and the subsets of 13 C and 1 H NMR chemical shifts generalize well to the groups that are not represented in the training fold.

Application 2: Boltzmann-weighted NMR chemical shifts of methylcyclohexane
Metabolites were experimentally interrogated using solution-state NMR, where the observed signal arises from the combined signals of present conformers. It is routine that NMR chemical shifts calculations are carried out on a single dominant conformer. However, it is well known that metabolites do not comprise a single conformer in solution and are instead found in a collection of various conformers [148], and the accuracy of NMR chemical shifts heavily depends on molecular geometries and conformation consideration [46]. It has been shown that for the highest accuracy NMR chemical shift calculations, consideration of conformers is critical, even for relatively small molecules [149]. As a second demonstration of ISi-CLE, a conformational analysis based on DFT was performed on a set of 80 conformers of methylcyclohexane using a Boltzmann distribution technique. Boltzmann weighting determines the fractional population of each conformer based on its energy level [92]. High-throughput and straightforward DFT-based NMR chemical shift calculations of all 80 methylcyclohexane conformers was performed by ISiCLE then compared to experimental values.
It has been shown by Willoughby et al. [92] that the effects of molecular flexibility on NMR chemical shifts can be captured by Boltzmann weighting analysis, as demonstrated with methylcyclohexane (Fig. 8). Methylcyclohexane is a well-studied small molecule [150][151][152][153][154]. It is flexible, composed of a single methyl group attached to a six-membered ring, and known to exist as an assembly of two chair conformers. There are two distinct conformations of which chair-chair interconversion is Fig. 8 The a equatorial and b axial structures of methylcyclohexane rapid and dominated by equatorial to axial conformation. We weighted 40 axial and 40 equatorial conformers in chloroform and obtained a relative free energy of 1. 99 [109]) values. We compared the Boltzmann-weighted 1 H and 13 C chemical shifts (a ratio of 3% axial to 97% equatorial) to experimental values reported by Willoughby et al. [92]. The Boltzmann-weighted, scaled MAE was 0.017 ppm for 1 H chemical shifts ( δ exp = 1.00 × δ comp + ∼ 0.00 ), similar to the experimental value of 0.018 ppm in the study of Willoughby et al. Also, the MAE for 13 C chemical shifts was 4.4 ppm and decreases to 0.8 ppm when the chemical shifts were scaled ( δ exp = 0.99 × δ comp + 0.13 ) (Fig. 9). Further details can be found in the Additional file 1 (S10).

Conclusions
We introduce the first release of ISiCLE, which predicts NMR chemical shifts of any given set of molecules relevant to metabolomics for a given set of DFT techniques. ISiCLE calculates the unscaled or scaled NMR chemical shifts (depending on the user's choice of DFT method) and writes the data to appended MDL Molfiles. It also quantifies the error in calculated NMR chemical shifts if the user provides experimental values.
The functionality of ISiCLE is demonstrated on a molecule set consisting of 312 molecules, with experimental chemical shifts reported in chloroform solvent. 1H and 13C NMR chemical shifts were calculated using 8 different levels of DFT (BLYP, B3LYP, B35LYP, BHLYP and, cc-pVDZ, and ccpVTZ), referenced to TMS in chloroform by carrying initial geometry optimizations out at B3LYP/6-31G(d) for all molecules. The optimal combination for this set was found to be B3LYP/cc-pVDZ// B3LYP/6-31G(d) with mean absolute error of 0.33 and 3.93 ppm for proton and carbon chemical shifts, respectively. We show that DFT calculations followed by linear scaling do in fact provide an analytically useful degree of accuracy and reliability. Finally, we used ISiCLE for the calculation of NMR chemical shifts of 80 Boltzmannweighted conformers of methylcyclohexane and compared our results with earlier studies in the literature.
ISiCLE is a promising automated framework for accurate NMR chemical shift calculations of small organic molecules. Through this tool, we hope to expand chemical shift libraries, without the need for chemical standards run in the laboratory, which could lead to significantly more identifiable metabolites. Future work includes wrapping individual steps of the ISiCLE NMR module into a formal workflow management system such as Snakemake, to include better fault tolerance, modularization, and improved data provenance. Furthermore, additional chemical properties will be included, such as ion mobility collision cross section and infrared spectra. Finally, ISiCLE will be adapted to run seamlessly on cloud computer resources such as Amazon AWS, Microsoft Azure, and Google Cloud. ISiCLE is a promising tool contributing to standards-free metabolomics, which depends on the ability to calculate properties for thousands of molecules and their associated conformers.

Additional files
Additional file 1. Supporting information document.
Additional file 2. Supporting information files. Including tutorial, code, and all other files.