Reliable and accurate prediction of basic pK\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_a$$\end{document}a values in nitrogen compounds: the pK\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_a$$\end{document}a shift in supramolecular systems as a case study

This article presents a quantitative structure–activity relationship (QSAR) approach for predicting the acid dissociation constant (pK\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_a$$\end{document}a) of nitrogenous compounds, including those within supramolecular complexes based on cucurbiturils. The model combines low-cost quantum mechanical calculations with QSAR methodology and linear regressions to achieve accurate predictions for a broad range of nitrogen-containing compounds. The model was developed using a diverse dataset of 130 nitrogenous compounds and exhibits excellent predictive performance, with a high coefficient of determination (R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^2$$\end{document}2) of 0.9905, low standard error (s) of 0.3066, and high Fisher statistic (F) of 2142. The model outperforms existing methods, such as Chemaxon software and previous studies, in terms of accuracy and its ability to handle heterogeneous datasets. External validation on pharmaceutical ingredients, dyes, and supramolecular complexes based on cucurbiturils confirms the reliability of the model. To enhance usability, a script-like tool has been developed, providing a streamlined process for users to access the model. This study represents a significant advancement in pK\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_a$$\end{document}a prediction, offering valuable insights for drug design and supramolecular system optimization. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00763-3.


Introduction
The concepts of acidity and basicity are fundamental to the understanding of chemistry and have been defined by various theories throughout history [2,18,19,54,55].One such theory was introduced by Svante Arrhenius in 1887 [2], who suggested that certain compounds can dissociate into ions in solution and identified acids as those that yield a proton (H + ) and bases as those that yield a hydroxide ion (OH − ).Another influential theory, the General Acid-Base Theory of Brönsted and Lowry [18,19,55], emerged in 1923, defining acidity and basicity in terms of the tendency to donate or accept a H + .
Understanding the strength of a base is crucial in comprehending its behavior and acidity plays a pivotal role in this regard.The strength of a base is commonly expressed by considering the strength of its conjugated acid, with a weaker conjugated acid indicating a stronger base.The acidity constant (K a ), which represents the equilibrium constant for the reaction between the acid and water, is used to quantitatively assess the strength of a base.For practical purposes, the pK a value, defined as the negative logarithm of the K a , is commonly employed [40].The pK a value is an essential tool that serves as an indicator of the relative acidity or basicity of a compound and enables predictions of its protonation state or protomeric forms under different pH conditions [40].Therefore, accurate determination of pK a holds immense significance across diverse fields, including medicinal chemistry [22,29,41,60], biochemistry [11,23,62,68], environmental science [12,15,48,50], chemistry of dyes [44,65,93] and supramolecular chemistry [10,56,88].
Recent research has focused on the phenomenon of supramolecular pK a shift, which involves a significant shift in the pK a value of nitrogenous compounds by forming supramolecular complexes with macrocyclic molecules such as cucurbiturils [7,8,14,59,64,95].This phenomenon is crucial for designing and optimizing supramolecular systems, with far-reaching implications in materials science [92,98], catalysis [3,28,77] and the development of drugs and their delivery methods [26,31,37,49,75].
In the pharmaceutical industry, accurate pK a values play a critical role in the design process of new drugs, as the acid/base character of a substance defines its biopharmaceutical properties, which have a direct impact on the pharmaceutical formulation of the drug [60,61].Nitrogen-containing heterocycles are of particular interest in the pharmaceutical industry due to their diverse biological activities [52,53], with over 75% of FDA-approved drugs containing such structures [47].The pK a values of these heterocycles provide information on the absorption, distribution, metabolism, excretion, and toxicity (ADMET) of the drug, which are crucial for the design process.Poor ADMET properties have been revealed as the cause for high attrition in the development phase [84].
Experimental methodologies for pK a determination have been extensively reviewed [74], but there are still samples that are difficult or impossible to measure accurately.To overcome this challenge, computational approaches have emerged as a promising alternative, as they can simulate virtually any set of working conditions without requiring physical samples [81,83,94].Consequently, significant efforts have been directed toward developing accurate and reliable computational methods for predicting pK a values.
Among the most common computational approaches are First Principles and quantitative structure-activity relationship (QSAR) methods [81].However, the accuracy of First Principles calculations in predicting pK a values relies heavily on the precise determination of the Gibbs free energy difference in solution, which poses a significant challenge [27,73,81].This is mainly due to the difficulty in calculating the Gibbs free energy of the proton and solvation energies, which can lead to deviations in pK a values of up to 3 units [13,73].One way to address these systematic errors is by using the relative pK a approach [73,81], which has demonstrated high accuracy and effectiveness in various solvents [83].Nevertheless, this approach is limited by the availability of accurate pK a values for reference systems with structural similarity to the sample.
QSAR is a time-efficient and computationally less costly approach that predicts physical properties by constructing a multiple linear regression equation for a specific physical property as a function (P) of the selected molecular descriptors ( X i ).The equation, represented as Eq. 1, assigns numerical coefficients ( a i ) to each molec- ular descriptor, which serve as weighting factors to determine the respective contributions of the predictor variables [25,81].
Although QSAR models are less costly than First Principles, traditional QSAR methods have been hindered by lengthy calculation times, especially when quantummechanical electronic descriptors are involved, particularly in large molecules.As a result, the prediction of pK a has been impractical.However, the B97-3c method, based on density functional theory (DFT), has recently emerged as a reliable and low-cost solution [16].This method effectively reduces computational time, thereby presenting a viable option for predicting pK a values in large molecules and supramolecular complexes.
QSAR models have extensively employed a wide range of descriptors to predict crucial properties, including pK a values [36,42,45,46,70,76,79,81,90].Some of these descriptors include charge [39,45,79], electronic energy differences ( E) [5,6,42,79,97], and the highest occupied molecular orbital energy ( ǫ HOMO ) [80,86].Despite the flexibility in choosing chemical descriptors and their (1) combinations, reported QSAR models are limited to specific datasets or structures, resulting in varying levels of accuracy and model performance across different subsets of data.Such is the case of QSAR models for predicting pK a values [96], which show diminished accuracy when predicting pK a of basic compounds such as nitrogenous compounds [78].These discrepancies may arise due to the lack of appropriate relationship or representation between the selected descriptors and the structural diversity present in the complete dataset.Consequently, developing reliable QSAR models for predicting pK a values in heterogeneous datasets has proven to be a persistent challenge.To improve the generalizability of QSAR models and achieve higher precision in predicting pK a values, a more careful selection of chemical descriptors that effectively capture the structural variability within the entire dataset is necessary.This may involve the use of more specific descriptors or advanced variable selection methods.
The present incapacity to efficiently and accurately predict basic pK a values from heterogeneous data highlights the need for enhanced QSAR methodologies capable of achieving high accuracy regardless of the size and structure of the compound, as well as its inclusion in supramolecular host-guest systems.In light of this challenge, we propose and validate a comprehensive QSAR approach that utilizes the B97-3C low-cost density functional theory to predict the basic pK a values of nitrogencontaining compounds in aqueous solution at 25 • C, both independently and within cucurbituril-based supramolecular complexes.This proposed approach represents a significant advancement toward predicting pK a shifts in supramolecular systems.

Results and discussion
The QSAR model By employing a comprehensive approach that integrates DFT [30], Conceptual Density Functional Theory (CDFT) [30], Molecular Electron Density Theory (MEDT) [24], and quantitative analysis of molecular surface, prediction models for estimating pK a values were evaluated through statistical analysis.The best performing model was selected based on the results of this evaluation (see "Methods" section) The selected prediction model is represented in Eq. 2. 1 The model was trained using a diverse set of 130 nitrogenous compounds, which encompassed aromatic and non-aromatic cyclic amines as well as aliphatic amines (primary, secondary, and tertiary).A coefficient of determination (R 2 ) of 0.9905 indicates an excellent fit of the data to the proposed model.The robustness of the model is further supported by a high Fisher statistic (F) of 2141.9289 and a relatively low standard error (s) of 0.3066.The root mean squared error (RMSE), 0.2982, and mean absolute error (MAE), 0.2440, provide additional evidence of the predictive accuracy of the model.
Our model (Eq.2) includes a variety of descriptors calculated in a vacuum environment and demonstrates remarkable performance when applied to a diverse set of nitrogenous compounds.The descriptors included in the model are the following: • Energy of deprotonation ( E) in kcal/mol: this descriptor measures the energy required to remove a proton from an acid.A higher E value signifies that a greater amount of energy is needed to carry out the deprotonation, which results in a higher pK a value.• HOMO-LUMO gap of deprotonation ( HL Gap ) in eV: this descriptor represents the change of energy gap between the highest occupied and lowest unoccupied orbitals of the acid-base equilibrium.A higher HL Gap suggests a less reactive base, hence a lower pK a value.• Mulliken electronegativity ( χ M ) in eV: this descrip- tor quantifies the ability of the base to donate a pair of electrons and accept a proton.A lower χ M indicates a higher basicity, which contributes to a higher pK a value.• Nonpolar surface area percentage ( %NPSA) of the base: this descriptor measures hydrophobicity and its influence on the solubility of the base in water.
Bases with a higher %NPSA tend to have lower solubility, which leads to increased stability of the conjugated acid in a polar environment compared to the base.This reduced solubility directly affects the ability of the base to donate and accept protons, thus altering the acid-base equilibrium and resulting in an increase in the pK a value.Notably, the % NPSA descriptor demonstrates one of the most significant individual correlations with the experimental pK a values, as evidenced by its correlation coefficient (r) of 0.7514 (refer to Additional file 1: Table S1).By considering such a comprehensive range of independent parameters (with a correlation between parameters ≤ |0.7744|, see Additional file 1: Table S2), our model provides valuable insights into the electronic structure, stability, solubility, hydrophobicity, and local electronic effects during proton transfer in the bases under investigation.This holistic approach contributes to the accurate prediction of pK a values and enhances our understanding of the underlying factors governing acid-base behavior in nitrogenous compounds.
In comparison with the trading software Chemaxon, our model shows a higher R 2 (0.9905 versus 0.9583), lower s (0.3066 versus 0.6346), and lower RMSE and MAE (see Additional file 1: Table S3).These results demonstrate the superiority of the presented approach over the widely used Chemaxon software.
While the cited works focused on a particular class of amines (non-heterogeneous data) and selected up to three parameters, we have expanded the predictive capabilities to a broader set of nitrogenous compounds by incorporating a comprehensive range of descriptors.This approach provides a more accurate estimation of the pK a value and greatly expands the range of applicability of our model.
Yu et al. [96] presented a comparative performance of ACD and SPARC software.While the statistical results of both software are similar to those of the present work, it is important to note that these software are commercial and employ proprietary methodologies, which may limit their accessibility and customization.A key advantage of our model is its generalizability to diverse types of amines and its ability to handle heterogeneous data, which is a common challenge in real-world applications.

External validation
The performance of our prediction model (Eq.2), in estimating pK a values, was evaluated a total of 40 compounds, which were part of an external validation dataset comprising pharmaceutical ingredients and dyes.It is important to note that the 40 compounds in the external validation dataset are not part of the training dataset of 130 compounds.Additionally, our method was tested on 6 cucurbituril-based supramolecular complexes, which were not included in either of the two previous datasets.These results are summarized in Tables 1 and 2, respectively.
According to data in Table 1, our model achieved a low external RMSE (RMSE_ext) of 0.32 and an external MAE (MAE_ext) of 0.28 when predicting pK a values for the external validation dataset of 40 pharmaceutical ingredients and dyes.Comparing these values with the corresponding metrics obtained from Chemaxon (RMSE_ext of 1.09 and a MAE_ext of 0.79), it is evident that our model outperforms Chemaxon in terms of accuracy and precision in predicting pK a values, especially when estimating pK a for supramolecular complexes, for which Chemaxon is currently incompatible.
For the dataset of 6 cucurbituril-based supramolecular complexes (Table 2), our model exhibited a slightly higher RMSE_ext of 0.54, indicating a moderate average deviation of the predicted pK a values for this dataset.Similarly, the MAE_ext value of 0.42 suggests a moderate average error of the estimated pK a values for the supramolecular complexes.

Internal validation
Furthermore, an internal validation assessment corroborates the excellent performance of our model.For internal validation within the training set of 130 compounds, we employed the "leave-one-out" cross-validation model.The small difference between the coefficient of determination (R 2 ) and the leave-one-out cross-validation correlation coefficient (Q 2 loo), indicated by the stability value of 0.0013, attests to the robustness of our model and suggests that it is not overfitted.
These results validate the effectiveness and reliability of our prediction model in estimating basic pK a values of nitrogen-containing compounds, both as isolated species and as guests in cucurbiturils complexes.Further optimization and refinement of the model may enhance its performance in predicting pK a values for diverse chemical systems, extending beyond nitrogenous compounds.

Script-like tool description
To enhance user experience with our model, we have developed a script-like tool that automates the determination of descriptors and pK a values for nitrogenous compounds.Users can easily access the estimation process through our tool by inputting the structures of the base and the conjugate acid.The tool is available at the following link: https:// github.com/ Jacks onalc azar/ Basic-pKa-Estim ation-Nitro gen-Compo unds.This streamlined process provided by our tool offers users a more convenient and efficient way to utilize our model.

Conclusion
The comprehensive QSAR approach presented in this article offers a powerful tool for rapidly and accurately predicting pK a values of nitrogenous compounds, including those within supramolecular complexes based on cucurbiturils.Our model, which combines quantum mechanical calculations and QSAR methodology, exhibits excellent predictive performance and provides valuable insights into various molecular properties relevant to proton transfer.The superiority of our approach over existing methods has been demonstrated through extensive comparisons.Furthermore, we have developed a user-friendly script-like tool that streamlines the determination of descriptors and pK a values, enhancing the accessibility and practicality of our model.This work represents a significant advancement in the field of pK a prediction and holds great potential for applications in drug discovery, supramolecular chemistry, and other related disciplines.Through further optimization and refinement, the model can extend its predictive capabilities to diverse chemical systems beyond nitrogenous compounds.

Methods
Prediction of pK a values of nitrogenous compounds using QSAR approach requires the calculation of relevant descriptors related to basicity or acidity.Chemical descriptors can be local or global, depending on whether they are related to a specific atom or group of atoms within the molecule or to the molecule as a whole.However, obtaining a comprehensive understanding of the electronic structure of a molecule and its acidity requires an advanced computational approach able to provide a detailed picture of the molecular parameters and properties relevant for predicting pK a values.In this section, we describe a comprehensive methodology for predicting basic pK a values of nitrogenous compounds.

Prediction of pK a values using chemical descriptors
We employed a multi-faceted approach, using DFT as a fundamental computational method to determine electronic properties.DFT is widely used in quantum chemistry, since it is regarded as a reliable approach to predict molecular properties and structures [20].However, DFT alone may not provide a complete picture of the molecular electronic structure.Therefore, additional post-processing techniques were employed.CDFT [30] was used to obtain a broader perspective of the molecular electronic structure by making use of global and local descriptors based on the conceptualization of electron density.Moreover, MEDT [24] and quantitative analysis of the molecular surface [58] were used to obtain properties that account for the electronic distribution of the molecule.
To obtain a comprehensive set of chemical descriptors, a combination of DFT, CDFT, MEDT, and quantitative analyses of molecular surface was used.Subsequently, a predictive model for pK a values was developed, using a diverse set The most relevant descriptors for the prediction of pK a values were identified using QSARINS software for statistical analysis [32,33].QSARINS is a powerful tool for identifying the molecular descriptors that contribute most significantly to the predictive power of the model.QSARINS uses iterative techniques to add or remove descriptors from a multivariable linear equation, based on their statistical significance, to create a set of models from which the most suited is selected.The performance of the selected model was evaluated by testing it on a set of 40 validation compounds, which were not part of the training set of 130 compounds.This set included pharmaceutical ingredients and dyes with known pK a values.

Determination of selected descriptors for pK a prediction
This section outlines the methodology for calculating the parameters involved in determining the pK a values according to Eq. 2 in the main section.The parameters include deprotonation energy ( E), HOMO-LUMO deprotonation gap ( HL Gap ), Mulliken electronegativity ( χ M ), the percentage of nonpolar surface area ( %NPSA), and the change in average local ionization energy at the nitrogen atom ( ALIE N ).
To calculate these parameters swiftly and reliably, DFT calculations were run on optimized geometries under vacuum conditions.The B97-3c low-cost Density Functional Method [16] and the ORCA software package (Program Version 5.0.3)[66] were employed for this purpose.A more detailed description of the methodology used for each parameter is provided in the following sections.

Energy of deprotonation ( E)
The electronic structure optimization and energy calculation for determining the energy of deprotonation ( E) were carried out using the ORCA software package [66] with the B97-3c low-cost functional to obtain the most stable conformation at a local minimum of the base and conjugate acid, swiftly and reliably [16].Protocol tightSCF was employed to ensure convergence.The total energy of the molecule was calculated, taking into account the Becke-Johnson dispersion damping (DFT-D3BJ) [34,35] and short-range basis incompleteness SRB correction of the basis set [16,87].The energy change during the deprotonation process of the conjugate acid was calculated as the difference between the total energy of the base and the total energy of the conjugate acid: (3) E = Total energy of base − Total energy of conjugate acid It is important to note that this methodology requires the base to have a net charge of zero and the conjugate acid tohave a net charge of + 1.If the base is an ion, it should be neutralized with its respective counterion (Cl − or Na + ).

HOMO-LUMO gap of deprotonation ( HL Gap )
The HOMO-LUMO gap (HL Gap ) is a crucial parameter for characterizing the electronic properties of a system.The HOMO and LUMO energies were automatically determined at the end of the electronic structure optimization in the previous step.Specifically, the HOMO and LUMO energies were obtained from the eigenvalues of the highest occupied and lowest unoccupied molecular orbitals, respectively.Subsequently, the HOMO-LUMO gap energy was calculated by subtracting the LUMO energy from the HOMO energy.
Thus, the variation of the HL Gap in the deprotonation process of the conjugate acid was calculated as:

Mulliken electronegativity ( χ M )
Quantifies the ability of the base to donate a pair of electrons and accept a proton.χ M was determined by Eq. 6 [69,72] Thus, the variation of the HL Gap in the deprotonation process of the conjugate acid was calculated as: Were VIP and VEA are the vertical ionization potential and vertical electron affinity of the base, respectively.The energy of the neutral molecule was calculated by single point from optimized base using the same quantum chemistry software package and level of theory employed in the preceding calculations.Next, the energy of the cation (N − 1) was obtained by removing an electron from the neutral molecule (N) using the same software package and level of theory, setting the charge of the molecule to + 1 in the input file.The VIP was calculated as the energy difference between the cation (E N −1 ) and the neutral molecule (E N ) [17], The energy of the anion (N + 1) was obtained by adding an electron to the neutral molecule (N) using the same software package and level of theory, setting the charge of the molecule to − 1 in the input file.The VEA was calculated as the energy difference between the neutral molecule (E N ) and the anion (E N +1 ) [17],

Nonpolar surface area percentage ( %NPSA)
The percentage of nonpolar surface area of the base was calculated using Multiwfn version 3.8 [57], a software package for post-processing wavefunction analysis, with an improved Marching Tetrahedra algorithm [58].The molecular structure of the base was loaded into the software in Gaussian Binary Wavefunction format (.gbw) and analyzed using the "quantitative analysis of molecular surface" function with electrostatic potential (ESP) as the mapped function.The analysis was conducted under default settings, with an electron density contour value of 0.00100 used to define the isovalue of the electron density surface.The grid point spacing of 0.250000 was selected for generating the molecular surface, and the ratio of van der Waals radius was set to 1.7000 to extend the spatial region of cubic grids, which determines the size of the molecular surface by expanding the van der Waals radii of the atoms in the molecule.

Change in average local ionization energy at nitrogen atom ( ALIE N )
The reactivity of the acid-base reaction center was investigated using the concept of Average Local Ionization Energy (ALIE) [71,85].To compute the ALIE values for the nitrogen atoms in both the base and conjugate acid, Eq. 9 was employed.
where ρ i (N) denotes the density of the i-th orbital of the nitrogen atom, ǫ i refers to the corresponding orbital energy, and ρ(N) denotes the total electron density on the nitrogen atom.The calculations were performed using Multiwfn software (version 3.8) [57] by importing the optimized molecular structures in.gbw format.
The difference between the ALIE N of the conjugate acid and the ALIE N of the base was calculated to determine ALIE N , which provides a quantitative measure of the change in the electronic structure and potential energy of the nitrogen atom upon the acid-base reaction.The parameter ALIE N was calculated using the following equation:

Table 1
Experimental and predicted basic pK a values for nitrogen compounds in pharmaceutical ingredients and dyes at 25 • C in aqueous solution c Does not have a common or short name.The IUPAC name is found in Additional file 1

Table 2
Predicted pK a values of nitrogen compounds in CB7-complexed states at 25 • C in aqueous solution (Predicted pK a CB7 ) and their respective pK a shift a Calculated by Eq. 2. RMSE_ext = 0.54 b Calculated pKa Shift using the Eq. 2 for both the free and complexed substrate.RMSE_ext = 0.50