LogD7.4 prediction enhanced by transferring knowledge from chromatographic retention time, microscopic pKa and logP

Lipophilicity is a fundamental physical property that significantly affects various aspects of drug behavior, including solubility, permeability, metabolism, distribution, protein binding, and toxicity. Accurate prediction of lipophilicity, measured by the logD7.4 value (the distribution coefficient between n-octanol and buffer at physiological pH 7.4), is crucial for successful drug discovery and design. However, the limited availability of data for logD modeling poses a significant challenge to achieving satisfactory generalization capability. To address this challenge, we have developed a novel logD7.4 prediction model called RTlogD, which leverages knowledge from multiple sources. RTlogD combines pre-training on a chromatographic retention time (RT) dataset since the RT is influenced by lipophilicity. Additionally, microscopic pKa values are incorporated as atomic features, providing valuable insights into ionizable sites and ionization capacity. Furthermore, logP is integrated as an auxiliary task within a multitask learning framework. We conducted ablation studies and presented a detailed analysis, showcasing the effectiveness and interpretability of RT, pKa, and logP in the RTlogD model. Notably, our RTlogD model demonstrated superior performance compared to commonly used algorithms and prediction tools. These results underscore the potential of the RTlogD model to improve the accuracy and generalization of logD prediction in drug discovery and design. In summary, the RTlogD model addresses the challenge of limited data availability in logD modeling by leveraging knowledge from RT, microscopic pKa, and logP. Incorporating these factors enhances the predictive capabilities of our model, and it holds promise for real-world applications in drug discovery and design scenarios. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00754-4.


Introduction
Lipophilicity reflects a compound's ability to dissolve in both octanol and water.In drug-like molecules, lipophilicity affects their physicochemical properties, such as absorption, distribution, metabolism, elimination and toxicology [1,2].High lipophilicity has been associated with an increased risk of toxic events, as reported in animal studies conducted by Pfizer [3], while low lipophilicity could limit drug absorption and metabolism [4,5].Optimal lipophilicity gives a drug molecule better safety and pharmacokinetic profiles [6].Therefore, accurately determining the lipophilicity of potential drugs is critical to increasing their chances of success in the development and evaluation processes.
Lipophilicity is generally quantitatively expressed with the n-octanol/water partition coefficient (logP) or the n-octanol/buffer solution distribution coefficient (logD) [1,7].LogP describes the differential solubility of a neutral compound with a single form in n-octanol and water.However, 95% of drugs have ionizable groups containing ionization and unionization forms.Thus, logD, which is pH dependent and measures the lipophilicity of an ionizable compound in a mixture of ionic species, is more relevant to drug research.Of particular interest is the logD at the physiological condition pH = 7.4 (logD7.4).According to Bhal's studies, logD was supposed to be taken into consideration in "Rule of 5" instead of logP [8].Yang et al. demonstrated that the molecular feature of logD can help distinguish aggregators from nonaggregators in drug discovery [9].Furthermore, compounds with moderate logD7.4values exhibit optimal pharmacokinetic and safety profiles, leading to improved therapeutic effectiveness [6].Overall, logD7.4plays a crucial role in drug discovery by providing a more comprehensive assessment of a drug's lipophilicity compared to the commonly used logP value.Accurate prediction of logD7.4 is essential for evaluating drug candidates and optimizing compound properties in the drug discovery process.
Several experimental techniques have been developed to measure logD7.4,including shake-flask, chromatographic and potentiometric approaches.The most commonly used method is the shake-flask method, where n-octanol serves as the octanol phase and buffer acts as the aqueous phase [10].However, this method is labor-intensive and requires large amounts of synthesized compounds.Chromatographic techniques, particularly high-performance liquid chromatography (HPLC) systems, rely on the distribution behavior between the mobile and stationary phases [11].Although HPLC method is simple and stable against impurities, it provides an indirect assessment of logD7.4 and is less accurate.Potentiometric titration approaches involve dissolving samples for logD7.4determination in n-octanol and titrating them with potassium hydroxide or hydrochloride.However, these approaches are limited to compounds with acid-base properties and require high sample purity [12].
Several in silico strategies have been devised to estimate logD due to the complicated experimental determination process.These strategies rely on the quantitative structure-property relationship (QSPR) [13][14][15][16][17]. Artificial intelligence (AI) methods, particularly graph neural networks (GNNs), which use graph representation learning of entire molecules, have been successfully employed in QSPR modeling [18][19][20][21][22].However, the availability of logD experimental datasets is limited due to proprietary data and the time-consuming shake-flask method, which restricts the generalization capability of the GNNs.To address this, Lapins et al. and Galushka et al. augmented the training datasets with calculated data of nearly 1.6 million predicted logD7.4values (ACD/logD7.4) from the ChEMBL database [15,16].Although this method uses a large amount of data, Fu et al. pointed out that utilizing predicted values can magnify the discrepancy between the predicted and the actual values, leading to suboptimal model performance for new molecules [17].
Pharmaceutical companies have harnessed their proprietary models to predict logD values.In comparison to academic endeavors, these models exhibit superior performance owing to the utilization of their extensive and confidential datasets.Bayer generates thousands of new data points annually [23], AstraZeneca has an expansive in-house database containing experimental drug metabolism and pharmacokinetics values [24], and Merck & Co. is significantly investing in leveraging institutional knowledge to guide their experimental endeavors [25].Notably, AstraZeneca's AZlogD74 model is trained on a dataset of over 160,000 molecules, which they continuously update with new measurements [24].
Previous academic studies have incorporated logP and pKa to estimate logD [26,27], considering the inherent limitations in data quantity and quality for logD predictions.The acid dissociation constant, pKa, represents an equilibrium constant defined as the negative logarithm of the ratio of protonated and deprotonated components in a solvent.Unlike logP, which disregards the molecule's ionization form, pKa provides information about a compound's ionization state and capacity, which logD takes into account.Therefore, it is important to acknowledge the correlation between logD, logP, and pKa.One theoretical approach assumes that logD can be calculated from logP and pKa [27,28].However, this calculation assumes that only the neutral species are distributed in the organic phase, disregarding the fact that octanol can dissolve a significant amount of water, allowing the ionic species to partition into octanol through water.This presence of both charged and uncharged species in the organic phase can lead to a significant error.To address the scarcity of data, data-driven methods such as transfer learning and multitask learning can uncover underlying the contributions of pKa and logP to logD, enabling a more comprehensive and reliable utilization of the data [29,30].Wu et al. employed transfer learning using experimental pKa and logP data [26], while Aliagas et al. utilized macroscopic pKa values of molecules predicted by the commercial software Moka as a descriptor of logD [31].Lukashina et al. and Wieder et al. employed multitask learning to simultaneously learn logD and logP tasks, resulting in improved prediction performance compared to learning the logD task alone [32,33].
In addition to logP and pKa, chromatographic retention time has also shown a strong correlation with the logD task.Parinet et al. used calculated logD and logP as descriptors to predict retention time [34], highlighting the link between molecular chromatographic behavior and logD.Chromatographic techniques offer rapid high throughput analysis, producing a substantial amount of chromatographic retention time data that surpasses the available logP and pKa data.Win et al. incorporated liquid chromatography retention time as a descriptor to improve the accuracy of logD prediction [35].However, their dataset only included 2070 molecules, which underutilizes the majority of almost 80,000 molecules in the chromatography retention time dataset [36].To the best of our knowledge, no previous research that has employed chromatographic retention time as a source task in transfer learning for logD prediction.Including retention time (RT) through transfer learning will expand the molecule dataset, encompassing more compounds and making valuable contributions to the logD task.
In this study, we have established RTlogD, a framework designed to predict the molecule's logD by integrating relevant information, such as RT, logP and pKa.First, we used RT prediction as a source task by constructing a pre-trained model trained on a dataset of nearly 80,000 molecules.Fine-tuning this RT model enhances the generalization capability of logD prediction because it has been exposed to a large number of molecules.Second, we incorporated logP as an additional task in parallel with logD prediction, creating a multitask model for lipophilicity prediction.The domain information contained in logP task serves as an inductive bias that improves the learning efficiency and prediction accuracy of the logD model.Lastly, we integrated the predicted acidic and basic microscopic pKa values as atomic features.The microscopic pKa of ionizable atoms can offer more specific ionization information, enabling enhanced lipophilicity prediction for different molecular ionization forms.To validate our method, we curated a time-split dataset consisting of molecules reported within the past 2 years and compared the performance of the RTlogD model with widely used tools such as ADMETlab2.0[14], PCFE [37], ALOGPS [38], FP-ADMET [13] and the commercial software Instant Jchem [39].

Data sets DB29-data
The DB29-data consists of experimental logD values gathered from ChEMBLdb29 [40].This dataset serves as modeling data due to its comprehensive coverage, facilitating the construction of a logD model with optimal performance.To ensure data quality, it exclusively includes experimental logD values obtained from the shake-flask method, chromatographic techniques, and potentiometric titration approaches.The following pretreatment steps were taken: (1) Records with pH values outside the range of 7.2-7.6 were removed.(2) Records with solvents other than octanol were eliminated.(3) All data was manually verified, and errors were corrected.We identified two types of errors: those resulting from partition coefficient not logarithmically transformed, and transcription errors where the values recorded in ChEMBLdb29 do not align with those in primary literature sources.Rectifying the first type of error is relatively straightforward, as these values can manifest as significantly large and hence are discernible.To address the second type of error, we have endeavored to rectify these discrepancies by crossreferencing the logD records in ChEMBLdb29 with logD values predicted by Instant Jchem.Whenever a record exhibited notable deviations from the predicted logD values, we manually corrected it based on its literature sources.(4) For the same molecule with multiple experimental values that did not significantly vary, the arithmetic mean of these values was adopted as the experimental value for that molecule.Otherwise, the molecule was excluded.( 5) Chemical structures were standardized by removing all salts from molecules, computing the normalized tautomer of the molecule, neutralizing charged molecules, and standardizing SMILES strings using the RDKit package [41].After these pretreatments, a logD7.4modeling data set with 19,128 compounds was obtained for training models.

T-data
To build an external test dataset that has not been used in model training during the comparison with existing logD prediction tools, we also processed the ChEMBLdb32 to create a time-split external test dataset following the same protocol described above.This yielded 2753 newly added logD7.4 data, which were compiled as T-data.

Lipo dataset
Additionally, the Lipo dataset was used to conduct a comparative analysis between several GNN-based logD models and RTlogD.Lipo dataset is from MoleculeNet deposited by AstraZeneca and includes 4200 compounds [42], which is widely recognized as a benchmark for logD prediction models.Here, the Lipo dataset was randomly split into the training, validation, and test sets with a ratio of 8:1:1.

RT dataset
We gathered the METLIN small molecule retention time (SMRT) data sets as auxiliary data sets to improve logD prediction performance.The SMRT data sets contain chromatographic retention time data for 80,038 small molecules using high-performance liquid chromatography-mass spectrometry, with values ranging from 0.3 to 1471.7 s.The remaining 79,957 molecules were used to construct chromatographic retention time models after removing molecules with no retention time [36].

LogP dataset
A total of 13,553 logP values were collected from PhysProp [43], 2534 logP values from NCI Open Database Compounds [44], 773 values from OChem [45] and 707 logP values from DiverseDataset [46].After normalization and deduplication, the resulting logP datasets contain 13,688 molecules.Table 1 summarizes the different types of data sets used in this work.
The RT dataset, logP dataset, DB29-data, T-data and the Lipo dataset can be found in our GitHub repository.

Baseline models
In this study, we employed four machine learning algorithms as baseline models for logD prediction: random forest [47] (RF), support vector machine [48] (SVM), artificial neural network [49] (ANN) and extreme gradient boost [50] (XGBoost) (Additional file 1: Fig S1 ).XGBoost was implemented using the XGBoost package, while SVM, RF and ANN were implemented using the Scikitlearn package [51].To encode the molecular structures as input to the models, we used Extended connectivity fingerprints (ECFPs) with a diameter 4 and a fingerprint length of 2048 bits.Additionally, we implemented several GNN-based methods to compare our model, including MolMapNet, MGA, StructGNN, KEMPNN, CoMPT, ALipSol and ALipSol + [26].

Attention-based graph neural network
The foundational structure of RTlogD is derived from our previously developed graph attention model called Attentive FP [18], which was implemented using Deep Graph Library (DGL) [52].This method employs a graph attention mechanism into the graph neural network (GNN) to concentrate on the most relevant parts of the inputs to attain a more favorable prediction.Initially, we utilized the DGL package and RDKit toolkit to convert the molecule's SMILES string into an undirected graph, incorporating nine types of atom features including microscopic pKa values (see "Introducing pKa features") and four types of bond features (Additional file 1: Table S1).Subsequently, the molecular graph was passed through three graph neural layers to facilitate message passing and update node representations.The readout operation computed graph representations from node features, followed by a multilayer perceptron (MLP) with two fully connected layers to predict graph labels based on the obtained graph features (Fig. 1a).

Pre-training model of RT
To initialize the network parameters of the logD model, we adopted a pre-training strategy.We initially trained an RT model using the aforementioned attention-based graph neural network and the SMRT dataset.During training, the SmoothL1Loss was used as the loss function.Optimization was performed using Adam with weight decay.Grid searching optimization was applied for hyperparameter tuning, determining the best hyperparameter set for each model based on the validation set.
The search ranges and optimal values of these hyperparameters are detailed in Additional file 1: Table S2.To enhance regularization and mitigate neuron co-adaptation, a dropout layer was integrated during training, randomly setting elements in the pooling output vector to zero with a probability of p = 0.2.Additionally, batch normalization was applied to expedite and stabilize the training process.The evaluation dataset's performance was computed after each epoch.Lastly, the weights of the best-performing model were employed as the initial parameters for the subsequent fine-tuning model.

Multitask learning for logD and logP
We conducted fine-tuning on the pre-trained RT model within a parallel multitask learning architecture [55], aiming to predict logP and logD values simultaneously.
In contrast to the single-task attention-based graph neural network employed in the RT model, the multilayer perceptron generates two outputs: one for logD and the other for logP.The hyperparameters remained consistent with those of the pre-trained RT model, except for the learning rate, which was reduced by a factor of 10 to preserve the RT information acquired during pre-training.We combined the compounds from the DB29-data and logP dataset, subsequently allocating them into training, validation, and test sets at an 8:1:1 ratio based on their molecular scaffolds.Early stopping was implemented based on the averaged squared Pearson correlation coefficient for logP and logD tasks on their respective internal validation sets.During training set calculations, for each molecule, we computed the SmoothL1Loss using the available value for either the logD or logP, omitting the unknown value.In cases where both logD and logP values were available, we computed the mean loss for these two tasks.

Introducing pKa features
We modified GNN model Attentive FP to incorporate molecular pKa features, based on our previously developed multi-instance learning framework Graph-pKa [18,53].Specifically, we concatenated the predicted acidic microscopic pKa and basic microscopic pKa as new features at the atomic level in our Attentive FP model.This expanded the initial 74-dimensional atomic feature calculated by RDKit to 76 dimensions.The acidic microscopic pKa is only assigned to non-carbon atoms connected to at least one hydrogen atom, with lower values indicating stronger acidic ionization ability.The basic microscopic pKa is assigned to nitrogen atoms without a positive formal charge, with higher values indicating a stronger basic ionization ability.Both acidic and basic microscopic pKa values were normalized to a range of zero to one.Graph-pKa was also used to predict macro-pKa values of the molecules used for the calculation in CALlogD.

Evaluation metrics
This study introduced three metrics to assess the model's performance: the mean absolute error (MAE), rootmean-squared error (RMSE) and R-squared coefficient of determination (R 2 ).We also introduced Spearman's correlation coefficient (r s ) to measure the monotonicity of the relationship between two datasets.
In Eqs. ( 1) through (3), y i and y i are the measured and predicted values for the molecule i, respectively, and y is the mean of all molecules in the datasets.In Eq. ( 4), cov(R(X), R(Y )) is the covariance of the rank variables, σ R(X) and σ R(Y ) are the standard deviations of the rank variables.

The implementation of RTlogD
The implementation strategies of RTlogD are presented in Fig. 1, which comprises two main parts.The first part, (1) shown in Fig. 1a, illustrates an attention-based graph neural network that incorporates molecular pKa features, which is the backbone structure of RTlogD (see "Method").
The second part, shown in Fig. 1b, depicts the overall workflow of RTlogD, including pre-training on RT and multitask learning for logD and logP.To initialize the logD model's network parameters, we first pre-trained an RT model (see "Method").We evaluated our pretrained RT model against the current state-of-the-art (SOTA) model, GNN-RT [54].We observed that our model achieved satisfactory performance, comparable to the best model to date (Additional file 1: Table S7).Then, we fine-tuned the pre-trained RT model within a parallel multitask learning architecture, aiming to predict logP and logD values simultaneously (see "Method").

Performance evaluation
To evaluate the predictive performance of RTlogD, we compared it with the theoretical method and GNNbased logD models.The GNN-based logD models and RTlogD were trained on the Lipo training/validation sets and evaluated on the Lipo test set (see "Method").The theoretical method, known as CALlogD, is derived from the predicted logP and pKa values using Eq. 5 to estimate logD7.4[28].
where δ i = {1, − 1} for acids and bases, respectively.In this method, the prediction of logD requires logP and pKa as known parameters for input.Here, logP is predicted by the auxiliary task of RTlogD and pKa is predicted by Graph-pKa.
The performance of the RTlogD model and the baseline models are presented in Table 2.Among all the methods, the CALlogD method displayed the poorest performance, (5) log D (PH) = log P − log 1 + 10 pH −pK a )δ i

Comparison with logD prediction tools
To further investigate the performance of RTlogD in the logD prediction task, we conducted a more stringent time-split evaluation, to ensure that the newly collected test We conducted a comparison between our proposed RTlogD model and five commonly used tools: Instant Jchem, ADMETlab2.0,PCFE, FP-ADMET and ALOGPS.The results presented in Table 3 clearly demonstrate the significant advantages of RTlogD over other tools.RTlogD exhibited a higher R 2 value and a lower RMSE and MAE value, indicating its superior performance.The PCFE model ranked as the second-best model, which utilized 1.71 million computational logD values for pretraining before fine-tuned with experimental logD7.4data.In contrast, our model achieved superior results using only approximately 80,000 chromatographic data for pre-training.Despite having a smaller pre-training dataset compared to PCFE, our model's performance suggests that incorporating auxiliary information, such as logD, logP, and pKa, through reasonable training strategies effectively contributes to its superior performance.Therefore, we further investigated the individual contributions of different modules RT, logP, and pKa, to the final prediction performance.

Ablation experiments
We first conducted ablation studies to evaluate the impact of auxiliary information on the logD prediction performance of the RTlogD model.Specifically, we examined the model's performance on T-data when it was not pre-trained on the RT dataset, did not incorporate microscopic pKa as atom features, or did not include the logP multitask.Table 4 presents the comparison of the complete RTlogD model with variations that exclude certain components.Additionally, Additional file 1: Table S6 presents the performance of various logD prediction model variations, including incorporating all auxiliary information as multitask, using logP as the pre-training  task while employing RT as multitask, and substituting macroscopic pKa for microscopic pKa.Table 4 shows that the "w/o RT" model, "w/o microscopic pKa" model, "w/o RT and microscopic pKa" model, and "w/o logP" model had a decrease in logD prediction performance.This emphasizes the importance of incorporating auxiliary information from RT, microscopic pKa features, and logP to enhance the overall performance of the RTlogD model.Notably, the RTlogD model, which combines pre-training on the RT dataset, incorporation of microscopic pKa as atomic features, and inclusion of the logP multitask, outperformed other strategies (Additional file 1: Table S6) and achieved the highest level of performance.
The effectiveness of QSPR models usually relies heavily on the similarity between the predicted molecules and those in the training set.This relationship is evident in Additional file 1: Fig S2 , where the model's prediction accuracy improves as the molecules in the T-data become more similar to those in the training set.Pre-training on RT data, which is closely related to logD, can address the challenge of data scarcity for logD modeling and enhance the generalization ability for predicting novel molecules.
To verify this, we conducted an experiment to gradually expand the chemical space of the training dataset with and without RT pre-training, respectively.Various numbers of molecules were sampled from the DB29-data as training data to train a series of models with expanding chemical space.The T-data was then used to evaluate the prediction performance of these models.
Specifically, to simulate out-of-domain prediction tasks commonly encountered in real-world applications, we adopted a prioritized sampling approach for selecting molecules from the DB29-Data that are less similar to the T-Data.For each molecule in the T-Data, we calculated its maximum Tanimoto similarity with molecules in the DB29-Data using ECFP4 fingerprints.Subsequently, we sorted the molecules in the DB29-Data in ascending order based on their similarity scores.The top N molecules were then sampled to construct a series of models.The initial model was built using the top 1000 molecules with the lowest similarity scores, while the subsequent models were constructed using the top 2000, 4000, 6000, 8000, 10,000, 1,2000, and 1,4000 molecules by incrementing the dataset size by 2000 compounds at each step.
The variation in prediction performance with respect to the size of the training data is depicted in Fig. 3a.Overall, the prediction performance improved as the training data size increased, regardless of whether RT pre-training was used or not.This improvement can be attributed to the data-hungry nature of GNNs, which require more data to fit the model and prevent overfitting.However, the performance improvement achieved by adding thousands of molecules in models without RT pre-training can be attained by simply incorporating an RT pre-training operation.As shown in Fig. 3a, enhancing the logD training data size from 1000 to 4000 leads to improved performance, with a decreased MAE from 1.212 to 0.914 for models without RT.When utilizing the RT pre-trained strategy, good performance can be achieved (MAE = 0.924) with only 1000 logD training data (the red dashed line in Fig. 3a).These results suggest that incorporating RT as pre-training may reduce the number of instances required for model training.Furthermore, with a smaller training data size, the performance gap between the models with and without RT pre-training is wider, indicating that the introduction of RT pre-training has a more pronounced impact when dealing with low data volumes and has achieved a notable generalization capability for predicting novel molecules.
In addition, we investigated the reasons behind Fig. 3a and proposed that the RT data enables it to leverage relevant knowledge.To analyze the chemical space of RT and logD dataset, we employed t-distributed stochastic neighbor embedding (t-SNE) [57] based on molecular fingerprints ECFP4.When training with only 1000 molecules, the training set in the DB29-data covers only a fraction of the T-data (Fig. 3c).When training with 4000 molecules, the coverage of the T-data by the training set is increased (Fig. 3d).However, adding more data does not improve the coverage further, as the chemical space of the training set remains relatively constant (Fig. 3e).Meanwhile, the chemical space of RT directly encompasses most of the T-data, except for a small portion representing peptides (scatters at bottom right-hand corner in Fig. 3b).The incorporation of RT exposes the model to a wider range of molecules and improves its inductive bias.These visualization results are consistent with the performance statistics presented in Fig. 3a.Although the performance gaps between the two models diminish as the logD training data size increases, the model with RT pre-training consistently outperforms in terms of generalization capability.
In addition to investigating the positive impact of the RT source task on logD prediction, we explored the rationale behind incorporating logP as an auxiliary task in a multitask learning approach.Multitask learning provides an inductive bias through the inclusion of auxiliary tasks, guiding the model to favor hypotheses that explain multiple tasks simultaneously.Consequently, incorporating relevant tasks can lead to improved model performance [58].Analysis depicted in Fig. 4 demonstrates a strong positive correlation, as evidenced by a high Spearman's correlation coefficients of 0.628 between logP and logD values for molecules.This observation suggests that integrating logP as an auxiliary task, given its monotonic relationship with logD, has the potential to enhance the accuracy of logD predictions.In summary, the ablation studies conducted in this research highlight the importance of incorporating RT, logP, and pKa information in logD modeling.Pre-training the model with RT data allows it to be exposed to a broader chemical space, improving its ability to generalize.Furthermore, utilizing logP as a multitask provides a strong inductive bias, further improving the model's performance.These strategies collectively contribute to the development of solutions that exhibit better generalization capabilities.

Interpretability analysis of pKa values
To investigate the significance of incorporating pKa information into logD prediction, we analyzed the  prediction accuracy of RTlogD and other models for highly ionizable molecules (Mol 1 to Mol 4), as depicted in Fig. 5. Predicting accurate logD values for such molecules is challenging, as ionization can alter a molecule's solubility and distribution characteristics from those of its neutral form.Highly ionizable molecules can exhibit complex pH-dependent partitioning behavior, which further complicates logD prediction.
An interpretability analysis was conducted to understand the microscopic pKa features by visualizing atomic attention.For each atom in a given molecule, attention weight scores ranging from 0 to 1 were obtained and normalized.Additionally, we created a model similar to RTlogD but without microscopic pKa as atom features (referred to as "w/o microscopic pKa" in ablation studies).Figure 5 displays the changes in atomic attention weight scores between models with and without microscopic pKa features.It is evident from Fig. 5 that the inclusion of pKa feature enables the RTlogD model to identify the strongest ionization sites in the molecules, which are assigned higher attention weights, consistent with the ionization sites (predicted by the Graph-pKa model).Consequently, RTlogD exhibits the lowest prediction errors for these challenging molecules compared to the model without pKa features and other tools.This indicates that microscopic pKa features, which reflect the ionization ability of chemical compounds and determine ionization sites, can improve logD prediction in a rational manner.

Conclusion
In this study, we present a novel in silico logD7.4prediction model called RTlogD.Our model combined a pre-training model on a chromatographic retention time dataset with a fine-tuning model that includes multitasks of logD and logP.We also incorporated microscopic acidic pKa and basic pKa into atomic features.Our model exhibited superior performance compared to existing tools and models, such as Instant Jchem, ADMETlab2.0,PCFE, FP-ADMET and ALOGPS.We conducted case studies and analyses to validate the strategies proposed in this paper.Our findings underscore the effectiveness of incorporating RT, logP, and microscopic pKa information, as well as utilizing transfer learning and multitask learning to enhance the performance of the RTlogD model.Pretraining the model with RT data enables it to capture a broader range of chemical space beyond the logD dataset alone.Moreover, employing RT as a multitask imparts a robust inductive bias, while incorporating microscopic pKa features provides valuable information about the compound's ionization ability and ionization sites.These strategies contribute to the rational development of solutions that demonstrate improved generalization capabilities.
In conclusion, our study has implications for drug discovery and design, as it can make more accurate predictions of the lipophilicity of novel molecules.The reliance on high-quality internal data is crucial for achieving robust model performance.In contrast to the commercial tools employed by pharmaceutical companies, academic models often rely on literature data, which inherently carries biases and may undermine model accuracy.RTlogD aims to address the limited generalization capability of existing models caused by data scarcity.This is achieved through the implementation of pre-training and multitask learning, effectively mitigating the constraints posed by insufficient open-source data.Additionally, RTlogD employs meticulously designed descriptors that incorporate microscopic pKa features, providing essential ionization information.This incorporation contributes to enhanced generalization capabilities compared to other open-source models.In the future, we intend to periodically update RTlogD with newly available substantial datasets to ensure its adaptability.Moreover, we plan to expand our analysis to predict not only lipophilic fragments but also transformation procedures, providing alternative or improved fragment suggestions.This will be of great importance in optimizing molecular structures with moderate lipophilicity and improving the success rate of drug candidates.
• fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year

•
At BMC, research is always in progress.

Learn more biomedcentral.com/submissions
Ready to submit your research Ready to submit your research ?Choose BMC and benefit from: ? Choose BMC and benefit from:

Fig. 1
Fig. 1 The architecture of the RTlogD model.a The graph neural network used in RTlogD.b Transfer learning of RT and the multitask learning of logP and logD module data have not been exposed to the model training.We trained RTlogD using 19,128 logD samples in the ChEMBLdb29 database (DB29-data) and tested it on the newly disclosed samples in ChEMBLdb32, which consists of 2753 recently measured logD values (T-data) (see "Method").On the one hand, T-data provides a relatively fair benchmark to compare RTlogD with existing logD prediction tools, such as ADMETlab2.0,PCFE, ALOGPS, FP-ADMET and Instant Jchem.On the other hand, in terms of the discrepancy observed between T-data and DB29-data within the chemical space, T-data is valuable to assess the generalization capability of the RTlogD model.To assess the structural dissimilarity between the T-data and DB29-data, we employed the molecular fingerprint ECFP4[56] to calculate both the max internal similarities within DB29-data and the max similarity of each molecule in T-data relative to DB29-data.Figure2illustrates that most molecules in T-data show structural dissimilarity compared to DB29-data, as evidenced by low max similarities ranging from 0.2 to 0.4.This enabled us to perform an independent evaluation and comparison of various predictive tools based on T-data.

Fig. 2
Fig. 2 Comparison of the maximum Tanimoto similarities distribution within DB29-Data (red), and between T-Data and DB29-Data (blue) using ECFP4

Fig. 3
Fig. 3 Effect of training data size on the prediction performance of T-data.a Model performance variation with and without RT pre-training.b t-SNE distribution of T-data and RT by ECFP4.c t-SNE distribution of T-data and 1000 training data sampled from DB29-data by ECFP4.d t-SNE distribution of T-data and 4000 training data.e t-SNE distribution of T-data and 8000 training data

Fig. 4
Fig. 4 Scatter plot of experimental logP and logD values in the dataset.Spearman's correlation coefficient values can range from − 1 to 1, where values of 1, 0, and − 1 indicate perfect positive correlation, no correlation, and perfect negative correlation, respectively

Fig. 5
Fig. 5 Visualization of attention weight distribution.Attention weights with blue indicates a value less than 0.5 and red indicates a value greater than 0.5 after normalization.The predicted error values of different methods are denoted by ∆logD7.4 and presented as the length of the error bars

Table 1
Different type of data sets used in this work

Table 2
Different model performances on the Lipo Data SetValues in bold represent the superior performance among the various methods

Table 3
Comparison with existing prediction tools on T-dataValues in bold represent the superior performance among the various methods

Table 4
Comparison with ablated models on T-data