DLIGAND2: an improved knowledge-based energy function for protein–ligand interactions using the distance-scaled, finite, ideal-gas reference state

Performance of structure-based molecular docking largely depends on the accuracy of scoring functions. One important type of scoring functions are knowledge-based potentials derived from known three-dimensional structures of proteins and/or protein–ligand complex structures. This study seeks to improve a knowledge-based protein–ligand potential based on a distance-scale finite ideal-gas reference (DFIRE) state (DLIGAND) by expanding the representation of protein atoms from 13 mol2 atom types to 167 residue-specific atom types, and employing a recently updated dataset containing 12,450 monomer protein chains for training. We found that the updated version DLIGAND2 has a consistent improvement over DLIGAND in predicting binding affinities for either native complex structures or docking-generated poses. More importantly, DLIGAND2 has a 52% increase over DLIGAND in enrichment factors in top 1% predictions based on the DUD-E decoy set, and consistently improves over Autodock Vina and other statistical energy functions in all three benchmark tests. We further found that DLIGAND2 outperforms empirical and machine-learning methods compared for virtual screening on new targets that are not homologous to the DUD-E training set. Given the best performance as a parameter-free statistical potential and among the best in all performance measures, DLIGAND2 should be useful for re-assessing the poses generated by docking software, or acting as one term in other scoring functions. The program is available at https://github.com/sysu-yanglab/DLIGAND2. Electronic supplementary material The online version of this article (10.1186/s13321-019-0373-4) contains supplementary material, which is available to authorized users.


Introduction
Structure-based molecular docking is one of the key components in computer-aided drug design [1][2][3]. Docking is a two-step process: conformational sampling of ligands bound to their receptors, followed by assessment of binding free energy between them. Due to advances in computing power and numerical algorithms, the success of docking is no longer restricted by inadequacy of conformational sampling but limited instead by the lack of a precise and reliable scoring function to evaluate the free energy of interactions between proteins and ligands [4]. Developing an accurate scoring function is challenging because molecular interaction is contributed by a delicate balance between several different types of interactions including van der Waals and columbic interactions in between, and interactions with solvent environment in addition to the difficulty in capturing entropic contributions [5,6].
A wide variety of scoring functions has been developed to approximate energy functions. Based on the derivation ways, scoring functions are usually classified into physics-based methods, empirical scoring functions, knowledge-based potentials, and descriptor-based scoring functions [7]. Physics-based methods, widely employed in molecular dynamics simulation studies, are obtained by combing quantum mechanical calculations of small molecular fragments and empirical fitting to known experimental data. Some examples are linear interaction energy (LIE) [8,9], linear response approximation (LRA) [10] and MM-PBSA/GBSA [11][12][13]. Since this type of methods require intensive computing time to perform kinetic integration for entropic effects, they are limited to assess a small number of compounds. Differently, virtual screening usually docked millions of molecules into a protein receptor to locate active compounds. Thus, the requirement of fast computation leads to the dominance of computationally efficient empirical scoring functions in docking as shown in the score-function assessment [5,6]. Empirical scoring functions are based on a linear combination of various energetic terms to approximate binding free energy. Notable examples are ChemScore [14,15], X-Score [16], Glide-Score [17,18], and etc. Typically, the weight factors for individual energetic terms in an empirical scoring function are obtained by regression to achieve the highest correlation to experimental binding affinities (scoring power). More recently, machine learning methods have been used to combine energetic terms and/or employ protein-ligand distances for training. Examples are RF-Score [19], ID-Score [20], SVM-SP [21], and DrugVQA [22]. However, these scoring functions are often sensitive to docking poses and don't perform well to separate decoys from true binding ligands in actual docking experiments [23]. Knowledge-based potentials (or statistical potentials) are derived from statistical analysis of known protein structures. A typical knowledge-based potential considers only the distances between atom pairs that allow efficient calculations. Different knowledge-based functions differ in how proteinligand atom pair potentials and their reference states are defined. Examples are SmoG [24,25], DrugScore [26], IT-Score [27,28], and ASP [29]. Knowledge-based scoring functions are also used in combination with solvation and entropic terms to improve performance. Examples are DSX [30], SmoG2016 [31] and ITScore/SE [32].
Previously, a knowledge-based scoring function called DLIGAND [9] was developed based on the distancescaled finite ideal-gas reference (DFIRE) state [33,34], which has successfully been used for protein interactions with DNA [35], RNA [36], and carbohydrate [37] molecules. DLIGAND was developed by representing both protein and ligand atoms by a few mol2 atom types, and trained on a small set of 200 protein complex structures. Here, we developed DLIGAND2 by substituting 13 mol2 atom types by 167 residue-specific atom types for protein atoms and using a large protein structural dataset for training. We showed that DLIGAND2 not only significantly improves over DLIGAND but also has superior performance in separating true ligands from decoys in Database of Useful Decoys-Enhanced (DUD-E).

Scoring function DLIGAND2 potential
We have used the same approach as the DLIGAND [38] to derive the distance-dependent interaction energy function between atomic pairs based on the distancescale finite ideal-gas reference (DFIRE) state [33] as where R is the gas constant, T = 300 K, α = 1.61, r cut = 15 Å, η is a scaling factor simply set as 0.01/RT. N obs (i,j,r) is the number of atomic pair (i,j) within the spherical shell of distance r observed in a given structure database, and ∆r(∆r cut ) is the bin width at r(r cut ). A constant value of 0.5 Å was used for ∆r at all bins and ∆r cut = ∆r. Here, we employed residue-specific atomic types for protein atoms that leads to 167 atomic types for protein atoms. This is different from DLIGAND, where both protein and ligand atoms were represented by mol2 atom types, and thus only 13 atom types were utilized for protein atoms.
We derived the protein-ligand interactions from protein structures because there is only a small number of non-redundant protein-ligand complex structures. From protein structures, we obtained the N obs for the number of observed pairs between protein atoms, which are converted to protein-ligand interactions by mapping indices for protein atoms to 11 mol2 atom types (see Additional file 1: Table S1) and summing over all pairs that are mapped to the same mol2 atom type as where i is protein atom type, δ map j , k is 1 only when the protein atom type j is mapped to mol2 atom type k, otherwise 0. Based on the N ′ obs (i, k, r) , we can derive the potential function in the same manner as DFIRE. This design enables us to obtain the scoring function purely from protein atoms without requiring their binding partners, so we employed our recently collected 12,450 non-redundant protein monomer chains [39] to obtain a sufficient number of observations. This training set represents more than 60 times bigger than the dataset (195 complex structures) used for deriving DLIGAND. For ligand mol2 atom types not existed in proteins, they were mapped to the closest atom type, as detailed in Additional file 1: Table S1. We also adopted the low-count correction according to Bayesian statistics as the previous study [40].

Benchmark datasets
Four benchmark datasets were employed to evaluate DLIGAND2.  [48,49], and iGEMDOCK (version 2.1) [50]. Docking ligands are confined to a 10 Å box enclosing the centroid of co-crystalized ligand. The maximum number of docking poses for each ligand was set to 10. After removing complexes failing to yield any complex structures in our selected docking programs, a collection of 4044 complexes remained for evaluation. The full list of 4044 complexes can be found in Additional file 2: Table S2. The scoring ability of functions were evaluated by the Pearson correlation coefficient (PCC) between the predicted and experimental values, as well as the root mean squared error (RMSE) after linear regression. The ability of DLIGAND2 to perform virtual screening was also evalued on the DUD-E dataset [51]. There are 22, 886 active ligands binding with 102 targets, with an average of 224 ligands per target. For each target, the DUD-E database provides an abundant number of decoys (50 decoys for each active) that have similar physicalchemical properties but dissimilar two-dimensional (2D) topology. We employed the 3D structure of a target protein with the highest resolution in the protein data bank for docking. This is different from original DUD-E test where the 3D structure of the best performance was selected for each target [51]. For each pair of protein target and ligand compound, we employed Autodock Vina with default options to generate one pose, which are re-scored by 5 scoring functions (ΔvinaRF 20 , ID-Score, X-Score, DLIGAND, and DLIGAND2).
The accuracy of each scoring function was evaluated by the LogAUC and enrichment factor (EF).
As described in DUD-E Ref. [51] and our previous studies [52,53], LogAUC takes the logarithm of x-axis in area under curve (AUC) to show more information on enrichment at a low false positive rate. We chose three regions of EF in top x% of the DUD-E dataset, where x equals to 1, 5 and 10 respectively.
where N x% True , N x% Selected , N x% Selected and N Total are the number of true positives, the number of selected candidates at top x% screened candidates, the number of active compounds, and the total number of compounds in the screened library, respectively.
For a fair comparison with the machine-learningbased scoring function (RF-Score-VS [54]) trained on the DUD-E dataset, we selected protein targets from the DEKOIS 2.0 benchmark [55] if it has sequence identity less than 95% to any protein in the DUD-E according to the BLAST [56]. Finally, 55 targets were kept and sorted by their sequence identity, as detailed in Additional file 3: Table S3.

Results and discussion
The DLIGAND2 potential Different from the united mol2 atom type used by DLI-GAND, the improved version DLIGAND2 has employed residue-specific types for protein atoms, which expanded atom types from 12 types to 169 atom types. Sufficient statistics for this larger number of atom types is ensured by using 12,450 protein chains for training. Residue-specific atom types enable the discrimination of the properties (e.g. partial charge) and surrounding environments of atoms. As shown in Fig. 1a, the potential energy between ligand atom S.3 and the main-chain O atom of ASP is significantly lower than between the atom and the mainchain O atom of ARG likely because S.3 atom has a weak but negative partial charge, which is repulsive to the negative charged ASP but attractive to the positive charged ARG residue. By comparison, DLIGAND provides an average potential over 20 amino acids. Significant differences also exist for interactions involving non-polar atoms. As shown in Fig. 1b, the CB atom of GLU and the CE atom of LYS belong to C.3 as defined in mol2, despite their very different electrostatic and steric environment. Their interactions with the ligand type N.am are very different when derived independently (DLIGAND2), and enclose the average energy function from DLIGAND.

Docking power
The docking power refers to whether a scoring function can correctly identify the native ligand poses from the predicted poses. Table 2 shows the evaluation results of docking power compared to the results by Li et al. [5] using the same docking sets in the CASF-2013 benchmark. DLIGAND2 achieves 14% improvement in success rate over DLIGAND in detecting native poses as the first ranked pose. Among all methods compared, DLIGAND2 has a moderate performance in term of success rates in ranking the native pose within top 1, 2, and 3 (at 45.1%, 61% and 75.4%, respectively). Nevertheless, DLIGAND2 ranks the second best in all knowledge-based/statistical potential scoring functions, behind ASP@GOLD, but better than PMF@SYBYL, PMF04@DS and PMF@DS. However, ASP@GOLD is not a pure statistical energy function but an empirical mix of a statistical potential with physical-based energetic terms in ChemScore@ GOLD. Thus, DLIGAND2 has the best performance for parameter-free statistical potentials.

Ranking power
The ranking power of a scoring function refers to its ability to correctly rank binders of a given target protein by their predicted binding affinities based on the poses from the crystal structures and optimized structures. Table 3 compares DLIGAND and DLIGAND2 to the evaluation results of other scoring functions of ranking power collected by Li et al. [5]. A high-level success rate indicates a completely correct ranking of all members within each ligand cluster whereas a low-level success rate denotes ranking of the best as top 1 within a cluster. Again, DLIGAND2 has a small improvement over DLIGAND in high level success rates (1.6% on crystal structures and 3% on optimized structures) but identical in low-level success rates. Compared to other statistical potentials (PMF@DF, ASP@GOLD, PMF@SYBYL), DLI-GAND2 has the highest high-level success rate in crystal and optimized structures and the highest low-level success rate in optimized structures but not in crystal structures. This suggests that DLIGAND2 is less sensitive to structural changes, compared to ASP@GOLD that has the large drop in low-level success rate from crystal to optimized structures. Empirical scoring functions such as X-Score and ChemScore@SYBYL have the best performance in this test.

Evaluation results on PDBbind data set
The above benchmark study is based on experimentally determined, protein-ligand complex structures. We further tested DLIGAND2's ability to predict protein-ligand binding affinities by using predicted complex structures from docking. To remove random fluctuations, we generated 10 poses for each pair of protein and ligand by each docking method, and the highest score among 10 poses by each scoring function was used to represent the predicted binding affinity, respectively. As shown in Table 4, when scored by docking methods' own scoring functions, AutoDock Vina yields the best correlation and lowest error with experimental values (PCC of 0.501 and

Table 2 Success rates for the evaluation of docking power ranked by top three poses
Results (excluding DLIGAND2 and DIGAND) cited from Li [5]. The RMSD value between one best-scored binding pose and the native binding pose is less than 2.0 Å

Scoring function Success rates (%)
The top pose Top two poses Top three poses

Table 3 Success rates (%) for the evaluation of ranking power ranked by high-level results on optimized structures
Results (excluding DLIGAND2 and DIGAND) cited from Li [5] Score  [4]. When re-assessed by DLIGAND2, the PCCs of predicted binding affinity consistently improve over all eight docking methods to the levels from 0.498 to 0.537 with an average of 0.523, and the RMSE from 1.69 to 1.76 with an average of 1.71. This indicates the main bottleneck of current docking method is the scoring function, as also disclosed in the previous study [4]. By comparison, DLIGAND can improve PCC values for five docking programs but decrease PCC values for 3 others with an average PCC of 0.455 and RMSE of 1.79, which are 13% lower and 4.7% higher than DLIGAND2, respectively. On the basis of average value, X-Score has a performance comparable to DLIGAND2 in PCC but a slightly higher error in RMSE. It should be noted that X-Score was trained on the complex structures homologous to the CASF-2013 benchmark dataset used here, whereas DLIGNAD2 was trained only by independent monomer structures. We also noted that DLIGAND2 is about 5 times faster than X-Score, which takes 2.7 and 13.3 h, respectively to complete this dataset (a total of 40,440 docking poses) by one CPU core of the Intel E5-2692V2 (2.2 GHz). Here, we did not compare to RF-Score (including RF-Score-v4 [59]), ΔvinaRF 20 and ID-Score because they were trained on the PBDbind refined set.

Evaluation results on DUD-E data set
The DUD-E dataset is used to examine the ability to separate true ligands from decoys, a practically important problem in virtual screening. Here, we employed the DUD-E dataset to evaluate the screening power of scoring functions. The performance of DLIGAND and DLIGAND2 is compared to those of three top ranked scoring functions in the CASF-2013 benchmark (ID-Score, ΔvinaRF 20 , and X-Score) using the poses generated by AutoDock Vina.
As shown in the Table 5 (The detailed data can be found in Additional file 4: Table S4), DLIGAND2 achieved the best performance with an average logAUC of 10.14% and enrichment factors of 6.67 for EF 1% . DLI-GAND2 achieved an average EF 1% of 30% higher than Autodock Vina, 52% and 64% higher than DLIGAND and X-Score, separately, and above 3 times higher than IDscore. The logAUC and enrichment factors of all targets are detailed in Additional file 4: Table S4. Notably, Autodock Vina ranks the 2nd by LogAUC and the first on EF 5% and EF 10% , with EF 1% of 26% and 86% higher than those by X-Score, and ID-Score despite the fact that they can provide higher correlation coefficients than Autodock Vina to experimental binding affinities in the CASF-2013 dataset. This is likely because ID-Score and X-Score were all trained by the PDBbind dataset that are homologous to CASF-2013 dataset. The over-training issues in empirical or machining learning based scoring functions have Table 4 Pearson correlation coefficients and root mean squared error between experimental binding affinity and binding affinity predicted by DLIGAND, DLIGAND2, and X-Score using docking poses generated by eight docking programs along with the results from the docking programs  also been observed in several previous studies [23]. The improvement of DLIGAND2 relative to Autodock Vina is more consistent in this independent test. As for RF-Score, the general version (RF-Score v3) for predicting binding affinity doesn't achieve a good performance with 5.42 for EF 1% [53], ranking even behind DLIGAND. Although RF-Score-VS version specifically trained based on DUD-E was reported to achieve EF 1% values up to 38.96 [53], the per-target cross validation tends to have an over-estimate due to protein homologs between training and test sets [60]. We will employ an external DEKOIS 2.0 dataset to evaluate DLIGAND2 and RF-Score-VS separately below.
To further compare the performance of each scoring function for different protein categories, 102 targets of DUD-E dataset are separated into eight categories and evaluated by the average EF 1% as shown in Table 6. DLI-GAND2 has the highest values of EF 1% in Cytochrome P450, GPCR, Kinase, and Protease. Especially in the category of GPCR and Kinase, DLIGAND2 has obvious advantages compared with other scoring functions by 2.03 times and 1.42 times better than the second ranked methods (ΔvinaRF 20 and DLIGAND), respectively. By comparison, AutoDock Vina performs the best in the ion channel, and is far superior to DLIGAND. The scoring function ΔvinaRF 20 performs the best in miscellaneous, nuclear receptors and other enzymes. DLIGAND2 doesn't perform well in targets of kinesin-like protein 1 (KIFF11, miscellaneous) and poly (ADP-ribose) polymerase-1 (PARP1, other enzymes), likely because their binding ligand contains halogen and phosphate elements that don't appear in training protein chains. Currently, DLI-GAND2 simply treats phosphate elements equivalent to the sulfate atom type. This issue may be solved in future study by including additional ligand atoms from proteinligand complex structures.
Among the best examples of DLIGAND2 performance, we plotted the receiver operating characteristic (ROC) for the case of PTN1 protein (protein-tyrosine phosphatase 1B). As shown in Fig. 3, DLIGAND2 has the highest area under the curve (AUC) of 0.769, followed by AutoDock Vina (0.75), X-Score (0.729) and DLIGAND (0.639). The differences are more significant at lower false positive rate, the most important region for virtual screening. Indeed, the EF 1% are 28.89, 9.12, 18.32, 9.33 and 9.33 for DLIGAND2, Autodock Vina, ΔvinaRF20, DLIGAND and X-Score, respectively. The AUC of ID-Score is 0.553, close to 0.5 by the random selection.

Evaluation results on DEKOIS 2.0 data set
To compare with the latest RF-Score-VS v2 (https ://githu b.com/oddt/oddt) scoring function that was trained on the DUD-E, we have compiled a new dataset from the DEKOIS 2.0 benchmark, with all targets sorted according to their sequence identity to the DUD-E targets according to the blastpgp. Figure 4 plots the average enrichment factor (EF 1% ) as a function of the number of targets sorted according to sequence identity. The average EF 1% for RF-Score-vs increases as the sequence identity increases, suggesting the performance of RF-Score-VS v2 is strongly depending on similarity to its training set. AutoDock Vina also has some dependence on similarity to DUD-E targets. By comparison, DLIGAND and DLIGAND2 have the least dependence except when the number of targets is low (< 10) likely due to natural fluctuations. DLIGAND2 has the highest performance when homologous targets are excluded for sequence identity less then 30% with an average EF 1% at 5.72, compared to 2.34 by AutoDock Vina, 2.73 by RF-Score-VS and 3.25 by DLIGAND.

Conclusion
We have developed a new knowledge-based scoring function DLIGAND2 by extending to 167 atom types for protein atoms from 13 types in the original DLI-GAND. Residue-specific atom types for proteins allow a more accurate description of the interaction of a ligand atom with different residues. To ensure sufficient statistics, DLIGAND2 is based on an updated non-redundant dataset of 12,450 protein chains, 62 times bigger than the dataset (195 structures) used in the original DLIGAND. DLIGAND2 consistently improves over DLIGAND in binding affinity prediction using either native or docking-predicted complex structures. The improvement in Pearson correlation coefficient is 8.7% for the CASF-2013 dataset by using native complex structures and 15% for the PDBbind dataset by using predicted complex structures. In addition, DLIGAND2 has significantly higher enrichment than DLIGAND in discriminating true ligands from decoys using the DUD-E dataset according to re-ranking of docked structures. These results suggest the usefulness of expanding protein atomic types in generating the DLIGAND 2 statistical potential.
DLIGAND2 is the best knowledge-based energy score but not as accurate as a few empirical (X-Score) or machine-learning based (RF-Score-v2 and ID-Score) scores trained by CASF-2013 or PDBbind. The X-Score and ID-Score methods outperform Autodock vina in the CASF-2013 and PDBbind, but they all have lower performance in decoy discrimination, a practically more important problem. We have also shown that the performance of RF-score-vs strongly depends on the sequence identity of the target protein to the dataset for training the method. Though RF-score-vs was reported to perform well in the DUD-E that includes many homologous proteins to its training set, it doesn't perform well on protein targets that are not homologous to its training set. By comparison, DLIGAND2 was derived from only protein monomer structures, ensuring a balanced performance for all targets. Considering the simplicity and fast computation, DLIGAND2 will be useful for re-scoring after docking, or being included as a term for other scoring functions.