Incorporating structural similarity into a scoring function to enhance the prediction of binding affinities

In this study, we developed a novel algorithm to improve the screening performance of an arbitrary docking scoring function by recalibrating the docking score of a query compound based on its structure similarity with a set of training compounds, while the extra computational cost is neglectable. Two popular docking methods, Glide and AutoDock Vina were adopted as the original scoring functions to be processed with our new algorithm and similar improvement performance was achieved. Predicted binding affinities were compared against experimental data from ChEMBL and DUD-E databases. 11 representative drug receptors from diverse drug target categories were applied to evaluate the hybrid scoring function. The effects of four different fingerprints (FP2, FP3, FP4, and MACCS) and the four different compound similarity effect (CSE) functions were explored. Encouragingly, the screening performance was significantly improved for all 11 drug targets especially when CSE = S4 (S is the Tanimoto structural similarity) and FP2 fingerprint were applied. The average predictive index (PI) values increased from 0.34 to 0.66 and 0.39 to 0.71 for the Glide and AutoDock vina scoring functions, respectively. To evaluate the performance of the calibration algorithm in drug lead identification, we also imposed an upper limit on the structural similarity to mimic the real scenario of screening diverse libraries for which query ligands are general-purpose screening compounds and they are not necessarily structurally similar to reference ligands. Encouragingly, we found our hybrid scoring function still outperformed the original docking scoring function. The hybrid scoring function was further evaluated using external datasets for two systems and we found the PI values increased from 0.24 to 0.46 and 0.14 to 0.42 for A2AR and CFX systems, respectively. In a conclusion, our calibration algorithm can significantly improve the virtual screening performance in both drug lead optimization and identification phases with neglectable computational cost.


Introduction
In order to save time and cost in drug discovery projects, various in silico approaches have been developed and applied to reduce the number of compounds which are to be experimentally synthesized and tested. Among these computer-aided drug design/discovery (CADD) favorably interact with a receptor (protein or nucleic acid) at its binding site, and if yes, the binding mode and binding affinity measured by the docking scoring function are determined. [1,2]. Similarity search is a typical LBVS method, which predicts activity of query compounds depending on their similarities/dissimilarities to known reference ligands by utilizing numerical similarity descriptors (fingerprints) [3]. Both docking and similarity methods have been successfully carried out independently or hierarchically to screen out confidently inactive compounds for specific receptors of interest. Compared to docking, similarity search is even faster, and therefore it is suitable to filter super large database before docking takes place. However, both similarity search and docking suffer from poor accuracy to rank potentially active ligands and prioritize top candidates to be suggested for experiments. Similarity search is based on the Similarity Property Principle (SPP) [4], i.e., structurally similar molecules are likely to possess similar biological properties and activities. However, this hypothesis is not always true. Sometimes small chemical differences may arise highly different activity ('activity cliffs') [5]. Accuracy of docking methods is limited due to lack of modelling structural flexibility of target receptors, effects of solvation and entropy changes, etc. These limitations of docking & scoring methods may be overcome by more accurate methodologies, such as end-point methods (MM-PBSA, MM-GBSA, LIE, etc.) [6,7], or rigorous alchemical free energy methods (FEP, TI, etc.) [8,9], with the price of much higher computational cost and much longer time. But we cannot help wondering: is there a way to improve the accuracy of docking & scoring methods without much more extra computational cost? With the advances of high-throughput screening technique, more and more compounds were measured against a drug target. ChEMBL [10] is a curated database which collects binding affinities of bioactive molecules for a drug target. How can we utilize the information on the known structures and activities to improve screening performance? Secondly, can we combine docking & scoring methods with the extremely fast methods of similarity calculations to improve the accuracy of binding affinity estimation? If so, how can we incorporate the two types of scores into one hybrid scoring function? In this work, we attempted to develop a novel algorithm to make a good use of those valuable information on known bioactive compounds.
Docking programs utilize scoring functions to estimate the binding affinities. Currently, scoring functions can be generally classified into following four categories based on how protein-ligand energy is predicted: (1) force-fieldbased, (2) empirical, (3) knowledge-based, (4) descriptorbased [11]. Although the underlying mechanisms of four categories are different, all of those developed scoring functions are trying to pursue promising results in the protein-ligand binding prediction. Considering the progress and achievements that have already been made by the existing scoring functions during the past few decades, we believe it will be more valuable to make some improvements based on current developed scoring functions. Therefore, in this work, instead of developing a completely novel scoring function for the binding affinity prediction, we introduced a more practical and universal approach that can improve the scoring power for an arbitrary docking scoring function. Our new scoring algorithm is suitable to the following scenario: the binding affinities against a specific receptor for a certain number of compounds (hereafter called reference compounds or reference ligands) have been experimentally measured, but many more compounds (hereafter called query compounds) need to be estimated in silico. Such a scenario is frequently encountered in real drug design projects. Two components were incorporated to compose our new algorithm: (1) the structural difference between the query compounds and reference ligands; (2) the deviation of the docking scores of those reference ligands and their experimental binding affinities. Both the above components can serve as weights to calibrate the original docking scores of ligands.

Work outline
The outline of our work is shown in the flow chart of Fig. 1. To evaluate the feasibility and performance of our algorithm, we collected the X-ray crystal structures of 11 receptors of various categories from the Protein Data Bank [12] (https ://www.rcsb.org) and their available ligand data from the ChEMBL database [13,14] (https ://www.ebi.ac.uk/chemb l/). Collected compounds for each receptor were divided into reference set (reference ligands) and validation set (query compounds). See Table 1 and Additional file 1. Compounds were docked to their corresponding targets with a state-of-art docking program Glide, which was selected as the basic scoring function for further improvement [15,16]. The Tanimoto Coefficient Tc [3] was calculated by utilizing the Open Babel program version 2.3.1 (http://openb abel.org) [17]. Four popular 2D fingerprints (FP2, FP3, FP4, MACCS) [17] available in the Open Babel package were adopted and compared in this study. The proposed hybrid scoring function was applied to calibrate the Glide docking scores. The scoring and ranking power of the hybrid scoring function as well as the original Glide docking scoring function was measured by root-mean-squareerror (RMSE), mean-absolute-error (MAE), Pearson's correlation coefficient (R 2 ), and the predictive index (PI) between the docking scores and the experimental binding affinities [18][19][20]. In addition, we evaluated the screening power of the hybrid scoring function by using enrichment factor (EF) and the area under the curve (AUC) of receiver operating characteristic (ROC) curve [21]. We also explored the choice of fingerprint type and CSE function form to optimize the screening performance. More details about the preparation of receptors and ligand datasets, docking software and procedures, and methods of evaluations are described in the following sessions.

Preparation of receptor datasets
To test the developed algorithm in this study, 11 receptors were selected according to the experimental records in ChEMBL database. The receptors can be divided into three classes: (1) 6 top receptors from diverse categories; (2) 4 top receptors from GPCR family; (3) one RNA receptor. Coagulation factor X (CFX), dopamine D2 receptor (D2R), µ opioid receptor (MOR), Extracellular signal-regulated kinase 2 (ERK2), vascular endothelial growth factor receptor 2 (VEGFR2) and estrogen   (3), the ribosomal RNA (rRNA) A-site was selected as the receptor. All the receptor categories were selected according to the rank of member amounts in their category families, while the receptors themselves were selected based on the compound number recorded in binding assays. The above information was collected from ChEMBL database and is shown in Table 1. Then the X-ray crystal structures of the above 11 target receptors were retrieved from Protein Data Bank (detail information was shown in Additional file 1: Table S1). The sources of all targets are Homo sapiens.

Preparation of ligand datasets
To better compare the binding energy of each ligand that binds to the same receptor, we collected 3D structures (SDF format) of ligands with the inhibition constant (K i ) values recorded from binding assays in the ChEMBL database. For the ribosomal RNA receptor, because of the limited number of K i activities, compounds with dissociation constant (K d ) values were collected. It is of note that the activities of those collected compounds for each receptor were measured using the same methods. To balance the distribution of K i activities for each receptor, compounds were hierarchically classified into 4 levels according to their K i values: K i < 10 nM, 10 nM ≤ K i < 1 µM, 1 µM ≤ K i < 100 µM and K i ≥ 100 µM. In each level, 300 or less than 300 compounds (if the number of compounds in the level does not reach 300) were randomly collected by utilizing numpy.random.choice in Python 3.7 program [22]. For one selected compound with 2 or more K i values that came from various assays, the average K i was used. To evaluate the screening power of our approach, we categorized the selected compounds into the active and inactive sets by the cutoff of K i /K d = 100 nM. This value is lower than normal threshold, 10 µM, but it can better balance the numbers of compounds in the active and inactive sets. The experimental K i /K d for each collected compound was then converted to the experimental ligand-receptor binding energy (kcal/mol) by the Eq. 1.
where ∆G binding is the binding energy of the ligand, R is the gas constant with a value of 8.314 J mol −1 K −1 and T is the room temperature under standard pressure with the value of 298.15 K. (1) To exclude compounds with very weak binding affinities and make our evaluation more reliable, compounds for 11 receptors with experimental binding energy higher than − 4 kcal/mol were removed from the selected datasets. Then we randomly separated compounds into training datasets and testing datasets by the proportion of 4:1. The number of compounds in training and testing sets for each receptor was listed in Table 1. The structures of compounds were not only stored in the mol2 format but also converted into different 2D fingerprints for the further exploration of our algorithm by Open Babel program, which is an expert chemical toolbox for the format interconversion of chemical data [23]. Four 2D fingerprints are available in OpenBabel according to its documentation (http://openb abel.org/docs/dev/Featu res/Finge rprin ts.html): (1) FP2, a path-based fingerprint stored in a 1024-bit vector; (2) FP3, s series of SMARTS queries stored in 55 bits; (3) FP4, s series of SMARTS queries stored 307 bits; (4) MACCS, a series of SMARTS patterns stored in 166 bits.
To critically evaluate the performance of our calibration algorithm, we performed extra validation test on hundreds of compounds with activities (K i ) of A2AR and CFX from an additional database, DUD-E database [24,25]. After excluding the compounds with binding energy higher than − 4 kcal/mol and those already included in the reference datasets, two external test datasets which have 1973 and 1599 unique compounds were compiled for the A2AR and CFX systems, respectively. All compounds from DUD-E were treated as query molecules and their docking scores were calibrated with the compounds in the corresponding ChEMBL dataset as references.

Docking software and procedure
We docked selected compounds (include training sets, test sets and external test sets) to their corresponding receptors utilizing the Glide docking program implemented in the Schrodinger software (Maestro 11.2). Before docking, the downloaded SDF files of ligands were processed with the LigPrep module in Maestro. The downloaded PDB files of receptors were processed with the module of Protein Preparation Wizard in Maestro: removing co-crystallized solvent and ions, adding hydrogen atoms and missing site-chain atoms, energy minimization on hydrogen atoms. Then we defined the binding site based on the geometric center of the native bound ligand without taking constraint or rotatable group into consideration. The flexible docking with post-docking minimization for 11 systems was conducted by the following settings: van der Waal radius scaling factor was 0.80, the partial charge cutoff for ligands was 0.15, the intramolecular hydrogen bond formation was rewarded, the number of poses per ligand to include was 10. The top binding pose with the best docking energy score was retained and stored.
To investigate the impact of different docking scoring function on our calibration algorithm, the performance of our hybrid scoring functions was also evaluated for the AutoDock Vina docking scoring function [26]. Again, selected compounds were docked to their corresponding receptors utilizing the AutoDock Vina docking program. The receptor preparation was performed following the same protocol in Glide docking program. The binding site and space were defined based on the geometric center and the size of the native bound ligand without taking constraint or rotatable group into consideration. Considering the different docking mechanisms of two docking programs, this time, compounds with experimental binding energy higher than − 5 kcal/mol were removed from the selected datasets to exclude compounds with weak binding affinities. Then compounds were randomly separated into training datasets and testing datasets proportionally and the docking score was calibrated using our proposed calibration approach. The Glide docking scores and AutoDock Vina docking scores as well as the experimental binding affinities (converted from K i values) of compounds in the reference and validation sets were listed in Additional file 2: Table S2A, Additional file 3: Table S2B.

Algorithm for docking score calibration
The new algorithm we proposed to calibrate the docking scores from a normal docking program is described as below: where DS 0 j and DS j are the docking score of the jth query compound before and after the calibration. S ij is the structural similarity between the jth query compound and the ith reference ligand. The exponent p is treated as an integer constant with its value varying from 1 to 4 in this study, for the exploration of the developed formula. We referred S p ij as compound similarity effect (CSE) function for convenience of discussion. n is the total number of reference ligands in the reference dataset. DS i is the docking score of the ith reference ligand. ∆G i is the experimental binding energy (kcal/mol) of the ith compound in the reference dataset, which is converted from the experimental K i /K d by the Eq. (1).
In this study, the structural similarity S ij between two compounds is represented by the Tanimoto Coefficient (Tc) calculated from their 2D fingerprints: [3] where x and y are the number of bits in the fingerprints of compounds X and Y, z is the number of bits set shared by compounds X and Y. Tc has a range between 0 and 1 and a larger value means higher structural similarity between two compounds. The Tc calculation was carried out by utilizing Open Babel under a Python 3.7 environment.
Given a simple example to demonstrate how the algorithm works, we assume there are only two reference compounds, i and j, whose docking scores are − 8.0 and − 10.0 kcal/mol and their experimental values are − 7.0 and − 8.0 kcal/mol, respectively. The docking score of the query compound is − 9.0 and the similarity between the query compound and reference compound i and j are 0.9 and 0.5, respectively. Assuming p is 4, then after the calibration, the new docking score for the query compound becomes: Apparently, reference compound i has more impact than j in the docking score calibration for this query compound. It is noted that our algorithm may not always improve the performance of docking score. There is a possibility that the docking score becomes worse after the calibration. However, we expect the similarity-based calibration can improve the binding affinity prediction in most scenarios with a certain size of the reference set.

Performance evaluation
To reduce the systematic error during the calculation, the random separation of compounds into training and testing sets before the calibration was repeated for ten times for each target. The mean value and 95 % CI for all performance metrics were then calculated. To evaluate the scoring and ranking performance of our algorithm, for each receptor, the docking score of compounds in the test set was compared to their experimental energies individually utilizing four different measurements, RMSE, MAE, R 2 and PI. By comparing the mean calibrated docking score with the original docking score, the scoring function was considered to be improved if RMSE and MAE reduced, while R 2 and PI increased. We also calculated the difference between the calibrated docking score and the original one. The difference is respectively represented by dRMSE, dMAE, dR 2  enrichment factor (EF) at 10 % (EF 10 % ) and 40 % (EF 40 % ) levels were adopted as the performance metrics. In comparison to the simple docking scoring function, our new calibration algorithm on docking score was considered to have better screening power if AUC and EF increased. We applied the same protocols to evaluate the performance of the calibration algorithm on the datasets of A2AR and CFX targets from the DUD-E database, except that the enrichment factors were calculated with different hit rates. Considering the sample sizes of datasets for these two targets are relatively large in the DUE-E database, we utilized EF 1 % and EF 10 % to evaluate the screening power. The equations for the calculation of metrics PI and EF were described in detail in Supplementary Information. We defined two scenarios, "focus library" and "diverse library", which are respectively appliable to drug lead optimization and lead identification in drug discovery, to evaluate the algorithm by limiting the range of Tc. In other words, the training compounds which did not meet the criterion of Tc range were excluded from calculations for the calibration. In the "focus library" scenario, for each system, we set up a lower bound Tc value. Below this threshold, Tc will be too low to improve the performance of the scoring function. In the "diverse library" scenario, besides the lower bound Tc value, we also set up an upper bound for Tc. We randomly collected 10,000 screening compounds from ZINC database [27] (https ://zinc.docki ng.org/) and calculated the structural similarity between testing compounds and screening compounds individually. The Tc value at which more than a half of screening compounds are lower than is selected as the upper bound Tc value. It is noted that this is a very stringent method to determine the Tc upper bound value.

The impact of fingerprint and CSE function
We first studied the calibration performance using the Glide scoring function. For all 11 systems, the scoring power measured by RMSE and MAE and ranking power measured by R 2 and PI of the original docking scoring function and the hybrid scoring functions applying different fingerprints and CSE function are shown in Additional file 1: Tables S3, S4 and Fig. 2. According to Additional file 1: Tables S3, S4, the developed algorithm can improve the accuracy of original docking score for most of systems, no matter what fingerprint was used or what CSE function was adopted. Specifically, when FP2 fingerprint was used for the similarity calculation between compounds in our algorithm, the docking scores can improve, regardless of types of systems or CSE function. Similarly, when CSE = S 4 ij , the performance of the scoring function enhanced for all 4 fingerprints. The comparison among the performance of the algorithms when employing different fingerprints and CSE functions is clearly illustrated in Fig. 2. For most systems, the enhanced effect of the algorithms with different calibrating functions can be ranked from the largest to the smallest when the p value varied from 4 to 1. As to fingerprint type, Fig. 2 also demonstrated that FP2 stands out as it mostly has lower RMSE and MAE, higher R 2 and PI than other fingerprints.
To test the generalizability of our calibration algorithm, we also studied the calibration performance using docking scores generated by another commonly-used docking program, AutoDock Vina. As shown in Additional file 1: Table S5 and Figure S1, the same conclusion was reached for this docking program, i.e., the S 4 outperforms other CES functions and FP2 outperforms other fingerprint types.
It is easy to understand why the performance of the algorithm became better when p value increased. As p value rises, the impact of the reference compounds that are structurally similar to the query compounds increases, meanwhile the impact of the reference compounds with low similarity reduces. As such, the weight of the similarity contribution will boost if the power of the similarity increases in the formula. It can be expected if the p value is higher than 4, the algorithm might continue improving the scoring function even in a more positive way. However, to balance the contributions from both the original docking scores and compound similarity effect, we let the maximal p value stop at 4.
Another interesting factor that can affect the performance of the algorithm is the type of fingerprints. The different underlying mechanisms of those fingerprints are likely to explain their different effects. Unlike FP3, FP4 and MACCS that are substructure-based fingerprints based on sets of SMARTS patterns, FP2 is a pathbased fingerprint that indexes small molecules fragments based on linear segments of up to 7 atoms, which might elucidate why FP2 performed better in our algorithm [28]. As FP2 is more specific and can be used in any initial chemical searches, we assume the similarity calculation based on FP2 is able to amplify the weights of those structurally similar references to a greater extent and better offset the shortage of traditional scoring function. For example, the traditional docking score is always averagely high even for those ligands with relatively low binding affinity, leading to insufficient differentiation of docking results. On the other hand, the different performances of those three substructure-based fingerprints (FP3, FP4 and MACCS), are likely caused by more complicated reasons. One probable reason is their different number of descriptors. For example, utilizing FP3 improved the scoring function very limitedly, which might be explained by its limited number of bits of 55 versus FP4 with 307 bits and MACCS with 166 bits stored in Open Babel [29].
The calibration performance using the best hybrid function is similar for the two docking scoring functions. As shown in Table 2  On the other hand, the improvement of performance on screening power by using our calibration algorithm further validated our approach. As shown in Table 3, EF 10 % and EF 40 % of the docking results after the calibration are better than the results before the calibration for all the scenarios except for EF 10 % of VEGFR2, for which the value before the calibration is slightly better (2.18 vs. 2.15). Considering the small sample size for rRNA receptor, it is reasonable to find significant larger enrichment factor values in the calculated metrics. After excluding the rRNA target, on average the mean EF 10 % and EF 40 % increased about 25 % and 22 % after the calibration of docking scores. Similarly, mean AUC of the docking results after the calibration is significantly improved for all the targets, as shown in Table 3 and illustrated in Fig. 3. Without considering the performance on rRNA target, on average the AUC improved approximately 20 % for the docking results after the calibration. Of note that the measurement of screening power may be biased for some drug target due to the imbalance between the number of actives and inactives, such as the actives only account for about 12 % of the total compounds in the test sets.

The impact from receptor categories on the calibration performance
As shown in Table 2; Fig. 2 and Additional file 1: Figure S1, the basic performance of docking score for all systems is different. Take the Glide docking scoring function as an example, for ERK2 drug target, the performance of original docking results is acceptable with a low mean MAE (1.20 kcal/mol) and RMSE (1.55 kcal/ mol), and a relatively high mean R 2 (0.35) and PI (0.74). On the other hand, for some receptors such as MOR, the performance of the original docking results is not satisfying with a high mean MAE (3.28 kcal/mol) and RMSE (3.94 kcal/mol), and a low mean R 2 (0.02) and PI (0.11). Therefore, in order to further compare the improved effect of the algorithm between different systems by excluding the impact from initial baselines of various systems, we quantitatively estimated the difference between the calibrated docking score and original docking score using parameters dRMSE, dMAE, dR 2 and dPI which quantitively measure the difference of those measurements before and after the calibration (Additional file 1: Table S4 and Fig. 4). It is observed that the extent of the improvement by using this algorithm varied from system to system, which indicates that although the developed algorithm can be adaptable for various receptors, its enhanced effect can still differ and depends on the receptor to some extent. The improvement on the scoring power and ranking power is more prominently for those systems with poor Table 2 Mean RMSE (kcal/mol), MAE (kcal/mol), R 2 and PI before and after calibration using the best hybrid scoring function (FP2 fingerprint with CSE = S 4 ) for 11 receptors "cali" and "orig" represent the calibrated and original docking scores, respectively docking performance, such as D2R, MOR, 5HT2AR, and CB1. After the calibration using the best hybrid scoring function, the PI increases by 356 %, 273 %, 174 % and 220 % for the four systems correspondingly ( Table 2). After the calibration, not only the docking performance is enhanced, but also the standard deviations and the CIs of the metrics measuring the docking performance are decreased among different drug receptors.

Application of calibration in drug lead identification
In above, we discussed our hybrid scoring function can enhance screening performance for focused compound libraries in drug lead optimization. Next, we evaluated Fig. 3 The mean ROC curves of screening results before and after calibration of Glide docking scores using the best hybrid scoring function (FP2 fingerprint with CSE = S 4 ) for 11 receptors. "cali" and "orig" represent the calibrated and original docking scores, respectively how well the best hybrid function (FP2 fingerprint and CSE = S 4 ) performs for diverse compound libraries in drug lead identification using two studies. First, we created "diverse libraries" by imposing an upper limit of Tc as described in Methods section. When we calibrated the docking score of a query compound, we applied an upper limit of Tc value to exclude reference compounds which are structurally similar to the query compound to participate calibration. Encouragingly, even applying relatively low upper bound Tc values, our best hybrid scoring function can still enhance the docking performance for all receptors as shown in Table 4, even though the extend of the enhancement becomes much smaller as expected.
The average values of mean MAE and RMSE decreased by 14.1 %, 14.7 %, respectively; and R 2 and PI increased by 26.7 % and 17.1 %, respectively, after the calibration. In a real situation, the enhancement may be significantly larger as demonstrated in the second study.
In the second study, we recalculated docking scores of the Glide scoring function for a set of external test compound libraries collected by DUDE-E database for two drug targets, A2AR and CFX. Unlike the first study, we did not impose an upper bound of Tc in selecting reference compounds to mimic the real situation in virtual screening studies, however, for the test compound libraries, we eliminated all the entries which were duplicated with reference compounds. The performance of the best hybrid docking scoring function is summarized in Additional file 1: Tables S6, S7 and shown in Additional file 1: Figures S2, S3. The MAE and RMSE were respectively dropped from 1.44 to 1.05, and 1.78 to 1.37 kcal/ mol for A2AR; and the two scoring power metrics were decreased from 2.71 to 1.66 and 3.25 to 2.13 kcal/mol for CFX. Similarly, the ranking power metrics R 2 and PI were also significantly increased for both systems. For A2AR, R 2 changed from 0.05 to 0.19 and PI changed from 0.24 to 0.46 (a 92 % increase); and for CFX, R 2 changed from 0.02 to 0.16, and PI changed from 0.14 to 0.42 (a 200 % increase). As for the screening power, the EF 1 % and EF 10 % respectively enhanced from 1.18 to 1.35 and from 1.12 to 1.32 for CFX, while these two metrics correspondingly increased from 0.98 to 1.06 and 1.13 to 1.28 for A2AR. Last, the AUC values were increased from 0.58 to 0.71 for CFX and from 0.61 to 0.71 for A2AR.
Taken together, in the scenario of drug lead identification, our calibration algorithm can still significantly improve the docking performance measured by MAE and RMSE for scoring power, R 2 and PI for the ranking power and EF and AUC for screening power. On the other hand, we pointed out that our method is based on docking results, hence the final performance on ranking compounds after our calibration algorithm may not meet the high standards of correctly ranking and prioritizing top compounds in the next stage of lead optimization, for which the more rigorous but much more expensive methods, such as alchemical free energy calculation using free energy perturbation [9] and thermodynamic integration [30,31], are usually adopted.

Conclusions
In summary, we developed a novel algorithm for quickly improving the scoring power and ranking power of a general scoring function used in a docking program by calibrating the docking score according to the structural similarities between the query compound and a set of reference compounds, whose experimental binding Table 3 Mean AUC, EF 10 % and EF 40 % of screening results before and after calibration of Glide docking scores using the best hybrid scoring function (FP2 fingerprint with CSE = S 4 ) for 11 receptors "cali" and "orig" represent the calibrated and original docking scores, respectively. The average numbers of compounds allocated in the active and inactive sets are shown in the  systems, respectively. Thus, we successfully developed an algorithm which integrates structure-based docking scores and ligand-based structural similarity scores into a hybrid scoring function and make a good use of known experimental values. With more and more measured binding affinity data collected by public databases like ChEMBL, our calibration algorithm could have more and more broad applications in structure-based drug design. Afterall, the significantly enhanced performance is achieved by a simple calibration algorithm whose computational cost is neglectable.
Additional file 1: Table S1. Lists the name, entry code, resolution, released date and deposition author for each receptor studied in this paper. Table S3. Lists the RMSE, MAE, R 2 and PI values before and after calibration of the Glide docking scores under the conditions of different CSE function and fingerprint. Table S4. Lists the difference of metrics for the measurement of docking performance before and after the calibration, i.e., dRMSE, dMAE, dR 2 and dPI for the Glide scoring function. Table S5.
Lists and Figure S1. Shows the RMSE, MAE, R 2 and PI values before and after calibration of the AutoDock Vina docking scores under the conditions of different CSE functions and fingerprints. Table S6. Shows RMSE, MAE, R 2 and PI values before and after calibration of Glide docking scores for compounds in the external test sets from DUD-E database. Figure S2. Shows the comparison of RMSE, MAE, R2 and PI values before and after the calibration of the Glide docking scores for A2AR and CFX external test sets using the best hybrid scoring function (FP2 fingerprint with CSE = S4). Table S7. Displays AUC, EF 1 % and EF 10 % values before and after calibration of Glide docking scores for compounds in the external test sets from DUD-E database. Figure S3. Shows ROC curves before and after calibration of the Glide docking scores for A2AR and CFX external test sets.
Additional file 2: Table S2A. Glide docking scores (kcal/mol) and experimental energies (kcal/mol) of selected compounds for all 11 targets in reference set and validation set.
Additional file 3: Table S2B. AutoDock Vina docking scores (kcal/mol) and experimental energies (kcal/mol) of selected compounds for all 11 targets in reference set and validation set.