Improving structural similarity based virtual screening using background knowledge

Girschick, Tobias; Puchbauer, Lucia; Kramer, Stefan

doi:10.1186/1758-2946-5-50

Research article
Open access
Published: 16 December 2013

Improving structural similarity based virtual screening using background knowledge

Tobias Girschick¹,
Lucia Puchbauer¹ &
Stefan Kramer²

Journal of Cheminformatics volume 5, Article number: 50 (2013) Cite this article

2935 Accesses
1 Citations
1 Altmetric
Metrics details

Abstract

Background

Virtual screening in the form of similarity rankings is often applied in the early drug discovery process to rank and prioritize compounds from a database. This similarity ranking can be achieved with structural similarity measures. However, their general nature can lead to insufficient performance in some application cases. In this paper, we provide a link between ranking-based virtual screening and fragment-based data mining methods. The inclusion of binding-relevant background knowledge into a structural similarity measure improves the quality of the similarity rankings. This background knowledge in the form of binding relevant substructures can either be derived by hand selection or by automated fragment-based data mining methods.

Results

In virtual screening experiments we show that our approach clearly improves enrichment factors with both applied variants of our approach: the extension of the structural similarity measure with background knowledge in the form of a hand-selected relevant substructure or the extension of the similarity measure with background knowledge derived with data mining methods.

Conclusion

Our study shows that adding binding relevant background knowledge can lead to significantly improved similarity rankings in virtual screening and that even basic data mining approaches can lead to competitive results making hand-selection of the background knowledge less crucial. This is especially important in drug discovery and development projects where no receptor structure is available or more frequently no verified binding mode is known and mostly ligand based approaches can be applied to generate hit compounds.

Background

Medical needs are the starting point for every drug discovery and development project. Apart from the classical in vitro and in vivo studies used in this process, pharmaceutical research relies more and more on in silico methods like (high throughput) virtual screening or molecular docking simulations [1, 2]. Computational methods promise to shorten the typically time-consuming efforts that come with the development of new market-approved drug compounds. In the early drug discovery process, virtual screening is used to rank or select compounds from huge databases of potential drug candidates that are later assessed in wet-lab and animal studies. In case one or more ligand structures of the target protein are known and available, virtual screening based on ligand similarities can be used to calculate a ranking of candidate compounds in a database. This approach is applied if no binding mode of the reported ligands, as well as no X-ray or NMR structure of the protein target is available and receptor based approaches are not easily accessible. Yet even in these cases the virtual screening approach is certainly a valid orthogonal approach to derive interesting and promising structures and scaffolds for the drug discovery pipeline.

In this paper, we present a concept of how structural similarity based methods used in virtual screening can be improved by integrating chemical background knowledge in the form of binding relevant or informative structural elements. Improvement in this case means higher enrichment of chemical compounds related to the query compound in the similarity ranking of a compound database. Consequently, more potentially biologically active and less potentially inactive compounds are selected in virtual screening for further processing in the drug discovery pipeline (e.g. in vitro, in vivo). To achieve an improved enrichment we extract binding relevant substructures from known ligands and transform them into a fingerprint. This fingerprint is then used to extend a structural similarity measure. We present two approaches to extract the binding relevant information: first we use visual inspection of a known ligand as well as literature review to identify binding relevant substructures, second we test a relatively basic data mining approach. We apply the Free Tree Miner (FTM) software [3] that takes a set of two-dimensional chemical structures as input. FTM mines for and returns all substructures that occur frequently (more often than a user defined minimum support threshold) in the given set. These relevant substructures are then fragmented and the fragments’ occurrences in a chemical structure are used as bits in a binary occurrence fingerprint. A limitation of the data mining based approach is the need for more than one known ligand (active compound). An advantage of the approach is that it can still be applied if no literature information on the binding relevant substructures or structural patterns is available and that it saves human effort.

In our experiments we extend two structural similarity measures with background knowledge and apply them to rank compounds in a database according to their similarity to a known active structure. The first similarity measure is based on the size of the maximum common substructure (MCS – e.g., Raymond et al.[4]) of two molecules, the second is based on Extended Connectivity Fingerprints (ECFP) [5]. No other factors like drug-likeness, Cytochrome P450 interaction or physico-chemical properties are used. This enables an isolated view on the effects of the similarity methods used for the rankings. The extended similarity measures are compared to their non-extended versions to assess their performance by calculating enrichment factors for 1%, 5% and 10% of the database.

We show that adding background knowledge on important binding components of ligands to both, the MCS similarity and the ECFP similarity, changes the virtual screening ranking in such a way that the top structures have improved docking scores, related structures are ranked at better positions and clearly improved enrichment factor values are obtained. We also show that replacing the visual inspection and literature search by a data mining approach improves the similarity rankings for most assessed data sets. The data mining approach performs slightly weaker than the by-hand approach, but gives competitive results.

The remainder of the paper is organized as follows: In the next section we give detailed information on the data and methods we use for the similarity calculations and our experimental setup. This is followed by a presentation and discussion of our results before we conclude. Additional result tables can be found in the Additional file 1.

Materials and methods

In this section we give detailed information on our experimental setting, on how we extend a similarity measure and on the data sources and evaluation measures used in our virtual screening experiments.

Experimental setup

When virtual screening by means of similarity ranking is performed in a drug discovery project, the similarities of all compounds in the screening database are calculated with respect to one or more known ligands of the protein target (used as reference compounds). The compounds in the database are subsequently sorted according to their similarity scores in descending order so that the compounds most similar to the reference appear first in the ranking. A good similarity measure will find structures that are related to the reference – or that potentially interact with the target protein – in the first few percent of the list. To assess the performance of different similarity measures we mix a set of known ligands into a set of decoys to form a screening database. As reference compound for the similarity rankings we use a randomly selected representative of the known ligands. After applying the standard similarity ranking procedure individually with each similarity measure, we can evaluate the performance of the similarity measures by examining the results for the known ligands in the screening database. The better a similarity measure is, the more known ligands will be in the top section of the ranking.

The experiments on extending a structural similarity measure can be divided into two lines of experiments: line “A” considers the by-hand selection of the binding relevant information that is used to extend the similarity measure and line “B” considers the data mining based selection of this information.

Table 1 shows a comparison of the steps necessary to apply the two presented approaches to extend similarity measures and rank a screening database.

Table 1 Overview of the steps necessary to apply the two presented approaches to extend similarities

Improving structural similarity based virtual screening using background knowledge

Abstract

Background

Results

Conclusion

Background

Materials and methods

Experimental setup

Extended similarity

Data

HMGR and statins

PPAR γ

Directory of useful decoys

ChEMBL activity classes

Evaluation measures

Docking procedure

Results and discussion

By-hand experiments

Mining-based experiments

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us