Deterministic clustering of the available chemical space

Thiel, Philipp; Peltason, Lisa; Ottmann, Christian; Kohlbacher, Oliver

doi:10.1186/1758-2946-5-S1-P53

Volume 5 Supplement 1

8th German Conference on Chemoinformatics: 26 CIC-Workshop

Poster presentation
Open access
Published: 22 March 2013

Deterministic clustering of the available chemical space

Philipp Thiel^1,3,
Lisa Peltason²,
Christian Ottmann¹ &
…
Oliver Kohlbacher³

Journal of Cheminformatics volume 5, Article number: P53 (2013) Cite this article

1800 Accesses
Metrics details

Clustering of compound libraries using 2D binary fingerprints is a fundamental task in chemoinformatics and various methods have been described to solve it [1]. These methods can roughly be grouped into deterministic and non-deterministic approaches with two key-characteristics distinguishing them. First, the algorithmic complexity of deterministic approaches is more demanding whereas the non-deterministic methods often try to overcome this drawback by using heuristics to save time and memory. Second, deterministic clustering algorithms, especially agglomerative hierarchical techniques have been shown to yield good results and often perform better than non-deterministic approaches [2]. As a consequence, clustering of small to medium sized libraries with up to 1 million compounds is regularly performed using deterministic techniques whereas libraries comprising millions of compounds are mostly clustered using heuristics like k-means [3].

Here, we present a deterministic approach for clustering huge compound libraries based on all pairwise compound similarities. For this purpose, we use an extremely fast and flexible algorithm for similarity calculations, which we have developed to be purely CPU-based thus having no need for any specialized hardware. Using this similarity method, we implemented a workflow with the following steps. First, we create a set of unique input fingerprints by filtering duplicates that are then stored and finally remapped onto their representative clusters. Second, we calculate all pairwise similarities to construct a similarity network by applying a fixed Tanimoto threshold to select the edges to be inserted into the network. From this similarity network the connected subgraphs are extracted and forwarded to the last step. Finally, connected subgraphs exceeding a predefined size are hierarchically clustered.

As a result, we show that our algorithm for similarity calculation is competitive to recently published CPU-based methods and can perform up to 380 million Tanimoto calculations per second on a current desktop computer. This efficient method allows our workflow to process medium to large libraries on current desktop computers within minutes. To finally demonstrate the power of our clustering workflow, we processed the commercially available chemical space comprising about 17 million compounds [4]. The entire clustering workflow took 63 hours on a compute server using 64 cores and 100 GB main memory to complete.

References

Olah MM, Bologa CG, Oprea TI: Strategies for compound selection. Curr Drug Discovery Technol. 2004, 3: 211-220.
Article Google Scholar
Downs G, Willett P, Fisanick W: Similarity Searching and Clustering of Chemical-Structure Databases Using Molecular Property Data. J Chem Inf Model. 1994, 34: 1094-1102. 10.1021/ci00021a011.
Article CAS Google Scholar
Boecker A, Derksen S, Schmidt E, Teckentrup A, Schneider G: A hierarchical clustering approach for large compound libraries. J Chem Inf Model. 2005, 45: 807-815. 10.1021/ci0500029.
Article CAS Google Scholar
Irwin JJ, Sterling T, Mysinger MM, Bolstad E, Coleman RG: ZINC: A Free Tool to Discover Chemistry for Biology. J Chem Inf Model. 2012, 52: 1757-1768. 10.1021/ci3001277.
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Chemical Genomics Centre of the Max Planck Society, Dortmund, 44227, Germany
Philipp Thiel & Christian Ottmann
F. Hoffmann-La Roche AG, CH-4070, Basel, Switzerland
Lisa Peltason
Applied Bioinformatics, Center for Bioinformatics, Quantitative Biology Center and Dept. of Computer Science, University of Tübingen, Tübingen, 72076, Germany
Philipp Thiel & Oliver Kohlbacher

Authors

Philipp Thiel
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Peltason
View author publications
You can also search for this author in PubMed Google Scholar
Christian Ottmann
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Kohlbacher
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philipp Thiel.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Thiel, P., Peltason, L., Ottmann, C. et al. Deterministic clustering of the available chemical space. J Cheminform 5 (Suppl 1), P53 (2013). https://doi.org/10.1186/1758-2946-5-S1-P53

Download citation

Published: 22 March 2013
DOI: https://doi.org/10.1186/1758-2946-5-S1-P53

8th German Conference on Chemoinformatics: 26 CIC-Workshop

Deterministic clustering of the available chemical space

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Journal of Cheminformatics

Contact us

8th German Conference on Chemoinformatics: 26 CIC-Workshop

Deterministic clustering of the available chemical space

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us