Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2

Ehrlich, Hans-Christian; Rarey, Matthias

doi:10.1186/1758-2946-4-13

previous work on this topic

Andrew Dalke, Andrew Dalke Scientific, AB

6 August 2012

Ehrlich and Rarey state "We presented, to our knowledge, the first comparison between Ullmann and VF2 subgraph isomorphism algorithm on molecular data and the first data set to perform such a benchmark." I wish to point out previous research on this topic.

The CDK and RDKit projects compared the VF2 algorithm performance with the Ullmann algorithm, and published their results online. A 2008 summary of the CDK work reports a speedup of about 5x.

A summary of the RDKit work in 2009 is here with follow-up. The latter specifically includes a test set of "450 queries and 1000 molecules", and concludes "So the search time has been more or less halved."

Both conclude that the VF2 algorithm is faster than Ullmann, and both implementations have switched to using VF2 instead of Ullmann.

Secondly, this paper essentially compares two algorithm implementations, yet those implementations do not seem to be available as part of the supplementary data. The authors are aware of that, but believe that the data presented is sufficient to "allow general conclusions." I disagree.

While we know that the timings were done using an "Intel(R) Xeon(R) CPU E5630 2.53GHz cluster node", we know nothing about the implementation language, the compiler used, choice of compiler options, and other factors which are well-known issues that affect performance numbers. The performance numbers themselves can be sensitive to small implementation differences. From the CDK link above, I read that sorting the atoms by element type can make a 50% difference in performance, and from the RDKit links I read how an extra data transformation can make VF2 slower than Ullmann. But I have no idea if these techniques were also applied to the implementations under study.

As an example of what can be done using publicly available source code, the RDKit distribution by default uses VF2 but with three lines of the form "#if 0" changed to "#if 1" in Code/GraphMol/Substruct/SubstructMatch.cpp and a recompile it will use the older Ullmann implementation.

Competing interests

None declared

aim of this publication

Matthias Rarey, University of Hamburg

19 September 2012

Thank you for listing the additional internet ressources with respect to Ullmann - VF2 comparison studies. I just want to point out that the intention of this paper is not to just show that one algorithm is
a factor x faster than another. Our aims are:
- presenting the algorithms in pseudo code making it easier for others to follow the algorithmic strategy
- presenting a large data collection making detailled comparisons possible, especially studing run time effects with respect to pattern size, molecular size, etc.
- making this data collection available such that algorithm developers have a point of reference in the future.
- making some general conclusions about the asymptotic run time behavior of both algorithms.

Absolut computing times are given to compare data and methods within the paper. Our implementation is in C++, but with the aims listed above I believe this is of minor importance. Algorithmic details (incl. atom sorting) are to the best of our knowledge included.

Competing interests

Author

CDK Benchmarks

John May, European Bioinformatics Institute

10 July 2013

Very nice article, in particular the concise pseudo code for each algorithm is very useful.

As Andrew mentioned there is a more dramatic difference seen in Chemistry Development Kit (CDK) comparisons. This is because the CDK currently uses an edge list representation which is sub-optimal for checking adjacent vertices. Both these algorithms of course require many of these operations. The choice of data structure particularly impacts the speed of Ullmann's algorithm which requires more adjacency checks. Using an adjacency/incidence list representation will yield closer results as is reported here (and with the RDKit benchmarks).

Competing interests

None declared

Benchmark Datasets

Matthias Rarey, Author

27 January 2016

update available

Competing interests

Due to the large interest in the dataset presented in this paper, we created
a web page with an update and a release history. We greatly appreciate comments
from Andrew Dahlke related to errors in some of the patterns. We now created
a new version with not just errors fixed, it also includes a new SMARTS set
from literature and updated compound collections. Since all results presented
in this paper relate to the original dataset, the supplementary material stays
constant. Infos on the updated data set can be found here:

http://www.zbh.uni-hamburg.de/smartsdataset

Archived Comments for: Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2

previous work on this topic

Competing interests

aim of this publication

Competing interests

CDK Benchmarks

Competing interests

Benchmark Datasets

Competing interests

Journal of Cheminformatics

Contact us