Ehrlich and Rarey state "We presented, to our knowledge, the first comparison between Ullmann and VF2 subgraph isomorphism algorithm on molecular data and the first data set to perform such a benchmark." I wish to point out previous research on this topic.
The CDK and RDKit projects compared the VF2 algorithm performance with the Ullmann algorithm, and published their results online. A 2008 summary of the CDK work reports a speedup of about 5x.
A summary of the RDKit work in 2009 is here with follow-up. The latter specifically includes a test set of "450 queries and 1000 molecules", and concludes "So the search time has been more or less halved."
Both conclude that the VF2 algorithm is faster than Ullmann, and both implementations have switched to using VF2 instead of Ullmann.
Secondly, this paper essentially compares two algorithm implementations, yet those implementations do not seem to be available as part of the supplementary data. The authors are aware of that, but believe that the data presented is sufficient to "allow general conclusions." I disagree.
While we know that the timings were done using an "Intel(R) Xeon(R) CPU E5630 2.53GHz cluster node", we know nothing about the implementation language, the compiler used, choice of compiler options, and other factors which are well-known issues that affect performance numbers. The performance numbers themselves can be sensitive to small implementation differences. From the CDK link above, I read that sorting the atoms by element type can make a 50% difference in performance, and from the RDKit links I read how an extra data transformation can make VF2 slower than Ullmann. But I have no idea if these techniques were also applied to the implementations under study.
As an example of what can be done using publicly available source code, the RDKit distribution by default uses VF2 but with three lines of the form "#if 0" changed to "#if 1" in Code/GraphMol/Substruct/SubstructMatch.cpp and a recompile it will use the older Ullmann implementation.
Competing interests
None declared
aim of this publication
Matthias Rarey, University of Hamburg
19 September 2012
Thank you for listing the additional internet ressources with respect to Ullmann - VF2 comparison studies. I just want to point out that the intention of this paper is not to just show that one algorithm is
a factor x faster than another. Our aims are:
- presenting the algorithms in pseudo code making it easier for others to follow the algorithmic strategy
- presenting a large data collection making detailled comparisons possible, especially studing run time effects with respect to pattern size, molecular size, etc.
- making this data collection available such that algorithm developers have a point of reference in the future.
- making some general conclusions about the asymptotic run time behavior of both algorithms.
Absolut computing times are given to compare data and methods within the paper. Our implementation is in C++, but with the aims listed above I believe this is of minor importance. Algorithmic details (incl. atom sorting) are to the best of our knowledge included.
Competing interests
Author
CDK Benchmarks
John May, European Bioinformatics Institute
10 July 2013
Very nice article, in particular the concise pseudo code for each algorithm is very useful.
As Andrew mentioned there is a more dramatic difference seen in Chemistry Development Kit (CDK) comparisons. This is because the CDK currently uses an edge list representation which is sub-optimal for checking adjacent vertices. Both these algorithms of course require many of these operations. The choice of data structure particularly impacts the speed of Ullmann's algorithm which requires more adjacency checks. Using an adjacency/incidence list representation will yield closer results as is reported here (and with the RDKit benchmarks).
Competing interests
None declared
Benchmark Datasets
Matthias Rarey, Author
27 January 2016
update available
Competing interests
Due to the large interest in the dataset presented in this paper, we created a web page with an update and a release history. We greatly appreciate comments from Andrew Dahlke related to errors in some of the patterns. We now created a new version with not just errors fixed, it also includes a new SMARTS set from literature and updated compound collections. Since all results presented in this paper relate to the original dataset, the supplementary material stays constant. Infos on the updated data set can be found here:
previous work on this topic
6 August 2012
Ehrlich and Rarey state "We presented, to our knowledge, the first comparison between Ullmann and VF2 subgraph isomorphism algorithm on molecular data and the first data set to perform such a benchmark." I wish to point out previous research on this topic.
The CDK and RDKit projects compared the VF2 algorithm performance with the Ullmann algorithm, and published their results online. A 2008 summary of the CDK work reports a speedup of about 5x.
A summary of the RDKit work in 2009 is here with follow-up. The latter specifically includes a test set of "450 queries and 1000 molecules", and concludes "So the search time has been more or less halved."
Both conclude that the VF2 algorithm is faster than Ullmann, and both implementations have switched to using VF2 instead of Ullmann.
Secondly, this paper essentially compares two algorithm implementations, yet those implementations do not seem to be available as part of the supplementary data. The authors are aware of that, but believe that the data presented is sufficient to "allow general conclusions." I disagree.
While we know that the timings were done using an "Intel(R) Xeon(R) CPU E5630 2.53GHz cluster node", we know nothing about the implementation language, the compiler used, choice of compiler options, and other factors which are well-known issues that affect performance numbers. The performance numbers themselves can be sensitive to small implementation differences. From the CDK link above, I read that sorting the atoms by element type can make a 50% difference in performance, and from the RDKit links I read how an extra data transformation can make VF2 slower than Ullmann. But I have no idea if these techniques were also applied to the implementations under study.
As an example of what can be done using publicly available source code, the RDKit distribution by default uses VF2 but with three lines of the form "#if 0" changed to "#if 1" in Code/GraphMol/Substruct/SubstructMatch.cpp and a recompile it will use the older Ullmann implementation.
Competing interests
None declared
aim of this publication
19 September 2012
Thank you for listing the additional internet ressources with respect to Ullmann - VF2 comparison studies. I just want to point out that the intention of this paper is not to just show that one algorithm is
a factor x faster than another. Our aims are:
- presenting the algorithms in pseudo code making it easier for others to follow the algorithmic strategy
- presenting a large data collection making detailled comparisons possible, especially studing run time effects with respect to pattern size, molecular size, etc.
- making this data collection available such that algorithm developers have a point of reference in the future.
- making some general conclusions about the asymptotic run time behavior of both algorithms.
Absolut computing times are given to compare data and methods within the paper. Our implementation is in C++, but with the aims listed above I believe this is of minor importance. Algorithmic details (incl. atom sorting) are to the best of our knowledge included.
Competing interests
Author
CDK Benchmarks
10 July 2013
Very nice article, in particular the concise pseudo code for each algorithm is very useful.
As Andrew mentioned there is a more dramatic difference seen in Chemistry Development Kit (CDK) comparisons. This is because the CDK currently uses an edge list representation which is sub-optimal for checking adjacent vertices. Both these algorithms of course require many of these operations. The choice of data structure particularly impacts the speed of Ullmann's algorithm which requires more adjacency checks. Using an adjacency/incidence list representation will yield closer results as is reported here (and with the RDKit benchmarks).
Competing interests
None declared
Benchmark Datasets
27 January 2016
update availableCompeting interests
Due to the large interest in the dataset presented in this paper, we created
a web page with an update and a release history. We greatly appreciate comments
from Andrew Dahlke related to errors in some of the patterns. We now created
a new version with not just errors fixed, it also includes a new SMARTS set
from literature and updated compound collections. Since all results presented
in this paper relate to the original dataset, the supplementary material stays
constant. Infos on the updated data set can be found here:
http://www.zbh.uni-hamburg.de/smartsdataset