Skip to main content

Archived Comments for: Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2

Back to article

  1. previous work on this topic

    Andrew Dalke, Andrew Dalke Scientific, AB

    6 August 2012

    Ehrlich and Rarey state "We presented, to our knowledge, the first comparison between Ullmann and VF2 subgraph isomorphism algorithm on molecular data and the first data set to perform such a benchmark." I wish to point out previous research on this topic.

    The CDK and RDKit projects compared the VF2 algorithm performance with the Ullmann algorithm, and published their results online. A 2008 summary of the CDK work reports a speedup of about 5x.

    A summary of the RDKit work in 2009 is here with follow-up. The latter specifically includes a test set of "450 queries and 1000 molecules", and concludes "So the search time has been more or less halved."

    Both conclude that the VF2 algorithm is faster than Ullmann, and both implementations have switched to using VF2 instead of Ullmann.

    Secondly, this paper essentially compares two algorithm implementations, yet those implementations do not seem to be available as part of the supplementary data. The authors are aware of that, but believe that the data presented is sufficient to "allow general conclusions." I disagree.

    While we know that the timings were done using an "Intel(R) Xeon(R) CPU E5630 2.53GHz cluster node", we know nothing about the implementation language, the compiler used, choice of compiler options, and other factors which are well-known issues that affect performance numbers. The performance numbers themselves can be sensitive to small implementation differences. From the CDK link above, I read that sorting the atoms by element type can make a 50% difference in performance, and from the RDKit links I read how an extra data transformation can make VF2 slower than Ullmann. But I have no idea if these techniques were also applied to the implementations under study.

    As an example of what can be done using publicly available source code, the RDKit distribution by default uses VF2 but with three lines of the form "#if 0" changed to "#if 1" in Code/GraphMol/Substruct/SubstructMatch.cpp and a recompile it will use the older Ullmann implementation.

    Competing interests

    None declared

  2. aim of this publication

    Matthias Rarey, University of Hamburg

    19 September 2012

    Thank you for listing the additional internet ressources with respect to Ullmann - VF2 comparison studies. I just want to point out that the intention of this paper is not to just show that one algorithm is
    a factor x faster than another. Our aims are:
    - presenting the algorithms in pseudo code making it easier for others to follow the algorithmic strategy
    - presenting a large data collection making detailled comparisons possible, especially studing run time effects with respect to pattern size, molecular size, etc.
    - making this data collection available such that algorithm developers have a point of reference in the future.
    - making some general conclusions about the asymptotic run time behavior of both algorithms.

    Absolut computing times are given to compare data and methods within the paper. Our implementation is in C++, but with the aims listed above I believe this is of minor importance. Algorithmic details (incl. atom sorting) are to the best of our knowledge included.

    Competing interests

    Author

  3. CDK Benchmarks

    John May, European Bioinformatics Institute

    10 July 2013

    Very nice article, in particular the concise pseudo code for each algorithm is very useful.

    As Andrew mentioned there is a more dramatic difference seen in Chemistry Development Kit (CDK) comparisons. This is because the CDK currently uses an edge list representation which is sub-optimal for checking adjacent vertices. Both these algorithms of course require many of these operations. The choice of data structure particularly impacts the speed of Ullmann's algorithm which requires more adjacency checks. Using an adjacency/incidence list representation will yield closer results as is reported here (and with the RDKit benchmarks).

    Competing interests

    None declared

  4. Benchmark Datasets

    Matthias Rarey, Author

    27 January 2016

    update available

    Competing interests

    Due to the large interest in the dataset presented in this paper, we created
    a web page with an update and a release history. We greatly appreciate comments
    from Andrew Dahlke related to errors in some of the patterns. We now created
    a new version with not just errors fixed, it also includes a new SMARTS set
    from literature and updated compound collections. Since all results presented
    in this paper relate to the original dataset, the supplementary material stays
    constant. Infos on the updated data set can be found here:

    http://www.zbh.uni-hamburg.de/smartsdataset

Advertisement