MET: a Java package for fast molecule equivalence testing

An important task in cheminformatics is to test whether two molecules are equivalent with respect to their 2D structure. Mathematically, this amounts to solving the graph isomorphism problem for labelled graphs. In this paper, we present an approach which exploits chemical properties and the local neighbourhood of atoms to define highly distinctive node labels. These characteristic labels are the key for clever partitioning molecules into molecule equivalence classes and an effective equivalence test. Based on extensive computational experiments, we show that our algorithm is significantly faster than existing implementations within SMSD, CDK and RDKit. We provide our Java implementation as an easy-to-use, open-source package (via GitHub) which is compatible with CDK. It fully supports the distinction of different isotopes and molecules with radicals.


Background
The analysis of molecules is an important part of cheminformatics. As atoms and bonds can be combined in a multitude of ways, a huge number of different molecules can be formed. Databases like PubChem, KEGG, or ChEBI store millions of molecules. Querying whether some molecule is included in such a database requires to test whether this molecule is equivalent to some existing one, i.e. whether they share the same structural and chemical properties. For example, it is not at all obvious that the two molecules (PubChem CID: 50934716 and 6397461), visualized in Fig. 1, are equivalent in 2D (not regarding stereochemistry). In this paper, we study the problem of testing whether two molecules are equivalent.
In an abstract mathematical way, molecules can be represented as graphs with atoms as nodes and bonds as edges. Chemical properties of atoms can be modelled as integral or real-valued node labels. To test whether two molecules are equivalent thus means to test whether the two associated labelled graphs are isomorphic.
The general graph isomorphism problem is neither known to be in P nor to be NP-complete [1,2]. In a recent breakthrough paper, Babai presented a quasipolyonomial-time algorithm for the graph isomorphism problem in general graphs [3]. This algorithm runs in time O(n polylog(n) ) where n is the number of nodes and polylog(n) is some polynomial in log(n) . For many special cases, stronger results are known. For example, for planar graphs, the problem can be solved in polynomial time [4]. A lot of molecules like DNA, RNA, or fullerenes can be represented as planar graphs. However, it is known that other molecule classes like inorganics or linked polymer networks cannot [2,5]. In pioneering work, Luks presented a polynomial-time isomorphism algorithm for graphs of bounded degree [6]. Since bond and atom types are bounded, this immediately implies the polynomialtime solvability of molecular equivalence via standard transformations from molecular, labelled graphs to simple graphs [2]. The corresponding polynomials are, however, of high degree so that algorithms require different techniques to be usable in practice.
There are many algorithms to test whether two labelled graphs are isomorphic. One possibility is to first transform molecule graphs into simple graphs without node labels, and then to solve the classical graph isomorphism problem [7]. In contrast, many other approaches, like ours, work directly on labelled graphs. McKay's famous nauty algorithm [8] is based on the idea of finding a canonical form of a graph G, i.e. a labelled graph C(G) that is isomorphic to G, often referred to as canonical labelling, such that every graph that is isomorphic to G has the same canonical form. Then testing whether two graphs are isomorphic reduces to merely checking equality of their corresponding canonically labelled graphs, which is easy. The hard part is to compute the canonical labellings. The nauty package is freely available; although known to have exponential runtime on some inputs, it performs very well in practice. The probably first practically usable algorithm goes back to Ullmann [9], who used recursive backtracking to solve the isomorphism problem (actually, his approach even solves the NP-complete subgraph isomorphism problem). A substantially improved version appeared in [10]. The general idea of iteratively extending a partial solution using certain feasibility criteria has been employed in several variants of VF/VF2/VF2 Plus/VF2++ algorithm [4,[11][12][13]. Experimental studies showed that the VF2++ algorithm is very efficient in practice [4]. It seems to be the fastest available code up to now.
In the context of cheminformatics, algorithms for testing molecule equivalence are implemented in widelyused software packages like RDKit [14], CDK [15] and SMSD [16]. These tools work well with a multitude of molecules. However, they do not always respect all chemical properties, like deuterium and radicals (see Figs. 2,3). Other algorithms implemented in CDK and SMSD do respect these properties but suffer from large running times.
Many scientists in cheminformatics use canonical Simplified Molecular Input Line Entry Specification (SMILES) [17,18] or InChI (International Chemical Identifier) [19,20] to represent molecules as strings [21,22]. To test whether two molecules are equivalent, it suffices to check whether the associated strings are identical. To create such strings, a canonical labelling problem has to be solved. SMILES or InChI strings can be created by CDK or RDKit [15,21]. Several limitations of the SMILES format exist, most importantly, that there is no standard way to generate a canonical representation [22]. We will later compare these approaches with ours.
Contribution To improve the current situation, we developed and implemented an algorithm for testing molecule equivalence in 2D. Our software is called MET (Molecule Equivalence Tester) and is available at https ://www.githu b.com/jasch ueler /MET/. Our software is designed and has been engineered to 1. consider chemical properties like deuterium and radicals in the equivalence test, 2. be highly competitive with the isomorphism algorithms from CDK and SMSD as well as with established SMILES and InChI methods, and 3. be compatible with CDK such that it can be easily integrated into existing CDK applications.
For testing molecular equivalence, our key contribution is to define highly distinctive node labels encoding both chemical properties and the local structural neighbourhood of an atom up to a certain depth. In order to achieve In the following section, we describe our method and the associated techniques. Afterwards, we experimentally evaluate our algorithm and show that it largely outperforms existing implementations from CDK, SMSD, and RDKit.

Methods
Let us start by formally defining the equivalence of molecules.

Preliminaries: molecule equivalence
In our application, we model a molecule as a graph G = (V , E) with node set V and edge set E. Each node v has an associated node label ℓ(v) ∈ R s of s (integral or real) numbers, each of which represents a certain atom property. A very simple choice of node labels would be, for example, to simply use the atomic number of the corresponding atom (in which case s would be one). Later, we will explain in detail how more sophisticated chemical and structural properties can be encoded into labels. We define that two molecules G 1 = (V 1 , E 1 ) and G 2 = (V 2 , E 2 ) are equivalent if and only if there is an isomorphism f between G 1 and G 2 i.e. a bijective function f : The first condition ensures that both graphs have the same structure while the second guarantees that also the node labels are compatible.

High-level description
Our algorithmic approach works as follows. The algorithm gets as input two molecules. For simplicity, we assume that these are given by their CDK representations. Alternatively, one may pass these molecules by some standard file description, for example, in SDF format. Our algorithm consists of two phases.
1. In the first phase, the algorithm converts both molecules into labelled graphs. In doing so, each molecule is transformed into a graph G = (V , E) with node set V and edge set E. For each atom, we introduce a node v ∈ V and for each bond, we introduce an edge e ∈ E .
In doing so, we create two graphs G 1 = (V 1 , E 1 ) and In the second step, we create node labels ℓ 1 : V 1 → R s and ℓ 2 : V 2 → R s . It is crucial that two nodes gain identical labels if and only if their associated atoms have identical properties. Our key contribution is to compute highly distinctive node labels. 2. In the second phase, we first run quick pre-test to check whether G 1 and G 2 are certainly not isomorphic. If this pre-test does not preclude the isomorphism, we use an isomorphism algorithm to decide whether the labelled graphs are isomorphic.
In the following, we give a detailed description of both phases.

Phase one
In the first phase, we transform the CDK molecule representations (regardless of stereochemistry) to labelled graphs. For this purpose, we replace each atom by a node and each bond by an edge. To conserve the chemical properties of each atom, each node gets a label, on which we will focus next.

Atom properties
Our software is designed to let the user choose the chemical properties that should be respected during the equivalence test.
In our applications, we consider the following atom properties: • atomic number • count of hydrogens and deuterium • formal charge • count of single electrons (radicals) • count of single bonds, double bonds and triple bonds Additional properties can easily be integrated into our algorithm, as long as they can be represented as real or integral numbers. For example, chemical properties like radius or partial charges can be represented as real numbers and might be helpful, if available. Table 1 shows the atom properties of the example molecule in Fig. 4.
Based on the selected properties, each node v gets an associated label ℓ(v) = (p 1 , p 2 , . . . , p s ) , where each p i represents a certain atom property. It is crucial that two nodes gain identical labels if and only if their associated atoms have identical properties.

Neighbourhood descriptors
In addition to the chemical properties, each node v gets an additional structural property d(v) called neighbourhood descriptor. This is a (real or integral) number that encodes information on the local neighbourhood of this node. We intend to give nodes identical neighbourhood descriptors if (a) they have the same set of chemical properties, and (b) their local neighbourhood is identical.
To compute the neighbourhood descriptors, we iteratively define for each node v an integer d i (v) that characterizes its local neighbourhood up to a depth of i. Initially, we define Here, hash denotes a generic hash function that maps tuples of (real or integral) numbers to integers. Consequently, the initial descriptor d 0 (v) contains information only on each node itself. For 1 ≤ i ≤ k (where the maximum depth k is a parameter on which we focus soon), we define Having calculated d i (v) for 0 ≤ i ≤ k , we use as the neighbourhood descriptor of the node v. Table 2 demonstrates the calculation of the neighbourhood descriptors on the example shown in Fig. 5.
We briefly point to some instructive observations.
• As the nodes C and D are supposed to have identical chemical properties, and they additionally have the same neighbourhood structure, they gain identical values of d i for all i ≥ 0 . Consequently, they gain the same neighbourhood descriptor d (not shown in the example). • The nodes G and H are supposed to have different chemical properties (thus d 0 (G) = d 0 (H) ) but their neighbourhood structure is exactly symmetric (thus However, their local neighbourhood differs when considering a depth of two or more.

Maximum depth
It is an important question of how to choose the maximum neighbourhood depth k. On the one hand, a large value of k will include more information on the local  neighbourhood of each node, and will thus potentially decrease the running time of the isomorphism algorithm. On the other hand, calculating the neighbourhood descriptors for each node of a molecule graph G = (V , E) requires a running time of O(k · |E|) . Thus, we might want to choose k as a small constant. Later, in the experimental results section, we will study the influence of k on the running time in a benchmark experiment.

Phase two
In the second phase, we test whether two labelled graphs G 1 and G 2 with node labels ℓ 1 and ℓ 2 are isomorphic.

Pre-test
Before starting the actual isomorphism test, we run a fast pre-test to quickly detect whether the graphs are certainly not isomorphic. For this purpose, we compare the number of atoms, bonds, hydrogens, deuterium, formal charge and single electrons of both molecules, which all must be equal. In addition, we construct the label sets {ℓ(v) : v ∈ V 1 } and {ℓ(v) : v ∈ V 2 } and test whether both sets are identical. If not, we can be sure that the two molecules are not equivalent.
Equivalence test After the pre-test, we try to find an isomorphism between G 1 and G 2 . In this paper, we tested two algorithms: • First, we tested the VF2++ algorithm included in the Lemon library [13,23]. Recent work showed that this algorithm is very fast on large instances [4]. However, this algorithm is implemented in C++. This is a small disadvantage for native Java application, like our implementation, as calling this algorithm requires system calls and thus reduces the portability of the Java code. • Second, we tested our own implementation of a generic backtracking algorithm, which we will describe in a moment. In contrast to VF2++, this algorithm is implemented in native Java and is thus more portable. As a drawback, our algorithm has been less engineered and might therefore be less efficient than VF2++.
We now give a rough description of our isomorphism algorithm. The algorithm gains as input two graphs G 1 and G 2 with associated node labels ℓ 1 and ℓ 2 .

Candidate sets
For each node v ∈ V 1 , the algorithm maintains a candidate set of nodes in G 2 that may be assigned to the node v. Vice versa, the algorithm maintains for each node w ∈ V 2 in G 2 the candidate set of nodes in G 1 that may be assigned to w. Backtracking The heart of the isomorphism algorithm is the following backtracking algorithm (see Algorithm 1).
The algorithm selects a node v ∈ V 1 with a minimumsized candidate set and tries to assign v to one of its candidate nodes, say to w ∈ can 1 (v) . By assigning v to w, we may reduce several candidate sets (see Algorithm 2): • Obviously, v and w can be removed from every candidate set in which they were included so far. • For each edge {u, v} ∈ E 1 , there must be an edge {z, w} ∈ E 2 . Thus, for each edge {u, v} ∈ E 1 , we can remove from can 1 (u) each node that is not adjacent to w. • Symmetrically, we can remove from can 2 (z) each node that is not adjacent to v.
• Furthermore, the candidate sets of v and w can be cleared.
In doing so, we reduce the candidate sets of the nodes. (Note that candidate sets will only be reduced, never increased by the assignment of two nodes.) Recursion After assigning two nodes to each other, we recursively try to find an assignment to the remaining nodes. This may or may not be successful.
• If we successfully assigned each node from G 1 to some node from G 2 , we successfully determine that both graphs are isomorphic. We return with a positive answer and the isomorphism function f. • If we did not succeed in recursively assigning the remaining nodes, we undo our latest modifications on the candidate lists, backtrack and try to assign to v the next node from its candidate set. After all candidate nodes have been tested this way (and we did not return with a positive answer), we know that G 1 and G 2 cannot be isomorphic.
Note that the isomorphism f : V 1 → V 2 is easily obtained by this procedure. The efficiency of our algorithm highly depends on the sizes of the candidate sets. Recall that, the larger the maximum depth k, the smaller the candidate sets will be, as the node labels become more distinctive.

Experiments, results, and discussion
To quantify the performance of our implementation in comparison to existing methods, we designed a benchmark experiment.
Benchmark experiment In each of the following experiments, we partitioned a certain SDF file with molecules into its equivalence classes, i.e. into maximum sized subsets of molecules in which all molecules are pairwise equivalent. Each equivalence class has a designated member called it's representative. Our general process is sketched as follows.
1. We read one molecule after the other from the input file and construct the associated graph G = (V , E) with node labels ℓ. 2. To insert G into its equivalence class, we use the property set {ℓ(v) : v ∈ V )} to compile a list of representatives that have the same property set and thus maybe equivalent to G. Let L G denote this list. 3. For each representative R ∈ L G , we test whether G is equivalent to R. To this end, we use one of the following algorithms/approaches:  [15].
If G and R are found to be equivalent, we add G to R's equivalence class and continue with the next molecule. Note that each of the algorithms may give a different result.
4. If G is equivalent to none representative in L G , we create a new equivalence class represented by G. Experimental environment All experiments were conducted on a Debian GNU/Linux 10 system with 38 CPU cores and 250 GB main memory. We used Java on version 12, CDK 2.3, RDKit 2018.09.1 (Ubuntu package) and SMSD 2.2.0.

Parameter tuning and optimization
In the first four experiments, we tried to find an optimum value of the neighbourhood depth parameter k. In addition, we evaluated which graph isomorphism algorithm works best in the second phase of our algorithm. Experiment 1 In a first experiment, we studied the influence of the neighbourhood depth parameter k on the running time of the MET and VF2++ algorithms. For this purpose, we ran our benchmark experiment for several values of k and measured the associated total running time. As the input data set, we used the set of 95 945 527 molecules that are listed in the PubChem database (October 2019), excluding molecules that solely consist of hydrogen or deuterium. Figure 6 shows the result of this experiment. We summarize some essential observations.
• We observe that a very small value of k induces a large running time. This is not surprising, as, with small k, the node descriptors carry little information on the local neighbourhood of each atom. The total running time reaches a minimum at k = 6. • Increasing the value of k beyond 6 does not significantly decrease the running time. Instead, the running time stays nearly constant for k ≥ 6 . We con-clude that a value of k = 6 might be a reasonable value for our type of application. Our observations agree with the results from related experiments on molecular fingerprints for small organic molecules [24,25], where it has been found that a diameter of 4 or 6 gives the best performance. • We further observe that the MET algorithm slightly outperforms VF2++ for all values of k. This may have several reasons. For example, since VF2++ is implemented in C++, we need to call it using the Java Native Interface, which induces additional overhead. In any case, we conclude that MET is slightly superior to VF2++.
In further experiments, we use k = 6 as the maximum neighbourhood depth. Experiment 2 Next, we further studied in more detail whether MET or VF2++ work best in the second phase of the equivalence test. In contrast to the previous experiment, we now consider data sets of different molecule sizes. For this purpose, we partitioned the PubChem data set into seven groups of molecules of approximately similar size.  We ran our benchmark experiment for each group individually and measured the associated running times for MET and VF2++. By calculating the quotient of both running times, we quantify the speed-up of one algorithm to the other. A quotient of one means that both algorithms are equally fast, a quotient of less than one means that MET is faster than VF2++. Figure 7 shows the result of this experiment.
• We observe that for each group, MET is superior to VF2++ in the majority of cases (as the medians lie below one). However, there are outliers in both directions. • While MET is superior for the majority of small molecules, it becomes less efficient when the number of atoms grows larger.
We conclude that our isomorphism algorithm is very well suited for molecules with up to 400 atoms, while VF2++ is comparably efficient but suffers from additional overhead. Experiment 3 Based on Experiment 2 we next considered the ratio between the running time of the equivalence test and the total running time. Therefore, we use the data set from Experiment 2 and individually measure the running time of the second phase only. Dividing this time through the total running time gives the ratio of the isomorphism algorithm (plus pre-test) on the total running time. In Fig. 8 we show the result of our experiment.
• We observe that the ratio of the isomorphism algorithm of MET is smaller than the algorithm of VF2++ for all molecule groups. • Especially, for the molecules with 50-200 atoms the algorithm of MET is significantly faster.
This result underlines the result of Experiment 2 and thus the suggestive usage of MET for molecules with up to 400 atoms. Experiment 4 In the previous experiment, we tested a lot of molecule pairs which are not equivalent. Most equivalence test failed during our pre-tests. In this experiment, we studied the efficiency of MET and VF2++ algorithms on a data set for which we a priori know that all molecules are equivalent.
For each group of molecule sizes, we selected a subset of equivalence classes as input for our benchmark experiment. In doing so, we made sure that all pairs of tested molecules are equivalent. We measured the running time of our benchmark experiment on these input files. Figure 9 shows the result of this experiment.
• We observe that both algorithms work approximately equally well for very small and very large molecules. • For medium-sized molecules MET is superior to VF2++.

Performance comparision
In our final set of experiments, we evaluated how the MET algorithm compares to the established equivalence tests from SMSD and to the methods based on canonical SMILES and InChI from CDK and RDKit.

Experiment 5
As it will become apparent from our experiment, the running time of some equivalence algorithms is very large. Consequently, we did not run all algorithms on the complete data set. In contrast, we just partitioned a small subset (23,254 molecules) of the PubChem database into its equivalence classes and measured the associated running time. Table 3 shows the running times for this process.
• First, we observe that two algorithms (CDKMCS, MCSPlus) from the SMSD do not correctly identify all pairs of equivalent molecules. For example, the structures of methadone hydrochloride (PubChem CID: 14184) and levomethadone hydrochloride Interestingly, these molecules do not contain radicals or isotopes. • Considering the running time, we observe that our approach outperforms all algorithms from SMSD by several orders of magnitude. Consequently, we exclude these algorithms from further experiments. • We observe that in this small sample the equivalence tests based on canonical SMILES and InChI give correct results. We further observe that the running time of RDKit is noticeable larger than that of CDK.
From this experiment, we conclude that only the SMILES and InChI based algorithms from CDK are competitive with MET and VF2++. Thus, we will further examine those algorithms on the large data set. Experiment 6 In our final experiment, we compare the running time of the SMILES and InChI based CDK methods with that of MET and VF2++. For this purpose, we ran these algorithms on the large data set from Experiment 2. Table 4 and Figures 10 and 11 show the results.
• Considering CDK's InChI method, we observe that MET outperforms CDK by a factor of about 2. In addition, we observe that the CDK method produce deviating results for 3245 molecule pairs, including 3 pairs for which at least one InChI could not be constructed by the CDK method. The remaining deviating results are false positives from CDK. For example, the molecules with CID 18671247 and 60160843 gain the same InChI although they differ in the position of some double bond. Another example is the molecules with CID 5151983 and 379953, which differ in the position of some proton but gain the same InChI. • Considering the SMILES method, we observe that MET's running time is just one third of CDK's. There   From our final experiment, we conclude that MET is a significant improvement to existing tools as it does not depend on the error-prone construction of canonical SMILES or InChIs and is furthermore considerably faster than CDK and RDKit.

Conclusion
In this article, we presented an algorithm for detecting the equivalence of molecules. Our algorithm exploits the chemical and structural properties of molecules to transform a molecule to a labelled graph. Our method is based on the construction of highly distinctive node labels that are used to decrease the running time of a isomorphism algorithm. Experimentally, we showed that it suffices to consider the local neighbourhood up to a depth of six. In its second phase, our algorithm uses a generic isomorphism algorithm for labelled graphs. Our experiments showed that our generic backtracking algorithm is competitive with the previously fastest implementation VF2++. In a set of experiments, we showed that our algorithm is faster than all algorithms currently implemented in SMSD, CDK, and RDKit. In addition, we found that our method is more robust than the methods included in CDK as it avoids the construction of SMILES or InChIs. As our software is compatible with CDK, it can easily be used to replace all current algorithms for equivalence testing from CDK or SMSD.
In the future, we plan to integrate our algorithm to the molecule fragmentation software ChemFrag [26]. Furthermore, we want to analyse the applicability of our algorithm for large molecules like proteins. In addition, we are going to consider the related problem whether some molecule is part of some larger molecule [27]. For this purpose, we need to solve the subgraph isomorphism problem, which is known to be NP-complete [1].