Efficient ring perception for the Chemistry Development Kit

May, John W; Steinbeck, Christoph

doi:10.1186/1758-2946-6-3

Software
Open access
Published: 30 January 2014

Efficient ring perception for the Chemistry Development Kit

John W May¹ &
Christoph Steinbeck¹

Journal of Cheminformatics volume 6, Article number: 3 (2014) Cite this article

3841 Accesses
23 Citations
16 Altmetric
Metrics details

Abstract

Background

The Chemistry Development Kit (CDK) is an open source Java library for manipulating and processing chemical information. A key aspect in handling chemical structures is the determination of the chemical rings. The rings of a structure are used areas including descriptors, stereochemistry, similarity, screening and atom typing. The CDK includes multiple algorithms for determining the rings of a structure on demand. Non-unique descriptions of rings were often used due to the slower performance of the unique alternatives.

Results

Efficient algorithms for handling chemical ring perception have been implemented and optimised in the CDK. The algorithms provide much faster computation of new and existing types of rings. Several optimisation and implementation considerations are discussed which improve real case usage. The performance is measured on several publicly available data sets and in several cases the new implementations were found to be more than an order of magnitude faster.

Conclusions

Algorithmic improvements allow handling of much larger datasets in reasonable time. Faster computation allows more appropriate rings to be utilised in procedures such as aromaticity. Several areas that require ring perception have also seen a noticeable improvement. The time taken to compute the unique rings is now comparable allowing a correct usage throughout the toolkit. All source code is open source and freely available.

Background

The Chemistry Development Kit (CDK) [1, 2] is an open source Java library for manipulating chemical information. A key aspect of manipulating and querying chemical information is the ability to define and reason about attributes of chemical structures. Describing the rings in a structure is fundamental and a prerequisite of other attributes.

There is often a disconnect between how chemical rings are numbered and what is useful for computation. Conflicting definitions of rings contribute towards discrepancies between chemistry toolkits such as assigning aromaticity. The CDK does not provide a single strict definition of what rings are present in a structure. The ring information is considered auxiliary with different algorithms utilised for a specific use-case. Some considerations of the differences will be touched upon but a thorough review is provided by [3, 4] and [5].

There are several key properties we wish to know: is an atom or bond in a ring, what size is the ring and what are the other atoms and bonds in the ring? This information can be stored as an attribute of each atom or bond, as a collection of rings on the structure or computed on demand. With the provision of multiple algorithms it is undesirable to store all the information but invariant properties including membership and smallest ring size could be stored as an attribute of an atom or bond.

The ring properties can be used in many procedures throughout the library. In similarity searching and screening the creation of chemical fingerprints [6] may include ring size or membership to reduce the number of false positives. When matching atoms and bonds between structures the ring properties can be used in early elimination of infeasible matches or to disfavour ring opening and closing. Ring properties are also utilised in structure patterns (SMARTS [7]) where ring membership, size and number of rings can be queried.

It is essential that different structure resonance forms are treated as equivalent, one approach is to treat bonds in aromatic ring systems as delocalised. Conversely a delocalised structure may have been provided without specified bond orders. The ring properties can be used to localise and delocalise the bonds between aromatic and Kekulé representations.

Geometric isomers (double-bond stereochemistry) should not be encoded when the bond is involved in a rigid ring. Rigidity is approximated by only allowing stereoconfigurations in rings with more than seven atoms. Groups of interdependent stereocenters can be identified by recursively checking the rings in a structure [8].

Improving the core ring perception algorithms can influence many areas and it is important that efficient algorithms are used.

Graph theory preliminaries

Although more comprehensive and accurate methods exist, chemical structures can be represented and efficiently modelled as graphs [9]. The algorithms used for ring perception are not specific to chemical structures and require several formal definitions. The basic concepts for these are briefly introduced here. A graph is composed of a set of vertices V and a set of edges E. Each vertex or edge may be labelled with a value. Two vertices are adjacent if an edge exists which contains the two vertices. The vertices of an edge are known as the endpoints, each endpoint is said to be incident to the edge. A degree of a vertex is the number of incident edges. If the endpoints are unordered, an edge is said to be undirected. Simple graphs have no edges connecting the same vertex (loops) and no edges which share the same endpoints (multiedges). We model a chemical structure as simple undirected labelled graph where the atoms and bonds are labels on the vertices and edges. Although the edges have a numeric value (bond order) they are not treated as weighted.

A walk is a sequence of vertices and edges connecting two vertices. If the start and end of the walk are the same, the walk is closed. Otherwise the walk is open. A walk is simple if it contains no repeated edges and elementary if there are no repeated vertices. A simple walk that is also open it is referred to as a path. Two vertices are connected if there is a path between them. A graph is connected if each vertex can be reached from every other vertex. A connected component (ConnComp(G)) in an undirected graph is a subgraph in which every vertex is connected.

A cycle is a closed walk. Graphs containing a cycle are said to be cyclic or acyclic if no cycle is present. Acyclic simple graphs are referred to as a tree. A ring in a chemical structure is best described as an elementary cycle. The cycle has no repeating vertices or edges and each vertex has a degree of 2 (in the cycle). This definition includes envelope rings of structures like napthalene and azulene. As we are primarily concerned with chemical structures herein we use the term cycle to refer to elementary cycle.

A cycle basis is a set of cycles which can be used to generate all other cycles (cycle space) of the graph. Representing a cycle as a set of edges, a new cycle can be generated using the symmetric difference (XOR, ⊕-summing) of the edge sets of two cycles whose edge sets intersect. A minimum cycle basis is a cycle basis of minimum weight, in an unweighted graph the weight is simply the number of edges. When there is more than one basis with the same weight the choice between them is arbitrary as either can be used to generate the cycle space.

Cycle membership

The first step in cycle processing for a chemical structure is to efficiently determine which vertices and edges of the graph belong to a cycle. In PubChem-Compound [10] (Aug 2013) 97.3% of structures (47,745,887) contained a cycle. Although the proportion of structures containing a cycle is high only 59.3% of the heavy atoms and 57.3% of bonds were cyclic. Eliminating these acyclic vertices and edges from further processing reduces the size of the computation.

The SpanningTree was introduced in the CDK to eliminate acyclic vertices and edges, reducing the runtime of existing algorithms [11]. A graph H is a subgraph of a graph G if the vertices V and edges E of H are a subset of G. A subgraph G is said to be a spanning subgraph of H if every vertex of H is present in G. The edges in chemical structures are unweighted and so the minimum spanning tree is a tree with the smallest number of edges. Given an input structure a spanning tree is created which contains a subset of the edges that span the vertices but contains no cycles. The SpanningTree class uses a greedy algorithm [12] to sequentially build up this tree. Cyclic vertices and edges are determined by finding a path in the tree between the two endpoints of an edge which was not included. Any edge that is not in the spanning tree is cyclic and any path in the tree which connects the two endpoints contains vertices and edges that are also cyclic. The number of paths to find depends on the number of edges not included in the spanning tree. Structures containing a large number of rings will have more edges removed and more paths to find. Discovery of a path in the tree is implemented as depth-first-search and the entire tree may be traversed for each removed edge.

Cycle sets

In addition to determining if a vertex or edge is cyclic, one would also like to know the sizes of cycles and the walks. There is an exponential number of elementary cycles in a graph and smaller subsets of this have subsequently been defined and used in various aspects of chemical information processing.

Smallest set of smallest rings/minimum cycle basis

A well known set of cycles is the Smallest Set of Smallest Rings (SSSR). The SSSR was originally defined as a minimum length Kirchhoff-fundamental basis but has evolved to refer to a minimum cycle basis (MCB). The original definition of SSSR does not always contain the shortest cycles and was computationally intractable [3]. To avoid confusion the term SSSR will now only be used in reference to CDK implementation names. As introduced previously the MCB is a polynomial set of cycles which can be used to generate the cycle space. As the MCB may not be unique it has little direct use in similarity, aromaticity, depiction or other descriptive features. It is also not required to find the shortest cycle through each edge or vertex which can be accomplished without checking the cycles form a basis. Although the MCB is not unique, the number of cycles it contains is. This value is the circuit rank^a and is the number of edges that would need to be removed to make the graph acyclic (a spanning tree). For these reasons the size of the MCB agrees with de-facto standards and chemical nomenclature (Figure 1). The formula |E|−|V|+|C o n n C o m p(G)| provides the circuit rank without computation of the cycle walks [5].

The original algorithm [13] utilised in the CDK was shown to be incorrect and can not guarantee completion on all graphs [3]. Although one may consider such cases rare in four of the five tested chemical data sets (Table 1) at least one structure was found which caused the CDK implementation to halt indefinitely. The algorithm is still partially used in other cheminformatics libraries [14]. The implementation was replaced with a correct algorithm [3] (SSSRFinder) which also provides uniquely defined cycle sets as alternatives to the MCB.

Table 1 Chemical structure sets used to measure performance

Full size table

In general the CDK library has been relying less on MCB as it has little use beyond counting the number of rings and generating the cycle space. Both of these tasks can be achieved more efficiently with other procedures. The implementations provided in the CDK are primarily for reference and their use in computing other uniquely defined cycle sets.

Essential and relevant cycles

The essential and relevant cycles are a uniquely defined set of cycles. The MCB is non-unique when there are multiple minimum cycle bases and an arbitrary choice of a single basis can generate the cycle space. The essential cycles is the intersect of these minimum cycle bases whilst the relevant cycles is the union. When a graph has a single unique MCB it is equal to both the essential and relevant cycles. As a subset of the MCB the essential cycles do not form a basis and cannot be used to generate the cycle space. Like the MCB the essential cycles are always polynomial in number. Counter-intuitively, structures such as barrelene (Figure 1) contain no essential cycles. The relevant cycles do form a basis but may be exponential in number.

The uniqueness of these cycle sets make them desirable for describing chemical entities. The essential cycles have been utilised in the CDK for similarity searching techniques including generation of fingerprints and for the structure query patterns. Unfortunately the computation of the unique essential and relevant cycles (using the SSSRFinder) takes much longer than the non-unique MCB. The increased computation runtime has generally meant the MCB has been favoured.

All elementary cycles

When considering all cycles, the number of cycles can be very large and infeasible to compute for fullerene-like and cyclophane-like structures. The set of all cycles can be generated using a cycle basis or computed directly [20]. Direct computation is more efficient and is provided in the CDK as the AllRingsFinder. One major drawback of the existing implementation is the dependence on a time measure to determine feasibility. The time was measured from when the algorithm started and aborted if the elapsed time exceeded a set threshold. Whether the algorithm completes then depends on the machine specification and also the current load on the processor. The timeout was also generally left at a value too high (5 seconds) for larger datasets. To demonstrate this, the timeout threshold was varied and tested on a small dataset. The number of structures that the algorithm successfully completed was measured. Increasing the threshold to longer than a second provides only a small gain in coverage (Figure 2). A timeout of just 50 ms allowed 99.4% of the structures to complete in 32 seconds. Leaving the timeout at the default value of 5000 ms allowed 99.8% of structures to complete but took nearly 10 times longer (291 seconds) (Figure 3). This could be an artefact of hardware improvements but highlights the difficulties in choosing an appropriate value when using a timeout. The set of all cycles was used throughout the library in fingerprint generation, similarity searching [21], descriptors, kekulisation and fragmentation. The cycles were also partially utilised in aromaticity perception.

Implementations

The processing of cycles in the CDK has been streamlined and optimised. Improved algorithms for determining cycle membership and the uniquely defined essential and relevant cycles have been implemented in the CDK library. The algorithms are split across several classes allowing an expert user to pick and choose. For simplicity a facade, Cycles (Figure 4), provides generation of the cycle sets described and applies preprocessing optimisations. Specific implementation details are discussed and measured in the results.

Graph representations

A graph can be represented and stored using several data structures [22] (Figure 5). The choice of data structure can dramatically affect performance. A coordinate or edge-list representation stores the vertices and edges as separate lists. The edge list is memory efficient but inefficient to determining adjacency where every edge must be checked. An adjacency matrix is a square matrix with a boolean value indicating whether two vertices are adjacent. Matrix representations offer constant time adjacency checking but require every vertex to be checked in order to obtain a list of neighbours or degree. The matrix representation is less memory efficient and requires quadratic space to store. In an adjacency/incidence list each vertex stores adjacent vertices or incident edges. Testing adjacency is bounded by the number of adjacent vertices, the degree[22]. The degree and the set of adjacent vertices can be obtained in constant time.

The choice of data structure depends on properties being modelled, and which algorithms will be used. Chemical structures are generally small (|V|<100) and each vertex is only adjacent to a few other vertices (sparse). Although more costly in memory and for modifications the attributes of chemical structures make the adjacency (or incidence) list representation preferable.

The CDK currently uses an edge list representation to store chemical structures. Conversion of the CDK native data type to an adjacency list is relatively quick but can become significant if carried out multiple times. Many of the existing algorithms used an optimised representation but benefit was seen by avoiding the slower CDK native objects and minimising reconversion. The overhead introduced for converting the CDK objects (Table 2) could be minimised by loading directly into an more optimal data structure. When comparing to existing methods the conversion time is included in comparisons.

Table 2 Average ( n = 15) time taken to convert CDK structure representations to adjacency and incidence list data structures

Full size table

Results and discussion

Here we describe the optimisations and measure the performance on several chemical datasets (Table 1). All measurements were performed on a 2.66 GHz Intel Core i7 processor using Java version 1.7.0_21. The unprocessed benchmark results are provided as Additional files 1, 2 and 3.