 Research article
 Open Access
 Published:
Efficient enumeration of monocyclic chemical graphs with given path frequencies
Journal of Cheminformatics volume 6, Article number: 31 (2014)
Abstract
Background
The enumeration of chemical graphs (molecular graphs) satisfying given constraints is one of the fundamental problems in chemoinformatics and bioinformatics because it leads to a variety of useful applications including structure determination and development of novel chemical compounds.
Results
We consider the problem of enumerating chemical graphs with monocyclic structure (a graph structure that contains exactly one cycle) from a given set of feature vectors, where a feature vector represents the frequency of the prescribed paths in a chemical compound to be constructed and the set is specified by a pair of upper and lower feature vectors. To enumerate all treelike (acyclic) chemical graphs from a given set of feature vectors, Shimizu et al. and Suzuki et al. proposed efficient branchandbound algorithms based on a fast tree enumeration algorithm. In this study, we devise a novel method for extending these algorithms to enumeration of chemical graphs with monocyclic structure by designing a fast algorithm for testing uniqueness. The results of computational experiments reveal that the computational efficiency of the new algorithm is as good as those for enumeration of treelike chemical compounds.
Conclusions
We succeed in expanding the class of chemical graphs that are able to be enumerated efficiently.
Introduction
The enumeration of chemical structures satisfying given constraints is an important topic in chemoinformatics [1–3]. Applications of the enumeration of chemical structures include structure determination using massspectrum and/or NMRspectrum [4, 5], virtual exploration of the chemical universe [6, 7], reconstruction of molecular structures from their signatures [8, 9], and classification of chemical compounds [10]. The enumeration problem is also important for development of novel chemical compounds because virtual exploration of chemical universe and reconstruction of molecular structures from their signatures are considered to be important elementary technologies. The enumeration of chemical structures has a long history. Cayley [11] considered the enumeration of structural isomers of alkanes in the 19th century. The seminal work of Pólya on counting the number of isomers using group theory is also famous [12].
In this paper, we consider the problem of enumerating chemical structures having monocyclic graph structures satisfying a given constraint, where a monocyclic graph is an undirected connected graph containing exactly one cycle (a graph is connected if there exists a path connecting every pair of vertices), and a constraint is given in the form of a set of feature vectors (i.e., a set of descriptors). We assume that each feature vector specifies the number of occurrences of each labeled path of length up to a given constant K, where a labeled path is an alternating sequence of atom names and bond types (see Figure 1 for an example of a feature vector). We also assume that a set of feature vectors is given by specifying the minimum and maximum numbers of occurrences of each labeled path. We develop an efficient algorithm for this enumeration problem by extending existing algorithms [13, 14] for enumerating treelike chemical structures (i.e., chemical structures without cycles). In this extension, some novel concepts are introduced and rigorous mathematical analysis is performed in orer to guarantee the correctness of the algorithm.
In order to verify the computational efficiency of our proposed algorithm, we perform computational experiments using a set of some chemical compounds from the KEGG LIGAND database [15]. The results suggest that the proposed algorithm enumerates chemical structures having monocyclic graph structures as nearly efficiently as treelike chemical graphs have been enumerated.
The rest of this paper is organized as follows. First, we review some mathematical definitions and give a formal definition of the enumeration problem for chemical structures with monocyclic graph structures. Next, we review background and related work. Then, we present the algorithm and the results of computational experiments. Finally, we conclude with future work. Mathematical proofs, pseudocodes for the algorithm, and some details on computational experiments are given in Additional file 1.
Preliminaries and problem formulation
This section reviews some basic definitions on graphs and formalizes the problem to be addressed in this work. Before providing formal descriptions, we briefly explain the problem definition using an example in Figure 2.
Our basic problem is to enumerate all chemical structures each of which is consistent with a given feature vector and the valence condition. Each coordinate of a feature represents the number of occurrences of vertex and edgelabeled paths. In order to keep the size of a feature vector moderate, we restrict the length of paths to be no greater than a constant K. In the example of Figure 2, we consider paths of lengths 0 and 1, where a path of length 0 corresponds to a single atom and a path of length 1 corresponds to a bond including its endpoint atoms. For example, the columns O, N, and C of feature vector f_{1}(G) mean that each target structure must contain exactly one oxygen, two nitrogen, and three carbon atoms, respectively. The columns N1O, N1C, and C2C mean that each target structure must contain exactly one single bond connecting N and O, two single bonds connecting C and N, and one double bond connecting C and C. It should be noted that one single bond connecting N and O is counted by both O1N and N1O. Then, the chemical structure G is consistent with f_{1}(G). However, another chemical structure may be consistent with a given feature vector. For example, the feature vector remains the same even if the double bond (along with the branching carbon atom) is moved into the backbone chain. Therefore, it is desirable to enumerate all chemical structures consistent with a given feature vector and the valence condition (specified by val(…)). On the other hand, there may not exist any consistent chemical structure if K is large; thus it may not be appropriate to uniquely specify a feature vector. Therefore, we assume in our target problem that upper and lower bounds of the number of occurrences of each labeled path are given as shown in Figure 3.
A multigraph is a graph that can have multiple edges between the same pair of vertices, where vertices correspond to atoms and multiedges correspond to double and triple bonds in chemical compounds. We call a connected multigraph a kaugmented tree if the number of adjacent vertex pairs (i.e., vertex pairs connected by edges) minus the number of vertices is k−1 (hence a multitree is a 0augmented tree). That is, a kaugmented tree is a graph obtained by adding edges to k different pairs of nonadjacent vertices in a multitree (see Figure 4). The problem considered in this paper is to enumerate all 1augmented trees satisfying specified upper and lower bound conditions on feature vectors. In the following, we provide the mathematical definition of the problem. We assume that readers have some familiarity with basic concepts in graph theory. For those who are not familiar with graph theory, we suggest referring to an appropriate textbook (e.g., [16]). Readers not interested in mathematical details can skip this part.
A graph is defined to be an ordered pair (V,E) of a finite set V of vertices and a finite set E of edges, where an edge is an unordered pair of distinct vertices (thus no selfloop exists), where an edge with two endvertices u and v is denoted by uv. A graph is called a multigraph when E is not necessarily composed of distinct pairs of vertices (thus multiple edges are allowed in a multigraph and E is no longer a set, but a multiset), and is called a simple graph if no multiple edges are allowed. The multiplicity (the number of multiple edges) between two vertices u and v is denoted by m(u,v). An edge in a multigraph G is called simple if its multiplicity in G is one. We denote the vertex set and edge set of a graph G by V(G) and E(G), respectively. For a vertex v in a multigraph G, let deg(v;G) denote the number of edges incident to vertex v (i.e., degree of v). In this paper, a cycle is a closed path with a length at least three (two edges with the same endvertices are not treated as a cycle), and a connected multigraph (resp., simple graph) with no cycle is a multitree (resp., simple tree). A kaugmented tree is a connected multigraph such that the number of adjacent vertex pairs (i.e., vertex pairs connected by edges) minus the number of vertices is k−1. For two vertices u and v in a multigraph G, let G+u v denote the multigraph obtained by adding a new edge uv to G; when u v∈E(G), let G−u v denote the multigraph obtained by removing uv from G. Let {\mathbb{Z}}_{+} denote the set of nonnegative integers, and let Σ be a set of colors, which correspond to chemical elements such as H, O and C. Let each color c∈Σ be associated with a valence\text{val}(c)\in {\mathbb{Z}}_{+}. A multigraph G is said to be Σ−colored if each vertex v has a color c(v)∈Σ. Chemical compounds can be viewed as Σcolored, selfloopless connected multigraphs, where vertices and colors represent atoms and elements, respectively.
Let d\in {\mathbb{Z}}_{+} be a prescribed integer, which corresponds to the maximum multiplicity of chemical graphs, and Σ^{k,d} denote the set of all alternating sequences (c_{0},m_{1},c_{1},…,m_{ k },c_{ k }) consisting of colors c_{0},c_{1},…,c_{ k }∈Σ and m_{1},m_{2},…,m_{ k }∈{1,2,…,d}. We denote the union of Σ^{0,d}, Σ^{1,d}, Σ^{2,d},…,Σ^{k,d} by Σ^{≤k,d}. Let {\mathcal{F}}_{k}(\Sigma ,d) be the set of all mappings g from Σ^{≤k,d} to {\mathbb{Z}}_{+}, i.e., {\mathcal{F}}_{k}(\Sigma ,d)=\{g:{\Sigma}^{\le k,d}\to {\mathbb{Z}}_{+}\}.
For a path P=(v_{0},m_{1},v_{1},…,m_{ k },v_{ k }) such that V(P)={v_{0},v_{1},…,v_{ k }}, E(P)={v_{0}v_{1},v_{1}v_{2},…,v_{k−1}v_{ k }}, and m_{ i }=m(v_{i−1},v_{ i }) is the multiplicity of edge v_{i−1}v_{ i }, the length of P is defined to be k=V(P)−1, and the color sequence c(P) of P is defined to be the sequence c(P)=(c(v_{0}),m_{1},c(v_{1}),…,m_{ k },c(v_{ k }))∈Σ^{k,d}.
Given a multigraph G and a sequence t∈Σ^{k,d} for some k, let o c c(t,G) denote the number of paths P in G such that c(P)=t. For an integer K\in {\mathbb{Z}}_{+}, the feature vector f_{ K }(G) of level K in G is defined to be the Σ^{≤K,d}dimensional vector f_{ K }(G) whose value at each entry t∈Σ^{≤K,d} is given by f_{ K }(G)[ t]=o c c(t,G). In this paper, we treat hydrogensuppressed chemical graphs with carbon C, nitrogen N or oxygen O, which are represented by Σcolored multigraphs G with color set Σ={O, N, C }. Figure 2 illustrates an example of Σcolored multigraph G that represents a hydrogensuppressed chemical graph and its feature vector f_{1}(G).
Note that in hydrogensuppressed chemical graph G, deg(v;G)<val(c(v)) may hold for some vertex v∈V(G). Let us define the residue degree res(v) of a vertex v to be val(c(v))−deg(v;G). In a multigraph G, we interpret res(v) of a vertex v as the number of hydrogen atoms attached to the vertex v (in our proposed procedure, we also interpret res(v) as the number of new edges/bonds that can be attached to v when G is being constructed by adding more edges).
For a vector g\in {\mathcal{F}}_{K}(\Sigma ,d) of level K≥1, a multigraph G with f_{ K }(G)=g is a multigraph such that the occurrence of each path t=(c_{0},m_{1},c_{1},…,m_{ p },c_{ p }) in G with length of at most K is completely specified by g[ t], in particular V(G)={t∣g[ t]≥1,t∈Σ^{0,d}=Σ} (i.e., G has exactly g[ t] vertices of color t), E(G)={t∣g[ t]≥1,t=(c,m,c^{′})∈Σ^{1,d}} (i.e., G has exactly g[ (c,m,c^{′})] edges of multiplicity m that join a vertex of color c and a vertex of color c^{′}).
For two vectors {g}_{L},{g}_{U}\in {\mathcal{F}}_{K}(\Sigma ,d) and an integer k≥0, let {\mathcal{G}}_{k}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}) denote the set of all Σcolored kaugmented trees G such that g_{ L }≤f_{ K }(G)≤g_{ U } (i.e., f_{ K }(G)=g^{′} for some g with g_{ L }≤g^{′}≤g_{ U }) and deg(v;G)≤val(c(v)), v∈V(G).
Our problem is to enumerate all kaugmented trees G on a given set of atoms each of which is consistent with one of the feature vectors between the lower and upper vectors {g}_{U},{g}_{L}\in {\mathcal{F}}_{K}(\Sigma ,d), such that g_{ L }≤g_{ U } (where g_{ L }[ t]=g_{ U }[ t] for all t∈Σ^{0,d} since the vertex set is fixed for all G).
In what follows, we fix a color set Σ and an upper bound d on multiplicity. We define the problem of enumerating kaugmented trees as follows.
Enumerating chemical graphs with given upper and lower path frequency (EULF)
Given a maximum path length K\in {\mathbb{Z}}_{+} and feature vectors {g}_{U},{g}_{L}\in {\mathcal{F}}_{K}(\Sigma ,d) such that g_{ L }[ t]=g_{ U }[ t] for all t∈Σ^{0,d}, enumerate all multigraphs G\in {\mathcal{G}}_{k}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}).
Figure 3 illustrates an example of an input of EULF with upper and lower feature vectors g_{ L } and g_{ U } and part of its output, multitrees {G}_{1},{G}_{2}\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}) and a 1augmented tree {G}_{3}\in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}).
For k=0, we have developed an efficient algorithm for EULF [13, 14]. The purpose of this work is to describe an algorithm for EULF with k=1. We assume that the maximum valence is 4 and mainly enumerate a 1augmented tree such that the cycle contains an edge of multiplicity one (a single bond), since otherwise a 1augmented tree is a single cycle consisting of edges of multiplicity two, which can be separately handled as a special rare case.
Background
As mentioned in Introduction, enumeration of chemical structures has a long history and many studies have been done. In the field of machine learning, a similar problem, which is called the preimage problem, has been studied [17, 18]. In this problem, a desired object is computed as a feature vector in a feature space, and then the feature vector is mapped back to the input space, where this mapped back object is called a preimage. The definition of the feature vectors based on the frequency of labeled paths [19, 20] or small fragments [10, 21] has been widely used. Akutsu and Fukagawa [22] formulated the graph preimage problem as the problem of inferring graphs from the frequency of paths of labeled vertices and proved that the problem is computationally intractable (NPhard) even for planar graphs with bounded degrees [22]. Nagamochi [23] proved that a graph determined by the frequency of paths with length one can be found in polynomial time if any exists.
The preimage problem has also been studied in the field of chemoinformatics as a part of inverse QSAR/QSPR (quantitative structureactivity relationship/quantitative structureproperty relationship) studies. Indeed, the problem is essentially the same as reconstruction and/or enumeration of molecules from their descriptors in inverse QSAR/QSPR [8, 9, 24, 25], where the descriptors correspond to feature vectors in the preimage problem. Wong and Burkowski developed a practical preimage based method and demonstrated that it actually generated the structure of a new drug candidate [26]. For enumeration of molecules from descriptors, useful tools such as MOLGEN have been developed [27]. However, they are not very efficient if large structures are to be enumerated because many of them treat general graph structures (under the valence constraint).
It might be possible to develop significantly faster algorithms for the preimage problem if we restrict the class of target chemical structures and employ recent techniques for enumeration of graph structures. Fujiwara et al.[28] studied enumeration of treelike chemical graphs that satisfy a given feature vector which specifies frequency of paths of up to a prescribed length K in a chemical compound to be constructed. They proposed a branchandbound algorithm that consists of a branching procedure based on the tree enumeration algorithm by Nakano and Uno [29, 30] and bounding operations designed by properties on path frequency and atomatom bonds. They showed by means of computational experiments on enumeration of alkane isomers that their algorithm works at least as efficiently as the fastest algorithm while using much less memory space.
To reduce the size of the search space, Ishida et al. [31] have introduced a new bounding operation, called the detachmentcut, based on the result of Nagamochi [23]. In this problem formulation, it is required that the path frequency of a chemical structure is exactly the same as the specified one. However, there does not exist such a structure in many cases because a mapping between chemical structures and feature vectors is not surjective and thus there are many vectors in a feature space that do not have preimages. To seek solutions effectively in a relaxed constraint, Shimizu et al. [13] recently introduced a problem of enumerating treelike hydrogensuppressed chemical graphs that satisfy one of a given set of feature vectors which is specified by a pair of upper and lower feature vectors. They proposed a branchandbound algorithm for the problem, called 1Phase algorithm, and afterward Suzuki et al. [14] proposed a more efficient and effective algorithm, called 2Phase algorithm. Implementations of these algorithms [13, 14] for enumerating treelike hydrogensuppressed/hydrogenretained chemical graphs with given upper and lower bounds on path frequencies are available on a web server ( http://sunflower.kuicr.kyotou.ac.jp/tools/enumol2/).
As shown by Nakano and Uno [29, 30], the class of trees admits a nice scheme for computer representation of their structures (called “leftheavy trees”) which enables us to generate trees significantly faster (in constant time per tree) without executing any explicit test on the uniqueness of structure representations of temporarily generated labeled graphs. Development of algorithms for enumerating chemical graphs with a “nontree structure” is thereby a challenging task if we still wish to attain high computational efficiency as we have achieved for enumeration of treelike chemical graphs, because no such effective representation scheme is known for general graphs. It should be noted that although polynomialtime algorithms have been developed for equivalence test and unique representation form problems for bounded degree graphs [32, 33] and chemical compounds [34], they are not directly applicable to efficient enumeration of chemical graphs.
In the NCI database ( http://cactus.nci.nih.gov/ncidb2.2/), the ratio of the number of chemical compounds with kaugmented tree structures to that of all registered chemical compounds is approximately 9%, 22%, 28%, 20%, and 11% for k=0,1,2,3, and 4, respectively. This implies that we have been able to treat only 9% of all of chemical compounds with high computational efficiency. As the first step toward efficient enumeration of nontree chemical graphs, we consider the problem of hydrogensuppressed chemical graphs with 1augmented tree (monocyclic) structure. If we can solve this problem, we can treat 31% (=9%+22%) of chemical compounds. Although no effective representation scheme is known even to 1augmented trees, we can create a tree by removing one edge in the unique cycle in a 1augmented tree (two multiple edges with the same endvertices is not called a cycle in this paper). Additionally, 2Phase algorithm [14], which enumerates treelike hydrogensuppressed chemical graphs, can be used without any major modification to enumerate such trees T=G−e with one edge deficit from 1augmented trees G to be constructed. Thus the main task is to efficiently test the uniqueness of generated labeled 1augmented trees. To design such a procedure, we use a wellreflected definition of a parent 0augmented tree T=G−e of a 1augmented tree G. As a result, we can combine the new procedure with 2Phase algorithm to obtain an algorithm for enumerating hydrogensuppressed chemical graphs with 1augmented tree structure from upper and lower bounds on feature vectors.
Method
Our proposed algorithm is based on existing algorithms to enumerate colored trees [29, 30] and colored multitrees [13, 14, 28, 31]. The basic strategy of our algorithm is to generate a multitree first and then extend it to a 1augmented tree by adding an edge. In enumeration algorithms, it is important not to miss any possible structures and not to duplicate identical structures. In order to efficiently cope with these conditions, the concept of the family tree has been widely employed in various enumeration algorithms. To define a family tree for graphs, we need to define a parentchild relationship between graph structures so that a parent structure is uniquely determined from a child structure, where each child structure is obtained by adding a vertex or an edge to its parent structure. Because extension of a multitree to a 1augmented tree is the core part of our proposed algorithm, we need to provide a proper definition of the parentchild relationship between a multitree and a 1augmented tree. As will be shown later, there may exist multiple possible ways of having a parent structure. How to define the unique parent of a given 1augmented tree is one of the novel points of our proposed algorithm. Another important issue on generating 1augmented trees is not to generate identical 1augmented trees from the same multitree. As will be shown later, there is a case in which additions of different edges result in identical structures. How to efficiently prevent this kind of duplicate generation of identical structures is the other novel point of our proposed algorithm. In the following, we give a detailed description of the algorithm including these novel points. Again, readers not interested in mathematical details can skip this part.
Overview of a new algorithm for 1Augmented trees
Let {\mathcal{G}}_{0}^{\prime} be the set of 0augmented trees (multitrees) T=G−e obtained from each 1augmented tree G\in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}) by removing a simple edge e in the unique cycle of G.
Then we have {\mathcal{G}}_{0}^{\prime}\subseteq {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U}) for a modified lower vector {g}_{L}^{\prime} in a vector set {g}_{L}^{\prime}. We construct such a vector set {g}_{L}^{\prime} from g_{ L } as follows: For each t∈Σ^{k,d} with k≥2, let {g}_{L}^{\prime}[\phantom{\rule{0.3em}{0ex}}t]=0; and for each t∈Σ^{1,d}, let
Thus our first task is to generate all multitrees T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U}) by using fast conventional algorithms such as 2Phase algorithm.
Our next task is to generate 1augmented trees G from each multitree T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U}) such that no 1augmented tree in G\in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}) will be duplicated during the entire enumeration over all T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U}). To attain this objective without storing all generated 1augmented trees for a comparison with a newly generated 1augmented tree, we define a mapping \pi :{\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U})\to {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U}); the multitree T=π(G) for a 1augmented tree G is called the parent of G. For a multitree T, a 1augmented tree G with π(G)=T is called a child of T (possibly T has more than one child), and is called a feasible child of T if G\in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}). Note that any of definition of such a mapping will suffice as long as π(G) is determined only by the information of an “unlabeled graph” G (i.e., topological structure) except for a possible difference in computational efficiency to avoid duplication of solutions.
In the following, we show the 2Phase algorithm, present details of our definition of parents π and design an efficient procedure for generating all children G from a given multitree T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U}).
Summary of 2Phase algorithm for 0Augmented tree
In this section, we summarize 2Phase Algorithm [14] for generating all multitrees in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U}). In the first phase, we simplify input feature vectors by adding the frequencies of the paths that include multiple edges to the corresponding paths which consist of only simple edges and then enumerate simple trees for the simplified upper and lower feature vectors. Figure 5 illustrates feature vectors g_{ U } and {g}_{L}^{\prime} and simplified feature vectors g_{ u } and {g}_{L}^{\prime}.
In the second phase, we assign multiplicities of edges for each of the simple trees to satisfy the feature vector constraint and the valence constraint. The inputs and outputs of the first phase and second phase in 2Phase Algorithm are described as follows:
From the given multitree T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U}), our efficient procedure generates all 1augmented trees in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}).
Parentchild relationship
In this section, to avoid duplication of a 1augmented tree during the entire enumeration over all T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U}), we introduce a parentchild relationship between a 0augmented tree and a 1augmented tree.
Signature of rooted multitrees
To define the parent π(G) of a 1augmented tree G using only topological structure, we first introduce the concepts of “canonical form” and “signature” for a class of multigraphs.
We fix the total order of colors in Σ arbitrarily, e.g., O <N <C, and regard each color c∈Σ as a small integer in {\mathbb{Z}}_{+}. We define the lexicographical order among sequences with elements in \Sigma \cup {\mathbb{Z}}_{+} as follows. A sequence A=(a_{1},a_{2},…,a_{ p }) is lexicographically smaller than a sequence B=(b_{1},b_{2},…,b_{ q }) (denoted by A<B) if and only if there is an index k such that (i) a_{ i }=b_{ i }(1≤i≤k); and (ii) a_{k+1}<b_{k+1}(k+1≤ min{p,q}) or k=p<q; otherwise A=B, i.e., p=q and a_{ i }=b_{ i }(1≤i≤p), or B<A. Let A≤B denote A<B or A=B.
A multigraph is called labeled if each vertex has a unique name or an index such as v_{0},v_{1},…,v_{n−1}, and we usually record a multigraph as labeled in our computer. Hence, testing isomorphism of two multigraphs is to find labels for these “unlabeled graphs” such that the two labeled graphs completely match each other including the adjacency between every two vertices. For a class of multigraphs, if we have a way of choosing a label for each multigraph G\in \mathcal{G} that is unique up to automorphisms of G, then we can test the isomorphism of two graphs directly with their labels. Such a labeling for G is called the canonical form of G. Once such a canonical form is obtained, we can easily encode each multigraph G\in \mathcal{G} into a code σ(G) (which is an integer or a sequence of integers/colors), called the signature of G, such that two multigraphs G,{G}^{\prime}\in \mathcal{G} are isomorphic if and only if σ(G)=σ(G^{′}). Without loss of generality we assume a total order over \{\sigma (G)\mid G\in \mathcal{G}\} by introducing, if necessary, a total order over all colors and a lexicographical (total) order over all sequences of integers and colors.
A rooted graph is a multigraph in which a vertex is designated as the root, and two rooted graphs are isomorphic if there is an isomorphism that maps their roots onto each other.
Any tree T has either a vertex or a pair of adjacent vertices removal of which leaves no component with at least V(T)/2 vertices [35], where the former is called the centroid and the latter is called the bicentroid.
In a rooted multitree T, the parent vertex of a nonroot vertex v is denoted by p(v) and the depth of a vertex v is denoted by depth(v), where the depth of a vertex is its distance to the root. For a vertex v in T, let T_{ v } denote the subtree induced from T by all descendants of v including v. For an edge e=u v in T (where u=p(v)), let T_{ e } (T_{ u v }) denote the subtree of T that consists of T_{ v } and u=p(v) joined by edge e=u v.
For the class of rooted multitrees, a canonical form of a rooted multitree T is given by an “ordered tree” τ of it (i.e., determination of a total order among children of each vertex). Let dfs(τ) denote the total order of vertices in τ visited by the depthfirstsearch order according to the order for children in τ. For example, Figure 6 illustrates three ordered trees τ_{1}, τ_{2} and τ_{3}, which are obtained from the same multitree T rooted at the centroid, where the number beside each vertex v indicates dfs(v).
We let δ(τ) denote the alternating sequence (c_{0},d_{0},c_{1},d_{1},…,c_{n−1},d_{n−1}) such that c_{ i } and d_{ i } denote the color and depth, respectively, of the ith vertex v_{ i } in dfs(τ), and M(τ) denote the sequence (m_{1},m_{2},…,m_{n−1}) of the multiplicity m_{ i }=m(v_{ i },p(v_{ i })) of the edge joining the ith vertex and its parent p(v_{ i }) in T. For a vertex v, let dfs(v) denote the labeling number of v in dfs(τ). For example, Figure 6 illustrates δ(τ_{ i }) of ordered trees τ_{ i }, i=1,2,3 and M(τ_{2}) and M(τ_{3}).
A leftheavy tree of a rooted multitree T is an ordered tree τ that has the maximum code δ(τ) among all ordered trees of T (hence a leftheavy tree τ is a canonical form and δ(τ) is a signature of it when we ignore the multiplicity of rooted multitrees). We define the canonical form of a rooted multitree T to be the leftheavy tree τ that has the maximum code M(τ) among all leftheavy trees of T, and let σ(T) denote a signature of T (a code of the canonical form τ such as (δ(τ),M(τ))). For example, in Figure 6, τ_{2} and τ_{3} are leftheavy trees of T, since they have lexicographically maximum sequences δ(τ_{2})=δ(τ_{3}) among all ordered trees τ of the rooted multitree T, and τ_{3} is the canonical form of T and (δ(τ_{3})=(C,0,C,1,N,2,C,1,N,2,N,1,O,2),M(τ_{3})=(2,1,1,2,1,1)) is the signature of T since it is a leftheavy tree with the lexicographically maximum M(τ_{3}) among all leftheavy trees τ of T.
Using the canonical form for rooted multitrees, we can define a canonical form for “unrooted” multitrees T by regarding them as trees rooted at the centroid or bicentroid.
Defining parents π
We are now ready to define parents π for 1augmented trees (note that there is no root for any 1augmented tree). Let G be a 1augmented tree with a unique cycle C of length p which by our assumption contains at least one simple edge. Then there are p possible choices G−e_{ i }, i=1,2,…,p, for the parent of G. We introduce a rule to choose one of them based only on the topological information on G and C. For each vertex v in C, let N(v) denote the set of vertices in V−V(C) adjacent to v. Removing an edge vw with v∈V(C) and w∈N(v) leaves a multitree containing w, which we denote by T_{ w }. For each vertex v in C, we encode all multitrees T_{ w }, w∈N(v) into a signature σ^{∗}(v) using the signature σ for rooted multitrees; we set
such that \sigma ({T}_{{w}_{1}})\ge \sigma ({T}_{{w}_{2}})\ge \cdots \ge \sigma ({T}_{{w}_{h}}) holds for N(v)={w_{1},w_{2},…,w_{ h }}. Note that two vertices v and v^{′} in C have the same color and an identical set of subtrees in N(v) and N(v^{′}) if and only if σ^{∗}(v)=σ^{∗}(v^{′}). For each simple edge e=u v in C, we define a code c^{∗}(e) as follows. We encode the unique path u_{1} (=u),u_{2},…,u_{ h } (=v) from u to v along C into σ^{∗}(u,v)=(σ^{∗}(u_{1}),m_{1},σ^{∗}(u_{2}),m_{2},…,m_{h−1},σ^{∗}(u_{ h })), where m_{ i }=m(u_{ i },u_{i+1}). Symmetrically, we define σ^{∗}(v,u) = (σ^{∗}(u_{ h }),m_{h−1},σ^{∗}(u_{h−1}),m_{h−2},…,m_{1},σ^{∗}(u_{1})). The code c^{∗}(e) is defined to be lexicographically the maximum one between two sequences σ^{∗}(u,v) and σ^{∗}(v,u). Furthermore, let E^{∗}(C) be the set of simple edges e^{∗} in C such that c^{∗}(e^{∗}) is lexicographically maximum among c^{∗}(e) for all simple edges e in C.
We call an edge vw with v∈V(C) and w∈N(v) a heavy edge if T_{ w } has at least V(G)/2 vertices. We distinguish two cases to define parent π.

Case 1.
There is no heavy edge around C: For an arbitrary edge e∈E^{∗}(C), we define π(G) to be G−e (note that when E^{∗}(C)≥2, G is symmetric around C and G−e and G−e^{′} will be isomorphic for any two edges e,e^{′}∈E^{∗}(C)). Figure 7 illustrates how the parent π(G) of a 1augmented tree G in Case 1 is determined on these signatures σ(u,v) and σ(v,u) of simple edges uv in the cycle C.

Case 2.
There is a heavy edge v^{∗}w^{∗}: Note that no other edge can be a heavy edge. Let e_{1} and e_{2} be the two edges in C that are adjacent to v^{∗}, where at least one of them is a simple edge since deg(v)≤4. If exactly one of them, e.g., e_{1} is a simple edge, then we define π(G) to be G−e_{1}. When e_{1} and e_{2} are simple edges, we choose one of them as follows. We first ignore all trees T_{ w } with w∈N(v^{∗}), which are symmetric at the vertex v^{∗} commonly shared by e_{1} and e_{2} and hence useless to construct a signature for distinguishing e_{1} and e_{2}. Without using T_{ w } with w∈N(v^{∗}), we construct the code c^{∗}(e_{1}) and c^{∗}(e_{2}). Finally we choose any edge e_{ i } such that c^{∗}(e_{ i }) is lexicographically maximum between c^{∗}(e_{1}) and c^{∗}(e_{2}), and define π(G) to be G−e_{ i }.
Generating children
Recall that our algorithm for enumerating 1augmented trees consists of two major stages: the first stage enumerates all multitrees T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U}) by 2Phase algorithm, and the second stage generates all feasible children G for each T\in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L}^{\prime},{g}_{U}), i.e., 1augmented trees G\in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}) with π(G)=T. This section describes a procedure for generating all children G=T+e of a given multitree T by adding a new edge e.
For simplicity, we consider the case where a given multitree T has the centroid (the case where it is rooted at the bicentroid can be treated with a minor technical modification). In the following, we assume that a given multitree T is represented as its canonical form (a leftheavy tree) τ rooted at its centroid, and that its sequences δ(τ)=(c_{1},d_{1},…,c_{ n },d_{ n }) and M(τ)=(m_{2},m_{3},…,m_{ n }) over the labeling dfs(τ) have been already computed after the first stage (2Phase algorithm can deliver not only solutions T but also τ and these sequences together).
It should be noted that the canonical form of leftheavy trees enjoys the following recursive structure. For any vertex v in T, the subtree T_{ v } of T rooted at v induces an ordered tree τ_{ v } from the leftheavy tree τ and τ_{ v } is again the canonical form of T_{ v }, since dfs(τ_{ v }) is a subsequence of dfs(τ) with consecutive vertices and its ordered pair (δ(τ_{ v }),M(τ_{ v })) is also lexicographically maximized over all ordered trees of T_{ v }.
Testing generated 1Augmented trees
Given the leftheavy tree τ of a multitree T, we add a new edge xy for two nonadjacent vertices x,y∈V(T)(dfs(x)<dfs(y)) to obtain G=T+x y. Let C denote the cycle created in G. We check the following condition to test whether T is the parent of Gor not.

Case I.
C contains the root (centroid) of T:

(A)
σ^{∗}(x y) is lexicographically maximum among σ^{∗}(e) for all simple edges e in C.

(A)

Case II.
Otherwise:

(B)
x is the ancestor of y;

(C)
σ^{∗}(x y)≥σ^{∗}(e) if the edge e incident to x in C is simple.

(B)
Then, we have the following lemma, where the proof is given in S1.1 (of the Additional file 1).
Lemma 1.
For a multitree T and two nonadjacent vertices x,y∈V(T), testing whether T=π(T+x y) can be done by checking the above condition in O(V(C)^{2}) time.
Avoiding duplication of children
In previous sections, we have showed that all children (i.e., 1augmented trees) of a given multitree T can be generated by the definition of the parentchild relationship between multitree and 1augmented tree. However, if T has two isomorphic subtrees T_{ u } and T_{ v } then we would have two children T+x y and T+x^{′}y^{′}, which are isomorphic to each other. To avoid such duplication, we test if T+x y is isomorphic to T+x^{′}y^{′} for some other vertices x^{′} and y^{′} when we add an edge xy to T. In fact, we do not try to find other such pair x^{′} and y^{′} explicitly. Instead we introduce a rule that we do not generate T+x y by any edge xy that has such an “isomorphic” vertex pair x^{′} and y^{′} on the left hand side of x and y in T. To detect this situation efficiently, we first compute data on each vertex v in T that indicates whether the left hand side of v contains another vertex u such that its subtree T_{p(u)u} is isomorphic to T_{p(v)v}. Using such data, we show that given a vertex pair x and y whether there is an isomorphic pair x^{′} and y^{′} in the left hand side can be checked in constant time. In other words, we show an O(n^{2}) time algorithm extracting all the leftmost side vertex pairs from T.Among all isomorphic vertex pairs, we call the leftmost one an “admissible pair”, where “isomorphic” means here that the connection of vertices in each pair results in an isomorphic tree. Figure 8 shows an example of a case in which additions of different edges result in identical structures and the admissible pair.
In this section, we show the validity that we only need to add an edge between each admissible pair to avoid duplication and omission of 1augmented trees generated from one multitree T. Finally, for a multitree, we provide an efficient algorithm extracting all vertex pairs to generate all children of the multitree.
Admissible pairs
We write T+u v∼T+u^{′}v^{′} if and only if T+u v and T+u^{′}v^{′} are isomorphic. For a tree T, let c_{ T } denote its centroid, which is either a vertex (unicentroid) or an edge (bicentroid). Let T be a leftheavy tree rooted at its centroid c_{ T }. When c_{ T } is a bicentroid r r^{′}, r and r^{′} will be the vertices that have no parent in the parentchild relationship in T. We shall now introduce “rootedisomorphism” among 1augmented trees obtained from T by adding a new edge. We regard a 1augmented tree G=T+u v obtained by adding new edge uv between two nonadjacent vertices u,v∈V(T) as a graph rooted at c_{ T }. When c_{ T } is a vertex r, we say that two 1augmented trees G=T+u v and G^{′}=T+u^{′}v^{′} are rootedisomorphic if they admit an isomorphism ψ such that c_{ T } in G=T+u v corresponds to c_{ T } in G^{′}=T+u^{′}v^{′} (i.e., ψ(r)=r when c_{ T } is a vertex r, and {ψ(r),ψ(r^{′})}={r,r^{′}} when c_{ T } is an edge r r^{′}). We write T+\mathit{\text{uv}}\underset{r}{\approx}T+{u}^{\prime}{v}^{\prime} if and only if T+u v and T+u^{′}v^{′} are rootedisomorphic with root r. Then, the following theorem holds, where the proof is given in Additional file 1: S1.2.
Theorem 2.
Let T be a leftheavy tree rooted at its centroid c_{ T } and {u,v},{u^{′},v^{′}}⊆V(T) be two pairs of nonadjacent vertices. If T+u v∼T+u^{′}v^{′} then T+\mathit{\text{uv}}\underset{r}{\approx}T+{u}^{\prime}{v}^{\prime}.
Theorem 2 tells us that two 1augmented trees G=T+u v and G^{′}=T+u^{′}v^{′} are isomorphic if and only if they are rootedisomorphic (i.e., c_{ T } in G corresponds to c_{ T } in G^{′} in the isomorphism ψ, where possibly ψ(r)=r^{′} and ψ(r^{′})=r when c_{ T }=r r^{′}).
Now we consider how to generate a set {\mathcal{G}}_{T} of 1augmented trees T+u v such that the 1augmented tree T+u v for any pair of nonadjacent vertices u,v∈V(T) is isomorphic to exactly one 1augmented tree G in the set {\mathcal{G}}_{T}. By Theorem 2, we only need to check the rootedisomorphism among 1augmented trees T+u v for all pairs of nonadjacent vertices u,v∈V(T). Based on this, we can modify a given tree T with bicentroid c_{ T }=r r^{′} into a tree T^{′} with unicentroid r^{∗} by inserting a new vertex on the edge r r^{′}. Since this does not change the rootedisomorphism among 1augmented trees T^{′}+u v or the leftheaviness of T, we assume in the following that a given tree T has a unicentroid c_{ T }=r.
Let T be a leftheavy tree. We shall introduce some terminology. Let x be a nonroot vertex x in T. Denote by left(x) the immediate left sibling of a nonroot vertex x (if any). We define data copy as follows.
Let u and v be two vertices in T. We denote by P(u,v) the unique path in T that connects u and v, where P(u,v)=P(v,u). Let lca(u,v) denote the least common ancestor of u and v, i.e., the highest vertex in P(u,v) (where we define lca(u,v) to be the edge c_{ T }=r r^{′} when T is rooted at the bicentroid c_{ T }=r r^{′}, and u∈V(T_{ r }) and v\in V({T}_{{r}^{\prime}})). When dfs(u)<dfs(v), we define the greatest uncommon ancestor gua of u and vas follows:
● Let gua(u,v) denote the child of lca(u,v) that is closest to u in T, where gua(u,v) is an ancestor of u (including u itself) if lca(u,v)≠u;
● Let gua(v,u) denote the child of lca(u,v) that is an ancestor of v (including v itself), where gua(u,v)=gua(v,u) if lca(u,v)=u.
We call a pair of nonadjacent vertices u,v∈V(T) with dfs(u)<dfs(v)admissible if it satisfies the following conditions (see Figure 9 for conditions (1) and (2) and Figure 10for condition (3)):

(1)
copy(w) = 0 for all vertices w∈ V(P(lca(u,v),r))−{r};

(2)
copy(w)=0 for all vertices w∈V(P(u,gua(u,v)))∪V(P(v,gua(v,u)))−{lca(u,v),gua(v,u)};

(3)
if copy(gua(v,u))=1 then

(i)
gua(u,v)=left(gua(v,u)) (hence u≠lca(u,v)); and

(ii)
For the copy \xfb of vertex u in T_{gua(v,u)}, it holds \text{dfs}(v)\ge \text{dfs}(\xfb) (where \text{dfs}(\xfb)=\text{dfs}(u)+V({T}_{\text{gua}(u,v)})).

(i)
Note that (3)(i) implies that copy(gua(v,u)) in (2) needs to be 0 when lca(u,v)=u.
The next lemma indicates that we only need to add an edge between each admissible pair to avoid duplication of 1augmented trees, where the proof is given in Additional file 1: S1.3.
Lemma 3.
For a leftheavy tree T rooted at its unicentroid c_{ T }=r, let {\mathcal{G}}_{T}=\{T+\mathit{\text{uv}}\mid admissible pairs u,v∈V(T)}. Then the 1augmented tree T+u v for any pair of nonadjacent vertices u,v∈V(T) is isomorphic to exactly one 1augmented tree G in {\mathcal{G}}_{T}.
Algorithm
In this section, we describe an algorithm of the second stage to generate all children of a given multitree T without duplication and omission, and show the computational complexity of the second stage. To generate all children of T, we first find all admissible pairs (u,v) for T and test whether T is the parent π(T+u v) of T+u v or not. Notice that a straightforward method would take O(n) time to check whether a pair (u,v) is admissible or not. Since there are at most _{ n }C_{2} vertex pairs in a multitree T, finding all admissible pairs for T may take O(n^{3}) time. That is, from Lemma 1, we may need O(n^{3}V(C)^{2})=O(n^{5}) time to generate all children of T.
In what follows, we design a faster O(n^{4})time algorithm to generate all children of a given multitree T. For this, we find only a subset of all admissible pairs, called the set of “candidate” pairs defined as follows (see also Figure 11). We see that no pair (x,y) generates a child T+x y of T if a heavy edge is created in T+x y and x is not an ancestor of y, since such (x,y) does not satisfy any of Cases I and II for generating children of T. Hence we are not interested in storing such pairs (x,y), and call an admissible pair (x,y) a candidate pair when (i) no heavy edge is created in T+x y; or (ii) x is an ancestor of y, where (i) (resp., (ii)) is a necessary condition of Case I (resp., Case II). By definition, every candidate vertex pair (x,y) is admissible, whereas any admissible pair (x,y) such that T+x y is a child of T is always a candidate pair. Therefore, to generate all children of T, we do not need to find all admissible pairs and only have to extract all candidate pairs.
To facilitate this, we examine all vertex pairs (u,v) (dfs(u)<dfs(v)) in T in a lexicographical order with respect to (dfs(u),dfs(v)), i.e., we choose each vertex v_{ i } from v_{0} to v_{n−1} as u and then choose each vertex v_{ j } from v_{i+1} to v_{n−1} as v. We call the lexicographical order over vertex pairs a dfs order. For each of the generated vertex pairs, we check whether it is a candidate pair or not. Finally, for each candidate pair (u,v), we test whether T+u v is a child of T in Case I or II.
We can find all candidate pairs for a multitree T in O(n^{2}) time in total as stated below. The proof is given in Additional file 1: S1.4.
Lemma 4.
For a leftheavy multitree T, all candidate pairs of T can be found in O(n^{2}) time.
Finally, by Lemma 1 and Lemma 4, we can generate all children of a multitree T in O(n^{4}) time, as stated in the next lemma.
Lemma 5.
Given a leftheavy multitree T, all children of T can be generated in O(n^{4}) time.
Proof.
For a leftheavy multitree T, we can find all candidate pairs in O(n^{2}) time by Lemma 4. For each candidate pair (u,v), we can test whether T=π(T+u v) or not in O(V(C)^{2}) time by Lemma 1. Thus, for a leftheavy multitree T, all children of T can be generated in O(n^{2}V(C)^{2})=O(n^{4}) time. □
Experimental and results
This section reports experimental results of our algorithm enumerating 1augmented trees. Tests were carried out on a PC with an Intel Core i5 processor running at 3.20 GHz and the Linux operating system using the C language, employing instances based on chemical compounds selected from the KEGG LIGAND database [15] ( http://www.genome.jp/ligand/).

(I)
First we select four chemical compounds “C00062,” “C03343,” “C03690,” and “C07178” as chemical graphs with 0augmented tree (acyclic) structure and four chemical compounds “C00095,” “C00270,” “C00645,” and “C00837” as chemical graphs with 1augmented tree structure (see Additional file 1: Figure S21 for illustrations of these chemical graphs), wherein each benzene ring in chemical compounds “C03343,” “C03690,” and “C07178” is regarded as a virtual atom b of valence 6. These compounds are heuristically selected based on the following criteria: (i) each compound is a 0augmented tree or 1augmented tree (except benzene ri ngs), (ii) each compound consists of C,O,H (or, C,O,N,H) atoms, (iii) compounds are not very similar to each other, and (iv) compounds have varying sizes but are not too large.
The virtual atom b is treated as one atom so that we discard all possible regioisomers of benzene. Thus in our experiment, we consider the cycles not caused by benzenes but by other substructures in these 1augmented trees. We remark that an efficient algorithm has been developed for generating all possible regioisomers of a given 0augmented tree structure with virtual atoms b by Li et al. [36], and an implementation of the algorithm is available on a web server ( http://sunflower.kuicr.kyotou.ac.jp/tools/enumol2/).
To generate problem instances from each of the selected chemical graphs, we define w\in {\mathbb{Z}}_{+} to be a width between upper and lower feature vectors. From the feature vector g=f_{ K }(G) of a chemical graph G at level K, we construct two feature vectors g_{ U } and g_{ L } of width w as follows. For each entry t with g[ t]≥1, let g_{ U }[ t]=g[ t]+w and g_{ L }[ t]= max{g[ t]−w,0}; and for each entry t with g[ t]=0, let g_{ U }[ t]=g_{ L }[ t]=0. See Additional file 1: Figure S23 (resp., Additional file 1: Figure S24) for the lower and upper feature vectors g_{ L } and g_{ U } with K=1 and w=1 (resp., K=2 and w=1) created from C00062.
To examine the computational efficiency, we compare the time per output multitree/1augmented tree by our algorithm and by 2Phase algorithm [14]. Our algorithm enumerates not only the 1augmented trees in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}) but also the multitrees in {\mathcal{G}}_{0}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}). Therefore, if time per output graph of our algorithm is close to that of 2Phase algorithm, then we can enumerate 1augmented trees as fast as 2Phase algorithm enumerates multitrees.
Table 1 shows the result of the comparison of 2Phase algorithm and our algorithm for varying K with fixed w=1, where the meanings of columns are as follows.
Note on tables:

(1)
C00062, C00095, C00270 C00645, C00837, C03343, C07178, and C03690 are the chemical compounds in the KEGG LIGAND database, respectively;

(2)
in Table 1, the width for constructing upper and lower feature vectors is 1;

(3)
n is the number of vertices without hydrogen atoms in an instance preprocessed by replacing each benzene ring with a new atom with six valences;

(4)
w is the width for constructing upper and lower feature vectors;

(5)
K is the level of given feature vectors;

(6)
“time (s)” is the CPU time in seconds;

(7)
T.O. means the “time over” (the time limit is set to be 1,800 seconds);

(8)
“time/graph” is the time per enumerating one graph;

(9)
“tree” is the number of all possible solutions of treelike chemical graph in the time limit;

(10)
“cycle” is the number of all possible solutions of 1tree chemical graph in the time limit;

(11)
“ratio” is a number such that “time/graph” of our algorithm is divided by that of 2Phase algorithm;

(12)
for any real numbers x and y, let xEy denote x×10 ^{y}.
It is to be noted that in some instances, the number of enumerated trees by 2Phase algorithm and that of our algorithm are different because of the time limit. Hence, the “tree” and “cycle” columns show the number of incomplete solutions in instances whose “time” column is “T.O.”. However, this is not a critical issue because we mainly want to know the “time per graph” and its “ratio” between 2Phase algorithm and our algorithm. We can make use of them as beneficial results from “tree,” “cycle,” and “time” columns even if they are incomplete and “T.O.”.
We find that almost all instances solved within the time limit by 2Phase algorithm are also solved by our algorithm within the time limit. Moreover, the “ratio” of instances is less than 10 except 4 out of 38 cases, and that of many instances is less than 5. This means that the time per output by our algorithm is close to that by 2Phase algorithm. Therefore, we have demonstrated that our algorithm maintains the high computational efficiency of 2Phase algorithm even if K changes. Note that our algorithm does not output any 1augmented trees in {\mathcal{G}}_{1}(\phantom{\rule{0.3em}{0ex}}{g}_{L},{g}_{U}) in “C00062,” “C03343,” “C07178,” and “C03690” when K is large. This is because the instances are acyclic chemical compounds: 1augmented trees become less able to satisfy the feature vector constraint as K increases and only multitrees can satisfy the feature vector constraint.
Table 2 shows the result of the comparison of 2Phase algorithm and our algorithm for varying w with fixed K=3. Just like with Table 1, almost all instances solved within the time limit by 2Phase algorithm are also solved by our algorithm within the time limit. The “ratio” of instances is less than 10 except 7 out of 48 cases, and that of many instances is less than 5. In particular, with respect to “C00095,” “C00645,” and “C00837,” which have 1augmented tree structure, the “ratio” is less than 1 or close to 1. This implies that our algorithm can enumerate 1augmented trees and multitrees faster than 2Phase algorithm enumerates multitrees. These results mean that the time per output by our algorithm is close to that by 2Phase algorithm. Therefore, we have demonstrated that our algorithm maintains the high computational efficiency of 2Phase algorithm even if w changes.
Finally, from Table 1 and Table 2, we compare 2Phase algorithm and our algorithm in terms of varying n, where n is the size of an instance. Note that n is the number of vertices without hydrogen atoms in an instance preprocessed by replacing each benzene ring with a new atom with six valences. We notice that there is no large difference in the “ratio” between all cases in spite of the fact that the instance size of C03690 is almost twice as large as that of C00062. This implies that our algorithm maintains the high computational efficiency of 2Phase algorithm even if the instance size becomes large.

(II)
Next we select four chemical compounds, prostaglandin (D08040), allobarbital (D00332), gabapentin (D00555), and histamine (D00079) as chemical graphs with 1augmented tree structure (see Additional file 1: Figure S22 for illustrations of these chemical graphs), all of which are existing drug compounds. We conducted the same experiment as we did for (I): Table 3 shows the result of the comparison of 2Phase algorithm and our algorithm for varying K with fixed w=1; Table 4 shows the result of the comparison of 2Phase algorithm and our algorithm for varying w with fixed K=3. In this experiment, we observe that there still is no large difference in the “ratio” between all cases except for the instance of D00079.
Discussions and conclusions
We considered the problem of enumerating all chemical graphs of 1augmented tree structure from a given set of pathfrequency based feature vectors specified by upper and lower feature vectors, and proposed a new exact algorithm by extending 2Phase algorithm [14]. The experimental results reveal that the computational efficiency of the new algorithm remains high, considering the hardness of treating 1augmented trees compared with 0augmented trees.
One of our future works is to introduce new bounding operations for 1augmented trees in 2Phase algorithm and our procedure for creating a cycle. Additionally, it is important to extend the proposed algorithm for enumeration of kaugmented trees with k≥2 because we can cover 59%, 79%, and 90% of chemical compounds by 2augmented trees, 3augmented trees, and 4augmented trees, respectively. In this paper, we used the assumption that chemical graphs we treat contain only atoms with valence at most 4 (except benzene rings) in order to define the parent of a 1augmented tree G as a 0augmented tree T that is obtained by removing an edge corresponding to a single bond in G. However, it is not difficult to extend our enumeration algorithm for chemical graphs possibly with atoms with valence more than 4 just by modifying the definition so that the parent of a 1augmented tree G is allowed to be a 0augmented tree T obtained by removing an edge that corresponds to a double or triple bond in G. Although benzene rings have already been treated as virtual atoms of valence 6, regioisomers are ignored in the proposed algorithm. As mentioned in “Experimental and results” section, an efficient algorithm for generating all possible regioisomers of a given 0augmented tree structure with virtual atoms b has been developed [36]. Therefore, combination of the proposed algorithm with that algorithm is left as future work as well as further extensions for including atoms with valence more than 4 and furan and more general structures.
Although we do not aim to develop enumeration algorithms that are directly applicable to drug design, this is a future target of our research. In order to apply enumeration algorithms to drug design, considering features based on the path frequency is far from sufficient. Factors such as hydrogen bond donors, hydrogen bond acceptors, positive charges, negative charges, and hydrophobic centers should be taken into account. In addition, the binding site information of the target molecule and geometric information such as the occurrence of rotatable bonds should be reflected. In order to include these factors in enumeration algorithms, we should develop efficient methods that can relate chemical graphs with such physicochemical and geometric factors. However, such a development is not an easy task even for one type of factor and thus is longterm future work.
References
Bytautas L, Klein DJ: Chemical combinatorics for alkaneisomer enumeration and more. J Chem Inf Comput Sci. 1998, 38: 10631078. 10.1021/ci980095c.
Bytautas L, Klein DJ: Formula periodic table for acyclic hydrocarbon isomer classescombinatorially averaged graph invariants. Phys Chem Chem Phys. 1999, 1: 55655572. 10.1039/a906137a.
Bytautas L, Klein DJ: Isomer combinatorics for acyclic conjugated polyenes: enumeration and beyond. Theoret Chem Acc. 1999, 101: 371387. 10.1007/s002140050455.
Buchanan BG, Feigenbaum EA: DENDRAL and MetaDENDRAL: their applications dimension. Artif Intell. 1978, 11: 524. 10.1016/00043702(78)900103.
Funatsu K, Sasaki S: Recent advances in the automated structure elucidation system, CHEMICS. Utilization of twodimensional NMR spectral information and development of peripheral functions for examination of candidates. J Chem Inf Comput Sci. 1996, 36: 190204. 10.1021/ci950152r.
Fink T, Reymond JL: Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery. J Chem Inf Comput Sci. 2007, 47: 342353. 10.1021/ci600423u.
Mauser H, Stahl M: Chemical fragment spaces for de novo design. J Chem Inf Comput Sci. 2007, 47: 318324. 10.1021/ci6003652.
Faulon JL, Churchwell CJ: The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences. J Chem Inf Comput Sci. 2003, 43: 721734. 10.1021/ci020346o.
Hall LH, Dailey ES: Design of molecules from quantitative structureactivity relationship models. 3. Role of higher order path counts: Path 3. J Chem Inf Comput Sci. 1993, 33: 598603. 10.1021/ci00014a012.
Deshpande M, Kuramochi M, Wale N, Karypis G: Frequent substructurebased approaches for classifying chemical compounds. IEEE Trans Knowl Data Eng. 2005, 17: 10361050.
Cayley A: On the analytic forms called trees with applications to the theory of chemical combinations. Rep Br Assoc Adv Sci. 1875, 45: 257305.
Pólya G: Kombinatorische anzahlbestimmungen für gruppen, graphen, und chemische verbindungen. Acta Math. 1937, 68: 45254.
Shimizu M, Nagamochi H, Akutsu T: Enumerating treelike chemical graphs with given upper and lower bounds on path frequencies. BMC Bioinformatics. 2011, 12 (Suppl 14): 53
Suzuki M, Nagamochi H, Akutsu T: A 2phase algorithm for enumerating treelike chemical graphs satisfying given upper and lower bounds. IPSJ SIG Technical Reports. BIO 2012, 2012, 28 (17): 18.
Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010, 38: D355D360,
Harary F: Graph Theory. 1969, MA: Addison Wesley
Bakir GH, Weston J, Schölkopf B: Learning to find preimages. Adv Neural Inform Process Syst. 2003, 16: 449456.
Bakir GH, Zien A, Tsuda K: Learning to find graph preimages. Lect Notes Comput Sci. 2004, 3175: 253261. 10.1007/9783540286493_31.
Kashima H, Tsuda K, Inokuchi A: Marginalized kernels between labeled graphs. Proceedings of the 20th International Conference Machine Learning. 2003, Palo Alto: AAAI Press, 321328.
Ueda N, Akutsu T, Perret JL, Vert JP, Mahé P: Graph kernels for molecular structureactivity relationship analysis with support vector machines. J Chem Inf Model. 2005, 45: 939951. 10.1021/ci050039t.
Byvatov E, Fechner U, Sadowski J, Schneider G: Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J Chem Inf Comput Sci. 2003, 3: 18821889.
Akutsu T, Fukagawa D, Jansson J, Sadakane K: Inferring a graph from path frequency. Discrete Appl Math. 2012, 160: 14161428. 10.1016/j.dam.2012.02.002.
Nagamochi H: A detachment algorithm for inferring a graph from path frequency. Algorithmica. 2009, 53: 207224. 10.1007/s0045300891840.
Kier L, Hall L, Frazer J: Design of molecules from quantitative structureactivity relationship models. 1. Information transfer between path and vertex degree counts. J Chem Inf Comput Sci. 1993, 33: 143147. 10.1021/ci00011a021.
Miyao T, Arakawa M, Funatsu K: Exhaustive structure generation for inverseQSPR/QSAR. Mol Inform. 2010, 29: 111125. 10.1002/minf.200900038.
Wong WWL, Burkowski FJ: A constructive approach for discovering new drug leads: Using a kernel methodology for the inverseQSAR problem. J Cheminf. 2009, 1: 410.1186/1758294614.
Gugisch R, Kerber A, Laue R, Meringer M, Rucker C: History and progress of the generation of structural formulae in chemistry and its applications. MATCH Comm Math Comput Chem. 2007, 58: 239280.
Fujiwara H, Wang J, Zhao L, Nagamochi H, Akutsu T: Enumerating treelike chemical graphs with given path frequency. J Chem Inf Model. 2008, 48: 13451357. 10.1021/ci700385a.
Nakano S, Uno T: Generating colored trees. Lect Notes Comput Sci. 2005, 3787: 249260. 10.1007/11604686_22.
Nakano S, Uno T: Efficient generation of rooted trees. NII Technical Report. 2003, NII2003005E,
Ishida Y, Zhao L, Nagamochi H, Akutsu T: Improved algorithms for enumerating treelike chemical graphs with given path frequency. Genome Inform. 2008, 21: 5364.
Luks EM: Isomorphism of graphs of bounded valence can be tested in polynomial time. J Comput Syst Sci. 1982, 25: 4265. 10.1016/00220000(82)900095.
Fürer M, Schnyder W, Specker E: Normal forms for trivalent graphs and graphs of bounded valence. Proceedings of 15th Annual ACM Symp Theory of Computing. 1983, NY: ACM Press, 161170.
Faulon JL: Isomorphism, automorphism partitioning, and canonical labeling can be solved in polynomialtime for molecular graphs. J Chem Inf Comput Sci. 1998, 38: 432444. 10.1021/ci9702914.
Jordan C: Sur les assemblages de lignes. J Reine Angew Math. 1869, 70: 185190.
Li J: Enumerating benzene isomers of treelike chemical graphs. Master’s Thesis. Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University 2014,
Acknowledgements
This work was supported in part by GrantinAid for Scientific Research (KAKENHI) no. 22240009 from JSPS, Japan.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
HN designed the algorithm and performed theoretical analysis on it. MS detailed the algorithm, implemented it, and performed computational experiments. TA participated in discussions for analyses of the algorithm and the results of computational experiments. HN and MS wrote the manuscript and TA revised it. All authors read and approved the final manuscript.
Electronic supplementary material
13321_2013_606_MOESM1_ESM.pdf
Additional file 1: Proofs of Lemmas and Theorems. Proofs of Lemma 1, Theorem 2, Lemma 3, and Lemma 4, descriptions of the 2phase algorithm and the main algorithm, and some figures for chemical graphs and upper and lower feature vectors used in the computational experiment are given in this supplemen (PDF 2 MB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Suzuki, M., Nagamochi, H. & Akutsu, T. Efficient enumeration of monocyclic chemical graphs with given path frequencies. J Cheminform 6, 31 (2014). https://doi.org/10.1186/17582946631
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/17582946631
Keywords
 Chemical graphs
 Enumeration
 Monocyclic structure
 Feature vector
Comments
View archived comments (1)