Skip to main content

Molecular structures enumeration and virtual screening in the chemical space with RetroPath2.0



Network generation tools coupled with chemical reaction rules have been mainly developed for synthesis planning and more recently for metabolic engineering. Using the same core algorithm, these tools apply a set of rules to a source set of compounds, stopping when a sink set of compounds has been produced. When using the appropriate sink, source and rules, this core algorithm can be used for a variety of applications beyond those it has been developed for.


Here, we showcase the use of the open source workflow RetroPath2.0. First, we mathematically prove that we can generate all structural isomers of a molecule using a reduced set of reaction rules. We then use this enumeration strategy to screen the chemical space around a set of monomers and predict their glass transition temperatures, as well as around aminoglycosides to search structures maximizing antibacterial activity. We also perform a screening around aminoglycosides with enzymatic reaction rules to ensure biosynthetic accessibility. We finally use our workflow on an E. coli model to complete E. coli metabolome, with novel molecules generated using promiscuous enzymatic reaction rules. These novel molecules are searched on the MS spectra of an E. coli cell lysate interfacing our workflow with OpenMS through the KNIME Analytics Platform.


We provide an easy to use and modify, modular, and open-source workflow. We demonstrate its versatility through a variety of use cases including molecular structure enumeration, virtual screening in the chemical space, and metabolome completion. Because it is open source and freely available on, workflow community contributions should likely expand further the features of the tool, even beyond the use cases presented in the paper.


The number of known chemical reactions is huge, at the time this manuscript was written there were ~84 million single- and multi-step reactions in the Chemical Abstract Service database (CAS) [1]. Yet, many reactions in CAS are redundant because the same reactions are applied to different reactants. Identifying identical reactions can be performed by computing reaction rules. Reaction rules represent reactions at the reaction center only. In other words, a reaction rule comprises only the substructures of the reactants and the products for which the atoms are either directly involved in bond rearrangements or are deemed to be essential for the reactivity of the reaction center. While a set of reaction rules is of course not available for all known chemical reactions, rules have been compiled for focused applications, such as retrosynthesis planning [2, 3], the discovery of novel chemical entities in medicinal chemistry [4], xenobiotic (including drug) degradation [5], metabolomics [6], and metabolic engineering [7,8,9,10]. Depending on the application, the number of rules varies from less than one hundred to few thousands, but in all cases the number of known reactions per application far exceeds the number of rules (there are for instance more than 14,000 reactions in metabolic databases such as MetaNetX [11]). There are several ways of coding reaction rules (for instance, BE-matrices [12] and fingerprints [13]) but most of the time the rules can be represented by reaction SMARTS [14], as it is done in the current paper.

The purpose of reactions rules is to generate reaction networks. The rules can be used in a forward manner to find for instance the metabolic degradation products of a drug, or in a reverse manner to find the reactions producing a desired product from a set of available reactants. In this later usage one produces retrosynthesis reaction networks. Several tools have been developed in the past to generate (retrosynthesis) reaction networks and reviews are available for synthesis planning [2, 3] and for metabolic engineering [15]. Disregarding if the rules are applied in a forward or reverse manner, network generation tools are making use of the same core algorithm. Starting from a source set of compounds the core algorithm applies the rules in an iterative fashion either a predefined number of times or until a sink set of compounds has been produced. At each iteration, the algorithm fires the rules on the source set producing new molecular structures and determines the new source set of molecules the rules will be fired upon at the next iteration. That set must comprise molecules that have not been processed before. Further details on the core algorithm and the differences between the various implementations are provided in Faulon et al. [16] and Delépine et al. [17]. In the current paper we make use of an open source workflow (RetroPath2.0 [17]), which follows the above core algorithm. This workflow is not based on original codes but instead was constructed entirely by assembling KNIME nodes [18] developed by the cheminformatics community (primarily RDKit nodes [19]). RetroPath2.0 is the first open source release of a retrosynthesis reaction network generation, and its usage in the current paper beyond network generation demonstrates its versatility.

As already mentioned, reaction network generation tools coupled with reactions rules have been developed and used primarily for synthesis planning and metabolic engineering, but can they be used to enumerate molecules (isomers for instance) and more generally to search chemical structures in the chemical space?

In principle yes if one can devise reaction rules enabling the production of any molecule in the chemical space. Such a set of rules necessarily exists for all known molecules (such as those in the CAS database) since they have been produced through either natural or synthetic chemical reactions. In practice and as already stated, reaction rules so far developed are application limited. Yet, within their respective application fields, specific rules have been used to discover novel molecules and reaction pathways. Taking experimentally validated examples, the rules associated with the ligand-based de novo design software DOGS ( LLC) [4] have enabled the production of new chemical entities inhibitors of DAPK3 (death-associated protein kinase 3) [20], metabolic rules for promiscuous enzymes have allowed the discovery of novel metabolites in E. coli [21] and have also been used to engineer metabolic pathways producing 1,4-butanediol [9] and flavonoids [22].

Going beyond application limited reaction rules, the main contribution of the present paper is to propose a set of transformation rules that enables the generation of any isomer of any given molecule of the chemical space. Precisely, we prove the claim that any isomer of any given molecule of N atoms, can be reached applying at most O(N 2) rules.

As illustrations, our transformation rules are used to screen the chemical space for structures that are similar to a given set of well-known monomers and to search aminoglycosides structures maximizing antibacterial activities. The compounds produced by our rules are not necessarily chemically accessible, since our transformation rules are not constructed based on chemical synthesis schema. To probe the (bio)synthetic accessibility of our solutions, we also perform search in the (bio)chemical space using enzymatic reaction rules. The enzymatic rules are also used to propose novel molecules completing E. coli metabolic network and for which masses are found in cell lysate mass spectra.

All results presented in this paper have been produced making use of the open source workflow RetroPath2.0. RetroPath2.0 and the associated data are provided as Additional file 1 and can be downloaded at The only differences between the various usages we have made of the RetroPath2.0 are within (1) the set of reaction rules and (2) the way molecules are selected at each iteration during the network generation process.

Results and discussions

The purpose of this section is to showcase the versatility of RetroPath2.0 by taking use cases of interest to the community. We first propose reaction rules to enumerate isomers (“Isomer enumeration” section), we then use the rules to screen in the chemical space structures that are similar to some known monomers (“Virtual screening in the chemical space” section) and compute property distribution (Glass transition temperature) in both the Chemical Space and PubChem, we next use a QSAR to search aminoglycosides types molecules for which antibacterial activity is maximized using both isomer transformation rules and enzymatic rules (“Search for molecules maximizing biological activities”), and we finally use enzymatic rules to find novel metabolites in E. coli and annotate the MS spectra of an E. coli cell lysate interfacing RetroPath2.0 with OpenMS [23] (“Metabolome completion and metabolomics” section).

Isomer enumeration

Isomer enumeration is a long-standing problem that is still under scrutiny [24, 25]. Our intent here is not to provide the fastest enumeration algorithm but to demonstrate how RetroPath2.0 can perform that job once appropriate reaction rules are provided. However, we provide in Additional file 2: Figure S1 a comparison of Retropath2.0’s execution time with the OMG and PMG software tools [25, 26] specifically dedicated to isomers enumeration. Retropath2.0 is found faster than OMG but slower than PMG. Thereafter, we outline two approaches making use of RetroPath2.0. The first is based on the classical canonical augmentation algorithm [27] and the second consists of iteratively transforming a given molecule such that all its isomers are produced. We name this latter approach isomer transformation. In both cases we limit ourselves to structural (constitutional) isomers, as there already exist workflows to enumerate stereoisomers [28] (Additional file 2: Figure S2).

Canonical augmentation

The principle of canonical augmentation, which is an orderly enumeration algorithm, is to grow a molecular graph by adding one atom at a time and retaining only canonical graphs for the next iteration [27]. The algorithm first proposed by Brendan McKay has been used to generate the GDB-17 database of small molecules [29]. The original algorithm has also been modified such that at each step a bond (not an atom) is added to the growing molecules [25]. In the present implementation we use the original McKay algorithm [27], consequently, the number of iterations is the number of atoms one wishes the molecule to have. The algorithm can easily be implemented into RetroPath2.0 by choosing as a source set a single unbonded atom, and a rule set depicting all possible ways an atom can be added to a molecular graph (see “Methods” section for more information). Considering that an atom can be added to a growing molecule through one, two, or more bonds (depending on its valence), the set of reaction rules is straightforward however cumbersome if one starts to consider all possible atoms types. For this reason we limit ourselves to carbon skeleton as it is usually done in the first step of isomer enumeration algorithm. Figure 1 below depicts the set of rules that generate all triangle free carbon skeletons.

Fig. 1
figure 1

Reaction rules for canonical augmentation of carbon skeletons. The corresponding reaction SMARTS string is provided for each rule

We note that rules R2–R4 will generate cycles since the added atom is attached to the growing molecule by 2–4 four bonds, thus only rule R 1 is necessary to grow acyclic molecules (alkanes for instance). The Table below provides the numbers of structural isomers of alkanes found up to 18 carbon atoms running RetroPath2.0 with rule number 1 in Fig. 1.

Isomer transformation

The isomer canonical augmentation algorithm becomes more complex when one starts to consider different atom and bond types. To overcome these difficulties the idea of the transformation enumeration approach is to start with one fully-grown molecule to which one applies all possible transformations such that all the structural isomers of the initial molecule are generated. This approach can be implemented in RetroPath2.0 using a hydrogen saturated molecule as a source and a reaction rule set enabling to transform the molecule while keeping the correct valence for each atom. Because atom valences are maintained the total number of bonds must remain the same after the transformations have taken place. In order to maintain the number of bonds constant, for any reaction rule the number of bonds created must equal the number of bonds deleted.

RetroPath2.0 applies a reaction rule to a given molecule by first searching all occurrences in the molecule of the subgraph representing the reactant (left side of the rule). To this end the labels on the subgraph are removed. Then for each occurrence of the unlabeled subgraph in the molecule, the labels are restored and the bonding patterns on the molecule are changed accordingly. The process is illustrated in the Fig. 2 below where it can be seen that rules R a and R b are identical (i.e. they produce the same solutions). In general, two rules R a = (La, A) and R b = (Lb, B) will produce the same solutions if a one-to-one mapping π can be found between the labels La and Lb of the rules such that the set of edges (A) in R a is transformed by π into the edges (B) of R b, i.e. π(A) = B.

Fig. 2
figure 2

Identical rules. There are two different ways (two different possible matchings for the reactants of the rules) of applying rules R a and R b, each rule produces molecules M1 and M2. The molecules produced by R a are identical to those produced by R b because the rules are identical. R a is identical to R b because when applying the one-to-one label mapping π(1, 2 , 3, 4) = 2, 1, 4, 3 on the edges of the R a one obtains the edges of R b


The 19 rules described in Fig. 3 allow us to generate all isomers of a given molecule at most 3/4 * (N 2 − N) iterations, where N is the number of atoms, respecting the following constraints: the maximal valence is 4 and there cannot be two double bonds on the same atom in a 3 or 4 membered ring.

Fig. 3
figure 3

Isomer transformation rule set. All reactions rules are solutions of system of Eq. (2) and are not identical (see text and Fig. 2 for definition of identical rules). Reactions in green move bonds around without creating or deleting cycles. Reactions in blue change bond order by creating or deleting at least one cycle. To each reaction corresponds a reverse reaction. The reverse reaction of R 1 is R 1, for R 2 it is R 4, for R 3: R 7, for R 5: R 5, for R 6: R 8 and the reverse reaction of R 9 is R 9. The reverse reaction for R 10 is R 15, for R 11, R 18, for R 12, R 19, for R 13, R 16 and for R 14, R 17. The bond order a13 and a24 can take any value from 0 to 3. The full list of rules excluding triple bonds can be found in Additional file 2: Figure S2

Lemma 1

The minimal number of bonds one can change is 4 and the 19 rules described in Fig. 3 generate all minimal transformations respecting the following constraints: the maximal valence is 4 and there cannot be two double bonds on the same atom in a 3 or 4 membered ring.


The minimal transformation one can perform consists of deleting one bond and creating another one. Since the bond created must be different from the one deleted at least three atoms (A1, A2, A3) must be involved. Let a12, a13, and a23 be the bond orders between the three atoms and let b12, b13 and b23 the bond orders after the reaction has taken place. Because the atom valence is maintained the following system of equations holds:

$$\begin{aligned} & \left( {{\text{L}}_{1} } \right){\text{a}}_{12} + {\text{a}}_{13} = {\text{b}}_{12} + {\text{b}}_{13} \\ & \left( {{\text{L}}_{2} } \right){\text{a}}_{12} + {\text{a}}_{23} = {\text{b}}_{12} + {\text{b}}_{23} \\ & \left( {{\text{L}}_{3} } \right){\text{a}}_{13} + {\text{a}}_{23} = {\text{b}}_{13} + {\text{b}}_{23} \\ \end{aligned}$$

(L1) + (L2) – (L3) − > a12 = b12, which implies a23 = b23 and a13 = b13.

It is therefore impossible to proceed to a minimal transformation with only 3 bonds involved.

Let us consider 4 atoms. There are 6 possible bonds between those atoms. Let us consider that we are changing 4 bonds, since we aim to find minimal transformations. Let us call a13 and a24 the two fixed bonds, without loss of generality.

Valence conservation (with b13 = a13 and b24 = a24) gives us the following system:

$$\begin{aligned} & \left( {{\text{L}}_{1} } \right){\text{a}}_{12} + {\text{a}}_{14} = {\text{b}}_{12} + {\text{b}}_{14} \\ & \left( {{\text{L}}_{2} } \right){\text{a}}_{12} + {\text{a}}_{23} = {\text{b}}_{12} + {\text{b}}_{23} \\ & \left( {{\text{L}}_{3} } \right){\text{a}}_{23} + {\text{a}}_{34} = {\text{b}}_{23} + {\text{b}}_{34} \\ & \left( {{\text{L}}_{4} } \right){\text{a}}_{14} + {\text{a}}_{34} = {\text{b}}_{14} + {\text{b}}_{34} \\ \end{aligned}$$

We can notice that (L1) + (L3) = (L2) + (L4): we therefore have a system of 3 equations with 4 unknowns, so we can set an unknown and calculate the other solutions.

As we are looking for minimal transformations, we can assume that we are changing a bond order by 1 on this unknown that we can set. Since valence is conserved, if a bond order is increased, then a bond order from the same atom has to be decreased. As the problem is perfectly symmetrical in all variables at this point, we can thus assume without loss of generality (at least one bond has to be deleted) that b12 = a12 − 1. Then solving the system immediately gives us b14 = a14 + 1, b23 = a23 + 1 and b34 = a34 − 1. This system can only be solved in our case (positive bond orders, no quadruple bonds) if a14 and a23 are either 0, 1 or 2 and a12 and a34 are either 1, 2 or 3. This means that we have at most 81 (34) cases for initial bond orders where our isomer problem has a solution. However, this solution space can be further reduced by problem symmetry arguments. We can see that the roles of a12 and a34 are symmetrical, as well as the roles of a23 and a14.

Let us call A1 the atom with the highest considered sum of bound orders (neglecting the fixed orders a13 and a24). Therefore, it is such that

(Condition 1):

$${\text{a}}_{12} + {\text{a}}_{14} \ge {\text{a}}_{12} + {\text{a}}_{23} \left( {{\text{higher}}\;{\text{considered}}\;{\text{sum}}\;{\text{of}}\;{\text{bound}}\;{\text{orders}}\;{\text{than}}\; {\text{A}}_{2} } \right),\quad {\text{or}}\;\;{\text{a}}_{14} \ge {\text{a}}_{23}$$

(Condition 2):

$${\text{a}}_{12} + a_{14} \ge {\text{a}}_{14} + {\text{a}}_{34} \left( {{\text{higher}}\;{\text{considered}}\;{\text{sum}}\;{\text{of}}\;{\text{bound}}\;{\text{orders}}\;{\text{than}}\;{\text{A}}_{4} } \right),\quad {\text{or}}\;{\text{a}}_{12} \ge {\text{a}}_{34}$$

a12 + a14 ≥ a23 + a34 (higher considered sum of bound orders than A3), is automatically verified when the other two are verified.

Condition 1 is not respected when a23 = 2 and a14 = 0 or 1 or when a23 = 1 and a14 = 0, without constraints on a12 and a34: (2 + 1) * 9 = 27 solutions. For the same reason, 27 solutions do not respect condition 2. The solutions that do not respect both Conditions 1 and 2 are (2 + 1) * (2 + 1) = 9. By symmetry arguments, we therefore reduced the solution space from 81 to 81 − 27 − 27 + 9 = 36. These 36 reaction rules are presented in Fig. 4.

Fig. 4
figure 4

Rules before solution space reduction due to valence and structure considerations. Reactions in green move bonds around without creating or deleting cycles. Reactions in blue change bond order by creating or deleting at least one cycle. Reactions in purple were deleted because valence is limited to 4, and reactions in black were deleted because there cannot be two double bonds on the same atom in a 3 or 4 membered ring

We can further reduce the solution space by considering that the maximum atom valence is 4. The solutions that do not respect this constraint are such that a12 + a14 = 5, so a12 = 3 and a14 = 2 (and this automatically verifies Conditions 1 and 2). Since there are no constraints on a23 and a34, we have 9 such solutions: the solution space has been reduced to 36 − 9 = 27 reactions.

One more constraint, imposed by 3D conformation of the molecule, is that there cannot be two double bonds on the same atom in a 3 or 4 membered ring.

This must be true for our initial molecule as well as for the produced molecule. For the initial molecule (as can be seen in the 4 black rules under rule 13 in Fig. 4), when a12 = a14 = 2, since a34 > 1 (bond whose order will be reduced), there is a cycle if a23 ≠ 0. There are therefore 4 solutions where the initial molecule is invalid: when a12 = a14 = 2, and a34 is 1 or 2 (smaller than a12) and a23 is 1 or 2.

This must also be true for the produced molecule. Two double bonds will be produced around atom 1 with a12 = 3 and a14 = 1 (this can be seen in the 4 black rules under rule 18 in Fig. 4). There will be a cycle if a34 ≠ 1. There are therefore 4 solutions where the produced molecule is invalid: when a12 = 3, a14 = 1, and a34 is 2 or 3 and a23 is 0 or 1 (smaller than a14).

Since these solutions respect valence constraints and problem symmetry, they are not included in the previous solution space reductions and therefore the solution space is reduced to 27 − 8 = 19 solutions. A summary table of solution space reduction is given in Additional file 3: Table S1. Since we have found 19 different working solutions for all the cases we have left, we have proved that the minimal number of bonds one can change is 4 and the 19 rules described in Fig. 3 generate all these minimal transformations.

Lemma 2

Let us consider M b an isomer of M a . We can apply a rule from this set of 19 rules that will reduce the sum of absolute order differences between those two molecules by at least 2 and at most 4.


Let (aij) be the order of bonds in Ma, (bij), j Є [2, N], i Є [1, j − 1], the order of bonds in Mb, where N is the number of atoms in Ma and Mb. Since Mb is different from Ma, we can find i,j such that aij > bij. By valence conservation in atom Aj, we can find k such that ajk < bjk, and by valence conservation of atom Ak, we can also find l such that akl > bkl. Therefore, we are considering 4 atoms and 4 bonds between those atoms, with at least 3 of their orders changing by 1. According to Lemma 1 the minimal number of bonds one can change is 4, so we will also have to change the bond order between Ai and Al. We are therefore considering a minimal transformation, so we know thanks to Lemma 1 that we can apply a rule from our set of rules to generate that transformation. Let us call \({\text{M}}_{\text{a}^\prime}\) the molecule produced that way, and \({\text{a}}_{\text{ij}^\prime}\) its bond orders. Let us now calculate the sum of orders of \({\text{M}}_{\text{a}^\prime}\). Then, by applying the rule, we have \({\text{a}}_{\text{ij}^\prime}\) = aij − 1 and therefore \(\left| {b_{\text{ij}} - a_{\text{ij}^\prime} } \right| = \left| {{\text{b}}_{\text{ij}} {-}{\text{a}}_{\text{ij}} } \right| - 1\). For the same reason, \(\left| {{\text{b}}_{\text{kl}} - {\text{a}}_{\text{kl}^\prime}} \right| = \left| {{\text{b}}_{\text{kl}} - {\text{a}}_{\text{kl}} } \right| - 1\). Moreover, \({\text{a}}_{\text{jk}^\prime} {\text{ = a}}_{\text{jk}} { + 1}\), and since ajk is smaller than bjk, we also have \(\left| {{\text{b}}_{\text{jk}} - {\text{a}}_{\text{jk}^\prime} } \right|{ = }\left| {{\text{b}}_{\text{jk}} - {\text{a}}_{\text{jk}} } \right| - 1\). The only bond we did not choose to change is ali. The order \({\text{a}}_{\text{ij}^\prime}\) of the transformed bond is either closer to bli than was ali, then the difference of the sum of absolute order differences is reduced by 4, or is further from bli, and this sum is reduced by 2. Therefore, if Ma and Mb are different, we can apply a rule from this set of rules that will decrease the sum of absolute order differences by at least 2 and at most 4.

Lemma 3

Considering M a and M b an isomer of M a , the 19 rules described in Fig. 3 allow us to transform M a into M b using at most 3/4*(N 2 − N) single transformations, where N is the number of atoms, respecting the following constraints: the maximal valence is 4 and there cannot be two double bonds on the same atom in a 3 or 4 membered ring.


Let us consider Mb an isomer of Ma. If the sum of absolute order differences is not null, then Mb is different from Ma and using Lemma 2, we know we can apply a rule that will strictly decrease the sum of absolute order differences. This sum is obviously positive, is an integer, and is strictly decreasing each time we apply a transformation rule so it will converge to 0 in S/2 transformations at most, where S is the sum of absolute order differences between Ma and Mb. When this sum is null, all bond orders are the same, which means the molecules are the same. An upper estimation of the maximum bond order difference is obtained when Ma only has triple bonds, which all have to be deleted. In that case, the sum of absolute order differences is: S = 3*(N 2 − N)/2, where N is the number of atoms and (N 2 − N)/2 the number of defined orders (since aij = aji). Therefore, since the sum decreases by at least 2, the maximum number of transformations we need to apply is 3*(N 2 − N)/4.

Proof of the main claim

Given the workings of the algorithm (breadth-first, as explained in “RetroPath2.0 core algorithm” section), the number of iterations for generating all isomers is the number of iterations for generating the furthest one in term of bond order difference from our starting molecule. Therefore, applying Lemma 3, we know the maximum number of iterations of the algorithm is 3 * (N 2 − N)/4.

Notice that although the number of iterations of the algorithm scales O(N 2), the number of transformation rules applied (i.e.: single reactions) is proportional to the number of isomers.

Corollary 1

The maximum number of iterations to generate all alkanes is N − 1, where N is the number of carbon atoms (hydrogens are not considered here).


Adapting the demonstration of Lemma 3, we have to consider the sum of absolute order differences of the farthest isomers that can be reached. Since alkanes are acyclic, the number of bonds is N − 1 (proven by a simple recurrence, the new atom being joined at a single point to the chain since the molecule is acyclic). Therefore, considering all bonds are different in the new molecule, the sum of absolute order differences is at most 2(N  1). Therefore, the maximum number of iterations of the algorithm is N  1.

The isomer transformation algorithm was applied to generate all alkanes up to 18 carbon atoms using rule R 1 of Fig. 3, since it is the only rule with only single bonds. Results are presented in Table 1, where it can be seen that Corollary 1 is verified in practice.

Table 1 Number of generated alkane isomers by canonical augmentation algorithm and isomer transformation algorithm

Virtual screening in the chemical space

In this section we used RetroPath2.0 to search all molecules that are at predefined distances of a given set of molecules. Such queries are routinely carried out in large chemical databases for drug discovery purposes [31], but in the present case we search similar structures in the entire chemical space. To perform search in the chemical space, we used a source set composed of 158 well-known monomers having a molecular weight up to 200 Da. Our rule set included the transformations colored green in Fig. 3 (i.e. transformation rules where double bonds are not transformed into cycles and conversely). For each monomer, RetroPath2.0 was iterated until no new isomers were generated. Each generated structures at a Tanimoto similarity greater than 0.5 from its corresponding monomer were retained (Tanimoto was computed using MACCS keys fingerprints [32]).

Next, we wanted to probe if the generated structures exhibited interesting properties as far as polymer properties are concerned. To that end we first developed a QSPR model taking properties from [33]. We focused on polymer glass transition temperature T g data [34]. The QSPR model was based on a random forest trained using RDKit fingerprints descriptors [19]. The obtained model had a leave-one-out cross-validation performance of Q2 = 0.75. The model was then applied to predict the T g for the set of enumerated isomers. Figure 5 compares the distribution of predicted T g values for the enumerated isomers with those obtained from isomer structures available from PubChem. T g values for enumerated isomers appeared evenly distributed around 301.86 ± 25.69 K compared with the isomers that were available in PubChem (331.66 ± 46.19 K). This shift in the T g values could be explained by the difference in distribution that necessarily exists between the isomers that are present in PubChem and the total number of enumerated isomers. As we lower the Tanimoto threshold, some monomers might become underrepresented in terms of isomer availability in PubChem. Additional file 2: Figure S3 shows the distributions of both sets of isomers in function of the threshold. The increased ability of selecting polymers with T g above or below room temperature for the enumerated set compared with the PubChem isomers is a desirable feature, as this parameter will determine the mechanical properties of the polymer [35]. In that way, performing a virtual screening of the chemical space of isomers of the reference monomers opens the possibility to engineering applications with improved polymer design.

Fig. 5
figure 5

Distributions of predicted Tg values for enumerated isomers and for isomers found in PubChem. Distribution of predicted polymer glass transition temperature T g for enumerated isomers and for isomers found in Pubchem of a reference set of 158 monomers with a Tanimoto similarity greater than 0.5

Moreover, we were interested in determining how many of the starting 158 monomers were accessible through biosynthesis. Namely, how many of the compounds can be synthesized by engineering a metabolic pathway in a chassis organism. This computation can be accomplished by RetroPath2.0 by defining all naturally produced chemicals as sinks in the workflow and using a collection of known enzymatic reaction rules in reversed mode. The process has been described in detailed elsewhere [17]. Through the application of the rules in a retrosynthetic fashion, it is possible to determine the routes that connect the target compounds to the natural precursors. Of the 158 available monomers, using the RetroPath2.0 workflow downloaded from [36], we were able to identify 17 compounds that can be naturally synthetized (Fig. 6a). We provide in an archive containing the list of pathways for those 17 compounds.

Fig. 6
figure 6

a Initial 158 monomers (green big circles) represented in the chemical space of chemical descriptors using the two main principal components computed from the MACCS fingerprints as axes. Monomers that can be produced through biosynthesis are represented as big circles in red. b Covering of the chemical space generated by the 574,186 isomers (blue) enumerated for the 158 monomers (green) with a Tanimoto similarity greater than 0.5 and associated predicted T g property of the resulting polymer. Virtual monomers are depicted as small circles to facilitate visualisation of their distribution around the starting monomers

The QPSR model for T g was applied to the set of enumerated isomers. As shown in Fig. 6b, the resulting set provided a good covering of the chemical space surrounding the starting monomer set. Moreover, a significant number of enumerated isomers show a high predicted T g value, which may indicate a good candidate as a building block replacement for known monomers. Interestingly, those isomers that were close to biosynthetic accessible monomers (Tanimoto based on MACCS keys fingerprint > 0.8) have a distribution of predicted T g values that significantly differ from the full set (p value < 1e − 12 Welch t-test), with a mean T g = 352.1 K (T g = 301.9 K in the full distribution). These close isomers to biosynthetically accessible monomers might be considered as good candidates for alternative biosynthesis since reaching them through biosynthesis may require only few modifications of the original catalytic route.

Search for molecules maximizing biological activities

In this section we are interested in searching chemical structures in the chemical space optimizing biological activities. This type of search can be solved using inverse QSAR procedures [37]. Inverse QSAR requires to first building a QSAR equation predicting activities from structure and then either (1) inverting the equation and enumerating structures matching a given activity [37] or (2) searching in the chemical space structures similar to those used to build the QSAR equation [33] but having optimized activities. The second approach makes use of either deterministic methods such as lattice enumeration [38] or stochastic searches.

We propose here to use RetroPath2.0 to solve the inverse QSAR problem using a stochastic approach with isomer transformation rules and enzymatic rules for biosynthetic accessibility. To this end, we selected a dataset of 47 aminoglycosides structures for which antibacterial activities have been measured using a MIC assay [39]. The dataset is composed of natural aminoglycosides (gentamicin, tobramycin, neomycin, kanamycin A and B, paromomycin, ribostamycin and neamine) to which are added synthetic structures built on a neamine scaffold. This dataset has already been used to build a QSAR model based on CoMFA analysis leading to a Q2 of 0.6 for a Leave-One-Out (LOO) procedure [39]. We provide in Additional file 1 a QSAR workflow that makes use of RDKit fingerprints [19] and random forest as a learner leading to a higher Q2 (0.7) for LOO. With that QSAR in hand we run RetroPath2.0 with a source set composed of the 47 aminoglycosides used in the training set, and two different reaction rules sets. The first set is extracted from the transforming enumeration rules depicted in Fig. 3, the second set is composed of enzymatic reaction rules leading to neamine (an aminoglycoside) biosynthesis from glucose. Reaction rules for the second set were computed as explained in the method section resulting in 94 rules specific to the biosynthesis of aminoglycosides.

In both cases, reactions rules were fired on the initial source set composed of 47 structures. All rule products were ranked according to their predicted activities as calculated by the QSAR and were selected for the next iteration according to a tournament procedure described in the methods section which derives from [40]. The Fig. 7a, b below gives the top activity and the average population activity vs. iteration. The most active structures found by each rule set are also drawn in Fig. 7c, d.

Fig. 7
figure 7

a, b Evolution versus iteration number of the best predicted activity (red) and average population predicted activity (blue) from amongst the newly generated structures using a transformation enumeration rules or b enzymatic rules. c, d Selected best structure generated after 500 iterations using either c transformation enumeration rules or d enzymatic rules

We observe in Fig. 7 that the average curve in B is lower to the one in A. This is due to the fact that enzymatic rules generated a lot of compounds that are structurally far from aminoglycoside (i.e.H2O, NH4+, O2…). Moreover, the rules used for A, allow more transformation/modification, thus enabling to better explore the chemical space, and ultimately finding more active compounds. We note that the structure in Fig. 7c have a slightly better predicted activity (pACT = 9.015) than the initial compounds used in the training set, while the structure in Fig. 7d have the same predicted activity than gentamicin (pACT = 8.867).

Metabolome completion and metabolomics

In this last example we use enzymatic reaction rules in an attempt to complete the metabolome of species used in biotechnology. We are motivated here by current efforts invested to complete the knowledge on the metabolism of various organisms [6, 7, 15]. The benefits are numerous and include the identification of relevant biomarkers for many diseases; for personalized nutrition advice; and also for searching for relevant indicators and metabolites of plant and animal stress in agricultural practices and breeding programs. Additionally, knowing the metabolic space of microbes is an essential step for optimizing metabolic engineering and creating synthesis pathways for new compounds for industrial applications.

Experimental evidences from metabolomics analyses are often informing us that with currently known metabolites one cannot cover the ranges of masses found in actual samples, and consequently there is a need of completing the metabolomes of interest. This need is clearly seen in the Human Metabolome Database (HMDB) where the number of reported masses has recently grown from 20,931 in 2013 [41] to 74,461 (at the time this manuscript was written), while annotated metabolites in metabolic databases are still in the range of 1847 (HumanCyc). Despite such a growth in databases, a significant amount of spectral peaks remains unassigned. This high fraction of unassigned peaks might be due to several factors including isotope, adduct formation, ion fragmentation, and multimers. Besides such sources of uncertainty in samples, many unassigned peaks should also be due to promiscuous activities of enzymes not yet characterized because of the lack of an appropriate description of the mechanisms of enzyme promiscuity.

To gain insights into those mechanisms enabling promiscuity, reaction rules have been shown to be appropriate [21] in particular the rules allowing to focus on the center of the reactions. To this end, several enzymatic reaction rules have been proposed such as those derived from bond-electron matrices [42], on the smallest molecular substructure changing during transformations [9], or on reaction rules that code for variable environments at reaction centers (see [7] and “Methods” section). That latter reaction rule system codes for changes in atom bonding environments where the reaction is taking place and the environment can range from including only the atoms participating to the reaction center to the entire set of atoms and molecules participating to the reaction. The advantage of that latter approach is that the size of the environment (named diameter) can be tuned to control the combinatorial explosion of possible products.

The degree of plasticity in metabolic networks that is uncovered by variable reaction center diameter is actually revealing an intrinsic feature of organisms linked to their adaptability, i.e. enzyme promiscuity. Promiscuity stands for the ability of enzymes to catalyze more than one reaction or to accept more than one substrate, a mechanism which can be traced to the evolutionary origins of enzymatic functions. Mimicking nature, such enzyme versatility can provide novel ways for biosynthesizing metabolite and even bioproducing non-natural molecule. To that end, the variable diameter method has shown itself to be especially well-suited for modeling the mechanisms of enzyme promiscuity as it has already enabled the experimentally validated discovery of a novel metabolite in E. coli and of the promiscuous enzymes producing it [21].

In this study, we make use of RetroPath2.0 to exemplify how variable reaction rule diameters can be used to complete the metabolome of E. coli. More precisely, we used as a source set all the metabolites present in E. coli iJO1366 model [43]. We first tested the two rule sets aforementioned, a set of about 100 reaction rules part of the BNICE framework [42] and a set of 50 reaction rules developed with the Sympheny software [9]. The reaction rules coded in the form of SMARTS string are provided in Supplementary at, along with the EC numbers corresponding to the rules. While the two rule sets were not developed to code only for E. coli reactions, for each EC number there is a corresponding enzyme annotated in E. coli so we kept all rules in the two sets. We then tested reaction rules with variable diameters using the procedure described in the method section to code for all E. coli metabolic reactions extracted from iJO1366 model. Rules were calculated for each reaction with diameters ranging from 2 to 16. Table 2 below provides the number of compound generated running RetroPath2.0 for one iteration on the metabolites of the iJO1366 model and the rules sets mentioned above (see “Methods” section for additional details).

Table 2 Compounds generated by RetroPath2.0 using various reaction rules applied on E. coli iJO1366 model metabolites [43]

Table 2 shows that the number of compounds generated increases as the diameter decreases. This is consistent with the fact that shorter diameters will accept more substrates than higher ones and will thus produce more products. Although they were not constructed with diameters, the BNICE and Sympheny rule sets generally correspond to small environments comprising only few atoms and bonds around reaction centers, which explain why these two systems generate more products than high diameter rule sets. Nonetheless, even with high diameters, all variable diameter rule sets produce more molecules found in E. coli model than the BNICE and Sympheny rule sets. This might indicate that the variable rule sets correspond to a more accurate coding of metabolic reactions than the other systems.

To further probe the coverage of the various rules sets listed in Table 2 we searched if the compounds produced could be found in MS spectra. To this end, we downloaded MS spectra from Metabolight [44] where masses have been measured on E. coli cell extracts. The spectra downloaded corresponded to a study aimed at probing the dynamics of isotopically labeled molecules (i.e. 13C labeled glucose) [45]. Since we are concerned here with wild type E. coli metabolome, we considered only the spectra where E. coli cells had not yet been exposed to labeled glucose (spectra acquired at time t = 0). All compounds generated by our various rules sets were prepared to be read by OpenMS nodes [23] and a workflow was written with these nodes to annotate the MS spectra peaks (cf. “Methods” section for details).

The results presented in Table 2 show that as the diameter decreases the number of peak assignment increases, which is not surprising considering that the number of compounds generated increases as well. We observe that the Sympheny and BNICE rules sets give results similar to those obtained by the D6 rule set, albeit with a higher number of annotations per peak.

In all cases the rule sets produced compounds not present in the E. coli model but with corresponding masses in the MS spectra. Additional file 4: Table S2 give a list of 40 such compounds having an identifier in MetaNetX [11] and produced by three identical reactions (i.e. reactions having the same substrates and products) generated using the Sympheny, BNICE and D6 rule sets. The compounds were produced by 53 reactions, some compounds being produced by more than one reaction. We note that the 40 compounds have been generated by rule sets for which at least one gene in E. coli has been annotated with the same corresponding EC number. The 40 compounds are thus potential new E. coli metabolites and their presence should be further verified using for instance MS/MS analysis (Additional file 4: Table S2).


In this paper we have presented a general method allowing one to explore the chemical space around a given molecule, or around a given set of molecules. The originality of the method is that the exploration is performed through chemical reactions rules. We have given a set of rules allowing us to generate any isomer of any given molecule of the chemical space. We also provide examples making use of reaction rules computed from enzymatic reactions. Using rules computed on known reactions has a definite advantage regarding the (bio)synthetic accessibility of the molecule produced, which is not necessarily the case for other techniques producing molecules de novo [33, 37, 40, 49,50,51,52,53].

Our method has been implemented into RetroPath2.0, a workflow running on the KNIME Analytics Platform [18]. RetroPath2.0 can easily be used with source molecules and reaction rules different than those presented in the paper. For instance the workflows provided in Additional file 1 can be used with the reaction SMARTS rules and fragment libraries (as source compounds) of the DOGS software ( LLC [4]) developed for de novo drug design, other technique evolving molecules toward specific activities or properties [40, 50,51,52] could also be implemented in RetroPath2.0 provided that one first codes reaction rules in SMARTS format.

Aside from searching molecules having interesting properties and activities RetroPath2.0 can also be used to complete metabolic maps by proposing new metabolites biosynthesized through promiscuous enzymes, these new metabolites can in turn be used to annotate MS spectra and to that end we provide an interface with OpenMS [23]. Finally, RetroPath2.0 was originally developed to enumerate pathways producing a given target product from a source set of reactants. While we have benchmarked the workflow in the context of metabolic engineering, [17] it can also be used for synthesis planning as long as synthesis reaction rules are available.


Generating reaction rules

All our reaction rules are represented in the form of reaction SMARTS [14]. Reaction rules used for canonical augmentation are provided in Fig. 1 and for isomer transformation in Figs. 3, 4, Additional file 2: Figure S2 and Additional file 5: Table S3. Enzymatic reaction rules were computed taking enzymatic reactions from MetaNetX version 2.0 [11]. To compute rules, we first performed an Atom–Atom Mapping (AAM) using the tool developed by [46] (Fig. 8a). Next, multiple substrates reactions were decomposed into components (panel C and D in Fig. 8). There are as many components as there are substrates and each component gives the transformation between one substrate and the products. Each product must contain at least one atom from the substrate according to the AAM. This strategy enforces that only one substrate can differ at a time from the substrates of the reference reaction when applying the rule.

Fig. 8
figure 8

RetroPath2.0 rules and corresponding SMARTS for reaction at various diameters. a Full reaction with atom mapping. b The list of broken bonds (− 1) and bonds formed (+ 1) is given by their atom numbers. c The corresponding SMARTS for the component modelling promiscuity on 6′-Oxo-paromamine: Substrate + l-Glutamate = Product + 2-Oxoglutarate. d The corresponding SMARTS for the component modelling promiscuity on l-Glutamate: Substrate + 6′-Oxo-paramamine = Neamine + Product. c, d Rules are encoded as reaction SMARTS and characterized by their diameter (∞ purple, 6 blue or 2 green), that is the number of bonds around the reaction center (atoms 19, 20 and 23, 24) defining the atoms kept in the rule

The following step consisted in computing reactions rules as reaction SMARTS for each component. We did it for diameters 2–16 around the reaction center (panels C and D in Fig. 8) by removing from the reaction components all atoms that were not in the spheres around the reaction center atoms.

We extracted more than 24,000 reaction components from MetaNetX reactions, each one of those leading to a rule at each diameter (from 2 to 16).

We provide in Supplementary at a subset of 14,300 rules for E. coli metabolism. The rules were selected based on the MetaNetX binding to external databases and the iJO1366 whole-cell E. coli metabolic model [43]. We also provide enzymatic rules enabling the biosynthesis of aminoglycosides from Glucose. The reactions were extracted from the map00524 KEGG map [47], and rules were computed as above on reactions for which a MetaNetX identifier could be retrieved. The resulting set comprised 94 rules calculated for each diameter ranging from 2 to 16.

RetroPath2.0 core algorithm

The RetroPath2.0 workflow essentially follows an algorithm proposed by some of us [16, 17] and its workflow implementation, which has already been described in details [17], is summarized in Fig. 9. We here focus on the different usages of RetroPath2.0 for the use cases provided in “Results and discussions” section.

Fig. 9
figure 9

RetroPath2.0 KNIME workflow. Inner view of the “Core” node where the computation takes place. The “Source, Sink…” and “Rules” nodes parse the source, sink and rules input files provided by the user and standardize data so that it can be processed by downstream nodes. Definitions for source, sink, and rule sets are provided in the text. The outer loop (“Source” loop) iterates over each source compounds, while the inner loop (“Length” loop) allows to iterate the process up to a maximum number of steps predefined by the user. The nodes (1) “FIRE”, (2) “PARSE”, (3) “UPDATE SOURCE…” and (4) “BUILD” are sequentially executed at each inner iteration. Respectively, they (1) apply all the rules on source compounds, (2) parse and standardize new products, (3) update the lists of source and sink compounds for the next iteration and (4) merge results that will be written by the node “Write global results”. Once the maximum number of steps is reached (or no new product is found), the “Compute scope” node identify the scope linking each source to the sink compounds, then these results are written by the node “Write per source results”. Only the main nodes involved in the process are shown

In all cases the workflow performs the generation of structures in a breadth-first way by applying iteratively the same procedure. An iteration starts by applying reaction rules to each of the compounds of a source set. For each compound, the products are computed using the RDKit KNIME nodes one-component or two-component reactions [19]. Products are sanitized (removal of structures having incorrect valence), standardised and duplicates are merged. The set of products will become the new source set for the next iteration. The workflow iterates until a predefined number of iterations is reached or until the source set is empty.

In the case of isomer augmentation (workflow RetroPath2.0-Mods-isomer-augmentation, “Isomer enumeration” sections) the initial source set is composed of a single carbon atom and the rule used is R 1 in Fig. 1, since it is the only rule that will produce acyclic molecules. The rule is fired on the source set, and the products become the new source set in the next iteration. The workflow is iterated a number of times equal to N − 1, where N is the number of atoms one wishes the final molecule to have.

In the case of isomer transformation (workflow RetroPath2.0-Mods-isomer-transformation, “Isomer enumeration” and “Virtual screening in the chemical space” section) the initial source set is composed of a molecule that is filled with the appropriate number of hydrogens using the RDKit KNIME node Add Hs. At each iteration rules are fired on the source set and the products obtained become the new source set for the next iteration. As an additional last step of each iteration, products that have already been processed in a previous iteration are filtered out before building the next source set. This necessitates maintaining a set (named sink) comprising all molecules so far generated. All products that have already been obtained are removed from the product set and the remaining molecules are (1) added to the sink set and (2) used as the new source set for the next iteration. This avoids applying reactions on the same products during subsequent iterations. Disconnected structures are removed from the results by filtering out any product having several disconnected components (according to the SMILES representation). When enumerating alkane, disconnected structures represents between 50 and 66% (depending of the alkane size) of the generated structures before filtering and merging duplicates. To generate the results of Table 1, since we are enumerating alkanes (no multiple bonds or cycles), the rule to be used is R 1 in Fig. 3. To enumerate the isomers of the monomers in “Virtual screening in the chemical space” section, if we prohibit the transformation of multiple bonds into cycles and thus keep the number of single, double and triple bonds constant, the rules to be used are R 1, R 5 and R 9 in Fig. 3 (also found in Additional file 2: Figure S2 since the monomers used do not contain triple bonds). Since this algorithm can become computationally intensive, we also provide an additional workflow (called RetroPath2.0-Mods-isomer-transformation-queue) to deal with memory management. This workflow illustrates how to introduce a FIFO data structure for the source set (i.e. queue containing structures upon which rules will be fired) and use it for iteratively firing rules on small chunks of structures (e.g. chunk of 20 structures), new products obtained are then added to the source queue. Interestingly, the breadth-first approach for generating the structures can be replaced by a depth-first approach by replacing the queue (first in, first out structure) by a stack (last in, first out structure).”

In the case of inverse-QSAR (workflow RetroPath2.0-Mods-iQSAR, “Search for molecules maximizing biological activities” section), the source set initially comprises the molecules used in the training set when building the QSAR. At each iteration, one or two molecules are chosen at random from the source set depending on the rule set that is being used (one molecule with enzymatic reaction rules, two molecules with isomer transformation rules). Rules are then fired on the selected molecules and an activity is predicted for each product using the QSAR equation. The source set is updated retaining molecules according to a selection tournament procedure borrowed from [40]. Briefly, the initial source set (i.e. the set of structures used at the start of the current iteration) is merged with the product set (i.e. the set of structures obtained after firing the rules). This merged set is then randomly split into 10 subsets and the 10 top best structures from each subset are retained according to their predicted activity. Finally, all the retained structures are pooled together to form the updated source set to be used at the next iteration. The workflow is iterated a (user) predefined number of times.

In the case of E. coli metabolic network completion (workflow RetroPath2.0-Mods-metabolomics, “Metabolome completion and metabolomics” section), we provide three workflows. The first workflow is RetroPath2.0, which is fully described in [17] and is similar to the isomer transformation one. Here, RetroPath2.0 produces a list of molecules obtained using E. coli enzymatic reaction rules (see “Generating reaction rules” section). The second workflow takes as input the products generated by RetroPath2.0, computes the exact mass for each product and prepare files to be read by OpenMS nodes for MS data peak assignment [23]. The last workflow is built with OpenMS nodes, it reads several MS data files in mzML format, two lists of adducts in positive and negative modes, and the files generated by the second workflow (containing RetroPath2.0 generated products with masses). The workflow searches for each compound the corresponding peak in the MS spectra. The workflow was parameterized for metabolomics analysis as described in OpenMS manual [48], the AccurateMassSearch node was set to negative ion mode as the experiment were carried out with an LTQ-Orbitrap instrument operating in negative FT mode (cf. protocols in [44]).

Further details on how to run all the above workflows are provided in the Supplementary at


  1. Chemicals Abstracts Service, CASREACT Accessed 28 June 17

  2. Warr WA (2014) A short review of chemical reaction database systems, computer-aided synthesis design, reaction prediction and synthetic feasibility. Mol Inform 33(6–7):469–476

    Article  CAS  Google Scholar 

  3. Szymkuc S, Gajewska EP, Klucznik T, Molga K, Dittwald P, Startek M, Bajczyk M, Grzybowski BA (2016) Computer-assisted synthetic planning: the end of the beginning. Angew Chem Int Ed Engl 55(20):5904–5937

    Article  CAS  Google Scholar 

  4. Schneider G (2014) Future de novo drug design. Mol Inform 33(6–7):397–402

    Article  CAS  Google Scholar 

  5. Moriya Y, Shigemizu D, Hattori M, Tokimatsu T, Kotera M, Goto S, Kanehisa M (2010) PathPred: an enzyme-catalyzed metabolic pathway prediction server. Nucleic Acids Res 38(Web Server issue):W138–W143

    Article  CAS  Google Scholar 

  6. Jeffryes JG, Colastani RL, Elbadawi-Sidhu M, Kind T, Niehaus TD, Broadbelt LJ, Hanson AD, Fiehn O, Tyo KE, Henry CS (2015) MINEs: open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomics. J Cheminform 7:44

    Article  Google Scholar 

  7. Carbonell P, Parutto P, Herisson J, Pandit SB, Faulon JL (2014) XTMS: pathway design in an eXTended metabolic space. Nucleic Acids Res 42:W389–W394

    Article  CAS  Google Scholar 

  8. Hadadi N, Hafner J, Shajkofci A, Zisaki A, Hatzimanikatis V (2016) ATLAS of biochemistry: a repository of all possible biochemical reactions for synthetic biology and metabolic engineering studies. ACS Synth Biol 5(10):1155–1166

    Article  CAS  Google Scholar 

  9. Yim H, Haselbeck R, Niu W, Pujol-Baxley C, Burgard A, Boldt J, Khandurina J, Trawick JD, Osterhout RE, Stephen R, Estadilla J, Teisan S, Schreyer HB, Andrae S, Yang TH, Lee SY, Burk MJ, Van Dien S (2011) Metabolic engineering of Escherichia coli for direct production of 1,4-butanediol. Nat Chem Biol 7(7):445–452

    Article  CAS  Google Scholar 

  10. Liu M, Bienfait B, Sacher O, Gasteiger J, Siezen RJ, Nauta A, Geurts JM (2014) Combining chemoinformatics with bioinformatics: in silico prediction of bacterial flavor-forming pathways by a chemical systems biology approach “reverse pathway engineering”. PLoS ONE 9(1):e84769

    Article  Google Scholar 

  11. Moretti S, Martin O, Van Du Tran T, Bridge A, Morgat A, Pagni M (2016) MetaNetX/MNXref–reconciliation of metabolites and biochemical reactions to bring together genome-scale metabolic networks. Nucleic Acids Res 44(D1):D523–D526

    Article  CAS  Google Scholar 

  12. Ugi I, Bauer J, Bley K, Dengler A, Dietz A, Fontain E, Gruber B, Herges R, Knauer M, Reitsam K, Stein N (1993) Computer-assisted solution of chemical problems—the historical development and the present state of the art of a new discipline of chemistry. Angew Chem Int Ed Engl 32(2):164–189

    Article  Google Scholar 

  13. Carbonell P, Planson AG, Fichera D, Faulon JL (2011) A retrosynthetic biology approach to metabolic pathway design for therapeutic production. BMC Syst Biol 5:122

    Article  Google Scholar 

  14. Daylight Theory: SMARTS Accessed 28 June 2017

  15. Hadadi N, Hatzimanikatis V (2015) Design of computational retrobiosynthesis tools for the design of de novo synthetic pathways. Curr Opin Chem Biol 28:99–104

    Article  CAS  Google Scholar 

  16. Faulon J-L, Carbonell P (2010) Reaction network generation. In: Faulon J-L, Bender A (eds) Handbook of chemoinformatics algorithms. CRC Press, Boca Raton

    Chapter  Google Scholar 

  17. Delépine B, Duigou T, Carbonell P, Faulon JL (2017) RetroPath2.0: A retrosynthesis workflow for metabolic engineers. Metab Eng.

  18. Berthold MR et al (2008) KNIME: The Konstanz Information Miner. In: Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R (eds) Data analysis, machine learning and applications. Studies in classification, data analysis, and knowledge organization. Springer, Berlin, Heidelberg, pp 319–326

  19. Landrum G (2016) RDKit: open-source cheminformatics Accessed 2 Aug 2016

  20. Rodrigues T, Reker D, Welin M, Caldera M, Brunner C, Gabernet G, Schneider P, Walse B, Schneider G (2015) De novo fragment design for drug discovery and chemical biology. Angew Chem Int Ed Engl 54(50):15079–15083

    Article  CAS  Google Scholar 

  21. Mellor J, Grigoras I, Carbonell P, Faulon JL (2016) Semisupervised Gaussian process for automated enzyme search. ACS Synth Biol 5(6):518–528

    Article  CAS  Google Scholar 

  22. Feher T, Planson AG, Carbonell P, Fernandez-Castane A, Grigoras I, Dariy E, Perret A, Faulon JL (2014) Validation of RetroPath, a computer-aided design tool for metabolic pathway engineering. Biotechnol J 9(11):1446–1457

    Article  CAS  Google Scholar 

  23. Rost HL, Schmitt U, Aebersold R, Malmstrom L (2014) pyOpenMS: a python-based interface to the OpenMS mass-spectrometry algorithm library. Proteomics 14(1):74–77

    Article  Google Scholar 

  24. Thiagarajan D, Mehta DP (2016) Faster algorithms for isomer network generation. J Chem Inf Model 56(12):2310–2319

    Article  CAS  Google Scholar 

  25. Peironcely JE, Rojas-Cherto M, Fichera D, Reijmers T, Coulier L, Faulon JL, Hankemeier T (2012) OMG: open molecule generator. J Cheminform 4(1):21

    Article  CAS  Google Scholar 

  26. Jaghoori MM, Jongmans STQ, de Boer F, Peironcely J, Faulon JL, Reijmers T, Hankemeier T (2013) PMG: multi-core metabolite identification. Electron Notes Theor Comput Sci 299:53–60

    Article  Google Scholar 

  27. McKay B (1998) Isomorph-free exhaustive generation. J Algorithm 26:306–324

    Article  Google Scholar 

  28. Gally JM, Bourg S, Do Q-T, Aci-Seche S, Bonnet P (2017) VSPrep: a general KNIME workflow for the Preparation of molecules for virtual screening. Mol Inf 36:1–11

    Article  Google Scholar 

  29. Ruddigkeit L, van Deursen R, Blum LC, Reymond JL (2012) Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model 52(11):2864–2875

    Article  CAS  Google Scholar 

  30. Isomers of Alkanes, Univ Bayreuth Accessed 28 June 17

  31. Willett P (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 11(23–24):1046–1053

    Article  CAS  Google Scholar 

  32. Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci 42(6):1273–1280

    Article  CAS  Google Scholar 

  33. Brown WM, Martin S, Rintoul MD, Faulon JL (2006) Designing novel polymers with targeted properties using the signature molecular descriptor. J Chem Inf Model 46(2):826–835

    Article  CAS  Google Scholar 

  34. Subramanian MN (2017) Polymer properties. In: Polymer blends and composites: chemistry and technology. Wiley, Hoboken, NJ, USA.

  35. Gerdeen JC, Rorrer RA (2011) Engineering design with polymers and composites, vol 30. CRC Press, Boca Raton

    Google Scholar 

  36. RetroPath2.0 at Accessed 29 Apr 2017

  37. Churchwell CJ, Rintoul MD, Martin S, Visco DP Jr, Kotu A, Larson RS, Sillerud LO, Brown DC, Faulon JL (2004) The signature molecular descriptor. 3. inverse-quantitative structure–activity relationship of ICAM-1 inhibitory peptides. J Mol Graph Model 22(4):263–273

    Article  CAS  Google Scholar 

  38. Martin S (2012) Lattice enumeration for inverse molecular design using the signature descriptor. J Chem Inf Model 52(7):1787–1797

    Article  CAS  Google Scholar 

  39. Setny P, Trylska J (2009) Search for novel aminoglycosides by combining fragment-based virtual screening and 3D-QSAR scoring. J Chem Inf Model 49(2):390–400

    Article  CAS  Google Scholar 

  40. Kawai K, Nagata N, Takahashi Y (2014) De novo design of drug-like molecules by a fragment-based molecular evolutionary approach. J Chem Inf Model 54(1):49–56

    Article  CAS  Google Scholar 

  41. Wishart DS, Jewison T, Guo AC, Wilson M, Knox C, Liu Y, Djoumbou Y, Mandal R, Aziat F, Dong E, Bouatra S, Sinelnikov I, Arndt D, Xia J, Liu P, Yallou F, Bjorndahl T, Perez-Pineiro R, Eisner R, Allen F, Neveu V, Greiner R, Scalbert A (2013) HMDB 3.0—the human metabolome database in 2013. Nucleic Acids Res 41(Database issue):D801–D807

    CAS  Google Scholar 

  42. Henry CS, Broadbelt LJ, Hatzimanikatis V (2010) Discovery and analysis of novel metabolic pathways for the biosynthesis of industrial chemicals: 3-hydroxypropanoate. Biotechnol Bioeng 106(3):462–473

    CAS  Google Scholar 

  43. Orth JD, Conrad TM, Na J, Lerman JA, Nam H, Feist AM, Palsson BO (2011) A comprehensive genome-scale reconstruction of Escherichia coli metabolism-2011. Mol Syst Biol 7:535

    Article  Google Scholar 

  44. MetaboLights, MTBLS229 study Accessed 28 June 2017

  45. Kiefer P, Schmitt U, Muller JE, Hartl J, Meyer F, Ryffel F, Vorholt JA (2015) DynaMet: a fully automated pipeline for dynamic LC–MS data. Anal Chem 87(19):9679–9686

    Article  CAS  Google Scholar 

  46. Rahman SA, Torrance G, Baldacci L, Martinez Cuesta S, Fenninger F, Gopal N, Choudhary S, May JW, Holliday GL, Steinbeck C, Thornton JM (2016) Reaction Decoder Tool (RDT): extracting features from chemical reactions. Bioinformatics 32(13):2065–2066

    Article  Google Scholar 

  47. KEGG Pathway, Neomycin, kanamycin and gentamicin biosynthesis Accessed 28 June 2017

  48. Walzer M, Sachsenberg T, Aicheler F, Rurik M, Veit J, Isabell B, Pedrioli P, Pfeuffer J, Liang X, Reinert K, O Kohlbacher. OpenMS tutorial Accessed 28 June 2017

  49. Visco DP Jr, Pophale RS, Rintoul MD, Faulon JL (2002) Developing a methodology for an inverse quantitative structure–activity relationship using the signature molecular descriptor. J Mol Graph Model 20(6):429–438

    Article  CAS  Google Scholar 

  50. van Deursen R, Reymond JL (2007) Chemical space travel. ChemMedChem 2(5):636–640

    Article  Google Scholar 

  51. Yu MJ (2011) Natural product-like virtual libraries: recursive atom-based enumeration. J Chem Inf Model 51(3):541–557

    Article  CAS  Google Scholar 

  52. Hoksza D, Skoda P, Vorsilak M, Svozil D (2014) Molpher: a software framework for systematic chemical space exploration. J Cheminform 6(1):7

    Article  Google Scholar 

  53. Virshup AM, Contreras-Garcia J, Wipf P, Yang W, Beratan DN (2013) Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds. J Am Chem Soc 135(19):7296–7303

    Article  CAS  Google Scholar 

Download references

Authors’ contributions

The work was directed by JLF who sketched the proof of the claim, designed all workflows, and generated the results presented in “Isomer enumeration” and “Metabolome completion and metabolomics” sections. MK wrote the proofs in “Isomer enumeration” section, TD wrote all workflows and produced the results in “Search for molecules maximizing biological activities” section. PC used the workflows to generate the results in “Virtual screening in the chemical space” section. All authors contributed to the manuscript write-up. All authors read and approved the final manuscript.


The authors acknowledge Baudoin Delépine for his contribution to the general workflow architecture and enzymatic reaction rules calculations.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Availability of data and materials

The workflows and associated data supporting the conclusions of this article are available on,

Ethics approval and consent to participate

Not applicable.


This work was supported by the French National Research Agency [ANR-15-CE1-0008], the Biotechnology and Biological Sciences Research Council, Centre for synthetic biology of fine and speciality chemicals [BB/M017702/1]; and Synthetic Biology Applications for Protective Materials [EP/N025504/1]. MK acknowledges funding provided by the DGA and École Polytechnique.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jean-Loup Faulon.

Additional files

Additional file 1. Monomers maps obtained running Retropath2.0 in section “Virtual screening in the chemical space”. The 17 compounds of the 158 available monomers that can be naturally synthesized and the corresponding synthesis pathways.


Additional file 2. Figure S1 represents the execution time of the tested softwares for alkane enumeration. Figure S2 represents the reduced set of transformation rules excluding triple bounds. Figure S3 represents the distributions of predicted T g values for enumerated isomers and for isomers found in PubChem with varying Tanimoto threshold.


Additional file 3: Table S1. Solution space reduction arguments or rules corresponding to bond configurations in section “Isomer transformation”. This is shown for the 36 rules after reduction of the solution space by problem symmetry arguments.


Additional file 4: Table S2. List of 40 compounds from section “Metabolome completion and metabolomics” having an identifier in MetaNetX and produced by three identical reactions (i.e. reactions having the same substrates and products) generated using the Sympheny, BNICE and D6 rule sets.

Additional file 5: Table S3. List of the 19 SMARTS rules that were used in this study.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Koch, M., Duigou, T., Carbonell, P. et al. Molecular structures enumeration and virtual screening in the chemical space with RetroPath2.0. J Cheminform 9, 64 (2017).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: