Molecular structures enumeration and virtual screening in the chemical space with RetroPath2.0

Koch, Mathilde; Duigou, Thomas; Carbonell, Pablo; Faulon, Jean-Loup

doi:10.1186/s13321-017-0252-9

Methodology
Open access
Published: 19 December 2017

Molecular structures enumeration and virtual screening in the chemical space with RetroPath2.0

Mathilde Koch¹,
Thomas Duigou¹,
Pablo Carbonell² &
…
Jean-Loup Faulon^1,2,3

Journal of Cheminformatics volume 9, Article number: 64 (2017) Cite this article

4885 Accesses
12 Citations
8 Altmetric
Metrics details

Abstract

Background

Network generation tools coupled with chemical reaction rules have been mainly developed for synthesis planning and more recently for metabolic engineering. Using the same core algorithm, these tools apply a set of rules to a source set of compounds, stopping when a sink set of compounds has been produced. When using the appropriate sink, source and rules, this core algorithm can be used for a variety of applications beyond those it has been developed for.

Results

Here, we showcase the use of the open source workflow RetroPath2.0. First, we mathematically prove that we can generate all structural isomers of a molecule using a reduced set of reaction rules. We then use this enumeration strategy to screen the chemical space around a set of monomers and predict their glass transition temperatures, as well as around aminoglycosides to search structures maximizing antibacterial activity. We also perform a screening around aminoglycosides with enzymatic reaction rules to ensure biosynthetic accessibility. We finally use our workflow on an E. coli model to complete E. coli metabolome, with novel molecules generated using promiscuous enzymatic reaction rules. These novel molecules are searched on the MS spectra of an E. coli cell lysate interfacing our workflow with OpenMS through the KNIME Analytics Platform.

Conclusion

We provide an easy to use and modify, modular, and open-source workflow. We demonstrate its versatility through a variety of use cases including molecular structure enumeration, virtual screening in the chemical space, and metabolome completion. Because it is open source and freely available on MyExperiment.org, workflow community contributions should likely expand further the features of the tool, even beyond the use cases presented in the paper.

Background

The number of known chemical reactions is huge, at the time this manuscript was written there were ~84 million single- and multi-step reactions in the Chemical Abstract Service database (CAS) [1]. Yet, many reactions in CAS are redundant because the same reactions are applied to different reactants. Identifying identical reactions can be performed by computing reaction rules. Reaction rules represent reactions at the reaction center only. In other words, a reaction rule comprises only the substructures of the reactants and the products for which the atoms are either directly involved in bond rearrangements or are deemed to be essential for the reactivity of the reaction center. While a set of reaction rules is of course not available for all known chemical reactions, rules have been compiled for focused applications, such as retrosynthesis planning [2, 3], the discovery of novel chemical entities in medicinal chemistry [4], xenobiotic (including drug) degradation [5], metabolomics [6], and metabolic engineering [7,8,9,10]. Depending on the application, the number of rules varies from less than one hundred to few thousands, but in all cases the number of known reactions per application far exceeds the number of rules (there are for instance more than 14,000 reactions in metabolic databases such as MetaNetX [11]). There are several ways of coding reaction rules (for instance, BE-matrices [12] and fingerprints [13]) but most of the time the rules can be represented by reaction SMARTS [14], as it is done in the current paper.

The purpose of reactions rules is to generate reaction networks. The rules can be used in a forward manner to find for instance the metabolic degradation products of a drug, or in a reverse manner to find the reactions producing a desired product from a set of available reactants. In this later usage one produces retrosynthesis reaction networks. Several tools have been developed in the past to generate (retrosynthesis) reaction networks and reviews are available for synthesis planning [2, 3] and for metabolic engineering [15]. Disregarding if the rules are applied in a forward or reverse manner, network generation tools are making use of the same core algorithm. Starting from a source set of compounds the core algorithm applies the rules in an iterative fashion either a predefined number of times or until a sink set of compounds has been produced. At each iteration, the algorithm fires the rules on the source set producing new molecular structures and determines the new source set of molecules the rules will be fired upon at the next iteration. That set must comprise molecules that have not been processed before. Further details on the core algorithm and the differences between the various implementations are provided in Faulon et al. [16] and Delépine et al. [17]. In the current paper we make use of an open source workflow (RetroPath2.0 [17]), which follows the above core algorithm. This workflow is not based on original codes but instead was constructed entirely by assembling KNIME nodes [18] developed by the cheminformatics community (primarily RDKit nodes [19]). RetroPath2.0 is the first open source release of a retrosynthesis reaction network generation, and its usage in the current paper beyond network generation demonstrates its versatility.

As already mentioned, reaction network generation tools coupled with reactions rules have been developed and used primarily for synthesis planning and metabolic engineering, but can they be used to enumerate molecules (isomers for instance) and more generally to search chemical structures in the chemical space?

In principle yes if one can devise reaction rules enabling the production of any molecule in the chemical space. Such a set of rules necessarily exists for all known molecules (such as those in the CAS database) since they have been produced through either natural or synthetic chemical reactions. In practice and as already stated, reaction rules so far developed are application limited. Yet, within their respective application fields, specific rules have been used to discover novel molecules and reaction pathways. Taking experimentally validated examples, the rules associated with the ligand-based de novo design software DOGS (inSili.com LLC) [4] have enabled the production of new chemical entities inhibitors of DAPK3 (death-associated protein kinase 3) [20], metabolic rules for promiscuous enzymes have allowed the discovery of novel metabolites in E. coli [21] and have also been used to engineer metabolic pathways producing 1,4-butanediol [9] and flavonoids [22].

Going beyond application limited reaction rules, the main contribution of the present paper is to propose a set of transformation rules that enables the generation of any isomer of any given molecule of the chemical space. Precisely, we prove the claim that any isomer of any given molecule of N atoms, can be reached applying at most O(N ²) rules.

As illustrations, our transformation rules are used to screen the chemical space for structures that are similar to a given set of well-known monomers and to search aminoglycosides structures maximizing antibacterial activities. The compounds produced by our rules are not necessarily chemically accessible, since our transformation rules are not constructed based on chemical synthesis schema. To probe the (bio)synthetic accessibility of our solutions, we also perform search in the (bio)chemical space using enzymatic reaction rules. The enzymatic rules are also used to propose novel molecules completing E. coli metabolic network and for which masses are found in cell lysate mass spectra.

All results presented in this paper have been produced making use of the open source workflow RetroPath2.0. RetroPath2.0 and the associated data are provided as Additional file 1 and can be downloaded at MyExperiment.org. The only differences between the various usages we have made of the RetroPath2.0 are within (1) the set of reaction rules and (2) the way molecules are selected at each iteration during the network generation process.

Results and discussions

The purpose of this section is to showcase the versatility of RetroPath2.0 by taking use cases of interest to the community. We first propose reaction rules to enumerate isomers (“Isomer enumeration” section), we then use the rules to screen in the chemical space structures that are similar to some known monomers (“Virtual screening in the chemical space” section) and compute property distribution (Glass transition temperature) in both the Chemical Space and PubChem, we next use a QSAR to search aminoglycosides types molecules for which antibacterial activity is maximized using both isomer transformation rules and enzymatic rules (“Search for molecules maximizing biological activities”), and we finally use enzymatic rules to find novel metabolites in E. coli and annotate the MS spectra of an E. coli cell lysate interfacing RetroPath2.0 with OpenMS [23] (“Metabolome completion and metabolomics” section).

Isomer enumeration

Isomer enumeration is a long-standing problem that is still under scrutiny [24, 25]. Our intent here is not to provide the fastest enumeration algorithm but to demonstrate how RetroPath2.0 can perform that job once appropriate reaction rules are provided. However, we provide in Additional file 2: Figure S1 a comparison of Retropath2.0’s execution time with the OMG and PMG software tools [25, 26] specifically dedicated to isomers enumeration. Retropath2.0 is found faster than OMG but slower than PMG. Thereafter, we outline two approaches making use of RetroPath2.0. The first is based on the classical canonical augmentation algorithm [27] and the second consists of iteratively transforming a given molecule such that all its isomers are produced. We name this latter approach isomer transformation. In both cases we limit ourselves to structural (constitutional) isomers, as there already exist workflows to enumerate stereoisomers [28] (Additional file 2: Figure S2).

Canonical augmentation

The principle of canonical augmentation, which is an orderly enumeration algorithm, is to grow a molecular graph by adding one atom at a time and retaining only canonical graphs for the next iteration [27]. The algorithm first proposed by Brendan McKay has been used to generate the GDB-17 database of small molecules [29]. The original algorithm has also been modified such that at each step a bond (not an atom) is added to the growing molecules [25]. In the present implementation we use the original McKay algorithm [27], consequently, the number of iterations is the number of atoms one wishes the molecule to have. The algorithm can easily be implemented into RetroPath2.0 by choosing as a source set a single unbonded atom, and a rule set depicting all possible ways an atom can be added to a molecular graph (see “Methods” section for more information). Considering that an atom can be added to a growing molecule through one, two, or more bonds (depending on its valence), the set of reaction rules is straightforward however cumbersome if one starts to consider all possible atoms types. For this reason we limit ourselves to carbon skeleton as it is usually done in the first step of isomer enumeration algorithm. Figure 1 below depicts the set of rules that generate all triangle free carbon skeletons.

We note that rules R₂–R₄ will generate cycles since the added atom is attached to the growing molecule by 2–4 four bonds, thus only rule R ₁ is necessary to grow acyclic molecules (alkanes for instance). The Table below provides the numbers of structural isomers of alkanes found up to 18 carbon atoms running RetroPath2.0 with rule number 1 in Fig. 1.

Isomer transformation

The isomer canonical augmentation algorithm becomes more complex when one starts to consider different atom and bond types. To overcome these difficulties the idea of the transformation enumeration approach is to start with one fully-grown molecule to which one applies all possible transformations such that all the structural isomers of the initial molecule are generated. This approach can be implemented in RetroPath2.0 using a hydrogen saturated molecule as a source and a reaction rule set enabling to transform the molecule while keeping the correct valence for each atom. Because atom valences are maintained the total number of bonds must remain the same after the transformations have taken place. In order to maintain the number of bonds constant, for any reaction rule the number of bonds created must equal the number of bonds deleted.

RetroPath2.0 applies a reaction rule to a given molecule by first searching all occurrences in the molecule of the subgraph representing the reactant (left side of the rule). To this end the labels on the subgraph are removed. Then for each occurrence of the unlabeled subgraph in the molecule, the labels are restored and the bonding patterns on the molecule are changed accordingly. The process is illustrated in the Fig. 2 below where it can be seen that rules R _a and R _b are identical (i.e. they produce the same solutions). In general, two rules R _a = (L_a, A) and R _b = (L_b, B) will produce the same solutions if a one-to-one mapping π can be found between the labels L_a and L_b of the rules such that the set of edges (A) in R _a is transformed by π into the edges (B) of R _b, i.e. π(A) = B.

Claim

The 19 rules described in Fig. 3 allow us to generate all isomers of a given molecule at most 3/4 * (N ² − N) iterations, where N is the number of atoms, respecting the following constraints: the maximal valence is 4 and there cannot be two double bonds on the same atom in a 3 or 4 membered ring.

Lemma 1

The minimal number of bonds one can change is 4 and the 19 rules described in Fig. 3 generate all minimal transformations respecting the following constraints: the maximal valence is 4 and there cannot be two double bonds on the same atom in a 3 or 4 membered ring.

Proof

The minimal transformation one can perform consists of deleting one bond and creating another one. Since the bond created must be different from the one deleted at least three atoms (A₁, A₂, A₃) must be involved. Let a₁₂, a₁₃, and a₂₃ be the bond orders between the three atoms and let b₁₂, b₁₃ and b₂₃ the bond orders after the reaction has taken place. Because the atom valence is maintained the following system of equations holds:

$$\begin{aligned} & \left( {{\text{L}}_{1} } \right){\text{a}}_{12} + {\text{a}}_{13} = {\text{b}}_{12} + {\text{b}}_{13} \\ & \left( {{\text{L}}_{2} } \right){\text{a}}_{12} + {\text{a}}_{23} = {\text{b}}_{12} + {\text{b}}_{23} \\ & \left( {{\text{L}}_{3} } \right){\text{a}}_{13} + {\text{a}}_{23} = {\text{b}}_{13} + {\text{b}}_{23} \\ \end{aligned}$$

(1)

(L₁) + (L₂) – (L₃) − > a₁₂ = b₁₂, which implies a₂₃ = b₂₃ and a₁₃ = b₁₃.

It is therefore impossible to proceed to a minimal transformation with only 3 bonds involved.

Let us consider 4 atoms. There are 6 possible bonds between those atoms. Let us consider that we are changing 4 bonds, since we aim to find minimal transformations. Let us call a₁₃ and a₂₄ the two fixed bonds, without loss of generality.

Valence conservation (with b₁₃ = a₁₃ and b₂₄ = a₂₄) gives us the following system:

$$\begin{aligned} & \left( {{\text{L}}_{1} } \right){\text{a}}_{12} + {\text{a}}_{14} = {\text{b}}_{12} + {\text{b}}_{14} \\ & \left( {{\text{L}}_{2} } \right){\text{a}}_{12} + {\text{a}}_{23} = {\text{b}}_{12} + {\text{b}}_{23} \\ & \left( {{\text{L}}_{3} } \right){\text{a}}_{23} + {\text{a}}_{34} = {\text{b}}_{23} + {\text{b}}_{34} \\ & \left( {{\text{L}}_{4} } \right){\text{a}}_{14} + {\text{a}}_{34} = {\text{b}}_{14} + {\text{b}}_{34} \\ \end{aligned}$$

(2)

We can notice that (L₁) + (L₃) = (L₂) + (L₄): we therefore have a system of 3 equations with 4 unknowns, so we can set an unknown and calculate the other solutions.

As we are looking for minimal transformations, we can assume that we are changing a bond order by 1 on this unknown that we can set. Since valence is conserved, if a bond order is increased, then a bond order from the same atom has to be decreased. As the problem is perfectly symmetrical in all variables at this point, we can thus assume without loss of generality (at least one bond has to be deleted) that b₁₂ = a₁₂ − 1. Then solving the system immediately gives us b₁₄ = a₁₄ + 1, b₂₃ = a₂₃ + 1 and b₃₄ = a₃₄ − 1. This system can only be solved in our case (positive bond orders, no quadruple bonds) if a₁₄ and a₂₃ are either 0, 1 or 2 and a₁₂ and a₃₄ are either 1, 2 or 3. This means that we have at most 81 (3⁴) cases for initial bond orders where our isomer problem has a solution. However, this solution space can be further reduced by problem symmetry arguments. We can see that the roles of a₁₂ and a₃₄ are symmetrical, as well as the roles of a₂₃ and a₁₄.

Let us call A₁ the atom with the highest considered sum of bound orders (neglecting the fixed orders a₁₃ and a₂₄). Therefore, it is such that

(Condition 1):

$${\text{a}}_{12} + {\text{a}}_{14} \ge {\text{a}}_{12} + {\text{a}}_{23} \left( {{\text{higher}}\;{\text{considered}}\;{\text{sum}}\;{\text{of}}\;{\text{bound}}\;{\text{orders}}\;{\text{than}}\; {\text{A}}_{2} } \right),\quad {\text{or}}\;\;{\text{a}}_{14} \ge {\text{a}}_{23}$$

(Condition 2):

$${\text{a}}_{12} + a_{14} \ge {\text{a}}_{14} + {\text{a}}_{34} \left( {{\text{higher}}\;{\text{considered}}\;{\text{sum}}\;{\text{of}}\;{\text{bound}}\;{\text{orders}}\;{\text{than}}\;{\text{A}}_{4} } \right),\quad {\text{or}}\;{\text{a}}_{12} \ge {\text{a}}_{34}$$

a₁₂ + a₁₄ ≥ a₂₃ + a₃₄ (higher considered sum of bound orders than A₃), is automatically verified when the other two are verified.

Condition 1 is not respected when a₂₃ = 2 and a₁₄ = 0 or 1 or when a₂₃ = 1 and a₁₄ = 0, without constraints on a₁₂ and a₃₄: (2 + 1) * 9 = 27 solutions. For the same reason, 27 solutions do not respect condition 2. The solutions that do not respect both Conditions 1 and 2 are (2 + 1) * (2 + 1) = 9. By symmetry arguments, we therefore reduced the solution space from 81 to 81 − 27 − 27 + 9 = 36. These 36 reaction rules are presented in Fig. 4.

We can further reduce the solution space by considering that the maximum atom valence is 4. The solutions that do not respect this constraint are such that a₁₂ + a₁₄ = 5, so a₁₂ = 3 and a₁₄ = 2 (and this automatically verifies Conditions 1 and 2). Since there are no constraints on a₂₃ and a₃₄, we have 9 such solutions: the solution space has been reduced to 36 − 9 = 27 reactions.

One more constraint, imposed by 3D conformation of the molecule, is that there cannot be two double bonds on the same atom in a 3 or 4 membered ring.

This must be true for our initial molecule as well as for the produced molecule. For the initial molecule (as can be seen in the 4 black rules under rule 13 in Fig. 4), when a₁₂ = a₁₄ = 2, since a₃₄ > 1 (bond whose order will be reduced), there is a cycle if a₂₃ ≠ 0. There are therefore 4 solutions where the initial molecule is invalid: when a₁₂ = a₁₄ = 2, and a₃₄ is 1 or 2 (smaller than a₁₂) and a₂₃ is 1 or 2.

This must also be true for the produced molecule. Two double bonds will be produced around atom 1 with a₁₂ = 3 and a₁₄ = 1 (this can be seen in the 4 black rules under rule 18 in Fig. 4). There will be a cycle if a₃₄ ≠ 1. There are therefore 4 solutions where the produced molecule is invalid: when a₁₂ = 3, a₁₄ = 1, and a₃₄ is 2 or 3 and a₂₃ is 0 or 1 (smaller than a₁₄).

Since these solutions respect valence constraints and problem symmetry, they are not included in the previous solution space reductions and therefore the solution space is reduced to 27 − 8 = 19 solutions. A summary table of solution space reduction is given in Additional file 3: Table S1. Since we have found 19 different working solutions for all the cases we have left, we have proved that the minimal number of bonds one can change is 4 and the 19 rules described in Fig. 3 generate all these minimal transformations.

Lemma 2

Let us consider M _b an isomer of M _a . We can apply a rule from this set of 19 rules that will reduce the sum of absolute order differences between those two molecules by at least 2 and at most 4.

Proof

Let (a_ij) be the order of bonds in M_a, (b_ij), j Є [2, N], i Є [1, j − 1], the order of bonds in M_b, where N is the number of atoms in M_a and M_b. Since M_b is different from M_a, we can find i,j such that a_ij > b_ij. By valence conservation in atom A_j, we can find k such that a_jk < b_jk, and by valence conservation of atom A_k, we can also find l such that a_kl > b_kl. Therefore, we are considering 4 atoms and 4 bonds between those atoms, with at least 3 of their orders changing by 1. According to Lemma 1 the minimal number of bonds one can change is 4, so we will also have to change the bond order between A_i and A_l. We are therefore considering a minimal transformation, so we know thanks to Lemma 1 that we can apply a rule from our set of rules to generate that transformation. Let us call ${\text{M}}_{\text{a}^\prime}$ the molecule produced that way, and ${\text{a}}_{\text{ij}^\prime}$ its bond orders. Let us now calculate the sum of orders of ${\text{M}}_{\text{a}^\prime}$. Then, by applying the rule, we have ${\text{a}}_{\text{ij}^\prime}$ = a_ij − 1 and therefore $\left| {b_{\text{ij}} - a_{\text{ij}^\prime} } \right| = \left| {{\text{b}}_{\text{ij}} {-}{\text{a}}_{\text{ij}} } \right| - 1$. For the same reason, $\left| {{\text{b}}_{\text{kl}} - {\text{a}}_{\text{kl}^\prime}} \right| = \left| {{\text{b}}_{\text{kl}} - {\text{a}}_{\text{kl}} } \right| - 1$. Moreover, ${\text{a}}_{\text{jk}^\prime} {\text{ = a}}_{\text{jk}} { + 1}$, and since a_jk is smaller than b_jk, we also have $\left| {{\text{b}}_{\text{jk}} - {\text{a}}_{\text{jk}^\prime} } \right|{ = }\left| {{\text{b}}_{\text{jk}} - {\text{a}}_{\text{jk}} } \right| - 1$. The only bond we did not choose to change is a_li. The order ${\text{a}}_{\text{ij}^\prime}$ of the transformed bond is either closer to b_li than was a_li, then the difference of the sum of absolute order differences is reduced by 4, or is further from b_li, and this sum is reduced by 2. Therefore, if M_a and M_b are different, we can apply a rule from this set of rules that will decrease the sum of absolute order differences by at least 2 and at most 4.

Lemma 3

Considering M _a and M _b an isomer of M _a , the 19 rules described in Fig. 3 allow us to transform M _a into M _b using at most 3/4*(N ² − N) single transformations, where N is the number of atoms, respecting the following constraints: the maximal valence is 4 and there cannot be two double bonds on the same atom in a 3 or 4 membered ring.

Proof

Let us consider M_b an isomer of M_a. If the sum of absolute order differences is not null, then M_b is different from M_a and using Lemma 2, we know we can apply a rule that will strictly decrease the sum of absolute order differences. This sum is obviously positive, is an integer, and is strictly decreasing each time we apply a transformation rule so it will converge to 0 in S/2 transformations at most, where S is the sum of absolute order differences between M_a and M_b. When this sum is null, all bond orders are the same, which means the molecules are the same. An upper estimation of the maximum bond order difference is obtained when M_a only has triple bonds, which all have to be deleted. In that case, the sum of absolute order differences is: S = 3*(N ² − N)/2, where N is the number of atoms and (N ² − N)/2 the number of defined orders (since a_ij = a_ji). Therefore, since the sum decreases by at least 2, the maximum number of transformations we need to apply is 3*(N ² − N)/4.

Proof of the main claim

Given the workings of the algorithm (breadth-first, as explained in “RetroPath2.0 core algorithm” section), the number of iterations for generating all isomers is the number of iterations for generating the furthest one in term of bond order difference from our starting molecule. Therefore, applying Lemma 3, we know the maximum number of iterations of the algorithm is 3 * (N ² − N)/4.

Notice that although the number of iterations of the algorithm scales O(N ²), the number of transformation rules applied (i.e.: single reactions) is proportional to the number of isomers.

Corollary 1

The maximum number of iterations to generate all alkanes is N − 1, where N is the number of carbon atoms (hydrogens are not considered here).

Proof

Adapting the demonstration of Lemma 3, we have to consider the sum of absolute order differences of the farthest isomers that can be reached. Since alkanes are acyclic, the number of bonds is N − 1 (proven by a simple recurrence, the new atom being joined at a single point to the chain since the molecule is acyclic). Therefore, considering all bonds are different in the new molecule, the sum of absolute order differences is at most 2(N − 1). Therefore, the maximum number of iterations of the algorithm is N − 1.

The isomer transformation algorithm was applied to generate all alkanes up to 18 carbon atoms using rule R ₁ of Fig. 3, since it is the only rule with only single bonds. Results are presented in Table 1, where it can be seen that Corollary 1 is verified in practice.

Table 1 Number of generated alkane isomers by canonical augmentation algorithm and isomer transformation algorithm

Full size table

Virtual screening in the chemical space

In this section we used RetroPath2.0 to search all molecules that are at predefined distances of a given set of molecules. Such queries are routinely carried out in large chemical databases for drug discovery purposes [31], but in the present case we search similar structures in the entire chemical space. To perform search in the chemical space, we used a source set composed of 158 well-known monomers having a molecular weight up to 200 Da. Our rule set included the transformations colored green in Fig. 3 (i.e. transformation rules where double bonds are not transformed into cycles and conversely). For each monomer, RetroPath2.0 was iterated until no new isomers were generated. Each generated structures at a Tanimoto similarity greater than 0.5 from its corresponding monomer were retained (Tanimoto was computed using MACCS keys fingerprints [32]).

Next, we wanted to probe if the generated structures exhibited interesting properties as far as polymer properties are concerned. To that end we first developed a QSPR model taking properties from [33]. We focused on polymer glass transition temperature T _g data [34]. The QSPR model was based on a random forest trained using RDKit fingerprints descriptors [19]. The obtained model had a leave-one-out cross-validation performance of Q² = 0.75. The model was then applied to predict the T _g for the set of enumerated isomers. Figure 5 compares the distribution of predicted T _g values for the enumerated isomers with those obtained from isomer structures available from PubChem. T _g values for enumerated isomers appeared evenly distributed around 301.86 ± 25.69 K compared with the isomers that were available in PubChem (331.66 ± 46.19 K). This shift in the T _g values could be explained by the difference in distribution that necessarily exists between the isomers that are present in PubChem and the total number of enumerated isomers. As we lower the Tanimoto threshold, some monomers might become underrepresented in terms of isomer availability in PubChem. Additional file 2: Figure S3 shows the distributions of both sets of isomers in function of the threshold. The increased ability of selecting polymers with T _g above or below room temperature for the enumerated set compared with the PubChem isomers is a desirable feature, as this parameter will determine the mechanical properties of the polymer [35]. In that way, performing a virtual screening of the chemical space of isomers of the reference monomers opens the possibility to engineering applications with improved polymer design.

Moreover, we were interested in determining how many of the starting 158 monomers were accessible through biosynthesis. Namely, how many of the compounds can be synthesized by engineering a metabolic pathway in a chassis organism. This computation can be accomplished by RetroPath2.0 by defining all naturally produced chemicals as sinks in the workflow and using a collection of known enzymatic reaction rules in reversed mode. The process has been described in detailed elsewhere [17]. Through the application of the rules in a retrosynthetic fashion, it is possible to determine the routes that connect the target compounds to the natural precursors. Of the 158 available monomers, using the RetroPath2.0 workflow downloaded from MyExperiment.org [36], we were able to identify 17 compounds that can be naturally synthetized (Fig. 6a). We provide in an archive containing the list of pathways for those 17 compounds.

The QPSR model for T _g was applied to the set of enumerated isomers. As shown in Fig. 6b, the resulting set provided a good covering of the chemical space surrounding the starting monomer set. Moreover, a significant number of enumerated isomers show a high predicted T _g value, which may indicate a good candidate as a building block replacement for known monomers. Interestingly, those isomers that were close to biosynthetic accessible monomers (Tanimoto based on MACCS keys fingerprint > 0.8) have a distribution of predicted T _g values that significantly differ from the full set (p value < 1e − 12 Welch t-test), with a mean T _g = 352.1 K (T _g = 301.9 K in the full distribution). These close isomers to biosynthetically accessible monomers might be considered as good candidates for alternative biosynthesis since reaching them through biosynthesis may require only few modifications of the original catalytic route.

Search for molecules maximizing biological activities

In this section we are interested in searching chemical structures in the chemical space optimizing biological activities. This type of search can be solved using inverse QSAR procedures [37]. Inverse QSAR requires to first building a QSAR equation predicting activities from structure and then either (1) inverting the equation and enumerating structures matching a given activity [37] or (2) searching in the chemical space structures similar to those used to build the QSAR equation [33] but having optimized activities. The second approach makes use of either deterministic methods such as lattice enumeration [38] or stochastic searches.

We propose here to use RetroPath2.0 to solve the inverse QSAR problem using a stochastic approach with isomer transformation rules and enzymatic rules for biosynthetic accessibility. To this end, we selected a dataset of 47 aminoglycosides structures for which antibacterial activities have been measured using a MIC assay [39]. The dataset is composed of natural aminoglycosides (gentamicin, tobramycin, neomycin, kanamycin A and B, paromomycin, ribostamycin and neamine) to which are added synthetic structures built on a neamine scaffold. This dataset has already been used to build a QSAR model based on CoMFA analysis leading to a Q2 of 0.6 for a Leave-One-Out (LOO) procedure [39]. We provide in Additional file 1 a QSAR workflow that makes use of RDKit fingerprints [19] and random forest as a learner leading to a higher Q2 (0.7) for LOO. With that QSAR in hand we run RetroPath2.0 with a source set composed of the 47 aminoglycosides used in the training set, and two different reaction rules sets. The first set is extracted from the transforming enumeration rules depicted in Fig. 3, the second set is composed of enzymatic reaction rules leading to neamine (an aminoglycoside) biosynthesis from glucose. Reaction rules for the second set were computed as explained in the method section resulting in 94 rules specific to the biosynthesis of aminoglycosides.

In both cases, reactions rules were fired on the initial source set composed of 47 structures. All rule products were ranked according to their predicted activities as calculated by the QSAR and were selected for the next iteration according to a tournament procedure described in the methods section which derives from [40]. The Fig. 7a, b below gives the top activity and the average population activity vs. iteration. The most active structures found by each rule set are also drawn in Fig. 7c, d.

We observe in Fig. 7 that the average curve in B is lower to the one in A. This is due to the fact that enzymatic rules generated a lot of compounds that are structurally far from aminoglycoside (i.e.H₂O, NH4⁺, O₂…). Moreover, the rules used for A, allow more transformation/modification, thus enabling to better explore the chemical space, and ultimately finding more active compounds. We note that the structure in Fig. 7c have a slightly better predicted activity (pACT = 9.015) than the initial compounds used in the training set, while the structure in Fig. 7d have the same predicted activity than gentamicin (pACT = 8.867).

Metabolome completion and metabolomics

In this last example we use enzymatic reaction rules in an attempt to complete the metabolome of species used in biotechnology. We are motivated here by current efforts invested to complete the knowledge on the metabolism of various organisms [6, 7, 15]. The benefits are numerous and include the identification of relevant biomarkers for many diseases; for personalized nutrition advice; and also for searching for relevant indicators and metabolites of plant and animal stress in agricultural practices and breeding programs. Additionally, knowing the metabolic space of microbes is an essential step for optimizing metabolic engineering and creating synthesis pathways for new compounds for industrial applications.

Experimental evidences from metabolomics analyses are often informing us that with currently known metabolites one cannot cover the ranges of masses found in actual samples, and consequently there is a need of completing the metabolomes of interest. This need is clearly seen in the Human Metabolome Database (HMDB) where the number of reported masses has recently grown from 20,931 in 2013 [41] to 74,461 (at the time this manuscript was written), while annotated metabolites in metabolic databases are still in the range of 1847 (HumanCyc). Despite such a growth in databases, a significant amount of spectral peaks remains unassigned. This high fraction of unassigned peaks might be due to several factors including isotope, adduct formation, ion fragmentation, and multimers. Besides such sources of uncertainty in samples, many unassigned peaks should also be due to promiscuous activities of enzymes not yet characterized because of the lack of an appropriate description of the mechanisms of enzyme promiscuity.

To gain insights into those mechanisms enabling promiscuity, reaction rules have been shown to be appropriate [21] in particular the rules allowing to focus on the center of the reactions. To this end, several enzymatic reaction rules have been proposed such as those derived from bond-electron matrices [42], on the smallest molecular substructure changing during transformations [9], or on reaction rules that code for variable environments at reaction centers (see [7] and “Methods” section). That latter reaction rule system codes for changes in atom bonding environments where the reaction is taking place and the environment can range from including only the atoms participating to the reaction center to the entire set of atoms and molecules participating to the reaction. The advantage of that latter approach is that the size of the environment (named diameter) can be tuned to control the combinatorial explosion of possible products.

The degree of plasticity in metabolic networks that is uncovered by variable reaction center diameter is actually revealing an intrinsic feature of organisms linked to their adaptability, i.e. enzyme promiscuity. Promiscuity stands for the ability of enzymes to catalyze more than one reaction or to accept more than one substrate, a mechanism which can be traced to the evolutionary origins of enzymatic functions. Mimicking nature, such enzyme versatility can provide novel ways for biosynthesizing metabolite and even bioproducing non-natural molecule. To that end, the variable diameter method has shown itself to be especially well-suited for modeling the mechanisms of enzyme promiscuity as it has already enabled the experimentally validated discovery of a novel metabolite in E. coli and of the promiscuous enzymes producing it [21].

In this study, we make use of RetroPath2.0 to exemplify how variable reaction rule diameters can be used to complete the metabolome of E. coli. More precisely, we used as a source set all the metabolites present in E. coli iJO1366 model [43]. We first tested the two rule sets aforementioned, a set of about 100 reaction rules part of the BNICE framework [42] and a set of 50 reaction rules developed with the Sympheny software [9]. The reaction rules coded in the form of SMARTS string are provided in Supplementary at MyExperiment.org, along with the EC numbers corresponding to the rules. While the two rule sets were not developed to code only for E. coli reactions, for each EC number there is a corresponding enzyme annotated in E. coli so we kept all rules in the two sets. We then tested reaction rules with variable diameters using the procedure described in the method section to code for all E. coli metabolic reactions extracted from iJO1366 model. Rules were calculated for each reaction with diameters ranging from 2 to 16. Table 2 below provides the number of compound generated running RetroPath2.0 for one iteration on the metabolites of the iJO1366 model and the rules sets mentioned above (see “Methods” section for additional details).

Table 2 Compounds generated by RetroPath2.0 using various reaction rules applied on E. coli iJO1366 model metabolites [43]

Full size table

Table 2 shows that the number of compounds generated increases as the diameter decreases. This is consistent with the fact that shorter diameters will accept more substrates than higher ones and will thus produce more products. Although they were not constructed with diameters, the BNICE and Sympheny rule sets generally correspond to small environments comprising only few atoms and bonds around reaction centers, which explain why these two systems generate more products than high diameter rule sets. Nonetheless, even with high diameters, all variable diameter rule sets produce more molecules found in E. coli model than the BNICE and Sympheny rule sets. This might indicate that the variable rule sets correspond to a more accurate coding of metabolic reactions than the other systems.

To further probe the coverage of the various rules sets listed in Table 2 we searched if the compounds produced could be found in MS spectra. To this end, we downloaded MS spectra from Metabolight [44] where masses have been measured on E. coli cell extracts. The spectra downloaded corresponded to a study aimed at probing the dynamics of isotopically labeled molecules (i.e. ¹³C labeled glucose) [45]. Since we are concerned here with wild type E. coli metabolome, we considered only the spectra where E. coli cells had not yet been exposed to labeled glucose (spectra acquired at time t = 0). All compounds generated by our various rules sets were prepared to be read by OpenMS nodes [23] and a workflow was written with these nodes to annotate the MS spectra peaks (cf. “Methods” section for details).

The results presented in Table 2 show that as the diameter decreases the number of peak assignment increases, which is not surprising considering that the number of compounds generated increases as well. We observe that the Sympheny and BNICE rules sets give results similar to those obtained by the D6 rule set, albeit with a higher number of annotations per peak.

In all cases the rule sets produced compounds not present in the E. coli model but with corresponding masses in the MS spectra. Additional file 4: Table S2 give a list of 40 such compounds having an identifier in MetaNetX [11] and produced by three identical reactions (i.e. reactions having the same substrates and products) generated using the Sympheny, BNICE and D6 rule sets. The compounds were produced by 53 reactions, some compounds being produced by more than one reaction. We note that the 40 compounds have been generated by rule sets for which at least one gene in E. coli has been annotated with the same corresponding EC number. The 40 compounds are thus potential new E. coli metabolites and their presence should be further verified using for instance MS/MS analysis (Additional file 4: Table S2).

Conclusions

In this paper we have presented a general method allowing one to explore the chemical space around a given molecule, or around a given set of molecules. The originality of the method is that the exploration is performed through chemical reactions rules. We have given a set of rules allowing us to generate any isomer of any given molecule of the chemical space. We also provide examples making use of reaction rules computed from enzymatic reactions. Using rules computed on known reactions has a definite advantage regarding the (bio)synthetic accessibility of the molecule produced, which is not necessarily the case for other techniques producing molecules de novo [33, 37, 40, 49,50,51,52,53].

Our method has been implemented into RetroPath2.0, a workflow running on the KNIME Analytics Platform [18]. RetroPath2.0 can easily be used with source molecules and reaction rules different than those presented in the paper. For instance the workflows provided in Additional file 1 can be used with the reaction SMARTS rules and fragment libraries (as source compounds) of the DOGS software (inSili.com LLC [4]) developed for de novo drug design, other technique evolving molecules toward specific activities or properties [40, 50,51,52] could also be implemented in RetroPath2.0 provided that one first codes reaction rules in SMARTS format.

Aside from searching molecules having interesting properties and activities RetroPath2.0 can also be used to complete metabolic maps by proposing new metabolites biosynthesized through promiscuous enzymes, these new metabolites can in turn be used to annotate MS spectra and to that end we provide an interface with OpenMS [23]. Finally, RetroPath2.0 was originally developed to enumerate pathways producing a given target product from a source set of reactants. While we have benchmarked the workflow in the context of metabolic engineering, [17] it can also be used for synthesis planning as long as synthesis reaction rules are available.

Methods

Generating reaction rules

All our reaction rules are represented in the form of reaction SMARTS [14]. Reaction rules used for canonical augmentation are provided in Fig. 1 and for isomer transformation in Figs. 3, 4, Additional file 2: Figure S2 and Additional file 5: Table S3. Enzymatic reaction rules were computed taking enzymatic reactions from MetaNetX version 2.0 [11]. To compute rules, we first performed an Atom–Atom Mapping (AAM) using the tool developed by [46] (Fig. 8a). Next, multiple substrates reactions were decomposed into components (panel C and D in Fig. 8). There are as many components as there are substrates and each component gives the transformation between one substrate and the products. Each product must contain at least one atom from the substrate according to the AAM. This strategy enforces that only one substrate can differ at a time from the substrates of the reference reaction when applying the rule.

The following step consisted in computing reactions rules as reaction SMARTS for each component. We did it for diameters 2–16 around the reaction center (panels C and D in Fig. 8) by removing from the reaction components all atoms that were not in the spheres around the reaction center atoms.

We extracted more than 24,000 reaction components from MetaNetX reactions, each one of those leading to a rule at each diameter (from 2 to 16).

We provide in Supplementary at MyExperiment.org a subset of 14,300 rules for E. coli metabolism. The rules were selected based on the MetaNetX binding to external databases and the iJO1366 whole-cell E. coli metabolic model [43]. We also provide enzymatic rules enabling the biosynthesis of aminoglycosides from Glucose. The reactions were extracted from the map00524 KEGG map [47], and rules were computed as above on reactions for which a MetaNetX identifier could be retrieved. The resulting set comprised 94 rules calculated for each diameter ranging from 2 to 16.

RetroPath2.0 core algorithm

The RetroPath2.0 workflow essentially follows an algorithm proposed by some of us [16, 17] and its workflow implementation, which has already been described in details [17], is summarized in Fig. 9. We here focus on the different usages of RetroPath2.0 for the use cases provided in “Results and discussions” section.

In all cases the workflow performs the generation of structures in a breadth-first way by applying iteratively the same procedure. An iteration starts by applying reaction rules to each of the compounds of a source set. For each compound, the products are computed using the RDKit KNIME nodes one-component or two-component reactions [19]. Products are sanitized (removal of structures having incorrect valence), standardised and duplicates are merged. The set of products will become the new source set for the next iteration. The workflow iterates until a predefined number of iterations is reached or until the source set is empty.

In the case of isomer augmentation (workflow RetroPath2.0-Mods-isomer-augmentation, “Isomer enumeration” sections) the initial source set is composed of a single carbon atom and the rule used is R ₁ in Fig. 1, since it is the only rule that will produce acyclic molecules. The rule is fired on the source set, and the products become the new source set in the next iteration. The workflow is iterated a number of times equal to N − 1, where N is the number of atoms one wishes the final molecule to have.

In the case of isomer transformation (workflow RetroPath2.0-Mods-isomer-transformation, “Isomer enumeration” and “Virtual screening in the chemical space” section) the initial source set is composed of a molecule that is filled with the appropriate number of hydrogens using the RDKit KNIME node Add Hs. At each iteration rules are fired on the source set and the products obtained become the new source set for the next iteration. As an additional last step of each iteration, products that have already been processed in a previous iteration are filtered out before building the next source set. This necessitates maintaining a set (named sink) comprising all molecules so far generated. All products that have already been obtained are removed from the product set and the remaining molecules are (1) added to the sink set and (2) used as the new source set for the next iteration. This avoids applying reactions on the same products during subsequent iterations. Disconnected structures are removed from the results by filtering out any product having several disconnected components (according to the SMILES representation). When enumerating alkane, disconnected structures represents between 50 and 66% (depending of the alkane size) of the generated structures before filtering and merging duplicates. To generate the results of Table 1, since we are enumerating alkanes (no multiple bonds or cycles), the rule to be used is R ₁ in Fig. 3. To enumerate the isomers of the monomers in “Virtual screening in the chemical space” section, if we prohibit the transformation of multiple bonds into cycles and thus keep the number of single, double and triple bonds constant, the rules to be used are R ₁, R ₅ and R ₉ in Fig. 3 (also found in Additional file 2: Figure S2 since the monomers used do not contain triple bonds). Since this algorithm can become computationally intensive, we also provide an additional workflow (called RetroPath2.0-Mods-isomer-transformation-queue) to deal with memory management. This workflow illustrates how to introduce a FIFO data structure for the source set (i.e. queue containing structures upon which rules will be fired) and use it for iteratively firing rules on small chunks of structures (e.g. chunk of 20 structures), new products obtained are then added to the source queue. Interestingly, the breadth-first approach for generating the structures can be replaced by a depth-first approach by replacing the queue (first in, first out structure) by a stack (last in, first out structure).”

In the case of inverse-QSAR (workflow RetroPath2.0-Mods-iQSAR, “Search for molecules maximizing biological activities” section), the source set initially comprises the molecules used in the training set when building the QSAR. At each iteration, one or two molecules are chosen at random from the source set depending on the rule set that is being used (one molecule with enzymatic reaction rules, two molecules with isomer transformation rules). Rules are then fired on the selected molecules and an activity is predicted for each product using the QSAR equation. The source set is updated retaining molecules according to a selection tournament procedure borrowed from [40]. Briefly, the initial source set (i.e. the set of structures used at the start of the current iteration) is merged with the product set (i.e. the set of structures obtained after firing the rules). This merged set is then randomly split into 10 subsets and the 10 top best structures from each subset are retained according to their predicted activity. Finally, all the retained structures are pooled together to form the updated source set to be used at the next iteration. The workflow is iterated a (user) predefined number of times.

In the case of E. coli metabolic network completion (workflow RetroPath2.0-Mods-metabolomics, “Metabolome completion and metabolomics” section), we provide three workflows. The first workflow is RetroPath2.0, which is fully described in [17] and is similar to the isomer transformation one. Here, RetroPath2.0 produces a list of molecules obtained using E. coli enzymatic reaction rules (see “Generating reaction rules” section). The second workflow takes as input the products generated by RetroPath2.0, computes the exact mass for each product and prepare files to be read by OpenMS nodes for MS data peak assignment [23]. The last workflow is built with OpenMS nodes, it reads several MS data files in mzML format, two lists of adducts in positive and negative modes, and the files generated by the second workflow (containing RetroPath2.0 generated products with masses). The workflow searches for each compound the corresponding peak in the MS spectra. The workflow was parameterized for metabolomics analysis as described in OpenMS manual [48], the AccurateMassSearch node was set to negative ion mode as the experiment were carried out with an LTQ-Orbitrap instrument operating in negative FT mode (cf. protocols in [44]).

Further details on how to run all the above workflows are provided in the Supplementary at MyExperiment.org.

References

Chemicals Abstracts Service, CASREACT https://www.cas.org/content/reactions. Accessed 28 June 17
Warr WA (2014) A short review of chemical reaction database systems, computer-aided synthesis design, reaction prediction and synthetic feasibility. Mol Inform 33(6–7):469–476
Article CAS Google Scholar
Szymkuc S, Gajewska EP, Klucznik T, Molga K, Dittwald P, Startek M, Bajczyk M, Grzybowski BA (2016) Computer-assisted synthetic planning: the end of the beginning. Angew Chem Int Ed Engl 55(20):5904–5937
Article CAS Google Scholar
Schneider G (2014) Future de novo drug design. Mol Inform 33(6–7):397–402
Article CAS Google Scholar
Moriya Y, Shigemizu D, Hattori M, Tokimatsu T, Kotera M, Goto S, Kanehisa M (2010) PathPred: an enzyme-catalyzed metabolic pathway prediction server. Nucleic Acids Res 38(Web Server issue):W138–W143
Article CAS Google Scholar
Jeffryes JG, Colastani RL, Elbadawi-Sidhu M, Kind T, Niehaus TD, Broadbelt LJ, Hanson AD, Fiehn O, Tyo KE, Henry CS (2015) MINEs: open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomics. J Cheminform 7:44
Article Google Scholar
Carbonell P, Parutto P, Herisson J, Pandit SB, Faulon JL (2014) XTMS: pathway design in an eXTended metabolic space. Nucleic Acids Res 42:W389–W394
Article CAS Google Scholar
Hadadi N, Hafner J, Shajkofci A, Zisaki A, Hatzimanikatis V (2016) ATLAS of biochemistry: a repository of all possible biochemical reactions for synthetic biology and metabolic engineering studies. ACS Synth Biol 5(10):1155–1166
Article CAS Google Scholar
Yim H, Haselbeck R, Niu W, Pujol-Baxley C, Burgard A, Boldt J, Khandurina J, Trawick JD, Osterhout RE, Stephen R, Estadilla J, Teisan S, Schreyer HB, Andrae S, Yang TH, Lee SY, Burk MJ, Van Dien S (2011) Metabolic engineering of Escherichia coli for direct production of 1,4-butanediol. Nat Chem Biol 7(7):445–452
Article CAS Google Scholar
Liu M, Bienfait B, Sacher O, Gasteiger J, Siezen RJ, Nauta A, Geurts JM (2014) Combining chemoinformatics with bioinformatics: in silico prediction of bacterial flavor-forming pathways by a chemical systems biology approach “reverse pathway engineering”. PLoS ONE 9(1):e84769
Article Google Scholar
Moretti S, Martin O, Van Du Tran T, Bridge A, Morgat A, Pagni M (2016) MetaNetX/MNXref–reconciliation of metabolites and biochemical reactions to bring together genome-scale metabolic networks. Nucleic Acids Res 44(D1):D523–D526
Article CAS Google Scholar
Ugi I, Bauer J, Bley K, Dengler A, Dietz A, Fontain E, Gruber B, Herges R, Knauer M, Reitsam K, Stein N (1993) Computer-assisted solution of chemical problems—the historical development and the present state of the art of a new discipline of chemistry. Angew Chem Int Ed Engl 32(2):164–189
Article Google Scholar
Carbonell P, Planson AG, Fichera D, Faulon JL (2011) A retrosynthetic biology approach to metabolic pathway design for therapeutic production. BMC Syst Biol 5:122
Article Google Scholar
Daylight Theory: SMARTS http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html. Accessed 28 June 2017
Hadadi N, Hatzimanikatis V (2015) Design of computational retrobiosynthesis tools for the design of de novo synthetic pathways. Curr Opin Chem Biol 28:99–104
Article CAS Google Scholar
Faulon J-L, Carbonell P (2010) Reaction network generation. In: Faulon J-L, Bender A (eds) Handbook of chemoinformatics algorithms. CRC Press, Boca Raton
Chapter Google Scholar
Delépine B, Duigou T, Carbonell P, Faulon JL (2017) RetroPath2.0: A retrosynthesis workflow for metabolic engineers. Metab Eng. https://doi.org/10.1016/j.ymben.2017.12.002
Berthold MR et al (2008) KNIME: The Konstanz Information Miner. In: Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R (eds) Data analysis, machine learning and applications. Studies in classification, data analysis, and knowledge organization. Springer, Berlin, Heidelberg, pp 319–326
Landrum G (2016) RDKit: open-source cheminformatics http://www.rdkit.org/. Accessed 2 Aug 2016
Rodrigues T, Reker D, Welin M, Caldera M, Brunner C, Gabernet G, Schneider P, Walse B, Schneider G (2015) De novo fragment design for drug discovery and chemical biology. Angew Chem Int Ed Engl 54(50):15079–15083
Article CAS Google Scholar
Mellor J, Grigoras I, Carbonell P, Faulon JL (2016) Semisupervised Gaussian process for automated enzyme search. ACS Synth Biol 5(6):518–528
Article CAS Google Scholar
Feher T, Planson AG, Carbonell P, Fernandez-Castane A, Grigoras I, Dariy E, Perret A, Faulon JL (2014) Validation of RetroPath, a computer-aided design tool for metabolic pathway engineering. Biotechnol J 9(11):1446–1457
Article CAS Google Scholar
Rost HL, Schmitt U, Aebersold R, Malmstrom L (2014) pyOpenMS: a python-based interface to the OpenMS mass-spectrometry algorithm library. Proteomics 14(1):74–77
Article Google Scholar
Thiagarajan D, Mehta DP (2016) Faster algorithms for isomer network generation. J Chem Inf Model 56(12):2310–2319
Article CAS Google Scholar
Peironcely JE, Rojas-Cherto M, Fichera D, Reijmers T, Coulier L, Faulon JL, Hankemeier T (2012) OMG: open molecule generator. J Cheminform 4(1):21
Article CAS Google Scholar
Jaghoori MM, Jongmans STQ, de Boer F, Peironcely J, Faulon JL, Reijmers T, Hankemeier T (2013) PMG: multi-core metabolite identification. Electron Notes Theor Comput Sci 299:53–60
Article Google Scholar
McKay B (1998) Isomorph-free exhaustive generation. J Algorithm 26:306–324
Article Google Scholar
Gally JM, Bourg S, Do Q-T, Aci-Seche S, Bonnet P (2017) VSPrep: a general KNIME workflow for the Preparation of molecules for virtual screening. Mol Inf 36:1–11
Article Google Scholar
Ruddigkeit L, van Deursen R, Blum LC, Reymond JL (2012) Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model 52(11):2864–2875
Article CAS Google Scholar
Isomers of Alkanes, Univ Bayreuth http://www.mathe2.uni-bayreuth.de/sascha/oeis/alkane.html. Accessed 28 June 17
Willett P (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 11(23–24):1046–1053
Article CAS Google Scholar
Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci 42(6):1273–1280
Article CAS Google Scholar
Brown WM, Martin S, Rintoul MD, Faulon JL (2006) Designing novel polymers with targeted properties using the signature molecular descriptor. J Chem Inf Model 46(2):826–835
Article CAS Google Scholar
Subramanian MN (2017) Polymer properties. In: Polymer blends and composites: chemistry and technology. Wiley, Hoboken, NJ, USA. https://doi.org/10.1002/9781119383581.ch3
Gerdeen JC, Rorrer RA (2011) Engineering design with polymers and composites, vol 30. CRC Press, Boca Raton
Google Scholar
RetroPath2.0 at MyExperiment.org https://www.myexperiment.org/workflows/4987.html. Accessed 29 Apr 2017
Churchwell CJ, Rintoul MD, Martin S, Visco DP Jr, Kotu A, Larson RS, Sillerud LO, Brown DC, Faulon JL (2004) The signature molecular descriptor. 3. inverse-quantitative structure–activity relationship of ICAM-1 inhibitory peptides. J Mol Graph Model 22(4):263–273
Article CAS Google Scholar
Martin S (2012) Lattice enumeration for inverse molecular design using the signature descriptor. J Chem Inf Model 52(7):1787–1797
Article CAS Google Scholar
Setny P, Trylska J (2009) Search for novel aminoglycosides by combining fragment-based virtual screening and 3D-QSAR scoring. J Chem Inf Model 49(2):390–400
Article CAS Google Scholar
Kawai K, Nagata N, Takahashi Y (2014) De novo design of drug-like molecules by a fragment-based molecular evolutionary approach. J Chem Inf Model 54(1):49–56
Article CAS Google Scholar
Wishart DS, Jewison T, Guo AC, Wilson M, Knox C, Liu Y, Djoumbou Y, Mandal R, Aziat F, Dong E, Bouatra S, Sinelnikov I, Arndt D, Xia J, Liu P, Yallou F, Bjorndahl T, Perez-Pineiro R, Eisner R, Allen F, Neveu V, Greiner R, Scalbert A (2013) HMDB 3.0—the human metabolome database in 2013. Nucleic Acids Res 41(Database issue):D801–D807
CAS Google Scholar
Henry CS, Broadbelt LJ, Hatzimanikatis V (2010) Discovery and analysis of novel metabolic pathways for the biosynthesis of industrial chemicals: 3-hydroxypropanoate. Biotechnol Bioeng 106(3):462–473
CAS Google Scholar
Orth JD, Conrad TM, Na J, Lerman JA, Nam H, Feist AM, Palsson BO (2011) A comprehensive genome-scale reconstruction of Escherichia coli metabolism-2011. Mol Syst Biol 7:535
Article Google Scholar
MetaboLights, MTBLS229 study http://www.ebi.ac.uk/metabolights/MTBLS229. Accessed 28 June 2017
Kiefer P, Schmitt U, Muller JE, Hartl J, Meyer F, Ryffel F, Vorholt JA (2015) DynaMet: a fully automated pipeline for dynamic LC–MS data. Anal Chem 87(19):9679–9686
Article CAS Google Scholar
Rahman SA, Torrance G, Baldacci L, Martinez Cuesta S, Fenninger F, Gopal N, Choudhary S, May JW, Holliday GL, Steinbeck C, Thornton JM (2016) Reaction Decoder Tool (RDT): extracting features from chemical reactions. Bioinformatics 32(13):2065–2066
Article Google Scholar
KEGG Pathway, Neomycin, kanamycin and gentamicin biosynthesis http://www.genome.jp/kegg-bin/show_pathway?hsa00524. Accessed 28 June 2017
Walzer M, Sachsenberg T, Aicheler F, Rurik M, Veit J, Isabell B, Pedrioli P, Pfeuffer J, Liang X, Reinert K, O Kohlbacher. OpenMS tutorial https://www.openms.de/wp-content/uploads/2016/02/handout1.pdf. Accessed 28 June 2017
Visco DP Jr, Pophale RS, Rintoul MD, Faulon JL (2002) Developing a methodology for an inverse quantitative structure–activity relationship using the signature molecular descriptor. J Mol Graph Model 20(6):429–438
Article CAS Google Scholar
van Deursen R, Reymond JL (2007) Chemical space travel. ChemMedChem 2(5):636–640
Article Google Scholar
Yu MJ (2011) Natural product-like virtual libraries: recursive atom-based enumeration. J Chem Inf Model 51(3):541–557
Article CAS Google Scholar
Hoksza D, Skoda P, Vorsilak M, Svozil D (2014) Molpher: a software framework for systematic chemical space exploration. J Cheminform 6(1):7
Article Google Scholar
Virshup AM, Contreras-Garcia J, Wipf P, Yang W, Beratan DN (2013) Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds. J Am Chem Soc 135(19):7296–7303
Article CAS Google Scholar

Download references

Authors’ contributions

The work was directed by JLF who sketched the proof of the claim, designed all workflows, and generated the results presented in “Isomer enumeration” and “Metabolome completion and metabolomics” sections. MK wrote the proofs in “Isomer enumeration” section, TD wrote all workflows and produced the results in “Search for molecules maximizing biological activities” section. PC used the workflows to generate the results in “Virtual screening in the chemical space” section. All authors contributed to the manuscript write-up. All authors read and approved the final manuscript.

Acknowledgements

The authors acknowledge Baudoin Delépine for his contribution to the general workflow architecture and enzymatic reaction rules calculations.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Availability of data and materials

The workflows and associated data supporting the conclusions of this article are available on MyExperiment.org, http://www.myexperiment.org/packs/728.html.

Ethics approval and consent to participate

Not applicable.

Funding

This work was supported by the French National Research Agency [ANR-15-CE1-0008], the Biotechnology and Biological Sciences Research Council, Centre for synthetic biology of fine and speciality chemicals [BB/M017702/1]; and Synthetic Biology Applications for Protective Materials [EP/N025504/1]. MK acknowledges funding provided by the DGA and École Polytechnique.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Micalis Institute, INRA, AgroParisTech, Université Paris-Saclay, 78350, Jouy-en-Josas, France
Mathilde Koch, Thomas Duigou & Jean-Loup Faulon
SYNBIOCHEM Centre, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK
Pablo Carbonell & Jean-Loup Faulon
CNRS-UMR8030/Laboratoire iSSB, Université Paris-Saclay, 91000, Évry, France
Jean-Loup Faulon

Authors

Mathilde Koch
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Duigou
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Carbonell
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Loup Faulon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jean-Loup Faulon.

Additional files

13321_2017_252_MOESM1_ESM.zip

Additional file 1. Monomers maps obtained running Retropath2.0 in section “Virtual screening in the chemical space”. The 17 compounds of the 158 available monomers that can be naturally synthesized and the corresponding synthesis pathways.

13321_2017_252_MOESM2_ESM.docx

Additional file 2. Figure S1 represents the execution time of the tested softwares for alkane enumeration. Figure S2 represents the reduced set of transformation rules excluding triple bounds. Figure S3 represents the distributions of predicted T _g values for enumerated isomers and for isomers found in PubChem with varying Tanimoto threshold.

13321_2017_252_MOESM3_ESM.xlsx

Additional file 3: Table S1. Solution space reduction arguments or rules corresponding to bond configurations in section “Isomer transformation”. This is shown for the 36 rules after reduction of the solution space by problem symmetry arguments.

13321_2017_252_MOESM4_ESM.xls

Additional file 4: Table S2. List of 40 compounds from section “Metabolome completion and metabolomics” having an identifier in MetaNetX and produced by three identical reactions (i.e. reactions having the same substrates and products) generated using the Sympheny, BNICE and D6 rule sets.

Additional file 5: Table S3. List of the 19 SMARTS rules that were used in this study.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Koch, M., Duigou, T., Carbonell, P. et al. Molecular structures enumeration and virtual screening in the chemical space with RetroPath2.0. J Cheminform 9, 64 (2017). https://doi.org/10.1186/s13321-017-0252-9

Download citation

Received: 03 July 2017
Accepted: 08 December 2017
Published: 19 December 2017
DOI: https://doi.org/10.1186/s13321-017-0252-9

Molecular structures enumeration and virtual screening in the chemical space with RetroPath2.0

Abstract

Background

Results

Conclusion

Background

Results and discussions

Isomer enumeration

Canonical augmentation

Isomer transformation

Claim

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Proof of the main claim

Corollary 1

Proof

Virtual screening in the chemical space

Search for molecules maximizing biological activities

Metabolome completion and metabolomics

Conclusions

Methods

Generating reaction rules

RetroPath2.0 core algorithm

References

Authors’ contributions

Acknowledgements

Competing interests

Consent for publication

Availability of data and materials

Ethics approval and consent to participate

Funding

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Additional files

13321_2017_252_MOESM1_ESM.zip

13321_2017_252_MOESM2_ESM.docx

13321_2017_252_MOESM3_ESM.xlsx

13321_2017_252_MOESM4_ESM.xls

Additional file 5: Table S3. List of the 19 SMARTS rules that were used in this study.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us