Representations of synthesis plans
In a synthesis plan, an intermediate molecule may be used as input in several reactions. Our starting point is the observation that in such cases, this intermediate molecule clearly should only be synthesized in one way: given two different ways to synthesize a given intermediate molecule, one will have the smallest asymptotic cost, and even if the two have equal costs, using both induces extra overhead.
This means that if an intermediate molecule appears more than once as a vertex in a unary-binary tree representation of a synthesis plan, its subtrees should be identical. To illustrate, in the tree in Fig. 2a, molecule C is synthesized in two different ways. However, either the tree Fig. 2b or the tree Fig. 2c must be cheaper, depending on the costs of the two ways of synthesizing C. In short, while synthesis plans can be represented by unary-binary trees with nodes labeled by molecules, not all such unary-binary trees are realistic synthesis plans.
As a consequence, we believe synthesis plans are better described as directed acyclic graphs (DAGs) in which each vertex has a unique vertex label, each vertex has out-degree zero, one or two (depending on whether it represents a starting material, a product of a cyclization, or a product of an affixation, respectively), there is exactly one vertex t with in-degree zero (representing the target molecule), and there is a path from t to any other vertex in the DAG. Such a DAG can be obtained from the unary-binary tree structure by merging vertices with the same label. Vice versa, a tree structure can be obtained from the DAG by a depth-first search from the root of the DAG, in a version which allows revisits to vertices. In Fig. 3b, the DAG arising from the unary-binary tree in Fig. 3a is shown.
In the DAG (as well as in unary-binary trees), labels of vertices are molecules. For this labeling to be chemically meaningful, the labels cannot be arbitrary, but should reflect the reactions involved. We now formalize this, using graph models of molecules as labels.
If edge (v, u) is in the DAG, we denote vertex u a child of v. A vertex with out-degree zero we denote a leaf. A molecule is an undirected, connected, labeled graph, where labels are atom types. A building block of a DAG is a non-leaf vertex v together with every child u of v and its corresponding edge (v, u). In a DAG with nodes labeled by molecules, a building block is a reaction if it satisfies the following: (1) v has one or two children. (2) If v has one child u, the label of v is obtained from the label of u by adding exactly one edge. This is a cyclization. (3) If v has two children \(u_1\) and \(u_2\), then the label of v is obtained from the labels of \(u_1\) and \(u_2\) by adding exactly one edge connecting these labels. This is an affixation. A starting material is a molecule that can be acquired without the need to plan how to synthesize it.
Definition 1
A synthesis plan for t is a labeled DAG with the following properties:
-
1.
Vertex labels are molecules, and each label is unique.
-
2.
Each building block is a reaction.
-
3.
There is exactly one vertex with in-degree zero, namely t.
-
4.
The label of each leaf is a starting material.
Our next observation is that such DAG models of synthesis plans can be represented as directed hypergraphs. A hypergraph differs from a standard graph in that edges connect sets of vertices rather than single vertices. More precisely, a hyperarc is an ordered pair of vertex (multi)sets. In chemistry, a reaction is a multiset of reactants and a multiset of products, and can therefore be seen as a hyperarc. As we only model construction reactions, all heads are singletons in this paper. In Fig. 3c, the reactions of Fig. 3b are depicted as hyperarcs.
When we say that a hyperarc represents a chemical reaction, we mean that its vertices are labeled with molecules, and that the label of the head is obtained from the labels of the tail under the reaction in question, analogously to the definition of a reaction in a DAG given above.
Hypergraphs are well suited for modeling chemistry, because they make the relationship between all molecules involved in a reaction explicit. This is in contrast to a DAG or a tree, where the two reactants in a reaction are only indirectly related via their common product.
In a hypergraph representation of the synthesis plan, adding a dummy source vertex s, together with a directed hyperarc \((\{s\}, \{i\})\) to each starting material i of the synthesis plan (as shown in Fig. 3c), gives a hypergraph which is in fact a hyperpath. This allows us to think of a synthesis plan as a form of path, hence to think of optimal synthesis plans as optimal paths. Given a larger hypergraph modeling the reactions in a part of chemistry under consideration, the search for optimal synthesis plans within this chemistry then becomes a search for optimal paths in the hypergraph.
Below, we recap basic hypergraph terminology, and then show how to use hypergraphs for finding synthesis plans.
Hypergraphs
A directed hypergraph is a set V of vertices and a set E of hyperarcs, where each hyperarc \(e=(T(e), H(e))\) is an ordered pair of non-empty multisets of vertices. The set T(e) is denoted the tail of the hyperarc and H(e)) the head. If \(|H(e)| = 1\), the hyperarc is denoted a B-hyperarc, and the notation for the single head vertex is h(e). A hypergraph with only B-hyperarcs is denoted a B-hypergraph. In this paper, we only consider B-hypergraphs. The size of a B-hypergraph is given by \({{\mathrm{size}}}(H)=\sum _{e\in E}(|T(e)| + 1)\). A hypergraph \(H'=(V',E')\) is a subhypergraph of \(H=(V,E)\) if \(V'\subseteq V\) and \(E'\subseteq E\).
A (plain) path \(P_{st}\) from s to t in a B-hypergraph is a sequence \(P_{st}=\langle s, e_1, v_1, e_2, v_2, \dots , v_{q-1}, e_q,t \rangle\) of vertices and B-hyperarcs such that \(s \in T(e_1)\), \(t=h(e_q)\) and \(v_i=h(e_i) \in T(e_{i+1})\) for \(i=1..q-1\). Its length \(|P_{st}|\) is the number q of hyperarcs. If \(t\in T(e_1)\), then \(P_{st}\) is a cycle. A hypergraph is acyclic if it does not contain any cycles.
The above concept of paths is only used for defining cycles. The proper generalization of directed paths to hypergraphs is that of hyperpaths. There are different ways of defining this—we use a variation based on [24]. For a general overview see [25]. Examples of what constitutes a hyperpath and what does not are given in “Appendix 1: Hyperpaths”.
Definition 2
A hyperpath \(\pi _{st}=(V_\pi , E_\pi )\) from a source vertex s to a target vertex t in a B-hypergraph H is a subhypergraph of H with the following properties: If \(t=s\), then \(V_\pi = \{s\}\) and \(E_\pi = \emptyset\). Otherwise, \(E_\pi\) can be ordered in a sequence \(\langle e_1, e_2, \ldots , e_q \rangle\) such that
-
(1)
\(T(e_i)\subseteq \{s\} \cup \{h(e_1), h(e_2), \ldots , h(e_{i-1})\}\) for all i
-
(2)
\(t=h(e_q)\)
-
(3)
Every \(v \in V_\pi {\setminus } \{ t \}\) has at least one outgoing hyperarc in \(E_\pi\), and t has zero.
-
(4)
Every \(v \in V_\pi {\setminus } \{s\}\) has exactly one ingoing hyperarc in \(E_\pi\), and s has zero.
Note that Definition 2(4) gives a 1–1 correspondence between \(E_\pi\) and \(V_\pi {\setminus } \{s\}\), hence we can define unique indices for the vertices in \(V_\pi {\setminus } \{s\}\) by \(v_i = h(e_i)\). We let \(v_0\) be s. The hyperarc \(e_i\) is called the predecessor hyperarc of \(v_i\) and the corresponding map \(p: V_\pi {\setminus } \{s\} \mapsto E_\pi\) is called the predecessor function of \(\pi _{st}\) [20]. We use the notation \(\pi _{st} = \langle p(v_1), p(v_2),\ldots ,p(v_{q-1}), p(t) \rangle\) for hyperpaths from now on. If a subhypergraph of a hyperpath \(\pi _{st}\) is a hyperpath itself, it is called a subhyperpath of \(\pi _{st}\). We will later need the following fact.
Lemma 1
Let \(\pi _{st} = \langle p(v_1), p(v_2),\ldots ,p(v_{q-1}), p(t) \rangle\) be a hyperpath from s to t. Then for any \(v_i\), there is a unique subhyperpath \(\pi _{sv_i}\) of \(\pi _{st}\) from s to \(v_i\).
Proof
By Definition 2(2) and (4), any subhyperpath from s to \(v_i\) must contain the set \(\pi _{sv_i}\) of hyperarcs returned by the procedure Backtrack listed in Algorithm 1. It is easy to verify that \(\pi _{sv_i}\) fulfills the requirements Definition 2(1)–(4). For uniqueness, let \(\pi '_{sv_i}\) be a hyperpath from s to \(v_i\), let \(E'\) be the hyperarcs of \(\pi '_{sv_i}\) not in \(\pi _{sv_i}\), and let \(e_j\) be the hyperarc of \(E'\) with highest index j. By Definition 2(3), \(v_j\) must have an outgoing hyperarc e. By Definition 2(1) and the maximality of j, we cannot have \(e \in E'\), and by the marking strategy of the algorithm, it cannot be a hyperarc of \(\pi _{sv_i}\). Hence, \(E'\) is empty and \(\pi _{sv_i} = \pi '_{sv_i}\). \(\square\)
Finding synthesis plans via hypergraphs
The overall goal of this paper is to find synthesis plans within a given chemistry. We assume the chemistry is described by a (possibly large) set of construction reactions, i.e., affixations and cyclizations. Above, we described how to model synthesis plans as hypergraphs. In this section, we show how to view the set of reactions of the given chemistry as a single, large hypergraph, and how this in turn will allow us to find synthesis plans within the chemistry simply by looking for hyperpaths in this hypergraph.
Let R be a set of reactions, and let S be a set of starting materials. We define the hypergraph, called the Hypergraph of Reactions (\({{\mathrm{HoR}}}\)) induced by R and S, as follows.
Definition 3
Let \(E_R\) be the representation of R as a set of hyperarcs. Let \(V_R\) be the set of vertices appearing in the heads and tails of the hyperarcs in \(E_R\). Vertices with the same label (i.e., representing the same molecule) are considered identical. Let \(V_S\) be the set of vertices with labels in S. Then, the hypergraph of reactions (\({{\mathrm{HoR}}}\)) is the hypergraph
$$H=(V_S \cup V_R\cup \{s\}, E_R\cup E_s),$$
where s is a dummy vertex and \(E_s = \{ (\{s\},\{v\}) \mid v \in V_S \}\) is a set of dummy hyperarcs.
An example \({{\mathrm{HoR}}}\) with R and S being the combined sets of reactions and starting materials from the three synthesis plans for decalin in Fig. 1a is depicted in Fig. 4.
The crux of our paper is captured by the following theorem.
Theorem 1
Let
\(H=(V,E)\)
be a HoR induced by reactions R and starting materials S. Let v be a vertex from
\(V_R\)
and let t be its label. Then there is a 1-1 correspondence between (a) the synthesis plans for t with reactions from R and starting materials from S and (b) the hyperpaths in H from s to v.
Proof
Let \(\sigma\) be a synthesis plan from (a). To map it to a hyperpath in (b), do as follows. Add a dummy vertex s and a dummy edge from each leaf of \(\sigma\)–s. This result is still a DAG, hence admits a topological ordering of its vertices, i.e., a linear order on its vertices such that all edges point in the same direction. Convert each reaction and dummy edge of the DAG into a hyperarc in the natural way, cf. Fig. 3, reversing the direction. Let \(E_{\pi }\) be the resulting hyperarcs, and let \(V_{\pi }\) be the vertices of the DAG. Then \(\pi = (V_{\pi }, E_{\pi })\) is a hyperpath: Definition 2(4) follows because a building block of a DAG vertex contains all the outgoing DAG edges of the vertex. Due to Definition 2(4), the topological ordering of the DAG vertices induces an ordering of \(E_{\pi }\) fulfilling Definition 2(1). Definition 1(iii) induces Definition 2(3) and, combined with the topological ordering, also Definition 2(2).
Conversely, let \(\pi\) be a hyperpath from (b). To map it to a synthesis plan in (a), do as follows: From \(\pi\) remove s and its outgoing dummy hyperarcs, and convert every hyperarc to a building block in the natural way, reversing the direction. The result is a synthesis plan: Definition 1(i) and (ii) follow by what it means for a hyperarc to represent a reaction and by the uniqueness of vertex labels in Definition 3. Definition 1(i) follows from Definition 2(3). Definition 1(iv) follows from Definition 2(4) and the fact that only dummy hyperarcs have been removed, each of which by definition points to a starting material.
Clearly, the two mappings are each other’s inverses.\(\square\)
As a consequence of Theorem 1, algorithms computing hyperpaths in hypergraphs are also algorithms computing synthesis plans. In particular, we claim that an algorithm by Nielsen et al. [20] for finding the K shortest hyperpaths in a B-hypergraph can be used to find the K best synthesis plans for a target molecules t, given a set of reactions and a set of starting materials. We now verify the details of this claim.
The algorithm by Nielsen et al. is based on Yens classic algorithm [26] for shortest paths in directed (standard) graphs. The overall idea is to find the single shortest hyperpath and then recursively consider all ways in which a hyperpath can deviate from the shortest hyperpaths found so far, using single shortest hyperpath computations as a subroutine. For an acyclic B-hypergraph \(H=(V,E)\), the single shortest hyperpath can be computed using dynamic programming [9] in \(O(|V|+{{\mathrm{size}}}(H))\) time, leading to a runtime for the algorithm by Nielsen et al. of \(O(K|V|(|V|+{{\mathrm{size}}}(H)))\) on such hypergraphs. The requirement of acyclicity can be lifted (at the expense of a slight increase in runtime) by using different algorithms for finding single shortest hyperpaths [27], whereas the algorithm by Nielsen et al. assumes the hypergraph to be a B-hypergraph.
Any \({{\mathrm{HoR}}}\) H is acyclic and satisfies \({{\mathrm{size}}}(H)\le 3|E|\). The former is because every reaction has a strict increase from reactants to product in the number of edges in the molecule labels, and the latter is because every hyperarc e in a \({{\mathrm{HoR}}}\) is a B-hyperarc with \(|T(e)| \le 2\).
Finally, the algorithm by Nielsen et al. requires the lengths of hyperpaths to be given by what is called an additive weight function. We will later demonstrate that the very generic quality measure \({{\mathrm{TW}}}\) for synthesis plans can be expressed in this form.
These properties combined with Theorem 1 give the following result.
Theorem 2
Given a HoR \(H = (V,E)\) and a target \(t \in V\), the K best synthesis plans for t in H, ranked according to the measure TW, can be found in time \(O(K|V|(|V|+|E|))\).
A detailed exposition of the algorithm by Nielsen et al. is given in “Appendix 2: The K best plans algorithm”.
The sets R and S of reactions and starting materials in the \({{\mathrm{HoR}}}\) can arise from many sources. As in Fig. 4, a set of known synthesis plans for a target t can be combined to a \({{\mathrm{HoR}}}\) which could then contain further, unknown plans. Another approach could be to generate the reactions of the \({{\mathrm{HoR}}}\) by recursively breaking the bonds of the target t in all possible ways. This could be all bonds of t (possibly stopping the recursion when a specified minimum size of intermediate molecules is met), or it could be a subset of bonds (i.e., a bond set) selected by methods from the classic retrosynthetic approach [1]. In “Appendix 3: Bond set based HoR construction”, we provide the algorithmic details of efficiently breaking a bond set in all possible ways. More generally, any database of reactions and starting materials describing a chemistry under consideration can be used as R and S. These databases can be based on published literature and patents, such as Reaxys [28] and SciFinder [29]. Recent developments for retrosynthetic reaction prediction [30,31,32] even allow for the inference of unknown reactions. As these automated prediction methods have already proven to be highly accurate, finding the K shortest hyperpaths also has potential for discovering novel synthesis plans employing such predicted reactions.
Quality measures
The algorithm of Nielsen et al. requires the lengths of hyperpaths to be given by an additive weight function. In this section, we investigate whether existing quality measures for synthesis plans can be expressed in this form, and we show that the very generic measure \({{\mathrm{TW}}}\) indeed can. On the other hand, this turns out to not be the case for the classic measure \({{\mathrm{EPL}}}\). The reasons are inconsistencies that we demonstrate are inherent in the measure. This is a bit surprising in light of its use in previous work, but the implication seems to be that the \({{\mathrm{EPL}}}\) measure should be used with caution.
An additive weight function W assigns weights to hyperpaths in an inductive manner. For our purposes, we only need a special case, often denoted a value function (for a more general definition of additive weight functions, see [27]): For each hyperarc e and each vertex v in its tail T(e), let \(a_{v,e}\) be a non-negative real number. Then for a hyperpath \(\pi _{st}\) from s to t, the weight \(W(\pi _{st})\) is one if \(t = s\), and is otherwise given recursively by
$$\begin{aligned} W(\pi _{st}) = \sum \limits _{v\in T(p(t))} a_{v,p(t)} W(\pi _{sv}), \end{aligned}$$
(2)
where the \(\pi _{sv}\)’s are the subhyperpaths from s to the vertices v in the tail of last hyperarc p(t) of \(\pi _{st}\). These subhyperpaths exist and are unique by Lemma 1. They also contain strictly fewer hyperarcs, so the recursion stops eventually. Hence, W is well-defined.
Total weight of starting materials We defined the quality measure total weight of starting materials (\({{\mathrm{TW}}}\)) in “Synthesis planning basics” section, expressed in unary-binary trees. Recall that \({{\mathrm{TW}}}\) expresses how much starting material is needed to produce one gram of target molecule, taking yields of reactions into account. In the hyperpath setting, the definition becomes the following. For a reaction e with a reactant v, the retro yield \(r_{v,e}\) is the amount of v in grams needed in reaction e to create one gram of the product h(e). Thus, \(r_{v,e}\ge 0\), and by mass conservation \(\sum _{v}{r_{v,e}}\ge 1\) for any reaction e. Figure 5 shows the HoR from Fig. 4 decorated with example retro yields.
Each (plain) st-path \(P_{st}=\langle s,e_1,v_1,e_2,v_2,\ldots ,e_{|P_{st}|}, t\rangle\) contained in the hyperpath of the synthesis plan induces a use of starting material \(v_1\) given by the product of the retro yields along the path. The product along a path is \(\prod _{i=1}^{|P_{st}|}{r_{v_{i-1},e_{i}}}\), where we for each hyperarc e from the dummy vertex s to a starting material define \(r_{s,e} = 1\). Thus, the total weight of starting materials needed for a synthesis plan represented by a hyperpath \(\pi _{st}\) is
$$\begin{aligned} {{\mathrm{TW}}}(\pi _{st}) = \sum _{P_{st} \,\text {in}\, \pi _{st}}{\prod _{i=1}^{|P_{st}|}{r_{v_{i-1},e_{i}}}} \,. \end{aligned}$$
(3)
This can be rephrased inductively as follows
$$\begin{aligned} {{\mathrm{TW}}}(\pi _{st}) = \left\{ \begin{array}{ll} 1 &{}\quad \text{ if } \;t=s \\ \sum \limits _{v\in T(p(t))}{r_{v,p(t)}{{\mathrm{TW}}}(\pi _{sv})} &{}\quad \text{ otherwise } \end{array} \right. \end{aligned}$$
(4)
This is most easily seen in the traditional tree notation for synthesis plans, cf. Fig. 3. Thus, \({{\mathrm{TW}}}\) can indeed be expressed as an additive weight function, cf. Eq. (2). The measure \({{\mathrm{TW}}}\) is actually very generic in nature [9]. For example, it can easily be adjusted to calculate the total price of the starting materials if a price per gram \(p_v\) is known for each starting material v, simply by setting \(r_{s,e} = p_v\) for the hyperarc e from s to v. It can also incorporate non-chemical expenses of reactions, such as cost of energy usage or cost of disposal of waste products, simply by adding s to the tail of any hyperarc e representing a reaction and setting \(r_{s,e}\) to the non-chemical expense per gram product produced in e. Even more generally, we note that any measure which can be described as an additive weight function [27] is compatible with our method.
External Path Length Recall from “Synthesis planning basics” section that the quality measure external path length (\({{\mathrm{EPL}}}\)) [8] is the sum of the lengths of all paths from the root to the leaves. When trying (and failing) to express \({{\mathrm{EPL}}}\) as an additive weight function, we discovered a certain peculiarity inherent in the measure: the optimal synthesis plan for a molecule depends on how the molecule is later used. More precisely, what constitutes the best (sub-)synthesis plan for an intermediate molecule inside a larger synthesis plan depends on where in the large plan the molecule is used. This does not correspond to physical reality, since different instances of a molecule are not distinguished after creation. As a consequence, in an optimal plan (w.r.t. \({{\mathrm{EPL}}}\)) where the same intermediate molecule appears twice, the plan could ask for different subplans for it. As we argued at the start of the “Results” section, such a plan would never be used in practice.
We now demonstrate the above inconsistency by an example, expressed using unary-binary trees (the model in which \({{\mathrm{EPL}}}\) was originally defined [8]).
Consider a (hypothetical) molecule M, and assume it can be synthesized using two different bond sets \(M'\) and \(M''\), as depicted in the top row of Fig. 6. \(M'\) admits only linear synthesis plans, whereas \(M''\) has many, including a fully convergent one. These synthesis plans are depicted in the bottom row of Fig. 6.
For synthesizing M by itself, the measure \({{\mathrm{EPL}}}\) is minimized by the fully convergent plan for the bond set \(M''\). However, consider the molecule T depicted in Fig. 7a, with the synthesis plan depicted in Fig. 7b. In this synthesis plan, the molecule M appears twice, with different sub-synthesis plans. According to \({{\mathrm{EPL}}}\) these different synthesis plans for M are indeed the optimal choices at these two positions in the remaining plan, as the reader can readily verify. For instance, the two alternatives shown in Fig. 8a, b have \({{\mathrm{EPL}}}=98\) and \({{\mathrm{EPL}}}=97\), while that of Fig. 7b has \({{\mathrm{EPL}}}=96\). In other words, according to \({{\mathrm{EPL}}}\), the position of M in the large plan determines how it should be made. We note that already in [8], Hendrickson expressed some reservations about the reliability of the measure as a cost function, but in less explicit terms than the phenomenon demonstrated above.