CompRet: a comprehensive recommendation framework for chemical synthesis planning with algorithmic enumeration

In computer-assisted synthesis planning (CASP) programs, providing as many chemical synthetic routes as possible is essential for considering optimal and alternative routes in a chemical reaction network. As the majority of CASP programs have been designed to provide one or a few optimal routes, it is likely that the desired one will not be included. To avoid this, an exact algorithm that lists possible synthetic routes within the chemical reaction network is required, alongside a recommendation of synthetic routes that meet specified criteria based on the chemist’s objectives. Herein, we propose a chemical-reaction-network-based synthetic route recommendation framework called “CompRet” with a mathematically guaranteed enumeration algorithm. In a preliminary experiment, CompRet was shown to successfully provide alternative routes for a known antihistaminic drug, cetirizine. CompRet is expected to promote desirable enumeration-based chemical synthesis searches and aid the development of an interactive CASP framework for chemists.

1 Definition of AND/OR tree and chemical reaction network.
In this study, we utilize a tree structure to represent a synthetic route as shown in Fig.1.
A directed graph G is a pair of a node set V and a edge set E ⊆ V × V . Let π be an alternating sequence of vertices and edges π = (v 1 , e 1 , v 2 , e 2 , . . . , e k−1 , v k ) such that for each i ∈ {1, . . . k − 1}, e i = (v i , v i+1 ). π is a path if π consists of distinct vertices. In particular, we call such π a path from v 1 to v k . π is a cycle if v 1 , . . . , v k−1 are mutually distinct and v k = v 1 . G is connected if for any node pair u, v in G, there is a path from u to v or from v to u. G is acyclic if G has no cycle. A directed acyclic graph G is a directed tree if G satisfies the following conditions: (1) there is a unique special node r, called the root, in G such that for any node v ∈ V , there is a path from r to v, for each v ∈ V , and (2) there is at most one node u = v in V such that (u, v) ∈ E. We call such u the parent of v and we say v is a child of u. Note that u may have more than one child. A node is called a leaf if it has no child. A node is an interior node if it is not a leaf. We can see that, in a directed tree, a path from u to v is unique for any node pair u and v. The depth of a node v in a directed tree is the number of edges contained in the path from the root to v. A graph G = (V , E ) is a subgraph of G if V ⊆ V and E ⊆ E ∩ (V × V ) hold. In the following, we give some key notations in this paper. 1. Each node is assigned either AND or OR such that AND and OR appear alternately in all the paths on T .
2. All the leaves are assigned OR. 3. Each node has a label whose value is either True, False, or unknown. 4. For each OR node n in T , if there is a child node of n whose label is True, then the label of n is True. Otherwise, the label of n is False. 5. For each AND node n in T , if the labels of all child nodes of n are True, then the label of n is True. Otherwise, the lable of n is False.
Here, an OR and AND node correspond to a molecule (target) and a reaction template, respectively. The root is typically an OR node because the root corresponds to the target compound. A terminal node in an AND/OR tree is defined as a leaf whose label is True or False. Note that, in an AND/OR tree, merging at an OR node is allowed, i.e., an AND node may have more than two parents, but there is no cycle.
Next, we introduce the notion of a proof tree to represent a single synthetic route. 1. The label of the root node is True.
2. An OR node has at most one child node.
3. Each AND node (i.e., reaction template) has molecules that are required for its corresponding reaction as its children.
The second condition indicates that, for each molecule in a proof tree, just one synthetic route exists to make it.
A chemical reaction network of a given target molecule is represented as a graph structure of multiple synthetic routes (AND/OR trees) as shown in Fig.2. Hence, the third condition implies that to make the label of an AND node n True, we need to pick all the required nodes of n from the chemical reaction network for a target molecule. Definition 1.3. [Chemical reaction network] A chemical reaction network for a target molecule c is a directed acyclic graph that satisfies the following conditions.
In this paper, we propose an algorithm to construct a chemical reaction network for a given target molecule using the DFPN algorithm. We refer to this algorithm as Ex-DFPN (Extended DFPN). We show pseudocodes for Ex-DFPN in Algorithm S1 and S2. We implemented the algorithm referring to the DFPN implementation by Nagai . 1,2 To perform the DFPN-based search, we use a variant of the AND/OR tree with additional properties. In the tree, each node has the following values: • Proof number (pn) • Disproof number (dn) • pn threshold (pnTh) • dn threshold (dnTh) and each edge has edge cost (e).
First, we explain the basic procedure of DFPN. We described the algorithm explained in this paragraph as pseudocode in Algorithm S1. DFPN consists of three steps (Select, Expand, and Update) illustrated in Fig. S1 (a). First of all, DFPN selects a node to expand based on an evaluation function. We will explain the details of the evaluation function later. In the Expand step, DFPN expands the selected node by computing reactants using reaction templates. When DFPN visits an OR node for the first time, it checks whether the molecule is commercially available or not. If the molecule is found, then the values of the evaluation functions of the node are set to 0 and DFPN returns to the parent node. Also, DFPN returns when the depth of the node reaches the depth threshold, which is a parameter to restrict the search space. In the Update step, DFPN updates the value of the evaluation function for the node.
We developed Ex-DFPN for constructing chemical reaction networks by adding some procedures that make the basic DFPN algorithm continue searching even after finding a  Figure S1: Illustration of (a) basic depth-first proof number search and (b) procedures to continue searching after finding a proof tree. Circle and square represent OR and AND nodes, which denote a molecule and a reaction template respectively.
solution. This procedure is implemented in the traceProofTree function, called in line 34 of Algorithm S1 and described in Algorithm S2. After finding a solution, Ex-DFPN performs some additional procedures illustrated in Fig.S1.(b) to continue searching. The basic idea of these procedures is to continue searching by assuming that the newly found solution does not exist. When Ex-DFPN finds a solution, there always exists a last found AND-node or OR-node. After determining the node in the trace step, Ex-DFPN updates the value of evaluation functions for nodes in the pathway from root, assuming that the node does not exist. With these procedures, Ex-DFPN can construct a chemical reaction network for a given target molecule.
We designed evaluation functions based on the proof number and disproof number. A proof number denotes the number of nodes to prove to show the node is true. A disproof number denotes the number of nodes to disprove to show the node is false. Because proof/disproof numbers for OR nodes and AND nodes have duality, we can simply define them as below.
where n.φ denotes proof number and n.δ denotes disproof number for OR node n, while the contrary holds for AND node. In addition, we utilized tree height information. Tree height is calculated for each node denoting the number of edges to the descendant leaf node. Using this evaluation, Ex-DFPN prefers synthetic routes with smaller steps.
Algorithm S1 retrosynthesis using depth first proof number search n c .∆ ← min(node.φ, δ 2 + 1) if node has not expanded yet then 3: node.delta ← node.children.size 4: end if 5: if node has only a child node then 6: OrSearch(child node) 7: n c .δ ← min(node.φ, δ 2 + 1) if node.lastSelectNodeIndex ! = -1 then 4: next ← node.children.get(node.lastSelectNodeIndex) 5: sideRoute.add(next) 6: if node.getProofNumber() = 0 then The procedure of our SRE algorithm is illustrated in Fig. 4 and the pseudocode is shown in Algorithm S3. Our SRE algorithm enumerates all synthetic routes by recursively partitioning the set of solutions into two disjoint subsets. Firstly, the algorithm picks the root node, which is node 1 in Fig. 4, and divides the set of solutions into two disjoint set; the one consists of the solutions containing node 1 and the other consists of the remaining solutions. Next, the algorithm enumerates all the solutions containing node 1 by focusing on node 3 and dividing the set into two disjoint sets again, i.e., the one consists of the solutions containing both nodes 1 and 3, and the other consists of the solutions containing node 1 but not node 3. The SRE algorithm repeats this procedure recursively for all nodes. When the SRE algorithm terminates, it is ensured that all synthetic routes are enumerated. Note that when the algorithm enumerates solutions containing node 1 but not containing node 5, node 6 must be contained in all solutions, and thus the algorithm can skip the case of containing node 1 and not containing nodes 5 or 6. This pruning yields the effectiveness of the algorithm. The proof of this guarantee is given in the next section.
if neighborsT oSearch is empty then partition(g , n) 41: end function gorithm In this section, we prove that the SRE algorithm can list all synthetic routes in a given chemical reaction network without duplication. To prove this, we need to show the following: the output does not include invalid synthetic routes or any duplication, and the output of the algorithm includes all the synthetic routes (proof trees) in a chemical reaction network.
We call the former property the soundness and the latter property the completeness of the algorithm.
To show the soundness, we first prove that no duplication appears in the output of the algorithm. Proof. Let R 1 = partition(g 1 , t 1 ), R 2 = partition(g 2 , n 2 ) be two different recursions. It is enough to show that these two recursions output different solutions. We consider the following cases: R 1 is an ancestor or a descendant recursion of R 2 : Without loss of generality, we can assume that R 1 is an ancestor of R 2 . The output of R 1 includes p by line 26 of Algorithm S3, while g 2 of R 2 , which is called under recursion in line 40, never includes p. By this observation, R 2 's output never includes p. Therefore, outputs of R 1 and R 2 are different from each other.
Otherwise: Let R j be a common ancestor of R 1 and R 2 . Such R j always exists because R 1 and R 2 are not the root recursion. In lines 38 and 40 of Algorithm S3, R j generates two child recursions whose input graphs g and g are different. Therefore, outputs of R 1 and R 2 are different.
Finally, different recursions always output different proof trees.
We then show the soundness of the SRE algorithm. What remaining to show is that the output only contains correct solutions. Because n in Algorithm S3 is an OR node, p is an AND node. From line 26-31 in Algorithm S3, all interior OR nodes in g have exactly one child OR node, while AND nodes have all child OR nodes. Once an AND node is added as a child node of an OR node, then the OR node is no longer a leaf node. Therefore, no AND node is added as a child node of an OR node.
From line 33, all leaf nodes in a tree are terminal nodes, and thus, from the property of an AND/OR tree, the label of the root node is also True.
Finally, we prove that Algorithm S3 outputs all the solutions. Proof. If root is a terminal node, then by line 15, the solution is outputted. Hence, in what follows, assume that root is not a terminal.
We say that Q is a partial proof tree of T , abbreviated as a ppt of T , if Q is a connected subgraph of T such that all the leaves in Q are assigned OR and the root of Q is the root of T . Let s(Q) be the number of AND nodes in Q.
We prove, by induction on s(·), the following statement: For any ppt Q with k AND nodes, Algorithm S3 makes a recursive call (Q , n) such that Q is a ppt of Q, Q has k − 1 AND nodes, and n has an AND neighbor in Q \ Q . If this statement holds, then we see that for any ppt Q, there is a recursive call in which g = Q after line 31. In addition, clearly, a proof tree is also a ppt. A proof tree has no leaf which is not a terminal node. Hence, because of line 34 and because Algorithm S3 makes the recursive call for an empty ppt in line 11, all proof trees are outputted.
Clearly, when the size is one, that is, for a ppt for the root node, the statement holds by line 11. Next we assume that for any ppt Q with k AND nodes, the hypothesis holds. Let S be a ppt such that s(S) = k + 1 and Q is a ppt of S.
From the assumption, there is a ppt Q of Q such that Algorithm S3 makes a recursive call C which receives Q and an OR node n satisfying s(Q ) = k − 1 and n has an AND neighbor q in Q \ Q . In the remainder, we show that there is a recursion call which receives Q and a desired AND node n .
If q is not in P , then clearly, there is some descendant recursive call C that satisfies the hypothesis. Suppose that q is in P . This implies that there is an ancestor recursive call C of C such that during the execution of C , q is added to P . Note that n is the unique parent for q from the definition of input graphs. Hence, from line 17, C receives n as an input and C makes a recursive call C which receives Q.
Let n be the parent of p , where p is an AND node in S \ Q. Because Q is a ppt for S, such n and p always exist. Note that S must contain n . If n is not in Q, then there is no AND node p * such that (p * , n ) ∈ Q. Because n is in S, this contradicts that p is the only AND in S \ Q. Suppose that n is not a leaf in Q. However, this creates a contradiction because n is a leaf in S − p . Hence n is a leaf in Q. Because (n , p ) is a unique incoming edge to p, by a similar argument for p, there is an descendant recursive call of C receiving Q and n such that p is not in P . Hence, the statement holds.  Table S1. For REF score, the distribution shows that routes selected based on REF in Figure 8 are better than other synthetic routes. For MSCS score, the distribution shows that routes selected based on MSCS in Figure 8 are better than the mean value.