Conditional variational autoencoder with Stack augmented RNN (Stack-CVAE)
We designed a novel generative model, Stack-CVAE, that combines stack-RNN with CVAE to generate structural expressions in SMILES. Stack-CVAE is based on CVAE, which produces substances similar to, but not the same as, the substances used for training. The objective function of CVAE is as follows:
$$E\left[\mathrm{log}P\left(z,c\right)\right]-{D}_{KL}\left[Q\left(z|X,c\right)\Vert P\left(z|c\right)\right]$$
(1)
In formula (1), \(Q(z|X,c)\) and \(P(z|c)\) approximate the probability distributions of an encoder and a decoder, respectively. The term \({D}_{KL}\) is the Kullback–Leibler divergence, and \(X\) and \(z\) indicate input data and latent spaces, respectively. Here, the term \(c\) indicates a condition vector that is associated with encoding and decoding.
The biggest difference between Stack-CVAE and CVAE is that CVAE uses regular RNN, LSTM, or GRU, and Stack-CVAE uses stack-augmented RNN (Stack-RNN). Stack-RNN is an augmented recurrent network with structured and growing memory. Stack-RNN has three operations. The POP operation deletes an element of a stack; the PUSH operation adds a new element to the top of a stack; and the NO-OP operation does not do anything. One of three operations is selected at each time step by a three-dimensional variable \({a}_{t}\), which is calculated using a hidden variable \({h}_{t}\) as in formula (2).
$${a}_{t}=f\left({Ah}_{t}\right)$$
(2)
In formula (2), \(A\) is a \(3\times m\) matrix (\(m\) is the size of the hidden layer) and \(f\) is a Softmax function. We denote \({a}_{t}\left[PUSH\right]\) by the probability of the PUSH operation, \({a}_{t}\left[POP\right]\) by the probability of the POP operation and \({a}_{t}\left[NO-OP\right]\) by the probability of the NO-OP operation. The stack is stored in a vector \({s}_{t}\) with size p at time t, and p is not fixed in order to increase the capacity of the model. The top element is stored in position 0 with value of \({s}_{t}\left[0\right]\). The PUSH operation adds a new element to the position 0 as in formula (3).
$${s}_{t}\left[0\right]={a}_{t}\left[PUSH\right]\sigma \left({Dh}_{t}\right)+{a}_{t}\left[POP\right]{s}_{t-1}\left[1\right]+{a}_{t}\left[NO-OP\right]{s}_{t-1}\left[0\right]$$
(3)
where \(D\) is \(1\times m\) matrix. If \({a}_{t}\left[POP\right]\) is equal to 1, the top element is popped and all the other elements are moved up one position. If \({a}_{t}\left[PUSH\right]\) is 1, all elements are moved down one position and the new element is added to the top of the stack. If \({a}_{t}\left[NO-OP\right]\) is 1, the stack is not changed. Similarly, for an element stored at a depth > 0 in the stack, elements in the previous stack are stored in the current stack as in the formula (4).
$${s}_{t}\left[i\right]={a}_{t}\left[PUSH\right]{s}_{t-1}\left[i-1\right]+{a}_{t}\left[POP\right]{s}_{t-1}\left[i+1\right]\left(i>0\right)+{a}_{t}\left[NO-OP\right]{s}_{t-1}\left[i\right]$$
(4)
An element of a stack is propagated to the next hidden layer, which is calculated by formula (5).
$${h}_{t}=\sigma \left(U{x}_{t}+R{h}_{t-1}+P{s}_{t-1}^{k}\right)$$
(5)
In formula (5), \(P\) is \(m\times k\) matrix, \({s}_{t-1}^{k}\) is a top element of the stack of time point t-1, and \(k\) is the size of an element. Because Stack-RNN is an RNN of which a cell has a stack structure, it has strength to learn longer and more complex data. Stack memory increases the probability of generating valid SMILES because RNN without a stack structure cannot learn ring structure or brackets. Furthermore, RNN without stack structure tends to generate SMILES that are similar to the training SMILES [16]. Because Stack-CVAE uses Stack-RNN as its encoder, the proportion of valid and unique SMILES of Stack-CVAE can be higher than that of CVAE of which the encoder is an RNN, GRU, or LSTM. We used GRU as a decoder. The model is depicted in Fig. 1.
First, the '<' character, which indicates the start of a SMILES, is put in front of the SMILES string, and the '>' character, which indicates the end of a SMILES, is put at the end of a SMILES string. When input SMILES is put into the generation model, an embedding vector is generated and combined with a pre-produced condition vector to generate an input matrix. The condition vector is made by calculating the molecular weight, LogP, and TPSA of input the SMILES using RDkit [26]. The input matrix generated is input to the encoder of the Stack-CVAE and converted into a latent vector. The process for generating a latent vector is illustrated in Fig. 1a. The encoder of Stack-CVAE used a 3-layer Stack-RNN with 512 hidden nodes, a stack-width of size 50, a stack-depth of size 10, and a 3-layer GRU with 512 hidden nodes as a decoder. The cross entropy was used as the cost function of the reconstruct error, and a linear neural network was used for each output of the decoder cell.
A generated latent vector is combined with the condition vector and input to the decoder. Finally, the output of the decoder generates a distribution of the probability that each token will be selected through Softmax and the decoder is trained to predict a token correctly. This process is illustrated in Fig. 1b.
After training, a new molecule can be generated using the generator, that is, the decoder of stack-CVAE. When the start token '<' is entered, the decoder generates a new token. A generated token is re-entered into a decoder and this process is repeated until a decoder generates an end token '>'. This process is illustrated in Fig. 1c.
Reinforcement learning for drug properties
In the pre-training stage, the Stack-CVAE model learns the rules of SMILES given conditions such as molecular weight, LogP, and TPSA. Then, reinforcement learning is applied to the pre-trained Stack-RNN model to generate a new chemical compound with the desired properties. In reinforcement learning, an agent determines a behavior according to a policy that specifies the probability of taking an action in a state. An environment then rewards an agent, and an agent is trained based on rewards. Reinforcement learning is performed to maximize accumulated rewards returned by an environment.
In the proposed model, an agent corresponds to a decoder of stack-CVAE, and a policy is a probability distribution of a decoder of stack-CVAE to generate tokens. An action and a state of an agent mean creating one token and a generated token, respectively. When all actions are completed, a chemical formula in the form of SMILES is generated, and the reward is measured by calculating the RAscore and binding affinity of the SMILES generated.
More specifically, a state of a policy means the current token. If a current token is combined with the latent vector z generated using a random value from a normal distribution with mean = 0 and sd = 1, and put into a decoder of stack-CVAE, then the probability distribution for the next token given the latent vector z is calculated. The next token is sampled according to the probability distribution, and this process is repeated until the ending character '>' is generated. Until then, the reward is zero, and an environment calculates reward based on two predicted properties of the generated SMILES.
The first property is the predicted Retrosynthetic Accessibility score (RAscore), which is calculated by a synthesis planning tool, AiZynthFinder [27]. RAscore is a coefficient indicating the synthesis possibility and has the value 1 if synthesis is possible, and 0 if not. The reward by RAscore is reward1 and is calculated by formula (6).
$$reward1= \left\{\begin{array}{cc}1, & if\;RAscore=0\\ RAscore \times 5+1, & otherwise\end{array}\right.$$
(6)
The second property is the binding affinity of the produced SMILES to the target proteins. The binding affinity was calculated using DeepPurpose. For multiple target proteins, binding affinities were averaged. We had two groups of target proteins, group A and group D. Group A included target proteins that should bind to the generated SMILES, and group D included target proteins that should not bind to the generated SMILES. The averaged binding affinity for group A and D is affA and affD, respectively. The reward by the binding affinity to group A is reward2 and is calculated by formula (7).
$$reward2= \left\{\begin{array}{cc}{(affA-4)}^{2}+1, & if \; affA > 5\\ 1, & otherwise\end{array}\right.$$
(7)
Reward by binding affinity to group D is reward3, and is 6, if affD < 5.5 and 1, otherwise. The entire reward is a sum of reward1, reward2, and reward3.
A decoder of stack-CVAE is trained to increase the reward and decrease loss, which is calculated by formula (8).
$$Loss=-\sum_{t=0}^{T-1}{\log}\left(y\right)\times discount\;reward$$
(8)
In formula (8), \(T\) means the length of the generated SMILES string, and \(y\) is the probability that the \((t+1)\) th token appears. Discount reward is reward multiplied by a discount rate, which we set to 0.1. The overall reinforcement learning process is illustrated in Fig. 2.