Efficient learning of non-autoregressive graph variational autoencoders for molecular graph generation

With the advancements in deep learning, deep generative models combined with graph neural networks have been successfully employed for data-driven molecular graph generation. Early methods based on the non-autoregressive approach have been effective in generating molecular graphs quickly and efficiently but have suffered from low performance. In this paper, we present an improved learning method involving a graph variational autoencoder for efficient molecular graph generation in a non-autoregressive manner. We introduce three additional learning objectives and incorporate them into the training of the model: approximate graph matching, reinforcement learning, and auxiliary property prediction. We demonstrate the effectiveness of the proposed method by evaluating it for molecular graph generation tasks using QM9 and ZINC datasets. The model generates molecular graphs with high chemical validity and diversity compared with existing non-autoregressive methods. It can also conditionally generate molecular graphs satisfying various target conditions.


Introduction
In recent years, machine learning has been actively adopted to accelerate the discovery of novel molecules with desired properties [1][2][3][4]. While traditional approaches, such as manual design and enumeration [5][6][7][8], depend highly on domain knowledge and intuition of human experts, machine learning approaches have allowed automated design of desired molecules in a data-driven manner. Thus, they have attracted considerable attention from academia and industry. A typical way to accomplish so is to construct a generative model by learning from data to capture the underlying distribution of the data, and use it to generate new molecules [9,10]. The large number of known molecules that have been discovered in the past can be a valuable source for the data [11,12]. Although this research area is currently in the incipient stage and remains challenging, studies are increasingly performed, demonstrating improved effectiveness.
The main consideration is how to represent molecules to be processed within a model and to be generated by a model. Various types of molecular representations can be considered. Among them, most existing studies have employed a simplified molecular-input line-entry system (SMILES) [13] to represent molecules. SMILES is a line notation that encodes its graph structure via depth-first traversal into a sequence of symbols with a simple vocabulary and grammar rules. There have been implemented deep generative models based on a recurrent neural network (RNN), which learn a generative distribution from data to produce a SMILES string from a latent vector. They include RNN language models [14][15][16][17], variational autoencoders (VAEs) [18][19][20][21] and generative adversarial networks (GANs) [22,23]. To conditionally generate SMILES strings with specified target properties, additional optimization procedures have been adopted along with the generative models, including Bayesian optimization [18] and reinforcement learning [14,15,22,23].
Conditional generative models, which directly sample new SMILES strings from a conditional generative distribution without any extra optimization procedure, have also been presented for targeted generation [19,24].
SMILES representation has proven useful in learning generative models for molecular design tasks owing to its simplicity. However, several limitations hinder the generation of chemically valid molecules. This representation cannot fully represent the entire chemical space, but only a part of the chemical space can be represented as SMILES strings. It is inadequate to capture the similarity between molecules, because a small perturbation in a molecule may result in a significant change in the representation. To address such limitations, numerous research attempts have been made to generate molecular graphs directly using generative models. A molecular graph G is a natural representation of a molecule, which is expressed as an undirected graph of varying size whose nodes and edges correspond to atoms and bonds, respectively. This representation provides high coverage of the chemical space, and can convey various chemical features within it.
With recent advances in graph neural networks [25][26][27], various methods for extending deep generative models to molecular graph generation have been proposed. They can be categorized into two main approaches: nonautoregressive and autoregressive. The non-autoregressive approach allows more principled use of generative models, generating a molecular graph G directly from a latent vector z using a non-autoregressive distribution p(G|z) without any iterative procedure. Few methods based on this approach have been presented, owing to the challenge imposed by graph isomorphism, meaning that a molecular graph is invariant to permutations of its nodes. Notably, Simonovsky and Komodakis [28] presented a non-autoregressive VAE that generates molecular graphs, named GraphVAE. Its training requires calculating the reconstruction loss for generated graphs. The problem of graph isomorphism is addressed by a graph matching procedure that entails expensive computation. De Cao and Kipf [29] sidestepped graph isomorphism by using a non-autoregressive GAN for molecular graph generation, named MolGAN. Its training however suffers from the mode-collapse problem, thus less diverse molecular graphs are generated. The non-autoregressive approach successfully generates small molecular graphs in a very fast and efficient manner but suffers from difficulty in model training and a low validity ratio when generating larger graphs. Owing to such limitations, the autoregressive approach, which aims at sequentially generating a molecular graph node by node using an autoregressive distribution, has been a main research direction [30][31][32][33]. These methods successfully generate molecular graphs with a high validity ratio, while involves an iterative procedure for each generation which makes them less efficient.
This work focuses on the non-autoregressive approach for fast and efficient generation of molecular graphs. Here we propose an efficient learning method to train a VAE that generates molecular graphs in a non-autoregressive manner. To improve the molecular graph generation, we train the VAE by incorporating three additional learning objectives: approximate graph matching, reinforcement learning, and auxiliary property prediction. The trained VAE can almost always generate chemically valid molecular graphs with improved diversity compared with existing non-autoregressive methods. The proposed method also allows constrained generation of molecular graphs with specified target properties.

Molecules as graphs
In this work, we use the molecular graph representation defined as follows. A molecule is represented by an undirected graph G = (V, E) with up to m nodes, where V and E represent the set of nodes and the set of edges, respectively. The node vectors v i ∈ V and edge vectors e i,j ∈ E are associated with heavy atoms and their bonds, respectively, in the molecule. It should be noted that e i,j = e j,i because we use an undirected graph. For the i-th atom, v i = (v i,1 , . . . , v i,p ) is a p-dimensional vector formed by concatenating three one-hot vectors indicating the atom type, formal charge, and number of explicit hydrogens. The dimension p depends on the dataset used. For the bond between the i-th and j-th atoms, e i,j = (e i,j,1 , . . . , e i,j,q ) is a q-dimensional one-hot vector associated with the bond type. We kekulize the molecule for simplicity so that the only bond types to consider are single, double, triple, and none, hence q = 4 . Additionally, the properties of the molecule are represented as a property vector y = (y 1 , . . . , y l ).

Graph variational autoencoder
We construct a conditional version of the graph VAE [34] in a non-autoregressive manner. It seeks to find the generative distribution of G conditioned on a latent vector z and a property vector y and parameterized by θ , denoted as p θ (G|z, y) . The prior distributions of z and y are assumed to be p(z) = N (z|0, I) and p(y) = N (y|µ y , � y ) , respectively. To address the intractability of the posterior distribution p θ (z|G, y) , we introduce an approximate posterior distribution q φ (z|G, y) = N (z|µ z (G, y), diag(σ 2 z (G, y))) , which has a normal distribution and is parameterized by φ.
The distributions q φ (z|G, y) and p θ (G|z, y) are called the encoder and decoder of the VAE, respectively. For the encoder, we use a message passing neural network (MPNN) [27], which is a variant of a graph neural network that operates directly on graphs of different sizes and is invariant to graph isomorphism. The encoder takes G and y and outputs the mean vector µ z (G, y) and variance vector σ 2 z (G, y) , from which z is sampled based on reparametrization as µ z (G, y) + ǫ ⊙ σ 2 z (G, y) with ǫ ∼ N (0, I) . The decoder is modeled as a fully-connected neural network that outputs mp + m(m − 1)q/2 values at once from z and y with node-wise and edge-wise softmax activation. The output values form a probabilistic graph g(z, y) composed of m nodes and m(m − 1)/2 edges.
The original learning objective of the VAE is given with respect the parameters φ and θ as: where the first and second terms on the right-hand side are regarded as the reconstruction loss and regularization loss, respectively. Owing to graph isomorphism, the calculation of the reconstruction loss necessitates a graph matching procedure that involves comparing an input graph and its probabilistic reconstruction which is computationally expensive. For example, the max-pooling matching algorithm has computational complexity of O(m 4 ) [28].
To make the learning more efficient, we introduce an approximate graph matching procedure, which aims to alleviate the computational burden for the reconstruction loss. Additionally, we incorporate reinforcement learning and auxiliary property prediction into the training of the VAE to further improve the generation performance. Details regarding the learning objectives utilized in this work are presented in the following subsection.

Approximate graph matching
The reconstruction loss E z∼q φ (z|G,y) [− log p θ (G|z, y)] involves comparing an original input graph G = (V, E) and its reconstruction by the VAE. In this subsection, we denote the reconstruction of G as a probabilistic graph Because the reconstruction loss must be invariant to graph isomorphism, a graph matching procedure that seeks the best possible matching between the two graphs is needed. To avoid expensive computation, we propose to use approximate graph matching. The main idea is to approximate the distance between G and G by comparing the numbers of atom types, bond types, atom-bond pair types, and atombond-atom pair types.
Assuming that each edge vector is represented as a four-dimensional vector as e i,j = (e i,j(single) , e i,j(double) , e i,j(triple) , e i,j(none) ) , the reconstruction loss is approximated as follows: where v i ∈ V , e i,j ∈ E , v i ∈ V , and e i,j ∈ E . When calculating the approximated reconstruction loss, we discard the non-atom and non-bond types from the vectors. The first, second, third, and fourth terms on the right-hand side correspond to the comparison of the numbers of atom types, bond types, atom-bond pair types, and atombond-atom pair types, respectively. They are independent of node ordering because they summate over the nodes in a graph, and are thus invariant to graph isomorphism. All the operations in the above equation are differentiable.

Reinforcement learning
We further improve the VAE via reinforcement learning with the aim of generating chemically valid molecules. We adopt a deterministic policy gradient framework, in which the decoder of the VAE is regarded as a policy network that takes the two vectors z and y as state inputs. It outputs a probabilistic graph as an action from the state. The reward for the action is the chemical validity of the probabilistic graph, which is evaluated using an external reward function R. The reward function returns 1 if the probabilistic graph can be decoded into a chemically valid molecular graph and 0 otherwise. In this work, the chemical validity of a molecular graph G is evaluated via a sanitization check. With the reward function, the policy network learns how to generate a probabilistic graph that fulfills the expected reward of 1.
We wish to optimize the molecular graph generation of the VAE for maximizing the external reward function R. Because the reward function is non-differentiable, it cannot be incorporated directly into the learning procedure. We build a reward network r that approximates the reward function R. The reward network is modeled as an MPNN with a sigmoid output. It takes a probabilistic graph as input and predicts its actual reward value. The reward network can backpropagate the VAE. We train the VAE to generate a probabilistic graph towards maximizing the output of the reward network.
For the learning objective, we derive two additional losses L RL (φ, θ) and L RL (r) as: The VAE is trained for minimizing L RL (φ, θ) , while the reward network r is trained to minimize L RL (r).
It should be noted that we can impose extra domainspecific constraints regarding structures or properties on the external reward function R for constrained generation. For example, we can define a blacklist of undesired substructures and make the value of the reward function 0 when its input contains any substructure in the blacklist. This prevents generated molecules from having undesired substructures.

Auxiliary property prediction
Augmenting a generative model with side information has known to improve the quality of generated samples as well as the stability of model training [19,35]. We incorporate auxiliary property prediction into the VAE learning to enable generating probabilistic graphs that correspond to desired properties as well as to diversify the generated outcomes. We build a predictor network as an MPNN with l linear outputs. It learns from the training dataset to predict the property vector y of a given graph. Because the predictor network can backpropagate the VAE, we train the VAE to generate a probabilistic graph whose corresponding y is to be reconstructed by the predictor network.
For learning with auxiliary property prediction, we derive two additional losses L Y (φ, θ) and L Y (f ) as: Only the probabilistic graphs that are deemed valid by the external reward function R are incorporated into the first loss L Y (φ, θ) . The VAE is trained to minimize L Y (φ, θ) . Simultaneously, the predictor network f is trained to minimize L Y (f ).

Learning from data
The proposed model is composed of four main components: the encoder network q φ , decoder network (4) p θ , reward network r, and predictor network f. The full learning objective combines the vanilla objective of VAE (1) along with objectives for approximate graph matching (2), reinforcement learning (3)(4), and auxiliary property prediction (5-6). The objective functions for the VAE part and the other part are J 1 and J 2 , respectively, given as: where β 1 and β 2 are hyperparameters that control the trade-off between different learning objectives.
Given an empirical data distribution p data (G, y) , we train the entire model for minimizing the two objective functions J 1 (φ, θ) and J 2 (r, f ) simultaneously. For each iteration, a training batch X is sampled from the data distribution. The VAE parameters φ and θ are updated via gradient descent of J 1 (φ, θ) on X, and the reward network r and predictor network f are updated via gradient descent of J 2 (r, f ) . Algorithm 1 presents the pseudocode of the learning procedure.

Molecular graph generation
Once the model is trained, we use the decoder p θ (G|z, y) to generate molecular graphs. For unconditional generation, y * is sampled from its prior distribution p(y) . To conditionally generate molecular graphs, y * is sampled from a conditional distribution. For example, if the target condition is given as y k = τ , then y * ∼ p(y|y k = τ ) . We sample z * from p(z) . Given y * and z * , the decoder returns a probabilistic output, which is decoded based on nodewise and edge-wise argmax to obtain a molecular graph G * as We use a simple decoding method to discretize probabilistic outputs for deriving molecular graphs. Some studies reported that post-processing of probabilistic outputs (7) based on such methods as maximum spanning tree [28] and beam search [36] can improve the validity of the generated molecular graphs.

Datasets
We conducted experiments to investigate the effectiveness of the proposed method. In this work, we utilized two molecular datasets: QM9 and ZINC. They are publicly available and have commonly been used for the evaluation of molecular graph generation.
The QM9 dataset [37,38] originally contains 133,885 organic molecules with at most nine heavy atoms of the following types: {C, N, O, F, none}. We sampled 100,000 unique molecular graphs from the original set. For each atom in a molecular graph, the formal charge and the number of explicit hydrogens belonged to {− 1, 0, 1} and {0, 1, 2, 3} , respectively. Thus, the dimensions p and q were 12 and 4, respectively.
The ZINC dataset was constituted by sampling 100,000 unique molecular graphs from the drug-like part of the ZINC database [11]. Each molecular graph in the dataset contained up to 38 heavy atoms with ten atom types: {C, N, O, F, P, S, Cl, Br, I, none}. In the dataset used, the formal charge and the number of explicit hydrogens were in {− 1, 0, 1} and {0, 1, 2, 3} , respectively. The dimensions p and q were 17 and 4, respectively.
We employed two properties that are readily calculable using the RDKit package [39]: the molecular weight (MolWt) and Wildman-Crippen partition coefficient (LogP) [40]. The property vector of each molecular graph was constituted with the three properties. Thus, the dimension l was 2. These properties can be evaluated efficiently without high costs for newly generated molecular graphs.

Experiments
In the proposed method, the model architecture was determined as follows. For the molecular graph representation, the hyperparameter m was set as 5 plus the maximum number of heavy atoms in a molecule in the dataset used, i.e., m = 14 for QM9 and m = 43 for ZINC. The encoder, reward, and predictor networks were modeled as MPNNs. Each MPNN had three message passing layers with 50 hidden units, followed by a 100-dimensional node aggregation layer, which was then further processed by two fully-connected layers with 100 tanh hidden units. The decoder network was a fully-connected network with three fully-connected layers having dimensions of 500 with LeakyReLU activation. The dimension of z was set as 100. Each property in y was normalized to have a mean of 0 and a standard deviation of 1 for training. For the prior distribution p(y) , we estimated the mean vector μ y and covariance matrix ˆ y from the training dataset. For reinforcement learning, the external reward function was implemented using the RDKit package [39]. Figure 1 shows the architectural details of the model used in the experiments. For the objective functions, the hyperparameters β 1 and β 2 were both set as 1. Given a dataset, the model was trained for 50 epochs using the RMSProp optimizer with a learning rate of 0.0005 and a batch size of 20.
We compared the proposed method with two baseline methods that generate molecular graphs in a non-autoregressive manner: GraphVAE [28] and MolGAN [29]. We also performed an ablation study on our method to investigate the efficacy of reinforcement learning and auxiliary property prediction. We evaluated two ablation models, denoted A1 and A2. For model A1, we removed both the reward and predictor networks by setting β 1 = 0 and β 2 = 0 , which indicates that the reinforcement learning and auxiliary property prediction were excluded from the learning objective. For model A2, the predictor network was removed by setting β 1 = 1 and β 2 = 0 to discard auxiliary property prediction only.
The performance of each model for molecular graph generation was evaluated with regard to three metrics: Validity, Uniqueness, and Novelty. We sampled 10,000 molecular graphs from each model. Then, Validity was calculated as the fraction of valid molecular graphs out of 10,000 generated molecular graphs. Uniqueness was the fraction of valid graphs that were not duplicates. Novelty was the fraction of valid graphs that were not included in the training dataset. The values of each metric ranged from 0 to 1. Because the primary goal is to generate molecular graphs that are valid, unique, and novel, we calculated the geometric mean (G-mean) of the three metrics for an overall comparison among the models.
All the experiments were implemented based on GPUaccelerated TensorFlow in Python. Table 1 presents the results of unconditional molecular graph generation for QM9 and ZINC. We report the validity, uniqueness, novelty, and G-mean for the five compared models. The best value in each row is highlighted in italics. Figures 2 and 3 show example molecular graphs generated randomly by the proposed models trained with QM9 and ZINC, respectively.

Results
The proposed model exhibited the highest G-mean for both the QM9 and ZINC datasets, as the model achieved good performance for all three metrics. The proposed model exhibited a validity score of > 90% for both datasets, indicating that it almost always generated chemically valid molecules. Additionally, it generated more diverse molecules, as evident from its comparable uniqueness and novelty scores. In the case of ZINC, all the compared models yielded novelty of 1, indicating that all the generated molecules were not included in the training set. The baseline models yielded relatively good performance for QM9 but rarely generated valid molecular graphs for ZINC.  Compared with the ablated models, reinforcement learning helped the proposed model generate chemically valid molecules, as evident from the observation that model A2 was superior to model A1 in terms of validity. The auxiliary property prediction contributed to the diversification of the generated molecules, as the proposed model exhibited higher uniqueness than model A2. Table 2 presents summary statistics for the newly generated molecules of each target condition by the proposed model. For conditional generation, the target conditions for MolWt and LogP were set as {120, 125, 130} and {− 0.4, 0.2, 0.8} , respectively, for QM9, and {300, 350, 400} and {1.5, 2.5, 3.5} , respectively, for ZINC. The property distributions obtained via unconditional generation were similar to those of the training set. Compared with the unconditional generation, the conditional generation results exhibited a slightly lower G-mean and a smaller number of unique molecular graphs. As shown in Table 2, when a target condition was set, the proposed model successfully generated molecular graphs whose properties were close to the target value.

GuacaMol benchmarks
We further investigated the effectiveness of the proposed method with the distribution-learning benchmarks in the GuacaMol framework [41]. The benchmarks are based on a standardized subset of the ChEMBL database [42]. As the training set for the proposed model, we used molecules with up to 50 heavy atoms from the original dataset. We trained the model for 20 epochs with the same experimental settings as before. Then, we evaluated Validity, Uniqueness, and Novelty of generated molecular graphs by the model. In addition, whether the model is able to reproduce the distribution of the training set was assessed with Kullback-Leibler Divergence (KLD) and Fréchet ChemNet Distance (FCD). For baselines, four SMILES generation models (LSTM, VAE, AAE, and ORGAN) and one molecular graph generation model (GraphMCTS) were compared, as implemented in [41].
The results for the distribution-learning benchmarks are shown in Table 3. Compared with the baseline models, the proposed model exhibited comparable or higher scores on validity, uniqueness, and novelty metrics.  However, the proposed model yielded relatively lower KLD and FCD scores like GraphMCTS, which indicates that the proposed model was inferior on reproducing the underlying property distributions of the training set. Overall, the molecular graph generation models were very useful in generating chemically valid and diverse molecular graphs, while they were inferior to the SMILES generation models in distribution-learning.

Conclusion
While the non-autoregressive approach has the advantage of fast and computationally efficient generation of molecular graphs without any iterative procedure, the existing methods suffer from low performance. To overcome this limitation, we proposed an efficient learning method to build a graph VAE that generates molecular graphs in a non-autoregressive manner. In order to improve the generation performance, we introduced approximate graph matching, reinforcement learning, and auxiliary property prediction into the training of the model. Experimental validation using QM9 and ZINC datasets demonstrated the effectiveness of the proposed method. Compared with existing methods, the proposed method exhibited higher performance for molecular graph generation with a validity score of > 90% as well as improved diversity of the generated molecular graphs for both datasets. We also demonstrated that the proposed model can conditionally generate novel molecular graphs satisfying specified target conditions. We believe that the proposed method can serve as an efficient tool for the discovery of new materials. Successful application of the method will allow automatic design of desired chemical structures with targeted conditions by learning implicit knowledge from data without explicit knowledge from human experts, thereby meriting further investigations. The generated molecular graphs can be examined further to obtain realistic chemical structures with desired properties.
One downside of the proposed method its high complexity with regard to space and time. Both the space and time complexity increase with the size of the graphs, owing to the use of graph neural networks. Thus, the  proposed method would be impractical for learning and generating large molecular graphs, e.g., hundreds of heavy atoms in a molecule. In the future, research will be performed to reduce the complexity for generating larger molecular graphs efficiently.