Compressed graph representation for scalable molecular graph generation

Recently, deep learning has been successfully applied to molecular graph generation. Nevertheless, mitigating the computational complexity, which increases with the number of nodes in a graph, has been a major challenge. This has hindered the application of deep learning-based molecular graph generation to large molecules with many heavy atoms. In this study, we present a molecular graph compression method to alleviate the complexity while maintaining the capability of generating chemically valid and diverse molecular graphs. We designate six small substructural patterns that are prevalent between two atoms in real-world molecules. These relevant substructures in a molecular graph are then converted to edges by regarding them as additional edge features along with the bond types. This reduces the number of nodes significantly without any information loss. Consequently, a generative model can be constructed in a more efficient and scalable manner with large molecules on a compressed graph representation. We demonstrate the effectiveness of the proposed method for molecules with up to 88 heavy atoms using the GuacaMol benchmark.


Introduction
Deep learning has revolutionized the design of novel molecules required for real-world industrial applications. Whereas traditional approaches have mostly been based on human knowledge and intuition, the use of deep learning has enabled the autonomous design of molecules by learning from previously accumulated data [1][2][3]. Most existing methods use deep generative models, such as variational autoencoders (VAEs) and generative adversarial networks (GANs). Their capabilities depend on the way of representing a molecule. Such representations include simplified molecular-input line-entry system (SMILES) and molecular graph representation.
Although the SMILES representation has been demonstrated to be useful, recent research tends to employ the molecular graph representation, which is a natural and intuitive way of representing a molecule by regarding its atoms and bonds as nodes and edges, respectively [1].
A major challenge for molecular graph generation is addressing the scalability issue caused by its high computational complexity [4]. The representation of a molecular graph G = (V, E) on which a model learns, where V and E are the set of nodes and edges in G , typically involves an adjacency expression between its nodes, yielding O(|V| 2 ) complexity. A naïve approach is to regard only heavy atoms in a molecule as nodes in the corresponding graph representation by treating hydrogen atoms implicitly as node features. This approach is however not scalable for large molecules with many heavy atoms, which are abundant in the real world [5,6]. Consequently, existing methods were evaluated by limiting the size of the molecules in the training dataset, which was often set to less than 50 heavy atoms. The benchmark datasets with small molecules, such as QM9 [7,8] and ZINC [9], have been commonly employed in the literature.
For scalable molecular graph generation, there have been research attempts to alleviate the complexity O(|V| 2 ) via representational simplification. One approach involves representing a molecular graph as a sequence of vectors and then building an autoregressive model on the sequence representation for the sequential generation of nodes and edges that form a graph. You et al. presented GraphRNN which constructs a model on a node-level sequence representation with M-dimensional adjacency vectors, where M is set to less than |V| , by employing breadth-firstsearch node ordering with which the complexity is reduced to O(|V|M) [10]. Goyal et al. presented Graph-Gen which transforms a molecular graph into an edgelevel sequence based on minimum depth-first-search coding, which leads to a complexity of O(|E|) [4]. However, as in the SMILES representation, the sequential nature imposes constraints on the model architecture and prevents the model from capturing molecular similarity and retaining chemical validity. Another approach is to reduce the number of nodes |V| directly in the representation. Jin et al. presented junction tree VAE (JTVAE) which represents a molecular graph as a junction tree, whose nodes correspond to valid chemical substructures, using tree decomposition [11]. The compressed representation can be generally applicable to any model architecture. Nevertheless, JTVAE can suffer from high dimensionality due to the dramatic increase in the number of node features, because of the large variety of chemical substructures that appear in the dataset.
For a more practical application of molecular graph generation, we focus on the latter approach which involves reducing the number of nodes directly in the representation. This study aims to improve the scalability of molecular graph generation to large molecules while maintaining the capability of generating chemically valid and diverse molecular graphs. We present a novel method for the compression of molecular graph representation for scalable molecular graph generation. We designate six small substructural patterns that commonly appear between two heavy atoms in practice and regard their appearances as additional edge features along with the bond types. A molecular graph is compressed by substituting the relevant substructures with new edges. This compression reduces the number of nodes without drastically increasing the number of edge features, making it scalable to large molecules. In addition, the compressed graph can be reconstructed into the original graph without any information loss.

Molecular graph compression
The conventional graph representation of a molecule is an undirected graph whose nodes and edges correspond to heavy atoms and their bonds in the molecule, respectively. Hydrogen atoms are treated implicitly as node features, and thus, they are not regarded as explicit nodes. Formally, a molecular graph is defined as G = (V, E) , where V and E denote the sets of nodes and edges, respectively. Each node corresponding to the i-th heavy atom is represented by a node vector v i ∈ V with the dimensionality of p, whose features indicate the atom type, formal charge, and valence information. An edge corresponding to the connection between the i-th and j-th atoms is represented by an edge vector e i,j ∈ E with a dimensionality of q, whose features are associated with a bond type. The property vector y = (y 1 , . . . , y l ) represents the properties of the molecule.
We compress the graph representation by reducing the number of nodes. We employ six small substructural patterns that commonly appear between two heavy atoms, which are listed in Fig. 1. Each of the substructural patterns contains only one or two heavy atoms with the atom types corresponding to C, N, and O, which are abundant in real-world molecules. We represent the appearances of these six substructural patterns using additional edge features, which may be sufficient for most real-world datasets. Nevertheless, depending on the training dataset, we can additionally designate more substructural patterns to be regarded as edge features for further compression.
Formally, we define a compression function that compresses an input graph. For an original graph G , the corresponding compressed graph G ′ is obtained using the function as Given the input graph G , the function finds the substructures that are relevant to the six designated patterns. With canonical ordering of the atoms in G , each substructure is sequentially converted to an edge by representing its appearance using the corresponding edge feature. The canonical numbers of atoms are used to prioritize which substructure is converted first. When multiple substructures overlap, the one with non-overlapping atoms having smaller canonical numbers is chosen to be replaced by an edge.
With the addition of edge features, the edge vector of compressed graph G ′ has higher dimensionality than that of the original graph G . This compression reduces one or two nodes per substructure. There may exist multiple substructures in between an atom pair, and a larger molecule may contain more relevant substructures. A graph will be further compressed if more of the substructural patterns exist in it. Figure 2 shows an illustrative example of the compressed graph representation for two molecules. In the first example, the original graph contains eight nodes because the corresponding molecule has eight heavy atoms. For the original graph, the substructures 1-2-3, 2-3-4, and 4-6-7 are relevant to patterns 6, 2, and 2, respectively. The substructures 1-2-3 and 2-3-4 overlap, and therefore, one among them needs to be chosen for compression. Because 1-2-3 has smaller canonical numbers, we choose 1-2-3 to be replaced. After 1-2-3 and 2-3-4 are replaced by the respective edges, the number of nodes is reduced to six. The second example involves an original graph that contains seven nodes. Two substructures, 2-3-4-5 with pattern 3 and 2-7-6-5 with pattern 4, appear simultaneously between the 2nd and 5th nodes. After they are substituted by edges, the compressed graph contains three nodes.
The main advantages of compressed graph representation are as follows. Firstly, the compressed representation reduces the number of nodes (i.e., |V ′ | ≤ |V| ), thereby providing better scalability to large molecules. Secondly, the compression is reversible, meaning that the compressed graph can be reconstructed into the original one without any information loss using a decompression function −1 (i.e., G = � −1 (�(G)) ). Thirdly, it does not drastically increase the dimensionality of edge vectors because only pre-chosen substructural patterns are additionally involved as edge features in the compressed representation (i.e., q ′ − q is a small constant). The increase in edge dimensionality does not significantly affect the scalability.

Learning on graph representation
In this study, we build a non-autoregressive graph VAE (NAGVAE), presented in [12], on the compressed graph representation. The model seeks to find the generative distribution p θ (G|z, y) parameterized by θ . The prior distributions p(z) and p(y) are set to N (z|0, I) and N (y|µ y , � y ) , respectively. We introduce an approximate posterior distribution q φ (z|G, y) = N (z|µ z (G, y), diag(σ 2 z (G, y))) parameterized by φ to address the intractability of the posterior distribution p θ (z|G, y).
The architecture of the model is illustrated in Fig. 3. The model consists of five components: the encoder network q φ (z|G, y) , decoder network p θ (G|z, y) , reward network r(G) , predictor network f (G) , and external reward function R(G) . The encoder network q φ (z|G, y) , which corresponds to the approximate posterior distribution, is modeled as message passing neural networks (MPNNs) [13] to be invariant to graph isomorphism. The encoder network takes G and y as inputs to produce µ z (G, y) and σ 2 z (G, y) , so that z is sampled from N (z|µ z (G, y), diag(σ 2 z (G, y))) based on the reparameterization trick. The decoder network p θ (G|z, y) , which captures the generative distribution, is modeled as a fully-connected neural network. The decoder network takes z and y to generate a probabilistic graph G . The reward and predictor networks are modeled as MPNNs.
The reward network r(G) takes G or G as input to predict the reward R(G) or R( G) . The predictor network takes the same input to predict y . The external reward function R(G) is designed based on chemical rules to return a reward of 1 if its input can be decoded as a chemically valid molecular graph and 0 otherwise.
Given N molecules and their properties, we form a training dataset D = {G ′ t , y t } N t=1 with the compressed representation, where G ′ t = �(G t ) . Then, the model is trained using the dataset. The objective function for thie training involves the original learning objective of the VAE as well as approximate graph matching, reinforcement learning, and auxiliary property prediction. The details of the model are described in [12].
The training involves the processing of a graph G in the form of a pair (V, E) comprising a node matrix V ∈ R |V|×p , where V i ∈ R p is the node vector v i ∈ V , and an edge tensor E ∈ R |V|×|V|×q , where E i,j ∈ R q is the edge vector e i,j ∈ E if it corresponds to a bond or substructure and is a zero vector otherwise. This leads to the computational complexity of O(|V| 2 ) . Because the use of the compressed graph representation directly reduces |V| , the model becomes more scalable to large molecules.

Molecular graph generation
After training the model, the decoder part p θ (G|z, y) is used to generate new molecular graphs. To generate a molecular graph, we sample z * and y * from their prior distributions p(z) and p(y) . They are fed into the decoder to produce a probabilistic output, which is then decoded via node-wise and edge-wise argmax to obtain a compressed graph G ′ * as

Fig. 3 Schematic diagram of model architecture
Because G ′ * is originally in the form of the compressed representation, we decompress it into its original representation with the decompression function −1 as The output G * can be interpreted as the chemical structure of a molecule.

GuacaMol benchmark
We investigated the effectiveness of the proposed method using the GuacaMol distribution-learning benchmark [14]. The training dataset for the benchmark is a standardized subset of the ChEMBL database [6], consisting of 1,591,378 molecules with up to 88 heavy atoms.
In the benchmark, the performance of a model for generating chemically valid and diverse molecular graphs is evaluated in terms of Validity, Uniqueness, and Novelty of 10,000 molecular graphs generated by the model. Validity is the ratio of valid molecular graphs, for which a molecular graph is counted as valid if it can be processed successfully with RDKit. Uniqueness is the ratio of valid graphs that are not duplicates. Novelty is the ratio of valid graphs that are not present in the training dataset. In addition, Kullback-Leibler Divergence (KLD) and Fréchet ChemNet Distance (FCD) are used to evaluate the success of a model in reproducing the distribution of the training dataset.

Implementation
We used a NAGVAE [12] trained with the training dataset on the compressed graph representation (NAG-VAE compress ) as the proposed model. The node and edge features that we used for the compressed representation are listed in Tables 1 and 2, respectively. It should be noted that the type and dimensionality of each feature depend on the training dataset. The model was trained for 10 epochs with a batch size of 10. The hyperparameters in the objective function were set to β 1 =5 and β 2 =1. Other settings were set according to the defaults in [12].

Molecular graph compression
Each molecular graph in the training dataset was compressed using the compressed graph representation. Figure 4 shows the results of molecular graph compression on the dataset, the summary statistics of which are listed in Table 3. The number of nodes with the compressed representation was reduced significantly. By frequency analysis on the dataset, we found that patterns 1-6 appeared 1.10, 1.31, 1.44, 1.03, 0.65, and 0.60 times, respectively, per molecule on average. Subsequently, the average and maximum number of nodes per molecule were reduced by 33.70% and 40.91%, respectively. In the cases of the two largest molecular graphs containing 88 nodes, the numbers of nodes were reduced to 30 and 40 nodes.
As evident from the results, the compression function effectively reduced the number of nodes in the molecular graphs. In particular, molecular graphs tended to be better compressed when the number of nodes was large. The high compression rate contributes to reducing the computational cost and memory usage involved in molecular graph generation. Table 4 shows a performance comparison between the baseline and proposed models. The experimental results for the baseline models were obtained from [14].  Among the baseline models, GraphMCTS was superior in generating chemically valid and diverse molecular graphs in terms of the validity, uniqueness, and novelty scores. LSTM yielded better performance in reproducing the underlying property distributions of the training dataset in terms of the KLD and FCD scores. JTVAE and NAGVAE original failed to provide results owing to the scalability issue. The proposed model, NAGVAE compress , was successful in generating molecular graphs. Notably, NAGVAE compress yielded comparable or superior performance in terms of the validity, uniqueness, and novelty scores. One drawback was the low distribution learning performance. It yielded lower KLD and FCD scores compared to the SMILES generation models.

Molecular graph generation
From a computational perspective, the use of the compressed representation reduced the computational burden for both the training and inference phases. Considering the complexity O(|V| 2 ) which increases with the number of nodes, training and inference on a more compact representation with a smaller number of nodes are faster and require lower computational cost and memory usage. This is also evident from the fact that NAG-VAE original failed to be trained, whereas NAGVAE compress was successfully trained with the training dataset. Additionally, the decompression for the compressed graph representation had little effect on the computational burden. The molecular graph generation by NAG-VAE compress , which involves inference with the decoder network p θ (G|z, y) and decompression with the function −1 , only took around 0.004 s and 0.001 s per molecular graph on average for the inference and decompression, respectively.
As demonstrated by the experimental results, the use of compressed graph representation makes molecular graph generation scalable to large molecular graphs without performance degradation with regard to the generation of chemically valid and diverse molecular graphs. We expect that molecular graph compression will shed

Conclusion
In this paper, we presented a molecular graph compression method to address the scalability issue of molecular graph generation. We identified six small substructural patterns that commonly appear between atom pairs in real-world molecules. Given a molecular graph, we converted the relevant substructures into new edges by representing them using additional edge features in the compressed graph representation. A generative model was constructed in a more efficient and scalable manner by training the model on the compressed representation. By conducting an experimental investigation using the GuacaMol benchmark, we found that the proposed method reduced the number of nodes significantly without any information loss. The generative model constructed on the compressed representation achieved performance comparable to that of the baseline methods regarding molecular graph generation. Although mitigating the high computational complexity intrinsically imposed on molecular graph generation has been challenging, this work successfully demonstrated that the molecular graph compression approach can effectively alleviate the complexity. We expect that this approach will be more effective with the better identification of data-specific substructural patterns that can be regarded as edge features. The use of the compressed representation contributes to a substantial reduction in the computational cost and memory usage, making it scalable to large molecules. This approach can be applied to other molecular graph generation methods to improve their efficiency and scalability, which merits further investigations.