Before describing the details of the kGCN system, basic implementation techniques for the graph representation of molecules and graph convolution are discussed.
Graph representation of molecules for GCN
This section first describes the formalization of a molecule to apply the GCNs. A molecule is formalized as a tuple \(\mathcal{M} \equiv (V,E,F)\), where V is a set of nodes. A node represents an atom in a molecule. A node has features \(\mathbf {f}_i \in F (i \in V)\), and F is a set of feature vectors representing the atom properties such as atom type, formal charge, and hybridization. These features should be appropriately designed by users. E is a set of edges, and an edge \(e \in E\) represents a bond between the atoms, i.e., \(e \in V \times V \times T\), where T is a set of bond types. An adjacency matrix \(\mathbf {A}^{(t)}\) is used, which is defined as follows:
$$\begin{aligned} (\mathbf {A}^{(t)})_{i,j} = {\left\{ \begin{array}{ll} 1 &{} (v_i,v_j,t) \in E \\ 0 &{} (v_i,v_j,t) \notin E \end{array}\right. }, \end{aligned}$$
where \((\cdot )_{i,j}\) represents the j-th element of i-th row. Similarly, the feature matrix is defined as:
$$\begin{aligned} (\mathbf {F})_{j,k}=(\mathbf {f}_j)_k \end{aligned}$$
where \((\cdot )_{k}\) represents the k-th element of a vector.
Using this matrix, a molecule is represented by \(\mathcal{M'} = (\mathbf {A},\mathbf {F})\), where \(\mathbf {A} = \{\mathbf {A}^{(t)} | t \in T\}\). The framework in the present system uses RDKit [29] to create adjacency and feature matrices and employs \(\mathcal M'\) as the input for GCN.
Graph convolutional network
kGCN supports GCNs in addition to the standard feed-forward neural networks. Therefore, GCNs for molecules are described first. Graph convolution layer, graph dense layer, and graph gather layer are defined as described below.
Graph convolution layer
The graph convolution is calculated from the input \(\mathbf {X^{(\ell )}}\) of the \(\ell\)-th layer as follows:
$$\begin{aligned} \mathbf {X}^{(\ell +1)} = \sigma \left( \sum _t \tilde{\mathbf {A}}^{(t)} \mathbf {X}^{(\ell )} \mathbf {W}^{(\ell )}_t \right) , \end{aligned}$$
where \(\mathbf {X}^{(\ell )}\) is the \(N \times D^{(\ell )}\) matrix and \(\mathbf {W}^{(\ell )}_t\) is the parameter matrix (\(D^{(\ell )} \times D^{(\ell +1)}\)) for a bond type t, \(\sigma\) is the activation function, and \(\tilde{\mathbf {A}}^{(t)}\) is the normalized adjacency matrix (\(N \times N\)). This normalization and implementation of the layers follows Kipf’s model [30] as a default. There are various choices for implementing the setting of graph convolution layers. In the kGCN system, the operation of the first layer input can be easily switched by changing the initial setting file for building the model.
The GCN is based on this graph convolution operation. The input of the first layer \(\mathbf {X}^{(1)}\) often corresponds to the feature matrix, \(\mathbf {F}\)
Graph dense layer
\(\mathbf {X^{(\ell )}}\) is an input for graph dense layer. \(\mathbf {X^{(\ell +1)}}\) is calculated as follows:
$$\begin{aligned} \mathbf {X}^{\ell +1} = \mathbf {X}^{(\ell )} \mathbf {W}^{(\ell )}, \end{aligned}$$
where \(\mathbf {X}^{(\ell )}\) is an \(N \times D^{(\ell )}\) matrix and \(\mathbf {W}^{(\ell )}\) is a parameter matrix (\(D^{(\ell )} \times D^{(\ell +1)}\)).
Graph gather layer
This layer converts a graph into a vector [31], i.e., the input \(\mathbf {X}^{(\ell )}\) is an \(N \times D^{(\ell )}\) matrix and \(\mathbf {X}^{(\ell )}\), i.e.,
$$\begin{aligned} (\mathbf {X}^{(\ell +1)})_{j}=\sum _j (\mathbf {X}^{(\ell )})_{ij}, \end{aligned}$$
where \((\cdot )_{i}\) represents an i-th element of a vector. This operation converts a matrix into a vector.
Figure 2 shows an example of GCN for a prediction task. The GCN model is a neural network consisting of a graph convolutional layer (GraphConv) with batch normalization (BN) [32] and rectified linear unit (ReLU) activation, graph dense layer with the ReLU activation, graph gather layer, and dense layer with the softmax activation. By assigning the label that is suitable for each task to the compounds, this model can be applied to many types of tasks, e.g., ADMET prediction based on the chemical structures.
Figure 3 shows an example of a multi-task GCN for a prediction task. The only difference is that multiple labels are predicted as an output. In this type of neural networks, multiple labels associated with a molecule such as several types of ADMET properties can be predicted simultaneously. It is well-known that multi-task prediction affords more improvement in the performance compared to that of individual single-task prediction [33].
Figure 4 shows an example of a multi-modal neural network employing a graph representing a compound and sequence of a protein. In addition to the information derived from the molecular structure, information from other modalities can also be used for the input. An example of the prediction of activity using compound and protein related information is described in detail in the Experiment section.
The kGCN system supports operations described above and some other additional operations to build a neural network. These operations are implemented using TensorFlow [34] and are compatible with Keras [35], allowing the users to construct neural networks such as convolutional neural networks and recurrent neural networks [13] with Keras operations.
These neural networks include hyper-parameters such as the number of layers in a model and number of dimensions for each layer. To determine these hyper-parameters, the kGCN system includes Bayesian optimization.
Visualization of graph convolutional network
To confirm the features of the molecules that influence prediction result, a visualization system using the integrated gradient (IG) method [22] is developed. After the construction of the prediction model, the visualization of the atom importance in the molecular structure, based on the IG value \(\mathcal{I}(x)\) derived from the prediction model, is possible.
IG value \(\mathcal{I}(x)\) is defined as follows:
$$\begin{aligned} \mathcal{I}(x) = \frac{x}{M}\sum _{k=1}^M \nabla S\left(\frac{k}{M}x\right), \end{aligned}$$
where x is the input of an atom of a molecule, M is the number of divisions of the input, S(x) is the prediction score, i.e., the neural network output with input x, and \(\nabla S(x)\) is the gradient of S(x) related to input x. In the default setting, M is set to 100. The atom importance is defined as the sum of the IG values of features in each atom. The calculation of the atom importance is performed on compound-by-compound basis.
The evaluation of the visualization results depends on each case. Although methods for the visualization of deep learning results are still developing, their effectiveness in solving common problems has not been reported; however, a quantitative evaluation of the IG values related to the molecules was previously reported for the prediction of a reaction [36].
Hyper-parameter optimization
To optimize the neural network models, hyper-parameters such as the number of graph convolution layers, the number of dense layers, dropout rate, and learning rate should be determined. As it is difficult to manually determine all these hyper-parameters, kGCN allows automatic hyper-parameter optimization with Gaussian-process-based Bayesian optimization using a Python library, GPyOpt [37].
Interfaces
This section describes three interfaces in the kGCN system.
Command-line interface
The kGCN system provides the command-line interface suitable for batch execution. Data processing is designed according to the aim, but there is a standard process common to many data processing designs, e.g., a series of processes for cross-validation. The kGCN commands include these common processes, i.e., the kGCN system allows preprocessing, learning, prediction, cross-validation, and Bayesian optimization using the following commands:
- kgcn-chem command:
-
allows preprocessing of molecule data, e.g., structure-data file (SDF) and SMILES.
- kgcn command:
-
allows batch execution related to prediction tasks: supervised training, prediction, cross-validation, and visualization.
- kgcn-opt command:
-
allows batch execution related to hyper-parameter optimization.
These commands can be used with Linux commands and enable users to construct automatic scripts, e.g., Bash scripts. Because such batch execution is suitable for large-scale experiments using workstation and reproducible experiments, this interface is useful for the evaluation of neural network models.
KNIME interface
The kGCN system supports KNIME modules as a GUI. KNIME is a platform to prepare the workflow, which consists of KNIME nodes for data processing, and is particularly useful in the field of data science. The kGCN KNIME nodes described below are useful for the execution of various kGCN functions in combination with existing KNIME nodes. The command-line interface allows batch execution, whereas the KNIME interface is suitable for early steps in the machine learning process such as prototyping and data preparation.
To train and evaluate the model, kGCN provides the following two nodes.
- GCNLearner:
-
trains the model from a given dataset. This node receives the training dataset and provides the trained model as an output. Detailed settings such as batch size and learning rate can be set as the node properties.
- GCNPredictor:
-
predicts the label from a given trained model and new dataset.
Using the kGCN nodes mentioned above, Fig. 5 shows an example of the workflow. This data flow can be separated into that before and after GCNLearner. The former part is for data preparation, for which kGCN includes the following KNIME nodes:
- CSVLabelExtractor:
-
reads labels from a CSV file for training and evaluation
- SDFReader:
-
reads the molecular information from an SDF.
- GraphExtractor:
-
extracts the graph from each molecule.
- AtomFeatureExtractor:
-
extracts the features from each molecule.
- GCNDatasetBuilder:
-
constructs the complete dataset by combining input and label data.
- GCNDatasetSplitter:
-
splits the dataset into training and test datasets.
The test dataset is used for the evaluation and interpretation of results. kGCN also provides the modules to display the output of the results.
- GCNScore:
-
provides the scores of the prediction model such as accuracy.
- GCNScoreViewer:
-
displays the graph of ROC scores in the image file.
- GCNVisualizer:
-
computes the IG values and atom importance.
- GCNGraphViewer:
-
displays the atom importance in the image file.
Another example of the workflow is shown in Fig. 6, which includes an example of multi-modal neural networks. To design multi-modal neural networks, the kGCN system provides the following modules:
- AdditionalModalityPreprocessor:
-
reads the data of another modality from a given file.
- AddModality:
-
adds the data of another modality to the dataset.
To change from single-task to multi-modal, AddModality node should be added next to the GCNDatasetBuilder node.
The visualization process shown at the bottom-right of Fig. 6 requires a specific computation time depending on the number of molecules to be visualized, as the computation time for the integrated gradient method for each molecule is 1–5 s during GPU execution. To reduce the size of the dataset, GCNDatasetSplitter can be used for selecting a part of the dataset.
Python interface
The kGCN system also provides a Python library for programmers to more precisely tune the setting of the analysis. The kGCN system can be used in a manner similar to any standard library and supports pip, a Python standard package manager. Furthermore, the kGCN system can be used in the Jupyter notebook, which is an interactive interface. Therefore, the users can easily explore this library using google collaboratory, a cloud environment for the execution of Python programs.
The kGCN system adopts an interface similar to scikit-learn, a defacto standard machine learning library in Python. Therefore, the process employing the kGCN library includes preprocessing, training by fit methods, and evaluation by pred method, in this order. The users can easily access the kGCN library in a similar manner to that of scikit-learn. Furthermore, designing a neural network, which is necessary for using kGCN, is easy if users are familiar with Keras because kGCN is compatible with the Keras library, and the users can easily design a neural network such as Keras.
To demonstrate a wide applicability of the present framework, three sample programs comprising the datasets and scripts using the standard functions of kGCN are available in the framework web pages. In addition to these examples, the application of kGCN for a reaction prediction has been reported in a prior study [36], where the visualized reaction centers predicted by GCNs were consistent with reaction centers reported in the literature. This literature report used GCNs for reaction prediction on the kGCN system.
Flexible user interfaces
As described in the introduction and implementation sections, kGCN provides KNIME GUI, a command-line interface, and a programming interface to support various types of users with various skill levels. For example, an easy-to-use high-layer GUI can assist the chemists with limited programming knowledge in using kGCN and understand SAR at a molecular level. Contrarily, for machine learning professionals with good programming skills, it is expected that they will focus on the improvement of algorithms using a low-layer python interface. By using a Python interface, the users can make machine learning procedures more flexible and incorporate the kGCN functions into the user specific programs such as web services. The users with good programming skills can also use the command-line interface to automate data-analysis procedures using the kGCN functions because it is easy to construct a pipeline combined with other commands such as Linux commands.