A self-attention based message passing neural network for predicting molecular lipophilicity and aqueous solubility

Efficient and accurate prediction of molecular properties, such as lipophilicity and solubility, is highly desirable for rational compound design in chemical and pharmaceutical industries. To this end, we build and apply a graph-neural-network framework called self-attention-based message-passing neural network (SAMPN) to study the relationship between chemical properties and structures in an interpretable way. The main advantages of SAMPN are that it directly uses chemical graphs and breaks the black-box mold of many machine/deep learning methods. Specifically, its attention mechanism indicates the degree to which each atom of the molecule contributes to the property of interest, and these results are easily visualized. Further, SAMPN outperforms random forests and the deep learning framework MPN from Deepchem. In addition, another formulation of SAMPN (Multi-SAMPN) can simultaneously predict multiple chemical properties with higher accuracy and efficiency than other models that predict one specific chemical property. Moreover, SAMPN can generate chemically visible and interpretable results, which can help researchers discover new pharmaceuticals and materials. The source code of the SAMPN prediction pipeline is freely available at Github (https://github.com/tbwxmu/SAMPN).


Introduction
Accurate and reliable prediction of molecular properties is an important ingredient in drug discovery and chemical material projects [1][2][3]. Characterizing quantitative structure-bioactivity/structure-property relationships (QSAR/QSPR) of compounds has always been a hot topic in medicinal and material chemistry [2,4], but such relationships are usually difficult to elucidate with heuristic rules or empirical measurements. Machine learning (ML) methods, such as random forests (RF) and support vector machines (SVM), have aided the discovery process of new chemical drugs and materials [2,5,6]. For example, random forests models with atom pair descriptors have been used by many pharmaceutical companies to construct QSAR models [7], and Bayesian optimization models have been used to design nanostructures for phonon transport [8]. More recently, however, neural-network-based methods have greatly accelerated this field and will be briefly discussed below.
Many ML methods first convert chemical molecules into a computer-interpretable representation, utilizing physicochemical properties from experimental/computational measurements [9] or by using molecular fingerprints. Physiochemical properties include mass, charge, refractivity, and many other physical features of the molecules. The most widely used molecular conversion, however, is the molecular fingerprint, which encodes a molecular structure into a series of binary digits (a bit vector) [10] based on substructures that may or may not be pre-defined, depending on the class of fingerprints being used. For example, extended-connectivity fingerprints (ECFP) can split one molecule into many substructures (not pre-defined) and encode all of them into just one bit vector with different identifiers [11]. ENREF_10 Alternatively, bit vectors may be extended into count vectors that indicate the number of each substructure found in the molecule, not just its presence/absence.
Compared to the previously-mentioned traditional methods, artificial neural networks (ANNs) have become increasingly popular in predicting molecular properties. For example, a three-layered ANN with E-state indices was used to predict aqueous solubility of organic molecules [15]. More recently, graph-based networks were applied to predict lipophilicity and solubility [16]. These network-based models have shown impressive results and made good contributions for developing new methods.
Fixed fingerprint feature extraction rules of molecules are useful to accurately reflect underlying chemical substructures, though these may not be the best-suited representation for all tasks. Hence, researchers have to spend much time and effort to carefully determine which features are most relevant to their models. This is especially problematic with the utilization of physical features, which may require advanced variable selection techniques or a high-level of empirical knowledge. In contrast, some deep learning networks based on the simplified molecular input line entry system (SMILES) [12] codes can automatically learn the molecular features [13,14]. However, this may cause the model to focus on the SMILES grammar and not the implicated molecular structure. This limitation of the SMILES-based deep learning models is hard to avoid as the SMILES representation is not designed to capture molecular similarity. Generally, molecules with similar chemical structures can be encoded into very different SMILES strings. Even for the same molecular structure, there are often nonunique SMILES strings as Fig. 1A displays. Though the process of generating canonical SMILES is well known, the process is inconsistent among chemical toolkits. For example, the 'canonical' SMILES code for caffeine is CN1C=NC2=C1C(=O)N(C)C(=O)N2C according to RDKit, Cn1cnc2c1c(=O)n(C)c(=O)n2C according to Obabel, and CN1C=NC2=C1C(=O)N(C(=O)N2C)C according to PubChem.
Using the natural chemical graph instead of the SMILES representation may be more suitable for chemical property predictions. Briefly, a graph consists of nodes and edges that connect two or more nodes to one another. Analogously, a chemical graph considers atoms as nodes and bonds as the edges connecting atoms to Fig. 1 Conversion of a chemical structure into a mathematical graph. a A chemical structure usually has a unique graph but multiple SMILES strings. b Relationship list between node indices and edge indices, which are converted from the chemical graph. c The lists of Node2Edge, Edge2Node, Edge2Revedge and Node2NeiNode, derived from (b) one another. Our formulation considers these edges as bidirectional, meaning that the bond connecting atom A to atom B is the same as the bond connecting atom B to atom A. An example chemical graph can be seen in Fig. 1a.
Essential chemical properties such as molecular validity are more easily represented in two-dimensional chemical graphs than linear SMILES. Unlike SMILES codes chemical graphs are invariant to molecule permutations, i.e., one molecular structure has one graph but multiple SMILES representations. Recently, graph-based deep learning models are reported in QSAR and QSPR studies [7,[17][18][19][20][21]. However, according to these references, predictions are difficult to interpret, since most neural networks act as black boxes [22].
In this paper, we describe a self-attention-based message-passing neural network (SAMPN) model, which is a modification of Deepchem's MPN [16] and is state-ofthe-art in deep learning. It directly learns the most relevant features of each QSAR/QSAPR task in the learning process and assigns the degree of importance for substructures to improve the interpretability of prediction. Our SAMPN graph network utilizes the chemical graph structure described above, where each edge is derived from the chemical bond and each atom is the node. Both our message passing neural network (MPN) and SAMPN model can be used as multi-target models (Multi-MPN or Multi-SAMPN), which can learn not only the relationship between chemical structures and properties, but also the relationship between intrinsic attributes of molecules. To demonstrate our computational methods, we chose lipophilicity and aqueous solubility as the target properties as they were very important chemical descriptors that pervade every aspect of bioactivity, drug metabolism and pharmacokinetic (DMPK) profiles [23].
To our knowledge, this is the first time that a model like SAMPN has been used to predict chemical properties from experimental data for QSPR studies. The results from our experiments demonstrate that our SAMPN network yields superior performance relative to traditional ML-based models and previous deep-learning models (i.e., Deepchem's MPN [16]). Furthermore, the predictions of SAMPN are easily understood and visualized, since the integrated attention mechanism can color the atoms of the molecule based on their contributions to the property of interest.

Datasets and data process
Datasets of molecular lipophilicity and aqueous solubility were used for developing and testing our method. Lipophilicity is usually quantified by the n-octanol/ water partition coefficient P and preferentially displayed in a logarithmic form as logP. The raw lipophilicity data was downloaded from CHEMBL3301361 deposited by AstraZeneca [24] and includes 4200 molecules. Aqueous solubility is the saturated concentration of the chemical in the aqueous phase, which is usually displayed with unit log(mol/L) and is represented as logS. This dataset was downloaded from the online chemical database and modeling environment (OCHEM) [25] and includes 1311 experimental records. The dataset distributions are plotted in Additional file 1: Fig. S1.
As both datasets are small relative to the typical size requirements of deep learning models, we use tenfold stratified cross-validation [13,23,35], where each dataset was randomly split into a training and validation set (80% and 10%, respectively) for parameter selection and a test dataset (10%) for model comparisons. Then, we repeated all experiments three times with different random seeds. This process ensures that the model does not simply memorize the training and is capable of generalizing to new molecules.
For the initial data preprocessing, duplicate molecules were removed so that each chemical structure in the data was unique, while the maximum one of the related properties was kept. Molecules unrecognized by RDkit (version 2019.3) [26], a cheminformatics toolkit implemented in Python, were also deleted. Only two columns ('smiles' and 'experimental value') were kept as the input data to our models. Each downloaded SMILES representation was then converted into a directed graph before training the SAMPN model using the MPN encoder, which was adapted from Deepchem and Chemprop [27,28]. The directed graphs were mainly composed of index lists of nodes and edges shown in Fig. 1c. Take the substructure of N-C as an example: a chemical bond between the N and C atoms can derive two edges (C:0 → N:1 and N:0 → C:1). The number of nodes is equal to the number of atoms and the number of edges is always double the number of bonds, since we consider edges to be bidirectional.

Message passing network encoder
Instead of manually selected features, using molecular graph structures directly was first reported in 1994 [29]. In recent years, graph-based methods have been used to analyze various aspects of chemical systems [14,30] and compare with fingerprints [31]. Graph-based models provide a natural way to describe chemical molecules, where atoms in the molecule are equivalent to nodes and chemical bonds to the edges in a graph. The message-passing network is a variant of the graph-theoretical approaches, which gradually merges information from distant atoms by extending radially through bonds as displayed in Fig. 2. Those passing messages were used to encode all substructures of a molecule by an adaptive learning approach, which extracts useful representations of molecules suited to the target predictions.
The message passing network encoder works as follows in Eqs. (1)(2)(3). The passing message M from atom x to atom y in the d-th iteration (message passing depth) is calculated as follows: Here, Re is the activation function (Relu). W inp and W h are the learned weight matrices. As we use the edgedependent neural network to pass a message, node feature f x is concatenated with edge feature f xy to form the merged node-edge feature f x f xy . Node feature f x , is derived by atom type, formal charge, valence, and aromaticity. Similarly, edge feature f xy is derived from bond order, ring status and direction connection. The definitions of node f x features and edge f xy features are displayed in Table 1. The initial message M d=1 xy , which x sends to y, is generated from the merged node-edge feature f x f xy by a neural network as described in Eq. (1).
In a chemical graph, atoms denote the node set x∈V, and bonds denote the edge set (x,y)∈E. Each edge has its own direction in the SAMPN model. N(x) or N(y) stands for the group of neighbor nodes of x or y, respectively. z ∈ N (x)\y means the neighbors of x do not contain y.
Node x is allowed to send a message to a neighbor node y only after node x has received messages from all neighbor nodes except y. We use the skip connection in the message passing steps as in Fig. 2 (displayed in between neighbor features and self-features). This skip connection allows the message to pass a very long distance without vanishing gradient problem when using backpropagation. The generated messages exchange and update based on the merged node-edge feature and the previous message passing step as Eq.
(2) defined. The latent vector h y of each node, take Node 2's latent vector h 2 as an example in Fig. 2, is obtained by  where, h y captures the local chemical structure features based on the passing depth, and W o and W ah are the learned weight matrices. More detailed information of SAMPN algorithm can be found in Additional file 1: Table S1 in Supporting Materials. Applying the above Eqs. (1-3) on a chemical graph generates the final graph representation G = {h 1 … h i … h n }, which combines with the self-attention mechanism and fully-connected neural networks to make the final prediction.

Self-attention mechanism
All hidden states of a node are directly combined into a single vector, which may not make the difference among the learned features explainable [32]. A better way is to apply the attention mechanism to obtain a context vector for the target node by focusing on its neighbors and local environment. Take Node 2 as an example (the blue node in Fig. 2), after several message passing steps, Node 2 has hidden state h 2 , which represents the substructure centered at Atom 2. Meanwhile, all the rest nodes have the same process and h n represents the substructure centered at Atom n. Since different substructures have different contribution to the molecular property, we can use the attention mechanism to capture the different influences of substructures in contributing to the target molecular property. A self-attention layer is then added to identify the relationship between the substructure contribution to the target property of a molecule. A dot-product attention algorithm was implemented to take the whole molecular graph representation G as the input. The self-attentive weighted molecule graph embedding can be formed as follows: where W att is the self-attention score that implicitly indicates the contribution of local chemical graph to the target property. As G = {h 1 … h i … h n }, each row of W att is the attention weight between the i-th atom and the rest atoms in the molecule. E G is the attentive embedding matrix, where each row corresponds to the attention weighted hidden vector of the node. Then, the global average pooling is used on the sum of G and E G to get the molecule latent vector as Fig. 2 shows in the purple rectangle. Finally, the latent vector is combined with several layers of fully connected networks for the target property prediction.

Model training and hyperparameter optimization
The code for the MPN encoder was mainly adapted from Deepchem and Chemprop [27,28]. Both the MPN encoder and self-attention mechanism were implemented with Python and Pytorch version 1.0, an open-source framework for deep learning [33]. MPN, Multi-MPN, SAMPN and Multi-SAMPN models were trained with the Adam optimizer using the same learning rate schedule in [34].
Multiple metrics were used to evaluate the performance of our models: mean absolute error (MAE), root mean squared error (RMSE), mean squared error (MSE), coefficient of determination (R 2 ) and Pearson correlation coefficient (PC). Lower values of MAE, MSE, and RMSE indicate better predictive performance. Conversely, higher values for PC and R 2 indicate better models or better fits for the data. While some of these metrics tell the same story, the inclusion of all of these values may provide a rich benchmark for future studies.
A grid search algorithm was used to adjust the hyperparameters with Hyperopt package version 0.1.2 [35]. Table 2 shows the hyperparameters to be optimized and the search space. We chose RMSE on the validation set as the metric to find the most suitable combination of the hyperparameters within the search space. In the lipophilicity-QSPR task, one of the best combinations of hyperparameters was {'activation': 'ReLU'; 'depth': 4; 'dropout': 0.25; 'layers of fully connected networks': 2; 'hidden size': 384}. All the message passing neural network models (MPN, SAMPN, Multi-MPN and Multi-SAMPN) utilized the above hyperparameters to test the final performance with using the tenfold stratified cross-validation on the whole dataset. In addition to using the published results from Deepchem's MPN, we also built a pure MPN model to establish a baseline without the self-attention and all the rest configurations were kept the same to SAMPN. To compare the single-task and multi-target based deep learning network, we built the multi-MPN and multi-SAMPN. The multi-target-based model used a merged molecule dataset from 'Lipophilicity' and 'Water Solubility' as described in Supporting Materials. All the used parameters were kept the same between MPN and SAMPN.

Random forest
To compare our SAMPN method with the traditional machine learning methods, we chose a random forest model as the baseline. Random forest (RF) [36] is a supervised learning algorithm with an ensemble of decision trees generated from a bootstrapped (bagged) sampling of compounds and features. It is widely used in the traditional structure-property relation research [37], and was considered as a "gold standard" according to its robustness, easy usage and high prediction accuracy in structure-property relationship research [38]. Here, the ECFP with a fixed length of 1024 [12] was used with the RF model, which was implemented in Python 3.6.3 [39] with the package Scikit-learn, version 0.21.2 [40]. For the RF model, more trees generally increase performance and make predictions more stable, but it also slows down the computation heavily. We set 500 trees for a good balance point as suggested in [16] for most QSPR studies.

Lipophilicity and solubility prediction
In each QSPR task, we built RF, MPN, SAMPN, multi-MPN and multi-SAMPN models to explore the relationship between the target property and the molecular structure. For the lipophilicity prediction, both singletarget based and multiple-target based model have good performance according to RMSE (SAMPN: 0.579 ± 0.036; Multi-SAMPN: 0.571 ± 0.032). Without the self-attention mechanism, the performance of MPN decreased as Table 3 and Fig. 3a show. Nevertheless, the result of our new formulation of the MPN or Multi-MPN was still much better than the one from the MPN version of Deepchem (0.719 ± 0.031) [16], and performance increased even more with the inclusion of the attention mechanism. Our MPN model is different from Deepchem in that we did not use any recurrent neural networks (RNN) in our network architecture, which improved the speed of our MPN model in training.
For the solubility prediction, message passing based networks also greatly improved the performance over the traditional model (RF) as displayed in Table 3 and Fig. 3b. The MPN from Deepchem also displayed a good performance (0.580 ± 0.030 RMSD) on the water solubility prediction [16]; however, they used a small water solubility dataset (1128 molecules). For comparison purposes, we used their default setting MPN (Deepchem) on our water solubility data (1311 molecules) with our stratified cross-validation protocol. The scripts of the detail calculation process were available in our Github repository. After that, MPN (Deepchem) shows similar performance (0.676 ± 0.022) with our MPN (0.694 ± 0.050). Based on our model results (performance: SAMPN > MPN; Multi-SAMPN > Multi-MPN), the self-attention mechanism can improve the performance of message passing neural networks in both lipophilicity and solubility prediction. And multi-target models have better performance than single task-based model (performance: Multi-SAMPN > SAMPN; Multi-MPN > MPN). In our study, no matter choosing which metric as Fig. 3 displayed, we can get the same conclusion as we motioned above. The better predictive performance of the multi-target model is probably from the benefit that Multi-SAMPN or Multi-MPN can use the learned feature from lipophilicity to help the solubility prediction, and vice versa. It is worth mentioning that Multi-SAMPN or Multi-MPN can predict the lipophilicity and aqueous solubility simultaneously rather than step by step prediction like SAMPN Table 3 Models' performance (root-mean-square error) on lipophilicity database Italics represents the best performance in the results a Values were reported in [16]. In the lipophilicity prediction, we use the same dataset with Deepchem. In the water solubility prediction, our used dataset is larger than Deepchem used (1128 molecules) b Values were calculated from the same data and the same stratified crossvalidation protocol in our work

Visualize the attention
While higher prediction accuracy is always desirable, the ability to interpret a QSPR model is also important. Model comparison and interpretation can be facilitated by a visualization technique, making it possible to identify the learned features that drive compound property predictions. In the SAMPN model, we can obtain the attention weight scores from the self-attention mechanism. For a specific molecule, we obtain the difference between each atom's weight score and the average attention weight of the molecule. We define the above difference as the attention coefficient of each atom and those attention weight coefficients are very useful to gain insight into which parts of a molecule increase the target molecular property and which decrease it.
By using heat map coloring on each molecule (such as in Fig. 4a-f ), it is easy to see which parts of molecule play a more important role in the lipophilicity or water solubility of molecule. The lipophilicity and solubility heat maps are helpful for chemists to optimize the lipophilicity and solubility of a particular molecule. Consider Fig. 4b, a depiction of 1H-indazole after using our model. This molecule has a relatively high lipophilicity, as it has a large π-electron-conjugated system in its fused aromatic ring. However, the nitrogen-containing section of the molecule displays strong anti-lipophilic properties relative to the rest of the molecule. This may, in part, be due to nitrogen's contribution (as 'N' or 'NH') to a hydrogen bonding-network with its surroundings. Thus, altering 1H-indazole to disrupt that potential network may increase the molecule's lipophilicity. To test this hypothesis, we used SAMPN to predict the lipophilicity of benzo[d]isothiazole (Additional file 1: Fig. S2), the molecule made by exchanging the 'NH' of 1H-indazole with 'S' (sulfur). As expected, this change did increase the molecule's lipophilicity. Another example is the primary  Fig. 4f, which can easily form hydrogen bonds with water molecules. This is reflected in red for a predicted soluble feature.

Conclusions
In this work, we have proposed a self-attention-based message passing neural network for identifying the relationship between molecular lipophilicity/solubility and structure. Our SAMPN model outperforms the conventional random forests and the previous graph neural network-based model. By applying the attention mechanism, SAMPN can provide some insights on the atomic sources of lipophilicity/solubility, which is different from black box approaches that most machine learning and deep learning methods used. The results from SAMPN are easy to understand by coloring the attention scores directly on the molecular graph, which is useful as a guide to adjust the lipophilicity or solubility of one molecule. In addition, our message-passing neural networks can be easily trained as a multi-target model, which makes it better in computational efficiency and predictive performance. With this approach and our case study, we believe our method can be applied to other quantitative structure-property relationship studies and help chemists optimize molecular properties directly from the chemical structures.