Investigation of the structure-odor relationship using a Transformer model

The relationships between molecular structures and their properties are subtle and complex, and the properties of odor are no exception. Molecules with similar structures, such as a molecule and its optical isomer, may have completely different odors, whereas molecules with completely distinct structures may have similar odors. Many works have attempted to explain the molecular structure-odor relationship from chemical and data-driven perspectives. The Transformer model is widely used in natural language processing and computer vision, and the attention mechanism included in the Transformer model can identify relationships between inputs and outputs. In this paper, we describe the construction of a Transformer model for predicting molecular properties and interpreting the prediction results. The SMILES data of 100,000 molecules are collected and used to predict the existence of molecular substructures, and our proposed model achieves an F1 value of 0.98. The attention matrix is visualized to investigate the substructure annotation performance of the attention mechanism, and we find that certain atoms in the target substructures are accurately annotated. Finally, we collect 4462 molecules and their odor descriptors and use the proposed model to infer 98 odor descriptors, obtaining an average F1 value of 0.33. For the 19 odor descriptors that achieved F1 values greater than 0.45, we also attempt to summarize the relationship between the molecular substructures and odor quality through the attention matrix.


Introduction
Smell plays an important role in all aspects of life and is thus an important property of all compounds. The relationship between molecular structure and odor quality is an essential research topic. Studies on this relationship may lead to predictions of the odor of a molecule, odor synthesis, and even the artificial synthesis of molecules with specific odors. However, studying the odors of different substances is challenging. A previous study [1] showed that molecules with similar structures may have very different odors, while molecules with similar odors may have completely distinct structures. In addition to the subtle relationship between molecular structure and odor, aspects such as sex, age, and disease history can affect odor perception. Therefore, special training is required to label the odors of substances, which increases the difficulty of labeling the odors of chemical compounds. Thus, to date, the relationship between molecular structure and odor remains difficult to specify.
Machine learning has been applied in a wide range of fields, including physics and chemistry, and various molecular structure property prediction methods have been proposed [2][3][4][5]. These methods can be divided into feature-based methods and feature-free methods according to the type of data that are input into the model. Feature-based methods take the generated fixed molecular features (such as molecular fingerprints and molecular parameters) as model inputs and use various algorithms (e.g., random forest and support vector machines) to predict the molecular properties. Feature-free methods Zheng et al. Journal of Cheminformatics (2022) 14:88 predict specific molecular properties by automatically extracting molecule features that are related to those properties using methods such as graph neural networks [6] or graph kernels [7]. In addition to predicting molecular properties such as water solubility and lipophilicity, feature-free methods use artificial neural networks to predict additional essential properties, such as the molecular energy, dipole moment and molecular dynamics [8,9], allowing us to compute this information faster than using computational chemistry methods. In molecular property prediction, the interpretability of the model is particularly important [10], as model interpretability allows us to investigate the relationship between molecular structure and different properties at the molecular, atomic, and subatomic levels. Although feature-based methods use fixed features, the resulting model usually provides some interpretability. In contrast, feature-free methods flexibly extract features according to the properties to be predicted; however, the models are not often interpretable. Therefore, we aim to develop a featurefree method that allows interpretation of the extracted features.
At present, approximately 4000 odorants have been labeled with their corresponding odor. The smells of odorants have been labeled with odor descriptors (ODs), such as 'sweet, ' 'fruity, ' and 'green. ' These data introduce the possibility of using data-driven approaches in molecular structure-odor studies. Several studies have used machine learning methods for OD prediction. For example, Keller et al. [11] used molecular parameters to predict the scores of 19 kinds of odors, achieving a correlation coefficient of 0.55. In contrast to most studies on OD prediction, this study attempted to predict scores corresponding to ODs through regression rather than classification, making it difficult to compare the results with those of the studies mentioned below. Shang et al. [12] predicted 10 ODs using molecular parameters, achieving an F1 value greater than 0.8. However, data augmentation was applied by synthesizing similar data points based on the original dataset before dividing the dataset into training and test sets. Therefore, the test set was essentially contaminated. Sanchez-Lengeling et al. [13] combined two datasets and predicted 138 ODs using a graph neural network (GNN) [3,14], with the previous output layer applied to cluster the ODs. Although the average F1 value was 0.36, the clustering results showed that the outputs in the last layer were closer to each other when the corresponding molecules were labeled with ODs in similar categories. Chacko et al. [15] used the same dataset as Keller et al. [11] to predict the pleasantness and intensity of odors, as well as two ODs (sweet and musky). The corresponding F1 values of the two ODs on the test set were 0.84 and 0.69. The dataset used by Chacko et al. contained 480 samples, and the ratio of the training set to the test set was 9:1. Thus, the results may not be stable because of the small number of samples in the test dataset. Debnath and Nakamoto [16] predicted three ODs (fruity, green, and sweet) using the mass spectra of different molecules and achieved an average F1 value of 0.51.
In recent years, the Transformer model has been widely used in image processing [17,18] and natural language processing [19,20] because of its flexible attention mechanism. In addition to processing sentences and images, the Transformer model can take more flexible input forms (such as graphs) by using relative positional embedding [21,22]. In terms of interpretability, the Transformer model results can naturally be interpreted according to its attention mechanism. Several Transformer models for molecular property prediction have been developed in recent years. Karpov et al. [23] used the SMILES data of molecules in the form of strings as the model input and predicted various molecular properties, such as the melting and boiling points. When molecules are represented in nonstring forms, the relative positional information between atoms must be used as one of the inputs to the model. Maziarka et al. [24,25] predicted molecular properties by adding the relative positional information of the atoms to the attention matrix, and Maziarka et al. [26] used carefully designed functions to express the positional relationship between atoms based on Maziarka et al. [24]. Both of these works interpret the model by visualizing the attention mechanism in the encoder. Hutchinson et al. [27] and Thölke [28] predicted several more essential properties, such as the molecular energy, dipole moment, and molecular dynamics. They not only used carefully designed functions to express the positional relationship between atoms but also computed the outputs according to a more physical approach. For example, they predicted the atomic forces by computing the derivative of the predicted atomic energies with respect to the relative position.
In this research, we adopt a feature-free method and use the Transformer model to predict ODs. We first predict the existence of molecular substructures using the Transformer model and then evaluate the performance of the attention mechanism in terms of model interpretability by visualizing the attention matrix. Finally, we use the model to predict ODs and visualize the attention matrices.
The main contributions of this study can be summarized as follows: • We finetune a Transformer model for predicting molecular properties and interpreting the results.
• Experiments are conducted to predict the existence of various substructures and to investigate the interpretability of the attention mechanism in the Transformer model. • The developed Transformer model is used to predict ODs, and the attention matrix is visualized to identify OD structural features.

Model
The original Transformer model [19] was developed for machine translation and consists of an encoder and a decoder. Each layer in the encoder contains one attention module, which can be regarded as a self-attention mechanism through which each word in the input sentence interacts with related words. Each layer in the decoder contains two attention modules. The first attention module is also a self-attention mechanism that enables the word that is currently being translated to communicate with other translated words. The second attention module is used to obtain information about the source language for the current word. A sentence is considered as a sequence of words. By adding position information as a positional embedding to the embedding of the input word, the original Transformer can consider the word order. Molecules are threedimensional (3D) structures that are composed of atoms. The relationship between atoms in a molecule cannot be represented by the positional embeddings used in the original Transformer because the bonds between atoms must be represented.
The Molecular Attention Transformer (MAT) model [24] was developed to predict molecular properties such as water solubility and blood-brain barrier penetration. The MAT model provides a creative solution for identifying the relationship between atoms. As shown on the left side of Fig. 1, the MAT model replaces positional embedding by adding adjacency and distance matrices to the attention matrix. The attention mechanism in the MAT model is formulated as where 1 , 2 , and 3 are hyperparameters; Q, K, and V are the query matrix, key matrix, and value matrix (as in the original Transformer); D and A are the distance matrix and adjacency matrix, respectively; and g(d) = exp(−d) is an elementwise function.
In this study, we propose a model based on the original Transformer and MAT models. We do not use more complex interatomic distance formulas or more distant neighborhood information as the direct inputs to the model, as used by Maziarka et al. [26]. Instead, we expect the model to automatically learn more complex distance and adjacency relationships through multiple heads and multiple encoder layers. The key features of the proposed model can be summarized as follows: (1) changes the attention calculation; (2) adds a decoder-like structure to the model to improve interpretability; and (3) introduces a contrastive loss function to the model. In the MAT model, attention is calculated by summing the inner product between the atom attributes, adjacency matrix, and distance matrix. According to Eq. (1), if the inner product between two atoms is large and these two atoms are far away from each other, information is exchanged between the two atoms, which is , where X is the input to the encoder layer and W Q adj is a learnable parameter); Q dist , K adj , K dist , V adj , and V dist can be obtained in the same way. On the basis of Eq. (2), message passing between two atoms based on their inner product value occurs only when the atoms are connected by a chemical bond or the atoms are close to each other.
In the MAT model, the output of the encoder is directly passed through a pooling layer before the molecular properties are predicted by the fully connected layers. We add a decoder-like module, similar to the original Transformer, to visualize the relationship between the atoms and outputs. The proposed model is shown in Fig. 2. In natural language, the words in a sentence are related to each other. However, in most cases, ODs are not necessarily related to each other. Therefore, we use a decoderlike module, namely, the Transformer decoder without the self-attention mechanism, as shown on the right side of Fig. 1. The output of the transformer encoder is transmitted to this decoder-like module. As shown in Fig. 2, the input to the decoder-like module is embedded cls i , which is obtained by passing a scalar of value 1 through a single fully connected network. Thus, embedded cls i is a learnable input for target i . The attention in the decoderlike module is computed by considering embedded cls i as the query and the outputs of the encoder as the key and value. This attention mechanism in the decoderlike module is expected to obtain better predictions by emphasizing atoms that are related to the molecular properties, thereby enabling the visualization of important substructures that affect the prediction results. The contrastive loss function has been widely used in self-supervised learning in recent years [29,30]. The application of the contrastive loss to supervised learning [31] can also improve model performance. We directly apply this contrastive loss to our model, and the definition of the loss function is shown in Eq. (3). P(i) is a set that includes all samples whose labels are the same as sample i, | • | is a function that counts the number of elements in a set, z i is a feature vector with unit length corresponding to sample i, A(i) is a set that includes all samples in the batch except sample i, and τ is a hyperparameter. The contrastive loss function brings feature vectors of samples with the same label closer while separating feature vectors of samples with different labels.

Experiment
We conducted two experiments in this research. The first experiment aimed to predict whether the input molecule has some specific substructure using our proposed Transformer model. The second experiment predicted ODs using the proposed Transformer model. We used two different datasets for these two experiments. We collected SMILES data for 100,000 molecules from ChEMBL [32] for the substructure predictions. The ChEMBL database is a bioactive dataset covering more than 2 million compounds, which ensures that we have sufficient data for substructure prediction. For the OD prediction experiment, we collected 4,240 odorants and their corresponding ODs from TheGoodScentsCompany [33]. Among the datasets that provide OD labels, The-GoodScentsCompany provides more data that is easier to obtain. In addition to odorants, we collected 222 molecules that were annotated as odorless from TheGood-ScentsCompany. RDKit with default settings was used to compute the atomic properties, adjacency matrices, and distance matrices of all molecules. For both datasets, we removed molecules for which the distance matrix could not be calculated and molecules with more than 60 atoms. Finally, 98,324 and 4,365 samples were used in the substructure prediction and OD prediction experiments, respectively. The model inputs were the atomic properties presented in Table 1. The code used for the experiments can be found at [34].

Substructure prediction
The purpose of this experiment was to test the performance of the Transformer model in predicting the existence of substructures and to investigate the interpretation ability of the model by visualizing the attention mechanism in the decoder-like module. We designed 24 substructures and combinations of multiple substructures and predicted these substructures with our proposed model. Fig. 3  The 98,324 samples were divided into training and test sets at a ratio of 5:1. Because we have sufficient data in this experiment and predicting the existence of substructures is a relatively simple task, we did not consider a wide range of hyperparameter settings. The hyperparameter settings examined in this experiment are listed in Table 2. In addition to the parameters listed in Table 2, similar to the original Transformer, each encoder and decoder layer includes an attention module and a twolayer pointwise feedforward network with the same number of units as the dimension of the atomic attributes, both of which end with a dropout layer with a rate of 0.1. Except for the layers used to convert the Q, K, and V matrices, which do not use the activation function, and the final output layer, which uses the sigmoid activation function, the rest of the fully connected layers use ReLU as the activation function. The learning rate was set to 7e-5 in this experiment.
In the multihead attention mechanism, each decoder layer should contain multiple attention matrices. Hence, our visualization results are the sum of the attention matrices of multiple heads. An example of visualizing the attention in the decoder-like module is shown in Fig. 4. In addition to visualizing the attention matrices, we attempted to quantify the performance of the attention    (4) should be between 0 and 1, with larger values indicating that the attention mechanism better identifies atoms related to the target substructure.) In later experiments, we visualized the sum of the attention matrices of all decoder layers and all heads. Therefore, the sum of the attention values of all atoms in a molecule is the product of the number of heads ( n h ) and the number of decoder layers ( n dc ). The performance of the attention mechanism in terms of identifying related atoms is evaluated as To compute the sum of the attention values of the target atoms, we calculated the variance of the attention values of the target atoms. A large variance indicates that the attention mechanism tends to identify only some of the target atoms, while a small variance denotes that the attention mechanism uniformly identifies the target atoms. For each molecule, the variance is calculated as where |T i | is the number of atoms that belong to the target.
In the OD prediction experiment, we compared the following six models. Proposed model: a model with the attention calculated using (2); MAT-attn: a model with the attention calculated using Eq. (1); ADJ-only: a model with the attention calculated using Eq. (7); DIST-only: a model with the attention calculated using Eq. (8); Simplified decoder: the model shown in Fig. 5, which was created based on the proposed model by simplifying the decoder-like module to a sum pooling layer; MAT-model: the original MAT model. ADJonly and DIST-only were used to investigate the role of the adjacency and distance matrices. The simplified decoder model was used to investigate the effect of the decoder-like module.
The ratio of the training set to the test set was fixed at 5:1.
The hyperparameter settings used in this experiment are listed in Table 3. In this experiment, class weights were used in the loss function, e.g., for each OD, the weight of each negative sample was 1, and the weight of a positive sample was equal to (number of all samples -number of positive samples)/number of positive samples.

Substructure prediction results
The best average F1 value for the 24 substructure prediction experiment was 0.983, which was achieved with 12 15-dimensional heads, six encoder layers, and one decoder layer. The individual F1 values for the 24 substructures were all greater than 0.9, as shown in Fig. 3. The average F1 values for the other hyperparameter settings are listed in Table 2 and are generally very similar to one another. In summary, our proposed Transformer model can detect the existence of substructures and combinations of substructures. Next, we investigated the ability of the attention mechanism to interpret the prediction results by visualizing the attention matrix in the decoder-like module. We visualized only true positive (TP) samples (positive samples that were predicted correctly). The visualization results of the model that achieved the best average F1 value (six encoder layers and one decoder layer) are shown in Fig. 6. We visualize the results of 3 substructures in Fig. 6; the visualizations of the other substructures show the same trends as these 3 substructures. More TP results corresponding to each substructure can be found at [34]. According to Fig. 6, for No. 1, the attention mechanism identifies only part of the atoms in the target instead of all the atoms included in the target substructure. For substructure No. 11, the attention mechanism identifies only O-O in the target and does not identify the single O. This result shows that the attention mechanism does not identify the atoms in the substructures that are similar to the target. For substructure No. 23, even molecules that contain only cCc are identified as positive, and the attention mechanism identifies both CC(C)C and cCc. This result shows that the attention mechanism can identify atoms in all composition substructures related to the target. Figure 6 shows that the attention mechanism clearly identifies several atoms contained in the target substructures. To investigate the role of the attention mechanism in multiple decoder layers, we visualized models with two, three, and four decoder layers. Figure 7 shows the visualization results for each individual decoder layer. When the model has multiple decoder layers, the attention mechanism in each decoder layer can identify atoms related to the target substructure, which inspired us to visualize the sum of the attention mechanisms in all decoder layers. Figure 8 shows the visualization results of the summed attention, illustrating that models with

OD prediction results
The hyperparameter settings used in the OD prediction experiment and the optimal OD prediction settings are presented in Table 3. The results of our proposed model and the comparison model are shown in Table 4. The proposed model and the ADJ-only model achieve very similar results. Therefore, the attention values calculated by Eqs.
(2) and (7) have similar effects on the results. We expected to introduce the 3D structure information of the molecules through g(D) in Eq. (2); however, the experimental results show that adding the distance information in this way does not enable the model to use the 3D structure information. This finding may be because there are relatively few samples, or the model itself may not have the ability to learn 3D structural information according to the distance matrix. The proposed model and the MAT-attn model obtain similar F1 results. Therefore, we conducted an approximate randomization test to verify whether the differences between these two results were meaningful. The p value was 0.009 when we compared the proposed and MAT-attn models. The best average F1 value was achieved by the model with two decoder layers. Unlike the substructure prediction experiment, visualizing the attention of the first decoder layer shows that the attention mechanism tends to identify all atoms with similar values. This result may be caused by having relatively few samples. In fact, the same phenomenon was observed in the substructure prediction experiment when using the same number of samples as in the OD prediction experiment. However, even if we increase the number of samples to approximately 100,000 and perform the OD prediction experiment, there may still be a tendency for the first encoder layer attention mechanism to mark all atoms with similar values, it may be necessary to collect information about the whole molecule to predict the odor, as a result of the factors affecting the odor of a molecule being highly complex.
Regarding attention visualization, we first visualized the attention of the model that achieved the best F1 value. We visualized the attention of the second decoderlike layer. More visualization results can be found at [34]. Nineteen ODs obtained F1 values greater than 0.45. To ensure that the visualization results are meaningful, we visualize only these 19 ODs. Figure 9 shows several visualization results of TP samples for 'fruity' , 'musk' , 'aldehydic' and 'fatty' . For these four ODs, the attention mechanism tends to identify C(=O)O, carbon in a large ring, C=O and long carbon chains, respectively. However, for the remaining ODs, no obvious features are marked in the corresponding positive samples.
According to the substructure visualization experiment results, the attention mechanism annotates only certain atoms in the substructures instead of all related atoms. The atoms in each substructure are randomly annotated by the attention mechanism; that is, the marked atoms vary depending on the model initialization. To determine the substructures associated with the ODs, we repeatedly trained the models with the same hyperparameter settings and visualized the atoms that were frequently annotated by the attention mechanisms in the different models. Specifically, we trained the models with the same hyperparameters 100 times and created a counter for each atom in each molecule in the samples. For each model, we then identified the top k atoms in a given molecule with the largest attention values and increased the counters corresponding to these k atoms by 1. Finally, we determined the atoms with counter values greater than n. We visualized the attention mechanisms of 100 models with k = 5 and n = 50.
Since we considered 100 models, when we visualized the TP and TN (true negative samples, e.g., negative samples that were predicted correctly) samples, we chose positive samples that 90 of the 100 models predicted ODs as positive and negative samples that 85 models predicted as negative. Figure 10 shows the partial results of the TP samples of 19 ODs. In Fig. 10, because k is limited to five atoms, the 'fatty' and 'musk' visualization results are not as good as those in Fig. 9. For the other ODs, we can observe some clear features. According to the TP sample visualization results, we attempted to summarize the feature substructures for each OD, and the summary results are shown in the 4th column of Table 5. The number of positive samples in the test set corresponding to the 19 ODs is shown in the second column of Table 5. (We note that an OD corresponds to multiple feature substructures, and we summarize the features that appear most frequently in the visualization results.) odorant, and we took the average score assigned by these 55 individuals as the odorant label. The dataset used in this study was labeled by different people, and the labeling standards may vary from person to person. The F1 score of 'sweet' was approximately 0.50 with our datasets. This result may be influenced by the small number of samples and the lack of consistency in the labels across the large amount of collected data.

Conclusion
In this study, we used a machine learning approach to investigate the relationship between molecular structure and odor. We first built a Transformer model to predict the molecular properties and interpret the prediction results. We modified the attention calculation in the encoder based on the MAT model and used a decoderlike module to interpret related substructures associated with ODs. We applied the proposed model to predict substructures in molecules and investigated the role of the attention mechanisms in the decoder layers. The results show that when we have a sufficient amount of samples, the attention mechanisms can identify some, but not all, of the atoms in the target substructures. This result demonstrates that the prediction results can be interpreted by visualizing the attention mechanism. Finally, we predicted 98 ODs with the proposed model and summarized the substructures associated with the 19 ODs by visualizing the attention mechanism. With additional odor labeling data, we expect to obtain better F1 results and clearer attention visualization results, thereby enabling a better understanding of the relationship between molecular structure and odor.