Relative molecule self-attention transformer

The prediction of molecular properties is a crucial aspect in drug discovery that can save a lot of money and time during the drug design process. The use of machine learning methods to predict molecular properties has become increasingly popular in recent years. Despite advancements in the field, several challenges remain that need to be addressed, like finding an optimal pre-training procedure to improve performance on small datasets, which are common in drug discovery. In our paper, we tackle these problems by introducing Relative Molecule Self-Attention Transformer for molecular representation learning. It is a novel architecture that uses relative self-attention and 3D molecular representation to capture the interactions between atoms and bonds that enrich the backbone model with domain-specific inductive biases. Furthermore, our two-step pretraining procedure allows us to tune only a few hyperparameter values to achieve good performance comparable with state-of-the-art models on a wide selection of downstream tasks. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00789-7.


INTRODUCTION
Predicting molecular properties is of the central importance to applications such as drug discovery or material design.Without accurate prediction of properties such as toxicity, a promising drug candidate is likely to fail clinical trials (Chan et al., 2019;Bender & Cortés-Ciriano, 2021).Many molecular properties cannot be feasibly computed (simulated) from first principles and instead have to be extrapolated, from an often small experimental dataset.The prevailing approach is to train a machine learning model such a random forest (Korotcov et al., 2017) or a graph neural network (Gilmer et al., 2017) from scratch to predict the desired property for a new molecule.
Machine learning is moving away from training models from scratch.In natural language processing (NLP), advances in large-scale pretraining (Devlin et al., 2018;Howard & Ruder, 2018) and the development of the Transformer (Vaswani et al., 2017) have culminated in large gains in data efficiency across multiple tasks (Wang et al., 2019a).Instead of training models purely from scratch, the models in NLP are commonly first pretrained on large unsupervised corpora.The chemistry domain might be at the brink of an analogous revolution, which could be transformative due to the high cost of obtaining large experimental datasets.A recent work has proposed Molecule Attention Transformer (MAT), a Transformer-based architecture adapted to processing molecular data (Maziarka et al., 2020) and pretrained using self-supervised learning for graphs (Hu et al., 2020).Several works have since shown further gains by improving network architecture or the pretraining tasks (Chithrananda et al., 2020;Fabian et al., 2020;Rong et al., 2020).
However, pretraining has not yet led to such transformative data-efficiency gains in molecular property prediction.For instance, non-pretrained models with extensive handcrafted featurization tend to achieve very competitive results (Yang et al., 2019a).We reason that architecture might be a key bottleneck.In particular, most Transformers for molecules do not encode the three dimensional structure of the molecule (Chithrananda et al., 2020;Rong et al., 2020), which is a key factor determining many molecular properties.On the other hand, performance has been significantly boosted by enriching the Transformer architecture with proper inductive biases (Dosovitskiy et al., 2021;Shaw et al., 2018;Dai et al., 2019;Ingraham et al., 2021;Huang et al., 2020;Romero & Cordonnier, 2021;Khan et al., 2021;Ke et al., 2021).Motivated by this perspective, we methodologically explore the design space of the self-attention layer, a key computational primitive of the Transformer architecture, for molecular property prediction.In particular, we explore variants of relative self-attention, which has been shown to be effective in various domains such as protein design and NLP (Shaw et al., 2018;Ingraham et al., 2021) Our main contribution is a new design of the self-attention formula for molecular graphs that carefully handles various input features to obtain increased accuracy and robustness in numerous chemical domains.We tackle the aforementioned issues with Relative Molecule Attention Transformer (R-MAT), our pre-trained transformer-based model, shown in Figure 1.We propose Relative Molecule Self-Attention, a novel variant of relative self-attention, which allows us to effectively fuse distance and graph neighbourhood information (see Figure 2).Our model achieves state-of-the-art or very competitive performance across a wide range of tasks.Satisfyingly, R-MAT outperforms more specialized models without using extensive handcrafted featurization or adapting the architecture specifically to perform well on quantum prediction benchmarks.The importance of representing effectively distance and other relationships in the attention layer is evidence by large performance gains compared to MAT.
An important inspiration behind this work was to unlock the potential of large pretrained models for the field, as they offer unique long-term benefits such as simplifying machine learning pipelines.We show that R-MAT can be trained to state-of-the-art performance with only tuning the learning rate.We also open-source weights and code as part of the HuggingMolecules package (Gaiński et al., 2021).

RELATED WORK
Relative Molecular Self-Attention Pretraining coupled with the efficient Transformer architecture unlocked state-of-the-art performance in molecule property prediction (Maziarka et al., 2020;Chithrananda et al., 2020;Fabian et al., 2020;Rong et al., 2020;Wang et al., 2019b;Honda et al., 2019).First applications of deep learning did not offer large improvements over more standard methods such as random forests (Wu et al., 2018;Jiang et al., 2021;Robinson et al., 2020).Consistent improvements were enabled by more efficient architectures adapted to this domain (Mayr et al., 2018;Yang et al., 2019a;Klicpera et al., 2020).In this spirit, our goal is to further advance modeling for any chemical task by redesigning self-attention for molecular data.
Encoding efficiently the relation between tokens in selfattention has been shown to substantially boost performance of Transformers in vision, language, music and biology (Shaw et al., 2018;Dai et al., 2019;Ingraham et al., 2021;Huang et al., 2020;Romero & Cordonnier, 2021;Khan et al., 2021;Ke et al., 2021).The vanilla selfattention includes absolute encoding of position, which can hinder learning when the absolute position in the sentence is not informative.1Relative positional encoding featurizes the relative distance between each pair of tokens, which led to substantial gains in the language and music domains (Shang et al., 2018;Huang et al., 2020).However, most Transformers for the chemical domain predominantly used no positional encoding in the self-attention layer (Chithrananda et al., 2020;Fabian et al., 2020;Rong et al., 2020;Wang et al., 2019b;Honda et al., 2019;Schwaller et al., 2019), which gives rise to similar issues with representing relations between atoms.We directly compare to (Maziarka et al., 2020), who introduced first self-attention module tailored to molecular data, and show large improvements across different tasks.Our work is also closely related to (Ingraham et al., 2021) that used relative self-attention fusing three dimensional structure with positional and graph based embedding, in the context of protein design.

MOLECULAR SELF-ATTENTIONS
We first give a short background on how prior work have applied self-attention to molecules and point out their shortcomings.
Text Transformers Multiple works have applied directly the Transformer to molecules encoded as text using the SMILES representation (Chithrananda et al., 2020;Fabian et al., 2020;Wang et al., 2019b;Honda et al., 2019;Schwaller et al., 2019).SMILES is a linear encoding of a molecule into a string of characters according to a deterministic ordering algorithm (Weininger, 1988;Jastrzębski et al., 2016).For example, the SMILES encoding of carbon dioxide is C(=O)=O.
Adding a single atom can completely change the ordering of atoms in the SMILES encoding.Hence, the relative positions of individual characters are not easily related to their proximity in the graph or space.This is in contrast to natural language processing, where the distance between two words in the sentence can be highly informative (Shaw et al., 2018;Huang et al., 2020;Ke et al., 2021).We suspect this makes the use of self-attention in SMILES models less effective.Another readily visible shortcoming is that the graph structure and distances between atoms of the molecule are either completely encoded or entirely thrown out.
Graph Transformers Several works have proposed Transformers that operate directly on a graph (Maziarka et al., 2020;Rong et al., 2020;Nguyen et al., 2019).The GROVER and the U2GNN models take as input a molecule encoded as a graph (Rong et al., 2020;Nguyen et al., 2019).
In both of them, the self-attention layer does not have a direct access to the information about the graph.Instead, the information about the relations between atoms (existence of a bond or distance in the graph) is indirectly encoded by a graph convolutional layer that is run in GROVER within each layer, and in U2GNN only at the beginning.Similarly to Text Transformers, Graph Transformers also do not take into account the distances between atoms.
Structured Transformer introduced in (Ingraham et al., 2021) uses relative self-attention that operates on amino-acids in the task of protein design.Self-attention proposed by (Ingraham et al., 2021), similarly to our work, provides the model with information about the three dimensional structure of the molecule.As R-MAT encodes the relative distances between pairs of atoms, Structured Transformer also uses relative distances between modeled amino-acids.However, it encodes them in a slightly different way.We incorporate their ideas, and extend them to enable processing of molecular data.
Molecule Attention Transformer Our work is closely related to Molecule Attention Transformer (MAT), a transformer-based model with self-attention tailored to processing molecular data (Maziarka et al., 2020).In contrast to aforementioned approaches, MAT incorporates the distance information in its self-attention module.MAT stacks N Molecule Self-Attention blocks followed by a mean pooling and a prediction layer.For a D-dimensional state x ∈ R D , the standard, vanilla self-attention operation is defined as where Q = xW Q , K = xW K , and V = xW V .Molecule Self-Attention extends Equation (1) to include additional information about bonds and distances between atoms in the molecule as where λ a , λ d , λ g are the weights given to individual parts of the attention module, g is a function given by either a softmax, or an element-wise g(d) = exp(−d), A is the adjacency matrix (with A (i,j) = 1 if there exists a bond between atoms i and j and 0 otherwise) and D is the distance matrix, where D (i,j) represents the distance between the atoms i and j in the 3D space.
Self-attention can relate input elements in a highly flexible manner.In contrast, there is little flexibility in how Molecule Self-Attention can use the information about the distance between two atoms.The strength of the attention between two atoms depends monotonically on their relative distance.However, molecular properties can depend in a highly nonlinear way on the distance between atoms.This has motivated works such as (Klicpera et al., 2020) to explicitly model the interactions between atoms, using higher-order terms.

RELATIVE POSITIONAL ENCODING
In natural language processing, a vanilla self-attention layer does not take into account the positional information of the input tokens (i.e. if we permute the layer input, the output will stay the same).
In order to add the positional information into the input data, a vanilla transformer enriches it with encoding of the absolute position.On the other hand, relative positional encoding (Shaw et al., 2018) adds the relative distance between each pair of tokens, which leads to substantial gains in the learned task.In our work, we use relative self-attention to encode the information about the relative neighbourhood, distances and physicochemical features between all pairs of atoms in the input molecule (See Figure 2).

ATOM RELATION EMBEDDING
Our core idea to improve Molecule Self-Attention is to add flexibility in how it processes graph and distance information.Specifically, we adapt positional relative encoding to processing molecules (Shaw et al., 2018;Dai et al., 2019;Huang et al., 2020;Ke et al., 2021), which we note was already hinted at in (Shaw et al., 2018) as a high-level future direction.The key idea in these works is to enrich the self-attention block to efficiently represent information about relative positions of items in the input sequence.
What reflects the relative position of two atoms in a molecule?Similarly to MAT, we delineate three inter-related factors: (1) their relative distance, (2) their distance in the molecular graph, and (3) their physiochemical relationship (e.g. are they within the same aromatic ring).
In the next step, we depart from Molecule Self-Attention (Maziarka et al., 2020) and introduce new factors to the relation embedding.Given two atoms, represented by vectors x i , x j ∈ R D , we encode their relation using an atom relation embedding b ij ∈ R D .This embedding will then be used in the self-attention module after a projection layer.Next, we describe three components that are concatenated to form the embedding b ij .
Neighbourhood embeddings First, we encode the neighbourhood order between two atoms as a 6 dimensional one hot encoding, with information about how many other vertices are between nodes i and j in the original molecular graph (see Figure 2 and Table 4 from Appendix A).
Distance embeddings As we discussed earlier, we hypothesize that a much more flexible representation of the distance information should be facilitated in MAT.To achieve this, we use a radial basis distance encoding proposed by (Klicpera et al., 2020): where d is the distance between two atoms, c is the predefined cutoff distance, n ∈ {1, . . ., N emb } and N emb is the total number of radial basis functions that we use.Then obtained numbers are passed to the polynomial envelope function u , with p = 6, in order to get the final distance embedding.
Bond embeddings Finally, we featurize each bond to reflect the physical relation between pairs of atoms that might arise from, for example, being part of the same aromatic structure in the molecule.Molecular bonds are embedded in as a 7 dimensional vector following (Coley et al., 2017) (see Table 5 from Appendix A).When the two atoms are not connected by a true molecular bond, all 7 dimensions are set to zeros.We note that while these features can be easily learned in pretraining, we hypothesize that this featurization might be highly useful for training R-MAT on smaller datasets.

RELATIVE MOLECULE SELF-ATTENTION
Equipped with the embedding b ij for each pair of atoms in the molecule, we now use it to define a novel self-attention layer that we refer to as Relative Molecule Self-Attention.
First, mirroring the key-query-value design in the vanilla self-attention (c.f.Equation ( 1)), we transform b ij into a key and value specific vectors b V ij , b K ij using two neural networks φ V and φ K .Each neural network consists of two layers.A hidden layer, shared between all attention heads and output layer, that create a separate relative embedding for different attention heads.
Consider Equation (1) in index notation: where the unnormalized attention is

By analogy, in Relative Molecule
Self-Attention, we compute e ij as where u, v ∈ R D are trainable vectors.We then define Relative Molecule Self-Attention operation: In other words, we enrich the self-attention layer with atom relations embedding.In the phase of attention weights calculation, we add content-dependent positional bias, global context bias and global positional bias (Dai et al., 2019;Huang et al., 2020) (that are calculated based on b K ij ) to the layer.Then, during calculation of the attention weighted average, we also include the information about the other embedding b V ij .

RELATIVE MOLECULE ATTENTION TRANSFORMER
Finally, we use Relative Molecule Self-Attention to construct Relative Molecule Attention Transformer (R-MAT).The key changes compared to MAT are: (1) the use of Relative Molecule Self-Attention, (2) extended atom featurization, and (3) extended pretraining procedure.Figure 1 illustrates the R-MAT architecture.
The input is embedded as a matrix of size N atom × 36 where each atom of the input is embedded following (Coley et al., 2017;Pocha et al., 2020), see Table 6 of Appendix A. We process the input using N stacked Relative Molecule Self-Attention attention layers.Each attention layer is followed by position-wise feed-forward Network (similar as in the classical transformer model (Vaswani et al., 2017)), which consists of 2 linear layers with a leaky-ReLU nonlinearity between them.
After processing the input using attention layers, we pool the representation into a constant-sized vector.We replace simple mean pooling with an attention-based pooling layer.After applying N self-attention layers, we use the following self-attention pooling (Lin et al., 2017) in order to get the graph-level embedding of the molecule: where H is the hidden state obtained from self-attention layers, W 1 ∈ R P ×D and W 2 ∈ R S×P are pooling attention weights, with P equal to the pooling hidden dimension and S equal to the number of pooling attention heads.Finally, the graph embedding g is then passed to the two layer MLP, with leaky-ReLU activation in order to make the prediction.
Pretraining We used two-step pretraining procedure.In the first step, network is trained with the contextual property prediction task proposed by (Rong et al., 2020), where we mask not only selected atoms, but also their neighbours.The goal of the task is to predict the whole atom context.This task is much more demanding for the network than the classical masking approach presented by (Maziarka et al., 2020) since the network has to encode more specific information about the masked atom neighbourhood.Furthermore, the size of the context vocabulary is much bigger than the size of the atoms vocabulary in the MAT pretraining approach.The second task is a graph-level prediction proposed by (Fabian et al., 2020) in which the goal is to predict a set of real-valued descriptors of physicochemical properties.For more detailed information about the pretraining procedure and ablations, see Appendix B.
Other details Similarly to (Maziarka et al., 2020), we add an artificial dummy node to the input molecule.The distance of the dummy node to any other atom in the molecule is set to the maximal cutoff distance, and the edge connecting the dummy node with any other atom has its unique index (see index 5 in Table 4 of Appendix A).Moreover, the dummy node has its own index in the input atom embedding.We calculate distance information in the similar manner as (Maziarka et al., 2020).The 3D molecular conformations that are used to obtain distance matrices are calculated using UFFOPTIMIZEMOLECULE function from the RDKit package (Landrum, 2016) with the default parameters.Finally, we consider a variant of the model extended with 200 rdkit features as in (Rong et al., 2020).The features are concatenated to the final embedding g and processed using a prediction MLP.

SMALL HYPERPARAMETER BUDGET
The industrial drug discovery pipelines focus on fast iterations of compound screenings and adjusting the models to new data incoming from the laboratory.We start by comparing in this setting R-MAT to DMPNN (Yang et al., 2019a), MAT (Maziarka et al., 2020) and GROVER (Rong et al., 2020), representative state-of-the-art models on popular molecular property prediction tasks.We followed the evaluation in (Maziarka et al., 2020), where the only changeable hyperparameter is the learning rate, which was checked with 7 different values.
The BBBP and Estrogen-β datasets use scaffold splits, while all the other datasets use random splits.
For every dataset we calculate scores based on 6 different splits, we report the mean test score based on the hyperparameters that obtained the best validation score.In this and the next experiments, we denote models extended with additional rdkit features (see Section 3.5) as GROVER rdkit and R-MAT rdkit .More information about the models and datasets used in this benchmark are given in Appendix C.4.
Table 1 shows that R-MAT outperforms other methods in 4 out of 6 tasks.For comparison, we also cite representative results of other methods from (Maziarka et al., 2020).Satisfyingly, we observe a marked improvement on the solubility prediction tasks (ESOL and FreeSolv).Understanding solubility depends to a large degree on a detailed understanding of spatial relationships between atoms.This suggests that the improvement in performance might be related to better utilization of the distance or graph information.
Table 1: Results on molecule property prediction benchmark from (Maziarka et al., 2020).We only tune the learning rate for models in the first group.First two datasets are regression tasks (lower is better), other datasets are classification tasks (higher is better).For reference, we include results for non-pretrained baselines (SVM, RF, GCN (Duvenaud et al., 2015), and DMPNN (Yang et al., 2019a)) from (Maziarka et al., 2020

LARGE HYPERPARAMETER BUDGET
In contrast to the previous setting, we test R-MAT against a similar set of models but using a largescale hyperparameter search (300 different hyperparameter combinations).This setting has been proposed in (Rong et al., 2020).For comparison, we include results under small (7 different learning rates) hyperparameter budget.All datasets use a scaffold split.Scores are calculated based on 3 different data splits.While the ESOL dataset is the same as in the previous paragraph, here it uses a scaffold split and labels are not normalized (unlike in the previous paragraph).Additional information about the models and datasets used in this benchmark are given in Appendix D.2.
Table 2 summmarizes the experiment.Results show that R-MAT outperforms other methods in 3 out of 4 tasks in large grid mode as well as in only tuning the learning rate mode.

LARGE-SCALE EXPERIMENTS
Finally, to better understand how R-MAT performs in a setting where pretraining is likely to less influence results, we include results on QM9 dataset (Ramakrishnan et al., 2014).QM9 is a quantum mechanics benchmark that encompasses prediction of 12 simulated properties across around 130k small molecules with at most 9 heavy (non-hydrogen) atoms.The molecules are provided with their atomic 3D positions for which the quantum properties were initially calculated.For these experiments, we used learning rate equal to 0.015 (we selected this learning rate value as it returned the best results Table 2: Results on the benchmark from (Rong et al., 2020).Models are fine-tuned under a large hyperparameters budget.Additionally, models fine-tuned with only tuning the learning rate are presented in the last group.The last dataset is classification task (higher is better), the remaining datasets are regression tasks (lower is better).For reference, we include results for non-pretrained baselines (GraphConv (Kipf & Welling, 2016), Weave (Kearnes et al., 2016) and DMPNN (Yang et al., 2019a)) from (Rong et al., 2020).Rank-plot for these experiments is in Appendix D.2.We bold the best scores over all models and underline the best scores for learning rate tuned models only.for α dataset among 4 different learning rates that we tested: {0.005,0.01,0.015,0.02}).Additional information about the dataset and models used in this benchmark are given in Appendix C.6

ESOL
Figure 3 compares R-MAT performance with various models.More detailed results could be find in Table 8 from Appendix D.3.R-MAT achieves highly competitive results, with state-of-the-art performance on 4 out of the 12 tasks.We attribute higher variability of performance to the limited small hyperparameter search we performed.These results highlight versatility of the model, as tasks in QM9 have very different characteristic than the datasets considered in previous sections.
Most importantly, targets in QM9 are calculated using quantum simulation software, rather than experimentally measured.Achieving strong empirical results hinged on a methodologically exploration the design space of different variants of the self-attention layer.We document here this exploration and relevant ablations.Due to space limitations, we defer most results to the Appendix E. We perform all experiments on the ESOL, FreeSolv and BBBP datasets with 3 different scaffold splits.We did not use any pretraining for these experiments.We follow the same fine-tuning methodology as in Section 4.1.

EXPLORING THE DESIGN SPACE OF SELF-ATTENTION LAYER
Importance of different sources of information in self-attention The self-attention module in R-MAT incorporates three auxiliary sources of information: (1) distance information, (2) graph information (encoded using neighbourhood order), and (3) bond features.In Table 3 (Left), we show the effect on performance of ablating each of this elements.Importantly, we find that each component is important to R-MAT performance, including the distance matrix.Maximum neighbourhood order We take a closer look at how we encode the molecular graph.(Maziarka et al., 2020) used a simple binary adjacency matrix to encode the edges.We enriched this representation by adding one-hot encoding of the neighbourhood order.For example, the order of 3 for a pair of atoms means that there are two other vertices on the shortest path between this pair of atoms.In R-MAT we used 4 as the maximum order of neighbourhood distance.That is, we encoded as separate features if two atoms are 1, 2, 3 or 4 hops away in the molecular graph.In Table 3 (Right) we ablate this choice.The result suggests that R-MAT performance benefits from including separate feature for all the considered orders.Closer comparison to Molecule Attention Transformer Our main motivation for improving self-attention in MAT was to make it easier to represent attention patterns that depend in a more complex way on the distance and graph information.We qualitatively explore here whether R-MAT achieves this goal, comparing its attention patterns to that of MAT.From the Figure 4 one can see that indeed R-MAT seems capable of learning more complex attention patterns than MAT.We add a more detailed comparison, with more visualised attention heads in Appendix D.4.

CONCLUSIONS
Transformer has been successfuly adapted to various domain by incorporating into its architecture a minimal set of inductive biases.In a similar spirit, we methodologically explored the design space of the selfattention layer, and identified a highly effective Relative Molecule Self-Attention.
Relative Molecule Attention Transformer, a model based on Relative Molecule Self-Attention, achieves state-of-the-art or very competitive results across a wide range of molecular property prediction tasks.R-MAT is a highly versatile model, showing state-of-the-art results in both quantum property prediction tasks, as well as on biological datasets.We also show that R-MAT is easy to train and requires tuning only the learning rate to achieve competitive results, which together with open-sourced weight and code, makes it is highly accessible.
Relative Molecule Self-Attention encodes an inductive bias to consider relationships between atoms that are commonly relevant to a chemist, but on the other hand leaves flexibility to unlearn them if needed.Relatedly, Vision Transformers learn global processing in early layers despite being equipped with a locality inductive bias (Dosovitskiy et al., 2021).Our empirical results show in a new context that picking the right set of inductive biases is key for self-supervised learning to work well.
Learning useful representations for molecular property prediction is far from solved.Achieving stateof-the-art results, while less dependent on them, still relied on using certain large sets of handcrafted features both in fine-tuning and pretraining.At the same time, these features are beyond doubt learnable from data.Developing methods that will push representation learning towards discovering these and better features automatically from data is an exciting challenge for the future.

A R-MAT NODE AND EDGE FEATURES
In the following section, we present the node and edge features used by R-MAT.

A.1 EDGE FEATURES
In R-MAT, all atoms are connected with an edge.The vector representation of every edge contains information about atoms neighbourhood, distances between them and physicochemical features of a bond if it exists (see Figure 2) Neighbourhood embeddings The neighbourhood information of an atom pair is represented by a 6-dimensional one-hot encoded vector, with features presented in Table 4. Every neighbourhood embedding contains the information about how many other vertices are between nodes i and j in the original molecular graph.
Table 4: Featurization used to embed neighbourhood order in R-MAT.
Indices Description Atoms i and j are connected with a bond 2 In the shortest path between atoms i and j there is one atom 3 In the shortest path between atoms i and j there are two atoms 4 In the shortest path between atoms i and j there are three or more atoms 5 Any of the atoms i or j is a dummy node Bond embeddings Molecular bonds are embedded in a 7-dimensional vector following (Coley et al., 2017), with features specified in Table 5.When the two atoms are not connected by a true molecular bond, all 7 dimensions are set to zeros.
Table 5: Featurization used to embed molecular bonds in R-MAT.

Indices Description
0 − 3 Bond order as one-hot vector of 1, 1.5, 2, 3 4 Is aromatic 5 Is conjugated 6 Is in a ring

A.2 NODE FEATURES
The input molecule is embedded as a matrix of size N atom × 36 where each atom of the input is embedded following (Coley et al., 2017;Pocha et al., 2020).All features are presented in Table 6.

B PRETRAINING
We extend the pretraining procedure of (Maziarka et al., 2020), who used a masking task based on (Devlin et al., 2018;Hu et al., 2020); they masked types of some of the graph atoms and treat them as the label, that should be predicted by the neural network.Such approach works well in NLP where models pretrained with the masking task create the state-of-the-art representation (Devlin et al., 2018;Liu et al., 2019;Yang et al., 2019b).However in chemistry, otherwise than in NLP, the size of atoms vocabulary is much smaller.Moreover, usually only one type of atom fits a given place and thus the Is aromatic representation trained with the masking task has problems with encoding meaningful information in chemistry.

B.1 CONTEXTUAL PRETRAINING
Instead of atom masking, we used a two-step pretraining that combines the procedures proposed by (Rong et al., 2020;Fabian et al., 2020).In the first step, the network is trained with the contextual property prediction task (Rong et al., 2020), where we mask not only the selected atoms, but also their neighbours.The task is then to predict the whole atom context, e.g. if the selected atom's type is carbon connected with a nitrogen with a double bond and with an oxygen with a single bond, we encode the atom neighbourhood as C_N-DOUBLE1_O-SINGLE1 (we list all the node-edge counts terms in the alphabetical order), then the network has to predict the specific type of the masked neighbourhood for every masked atom.This task is much more demanding for the network than the classical masking approach presented by (Maziarka et al., 2020) as the network has to encode more specific information about the masked atom's neighbourhood.Furthermore, the size of context vocabulary is much bigger than the size of atoms vocabulary in the MAT pretraining approach (2925 for R-MAT vs 35 for MAT).

B.2 GRAPH-LEVEL PRETRAINING
The second task is the graph-level property prediction proposed by (Fabian et al., 2020).In this pretraining procedure, the task is to predict 200 real-valued descriptors of physicochemical characteristics of every given molecule.
The list of all 200 descriptors from RDKit is as follows: The comparison also includes two different pretrained models: MAT (Maziarka et al., 2020) and GROVER (Rong et al., 2020).
Datasets The benchmark is based on important molecule property tasks in the drug discovery domain.The first two datasets are ESOL and FreeSolv, in which the task is to predict the solubility of a molecule in water -a key property of any drug -and the error is measured using RMSE.The goal in BBBP and Estrogen−β is to classify correctly whether a given molecule is active against a biological target.For details on other tasks please see (Maziarka et al., 2020).BBBP and Estrogen−β used scaffold split, rest datasets used random split method.For every dataset 6 different splits were created.Labels of regression datasets (ESOL and FreeSolv) were normalized before training.We did not include the Estrogen−α dataset that was also presented by (Maziarka et al., 2020), due to the GPU memory limitations (as the biggest molecule from this dataset consists of over 500 atoms).
Training hyperparameters We fine-tune R-MAT on the target tasks for 100 epochs, with batch size equal to 32 and Noam optimizer with warm-up equal to 30% of all steps.The only hyperparameter that we tune is the learning rate, which is selected from the set of 7 possible options: {1e−3, 5e−4, 1e−4, 5e−5, 1e−5, 5e−6, 1e−6}.This small budget for hyperparameter selection reflects the long-term goal of this paper of developing easy to use models for molecule property prediction.The fine-tuning was conducted using nVidia V100 GPU.

C.5 LARGE HYPERPARAMETER BUDGET
Models For large hyperparameters budget we compared R-MAT with three models trained from scratch: GraphConv (Kipf & Welling, 2016), Weave (Kearnes et al., 2016) and DMPNN (Yang et al., 2019a) and two pre-trained models: MAT (Maziarka et al., 2020) and GROVER (Rong et al., 2020).For small hyperparameters budget we compared R-MAT to MAT and GROVER.We note that R-MAT, MAT and GROVER use different pretraining methods.MAT was pretrained with 2M molecules from ZINC database and GROVER was pretrained with 10M molecules from the ZINC and ChEMBL databases.
Datasets All datasets were splitted using a scaffold split.The resulting splits are different than in the MAT benchmark.For every dataset, 3 different splits were created.In our comparison we included only the subset of single-task datasets from original GROVER work (Rong et al., 2020).This is the reason why we use a smaller number of datasets.The obtained regression scores for ESOL differs significantly from the small hyperparameters budget benchmark because this time labels are not normalized.
Training hyperparameters For learning rate tuning we used the same hyperparameters settings as in the MAT benchmark (see Appendix C.4).
For large hyperparameters budget we run random search with the hyperparameters listed in a Table 7.
Datasets The QM9 dataset (Ramakrishnan et al., 2014) is a dataset that consists of molecules, with up to 9 heavy atoms (H, C, N, O, F) and up to 29 atoms overall per molecule.Each atom from this dataset is additionally associated with 3D position.The dataset consists of 12 different regression tasks named: α, ∆ε, ε HOMO , ε LUMO , µ, C ν , G, H, R 2 , U , U 0 , ZPVE, for which the mean absolute error is a standard metric.The dataset has over 130k molecules.We use data splits proposed by (Anderson et al., 2019), which gives us 100k trainin molecules, 18k molecules for validation and 13k molecules for testing.
Training hyperparameters We trained R-MAT for 1000 epochs, with batch size equal to 256 and learning rate equal to 0.015.We report the test set MAE for the epoch with the lowest validation MAE.We selected this learning rate value as it returned the best results for α among 4 different learning rates that we tested: {0.005, 0.01, 0.015, 0.02}.

C.7 ABLATIONS
Datasets For the ablations section, we used the BBBP, ESOL and FreeSolv datasets, splitted using a scaffold split, with 3 different splits.Labels of the regression datasets (ESOL and FreeSolv) were normalized before training.Scores obtained in this section differ significantly from the previous benchmarks due to the different data splits, different model hyperparameters and no pretraining used.
Training hyperparameters Similarly as for our main benchmarks, we tuned only the learning rate, which was selected from the set of 7 possible options: {1e−3, 5e−4, 1e−4, 5e−5, 1e−5, 5e−6, 1e−6}.We used batch size equal to 32 and Noam optimizer with warm-up equal to 20% of all steps.Moreover we use single layer, instead of two-layers MLP as our classification part.

D.1 SMALL HYPERPARAMETER BUDGET
In Figure 5 one can find rank plots for results from Table 1.R-MAT and R-MAT rdkit obtained the best median rank among all compared models.

D.2 LARGE HYPERPARAMETER BUDGET
In Figure 6 one can find rank plots for results from Table 2.We present separate plots for models trained with the large grid search (Left) and for models with only learning rate tuning (Right).

D.3 LARGE-SCALE EXPERIMENTS
In Table 8 one can find detailed results of comparison R-MAT performance with other various models.R-MAT achieves highly competitive results, with state-of-the-art performance on 4 out of the 12 tasks, which proves how universal this model is.

D.4 CLOSER COMPARISON TO MOLECULE ATTENTION TRANSFORMER
Our main motivation for improving self-attention in MAT was to make it easier to represent attention patterns that depend in a more complex way on the distance and graph information.We qualitatively explore here whether R-MAT achieves this goal, comparing its attention patterns to that of MAT.For this purpose we compared attention patterns learned by pretrained MAT (weights taken from (Maziarka et al., 2020)) and R-MAT for a selected molecule from the ESOL dataset.Figure 7 shows that different heads of Relative Molecule Self-Attention are focusing on different atoms in the input molecule.We can see that self-attention strength is concentrated on the input atom (head 5), on the closest neighbours (heads 0 and 11), on the second order neighbours (head 7), on the dummy node (head 1) or on some substructure that occurs in the molecule (heads 6 and 10 are concentrated on atoms 1 and 2).In contrast, self-attention in MAT focuses mainly on the input atoms and its closest neighbours, the information from other regions of the molecule is not strongly propagated.This likely happens due to the construction of the Molecule Self-Attention in MAT (c.f.Equation ( 2)), where the output atom representation is calculated from equally weighted messages based on the adjacency matrix, distance matrix and self-attention.Due to its construction, it is more challenging for MAT than for R-MAT to learn to attend to a distant neighbour.

E EXPLORING THE DESIGN SPACE OF MOLECULAR SELF-ATTENTION
Identifying the Relative Molecule Self-Attention layer required a large-scale and methodological exploration of the self-attention design space.In this section, we present experimental data that informed our choices.We also hope it will inform future efforts in designing attention mechanism for molecular data.We follow here the same evaluation protocol as in Section 4.4 and show how different natural variants compare against R-MAT.The self-attention pattern in MAT is dominated by the adjacency and distance matrix, while R-MAT seems capable of learning more complex attention patterns.

E.1 SELF-ATTENTION VARIANTS
Relative Molecule Self-Attention is designed to better incorporate the relative spatial position of atoms in the molecule.The first step is embedding each pair of atoms.Then, the embedding is used to re-weight self-attention.To achieve this, Relative Molecule Self-Attention combines ideas from natural language processing (Shaw et al., 2018;Dai et al., 2019;Huang et al., 2020).These works focus on encoding better relative positions of tokens in the input.
We compare to three specific variants from these works that can be written using our previously introduced notation as: 1. Relative self attention (Shaw et al., 2018): 2. Relative self attention with attentive bias (Dai et al., 2019): 3. Improved relative self-attention (Huang et al., 2020): Table 9 shows that the attention operation used in R-MAT outperforms other variants across the three tasks.This might be expected given that Relative Molecule Self-Attention combines these ideas (c.f.Equation ( 3)).

E.2 ENRICHING BOND FEATURES WITH ATOM FEATURES
In Relative Molecule Self-Attention, we use a small number of bond features to construct the atom pair embedding.We investigate here the effect of extending bond featurization.
Inspired by (Shang et al., 2018), we added information about the atoms that an edge connects.We tried three different variants.In the first one, we extend the bond representation with concatenated input features of atoms that the bond connects.In the second one, instead of raw atoms' features, we tried the one-hot-encoding of the type of the bond connection (i.e. when the bond connects atoms C and N, we encode it as a bond 'C_N' and take the one-hot-encoding of this information).Finally, we combined these two approaches together.The results are shown in Table 10.Surprisingly, we find that adding this type of information to the bond features negatively affects performance of R-MAT.This suggests that R-MAT can already access these features efficiently from the input (which we featurize using the same set of features).This could also happen due to the fact that after a few layers, the attention is not calculated over the input atoms anymore.Instead, it works over hidden embeddings, which themselves can be mixed representations of multiple atom embeddings (Brunner et al., 2019), where the proposed additional representation contains only information about the input features.

E.3 DISTANCE ENCODING VARIANTS
R-MAT uses a specific radial base distance encoding proposed by (Klicpera et al., 2020), followed by the envelope function, with N emb = 32.We compare here to several other natural choices.
We tested the following distance encoding variants : (1) removal of the envelope function, (2) increasing the number of distance radial functions to 128, (3) using distance embedding from the popular SchNet model (Schütt et al., 2017).The distance in SchNet is encoded as e n (d) = exp(−γ d − µ n 2 ), for γ = 10Å and 0Å ≤ µ n ≤ 30Å divided into N emb equal sections, with N emb set to 32 or 128.
The results are shown in Table 11.These results corroborate that a proper representation of distance information is a key in adapting self-attention to molecular data.We observe that all variants underperform compared to the radial base encoding used in Relative Molecule Self-Attention.As pretraining is nowadays the main component of big Transformer architectures (Devlin et al., 2018;Liu et al., 2019;Clark et al., 2020), we decided to devote more attention to this issue.For this purpose we compared various graph pretraining methods to identify the best one and use it in the final R-MAT model.
Pretraining methods We used various pretraining methods proposed in the molecular property prediction literature (Hu et al., 2020;Maziarka et al., 2020;Rong et al., 2020;Fabian et al., 2020).To be more specific, we tried R-MAT with the following pretraining procedures: • No pretraining -as a baseline we include reuslts for R-MAT fine-tuned from scratch, without any pretraining.• Masking -masked pretraining used in (Maziarka et al., 2020).This is an adaptation of standard MLM pretraining used in NLP (Devlin et al., 2018) to the graphical data.In this approach, we mask features of 15% of all atoms in the molecule and then pass it throught the model.The goal is to predict what was the masked features.• Contextual -contextual pretraining method proposed by (Rong et al., 2020).We described it further in Appendix B. • Graph-motifs -graph-level motif prediction method proposed by (Rong et al., 2020), where for every molecule we obtain the fingerprint with information whether specified molecular functional groups are present in our molecule.The network's task is multi-label classification, where it has to predict, whether every predefined functional group is in the given molecule.• Physicochemical -graph-level prediction method proposed by (Fabian et al., 2020).We described it further in Appendix B. • GROVER -pretraining used by authors of GROVER (Rong et al., 2020).Combination of two pretraining methods: contextual and graph-motifs.• R-MAT-pretraining used in this paper.Combination of two pretraining methods: contextual and physicochemical.

Figure 1 :
Figure 1: Relative Molecule Attention Transformer uses a novel relative selfattention block tailored to molecule property prediction.It fuses three types of features: distance embedding, bond embedding and neighbourhood embedding.

Figure 2 :
Figure2: The Relative Molecule Self-Attention layer is based on the following features: (a) neighbourhood embedding one-hot encodes graph distances (neighbourhood order) from the source node marked with an arrow; (b) bond embedding one-hot encodes the bond order (numbers next to the graph edges) and other bond features for neighbouring nodes; (c) distance embedding uses radial basis functions to encode pairwise distances in the 3D space.These features are fused according to Equation (4).

Figure 3 :
Figure 3: Rank plot of scores obtained on the QM9 benchmark, which consists of 12 different quantum property prediction tasks.

Figure 4 :
Figure 4: Visualization of the learned self-attention for each of the first 3 attention heads in the second layer of pretrained R-MAT (middle) and the first 4 attention heads in pretrained MAT (bottom), for a molecule from the ESOL dataset.The top Figure visualizes the molecule and its adjacency and distance matrices.The self-attention pattern in MAT is dominated by the adjacency and distance matrix, while R-MAT seems capable of learning more complex attention patterns.

Figure 7 :
Figure 7: Visualization of the learned self-attention for each of all attention heads in the second layer of pretrained R-MAT (left) and all attention heads in pretrained MAT (right), for a molecule from the ESOL dataset.The top Figure visualizes the molecule and its adjacency and distance matrices.The self-attention pattern in MAT is dominated by the adjacency and distance matrix, while R-MAT seems capable of learning more complex attention patterns.
Figure 9: Fine-tuning scores obtained by R-MAT pretrained with a different number of pretraining epochs.
). Rank-plot for these experiments is in Appendix D.1.

Table 3 :
Ablations of Relative Molecule Self-Attention; other ablations are included in the Appendix.

Table 6 :
Featurization used to embed atoms in R-MAT.

Table 7 :
Relative Molecule Attention Transformer large grid hyperparameters ranges

Table 8 :
Mean absolute error on QM9, a benchmark including various quantum prediction tasks.Results are cited from literature.