Llamol: a dynamic multi-conditional generative transformer for de novo molecular design

Abstract Generative models have demonstrated substantial promise in Natural Language Processing (NLP) and have found application in designing molecules, as seen in General Pretrained Transformer (GPT) models. In our efforts to develop such a tool for exploring the organic chemical space in search of potentially electro-active compounds, we present Llamol, a single novel generative transformer model based on the Llama 2 architecture, which was trained on a 12.5M superset of organic compounds drawn from diverse public sources. To allow for a maximum flexibility in usage and robustness in view of potentially incomplete data, we introduce Stochastic Context Learning (SCL) as a new training procedure. We demonstrate that the resulting model adeptly handles single- and multi-conditional organic molecule generation with up to four conditions, yet more are possible. The model generates valid molecular structures in SMILES notation while flexibly incorporating three numerical and/or one token sequence into the generative process, just as requested. The generated compounds are very satisfactory in all scenarios tested. In detail, we showcase the model’s capability to utilize token sequences for conditioning, either individually or in combination with numerical properties, making Llamol a potent tool for de novo molecule design, easily expandable with new properties. Scientific contribution We developed a novel generative transformer model, Llamol, based on the Llama 2 architecture that was trained on a diverse set of 12.5 M organic compounds. It introduces Stochastic Context Learning (SCL) as a new training procedure, allowing for flexible and robust generation of valid organic molecules with up to multiple conditions that can be combined in various ways, making it a potent tool for de novo molecular design.


Introduction
In fields like energy storage materials or medicinal chemistry, substances are key to technological advancement and progress: the success of these applications hinges on the specific properties of the materials.However, the processes of discovery and development of new materials often face practical and/or principal obstacles, such as unavailability of compounds or precursors, high production costs, and the need for extensive trials on the practical side, or limited data and/or experience, as well as biased expectations of designers and developers on the other hand.Generative models, a powerful category in machine learning, have the potential to address both of these issues simultaneously, as they can help focus our efforts a priori only on the most likely candidates.
Many architectures related to creation of novel data points were developed in recent years, most notably Recurrent Neural Networks (RNN) [1], Generative Adversarial Networks (GAN) [2], Variational Autoencoders (VAE) [3] and Transformers [4].The transformer architecture, especially, has revolutionized the fields of Natural Language Processing (NLP) [5] and other domains like computer vision [6].The introduction of the General Pretrained Transformer (GPT) architecture led to significant advancements in generative natural language applications.Generative models have also been applied in the fields of medicine and material science to create new molecules with predefined features, a process known as conditional generation [7,8].This application can significantly accelerate the discovery of new candidate molecules.Although current generative models may not provide the optimal solution, they can greatly reduce the size of the chemical space that needs to be evaluated.Current estimates for the size of the chemical space containing drug-like molecules range from 10 23 to 10 60 [9].Many approaches have successfully used VAEs [10,11], GANs [12], or RNNs [13].However, more recently, transformer models, specifically the GPT models [8,14], have emerged as the new state-of-the-art in this domain, especially, in the field of conditional molecular generation [15,16].A good summary of available models can be found in the survey from Du et.al. [17].
Bagal et al. [8] presented the MolGPT architecture from which a family of models, each one tailored to a specific task, could be derived.Inspired by their work, we set out to develop a solitary model that can handle many tasks simultaneously to support the search for low-cost, high-energy-density alternatives for energy storage materials in flow batteries.The model itself should not require complex training data; thus, it operates on SMILES [18] -a minimalist molecular representation that allows us to draw a mass of data from numerous sources -and easy to provide and directly to verify target properties that serve as conditions (primarily to facilitate the development process of the model). 1n this paper, we present a new, dynamic training approach termed "Stochastic Context Learning" (SCL) to train a single model for conditional generation, capable of generating molecules as SMILES while respecting a variable number of conditions.Our training dataset consists of approx.13 million organic molecules, which is a superset of several public datasets (see Section 3.1).On this, we train a GPT-style transformer model, specifically a model based on LLama 2 [19], to generate new compounds based on one or more conditions/target property.To achieve this, we assign a learnable embedding to each property value.This ensures that the model perceives not only the numerical value, but also the associated label.
To be able to assess the model's performance directly, we chose three easily determined numerical properties: SAScore [20] (reflecting production cost), and logP and molecular weight (contributing to energy density), along with another optional condition: a user-defined core structure that has to be integrated into the final molecule.The latter is given as a SMILES string, which is a continuous sequence of tokens, hereafter referred to as a 'token sequence'. 2n the following sections, we detail the architecture, training data and process along with the results obtained for unconditional, single, and multi-conditional molecule generation.

Architecture
The architecture we utilized, as depicted in Figure 1, is a modified version of the LLama2 architecture [19] as obtained from GitHub (https://github.com/karpathy/llama2.c).The hyperparameters can be found in Table 1, which we determined from previous experiments.
Our model consists of approximately 15 million parameters and is composed of eight decoder blocks.Each decoder block includes a masked multi-head self-attention layer, followed by a Feed Forward Network (FFN) that employs the SwiGLU [21] activation function.While the original LLama 2 architecture utilized Grouped-Query Attention (GQA) [22], we opted for the full multi-head attention mechanism given the comparatively smaller size of our model.
The masked multi-head self-attention layer [4], defined by Equation 1, takes an embedded input sequence X ∈ R L×d emb of length L, where each element represents an embedding vector with dimension d emb .Through the attention mechanism, each head learns to attend to a different part of the sequence, resulting in an attention matrix head i ∈ R L×dv .We utilize dot-product self-attention, which produces three matrices: Q i and K i with dimensions L × d k , and V i with dimensions L × d v .These matrices are generated by applying linear transformations using weight matrices W i Q , W i K , and W i V , each with dimensions of d emb × d k and d emb × d v , respectively, to the input sequence X for each attention head i.
In our specific case, we set d k and d v to be equal to d emb /n heads , resulting in d k = d v = 384/8 = 48.To keep the autoregressive property for our model, we mask out the upper right triangle by using the mask matrix M ∈ R L×L shown in Equation 4.Then, these attention matrices are concatenated with each other along the d v -dimension.Afterward, the resulting concatenated matrix is further transformed using another learnable weight matrix The Llama 2 architecture employs several changes compared to the standard decoder architecture [4].Firstly, we use rotary positional embeddings (RoPe) [23] to encode absolute and relative positional information directly into the attention matrix.Secondly, instead of applying layer normalization [24] after the self-attention and feed-forward layers, we employ RMSNorm [25] as a more efficient pre-normalization step.A feed-forward layer is described by eq. 5, where W 1 , W 3 ∈ R d emb ×d f f n and W 2 ∈ R d f f n ×d emb are learned weight matrices and ⊙ represents the elementwise product of two vectors.After each feed-forward layer we employ a dropout-layer [26] with the probability given in table 1.
The concat function, which we use below, is defined as follows: Here, A and B are matrices of shape a × e and b × e respectively.The result of the concat function is a matrix of shape (a + b) × e, where the rows of A are stacked on top of the rows of B.
Furthermore, we made significant alterations to the context ingestion process.The input to our model is a sequence X of shape L×d emb , which can be divided into two parts: X = Concat(C, S).The first part, C ∈ R c×d emb , also later referred to as the "context", represents the given conditions and can be expressed as C = Concat((t 1 , t 2 , . . ., t n ) T , t ts ) ∈ R c×d emb .The embedded vectors t i ∈ R d emb ∀i ∈ {1, . . ., n} represent the n numerical conditions, which are provided to the model, in our case n = 3.On the other hand, t ts ∈ R k×d emb is a matrix of k embedded tokens, that is concatenated with the numerical conditions.The second part, S ∈ R s×d emb , just describes the molecule, as a SMILES, itself.c is just the length of the complete context and s is the length of the given SMILES, both are not fixed in length.Typically, a token sequence includes multiple tokens, which translates to multiple embeddings in the context, while a numerical condition is represented by one embedding.In our case, the order was the following: First are the embedded numerical conditions, then the token sequence embeddings, and lastly the SMILES itself.During the training process, we learn SMILES embeddings by learning an embedding vector for each token.
In order to facilitate controlled property generation of molecules, we prepend the sequence with conditions, such as numerical values or a token sequence.While the number of conditions theoretically has no limit, we limit the contexts to three numerical values and one token sequence for the purposes of this paper.Each numerical value is assigned a type identifier, and a separate linear layer is used to transform them into the embedding dimension.The transformed values are then combined with the learned type encoding specific to each numerical property.In our implementation, we assigned a fixed type number to each property and mapped it to a learnable vector, which serves as the type encoding.
To provide positional information, we applied RoPe to every part of the context and sequence.Although adding positional information to numerical values is not necessary, we chose to include it for the sake of simplicity in implementation, without negatively impacting the model's performance.
Due to the type identifiers, this approach enables the model to differentiate between various conditions in a straightforward, yet effective manner.Consequently, we are free to mix or even omit conditions within the sequence.This property plays a crucial role in our training procedure, as in combination with the SCL method it allows the model to adapt dynamically and process all possible combinations of context.
The degree of creativity of the model's output can be controlled by the so-called temperature parameter, which is defined as a positive real number t ∈ R + , by dividing the output log probabilities by the said value.A temperature of t = 1 does not alter the model's output, whereas a lower temperature sharpens the output distribution, thus making it more deterministic.Conversely, a temperature greater than one leads to a higher level of variability  3 Training

Dataset
The model was trained on a dataset of molecules, which was compiled from several public sources to create a large and diverse population.The resulting dataset, we call OrganiX13, includes SMILES strings of mostly organic and/or druglike molecules taken from the sources listed in Table 2.
Subsequently, we used RDKit to provide the numerical values for some quick-to-compute surrogate properties to investigate the training behavior and to enable the direct verification of the generated results.The properties chosen were the logP, SAScore, and molecular weight, as those properties also have an impact on the achievable energy density or cost of an electro-active material in the chosen aqueous flow battery application.In detail, 1. LogP is defined as the logarithm of the partition coefficient, which denotes the hydrophobic or lipophilic nature of a molecule.A positive logP value suggests that the molecule prefers non-polar solvents, whereas a negative value indicates that the molecule is soluble in water, a desirable property for aqueous flow battery systems, which correlates to energy density.
2. Molecular weight can be used as a proxy for its size.Again, to attain high energy densities, we would like to have control over the maximum size of the compounds generated.To ensure numerical similarity with the intervals of other properties, the molecular weights were divided by 100.
3. SAScore: The Synthetic Accessibility Score (SAScore) [20] estimates the ease or difficulty of creating a compound [39].Based on a frequency analysis of chemical moieties in the PubChem database, it assigns a score ranging from zero (easy) to ten (difficult), which is supposed to reflect to some extent the cost of production.
The resulting dataset encompasses many SMILES strings that cover a broad range of about 12 units in logP, a range in SAScore from around 1 to 6, as well as a similar range in scaled molecular weights.This served as a basis for the subsequent training.

Procedure
Initially, we convert the SMILES representation into a sequence of tokens using a tokenizer.We used the BERTtokenizer [40] in DeepChem [41], which employs a fixed vocabulary size of 591 tokens.It splits the SMILES at the character level, except for values enclosed in square brackets, which are treated as a single token.
These tokens are then passed through a separate lookup table, which maps them to a d emb -dimensional embedding space.Prior to feeding the token embeddings into the decoder model, a context is added at the beginning.
For each numerical property, we projected the values into the embedding space using their respective linear layer and then combined them with the embedded type identifier.These identifiers allow the model to distinguish the significance of each numerical value, enabling easy manipulation of their positions or exclusion.Since the properties remain constant throughout the training process, we calculate them in advance.
If we use a token sequence as context, we perform these calculations dynamically in each batch during the training, allowing them to have varying token sequence sizes and content.During a training step, a token sequence represents a contiguous subsequence of the current tokenized SMILES.We start by randomly selecting a starting index from zero up to the current SMILES length, followed by determining a random ending index greater than the starting index but smaller than the current SMILES length.
In our case, we limited the context token to a maximum sequence length of 50 to avoid memory issues, which sufficed for our purposes.This sequence is then embedded using the same embedding layer as the input sequence and combined with an embedding specific to the token sequence, sharing the shape of the input embedding table.Additionally, a learned label embedding is added to these combined token sequence embeddings to indicate their relatedness.

Stochastic Context Learning (SCL)
Given an input sequence X ∈ R L×d emb of length L, where each element is represented by a d emb -dimensional vector, we divide it into two parts: X = Concat(C, S).Our algorithm focuses on modifying the context part C. We represent this part as a combination of two parts.The first is C num = (t 1 , t 2 , . . ., t n ) T ∈ R n×d emb , where n represents the maximum number of numerical conditions used in the training process (in this case, n = 3).The second is the token sequence C ts ∈ R k×d emb , where k is the length of the token sequence, such that C = Concat(C num , C ts ).The length k is not specific and can change for each input sequence X.
To begin, we set a deletion probability p del to 15% during training.For each row in the C num matrix, we check if it should be deleted with a probability of p del .If it meets the criteria, we remove the row from the C num matrix and consequently from the input sequence X, which then would be of shape (L − i) × d emb , where 0 ≤ i ≤ n is the number of deleted numerical conditions.Similarly, the same probability is used to control if the token sequence should remain in the context for the current sequence.In this case, the p del probability says if the entire token sequence should be removed, not just one row.Occasionally, there may be a situation where all conditions in C num and C ts are eliminated.
In such instances, the sequence becomes unconditioned.
For batched input sequences X batch with shape R B×Lmax×d emb , the process works similarly.We iterate over each of the n numerical conditions and sample if it should be deleted with a probability of p del .If a condition is selected for deletion, we remove the corresponding row from all entries in the batch of size B. A description for the batched algorithm is given in the Algorithm 1.We assume that every molecule in the batch has all n numerical properties.If a molecule only has a portion of the properties, we would simply pad the missing values.In our case, there was no need for padding, as all molecules had all the numerical properties.The batch is created out of B number of sequences X, each of those could have a different length L due to the variance in length in the token sequence condition and also the SMILES itself.To batch those together, we take the maximum sequence length L max for all sequences that should be packed into the batch and pad the shorter SMILES by appending a pad-token to the length of L max .
Thus, throughout the training process, the model has to handle different combinations of the provided conditions, which allows the model to learn unconditionally, single conditions, and also multiple conditions in one go.Thanks to the type of embeddings we add to every context element, we can change the number of properties that are provided to the model and still have the model distinguish which properties are provided.Remove all entries from the token sequence in the L-axis from all samples in X ▷ Delete token sequence condition return X ▷ Modified input sequence 14: end function

Loss
The model is trained to predict the next token by calculating the cross-entropy loss between the actual next token and the predicted probability for that token.Note that this loss is only calculated for the SMILES part of the given sequence, the prepended context is not considered in the loss.Since we only train with the autoregressive loss, the context does not have to be evaluated while training, making our approach very flexible to various conditions.This loss is then backpropagated through the model using the Adam optimizer [42].The cross-entropy loss is defined as follows (Equation 6): In this expression, N is the batch size, where y ∈ {0, 1, 2, . . ., d voc } N and ŷ ∈ R N ×dvoc correspond to the target tokens and the predicted log probabilities, respectively.The mean over the negative logarithms for the normalized predicted probabilities of the next token is calculated.Here, ŷn,yn specifically refers to the predicted log probability assigned to the correct target token y n for the n-th sample in the batch.
The model was trained on a single Nvidia A100 GPU for two days and used about 35 GB VRAM while training.A constant learning rate of α = 10 −4 , with β 1 = 0.9 and β 2 = 0.95 was used for the Adam optimizer.The dataset was randomly partitioned into two parts, a training set and a testing set.The training dataset consisted of 90%, while the testing dataset comprised 10% of the data.The model was trained using a batch size of 256 with gradient accumulation steps of 4 batches.Each sequence for the model starts with a "start of SMILES"-token ([CLS]) and ends with an "end of SMILES"-token ([SEP]).Shorter SMILES strings were padded with a "pad"-token ([PAD]) to match the length of the longest SMILES in that batch.The same padding process was applied to the token sequence in the context.
New SMILES are then sequentially generated by first starting with a "[CLS]"-token and then predicting the next tokens iteratively.The generation ends, when the model predicts the "[SEP]"-token or a specified token limit is reached.

Results and Discussion
After the training, we used the model in different scenarios to generate new SMILES, e.g., without any constraints or with one or more constraints (including numerical and/or structural targets), while keeping the temperature parameter constant at temperature = 0.8.This value ensures a close but not too strict coupling to the underlying probability distributions, which proved helpful in our experiments.
The metric used to measure the performance of the models for a batch of generated compounds is the mean absolute deviation between requested and obtained numerical values, in addition to the percentage of novelty, uniqueness, and validity of the molecular structures generated.The latter refers to a randomly chosen subset of 2.5 M samples from the training dataset since evaluation against the full dataset proved too much effort.
In more detail, these metrics are defined as follows: 1. Novelty: is defined as the percentage of newly generated molecules not present in the reference dataset.We use this, to ensure that the model is not memorizing the training data, but instead is inventing new compounds.
We measure this by comparing the generated SMILES with the SMILES in the dataset.Please note: this is not equivalent to testing the molecular graphs for isomerism, i.e., alternative synonyms are not detected as redundant molecular structures by this procedure, but rather just a string comparison.2. Uniqueness: The uniqueness is the ability of the model to generate unique molecules.We measure the percentage of unique molecules generated in a batch of 1k and 10k molecules under specific conditions.Again, identical molecules with synonymous SMILES remain undetected.

Validity:
The ratio of validity is determined by the number of properly parsed SMILES (by RDKit [43]) versus the total number of generated SMILES in a batch.4. Mean average deviation: Is defined as the following: For each of the n generated SMILES strings, the target value of the respective property is denoted as y i , while x i represents the 'true', i.e. actually calculated property value.The model should minimize this quantity without being explicitly trained on it, which would indicate that the model incorporates the provided context into the generative process.This metric is also used to enable comparisons to other models, e.g.[8].
We specifically compare our model to MolGPT [8] as it is the most similar in terms of architecture and choice of conditions.

Unconditional Generation
Without applying any conditions, we generated 20k SMILES and calculated the corresponding properties logP, SAScore, and molecular weight using RDKit.The resulting frequencies of distribution are very similar to the distributions obtained from a representative sample of training molecules, see Figure 2a, 2b, and 2c.This indicates that the model has indeed learned the inherent distribution of the training dataset, without specifically training the model unconditionally.
The generated SMILES also achieve very comparable performance in terms of uniqueness, and validity to MolGPT as shown in Table 3 (row 1 and 8).Surprisingly, our degree of novelty is significantly higher than in the MolGPT case.We suspect that this is mostly due to our larger dataset, which makes it much more unlikely for the model to reproduce a specific entry.

Single Condition
In this experiment, we also assessed the model's ability to handle single-condition generation over wide ranges of target values.
For each target value of the intervals listed in Table 3, rows 2 -7, the procedure involved generating a sample of 10,000 molecules.As before, we determined the true property values of the generated molecules using RDKit and compared Despite the low probability of the model being trained solely on one property, it performs well in this task, as demonstrated in Figures 3a, 3b, and 3c.The model achieves low MAD values across the entire span of the respective target properties (rows 2, 4, and 6), although the MAD values obtained for the in-distribution series are generally and expectedly significantly lower (rows 3, 5, and 7).In fact, predicting logP values to an accuracy of 0.5 logP units (root mean square deviation) is commonly considered a satisfactory result [44].
Still, the out-of-sample performance is acceptable.Although the scatter increases, the general trend is well retained, with the only exception of the very high (>7) SAScores.In this case (row 4 in Table 3), we observe also a concomitant drop in the percentage of validity of the generated molecules (81 vs 99.7 %).Upon manual inspection, we find that not only compounds with highly bridged ring systems and/or accumulations of stereogenic centers were generated that are supposedly very demanding to synthesize but also rare and unstable atomic environments such as neighboring diradicals and/or carbenes that presumably prevent the proper parsing of structures.
In comparison to MolGPT our model archives a slightly lower MAD in the single condition case, without being specifically trained on that task, while simultaneously scoring the same uniqueness and validity.

Multiple Conditions
For each pairwise combination of target properties, we generated 1k SMILES, see Figures 4, 5 and 6.The graph labels are in the same order as given in the captions of the respective figures.In general, the generated molecules' actual properties center closely around the desired values.Although all chosen values were well within the highly populated areas of the underlying distributions, some combinations turned out to be hard to satisfy, resulting in a more pronounced scatter.
Figure 4 shows the distribution of calculated logP values and SAScores.This pair works well for lower logP values, but for higher ones the variance in the SAScore axis rises significantly.This seems to indicate that in this case, the logP values have a slight priority in the generative process compared to SAScore.There are some outliers, but most of the generated molecules fulfill both conditions.We suspect that this could be an effect of the shortage of training data in that region, thus leading to more inaccurate results.
Next we compare the combination of logP and the molecular weight values, as shown in Figure 5. Apparently, the molecular weight takes priority in the generation, as it displays a much smaller variance compared to the logP.However, the logP is still met accurately despite being under very strict size constraints.This comes as no surprise, due to the ease with which the molecular weight can be determined by counting the contributions of each atom, as opposed to the more extensive considerations demanded by logP values.
Lastly, Figure 6 displays the combination of SAScore and molecular weight.Similar to the logP and molecular weight comparison, molecular weight still dominates the generative process.In comparison, the model can not uphold the SAScore in all cases.This is especially evident in the case, where the molecular weight is set to a low value of 1.5 which results in a high SAScore variance.Apparently, the model struggles to incorporate a sufficient number of challenging motifs into a small molecule, due to the limited size and range of available elements In contrast, when the weight is set to a higher value of 3.5, i.e. a larger molecule, we obtain a much lower variance.
Finally, in Figure 7 we visualize the generated molecules that take into account all three properties.As is evident by the disjoint point clouds in the graph, the model learned to consider all three conditions and generate matching molecules.
The labels in the graph should be read in the order of logP, SAScore and then molecular weight.
In Table 4 we compare the performance of our model to MolGPT in multi-conditional generation where applicable.Each row in the table represents an experiment, with the columns representing the properties used as conditions.If a condition was not utilized, the cell was left empty.Our model has on-par performance compared to MolGPT in the case of logP + SAScore, while simultaneously being able to handle other condition combinations effectively.

Token Sequence Incorporation
A very common question in material design is to create analogs from a given starting molecule and add/modify structural features to customize the physical properties.For this reason, the model also accepts a SMILES string representing the desired molecular moiety that should be integrated as a building block in the newly generated structure as an additional condition.To measure the performance of our model for this task, we used the following criteria: • Substructure Matches (SM): The substructure match measures the percentage of generated molecules that explicitly include the target moiety.As a first step, we convert the target structure into a SMARTS [45] pattern, which is essentially a regular expression to match specific atoms or substructures within a molecular structure.
To make the criterion a little less strict, all information about bond orders is removed, only the connectivity itself is retained.Therefore, the pattern tolerates modifications in the details of the electronic structure (e.g.localized double bonds versus aromatic bonds), while still maintaining the overall topology.With this property, we measure how often the target structure is retained during the generative process.
For this experiment, at a constant temperature of 0.8, batches of 1000 molecules were generated for various context token sequences and evaluated using the mentioned metric.Table 6 lists different organic target structures (as the context token sequence in SMILES form) and the results obtained a) without applying any other conditions (columns: uniqueness at 1k / SM), and b) with another additional numerical condition (columns: LogP/SAScore/molecular weight at different target values each).
Overall, the model seems to perform very well, as we can recover the target structures at least once in most of the newly generated SMILES, except for one particular example, Thiophene, see below.However, especially when given a larger target structure like for example Morphine, we observe that the generated structures become very repetitive.
By chance, we found that the success rate in generating structures containing the building block thiophene is very dependent on the explicit formulation of the SMILES string (rows 2 + 3 in Table 6).Thiophene is a common heteroaromatic substructure in the training dataset.It can be expressed in lowercase letters, which represent aromatic ring systems, or in uppercase letters, indicating localized double bonds.The training data mostly use the aromatic notation (in various synonyms, depending on where the listing of ring atoms starts).Overall, we found about 144k SMILES strings that contain at least one thiophene substructure.Table 5 shows the synonyms, their frequencies in the training data, and the retrieval rates (for a batch size of 100 SMILES).
The recovery rate for this target moiety varies significantly, ranging from 10% to 70% for various aromatic synonyms.However, it exhibits only a loose correlation with the observed relative frequency of each synonym in the training data.
In contrast, a SMILES in kekulized notation-i.e., with localized double bonds, a rarity in the training data-achieves a retrieval rate of 90% for the specified token sequence.Notably, this sequence is expressed in aromatic lowercase notation in the generated SMILES.We suspect, that the discrepancies in performance trace back to the internal perception of the target context token sequence due to the rather complicated rules defining aromaticity and the relative position of the tell-tale sulfur atom as a main feature of thiophene.In contrast, the explicit specification of double bonds seems to allow a more reliable (re)construction of the overall structure, despite its almost complete lack in the training set.The really useful application for customizing given structures is the simultaneous application of one or more additional criteria.

SMILES
Thus, we study combinations of token sequence conditions together with single numerical conditions, see Table 6.Each combination was tested on 1000 generated molecules, with the numerical values uniformly sampled from the range specified in the table header for each property.
In most cases, we observed a decrease in the number of substructure matches for the molecules tested as compared to the previous run without numerical conditions.This is likely due to the model having to handle two possibly competing conditions simultaneously.The MAD values for logP and SAScore were also notably higher compared to generating without a token sequence but remain within acceptable limits.It is worth noting that when the two conditions conflicted, such as with Ibuprofen and negative logP values, this led to the presence of some significant outliers.Conversely, when the conditions aligned well, the were consistent with the previous results.More details on the graph for the Ibuprofen and logP relationship can be found in Appendix 5.1.We also observed that the SAScore in the case of Morphine is significantly higher than in the other examples.This is mostly due to Morphine having a SAScore of about 5.2, and we requested lower values.In this case, the model prioritizes the token sequence in comparison to the SAScore, which leads to the higher MAD.
Apparently, the token sequence condition takes precedence in most cases over the criteria logP and SAScore, as evidenced by the elevated MAD scores.Yet again, the molecular weight seems to be prioritized over the token sequence, as evidenced by the very low MAD scores, particularly for larger molecules such as Morphine.We also conducted experiments where multiple token sequences were tested under two conditions simultaneously.The results of these experiments can be found in Table 7.Each row in the table represents a specific experiment, with the columns representing the properties used as conditions.If a condition was not utilized, the cell was left empty.
The model consistently performs well under various conditions, as shown by the low MAD values.However, when conditions are overly restrictive in combination with the token sequence, it can lead to higher MAD values.This is because the model prioritizes certain properties over others.
For instance, consider Paracetamol, where both logP and molecular weight conditions are applied.Due to the constraining effect of molecular weight on the molecule's size, decreasing the logP value significantly becomes challenging.In this case, the model prioritizes the molecular weight condition.We suspect this is because molecular weight is easier to validate and has more pronounced limitations compared to logP.
Nevertheless, the model effectively satisfies all three constraints in most cases, as evidenced by a high percentage of substructure matches and low MAD values for the properties in Table 7. Notably, when generating molecules with three properties, some MAD values are even lower than those observed in two-property generation.This could be attributed to the model being trained on a larger number of three-property batches, resulting in improved performance.
In general, all four conditions are respected during the generative process and make significant contributions to the resulting molecules.

Conclusion
Our aim was to provide a tool for the relevant chemical spaces for a given application, in our case the subspace of organic, potentially electro-active compounds.We therefore adapted existing work and approaches to our needs and came up with a new training variant that allows for a solitary model very flexible in use, which was also trained on a data set of substantial size.
In detail, we 1. developed a GPT-style Transformer based on the LLama 2 architecture, showcasing strong performance in both single and multi-conditioned generation, comparable to or slightly surpassing existing models, despite not being task-specific.
2. compiled and utilized a training dataset comprising 13 million organic molecules sourced from various origins, enhancing the model's ability to generate a variety of molecular structures.
3. implemented a new training method we call Stochastic Context Learning (SCL), enabling our model to handle various combinations of conditions efficiently for multi-conditional generation using a single model.
We were able to show that the training process was successful and the achieved accuracy very satisfactory.The model generalizes quite well, as target values requested outside the well-sampled areas still tend to fall in the desired ranges.
The whole setup is very generic and easily adaptable to other applications.The latter motivates the number and choice of properties used as conditions for narrowing down the search space.In fact, for the model to be more useful in the search for energy-storage materials, in future we intend to provide a more meaningful, yet expensive property, such as the enthalpy of reaction.
Looking ahead, this research opens up exciting possibilities for further advancements in generative models and their applications in chemistry and related fields.Our modified architecture, combined with the SCL approach, holds great potential for generating novel and diverse organic molecules with precise control over desired properties.
In theory, a single model can learn a wide range of conditions and combinations by utilizing this approach during training.Therefore, we chose the SAScore (reflecting a materials' production cost), molecular size and logP (contributing to the energy density), as well as a desirable molecular core structure as optional target conditions.As an added benefit, a single model also comes at a reduced training cost.This method enables a more flexible and scalable training process, as it does not require every property to be available for all samples.

Outlook
For future work, we intend to focus more on curating the dataset, as to not have these very concentrated distributions for all properties.We hope that by reducing redundant molecules, the model would generalize better, while also reducing training time in the process.Generally, we assume that the model could perform even better with more training data, as it seems to be underfitted even with our large dataset.
Furthermore, we also intend to expand the number of properties that are given to the model, as there are more useful conditions for practical applications, such as the HOMO-LUMO gap.

Appendix 5.1 A
In this chapter, we visualize the errors of generated molecules using the molecular fragment condition with a single numerical condition.
In the case of Ibuprofen with a naturally very positive logP of about 3.0, it is very difficult for the model to significantly reduce the logP to the desired negative values, while also keeping the fragment intact.This leads to an overall higher MAD, due to a small sample of large outliers that increased the mean by a significant margin.This can be seen in the Figure 8.We also conducted some experiments on special combinations of different conditions, as these also show the limitations of the model, either due to the incompatibility of these conditions or the lack of training data in those regions.
We tested the combination of a low molecular weight (100) and a high SAScore (7), which can be seen in the Figure 9.The generated molecules have both characteristics by being hard to produce due to the high number of connected, bridged, annealed or spiro-rings and ring strains associated with the high degree of interconnected rings and/or openshell centers (radicals and/or carbenes), while keeping the molecular weight small.In this scenario, it also uses more uncommon elements to fit into both conditions.

C
In this section are a sample of the generated molecules for each property visualized.In Figure 10 showcases examples that are generated with logP as a property from negative to positive values.Furthermore, the Figure 11 show the change over different SAScores.Lastly, the Figure 12 shows how the generated molecules get larger with a rising molecular weight.

Figure 2 :
Figure 2: Distribution of properties as obtained from a 2.5M sample of training molecules in comparison with the distributions from 20k unconditionally generated molecules.

Figure 3 :
Figure 3: Requested (x-axis) versus actual values (y-axis) for the diverse target properties: a) logP, b) SAScore, c) molecular weight.For each target value, a batch of 10k SMILES was generated; MAD is averaged over the entire range.

Figure 9 :
Figure 9: Special Case: Generated molecular with low molecular weight and high SAScore as conditioning.

Figure 10 :
Figure 10: A sample of the generated molecular with logP as conditioning.

Figure 11 :
Figure 11: A sample of the generated molecular with SAScore as conditioning.

Figure 12 :
Figure 12: A sample of the generated molecular with the molecular weight as conditioning.

Table 1
Algorithm 1 Batched SCL algorithmRequire: Input sequence X ∈ R B×L×d emb , n maximum number of numerical conditions, B batch size, L sequence size and d emb is the embedding size.

Table 3 :
Table for comparing metrics for the three metrics at a temperature of 0.8 for 10k generated molecules.All metrics are evaluated with these 10k molecules, except uniqueness at 1k.

Table 4 :
Table for comparing multiple property conditions for 10k generated molecules to other models.

Table 5 :
Thiophene (as a context token sequence) analysis for 100 generated molecules

Table 6 :
Table for comparing metrics on 1000 generated molecules for each context token sequence.

Table 7 :
Table for comparing multiple property conditions for 1000 generated molecules using 4 example token sequences.