Skip to main content

Llamol: a dynamic multi-conditional generative transformer for de novo molecular design


Generative models have demonstrated substantial promise in Natural Language Processing (NLP) and have found application in designing molecules, as seen in General Pretrained Transformer (GPT) models. In our efforts to develop such a tool for exploring the organic chemical space in search of potentially electro-active compounds, we present Llamol, a single novel generative transformer model based on the Llama 2 architecture, which was trained on a 12.5M superset of organic compounds drawn from diverse public sources. To allow for a maximum flexibility in usage and robustness in view of potentially incomplete data, we introduce Stochastic Context Learning (SCL) as a new training procedure. We demonstrate that the resulting model adeptly handles single- and multi-conditional organic molecule generation with up to four conditions, yet more are possible. The model generates valid molecular structures in SMILES notation while flexibly incorporating three numerical and/or one token sequence into the generative process, just as requested. The generated compounds are very satisfactory in all scenarios tested. In detail, we showcase the model’s capability to utilize token sequences for conditioning, either individually or in combination with numerical properties, making Llamol a potent tool for de novo molecule design, easily expandable with new properties.

Scientific contribution

We developed a novel generative transformer model, Llamol, based on the Llama 2 architecture that was trained on a diverse set of 12.5 M organic compounds. It introduces Stochastic Context Learning (SCL) as a new training procedure, allowing for flexible and robust generation of valid organic molecules with up to multiple conditions that can be combined in various ways, making it a potent tool for de novo molecular design.


In fields like energy storage materials or medicinal chemistry, substances are key to technological advancement and progress: the success of these applications hinges on the specific properties of the materials. However, the processes of discovery and development of new materials often face practical and/or principal obstacles, such as unavailability of compounds or precursors, high production costs, and the need for extensive trials on the practical side, or limited data and/or experience, as well as biased expectations of designers and developers on the other hand. Generative models, a powerful category in machine learning, have the potential to address both of these issues simultaneously, as they can help focus our efforts a priori only on the most likely candidates.

Many architectures related to creation of novel data points were developed in recent years, most notably Recurrent Neural Networks (RNN) [1], Generative Adversarial Networks (GAN) [2], Variational Autoencoders (VAE) [3] and Transformers [4]. The transformer architecture, especially, has revolutionized the fields of Natural Language Processing (NLP) [5] and other domains like computer vision [6]. The introduction of the General Pretrained Transformer (GPT) architecture led to significant advancements in generative natural language applications. Generative models have also been applied in the fields of medicine and material science to create new molecules with predefined features, a process known as conditional generation [7, 8]. This application can significantly accelerate the discovery of new candidate molecules. Although current generative models may not provide the optimal solution, they can greatly reduce the size of the chemical space that needs to be evaluated. Current estimates for the size of the chemical space containing drug-like molecules range from \(10^{23}\) to \(10^{60}\) [9]. Many approaches have successfully used VAEs [10,11,12], GANs [13], RNNs [14, 15] or Reinforcement Learning [16]. However, more recently, transformer models, specifically the GPT models [8, 17], have emerged as the new state-of-the-art in this domain, especially, in the field of conditional molecular generation [18,19,20]. A good summary of available models can be found in the survey from Du et. al. [21].

Bagal et al. [8] presented the MolGPT architecture from which a family of models, each one tailored to a specific task, could be derived. Inspired by their work, we set out to develop a solitary model that can handle many tasks simultaneously to support the search for low-cost, high-energy-density alternatives for energy storage materials in flow batteries. The model itself should not require complex training data; thus, it operates on SMILES [22] – a minimalist molecular representation that allows us to draw a mass of data from numerous sources – and easy to provide and directly to verify target properties that serve as conditions (primarily to facilitate the development process of the model).Footnote 1

In this paper, we present a new, dynamic training approach termed “Stochastic Context Learning” (SCL) to train a single model for conditional generation, capable of generating molecules as SMILES while respecting a variable number of conditions. Our training dataset consists of approx. 12.5 million organic molecules, which is a superset of several public datasets (see Sect. 3.1). On this, we train a GPT-style transformer model, specifically a model based on Llama 2 [23], to generate new compounds based on one or more conditions/target properties. To achieve this, we assign a learnable embedding to each property value. This ensures that the model perceives not only the numerical value, but also the associated label.

To be able to assess the model’s performance directly, we chose three easily determined numerical properties: SAScore [24] (reflecting production cost), logP, and molecular weight (contributing to energy density), along with another optional condition: a user-defined core structure that has to be integrated into the final molecule. The latter is given as a SMILES string, which is a continuous sequence of tokens, hereafter referred to as a ’token sequence’.Footnote 2

In the following sections, we detail the architecture, training data and process along with the results obtained for unconditional, single, and multi-conditional molecule generation.


The architecture we utilized, as depicted in Fig. 1, is a modified version of the Llama2 architecture [23] as obtained from GitHub ( The hyperparameters can be found in Table 1, which we determined from previous experiments.

Our model consists of approximately 15 million parameters and is composed of eight decoder blocks. Each decoder block includes a masked multi-head self-attention layer, followed by a Feed Forward Network (FFN) that employs the SwiGLU [25] activation function. While the original Llama 2 architecture utilized Grouped-Query Attention (GQA) [26], we opted for the full multi-head attention mechanism given the comparatively smaller size of our model.

The masked multi-head self-attention layer [4], defined by Eq. 1, takes an embedded input sequence \(X \in {\mathbb {R}}^{L \times d_{emb}}\) of length L, where each element represents an embedding vector with dimension \(d_{emb}\). Through the attention mechanism, each head learns to attend to a different part of the sequence, resulting in an attention matrix \(\text {head}_i \in {\mathbb {R}}^{L \times d_v}\). We utilize dot-product self-attention, which produces three matrices: \(Q_i\) and \(K_i\) with dimensions \(L \times d_k\), and \(V_i\) with dimensions \(L \times d_v\). These matrices are generated by applying linear transformations using weight matrices \(W_Q^i\), \(W_K^i\), and \(W_V^i\), each with dimensions of \(d_{emb} \times d_k\) and \(d_{emb} \times d_v\), respectively, to the input sequence X for each attention head i.

In our specific case, we set \(d_k\) and \(d_v\) to be equal to \(d_{emb} / n_{heads}\), resulting in \(d_k = d_v = 384 / 8 = 48\). To keep the autoregressive property for our model, we mask out the upper right triangle by using the mask matrix \(M \in {\mathbb {R}}^{L\times L}\) shown in Eq. 4. Then, these attention matrices are concatenated with each other along the \(d_v\)-dimension. Afterward, the resulting concatenated matrix is further transformed using another learnable weight matrix \(W_O \in {\mathbb {R}}^{h \cdot d_v \times d_{emb}}\).

$$\begin{aligned}&\text {MMHA}(X) = \text {Concat}(\text {head}_{1}, \text {head}_{2}, \ldots , \text {head}_{h}) \cdot W_O \end{aligned}$$
$$\begin{aligned}&\text {head}_i = \text {MaskedAttention}(X \cdot W_Q^i, X \cdot W_K^i, X \cdot W_V^i) \end{aligned}$$
$$\begin{aligned}&\text {{Masked\,Attention}}(Q, K, V) = \text {{softmax}} \left( \frac{{QK^T}}{\sqrt{d_k}} + M \right) \cdot V \end{aligned}$$
$$\begin{aligned}&M = \overbrace{\begin{pmatrix} 0 &{} -\infty &{} -\infty &{} \ldots &{}-\infty \\ 0 &{} 0 &{} -\infty &{}\ldots &{}-\infty \\ \ldots &{}\ldots &{} \ldots &{} \ldots &{} -\infty \\ 0 &{} 0 &{} 0 &{} \ldots &{}0 \end{pmatrix}}^{L} \end{aligned}$$
$$\begin{aligned}&\text {FFN}(X) = \text {SwiGLU}(X,W_1,W_3) \cdot W_2 \end{aligned}$$

The Llama 2 architecture employs several changes compared to the standard decoder architecture [4]. Firstly, we use rotary positional embeddings (RoPe) [27] to encode absolute and relative positional information directly into the attention matrix. Secondly, instead of applying layer normalization [28] after the self-attention and feed-forward layers, we employ RMSNorm [29] as a more efficient pre-normalization step. A feed-forward layer is described by Eq. 5, where \(W_1, W_3 \in {\mathbb {R}}^{d_{emb} \times d_{ffn}}\) and \(W_2 \in {\mathbb {R}}^{d_{ffn} \times d_{emb}}\) are learned weight matrices and \(\odot\) represents the elementwise product of two vectors. After each feed-forward layer, we employ a dropout-layer [30] with the probability given in Table 1. The concatenate function, stacks the matrices row-wise \(\text {Concat}(A, B): {\mathbb {R}}^{a \times e} \times {\mathbb {R}}^{b \times e} \rightarrow {\mathbb {R}}^{(a+b \times e)}\).

Furthermore, we made significant alterations to the context ingestion process. The input to our model is a sequence X of shape \(L \times d_{emb}\), which can be divided into two parts: \(X = \text {Concat}(C,S)\). The first part, \(C \in {\mathbb {R}}^{c \times d_{emb}}\), also later referred to as the “context”, represents the given conditions and can be expressed as \(C = Concat((t_{1}, t_{2}, \ldots , t_{n} )^T, t_{ts}) \in {\mathbb {R}}^{c \times d_{emb}}\). The embedded vectors \(t_{i} \in {\mathbb {R}}^{d_{emb}} \ \forall i \in \{1,\ldots ,n\}\) represent the n numerical conditions, which are provided to the model, in our case \(n=3\). On the other hand, \(t_{ts} \in {\mathbb {R}}^{k \times d_{emb}}\) is a matrix of k embedded tokens, that is concatenated with the numerical conditions. The second part, \(S \in {\mathbb {R}}^{s \times d_{emb}}\), just describes the molecule, as a SMILES, itself. c is just the length of the complete context and s is the length of the given SMILES, both are not fixed in length.

Typically, a token sequence includes multiple tokens, which translates to multiple embeddings in the context, while a numerical condition is represented by one embedding. In our case, the order was the following: First are the embedded numerical conditions, then the token sequence embeddings, and lastly the SMILES itself. During the training process, we learn SMILES embeddings by learning an embedding vector for each token.

In order to facilitate controlled property generation of molecules, we prepend the sequence with conditions, such as numerical values or a token sequence. In principle, the number of conditions has no limit as it is part of the input sequence of the transformer model, although empirical evaluation is needed to determine practical limits and scalability. Initial experiments with ten conditions showed promising results, suggesting the model’s potential to handle a larger number of conditions. For the purposes of this paper, we limit the contexts to three numerical values and one token sequence. Each numerical value is assigned a type identifier, and a separate linear layer is used to transform them into the embedding dimension. The transformed values are then combined with the learned type encoding specific to each numerical property. In our implementation, we assigned a fixed type number to each property and mapped it to a learnable vector, which serves as the type encoding.

To provide positional information, we applied RoPe to every part of the context and sequence. Although adding positional information to numerical values is not necessary, we chose to include it for the sake of simplicity in implementation, without negatively impacting the model’s performance.

Due to the type identifiers, this approach enables the model to differentiate between various conditions in a straightforward, yet effective manner. Consequently, we are free to mix or even omit conditions within the sequence. This property plays a crucial role in our training procedure, as in combination with the SCL method it allows the model to adapt dynamically and process all possible combinations of context.

The degree of creativity of the model’s output can be controlled by the so-called temperature parameter, which is defined as a positive real number \(t \in {\mathbb {R}}_{+}\), by dividing the output log probabilities by the said value. A temperature of \(t = 1\) does not alter the model’s output, whereas a lower temperature sharpens the output distribution, thus making it more deterministic. Conversely, a temperature greater than one leads to a higher level of variability.

Fig. 1
figure 1

The Llamol architecture visualized

Table 1 Hyperparameters used for the Llamol model



The model was trained on a dataset of molecules, which was compiled from several public sources to create a large and diverse population. The resulting dataset, we call OrganiX13, includes SMILES strings of mostly organic and/or drug-like molecules taken from the sources listed in Table 2.

Table 2 Datasets used in the combined dataset

All SMILES were canonicalized via RDKit, while keeping the stereochemistry intact and duplicates were, via string-based comparison, removed. Entries that could not be parsed properly by RDKit were also removed. Likewise, all molecules exceeding a limit of 256 tokens or ionic structures (salts) were rejected. After this preprocessing, the final dataset contains approximately 12.5 million SMILES.

Subsequently, we used RDKit to provide the numerical values for some quick-to-compute surrogate properties to investigate the training behavior and to enable the direct verification of the generated results. The properties chosen were the logP, SAScore, and molecular weight, as those properties also have an impact on the achievable energy density or cost of an electro-active material in the chosen aqueous flow battery application. In detail,

  1. 1.

    LogP is defined as the logarithm of the partition coefficient, which denotes the hydrophobic or lipophilic nature of a molecule. A positive logP value suggests that the molecule prefers non-polar solvents, whereas a negative value indicates that the molecule is soluble in water, a desirable property for aqueous flow battery systems, which correlates to energy density.

  2. 2.

    Molecular weight can be used as a proxy for its size. Again, to attain high energy densities, we would like to have control over the maximum size of the compounds generated. To ensure numerical similarity with the intervals of other properties, the molecular weights were divided by 100.

  3. 3.

    SAScore: The Synthetic Accessibility Score (SAScore) [24] estimates the ease or difficulty of creating a compound. Based on a frequency analysis of chemical moieties in the PubChem database, it assigns a score ranging from zero (easy) to ten (difficult), which is supposed to reflect to some extent the cost of production.

The resulting dataset encompasses many SMILES strings that cover a broad range of about 12 units in logP, a range in SAScore from around 1 to 6, as well as a similar range in scaled molecular weights. This served as a basis for the subsequent training.


Initially, we convert the SMILES representation into a sequence of tokens using a tokenizer. We used the BERT-tokenizer [45] in DeepChem [46], which employs a fixed vocabulary size of 591 tokens. It splits the SMILES at the character level, except for values enclosed in square brackets, which are treated as a single token.

These tokens are then passed through a separate lookup table, which maps them to a \(d_{emb}\)-dimensional embedding space. Prior to feeding the token embeddings into the decoder model, a context is prepended.

The numerical properties are processed as described in section 2.

If we use a token sequence as context, we perform these calculations dynamically in each batch during the training, allowing them to have varying token sequence sizes and content. During a training step, a token sequence represents a contiguous subsequence of the current tokenized SMILES. We start by randomly selecting a starting index from zero up to the current SMILES length, followed by determining a random ending index greater than the starting index but smaller than the current SMILES length.

In our case, we limited the context token to a maximum sequence length of 50 to avoid memory issues, which sufficed for our purposes. This sequence is then embedded using the same embedding layer as the input sequence and combined with an embedding specific to the token sequence, sharing the shape of the input embedding table. Additionally, a learned label embedding is added to these combined token sequence embeddings to indicate their relatedness.

Stochastic context learning (SCL)

Given an input sequence \(X \in {\mathbb {R}}^{L \times d_{emb}}\) of length L, where each element is represented by a \(d_{emb}\)-dimensional vector, we divide it into two parts: \(X = \text {Concat}(C,S)\). Our algorithm focuses on modifying the context part C. We represent this part as a combination of two parts. The first is \(C_{num} = (t_{1}, t_{2}, \ldots , t_{n} )^T \in {\mathbb {R}}^{n \times d_{emb}}\), where n represents the maximum number of numerical conditions used in the training process (in this case, \(n=3\)). The second is the token sequence \(C_{ts} \in {\mathbb {R}}^{k \times d_{emb}}\), where k is the length of the token sequence, such that \(C = Concat(C_{num}, C_{ts})\). The length k is not specific and can change for each input sequence X.

To begin, we set a deletion probability \(p_{del}\) to \(15\%\) during training. For each row in the \(C_{num}\) matrix, we check if it should be deleted with a probability of \(p_{del}\). If it meets the criteria, we remove the row from the \(C_{num}\) matrix and consequently from the input sequence X, which then would be of shape \((L-i) \times d_{emb}\), where \(0 \le i \le n\) is the number of deleted numerical conditions. Similarly, the same probability is used to control if the token sequence should remain in the context for the current sequence. In this case, the \(p_{del}\) probability says if the entire token sequence should be removed, not just one row. Occasionally, there may be a situation where all conditions in \(C_{num}\) and \(C_{ts}\) are eliminated. In such instances, the sequence becomes unconditioned.

For batched input sequences \(X_{batch}\) with shape \({\mathbb {R}}^{B \times L_{max} \times d_{emb}}\), the process works similarly. We iterate over each of the n numerical conditions and sample if it should be deleted with a probability of \(p_{del}\). If a condition is selected for deletion, we remove the corresponding row from all entries in the batch of size B. A description for the batched algorithm is given in the Algorithm 1. We assume that every molecule in the batch has all n numerical properties. If a molecule only has a portion of the properties, we would simply pad the missing values. In our case, there was no need for padding, as all molecules had all the numerical properties. The batch is created out of B number of sequences X, each of those could have a different length L due to the variance in length in the token sequence condition and also the SMILES itself. To batch those together, we take the maximum sequence length \(L_{max}\) for all sequences that should be packed into the batch and pad the shorter SMILES by appending a pad-token to the length of \(L_{max}\).

Thus, throughout the training process, the model has to handle different combinations of the provided conditions, which allows the model to learn unconditionally, single conditions, and also multiple conditions in one go. Thanks to the type of embeddings we add to every context element, we can change the number of properties that are provided to the model and still have the model distinguish which properties are provided.

Algorithm 1
figure a

Batched SCL algorithm


The model is trained to predict the next token by calculating the cross-entropy loss between the actual next token and the predicted probability for that token. Note that this loss is only calculated for the SMILES part of the given sequence, the prepended context is not considered in the loss. Since we only train with the autoregressive loss, the context does not have to be evaluated while training, making our approach very flexible to various conditions. This loss is then backpropagated through the model using the AdamW optimizer [47]. The cross-entropy loss is defined as follows (Eq. 6):

$$\begin{aligned} \text {CrossEntropyLoss}(y, {\hat{y}}) = -\frac{1}{N}\sum _{n=1}^{N}\log \left( \frac{\exp ({\hat{y}}_{n, y_n})}{\sum _{i=1}^{d_{\text {voc}}}\exp ({\hat{y}}_{n, i})}\right) \end{aligned}$$

In this expression, N is the batch size, where \(y \in \{0, 1, 2, \ldots , d_{\text {voc}}\}^{N}\) and \({\hat{y}} \in {\mathbb {R}}^{N \times d_{voc}}\) correspond to the target tokens and the predicted log probabilities, respectively. The mean over the negative logarithms for the normalized predicted probabilities of the next token is calculated. Here, \({\hat{y}}_{n, y_n}\) specifically refers to the predicted log probability assigned to the correct target token \(y_n\) for the n-th sample in the batch.

The model was trained on a single Nvidia A100 GPU for two days and used about 35 GB VRAM while training. A constant learning rate of \(\alpha = 10^{-4}\), with \(\beta _1 = 0.9\) and \(\beta _2 = 0.95\) was used for the AdamW optimizer. The dataset was randomly partitioned into two parts, a training set and a testing set. The training dataset consisted of 90%, while the testing dataset comprised 10% of the data. The model was trained using a batch size of 256 with gradient accumulation steps of 4 batches. Each sequence for the model starts with a “start of SMILES”-token ([CLS]) and ends with an “end of SMILES”-token ([SEP]). Shorter SMILES strings were padded with a “pad”-token ([PAD]) to match the length of the longest SMILES in that batch. The same padding process was applied to the token sequence in the context.

New SMILES are then sequentially generated by first starting with a “[CLS]”-token and then predicting the next tokens iteratively. The generation ends, when the model predicts the “[SEP]”-token or a specified token limit is reached.

Results and discussion

After the training, we used the model in different scenarios to generate new SMILES, e.g., without any constraints or with one or more constraints (including numerical and/or structural targets), while keeping the temperature parameter constant at \(temperature=0.8\). This value ensures a close but not too strict coupling to the underlying probability distributions, which proved helpful in our experiments.

The metric used to measure the performance of the models for a batch of generated compounds is the mean absolute deviation between requested and obtained numerical values, in addition to the percentage of novelty, uniqueness, and validity of the molecular structures generated. These are evaluated on the complete training dataset.

In more detail, these metrics are defined as follows:

  1. 1.

    Novelty: is defined as the percentage of newly generated molecules not present in the reference dataset. We use this, to ensure that the model is not memorizing the training data, but instead is inventing new compounds. We measure this by comparing the generated SMILES with the SMILES in the dataset. Please note: this is not equivalent to testing the molecular graphs for isomerism, i.e., alternative synonyms are not detected as redundant molecular structures by this procedure, but rather just a string comparison.

  2. 2.

    Uniqueness: The uniqueness is the ability of the model to generate unique molecules. We measure the percentage of unique molecules generated in a batch of 1k and 10k molecules under specific conditions.

    Again, identical molecules with synonymous SMILES remain undetected.

  3. 3.

    Validity: The ratio of validity is determined by the number of properly parsed SMILES (by RDKit [48]) versus the total number of generated SMILES in a batch.

  4. 4.

    Mean average deviation: Is defined as the following:

    $$\begin{aligned} MAD&= \frac{1}{n} \sum _{i=1}^{n}{ |x_i - y_i|} \end{aligned}$$

    For each of the n generated SMILES strings, the target value of the respective property is denoted as \(y_i\), while \(x_i\) represents the ’true’, i.e. actually calculated property value. The model should minimize this quantity without being explicitly trained on it, which would indicate that the model incorporates the provided context into the generative process. This metric is also used to enable comparisons to other models [8, 49].

We compare our model to others of similar architecture and choice of conditions. The findings can be seen in Table 3.

Unconditional generation

Without applying any conditions, we generated 20k SMILES and calculated the corresponding properties logP, SAScore, and molecular weight using RDKit. The resulting frequencies of distribution are very similar to the distributions obtained from a representative sample of training molecules, see Fig. 2a–c.

This indicates that the model has indeed learned the inherent distribution of the training dataset, without specifically training the model unconditionally.

The generated SMILES also achieve very comparable performance in terms of uniqueness, and validity to the other models as shown in Table 3. Our degree of novelty is inline or slightly better than that of models trained on the MOSES [50] dataset (1.9 Mio. molecules), but falls behind on the GuacaMol (1.59 Mio. molecules ) [51]. We argue that this is mostly due to the significant difference in size of our dataset compared to the datasets mentioned above. Since our dataset covers more of the chemical space, this makes it more likely for the model to generate molecules present in the dataset.

Fig. 2
figure 2

Distribution of properties as obtained from the training dataset in comparison with the distributions from 20k unconditionally generated molecules

Single Condition

In this experiment, we also assessed the model’s ability to handle single-condition generation over wide ranges of target values.

For each target value of the intervals listed in Table 3, rows 2 – 7, the procedure involved generating a sample of 10,000 molecules. As before, we determined the true property values of the generated molecules using RDKit and compared those to the targeted values. For each property, we ran two scenarios: The first one covered a broad interval of values that encompassed both in-distribution, as well as out-of-distribution values. The second run focused solely on the performance of a select few in-distribution target values. The interval notation [a, b; c] signifies that the values were uniformly sampled from a discrete distribution of values, including a and b, with a step size of c. More precisely, [-2, 7; 1] denotes the set of values \(\{-2, -1, 0,\ldots , 6, 7\}\). The notation \(\{ a_1, a_2, \ldots \}\) indicates that the values were uniformly sampled from that specific set of values.

Despite the low probability of the model being trained solely on one property, it performs well in this task, as demonstrated in Figs. 3a–c. The model achieves low MAD values across the entire span of the respective target properties (rows 2, 4, and 6), although the MAD values obtained for the in-distribution series are generally and significantly lower (rows 3, 5, and 7). In fact, predicting logP values to an accuracy of 0.5 logP units (root mean square deviation) is commonly considered a satisfactory result [52].

Still, the out-of-sample performance is acceptable. Although the scatter increases, the general trend is well retained, with the only exception of the very high (>7) SAScores. In this case (row 4 in Table 3), we observe also a concomitant drop in the percentage of validity of the generated molecules (82.5 vs 99 %). Upon manual inspection, we find that not only compounds with highly bridged ring systems and/or accumulations of stereogenic centers were generated that are supposedly very demanding to synthesize but also rare and unstable atomic environments such as neighboring diradicals and/or carbenes that presumably prevent the proper parsing of structures.

In comparison to the other models in Table 3 our model archives a slightly lower MAD in the single condition case, without being specifically trained on that task, while simultaneously scoring the uniqueness and validity on par with the other models. Yet again, our level of novelty is significantly lower than that from MolGPT. As before, we suspect that this is due to the larger size of our training dataset and the restrictions on the generative process imposed by the constraints given by the context.

Fig. 3
figure 3

Requested (x-axis) versus actual values (y-axis) for the diverse target properties: a) LogP, b) SAScore, c) Molecular weight. For each target value, a batch of 10k SMILES was generated; MAD is averaged over the entire range

Table 3 Table for comparing metrics for the three metrics at a temperature of 0.8 for 10k generated molecules. All metrics were evaluated with these 10k molecules, except uniqueness at 1k

Multiple conditions

For each pairwise combination of target properties, we generated 10k SMILES, see Figs. 4, 5 and 6. The graph labels are in the same order as given in the captions of the respective figures.

In general, the generated molecules’ properties center closely around the desired values. Although all chosen values were well within the highly populated areas of the underlying distributions, some combinations turned out to be hard to satisfy, resulting in a more pronounced scatter.

Figure 4 shows the distribution of calculated logP values and SAScores. This pair works well for lower logP values, but for higher ones the variance in the SAScore axis rises significantly. This seems to indicate that in this case, the logP values have a slight priority in the generative process compared to SAScore. There are some outliers, but most of the generated molecules fulfill both conditions. We suspect that this could be an effect of the shortage of training data in that region, thus leading to more inaccurate results.

Next we compare the combination of logP and the molecular weight values, as shown in Fig. 5. Apparently, the molecular weight takes priority in the generation, as it displays a much smaller variance compared to the logP. However, the logP is still met accurately despite being under very strict size constraints. This comes as no surprise, due to the ease with which the molecular weight can be determined by counting the contributions of each atom, as opposed to the more extensive considerations demanded by logP values.

Lastly, Fig. 6 displays the combination of SAScore and molecular weight. Similar to the logP and molecular weight comparison, molecular weight still dominates the generative process. In comparison, the model can not uphold the SAScore in all cases. This is especially evident in the case, where the molecular weight is set to a low value of 1.5 which results in a high SAScore variance. Apparently, the model struggles to incorporate a sufficient number of challenging motifs into a small molecule, due to the limited size and range of available elements In contrast, when the weight is set to a higher value of 3.5, i.e. a larger molecule, we obtain a much lower variance.

Finally, in Fig. 7 we visualize the generated molecules that take into account all three properties. As is evident by the disjoint point clouds in the graph, the model learned to consider all three conditions and generate matching molecules. The labels in the legend should be read in the order of logP, SAScore, and molecular weight.

In Table 4 we compare the performance of our model to other models in multi-conditional generation where applicable. Each row in the table represents an experiment, with the columns representing the properties used as conditions. If a condition was not utilized, the cell was left empty. Our model slightly outperforms the others, in the case of logP + SAScore, in terms of MAD, uniqueness and validity, while simultaneously being able to handle other condition combinations effectively. Again, the novelty falls behind the other models.

Table 4 Table for comparing multiple property conditions for 10k generated molecules to other models
Fig. 4
figure 4

LogP + SAScore

Fig. 5
figure 5

LogP + Molecular weight

Fig. 6
figure 6

SAScore + Molecular weight

Fig. 7
figure 7

LogP + SAScore + Molecular weight

Token sequence incorporation

A very common question in material design is to create analogs from a given starting molecule and add/modify structural features to customize the physical properties. For this reason, the model also accepts a SMILES string representing the desired molecular moiety that should be integrated as a building block in the newly generated structure as an additional condition. The requested SMILES string is converted to the canonical form used throughout, before it is tokenized. To measure the performance of our model for this task, we used the following criteria:

  • Substructure Matches (SM): The substructure match measures the percentage of generated molecules that explicitly include the target moiety. As a first step, we convert the target structure into a SMARTS [55] pattern, which is essentially a regular expression to match specific atoms or substructures within a molecular structure. To make the criterion a little less strict, all information about bond orders is removed, only the connectivity itself is retained. Therefore, the pattern tolerates modifications in the details of the electronic structure (e.g. localized double bonds versus aromatic bonds), while still maintaining the overall topology. With this property, we measure how often the target structure is retained during the generative process.

For this experiment, at a constant temperature of 0.8, batches of 1k molecules were generated for various context token sequences and evaluated using the mentioned metric. Table 5 lists different organic target structures (as the context token sequence in SMILES form) and the results obtained a) without applying any other conditions (columns: uniqueness at 1k / SM), and b) with another additional numerical condition (columns: LogP/SAScore/molecular weight at different target values each).

Overall, the model seems to perform very well, as we can recover the target structures at least once in most of the newly generated SMILES. However, especially when given a larger target structure such as Morphine, we observe that the generated structures become very repetitive.

Token sequence with a single numerical condition

The really useful application for customizing given structures is the simultaneous application of one or more additional criteria.

Thus, we study combinations of token sequence conditions together with single numerical conditions, see Table 5. Each combination was tested on 1000 generated molecules, with the numerical values uniformly sampled from the range specified in the table header for each property.

Generally, we observe that smaller token sequences, which naturally occur as building blocks in the training data (entries 1 - 5), are readily integrated into a variety of generated compounds. In contrast, larger compounds (entries 6 - 10) that emerge as independent, self-contained units within the dataset exhibit a significantly lower rate of uniqueness.

In most cases, we observed a decrease in the number of substructure matches for the molecules tested as compared to the previous run without numerical conditions. This is likely due to the model having to handle two possibly competing conditions simultaneously. The MAD values for logP and SAScore were also notably higher compared to generating without a token sequence but remain within acceptable limits. It is worth noting that when the two conditions conflicted, such as with Ibuprofen and negative logP values, this led to the presence of some significant outliers. Conversely, when the conditions aligned well, the errors were consistent with the previous results. More details on the graph for the Ibuprofen and logP relationship can be found in Appendix 5.1. We also observed that the MAD of the SAScore in the case of Morphine is significantly higher than in the other examples. This is mostly due to Morphine having a SAScore of about 5.2, and we requested lower values. In this case, the model prioritizes the token sequence in comparison to the SAScore, which leads to the higher MAD.

Surprisingly, the token sequence condition takes precedence in most cases over the criteria logP and SAScore, as evidenced by the elevated MAD scores. Yet again, the molecular weight seems to be prioritized over the token sequence, as evidenced by the very low MAD scores, particularly for larger molecules such as Morphine.

Token sequence with multiple numerical conditions

We also conducted experiments where multiple token sequences were tested under two conditions simultaneously. The results of these experiments can be found in Table 6. Each row in the table represents a specific experiment, with the columns representing the properties used as conditions. If a condition was not utilized, the cell was left empty.

The model consistently performs well under various conditions, as shown by the low MAD values. However, when conditions are overly restrictive in combination with the token sequence, it can lead to higher MAD values or lower rates of substructure matches. This is because the model prioritizes certain properties over others.

For instance, consider Paracetamol, where both logP and molecular weight conditions are applied. Due to the constraining effect of molecular weight on the molecule’s size, decreasing the logP value significantly becomes challenging. In this case, the model prioritizes the molecular weight condition. We suspect this is because molecular weight is easier to validate and has more pronounced limitations compared to logP.

Nevertheless, the model effectively satisfies all three constraints in most cases, as evidenced by a high percentage of substructure matches and low MAD values for the properties in Table 6. Notably, when generating molecules with three properties, some MAD values are even lower than those observed in two-property generation. This could be attributed to the model being trained on a larger number of three-property batches, resulting in improved performance.

In general, all four conditions are respected during the generative process and make significant contributions to the resulting molecules.


Our aim was to provide a tool for exploring the relevant chemical spaces for a given application, in our case the subspace of organic, potentially electro-active compounds. We therefore adapted existing work and approaches to our needs and came up with a new training variant that allows for a solitary model very flexible in use, which was also trained on a data set of substantial size.

In detail, we

  1. 1.

    developed a GPT-style Transformer based on the Llama 2 architecture, showcasing strong performance in both single and multi-conditioned generation, comparable to or slightly surpassing existing models, despite not being task-specific.

  2. 2.

    compiled and utilized a training dataset comprising 12.5 million organic molecules sourced from various origins, enhancing the model’s ability to generate a variety of molecular structures.

  3. 3.

    implemented a new training method we call Stochastic Context Learning (SCL), enabling our model to handle various combinations of conditions efficiently for multi-conditional generation using a single model.

We were able to show that the training process was successful and the achieved accuracy very satisfactory. The model generalizes quite well, as target values requested outside the well-sampled areas still tend to fall in the desired ranges. At present, this provides us with promising chemical sub-spaces to screen for electro-active materials, e.g. by feeding the generated SMILES into a trained model for predicting the enthalpy of reaction (as a prerequisite for calculating the redox potential [56]).

The whole setup is very generic and easily adaptable to other applications. The latter motivates the number and choice of properties used as conditions for narrowing down the search space. In fact, for the model to be more useful in the search for energy-storage materials, in future we intend to provide a more meaningful, yet expensive property, such as the enthalpy of reaction.

Looking ahead, this research opens up exciting possibilities for further advancements in generative models and their applications in chemistry and related fields. Our modified architecture, combined with the SCL approach, holds great potential for generating novel and diverse organic molecules with precise control over desired properties.

In theory, a single model can learn a wide range of conditions and combinations by utilizing this approach during training. Therefore, we chose the SAScore (reflecting a materials’ production cost), molecular size and logP (contributing to the energy density), as well as a desirable molecular core structure as optional target conditions. As an added benefit, a single model also comes at a reduced training cost. This method enables a more flexible and scalable training process, as it does not require every property to be available for all samples.

Outlook For future work, we intend to focus more on curating the dataset, as to not have these very concentrated distributions for all properties. We hope that by reducing redundant molecules, the model would generalize better, while also reducing training time in the process. Generally, we assume that the model could perform even better with more training data, as it seems to be underfitted even with our large dataset.

Furthermore, we also intend to expand the number of properties that are given to the model, as there are more useful conditions for practical applications, such as the HOMO-LUMO gap.

Table 5 Table for comparing metrics on 1000 generated molecules for each context token sequence
Table 6 Table for comparing multiple property conditions for 1000 generated molecules using example token sequences

Data availability

The model and training data is available at our Github:


  1. A condition, here, is a desired molecular property that we want to provide to the model. Based on this condition, the model should generate new molecules that satisfy the requested value.

  2. A token sequence can represent either a complete molecule or a molecular fragment, which may not necessarily be valid independently. However, a token sequence should become part of a valid molecule when incorporated into the generative process.


  1. Sherstinsky A (2020) Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena 404:132306.

  2. Goodfellow IJ et al (2014) Generative adversarial networks.

  3. Kingma DP, Welling M (2022) Auto-encoding variational bayes.

  4. Vaswani A, et al (2017) Attention is all you need.

  5. Brown TB, et al (2020) Language models are few-shot learners.

  6. Dosovitskiy A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale.

  7. Urbina F, Lowden CT, Culberson JC, Ekins S (2022) MegaSyn: integrating generative molecular design, automated analog designer, and synthetic viability prediction. ACS Omega 7:18699–18713

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Bagal V, Aggarwal R, Vinod PK, Priyakumar UD (2021) MolGPT: molecular generation using a transformer-decoder model. J Chem Inf Modeling 62:2064–2076.

    Article  CAS  Google Scholar 

  9. Polishchuk PG, Madzhidov TI, Varnek A (2013) Estimation of the size of drug-like chemical space based on gdb-17 data. J Comput-aided Mol Design 27:675–679

    Article  CAS  Google Scholar 

  10. Richards RJ, Groener AM (2022) Conditional \(\beta\)-vae for de novo molecular generation.

  11. Lim J, Ryu S, Kim JW, Kim WY (2018) Molecular generative model based on conditional variational autoencoder for de novo molecular design. J Cheminf

    Article  Google Scholar 

  12. Lee M, Min K (2022) Mgcvae: multi-objective inverse design via molecular graph conditional variational autoencoder. J Chem Inf Modeling 62:2943–2950.

    Article  CAS  Google Scholar 

  13. Cao ND, Kipf T (2022) Molgan: An implicit generative model for small molecular graphs.

  14. Grisoni F, Moret M, Lingwood R, Schneider G (2020) Bidirectional molecule generation with recurrent neural networks. J Chem Inf Modeling 60:1175–1183.

    Article  CAS  Google Scholar 

  15. Kotsias P-C et al (2020) Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks. Nat Mach Intell 2:254–265.

    Article  Google Scholar 

  16. S V, S S, et al (2022) Multi-objective goal-directed optimization of de novo stable organic radicals for aqueous redox flow batteries. Nat Mach Intell 4:720–730.

  17. Chen Y et al (2023) Molecular language models: RNNs or transformer? Brief Functional Genom 22:392–400.

    Article  Google Scholar 

  18. Wang J et al (2021) Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning. Nat Mach Intell 3:914–922.

    Article  Google Scholar 

  19. Wang Y, Zhao H, Sciabola S, Wang W (2023) cMolGPT: a conditional generative pre-trained transformer for target-specific de novo molecular generation. Molecules 28:4430.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Kim H, Na J, Lee WB (2021) Generative chemical transformer: neural machine learning of molecular geometric structures from chemical language via attention. J Chem Inf Modeling 61:5804–5814. (PMID: 34855384)

    Article  CAS  Google Scholar 

  21. Du Y, Fu T, Sun J, Liu S (2022) Molgensurvey: a systematic survey in machine learning models for molecule design.

  22. Weininger D (1988) Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36.

    Article  CAS  Google Scholar 

  23. Touvron H, et al (2023) Llama 2: open foundation and fine-tuned chat models.

  24. Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminf

    Article  Google Scholar 

  25. Shazeer N (2020) Glu variants improve transformer.

  26. Ainslie J, et al (2023) Gqa: Training generalized multi-query transformer models from multi-head checkpoints.

  27. Su J, et al (2022) Roformer: Enhanced transformer with rotary position embedding.

  28. Ba JL, Kiros JR, Hinton GE (2016) Layer normalization.

  29. Zhang B, Sennrich R (2019) Root mean square layer normalization.

  30. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    Google Scholar 

  31. Sterling T, Irwin JJ (2015) ZINC 15 - ligand discovery for everyone. J Chem Inf Modeling 55:2324–2337.

    Article  CAS  Google Scholar 

  32. Ramakrishnan R, Dral PO, Rupp M, von Lilienfeld OA (2014) Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 1.

  33. Ruddigkeit L, van Deursen R, Blum LC, Reymond J-L (2012) Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Modeling 52:2864-2875.

  34. Gómez-Bombarelli R et al (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Sci 4:268-276.

  35. Sorkun E, Zhang Q, Khetan A, Sorkun MC, Er S (2022) RedDB, a computational database of electroactive molecules for aqueous redox flow batteries. Sci Data 9.

  36. John PCS et al (2019) Message-passing neural networks for high-throughput polymer screening. J Chem Phys 150:234111.

  37. Nakata M, Shimazaki T, Hashimoto M, Maeda T (2020) PubChemQC PM6: data sets of 221 million molecules with optimized molecular geometries and electronic properties. J Chem Inf Modeling 60:5891–5899

    Article  CAS  Google Scholar 

  38. Nakata M, Shimazaki T (2017) PubChemQC project: a large-scale first-principles electronic structure database for data-driven chemistry. J Chem Inf Model 57:1300–1308

    Article  CAS  PubMed  Google Scholar 

  39. Hachmann J et al (2011) The Harvard clean energy project: Large-scale computational screening and design of organic photovoltaics on the world community grid. The Journal of Physical Chemistry Letters 2:2241-2251.

  40. Duvenaud D, et al (2015) Convolutional networks on graphs for learning molecular fingerprints.

  41. Zdrazil B et al (2023) The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res 52:D1180–D1192.

    Article  PubMed Central  Google Scholar 

  42. Blackshaw J et al (2009) CHEMBL database release 31.

  43. Davies M et al (2015) Chembl web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res 43:W612–W620.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Jupp S et al (2014) The ebi rdf platform: linked open data for the life sciences. Bioinformatics 30:1338–1339.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Schwaller P et al (2019) Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Central Sci 5:1572–1583.

    Article  CAS  Google Scholar 

  46. Ramsundar B, et al (2019) Deep Learning for the Life Sciences O’Reilly Media.

  47. Loshchilov I, Hutter F (2019) Decoupled weight decay regularization.

  48. Landrum G, et al (2020) rdkit/rdkit: 2020_03_1 (q1 2020) release.

  49. Lu H, Wei Z, Wang X, Zhang K, Liu H (2023) Graphgpt: A graph enhanced generative pretrained transformer for conditioned molecular generation. Int J Mol Sci 24.

  50. Polykovskiy D et al (2020) Molecular sets (moses): a benchmarking platform for molecular generation models. Front Pharmacol.

    Article  PubMed  PubMed Central  Google Scholar 

  51. Brown N, Fiscato M, Segler MH, Vaucher AC (2019) Guacamol: benchmarking models for de novo molecular design. J Chem Inf Modeling 59:1096–1108.

    Article  CAS  Google Scholar 

  52. Işık M et al (2020) Assessing the accuracy of octanol-water partition coefficient predictions in the sampl6 part ii log p challenge. J Comput-aided Mol Design 34:335–370

    Article  Google Scholar 

  53. Haroon S, CA H, AS J, (2023) Generative pre-trained transformer (gpt) based model with relative attention for de novo drug design. Comput Biol Chem 106:107911.

  54. Monteiro NR et al (2023) Fsm-ddtr: end-to-end feedback strategy for multi-objective de novo drug design using transformers. Comput Biol Med 164:107285.

    Article  PubMed  Google Scholar 

  55. Daylight Theory: SMARTS - A language for describing molecular patterns – [Accessed 03-11-2023]

  56. Barker J, Berg L-S, Hamaekers J, Maass A (2021) Rapid prescreening of organic compounds for redox flow batteries: a graph convolutional network for predicting reaction enthalpies from smiles. Batteries Supercaps 4:1482–1490

    Article  CAS  Google Scholar 

Download references


Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations



N.D. wrote the main manuscript. A.M. helped in writing and refining parts of the manuscript and gave input on the chemical application of the model. J.H. supervised the paper and helped with the mathematical background. All authors reviewed and refined the manuscript.

Corresponding author

Correspondence to Niklas Dobberstein.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.




In this chapter, we visualize the errors of generated molecules using the molecular fragment condition with a single numerical condition.

In the case of Ibuprofen with a naturally very positive logP of about 3.0, it is very difficult for the model to significantly reduce the logP to the desired negative values, while also keeping the fragment intact. This leads to an overall higher MAD, due to a small sample of large outliers that increased the mean by a significant margin. This can be seen in the Fig. 8.

Fig. 8
figure 8

Ibuprofen logP Graph - Generated vs Target


We also conducted some experiments on special combinations of different conditions, as these also show the limitations of the model, either due to the incompatibility of these conditions or the lack of training data in those regions.

We tested the combination of a low molecular weight (100) and a high SAScore (7), which can be seen in the Fig. 9. The generated molecules have both characteristics by being hard to produce due to the high number of connected, bridged, annealed or spiro-rings and ring strains associated with the high degree of interconnected rings and/or open-shell centers (radicals and/or carbenes), while keeping the molecular weight small. In this scenario, it also uses more uncommon elements to fit into both conditions.

Fig. 9
figure 9

Special Case: Generated molecules with low molecular weight and high SAScore as conditioning


In this section are a sample of the generated molecules for each property visualized. Figure 10 showcases examples that are generated with logP as a property from negative to positive values. Furthermore, the Fig. 11 show the change over different SAScores. Lastly, the Fig. 12 shows how the generated molecules get larger with a rising molecular weight.

Fig. 10
figure 10

A sample of the generated molecules with logP as conditioning

Fig. 11
figure 11

A sample of the generated molecules with SAScore as conditioning

Fig. 12
figure 12

A sample of the generated molecules with the molecular weight as conditioning

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dobberstein, N., Maass, A. & Hamaekers, J. Llamol: a dynamic multi-conditional generative transformer for de novo molecular design. J Cheminform 16, 73 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: