Skip to main content

Molecular generation by Fast Assembly of (Deep)SMILES fragments



In recent years, in silico molecular design is regaining interest. To generate on a computer molecules with optimized properties, scoring functions can be coupled with a molecular generator to design novel molecules with a desired property profile.


In this article, a simple method is described to generate only valid molecules at high frequency (\(>300,000\) molecule/s using a single CPU core), given a molecular training set. The proposed method generates diverse SMILES (or DeepSMILES) encoded molecules while also showing some propensity at training set distribution matching. When working with DeepSMILES, the method reaches peak performance (\(>340,000\) molecule/s) because it relies almost exclusively on string operations. The “Fast Assembly of SMILES Fragments” software is released as open-source at Experiments regarding speed, training set distribution matching, molecular diversity and benchmark against several other methods are also shown.


In recent years, there has been a surge of methods developed for in silico molecular generation. Mostly using deep neural networks [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18], but not only [19,20,21,22,23,24]. Some authors use much simpler methods and the present contribution falls into this category. Notably, Polischuk [19, 25] uses molecular fragments and generates only valid molecules, while allowing some control [25] over the molecular diversity, novelty and synthetic complexity of the generated molecules. Kwon et al. [20] use direct cross-over and mutation operators over SMILES strings, combined with Conformational Space Annealing [26]. Their method does not require a training set but can generate invalid SMILES. Yoshikawa et al. [27] use a population-based grammatical evolution approach (ChemGE). While their method is fast and inherently parallel, it requires an initial population of molecules and can generate invalid SMILES. Nigam et al. [21] generate molecules by Gibbs sampling of SELFIES [28]. Their approach generates only valid molecules and does not require a training set. However, it requires translating molecules to/from SELFIES [28] (a recently developed linear encoding of molecular graphs). Brown et al. [23], Jensen et al. [22] and Leguy et al. [24] use a genetic algorithm over molecular graphs. Jensen’s method [22] doesn’t require task-specific model training and generates only valid molecules. Leguy et al. [24] use an evolutionary algorithm sequentially building a molecular graph using seven mutation operators. Their method also does not require a training set and generates only valid molecules.

The hereby proposed method works directly at the SMILES level. It generates only valid SMILES and thus valence-correct molecules. Simplified Molecular Input Line Entry System (SMILES [29]) is a molecular file format specifying a linear encoding of molecular graphs. SMILES are a compact way to store molecules on computers. The format is supported by all chemoinformatics toolkits and hence widespread. For rather small molecules, SMILES are human-readable. For the remaining of this article, it is only necessary to know that SMILES are strings made of balanced parentheses (indicating possibly nested branches of a linearized tree data structure), bracket atoms (an atom between brackets carries special propertiesFootnote 1), digits indicating ring opening and closures on the molecular graph plus other characters listing atoms and the bonds between them. For more details, see the Open SMILES specification [30] or the seminal paper [29]. On the other hand, DeepSMILES [31] is a recently proposed variant of SMILES, designed to ease in-silico molecular generation by making it harder to generate syntactically invalid SMILES (also one of the goals of SELFIES [21]). DeepSMILES allows two options: (i) avoiding branch opening parentheses and/or (ii) avoiding ring opening numbers. In this study, only the DeepSMILES flavor without ring opening numbers is considered.

Fig. 1
figure 1

Folic acid (left) and its Bemis-Murcko scaffold (right)

In the experiments, to quantify molecular diversity in a dataset and the molecules generated from it, the count of unique Bemis-Murcko scaffolds [32] (Fig. 1) is monitored. In the remaining of this article, the datasets used, the method itself as well as computational experiments are presented and discussed.


The GuacaMol de Novo molecular design benchmark

GuacaMol [33,34,35,36] consists in a benchmark suite for molecular generators. The benchmark tasks measure the fidelity of models to reproduce the property distribution of a training set made of ChEMBL [37] 24 moleculesFootnote 2, the ability to generate novel molecules, the exploration and exploitation of the chemical space and several optimization tasks. In this study, since no molecular optimization was performed, only the molecular generation task was used. GuacaMol allows to compare performance against a few methods with published results, using a variety of metrics. All metrics are normalized; zero being the worst score and one the best.

GuacaMol metrics: Validity [33]: measures if the generated molecules are realistic (e.g. SMILES is valid according to RDKit [38]). Uniqueness [33]: assesses whether a model generates unique molecules (i.e. few or no duplicate canonical SMILES). Novelty [33]: assesses whether a model generates molecules which are not present in the training set. Fréchet ChemNet Distance [39]: measures how close the distributions of generated molecules are to training set ones. Kullback-Leibler (KL) divergence [40]: measures how well a probability distribution approximates another one. For this benchmark, the probability distributions of several physicochemical descriptors [33] are compared. This metric also captures molecular diversity (given a physicochemical property distribution, the generated molecules should be as diverse as training set ones).

Some methods with published GuacaMol results [33]: Random sampler [33]: a baseline model only random sampling the training set. SMILES LSTM [3, 33]: a Long-Short-Term Memory (LSTM) neural network that predicts the next character for partial SMILES strings. Graph MCTS [22, 33]: Jensen’s Graph-based Monte Carlo Tree Search molecular generator. AAE: an Adversarial AutoEncoder [41, 42]. ORGAN: Objective-Reinforced Generative Adversarial Network [41,42,43]. This deep-learning model architecture combines a generator and a discriminator network to generate molecules. VAE: a Variational AutoEncoder [41, 42]. This deep learning model learns a representation of molecules as latent vectors in a continuous space. The network architecture includes an encoder network that converts SMILES strings to latent vectors, and a decoder performing the reverse operation.


In the experiments, three datasets were used, in addition to the dataset internal to the GuacaMol benchmark. For the molecular generation speed benchmark, a random sample of one million molecules [44] from the GDB-13 [45] was used as the training set, so that results can be compared to the published results of Arús-Pous et al. [8]. For the molecular diversity and training set distribution matching experiments, two more datasets were used. To represent drug-like molecules, a bootstrap sample of 100,000 molecules was drawn from ChEMBL-28 [37]. To represent natural products, a bootstrap sample of 20,000 molecules was drawn from the Traditional Chinese Medicine Database at Taiwan [46] (this database is much smaller than ChEMBL).

The hereby proposed method is parameterized by a molecular fragmentation scheme and an atom typing scheme. Any molecular fragmentation scheme can be used, as long as it doesn’t cut rings (e.g. BRICS [47] or RECAP [48]). Many atom typing schemes could be used.

Fragmenting molecules In the experiments, the following ad-hoc fragmentation scheme was used to identify then select some cleavable bonds. Only single bonds between heavy atoms not in rings can be selected. Furthermore, the bond must not be connected to a stereo center nor involved in cis-trans isomerism (an attempt at preserving the stereochemistry of fragments, if present). Cleaved bonds are chosen randomly without replacement from this list, in order to obtain n fragments. By default, a fragment molecular weight (MW) of 150Da is used and the recommended number of fragments for molecule m is given by:

$$\begin{aligned} num\_frags(m) = round\left( \frac{MW(m)}{150}\right) \end{aligned}$$

This fragment weight parameter is just used as a hint to decide in how many fragments the given molecule must be cut (it is not strictly enforced). Since the fragmentation process is controlled by a random seed, if one requires more fragments from a given dataset, doing several passes with different seeds generates more fragments. As in Arús-Pous [9], randomized SMILES are used (instead of canonical ones) so that the same molecule does not always result in the same set of fragments (e.g. which fragment is a prefix of the SMILES, is the fragment written from left-to-right or the opposite). Duplicate fragments are not removed, because the fact that a fragment is found multiple times correlates with the natural occurring frequency of this fragment in a dataset. Also, removing duplicates might require a canonicalization step and would only be required (to reduce memory usage) if one is fragmenting a truly huge dataset.

Typing atoms In experiments and as in previous work [49], the following atom typing scheme inspired from atom pairs [50] was used: an atom \(a_i\) in molecule m is identified by the tuple

$$\begin{aligned} type(a_i) = (\pi , e, h, f) \end{aligned}$$

where \(\pi \) is the number of pi electrons, e the chemical element symbol, h the number of bonded heavy atoms and f the formal charge. Stereochemistry is ignored.

Extended bond typing While the bond order (BO) is a natural bond typing scheme, to better preserve some of the structure of a molecular training setFootnote 3, it is useful to extend bond typing to make it more precise. Assuming \(b_j\) is a bond in molecule m between atoms \(a_i\) and \(a_{i+1}\):

$$\begin{aligned} type(b_j) = (type(a_i), BO(b_j), type(a_{i+1})) \end{aligned}$$

Note that if \(type(a_i) \ne type(a_{i+1})\), the bond is directed. In fact, several published methods use some kind of extended bond typing scheme (eMolFrag [51], CReM [19, 25]).

Tagging cleaved bonds Once some bonds have been selected for cleavage, this can be encoded into a valid SMILES string where unspecified atoms (the ‘*’ character in SMILES notation) are introduced and isotope numbers are used to encode the atom types of the atoms surrounding a cleaved bond. For example, tagging a cleaved bond in the SMILES for ethanol (’OCC’) would give: ‘O[2*][1*]CC’. Where isotope numbers one and two are indexes into an atom type dictionary (a mapping from integer to atom type). In the following, if a fragment is a prefix of a SMILES string, it is called a “seed fragment”. All other fragments are called “branch fragments”.

Fig. 2
figure 2

Two fragmentations of folic acid (vitamin B\(_9\)). Under each mixture of fragments, the (not canonical but randomized) SMILES and corresponding DeepSMILES-without-ring-openings are shown. Tagged cleaved bonds (in black) are the wildcard atom pairs [x*][y*] with the isotope numbers x and y encoding atom types directly to the left and right of the cleaved bond. Colors indicate the seed fragment (in green), the first branch fragment (in orange) and the second branch fragment (in purple)

Fig. 3
figure 3

First line: SMILES string with tagged cleaved bonds. Second line: SMILES branch nesting depth of each character from the first line (0: not in a branch; 1: inside a branch; 2: inside a branch inside a branch, etc). Third line: S characters mark the “seed fragment”. T (resp. U) characters mark the first (resp. second) tagged cleaved bond. B marks the only heavy atom of the first “branch fragment”. C characters mark the atoms and bonds of the second “branch fragment”. The second “branch fragment” starts at SMILES branch nesting depth 2. This fragment cannot continue once the nesting depth becomes lower. The corresponding fragmented molecule can be seen on Fig. 2 right

Fragmenting molecules with tagged cleaved bonds A SMILES string with tagged cleaved bonds can be transformed into a mixture of fragments. This is not strictly necessary, as long as the molecular fragment assembler can recognize tagged cleaved bonds. However, this is useful to visualize the fragments with a 2D molecular viewer. The SMILES ‘O[2*][1*]CC’ becomes ‘O[2*][1*].[2*][1*]CC’. When extracting a fragment from a SMILES with tagged cleaved bonds, care must be taken to not extend the fragment past (or shorter than) its SMILES branch nesting depth. i.e. the nesting of parentheses pairs inside the SMILES string must be taken into account (Fig. 3).

Indexing molecules with tagged cleaved bonds For fast molecular generation, it is necessary to index all the fragmented molecules first. All seed fragments are stored into an array. Each branch fragment is stored into a hash-table of arrays where the prefix tagged cleaved bond is used as a key and the remaining of the fragment is appended to the array of all values for this key. For a fragment with several cleaved bonds, the first cleaved bond appearing in the fragment’s SMILES (when reading it from left to right) will determine under which cleaved bond type this fragment is registered. Our method uses randomized SMILES rather than canonical ones, in order to avoid any bias that could be introduced by the canonicalization procedure.

Generating molecules In essence, some kind of Markov sampling of the previously created data structure is performed, until the generated string has no tagged cleaved bond left. The algorithm is: (i) a seed fragment is uniformly drawn from the array of seed fragments. (ii) as long as the string under construction contains tagged cleaved bonds, each of them is deleted and replaced by a uniformly drawn fragment from the array of fragments with the tagged cleaved bond as the key. When generating the right flavor of DeepSMILES, such a fragment assembly algorithm can be performed using almost exclusively string operations. However, when generating SMILES, an extra step renumbering ring opening and closure numbers is required, to avoid number collisions between rings from different fragments.

Software implementation Typing atoms and bonds, selecting and tagging cleaved bonds is done by a Python script using RDKit [38]. For performance and correctness reasons [52], indexing fragments and fragment assembly is done by an OCaml [53] program reading SMILES with tagged cleaved bonds. The program is named FASMIFRA for “Fast Assembly of SMILES Fragments”.


To assess the model training speed and molecular generation frequency of the proposed method, the 1M GDB-13 molecules sample [44] from Arús-Pous [8] was used and 2B molecules were generated (same protocol). The training set was fragmented (with a fragment molecular weight decreased from 150 to 50Da, because GDB-13 molecules are quite small) , then molecules were generated from those fragments (Fig. 4). Tests were run using a single core of an Intel Core i7 CPU @ 2.7GHz, with 12 cores and 16GB or RAM, under Ubuntu Linux 20.04 LTS.

Fig. 4
figure 4

Model training (left) and sampling speed (right). RNN: Recurrent Neural Network (numbers cited from literature [8]); Frag (train): molecular fragmentation (Python/RDKit); Tag: tagging cleaved bond (proposed method, Python/RDKit); Frag (sampling): assembly of molecular fragments using molecular graph operations (Python/RDKit); Smi: fast assembly of SMILES fragments (proposed method, OCaml). Dsmi: fast assembly of DeepSMILES fragments (proposed method, OCaml); Tag is the model training prerequisite of Smi and Dsmi sampling. All methods use a single CPU, except RNN which uses four CPUs and one GPU

To assess diversity of the generated molecules, as well as training set distribution matching, a sample of 100k molecules from ChEMBL-28 and a sample of 20k molecules from TCM@Taiwan were used. After molecular fragmentation of each set, the same number of molecules was generated (100k and 20k). The number of unique Bemis-Murcko scaffolds in each training and generated set is reported, along with the number at the intersection of those sets (Table 1).

Table 1 Molecular diversity assessed via the number of unique Bemis-Murcko scaffolds at the intersection between datasets

To assess if the method is capable of training set distribution matching, those training and generated sets were projected into an eight dimensional space, where dimensions are quite unrelated (molecular weight, calculated LogP, #aromatic rings, topological polar surface area, #rotatable bonds, synthetic accessibility score [54], hydrogen bond acceptors and hydrogen bond donors). Then, the overlap between the training set and the corresponding generated set histogram was quantified using the Jaccard index (equation (1)). Let X and Y be two histograms with the same number of bins (n).

$$\begin{aligned} J(X, Y) = \frac{\sum _{1}^{n}\min (X_i,Y_i)}{\sum _{1}^{n}\max (X_i,Y_i)} \end{aligned}$$

A Jaccard index of zero means no overlap between two histograms, while one means perfect similarity (Fig. 5).

Fig. 5
figure 5

Distribution reproduction experiments’ histograms. Traditional Chinese Medicine database at Taiwan: training set in blue (TCMtrain), generated set in cyan (TCMgene). ChEMBL training set in red (CBLtrain), generated set in orange (CBLgene). The Jaccard index between the training and generated set histograms is shown in the legend (T = x)

Table 2 Comparison of several molecular generators in the GuacaMol [33] distribution learning benchmark
Fig. 6
figure 6

100 FASMIFRA-generated molecules during the GuacaMol benchmark (ChEMBL 24 training set)

Results on the GuacaMol molecular generation benchmark can be seen in Table 2 and Fig. 6.


Model training frequency (Fig. 4). RNN [8] is the slowest method, with \(\sim \)30 molecule/s (70 epochs; 8 min/epoch; 1M training set molecules). Plus, RNN used four CPUs and one GPU, so the per CPU processing frequency is much lower. Molecular fragmentation (2755 molecule/s) or cleaved bond tagging (proposed method; 2479 molecule/s); both written in Python using RDKit and running on a single CPU core have a more reasonable, and comparable, processing frequency.

Model sampling frequency (Fig. 4). Assembly of molecular fragments in Python using RDKit is the slowest method here (\(\sim \)1685 molecule/s). Editing a molecular graph (RWMol class in RDKit) is not so efficient. RNN is reasonably fast upon model sampling (\(\sim \)9167 molecule/s); although requiring four CPUs and one GPU. On the other hand, the proposed method of fast assembly of SMILES fragments reaches high sampling frequencies (\(\sim \)300599 SMILES-encoded molecule/s; \(\sim \)348636 for DeepSMILES).

GuacaMol benchmark (Table 2) In this benchmark, while FASMIFRA is one of the simplest methods (probably just after the Random sampler baseline model), a balanced performance profile is observed. As expected, FASMIFRA generates only valid molecules (Validity = 1.0). The only metric in which FASMIFRA is not great is Novelty (0.7); meaning that sometimes generated molecules are part of the training set. But, FASMIFRA being a fragment-based method, this was expected. Especially, extended bond typing constrains which fragment can be connected to which, effectively limiting the number of combinations which can be obtained from a fragment library. On the other hand, a negative control experiment was performed, where extended bond typing was turned off, which showed that while doing this improves the Novelty metric (from 0.702 to 0.947), this is at the detriment of training set distribution matching (KL_divergence is decreased from 0.959 to 0.855; FCD is decreased from 0.814 to 0.397). This negative control experiment shows that FASMIFRA is not a random molecular generator. Other methods which perform very well in the GuacaMol benchmark are the Random sampler baseline, but it cannot generate new molecules (Novelty = 0.0). The SMILES LSTM and the VAE are also very balanced and show good performance across all metrics. Compared to FASMIFRA, the Graph MCTS is lacking on the KL_divergence (0.522) and FCD metrics (0.015). The AAE is lacking on the FCD metric (0.529). The ORGAN is lacking on the Validity (0.379), KL_divergence (0.267) and FCD metrics (0.0). However, and to their defense, some of these methods might perform molecular optimization while FASMIFRA cannot (it would need to be coupled to a genetic algorithm or simulated annealer).

Molecular diversity (Table 1 and Novelty line in Table 2). In terms of Bemis-Murcko scaffolds, the proposed method generates a significant fraction of new scaffolds (74% in the generated set for ChEMBL; 68% in the generated set for TCM@Taiwan). With both datasets, generating molecules resulted in an increase of the number of unique Bemis-Murcko scaffolds in the generated set, compared to the training set (Table 1). As expected, the ChEMBL dataset (drug-like molecules) and the TCM@Taiwan dataset (natural products) are very different (Table 1 and Fig. 5). For example, both training sets share less than 1000 scaffolds.

Training set distribution matching (Fig. 5 and KL_divergence and FCD lines in Table 2). The method shows some propensity at training set distribution matching. For example, the minimum, median and maximum Jaccard indexes between histograms are (0.71, 0.805 and 0.88) for ChEMBL and (0.77, 0.845 and 0.93) for TCM@Taiwan.

On the positive side, this method is conceptually simple. Fragmenting molecules is reasonably fast (left of Fig. 4). Indexing fragments and generating molecules is extremely fast (right of Fig. 4). Only syntactically valid (Deep)SMILES encoded molecules are generated (Validity line in Table 2).

On the negative side, this method does require a training set. See the introduction for methods which don’t. Also, the method doesn’t create new rings. However, a medicinal chemistry technique is readily applicable in order to reasonably alter the generated rings [55]. If the training set contains molecules which cannot be fragmented (e.g. only made of fused rings), such molecules only contribute one seed fragment, without any attachable branch fragment (i.e. they might be copied as is from the training set to the generated set upon molecular generation). The method can generate duplicate molecules. Canonicalizing the produced SMILES would allow to detect and eventually filter those out.


In this article, a simple method to generate molecules from molecular fragments was described. The method can work with any molecular fragmentation scheme, as long as rings are not opened/broken. Several experiments were presented, evaluating model training speed, molecular generation frequency, molecular diversity and training set distribution matching. The proposed method can be used as-is in genetic algorithm or simulated annealing fragment-based molecular generators. Our prototype software implementation is released under the GPL license. The technique may also be useful for dataset augmentation and demonstrates that DeepSMILES proposed useful simplifications to the SMILES syntax. We are not working on it and it might be difficult, but an interesting extension might be to support molecular fragmentation schemes which happen to open/break rings.

Availability of data and materials

The FASMIFRA software, TCM@Taiwan and ChEMBL training samples are on github: (accessed 2021-07-28). The GDB-13 1M training sample can be downloaded from the GDB website [44, 45].


  1. Like explicit hydrogens, a formal charge or a specific isotope number.

  2. I.e. only molecules which have been synthesized and wet-lab tested.

  3. Also, to prevent molecular generation from creating new/never-seen bonds and so to favor the synthesizability of the generated molecules.



Adverserial AutoEncoder


Deep Neural Network


Fast Assembly of SMILES Fragments


Generated DataBase


Long Short-Term Memory


Monte Carlo Tree Search


Objective-Reinforced Generative Adversarial Network


Recurrent Neural Network


Self-referencing embedded strings [28]


Simplified Molecular Input Line Entry System [29]


Variational AutoEncoder


  1. Grisoni F, Moret M, Lingwood R, Schneider G (2020) Bidirectional molecule generation with recurrent neural networks. J Chem Inf Model. 60(3):1175–1183.

    Article  CAS  PubMed  Google Scholar 

  2. Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci. 4(2):268–276.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Segler MHS, Kogej T, Tyrchan C, Waller MP (2018) Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent Sci. 4(1):120–131.

    Article  CAS  PubMed  Google Scholar 

  4. Neil D, Segler M, Guasch L, Ahmed M, Plumbley D, Sellwood M, Brown N (2018) Exploring deep recurrent models with reinforcement learning for molecule design. ICLR

  5. Popova M, Isayev O, Tropsha A (2018) Deep reinforcement learning for de novo drug design. Adv Sci.

    Article  Google Scholar 

  6. ...Zhavoronkov A, Ivanenkov YA, Aliper A, Veselov MS, Aladinskiy VA, Aladinskaya AV, Terentiev VA, Polykovskiy DA, Kuznetsov MD, Asadulaev A, Volkov Y, Zholus A, Shayakhmetov RR, Zhebrak A, Minaeva LI, Zagribelnyy BA, Lee LH, Soll R, Madge D, Xing L, Guo T, Aspuru-Guzik A (2019) Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat Biotechnol. 37(9):1038–1040.

    Article  CAS  PubMed  Google Scholar 

  7. Sattarov B, Baskin II, Horvath D, Marcou G, Bjerrum EJ, Varnek A (2019) De novo molecular design by combining deep autoencoder recurrent neural networks with generative topographic mapping. J Chem Inf Model. 59(3):1182–1196.

    Article  CAS  PubMed  Google Scholar 

  8. Arús-Pous J, Blaschke T, Ulander S, Reymond J-L, Chen H, Engkvist O (2019) Exploring the GDB-13 chemical space using deep generative models. J Cheminf. 11(1):20.

    Article  Google Scholar 

  9. Arús-Pous J, Johansson SV, Prykhodko O, Bjerrum EJ, Tyrchan C, Reymond J-L, Chen H, Engkvist O (2019) Randomized SMILES strings improve the quality of molecular generative models. J Cheminf. 11(1):71.

    Article  Google Scholar 

  10. Blaschke T, Arús-Pous J, Chen H, Margreitter C, Tyrchan C, Engkvist O, Papadopoulos K, Patronov A (2020) Reinvent 2.0: an ai tool for de novo drug design. J Chem Inf Model. 60(12):5918–5922.

    Article  CAS  PubMed  Google Scholar 

  11. Prykhodko O, Johansson SV, Kotsias P-C, Arús-Pous J, Bjerrum EJ, Engkvist O, Chen H (2019) A de novo molecular generation method using latent vector based generative adversarial network. J Cheminf. 11(1):74.

    Article  Google Scholar 

  12. Arús-Pous J, Patronov A, Bjerrum EJ, Tyrchan C, Reymond J-L, Chen H, Engkvist O (2020) SMILES-based deep generative scaffold decorator for de-novo drug design. J Cheminf. 12(1):38.

    Article  CAS  Google Scholar 

  13. Yang X, Zhang J, Yoshizoe K, Terayama K, Tsuda K (2017) ChemTS: an efficient Python library for de novo molecular generation. Sci Technol Adv Mater. 18(1):972–976.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Sumita M, Yang X, Ishihara S, Tamura R, Tsuda K (2018) Hunting for organic molecules with artificial intelligence: molecules optimized for desired excitation energies. ACS Central Sci 4(9):1126–1133.

    Article  CAS  Google Scholar 

  15. Merk D, Friedrich L, Grisoni F, Schneider G (2018) De novo design of bioactive small molecules by artificial intelligence. Mol Inf. 37(1–2):1700153.

    Article  CAS  Google Scholar 

  16. Gupta A, Müller AT, Huisman BJH, Fuchs JA, Schneider P, Schneider G (2018) Generative recurrent networks for de novo drug design. Mol Inf. 37(1–2):1700111.

    Article  CAS  Google Scholar 

  17. Blaschke T, Olivecrona M, Engkvist O, Bajorath J, Chen H (2018) Application of generative autoencoder in de novo molecular design. Mol Inf. 37(1–2):1700123.

    Article  CAS  Google Scholar 

  18. Liu X, Ye K, van Vlijmen H.W.T, IJzerman A.P, van Westen G.J.P (2019) An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine A2A receptor. J Cheminf 11(1):35.

    Article  CAS  Google Scholar 

  19. Polishchuk P (2020) CReM: chemically reasonable mutations framework for structure generation. J Cheminf. 12(1):28.

    Article  Google Scholar 

  20. Kwon Y, Lee J (2021) MolFinder: an evolutionary algorithm for the global optimization of molecular properties and the extensive exploration of chemical space using SMILES. J Cheminf. 13(1):24.

    Article  CAS  Google Scholar 

  21. Nigam A, Pollice R, Krenn M, Gomes GdP, Aspuru-Guzik A (2021) Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (STONED) algorithm for molecules using SELFIES. Chem Sci. 12:7079–7090.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Jensen JH (2019) A graph-based genetic algorithm and generative model/Monte Carlo Tree search for the exploration of chemical space. Chem Sci. 10(12):3567–3572.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Brown N, McKay B, Gilardoni F, Gasteiger J (2004) A graph-based genetic algorithm and its application to the multiobjective evolution of median molecules. J Chem Inf Comp Sci. 44(3):1079–1087.

    Article  CAS  Google Scholar 

  24. Leguy J, Cauchy T, Glavatskikh M, Duval B, Da Mota B (2020) EvoMol: a flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation. J Cheminf. 12(1):55.

    Article  CAS  Google Scholar 

  25. Polishchuk P (2020) Control of synthetic feasibility of compounds generated with CReM. J Chem Inf Model. 60(12):6074–6080.

    Article  CAS  PubMed  Google Scholar 

  26. Joung I, Kim JY, Gross SP, Joo K, Lee J (2018) Conformational space annealing explained: a general optimization algorithm, with diverse applications. Comput Phys Commun. 223:28–33.

    Article  CAS  Google Scholar 

  27. Yoshikawa N, Terayama K, Sumita M, Homma T, Oono K, Tsuda K (2018) Population-based de novo molecule generation. Using grammatical evolution. Chem Lett. 47(11):1431–1434.

    Article  CAS  Google Scholar 

  28. Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach Learn Sci Technol. 1(4):045024.

    Article  Google Scholar 

  29. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 28(1):31–36.

    Article  CAS  Google Scholar 

  30. James CA, et al OpenSMILES specification. Accessed 7 July 2021

  31. O’Boyle N, Dalke A (2018) DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. ChemRxiv.

    Article  Google Scholar 

  32. Bemis GW, Murcko MA (1996) The properties of known drugs. 1. Molecular frameworks. J Med Chem. 39(15):2887–2893.

    Article  CAS  PubMed  Google Scholar 

  33. Brown N, Fiscato M, Segler MHS, Vaucher AC (2019) GuacaMol: benchmarking models for de novo molecular design. J Chem Inf Model. 59(3):1096–1108.

    Article  CAS  PubMed  Google Scholar 

  34. Nathan B GuacaMol leaderbord. Accessed 23 Aug 2021

  35. Nathan B GuacaMol github. Accessed 23 Aug 202

  36. Nathan B GuacaMol baselines github. Accessed 23 Aug 202

  37. Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR (2016) The ChEMBL database in 2017. Nucleic Acids Res. 45(D1):945–954.

    Article  CAS  Google Scholar 

  38. Landrum G RDKit: Open-Source Cheminformatics. Accessed 8 July 2021

  39. Preuer K, Renz P, Unterthiner T, Hochreiter S, Klambauer G (2018) Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J Chem Inf Model. 58(9):1736–1741.

    Article  CAS  PubMed  Google Scholar 

  40. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat. 22(1):79–86

    Article  Google Scholar 

  41. Polykovskiy D, et al. MOSES. Accessed 23 Aug 2021

  42. Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, Golovanov S, Tatanov O, Belyaev S, Kurbanov R, Artamonov A, Aladinskiy V, Veselov M, Kadurin A, Johansson S, Chen H, Nikolenko S, Aspuru-Guzik A, Zhavoronkov A (2020) Molecular sets (MOSES): a benchmarking platform for molecular generation models. Front Pharmacol. 11:1931.

    Article  CAS  Google Scholar 

  43. Guimaraes GL, Sanchez-Lengeling B, Outeiral C, Farias PLC, Aspuru-Guzik A (2018) Objective-reinforced generative adversarial networks (ORGAN) for sequence generation models, 1–7. arXiv:1705.10843

  44. Arús-Pous J GDB13 1M sample. Accessed 8 July 2021

  45. Blum LC, Reymond J-L (2009) 970 million druglike small molecules for virtual screening in the chemical universe database gdb-13. JACS 131(25):8732–8733.

    Article  CAS  Google Scholar 

  46. Chen CY-C (2011) TCM Database@Taiwan: the world’s largest traditional Chinese medicine database for drug screening in silico. PLoS ONE. 6(1):1–5.

    Article  CAS  Google Scholar 

  47. Degen J, Wegscheid-Gerlach C, Zaliani A, Rarey M (2008) On the art of compiling and using “Drug-Like” chemical fragment spaces. ChemMedChem 3(10):1503–1507.

    Article  CAS  PubMed  Google Scholar 

  48. Lewell XQ, Judd DB, Watson SP, Hann MM (1998) RECAP - Retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry. J Chem Inf Comput Sci. 38(3):511–522.

    Article  CAS  PubMed  Google Scholar 

  49. Berenger F, Yamanishi Y (2020) Ranking molecules with vanishing kernels and a single parameter: active applicability domain included. J Chem Inf Model. 60(9):4376–4387.

    Article  CAS  PubMed  Google Scholar 

  50. Carhart RE, Smith DH, Venkataraghavan R (1985) Atom Pairs as molecular features in structure-activity studies: definition and applications. J Chem Inf Comput Sci. 25(2):64–73.

    Article  CAS  Google Scholar 

  51. Liu T, Naderi M, Alvin C, Mukhopadhyay S, Brylinski M (2017) Break down in order to Build up: decomposing small molecules for fragment-based drug design with eMolFrag. J Chem Inf Model. 57(4):627–631.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Berenger F, Zhang KYJ, Yamanishi Y (2019) Chemoinformatics and structural bioinformatics in ocaml. J Cheminf. 11(1):10.

    Article  Google Scholar 

  53. Leroy X, Doligez D, Frisch A, Garrigue J, Rémy D, Vouillon J (2021) The OCaml System Release 4.12 - Documentation and User’s Manual. INRIA, Paris, France

  54. Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminf. 1(1):8.

    Article  CAS  Google Scholar 

  55. Pennington LD, Aquila BM, Choi Y, Valiulin RA, Muegge I (2020) Positional analogue scanning: an effective strategy for multiparameter optimization in drug design. J Med Chem. 63(17):8956–8976.

    Article  CAS  PubMed  Google Scholar 

Download references


FB acknowledges the use of ChemAxon JChem 20.13 for 2D depiction of molecules (; accessed 2021-07-08).


This work was supported by the Japan Agency for Medical Research and Development: AMED [JP20nk0101111].

Author information

Authors and Affiliations



FB designed and implemented the method, ran experiments, prepared figures and tables and wrote the initial manuscript. All authors have analyzed the results and participated in writing the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Francois Berenger.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing financial interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Berenger, F., Tsuda, K. Molecular generation by Fast Assembly of (Deep)SMILES fragments. J Cheminform 13, 88 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: