MAYGEN: an open-source chemical structure generator for constitutional isomers based on the orderly generation principle

Yirik, Mehmet Aziz; Sorokina, Maria; Steinbeck, Christoph

doi:10.1186/s13321-021-00529-9

Software
Open access
Published: 03 July 2021

MAYGEN: an open-source chemical structure generator for constitutional isomers based on the orderly generation principle

Journal of Cheminformatics volume 13, Article number: 48 (2021) Cite this article

6627 Accesses
9 Citations
11 Altmetric
Metrics details

Abstract

The generation of constitutional isomer chemical spaces has been a subject of cheminformatics since the early 1960s, with applications in structure elucidation and elsewhere. In order to perform such a generation efficiently, exhaustively and isomorphism-free, the structure generator needs to ensure the building of canonical graphs already during the generation step and not by subsequent filtering. Here we present MAYGEN, an open-source, pure-Java development of a constitutional isomer molecular generator. The principles of MAYGEN’s architecture and algorithm are outlined and the software is benchmarked in single-threaded mode against the state-of-the-art, but closed-source solution MOLGEN, as well as against the best open-source solution PMG. Based on the benchmarking, MAYGEN is on average 47 times faster than PMG and on average three times slower than MOLGEN in performance.

Introduction

Unconstrained isomer generation has received attention over the past decades as a means to assess the theoretically existing chemical space and as a hypothesis generator. Recently, the works of Jean-Louis Reymond and coworkers for the creation of the GDB-11 [1], GDB-13 [2] and GDB-17 [3] databases, enumerating all possible molecules with 11, 13, and 17 non-hydrogen atoms, respectively, in the molecular formula, have laid out the motivations for unconstrained isomer generation and the exploitation of its results in sufficient detail. Such molecular generation methods can be used as hypothesis generators in areas such as computer-assisted structure elucidation, but also to answer broader questions such as the exact size of a chemical space. Structure generators that produce constitutional isomers take a molecular formula as input, e.g., $\text {C}_{10}\text {H}_{16}\text {O}$, and enumerate or output all possible chemical structures that can be built with the given set of atoms in the molecular formula. The history of chemical graph generators reaches back to the 1960s and started with the DENDRAL project [4]. Their structure generator, CONGEN [5], was based on the substructures building blocks and dealt well with the overlapping substructures. Another structure generator substructure building blocks based was Assemble [6]. Chemical graph generators are based on mathematical theorems, especially the application of algorithmic group theory [7] and combinatorial algorithms [8]. MASS was a tool for the mathematical analysis of molecular structures and constructes molecules by generating their adjacency matrices [9] and SMOG [10] was the successor of MASS. Adjacency matrices include the edge multiplicity information for each atom pair in molecules.

Despite the long history of research on the theoretical and practical generation of chemical graphs, the number of publicly available algorithms and software for this purpose is still limited. The available generators [11] are ASSEMBLE [6], COCON [12], DENDRAL [4], LSD [13], MOLGEN [14], OMG [15], PMG [16], SENECA [17] and SMOG [10]. These generators and more details are described in [11]. For several decades, the closed-source, commercial structure generator MOLGEN, developed in C at the University of Bayreuth, marked the state of the art in terms of speed and completeness. Recognising the need for an open-source structure generator, Peironcely and colleagues [15] developed the Open Molecule Generator (OMG). OMG, however, is orders of magnitude slower than MOLGEN. Following OMG, a parallelized structure generator, PMG, was developed based on the OMG algorithm. The 452,458 isomers of $\text {C}_{10}\text {H}_{16}\text {O}$, for instance, are generated in only 3 s by MOLGEN 5.0, whereas MAYGEN 1.4 and PMG take 10 and 45 s, respectively. For more benchmarks, see “Results” section of the present manuscript.

In this work, we present the development of an open-source structure generator MAYGEN, a pure-Java constitutional isomer generator based on the principle of orderly generation described by Grund et al. [18]. We benchmark our method against the fastest available open-source solution PMG as well as against the closed-source, de facto gold standard MOLGEN. On average, MAYGEN is 47 times faster than PMG and three times slower than MOLGEN. In an old Arabic saying, “may” refers to a drop of water, and we hope that MAYGEN will be a good contribution to the field and trigger a surge in the development of improved and faster versions eventually rivalling the best closed-source solutions and thereby serving the scientific community. The complete MAYGEN code, as well as precompiled binaries, are available on GitHub.

Methods

MAYGEN 1.4 generates constitutional isomers of a given molecular formula with an orderly graph generation algorithm from the field of algorithmic group theory. The principles are described in detail in [18]. We summarize them as following. A graph with p nodes, ${1,2,3, \ldots , p}$ has its symmetry group $S_{p}$. This symmetry group includes all the permutations of these nodes. However, for the case of coloured graphs, the nodes need to be partitioned (Eq. 1), in other words, nodes are grouped based on their colours, degrees and edges.

$$\begin{aligned} \lambda := (\lambda _{1}, \lambda _{2}, \ldots ) \; with \; \sum _{i}\lambda _{i} = n_{i} \end{aligned}$$

(1)

A molecule can be represented as a coloured graph. For 4 isomers of $\text {C}_{8}\text {O}_{2}\text {H}_{16}$ (Fig. 1), all atoms are coloured by their element types.

The atoms of $\text {C}_{8}\text {O}_{2}\text {H}_{16}$ can be partitioned in three groups as following: $\lambda ={2, 8, 16}$. For the case of this node partition, the symmetry group of 26 nodes, $S_{26}$, cannot be used since the nodes are coloured. In this case, a special type of symmetry group is applied, consisting of Young subgroups, that are the symmetry groups built based on the initial node partition (Eqs. 2 and 3).

$$\begin{aligned} n= & {} \bigcup \limits _{i}n_{i}^{\lambda } \, where \, n_{i}^{\lambda } = \left\{ \sum _{j=1}^{i-1} \lambda _{j}+1, \ldots , \sum _{j=1}^{i}\lambda _{j} \right\} \end{aligned}$$

(2)

$$\begin{aligned} S_{\lambda }:= & {} \left\{ \pi \in S_{n} | \forall i : \pi (n_{i}^{\lambda }) = n_{i}^{\lambda } \right\} \subseteq S_{n} \end{aligned}$$

(3)

In Eq. (2), these two summations give the minimum and maximum entries of the integer range. For the partition $\lambda ={2, 8, 16}$, its integer sets are:

$\{1, 2\} \cup \{3, 4, 5, 6, 7, 8, 9, 10\} \cup \{11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26\}$

This symmetry group $S_{\lambda }$ is the direct product of Young subgroups permuting each atom type within its partition. In the case of $\text {C}_{8}\text {O}_{2}\text {H}_{16}$, the symmetry group of $S_{\lambda }$ is $S_{\{1, 2\}}* S_{\{3, 4, 5, 6, 7, 8, 9, 10\}}*S_{\{11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26\}}$. The permutations of these symmetry groups only permute each element type within their groups, such as oxygens, carbons and hydrogens. The Young subgroups are then used for the construction of molecules’ automorphism groups (Eq. 4). These atom partitions and symmetry groups are the core part of the MAYGEN canonical test.

$$\begin{aligned} Aut(A) := \left\{ \pi \in S_{n} | A\pi = A \right\} \subseteq S_{p} \end{aligned}$$

(4)

MAYGEN’s construction of candidate structures consists of three distinct recursive tasks. First, the hydrogens are distributed to the heavy (i.e. non-hydrogen) atoms of the molecular formula. Then, the structures are generated in a block-wise manner, and finally, the canonical test avoids the generation of duplicate structures in an efficient and dynamic manner.

Molecular formula check and hydrogen distribution

Graph existence check

Before calling the generator functions, there is a preliminary test for input molecular formulae. From graph theory, a degree list d can represent a graph with p nodes if the sum of all node degrees is equal or greater than $2*(p-1)$ and if the sum is an even number (Eq. 5) [18].

$$\begin{aligned} d = (d_{1},d_{2}, \ldots ,d{p}) \, \sum _{i=1}^{p} d_{i} \, is \, even \, and \, \sum _{i=1}^{p} d_{i} \ge 2*(p-1) \end{aligned}$$

(5)

A graph with p nodes should consist of at least $(p-1)$ edges. Since an edge is connected with two nodes in a graph, the sum of its node degrees should be equal to or greater than $2*(p-1)$.

Hydrogen distribution

For a given molecular formula, MAYGEN processes the hydrogens first and distributes them to all the other atoms in all possible ways since a hydrogen atom has a valence of 1 and can always have only one neighbour. The hydrogen distribution function takes two inputs, the atom partition and the number of hydrogens. The hydrogens are distributed in ascending order within each partition in order to avoid duplicates.

After the hydrogen distribution, the initial degrees and the initial partition are updated for each hydrogen distribution. For example, the non-hydrogen atoms from the molecular formula $\text {C}_{6}\text {H}_{6}$ have the initial respective degrees as [4, 4, 4, 4, 4, 4] and the initial partition {6}. There are 7 possible hydrogen distributions (Fig. 2) to these carbon atoms. After the hydrogen distribution step, the new lists of node degrees and partitions are used for the structure generation process. With the pre-hydrogen distribution, MAYGEN deals with a $6\times 6$ matrix instead of a $12\times 12$ matrix. The matrix size also has an impact on the canonical test since this test depends directly on the rows’ permutations. The hydrogen distribution code is available in the hydrogenDistributor Java class.

Construction of candidate structures

Once the molecular formula satisfies the graph existence criteria, the hydrogen distribution is performed to build a list of degrees. MAYGEN then starts the actual construction of candidate structures for each degree. The structures are represented by adjacency matrices in which each entry represents the edge multiplicity between the atom pairs. These matrices are built in a block-wise manner. The algorithm is based on the node degrees that correspond to the atom valences. The initial partition of the atoms, based on their element symbols, defines the blocks of the matrix (Fig. 3).

With p being the number of atoms in the molecular formula without the hydrogens, an empty $p\times p$ matrix A is built. This matrix is filled in descending order starting with the maximal capacities and this is performed for each atom. The maximal capacity of an atom is calculated by decrementing its valence. For example, the valence of carbon is 4 and its maximal capacity is 3. Due to the diagonal symmetry of such matrices, only the upper triangular part needs to be filled. A canonical test, as described below, is performed once a block is filled. In a matrix, a block is defined as a number of rows and their transposes (i.e. columns). For example, a block between two indices 1 and 4 means the first 4 rows and the first 4 columns of the matrix. It needs to be noted that the canonical tests are performed without waiting for the whole matrix to be filled, which increases MAYGEN’s efficiency. This is the early boundary condition of the block-wise generation and avoids the construction of duplicate molecular structures. When the whole matrix is filled, it is written into the output SDF file, if such an option is selected at the beginning of the process. The algorithm then modifies the same input matrix A until there are no more possible changes. This is called the “build-and-forget method” [18]. The overall algorithm structure is explained in Algorithm 1 [18] and illustrated in Fig. 4.

Keeping the example of $\text {C}_{6}\text {O}_{2}\text {H}_{6}$, the initial valence vector is $v = [4,4,4,4,4,4,2,2,1,1,1,1,1,1]$, where the valences of each carbon atom are listed first, then the valences of each oxygen atom, and lastly the valences of all 6 hydrogen atoms. To optimize the process, the hydrogens are avoided in the further construction of the matrices by the hydrogen distribution step. Thus, the initial partition is $\lambda =\{6,2\}$ and the corresponding matrix is a $8\times 8$ matrix (built on 6 carbons and 2 oxygens).

Canonical test

The canonical test is the crucial part of the MAYGEN algorithm. In block-wise orderly structure generation, the early canonical testing avoids the construction of many duplicates. Overall, the purpose of the canonical test is the detection of the maximal matrix with respect to the given initial node partition.

$$\begin{aligned} A \ge A\pi \quad \forall \pi \in S_{\lambda } \end{aligned}$$

(6)

In the naive version of the canonical test, the matrix A is permuted for all the permutations of $S_{\pi }$ and its maximality is checked (Eq. 6). In the permuted matrices, $A_{\pi }$, their rows and entries are permuted. The original matrix A is compared with all the permuted matrices. Two matrices are compared row by row in a lexicographical order (Eq. 7).

$$\begin{aligned} \begin{aligned} A> A' : \Longleftrightarrow (a_{1,1},\ldots , a_{1,p},a_{2,1},\ldots , a_{2,p},a_{p,1},\ldots , a_{p,p} )\\ > (a'_{1,1},\ldots , a'_{1,p},a'_{2,1},\ldots , a'_{2,p},a'_{p,1},\ldots , a'_{p,p}) \end{aligned} \end{aligned}$$

(7)

In the block-wise orderly generation, only the rows within the blocks are compared.

Cycle transpositions

In the canonical test, the size of the symmetry group affects the run time of the algorithm. The initial partition is updated for each row during the test. Starting with the initial partition, with each row, the partitions are refined. The refinement process (Eq. 8) is explained below:

$$\begin{aligned} \lambda ^{(i)}= {\left\{ \begin{array}{ll} \underbrace{(1, ..., 1,}_\text {i-1}1,\lambda _{i}^{(i-1)} -1, \lambda _{i+1}^{(i-1)} ,...) &{} \text {if} \quad \lambda _i^{i-1} > 1\\ \underbrace{(1,...,1,}_\text {i-1}1,\lambda _{i+1}^{(i-1)} &{} \text {if} \quad \lambda _i^{(i-1)} = 1 \end{array}\right. } \end{aligned}$$

(8)

For $\text {C}_{3}\text {O}_{2}\text {H}_{4}$, the initial partition without hydrogens is {3,2}. Thus the partition list for all the rows are:

$$\begin{aligned} \begin{aligned} \lambda ^0&= \{3,2\}\\ \lambda ^1&= \{1,2,2\}\\ \lambda ^2&= \{1,1,1,2\}\\ \lambda ^3&= \{1,1,1,2\}\\ \lambda ^4&= \{1,1,1,1\} \end{aligned} \end{aligned}$$

These partition lists are used for the construction of the symmetry groups. By comparing the indices of two consecutive partitions, the cycle transpositions of symmetry groups are calculated. For partitions $\lambda ^{(i-1)}$ and $\lambda ^{(i)}$, the number of cycles is the $i$th entry in the former partition $\lambda _{i}^{(i-1)}$ (Eq. 9).

$$\begin{aligned} S_{\lambda ^{i-1}} = \cup _{j=i}^{\lambda _{i}^{i-1}}(i,j)S_{\lambda ^{i}},i=1,\ldots ,p-1 \end{aligned}$$

(9)

For example, the initial partition is {3, 2} and the refined partition for the first row is {1,2,2}. Here the number of cycle transpositions is 3 since the first entry of the former partition is 3. The cycle transpositions are (1,1), (1,2) and (1,3). These cycles are calculated row by row for all the partitions. The symmetry group of the molecule is calculated by the multiplication of all these cycles. The list of the partitions and their cycles are listed below:

$$\begin{aligned} \lambda ^0= & {} \{3,2\} \quad \lambda ^1 = \{1,2,2\} \quad Cycles: (1,1), (1,2), (1,3) \\ \lambda ^1= & {} \{1,2,2\} \quad \lambda ^2 = \{1,1,1,2\} \quad Cycles: (2,2), (2,3) \\ \lambda ^2= & {} \{1,1,1,2\} \quad \lambda ^3 = \{1,1,1,2\} \quad Cycles: (3,3) \\ \lambda ^3= & {} \{1,1,1,2\} \quad \lambda ^4 = \{1,1,1,1\} \quad Cycles: (4,4), (4,5) \end{aligned}$$

Calculation of automorphisms

In the canonical test, for a candidate matrix the corresponding automorphisms are calculated row by row. For the $i$th row of a matrix, the cycle transpositions $\varsigma _{i,j}$ are calculated based on the partitions $\lambda ^{(i-1)}$ and $\lambda ^{(i)}$. These cycle transpositions are used in the automorphisms search. All these cycles are multiplied in DFS manner with all the former automorphisms $\tau$ of the graph. This updated list of permutations are used in the canonical test of the matrix. For a graph with p nodes, its list of automorphisms until the $i$th row is:

$$\begin{aligned} F^{(i)} = \{\tau \in F^{(i-1)} | \tau * \varsigma _{(i,j)} \} \quad i<j<\lambda _{i}^{i-1} \end{aligned}$$

(10)

After the multiplication with all its cycles (Eq. 10), this updated list of automorphisms is used in the maximality check. If an automorphism is detected, that permutation is added to the automorphisms list, $F^{i}$. Thus, the automorphisms list is updated for each row until the row is in maximal form with respect to its partitions.

Maximality check

For the maximality test of the $i$th row of a matrix, the row is compared with each permutation action in the automorphisms list. For each permutation, the original matrix A is permuted. Then, the $i$th rows of the original matrix and the permuted one are compared. These two rows are compared based on the $i$th atom partition. For an initial matrix A, as shown in Fig. 5a, with its partition $\lambda ^{(0)} = \{5\}$ and the refined partition $\lambda ^{(0)'} = \{1,4\}$, there are 5 cycle transpositions. One of these cycles is (1,2). To perform the maximality test, its first and second rows are compared (Fig. 5a).

In this example, the permutation (1,2) is an automorphism of the first row since it maps the row to itself in the adjacency matrix. Then this permutation is added to the list of automorphisms. However, in the case where a mapping with a cycle does not map the row to itself, a canonical permutation is needed. For an initial matrix B (Fig. 5b) with its initial partition $\lambda ^{(0)} = \{5\}$, its refined partition is $\lambda ^{(0)'} = \{1,4\}$, and there are 5 cycle transpositions for these partition. One of them is (1,2). To perform the maximality test, its first and second rows are compared (Fig. 5b).

Different from example A, in matrix B the first and second row are not identical after the cycle transpositions, and a canonical permutation is therefore needed. The canonical permutations are searched within the Young subgroups built with respect to the refined partition. In this example, the refined partition is $\lambda ^{(0)'} = \{1,4\}$. Thus, the symmetry group is $S_{\{1\}}*S_{\{2,3,4,5\}}$. For the canonical permutation search, only the permutations of the sets {1} and {2,3,4,5} are required. For the rows of matrix B, the canonical permutation is then (3,5), as depicted in Fig. 5c. Thus, (1,2)(3,5) is the automorphism of the first row and added to the automorphisms list for further testings.

In general, there are three criteria for updating the automorphisms list and for the maximality check:

In the canonical test, if the row is canonical after testing all the permutations, the partition $\lambda ^{(i+1)}$ is built based on the $i$th row’s entries. After filling the entries of the $i$th row, i.e., adding bonds to the $i$th atom, the atom neighbourhoods are changed. Therefore the partition $\lambda ^{(i+1)}$ is defined based on the partition $\lambda ^{(i)}$ and the $i$th row entries. For matrix A and its refined partition $\lambda ^{(0)'}=\{1,4\}$, its partition first is updated with respect to the first row entries:

$$\begin{aligned}&\text {Refined partition }\lambda ^{(0)'}=\{1,4\} \rightarrow A[1]=[0 | {{2}} ,{{1, 1}}, {{0}}] \\&\quad \rightarrow \text {Updated partition }\lambda ^{(1)}=\{1,{{1}},{{2}},{{1}}\} \end{aligned}$$

The canonical test continues until the rows are in maximal form in lexicographic order. The automorphisms and partition lists are updated row by row.

Learning from canonical test

In case a molecule cannot pass the canonical test, there is still something to learn from the test. In the row by row comparison of the canonical test, when a row does not pass the test, the entry making it non-canonical is detected. As explained in Algorithm 1, if a block is not canonical, MAYGEN updates the matrix starting with its last entry in the block. However, with the help of the non-canonical matrix, the algorithm starts modifying the matrix from the entry making the matrix non-canonical. For a matrix C with its partition $\lambda ^{(0)}=\{5\}$ and the refined partition $\lambda ^{(0)'}=\{1,4\}$, there are 5 cycle transpositions. One of these cycles is (1,3). To perform the maximality test, its first and third rows are compared as shown in Fig. 6.

The permutation $\pi =(2,4)(3,5) \in S_{\{1\}}*S_{\{2,3,4,5\}}$ makes the third row bigger than the first row. Here the first entry making the row non-canonical is C[3, 4] in the matrix. Then the matrix construction continues with the indices [3, 4]. Using the learning from the canonical test, all the other non-canonical matrices are skipped.

Connectivity test

The connectivity test of a graph is performed based on the neighbourhoods of all its nodes. The connectivity test starts with enumerating the nodes and setting this as the initial graph enumeration. The enumeration list is updated while checking the neighbour lists node by node. After detecting neighbours of a node, the labelling of the tested node and its neighbours from the graph enumeration list are stored. The minimum value of this set is given as the smallest index of the neighbourhood. This smallest index value is used for updating the list of graph enumeration. The test is terminated once all the nodes have the same label or all the nodes are re-labelled. For example, the connectivity test is performed for an isomer of $\text {C}_{6}\text {H}_{6}$ represented by the adjacency matrix A (Fig. 7a) with its initial node enumeration (labels) {1, 2, 3, 4, 5, 6} (Table 1).

Table 1 The connectivity test for an isomer of $\text {C}_{6}\text {H}_{6}$ represented by matrix A (Fig. 7a)

Full size table

The matrix A (Fig. 7a) is connected since the smallest node label for each tested node is 1 and its last node enumeration list includes only 1s. Thus there is only one component whose smallest index is 1 (Fig. 7b). For a disconnected chemical graph represented by the adjacency matrix B (Fig. 8a) with its initial node enumeration (labels) {1, 2, 3, 4, 5, 6}.

The matrix B represents a disconnected isomer of $\text {C}_{6}\text {H}_{6}$. This molecule has two components (Fig. 8b) with the indices $\varsigma _1= \{1,2,5\}$ and $\varsigma _2= \{3,4,6\}$. The first component $\varsigma _1$ is the first component with respect to its atom labelling. Here, components are compared with respect to their maximum index.

Learning from connectivity test

Similar to “Learning from canonical test”, there is still something to learn from the connectivity test if a molecule is not connected. In MAYGEN, the connectivity test is performed when a canonical matrix is complete. If a molecule is not connected, it is not stored in the output file and its first component needs to be detected. For example, the matrix B with Table 2, its first component is $\varsigma _1= \{1,2,5\}$. The maximum index of the first component identifies where the graph gets disconnected.

Table 2 The connectivity test for an isomer of $\text {C}_{6}\text {H}_{6}$ represented by matrix B (Fig. 8a)

Full size table

In Algorithm 1, when a matrix is complete and stored in the output file, the generation process continues with the backward function. Here, the last index of the matrix is used as the input. However, with the “learning from connectivity test”, the algorithm continues with the last entry of the first component. For example, in matrix B, the first component is $\varsigma _1= \{1,2,5\}$ and the maximum index is 5. Thus, the graph gets disconnected after the last entry of the fifth row, B[5, 6] entry of the matrix B. All the other modifications on the matrix between its last entry [6, 6] and [5, 6] build only disconnected graphs. That is why the matrix modification process continues with the last entry of the first component. Learning from the connectivity test reduces the construction of disconnected graphs.

Results

MAYGEN is written purely in Java and hosted on GitHub (see section Availability). The full source code, as well as pre-compiled binaries, are available for download. The code can be executed as follows:

Generates molecular structures for a given molecular formula. The input is a molecular formula string, e.g. ‘C₂OH₄’. Besides this formula, the directory is needed to be specified for the output file.

In order to generate constitutional isomers, the user needs to pass a molecular formula with the -f option:

Alternatively, users who either want to contribute to the development or use the latest source code can clone the GitHub repository and build the MAYGEN binary using the Maven build environment.

For the purpose of this publication, MAYGEN was tested with randomly selected molecular formulae. The run times of MAYGEN, MOLGEN and PMG are compared in Table 4. The computational experiments were performed in single-threaded mode and without storing structures in an output file. PMG was tested against OMG and confirmed that even in single-threaded mode, PMG is faster. We used the latest version of Molgen, V 5, to be able to benchmark against larger numbers of molecular formulae. Molgen 3.5, which is faster than Molgen 5, is only available as a Windows GUI application and, to the best of our knowledge, cannot be run in batch mode. Furthermore, we do not own Windows license of Molgen 3.5. We did, however, manually run 10 formulae of the test version of Molgen 3.5 against the test version of Molgen 5 on the same Windows machine (Table 3).

Table 3 The number of structures and the run times are listed for MOLGEN 3.5 and MOLGEN 5.0 with a randomly selected 10 molecular formulae

Full size table

The comparison showed that Molgen 3.5 is about four times faster than Molgen 5 on average for these 10 tests. Different from MOLGEN 5.0 [14], PMG generates structures for additional valences of sulfur (S), phosphorus (P) and nitrogen (N) and therefore more molecules than MOLGEN or MAYGEN [15]. MOLGEN 5.0 uses the default lowest valences for N(3), S(2), and P(3), unless a user defines the higher valences. For all the results given in Table 4, MAYGEN generated the same number of structures as Molgen 5.0. Molgen has an aromaticity filter that filters out resonance structures of substituted aromatic molecules. This filter was deactivated with the -noaromaticity flag to achieve comparability. Since halogens are not defined in PMG, it does not generate structures with molecular formulae including Cl, F, Br, I.

Table 4 The number of structures and the run times are listed for MOLGEN 5.0, MAYGEN and PMG with a diverse set of molecular formulae

Full size table

For most structures containing all allowed elements, MOLGEN was slightly faster than MAYGEN and much faster than PMG (Figs. 9, 10); for carbohydrates and those containing additional oxygen, MAYGEN’s execution speed was comparable to that of MOLGEN. Since PMG does not generate structures for formulae with halogens, “N/A” is added to the result table. “> 24 h” is added to the result for the formulae for which PMG took longer than a day. These results are visualized with spaces in the plots (Figs. 9, 10).

Limitations

MAYGEN is currently restricted to generate molecules with the lowest valence states of nitrogen, phosphorus and sulfur, and all testing and benchmarking was done under this boundary condition. This is no principle restriction—the algorithm will work with any given valence state—but the workflow logic of MAYGEN needs to be adapted to compute structures for higher valences of these elements.

Future work

Being implemented in pure Java and with its code completely open, MAYGEN can be easily extended with additional functionalities and algorithmic improvements. The code availability through GitHub invites the scientific community to contribute to the further developments of MAYGEN. Obvious future work includes performance enhancements and the parallelization of the algorithm. Future implementations of MAYGEN will be parallelised. The lowest hanging fruit will be exploiting the built-in parallelism in the Java VM using multiple available cores. Here, trivial parallelism can be used by computing the isomers of different hydrogen distributions simultaneously. With 8 cores in the CPU of the senior author’s laptop and 18 cores in individual CPUs on our local compute cluster, significant speed gains can be achieved through this simple measure. The examples in our results Table 4 have between 2 and 74 hydrogen partitions, which yields plenty of space for further speed gains. Trivial parallelism can be pushed further by recent cloud orchestration schemes where containers can be seamlessly launched in large clouds, for example using the Google Container Engine. Here, the number of parallel computations X can be matched to fit the number of partitions precisely, leading to an approximate speed gain of X, ignoring the container provisioning and result collection. More elaborate non-trivial parallelisation schemes will be needed to push the boundary of computing with more heavy atoms in each molecular formula beyond the current 15–20 atom limit. The exponential explosion of the number of isomers in this region, will only allow for very moderate advances though. We also aim to integrate MAYGEN into the Chemistry Development Kit (CDK) [19] in the near future which will enable an easy integration of the molecular structure generator in other software programmatically. Furthermore, it is desirable that MAYGEN can use substructures in its input as building blocks, in order to include them as badlists or goodlists into the generation and therefore reduce the number of candidate structures to generate. This will enable its use in systems for computer-assisted structure elucidation (CASE) whose aim is to elucidate chemical structures from NMR and mass spectral data.

Conclusion

In this manuscript we presented MAYGEN, an open-source constitutional isomer generator completely written in Java. MAYGEN generates constitutional isomer spaces exhaustively and avoids isomorphic structures during the generation using the principles of orderly canonical graph generation. We presented extensive testing of MAYGEN against two alternative solutions: MAYGEN outperforms the current best open source structure generator PMG by orders of magnitude, on average 47 times faster, and is only marginally slower, on average three times, than the fastest current state-of-the-art software MOLGEN. We expect MAYGEN to be a starting point for further developments in the area of chemical structure generation by the open source, open science community.

Data availability

Project name: MAYGEN, Project home page: https://github.com/MehmetAzizYirik/MAYGEN, Operating system(s): Platform independent, Programming language: Java, License: MIT.

References

Fink T, Reymond J-L (2007) Virtual exploration of the chemical universe up to 11 atoms of c, n, o, f: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery. J Chem Inf Model 47(2):342–353
Article CAS Google Scholar
Blum LC, Reymond J-L (2009) 970 million druglike small molecules for virtual screening in the chemical universe database GSB-13. J Am Chem Soc 131(25):8732–8733
Article CAS Google Scholar
Ruddigkeit L, Van Deursen R, Blum LC, Reymond J-L (2012) Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model 52(11):2864–2875
Article CAS Google Scholar
Sutherland G (1967) Dendral—a computer program for generating and filtering chemical structures. Technical report, Stanford Univ Calif Dept of Computer Science
Bruccoleri RE, Karplus M (1987) Prediction of the folding of short polypeptide segments by uniform conformational sampling. Biopolym Original Res Biomol 26(1):137–168
CAS Google Scholar
Badertscher M, Korytko A, Schulz K-P, Madison M, Munk ME, Portmann P, Junghans M, Fontana P, Pretsch E (2000) Assemble 2.0: a structure generator. Chemom Intell Lab Syst 51(1):73–79
Article CAS Google Scholar
Holt DF, Eick B, O’Brien EA (2005) Handbook of computational group theory. CRC Press, Boca Raton
Book Google Scholar
Kreher DL, Stinson DR (2020) Combinatorial algorithms: generation, enumeration, and search. CRC Press, Boca Raton
Book Google Scholar
Serov V, Elyashberg ME, Gribov L (1976) Mathematical synthesis and analysis of molecular structures. J Mol Struct 31(2):381–397
Article CAS Google Scholar
Molchanova MS, Shcherbukhin VV, Zefirov NS (1996) Computer generation of molecular structures by the SMOG program. J Chem Inf Comput Sci 36(4):888–899
Article CAS Google Scholar
Yirik MA, Steinbeck C (2021) Chemical graph generators. PLoS Comput Biol 17(1):1008504
Article Google Scholar
Junker J (2011) Theoretical NMR correlations based structure discussion. J Cheminform 3(1):1–4
Article Google Scholar
Nuzillard J-M, Georges M (1991) Logic for structure determination. Tetrahedron 47(22):3655–3664
Article CAS Google Scholar
Gugisch R, Kerber A, Kohnert A, Laue R, Meringer M, Rücker C, Wassermann A (2015) Chapter 6—molgen 5.0, a molecular structure generator. In: Basak SC, Restrepo G, Villaveces JL (eds) Advances in mathematical chemistry and applications, Chap. 6, vol 1. Bentham Science Publishers, Charjah, pp 113–138
Peironcely JE, Rojas-Chertó M, Fichera D, Reijmers T, Coulier L, Faulon J-L, Hankemeier T (2012) OMG: open molecule generator. J Cheminform 4(1):1–13
Article Google Scholar
Jaghoori MM, Jongmans S-ST, De Boer F, Peironcely J, Faulon J-L, Reijmers T, Hankemeier T (2013) PMG: multi-core metabolite identification. Electron Notes Theor Comput Sci 299:53–60
Article Google Scholar
Steinbeck C (2001) Seneca: a platform-independent, distributed, and parallel system for computer-assisted structure elucidation in organic chemistry. J Chem Inf Comput Sci 41(6):1500–1507
Article CAS Google Scholar
Grund R, Müller R (1995) Konstruktion Molekularer Graphen Mit Gegebenen Hybridisierungen und Überlappungsfreien Fragmenten. Lehrstuhl II für Mathematik, Bayreuth
Google Scholar
Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N, Kuhn S, Pluskal T, Rojas-Chertó M, Spjuth O et al (2017) The chemistry development kit (CDK) v2. 0: atom typing, depiction, molecular formulas, and substructure searching. J Cheminform 9(1):1–19
Article Google Scholar

Download references

Acknowledgements

We wish to acknowledge the helpful discussion with Dr. Roland Grund about the principles of orderly generation and the help from Valentyn Kolesnikov for the performance tuning. We also would like to thank Gulnihal Gul Mamat for designing the logo. The computational experiments were performed in single-threaded mode and on resources of Friedrich Schiller University Jena supported in part by DFG grants INST 275/334-1 FUGG and INST 275/363-1 FUGG.

Funding

Open Access funding enabled and organized by Projekt DEAL. MAY and CS acknowledge funding by the Carl-Zeiss-Foundation. MS was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), Project-ID 239748522, SFB 1127 ChemBioSys.

Author information

Authors and Affiliations

Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller University, Lessing Strasse 8, 07743, Jena, Germany
Mehmet Aziz Yirik, Maria Sorokina & Christoph Steinbeck

Authors

Mehmet Aziz Yirik
View author publications
You can also search for this author in PubMed Google Scholar
Maria Sorokina
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Steinbeck
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MAY developed MAYGEN, performed the evaluation and testing and made the figures. MS contributed with advice, code-review and made the figures. CS conceived the project and guided the development. All authors wrote the manuscript. All authors read and approved the manuscript.

Corresponding author

Correspondence to Christoph Steinbeck.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Yirik, M.A., Sorokina, M. & Steinbeck, C. MAYGEN: an open-source chemical structure generator for constitutional isomers based on the orderly generation principle. J Cheminform 13, 48 (2021). https://doi.org/10.1186/s13321-021-00529-9

Download citation

Received: 28 April 2021
Accepted: 22 June 2021
Published: 03 July 2021
DOI: https://doi.org/10.1186/s13321-021-00529-9

MAYGEN: an open-source chemical structure generator for constitutional isomers based on the orderly generation principle

Abstract

Introduction

Methods