MAYGEN: an open-source chemical structure generator for constitutional isomers based on the orderly generation principle

The generation of constitutional isomer chemical spaces has been a subject of cheminformatics since the early 1960s, with applications in structure elucidation and elsewhere. In order to perform such a generation efficiently, exhaustively and isomorphism-free, the structure generator needs to ensure the building of canonical graphs already during the generation step and not by subsequent filtering. Here we present MAYGEN, an open-source, pure-Java development of a constitutional isomer molecular generator. The principles of MAYGEN’s architecture and algorithm are outlined and the software is benchmarked in single-threaded mode against the state-of-the-art, but closed-source solution MOLGEN, as well as against the best open-source solution PMG. Based on the benchmarking, MAYGEN is on average 47 times faster than PMG and on average three times slower than MOLGEN in performance.


Introduction
Unconstrained isomer generation has received attention over the past decades as a means to assess the theoretically existing chemical space and as a hypothesis generator. Recently, the works of Jean-Louis Reymond and coworkers for the creation of the GDB-11 [1], GDB-13 [2] and GDB-17 [3] databases, enumerating all possible molecules with 11, 13, and 17 non-hydrogen atoms, respectively, in the molecular formula, have laid out the motivations for unconstrained isomer generation and the exploitation of its results in sufficient detail. Such molecular generation methods can be used as hypothesis generators in areas such as computer-assisted structure elucidation, but also to answer broader questions such as the exact size of a chemical space. Structure generators that produce constitutional isomers take a molecular formula as input, e.g., C 10 H 16 O , and enumerate or output all possible chemical structures that can be built with the given set of atoms in the molecular formula. The history of chemical graph generators reaches back to the 1960s and started with the DENDRAL project [4]. Their structure generator, CONGEN [5], was based on the substructures building blocks and dealt well with the overlapping substructures. Another structure generator substructure building blocks based was Assemble [6]. Chemical graph generators are based on mathematical theorems, especially the application of algorithmic group theory [7] and combinatorial algorithms [8]. MASS was a tool for the mathematical analysis of molecular structures and constructes molecules by generating their adjacency matrices [9] and SMOG [10] was the successor of MASS.
purpose is still limited. The available generators [11] are ASSEMBLE [6], COCON [12], DENDRAL [4], LSD [13], MOLGEN [14], OMG [15], PMG [16], SENECA [17] and SMOG [10]. These generators and more details are described in [11]. For several decades, the closed-source, commercial structure generator MOLGEN, developed in C at the University of Bayreuth, marked the state of the art in terms of speed and completeness. Recognising the need for an open-source structure generator, Peironcely and colleagues [15] developed the Open Molecule Generator (OMG). OMG, however, is orders of magnitude slower than MOLGEN. Following OMG, a parallelized structure generator, PMG, was developed based on the OMG algorithm. The 452,458 isomers of C 10 H 16 O , for instance, are generated in only 3 s by MOLGEN 5.0, whereas MAYGEN 1.4 and PMG take 10 and 45 s, respectively. For more benchmarks, see "Results" section of the present manuscript.
In this work, we present the development of an opensource structure generator MAYGEN, a pure-Java constitutional isomer generator based on the principle of orderly generation described by Grund et al. [18]. We benchmark our method against the fastest available open-source solution PMG as well as against the closedsource, de facto gold standard MOLGEN. On average, MAYGEN is 47 times faster than PMG and three times slower than MOLGEN. In an old Arabic saying, "may" refers to a drop of water, and we hope that MAYGEN will be a good contribution to the field and trigger a surge in the development of improved and faster versions eventually rivalling the best closed-source solutions and thereby serving the scientific community. The complete MAY-GEN code, as well as precompiled binaries, are available on GitHub. MAYGEN 1.4 generates constitutional isomers of a given molecular formula with an orderly graph generation algorithm from the field of algorithmic group theory. The principles are described in detail in [18]. We summarize them as following. A graph with p nodes, 1, 2, 3, . . . , p has its symmetry group S p . This symmetry group includes all the permutations of these nodes. However, for the case of coloured graphs, the nodes need to be partitioned (Eq. 1), in other words, nodes are grouped based on their colours, degrees and edges.

Methods
A molecule can be represented as a coloured graph. For 4 isomers of C 8 O 2 H 16 (Fig. 1), all atoms are coloured by their element types.
The atoms of C 8 O 2 H 16 can be partitioned in three groups as following: = 2, 8, 16 . For the case of this node partition, the symmetry group of 26 nodes, S 26 , cannot be used since the nodes are coloured. In this case, a special type of symmetry group is applied, consisting of Young subgroups, that are the symmetry groups built based on the initial node partition (Eqs. 2 and 3).
MAYGEN's construction of candidate structures consists of three distinct recursive tasks. First, the hydrogens are distributed to the heavy (i.e. non-hydrogen) atoms of the molecular formula. Then, the structures are generated in a block-wise manner, and finally, the canonical test avoids the generation of duplicate structures in an efficient and dynamic manner.

Molecular formula check and hydrogen distribution Graph existence check
Before calling the generator functions, there is a preliminary test for input molecular formulae. From graph theory, a degree list d can represent a graph with p nodes if the sum of all node degrees is equal or greater than 2 * (p − 1) and if the sum is an even number (Eq. 5) [18].
A graph with p nodes should consist of at least (p − 1) edges. Since an edge is connected with two nodes in a graph, the sum of its node degrees should be equal to or greater than 2 * (p − 1).

Hydrogen distribution
For a given molecular formula, MAYGEN processes the hydrogens first and distributes them to all the other atoms in all possible ways since a hydrogen atom has a valence of 1 and can always have only one neighbour. The hydrogen distribution function takes two inputs, the atom partition and the number of hydrogens. The hydrogens are distributed in ascending order within each partition in order to avoid duplicates.
After the hydrogen distribution, the initial degrees and the initial partition are updated for each hydrogen distribution. For example, the non-hydrogen atoms from the molecular formula C 6 H 6 have the initial respective degrees as [4,4,4,4,4,4] and the initial partition {6}. There are 7 possible hydrogen distributions (Fig. 2) to these carbon atoms. After the hydrogen distribution step, the new lists of node degrees and partitions are used for the structure generation process. With the pre-hydrogen distribution, MAYGEN deals with a 6 × 6 matrix instead of a 12 × 12 matrix. The matrix size also has an impact on the canonical test since this test depends directly on the rows' permutations. The hydrogen distribution code is available in the hydrogenDistributor Java class.

Construction of candidate structures
Once the molecular formula satisfies the graph existence criteria, the hydrogen distribution is performed to build a list of degrees. MAYGEN then starts the actual construction of candidate structures for each degree. The structures are represented by adjacency matrices in which each entry represents the edge multiplicity between the atom pairs. These matrices are built in a block-wise manner. The algorithm is based on the node degrees that correspond to the atom valences. The initial partition of the atoms, based on their element symbols, defines the blocks of the matrix (Fig. 3).
With p being the number of atoms in the molecular formula without the hydrogens, an empty p × p matrix A is built. This matrix is filled in descending order starting Fig. 2 Illustration of the hydrogen distribution of C6H6 (in yellow) and its effect on the assigned atom valency (in blue) and on the atom partition (in red)

Fig. 3
Block-wise representation of a matrix. Here, the matrix is split into parts based on the initial node partition with p entries. Image adapted from Grund et al. [18] with the maximal capacities and this is performed for each atom. The maximal capacity of an atom is calculated by decrementing its valence. For example, the valence of carbon is 4 and its maximal capacity is 3. Due to the diagonal symmetry of such matrices, only the upper triangular part needs to be filled. A canonical test, as described below, is performed once a block is filled. In a matrix, a block is defined as a number of rows and their transposes (i.e. columns). For example, a block between two indices 1 and 4 means the first 4 rows and the first 4 columns of the matrix. It needs to be noted that the canonical tests are performed without waiting for the whole matrix to be filled, which increases MAYGEN's efficiency. This is the early boundary condition of the block-wise generation and avoids the construction of duplicate molecular structures. When the whole matrix is filled, it is written into the output SDF file, if such an option is selected at the beginning of the process. The algorithm then modifies the same input matrix A until there are no more possible changes. This is called the "build-and-forget method" [18]. The overall algorithm structure is explained in Algorithm 1 [18] and illustrated in Fig. 4.
Keeping the example of C 6 O 2 H 6 , the initial valence vector is v = [4, 4, 4, 4, 4, 4, 2, 2, 1, 1, 1, 1, 1, 1] , where the valences of each carbon atom are listed first, then the valences of each oxygen atom, and lastly the valences of all 6 hydrogen atoms. To optimize the process, the hydrogens are avoided in the further construction of the matrices by the hydrogen distribution step. Thus, the initial partition is = {6, 2} and the corresponding matrix is a 8 × 8 matrix (built on 6 carbons and 2 oxygens).

Algorithm 1: MAYGEN algorithm
Input: Molecular formula with p non-hydrogen atoms Output: SDF file with molecular structures Step 1: Perform hydrogen distribution Step 2: First the block index i is set, i = 1; go to step 4.
Step 3: if i = 0 then the procedure stops else go to step 5 Step 4: Maximum filling Fill the strip A(i) in lexicographic order depending on the valences. if no more fillings exist then 1 set i = (i − 1) 2 go to step 3 else go to step 6 Step 5: Smaller filling Fill the strip A(i) in a reverse lexicographic order depending on the valences. if no more fillings exist then 1 set i = (i − 1) 2 go to step 3 else go to step 6 Step 6: (c) go to step 4 else go to step 5

Canonical test
The canonical test is the crucial part of the MAYGEN algorithm. In block-wise orderly structure generation, the early canonical testing avoids the construction of many duplicates. Overall, the purpose of the canonical test is the detection of the maximal matrix with respect to the given initial node partition. In the naive version of the canonical test, the matrix A is permuted for all the permutations of S π and its maximality is checked (Eq. 6). In the permuted matrices, A π , their rows and entries are permuted. The original matrix A is compared with all the permuted matrices. Two matrices are compared row by row in a lexicographical order (Eq. 7).
In the block-wise orderly generation, only the rows within the blocks are compared.

Cycle transpositions
In the canonical test, the size of the symmetry group affects the run time of the algorithm. The initial partition is updated for each row during the test. Starting with the initial partition, with each row, the partitions are refined. The refinement process (Eq. 8) is explained below: For C 3 O 2 H 4 , the initial partition without hydrogens is {3,2}. Thus the partition list for all the rows are: These partition lists are used for the construction of the symmetry groups. By comparing the indices of two consecutive partitions, the cycle transpositions of symmetry groups are calculated. For partitions (i−1) and (i) , the number of cycles is the i th entry in the former partition For example, the initial partition is {3, 2} and the refined partition for the first row is {1,2,2}. Here the number of cycle transpositions is 3 since the first entry of the former partition is 3. The cycle transpositions are (1,1), (1,2) and (1,3). These cycles are calculated row by row for all the partitions. The symmetry group of the molecule is calculated by the multiplication of all these cycles. The list of the partitions and their cycles are listed below:

Calculation of automorphisms
In the canonical test, for a candidate matrix the corresponding automorphisms are calculated row by row. For the i th row of a matrix, the cycle transpositions ς i,j are calculated based on the partitions (i−1) and (i) . These cycle transpositions are used in the automorphisms search. All these cycles are multiplied in DFS manner with all the former automorphisms τ of the graph. This updated list of permutations are used in the canonical test of the matrix. For a graph with p nodes, its list of automorphisms until the i th row is: After the multiplication with all its cycles (Eq. 10), this updated list of automorphisms is used in the maximality check. If an automorphism is detected, that permutation is added to the automorphisms list, F i . Thus, the automorphisms list is updated for each row until the row is in maximal form with respect to its partitions.

Maximality check
For the maximality test of the i th row of a matrix, the row is compared with each permutation action in the automorphisms list. For each permutation, the original matrix A is permuted. Then, the i th rows of the original matrix and the permuted one are compared. These two rows are compared based on the i th atom partition. For an initial matrix A, as shown in Fig. 5a, with its partition (0) = {5} and the refined partition (0) ′ = {1, 4} , there are 5 cycle transpositions. One of these cycles is (1,2). To perform the maximality test, its first and second rows are compared (Fig. 5a).
In this example, the permutation (1,2) is an automorphism of the first row since it maps the row to itself in the adjacency matrix. Then this permutation is added to the list of automorphisms. However, in the case where a mapping with a cycle does not map the row to itself, a canonical permutation is needed. For an initial matrix B (Fig. 5b) with its initial partition (0) = {5} , its refined partition is (0) ′ = {1, 4} , and there are 5 cycle transpositions for these partition. One of them is (1,2). To perform the maximality test, its first and second rows are compared (Fig. 5b).
Different from example A, in matrix B the first and second row are not identical after the cycle transpositions, and a canonical permutation is therefore needed. The canonical permutations are searched within the Young subgroups built with respect to the refined partition. In this example, the refined partition is In general, there are three criteria for updating the automorphisms list and for the maximality check: Fig. 5 Maximality check. a A matrix A is permuted with a cycle transposition. The first and the second rows are identical after the permutation action. b A matrix B is permuted with a cycle transposition. The first and the second rows are not identical. c The canonical permutation of matrix B is given 1 Procedure: Updating the automorphism list and maximality check 1 If the row i is maximal and equal to the permuted row, the permutation is added to the automorphism list; 2 If the row i is maximal but not equal to the permuted row, an automorphism is searched in its Young subgroup (a) If there is such an automorphism, the permutation is added to the automorphisms list; (b) Else, the automorphism is ignored and not added to the list. 3 If the original row i is smaller than the permuted matrix, the tested candidate molecule is not canonical. The canonical test is then terminated.
In the canonical test, if the row is canonical after testing all the permutations, the partition (i+1) is built based on the i th row's entries. After filling the entries of the i th row, i.e., adding bonds to the i th atom, the atom neighbourhoods are changed. Therefore the partition (i+1) is defined based on the partition (i) and the i th row entries. For matrix A and its refined partition (0) ′ = {1, 4} , its partition first is updated with respect to the first row entries: The canonical test continues until the rows are in maximal form in lexicographic order. The automorphisms and partition lists are updated row by row.

Learning from canonical test
In case a molecule cannot pass the canonical test, there is still something to learn from the test. In the row by row comparison of the canonical test, when a row does not pass the test, the entry making it non-canonical is detected. As explained in Algorithm 1, if a block is not canonical, MAYGEN updates the matrix starting with its last entry in the block. However, with the help of the non-canonical matrix, the algorithm starts modifying the matrix from the entry making the matrix non-canonical. For a matrix C with its partition (0) = {5} and the refined partition (0) ′ = {1, 4} , there are 5 cycle transpositions. One of these cycles is (1,3). To perform the maximality test, its first and third rows are compared as shown in Fig. 6. The permutation π = (2, 4)(3, 5) ∈ S {1} * S {2,3,4,5} makes the third row bigger than the first row. Here the first entry making the row non-canonical is C [3,4] in the matrix. Then the matrix construction continues with the indices [3,4]. Using the learning from the canonical test, all the other non-canonical matrices are skipped.

Connectivity test
The connectivity test of a graph is performed based on the neighbourhoods of all its nodes. The connectivity test starts with enumerating the nodes and setting this as the initial graph enumeration. The enumeration list is updated while checking the neighbour lists node by node. After detecting neighbours of a node, the labelling of the tested node and its neighbours from the graph enumeration list are stored. The minimum value of this set is given as the smallest index of the neighbourhood. This smallest index value is used for updating the list of graph enumeration. The test is terminated once all the nodes have the same label or all the nodes are re-labelled. For example, the connectivity test is performed for an isomer of C 6 H 6 represented by the adjacency matrix A (Fig. 7a) with its initial node enumeration (labels) {1, 2, 3, 4, 5, 6} ( Table 1).
The matrix A (Fig. 7a) is connected since the smallest node label for each tested node is 1 and its last node enumeration list includes only 1s. Thus there is only one component whose smallest index is 1 (Fig. 7b). For a disconnected chemical graph represented by the adjacency matrix B (Fig. 8a) with its initial node enumeration (labels) {1, 2, 3, 4, 5, 6}.  Table 1 The connectivity test for an isomer of C 6 H 6 represented by matrix A (Fig. 7a)  The matrix B represents a disconnected isomer of C 6 H 6 . This molecule has two components ( Fig. 8b) with the indices ς 1 = {1, 2, 5} and ς 2 = {3, 4, 6} . The first component ς 1 is the first component with respect to its atom labelling. Here, components are compared with respect to their maximum index.

Learning from connectivity test
Similar to "Learning from canonical test", there is still something to learn from the connectivity test if a molecule is not connected. In MAYGEN, the connectivity test is performed when a canonical matrix is complete. If a molecule is not connected, it is not stored in the output file and its first component needs to be detected. For example, the matrix B with Table 2, its first component is ς 1 = {1, 2, 5} . The maximum index of the first component identifies where the graph gets disconnected.
In Algorithm 1, when a matrix is complete and stored in the output file, the generation process continues with the backward function. Here, the last index of the matrix is used as the input. However, with the "learning from connectivity test", the algorithm continues with the last entry of the first component. For example, in matrix B, the first component is ς 1 = {1, 2, 5} and the maximum index is 5. Thus, the graph gets disconnected after the last entry of the fifth row, B [5,6] entry of the matrix B. All the other modifications on the matrix between its last entry [6,6] and [5,6] build only disconnected graphs. That is why the matrix modification process continues with the last entry of the first component. Learning from the connectivity test reduces the construction of disconnected graphs.

Results
MAYGEN is written purely in Java and hosted on GitHub (see section Availability). The full source code, as well as pre-compiled binaries, are available for download. The code can be executed as follows: Fig. 8 a The adjacency matrix of an isomer of C 6 H 6 . b A disconnected molecule with formula C 6 H 6 Table 2 The connectivity test for an isomer of C 6 H 6 represented by matrix B (Fig. 8a)  Generates molecular structures for a given molecular formula. The input is a molecular formula string, e.g. 'C 2 OH 4 ' . Besides this formula, the directory is needed to be specified for the output file.  In order to generate constitutional isomers, the user needs to pass a molecular formula with the -f option:

Node index Neighbors Former label Minimum label
> java −j a r MAYGEN. j a r −f C10H16 MAYGEN i s g e n e r a t i n g i s o m e r s of C10H16 . . . The number of st r u c t u r e s is : 24938 Time : 1 . 5 9 0 s econds Alternatively, users who either want to contribute to the development or use the latest source code can clone the GitHub repository and build the MAYGEN binary using the Maven build environment.
For the purpose of this publication, MAYGEN was tested with randomly selected molecular formulae. The run times of MAYGEN, MOLGEN and PMG are compared in Table 4. The computational experiments were performed in single-threaded mode and without storing structures in an output file. PMG was tested against OMG and confirmed that even in single-threaded mode, PMG is faster. We used the latest version of Molgen, V 5, to be able to benchmark against larger numbers of molecular formulae. Molgen 3.5, which is faster than Molgen 5, is only available as a Windows GUI application and, to the best of our knowledge, cannot be run in batch mode. Furthermore, we do not own Windows license of Molgen 3.5. We did, however, manually run 10 formulae of the test version of Molgen 3.5 against the test version of Molgen 5 on the same Windows machine ( Table 3).
The comparison showed that Molgen 3.5 is about four times faster than Molgen 5 on average for these 10 tests. Different from MOLGEN 5.0 [14], PMG generates structures for additional valences of sulfur (S), phosphorus (P) and nitrogen (N) and therefore more molecules than MOLGEN or MAYGEN [15]. MOLGEN 5.0 uses the default lowest valences for N(3), S(2), and P(3), unless a user defines the higher valences. For all the results given in Table 4, MAYGEN generated the same number of structures as Molgen 5.0. Molgen has an aromaticity filter that filters out resonance structures of substituted aromatic molecules. This filter was deactivated with the -noaromaticity flag to achieve comparability. Since halogens are not defined in PMG, it does not generate structures with molecular formulae including Cl, F, Br, I. For most structures containing all allowed elements, MOLGEN was slightly faster than MAYGEN and much faster than PMG (Figs. 9, 10); for carbohydrates and those containing additional oxygen, MAYGEN's execution speed was comparable to that of MOLGEN. Since PMG does not generate structures for formulae with halogens, "N/A" is added to the result table. "> 24 h" is added to the result for the formulae for which PMG took longer than a day. These results are visualized with spaces in the plots (Figs. 9, 10).

Limitations
MAYGEN is currently restricted to generate molecules with the lowest valence states of nitrogen, phosphorus and sulfur, and all testing and benchmarking was done under this boundary condition. This is no principle restriction-the algorithm will work with any given valence state-but the workflow logic of MAYGEN needs to be adapted to compute structures for higher valences of these elements.

Future work
Being implemented in pure Java and with its code completely open, MAYGEN can be easily extended with additional functionalities and algorithmic improvements. The code availability through GitHub invites the scientific community to contribute to the further developments of MAYGEN. Obvious future work includes performance enhancements and the parallelization of the algorithm. Future implementations of MAYGEN will be parallelised. The lowest hanging fruit will be exploiting the builtin parallelism in the Java VM using multiple available cores. Here, trivial parallelism can be used by computing the isomers of different hydrogen distributions simultaneously. With 8 cores in the CPU of the senior author's laptop and 18 cores in individual CPUs on our local compute cluster, significant speed gains can be achieved through this simple measure. The examples in our results Table 4 have between 2 and 74 hydrogen partitions, which yields plenty of space for further speed gains. Trivial parallelism can be pushed further by recent cloud orchestration schemes where containers can be seamlessly launched in large clouds, for example using the Google Container Engine. Here, the number of parallel computations X can be matched to fit the number of partitions precisely, leading to an approximate speed gain of X, ignoring the container provisioning and result collection. More elaborate non-trivial parallelisation schemes will be needed to push the boundary of computing with more heavy atoms in each molecular formula beyond the current 15-20 atom limit. The exponential explosion of the number of isomers in this region, will only allow for very moderate advances though. We also aim to integrate MAYGEN into the Chemistry Development Kit (CDK) [19] in the near future which will enable an easy integration of the molecular structure generator in other software programmatically. Furthermore, it is desirable that MAYGEN can use substructures in its input as building blocks, in order to include them as badlists or goodlists into the generation and therefore reduce the number of candidate structures to generate. This will enable its use in systems for computer-assisted structure elucidation (CASE) whose aim is to elucidate chemical structures from NMR and mass spectral data.

Conclusion
In this manuscript we presented MAYGEN, an opensource constitutional isomer generator completely written in Java. MAYGEN generates constitutional isomer Fig. 9 Times for structure generation runs with MOLGEN 5.0, MAYGEN and PMG for molecular formulae containing all allowed elements (carbon, hydrogen, oxygen, nitrogen, phosphorus, sulfur and halogens. The total run times (s) are plotted. For a fairer comparison, Fig. 10 shows the per-molecule run times Fig. 10 Times for structure generation runs with MOLGEN 5.0, MAYGEN and PMG for molecular formulae containing all allowed elements (carbon, hydrogen, oxygen, nitrogen, phosphorus, sulfur and halogens. Since PMG generates additional structures with higher oxidation states for N, S and P the run times (ms) for the construction of per molecule are plotted spaces exhaustively and avoids isomorphic structures during the generation using the principles of orderly canonical graph generation. We presented extensive testing of MAYGEN against two alternative solutions: MAY-GEN outperforms the current best open source structure generator PMG by orders of magnitude, on average 47 times faster, and is only marginally slower, on average three times, than the fastest current state-of-the-art software MOLGEN. We expect MAYGEN to be a starting point for further developments in the area of chemical structure generation by the open source, open science community.