This section consists of five parts. First, it provides an overview of the basic principles underlying (molecular) graph theory. Second, it describes the enumeration algorithm for constitutional isomers developed by Grund. Third, it briefly explains the concept of canonical SMILES strings. Fourth, it describes the procedure developed here for the enumeration of stereoisomers. And fifth, it illustrates the implementation and practical features of the program *enu*.

### Graph theory

The isomer enumerator relies on graph theory. This discipline goes back to the first half of the 18th century when the Swiss mathematician Leonhard Euler published his famous article on the *Problem of the Königsberg Bridges* [22]. Since then, graph theory has become increasingly relevant, with applications in fields such as social sciences, economics, electrical and industrial engineering, as well as all branches of the natural sciences, namely physics, chemistry, and biology [23].

#### Molecular graphs

A molecular graph is a connected labeled multigraph, i.e., a graph in which there exists a path from each node to every other node, the nodes are labeled, and there can be multiple edges between two nodes. The vertices represent atoms and the edges account for covalent bonds between the atoms [24]. The graph describes the topology of a molecule, but does not provide any information on its geometry.

A molecular graph can be described by the combination of a *label vector* \(\pmb \alpha\), a *valence vector* \(\pmb \delta\), a *partition vector* \(\pmb \lambda\), and a symmetric *adjacency matrix* \(\pmb A\in {\mathbb {N}}_0^{+ \, N\times N}\) [9]. An example is provided in Fig. 1 and the terminology is explained in more detail in Additional file 1: Sec. S1.1. The combination of the label vector (atom-type names) and partition vector (number of atoms of a given type) provides the molecular formula. The valence vector contains the fixed valences of the atom types listed in the label vector. A matrix element \(A_{i,j}\) of \(\pmb A\) describes the order of the bond possibly connecting the atom at position *i* to the atom at position *j* in the atom vector (or is set to zero in the absence of a bond).

For a given choice of \(\pmb \alpha\), \(\pmb \delta\), and \(\pmb \lambda\) (i.e., of a molecular formula and of atom-type valences), the specification of an adjacency matrix \(\pmb A\) (i.e., of a covalent connectivity between the atoms) defines a unique labeled molecular graph. However, since the atoms of a common type in a molecule are physically indistinguishable, two labeled graphs that are directly related by a permutation in the indices of these atoms actually describe the same molecule (merely with a different atom numbering). In other words, for a given choice of \(\pmb \alpha\), \(\pmb \delta\), and \(\pmb \lambda\), the same molecule can generally be represented by many different adjacency matrices \(\pmb A\). This observation is fundamentally important to the problem of isomer enumeration. It is known as (molecular) graph isomorphism, and explained in more detail in Additional file 1: Sec. S1.2.

*Adjacency Matrix Canonicity* In order to have a unique representation of a molecular topology in the form of a labeled multigraph, a *lexicographical ordering* can be used as canonicity criterion for the adjacency matrix. An adjacency matrix \(\pmb A\) is lexicographically larger than an adjacency matrix \(\pmb A'\) (noted \(\pmb A> \pmb A'\)) provided that [9]

$$\begin{aligned} \exists i_0, j_0: \, \left( A_{i,j} = A'_{i,j} \, \, \, \forall \, (i,j) < (i_0, j_0)\right) \, \, \, \wedge \, \, \, \left( A_{i_0,j_0} > A'_{i_0, j_0}\right) \, , \end{aligned}$$

(1)

with the definition [9]

$$\begin{aligned} (i,j)< (k,l) \, \, \, \Leftrightarrow \, \, \, (i< k) \, \, \vee \, \, (i=k \, \wedge \, j < l) \, . \end{aligned}$$

(2)

In plain words, when the two matrices are read row-by-row from the top left to the bottom right, the first difference encountered determines the lexicographical ordering. The canonical adjacency matrix used to represent a molecule is then defined as the lexicographically largest among all possible adjacency matrices, which in turn defines a canonical labeling of the atoms in the molecular graph. Note that for a unique representation of molecules, the canonicity criterion for \(\pmb A\) must be accompanied by a canonicity criterion for the ordering of the atom types in the vector \(\pmb \alpha\). More details on canonicity can be found in Additional file 1: Sec. S1.2.

### Enumeration of constitutional isomers

The following sections present the algorithm that is used in *enu*. It is based on the PhD thesis of R. Grund [9], which proposes a solution to both the problem of finding all possible adjacency matrices as well as testing these matrices for canonicity.

Enumerating all the unique constitutional isomers of a given molecular formula amounts to finding all the canonical adjacency matrices associated with this formula. This could be achieved in a brute-force way by exhaustively enumerating all possible adjacency matrices compatible with the given molecular formula and the fixed valences of the different atom types, and filtering out those that are not canonical. The algorithm implemented in *enu* is based on this principle, but relies on an effective pruning mechanism that drastically limits the number of adjacency matrices to be generated and tested for canonicity, leading to far superior performance compared with a brute-force approach. A comparison between the optimized code and a brute-force approach is provided in Additional file 1: Sec. S1.7, see Table S1 and Figure S5.

#### Orderly enumeration

The first task to be performed is the systematic construction of possible adjacency matrices for a given choice of the vectors \(\pmb \alpha\), \(\pmb \delta\), and \(\pmb \lambda\). The algorithm of Grund [9] proceeds by creating these matrices in lexicographically decreasing order (Fig. 2). Note that this principle of orderly generation was proposed earlier by Read and Faradzev [25,26,27]. Since the adjacency matrix is symmetric and has only zeros along its diagonal, the algorithm only needs to find valid entries for the upper triangle of the matrix. Starting with an empty adjacency matrix, it proceeds through the matrix from the top left element to the bottom right one (with the line number as a primary index and, within each line, the column number as a secondary index), and fills it using two main subroutines. This filling order is particularly convenient, as it matches that in which the elements are checked to determine if a matrix is lexicographically larger or smaller than another one. In the *forward step*, the current matrix position is incremented and the maximum possible entry for the new position is determined based on the specified atom valences and the bonds already listed in the matrix. If a compatible value is found, the matrix element at the current position is filled, and another forward step is called. Otherwise, the matrix is not amenable to completion and a *backward step* is performed. The current matrix position is decremented and it is checked whether the matrix element at the new position can be decreased by one. If this is possible, the algorithm continues with a forward step. Otherwise it continues with another backward step. The two routines are outlined in the Additional file 1: Sec. S1.3.

#### Connectivity test

While the filling algorithm generates all possible adjacency matrices compatible with the valences of the different atoms in decreasing lexicographical order, it does not guarantee that these adjacency matrices describe a connected graph [9]. Therefore, a potential adjacency matrix has to be tested for connectivity to ensure that it is a viable isomer of the given molecular formula (rather than a collection of two or more molecules). The connectivity test implemented in *enu* is an adapted depth-first search [28] of the graph, as described in Additional file 1: Sec. S1.4. The algorithm uses a last-in-first-out stack to go through the connected parts of the graph, and marks the vertices it encounters as visited. If all vertices have been visited once the stack is empty, the graph is connected.

#### Canonicity test

The most important (and most difficult) part of the enumeration process is the testing of the generated adjacency matrices for canonicity, i.e., assessing whether there is no isomorphic adjacency matrix that is lexicographically larger. The routines to perform this canonicity test are described in Additional file 1: Sec. S1.5.

#### Special treatment of hydrogen atoms

Hydrogen atoms typically represent the most numerous atoms in a molecule. Therefore, it is advantageous to rely on a special treatment for this atom type during the enumeration process [9]. To that end, following the suggestion of Grund [9], the hydrogen atoms are associated to the heavy atom bearing them in a pre-processing step, thereby producing united-atoms with accordingly decreased valences. These are used in the filling algorithm. In this way, the hydrogen atoms are “implicit” during the orderly enumeration, resulting in a significant speed gain. This is described in more detail in Additional file 1: Sec. S1.6.

### Canonical SMILES

The *enu* program enumerates isomers based on lexicographically canonical adjacency matrices. However, for convenience, it reports the generated isomers as canonical SMILES strings [17]. The generation of a unique SMILES representation requires the specification of a canonical atom ordering [20, 21]. Over the past decades, various algorithms have been developed to achieve such a canonicalization [20, 21, 29,30,31,32]. While the matrix canonicity criterion of *enu* already leads to a canonical ordering of the atoms, this order is not necessarily suitable to create elegant (i.e., easily readable) SMILES strings. Thus, once a new canonical adjacency matrix is found, a different canonicalization algorithm is used as a post-processing step in *enu* to create the corresponding canonical SMILES strings. It is based on a combination of the schemes proposed by Weininger *et al.* in 1989 [20] and by Schneider *et al.* in 2015 [21]. The applied algorithm is explained in more detail in Additional file 1: Sec. S2. Note that during both the enumeration of the adjacency matrices and the generation of canonical SMILES strings, the hydrogen atoms are treated implicitly.

### Enumeration of stereoisomers

When going from a two- to a three-dimensional representation of a molecule, constitutionally identical molecules can have a different spatial arrangement of their atoms, leading to *stereoisomerism* [24]. In *enu*, the enumeration of the stereoisomers associated with a given constitutional isomer is performed as a postprocessing step to the constitutional-isomer generation. Two kinds of stereoisomerism are considered: (i) chirality is considered for tetravalent atoms that have four singly-bonded neighbours, and (ii) cis/trans stereoisomerism is considered for double bonds connecting two tetravalent atoms that have two singly-bonded neighbors (in addition to the doubly-bonded one) and that are not part of a cycle. Currently, handling of the stereochemistry for possible centers of valence higher than four is not implemented. In addition, double bonds within cycles or cummulene systems are at present not considered in the search of cis/trans stereocenters. Note that the points discussed in the following sections are general observations about stereocenters, and that similar ideas were also developed independently earlier [33].

In practice, the tetravalent atoms are typically carbon atoms. The neighbors (substituents) can differ either in their constitution, i.e., different atom types or connectivities, or in their spatial arrangements, i.e., different stereo configurations. Stereocenters with neighbors differing in their constitutions are called *true stereocenters*, while stereocenters with neighbors differing only in their stereo configurations are called *para stereocenters* [33]. The considered types of stereocenters (i.e., tetrahedral and cis/trans) as well as the distinction between true and para stereocenters is illustrated in Fig. 3.

Here, stereoisomers will refer to the isomers of a given constitutional isomer corresponding to different spatial arrangements around tetrahedral centers and double bonds. The entire collection of all stereoisomers for all constitutional isomers of a given formula will be referred to as the spatial isomers of the molecule. The *enu* program thus enumerates all constitutional and spatial isomers of a chemical formula.

The procedure to enumerate stereoisomers consists of two steps. In a first step, all unique true stereoisomers are determined. In a second step, the para stereoisomers are generated. This division is necessary because the para stereocenters may be active or not depending on the stereo configuration of the other stereocenters in the molecule. The enumerator uses canonical SMILES strings to represent the enumerated stereoisomers. These strings describe the *local* stereo configuration of the stereocenters in the molecule, which implies that the specified configuration depends on the order in which the atoms appear in the string [17].

#### Finding the true stereoisomers of a molecule

The problem of finding all true stereoisomers of a molecule involves two sequential tasks: (i) identifying the true stereocenters, and (ii) generating all unique true stereoisomers that arise from these stereocenters.

##### Recognizing true stereocenters

The procedure to detect true *tetrahedral* stereocenters is shown in Fig. 4. The process consists of checking all atoms with four neighbors (or three neighbors and one implicitly connected hydrogen). If the four first-neighbor atoms are all different, a true tetrahedral stereocenter is directly detected. If at least two first neighbors are identical and have valence one, the atom cannot be a true stereocenter. If two or more first neighbors are identical but have a valence larger than one, the algorithm relies on the automorphism group. The automorphism group \(Aut({\textbf{A}})\) is generated as a by-product of the canonicity test in the enumeration algorithm (see Additional file 1: Sec. S1.5.5). It contains the atom index permutations which leave the adjacency matrix \(\pmb A\) unchanged. If there is at least one permutation \(\pi \in Aut({\textbf{A}})\) which leaves the potential stereocenter unchanged (i.e., does not swap it with another atom) but swaps two of its first neighbors, the atom is not a true stereocenter. If such a permutation does not exist, the potential stereocenter is a true tetrahedral stereocenter.

A cis/trans stereocenter is defined as a pair of atoms connected by a double bond. The term (cis/trans) *half-stereocenter* will be used here for these two atoms. The procedure to identify true cis/trans half-stereocenters is analogous to the one for tetrahedral stereocenters. It is outlined in Fig. 5. Here, one considers all the atoms with three neighbors (or two neighbors and one implicitly connected hydrogen atom) that are connected by exactly one double bond to another atom. If the two singly-bonded first-neighbor atoms are identical and have valence one, the considered atom is not a true half-stereocenter. If the two singly-bonded first neighbors are different or if they are identical but there is no permutation \(\pi \in Aut({\textbf{A}})\) that leaves the considered atom identical while swapping the two singly-bonded first neighbors, the considered atom is a potential half-stereocenter. All potential half-stereocenters which are connected by a double bond to another potential half-stereocenter are true half-stereocenters.

#### Generating all unique true stereoisomers

Once the \(n_{tet}\) true tetrahedral stereocenters and the \(n_{ct}\) true cis/trans stereocenters of a molecule have been identified, it is straightforward to enumerate the SMILES strings corresponding to the associated \(2^{n_{tet} + n_{ct}}\) true stereoisomers. A binary *configuration vector* of size \(n_{tet} + n_{ct}\) is used to this purpose. For the tetrahedral centers, a 0 specifies a clockwise direction (encoded by a ‘@@’ in the SMILES string), and a 1 specifies a counter-clockwise direction (encoded by a ‘@’ in the SMILES string). For double bonds, a 0 specifies a trans configuration (encoded either by ‘/’ and ‘/’ or by ‘\’ and ‘\’ in the SMILES string) and a 1 specifies a cis configuration (encoded either by ‘/’ and ‘\’ or by ‘\’ and ‘/’ in the SMILES string). The vector is initially filled with zeros, and binary counting is then used to find all possible configurations of the true stereocenters. However, not all stereoisomers constructed using this approach are unique, as can be seen by considering the examples in Fig. 6.

To make sure that only unique stereoisomers are reported, the automorphism group \(Aut(\pmb A)\) is used to filter the results of the binary counting procedure. Here, the convention is used that the smallest equivalent stereochemical configuration vector is reported as the canonical one. For each configuration vector, it has to be determined whether the current stereoisomer is equivalent to one of the previously generated stereoisomers, i.e., one with a lexicographically smaller configuration vector. For all permutations \(\pi \in Aut(\pmb A)\), the true stereocenters are checked. By definition, two stereocenters can only be swapped by a permutation of the automorphism group if they are configurationally indistinguishable. If such a swap occurs in a given permutation, the corresponding encoding of 0 or 1 in the configuration vector has to be changed accordingly. If the resulting configuration vector is smaller than the original one, the stereoisomer has already been encountered, and is not counted again. When using a notation that specifies absolute stereo configurations, the step of adapting the encoding in the configuration vector is trivial, as the configurations of the two swapped stereocenters are simply swapped as well (Fig. 7). However, since SMILES strings only specify the stereo configuration locally, the step of adapting the encoding in the configuration vector is not as trivial.

The local stereo configuration depends on the order in which the first neighbors of a stereocenter appear in the SMILES string, as can be seen in Fig. 8. If a tetrahedral stereocenter is connected to four neighboring atoms (or three atoms and one implicit hydrogen), there are \(4!=24\) possible orders (or \(3!=6\) in the case of a center with an implicit hydrogen). When a permutation \(\pi \in Aut(\pmb A)\) is applied to a stereoisomer, it is possible that a stereocenter is permuted with a different stereocenter and the configuration vector has to be adapted accordingly. For this, the order of the neighboring atoms of the original stereocenter in the SMILES string has to be compared to the corresponding order after applying the permutation. If the stereocenter \(a_i\) with neighbors \(\pmb n_i = [n_{i,1},n_{i,2},n_{i,3},n_{i,4}]\) (order in the SMILES string) and configuration \(c_i \in \lbrace 0,1\rbrace\) is permuted with the stereocenter \(a_j\) with neighbors \(\pmb n_j = [n_{j,1}, n_{j,2}, n_{j,3}, n_{j,4}]\) and configuration \(c_j\in \lbrace 0,1\rbrace\), one determines for each of the neighboring atoms in \(\pmb n_i\) the neighboring atoms in \(\pmb n_j\) it is permuted with. If the new ordering is an even permutation of the old ordering, the configuration \(c_i^{perm}\) of stereocenter \(a_i\) in the new configuration vector will be encoded with the old configuration of \(a_j\), i.e., \(c_i^{perm} = c_j\). Conversely, if the new ordering is an odd permutation, the configuration \(c_i^{perm}\) of \(a_i\) will be encoded with the opposite of the old configuration of \(a_j\), i.e., \(c_i^{perm} = \lnot c_j\). Once the permuted configuration vector is generated, it can be decided whether the current stereoisomer is equivalent to a stereoisomer with a smaller configuration vector. If it is the case, the stereoisomer is not counted again. An example how non-unique stereoisomers are detected is shown in Fig. 9.

Unlike a tetrahedral stereocenter, a cis/trans half-stereocenter is only connected to three neighboring atoms (or two atoms and one implicit hydrogen). The configuration of a cis/trans stereocenter is determined by the value of 0 or 1 in the stereochemical configuration vector, as well as by which of the two singly-bonded neighbors are considered for the directionality. Here, the convention is used that the directionality is always specified considering the two singly-bonded neighbors that are encountered first in the SMILES string. If two atoms \(a_{i,1}\) and \(a_{i,2}\) forming a cis/trans stereocenter \(s_i\) with configuration \(c_i\in \lbrace 0, 1\rbrace\) (where the singly-bonded neighbors considered for the directionality are the atoms \(n_{i,1}\) and \(n_{i,2}\)) are swapped with two other atoms \(a_{j,1}\) and \(a_{j,2}\) forming a cis/trans stereocenter \(s_j\) with configuration \(c_j\) (where the singly-bonded neighbors considered for the directionality are the atoms \(n_{j,1}\) and \(n_{j,2}\)), the new configuration of \(s_i\) is determined as follows. If \(n_{i,1}\) is swapped with \(n_{j,1}\) and \(n_{i,2}\) is swapped with \(n_{j,2}\), this corresponds to an even permutation, and the configuration of \(s_i\) in the new configuration vector will be equal to the configuration of \(s_j\), i.e., \(c_i^{perm} = c_j\). If \(n_{i,1}\) is not swapped with \(n_{j,1}\), but \(n_{i,2}\) is swapped with \(n_{j,2}\) (or vice versa) this means that, due to the ordering of the atoms in the SMILES string, a different pair of singly-bonded first neighbors is considered for the directionality of \(s_i\) than for the directionality of \(s_j\), and the configuration of \(s_i\) in the new vector becomes the opposite of the configuration of \(s_j\), i.e., \(c_i^{perm} = \lnot c_j\). If neither \(n_{i,1}\) is swapped with \(n_{j,1}\) nor \(n_{i,2}\) with \(n_{j,2}\), it means that the opposite pair of first neighbors is considered for the directionality of the configuration of \(s_i\) than it was for \(s_j\), and the configuration of \(s_i\) in the new configuration vector is the same as the configuration of \(s_j\) in the old one, i.e., \(c_i^{perm} = c_j\).

#### Recognizing para stereocenters

Once the true stereocenters have been found and all unique stereoisomers stemming from these centers have been enumerated, the next step is to complete the list of stereoisomers by adding the para stereoisomers. In the following, the term *potential para stereocenter* is used to denote a center that can possibly be a para stereocenter. Whether this possibility is realized depends on the actual configuration of the true stereocenters (and of the other potential para stereocenters) in the molecule.

An atom is a potential tetrahedral para stereocenter if it was omitted from the list of true stereocenters in the third test of the algorithm in Fig. 4. This corresponds to the situation where at least one permutation \(\pi \in Aut(\pmb A)\) leaves the center unchanged while swapping two of its first neighbors. This indicates that the substituents of the center starting with these first neighbors are *constitutionally* identical, but may potentially be *stereochemically* distinct if they encompass true or other para stereocenters in specific configurations.

The same logic applies to potential cis/trans para half-stereocenters. If an atom was discarded in the fourth or fifth test of the algorithm in Fig. 5 it might be a potential cis/trans para half-stereocenter. This is the case if the atom has two identical singly-bonded first neighbors which are swapped by at least one permutation \(\pi \in Aut(\pmb A)\), indicating that the two neighboring substituents are *constitutionally* identical, but may be *stereochemically* distinct. The actual para cis/trans stereocenter is defined as a pair of half-stereocenters, so it still has to be checked whether a potential cis/trans para half-stereocenter is connected by a double bond to another potential para cis/trans half-stereocenter.

An additional criterion that can be used to reduce the number of potential para stereocenters to be tested is that a para stereocenter lies “in the middle” of at least one pair (or, possibly, multiple pairs) of configurationally symmetrical true stereocenters. Here, symmetrical means that there is at least one permutation \(\pi \in Aut(\pmb A)\) which swaps the two true stereocenters. If two symmetrical true stereocenters are not part of a cycle, a simple shortest-path algorithm can be used to determine which atom lies in the middle of the two. If an atom is part of one or more cycles, there is no simple shortest-path algorithm to check all potential paths between two true stereocenters, and this filtering cannot be employed. This is illustrated in Fig. 10. Note also that within a cycle, there can also exist para stereocenters that do not depend on true stereocenters, but only on other para stereocenters, e.g *cis*-1,4-dimethylcyclohexane and *trans*-1,4-dimethylcyclohexane. Such para stereocenters are currently not yet considered in *enu*.

To summarize, the potential para stereocenters that are retained consist of the ones that were discarded as true stereocenters and lie either in a cycle or in the middle of the shortest path between at least one pair of symmetrical true stereocenters. If there are no potential para stereocenters in a molecule, the list of stereoisomers is already complete after considering the true stereoisomers. Otherwise, the list is processed anew to generate the para stereoisomers.

#### Generating all unique para stereoisomers

The process to generate all unique para stereoisomers has to be performed for each true stereoisomer. Whether a potential para stereocenter actually *is* a stereocenter can only be determined once the stereo configuration in a true stereoisomer is specified.

The algorithm to determine all unique para stereoisomers of a molecule goes as follows. For each true stereoisomer, the automorphism group \(Aut^{true}(\pmb A)\subseteq Aut(\pmb A)\) is considered, which contains the subset of permutations in the automorphism group that only swap true stereocenters if they have the same absolute stereo configuration. Next, a para configuration vector is created and binary counting is used to find all possible para stereoisomers. For each of these, it first has to be determined which of the potential para stereocenters are actually stereocenters in the current configuration. Then it has to be determined whether this stereoisomer has already been found before, i.e., with a smaller para configuration vector.

In order to determine which potential para stereocenters are actually stereocenters, the automorphism group \(Aut^{para}(\pmb A)\subseteq Aut^{true}(\pmb A)\) is considered, which contains the permutations of \(Aut^{true}(\pmb A)\) that only swap para stereocenters if they have the same absolute stereo configuration in the current para stereoisomer (i.e., the true stereoisomer with the para stereo configuration specified by the current para configuration vector). All potential para stereocenters for which there exists a permutation \(\pi \in Aut^{para}(\pmb A)\) that leaves the current para stereocenter identical but swaps at least two of its immediate neighbors are not stereocenters in the current para stereoisomer. This is indicated by setting the corresponding entry in the para configuration vector to -1 (technically making the vector ternary instead of binary). These steps are repeated as long as at least one para stereocenter is determined to be “not active” in the current configuration.

In order to check whether para stereocenters have the same absolute configuration, the same logic can be used as for true stereocenters. Two symmetrical tetrahedral para stereocenters have the same absolute configuration if (i) the order of the first neighbors of the first stereocenter in the SMILES string is an even permutation of the order of the first neighbors of the second stereocenter in the string and they have the same encoding in the para configuration vector, or (ii) the order of the first neighbors is an odd permutation and they have the opposite encoding in the para configuration vector. Similarly, two symmetrical cis/trans para stereocenters have the same absolute configuration if the two singly-bonded neighbors (and the connected substituents) encountered first in the SMILES string (i) are both the same for the two stereocenters and the stereocenters have the same encoding in the para configuration vector, (ii) are both not the same for the two stereocenters and the stereocenters have the same encoding in the para configuration vector, or (iii) if only one of the singly-bonded neighbors of the first stereocenter is the same as the singly-bonded neighbors of the second stereocenter and the stereocenters have the opposite encoding in the para configuration vector.

The process to eliminate equivalent but smaller para configurations is the same as for true stereocenters. For the lexicographical comparison, the values of -1 in the para configuration vector are treated as 0. Fig. 11 shows the stereoisomers of a small molecule that has four true stereocenters and three potential para stereocenters.

#### Creating SMILES strings for stereoisomers

Once the list of stereoisomers of a constitutional isomer is available as a list of configuration vectors for the true and para stereocenters, this information can be included into the corresponding SMILES strings. For the tetrahedral stereocenters, the encoding is straightforward. In the SMILES string, the element symbol of the stereocenter is enclosed by rectangular brackets, and either ‘@@’ or ‘@’ is added if the corresponding element in the configuration vector has the value 0 or 1, respectively. Additionally, if the stereocenter is connected to an implicit hydrogen atom, a ‘H’ is added before the closing rectangular bracket.

The handling of cis/trans stereocenters is slightly more complicated. In the general case, one simply adds ‘/’ before the half-stereocenter that is visited first in the SMILES string, and ‘/’ or ‘\’ before the first visited singly-bonded neighbor of the second half-stereocenter if the corresponding element in the configuration vector has the value 0 or 1, respectively. However, the situation is more complicated if a half-stereocenter is also the singly-bonded neighbor of another half-stereocenter (Fig. 12). A simple way to handle this situation is to go through the cis/trans stereocenters in the order in which the half-stereocenters are visited in the SMILES string, checking if the first half-stereocenter already contains an encoding, and then adapting the encoding of the first visited singly-bonded neighbor of the second half-stereocenter accordingly (i.e., cis or trans).

In the XML output, the number of tetrahedral and cis/trans stereocenters is reported for each stereoisomer. Additionally, for molecules with at least one tetrahedral stereocenter, the enantiomer of each stereoisomer is reported (if one exists).

### Implementation details

The functionalities described in the previous sections are implemented in a C++ program called the *isomer enumerator*, in short *enu*. The current version of the program can be found on GitHub at https://github.com/csms-ethz/CombiFF. This repository also contains input and output files, as well as other programs related to the CombiFF scheme [15, 16]. Running enu -help will print an overview of the input options to the standard output. A user manual for *enu* is also provided in https://github.com/csms-ethz/CombiFF/blob/main/doc/enu.pdf.

The following sections provide an overview over some of the functionalities implemented into the enumerator.

#### Specifying a molecular formula

There is some flexibility to define how many times an element type should occur in the molecular formula. For each element type, this number can be given as a single integer (e.g. H5), as a list of integers (e.g. H[0,2,4,5]), or as a range (e.g. H[0-5]).

#### Implicit hydrogen atoms

As described previously, hydrogen atoms are treated implicitly (i.e., distributed among the other atom types before the enumeration algorithm starts). In order to specify the types of molecules of interest more distinctly, the implicit hydrogen atoms can also be specified directly in the chemical formula. For example, {CH1}1{CH2}2{OH1}3 will be translated to the formula \(\hbox {C}_{3}\hbox {H}_{8}\hbox {O}_{3}\) with the restriction that it includes one carbon atom bonded to exactly one hydrogen atom, two carbon atoms bonded each to exactly two hydrogen atoms, and three oxygen atoms bonded to exactly one hydrogen atom. This will generate the two constitutional isomers OCC(O)CO and OCCC(O)O (no stereoisomers), whereas there exist 36 constitutional and spatial isomers for the unrestricted formula \(\hbox {C}_{3}\hbox {H}_{8}\hbox {O}_{3}\).

#### Filtering for properties

Currently, the user may restrict the following molecular properties: The maximum bond order, the number of unsaturations (summing one for double bonds and cycles, and two for triple bonds), the total number of bonds (each bond counted once, irrespective if single or multiple), the number of single bonds, the number of double bonds, the number of triple bonds, the number of quadruple bonds, as well as the number of cycles in the molecule. These restrictions can be set either as an integer, as a list of integers, or as a range of integers.

The filtering for the number of unsaturations is performed before the enumeration starts, by calculating the number of unsaturations for a molecular formula using Eq. S16 (see Additional file 1). The filtering for the other properties is done whenever a new isomer is found. If the restrictions are not met, the isomer is not reported.

For example, enumerating all straight-chain alkane isomers \(\hbox {C}_{\hbox {n}}\hbox {H}_{2n+2}\) from \(\hbox {C}_{1}\hbox {H}_{4}\) to \(\hbox {C}_{20}\hbox {H}_{42}\) could be achieved with the formula specification C[1-20]H[4-42] and the restriction that the number of unsaturations should be zero.

#### Aromaticity

Currently, there is only a very basic implementation to recognize aromatic rings of size six using a substructure search for alternating single and double bonds. This procedure is able to recognize structures like benzene and pyridine, which is sufficient to eliminate duplicate isomers where the ordering of the single and double bonds is different. However, this functionality is still very limited. In the future, it will be extended to recognize aromatic rings during the enumeration procedure. In the meantime, it is possible to post-process the *enu* output using a suitable cheminformatics library such as the *RDKit* [34] to recognize aromaticity for more complicated cases. Thanks to the convenient XML output format and the SMILES notation, such a post-processing is easy to implement.

#### Visualization

To visualize the output of *enu*, a small Python script is provided in the GitHub repository (Fig. 13). It uses the Python3 [35] *xml.etree.ElementTree* module to parse the XML list of constitutional and spatial (if present) isomers, the Python library *pdfrw* [36] to concatenate PDFs, as well as the *RDKit* [34] cheminformatics library to create the visualizations (for an example, see Fig. 13).

#### Family definitions

The most straightforward way to use *enu* is *via* the command line by specifying a chemical formula (potentially including atoms with implicit hydrogens and filtering criteria, as described in the previous sections). However, one can also define a so-called *family* which offers more flexibility, i.e., the use of element aliases, of a filtering for substructures, and of pseudoatoms. A brief overview is given in the next paragraphs.

##### Element aliases

In order to provide more flexibility in the definition of the chemical formulas for which the isomers are to be enumerated, it is possible to define so called *element aliases*. An element alias has a name and contains a set of element types. For example, an element alias for the halogen element types could be called “Hal” and contain the four types “Br”, “Cl”, “F” and “I”. When two or more of the same element alias occur in a chemical formula, AND, XOR, and OR logic can be used to specify if they should be of the same element type, of a different element type, or whether both is allowed. For example, the notation Hal3 (AND) specifies that the halogen atoms have to be the same type (*e.g.*, Br3). \( {\hat{\,}}\) Hal1\( {\hat{\,}}\) Hal2 (XOR) specifies that the first halogen atom has to be of a different type than the two other halogen atoms (e.g., Br1Cl2). Finally, Hal1Hal1Hal1 (OR) specifies that any combination of three halogen atoms is allowed (e.g*.*, Br1Cl1F1, Br3, or Br1Cl2). The formula to enumerate all straight-chain haloalkanes with ten carbon atoms and two halogen atoms of the same type can then be expressed as C10H20Hal2 and is equivalent to enumerating the four chemical formulae \(\hbox {C}_{10}\hbox {H}_{20}\hbox {Br}_2\), \(\hbox {C}_{10}\hbox {H}_{20}\hbox {Cl}_{2}\), \(\hbox {C}_{10}\hbox {H}_{20}\hbox {F}_2\) and \(\hbox {C}_{10}\hbox {H}_{20}\hbox {I}_2\).

##### Filtering for substructures

It is also possible to filter for the occurrence of substructures. The number of occurrences can be specified by a single integer, a list of numbers, or a range of numbers. By setting the occurrence to zero, one may also prevent the occurrence of a substructure. The implemented substructure search algorithm is the Ullmann algorithm [37]. As the enumerator is aimed for relatively small molecules, the performance of the Ullmann algorithm is not a bottleneck for the overall runtime. If this became an issue in the future, it could be replaced by a more modern algorithm such as VF2 [38, 39].

Here, a substructure is defined by a name, a list of atoms, and an adjacency matrix stack (i.e., the upper triangle of an adjacency matrix, written row-wise as a one-dimensional vector). Each of the atoms can be either an element type, an element type with a number of implicit hydrogens, an element alias, or a wildcard. If there is more than one element alias of the same type, it is also possible to use the AND, XOR and OR logic to specify if they should be of the same element types, of a different element type, or whether both are allowed. When element aliases are used and multiple occurrences of the substructure are requested, it is also possible to specify how the element types should occur across substructures, also using AND, XOR and OR logic.

When multiple substructures are required, the implemented convention is that there can be a maximum overlap of one atom between the matched substructures. For example, in the molecule CCCC, the substructure CC is found three times, but the substructure CCC is only found once.

##### Pseudoatoms

Applying substructure matching in the form of a post-processing step as described in the previous section can become very inefficient if a chemical formula is given with many potential isomers, where only a small subset of them contain the desired substructure(s).

Consider, for example, the formula C[2-10]H[2-18]O4, where the number of unsaturations in the molecule is set to two. In total, there exist more than 560 million constitutional and spatial isomers. However, if one requires exactly two occurrences of the substructure COC(=O)H, only 1484 of these isomers contain the desired substructures. The enumeration of these isomers takes about 12 minutes on a laptop with an i7-8565U CPU. A major part of the computation time is thus wasted on constitutional isomers which are not reported. Note that there is no time wasted on enumerating redundant stereoisomers, as the stereoisomers are only generated for the constitutional isomers that are compatible with the given restrictions.

The number of redundant constitutional isomers can be reduced by using implicit hydrogens specifying the formula C[0-8]{CH1}2H[2-18]{OH0}4, such that isomers containing e.g. an oxygen–hydrogen bond are not generated during the enumeration. With this more specific formula, the enumeration time reduces to about 1 min. However, for the larger example C[4-10]H[4-16]O6, where the number of unsaturations in the molecule is set to three and it is required that there are exactly three occurrences of the substructure COC(=O)H, even with this trick of using the formula C[0-7]{CH1}3H[4-16]{OH0}6, the enumeration of the the 1328 constitutional and spatial isomers takes about 50 minutes.

To solve this problem in a more general fashion, so-called *pseudoatoms* are introduced. A pseudoatom is a molecular substructure which only contains one atom that is not fully bonded, i.e., can be connected to the atoms of the rest of the molecule. A pseudoatom is defined by a name, a list of atoms and an adjacency matrix stack. The pseudoatom behaves like a normal atom during the enumeration process. The valence of the pseudoatom corresponds to the valence of the not-fully-bonded atom minus the number of bonds that this atom forms with the other atoms within the pseudoatom. Whenever a new isomer is found, the pseudoatom is explicited in terms of its atom content, i.e., the adjacency matrix is extended and canonicalized, the SMILES string is generated, and the stereoisomers are listed (if requested). Using a pseudoatom OC(=O)H, it takes less than a second to enumerate the 1328 constitutional and spatial isomers of C[1-7]H[4-16]’OC(=O)H’3 with three occurrences of the substructure COC(=O)H, i.e., much less than the above 50 min.

A caveat of this approach is that the canonicity test does not recognize if a pseudoatom can also be constructed using the atoms available in the rest of the molecule. When this is possible, there will be duplicate isomers in the enumeration list. However, based on to the canonical SMILES strings these duplicates can easily be removed by the user in a post-processing step. Thanks to the convenient XML output format and the SMILES notation, such a post-processing is easy to implement if required.

#### Output

The enumerated isomers are written to an output file following an XML format. The corresponding *Document Type Definition* (DTD) can be found at https://github.com/csms-ethz/CombiFF/blob/main/use/output_files/isomer_enumeration.dtd. In addition to the generated output file, the program prints the current molecular formula as well as the current number of isomers during the enumeration process to the standard output (updated after every 100th detected (stereo)isomer). It is also possible to count the number of (stereo)isomers without generating the list of SMILES strings as an XML output file, by using the argument -count_only.