Smiles2Monomers: a link between chemical and biological structures for polymers

Background The monomeric composition of polymers is powerful for structure comparison and synthetic biology, among others. Many databases give access to the atomic structure of compounds but the monomeric structure of polymers is often lacking. We have designed a smart algorithm, implemented in the tool Smiles2Monomers (s2m), to infer efficiently and accurately the monomeric structure of a polymer from its chemical structure. Results Our strategy is divided into two steps: first, monomers are mapped on the atomic structure by an efficient subgraph-isomorphism algorithm ; second, the best tiling is computed so that non-overlapping monomers cover all the structure of the target polymer. The mapping is based on a Markovian index built by a dynamic programming algorithm. The index enables s2m to search quickly all the given monomers on a target polymer. After, a greedy algorithm combines the mapped monomers into a consistent monomeric structure. Finally, a local branch and cut algorithm refines the structure. We tested this method on two manually annotated databases of polymers and reconstructed the structures de novo with a sensitivity over 90 %. The average computation time per polymer is 2 s. Conclusion s2m automatically creates de novo monomeric annotations for polymers, efficiently in terms of time computation and sensitivity. s2m allowed us to detect annotation errors in the tested databases and to easily find the accurate structures. So, s2m could be integrated into the curation process of databases of small compounds to verify the current entries and accelerate the annotation of new polymers. The full method can be downloaded or accessed via a website for peptide-like polymers at http://bioinfo.lifl.fr/norine/smiles2monomers.jsp.Graphical abstract: . Electronic supplementary material The online version of this article (doi:10.1186/s13321-015-0111-5) contains supplementary material, which is available to authorized users.

Our goal is to determine from the set of chains of a given pattern, the one that minimizes the search time. The search time first depends on the frequency of the sub-patterns in the target polymer type and on the number of edges (more generally the arity) that should be explored to find a given pattern. Indeed, the more a sub-pattern is frequent, the more time is spent to find all its occurrences. Once a sub-pattern is found, the remaining edges starting from a specific node of this sub-pattern should be explored to continue the construction of the pattern, according to a given growing chain.
For example in Figure 1, the node a has an arity of 3. So, the 3 edges starting from a should be explored to find the next node prompted by the growing chain. In the case of the chain c α , the sub-pattern a−b is constructed. Then, two remaining edges must be explored from b to find d and construct the given pattern a−b−d

Greedy algorithm
The greedy algorithm chooses the most selective node at each step of the chain construction. First, the frequency of each node label is calculated by counting every types of labels in the learning database. The most selective label, i.e. the one with the lower number of occurrences, is the starting point for chains of patterns containing this label. Then, its most selective neighbor is recursively added to the growing chain. The goal is to constantly minimize the number of initial solutions.
For example in Figure 1, the most selective node label is a with a 2/18 frequency, so the greedy algorithm will start to search for a and then recursively determine its most selective neighbor. In this case, only one chain among the four chains generated from the studied pattern starts with a, this is c α = a → a−b → a−b−d. So, the greedy algorithm outputs the chain c α as the most selective one.
But this chain selection is greedy: the selectivity is calculated for each node independently of its links to other nodes. It does not guaranty to have the most selective chain that takes into account the previous steps of the chain construction.

Markovian chains
Markovian chains select the best succession of sub-patterns that constructs the most selective chain. In other words, they allow to construct the chain that minimizes the global search time. First, the time needed to find each possible chain constructed from a given pattern is estimated. Then, the chain requiring the minimal time is selected.
For a searched pattern p k of size k and a given chain c, the full estimated time T spent to search for a succession of growing patterns p 1 , p 2 , . . . until p k is given by: is the time spent extending the p i−1 pattern to the p i pattern. It is to notice that p 0 is here the empty pattern which is not searched. The initial time T (p 0 → p 1 ) spent to search for p 1 from the learning database of polymers is proportional to N , where N is the full number of nodes in the learning database.
By recursion, once the search for p i−1 gives n pi−1 occurrences (which can, in some cases, be exponentially >> N ), the next search for p i will give n pi occurrences, and is of time : where a pi−1→pi is the remaining arity of the node from which we wish to extend p i−1 into p i , and t the time to compute a comparison between two labels. The time t can be maximized by a constant, so it can be withdrawn from the formulas We introduce a filtering ability f i defined by and when cumulated Notice that f pj−1→pj is naturally bounded by the arity of the node extended during the construction of p j from p j−1 . Now we can introduce f in T (c) calculation.
For example in Figure 1, the estimated computational time T (c α ) on the previously mentioned chain c α is the following: Note that c α is a possible chain to search for a−b−d, however not the best one. In practice, we select the most selective chain for each pattern, applying the following formula: For example in Figure 1 , the most selective chain for the pattern a − b − d is c δ (with T (c δ ) = 14 9 ) because the extension of the sub-pattern b − d to the pattern a−b−d has an estimated search time smaller than the extension of the sub-pattern a−b, and the extension of the sub-pattern d has an estimated search time smaller than the extension of the sub-pattern b.
Since the set of chains c = p 0 p k used to search a pattern p k (or in more general case, any set of patterns) can be represented by a Directed Acyclic Graph (DAG), the dynamic programming aspect of the computation of T (p k ) = min ∀c | c=p0 p k T (c) can be easily established. It has two consequences: (a) only one optimal chain per pattern in the DAG must be kept, and the full set of optimal chains for all patterns can itself be represented by a tree in a very compact way ; (b) the p k pattern(s) and the optimal chain to find it(them) efficiently can be set by a classical memory efficient approach computing the p i patterns from the p i−1 already established optimal chains (for all i ∈ [1..k]), with emphasis on removal of all non optimal chains.

Hybrid algorithm
We define two possible algorithms to choose the best chain of any sub-pattern to minimize its search time. On the one hand, the greedy algorithm computes rapidly (polynomial time) the best chains for each residue of the database but can fall in local minima. On the other hand, the Markovian model computes, at a slower rate (exponential time) the best chains because it finds global minima, which means a guaranty of optimality for several critical residues. To take advantage of both models, we implemented an hybrid solution using the Markovian model on the m first extensions of the chain and finishing next extensions with the greedy model. The Markovian model applied only on the first m steps avoids critical exponential numbers of intermediate solutions.

Results
We tested the performances of our algorithm on the Norine dataset. We measured only the search time for all the monomers on each polymer of the database. We compared the search time for a random order, for the greedy algorithm (Markovian with k = 1) and for several values of k in our Hybrid Algorithm. We obtained an average execution time per polymer of 7.8 ms for aed random order, 7 ms for the greedy algorithm and we reach a plateau for k = 3 at 6.1 ms. So, the greedy algorithm reduces time by 10% in comparison to a random order. Moreover, when k = 3, the hybrid algorithm reduces time by 20%, also in comparison to a random order. The isomorphism algorithm used for SMARTS matching in CDK uses an ordering only for the very first atom search, which is better than the random algorithm but worse than the greedy one. In comparison with the CDK SMARTS algorithm, we improved the isomorphism time from 10% up to 20%. The larger the number of monomers in the database is, the more efficient the search is, compared to the loading and preprocessing overheads (IO + SMILES parsing).  Figure 1: Index creation using the Markovian method. This figure represents the full process to build a Markovian index. The gray rectangle contain an example of learning database and, in blue, the pattern that we want to index.
(1) Counts of the sub-patterns in the learning database: computation of the number of occurrences of each sub-pattern of a−b−d in the learning database.
(2) Estimation of the sub-patterns filtration: computation of the number of occurrences of a sub-pattern over the counts for the previous sub-pattern.
(3) Selection of the best chain for the pattern: computation of the expected computing time for each sub-pattern and selection of the faster chain at each step. The selected paths in the Directed Acyclic Graph is highlighted in green.