Skip to main content
  • Research article
  • Open access
  • Published:

Flexible heuristic algorithm for automatic molecule fragmentation: application to the UNIFAC group contribution model

Abstract

A priori calculation of thermophysical properties and predictive thermodynamic models can be very helpful for developing new industrial processes. Group contribution methods link the target property to contributions based on chemical groups or other molecular subunits of a given molecule. However, the fragmentation of the molecule into its subunits is usually done manually impeding the fast testing and development of new group contribution methods based on large databases of molecules. The aim of this work is to develop strategies to overcome the challenges that arise when attempting to fragment molecules automatically while keeping the definition of the groups as simple as possible. Furthermore, these strategies are implemented in two fragmentation algorithms. The first algorithm finds only one solution while the second algorithm finds all possible fragmentations. Both algorithms are tested to fragment a database of 20,000+ molecules for use with the group contribution model Universal Quasichemical Functional Group Activity Coefficients (UNIFAC). Comparison of the results with a reference database shows that both algorithms are capable of successfully fragmenting all the molecules automatically. Furthermore, when applying them on a larger database it is shown, that the newly developed algorithms are capable of fragmenting structures previously thought not possible to fragment.

Introduction

Cheminformatics is a growing field due to the increasing computational capabilities and improvements in the accuracy achieved by its predictions. The chemical space is vast and the number of molecules available to produce with new and, in some cases even automated synthetizing routes increases. However, before investing resources into synthetizing and characterizing molecules, a predictive approach for its properties would help narrow down the possible candidates. In addition, for the application of thermodynamic models or a priori calculation of thermophysical properties, predictive methods can be helpful and in some cases even necessary. These methods, which relate properties to the molecule structures are usually named QSPR methods (Quantitative Structure Property Relationship). One subgroup of these models is the group contribution method. The idea behind this method is to divide the value of a property of the complete molecule into its contributions based on the chemical groups or other molecular subunit. Group contribution models have been successfully applied to a wide variety of properties including density [1, 2], critical properties [3,4,5], enthalpy of vaporization [6], normal boiling points [7, 8], water–octanol partition coefficients [9,10,11], infinite dilution activity coefficients [12] and many more. Also, from Gibbs excess energy models [13,14,15] and equations of states [16,17,18,19] they provide an approach that allows widening their application range to molecules composed of the same chemical groups relatively easily.

However, in the development and application of these models a manual mapping of the groups has to be performed in most cases. This can hinder the fast development and testing of possible different group combinations, especially for larger number of molecules.

Jochelson [20], in 1968, already described a simple automatic routine for substructure counting. Most of research since [21,22,23,24,25,26,27,28] is focused more on describing algorithms for substructure search, ring perception and aromaticity perception. In a recent paper Ertl [29] proposed a new algorithm for automatic chemical group definition based on a large database. Fortunately, most of the current cheminformatic toolkits already include search and perception features, allowing to create new advanced fragmentation algorithms focusing on other problems.

One of the free tools offered online for structure analysis is Checkmol [28, 30]. It is an open-source program for finding a defined set of functional groups within a molecular structure. However, it checks its existence without counting the occurrence. Przemieniecki [31] developed an implementation of UNIFAC with automatic group fragmentation by means of a non-standardized way of specifying the fragmentation scheme. Some other free webpage services that allow a complete automatic fragmentation of molecules also exist, including the ones from the companies DDBST GmbH [32] and Xemistry GmbH [33]. In the first case, fragmentation is limited to the schemes supported by the webpage. In the second case, it is possible to provide own fragmentation rules allowing for fragmentation using different schemes. However, the terms of use only allow for a manual use of the website and without the ability to use the results in commercial applications. Furthermore, knowing how the algorithm works would allow to debug, find errors and improve it.

Tools that implement group contribution models like Octopus [34], thermo [35] or UManSysProp [36] would largely benefit from an improved flexible automated fragmentation algorithm based on standardized ways to define the fragmentation scheme that can handle complex molecules.

The goal of this work is to provide flexible algorithms that only need a simple fragmentation scheme based on the SMARTS language [37] which is easy to use for the rapid development and testing of group contribution methods on larger datasets.

Challenges of automatic fragmentation

Several challenges like non-unique group assignment, incomplete group assignment and the composition of the fragmentation scheme itself can arise when developing an automatic fragmentation algorithm. These will be discussed in more detail in this section. The examples described are based on the fragmentation scheme from Table 1.

Table 1 Fragmentation scheme developed in this work for the published UNIFAC groups and the respective pattern described used for sorting

Non-unique group assignment

For the assignment of the groups several solutions might be possible. The order in which the different groups are searched has an influence. For example, an ACOH group (hydroxyl bound to an aromatic carbon atom) can be recognized as such or fragmented into an aromatic carbon (AC) and a hydroxyl (OH) group. Furthermore, depending on the order in which the non-overlapping fragmentation is performed on the molecule structure, different results might be attained. For example, if a molecule is fragmented starting from left to right (Fig. 1a), the result obtained can be different from the one obtained if the molecule is fragmented from right to left (Fig. 1b).

Fig. 1
figure 1

Example of a molecule with different functional groups where non-unique group assignment is possible. The groups identified are marked by the dotted line. Depending on where the algorithm starts to assign the groups, the result of the fragmentation is different. If the molecule is fragmented starting from left to right, the result might be the one shown in a, while if it is fragmented from right to left, the result might be as shown in b. SMILES: C[NH]C(=O)OC

In these cases, the algorithm must either deliver the correct fragmentation as a first solution or find all solutions and then specify how to choose the correct one.

Incomplete group assignment

This case occurs when it is not possible to assign one or more atoms to a specific group. In some cases, the order of the groups searched can also lead to this situation. For example, in Fig. 2 if the AC groups (aromatic carbon) are searched first, the remaining chlorine atom cannot be assigned to any other functional group from the fragmentation scheme. In other cases, there will be molecules with atoms or functional groups that are just not defined in the fragmentation scheme. However, in most cases where the fragmentation is possible, this issue can be avoided if the algorithm specifies the order in which the functional groups are searched.

Fig. 2
figure 2

Example of a molecule with different functional groups where incomplete group assignment is possible. The groups identified are marked by the dotted line. The chlorine atom cannot be assigned to a group from the fragmentation scheme. SMILES: c1c(Cl)c([OH])ccc1

The fragmentation scheme

Defining the fragmentation scheme is decidedly important for the accuracy of the algorithm. If the groups defined were targeting very specific functional groups or avoiding overlapping with other groups, this would minimize the non-unique or incomplete group assignments. A lot of time and testing can be invested in developing highly specific patterns for any given group contribution method such as those already done for UNIFAC by Salmina et al. [38]. However, if the algorithm includes a way to prioritize the groups from the fragmentation scheme, in most cases the groups do not have to be highly specific thus allowing to focus more time on developing different fragmentation schemes instead of refining one specific scheme.

Strategies to overcome the challenges

To overcome the challenges described in the section “Challenges of automatic fragmentation”, three features were implemented in this work:

Heuristic group prioritization

The patterns of the fragmentation scheme are sorted based on a set of heuristically determined descriptors. These descriptors can be, for example, the number of atoms describing the pattern, the number of bonds available or the number of double bonds.

Parent–child group prioritization

The complete fragmentation scheme is analyzed to find patterns that are contained within others. E.g. CH2 is contained in CONHCH2. Whenever searching for a specific pattern, if the group has such a parent pattern, the parent pattern is searched first. After that, the child pattern is searched.

Adjacent group search

To avoid incomplete group assignments, whenever a part of the structure is already fragmented, the subsequent matches have to be adjacent to the groups already found.

The algorithms

There are two types of algorithms that are possible to fragment molecules. The first type of algorithm (simple fragmentation) searches for one possible solution and accepts the first one found. The second type of algorithm (complete fragmentation) tries to find all possible solutions to fragment the molecule. To achieve this, a full tree search on the complete structure over the entire fragmentation scheme has to be performed. Since more than one solution is inherently possible, a way should to be provided to prioritize the determined solutions and select one.

Simple fragmentation

In the simple fragmentation algorithm, only one solution is searched. The patterns are sorted based on automatically calculated descriptors. In this work, the following set of 8 heuristically chosen descriptors were used to sort the patterns in descending order:

  1. 1.

    When the pattern has zero bonds: First, the patterns without bonds, then patterns with bonds are sorted.

  2. 2.

    When the pattern is simple: consisting of one atom with valence one or one atom with valence one connected to a carbon atom. First, the simple patterns, then the others are sorted.

  3. 3.

    Number of atoms defining the group: this number includes the atoms actually matched by the pattern as well as the ones defining the vicinity in case of recursive SMARTS.

  4. 4.

    When the number of available bonds is one: first, the patterns with one bond, then patterns with more bonds are sorted.

  5. 5.

    Number of atoms in the pattern that are neither hydrogen nor carbon.

  6. 6.

    When the pattern includes atoms in a ring: first the patterns that describe a partial ring (aliphatic or aromatic), then the other patterns are sorted.

  7. 7.

    Number of triple bonds.

  8. 8.

    Number of double bonds.

As a first step, the algorithm performs a quick search for the different groups in the fragmentation scheme applying the heuristic group prioritization and the parent–child group prioritization as described above. The search goes sequentially through the sorted fragmentation scheme, adding groups that are found and do not overlap with groups that were already found. In case it successfully finds a valid fragmentation, this is taken as the solution.

In case no solution is found after trying all fragmentation patterns, the area around the unassigned atoms is cleared of adjacent groups and the search is repeated applying all three features described above, i.e. searching only for non-overlapping groups that are contiguous to the groups already found. The clearing and searching might be repeated several times if no solution is found after the first iteration. In each subsequent iteration, a larger portion of the molecule connected to the unassigned atoms is cleared. If a valid fragmentation is found, this is taken as the solution. Figure 3 shows a flow-diagram-like schematic representation of the algorithm.

Fig. 3
figure 3

Schematic representation of the simple fragmentation algorithm

Complete fragmentation

With the complete fragmentation algorithm, all possible solutions are searched. While the simple fragmentation algorithm might take milliseconds to find the fragmentation, the complete fragmentation algorithm might take minutes or even hours due to the vast space of possible combinations. Its search time increases exponentially with increasing molecule size. However, in contrast to the simple fragmentation, it allows to find all fragmentations and therefore its success in finding a solution is not dependent on the order of the searched patterns.

This algorithm was implemented as a recursive algorithm that performs a complete tree search of all possible combinations of fragmentation. To reduce the fragmentation space that needs to be searched, the algorithm keeps track of the solutions already found and of the group combinations that lead to an incomplete fragmentation. If several solutions were found in the end, the solutions were sorted by the number of different patterns and the first solution was taken as the determined fragmentation. This way, patterns with larger groups are prioritized over smaller patterns. Figure 4 shows a flow-diagram-like schematic representation of the algorithm.

Fig. 4
figure 4

Schematic representation of the complete fragmentation algorithm

Computational details

In this work, the RDKit [39] python module was used to implement the algorithm. It supports the Simplified Molecular Input Line Entry System (SMILES) [40] and the SMiles ARbitrary Target Specification (SMARTS) [37] languages for specifying the molecular structures and the functional group patterns respectively. The SMARTS language is used as it provides a standardized, rich featured, easily learnable and wide spread approach to describe the molecular patterns.

To implement the parent–child group prioritization as described in “Parent–child group prioritization” section, it is necessary to test whether one pattern is contained within another. RDKit already works well when testing for most of the parent–child relationships. However, in some cases where the explicit amount of hydrogen atoms is important, the results are incorrect. For example, RDKit matches ‘[CH3][OH]’ as being contained in ‘[CH3][O;H0]’. Because of this, in this work, after a positive match the explicit amount of hydrogen atoms is tested to avoid false positives.

The research group of Computational Molecular Design at the University of Hamburg offers an online tool called SMARTSviewer [41, 42] that makes developing SMARTS patterns easier. This tool was used in the development process of the fragmentation scheme. The same group is also developing new algorithms to find the relationships between SMARTS patterns. In future, these developments might help improve the capabilities of cheminformatics modules such as RDKit to discern whether a pattern is contained within another.

The open source thermodynamics python module thermo [35] includes a large database of structures including single molecules and mixtures. After excluding salts and radicals, this comprises of a total set of 62,380 structures in the form of SMILES. For a subset of structures of this large database, fragmentations are available for use with the UNIFAC model. These structures were automatically fragmented using the service provided on the DDBST GmbH webpage [32]. This work first compares the results of the newly developed fragmentation algorithms with this reference database and then checks whether the new algorithms can fragment more structures than previously thought.

For some SMILES that include heavy versions of hydrogen, e.g. deuterium, these were replaced by normal hydrogen atoms. That makes 28,678 available SMILES with their corresponding UNIFAC fragmentation in the reference database.

For the sake of making the implementation of the algorithm easier in another group contribution model, the functions and the reference databases are made available as separate files in Additional files 1, 2, 3, 4 and on GitHub [43].

Results and discussion

The fragmentation scheme for UNIFAC developed in this work can be found in Table 1. A version of the sorted fragmentation scheme according to the description in “Simple fragmentation” section can be found in Additional file 5.

The focus of this work is to develop a fragmentation algorithm that is as independent as possible from the chosen fragmentation scheme to allow for a faster development of new group contribution methods. For this reason, the SMARTS for each pattern were kept as simple as possible. The few patterns that were made more specific to match the results better from the literature database have been underlined. However, the overall majority of the SMARTS are as simple as they can be.

The fragmentation results are summarized in Table 2. Since the order of patterns searched can have an influence on the end result, both cases are differentiated in the table.

Table 2 Results of the fragmentation with both algorithms on the reference database

It can be observed that the simple fragmentation algorithm with the sorted patterns is able to fragment all but the molecule shown in Fig. 5. This is because there is no group in the fragmentation scheme matching the structure. The algorithm was able to fragment the molecules for every structure for which it should have been possible. This is a very encouraging result. Based on a set of general descriptors, by sorting the patterns automatically as much as 98.7% of the fragmented molecules match the fragmentation found by the algorithm from the reference database. Most of the remaining 1.3% of the fragmentations from the reference database can be explained by a different aromaticity perception. In the RDKit, a chemical bond is either described as being aromatic or being a single/double bond as opposed to the assignments done in the reference database where in some cases no distinction is made.

Fig. 5
figure 5

Only molecule that was not possible to fragment. SMILES: C1=CN=CC#C1

For the simple fragmentation algorithm, as expected, the sorting of the patterns plays a major role on the success of finding any solution at all and it is especially important to find the same solution as the reference database.

To evaluate the complete fragmentation algorithm only the molecules with 20 or less heavy atoms were included from the reference database. This was done because for very large molecules the algorithm takes hours to find all solutions.

Table 2 shows that since this algorithm searches for all possible fragmentations the amount of fragmented molecules is independent on whether the patterns are sorted or not. However, the results show that the sorting of the patterns has an influence on whether the chosen solution at the end is equal to the solution of the reference database.

This is because the order in which the different patterns is searched for defines the order of the found solutions from which the first one is selected.

The complete fragmentation algorithm could be refined further to sort the determined solutions at the end in a more elaborate way, for example, based on the descriptors of the patterns. However, this is out of the scope of this work.

Lastly, the algorithms were applied to the large database of structures included in thermo [35] to find out if the new algorithms are capable of fragmenting molecules that were not in the reference database. In this case, first the simple fragmentation algorithm was applied with the sorted patterns. If no solution was found with the simple fragmentation algorithm, the complete fragmentation algorithm was applied if the structure was smaller than 20 heavy atoms.

With this combined fragmentation algorithm, in total 33,560 structures were fragmented successfully. This number is 17% larger than the 28,677 fragmented structures in the reference database. This shows that the newly developed algorithms are capable of fragmenting more structures than the algorithm used in the reference database.

Conclusions

Several challenges exist when attempting to fragment molecules into a set of predefined functional groups or molecular subunits. The strategies developed and implemented for the two algorithms in this work, show that it is possible to automate group fragmentation based on computed descriptors for the patterns in the fragmentation scheme. Both algorithms are capable of fragmenting every molecule of a reference database of structures into their respective UNIFAC groups. Furthermore, the algorithms are capable of fragmenting molecules that could not be fragmented by the algorithm of the reference database. The advancements of this work permit to accelerate the development of new group contribution models by allowing to test different fragmentations schemes on large databases of molecules much faster than with manual fragmentation, which is the existing standard for most group contribution models. It is a step forward in the direction of completely automated QSPR methods and maybe even completely automated group contribution development.

Availability of data and materials

The datasets and the source code supporting the conclusions of this article are available in the GitHub repository, https://github.com/simonmb/fragmentation_algorithm_paper. Furthermore, these are also provided in Additional information to allow complete reproducibility of the work.

References

  1. Evangelista NS, do Carmo FR, de Santiago-Aguiar RS, de Sant’Ana HB (2014) Development of a new group contribution method based on GCVOL model for the estimation of pure ionic liquid density over a wide range of temperature and pressure. Ind Eng Chem Res 53(22):9506–9512

    Article  CAS  Google Scholar 

  2. Shen C, Li C, Li X, Lu Y, Muhammad Y (2011) Estimation of densities of ionic liquids using Patel-Teja equation of state and critical properties determined from group contribution method. Chem Eng Sci 66(12):2690–2698

    Article  CAS  Google Scholar 

  3. Varamesh A, Hemmati-Sarapardeh A, Moraveji MK, Mohammadi AH (2017) Generalized models for predicting the critical properties of pure chemical compounds. J Mol Liq 1(240):777–793

    Article  Google Scholar 

  4. do Carmo FR, Evangelista NS, Fernandes FAN, de Sant’Ana HB (2015) Evaluation of optimal methods for critical properties and acentric factor of biodiesel compounds with their application on Soave–Redlich–Kwong and Peng-Robinson equations of state. J Chem Eng Data 60(11):3358–3381

    Article  Google Scholar 

  5. Carande WH, Kazakov A, Muzny C, Frenkel M (2015) Quantitative structure–property relationship predictions of critical properties and acentric factors for pure compounds. J Chem Eng Data 60(5):1377–1387

    Article  CAS  Google Scholar 

  6. Abdi S, Movagharnejad K, Ghasemitabar H (2018) Estimation of the enthalpy of vaporization at normal boiling temperature of organic compounds by a new group contribution method. Fluid Phase Equilib 15(473):166–174

    Article  Google Scholar 

  7. Sola D, Ferri A, Banchero M, Manna L, Sicardi S (2008) QSPR prediction of N-boiling point and critical properties of organic compounds and comparison with a group-contribution method. Fluid Phase Equilib 263(1):33–42

    Article  CAS  Google Scholar 

  8. Wang Q, Ma P, Wang C, Xia S (2009) Position group contribution method for predicting the normal boiling point of organic compounds. Chin J Chem Eng 17(2):254–258

    Article  CAS  Google Scholar 

  9. Goudarzi N, Goodarzi M (2010) QSPR study of partition coefficient (Ko/w) of some organic compounds using radial basic function-partial least square (RBF-PLS). J Braz Chem Soc 21(9):1776–1783

    Article  CAS  Google Scholar 

  10. Livingstone DJ, Ford MG, Huuskonen JJ, Salt DW (2001) Simultaneous prediction of aqueous solubility and octanol/water partition coefficient based on descriptors derived from molecular structure. J Comput Aided Mol Des 15(8):741–752

    Article  CAS  Google Scholar 

  11. Derawi SO, Kontogeorgis GM, Stenby EH (2001) Application of group contribution models to the calculation of the octanol–water partition coefficient. Ind Eng Chem Res 40(1):434–443

    Article  CAS  Google Scholar 

  12. He J, Zhong C (2003) A QSPR study of infinite dilution activity coefficients of organic compounds in aqueous solutions. Fluid Phase Equilib 205(2):303–316

    Article  CAS  Google Scholar 

  13. Constantinescu D, Gmehling J (2016) Further development of modified UNIFAC (Dortmund): revision and extension 6. J Chem Eng Data 61(8):2738–2748

    Article  CAS  Google Scholar 

  14. Achard C, Dussap CG, Gros JB (1994) Representation of vapour–liquid equilibria in water–alcohol–electrolyte mixtures with a modified UNIFAC group-contribution method. Fluid Phase Equilib 98:71–89

    Article  CAS  Google Scholar 

  15. Ganbavale G, Zuend A, Marcolli C, Peter T (2015) Improved AIOMFAC model parameterisation of the temperature dependence of activity coefficients for aqueous organic mixtures. Atmos Chem Phys 15(1):447–493

    Article  Google Scholar 

  16. Emami FS, Vahid A, Elliott JR, Feyzi F (2008) Group contribution prediction of vapor pressure with statistical associating fluid theory, perturbed-chain statistical associating fluid theory, and Elliott–Suresh–Donohue equations of state. Ind Eng Chem Res 47(21):8401–8411

    Article  CAS  Google Scholar 

  17. Lubarsky H, Polishuk I, NguyenHuynh D (2016) The group contribution method (GC) versus the critical point-based approach (CP): predicting thermodynamic properties of weakly- and non-associated oxygenated compounds by GC-PPC-SAFT and CP-PC-SAFT. J Supercrit Fluids 1(110):11–21

    Article  Google Scholar 

  18. Burgess WA, Tapriyal D, Gamwo IK, Wu Y, McHugh MA, Enick RM (2014) New group-contribution parameters for the calculation of PC-SAFT parameters for use at pressures to 276 MPa and temperatures to 533 K. Ind Eng Chem Res 53(6):2520–2528

    Article  CAS  Google Scholar 

  19. Baghbanbashi M, Pazuki G (2016) Application of SAFT-VR equation of state for prediction of thermophysical properties of sugar solutions. J Food Process Eng 39(6):601–609

    Article  CAS  Google Scholar 

  20. Jochelson N, Mohr CM, Reid RC (1968) The automation of structural group contribution methods in the estimation of physical properties. J Chem Doc 8(2):113–122

    Article  CAS  Google Scholar 

  21. Adams JT, So EMT (1985) Automation of group-contribution techniques for estimation of thermophysical properties. Comput Chem Eng 9(3):269–284

    Article  CAS  Google Scholar 

  22. Dengler A, Ugi I (1991) A central atom based algorithm and computer program for substructure search. Comput Chem 15(2):103–107

    Article  CAS  Google Scholar 

  23. Ozawa K, Yasuda T, Fujita S (1997) Substructure search with tree-structured data. J Chem Inf Comput Sci 37(4):688–695

    Article  CAS  Google Scholar 

  24. Raymond JW, Rogers TN (1999) Molecular structure disassembly program (MOSDAP): a chemical information model to automate structure-based physical property estimation. J Chem Inf Comput Sci 39(3):463–474

    Article  CAS  Google Scholar 

  25. Joback KG (2001) Knowledge bases for computerized physical property estimation. Fluid Phase Equilib 185(1):45–52

    Article  CAS  Google Scholar 

  26. Rowley RJ, Oscarson JL, Rowley RL, Wilding WV (2001) Development of an automated SMILES pattern matching program to facilitate the prediction of thermophysical properties by group contribution methods. J Chem Eng Data 46(5):1110–1113

    Article  CAS  Google Scholar 

  27. Feldman HJ, Dumontier M, Ling S, Haider N, Hogue CWV (2005) CO: a chemical ontology for identification of functional groups and semantic comparison of small molecules. FEBS Lett 579(21):4685–4691

    Article  CAS  Google Scholar 

  28. Haider N (2010) Functionality pattern matching as an efficient complementary structure/reaction search tool: an open-source approach. Molecules 15(8):5079–5092

    Article  CAS  Google Scholar 

  29. Ertl P (2017) An algorithm to identify functional groups in organic molecules. J Cheminform 7(9):36

    Article  Google Scholar 

  30. Checkmol—functional group analyzer. http://merian.pch.univie.ac.at/~nhaider/fga.php. Accessed 25 Dec 2018

  31. Przemieniecki J. Implementation of UNIFAC model. 2018. https://github.com/JacekPrzemieniecki/UNIFAC. Accessed 21 Dec 2018

  32. Group assignment. http://www.ddbst.de/unifacga.html. Accessed 21 Dec 2018

  33. Ihlenfeldt W-D. Online structure fragmentation. https://xemistry.com/fragment/. Accessed 21 Dec 2018

  34. Sombra Evangelista N, do Carmo FR. Octopus: a tool for estimating physical and chemical properties of organic molecular compounds. https://github.com/thegibbsproject/octopus. Accessed 21 Dec 2018

  35. Bell C. DB: thermo: thermodynamics, phase equilibria, transport properties and chemical database component of chemical engineering design library (ChEDL). 2018. https://github.com/CalebBell/thermo. Accessed 6 July 2018

  36. Topping D, Barley M, Bane M, Higham NJ, Aumont B, Dingle N et al (2016) UManSysProp V1.0: An online and open-source facility for molecular property prediction and atmospheric aerosol calculations. Geosci Model Dev 9(2):899–914

    Article  Google Scholar 

  37. SMARTS language. http://www.daylight.com/dayhtml_tutorials/languages/smarts/. Accessed 12 May 2019

  38. Salmina ES, Haider N, Tetko IV (2015) Extended functional groups (EFG): an efficient set for chemical characterization and structure-activity relationship studies of chemical compounds. Molecules 21(1):1

    Article  Google Scholar 

  39. RDKit: open source cheminformatics. RDKit: open source cheminformatics. http://www.rdkit.org. Accessed 28 Jan 2019

  40. SMILES language. http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html. Accessed 12 May 2019

  41. Schomburg K, Ehrlich H-C, Stierand K, Rarey M (2010) From structure diagrams to visual chemical patterns. J Chem Inf Model 50(9):1529–1535

    Article  CAS  Google Scholar 

  42. SMARTSviewer. SMARTSviewer—visualize chemical patterns. https://smartsview.zbh.uni-hamburg.de/. Accessed 28 Jan 2019

  43. Simonmb/fragmentation_algorithm. GitHub. https://github.com/simonmb/fragmentation_algorithm. Accessed 12 May 2019

  44. DDBST GmbH. published parameters UNIFAC. http://www.ddbst.de/published-parameters-unifac.html. Accessed 31 May 2019

Download references

Acknowledgements

The author thanks Caleb Bell for the fruitful discussions on the matter and for allowing to use his database to test the fragmentation algorithms. The author would also like to thank the DDBSTDortmund Data Bank Software & Separation Technology GmbH for allowing to access and publish the fragmentation results from their online fragmentation tool.

Funding

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—Projektnummer 392323616 and the Hamburg University of Technology (TUHH) in the funding programme *Open Access Publishing*.

Author information

Authors and Affiliations

Authors

Contributions

The complete work was done by the only corresponding author. The author read and approved the final manuscript.

Corresponding author

Correspondence to Simon Müller.

Ethics declarations

Competing interests

The author declares no competing financial interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1.

 Reference database of structures with fragmentations by the DDBST online fragmentation tool.

Additional file 2.

Large database of structures without fragmentations by another method used to test the capability of the algorithms on more molecules.

Additional file 3.

Code to reproduce results from the paper.

Additional file 4.

 Class encapsulating both algorithms for use in new applications.

Additional file 5: Table S1.

Sorted fragmentation scheme developed in this work for the published UNIFAC groups and the respective pattern used for sorting.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Müller, S. Flexible heuristic algorithm for automatic molecule fragmentation: application to the UNIFAC group contribution model. J Cheminform 11, 57 (2019). https://doi.org/10.1186/s13321-019-0382-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13321-019-0382-3

Keywords