Classifying natural products from plants, fungi or bacteria using the COCONUT database and machine learning
Journal of Cheminformatics volume 13, Article number: 82 (2021)
Natural products (NPs) represent one of the most important resources for discovering new drugs. Here we asked whether NP origin can be assigned from their molecular structure in a subset of 60,171 NPs in the recently reported Collection of Open Natural Products (COCONUT) database assigned to plants, fungi, or bacteria. Visualizing this subset in an interactive tree-map (TMAP) calculated using MAP4 (MinHashed atom pair fingerprint) clustered NPs according to their assigned origin (https://tm.gdb.tools/map4/coconut_tmap/), and a support vector machine (SVM) trained with MAP4 correctly assigned the origin for 94% of plant, 89% of fungal, and 89% of bacterial NPs in this subset. An online tool based on an SVM trained with the entire subset correctly assigned the origin of further NPs with similar performance (https://np-svm-map4.gdb.tools/). Origin information might be useful when searching for biosynthetic genes of NPs isolated from plants but produced by endophytic microorganisms.
Due to the importance of natural products (NPs) in drug discovery [1, 2], there is a considerable interest in describing and understanding their structural diversity, particularly by exploiting NP databases  using in silico methods such as machine learning (ML) . Computational approaches have been reported to distinguish between NPs and non-NPs [5,6,7,8,9], between terrestrial and marine NPs , and to classify NP structural types [11, 12] and visualize their chemical space .
In our own approach to this problem , we recently analyzed NPAtlas, an open-access database listing 25,523 NPs from bacterial or fungal origin , by computing the MAP4 fingerprint (MinHashed Atom-Pair fingerprint up to four bonds)  of each NP and creating a TMAP (tree-map)  of the resulting high-dimensional dataset. In this analysis, NPs from bacterial or fungal origin formed separated clusters. This separation effect was confirmed by showing that a support vector machine (SVM) trained with the MAP4 of NPAtlas was able to distinguish bacterial or fungal origin, including a recently reported NP isolated from the marine sponge Phakellia fusca assigned by our classifier to be of bacterial origin, in line with the fact that many NPs from this sponge originate from endosymbiotic actinobacteria [18, 19].
The possibility to assign the origin of NPs from their structure was intriguing because most NPs are secondary metabolites produced by biosynthetic gene clusters  which are sometimes transferred between different organisms . Such horizontal gene transfer may reflect adaptative relationships between host organisms such as plants and sponges and endosymbiotic bacteria or fungi . Among the many endophytic NPs [23, 24], striking examples include the cancer drug paclitaxel, a plant NP also produced by endophytic fungi of the yew tree [25, 26], and maytansine, used in antibody-drug conjugates for cancer therapy and produced by endophytic bacteria in plants . Due to the very widespread occurrence of endophytic bacteria and fungi in plants, we asked whether our MAP4 analysis might be able to distinguish plant NPs from bacterial and fungal NPs. To test this hypothesis, we considered the recently reported COCONUT database, an open-access database currently offering the most extensive coverage and including plant NPs .
Results and discussion
Chemical space analysis of plant and microbial NPs from the COCONUT database
COCONUT collects over 400 thousand NPs from 52 different databases, 135 thousand of which are annotated with a taxonomical origin. For our analysis, we considered the 68 thousand entries annotated with a source organism that were also associated with a publication. We focused on those annotated as originating from plants (50%), fungi (23%), or bacteria (16%), leaving out a smaller subset of NPs originating from animals (2%), homo sapiens (2.5%), of marine origin (1.5%), or lacking one of the previous taxonomical annotations (5%). The selected subset of 60,171 NPs contained 33,772 plant NPs, 15,648 fungal NPs and 10,751 bacterial NPs.
The subset spanned from molecular weight MW = 81 Da for 1,2-dihydropyridine, a plant NP , to MW = 2901 Da for lacticin 481, a bacterial peptide . Plant NPs dominated the intermediate molecular weight range (200 < MW < 800), while fungal NPs were most abundant in the low molecular weight range (MW ≤ 200) and bacterial NPs in the high MW range (MW ≥ 800). The three series had rather similar distributions of the fraction of sp3 carbon atoms (Fsp3), which measures the degree of saturation. However, the estimated octanol:water partition coefficient AlogP indicated that highly polar NPs were almost absent from fungal NPs. Furthermore, plant NPs had overall higher percentages of glycosides, while peptides were almost absent from plant NPs and most abundant in bacterial NPs (Table 1).
To get a closer insight into structural features, we calculated the MAP4 fingerprint for each of the 60,171 selected NPs. MAP4 encoding combines the characteristics of substructure fingerprints, which are well suitable for small molecules, and of atom pair fingerprints, which are instead preferable for larger structures, and it has been proven suitable for both . It consists of listing all pairs of circular substructures of radius 1 and 2 as SMILES, separated by their topological distance in bonds, and MinHashing the resulting set of SMILES pairs to a defined dimensionality (1024 in the present analysis). We then represented the MAP4 annotated NP dataset using the dimensionality reduction method TMAP. This method is suitable for very large high-dimensional datasets and performs better than t-SNE or UMAP in preserving local and global relationships in the data . To create a TMAP, the algorithm computes an approximate nearest neighbor graph by locality sensitive hashing (LSH), cuts edges to obtain the minimum spanning tree of this graph, and creates an optimized 2D representation of the minimum spanning tree, in which each node represent a molecule connected to its approximate nearest neighbors. This tree is then displayed with interactive the visualization tool Faerun . Faerun shows each node as a sphere that can be color-coded according to various properties and uses Smilesdrawer  to depict molecular structures. The TMAP of our NP subset is available interactively at https://tm.gdb.tools/map4/coconut_tmap/.
The TMAP of our NP subset color-coded by MW showed that most high MW compounds appeared in two groups, the first one (at right on the TMAP), contained peptides and related macrocycles, and the second one (at middle/lower left on the TMAP) corresponded to glycosylated triterpenoids (Fig. 1a). Color-coding by Fsp3 showed that the TMAP separated high Fsp3 molecules (left half of the TMAP), comprising many terpenes, steroids, and glycosides, from low Fsp3 molecules (right half of the TMAP) featuring many polyphenols and related polyaromatic molecules (Fig. 1b). Furthermore, the color-code by the calculated octanol:water partition coefficient AlogP, estimating polarity, showed several islands of highly polar NPs (low AlogP, magenta) corresponding mostly to nucleosides and glycosylated polyphenols (upper part of the TMAP), glycosylated triterpenoids (lower left on the TMAP) and peptides (middle right on the TMAP), as well as a few groups of apolar NPs (high AlogP, red), corresponding primarily to lipids, terpenes, and steroids (Fig. 1c)
Color-coding by the annotated origin showed that NPs from plants, fungi, or bacteria formed many well-defined clusters spread across the entire TMAP (Fig. 1d). On the one hand, this separation illustrated how NP origin corresponded to differences in molecular structure that were well perceived by the MAP4 fingerprint used to generate the map. On the other hand, the taxonomical origin color code also showed that each subset contained diverse structural types. While there was no correlation of origin with properties such as MW, Fsp3, or AlogP, most glycosides were associated with plants, and most peptides were of bacterial or fungal origin, in line with Table 1 (Fig. 1e). These relationships were also well visible by color-coding the TMAP by six selected prioritized categories summarizing important characteristics of natural products (Fig. 1f)
Statistical modeling of NP origin using support vector machines (SVM)
The clear separation of NPs from plants, fungi, or bacteria in the TMAP above clearly showed that our MAP4 fingerprint distinguished between NPs of plant, bacterial or fungal origin. To further investigate this separation, we trained an SVM classifier using the MAP4 similarity matrix of half of the COCONUT subset and used the other half to evaluate it. Indeed, the obtained MAP4 SVM correctly predicted the origin of 94% of plant NPs, 89% of fungal NPs, and 89% of the bacterial NPs (MAP4 SVM), resulting in a balanced accuracy of 0.897, an MCC (Matthews correlation coefficient) of 0.890, and an F1 score of 0.920 (see Methods for a detailed explanation of the used metrics).
To better identify the role of the MAP4 molecular encoding in the reported successful prediction, we compared the performances of a MAP4 SVM with the performances of an SVM trained using ECFP4 (Extended Connectivity Fingerprint ECFP of radius 2, ECFP4 SVM) and the RDKit atom pair fingerprint (AP SVM). We chose ECFP4 and the RDKit AP as widely used and available examples of substructures fingerprints and atom pair fingerprints. As a baseline model, we also included an SVM trained with a set of 11 calculated physico-chemical properties, namely MW, Fsp3, HBD (hydrogen bond donor) count, HBA (hydrogen bond acceptor) count, AlogP, the number of carbons, oxygens, and nitrogens, the total number of atoms, number of bonds, and TPSA (topological polar surface area) (properties SVM). The selected 60 thousand COCONUT entries were divided into five subsets, and each model was trained and evaluated five times using the five different 80-20 training test splits combinations of one subset as test set and the other four as training set. Then the mean balanced accuracy, MCC, and F1 score of the five evaluations were calculated.
The results of this evaluation are presented in Table 2; Fig. 2. Remarkably, all four SVM performed reasonably well. The good performance of the property based SVM reflected the fact that relatively large NP families with characteristic properties are essentially all from the same origin. For example, almost all large peptides or cyclic peptides are assigned to bacteria, while most glycosylated triterpenoids and polyphenols are assigned to plants. Nevertheless, there was a significant performance increase with the ECFP4 SVM and MAP4 SVM, which performed best, showing that correct origin assignment works better if specific substructures are considered. Among the four SVM evaluated, our MAP4 SVM performed best with significantly higher values compared to the ECFP4 SVM, probably because the MAP4 fingerprint encodes a more precise representation of the molecular structures than ECFP4. Indeed, MAP4 considers pairs of local substructures and the topological distance between them, while ECFP4 only encodes the presence of local substructures.
Using the MAP4 SVM to assign the origin of NPs
The SVM evaluation above showed that the MAP4 analysis of NP molecular structure identified features distinguishing between NPs assigned to plants, fungi, and bacteria. Assuming that most of the assigned origins were correct among the 60,171 NPs used for training, one may use an SVM to tentatively assign the origin of further NPs as originating from plants, fungi, or bacteria. To best exploit the information in the COCONUT database, we trained a MAP4 SVM using the entire set of 60 thousand COCONUT NPs assigned to plants, fungi, or bacteria. We used the resulting classifier to build an online tool that takes any molecular structure as input (drawn or pasted as SMILES) and returns the assigned origin and the corresponding percentages from the SVM classifier. This tool is freely accessible online at https://np-svm-map4.gdb.tools/.
The online tool performed quite well in assigning the origin of newly published NPs which were not present in COCONUT. Among 20 recently reported NPs from plants, fungi, or bacteria, 17 were correctly assigned to their origin, while only three were misassigned (Table 3; Fig. 3). In details, the fungal epicospirocin 1 , penicimeroterpenoid A , beetleane A , funiculolide D , and fusoxypenes A , the bacterial vertirhodin A , bosamycin A , and dumulmycin , and the plant fortuneicyclidin A , meloyunnanine A , hyperfol B , pegaharmol A , hunzeylanine A , mucroniferal A , perovsfolin A , horienoid A , and erythrivarine J  were correctly classified. On the other hand, the fungal rhizolutin  and myxadazoles A  and the bacterial marinoterpin A  were misclassified. Note that in these cases, the percentage values to the assigned class were lower than for the correct predictions.
As an additional test of our online tool, we investigated the predicted origin of the 3364 NPs (Additional file 1) in COCONUT reported with an origin and a publication for which the organism name was reported (e.g. Brachystemma calycinum) but not the corresponding taxonomical annotation as plant, fungi, bacteria. Checking individual predictions showed that the predicted origin was in many cases correct, in line with our performance evaluation. For example, the 49 NPs with Euphorbia as a source, many of which were peracetylated polycyclic terpene alcohols, as well as the 45 NPs with Radula as a source, which were polyphenols and terpenes, were all correctly assigned to a plant origin.
In several cases, the SVM prediction conflicted with the taxonomy of the reported source organism. For example, the indole alkaloids cephalinones A-D and cephalandoles A-C isolated from the orchid Cephalanceropsis gracilis  and whose structures were partly revised by total synthesis , were all assigned to bacteria by our SVM. In fact, These NPs might stem from an endophytic bacterium considering that endophytic microorganisms produce several related indole alkaloids . Our SVM also reassigned the cancer drug maytansin from an annotated plant origin in the training set to a predicted bacterial origin, in line with its endophytic origin . On the other hand, our classifier also assigned a bacterial origin to two cyclic peptides (CNP0085258 and CNP0085259)  and a cyclotide (CNP0085363)  isolated from plants. Although these plants indeed contain endophytic bacteria, the plant origin of such peptides is well established [58, 59], and the SVM assignment to bacteria reflects the fact that the majority of cyclic peptides and cyclotides in the COCONUT set used for training the SVM were assigned to bacteria, compared to only a handful of cyclotides of plant origin.
While the classifier may point to the possible endophytic origin of NPs isolated from plants, its use on NPs from other sources is problematic. For instance, among the 1,035 marine NPs from COCONUT with an annotated origin, 639 were assigned to plants by our SVM. This prediction must be mostly wrong considering that most marine organisms such as algae, corals, and sponges are not plants. For example, the 44 NPs from the soft coral Sinularia, or the macrocyclic terpene lactone lobophytolide A (CNP0275045) stemming from the soft corral lobophytum cristagalli [60, 61], were all incorrectly assigned to plants. However, the remaining 231 fungal and 165 bacterial predictions might be partly correct considering that many marine organisms carry endosymbionts. For example, our classifier assigned a bacterial origin for echinosulfonic acid B (CNP0318329), a brominated bis-indole NP isolated from the marine sponge Echinodictyum gorgonoides . In this case, other authors have reported the isolation of a bacterial strain from the same sponge as a probable source of its biological activities .
In summary, we visualized the chemical space covered by a subset of 60 thousand NPs from the COCONUT database with an assigned origin and publication using a TMAP calculated on the basis of MAP4 as molecular fingerprint, which is available at https://tm.gdb.tools/map4/coconut_tmap/. Analyzing this TMAP revealed that NPs from plant, fungal or bacterial origin form well separated groups. We then trained an SVM classifier with the MAP4 fingerprint to assign the origin of NPs and found that it performed excellently and significantly better than classifiers trained with ECFP4, RDkit AP, or physico-chemical descriptors.
To help assign NP origin, we then trained a MAP4 SVM classifier using the entire set of 60 thousand NPs. This tool is available online at https://np-svm-map4.gdb.tools/ and returns an origin prediction for any molecular structure drawn or pasted as SMILES. We found that this classifier correctly predicts the origin of plant, bacterial or fungal NPs not included in the 60 thousand COCONUT set used for training, as exemplified with the correct prediction of 17 out of 20 newly published NPs. Broader testing of the classifier with further NPs from COCONUT showed limitations for NPs not from plant or microbial origin, such as marine NPs, but it also led to interesting use cases suggesting that the tool might serve as a help to assign NP origin. This concerns in particular NPs isolated from plant but which might in fact be produced by endophytic microorganisms. Such information could be essential when searching for the corresponding biosynthetic genes.
The COCONUT database was downloaded. Only the 135,091 (out of 400,837) entries having a taxonomical annotation were selected. The selected subset was further filtered down to the 67,730 entries having an annotation not shorter than ten characters in the DOI field. Then, the taxonomy field was split by commas and match towards the words “plant”/“plants”, “fungi”/“aspergillus”, “bacteria”/“bacillus”/“bacta” to select the NPs with an annotated plant, fungal, or bacterial origin, respectively. The entries common between multiple origins were assigned with the following priority: human > animal > bacteria > fungi > plant > marine. The process led to the selection of 33,772 plant NPs, 15,648 fungi NPs, and 10,751 bacterial NPs with annotated DOI, for a total of 60,171 structures. The number of carbons, oxygen, and nitrogens, the total number of atoms, number of bonds, and TPSA were extracted from the COCONUT annotations. MW, Fsp3, HBD, and HBA count, AlogP, were calculated using RDKit . The presence/absence of a peptide or a glycoside moiety was evaluated using Daylight  SMILES arbitrary target specification (SMARTS) language. SMARTS were used with RDKit to identify COCONUT entries containing a dipeptide substructure, defined as “[NX3,NX4+][CH1,CH2][CX3](=[OX1])[NX3,NX4+][CH1,CH2][CX3](=[OX1])[O,N]”, or a containing a glycoside defined as cyclic N- or O-acetal substructure with the SMARTS “[CR][OR][CHR]([OR0,NR0])[CR]”. Substructures were used only for recognizing and labeling peptidic and glycosylated NPs and they were not removed.
The 1024 dimensions MinHashed atom pair fingerprint of radius 2 was calculated using the open-source code of MAP4.
The indices generated by the MinHash procedure of the MAP4 calculation were used to create a locality-sensitive hashing (LSH) forest  of 32 trees. Then, for each structure, the 20 approximate nearest neighbors (NNs) in the MAP4 feature space were extracted from the LSH forest, and the tree layout was calculated. The LSH forest and the minimum spanning tree layout were calculated using the TMAP open-source code. Finally, Fearun  was used to display the obtained layout interactively.
MAP4 SVM implementation
The coconut SUBSET entries used to generate the TMAP were assigned to training or test set with a 50% random split. The SVM was trained using the MAP4 fingerprints of the training set. It utilized a custom kernel that calculates the similarity matrix between two MAP4 fingerprints, where the similarity of fingerprint a and fingerprint b is calculated (1) counting of elements with the same value and the same index across a and b, and (2) dividing the obtained value by the number of elements of fingerprint a. The class weights were inversely proportional to the class frequency, and the hyperparameter C was optimized using fivefold cross-validation. During the hyperparameter optimization, 20% of the training set was left out as a validation set, and the balanced accuracy of the validation set was maximized. The hyperparameter C was optimized among the values 0.1,1, 10, 100, and 1000, resulting in C = 1. To overcome the intrinsic incapability of SVMs in handling more than two classes, the classifier was implemented using scikit-learn  with the “one versus rest” strategy, where in the background one classifier per class is trained and the class is fitted against all the other classes. and all not mentioned hyperparameters were used in their default values. Platt scaling , was used to obtain probabilistic prediction values. After the evaluation process, a second version of the MAP4 SVM classifier was trained using both training and test to learn from all curated 60 thousand data points.
MAP4, ECFP4, RDKit AP, and properties SVMs comparison
The MAP4, ECFP4, and the RDKit AP fingerprints and a set of 11 properties (MW, Fsp3, HBD and HBA count, AlogP, number of carbons, oxygens, and nitrogens, total number of atoms, number of bonds, and TPSA) were used to train four different SVM classifiers in a fivefold cross-validation. For all classifiers, the class weights were inversely proportional to the class frequency, and the hyperparameters were optimized using 10% of the available data (Table 4). For the properties SVM, the 11 values were scaled to zero mean and unit variance.
Classifiers evaluation metrics
The F1 score is defined as the harmonic mean of precision and recall:
Where TP stands for true positives, TN for true negatives, FP for false positives, and FN for false negatives predicted by the classifier.
The balanced accuracy is defined as:
The Matthews correlation coefficient (MCC) is a correlation between the observed and the predicted class and it is defined as:
Online MAP4 SVM
The MA4 SVM classifier trained with the whole 60 thousand COCONUT subset is found at https://np-svm-map4.gdb.tools/. The query molecule can be provided as a drawn structure or pasted SMILES in the JSME editor . The given query is canonicalized, chirality information is removed with RDKit, and the MAP4 fingerprint is calculated. To obtain probabilistic prediction values for each class, we use Platt scaling .
The code used for the presented work is publicly available at https://github.com/reymond-group/Coconut-TMAP-SVM.
Collection of Open Natural Products.
Hydrogen bond acceptor
Hydrogen bond donor
Locality sensitive hashing
MinHashed atom pair fingerprint
Matthews correlation coefficient
SMILES arbitrary target specification
Simplified molecular-input line-entry system
Supported vector machine
Topological polar surface area
Dias DA, Urban S, Roessner U (2012) A historical overview of natural products in drug discovery. Metabolites 2:303–336. https://doi.org/10.3390/metabo2020303
Newman DJ, Cragg GM (2020) Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019. J Nat Prod 83:770–803. https://doi.org/10.1021/acs.jnatprod.9b01285
Chen Y, de Bruyn Kops C, Kirchmair J (2017) Data resources for the computer-guided discovery of bioactive natural products. J Chem Inf Model 57:2099–2111. https://doi.org/10.1021/acs.jcim.7b00341
Chen Y, Kirchmair J (2020) Cheminformatics in natural product-based drug discovery. Mol Inf. https://doi.org/10.1002/minf.202000171
Ertl P, Roggo S, Schuffenhauer A (2008) Natural product-likeness score and its application for prioritization of compound libraries. J Chem Inf Model 48:68–74. https://doi.org/10.1021/ci700286x
Zaid H, Raiyn J, Nasser A et al (2010) Physicochemical properties of natural based products versus synthetic chemicals. Open Nutraceut J 3:194–202
Yu MJ (2011) Natural product-like virtual libraries: recursive atom-based enumeration. J Chem Inf Model 51:541–557. https://doi.org/10.1021/ci1002087
Vanii Jayaseelan K, Moreno P, Truszkowski A et al (2012) Natural product-likeness score revisited: an open-source, open-data implementation. BMC Bioinform 13:106. https://doi.org/10.1186/1471-2105-13-106
Chen Y, Stork C, Hirte S, Kirchmair J (2019) NP-scout: machine learning approach for the quantification and visualization of the natural product-likeness of small molecules. Biomolecules 9:43. https://doi.org/10.3390/biom9020043
Pereira F (2021) Machine learning methods to predict the terrestrial and marine origin of natural products. Mol Inf. https://doi.org/10.1002/minf.202060034
Djoumbou Feunang Y, Eisner R, Knox C et al (2016) ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J Cheminform 8:61. https://doi.org/10.1186/s13321-016-0174-y
Kim H, Wang M, Leber C et al (2020) NPClassifier: a deep neural network-based structural classification tool for natural products. https://doi.org/10.26434/chemrxiv.12885494.v1
Zabolotna Y, Ertl P, Horvath D et al (2021) NP navigator: a new look at the natural product chemical space. Mol Inf. https://doi.org/10.1002/minf.202100068
Capecchi A, Reymond J-L (2020) Assigning the origin of microbial natural products by chemical space map and machine learning. Biomolecules 10:1385. https://doi.org/10.3390/biom10101385
van Santen JA, Jacob G, Singh AL et al (2019) The natural products atlas: an open access knowledge base for microbial natural products discovery. ACS Cent Sci 5:1824–1833. https://doi.org/10.1021/acscentsci.9b00806
Capecchi A, Probst D, Reymond J-L (2020) One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J Cheminform 12:43. https://doi.org/10.1186/s13321-020-00445-4
Probst D, Reymond J-L (2020) Visualization of very large high-dimensional data sets as minimum spanning trees. J Cheminform 12:12. https://doi.org/10.1186/s13321-020-0416-x
Wu Y, Liao H, Liu L-Y et al (2020) Phakefustatins A–C: kynurenine-bearing cycloheptapeptides as RXRα modulators from the marine sponge Phakellia fusca. Org Lett. https://doi.org/10.1021/acs.orglett.0c01586
Han M, Liu F, Zhang F et al (2012) Bacterial and archaeal symbionts in the South China Sea sponge Phakellia fusca: community structure, relative abundance, and ammonia-oxidizing populations. Mar Biotechnol N Y N 14:701–713. https://doi.org/10.1007/s10126-012-9436-5
Meunier L, Tocquin P, Cornet L et al (2020) Palantir: a springboard for the analysis of secondary metabolite gene clusters in large-scale genome mining projects. Bioinformatics 36:4345–4347. https://doi.org/10.1093/bioinformatics/btaa517
Villa TG, Viñas M (2019) Horizontal gene transfer: breaking borders between living kingdoms. Springer International Publishing, Cham
Hardoim PR, van Overbeek LS, Berg G et al (2015) The hidden world within plants: ecological and evolutionary considerations for defining functioning of microbial endophytes. Microbiol Mol Biol Rev MMBR 79:293–320. https://doi.org/10.1128/MMBR.00050-14
Strobel G, Daisy B, Castillo U, Harper J (2004) Natural products from endophytic microorganisms. J Nat Prod 67:257–268. https://doi.org/10.1021/np030397v
Ye K, Ai H-L, Liu J-K (2021) Identification and bioactivities of secondary metabolites derived from endophytic fungi isolated from ethnomedicinal plants of tujia in hubei province: a review. Nat Prod Bioprospecting 11:185–205. https://doi.org/10.1007/s13659-020-00295-5
Howat S, Park B, Oh IS et al (2014) Paclitaxel: biosynthesis, production and future prospects. N Biotechnol 31:242–245. https://doi.org/10.1016/j.nbt.2014.02.010
Shankar Naik B (2019) Developments in taxol production through endophytic fungal biotechnology: a review. Orient Pharm Exp Med 19:1–13. https://doi.org/10.1007/s13596-018-0352-8
Kusari S, Lamshöft M, Kusari P et al (2014) Endophytes are hidden producers of maytansine in putterlickia roots. J Nat Prod 77:2577–2584. https://doi.org/10.1021/np500219a
Sorokina M, Merseburger P, Rajan K et al (2021) COCONUT online: collection of open natural products database. J Cheminform 13:2. https://doi.org/10.1186/s13321-020-00478-9
Heim WG, Sykes KA, Hildreth SB et al (2007) Cloning and characterization of a Nicotiana tabacum methylputrescine oxidase transcript. Phytochemistry 68:454–463. https://doi.org/10.1016/j.phytochem.2006.11.003
Hooven HW van den, Lagerwerf FM, Heerma W et al (1996) The structure of the lantibiotic lacticin 481 produced by Lactococcus lactis: location of the thioether bridges. FEBS Lett 391:317–322. https://doi.org/10.1016/0014-5793(96)00771-5
Probst D, Reymond J-L, Wren J (2018) FUn: a framework for interactive visualizations of large, high-dimensional datasets on the web. Bioinformatics 34:1433–1435. https://doi.org/10.1093/bioinformatics/btx760
Zhu G, Hou C, Yuan W et al (2020) Molecular networking assisted discovery and biosynthesis elucidation of the antimicrobial spiroketals epicospirocins. Chem Commun. https://doi.org/10.1039/D0CC03990J
Cheng X, Liang X, Zheng Z-H et al (2020) Penicimeroterpenoids A–C, Meroterpenoids with rearrangement skeletons from the marine-derived fungus Penicillium sp. SCSIO 41512. Org Lett. https://doi.org/10.1021/acs.orglett.0c02160
Cao P-R, Zheng Y-L, Zhao Y-Q et al (2021) Beetleane A and Epicoane A: two carbon skeletons produced by Epicoccum nigrum. Org Lett. https://doi.org/10.1021/acs.orglett.1c00731
Yan D, Matsuda Y (2021) Genome mining-driven discovery of 5-methylorsellinate-derived meroterpenoids from Aspergillus funiculosus. Org Lett. https://doi.org/10.1021/acs.orglett.1c00951
Jiang L, Zhang X, Sato Y et al (2021) Genome-based discovery of enantiomeric pentacyclic sesterterpenes catalyzed by fungal bifunctional terpene synthases. Org Lett 23:4645–4650. https://doi.org/10.1021/acs.orglett.1c01361
Sun J, Zhao G, O’Connor RD et al (2021) Vertirhodins A–F, C-linked pyrrolidine-iminosugar-containing pyranonaphthoquinones from Streptomyces sp. B15-008. Org Lett 23:682–686. https://doi.org/10.1021/acs.orglett.0c03825
Xu ZF, Bo ST, Wang MJ et al (2020) Discovery and biosynthesis of bosamycin from Streptomyces sp. 120454. Chem Sci. https://doi.org/10.1039/D0SC03469J
An JS, Shin B, Kim TH et al (2021) Dumulmycin, an antitubercular bicyclic macrolide from a riverine sediment-derived Streptomyces sp. Org Lett 23:3359–3363. https://doi.org/10.1021/acs.orglett.1c00847
Zhu L, Zhu D-R, Zhou W-X et al (2021) Fortuneicyclidins A and B, pyrrolizidine alkaloids with a 7-azatetracyclo[188.8.131.52.02,8]tridecane core, from Cephalotaxus fortunei. Org Lett 23:2807–2810. https://doi.org/10.1021/acs.orglett.1c00738
Wu J, Zhao S-M, Shi B-B et al (2020) Cage-monoterpenoid quinoline alkaloids with neurite growth promoting effects from the fruits of Melodinus yunnanensis. Org Lett 22:7676–7680. https://doi.org/10.1021/acs.orglett.0c02871
Lou H, Yi P, Hu Z et al (2020) Polycyclic polyprenylated acylphloroglucinols with acetylcholinesterase inhibitory activities from Hypericum perforatum. Fitoterapia 143:104550. https://doi.org/10.1016/j.fitote.2020.104550
Li S-G, Wang Y-T, Zhang Q et al (2020) Pegaharmols A–B, aially chiral β-carboline-quinazoline dimers from the roots of Peganum harmala. Org Lett 22:7522–7525. https://doi.org/10.1021/acs.orglett.0c02709
Zhang J, Yuan M-F, Li S-T et al (2020) Hunzeylanines A–E, five bisindole alkaloids tethered with a methylene group from the roots of Hunteria zeylanica. J Org Chem 85:10884–10890. https://doi.org/10.1021/acs.joc.0c01448
Zhang J, Shi L-Y, Yin X et al (2020) Discovery of novel potential plant growth regulators from Corydalis mucronifera. Fitoterapia 147:104776. https://doi.org/10.1016/j.fitote.2020.104776
Tanaka N, Niwa K, Kajihara S et al (2020) C28 terpenoids from lamiaceous plant Perovskia scrophulariifolia: their structures and anti-neuroinflammatory activity. Org Lett 22:7667–7670. https://doi.org/10.1021/acs.orglett.0c02855
Fan Y-Y, Gan L-S, Chen S-X et al (2021) Horienoids A and B, two heterocoupled sesquiterpenoid dimers from Hedyosmum orientale. J Org Chem. https://doi.org/10.1021/acs.joc.1c00307
Tang Y-T, Wu J, Yu Y et al (2021) Colored dimeric alkaloids from the barks of Erythrina variegata and their neuroprotective effects. J Org Chem. https://doi.org/10.1021/acs.joc.1c01489
Kwon Y, Shin J, Nam K et al (2020) Rhizolutin, a novel 7/10/6-tricyclic dilactone, dissociates misfolded protein aggregates and reduces apoptosis/inflammation associated with Alzheimer’s disease. Angew Chem Int Ed. https://doi.org/10.1002/anie.202009294
Li Y, Zhuo L, Li X et al (2021) Myxadazoles, myxobacterium-derived isoxazole–benzimidazole hybrids with cardiovascular activities. Angew Chem Int Ed 60:21679–21684. https://doi.org/10.1002/anie.202106275
Kim MC, Winter JM, Asolkar RN et al (2021) Marinoterpins A–C: Rare linear merosesterterpenoids from marine-derived actinomycete bacteria of the family streptomycetaceae. J Org Chem. https://doi.org/10.1021/acs.joc.1c00262
Wu P-L, Hsu Y-L, Jao C-W (2006) Indole alkaloids from Cephalanceropsis gracilis. J Nat Prod 69:1467–1470. https://doi.org/10.1021/np060395l
Mason JJ, Bergman J, Janosik T (2008) Synthetic studies of cephalandole alkaloids and the revised structure of cephalandole A. J Nat Prod 71:1447–1450. https://doi.org/10.1021/np800334j
Ishikura M, Yamada K (2009) Simple indole alkaloids and those with a nonrearranged monoterpenoid unit. Nat Prod Rep 26:803–852. https://doi.org/10.1039/B820693G
Zhao J, Zhou L-L, Li X et al (2011) Bioactive compounds from the aerial parts of Brachystemma calycinum and structural revision of an octacyclopeptide. J Nat Prod 74:1392–1400. https://doi.org/10.1021/np200048u
Yeshak MY, Burman R, Asres K, Göransson U (2011) Cyclotides from an extreme habitat: characterization of cyclic peptides from Viola abyssinica of the Ethiopian highlands. J Nat Prod 74:727–731. https://doi.org/10.1021/np100790f
Srivastava S, Dashora K, Ameta KL et al (2021) Cysteine-rich antimicrobial peptides from plants: the future of antimicrobial therapy. Phytother Res 35:256–277. https://doi.org/10.1002/ptr.6823
dos Santos-Silva CA, Zupin L, Oliveira-Lima M et al (2020) Plant antimicrobial peptides: state of the art, in silico prediction and perspectives in the omics era. Bioinf Biol Insights 14:1177932220952739. https://doi.org/10.1177/1177932220952739
Tursch B, Braekman JC, Daloze D et al (1974) Chemical studies of marine invertebrates. X. Lobophytolide, a new cembranolide diterpene from the soft coral lobophytum cristagalli (coelenterata, octocorallia, alcyonacea). Tetrahedron Lett 15:3769–3772. https://doi.org/10.1016/S0040-4039(01)92004-0
Blunt JW, Copp BR, Munro MHG et al (2010) Marine natural products. Nat Prod Rep 27:165–237. https://doi.org/10.1039/B906091J
Ovenden SPB, Capon RJ (1999) Echinosulfonic Acids A–C and echinosulfone A: Novel bromoindole sulfonic acids and a sulfone from a Southern Australian marine sponge, echinodictyum. J Nat Prod 62:1246–1249. https://doi.org/10.1021/np9901027
Dhinakaran DI, Prasad DRD, Gohila R, Lipton P (2012) Screening of marine sponge-associated bacteria from Echinodictyum gorgonoides and its bioactivity. Afr J Biotechnol 11:15469–15476. https://doi.org/10.4314/ajb.v11i88
RDKit: Open-source cheminformatics. https://www.rdkit.org/ . Accessed 20 Sept 2021
Daylight. https://www.daylight.com/. Accessed 17 Jul y2020.
Bawa M, Condie T, Ganesan P (2005) LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th international conference on World Wide Web. Association for Computing Machinery, Chiba, Japan, pp 651–660
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74
Ralaivola L, Swamidass SJ, Saigo H, Baldi P (2005) Graph kernels for chemical informatics. Neural Netw 18:1093–1110. https://doi.org/10.1016/j.neunet.2005.07.009
Vert JP, Tsuda K, Schölkopf B (2004) A primer on kernel methods. Kernel methods in computational biology. Biologische Kybernetik, Cambridge, pp 35–70
The authors thank Prof. Dr. Olivier Potterat, University of Basel, for critical reading and helpful discussions.
This work was supported by a grant from the Vice-Rectorate Development of the University of Bern to A. C., by the Swiss National Science Foundation Grant no. 200020_178998, and by the European Research Council Grant no. 885076.
The authors declare no conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Capecchi, A., Reymond, JL. Classifying natural products from plants, fungi or bacteria using the COCONUT database and machine learning. J Cheminform 13, 82 (2021). https://doi.org/10.1186/s13321-021-00559-3