SYBA: Bayesian estimation of synthetic accessibility of organic compounds

Voršilák, Milan; Kolář, Michal; Čmelo, Ivan; Svozil, Daniel

doi:10.1186/s13321-020-00439-2

Research article
Open access
Published: 20 May 2020

SYBA: Bayesian estimation of synthetic accessibility of organic compounds

Journal of Cheminformatics volume 12, Article number: 35 (2020) Cite this article

8907 Accesses
60 Citations
6 Altmetric
Metrics details

Abstract

SYBA (SYnthetic Bayesian Accessibility) is a fragment-based method for the rapid classification of organic compounds as easy- (ES) or hard-to-synthesize (HS). It is based on a Bernoulli naïve Bayes classifier that is used to assign SYBA score contributions to individual fragments based on their frequencies in the database of ES and HS molecules. SYBA was trained on ES molecules available in the ZINC15 database and on HS molecules generated by the Nonpher methodology. SYBA was compared with a random forest, that was utilized as a baseline method, as well as with other two methods for synthetic accessibility assessment: SAScore and SCScore. When used with their suggested thresholds, SYBA improves over random forest classification, albeit marginally, and outperforms SAScore and SCScore. However, upon the optimization of SAScore threshold (that changes from 6.0 to – 4.5), SAScore yields similar results as SYBA. Because SYBA is based merely on fragment contributions, it can be used for the analysis of the contribution of individual molecular parts to compound synthetic accessibility. SYBA is publicly available at https://github.com/lich-uct/syba under the GNU General Public License.

Background

Chemical space available for the generation of new molecules is huge [1,2,3,4], making the synthesis and testing of all possible compounds impractical. Therefore chemists, both experimental and computational, developed tools and approaches for the exploration of chemical space with the aim to identify new compounds with desirable physico-chemical, biological and pharmacological properties [5,6,7,8,9,10,11,12]. A major in silico method for chemical space exploration is de novo molecular design in which new virtual molecules are assembled from scratch [13,14,15,16,17,18]. An essential requirement for de novo designed compounds is their synthetic accessibility. Synthetic accessibility is commonly incorporated into de novo design programs by employing chemical strategies that guide an assembly process. For example, the connections between certain atom types can be disallowed [19], established chemical reactions can be used to connect individual molecular building blocks [20, 21] or the retrosynthetic rules can be directly incorporated into the assembly process [22, 23].

The latest development in de novo molecular design are molecular generators based on deep learning approaches [24,25,26]. These typically construct new molecules not by assembling the building blocks, but by producing chemically feasible SMILES strings [27,28,29,30,31,32]. The generators are able to produce millions of virtual compounds, synthetic accessibility of which has to be quickly and efficiently assessed. Quick synthetic accessibility assessment can be based [33] on molecule’s complexity that is typically calculated [34,35,36,37] from the number of atoms, bonds, rings, and/or hard-to-synthesize motifs, such as chiral centers or uncommon ring fusions. However, the definition of molecular complexity is ambiguous and context dependent [38, 39]. The structural complexity is not equivalent to the synthetic one as complexity-based metrics do not incorporate any information about starting materials and tend to remove molecules that can be synthesized from already existing complex precursors [40, 41]. A better way of synthetic accessibility assessment is to use the complexity of the synthetic route [42]. Based on this principle, SCScore, a data-driven metric designed to describe real syntheses, was developed recently [43]. SCScore is based on the idea that reaction products are synthetically more complex than reactants. To quantify this, a deep feed-forward neural network, that assigns a synthetic complexity score between 1 and 5, was trained on 22 million reactant-product pairs from the Reaxys database [44]. Using the hinge loss objective function, that supports the separation between scores in each reactant–product pair, the model learns synthetic complexity score that correlates with the number of reaction steps, but does not rely on the availability of reaction database or organic chemist ranking.

SAScore [45], another popular and rapid method for synthetic accessibility assessment, is based on the analysis of ECFP4 [46] fragments obtained from one million compounds randomly selected from the PubChem database [47]. The main idea of SAScore is that when a molecular fragment occurs often in the PubChem database, it contributes to the synthetic accessibility of a molecule more than a less frequently occurring fragment. Each fragment is assigned a numerical score, frequent fragments have positive scores and less frequent fragments have negative scores. In addition to the fragment score, SAScore consists of a complexity penalty and symmetry bonus. These terms penalize nonstandard structural motives such as macrocycles, stereo centers, spiro and bridge atom, but reward the symmetry of a structure. SAScore acquires values between 1 (easy to make) and 10 (very difficult to make), where 6.0 is suggested by the authors [45] as a threshold to distinguish between easy- and hard-to-synthesize compounds. SAScore is a popular high-throughput measure and proved to be a very useful tool in many cheminformatics applications [27, 48,49,50].

In the present work, we further expand on main concepts of SAScore construction. We developed SYBA (SYnthetic Bayesian Accessibility), a rapid fragment-based score derived using Bayesian probabilistic modeling. Fragment contributions to SYBA are calculated not only from fragments present in synthetically accessible molecules, but also from fragments appearing in hard-to-synthesize molecules.

Methods

SYBA score derivation

SYBA is a Bernoulli naïve Bayes classifier based on the frequency of molecular fragments that are present in the database of easy-to-synthesize (ES) and hard-to-synthesize (HS) molecules and on the assumption of the independence of molecular fragments. Though such assumption is bold, it was shown to provide surprisingly good results in many cheminformatics studies [51,52,53,54,55].

Each compound is represented by a binary fingerprint $\varvec{F} = \left[ {f_{1} ,f_{2} , \ldots ,f_{M} } \right]$ of length M where $f_{i}$ indicates the presence ($f_{i} = 1$) or absence ($f_{i} = 0$) of the specific fragment i in the compound. SYBA uses this fingerprint to assign the molecule to a class $C \in \left\langle {{\text{ES}},{\text{HS}}} \right\rangle$. The calculation is based on the Bayes theorem

$$p\left( {C |\varvec{F}} \right) = \frac{{p\left( {\varvec{F}|C} \right) p\left( C \right)}}{{p\left( \varvec{F} \right)}},$$

(1)

where $p\left( {C |\varvec{F}} \right)$ is the posterior probability that a compound with a certain set of molecular fragments F belongs to the class $C$. The likelihood $p(\varvec{F}|C)$ is the conditional probability that a compound from the class $C$ contains a set of molecular fragments F. The marginal probabilities $p\left( \varvec{F} \right)$ and $p\left( C \right)$ express our belief to observe a set of molecular fragments F and the molecule that belongs to the class $C$.

The SYBA score is defined as the logarithm of the ratio of the posterior probabilities that the molecule belongs to the ES and HS classes,

$${\text{SYBA}}\left( \varvec{F} \right) = \ln \left( {\frac{{p\left( {{\text{ES|}}\varvec{F}} \right)}}{{p\left( {{\text{HS|}}\varvec{F}} \right)}}} \right) .$$

(2)

Using Eq. 1, the SYBA score can be expressed as

$${\text{SYBA}}\left( \varvec{F} \right) = \ln \left( {\frac{{p\left( {\text{ES}} \right)}}{{p\left( {\text{HS}} \right)}}} \right) + \ln \left( {\frac{{p\left( {\varvec{F} | {\text{ES}}} \right)}}{{p\left( {\varvec{F} | {\text{HS}}} \right)}}} \right) .$$

(3)

In the data set SYBA was derived from (further referred to as the training data set S), ES and HS compounds are represented evenly, the priors $p\left( {\text{ES}} \right)$ and $p\left( {\text{HS}} \right)$ are thus equal and the term $\ln \left( {\frac{{p\left( {\text{ES}} \right)}}{{p\left( {\text{HS}} \right)}}} \right)$ becomes zero:

$${\text{SYBA}}\left( \varvec{F} \right) = \ln \left( {\frac{{p\left( {\varvec{F} | {\text{ES}}} \right)}}{{p\left( {\varvec{F} | {\text{HS}}} \right)}}} \right) .$$

(4)

Assuming the independence of molecular fragments, the conditional probability $p\left( {\varvec{F} |C} \right)$ factorizes to $p\left( {\varvec{F} |C} \right) = \mathop \prod \limits_{i = 1}^{M} p(f_{i} |C)$ and the SYBA score simplifies to

$${\text{SYBA}}\left( \varvec{F} \right) = \sum\nolimits_{i = 1}^{M} {s_{i} \left( {f_{i} } \right)}$$

(5)

where $s_{i} \left( {f_{i} } \right)$ is the score contribution from the fragment i (SYBA fragment score) given as

$$s_{i} \left( {f_{i} } \right) = \ln \left( {\frac{{p\left( {f_{i} | {\text{ES}}} \right)}}{{p\left( {f_{i} | {\text{HS}}} \right)}}} \right).$$

(6)

Considering that $p\left( {f_{i} | {\text{ES}}} \right) = 1 - p\left( {f_{i} | {\text{HS}}} \right)$, the fragment scores $s_{i} \left( {f_{i} } \right)$ in Eq. 6 represent logits and can be expressed using the fragment frequencies in the training data set S as

$$s_{i} \left( {f_{i} } \right) = { \ln }\frac{{N_{\text{HS}} + 2}}{{N_{\text{ES}} + 2}} + f_{i} {\text{ln}}\frac{{\left( {n_{{{\text{ES}},i}} + 1} \right)}}{{\left( {n_{{{\text{HS}},i}} + 1} \right)}} + (1 - f_{i} ) { \ln }\frac{{\left( {N_{\text{ES}} - n_{{{\text{ES}},i}} + 1} \right)}}{{\left( {N_{\text{HS}} - n_{{{\text{HS}},i}} + 1} \right)}},$$

(7)

where $N_{\text{HS}}$ is the number of HS and $N_{\text{ES}}$ the number of ES molecules in the training data set S, $n_{{{\text{HS}},i}}$ is the number of HS molecules in the training data set S that contain the fragment i, and $n_{{{\text{ES}},i}}$ is the number of ES molecules in the training data set S that contain the fragment i. See Additional file 2 for a detailed derivation. Positive $s_{i} \left( {f_{i} } \right)$ means that the presence/absence of the fragment i is more probable in ES than in HS class and vice versa. Positive SYBA means that the compound belongs more likely to the ES class, while negative SYBA means that the compound belongs more likely to the HS class. The higher the absolute value of SYBA, the more evidence for the class membership is present in the molecule.

Training set construction

The training data set S consists of two subsets: S₊ contains ES structures and S_- contains HS structures (Fig. 1, Additional file 1). While ES molecules can be readily obtained, for example, from the ZINC database of purchasable compounds [56, 57], no equivalent database of HS molecules exists. However, HS molecules can be designed by Nonpher [58], a method based on a molecular morphing approach [59]. In Nonpher, a starting molecule is gradually transformed into a more complex compound using small structural perturbations, such as the addition or removal of an atom or a bond. To prevent the creation of overly complex structures, four complexity indices (Bertz [34], Whitlock [35], BC [36] and SMCM [37]) are monitored and once their respective thresholds (Additional file 2: Table S1) are exceeded, Nonpher is stopped.

Using Nonpher, 693 353 HS molecules were generated and they form the S_- data set. The S₊ data set, containing ES compounds, is formed by the same number of molecules randomly chosen (excluding natural products) from the ZINC15 database [57] so that their distribution of the number of heavy atoms is the same as in the S_- data sets. Every S₊ and S_- molecule was fragmented using the Morgan fingerprint function in the RDKit toolkit [60]. Fragments with the radius of 4 and smaller, corresponding to radial ECFP8 [46] fragments, were used. This type of fragments consists of a central atom and atoms distant from the central atom up to four bonds. Besides ECFP8 fragments, the number of stereocenters was also included into SYBA as the molecules with more stereocenters are typically more difficult to synthesize. The stereo score is based, similarly to the fragment score, on the analysis of the number of stereocenters in the training set S. To obtain the stereo score, molecules were divided into 6 bins differing by the number of stereocenters (0, 1, 2, 3, 4 and 5+) and individual score contributions were calculated from Eq. 7.

Test set construction

SYBA performance could have been assessed using a test set created in a similar way as the training set S, i.e. using HS compounds generated by Nonpher. However, such test set would be clearly biased towards chemical space covered by Nonpher. Therefore, two test sets were constructed in a conceptually different manner. First test set, further denoted as T_MC, was manually curated from the literature, second test set, referred to as T_CP, was computationally picked from the ZINC15 [57] and GDB17 databases [61].

HS compounds in T_MC (denoted as T_MC−) were obtained by the analysis [58] of 296 published compounds assessed by experienced medicinal chemists [41, 45, 62, 63]. Based on original chemists’ scores, the final T_MC- data set of 40 HS compounds was assembled. A complementary T_MC+ data set consists of 40 ES compounds selected from the ZINC15 database [57] in such a way that the distribution of the number of their heavy atoms is the same as in the T_MC- data set. Because small T_MC size may bias the results, 30 different T_MC data set instances were generated using the same 40 T_MC- compounds, but different 40 T_MC+ compounds (Additional file 3).

HS compounds in the T_CP test set (Additional file 4), denoted as T_CP-, were obtained by the analysis of the publicly available subset of 50 M molecules from the GDB-17 database [61]. Only molecules exceeding thresholds (Additional file 2: Table S1) of all monitored complexity indices (Bertz [34], Whitlock [35], BC [36] and SMCM [37]) were considered to be HS. In total, 3 581 molecules form the T_CP- data set. A complementary T_CP+ data set consists of the same number of compounds randomly selected from the ZINC15 database [57] that follow the same size distribution as HS compounds and that, in addition, do not exceed any of the aforedescribed complexity indices. Data sets used in the present work are summarized in Fig. 1.

Performance evaluation

The performance of classification models studied in the present work was assessed by four different metrics: the classification accuracy (Acc), sensitivity (SN), specificity (SP) and area under the ROC curve (AUC). Acc gives the percentage of correctly classified samples regardless of their class.

$${\text{Accuracy }}\left( {\text{Acc}} \right) = \frac{TP + TN}{TP + TN + FN + FP}$$

(8)

where true positives (TP) are ES compounds predicted by a model to be ES, true negatives (TN) are HS compounds predicted to be HS, false positives (FP) are HS compounds predicted to be ES and false negatives (FN) are ES compounds predicted to be HS. The accuracy can also be evaluated for positive and negative classes independently leading to SN and SP. SN is the percentage of correctly predicted positive class compounds, while the percentage of correctly predicted negative class compounds is known as SP.

$${\text{Sensitivity }}\left( {\text{SN}} \right) = \frac{TP}{TP + FN}$$

(9)

$${\text{Specificity }}\left( {\text{SP}} \right) = \frac{TN}{TN + FP}$$

(10)

SN and SP can be combined in the receiver operating characteristic (ROC) curve that is the graphical representation of the trade-off between true positive rate (given as SN) and false positive rate (given as 1 − SP) over all possible thresholds (Fig. 2). The area under the ROC curve (AUC) is the quantitative measure of the performance of a classifier and is equal to the probability that a classifier will rank a randomly chosen positive example higher than a randomly chosen negative example. A random classifier has AUC of 0.5, while AUC for a perfect classifier is equal to 1.

Random forest classification, SAScore and SCScore

Because of its wide adoption in various cheminformatics applications [58, 64,65,66], the random forest (RF) classifier with compounds encoded by 1024-bits long Morgan fingerprint with radius 2 was used as a baseline method with which SYBA, SAScore [45] and SCScore [43] were compared. The RF classifier was implemented in Scikit-learn [67]. Two RF hyperparameters were optimized: the number of trees (50, 100, 300 and 500) and the maximum number of features considered when looking for the best split (10% out of 1024 = 102, 25% = 256, 50% = 512, 75% = 768, 100% = 1024, $\sqrt {1024} = 32$ and $\log_{2} \left( {1024} \right) = 10$). For each pair of hyperparameters, RF model was trained using the training set S and the prediction accuracy was evaluated on the test set T_CP (Additional file 2: Table S2, Figures S2–S8). The setting used in this work (100 trees and 32 features) represents the best trade-off between computational efficiency and prediction accuracy [64]. RF was trained using the training set S (Fig. 1). SAScore was calculated by the RDKit toolkit [60]. SCScore code was downloaded from the public GitHub repository [68].

Classification thresholds

In SYBA, more positive value means a higher probability that the compound is ES and more negative value indicates a higher probability that the compound is HS (Eq. 4). The threshold value of zero is used to distinguish between ES and HS compounds. For SAScore, the recommended value of 6.0 [45] was used as a threshold. In RF, the final prediction is based on a number of decision trees that predict either of classes. Here, 0.5 is used as a threshold, i.e., if more decision trees predict ES than HS class, a compound is classified as ES and vice versa. For SCScore [43], no threshold was suggested by the authors. In such case, the threshold can be identified by the analysis of the ROC curve. A frequently used measure that enables the selection of an optimal threshold is the Youden index (YI) [69, 70]. YI is defined as

$$YI = { \hbox{max} }\left( {SN + SP - 1} \right)$$

(11)

and ranges between 0 and 1 (Fig. 2). The optimal threshold value is selected by maximizing YI, i.e., by maximizing the sum of SN and SP.

Statistical comparison of model performance

The performance of studied classification models was compared using non-parametric Cochran’s Q test [71], an omnibus test for testing for differences between three or more machine learning models. In the case of the statistically significant result of Cochran’s Q test, differing pairs of classification models were identified by McNemar’s post hoc paired test [72] with Benjamini–Hochberg false discovery rate adjustment [73]. McNemar’s test checks if the distribution of disagreements between two methods is imbalanced. The statistical significance for all tests in the present work was assessed at the significance level α = 0.05.

Results and discussion

Chemical space covered by SYBA data sets

The examples of training set compounds are given in Additional file 2: Figures S9–S12. In total, 3 439 074 ECFP8 fragments were obtained for ES compounds and 23 447 524 fragments for HS compounds. 458 040 fragments are common for both S + and S- subsets. 55.0% of S+ fragments and 91.7% of S− fragments are present only once in the whole data set S (singletons). Typical ES and HS fragments are shown in Figs. 3 and 4, fragments with very low SYBA in Additional file 2: Figure S13 and compounds containing these fragments in Additional file 2: Figure S14.

Though the number of fragments in HS compounds is much larger than in ES compounds, chemical space is equally covered by both HS and ES molecules and there is no bias towards HS compounds (Fig. 5).

The visualization of chemical space (Fig. 6) covered by S, T_CP and T_MC compounds shows that test set compounds lie within chemical space of training compounds. The examples of T_CP and T_MC compounds are given in Additional file 2: Figures S15–S21.

SYBA also enables the visualization and interpretation of fragment score contributions. Each SYBA fragment score can be projected to the corresponding fragment root atom and this projection can be used to analyze which fragments contribute unfavorably to molecule synthetic accessibility (Fig. 7).

Classifier performance on manually curated test set

The differences in classification of T_MC (Fig. 1) compounds using SYBA, SAScore and RF with default thresholds are statistically significant. The Cochran’s Q test p value is 2 × 10⁻⁵ for the T_MC test set with the smallest SYBA AUC. The results of the corresponding McNemar’s paired tests [72] are summarized in Table 1. While RF and SYBA do not differ significantly, both RF and SYBA yield significantly better results than SAScore.

Table 1 The results of McNemar’s two-sided paired tests for the T_MC test set with the smallest SYBA AUC (AUC = 0.830)

Full size table

The quality measures of the classification of the compounds in the manually curated T_MC test set, averaged over 30 T_MC instances, are summarized in Table 2. The corresponding confusion matrices are reported in Additional file 2: Panel S1 and Panel S2 and ROC curves are shown in Fig. 8.

Table 2 The performance of classification models for the manually curated T_MC test set

Full size table

In terms of Acc, the best performing model is SYBA followed by RF and SAScore. While SYBA and RF sensitivity and specificity are well balanced, SAScore shows high sensitivity (SN = 0.934, i.e., on average 93.4% of ES compounds are predicted as ES), while its specificity (i.e., the ability to correctly classify HS compounds) is rather low (SP = 0.300). The observed high sensitivity of SAScore is not surprising as only 0.2% of ZINC structures have SAScore greater than 6.0 and out of these, only lower units were selected into the T_MC+ set.

For optimized thresholds, the differences between SYBA, SAScore and RF are again statistically significant (Cochran’s Q test p-value is 2 × 10⁻¹⁴ for the T_MC test set with the smallest SYBA AUC). However, McNemar’s paired test (Table 3) identifies significant differences only between SCScore and other methods meaning that SAScore results improve significantly upon threshold optimization.

Table 3 The results of McNemar’s two-sided paired tests for the T_MC test set with the smallest SYBA AUC (AUC = 0.830)

Full size table

In terms of performance measures (Table 2), the most accurate classifiers are SYBA, RF and SAScore followed afar by SCScore. Notable is the improvement of SAScore SP by 0.619 compared to the default threshold. The increase in SAScore SP comes, however, at the cost of SN that decreases by 0.135. The worst performing model, SCScore, is only slightly better (AUC = 0.528) than a random model. However, because T_MC+ and T_MC− data sets consist of only 40 compounds each, the results must be interpreted with caution as small changes in confusion matrices lead to relatively large changes in reported metrics.

Classifier performance on computationally picked test set

Even stronger evidence of the differences between the models is provided by the classification of compounds in the large computationally picked T_CP test set (Fig. 1). Using the default thresholds, the differences between the classifiers are statistically significant (Cochran’s Q test p-value < 10⁻¹⁶) and all classifiers differ significantly (Table 4).

Table 4 The results of McNemar’s two-sided paired tests for the T_CP test set

Full size table

When used with their default thresholds, both SYBA and RF are more accurate than SAScore (Table 5, Additional file 2: Panel S3). Low observed SAScore accuracy (Acc = 0.665) is caused by its low specificity when almost 70% of HS compounds are predicted to be ES (Table 5).

Table 5 The performance of classification models for the computationally picked T_CP test set

Full size table

Cochran’s Q test identifies significant differences (p-value < 10⁻¹⁶) also if classifiers are used with the optimized thresholds. While statistically significant differences were detected within RF/SAScore and RF/SYBA pairs (Table 6), the effect size is still rather small (Table 5). However, the difference between SCScore and all other methods is statistically significant and large (Tables 5 and 6). No statistically significant difference was observed between SYBA and SAScore meaning that these two methods yield comparable results.

Table 6 The results of McNemar’s two-sided paired tests for the T_CP test set

Full size table

Compared to its default threshold of 6.0, SAScore specificity increases by 0.662 (Table 5) when the threshold is shifted to the optimal value of 4.5 (Fig. 10). At this threshold, SAScore is on par with SYBA and RF methods (Table 5). However, SYBA retains its high performance over much broader range of threshold values than SAScore (Additional file 2: Figure S1).

High performance of SYBA, RF and SAScore is also evident from their AUC that is close to one (Fig. 9, Table 5). On the other hand, SCScore fails to distinguish between ES and HS compounds as can be deduced from its ROC curve (Fig. 9). In its optimal threshold of 3.1, SCScore predicts a majority of T_CP compounds as HS (Fig. 10, Additional file 2: Panel S3).

In addition to the T_MC and T_CP test sets, the performance of SAScore and SCScore was also assessed using the training set S, as this data set was not used for their parametrization. Classification results are shown in Table 7 and Fig. 11, confusion matrices are available in Additional file 2: Panel S4.

Table 7 The performance of classification models for the training set S

Full size table

In agreement with previous experiments on the T_MC and T_CP test sets, SAScore is able to distinguish between ES and HS compounds more accurately than SCScore. However, to achieve the best performance, SAScore classification threshold must be shifted from its default value of 6.0 to the optimal value of 4.4. At this threshold, SAScore is both highly sensitive and specific, while SCScore is, using its optimal threshold of 3.7, sensitive and specific only moderately.

The observed poor performance of SCScore in all data sets may follow from the fact that SCScore differs conceptually from other methods tested in the present work. In SCScore, the problem of predicting synthetic complexity is reformulated as the analysis of reactions consisting of reactant-product pairs and SCScore correlates with a number of reaction steps. It means that, contrary to SYBA and SAScore, SCScore captures synthetic feasibility of compounds, not structural complexity. In SCScore derivation, each molecule is analyzed as a whole in the context of all molecules and reactions as they appear in the Reaxys database [44]. Thus, SCScore is biased [43] by the types of reactants and products in the Reaxys database. Therefore, we hypothesize that the unsatisfactory results of SCScore are caused by the fact that compounds in our data sets come from chemical subspace insufficiently covered by Reaxys compounds.

Conclusions

In the present work, SYBA method for the classification of organic compounds as easy- and hard-to-synthesize is described. SYBA is an additive fragment-based approach meaning that the compound is decomposed into individual substructure fragments, each fragment is assigned its respective SYBA fragment score and these are summed to obtain the final SYBA score. The fragment scores were derived by the Bayesian analysis of the frequency of ECFP8 fragments occurring in the database of ES compounds, that were randomly chosen from the ZINC15 database [57], and HS compounds, that were generated using the Nonpher approach [58]. Because SYBA was derived from ECFP8 fragments that utilize only molecular connectivity and no 3D information, the influence of stereochemistry on synthetic accessibility is not accounted for. However, apart from ECFP8 fragments, the number of stereocenters is included in SYBA calculation and the compounds with many stereocenters are penalized. If the SYBA score is positive, the compound is considered to be ES and vice versa. While SYBA score can theoretically assume values between plus and minus infinity, a majority of compounds will have SYBA score between − 100 and +100 in real applications. It must be stressed here that the absolute value of the SYBA score is the measure of the confidence of the prediction and not of the degree of the synthetic accessibility.

SYBA was compared with other two recent classification methods, SAScore [45] and SCScore [43]. As a baseline for the comparison, RF/ECFP4 classifier was used due to its wide adoption in many cheminformatics applications. All methods were assessed using accuracy, sensitivity, specificity and area under the ROC curve. While SYBA and RF provide similar performance, we recommend to use SYBA due to its smaller complexity, lower computational demands and more straightforward analysis of the individual fragment contributions. SYBA outperforms SAScore when this is used with the threshold of 6.0 proposed by the authors [45]. However, if the SAScore threshold is changed to the value of ~ 4.5, the accuracy of SYBA and SAScore becomes comparable. Therefore, to reduce the number of false positives in workflows that utilize SAScore, we recommend to decrease the SAScore threshold to ~ 4.5. SYBA, RF and SAScore substantially outperform SCScore. The observed weak performance of SCScore can be, in our opinion, attributed to the fact that our test set compounds come from a part of chemical space that is insufficiently covered by Reaxys compounds used to derive SCScore. The SYBA fragment scores can be mapped [74] onto a molecule and used for the analysis of the contribution of its individual substructures to the overall synthetic accessibility.

SYBA can be used to quickly rank large molecular data sets that originate, for example, from de novo molecular design. However, SYBA is conceptually based on the notion that a compound can be categorized as easy- and hard-to-synthesize. As the synthetic accessibility is a vaguely defined term, SYBA’s simplifying approach, though accurate enough, cannot compete with more sophisticated synthetic path-reconstruction methods that enable the incorporation of other factors such as the availability of starting materials, reaction yields or a price aspect. In the end, the definitive assessment of synthetic accessibility is in the hands of experienced organic chemists.

Availability of data and materials

The code used to train and benchmark SYBA model is available from https://github.com/lich-uct/syba repository. Nonpher is available from https://github.com/lich-uct/nonpher.

Abbreviations

Acc:: Accuracy
AUC:: Area under the ROC curve
ES:: Easy-to-synthesize
HS:: Hard-to-synthesize
MW:: Molecular weight
RF:: Random forest
ROC:: Receiver operating characteristic
S:: Training set
SN:: Sensitivity
SP:: Specificity
SVD:: Singular value decomposition
SYBA:: SYnthetic Bayesian Accessibility
T_CP :: Computationally picked test set
T_MC :: Manually curated test set
YI:: Youden index

References

Bohacek RS, McMartin C, Guida WC (1996) The art and practice of structure-based drug design: a molecular modeling perspective. Med Res Rev 16(1):3–50
Article CAS PubMed Google Scholar
Polishchuk PG, Madzhidov TI, Varnek A (2013) Estimation of the size of drug-like chemical space based on GDB-17 data. J Comput Aided Mol Des 27(8):675–679
Article CAS PubMed Google Scholar
Ertl P (2003) Cheminformatics analysis of organic substituents: identification of the most common substituents, calculation of substituent properties, and automatic identification of drug-like bioisosteric groups. J Chem Inf Comput Sci 43(2):374–380
Article CAS PubMed Google Scholar
Reymond JL, van Deursen R, Blum LC, Ruddigkeit L (2010) Chemical space as a source for new drugs. Medchemcomm 1(1):30–38
Article CAS Google Scholar
Llanos EJ, Leal W, Luu DH, Jost J, Stadler PF, Restrepo G (2019) Exploration of the chemical space and its three historical regimes. Proc Natl Acad Sci U S A 116(26):12660–12665
Article CAS PubMed PubMed Central Google Scholar
Karlov DS, Sosnin S, Tetko IV, Fedorov MV (2019) Chemical space exploration guided by deep neural networks. Rsc Advances 9(9):5151–5157
Article CAS PubMed PubMed Central Google Scholar
Gromski PS, Henson AB, Granda JM, Cronin L (2019) How to explore chemical space using algorithms and automation. Nat Rev Chem 3(2):119–128
Article Google Scholar
Walters WP (2019) Virtual chemical libraries. J Med Chem 62(3):1116–1124
Article CAS PubMed Google Scholar
Franzini RM, Neri D, Scheuermann J (2014) DNA-encoded chemical libraries: advancing beyond conventional small-molecule libraries. Acc Chem Res 47(4):1247–1255
Article CAS PubMed Google Scholar
Lopez-Vallejo F, Caulfield T, Martinez-Mayorga K, Giulianotti MA, Nefzi A, Houghten RA, Medina-Franco JL (2011) Integrating virtual screening and combinatorial chemistry for accelerated drug discovery. Comb Chem High Throughput Screen 14(6):475–487
Article CAS PubMed Google Scholar
Hoffmann T, Gastreich M (2019) The next level in chemical space navigation: going far beyond enumerable compound libraries. Drug Discov Today 24(5):1148–1156
Article CAS PubMed Google Scholar
van Hilten N, Chevillard F, Kolb P (2019) Virtual compound libraries in computer-assisted drug discovery. J Chem Inf Model 59(2):644–651
Article PubMed CAS Google Scholar
Schneider G, Fechner U (2005) Computer-based de novo design of drug-like molecules. Nat Rev Drug Discov 4(8):649–663
Article CAS PubMed Google Scholar
Loving K, Alberts I, Sherman W (2010) Computational approaches for fragment-based and de novo design. Curr Top Med Chem 10(1):14–32
Article CAS PubMed Google Scholar
Medina-Franco JL, Martinez-Mayorga K, Meurice N (2014) Balancing novelty with confined chemical space in modern drug discovery. Expert Opin Drug Discov 9(2):151–165
Article CAS PubMed Google Scholar
Schneider P, Schneider G (2016) De Novo design at the edge of Chaos. J Med Chem 59(9):4077–4086
Article CAS PubMed Google Scholar
Kutchukian PS, Shakhnovich EI (2010) De novo design: balancing novelty and confined chemical space. Expert Opin Drug Discov 5(8):789–812
Article CAS PubMed Google Scholar
Hartenfeller M, Schneider G (2011) De novo drug design. Methods Mol Biol 672:299–323
Article CAS PubMed Google Scholar
Hartenfeller M, Proschak E, Schuller A, Schneider G (2008) Concept of combinatorial de novo design of drug-like molecules by particle swarm optimization. Chem Biol Drug Des 72(1):16–26
Article CAS PubMed Google Scholar
Vinkers HM, de Jonge MR, Daeyaert FF, Heeres J, Koymans LM, van Lenthe JH, Lewi PJ, Timmerman H, Van Aken K, Janssen PA (2003) SYNOPSIS: SYNthesize and OPtimize System in Silico. J Med Chem 46(13):2765–2773
Article CAS PubMed Google Scholar
Hartenfeller M, Zettl H, Walter M, Rupp M, Reisen F, Proschak E, Weggen S, Stark H, Schneider G (2012) DOGS: reaction-driven de novo design of bioactive compounds. PLoS Comput Biol 8(2):e1002380
Article CAS PubMed PubMed Central Google Scholar
Schneider G, Lee ML, Stahl M, Schneider P (2000) De novo design of molecular architectures by evolutionary assembly of drug-derived building blocks. J Comput Aided Mol Des 14(5):487–494
Article CAS PubMed Google Scholar
Fechner U, Schneider G (2006) Flux (1): a virtual synthesis scheme for fragment-based de novo design. J Chem Inf Model 46(2):699–707
Article CAS PubMed Google Scholar
Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T (2018) The rise of deep learning in drug discovery. Drug Discov Today 23(6):1241–1250
Article PubMed Google Scholar
Hessler G, Baringhaus KH (2018) Artificial intelligence in drug design. Molecules. 23(10):2520. https://doi.org/10.3390/molecules23102520
Article CAS PubMed Central Google Scholar
Xu Y, Lin K, Wang S, Wang L, Cai C, Song C, Lai L, Pei J (2019) Deep learning for molecular generation. Future Med Chem 11(6):567–597
Article CAS PubMed Google Scholar
Popova M, Isayev O, Tropsha A (2018) Deep reinforcement learning for de novo drug design. Sci Adv 4(7):7885
Article CAS Google Scholar
Olivecrona M, Blaschke T, Engkvist O, Chen H (2017) Molecular de-novo design through deep reinforcement learning. J Cheminform 9(1):48
Article PubMed PubMed Central Google Scholar
Blaschke T, Olivecrona M, Engkvist O, Bajorath J, Chen H (2018) Application of generative autoencoder in de novo molecular design. Mol Inform. 37(1–2):1700123
Article CAS Google Scholar
Segler MHS, Kogej T, Tyrchan C, Waller MP (2018) Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent Sci 4(1):120–131
Article CAS PubMed Google Scholar
Gupta A, Müller AT, Huisman BJ, Fuchs JA, Schneider P, Schneider G (2018) Generative recurrent networks for de novo drug design. Mol Inform. 37(1–2):1700111
Article CAS Google Scholar
Merk D, Friedrich L, Grisoni F, Schneider G (2018) De novo design of bioactive small molecules by artificial intelligence. Mol Inform. 37(1–2):1700153
Article PubMed Central CAS Google Scholar
Mendez-Lucio O, Medina-Franco JL (2017) The many roles of molecular complexity in drug discovery. Drug Discov Today 22(1):120–126
Article CAS PubMed Google Scholar
Bertz SH (1981) The first general index of molecular complexity. J Am Chem Soc 103(12):3599–3601
Article CAS Google Scholar
Whitlock HW (1998) On the structure of total synthesis of complex natural products. J Organic Chem 63(22):7982–7989
Article CAS Google Scholar
Barone R, Chanon M (2001) A new and simple approach to chemical complexity application to the synthesis of natural products. J Chem Inf Comp Sci 41(2):269–272
Article CAS Google Scholar
Allu TK, Oprea TI (2005) Rapid evaluation of synthetic and molecular complexity for in silico chemistry. J Chem Inf Model 45(5):1237–1243
Article CAS PubMed Google Scholar
Selzer P, Roth HJ, Ertl P, Schuffenhauer A (2005) Complex molecules: do they add value? Curr Opin Chem Biol 9(3):310–316
Article CAS PubMed Google Scholar
Sheridan RP, Zorn N, Sherer EC, Campeau LC, Chang CZ, Cumming J, Maddess ML, Nantermet PG, Sinz CJ, O’Shea PD (2014) Modeling a crowdsourced definition of molecular complexity. J Chem Inf Model 54(6):1604–1616
Article CAS PubMed Google Scholar
Gillet VJ, Myatt G, Zsoldos Z, Johnson AP (1995) SPROUT, HIPPO and CAESA: tools for de novo structure generation and estimation of synthetic accessibility. Perspect Drug Discov Des 3:34–50
Article CAS Google Scholar
Huang Q, Li L-L, Yang S-Y (2011) RASA: a rapid retrosynthesis-based scoring method for the assessment of synthetic accessibility of drug-like molecules. J Chem Inf Model 51(10):2768–2777
Article CAS PubMed Google Scholar
Li J, Eastgate MD (2015) Current complexity: a tool for assessing the complexity of organic molecules. Org Biomol Chem 13(26):7164–7176
Article CAS PubMed Google Scholar
Coley CW, Rogers L, Green WH, Jensen KF (2018) SCScore: synthetic complexity learned from a reaction corpus. J Chem Inf Model 58(2):252–261
Article CAS PubMed Google Scholar
Reaxys. https://www.reaxys.com. Accessed 24 January 2020
Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminf 1:1–11
Article CAS Google Scholar
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754
Article CAS PubMed Google Scholar
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B et al (2019) PubChem 2019 update: improved access to chemical data. Nucleic Acids Res 47(D1):D1102–D1109
Article PubMed Google Scholar
Besnard J, Ruda GF, Setola V, Abecassis K, Rodriguiz RM, Huang XP, Norval S, Sassano MF, Shin AI, Webster LA et al (2012) Automated design of ligands to polypharmacological profiles. Nature 492(7428):215–220
Article CAS PubMed Google Scholar
Yang X, Zhang J, Yoshizoe K, Terayama K, Tsuda K (2017) ChemTS: an efficient python library for de novo molecular generation. Sci Technol Adv Mater 18(1):972–976
Article CAS PubMed PubMed Central Google Scholar
Chevillard F, Kolb P (2015) SCUBIDOO: a large yet screenable and easily searchable database of computationally created chemical compounds optimized toward high likelihood of synthetic tractability. J Chem Inf Model 55(9):1824–1835
Article CAS PubMed Google Scholar
Clark AM, Dole K, Coulon-Spektor A, McNutt A, Grass G, Freundlich JS, Reynolds RC, Ekins S (2015) Open source Bayesian Models. 1. Application to ADME/Tox and drug discovery datasets. J Chem Inf Model 55(6):1231–1245
Article CAS PubMed PubMed Central Google Scholar
Xia X, Maliski EG, Gallant P, Rogers D (2004) Classification of kinase inhibitors using a Bayesian model. J Med Chem 47(18):4463–4470
Article CAS PubMed Google Scholar
Bender A (2011) Bayesian methods in virtual screening and chemical biology. Methods Mol Biol 672:175–196
Article CAS PubMed Google Scholar
Vogt M, Bajorath J (2007) Introduction of an information-theoretic method to predict recovery rates of active compounds for Bayesian in silico screening: theory and screening trials. J Chem Inf Model 47(2):337–341
Article CAS PubMed Google Scholar
Koutsoukas A, Lowe R, Kalantarmotamedi Y, Mussa HY, Klaffke W, Mitchell JB, Glen RC, Bender A (2013) In silico target predictions: defining a benchmarking data set and comparison of performance of the multiclass Naive Bayes and Parzen-Rosenblatt window. J Chem Inf Model 53(8):1957–1966
Article CAS PubMed Google Scholar
Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG (2012) ZINC: a free tool to discover chemistry for biology. J Chem Inf Model 52(7):1757–1768
Article CAS PubMed PubMed Central Google Scholar
Sterling T, Irwin JJ (2015) ZINC 15—ligand discovery for everyone. J Chem Inf Model 55(11):2324–2337
Article CAS PubMed PubMed Central Google Scholar
Voršilák M, Svozil D (2017) Nonpher: computational method for design of hard-to-synthesize structures. J Cheminf 9(1):1–20
Article CAS Google Scholar
Hoksza D, Skoda P, Vorsilak M, Svozil D (2014) Molpher: a software framework for systematic chemical space exploration. J Cheminf 6:1–13
Article CAS Google Scholar
RDKit: open-source cheminformatics. http://www.rdkit.org. Accessed 24 January 2020
Ruddigkeit L, van Deursen R, Blum LC, Reymond J-L (2012) Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model 52(11):2864–2875
Article CAS PubMed Google Scholar
Boda K, Seidel T, Gasteiger J (2007) Structure and reaction based evaluation of synthetic accessibility. J Comput-Aided Mol Des 21(6):311–325
Article CAS PubMed Google Scholar
Fukunishi Y, Kurosawa T, Mikami Y, Nakamura H (2014) Prediction of synthetic accessibility based on commercially available compound databases. J Chem Inf Model 54(12):3259–3267
Article CAS PubMed Google Scholar
Sheridan RP (2013) Using random forest to model the domain applicability of another random forest model. J Chem Inf Model 53(11):2837–2850
Article CAS PubMed Google Scholar
Kensert A, Alvarsson J, Norinder U, Spjuth O (2018) Evaluating parameters for ligand-based modeling with random forest on sparse data sets. J Cheminform 10(1):49
Article CAS PubMed PubMed Central Google Scholar
Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43(6):1947–1958
Article CAS PubMed Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine Learning in Python. J Mach Learn Res 12:2825–2830
Google Scholar
SCScore GitHub. https://github.com/connorcoley/scscore. Accessed 24 January 2020
Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35
Article CAS PubMed Google Scholar
Fluss R, Faraggi D, Reiser B (2005) Estimation of the Youden Index and its associated cutoff point. Biom J 47(4):458–472
Article PubMed Google Scholar
Looney SW (1988) A statistical technique for comparing the accuracies of several classifiers. Pattern Recogn Lett 8(1):5–9
Article Google Scholar
Westfall PH, Troendle JF, Pennello G (2010) Multiple McNemar tests. Biometrics 66(4):1185–1191
Article PubMed PubMed Central Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate—a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 57(1):289–300
Google Scholar
Riniker S, Landrum GA (2013) Similarity maps—a visualization strategy for molecular fingerprints and machine-learning methods. J Cheminform 5(1):43
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

Funding

This work was supported from the Ministry of Education of the Czech Republic (RVO 68378050-KAV-NPUI and LM2018130). IČ research was further supported by a specific university research (MSMT No. 20/2015). MK was supported by the project “Center for Tumor Ecology-Research of the Cancer Microenvironment Supporting Cancer Growth and Spread” (Reg. No. CZ.02.1.01/0.0/0.0/16_019/0000785) within the Operational Programme Research, Development and Education.

Author information

Authors and Affiliations

CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague 6, Czech Republic
Milan Voršilák, Ivan Čmelo & Daniel Svozil
CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of theCzech Academy of Sciences, Vídeňská 1083, 142 20, Prague 4, Czech Republic
Milan Voršilák & Daniel Svozil
Laboratory of Genomics and Bioinformatics, Institute of Molecular Genetics, Czech of the Academy of Sciences, Vídeňská 1083, 142 20, Prague 4, Czech Republic
Michal Kolář
Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague 6, Czech Republic
Michal Kolář

Authors

Milan Voršilák
View author publications
You can also search for this author in PubMed Google Scholar
Michal Kolář
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Čmelo
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Svozil
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MV, MK and DS conceptualized the problem. MV was responsible for method development, implementation and validation. MV also maintains the SYBA GitHub repository. MK derived SYBA and IČ took part in method testing. DS supervised the study, performed statistical analyses and prepared the manuscript with the active participation of MV, MK and IČ. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Daniel Svozil.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Training set S. It consists of 693 353 ES compounds selected from the ZINC15 database and of 693 353 ES compounds generated by Nonpher.

Additional file 2.

This supporting document contains the detailed description of the SYBA score derivation, threshold values of complexity indices, RF hyperparameter optimization results, the dependence of the accuracy of the classification of TCP compounds on SYBA and SAScore thresholds, examples of correctly predicted and mispredicted S, TMC and TCP compounds, fragments with very low SYBA contributions and HS compounds containing these fragments, and confusion matrices of the classification of TMC, TCP and S data sets.

Additional file 3.

Manually curated test set (TMC). It consists of 40 HS compounds manually selected from scientific papers and of 30 ES sets, each of them contains 40 compounds selected from the ZINC15 database.

Additional file 4.

Computationally picked test set (TCP). It consists of 3 581 HS compounds that were obtained from the GDB-17 database complemented by the same number of compounds randomly selected from the ZINC15 database.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Voršilák, M., Kolář, M., Čmelo, I. et al. SYBA: Bayesian estimation of synthetic accessibility of organic compounds. J Cheminform 12, 35 (2020). https://doi.org/10.1186/s13321-020-00439-2

Download citation

Received: 31 January 2020
Accepted: 09 May 2020
Published: 20 May 2020
DOI: https://doi.org/10.1186/s13321-020-00439-2

SYBA: Bayesian estimation of synthetic accessibility of organic compounds

Abstract

Background