SYBA: Bayesian estimation of synthetic accessibility of organic compounds

SYBA (SYnthetic Bayesian Accessibility) is a fragment-based method for the rapid classification of organic compounds as easy- (ES) or hard-to-synthesize (HS). It is based on a Bernoulli naïve Bayes classifier that is used to assign SYBA score contributions to individual fragments based on their frequencies in the database of ES and HS molecules. SYBA was trained on ES molecules available in the ZINC15 database and on HS molecules generated by the Nonpher methodology. SYBA was compared with a random forest, that was utilized as a baseline method, as well as with other two methods for synthetic accessibility assessment: SAScore and SCScore. When used with their suggested thresholds, SYBA improves over random forest classification, albeit marginally, and outperforms SAScore and SCScore. However, upon the optimization of SAScore threshold (that changes from 6.0 to – 4.5), SAScore yields similar results as SYBA. Because SYBA is based merely on fragment contributions, it can be used for the analysis of the contribution of individual molecular parts to compound synthetic accessibility. SYBA is publicly available at https://github.com/lich-uct/syba under the GNU General Public License.


Background
Chemical space available for the generation of new molecules is huge [1][2][3][4], making the synthesis and testing of all possible compounds impractical. Therefore chemists, both experimental and computational, developed tools and approaches for the exploration of chemical space with the aim to identify new compounds with desirable physico-chemical, biological and pharmacological properties [5][6][7][8][9][10][11][12]. A major in silico method for chemical space exploration is de novo molecular design in which new virtual molecules are assembled from scratch [13][14][15][16][17][18]. An essential requirement for de novo designed compounds is their synthetic accessibility. Synthetic accessibility is commonly incorporated into de novo design programs by employing chemical strategies that guide an assembly process. For example, the connections between certain atom types can be disallowed [19], established chemical reactions can be used to connect individual molecular building blocks [20,21] or the retrosynthetic rules can be directly incorporated into the assembly process [22,23].
The latest development in de novo molecular design are molecular generators based on deep learning approaches [24][25][26]. These typically construct new molecules not by assembling the building blocks, but by producing chemically feasible SMILES strings [27][28][29][30][31][32]. The generators are able to produce millions of virtual compounds, synthetic accessibility of which has to be quickly and efficiently assessed. Quick synthetic accessibility assessment can be based [33] on molecule's complexity that is typically calculated [34][35][36][37] from the number of atoms, bonds, rings, and/or hard-to-synthesize motifs, such as chiral centers or uncommon ring fusions. However, the definition of molecular complexity is ambiguous and context dependent [38,39]. The structural complexity is not equivalent to the synthetic one as complexity-based metrics do not incorporate any information about starting materials and tend to remove molecules that can be synthesized from already existing complex precursors [40,41]. A better way of synthetic accessibility assessment is to use the complexity of the synthetic route [42]. Based on this principle, SCScore, a data-driven metric designed to describe real syntheses, was developed recently [43]. SCScore is based on the idea that reaction products are synthetically more complex than reactants. To quantify this, a deep feed-forward neural network, that assigns a synthetic complexity score between 1 and 5, was trained on 22 million reactant-product pairs from the Reaxys database [44]. Using the hinge loss objective function, that supports the separation between scores in each reactant-product pair, the model learns synthetic complexity score that correlates with the number of reaction steps, but does not rely on the availability of reaction database or organic chemist ranking.
SAScore [45], another popular and rapid method for synthetic accessibility assessment, is based on the analysis of ECFP4 [46] fragments obtained from one million compounds randomly selected from the PubChem database [47]. The main idea of SAScore is that when a molecular fragment occurs often in the PubChem database, it contributes to the synthetic accessibility of a molecule more than a less frequently occurring fragment. Each fragment is assigned a numerical score, frequent fragments have positive scores and less frequent fragments have negative scores. In addition to the fragment score, SAScore consists of a complexity penalty and symmetry bonus. These terms penalize nonstandard structural motives such as macrocycles, stereo centers, spiro and bridge atom, but reward the symmetry of a structure. SAScore acquires values between 1 (easy to make) and 10 (very difficult to make), where 6.0 is suggested by the authors [45] as a threshold to distinguish between easyand hard-to-synthesize compounds. SAScore is a popular high-throughput measure and proved to be a very useful tool in many cheminformatics applications [27,[48][49][50].
In the present work, we further expand on main concepts of SAScore construction. We developed SYBA (SYnthetic Bayesian Accessibility), a rapid fragmentbased score derived using Bayesian probabilistic modeling. Fragment contributions to SYBA are calculated not only from fragments present in synthetically accessible molecules, but also from fragments appearing in hard-tosynthesize molecules.

SYBA score derivation
SYBA is a Bernoulli naïve Bayes classifier based on the frequency of molecular fragments that are present in the database of easy-to-synthesize (ES) and hard-to-synthesize (HS) molecules and on the assumption of the independence of molecular fragments. Though such assumption is bold, it was shown to provide surprisingly good results in many cheminformatics studies [51][52][53][54][55].
Each compound is represented by a binary finger- or absence ( f i = 0 ) of the specific fragment i in the compound. SYBA uses this fingerprint to assign the molecule to a class C ∈ �ES, HS� . The calculation is based on the Bayes theorem where p(C|F ) is the posterior probability that a compound with a certain set of molecular fragments F belongs to the class C . The likelihood p(F |C) is the conditional probability that a compound from the class C contains a set of molecular fragments F. The marginal probabilities p(F ) and p(C) express our belief to observe a set of molecular fragments F and the molecule that belongs to the class C.
The SYBA score is defined as the logarithm of the ratio of the posterior probabilities that the molecule belongs to the ES and HS classes, Using Eq. 1, the SYBA score can be expressed as In the data set SYBA was derived from (further referred to as the training data set S), ES and HS compounds are represented evenly, the priors p(ES) and p(HS) are thus equal and the term ln p(ES) p(HS) becomes zero: Assuming the independence of molecular fragments, the conditional probability p(F |C) factorizes to p(f i |C) and the SYBA score simplifies to where s i f i is the score contribution from the fragment i (SYBA fragment score) given as Considering that p f i |ES = 1 − p f i |HS , the fragment scores s i f i in Eq. 6 represent logits and can be expressed using the fragment frequencies in the training data set S as where N HS is the number of HS and N ES the number of ES molecules in the training data set S, n HS,i is the number of HS molecules in the training data set S that contain the fragment i, and n ES,i is the number of ES molecules in the training data set S that contain the fragment i. See Additional file 2 for a detailed derivation. Positive s i f i means that the presence/absence of the fragment i is more probable in ES than in HS class and vice versa. Positive SYBA means that the compound belongs more likely to the ES class, while negative SYBA means that the compound belongs more likely to the HS class. The higher the absolute value of SYBA, the more evidence for the class membership is present in the molecule.

Training set construction
The training data set S consists of two subsets: S + contains ES structures and Scontains HS structures (Fig. 1, Additional file 1). While ES molecules can be readily obtained, for example, from the ZINC database of purchasable compounds [56,57], no equivalent database (7) of HS molecules exists. However, HS molecules can be designed by Nonpher [58], a method based on a molecular morphing approach [59]. In Nonpher, a starting molecule is gradually transformed into a more complex compound using small structural perturbations, such as the addition or removal of an atom or a bond. To prevent the creation of overly complex structures, four complexity indices (Bertz [34], Whitlock [35], BC [36] and SMCM [37]) are monitored and once their respective thresholds (Additional file 2: Table S1) are exceeded, Nonpher is stopped.
Using Nonpher, 693 353 HS molecules were generated and they form the Sdata set. The S + data set, containing ES compounds, is formed by the same number of molecules randomly chosen (excluding natural products) from the ZINC15 database [57] so that their distribution of the number of heavy atoms is the same as in the Sdata sets. Every S + and Smolecule was fragmented using the Morgan fingerprint function in the RDKit toolkit [60]. Fragments with the radius of 4 and smaller, corresponding to radial ECFP8 [46] fragments, were used. This type of fragments consists of a central atom and atoms distant from the central atom up to four bonds. Besides ECFP8 fragments, the number of stereocenters was also included into SYBA as the molecules with more stereocenters are typically more difficult to synthesize. The stereo score is based, similarly to the fragment score, on the analysis of the number of stereocenters in the training set S. To obtain the stereo score, molecules were divided into 6 bins differing by the number of stereocenters (0, 1, 2, 3, Fig. 1 Data set summary. Training set was used to derive SYBA scores, as well as to train a random forest classifier. Training set consists of 693 353 molecules randomly selected from the ZINC15 database [57] that are considered to be ES (S + data set) and of the same number of HS molecules generated by Nonpher [58] (S − data set). Two test sets were used to compare the performance of SYBA, a random forest, SAScore [45] and SCScore [43]. Manually curated test set (T MC ) contains 40 compounds (T MC-data set) considered to be HS by experienced medicinal chemists [58] supplemented by 40 ES compounds randomly selected from the ZINC15 database (T MC+ data set). 30 T MC data set instances differing in T MC+ compounds were constructed. Computationally picked test set (T CP ) consists of 3 581 HS compounds that were obtained from the GDB-17 database [61] (T CP-data set) complemented by the same number of compounds randomly selected from the ZINC15 database (T CP+ data set) 4 and 5+) and individual score contributions were calculated from Eq. 7.

Test set construction
SYBA performance could have been assessed using a test set created in a similar way as the training set S, i.e. using HS compounds generated by Nonpher. However, such test set would be clearly biased towards chemical space covered by Nonpher. Therefore, two test sets were constructed in a conceptually different manner. First test set, further denoted as T MC , was manually curated from the literature, second test set, referred to as T CP , was computationally picked from the ZINC15 [57] and GDB17 databases [61].
HS compounds in T MC (denoted as T MC− ) were obtained by the analysis [58] of 296 published compounds assessed by experienced medicinal chemists [41,45,62,63]. Based on original chemists' scores, the final T MC-data set of 40 HS compounds was assembled. A complementary T MC+ data set consists of 40 ES compounds selected from the ZINC15 database [57] in such a way that the distribution of the number of their heavy atoms is the same as in the T MC-data set. Because small T MC size may bias the results, 30 different T MC data set instances were generated using the same 40 T MC-compounds, but different 40 T MC+ compounds (Additional file 3).
HS compounds in the T CP test set (Additional file 4), denoted as T CP-, were obtained by the analysis of the publicly available subset of 50 M molecules from the GDB-17 database [61]. Only molecules exceeding thresholds (Additional file 2: Table S1) of all monitored complexity indices (Bertz [34], Whitlock [35], BC [36] and SMCM [37]) were considered to be HS. In total, 3 581 molecules form the T CP-data set. A complementary T CP+ data set consists of the same number of compounds randomly selected from the ZINC15 database [57] that follow the same size distribution as HS compounds and that, in addition, do not exceed any of the aforedescribed complexity indices. Data sets used in the present work are summarized in Fig. 1.

Performance evaluation
The performance of classification models studied in the present work was assessed by four different metrics: the classification accuracy (Acc), sensitivity (SN), specificity (SP) and area under the ROC curve (AUC ). Acc gives the percentage of correctly classified samples regardless of their class.
where true positives (TP) are ES compounds predicted by a model to be ES, true negatives (TN) are HS compounds predicted to be HS, false positives (FP) are HS compounds predicted to be ES and false negatives (FN) are ES compounds predicted to be HS. The accuracy can also be evaluated for positive and negative classes independently leading to SN and SP. SN is the percentage of correctly predicted positive class compounds, while the percentage of correctly predicted negative class compounds is known as SP.
SN and SP can be combined in the receiver operating characteristic (ROC) curve that is the graphical representation of the trade-off between true positive rate (given as SN) and false positive rate (given as 1 − SP) over all possible thresholds (Fig. 2). The area under the ROC curve (AUC ) is the quantitative measure of the performance of a classifier and is equal to the probability that a classifier will rank a randomly chosen positive example higher than a randomly chosen negative example. A random classifier has AUC of 0.5, while AUC for a perfect classifier is equal to 1.

Random forest classification, SAScore and SCScore
Because of its wide adoption in various cheminformatics applications [58,[64][65][66], the random forest (RF) classifier with compounds encoded by 1024-bits long Morgan fingerprint with radius 2 was used as a baseline method with which SYBA, SAScore [45] and SCScore [43] were compared. The RF classifier was implemented in Scikit-learn [67]. Two RF hyperparameters were optimized: the number of trees (50, 100, 300 and 500) and the maximum number of features considered when looking for the best split (10% out of 1024 = 102, 25% = 256, 50% = 512, 75% = 768, 100% = 1024, √ 1024 = 32 and log 2 (1024) = 10 ). For each pair of hyperparameters, RF model was trained using the training set S and the prediction accuracy was evaluated on the test set T CP (Additional file 2: Table S2, Figures S2-S8). The setting used in this work (100 trees and 32 features) represents the best trade-off between computational efficiency and prediction accuracy [64]. RF was trained using the training set S (Fig. 1). SAScore was calculated by the RDKit toolkit [60]. SCScore code was downloaded from the public GitHub repository [68].

Classification thresholds
In SYBA, more positive value means a higher probability that the compound is ES and more negative value indicates a higher probability that the compound is HS (Eq. 4). The threshold value of zero is used to distinguish between ES and HS compounds. For SAScore, the recommended value of 6.0 [45] was used as a threshold. In RF, the final prediction is based on a number of decision trees that predict either of classes. Here, 0.5 is used as a threshold, i.e., if more decision trees predict ES than HS class, a compound is classified as ES and vice versa. For SCScore [43], no threshold was suggested by the authors. In such case, the threshold can be identified by the analysis of the ROC curve. A frequently used measure that enables the selection of an optimal threshold is the Youden index (YI) [69,70]. YI is defined as and ranges between 0 and 1 (Fig. 2). The optimal threshold value is selected by maximizing YI, i.e., by maximizing the sum of SN and SP.

Statistical comparison of model performance
The performance of studied classification models was compared using non-parametric Cochran's Q test [71], an omnibus test for testing for differences between three or more machine learning models. In the case of the statistically significant result of Cochran's Q test, differing pairs of classification models were identified by McNemar's post hoc paired test [72] with Benjamini-Hochberg false discovery rate adjustment [73]. McNemar's test checks if the distribution of disagreements between two methods is imbalanced. The statistical significance for all tests in the present work was assessed at the significance level α = 0.05.

Chemical space covered by SYBA data sets
The examples of training set compounds are given in Additional file 2: Figures S9-S12. In total, 3 439 074 ECFP8 fragments were obtained for ES compounds and 23 447 524 fragments for HS compounds. 458 040 fragments are common for both S + and S-subsets. 55.0% of S+ fragments and 91.7% of S− fragments are present only once in the whole data set S (singletons). Typical ES and HS fragments are shown in Figs. 3 and 4, fragments with very low SYBA in Additional file 2: Figure S13 and  Nine fragments that are most frequent in the S + data set and, at the same time, least frequent in the Sdata set. s i is SYBA fragment score. Blue circles represent each fragment central atom, yellow circles represent aromatic atoms. Fragment images were generated by the RDKit function DrawMorganEnvs() compounds containing these fragments in Additional file 2: Figure S14.
Though the number of fragments in HS compounds is much larger than in ES compounds, chemical space is equally covered by both HS and ES molecules and there is no bias towards HS compounds (Fig. 5).
The visualization of chemical space (Fig. 6) covered by S, T CP and T MC compounds shows that test set compounds lie within chemical space of training compounds. The examples of T CP and T MC compounds are given in Additional file 2: Figures S15-S21.
SYBA also enables the visualization and interpretation of fragment score contributions. Each SYBA fragment score can be projected to the corresponding fragment root atom and this projection can be used to analyze which fragments contribute unfavorably to molecule synthetic accessibility (Fig. 7).

Classifier performance on manually curated test set
The differences in classification of T MC (Fig. 1) compounds using SYBA, SAScore and RF with default thresholds are statistically significant. The Cochran's Q test p value is 2 × 10 −5 for the T MC test set with the smallest SYBA AUC. The results of the corresponding McNemar's paired tests [72] are summarized in Table 1. While RF and SYBA do not differ significantly, both RF and SYBA yield significantly better results than SAScore. The quality measures of the classification of the compounds in the manually curated T MC test set, averaged over 30 T MC instances, are summarized in Table 2. The corresponding confusion matrices are reported in Additional file 2: Panel S1 and Panel S2 and ROC curves are shown in Fig. 8.
In terms of Acc, the best performing model is SYBA followed by RF and SAScore. While SYBA and RF sensitivity Fig. 6 Chemical space coverage by training set S and test sets T CP and T MC . T MC data set consists of 40 HS compounds and 1200 ES compounds, from S and T CP data sets random samples of 1240 compounds were generated. Each compound was encoded by 1024 bits long ECFP4 fingerprint. The dimensionality of the input space was reduced by SVD to 500 components that explain 88% of the variance in the data Fig. 7 SYBA fragment score visualization. Fragment score is projected on the fragment root atom and the whole molecule is visualized as a similarity map [74]. The more frequent the fragment is in the S + data set compared to the Sdata set, the greener is its central atom. Similarly, the more frequent the fragment is in the Sdata set compared to the S + data set, the redder is its central atom. This visualization enables to analyze the contributions of the individual parts of the molecule to its synthetic accessibility. In the HS molecule, the quaternary carbon is most problematic. Another substructure decreasing compound synthetic accessibility is a fused cyclopropane ring as can be observed both in ES and HS compounds and specificity are well balanced, SAScore shows high sensitivity (SN = 0.934, i.e., on average 93.4% of ES compounds are predicted as ES), while its specificity (i.e., the ability to correctly classify HS compounds) is rather low (SP = 0.300). The observed high sensitivity of SAScore is not surprising as only 0.2% of ZINC structures have SAScore greater than 6.0 and out of these, only lower units were selected into the T MC+ set.
For optimized thresholds, the differences between SYBA, SAScore and RF are again statistically significant (Cochran's Q test p-value is 2 × 10 −14 for the T MC test set with the smallest SYBA AUC). However, McNemar's paired test (Table 3) identifies significant differences only between SCScore and other methods meaning that SAScore results improve significantly upon threshold optimization.
In terms of performance measures (Table 2), the most accurate classifiers are SYBA, RF and SAScore followed afar by SCScore. Notable is the improvement of SAScore SP by 0.619 compared to the default threshold. The increase in SAScore SP comes, however, at the cost of SN that decreases by 0.135. The worst performing model, SCScore, is only slightly better (AUC = 0.528) than a random model. However, because T MC+ and T MC− data sets consist of only 40 compounds each, the results must be interpreted with caution as small changes in confusion matrices lead to relatively large changes in reported metrics.

Table 1 The results of McNemar's two-sided paired tests for the T MC test set with the smallest SYBA AUC (AUC = 0.830)
The default threshold values were used (0.0 for SYBA, 0.5 for RF, and 6.0 for SAScore). The p-values were adjusted using Benjamini-Hochberg method

Classifier performance on computationally picked test set
Even stronger evidence of the differences between the models is provided by the classification of compounds in the large computationally picked T CP test set (Fig. 1).
Using the default thresholds, the differences between the classifiers are statistically significant (Cochran's Q test p-value < 10 −16 ) and all classifiers differ significantly (Table 4). When used with their default thresholds, both SYBA and RF are more accurate than SAScore (Table 5, Additional file 2: Panel S3). Low observed SAScore accuracy (Acc = 0.665) is caused by its low specificity when almost 70% of HS compounds are predicted to be ES (Table 5).
Cochran's Q test identifies significant differences (p-value < 10 −16 ) also if classifiers are used with the optimized thresholds. While statistically significant differences were detected within RF/SAScore and RF/ SYBA pairs (Table 6), the effect size is still rather small (Table 5). However, the difference between SCScore and all other methods is statistically significant and large (Tables 5 and 6). No statistically significant difference was observed between SYBA and SAScore meaning that these two methods yield comparable results.
Compared to its default threshold of 6.0, SAScore specificity increases by 0.662 (Table 5) when the threshold is shifted to the optimal value of 4.5 (Fig. 10). At this threshold, SAScore is on par with SYBA and RF methods (Table 5). However, SYBA retains its high performance over much broader range of threshold values than SAScore (Additional file 2: Figure S1).
High performance of SYBA, RF and SAScore is also evident from their AUC that is close to one (Fig. 9, Table 5). On the other hand, SCScore fails to distinguish between ES and HS compounds as can be deduced from its ROC curve (Fig. 9). In its optimal threshold of 3.1, SCScore predicts a majority of T CP compounds as HS (Fig. 10, Additional file 2: Panel S3).
In addition to the T MC and T CP test sets, the performance of SAScore and SCScore was also assessed using the training set S, as this data set was not used for their parametrization. Classification results are shown in Table 7 and Fig. 11, confusion matrices are available in Additional file 2: Panel S4.
In agreement with previous experiments on the T MC and T CP test sets, SAScore is able to distinguish between ES and HS compounds more accurately than SCScore. However, to achieve the best performance, SAScore classification threshold must be shifted from its default value of 6.0 to the optimal value of 4.4. At this threshold, SAScore is both highly sensitive and specific, while SCScore    is, using its optimal threshold of 3.7, sensitive and specific only moderately. The observed poor performance of SCScore in all data sets may follow from the fact that SCScore differs conceptually from other methods tested in the present work. In SCScore, the problem of predicting synthetic complexity is reformulated as the analysis of reactions consisting of reactant-product pairs and SCScore correlates with a number of reaction steps. It means that, contrary to SYBA and SAScore, SCScore captures synthetic feasibility of compounds, not structural complexity. In SCScore derivation, each molecule is analyzed as a whole in the context of all molecules and reactions as they appear in the Reaxys database [44]. Thus, SCScore is biased [43] by the types of reactants and products in the Reaxys database. Therefore, we hypothesize that the unsatisfactory results of SCScore are caused by the fact that compounds in our data sets come from chemical subspace insufficiently covered by Reaxys compounds.

Conclusions
In the present work, SYBA method for the classification of organic compounds as easy-and hard-to-synthesize is described. SYBA is an additive fragment-based approach meaning that the compound is decomposed into individual substructure fragments, each fragment is assigned its respective SYBA fragment score and these are summed to obtain the final SYBA score. The fragment scores were derived by the Bayesian analysis of the frequency of ECFP8 fragments occurring in the database of ES compounds, that were randomly chosen from the ZINC15 database [57], and HS compounds, that were generated using the Nonpher approach [58]. Because SYBA was derived from ECFP8 fragments that utilize only molecular connectivity and no 3D information, the influence of stereochemistry on synthetic accessibility is not accounted for. However, apart from ECFP8 fragments, the number of stereocenters is included in SYBA calculation and the compounds with many stereocenters are penalized. If the SYBA score is positive, the compound is considered to be ES and vice versa. While SYBA score can theoretically assume values between plus and minus infinity, a majority of compounds will have SYBA score between − 100 and +100 in real applications. It must be stressed here that the absolute value of the SYBA score is the measure of the confidence of the prediction and not of the degree of the synthetic accessibility.
SYBA was compared with other two recent classification methods, SAScore [45] and SCScore [43]. As a baseline for the comparison, RF/ECFP4 classifier was used due to its wide adoption in many cheminformatics applications. All methods were assessed using accuracy, sensitivity, specificity and area under the ROC curve. While SYBA and RF provide similar performance, we recommend to use SYBA due to its smaller complexity, lower computational demands and more straightforward analysis of the individual fragment contributions. SYBA outperforms SAScore when this is used with the threshold of 6.0 proposed by the authors [45]. However, if the SAScore threshold is changed to the value of ~ 4.5, the accuracy of SYBA and SAScore becomes comparable. Therefore, to reduce the number of false positives in workflows that utilize SAScore, we recommend to decrease the SAScore threshold to ~ 4.5. SYBA, RF and SAScore substantially outperform SCScore. The observed weak performance of SCScore can be, in our opinion, attributed to the fact that our test set compounds come from a part of chemical space that is insufficiently covered by Reaxys compounds used to derive SCScore. The SYBA fragment scores can be mapped [74] onto a molecule and used for the analysis of the contribution of its individual substructures to the overall synthetic accessibility.
SYBA can be used to quickly rank large molecular data sets that originate, for example, from de novo molecular design. However, SYBA is conceptually based on the notion that a compound can be categorized as easy-and hard-to-synthesize. As the synthetic accessibility is a vaguely defined term, SYBA's simplifying approach, though accurate enough, cannot compete with more sophisticated synthetic path-reconstruction methods that enable the incorporation of other factors such as the availability of starting materials, reaction yields or a price aspect. In the end, the definitive assessment of synthetic accessibility is in the hands of experienced organic chemists.