For this proof-of-concept study aiming at generating the V1M library, we constrained the program so that each enumerated macrolide scaffold had a total of twelve SM units (i.e., to create 14-member ring macrolide scaffolds such as Erythromycin). Each SM unit was selected from the nine unique CSM types (see inputs for V1M in Table 1 and Fig. 6). SM001, SM002, SM006, SM013 were allowed 3, 4, 2 and 2 times per macrolide scaffold respectively, and the remaining CSM types once. These values were chosen by an approximate weighting based on the frequency of CSM types found in each of the eighteen bioactive macrolide (BM) scaffolds (see Additional file 2: Table S1). In other words, we allowed frequently occurring structural motifs such as SM001 and SM002 to be repeated more often than the others. We had the program skip 100 k permutations after each output structure. An additional ester was added, and V1M was set up to contain exactly 1 million macrolide scaffolds (Table 1).
The number of all possible macrolide scaffolds according to the input parameters used for V1M was approximately 872 billion compounds, since each macrolide scaffold used twelve SMs taken from a total of sixteen available CSMs in the selection pool (i.e., 3 repeats of SM001, 4 repeats of SM002, 2 repeats of SM006, 2 repeats SM013, 1 repeat of SM003, SM004, SM005, SM008, SM009). Herein, we generated a total of 1 million compounds by skipping 100 k possible compounds after each selected macrolide scaffold. Considering that all the examined compounds met the required structural features and were chosen to be in V1M, we only covered the first 100 billion compounds, i.e., a small fraction (11.47%) of the entire possible macrolide scaffold population based on the user-defined constraints. It took a standard desktop PC (Intel® Core™ i7 CPU, 8 GB RAM) approximately seven hours to generate V1M. Then we analyzed it in terms of:
-
Structural diversity We reported on the distribution, composition of CSM types, their occurrence(s) per macrolide scaffold, along with the distribution of heteroatoms and heavyatoms observed in V1M. We computed the same properties for simplified structures of eighteen well-known BM drugs (Additional file 2: Figure S2) and compared their results to those of V1M.
-
Chemical diversity We studied the distributions of molecular descriptors (molecular weight—MW, hydrophobicity—SlogP, topological polar surface area—TPSA, hydrogen bond acceptor—HBA, hydrogen bond donor—HBD, rotatable bond—NRB) which are commonly used to assess drug likeness, bioavailability, and oral absorption [38,39,40,41,42,43]. Additionally, we computed the same descriptors for eighteen BM scaffolds, and conducted a comparative analysis to further emphasize the chemical diversity of V1M.
-
Correlation Analysis we conducted a pairwise correlation analysis among all computed descriptors (MW, SlogP, TPSA, HBA, HBD, NRB, heteroatoms, heavyatoms).
-
Structural similarity with eighteen BM scaffolds we assessed the chemical similarity of our 1M macrolide scaffolds from V1M with respect to the eighteen BM scaffolds by conducting fingerprint analysis (MACCS).
Regarding the comparative analyses with the eighteen BMs, substituted ester and amino functional groups protruding from the ring cyclic frameworks were replaced with alcohol and amine groups respectively, and sugar blocks were removed. This preprocessing step allowed us to directly compare the molecular properties of the scaffolds of the well-known BMs to those generated in V1M. Simplified structures of the eighteen BMs are provided in Fig. 2. For example, the structure of Erythromycin was modified as shown in Additional file 2: Figure S3.
Structural diversity of V1M
We analyzed the structural diversity of V1M by studying the distribution of CSM types, each type’s composition, the occurrence of each CSM type per macrolide scaffold, counts of heteroatoms and heavyatoms. The distribution of CSM types and their composition in V1M were presented in Fig. 8a. The blue bars represent the number of macrolide scaffolds where respective CSM types are observed, and the orange bars represents the total composition of each CSM type in the entire library. All nine CSM types, in general, were highly represented (blue, Fig. 8a); SM001, SM002, SM003, SM008 and SM013 were observed in all 1M macrolide scaffolds, and SM006, SM005, SM009, SM004 were observed in 875 k, 800 k, 660 k, 602 k macrolide scaffolds respectively. V1M employed all nine unique CSM types among a selection pool of sixteen CSMs; therefore, the overall diversity of SMs in V1M is significantly high. Among the eighteen BM scaffolds, we observed a somewhat similar distribution of CSM types (blue, Fig. 8b). All nine CSM types were present in eighteen BM scaffolds. SM001 and SM002, like in V1M, were involved in all eighteen BM scaffolds; SM003, SM005, SM006, SM008 and SM013 were found in ten, six, six, seven and seven bioactive macrolide scaffolds. Interestingly, SM002 was observed twice more frequently than SM001 in these eighteen BM scaffolds. We could probably posit that the methyl structural motif (SM002) helps maintain/impose critical conformational constraints for the macrolides (compared to the SM001). The direct comparison of CSM type distribution (in percentages) among the scaffolds of the eighteen BMs and V1M was shown in Additional file 2: Figure S4A.
We then studied the total composition of CSM types in V1M (orange, Fig. 8a). Since we generated a million macrolide scaffolds each containing 12 CSMs, the total number of CSMs used in the library was 12 million. Among all 12 million CSMs, a comparatively large portion in V1M were occupied by SM002 (methyl, 2.9 million, 24%) and SM001 (methylene, 2.2 million, 18%). The high occurrence for these SMs was obviously fueled by the high number of repetitions we initially allowed for this enumeration (SM002 was allowed four times, and SM001 three). The SM013 alkene was allowed two times and was comprised in 1.58 M (13%). Some CSMs containing oxygen such as SM004, SM005 and SM009 were comprised in relatively small portions (from 600 k to 800 k, 6%). The remaining SMs (SM003 containing an ethyl, SM006 containing an α-hydroxy methyl, and SM008 with a methyl carboxaldehyde) were comprised in ~ 1–1.2 M (9%) of the entire SM population. Regarding the eighteen BM scaffolds, a highly similar CSM type composition was observed (orange, Fig. 8b). Among a total of 163 CSMs found in eighteen BM scaffolds, SM002 (60, 36.8%) and SM001 (32, 19.6%) accounted for fairly large portions, as in V1M. The remaining seven CSM types among the BM scaffolds maintained a similar composition (4–8%) each. The direct comparison of CSM type composition (in percentages) among the eighteen BM scaffolds and V1M (Additional file 2: Figure S4B) emphasized a remarkably similar pattern, which was contributed by the carefully weighted inputs for SM type frequencies in generating V1M.
Regarding their occurrence per macrolide scaffold, we limited the occurrences of five CSM types: SM003, SM004, SM005, SM008, SM009, to only one per scaffold. Thus, it is not surprising to observe that their recurrences per macrolide scaffold in V1M were only one (Fig. 8c). Despite having different distributions for each repetition (which was not controlled during the enumeration process), V1M contained SM001, SM002, SM006 and SM013 up to their maximally allowed repetitions per macrolide scaffold (Fig. 8c). It demonstrates that the various user constraints are fully respected in the macrolide scaffold structures generated by the PKS Enumerator software. In comparison to the eighteen BM scaffolds (Fig. 8d), the frequency distributions of the CSMs in V1M appear more balanced or controlled (Fig. 8c); in other words, they form a slightly bell-shaped pattern. On the other hand, there is no recognizable pattern among the different occurrences of CSMs among the eighteen BM scaffolds.
The number of O-heteroatoms, which is the only type of heteroatom in V1M solely based on the selected CSMs, ranged from 4 to 8 (Fig. 9f). The highest populations with approx. 444 k and 350 k macrolide scaffolds had six and seven heteroatoms, respectively, and 144 k macrocycles contained five O-heteroatoms. The diversity of heteroatoms in the library can be easily controlled by using RSMs that could introduce alternative heteroatoms other than oxygen, such as nitrogen, sulfur, phosphorus, boron, etc. However, we specifically chose commonly found CSMs derived from the eighteen BM scaffolds and excluded RSMs that are normally enriched with different heavy atoms and/or functional groups. In comparison to the eighteen BM scaffolds, which contained 7 to 9 heteroatoms, V1M delivered relatively lower numbers of heteroatoms. The number of heavy atoms in V1M followed a slightly left-skewed distribution ranging from 27 to 32 (Additional file 2: Figure S5B). Most of the library (858 k macrolide scaffolds) had 29 to 31 heavy atoms. A higher diversity in the number of heavy atoms can be delivered by adjusting the ring size or structural motifs with different lengths and/or functional groups. All BM scaffolds were composed of 27, 29, 30, 31 and 34 heavyatoms per ring, and seventeen BM scaffolds were observed within the heavyatom distribution of V1M.
Chemical diversity of V1M
We analyzed the chemical diversity section of V1M library in terms of six molecular properties: molecular weight (MW), hydrophobicity (SlogP), topological polar surface area (TPSA), hydrogen bond acceptors (HBA), hydrogen bond donors (HBD), and rotatable bonds (NRB). These specific molecular properties were selected because they are commonly used to assess oral absorption, cell permeability, bioavailability, and drug likeness [38,39,40,41,42,43] of small molecules.
The molecular weight of V1M followed a slightly left-skewed distribution. It ranged from 378.5 to 456.6 g mol−1 with an average (meanMW) equal to 422.6 ± 16.7 g mol−1 (Fig. 9a). The highest population with approximately 316 k macrolide scaffolds (31.6%) fell between 420 and 435 g mol−1. Remarkably, 828 k (83%) of V1M population fell within the narrow range of MW from 405 to 450 g mol−1, along with seventeen out of the eighteen BM scaffolds which ranged from 384.22 to 482.25 g mol−1 with an average of 424.7 ± 23.7 g mol−1. Regarding Lipinski’s rule of 5 [39], V1M and eighteen BM scaffolds abided by Lipinski’s molecular weight since they all had MW less than 500. However, the original structures of eighteen BMs were simplified by removing sugars and bulky functional groups for a direct comparative study with V1M. Even after the removal of commonly occurring sugar groups which would amount to approx. 320 g mol−1 and bulky functional groups, eighteen BM scaffolds were found to have MWs very close to the MW threshold. Therefore, based on the eighteen bioactive macrolide scaffolds we studied, MW limit of 500 g mol−1 may not be quite relevant for macrolides.
Then, we analyzed the distribution of the predicted hydrophobicity as assessed by the fragment-based octanol/water coefficient partition (SlogP) for all generated macrolide scaffolds in V1M. SlogP had a bell-shaped symmetric distribution with values ranging from 1.19 to 5.57, and an average of 3.29 ± 0.77 (Fig. 9b). The highest population with approximately 244 k macrolide scaffolds (24.48%) fell in the SlogP range between 3 and 3.5. A very small percentage (0.48%) of V1M had SlogP > 5, thereby exceeding Lipinski’s rule regarding hydrophobicity. On the other hand, SlogP of eighteen BM scaffolds abided by Lipinski’s rule; they ranged from 0.17 to 2.69 with an average of 1.17 ± 0.65. All eighteen BM scaffolds were observed either below or within the low spectrum of SlogP distribution in V1M, suggesting that low/moderate hydrophobicity is preferred for potent antibiotic macrolide drugs.
Topological Polar Surface Area (TPSA) appeared to follow a left-skewed distribution. It ranged from 52.6 to 130.4 Å2 (Fig. 9c) with an average of 99.5 ± 15.6 Å2. Approximately 302 k (30.2%) fell in the most populated region between 105 and 120 Å2, and 99.5% of the library had TPSA between 60 and 135, in which 11 BM scaffolds (61.1%) were observed as well. TPSA of the eighteen BM scaffolds ranging from 113 to 153 Å2 were observed in the high end and beyond the maximum range of V1M. Since macrolides are known for their ability to bind difficult target proteins with bland surfaces and large binding pockets [9], higher TPSAs observed for these eighteen BMs could be highly relevant to their potent bioactivities. Overall, both V1M had thirteen BM scaffolds had TPSAs compliant with Veber’s rules [38], TPSA ≤ 140 Å2 beyond which five BM scaffolds were observed.
The number of hydrogen bond acceptors (HBA) in V1M followed a slightly right-skewed distribution (Fig. 9d). V1M ranged from 4 to 8 HBAs with an average of 6.31 ± 0.8. Approximately 938 k macrolide scaffolds (93.8%) of V1M was observed to have HBAs from 5 to 7, and a majority 444 k (44.4%) macrolide scaffolds had 6 HBA. However, higher range of HBA (7 to 9) was observed for all BM scaffolds, suggesting a higher count of HBAs could be relevant to the potent bioactivities of macrolides. Both V1M and eighteen BM scaffolds were compliant with Lipinski’s rule regarding HBA (HBAs ≤ 10). Meanwhile, the numbers of hydrogen bond donors in V1M ranged from 0 to 3, with 497 k (49.7%) macrolide scaffolds have 2 HBDs (Fig. 9e). BM scaffolds covered a wide range of HBDs from 1 to 6, with 11 BM scaffolds within HBD distribution of V1M. Sixteen BM scaffolds and V1M had HBD values compliant with Lipinski’s rule (HBD ≤ 5). Two BM scaffolds had HBDs of 6. The number of rotatable bonds in each macrolide scaffold ranged from three to four (Additional file 2: Figure S5A). A large majority, approximately 800 k (80%), of V1M had rotatable bonds of four. All BM scaffolds covered a range of rotatable bonds from 0 to 6 with an average of 2.33 ± 1.68. Seven BM scaffolds had NRB of 1, and only five BM scaffolds fell within the same distribution of V1M. Both our V1M library and 18 BM scaffolds were compliant with Veber’s rule of NRB (NRB ≤ 10).
Since we have been referencing Lipinski’s rule of 5 and Veber’s rules for our macrolide scaffolds, we also conducted a short study to assess whether the eighteen chosen BMs with reduced structures (Additional file 2: Figure S2) followed these rules as well. Additional file 2: Figure S6 shows a summary of molecular properties and filters such as MW, SlogP, TPSA, HBA, HBD, NRB that are normally used to determine cell permeability, bioavailability and drug likeness [38, 40, 41], along with color-coded information on whether the molecular property values fall within or outside Lipinski’s and Veber’s region.
Thirteen out of eighteen BM scaffolds displayed molecular properties well within Lipinski’s [39] and Veber’s rules[38] while the rest slightly deviate in TPSA and HBD (Additional file 2: Figure S6). All BM scaffolds display MW ≤ , SlogP ≤ 5, HBA ≤ 11 and NRB ≤ 10. Five deviating BM scaffolds still showed values not far from the Lipinski’s and Veber’s marginal values (HBD = 5, TPSA = 140). Dirithromycin (HBD = 6, TPSA = 153.47) and Roxithromycin (HBD = 6, TPSA = 151.3) showed the highest deviations from the Lipinski’s border values of HBD while the rest three BM scaffolds (Cethromycin, Erythromycin and Flurithromycin) slightly deviated from Veber’s TPSA of 140 by a range of 4.52–7.15. This data suggested that, as expected, not all BM scaffolds abided by Lipinski’s or Veber’s rules [39], but the majority of BM scaffolds still fell within Lipinski’s region of drug likeness and bioavailability. One should underscore again that our analysis was conducted using the reduced representation of the BMs (i.e., only the macrolide scaffolds) to enable the direct comparison with the macrolide scaffolds generated by the PKS Enumerator. One should also underline that estimating the drug likeness of macrolides is highly complex; therefore, rules derived from small aliphatic molecules are likely to fail.
Correlation analysis among computed descriptors of V1M
To better understand the relationships among the chemical descriptors, we analyzed the Pearson correlation coefficients among all molecular properties (MW, SlogP, TPSA, HBA, HBD, heteroatoms and heavyatoms) computed for V1M. The heatmap is reported in Additional file 2: Figure S7. Several interesting patterns emerged during this pairwise correlation analysis among descriptors. TPSA was observed to hold multiple strong positive relationships with other molecular descriptors except for SlogP and rotatable bonds. TPSA established strong positive correlations with HBA (r = 0.97), HBD (r = 0.88), and heteroatoms (r = 0.97). Predictably, introducing heteroatoms, especially polar atoms such as oxygen or nitrogen or fluorine, would increase polar surface areas and promote hydrogen bonding as well. The chemical descriptors related to polarity and hydrogen bonding (TPSA, HBA, HBD and heteroatoms) all had strong positive correlations among each other; and some more than the others. For example, HBA established a perfect positive correlation with heteroatoms. Oxygen was the only heteroatom introduced in CSMs (SM004, SM005, SM006, SM008 and SM009) used for our study, thus it is the major source affecting important chemical properties which are TPSA, HBA and HBD. Introducing polar functional groups by carefully designing new SM types, selecting and specifying the occurrences of SM types per macrolide scaffold could have a significant impact on the associated chemical properties.
MW had an unsurprisingly strong positive correlation with heavyatoms (r = 0.99). MW also showed relatively strong positive correlations with HBA (r = 0.71), HBD (r = 0.67), TPSA (r = 0.66). It was likely because the CSM types containing oxygen had relatively larger functional groups in comparison to the rest in our study (e.g. SM005, SM006, SM008), thereby resulting in a high correlation between MW and other descriptors: TPSA, HBA and HBD. This correlation can be enhanced by allowing higher number of CSMs with polar atoms in the selection pool or increasing their repeats per macrolide scaffold. In general, it can be seen in V1M that several molecular properties such as MW, TPSA, HBA, HBD, heavyatoms and heteroatoms had moderate or strong positive correlations with one another.
Comprehensibly, SlogP (hydrophobicity) established multiple strong negative relationships with other molecular descriptors: TPSA (r = − 0.94), HBA (r = − 0.93), HBD (r = − 0.77), and heteroatoms (r = − 0.93). Since most polar compounds are known to interact with water, lower hydrophobicity would be observed for macrolides possessing larger polar surface areas or functional groups with potential HBAs and HBDs. SlogP didn’t show good correlations with MW (r = − 0.44) and NRB (r = 0.04). NRB did not report any important correlations with the rest of the molecular descriptors, and the distribution of NRB within V1M (Additional file 2: Figure S5A) was too low to form any significant correlations with other descriptors.
Analysis of chemical fingerprints for V1M
Chemical fingerprints were computed for the scaffolds of both V1M and the eighteen BMs. We used 2D MACCS (RDKit implementation of the MACCS keys [36]) via the RDKit fingerprint node in Knime. Tanimoto similarity coefficients were computed via the CDK toolkit in Knime for the fingerprints of V1M against those of eighteen BM scaffolds which were used as reference compounds. For each of 1 M macrolide scaffolds in V1M, only the maximum Tanimoto score achieved with any of the eighteen BM scaffolds was reported, i.e., maximum aggregation method. For example, a macrolide scaffold would afford various Tanimoto scores with all the eighteen BM scaffolds, among which it afforded the highest Tanimoto score with Clarithromycin. So, for that macrolide scaffold, only Clarithromycin and the associated Tanimoto score was reported. We then analyzed the distribution of Tanimoto scores obtained for all 1 M macrolide scaffolds. Tanimoto scores range from 0 to 1, with 1 being the highest similarity score between two compounds and 0 the lowest. It should be noted that the Tanimoto score between Spiramycin and Rokitamycin is 1 (Additional file 2: Figure S8), which would explain why macrolide scaffolds in V1M obtained the same Tanimoto scores with both. It should also be noted that MACCS method does not account for chirality since they are 2D fingerprints based; thus, there is a clear limitation in determining chemical similarity for compounds with different stereocenters.
V1M had a slightly left-skewed distribution of MACCS Tanimoto scores ranging from 0.63 to 1.0, with an average of 0.84 ± 0.04 (Fig. 10a). The macrolide scaffolds in V1M identified seven among the eighteen BM scaffolds as most chemically similar: Clarithromycin, Midecamycin, Rokitamycin, Spiramycin, Tylosin, MiocaV1Mmycin and Erythromycin (Fig. 10b).
The count of macrolide scaffolds with highest Tanimoto scores associated with these seven BM scaffolds is reported in Fig. 10b. A large portion of V1M, 297 k (30%) and 301 k macrolide scaffolds (30%), was associated with the highest MACCS-based chemical similarity with Miocamycin and Clarithromycin respectively among all other BM scaffolds (Fig. 10b). Approximately one quarter of V1M, 216 k (22%) macrolide scaffolds, identified Midecamycin, and 149 k macrolide scaffolds (15%) identified Erythromycin as the most chemically similar BM scaffolds. Only 32 k macrolide scaffolds (3%) identified Rokitamycin/Spiramycin, and 3.9 k macrolide scaffolds (0.39%) Tylosin as the highest chemically similar BM scaffold.
The boxplot analysis in Fig. 10c showed the distributions of Tanimoto scores against these seven BM scaffolds. This allowed us to determine the level of chemical similarity between individual BM scaffolds identified as most chemically similar, and their closest macrolide scaffold analogues in V1M. Clarithromycin (0.853 ± 0.03), Midecamycin (0.846 ± 0.03) and Rokitamycin/Spiramycin (0.849 ± 0.03) showed similar distributions with a median Tanimoto score of approximatively 0.85 (Fig. 10c), indicating that macrolide scaffolds in V1M afforded an equivalent level of chemical similarity with these BM scaffolds. Miocamycin (0.826 ± 0.05) and Erythromycin (0.831 ± 0.04) covered relatively wider, but somewhat similar Tanimoto distributions with a median Tanimoto score of approximately 0.83 (Fig. 10c). Overall, their closest analogues from V1M showed an equivalent level of chemical similarity with these BM scaffolds; except for Tylosin which had relatively lower Tanimoto scores ranging from 0.66 to 0.84 with an average of 0.76 ± 0.03. Among the macrolide scaffolds in V1M that had highest fingerprint similarity with Miocamycin, six example structures along with their Tanimoto scores were provided in Fig. 11. Using Tanimoto score of 0.75 as a cutoff value for good similarity measurement, 987 k macrolide scaffolds in V1M achieved high chemical similarity with the known BM scaffolds based on MACCS fingerprint.