Robust optimization of SVM hyperparameters in the classification of bioactive compounds

Background Support Vector Machine has become one of the most popular machine learning tools used in virtual screening campaigns aimed at finding new drug candidates. Although it can be extremely effective in finding new potentially active compounds, its application requires the optimization of the hyperparameters with which the assessment is being run, particularly the C and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma$$\end{document}γ values. The optimization requirement in turn, establishes the need to develop fast and effective approaches to the optimization procedure, providing the best predictive power of the constructed model. Results In this study, we investigated the Bayesian and random search optimization of Support Vector Machine hyperparameters for classifying bioactive compounds. The effectiveness of these strategies was compared with the most popular optimization procedures—grid search and heuristic choice. We demonstrated that Bayesian optimization not only provides better, more efficient classification but is also much faster—the number of iterations it required for reaching optimal predictive performance was the lowest out of the all tested optimization methods. Moreover, for the Bayesian approach, the choice of parameters in subsequent iterations is directed and justified; therefore, the results obtained by using it are constantly improved and the range of hyperparameters tested provides the best overall performance of Support Vector Machine. Additionally, we showed that a random search optimization of hyperparameters leads to significantly better performance than grid search and heuristic-based approaches. Conclusions The Bayesian approach to the optimization of Support Vector Machine parameters was demonstrated to outperform other optimization methods for tasks concerned with the bioactivity assessment of chemical compounds. This strategy not only provides a higher accuracy of classification, but is also much faster and more directed than other approaches for optimization. It appears that, despite its simplicity, random search optimization strategy should be used as a second choice if Bayesian approach application is not feasible.Graphical abstract The improvement of classification accuracy obtained after the application of Bayesian approach to the optimization of Support Vector Machines parameters. Electronic supplementary material The online version of this article (doi:10.1186/s13321-015-0088-0) contains supplementary material, which is available to authorized users.


Background
The application of computational methods at various stages of drug design and development has become a vital part of the process. As the methods developed become constantly more effective, despite the aims at optimizing their performance, the focus of the attention shifts away from performance optimization to the minimization of requirements for computational resources. The attainment of both effectiveness and the desired speed has been responsible for the recent extreme popularity of machine learning (ML) methods in computer-aided drug design (CADD) approaches. Machine learning methods are mostly used for virtual screening (VS) tasks, in which they are supposed to identify potentially active compounds in large databases of chemical structures. One of the most widely used ML methods in CADD is the Support Vector Machine (SVM). Although it has a potential of providing very high VS performance, its application requires the optimization of the parameters used during the training process, which was proved to be crucial for obtaining accurate predictions. To date, various approaches have been developed to make SVM faster and more effective. In cheminformatics applications, the most popular optimization strategies are grid search [1,2] and heuristic choice [3,4]. Depending on the problem, they are able to provide high classification accuracy-for example Wang et al. obtained 86% of accuracy in the classification of hERG potassium channel inhibitors for the heuristic choice of the SVM parameters [4]. On the other hand, Hamman et al. [1] were able to evaluate the cytochrome P450 activities with 66-83% of accuracy using grid search method of SVM parameters optimization. The need for optimizing SVM parameters is undeniable, as classification efficiency can change dramatically for various parameters values. A high computational cost of a systematic search over a predefined set of parameters' values is a trigger for development of new optimization algorithms. In recent years, Bayesian optimization [5,6] (including gaussian processes [7]) and random search-based selection [8] have become more popular [9,10]. As those approaches were not explored so far in the field of cheminformatics, we analyze their impact on classification accuracy and, more importantly, the speed and ease of use, that these approaches have lent to the optimization of SVM hyperparameters in the search for bioactive compounds.

Hyperparameters optimization
In the classical ML approach to a classification problem, we are given a training set (x i , y i ) N i=1 (with x i representing samples' features, in our case-fingerprint, and y i being the class assignment) and we try to build a predictive model based on these data using a training algorithm that sets the parameters w (for example the weight of each fingerprint element) for fixed hyperparameters (for example a type of SVM kernel, the regularization strength C or the width of the RBF kernel γ). In other words, given an objective S that must be maximized, we are supposed to solve the following problem: While this problem is often easily solvable (for example, in SVM, S is a concave function, and thus, we can find the maximum by using a simple steepest ascent algorithm), in general, it is very hard to find an optimal . This difficulty stems from the very complex shape of the S function once we treat as its arguments, which results in the joint optimization of the model parameters (w) and the set of hyperparameters ( ): A basic method for solving this problem is a grid search-based approach, which simply samples the set of possible values in a regular manner. For example, we choose the parameter C for a SVM in a geometrical progression, obtaining the values 1 , . . . , k and returning the best solution among each of the subproblems: While such an approach guarantees finding the global optimum for k → ∞, it might be extremely computationally expensive, as we need to train k classifiers, each of which can take hours. Instead, we can actually try to solve the optimization problem directly by performing an adaptive process that on one hand tries to maximize the objective function and on the other hand samples the possible space intelligently in order to minimize the number of classifier trainings. The main idea behind Bayesian optimization for such a problem is to use all of the information gathered in previous iterations for performing the next step. It is apparent that grid searchbased methods violate this assumption as we do not use any knowledge coming out from the results of models trained with other values.
We can consider this problem as the process of finding the maximum for f ( ), defined as Unfortunately, f is an unknown function and we cannot compute its gradient, Hessian, or any other characteristics that could guide the optimization process. The only action we can perform is to obtain a value for f at a given point. However, doing so is very expensive (because it requires training a classifier); thus, we need a fast (with respect to evaluating the function), derivative-free optimization technique to solve this problem.
For the task under consideration, S is the accuracy of the resulting SVM model with the RBF kernel, and = {C, γ } is the set of two hyperparameters that we must fit to optimize the SVM performance to predict the bioactivity of compounds, which (loosely speaking) is measured by f.

Results and discussion
Six SVM optimization approaches were evaluated in the classification experiments of compounds possessing activity towards 21 protein targets, represented by six different fingerprints (Table 1).

Classification effectiveness analysis
A global analysis of the classification efficiency revealed that Bayesian optimization definitely outperformed the other methods of SVM parameters' optimization ( Fig. 1). For a particular target and fingerprint, Bayesian approach provided a higher classification accuracy in 80 experiments, a significantly greater number than the other strategies (22 for the runner-up: grid search). On the other hand, the SVMlight and libSVM were definitely the least effective methods of SVM usage; they did not provide the highest accuracy values for any of the target/ fingerprint combinations. This result is an obvious consequence of the fact that SVMlight and libSVM are just basic heuristics and their results cannot be comparable with any hyperparameters optimization technique. Interestingly, libSVM achieved much better results than SVMlight even though its heuristic is much simpler.
The relationships between various methods tested were preserved when the results were analyzed with respect to the various fingerprints (Fig. 2)-the Bayesian optimization always provided the highest classification accuracy (for 13-14 targets for each of the fingerprints analyzed), whereas the 'global' second-place method-grid searchwas outperformed by 'small grid'-cv for two fingerprints: MACCSFP and PubchemFP. The runners-up (grid search or 'small grid'-cv, depending on fingerprint) provided the best predictive power of the model for 3 proteins on average. The ineffectiveness of the SVMlight and libSVM strategies has been already indicated in the 'global' analysis, and with respect to various fingerprints, there was no protein for which those SVM optimization methods provided the highest classification accuracy.
The situation becomes more complex when separate targets are taken into account (Fig. 3). The Bayesian optimization provided the best results for all considered representations for some proteins (CDK2, H 1 , ABL); however, in few cases, other optimization approaches for tuning SVM parameters outperformed the Bayesian method (5-HT 6 -random and grid search, beta1AR-'small grid'-cv and grid search, beta3AR-grid search, HIVi-grid search, 'small grid'-cv, random, MAP kinases ERK2-'small grid'-cv and random search). These results show that a more careful model accuracy approximation is required for some proteins. Because we are interested in maximizing the accuracy on a naive test set, we approximate this set by performing internal cross-validation for each method. This is a well-known technique in ML; however, it might be not reliable for small datasets. Beta1AR, beta3AR, and HIVi are very small datasets in our comparison; thus, it seems probable that the poor results of the Bayesian approach (a poor approximation of the S value) were caused by the high internal variance in the dataset rather than because the Bayesian approach was actually worse than the grid search method.
Because grid search was the second-place method in the majority of the analyses, both for global analysis, and fingerprint-and target-based comparisons, a direct comparison of the number of the highest accuracies obtained for Bayesian optimization and the grid search approach was performed ( Table 2). The sum of the number of wins is not equal for the given fingerprintbased or target-based comparison as the draws were also considered.
The comparison of the number of 'wins' for Bayesian optimization over the grid search indicates the superiority of the former approach. In the 'global' analysis, the Bayesian optimization strategy gave a higher accuracy for approximately a 3-fold higher number of experiments than the grid search. For the fingerprintbased analysis, the ratio of Bayesian/grid search wins was similar to the best ratio (in favor of Bayesian optimization) obtained for MACCSFP (18 : 4) and the worst (15 : 7 and 15 : 6) for EstateFP and SubFP, respectively.  When target-based comparisons were considered, Bayesian optimization outperformed the grid search approach for some targets in all cases (i.e., CDK2, H 1 , ABL); for others, there was only 1 case when the grid search strategy won (i.e., 5-HT 2A , 5-HT 2C , M 1 , ERK2, AChE, A 1 , alpha2AR, CB1, D 4 , H 3 , IR), still others were draws (i.e., 5-HT 7 , beta1AR), and in two cases the grid search provided top accuracies (beta3AR, HIVi).

Examination of optimization steps in time
A time course study of the accuracy values was also conducted. Figure 4 shows analyses for 5-HT 2A as an example; the results for the remaining targets are in the Additional files section (Additional file 1). We demonstrated not only that Bayesian optimization required the smallest number of iterations to achieve optimal performance (for all representations the number of iterations was less than 20), but also that in the majority of cases, SVM optimized using a Bayesian approach achieved better performance than all of the other optimization methods. SVMlight and libSVM were not iteratively optimized; therefore, the accuracy/number of iterations function is constant for these approaches. In general, the Bayesian and random search approach were optimized very quickly (in less than 20 iterations), whereas the grid search method required many more iterations before the SVM reached optimal performance: 57 iterations (the lowest number) were required for EstateFP, and 138 (the highest number) for MACCSFP). Figure 4 also shows the rate of the improvement of the accuracy after the application of particular optimization approach, which depended on the representation of the compounds-for EstateFP, it was improvement from 0.8 (grid search) up to 0.82 (Bayesian), but for libSVM and SVMlight the improvement was significantly higher; for these two strategies, the classification accuracy was equal to 0.68. A similar result was obtained for ExtFP-the rate of improvement for the Bayesian optimization strategy compared with the random search approach was approximately 0.02 (from 0.88 to 0.90), but it was higher for the other optimization methods: 0.03 for grid search, 0.05 for libSVM and 0.22 for SVMlight. The pattern was similar for KlekFP and MACCSFP, with differences occurring only in the performance of libSVM. However, for PubchemFP and SubFP, grid search optimization provided the same predictive power for SVM as Bayesian optimization; for the SubFP, there was a selected range of iterations (117-142) when grid search provided slightly better SVM performance (by about 2%) in comparison to Bayesian approach.
In order to provide the comprehensive and global analysis of the changes in accuracy with an increasing number of iterations, the areas under curves (AUC) presented in Fig. 4 (and other curves that are placed in the Additional files section) were calculated. Example analysis for selected target/fingerprint pair (5-HT 2A , ExtFP) is presented in Table 3; the remaining analyses are in the Additional files section (Additional file 2). The global average AUC, the average AUC for particular fingerprints and targets are presented in Tables 4 and 5. The Tables also include final (for the selected target/fingerprint) and averaged (for the rest of the cases) final accuracy values obtained for a given strategy; the highest AUC/accuracy values for the particular case considered are marked with an asterisk sign. In general, the AUC of a curve indicates the strength of the trained model at any randomly chosen iteration. In other words, the AUC measures how quickly a given strategy converges to a strong model.
The analysis of the results obtained for the example target/fingerprint pair (5-HT 2A , ExtFP; Table 3) shows that both the highest AUC and final optimal accuracy values were obtained with the Bayesian strategy for SVM optimization. A similar observation was made for the global and fingerprint-based analysis; Bayesian optimization provided the best average AUC and average optimal accuracy for all fingerprints, as well as the global average value of this parameter. Interestingly, although grid search was the second-place method for optimal accuracy, it was actually the random search that outperformed this method in terms of AUC, which could be explained from an analysis of the respective curves.
Although the grid search method provided higher final accuracy values, these occurred relatively 'late' (after a series of iterations), high accuracies were obtained almost immediately for random search (Figs. 4, 5). Similarly, the average AUC and optimal accuracy values calculated for various targets were highest for Bayesian optimization in the great majority of cases. HIVi and ERK2 were the only targets for which the averaged AUC obtained with the Bayesian optimization strategy was outperformed by other optimization methods. On the other hand, the group of targets for which the average optimal accuracy values were the highest for methods other than Bayesian optimization was a bit more extensive (i.e., M 1 , ERK2, A 1 , beta3AR, HIVi, IR). However, for most of these targets, the difference between the best average accuracy and that obtained with Bayesian optimization was approximately 3% (however, for example for beta3AR this difference approached to 10%, from 0.879 to 0.972). On the other hand, an improvement of several percentage points was also observed when the average AUC and optimal accuracy obtained with the Bayesian strategy were compared with the strategy that provided the 'second-best' accuracy value in the ranking. The number of iterations required to achieve optimal SVM performance was also analyzed in detail ( Fig. 5; Additional file 3). The most striking observation was that all curves corresponding to the Bayesian optimization results were both shifted towards higher accuracy values and were much 'shorter' , meaning that a significantly lower   number of iterations was necessary in total to reach optimal SVM performance. Two relevant points arise from a comparison of Bayesian optimization with the grid search method (which sometimes outperformed Bayesian optimization): obtaining optimal accuracy with the grid search method required many more calculations, and even when grid search yielded higher accuracy values than Bayesian optimization, the difference between the two was approximately 1-2%. This result indicates that even when Bayesian optimization 'lost' , the results provided by this strategy were still very good and taking into account the calculation speed, it can be successfully applied also in experiments for which it was not indicated to be the best approach. A very interesting observation arising from Fig. 5 is that random search reached the optimal classification effectiveness (as measured by accuracy) in the least  The results were also analyzed regarding the changes in the accuracy when additional steps were applied. A panel of example results is shown in Fig. 6 for the cannabinoid CB1/SubFP combination (the remaining targets are in Additional file 4). The black dots show the set of parameters tested in the particular approach, and the black squares represent the set of parameters selected as optimal. This chart shows the advantage of Bayesian optimization in terms of the way of work, and the sequence of selected parameters. The set of tested parameters is fixed for grid search optimization, whereas in case of random search, it is based on the random selection. On the other hand, the selection of parameters for Bayesian optimization is more directed, which also affects the effectiveness of the classification. For grid search, only a small fraction of the parameters tested provided satisfactory predictive power of the model (only approximately 35% of the predictions resulted in an accuracy exceeding 0.7). Surprisingly, a relatively high classification efficiency was obtained with the use of the random search approach-60% of the sets of parameters tested provided predictions with an accuracy over 0.7. However, investigation of the Bayesian optimization approach to parameter selection revealed that the choice of parameters tested was justified, and hence, the results obtained with their use were significantly better than those obtained with the other approaches-75% predictions with accuracy over 0.7.
We conclude that there are three SVM hyperparameters selection approaches worth using for activity prediction for compounds: • libSVM heuristic (when only one set of hyperparameters is needed), • random search (when we need a strong model quickly, using less than a few dozen iterations), • a Bayesian approach (when we want the strongest model and can wait a bit longer). The SVMlight heuristic as well as the traditional grid search approach have definitely been shown to be significantly worse in terms of the resulting model accuracy as well as time needed to construct such model.

Conclusions
The paper presents strengths of Bayesian optimization applied for fitting SVM hyperparameters in cheminformatics tasks. Because the importance and necessity of the SVM optimization procedure is undeniable, various approaches to this task have neen developed so far. However, the most popular approaches to SVM optimization are not always very effective, in terms of both the  predictive power of the models obtained and the computational requirements. This study demonstrated that Bayesian optimization not only provides better classification accuracy than the other optimization approaches tested but is also much faster and directed-in the majority of cases, the number of iterations required to achieve optimal performance was the lowest out of the all methods tested, and the set of parameters tested provided the best predictions on average. Interestingly, if good classification results are desired to be obtained quickly (using a low number of iterations and without complex algorithms), the random search method in which hyperparameters are randomly selected from a predefined range) leads to very good performance of the SVM for predicting the activity of compounds and can thus be used when Bayesian optimization approach is not feasible. Consequently, we can formulate the following rule of thumb for tuning SVM's hyperparameters for the classification of bioactive compounds: 1. If you have no resources for performing hyperparameters optimization, use C = 1, γ = 1 d (as defined in libSVM). 2. If you have limited resources (up to 20 learning procedures) or limited access to complex optimization software, use a random search for C and γ with distribution defined in the "Methods" section. 3. If you have resources for 20 or more training runs and access to Bayesian optimization software a , use a Bayesian optimization of C, γ.
In general, there is no scenario in which one should use a grid search approach (it is always preferable to use random search or a Bayesian method) or SVMlight heuristics (it is always better to use libSVM) in the tasks connected with the assessment of compounds bioactivity.

Methods
The objective of the iterative global optimization of a function f : L → R is to find the sequence of points that converges to the optimal ˆ , f (ˆ ) = sup ∈L f ( ). A good algorithm should find a solution at least over some family of functions F , not necessarily containing f.
The above-mentioned issue can be viewed as a sequential decision making problem [37] in which at time step i a decision based on all previous points α i ( 1:i−1 ,f 1:i−1 ), where f i = f (x i ) + ε i is made. In other words, we have access to approximations of f values from previous steps. For simplicity, assume that ε i = 0 (f is deterministic); however, in general, all methods considered can be used in a stochastic scenario (for example, when randomized cross-validation is used as underlying method for f evaluation).
The goal is to find α which minimizes δ n (α) = f (ˆ ) − f ( n ), meaning that we are interested in which could be efficiently solved if f is known.

Approximation of generalization capabilities
In general, we are interested in how well our predictive model behaves on a naive test set. In other words, we are assuming that our data are a finite iid (independent and identically distributed) sample from some underlying joint distribution over samples (compounds) and their binary labels (biological activity) µ: where X represents a feature space of compounds under investigation. We want to maximize the expected accuracy over all possible compounds from µ, in other words 1 , . . . , n ∈ L, Clearly, we cannot integrate over an unknown probability distribution, but we can approximate this value using internal cross-validation. In other words, we are using a stochastic approximation where CV T (p, y) is the mean accuracy of the model of predictions p as compared to the true labels y over splits of set T into T train and T test (composed of X test data and corresponding labels Y test ). Thus we can assume [38] that where ε is a random noise variable (resulting from the approximating error and stochastic nature of cross validation).

Random optimization
First, let us define a random optimization technique as a strategy α R ( 1:i−1 ,f 1:i−1 ) = α R i = i ∼ P(L), for some probability distribution over the hyperparameters P(L). In other words, in each iteration, we sample from P(L), ignoring all previous samples and their results. Finally, we return the maximum of the values obtained.
It is easily seen that a random search, under the assumption that ∀ ∈L P(α R i = ) > 0, has a property described in (1). A random search will converge to the optimum [39], if only each set of parameters is possible to generate when taking new sample from our decision making process. In practise, it is only necessary that P(f (α R i ) = f (ˆ )) > 0. Similarly, if one uses a grid search approach that discretizes L, then given enough iterations and the assumption that f is continuous, one will converge to the optimal solution. It is important to note that the speed of such a convergence can be extremely low.
The only thing missing is the selection of P(L). According to many empirical studies showing that meaningful changes in the SVM results as the function of its hyperparameters can be expressed in log-scale of these parameters we use where we are interested in L = [C min , C max ] × [γ min , γ max ]. In other words, we are using a log-uniform distributions independently over C and γ. f ( ) = f ( ) + ε, P( = (C, γ )) = log 10 C−log 10 C min log 10 C max −log 10 C min · log 10 γ −log 10 γ min log 10 γ max −log 10 γ min

Grid search
For grid search optimization we select in a similar manner to the random search approach, uniformly in the log-scale of the parameters, and given M p choices of parameter p We put the linear order of ij by raveling the resulting matrix by column, which is the most common practice in most ML libraries. It is worth noting that one could achieve better scores by alternating this ordering to any random permutation; however, in practice, such alternation is rarely performed.

Bayesian optimization
If the exact form of f is known (for example, if f is convex and its derivative is known), then the optimization procedure would be much simpler. Unfortunately, f is a blackbox function wih a very complex structure, expensive even to evaluate. However, some simplifying assumptions for f might make a problem solvable. Assume that f can be represented as a sample from a probability distribution over a family of functions f ∼ P(f ), f ∈ F .
We can now express the expectation over the loss function δ n : Given that in step n the values of i ,f i for i = 1, . . . , n − 1 are already known and using the Bayes rule, we can write: thus This is a very basic equation for general Bayesian optimization techniques. Given additional assumptions about the prior distribution of P(f ), very efficient solutions for the entire process can be provided. In the case considered here, a very common approach exploiting features of the Gaussian processes is employed; thus, we assume ij = (C i , γ j ) = 10 log 10 C min +(i−1) log 10 C max −log 10 C min M C −1 , 10 log 10 γ min +(j−1) log 10 γ max −log 10 γ min = arg min α F δ n (α)dP(f |x 1:n−1 ,f 1:n−1 ).