Practical guidelines for the use of gradient boosting for molecular property prediction

Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure–activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications. Graphical abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00743-7.

QSAR aims to link the molecular structure (numerically encoded as the so-called molecular descriptors) [15][16][17] with experimentally measurable properties.For this application, decision tree ensembles are among the most used machine learning methods thanks to their excellent performance, ability to rank features in terms of importance and their ability to scale to large datasets [18,19], alongside other popular frameworks like support vector machines (SVM) [20,21].
Among decision tree ensembles, gradient boosting machines (GBM) have seen a strong surge in popularity in the last years, driven by excellent results in data science competitions and state-of-the-art performance in modelling tabular data [22,23].GBM iteratively aggregates predictive models so that each one compensates the errors from the previous step, thus yielding a high-performance ensemble.
However, several implementations of the GBM algorithm exist, each with unique modifications to the original formulation and employing different decision tree structures [23], such as XGBoost [31], LightGBM [32] and CatBoost.[33] While the importance of these differences has been recognised in other fields [23], these algorithms are used interchangeably in chemoinformatics, and to our knowledge their respective advantages are not well documented.Thus, there is an urgent need for a rigorous benchmarking of these different implementations for QSAR applications.This is further warranted by the uniqueness of cheminformatics datasets compared to other typical tabular datasets like finance and real estate price prediction [22,23].For example, datasets in this field tend to have a much higher number of features, they are often extremely imbalanced [34] and might contain false positives or false negatives [35].
The aim of this paper is to provide the first set of practical guidelines for the use of gradient boosting in QSAR applications, such as toxicology and drug discovery, by answering the following questions: 1. Which gradient boosting implementation performs the best for QSAR?
2. Which package scales the best to large datasets, such as high throughput screens (HTS)? 3. Do they produce similar feature importance rankings, or do they highlight different molecular features?4. Is it possible to identify the most important hyperparameters to optimize for these algorithms to accelerate further the development and deployment of these methods for QSAR?
To answer these questions, we carried out a largescale benchmark of these three implementations on 16 classification and regression datasets with 94 different endpoints commonly considered for virtual screening, covering a wide range of dataset size and class-imbalance ratios.To ensure the robustness of our results, we extensively optimized each algorithm according to the guidelines set up by the respective authors of the packages and recent studies, constructing 157,590 individual QSAR models.

Methods
GBM is an ensemble algorithm, which aims to aggregate several decision trees into a single more performant predictor.Decision trees are a machine-learning algorithm that learns a flowchart-like structure of hierarchical binary decisions [36].The terminal nodes of the graph are generally named leaves, which are used to assign sample predictions [36].To explain how GBM constructs the decision tree ensemble, we first present the original implementation of the algorithm [37] followed by a systematic analysis of the changes introduced by XGBoost, LightGBM and CatBoost.

Gradient boosting
Given an input matrix X and a vector Y of molecular properties (e.g., biological activity), the gradient boosting algorithm approximates the underlying function F (x) , which maps the relationship between the molecular descriptor x i and the biological activity y i , with a function F (x) , constructed in an additive manner: where σ is the learning rate, a constant regularization parameter limiting the influence of a given predictor within the ensemble, and F m (x) is the m th tree.Given a loss function L y i , p i , such as the binary cross-entropy, that measures the quality of predictions p i with respect to real readouts y i , after the first iteration each new tree F m is learned by minimizing the following objective: (1) where the derivative of the loss with respect to the ensemble output represents the prediction residuals of F (x) at the previous iteration, and P m are the predictions at the current iteration.As such, each new decision tree is constructed so that it compensates the prediction errors of the model during the previous iteration, essentially conducting gradient descent in function space instead of parameter space.
The original formulation of GBM is the one employed by the popular machine learning package Scikit-learn [38].Unfortunately, this implementation lacks many of the regularization and optimization methods implemented by XGBoost, CatBoost and LightGBM and cannot be parallelized across multiple CPU cores.For this reason, we did not include the Scikit-learn version of GBM in the benchmarking study.

XGBoost
XGBoost introduces a regularized learning objective [31].At a given iteration m , instead of being computed accord- ing to the loss function L y i , p i , the residuals are calcu- lated with the following formula: where γ and are regularization hyperparameters, T m is the number of leaves in the m th tree and w m 2 is the L2 norm of its leaf weights.Thanks to this modification, XGBoost learns simpler trees with smoother weights, which leads to better generalization [31].Additionally, XGBoost employs Newton descent instead of gradient descent to optimize its trees, which leads to faster convergence [39].Finally, XGBoost also introduced a new feature split finding algorithm to speed up training [31].

LightGBM
This implementation also adopts many solutions proposed by XGBoost to improve the performance such as the regularized learning objective and Newton descent.However, LightGBM introduces three new strategies to make training more efficient: a histogram-based split finding method, Exclusive Feature Bundling (EFB) and Gradient-based One-Side Sampling (GOSS) [32].EFB employs heuristics to find groups of mutually exclusive features and merges them together, thus reducing the dimensionality of the dataset, while GOSS relies on gradients to sample at each iteration the most important dataset instances without changing the training set distribution.Each of these algorithms simplifies different aspects of the original minimization objective, thus (2) speeding up training time with negligible loss in accuracy.Furthermore, LightGBM employs a different tree growth strategy compared to XGBoost.In most cases, trees are generated in a "breadth-first" fashion, where every time a new split is found, all other splits at the same level are first considered before increasing further the depth of the tree.This yields tree structures that have the same depth across all branches.In contrast, LightGBM grows trees in a "depth-first" fashion (Fig. 1), where the algorithm splits nodes exclusively according to the largest performance gain [40].This procedure leads to asymmetric trees, where certain branches might be very deep while others might be shallow.This approach tends to converge faster, but might be susceptible to overfitting on small datasets [32].

Catboost
There are three main features that distinguish CatBoost from LightGBM and XGBoost.First, it provides a novel Target Statistics (TS) algorithm to handle categorical variables, which leads to more robust performance on unseen data by addressing the issue of target leakage during training [33].However, categorical inputs are very rarely found in molecular descriptors [41], therefore this aspect is not of big relevance for cheminformatics applications.Second, it introduced ordered boosting, a variation of gradient boosting where each model is trained on a different partition of the training dataset, tackling the issue of prediction shift that arises by fitting trees on gradients obtained from samples already used during training.In principle, this approach reduces the risk of overfitting, especially on small datasets [33].Third, CatBoost employs "oblivious decision trees", where the same variable and threshold are used to generate splits at a given depth level (Fig. 1) [33,42].This enforced symmetry acts as regularization, constraining the expressiveness of tree models, and can be leveraged to provide uncertainty estimates on predictions, similarly to Gaussian Processes models [43].Finally, the authors of this library have researched extensively the theoretical properties of gradient boosting and proposed several new features like Langevin gradient descent [44] and sample importance analysis [45], which are only available in the CatBoost package [42].

Datasets
To provide a robust evaluation framework for our benchmark analysis, we evaluated XGBoost, LightGBM and CatBoost on 16 classification and regression datasets from three well-established repositories: MoleculeNet, [27] MolData [1] and the ChEMBL benchmarking study from Cortés-Ciriano et al [46] (Table 1).From the first, we included Tox21, MUV, HIV, ClinTox, BBBP, BACE and SIDER.From the second, we chose the Phosphatase, NTPase, Oxidoreductase and Fungal datasets.From the third, we selected HERG, Acetylcholinesterase, COX-2, erbB1 and JAK-2.We retrieved the MoleculeNet datasets from a recent benchmarking study [16], while we referred to the original publications for the MolData repository and the ChEMBL datasets [1,46].This selection entails approximately 1.4 million unique compounds and 94 endpoints on a wide variety of protein families and biological responses, ensuring that our findings are broadly applicable for cheminformatics applications.Our selection covers an extensive range of compounds per endpoint (from 2000 to 330,000) and imbalance ratios between compounds classified as either 'positive' or 'negative' (from 1:2 to 1:500), reflecting the diversity of datasets typically encountered in cheminformatics (Table 1).

Performance metrics
For each classification dataset, we evaluated the Receiver Operating Characteristic Area Under Curve (ROC-AUC) and Precision-Recall Area Under Curve (PR-AUC).Our selection is consistent with the figures of merit used in the literature when evaluating these datasets and ensures that the results are not skewed by high imbalance ratios [1,16,27,47,48].For the regression datasets, we evaluated the Root Mean Squared Error (RMSE).To assess whether differences in performance are statistically significant, we used the two-tailed Mann-Whitney test with Bonferroni correction [49].

Molecular descriptors
We featurized all compounds using the Extended-Connectivity Fingerprints (ECFP) with radius of 2 and bit size of 1024 [50].To ensure that bit collision is not a factor in any of our findings, we have investigated the change in vector sparsity when using larger bit sizes.Given that the number of unique fragments remains approximately constant for all datasets when increasing the bit size (Additional file 1: Table S1), we can exclude that bit collision plays a role for the benchmarks in this study.

Performance analysis
We used three different optimization and evaluation protocols, depending on whether the dataset is from Mol-eculeNet, MolData or ChEMBL.The reason for this is to keep our analysis consistent with prior studies from the scientific literature, and because the datasets from MolData are several orders of magnitude larger than the ones in the MoleculeNet repository or from Cortés-Ciriano et al [46].For MoleculeNet datasets, we replicated a previously proposed procedure [16], whereby for each endpoint, each classifier is optimized with Hyperopt [51] for 100 iterations using an extensive hyperparameter grid, determined according to existing guidelines and benchmarks [22,39,40,42].The full hyperparameter grid is available in the Supporting Information.Each optimization iteration measured the average PR-AUC with a given hyperparameter setting across three random train-test splits with an 80:20 ratio.Then, the model was run with the optimal hyperparameters on 50 independent evaluations with random splits, using the same ratio between training and test set.After each run, the ROC-AUC and PR-AUC were measured on the test set as well as the training time.Finally, for a given dataset, the performance metrics and training times were averaged across replicates and across endpoints.
For the MolData benchmarks, we used the scaffold splits provided by Arshadi and coworkers during optimization and evaluation [1].As such, for each endpoint, each classifier was optimized for 100 iterations using Hyperopt [51] with the same grid as above.Each iteration measured the PR-AUC obtained by the classifier with a given hyperparameter setting on the validation set.Then, the model was run with optimal hyperparameters on five independent evaluations with different random seeds, measuring the ROC-AUC and PR-AUC on the test set as well as the training time.As above, the results were reported as averages across replicates and endpoint for a given dataset.
For the regression datasets from Cortés-Ciriano et al. [46], we adopted the procedure employed in the original publication.In short, each dataset was split into training, validation and test sets with a 70:15:15 ratio using random splits.We then performed hyperparameter tuning via Hyperopt, optimizing RMSE on the validation split for 100 iterations, using the same grid as above.
Finally, we repeated training on the training split and evaluation of RMSE on the test set for 50 iterations.As such, the final RMSE values were indicated as averages across replicates for each dataset.

Feature ranking analysis
One of the advantages of GBM is that it can provide information on the feature importance, which can be used as a tool to provide indication of what drives the model predictions, and, in certain cases, to achieve model explainability [52].We used Shapley values [19,53] to assess which molecular features are the most important according to each GBM predictor.Shapley values quantify the importance of each feature ('feature attribution' [37]) by evaluating the change in a model's predictions across all possible permutations [19,52].To obtain feature rankings for each dataset, we collected the Shapley values from each model with optimal hyperparameters during the evaluation procedure.Then, we averaged them across independent runs and dataset endpoint, obtaining one ranked list of variables per dataset for each model.
To compare the variable rankings between pairs of GBM implementations, we employed the following formula: where k is the cut-off for the number of most important variables to consider (set to k = 20 in the present study) and V sk is the number of unique variables when consid- ering both importance rankings.Intuitively, this metric measures the agreement of the two rankings, irrespective of the specific ordering, among the top 20 most important variables.For example, a score of 50 indicates that two GBM models have 10 molecular features in common when looking at their respective top 20 most important variables, regardless of whether these 10 features received the same rank in both lists.This score therefore shows whether the use of different gradient boosting algorithms would highlight the same features as most important, without being influenced by the ranking of less informative variables.However, it should be kept in mind that for many molecular representations such as hashed fingerprints, translating feature importance rankings into chemical insights is not a trivial task [54].Finally, to evaluate the influence of converging to different hyperparameter configurations, regardless of algorithmic differences in the gradient boosting implementation, we also evaluated the feature ranking overlap between two independent LightGBM optimization runs.The analysis was limited to LightGBM due to (4) Overlap% = 1 − V sk k * 100, k = 20 computational costs and that considering one GBM is sufficient to evaluate the variability in feature ranking overlap induced by the stochasticity in the hyperparameter optimization process.

Hyperparameter analysis
To evaluate the influence of each hyperparameter on the optimization process, we employed the Functional ANOVA (fANOVA) [55].To acquire a sufficient collection of hyperparameter combinations, we optimized LightGBM with Hyperopt for 500 iterations on each endpoint, using the same hyperparameter grid and evaluation criteria as above.Because of the high computational cost for this analysis, we limited our study only to one GBM implementation and exclusively to classification datasets.Then, after pruning the worst 150 iterations, we processed the resulting parameter-performance pairs using fANOVA, yielding individual hyperparameter importance scores and their first-order interactions.By limiting the analysis to well-performing configurations, we ensured that the importance estimates for the parameters reflect their importance on reaching the optimum, and not on causing large oscillations in performance [55].We excluded the SIDER and Fungal datasets from this analysis, since they were reserved as test sets to evaluate whether selecting hyperparameters according to their fANOVA importance score generalizes to unseen datasets.Furthermore, to assess the influence of molecular descriptors on the optimal hyperparameters, we also repeated this procedure using the MACCS keys [56] and an assortment of 207 physical-chemical descriptors from RDKit as featurization options.The complete list of descriptors is available in the Supporting Information (Additional file 1: Table S2).

Predictive performance
Overall, XGBoost achieves the best performance on most of the datasets (Fig. 2a, b and Additional file 1: Figure S1), with statistically significant differences in most cases.Interestingly, there seems to be a correlation between the improvement provided by XGBoost over the alternatives and dataset size.For smaller classification datasets (e.g., BACE, BBBP and ClinTox), CatBoost performs worse, with LightGBM being able to match or outperform XGBoost.This aspect is seemingly in contradiction with the concerns of overfitting due to its depth-first tree structure reported elsewhere.
[40] For medium-sized datasets (e.g., Tox21, MUV and HIV, ranging from approximately 7000 compounds to 40,000), CatBoost tends to perform better than Light-GBM, and it outperforms XGBoost on the Tox21 dataset.Finally, for large datasets (NTPase, Phosphatase and Oxidoreductase datasets, having more than 300,000 molecules per endpoint), XGBoost outperforms both LightGBM and CatBoost.When considering all datasets, XGBoost provides roughly a 5% improvement on average over LightGBM and CatBoost in terms of ROC-AUC and PR-AUC.
Regarding the regression datasets, LightGBM tends to achieve worse RMSE scores, while XGBoost ranks as the best performing algorithm on most benchmarks (Fig. 2b).CatBoost is generally able to match the performance of XGBoost, although the differences are statistically significant.
When considering the training times across all datasets (Fig. 2c), a similar dependence on the dataset size can be observed.LightGBM is the fastest algorithm on all benchmarks, due to the algorithm's focus on reducing computational load.CatBoost is the slowest algorithm for small and medium sized datasets, while XGBoost requires significantly more time to train for larger datasets than both alternatives.While the absolute difference of training times for a single model is not particularly great (i.e., 5 versus 140 s on a CPU with 32 cores), it can significantly impact hyperparameter optimization procedures, where the model needs to be retrained many times.Furthermore, this difference will also grow significantly if less cores are available for training.
In summary, XGBoost provides the best predictive performance for cheminformatics out of all gradient boosting implementations, at the cost of training speed for larger datasets.LightGBM and CatBoost have comparable performance, but the former provides substantial benefits in terms of training time over the other algorithms.

Feature ranking comparison
We observed a remarkable variability between the importance rankings across different implementations, especially when comparing them to the overlap scores of two independent optimization and training runs for the same GBM algorithm (Fig. 3).For MUV, for example, there is approximately only a 20% overlap for any implementation pair, while for other datasets the agreement reaches up to 90%.The reason for the variability across implementations could be due to the use of different tree structures, as well as converging to different hyperparameter optima.For example, tuning the minimum split gain can lead to the selection of different splits, which in turn would yield different variable importance scores.This would explain the results obtained when comparing two runs of the same GBM algorithm across all datasets, since even in that scenario the variable overlap scores are distributed between 70 and 90% (Fig. 3).Another possible explanation for this pattern is that the algorithms highlight similar molecular fragments, but those fragments are mapped to different bits in the ECFP representation, thus producing semantically similar rankings despite not focusing on the same variables.To investigate this hypothesis, we calculated the top 20 ranked fragments for all GBM algorithms for the BACE datasets and manually inspected them (Additional file 1: Figure S2).When comparing the most important fragments between pairs of GBM predictor, each model had approximately ten unique substructures, which did not have any analogues in the other rankings.As such, it seems that each implementation indeed generates semantically distinct explanations for a given dataset, highlighting potential differences in the learned structure-activity relationships.
The main takeaway from this analysis is that using gradient boosting to evaluate which molecular features or fragments are the most influential is a non-trivial task, given the low agreement between different implementations of the same algorithm.Expert knowledge must always be employed to evaluate each fingerprint bit or molecular descriptor and to assess whether the explanations provided by the model are reasonable.Finally, averaging the Shapley scores on different hyperparameter optima or across different gradient boosting implementations might yield better estimates of feature importance.

Hyperparameter importance
After calculating the hyperparameter importance across datasets for LightGBM, we evaluated their distribution on different endpoints (Fig. 4).The analysis was limited to one GBM implementation due to the high number of optimization iterations required per endpoint.We focused our analysis on the following hyperparameters: • "colsample_bytree": fraction of features to sample at the beginning of the construction of a given tree.
Tuning it helps with regularization of the ensemble.• "learning_rate": regulates how much each tree affects the overall performance of the ensemble, or in other words how many boosting rounds are required to converge.Large learning rates help with underfitting, small learning rates can help with regularization.• "max_depth": defines the maximum depth for constructing individual trees.Large values help with underfitting, small values can help with regularization.• "min_child_samples": minimum number of samples for a given leaf node.Affects tree construction and can help with regularization.• "min_child_weight": minimal sum of hessians for a given leaf node.Affects tree construction and can help with regularization.• "min_split_gain": minimal decrease in loss required to further split a node.Affects tree construction and can help with regularization.• "neg_subsample": fraction of majority class samples to use for bagging when constructing a given tree.Helps with class imbalance and regularization.• "num_leaves": Maximum number of leaves a given tree can have.Similar to max_depth but provides more fine-grained control on the shape of the tree since LightGBM uses depth-first trees.• "reg_alpha": L1 norm regularization coefficient of the leaf weights.• "reg_lambda": L2 norm regularization coefficient of the leaf weights.• "scale_pos_weight": scaling coefficient for the minority class when computing the cross-entropy loss.Large values can offset class imbalance.• "subsample_freq": affects how often to perform bagging when training the ensemble.If set to k, bagging is performed every k trees.
Generally speaking, the importance of the individual hyperparameters in the optimization process varies greatly across datasets.Furthermore, 1st order Fig. 3 Box-plot distribution of overlap scores across all datasets for each gradient boosting implementation pair.The length of the box denotes the interquartile range, the diamond indicates the mean and the horizontal line defines the median.The comparison between two independent optimization runs using the same algorithm was limited to LightGBM due to its computational cost interactions between parameters play a more significant role in reaching the global optimum than tuning them in isolation, as highlighted by their larger importance score.This is consistent with the strong correlations between parameters and their non-linear effects on model behavior [39,40,42], which make Bayesian hyperparameter optimization necessary in the first place [51].
Looking at individual contributions (Fig. 4a), it is possible to identify highly influential hyperparameters, such as the learning rate and the minimum split gain, as well as less relevant ones, such as tree-wise feature sampling.However, all importance score distributions are remarkably skewed, highlighting that each contribution can strongly vary across different datasets.When looking at the top ten most influential pairwise interactions (Fig. 4b), most of them are related to the learning rate and the scaling coefficient for the contribution of the minority class to the global loss, highlighting the importance of tuning weighted cross-entropy when dealing with imbalanced classification.While some of these findings are consistent with the optimization guidelines from the literature, such as tuning the learning rate and the minimum split gain, others appear to contradict them.For example, while stochastic sampling of instances and features is believed to be an effective regularization technique for gradient boosting [31], in this analysis tuning it seems to be not influential in converging to the parameter configuration optimum.
To evaluate the robustness of our importance estimates, we chose to optimize LightGBM again on all datasets, tuning only the most influential parameters according to the fANOVA analysis.To do so, we selected only the parameters that appeared at least once among the top 10 most important interaction terms, yielding a grid of 7 hyperparameters instead of 12 (available in the Supporting Information).To test whether this reduced selection leads to faster convergence of the optimization process, we used 30 iterations instead of 100.As a negative control, we also evaluated the performance achieved by optimizing all hyperparameters for the same number of iterations.Finally, we expressed the ROC-AUC and PR-AUC values achieved by these benchmarks as a fraction of the performance of the optimization process with all parameters and 100 iterations.This evaluation scheme allows us to assess how well quickly tuning only the most important hyperparameters approximates the original large-scale optimization procedure.
As shown in Fig. 5, given the same number of iterations, using only the best parameters for the optimization process leads to consistent performance gains compared to tuning all hyperparameters.This indicates that the scores from fANOVA accurately reflect the importance of tuning a given hyperparameter for reaching the optimum.Interestingly, in some cases the optimal hyperparameter grid is able to outperform the results obtained tuning all hyperparameters for 100 iterations, such as for the NTP dataset in terms of PR-AUC and ROC-AUC (Fig. 5 and Additional file 1: Figure S2).
However, when evaluating the effectiveness of adjusting only the most important parameters on holdout datasets, the performance improvements are inconsistent.This indicates that the hyperparameter importance scores obtained by analysis of a set of endpoints do not generalize on external endpoints (Additional file 1: Figure S1).Therefore, deciding which parameters to tune must be determined on a case-by-case basis.A similar pattern is also observed when evaluating the influence of changing molecular representation for constructing the QSAR model, indicating that the parameter importance scores are highly feature-specific (Fig. 5 and Additional file 1: Figure S2).
In conclusion, optimization analysis tools such as fANOVA can be useful to further improve gradient boosting in cases where QSAR models need to be retrained periodically as new data is collected, for example for ADME prediction toolkits [3,57].However, the importance estimates provided by fANOVA do not generalize to unseen endpoints or different molecular representations, and limiting the optimization process to a handful of parameters can affect the performance of the classifier by up to 20%.Therefore, our recommendation is to tune all possible parameters when training gradient boosting models for QSAR, if the computational time to do so is not prohibitive.If optimizing all parameters is too costly, adjusting the learning rate, the weight of the minority class and the minimum gain to split will likely lead to the best results on a limited computational budget.

Conclusions
This work investigated the differences between popular gradient boosting implementations in the context of cheminformatics, to guide future QSAR modelling projects.Specifically, our analysis focused on predictive performance and training time, as well as on feature ranking consistency among methods.Furthermore, we investigated which hyperparameters are the most important to tune for gradient boosting machines to reach better performance faster.To achieve these goals, we evaluated 11 different datasets, encompassing approximately 1.4 million unique compounds with a diverse selection of dataset sizes and imbalance ratios.
XGBoost generally outperformed all alternatives in terms of predictive performance by approximately 5%, at the cost of longer training times for larger datasets (e.g.above 100,000 compounds).LightGBM and CatBoost achieve similar performance, but the former requires significantly less time to be trained compared to the other implementations.The improvement is especially significant for datasets with more than 100,000 compounds, where LightGBM could be trained approximately 100 times faster than XGBoost and 50 times faster than CatBoost.In terms of feature importance, each implementation tends to rank molecular features differently.This not only indicates that each approach might learn slightly different structure-activity relationships, but also that caution must be exercised when using these tools to assess which fragments or properties are relevant for the biological response modelled.In this context, expert knowledge is key to critically evaluate whether these explanations could be due to chance correlation.Finally, our hyperparameter importance analysis highlights that there is significant variability in how much a given parameter affects convergence to the optimum between datasets.As such, our indication is to tune as many parameters as possible when optimizing gradient boosting models.If the computational budget is limited, our recommendation is to focus on the learning rate, the minimum split gain and the weight of the minority class if the dataset is imbalanced.
In conclusion, our study provides a set of practical guidelines for the use of gradient boosting for molecular property prediction.Given the rising popularity of this algorithm for virtual screening and QSAR, we believe our study will provide useful advice in its optimization, its use cases and limitations, thus benefitting the cheminformatics community as a whole.
• fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year

•
At BMC, research is always in progress.

Learn more biomedcentral.com/submissions
Ready to submit your research Ready to submit your research ?Choose BMC and benefit from: ? Choose BMC and benefit from:

Fig. 1
Fig. 1 Different tree structures and split indexes (shown inside each node) generated by XGBoost, LightGBM and CatBoost.XGBoost adopts a "breadth-first" search, maintaining constant tree depth across branches.LightGBM uses a "depth-first" criterion, yielding asymmetric trees.CatBoost relies on oblivious trees, where at a given depth the same split is used across all branches, as indicated by the constant split indexes

Fig. 2
Fig. 2 Performance comparison of all gradient boosting implementations in terms of a PR-AUC, b RMSE and c training time.All calculations were performed on an AMD Ryzen Threadripper 3970X CPU.Statistical tests are carried out with respect to XGBoost.Error bars represent the standard deviation (N = 50 for MoleculeNet datasets, N = 5 for MolData datasets), while the asterisks denote whether the difference is significant (*: α < 0.05, **: α < 0.01, with Bonferroni correction)

Fig. 4
Fig. 4 Violin plot distribution of the importance scores across all endpoints for the Tox21, MUV, HIV, BBBP, BACE, ClinTox, Phosphatase, NTPase and Oxidoreductase datasets.aThe distribution of individual contributions for each hyperparameter, denoted by a numerical identifier.b The score variation of pairwise interactions.Each interaction is defined by the combination of two numeric identifiers for conciseness

Fig. 5
Fig. 5 LightGBM PR-AUC comparison between carrying out hyperparameter tuning according to the optimal grid obtained from fANOVA and tuning all hyperparameters.a Performance on the datasets used for the fANOVA analysis.b Performance on the holdout datasets and with different molecular representations.Each approach was optimized for 30 iterations.The performance is reported in relation to the results obtained by tuning all parameters for 100 iterations.Error bars represent the standard deviation (N = 50 for MoleculeNet datasets, N = 5 for MolData datasets), while the asterisks denote whether the difference is significant (*: α < 0.05, **: α < 0.01, with Bonferroni correction)

Table 1
Datasets employed in this study.For datasets with multiple endpoints, we reported the ranges between minimum and maximum values regarding the compounds per endpoint and imbalance ratios