Comparison and improvement of the predictability and interpretability with ensemble learning models in QSPR applications

Chen, Chia-Hsiu; Tanaka, Kenichi; Kotera, Masaaki; Funatsu, Kimito

doi:10.1186/s13321-020-0417-9

Research article
Open access
Published: 30 March 2020

Comparison and improvement of the predictability and interpretability with ensemble learning models in QSPR applications

Chia-Hsiu Chen¹,
Kenichi Tanaka¹,
Masaaki Kotera¹ &
…
Kimito Funatsu¹

Journal of Cheminformatics volume 12, Article number: 19 (2020) Cite this article

6946 Accesses
47 Citations
16 Altmetric
Metrics details

Abstract

Ensemble learning helps improve machine learning results by combining several models and allows the production of better predictive performance compared to a single model. It also benefits and accelerates the researches in quantitative structure–activity relationship (QSAR) and quantitative structure–property relationship (QSPR). With the growing number of ensemble learning models such as random forest, the effectiveness of QSAR/QSPR will be limited by the machine’s inability to interpret the predictions to researchers. In fact, many implementations of ensemble learning models are able to quantify the overall magnitude of each feature. For example, feature importance allows us to assess the relative importance of features and to interpret the predictions. However, different ensemble learning methods or implementations may lead to different feature selections for interpretation. In this paper, we compared the predictability and interpretability of four typical well-established ensemble learning models (Random forest, extreme randomized trees, adaptive boosting and gradient boosting) for regression and binary classification modeling tasks. Then, the blending methods were built by summarizing four different ensemble learning methods. The blending method led to better performance and a unification interpretation by summarizing individual predictions from different learning models. The important features of two case studies which gave us some valuable information to compound properties were discussed in detail in this report. QSPR modeling with interpretable machine learning techniques can move the chemical design forward to work more efficiently, confirm hypothesis and establish knowledge for better results.

Introduction

Machine learning has led to an explosion of applications, and researchers have developed new capabilities of machine learning for a wide variety of tasks. There is growing interest in applications of machine-learning techniques in quantitative structure–activity relationship (QSAR) and quantitative structure–property relationship (QSPR) modeling research [1]. QSARs/QSPRs are models where characteristics of molecules are correlated with their experimental behaviors using various mathematical regression and classification algorithms [2].

One of the issues of QSAR/QSPR models is that they are difficult to interpret in a chemically meaningful manner. The effectiveness of QSPR will be limited by the machine’s inability to explain its thoughts and actions to researchers. While interpretable models can be easy to interpret simple such as linear regression [3] and decision trees [4], the most powerful algorithms with high accuracy like neural networks [5], support vector machine [6] are uninterpretable models, which provide predictions that are not designed to be interpretable and cannot be easily interpreted. Interpretable machine-learning approaches will be essential for researchers to understand, trust, and effectively manage. Approaches for interpreting a model can help to extract information from a model to justify its prediction. Moreover, the interpretation of the machine-learning model should be simple enough to be human-understandable. Thus, interpretable approaches should utilize interpretable models and interpretable descriptors. Interpretable descriptors must have clear structural or chemical meaning.

Traditional interpretable models such as linear regression or decision tree have unfavorable performance that we cannot merely squeeze much more accuracy out of any single model. New tools are being developed to create better interpretable models [7, 8]. One potential way to increase accuracy without losing too much interpretability is to combine the predictions of numbers of traditional interpretable models. It is called “ensemble learning.” Random forests (RF) is one of the examples of decision tree (DT) based ensemble learning models [9]. RF is typically treated as an uninterpretable model due to the complicated algorithm design. In fact, considering that the output of random forests is the majority vote by a large number of independent decision trees, and each tree is naturally interpretable. One efficient way to get an insight into a RF model is to compute feature importance [10]. In RF, it is not very hard to gauge the influence of individual features in a single tree at a time, but the global feature importance of RF can be quantified by the total decrease in node impurity averaged over all trees of the ensemble.

In QSAR/QSPR models, feature importance can be used to estimate the importance of single descriptors or group of descriptors representing different chemical properties to explain the relationships [11, 12]. Guha and Jurs [13] demonstrated that the RF model selected important descriptors similar to the multilinear regression and partial least square regression models. Polishchuk et al. [14] modeled the toxicity of 664 compounds toward tetrahymena pyriformis with RF and determined the importance of hydrophobic factors for toxicity variation. Marchese Robinson et al. [15] investigated different interpretation strategies for RF, linear SVM, and PLS models on several benchmark datasets, and the predictions were interpreted in a chemically and biologically meaningful way.

A benefit of using a DT-based ensemble learning models such as RF is that they can automatically provide estimates of feature importance from a trained predictive model. Generally, feature importance is a score that indicates how useful the feature was in the construction of the DTs within the model. However, the importance calculations are strongly based on the ensemble methods (such as bagging [16] or boosting [17]). For example, the learning in RF is done in parallel using “bagging,” and each tree is built from the random selection of features. On the other hands, the learning by “boosting” is done serially. Boosting tends to choose highly correlated features and use them in several trees. Therefore, different ensemble learning models may lead to different prediction results and different feature selections for interpretation. Therefore, the generalization of predictability and interpretability may be limited using feature importance provided by one specific DT-based ensemble learning model.

To solve this problem of different importance calculations, herein, we tried to ensemble the predictions of different DT-based ensemble learning models. The predictions from different models can simply be averaged, weighted, or combined in more mathematically sophisticated ways such as combinatorial QSAR modeling [18]. Super-learner (or meta-learner) is one of the great options as a more rigorous way to combine model predictions. The super-learner is a specific implementation of stacked generalization developed by Wolpert [19]. Stacked generalization uses a combiner model to decide the weights for the constituent predictions. “Blending” is very close to stacked generalization and can successfully improve prediction accuracy [20], but slightly more straightforward and less risk of an information leak than stacked generalization. In blending, the combining mechanism is that predictions from different models are used as training data for the super-learner (blender) to approximate the same target value. Basically, the blender can figure out the combining mechanism and do not affect the interpretability of each individual constituent model. Therefore, blending enhances understanding and leads to greater awareness and familiarity in a dataset by combining interpretable models.

In this paper, we proposed the development of a method for interpretable models based on “blending” to overcome the shortcomings of DT-based ensemble learning models. We used different blending methods to combine four different DT-based ensemble learning models for regression and binary classification modeling tasks. To validate the proposed method, we conducted two proof-of-concept case studies and confirmed that our method could propose preferable property values and interpretability. The established QSPR model for a regression task of fluorescence dataset was performed to study the fluorescence emission wavelengths of 413 fluorescent dyes in different solvent conditions. For the classification task, we used an organic compounds database with 3786 records to predict the liquid crystal behavior. We compared the predictive performances and important features from the blending method and four different DT-based ensemble models in both case studies. The important features used in these QSPR models which may give us some valuable information to properties were discussed in detail in this report. This study may lead to a better understanding of DT-based ensemble learning models and provide a meaningful manner for predictability and interpretability improvement.

Method

Dataset

For the two proof-of-concept studies, two different datasets were used: compounds for fluorescence dyes and liquid crystals.

1. Fluorescence dataset

A large set of 413 dyes maximum experimental fluorescence wavelength (λ_em) were collected in the database [21] and from several fluorescence researches [22,23,24]. The fluorescence dataset included a large variety of chromophore derivatives listed in Additional file 1: Table S1 such as cyanine, xanthene, coumarin, pyrene, naphthalene, anthracene, etc. A dataset containing 413 dyes, 473 samples for 418 dyes in different solvent conditions were used in this study. The data set was randomly divided into two subsets from each chromophore derivatives: a training dataset of 392 samples and a test dataset of 81 samples were used. The training set was used to evaluate their predictability of QSPR models.

2. Liquid crystal dataset

A liquid crystal dataset was taken from LiqCryst database [25]. The dataset consisted of 3786 rod-like aromatic compounds with a variety of different mesogen types and wing substituents. There were 2780 liquid crystal (LC) compounds and 1006 compounds which LC behavior was not observed (NLC). For the purpose of developing the model, the dataset was randomly divided into the training set and test set, in the ratio 3:1. The training dataset consisted of 2060 LCs and 779 NLCs. The test dataset, which included 720 LCs and 227 NLCs, was used to test the developed models and to evaluate their generalization ability of classification.

Model

Decision tree and ensemble learning

Decision tree (DT) is a popular method and shows many advantages over other simple models such as the classification and regression tree (CART) [4]. DT is very fast in training and requires practically no data preparation such as normalization or feature selection. Furthermore, DT is simple to understand and interpret using a flowchart-like structure. However, DTs have extremely low bias because they maximally overfit to the training data. Thus DTs are known to be unstable because small variations in the training data can result in different trees and different predictions. To address these shortcomings, ensemble learning algorithms have been proposed.

Ensemble learning algorithms are designed to improve the stability and accuracy of machine learning algorithms used in both classification and regression because they can be more accurate and robust than a single classifier or regressor [26,27,28]. The ensemble learning is that attempts to create a strong classifier/regressor from a number of weak classifiers/regressors. Fast and simple algorithms such as DT are commonly used as weak classifiers/regressors in ensemble learning methods. Several strategies are using in ensemble learning, such as bagging and boosting.

Bagging (stands for bootstrap aggregation) is a parallel ensemble method and aim to decrease the variance [16]. Bagging produces several subsets for training from the original dataset by random sampling with replacements, and each model is built independently. Bagging uses multiple models with high variance but low bias to obtain better predictions. For example, RF is one of the most popular and most powerful applications of bagging.

Boosting is a sequential ensemble method to decrease bias instead of variance [17]. It is a two-step approach, where boosting first uses subsets of the original data to build a model and then adds new models to reduce the error of previous models. Unlike bagging, every new subset contains the elements that were likely to be misclassified by previous models. The prediction of boosting is combined with those models using a particular cost function. Boosting uses multiple models with low variance but high bias models to obtain better predictions. AdaBoost is one of the best algorithms used to boost the performance of decision.

Random forests (RF)

Random forests (RF) is a versatile ensemble learning model using single full-grown DTs for both classification and regression tasks developed by Breiman [9]. Two types of randomness, bootstrap sampling and random selection of input variables, are used in the algorithm to make sure that the single DT grown in the forest are dissimilar and uncorrelated from each other. At each node of DT, the optimal split is derived by the reduction in impurity as CART. Growing a forest of trees with randomness leads to better predictions compared to a single DT and helps to make the model robust to noise in the data set.

Extremely randomized trees (ExtraTrees)

The ExtraTrees method is another ensemble learning model based on bagging which was introduced by Geurts et al. [29]. ExtraTrees is trained using bootstrap sampling and the random variable selection, like in an ordinary RF. But, the optimal cut-point at each node of DT during training is randomized. Subsequently, ExtraTrees is faster than RF when training. Geurts et al. established that ExtraTrees lead to a further decrease in overall variance. Furthermore, they have compared ExtraTrees with RF and shown to perform equal to or better than RF.

Adaptive boosting (AdaBoost)

AdaBoost is the first successful boosting algorithm developed by Freund and Schapire [17]. AdaBoost creates numbers of weak learners by adaptively adjusting the weights of each weak learner. After training a weak learner, AdaBoost increases the weight on the misclassified samples so that these samples will make up a larger part of the next weak learner training set. Then, the predictions of AdaBoost are made by majority vote of the weak learners’ predictions. Therefore, AdaBoost can generate expanding diversity to improve performance.

Gradient boosting (GBM)

Gradient boosting is another boosting algorithm similar to AdaBoost. The idea of gradient boosting is an optimization on a suitable cost function originated by Breiman [30]. This idea was further developed by Friedman [31, 32] and called gradient boosting machines (GBM). GBM also works by sequentially adding weak learners to an ensemble like AdaBoost. However, instead of tweaking the instance weights at every iteration as AdaBoost does, GBM tries to fit the new weak learner according to residual errors made by the previous weak learner. In other words, the algorithm of GBM is a numerical optimization problem to minimize the loss of weak learners using gradient descent. GBM has led to the development of boosting algorithms in many areas of machine learning.

Feature importance

One efficient way of getting an insight into above DT-based ensemble models is to compute feature importance which is relatively straightforward to retrieve importance scores for each attribute. Feature importance can be calculated for a single DT by the amount that each attribute improves the performance measure. The performance measure is computing the amount of “impurity” such as variance in case of regression trees and Gini coefficient or entropy in case of classification trees. Generally, DT-based ensemble models provide a score that indicates how useful or valuable each feature was in the construction of the DTs within the model. The more an attribute is used to make key decisions with DTs, the higher its relative importance score. In other words, features with high importance scores are only important for the main split in DTs. The absolute value of the importance score is not as important as the relative values, which we can use to determine the most relevant features for a task. The basic use of the feature importance is to create a “feature ranking” among the features from high to low for each model.

Blending

Ensemble learning is a procedure designed to increase predictive performance by combining the predictions of multiple machine learning models. There is a variety of ensemble methods, from simple ones like voting or averaging the predictions, to building complex learning models using the predictions as features. Stacked generalization is a way of combining predictions of multiple base models that have been demonstrated for a classification task [19], which has also been used for regression [33]. Stacked generalization applies a higher-level learning algorithm (so-called “meta-learner” or “super-learner”) and out-of-fold predictions for the training data to discover the best way of how to combine the outputs of the base models. Blending is very close to stacked generalization. Instead of creating out-of-fold predictions, blending is more straightforward and less risk of an information leak than stacked generalization.

There are two kinds of models in blending: several base models (level-0 models) and one blender (level-1 model) shown in Fig. 1. When using the blending for predictions, the training data is first fed into the level-0 models, and each of the level-0 model calculates a prediction value. These values are fed into the level-1 model; the level-1 model combines them and computes the final prediction. In other words, the inputs to the level-1 model are the outcomes of the level-0 models. Thus, the blender (level-1 model) decides if it wants to keep that level-0 model or not and summarizes information from level-0 models. The feature importance of blending was the summarization of level-0 model with different weights as following:

$$FI_{blend} = \mathop \sum \limits_{i = 1}^{n} w_{i} FI_{i}$$

where FI_blend is the feature importance of proposed method, n is the number of level-0 models, w_i is the weight of each level-0 model, and FI_i is the feature importance of each level-0 model. To ensure the predictability and interpretability preserved from each level-0 models, we used the lowest possible number of individual constituent models; then, we used simple voting, linear combinations, or DT-based ensemble learning as a level-1 model. In this work, we use three different blenders to combine results of level-0 models. To simply compare the different performance of blenders, we defined three blending methods as following:

Uniform blending: use simple voting in classification and average in regression as blenders. The weights (w_i) of each model were the same.
Linear blending: use a linear model as a blender.
We used logistic regression in a classification task and multiple linear regression in a regression task. The weights (w_i) of each model were determined by regression coefficients.
Any blending: use a non-linear model as a blender.
We used GBM with simple structures (n_estimators = 10) as the level-1 model because of the serial dependence of level-0 models. The weights (w_i) of each model were determined by the feature importance of GBM.

Descriptor selection

Many studies have noted to achieve model interpretation in many research fields in QSAR/QSPR using interpretable descriptors [34, 35]. Therefore, we used different interpretable descriptor sets for two case studies.

Dragon 7 software [36], Gaussian 09 software [37] and RDKit package [38] were used for descriptor calculation. In the case study of fluorescence dataset, 2143 Dragon 7 molecular descriptors from 0-dimensional to 2-dimensional molecular information and 25 quantum chemical (QC) descriptors were calculated by Gaussian 09 software. The geometries of the molecules were optimized with the B3LYP density functional method [39], 6–31G* basis set, and frequency calculations. For improving the prediction accuracy, Dragon 7 descriptors demonstrated better prediction than RDKit that the result is in Additional file 1: Table S2. 3-dimensional Dragon 7 molecular descriptors were not considered since they are hardly interpreted. The use of QC descriptors can successfully improve prediction and interpretability from our previous work [40] and provide more specific physical meanings. We consider solvent species in model constructing since the solvent effect plays a more critical role in the process of fluorescence [41].

In the case study of liquid crystal dataset, RDKit descriptors were applied in this research for improving the accuracy of LC behavior prediction because Dragon 7 descriptors did not improve predictability. Our previous works noted that structural descriptors resulted in good LC prediction [42], we only calculated interpretable structural descriptors in RDKit such as atom counts and numbers of fragments. The comparison of Dragon 7 and RDKit descriptors is listed in Additional file 1. Furthermore, we separated the structural template of LC into mesogens and wings for descriptor calculations to improve interpretability shown in Fig. 2. Total 250 interpretable descriptors were calculated by RDKits containing 84 descriptors from a raw structure, 72 descriptors from mesogen, 46 descriptors from wing1 and 48 descriptors from wing2. The detail of descriptor selection is provided in Additional file 1.

Evaluation of model performance

The performance of created regression models was evaluated by coefficient of determination (R²) and root mean square error (RMSE).

$${\text{RMSE}} = \sqrt {\frac{1}{n}\mathop \sum \limits_{j = 1}^{n} \left( {y_{j} - \widehat{y}_{j} } \right)^{2} }$$

where $y_{j}$ is the observed value for the jth observation, $\widehat{y}_{j}$ is the predicted value and n is the number of samples.

The performance of created classification models was analyzed on the basis of classification results obtained for the prediction set. The used performance metrics are defined as follows:

Accuracy $Acc = \frac{a + d}{a + b + c + d}$

Precision $\Pr = \frac{a}{a + c}$

Recall $r = \frac{a}{a + b}$

F1 score $F1 = \frac{2}{{\frac{1}{\Pr } + \frac{1}{r}}} = \frac{2a}{2a + b + c}$

Matthews correlation coefficient (MCC)

$${\text{MCC}} = \frac{a \times d - b \times c}{{\sqrt {\left( {a + b} \right)\left( {a + c} \right)\left( {b + d} \right)\left( {c + d} \right)} }}$$

where a is true positive, b is false negative, c is false positive, and d is true negative (Table 1).

Table 1 Confusion table

Full size table

Accuracy (Acc) gives the percentage of LCs and NLCs correctly classified, while the precision (Pr) gives the percentage of correctly classified LCs among all compounds which are classified as LCs. The recall (r) represents the numerical value of the probability of identifying compounds that exhibit the LC phases. The F1 score (F1) can be interpreted as a weighted average of precision and recall. This score takes both false positives and false negatives into account. The Matthews correlation coefficient (MCC) is a correlation coefficient to handle imbalanced data between the observed and predicted binary classifications. It is not as easy to understand as accuracy intuitively, but the F1 and MCC are usually more useful than accuracy, especially for imbalanced class distribution.

Software and implementation

Four DT-based ensemble learning models are freely available in Python. RF, ExtraTrees, AdaBoost, and GBM were constructed using Scikit-learn package in Python [43]. All models are able to compute feature importance automatically for every feature after training. All descriptors in this study were calculated by Dragon 7 and RDKit. Statistical analyses were conducted using Python scripts.

Results and discussion

Case study 1: fluorescence dataset

Performance of DT-based ensemble models

To obtain DT-based ensemble learning models, the hyper-parameters were determined based on the root mean squared error (RMSE) of fivefold cross-validation using a randomized search. The overall performances for fluorescence wavelength (λ_em) of four different DT-based ensemble learning approaches are presented in Table 2. Figures showing the predicted λ_em versus experimental λ_em for training dataset and test dataset are in Additional file 1: Figs. S3–S6).

Table 2 The results of four different DT-based ensemble learning methods

Full size table

The performances of all of the four DT-based ensemble learning models were similar in agreement of our previous work results which ensemble learning was suitable for fluorescence prediction than single models [40]. The best well-fitted model was ExtraTrees with $R^{2}$ = 0.991, RMSE = 11.15 nm for training dataset which demonstrated good predictability for the external test dataset ($R_{pred}^{2}$ = 0.908, RMSE = 33.71 nm). In machine learning, there is something called the “No Free Lunch” theorem which states that no machine learning model can typically capture the full complexity of problems. Although all models demonstrated nearly the same results, each model only provides a rough representation of the problem to solve. Different models may be likely to have various prediction results. Therefore, the interpretation of models may help to understand the factors of model predictions.

Interpretation of DT-based ensemble models

The feature importance provided by DT-based ensemble learning models can help identify input variables that may be most relevant of each descriptor to the regression problems. Since the random state may influence the feature importance in ensemble learning models, every model was repeated ten times with different random states from 0 to 10. The basic use of the feature importance is to create a “feature ranking” among the features from high to low for each model. It is meaningless to compare the values of feature importance of different models. We hence discuss the top 10 important descriptors among 2169 descriptors in four models which were highly related to fluorescence wavelengths.

Table 3 shows that ten important descriptors and their feature importance from four DT-based ensemble learning models for the fluorescence wavelength. The feature importance of HOMO–LUMO gap is far greater than other descriptors. The rest descriptors had relatively lower scores of feature importance than the HOMO–LUMO gap. In our opinion, models made the main prediction decision based on HOMO–LUMO gap and used the rest descriptors for minor correlation in predictions because the emission process is the energy relaxation from LUMO to HOMO.

Table 3 Top 10 important descriptors selected by four DT-based ensemble learning models

Full size table

In RF, the feature importance of HOMO–LUMO gap was three times larger than AP(xx) which HOMO–LUMO gap dominated the main predictions. RF used topological descriptors in minor correlations for final predictions. The feature importance scores of topological descriptors were much smaller than the HOMO–LUMO gap which means less effect to predictions. In ExtraTrees, F01[C-N] was as important as AP(xx) because of the large number of cyanine dyes in training dataset. Unlike RF, HOMO–LUMO gap did not have extremely high importance than other descriptors since the ExtraTrees did not optimize cut-points of DTs in ExtraTrees. ExtraTrees used more structural descriptors (F01[C-N], F01[C-N], F01[C-C] and C-004) which also resulted from cyanine dyes. In AdaBoost, more QC descriptors were selected to build DT weak learners since the depth of DTs in AdaBoost is only five. Similar to ExtraTrees, structural descriptors also had high importance scores in AdaBoost. In GBM, except the HOMO–LUMO gap and solvent, the rest descriptors had small importance scores like RF. The descriptors selected by GBM were apparently different from the other three models. Interestingly, GBM was the only model that focus on solvent effect.

Obviously, four DT-based ensemble learning models used different descriptors and prediction mechanisms to approach fluorescence prediction. There is no absolute answer which model is the correct model for prediction and interpretation because the fluorescence is a complex phenomenon. It is hard to interpret the relationship between descriptors and properties because the information from four different models is too scattered. One of the best ways to improve predictions and extract information from prediction models is to combine or summarize different models. Thus, we tried to use blending to summarize information from four different DT-base ensemble learning models and discuss the result in the next section.

Performance of blending models

The prediction results of four DT-base ensemble learning models (level-0 models) were fed into the blending model (level-1 models). Three different blending methods are listed as follows:

Uniform blending: average the prediction results and feature importance (no meta-learning).
Linear blending: use multiple linear regression to summarize information (linear meta-learning).
Any blending: use GBM regression to summarize information (non-linear meta-learning).

The results of the blending models are shown in Table 4. The summarization of level-0 models with blending successfully improved the prediction performance of fluorescence prediction. Simple summarization of level-0 models such as uniform blending and linear blending can efficiently overcome the pros and cons of level-0 models. Any blending exhibited the best prediction and large improvement from level-0 models. The only difference is that there are same weights of level-0 models in uniform blending but different weights (RF:10%, ExtraTrees:56%, AdaBoost:20%, GBM:14%) in linear blending. Above three blending methods, any blending was performed to develop a nonlinear relationship between level-0 models and fluorescence wavelength. To obtain better results, the hyper-parameters (n_estimators = 10, max_depth = 8, learning_rate = 0.1) that influence the performance of level-1 GBM were optimized by grid search with fivefold cross-validation. The $R^{2}$ of the training dataset was 0.996, and the $R^{2}$ of the test dataset was 0.931. Any blending model had the RMSE of 7.84 nm for the training dataset, 29.11 nm for the test dataset. Figure 3 shows the experimental values versus calculated values of λ_em by any blending. The improvement of any blending was larger than uniform blending and linear blending. Therefore, it reveals that none of level-0 models had the best answer but the ensemble of level-0 models can lead to better predictions.

Table 4 The results of three different blending methods

Full size table

Performance standards such as R² and RMSE play crucial roles in determining the success or failure of model training and performance improvement efforts. However, the differences between models were not significant. In fact, blending only can provide small improvements since DT-based ensemble learning models are already powerful algorithms. Due to the complexity of fluorescence mechanisms, none of the models are perfect. Thus, we not only compared the difference between standards but also examined the difference in predictions of test samples. Figure 4 reveals experimental λ_em values versus calculated λ_em values of the test set using RF, ExtraTrees, AdaBoost, GBM and any blending. In Fig. 4, we highlighted four different areas with large prediction error in some models, and some chemical structure examples are listed in Figs. 5, 6.

Every model failed to predict the λ_em of structures in area A and area B because four ensemble learning models cannot successfully capture the patterns of the fluorescence phenomena. In area A, any blending could slightly reduce the prediction error by using the accurate prediction provided by ExtraTrees. The two structures in area B had long fluorescence wavelengths due to the solvent effect, but lack of samples in different solvent causes the worst predictions.

Any blend performed better prediction in area C and area D among all ensemble learning models with better R² and RMSE. In area C, the significant weakness of RF, ExtraTrees and AdaBoost was the prediction of tetracarboxylic dianhydride structures. The correct predictions of any blending only relied on predictions of GBM. On the other hand, three dyes in area D were hard to be predicted by RF and GBM. Any blending provided better predictions based on ExtraTrees and AdaBoost. In summary, four DT-based ensemble models had their own pros and cons for predictions of specific dyes. Therefore, the improvement provided by any blending is not only the performance standards but also summarization of different models.

Interpretation of blending models

Table 5 shows that ten important descriptors and feature importance from three blending methods. Obviously, unlike the feature importance results of level-models, three blending methods selected similar descriptors but the score and ranking of each descriptor were slightly different. Despite the small difference caused by each blending method, the summarized information led to a unification explanation which was easier for interpretation of models.

Table 5 Top 10 important descriptors selected by three blending methods

Full size table

We discuss the top ten important descriptors in any blending models because of the high predictive accuracy for fluorescence wavelength. The ten important descriptors also contained three QC descriptors. The HOMO–LUMO gap, Van der Waals surface areas (P_VSA_ppp_L) can be explained as the absorption process of compounds. For example, dyes with large conjugation area such as cyanine dyes result to the large Van der Waals surface area. SpMax2_Bh(i), a topological descriptor based on ionization potential, may be related to the ionization in solvents that is able to change the spectral characteristics of the dye [44]. The fluorescence wavelength has strong solvent effects, as called “solvatochromism” based on the change of polarities of the ground and excited state of a chromophore in the solvent polarity [41]. The high importance of solvent species, polarizability (AP(xx), EP(xx)) and dipole moment correlated topology descriptors (Chi1_EA(dm), Chi0_EA(dm)) supported the fact of solvatochromism phenomena. The structural descriptors such as F01[C-N] and F02[C-N] reflected the structural features of cyanine dyes with the large ratio in training dataset.