First we shall describe the different machine learning models used to predict the RTs, and then we shall present our Bayesian approach to project the RTs to a given Chromatographic Method (CM).
Prediction of retention times with machine learning
Several state of the art machine learning regressors were tested for predicting the RTs using three different sets of features: fingerprints only, descriptors only and fingerprints+descriptors. Parameter search was used for tuning all models with the exception of CatBoost-based regressors (see "Gradient boosting" Section for the rationale). We have also created an ensemble with all the trained models to attempt to further improve Retention Time (RT) prediction [33]. Some of the choices for the regressors can be understood by the need of having diversity in their predictions to increase the chances of the ensemble improving their individual performances (see "Blending" Section).
Preprocessing of descriptors and fingerprints
Descriptors were first standardized and imputed using median imputation when alvaDesc was not able to generate some descriptor. If imputation was needed, a missing indicator was added, enabling the regressors to account for missingness despite the imputation. Features with 0 variance were removed. Highly correlated features were also eliminated (correlation \(> 0.9\)); this conservative threshold was not tuned since all tested regressors are robust against collinearity. The main benefit of removing correlated features is memory saving.
The only preprocessing applied to the fingerprint features was removing those with low variance. Treating each feature X as a binary Bernouilli random variable, the variance threshold was selected using \(\text {Var}[X]=p (1-p)\), were p is a parameter to be tuned (see "Bayesian hyperparameter search" Section) which is usually set to a high value (typically \(>0.9\)).
Taking inspiration from [24], an additional binary feature was added to each molecule representation indicating whether the molecule is retained or not. Since in a real world application this information would not be available, this feature must also be predicted. To that end, we trained a eXtreme Gradient Boosting (XGBoost) classifier. As suggested by Fig. 3, a molecule was considered non-retained if its Retention Time (RT) was smaller than 5 minutes. The XGBoost classifier was tuned using the same procedure described in "Bayesian hyperparameter search" Section for the regressors, although the metric to be maximized in this case was the F1 score. Preliminary results suggested that using fingerprints, descriptors, and fingerprints+descriptors yielded similar results, so we used only fingerprints as features for speed.
Gradient boosting
Gradient Boosting Machines (GBMs) have already been considered in state of the art methods for Retention Time (RT) prediction [24]. In this work, several GBMs were tested, using slightly different approaches for the hyperparameter search. In addition to the interest of comparing several GBMs, the use of different combinations of GBMs and tuning options was partially motivated by the need of having diversity in the predictions for building a good ensemble (see "Blending" Section). Specifically, we tested:
-
XGBoost [34]: it is probably the most-commonly used GBM, and it was employed for Retention Time (RT) prediction in [24]. Bayesian search was applied on different regressors fed using fingerprints, descriptors and fingerprints+descriptors. Among the tuned parameters, the most relevant ones include the number of boosting rounds, the maximum depth of the trees, subsampling parameters (either by column, by tree or by level), regularization parameters (such as \(L_1\) and \(L_2\) regularization) and parameters controlling the conservativeness of the algorithm (usually referred as \(\gamma\) and minimum child weight).
-
Gradient Boosting Machine (lightGBM) [35]: it is a well-known alternative to XGBoost, with optimizations for speed and memory usage (hence, its name). Furthermore, stepwise optimization methods particularly designed for lightGBM can be used. This avoids the need for Bayesian search, further reducing tuning times by exploiting heuristics. The hyperparameters were selected in the following order: \(L_1\) regularization, \(L_2\) regularization, maximum number of leaves, proportion of randomly selected features on each tree, bagging fraction, bagging frequency (it controls the number of iterations between bagging) and the minimum number of samples in the leaves.
-
CatBoost [36]: an interesting question regarding the non-retained molecules is if their inclusion as training data improves the performance of the regressors. To investigate this question, the performance of a regressor when trained with different weights for the retained and non-retained molecules can be evaluated. Different weights for both types of molecules can also help with the unbalance between retained and non-retained molecules. Since the ratio of non-retained to retained molecules is approximately 1/40 in the SMRT dataset, the weight of the retained molecules was set to 1, whereas the weight of the non-retained molecules was varied between \(10^{-6}\) (effectively ignoring them) and 80 (hence the global influence of the non-retained molecules is approximately twice the influence of the retained ones). However, using the same approach as with previous regressors would require a full Bayesian search for each weight of the non-retained molecules. Instead of tuning parameters for each weight, we looked for a regressor able to provide good performance with its default parameters. CatBoost was selected for this reason [36]. Note that CatBoost regressors not only permit studying the influence of the non-retained molecules in the predictions, but they also provide a useful context that may enable the meta-regressor of the ensemble to distinguish between retained and non-retained molecules (see "Blending" Section).
Deep neural network
Together with GBMs, DNNs usually achieve the best results in machine learning competitions [37, 38]. DNNs were used for Retention Time (RT) prediction in [21], where a DNN with 4 layers and regularization was proposed. Regularization is key for achieving good generalization, since even a small shallow neural network can overfit the SMRT dataset in a few epochs. Driven by this observation, we used a DNN with just 3 layers, regularized using large dropout rates. The sizes of the hidden layers, the dropout rates and the non-linear activations were determined using Bayesian hyperparameter search.
To improve the generalization ability of the DNN and to accelerate its training, we used cosine annealing warm restarts [39]. The number of restarts and the length of the cosine annealing were also subject to hyperparameter search. After the training with warm restarts, we employed Stochastic Weight Averaging (SWA) using a constant learning rate schedule. With this setting, SWA just consists of training the DNN for a few extra epochs (whose optimal value is to be determined during hyperparameter search), and then averaging the weights of the DNN along the trajectory followed during optimization. In [40], the authors suggest that SWA leads to wider minima, which is hypothesized to result in better generalization than conventionally trained DNN.
Finally, quantile transformation was applied to RTs before fitting. The method transforms RTs to follow a standardized normal distribution. This may facilitate learning since the last layer does not need to learn large weights to match the untransformed RTs.
Kernel methods
Support Vector Machines (SVMs) have already been considered for a wide variety of applications related to metabolites, including elution order prediction [19]. Although we tested SVMs with both descriptors and fingerprints, performance of both regressors was poor. As an alternative to this classic kernel method, we considered Deep Kernel Learning (DKL) [41]. DKL can be interpreted as a DNN whose last layer has been substituted by a Gaussian Process (GP). This permits leveraging both the ability of deep learning for extracting relevant features from the raw-inputs, and the non-parametric flexibility of GPs. The combination of the DNN and the GP kernel can also be viewed as a new flexible kernel which can be used as a drop-in replacement for standard kernels. DKLs were tested using fingerprints, descriptors and fingerprints+descriptors.
Following the observations from "Deep neural network" Section, we employed a highly regularized DNN. Besides dropout, we also considered batch-normalization [42], not only because of its regularization capabilities, but also because it keeps activations from the network within a predictable range. This eases the use of kernel interpolation (specifically, KISS-GP [43]) to approximate the GP kernel, which enables fast computations. Quantile transformation was also applied to the RTs.
DKL was trained using early stopping, and the learning rate was tuned during parameter search. Similar to the DNNs from "Deep neural network" Section, the specific architecture and regularization were subject to parameter search. Learning rate scheduling was used, reducing the learning rate when validation loss was stacked in a plateau. The patience argument before decreasing the learning rate was also tuned. Finally, three kernels were considered during hyperparameter tuning: the squared exponential kernel, the linear kernel and a spectral mixture kernel with four components [44]. A full list of the mathematical expressions of the kernels used in this paper can be found in Section S3 of Additional file 1.
Blending
We tested if the combination of the different regressors could improve their individual predictions. We used blending [45] to build a meta-regressor which learns to combine the predictions of the so-called base-regressors. Blending is a popular alternative to stacked generalization (or stacking) [46] which has lower computational demand and it is simpler, resulting in less likelihood of information leakage. With large datasets like SMRT, blending and stacking usually yield similar results. Hence, since the meta-regressor is also subject to parameter tuning, blending was used for faster training.
To train a meta-regressor with blending, a holdout set is created using a small subset of the training set. In our experiments, we used a 80-20% split. The base-regressors are trained on the 80% of the data, and their predictions for the holdout dataset are stored. The meta-regressor then learns to combine the predictions of the base-regressors using the predictions on the holdout dataset. Note that an instance on the original training data is only used just once for training, either on the base-regressors or in the meta-regressor, avoiding information leakage. This procedure for training the meta-regressor is also outlined in Fig. S1 of Additional file 1.
A random forest was used as meta-regressor, tuning its main parameters through Bayesian optimization. The parameters tuned were the number of trees, the maximum depth of each tree, the maximum number of features considered at each split, the minimum number of samples before considering a split and the minimum number of samples at a leaf.
Bayesian hyperparameter search
Most regressors with the exception of lightGBMs (tuned using iterative search for speed tuning) and CatBoosters (not tuned due to its good default values) were tuned using Bayesian hyperparameter search. The p parameter controlling the thresholding of binary features was also optimized (see "Preprocessing of descriptors and fingerprints" Section). The parameters were tested following the predictions of a TPE algorithm [27]. The TPE algorithm works by suggesting the parameters that maximize the expected improvement in the score being maximized, which in this paper was the negative of the MEDian Absolute Error (MEDAE). This permits balancing exploration versus explotation, obtaining a set of hyperparameters with good performance in fewer iterations than other approaches like grid search. In our experiments each model performed 50 iterations of the Bayesian search.
Regarding the optimization of the blended regressor, it should be noted that it proceeds greedily. That is, base-regressors are tuned individually, and the predictions of the best performing parameters are then used to create the training set for the meta-regressor. Finally, the parameters of the latter are optimized. It may be argued that this approach is suboptimal, since the base-regressors cannot be tuned to complement each other. However, jointly optimizing all base-regressors and the meta-regressor is difficult due to the dimensionality of the search space. Furthermore, this approach would not permit drawing conclusions from the performance of the base-regressors, which is part of the objectives of this work.
Validation procedure
To avoid data-leakage when reporting the performance of the different models, nested stratified cross-validation was used. Nested cross-validation guarantees that different data is used to tune model parameters and to evaluate its performance by means of outer and inner cross-validation loops [47]. In the outer loop, train/test splits are generated, which are then used for averaging the test scores over several data splits. In the inner loop, the train set is further split in train/validation subsets. The best parameters are selected by minimizing the MEDAE on the validation splits. We used 5-folds and 7-folds stratified cross-validations in the outer and inner loops, respectively. To ensure that the distribution of RTs is representative of the population in all folds, stratification was performed by separating the target variable (RTs) into 6 different bins. The validation procedure is also summarized in Fig. S1 of Additional file 1.
The Bayesian hyperparameter search ("Bayesian hyperparameter search" Section) and the validation procedure described in this section approximately required 2.5 months of computational time in a computer with an AMD Ryzen Threadripper 2970WX with 24 cores at 1.85 Gz, and a NVIDIA GeForce RTX 2080 GPU.
Projection between chromatographic methods
Machine learning models trained on a given Retention Time (RT) dataset (SMRT in this work) cannot be directly used to predict experimental RTs from other Chromatographic Method (CM)s due to the variability of the experimental setups. To exploit the knowledge of a predictive model trained on the SMRT, a second model projecting the predicted RTs to the specific Chromatographic Method (CM) used in an experiment is needed.
Given a specific Chromatographic Method (CM), the projection function can be learned if some of the experimental metabolites have been identified, and therefore both their experimental and predicted RTs are known (step 3 in Fig. 2). For the workflow in Fig. 2 to be practical, it is desirable that the projection function can be learned from a small dataset (tens of molecules) so that the researcher has to identify just a few molecules. In practice, this would probably be accomplished by adding pure metabolite standards to the sample. The more standards that need to be used, the more time and money will be required. Hence the interest in minimizing their number.
Bayesian methods are particularly well suited to solve classification/regression problems when data is scarce. This is due to their ability to incorporate prior knowledge about the problem. If the prior provides useful inductive biases for the task at hand, only a few samples may be needed to learn a proper solution to the problem [48]. Hence, under the Bayesian paradigm, the issue of learning from few data becomes how to specify a suitable prior for the problem.
Meta-learning has recently arose as a possible solution for acquiring useful prior knowledge. In meta-learning, knowledge is gained by solving a set of tasks (meta-tasks), which is then exploited to solve a closely-related but different task (target-task). In the Bayesian setting, meta-tasks are used to learn a useful prior distribution, which is then used as starting point to solve the target-task. This is done by incorporating new evidence provided by the target-task into the prior, which results in the so-called posterior distribution.
Hence, we propose the use of meta-learning to solve the problem of learning from few samples. The outline of the approach is shown in Fig. 4. We shall consider that the set of meta-tasks \(\mathcal {M}\) consists of m datasets \(\mathcal {M}=\{\mathcal {D}_i\}_{i=1}^m\), each corresponding to a different Chromatographic Method (CM). Each dataset \(\mathcal {D}_i\) contains predicted RTs \(\mathbf {x}^i\), as well as the experimental RTs obtained with a specific Chromatographic Method (CM), \(\varvec{y}^i\). Hence, \(\mathcal {D}_i=\{\mathbf {x}^i, \varvec{y}^i\}\) is a single meta-task and we would like to map \(\varvec{x}^i\) to \(\varvec{y}^i\) using a smooth function \(f(\cdot )\). The predicted RTs are obtained by using the best predictive model from "Prediction of retention times with machine learning" Section. In our problem, meta-tasks are used to learn a prior distribution p(f) over the functions \(f(\cdot )\) translating predicted RTs to experimental RTs of different CMs. Let us consider that we have gathered experimental RTs using the CMs A, B and C. During meta-learning, the projection functions
$$\begin{aligned} f_A(\varvec{x}^A) \approx \varvec{y}^A, f_B(\varvec{x}^B) \approx \varvec{y}^B, \text { and } f_C(\varvec{x}^C) \approx \varvec{y}^C, \end{aligned}$$
will be constructed. These functions should be considered as samples drawn from the same distribution p(f). The aim of meta-learning is to learn a plausible prior p(f) that explains all observed samples \(f_A(\cdot ), f_B(\cdot )\) and \(f_C(\cdot )\).
In addition to the the meta-tasks we have the target-task \(\widetilde{\mathcal {D}}=\{\varvec{\tilde{x}}, \varvec{\tilde{y}}, \varvec{\tilde{x}^*}, \varvec{\tilde{y}^*}\}\). Again, a single target-task is comprised of data from a single Chromatographic Method (CM). Intuitively, the target training points \(\{\varvec{\tilde{x}}, \varvec{\tilde{y}}\}\) represent molecules whose identity is known (step 3 in Fig. 2) whereas the target test points \(\{\varvec{\tilde{x}^*},\varvec{\tilde{y}^*}\}\) are molecules whose identity is to be discovered (step 6 in Fig. 2). We assume that the number of annotated molecules is small (indeed, this is the main difference between a meta-task and a target-task). Note that, when solving the target-task, the prior distribution p(f) learned during meta-learning is updated with the new evidence \(\{\varvec{\tilde{x}}, \varvec{\tilde{y}}\}\), which should enable the prediction/ranking of \(\{\varvec{\tilde{x}^*},\varvec{\tilde{y}^*}\}\).
GPs are particularly suited as projection model: they represent a distribution over functions, they can perform regression on smalls amount of data, and they can incorporate prior knowledge using the Bayesian framework. Hence, we shall consider:
$$\begin{aligned}&f(\cdot ) \sim \mathcal {GP}\left( m_{\varvec{\theta }_m}(\cdot ), k_{\varvec{\theta }_k}(\cdot , \cdot )\right) \text { or equivalently}\\&p_{\varvec{\theta }}(f)= \mathcal {GP}\left( f \mid m_{\varvec{\theta }_m}, k_{\varvec{\theta }_k}\right) ,\qquad \varvec{\theta }=[\varvec{\theta }_m, \varvec{\theta }_k] \end{aligned}$$
where the mean and kernel functions of the GP are parametrized with \(\varvec{\theta }_m\) and \(\varvec{\theta }_k\), respectively. Hence, the whole prior is parametrized with \(\varvec{\theta }=[\varvec{\theta }_m, \varvec{\theta }_k]\). These parameters are learned by minimizing the negative Leave One Out (LOO) log predictive probability on the meta-tasks (see Algorithm 1, where \(\varvec{y}^i_{-j}\) means all targets but the j-th item). The use of the LOO-based loss instead of the usual log marginal loss is based on the observation that cross-validation procedures (such as LOO) should be more robust against possible model misspecifications [49, Section 4.8]. Once the parameters have been learned, they can be used to specify a prior that is expected to generalize well on the target-task. Indeed, to avoid overfitting, \(\varvec{\theta }\) is not optimized while solving the target-tasks. The only parameter estimated with target-task data is the variance of the residuals. This is done by maximizing type II maximum likelihood during 250 epochs with an Adam optimizer with learning rate set to 0.01.
It is worth noting that the proposed meta-learning method fits well the Retention Time (RT)-based filtering workflow shown in Fig. 2. In this scenario, a query from a researcher corresponds to a target-task, which exploits the information provided by a previously meta-learned prior. Furthermore, although computing new predictions with the projection function scales quadratically with the training data, it will typically train on tens of RTs. Hence, it would only take tenths of a second to map experimental RTs to predicted RTs. Next sections present the data preprocessing and the parameter (\(m_{\varvec{\theta }_m}\), \(k_{\varvec{\theta }_k}\)) selection process used to devise our projection method.
Experimental setup and data preprocessing
In [21] the non-retained molecules were ignored for validating the projection method, and we adopt the same methodology here. The rationale for this is that in an experiment is easy to know if a molecule has been retained or not, and a researcher would not use a RTs database to try to annotate non-retained metabolites.
To avoid data-leakage during validation, we ensured that the meta-tasks data, target training data, and target test data did not overlap. To that end, we used a leave-one-Chromatographic Method (CM)-out approach. That is, data from a specific Chromatographic Method (CM) could only be used as either part of the meta-tasks or as the target-task. Hence, when using a Chromatographic Method (CM) as target-task, meta-learning was used on the remainder of CMs. Following [21], the following CMs were used to create the target-tasks: FEM long (342 molecules), FEM orbitrap plasma (133), LIFE old (148) and RIKEN (271). The number of molecules in the remainder of CMs is 2418.
For a specific target-Chromatographic Method (CM) (one of the four above mentioned), and after meta-learning on the meta-tasks, the target training data is created by subsampling the Chromatographic Method (CM) data (and the remainder of RTs are used as target test data). To obtain a good projection, researchers are expected to add standards spanning the whole range of the experimental RTs. To mimic this behaviour, stratified sampling was used. Sampling was repeated 10 times for each number of training points to obtain error estimates. To study the robustness of the projection method when only a few metabolites are known, the number of training points was varied between 10 and 50; 50 was the number of molecules used in [21].
All GP models share the same preprocessing steps despite their mean and kernel functions. RTs are transformed to log space using
$$\begin{aligned} \bar{\varvec{x}} = \log (1 + \varvec{x}) \qquad \text { and } \bar{\varvec{y}} = \log (1 + \varvec{y}), \end{aligned}$$
where \(\varvec{x}\) and \(\varvec{y}\) may belong to any meta-task \(\mathcal {D}_i\) or any target-task \(\widetilde{\mathcal {D}}\). The motivation for using this transformation is twofold. On one hand, RTs take only positive values. Without any transformation, the model has to learn this restriction on its own, which may be difficult in the scarce data scenario. By using the transformed RTs \(\bar{\varvec{y}}\), the model learns to predict a target without any restriction. Then, the inverse transformation maps back \(\bar{\varvec{y}}\) to the positive interval, forcing positiveness in the projected RTs. On the other hand, by also applying the transformation to \(\varvec{x}\), the non-linear relationship between \(\varvec{x}\) and \(\varvec{y}\) linearizes, which could enable the use of simpler kernels.
After the log-transformation, and since the software used to implement GPs is geared towards using inputs normalized to [0, 1] and outcomes normalized to \([-1, 1]\), data is further scaled using robust statistics. We used
$$\begin{aligned} \bar{\bar{x}} = \left( \frac{\bar{x} - \text {median}(\bar{x})}{0.741\cdot \text {IQR}(\bar{x})} + 3\right) / 6 \qquad \text { and } \qquad \bar{\bar{y}} = \left( \frac{\bar{y} - \text {median}(\bar{y})}{0.741\cdot \text {IQR}(\bar{y})}\right) / 3, \end{aligned}$$
where \(\text {IQR}\) denotes the interquartile range. The constant 0.741 is used because, for normal populations, the standard deviation fulfills \(\sigma \approx 0.741 \cdot \text {IQR}\). Hence, under the normality assumption, 99.7% of the transformed \(\bar{\bar{x}}\) will be on the [0, 1] range and 99.7% of the transformed \(\bar{\bar{y}}\) will be on the \([-1, 1]\) range. Despite the different last preprocessing step for predicted (\(\varvec{x}\)) and experimental (\(\varvec{y}\)) RTs, both transformations are learned using the predicted RTs. Thus, there is no Chromatographic Method (CM)-dependent scaling.
Comparison of meta-learned GP models
We evaluated the performance of the different meta-learned GPs models arising from various choices of their two parameters:
-
Mean function \(m_{\varvec{\theta }_m}(\cdot )\): a typical parametrization of a GP when no prior information is available about the mean is \(f(\cdot ) \sim \mathcal {GP}\left( 0, k_{\varvec{\theta }_k}(\cdot , \cdot )\right)\); that is, \(m_{\varvec{\theta }_m}(\cdot )=0\). The underlying assumption is that all relevant prior information can be incorporated into the kernel parameters \(\varvec{\theta }_k\). However, [50] shows that learning a mean function \(m_{\varvec{\theta }_m}(\cdot )\) (either alone or combined with kernel learning) can outperform kernel learning alone. We tested this in our problem by studying GPs with a constant mean function, and GPs with a mean function parametrized with a neural network. In our experiments, we used a neural network with two hidden layers with 128 units and leaky-ReLU activations.
-
Kernel function \(k_{\varvec{\theta }_k}(\cdot , \cdot )\): different kernels result in different properties of the projection function. We compared the commonly used kernels and combinations of them. Specifically, we tested the squared exponential kernel, Matérn kernels with \(\nu =1.5\) and \(\nu =2.5\), the polynomial kernel of degree 4, a linear combination of two squared exponential kernels, a linear combination of a linear kernel and a squared exponential kernel, and a linear combination of a squared exponential kernel and a polynomial kernel of degree 4. A full list of the mathematical expressions for these kernels can be found in Section S1 of Additional file 1.
The experimental setup described in "Experimental setup and data preprocessing" Section is used. We focused on the performance of the models in the low-data regime using just 10 training data points. For a single target-task \(\widetilde{\mathcal {D}}=\{\varvec{\tilde{x}}, \varvec{\tilde{y}}, \varvec{\tilde{x}^*}, \varvec{\tilde{y}^*}\}\), the predictive marginal log-likelihood
$$\begin{aligned} \mathcal {L}_{\mathcal {D}} = p(\varvec{\tilde{y}^*} \mid \mathcal {D}, \varvec{\tilde{x}}, \varvec{\tilde{y}}, \varvec{\tilde{x}^*}) = \int p \left( \varvec{\tilde{y}^*} \mid f(\varvec{\tilde{x}^*}) \right) p\left( f(\varvec{\tilde{x}^*}) \mid \mathcal {D}, \varvec{\tilde{x}}, \varvec{\tilde{y}}, \varvec{\tilde{x}}^*\right) d f. \end{aligned}$$
(1)
was used as metric of the model’s performance.
To obtain a single metric while taking into account the possible differences in the scales of the marginal log-likelihoods for the different CMs, each \(\mathcal {L}_{\mathcal {D}}\) was compared with the marginal log-likelihood of a reference model \(\mathcal {L}_{\mathcal {D}}^{\text {ref}}\): \(\Delta \mathcal {L}_{\mathcal {D}} = \mathcal {L}_{\mathcal {D}}-\mathcal {L}_{\mathcal {D}}^{\text {ref}}\). We used as reference model a GP with constant mean and squared exponential kernel trained on the target-task without meta-learning. This model was trained by optimizing type II maximum likelihood during 500 epochs using an Adam optimizer with learning rate set to 0.01. The final metric \(\Delta \mathcal {L}_{\text {avg}}\) was obtained by averaging across the four test-tasks and repetitions. Values \(\Delta \mathcal {L}_{\text {avg}} > 0\) correspond with models that perform better (in average) than the reference one. Note that this not only permits the comparison of different meta-learned GP-models, but it also assesses the usefulness of the meta-learning approach.
Additional experiments studying the influence of the number of meta-tasks in the performance of the GP were also carried out. They are discussed in Section S3 of the Additional file 1.
Predictive performance of the projection function
We compared the best GP-model from "Comparison of meta-learned GP models" Section with monotonic Generalized Additive Models (GAMs) [8], robust polynomial regression [21] and piecewise polynomial regression [24]. Unfortunately, it is not possible to compute the predictive marginal likelihood for all these models. Hence, we evaluated the performance of the models attending to both their predictive accuracy as well as their ability to generate proper prediction intervals. To test the predictive accuracy of the meta-learning approach we computed the median relative error, Mean Absolute Error (MAE) and MEDAE for the target test set. To test the prediction intervals we used the interval score [51]
$$\begin{aligned} S(\varvec{\ell }, \varvec{u}, \varvec{\tilde{y}^*})&= \frac{1}{\text {len}(\varvec{\tilde{y}^*})}\sum _{i=1}^{\text {len}(\varvec{\tilde{y}^*})} S(\ell _i,u_i,\tilde{y}^*_i) \qquad \text { with }\nonumber \\ S(\ell _i,u_i,\tilde{y}^*_i)&= (u_i-\ell _i)+\frac{2}{\alpha }(\ell _i-\tilde{y}^*_i)\mathbbm {1}(\tilde{y}^*_i<\ell _i)+\frac{2}{\alpha }(\tilde{y}^*_i-u_i)\mathbbm {1}(\tilde{y}^*_i>u_i), \end{aligned}$$
(2)
where \(\varvec{\ell }\) and \(\varvec{u}\) are the lower and upper ends of the prediction interval generated for the test target points \(\varvec{\tilde{y}^*}\), \(\mathbbm {1}\) is the indicator function, and \(\alpha\) is the coverage that the models are aiming for. We used \(\alpha =0.95\) in all experiments. Equation (2) can be understood by noting that a proper prediction interval should reach a tradeoff between being as small as possible (\(l_i\) should be close to \(u_i\)) and covering the observed values (\(l_i \le \tilde{y}^*_i \le u_i\)). The first term of Equation (2) just measures the length of the interval, while the second and third terms penalize having observed values outside the prediction interval (moreover, the further apart an observation is from the interval, the larger the penalty).
Note that the interval score has the same units as the RTs in \(\varvec{\tilde{y}^*}\). To obtain an adimensional metric and facilitate the comparison of different target CMs, we define the scaled interval score as \(S(\varvec{\ell },\varvec{u},\varvec{\tilde{y}^*}) / \text {median}\left( [\varvec{y^*}, \varvec{\tilde{y}^*}]\right) .\)
Ranking annotations based on the projection function
We have tested the ability of the projection method to rank and filter candidate annotations in metabolomic experiments based on mass search and RT predictions. The test implements a similar workflow to that described in Fig. 2. We collected the Retention Time (RT) predictions of the best-performing model from "Prediction of retention times with machine learning" Section for the 6,823 molecules with KEGG number in the Human Metabolome DataBase (HMDB) [52]. This simulates the database used to rank candidate annotations in Fig. 2. We used the leave-one-Chromatographic Method (CM)-out approach described in "Experimental setup and data preprocessing" Section for meta-training and target-tasks evaluation. After learning a projection function on the target training set, we simulated queries against the HMDB database to annotate the molecules on the target test set. For each molecule in the target test set, an accurate mass search (10 ppm mass error, the same as [21]) was performed to retrieve all compatible molecules from HMDB. To mimic real experimental conditions, we simulated experimental errors in the mass measurement by adding random noise to the mass of the unknown molecule. The random noise had a normal distribution with zero mean and a standard deviation of 10/3 ppm so that \(99.7\%\) of the errors were between \([-10 \text { ppm}, 10 \text { ppm}]\). Random noise below \(-10\) ppm or above 10 ppm was truncated to guarantee that the mass search always returned the correct molecule as a candidate (note that the mass search is based on the noisy mass and not the real one). Then, the molecules were ranked using Retention Time (RT) information according to a z-score computed as
$$\begin{aligned} z=\frac{\mid \tilde{y}^* - \mu (\tilde{x}^*)\mid }{\sigma (\tilde{x}^*)}, \end{aligned}$$
where \(\mu (\tilde{x}^*)\) and \(\sigma (\tilde{x}^*)\) represent the GP’s mean and standard deviation for the predicted-experimental Retention Time (RT) pair \((\tilde{x}^*, \tilde{y}^*)\). The intuition for the usage of the z-score as ranking metric is to take into account not only the agreement between the real experimental Retention Time (RT) and the projected value, but also the uncertainty in the projection. We focused on mass queries returning more than three candidates and computed the percentage of results where the true molecule was ranked among the top three candidates after z-scoring. To facilitate the interpretation of the results, a baseline performance for metabolite annotation when using only mass information was also computed. In this case, if several candidates with the same mass were returned, ties were randomly broken.