 Research article
 Open Access
 Published:
Probabilistic metabolite annotation using retention time prediction and metalearned projections
Journal of Cheminformatics volumeÂ 14, ArticleÂ number:Â 33 (2022)
Abstract
Retention time information is used for metabolite annotation in metabolomic experiments. But its usefulness is hindered by the availability of experimental retention time data in metabolomic databases, and by the lack of reproducibility between different chromatographic methods. Accurate prediction of retention time for a given chromatographic method would be a valuable support for metabolite annotation. We have trained stateoftheart machine learning regressors using the 80,Â 038 experimental retention times from the METLIN Small Molecule Retention Tim (SMRT) dataset. The models included deep neural networks, deep kernel learning, several gradient boosting models, and a blending approach. 5,Â 666 molecular descriptors and 2,Â 214 fingerprints (MACCS166, Extended Connectivity, and Path Fingerprints fingerprints) were generated with the alvaDesc software. The models were trained using only the descriptors, only the fingerprints, and both types of features simultaneously. Bayesian hyperparameter search was used for parameter tuning. To avoid dataleakage when reporting the performance metrics, nested crossvalidation was employed. The best results were obtained by a heavily regularized deep neural network trained with cosine annealing warm restarts and stochastic weight averaging, achieving a mean and median absolute errors of \(39.2 \pm 1.2\; s\) and \(17.2 \pm 0.9\;s\), respectively. To the best of our knowledge, these are the most accurate predictions published up to date over the SMRT dataset. To project retention times between chromatographic methods, a novel Bayesian metalearning approach that can learn from just a few molecules is proposed. By applying this projection between the deep neural network retention time predictions and a given chromatographic method, our approach can be integrated into a metabolite annotation workflow to obtain zscores for the candidate annotations. To this end, it is enough that just as few as 10 molecules of a given experiment have been identified (probably by using pure metabolite standards). The use of zscores permits considering the uncertainty in the projection when ranking candidates, and not only the accuracy. In this scenario, our results show that in 68% of the cases the correct molecule was among the top three candidates filtered by mass and ranked according to zscores. This shows the usefulness of this information to support metabolite annotation. Python code is available on GitHub at https://github.com/constantinogarcia/cmmrt.
Introduction
Metabolite annotation remains the main bottleneck in untargeted metabolomicsÂ [1, 2], with the vast majority of metabolites being left as unidentified [3]. Beyond the moleculeâ€™s mass, other moleculeâ€™s properties such as Retention Time (RT), collision cross section, or the fragmentation spectrum can be very valuable during the metabolite annotation process [4, 5]. The most common approach to annotate metabolites is to query a metabolomics database for compounds that have a mass compatible with the experimental masses. Often this query returns multiple annotation candidates for the same mass. Next, the researcher tries to discard, or score according to plausibility, the candidate annotations using other moleculeâ€™s propertiesÂ [1].
Liquid Chromatography Mass Spectrometry (LCMS) remains the most common platform used in untargeted metabolomics. In addition to m/z ratio, it provides information about the Retention Time (RT), the time at which metabolites elute from the chromatographic column. By using hyphenated setups (MS/MS), the fragmentation spectra of the molecules may be obtained [6]. These spectra are very useful for ruling out candidate annotations, and they are necessary to achieve the highest confidence levels of the Metabolomics Society in metabolite annotation (levels 0 and 1 [7]). However, obtaining them requires hyphenated setups which are more expensive and complex. Even when this type of instrumentation is used, the fragmentation spectrum of every molecule of interest is not always available due to instrumentation limitations or time constraints for the analysis. Also, sometimes the amount of sample available is not sufficient for MS/MS analysis. Hence, especially in pilot untargeted studies where unambiguous identification is not crucial (that is, where confidence level 2 of the Metabolomics Society is enough), often the fragmentation spectra are not available, and the annotation has to be done with just m/z ratios and RTs.
Obtaining moleculeâ€™s properties experimentally (such as the Retention Time (RT) or the fragmentation spectrum) requires the analysis of pure standards, which is a long, tedious and expensive process. Therefore, metabolomic databases often lack this information, especially for new metabolites that are still being discovered. Furthermore, the variability of the experimental setups means that different values are often obtained for these features in different setups [6, 8]. The reliable prediction of these features from the structures of molecules using machine learning techniques is therefore a compelling alternative to their experimental generationÂ [9,10,11,12].
Computational prediction of the Retention Time (RT) has been shown to be useful for molecule annotation in proteomics [13, 14] and lipidomics [15, 16]. However, until recently the prediction of small molecules Retention Time (RT) remained a challenge due to the small size (usually a few hundreds) of the publicly available Retention Time (RT) datasets [17]. This size prevented the training of machine learning models capable of accurately predicting the Retention Time (RT) of the large variety of small molecules involved in the typical metabolomic study, being the efforts in this direction limited to the prediction of the Retention Time (RT) of some concrete type of small molecules [14, 16, 18], or of the order of elution of the molecules [19, 20]. This situation changed recently with the publication of more than 80,000 experimental RTs collected through reversedphase Liquid Chromatography Mass Spectrometry (LCMS) from the METLIN Small Molecule Retention Time (SMRT) dataset [21], which has renewed interest in the Retention Time (RT) prediction of small moleculesÂ [17, 22,23,24,25,26].
In this paper we have tested the performance of several stateoftheart machine learning models for the task of Retention Time (RT) prediction using the SMRT dataset. In the evaluation presented in [21] the molecules that were not retained by the column were excluded. In our evaluation, both retained and nonretained molecules will be considered. Nonretained molecules are typically ignored in metabolomics experiments. However, the ultimate goal of the machine learning model would be the computational prediction of the RTs of a set of molecules present in a metabolomics database based on their chemical structures, to confirm or discard candidate metabolite annotations. In this scenario, it is unknown in advance whether a molecule of the database is going to be retained or not, and therefore it is desirable to predict as accurately as possible the RTs of the nonretained molecules.
Hyperparameter search for the models was performed with the Treestructured Parzen Estimator (TPE) algorithm [27], and a nested crossvalidation was used in the evaluation. The best model was a Deep Neural Network (DNN) trained using molecular fingerprints, which improved the performance of the best previous models to predict Retention Time (RT) [24, 26].
Having a machine learning model capable of accurately predicting the Retention Time (RT) would enable filtering out annotations with similar mass but different RTs. However, note that a model trained on the SMRT dataset can only accurately predict RTs for a Chromatographic Method (CM) identical to the one employed to collect this data. Since laboratories usually customize the Chromatographic Method (CM) for the needs of each experiment, a SMRTbased model cannot be directly applied to experimental data from other laboratories, or even other experiments conducted in the same laboratory. However, if CMs are similar, elution order is largely preservedÂ [8], which enables the construction of a projection function that maps RTs in one Chromatographic Method (CM) to RTs in another Chromatographic Method (CM). FigureÂ 1 illustrates both the dependency of the RTs with the Chromatographic Method (CM), and the conservation of the elution order. To build such a projection function, a set of known molecules whose RTs is known in both CMs is needed.
FigureÂ 2 shows a possible workflow to exploit the Retention Time (RT) predictions of a machine learning model and a projection method during the metabolite annotation process. Although not explicitly shown in the figure, we assume that RTs are used in conjunction with the m/z ratio. In the center of Fig.Â 2 there is a large database containing molecule identities and their main chemical properties, including RTs. The RTs stored in the database are computed using the predictive model trained on the SMRT dataset (step 1). The creation of a database (step 2) avoids running complex predictive models in realtime. Also, note that this database may include molecules not observed in the SMRT dataset. To use this database, a researcher provides the experimental RTs (as measured in his/her Chromatographic Method (CM)) of a few molecules whose identity is known (step 3). These molecules will typically be pure metabolite standards added to the sample. The molecule identities are then used to retrieve the corresponding predicted RTs from the database, which will be subsequently used to create pairs of experimentalpredicted RTs (step 3). A projection function mapping predicted RTs to experimental RTs is learned from these pairs (step 4). The researcher then provides experimental m/z (not shown in the figure to avoid clutter) of the molecules he/she is trying to identify. The m/z ratios are used as a first filter to obtain candidate annotations from the database. The predicted RTs of the filtered molecules are then projected to experimental RTs to create a â€śprojected databaseâ€ť (step 5). Note that the projected database is much smaller than the original one due to the m/z filtering, which makes the projection computationally efficient. The researcher finally uses the experimental RTs to query the projected database (step 6). The results retrieved from it would enable scoring candidates with similar m/z but different RTs (step 7).
To make this workflow practical, it is desirable that the projection function can be learned from a very small dataset, so that the researcher only has to identify a small set of molecules. To that end, this work proposes a Bayesian metalearning approach to project the predicted RTs to a specific Chromatographic Method (CM) based on just a few identified molecules. This approach has the advantage of being able to generalize from a small training set while providing confidence intervals for the Retention Time (RT) projections between CMs, and not only a point estimate. We demonstrate the ability of the proposed projection method to learn from few samples by testing not only its predictive accuracy, but also its ability to rank the correct metabolite identity among the top three candidates based on their RTs.
Materials
The METLIN Small Molecule Retention Time (SMRT) dataset consists of the experimental retention times of 80,038 small molecules from the METLIN library [21]. All RTs were obtained using reversephase chromatography with highperformance liquid chromatographymass spectrometry (HPLCMS). The dataset has a wide variety of small molecules analysed under the same conditions, including metabolites, natural products and druglike small molecules. It also includes nonretained molecules; these are compounds that are not retained in the column and elute before gradient starts, typically within the first minute. Hence, RTs of the nonretained molecules are considerably smaller than RTs of the retained molecules. Although some authors ignore the nonretained molecules when validating machine learning models, the whole dataset was used for both training and validating the regressors of this paper. The rationale for this is that these machine learning models are going to be used to predict RTs of metabolites in a database (see Fig.Â 2). Then, these predictions could be used to filter and rank experimental data. If a regressor is trained without nonretained RTs it will only be able to predict retained RTs, even for a nonretained metabolite A in the database. If in an experiment there is an unidentified metabolite B with a similar m/z and whose RT is close to Aâ€™s (wrongly) predicted Retention Time (RT), the system will propose metabolite A as a candidate annotation for B. Hence the interest in training the regressors with both retained and nonretained molecules.
The SMRT dataset has been made public including the PubChem numbers and SDF files representing their chemical structure [28], together with their experimental RT information. In this work, these chemical structures were used to obtain a wide variety of features describing relevant properties of the molecules. These features were computed using alvaDesc [29, 30] and include both fingerprints and molecular descriptors. Specifically, alvaDesc permits the computation of MACCS166 fingerprints, Extended Connectivity Fingerprints (ECFP) [31] and Path Fingerprints (PFP), making a total of 2,Â 214 fingerprints. Additionally, the 5,Â 666 molecular descriptors supported by alvaDesc were also generated; the complete list can be seen inÂ [32]. All the descriptors and fingerprints obtained with alvaDesc were used to feed the regressors.
FollowingÂ [21], we also used the PredRet databaseÂ [8] for validating the projections from predicted to experimental RTs. The PredRet is a database of experimental RTs from different chromatographic systems commonly used for building and testing projection models between pairs of CMs.
Methods
First we shall describe the different machine learning models used to predict the RTs, and then we shall present our Bayesian approach to project the RTs to a given Chromatographic Method (CM).
Prediction of retention times with machine learning
Several state of the art machine learning regressors were tested for predicting the RTs using three different sets of features: fingerprints only, descriptors only and fingerprints+descriptors. Parameter search was used for tuning all models with the exception of CatBoostbased regressors (see "Gradient boosting" Section for the rationale). We have also created an ensemble with all the trained models to attempt to further improve Retention Time (RT) prediction [33]. Some of the choices for the regressors can be understood by the need of having diversity in their predictions to increase the chances of the ensemble improving their individual performances (see "Blending" Section).
Preprocessing of descriptors and fingerprints
Descriptors were first standardized and imputed using median imputation when alvaDesc was not able to generate some descriptor. If imputation was needed, a missing indicator was added, enabling the regressors to account for missingness despite the imputation. Features with 0 variance were removed. Highly correlated features were also eliminated (correlation \(> 0.9\)); this conservative threshold was not tuned since all tested regressors are robust against collinearity. The main benefit of removing correlated features is memory saving.
The only preprocessing applied to the fingerprint features was removing those with low variance. Treating each feature X as a binary Bernouilli random variable, the variance threshold was selected using \(\text {Var}[X]=p (1p)\), were p is a parameter to be tuned (see "Bayesian hyperparameter search" Section) which is usually set to a high value (typically \(>0.9\)).
Taking inspiration fromÂ [24], an additional binary feature was added to each molecule representation indicating whether the molecule is retained or not. Since in a real world application this information would not be available, this feature must also be predicted. To that end, we trained a eXtreme Gradient Boosting (XGBoost) classifier. As suggested by Fig.Â 3, a molecule was considered nonretained if its Retention Time (RT) was smaller than 5 minutes. The XGBoost classifier was tuned using the same procedure described in "Bayesian hyperparameter search" Section for the regressors, although the metric to be maximized in this case was the F1 score. Preliminary results suggested that using fingerprints, descriptors, and fingerprints+descriptors yielded similar results, so we used only fingerprints as features for speed.
Gradient boosting
Gradient Boosting Machines (GBMs) have already been considered in state of the art methods for Retention Time (RT) predictionÂ [24]. In this work, several GBMs were tested, using slightly different approaches for the hyperparameter search. In addition to the interest of comparing several GBMs, the use of different combinations of GBMs and tuning options was partially motivated by the need of having diversity in the predictions for building a good ensemble (see "Blending" Section). Specifically, we tested:

XGBoostÂ [34]: it is probably the mostcommonly used GBM, and it was employed for Retention Time (RT) prediction inÂ [24]. Bayesian search was applied on different regressors fed using fingerprints, descriptors and fingerprints+descriptors. Among the tuned parameters, the most relevant ones include the number of boosting rounds, the maximum depth of the trees, subsampling parameters (either by column, by tree or by level), regularization parameters (such as \(L_1\) and \(L_2\) regularization) and parameters controlling the conservativeness of the algorithm (usually referred as \(\gamma\) and minimum child weight).

Gradient Boosting Machine (lightGBM)Â [35]: it is a wellknown alternative to XGBoost, with optimizations for speed and memory usage (hence, its name). Furthermore, stepwise optimization methods particularly designed for lightGBM can be used. This avoids the need for Bayesian search, further reducing tuning times by exploiting heuristics. The hyperparameters were selected in the following order: \(L_1\) regularization, \(L_2\) regularization, maximum number of leaves, proportion of randomly selected features on each tree, bagging fraction, bagging frequency (it controls the number of iterations between bagging) and the minimum number of samples in the leaves.

CatBoostÂ [36]: an interesting question regarding the nonretained molecules is if their inclusion as training data improves the performance of the regressors. To investigate this question, the performance of a regressor when trained with different weights for the retained and nonretained molecules can be evaluated. Different weights for both types of molecules can also help with the unbalance between retained and nonretained molecules. Since the ratio of nonretained to retained molecules is approximately 1/40 in the SMRT dataset, the weight of the retained molecules was set to 1, whereas the weight of the nonretained molecules was varied between \(10^{6}\) (effectively ignoring them) and 80 (hence the global influence of the nonretained molecules is approximately twice the influence of the retained ones). However, using the same approach as with previous regressors would require a full Bayesian search for each weight of the nonretained molecules. Instead of tuning parameters for each weight, we looked for a regressor able to provide good performance with its default parameters. CatBoost was selected for this reasonÂ [36]. Note that CatBoost regressors not only permit studying the influence of the nonretained molecules in the predictions, but they also provide a useful context that may enable the metaregressor of the ensemble to distinguish between retained and nonretained molecules (see "Blending" Section).
Deep neural network
Together with GBMs, DNNs usually achieve the best results in machine learning competitions [37, 38]. DNNs were used for Retention Time (RT) prediction inÂ [21], where a DNN with 4 layers and regularization was proposed. Regularization is key for achieving good generalization, since even a small shallow neural network can overfit the SMRT dataset in a few epochs. Driven by this observation, we used a DNN with just 3 layers, regularized using large dropout rates. The sizes of the hidden layers, the dropout rates and the nonlinear activations were determined using Bayesian hyperparameter search.
To improve the generalization ability of the DNN and to accelerate its training, we used cosine annealing warm restartsÂ [39]. The number of restarts and the length of the cosine annealing were also subject to hyperparameter search. After the training with warm restarts, we employed Stochastic Weight Averaging (SWA) using a constant learning rate schedule. With this setting, SWA just consists of training the DNN for a few extra epochs (whose optimal value is to be determined during hyperparameter search), and then averaging the weights of the DNN along the trajectory followed during optimization. In [40], the authors suggest that SWA leads to wider minima, which is hypothesized to result in better generalization than conventionally trained DNN.
Finally, quantile transformation was applied to RTs before fitting. The method transforms RTs to follow a standardized normal distribution. This may facilitate learning since the last layer does not need to learn large weights to match the untransformed RTs.
Kernel methods
Support Vector Machines (SVMs) have already been considered for a wide variety of applications related to metabolites, including elution order predictionÂ [19]. Although we tested SVMs with both descriptors and fingerprints, performance of both regressors was poor. As an alternative to this classic kernel method, we considered Deep Kernel Learning (DKL)Â [41]. DKL can be interpreted as a DNN whose last layer has been substituted by a Gaussian Process (GP). This permits leveraging both the ability of deep learning for extracting relevant features from the rawinputs, and the nonparametric flexibility of GPs. The combination of the DNN and the GP kernel can also be viewed as a new flexible kernel which can be used as a dropin replacement for standard kernels. DKLs were tested using fingerprints, descriptors and fingerprints+descriptors.
Following the observations from "Deep neural network" Section, we employed a highly regularized DNN. Besides dropout, we also considered batchnormalizationÂ [42], not only because of its regularization capabilities, but also because it keeps activations from the network within a predictable range. This eases the use of kernel interpolation (specifically, KISSGP Â [43]) to approximate the GP kernel, which enables fast computations. Quantile transformation was also applied to the RTs.
DKL was trained using early stopping, and the learning rate was tuned during parameter search. Similar to the DNNs from "Deep neural network" Section, the specific architecture and regularization were subject to parameter search. Learning rate scheduling was used, reducing the learning rate when validation loss was stacked in a plateau. The patience argument before decreasing the learning rate was also tuned. Finally, three kernels were considered during hyperparameter tuning: the squared exponential kernel, the linear kernel and a spectral mixture kernel with four componentsÂ [44]. A full list of the mathematical expressions of the kernels used in this paper can be found in SectionÂ S3 of Additional file 1.
Blending
We tested if the combination of the different regressors could improve their individual predictions. We used blendingÂ [45] to build a metaregressor which learns to combine the predictions of the socalled baseregressors. Blending is a popular alternative to stacked generalization (or stacking)Â [46] which has lower computational demand and it is simpler, resulting in less likelihood of information leakage. With large datasets like SMRT, blending and stacking usually yield similar results. Hence, since the metaregressor is also subject to parameter tuning, blending was used for faster training.
To train a metaregressor with blending, a holdout set is created using a small subset of the training set. In our experiments, we used a 8020% split. The baseregressors are trained on the 80% of the data, and their predictions for the holdout dataset are stored. The metaregressor then learns to combine the predictions of the baseregressors using the predictions on the holdout dataset. Note that an instance on the original training data is only used just once for training, either on the baseregressors or in the metaregressor, avoiding information leakage. This procedure for training the metaregressor is also outlined in Fig. S1 of Additional file 1.
A random forest was used as metaregressor, tuning its main parameters through Bayesian optimization. The parameters tuned were the number of trees, the maximum depth of each tree, the maximum number of features considered at each split, the minimum number of samples before considering a split and the minimum number of samples at a leaf.
Bayesian hyperparameter search
Most regressors with the exception of lightGBMs (tuned using iterative search for speed tuning) and CatBoosters (not tuned due to its good default values) were tuned using Bayesian hyperparameter search. The p parameter controlling the thresholding of binary features was also optimized (see "Preprocessing of descriptors and fingerprints" Section). The parameters were tested following the predictions of a TPE algorithm [27]. The TPE algorithm works by suggesting the parameters that maximize the expected improvement in the score being maximized, which in this paper was the negative of the MEDian Absolute Error (MEDAE). This permits balancing exploration versus explotation, obtaining a set of hyperparameters with good performance in fewer iterations than other approaches like grid search. In our experiments each model performed 50 iterations of the Bayesian search.
Regarding the optimization of the blended regressor, it should be noted that it proceeds greedily. That is, baseregressors are tuned individually, and the predictions of the best performing parameters are then used to create the training set for the metaregressor. Finally, the parameters of the latter are optimized. It may be argued that this approach is suboptimal, since the baseregressors cannot be tuned to complement each other. However, jointly optimizing all baseregressors and the metaregressor is difficult due to the dimensionality of the search space. Furthermore, this approach would not permit drawing conclusions from the performance of the baseregressors, which is part of the objectives of this work.
Validation procedure
To avoid dataleakage when reporting the performance of the different models, nested stratified crossvalidation was used. Nested crossvalidation guarantees that different data is used to tune model parameters and to evaluate its performance by means of outer and inner crossvalidation loopsÂ [47]. In the outer loop, train/test splits are generated, which are then used for averaging the test scores over several data splits. In the inner loop, the train set is further split in train/validation subsets. The best parameters are selected by minimizing the MEDAE on the validation splits. We used 5folds and 7folds stratified crossvalidations in the outer and inner loops, respectively. To ensure that the distribution of RTs is representative of the population in all folds, stratification was performed by separating the target variable (RTs) into 6 different bins. The validation procedure is also summarized in Fig.Â S1 of Additional file 1.
The Bayesian hyperparameter search ("Bayesian hyperparameter search" Section) and the validation procedure described in this section approximately required 2.5 months of computational time in a computer with an AMD Ryzen Threadripper 2970WX with 24 cores at 1.85 Gz, and a NVIDIA GeForce RTX 2080 GPU.
Projection between chromatographic methods
Machine learning models trained on a given Retention Time (RT) dataset (SMRT in this work) cannot be directly used to predict experimental RTs from other Chromatographic Method (CM)s due to the variability of the experimental setups. To exploit the knowledge of a predictive model trained on the SMRT, a second model projecting the predicted RTs to the specific Chromatographic Method (CM) used in an experiment is needed.
Given a specific Chromatographic Method (CM), the projection function can be learned if some of the experimental metabolites have been identified, and therefore both their experimental and predicted RTs are known (step 3 in Fig. 2). For the workflow in Fig.Â 2 to be practical, it is desirable that the projection function can be learned from a small dataset (tens of molecules) so that the researcher has to identify just a few molecules. In practice, this would probably be accomplished by adding pure metabolite standards to the sample. The more standards that need to be used, the more time and money will be required. Hence the interest in minimizing their number.
Bayesian methods are particularly well suited to solve classification/regression problems when data is scarce. This is due to their ability to incorporate prior knowledge about the problem. If the prior provides useful inductive biases for the task at hand, only a few samples may be needed to learn a proper solution to the problemÂ [48]. Hence, under the Bayesian paradigm, the issue of learning from few data becomes how to specify a suitable prior for the problem.
Metalearning has recently arose as a possible solution for acquiring useful prior knowledge. In metalearning, knowledge is gained by solving a set of tasks (metatasks), which is then exploited to solve a closelyrelated but different task (targettask). In the Bayesian setting, metatasks are used to learn a useful prior distribution, which is then used as starting point to solve the targettask. This is done by incorporating new evidence provided by the targettask into the prior, which results in the socalled posterior distribution.
Hence, we propose the use of metalearning to solve the problem of learning from few samples. The outline of the approach is shown in Fig.Â 4. We shall consider that the set of metatasks \(\mathcal {M}\) consists of m datasets \(\mathcal {M}=\{\mathcal {D}_i\}_{i=1}^m\), each corresponding to a different Chromatographic Method (CM). Each dataset \(\mathcal {D}_i\) contains predicted RTs \(\mathbf {x}^i\), as well as the experimental RTs obtained with a specific Chromatographic Method (CM), \(\varvec{y}^i\). Hence, \(\mathcal {D}_i=\{\mathbf {x}^i, \varvec{y}^i\}\) is a single metatask and we would like to map \(\varvec{x}^i\) to \(\varvec{y}^i\) using a smooth function \(f(\cdot )\). The predicted RTs are obtained by using the best predictive model from "Prediction of retention times with machine learning" Section. In our problem, metatasks are used to learn a prior distribution p(f) over the functions \(f(\cdot )\) translating predicted RTs to experimental RTs of different CMs. Let us consider that we have gathered experimental RTs using the CMs A, B and C. During metalearning, the projection functions
will be constructed. These functions should be considered as samples drawn from the same distribution p(f). The aim of metalearning is to learn a plausible prior p(f) that explains all observed samples \(f_A(\cdot ), f_B(\cdot )\) and \(f_C(\cdot )\).
In addition to the the metatasks we have the targettask \(\widetilde{\mathcal {D}}=\{\varvec{\tilde{x}}, \varvec{\tilde{y}}, \varvec{\tilde{x}^*}, \varvec{\tilde{y}^*}\}\). Again, a single targettask is comprised of data from a single Chromatographic Method (CM). Intuitively, the target training points \(\{\varvec{\tilde{x}}, \varvec{\tilde{y}}\}\) represent molecules whose identity is known (step 3 in Fig.Â 2) whereas the target test points \(\{\varvec{\tilde{x}^*},\varvec{\tilde{y}^*}\}\) are molecules whose identity is to be discovered (step 6 in Fig.Â 2). We assume that the number of annotated molecules is small (indeed, this is the main difference between a metatask and a targettask). Note that, when solving the targettask, the prior distribution p(f) learned during metalearning is updated with the new evidence \(\{\varvec{\tilde{x}}, \varvec{\tilde{y}}\}\), which should enable the prediction/ranking of \(\{\varvec{\tilde{x}^*},\varvec{\tilde{y}^*}\}\).
GPs are particularly suited as projection model: they represent a distribution over functions, they can perform regression on smalls amount of data, and they can incorporate prior knowledge using the Bayesian framework. Hence, we shall consider:
where the mean and kernel functions of the GP are parametrized with \(\varvec{\theta }_m\) and \(\varvec{\theta }_k\), respectively. Hence, the whole prior is parametrized with \(\varvec{\theta }=[\varvec{\theta }_m, \varvec{\theta }_k]\). These parameters are learned by minimizing the negative Leave One Out (LOO) log predictive probability on the metatasks (see Algorithm 1, where \(\varvec{y}^i_{j}\) means all targets but the jth item). The use of the LOObased loss instead of the usual log marginal loss is based on the observation that crossvalidation procedures (such as LOO) should be more robust against possible model misspecifications [49, Section 4.8]. Once the parameters have been learned, they can be used to specify a prior that is expected to generalize well on the targettask. Indeed, to avoid overfitting, \(\varvec{\theta }\) is not optimized while solving the targettasks. The only parameter estimated with targettask data is the variance of the residuals. This is done by maximizing type II maximum likelihood during 250 epochs with an Adam optimizer with learning rate set to 0.01.
It is worth noting that the proposed metalearning method fits well the Retention Time (RT)based filtering workflow shown in Fig.Â 2. In this scenario, a query from a researcher corresponds to a targettask, which exploits the information provided by a previously metalearned prior. Furthermore, although computing new predictions with the projection function scales quadratically with the training data, it will typically train on tens of RTs. Hence, it would only take tenths of a second to map experimental RTs to predicted RTs. Next sections present the data preprocessing and the parameter (\(m_{\varvec{\theta }_m}\), \(k_{\varvec{\theta }_k}\)) selection process used to devise our projection method.
Experimental setup and data preprocessing
InÂ [21] the nonretained molecules were ignored for validating the projection method, and we adopt the same methodology here. The rationale for this is that in an experiment is easy to know if a molecule has been retained or not, and a researcher would not use a RTs database to try to annotate nonretained metabolites.
To avoid dataleakage during validation, we ensured that the metatasks data, target training data, and target test data did not overlap. To that end, we used a leaveoneChromatographic Method (CM)out approach. That is, data from a specific Chromatographic Method (CM) could only be used as either part of the metatasks or as the targettask. Hence, when using a Chromatographic Method (CM) as targettask, metalearning was used on the remainder of CMs. FollowingÂ [21], the following CMs were used to create the targettasks: FEM long (342 molecules), FEM orbitrap plasma (133), LIFE old (148) and RIKEN (271). The number of molecules in the remainder of CMs is 2418.
For a specific targetChromatographic Method (CM) (one of the four above mentioned), and after metalearning on the metatasks, the target training data is created by subsampling the Chromatographic Method (CM) data (and the remainder of RTs are used as target test data). To obtain a good projection, researchers are expected to add standards spanning the whole range of the experimental RTs. To mimic this behaviour, stratified sampling was used. Sampling was repeated 10 times for each number of training points to obtain error estimates. To study the robustness of the projection method when only a few metabolites are known, the number of training points was varied between 10 and 50; 50 was the number of molecules used inÂ [21].
All GP models share the same preprocessing steps despite their mean and kernel functions. RTs are transformed to log space using
where \(\varvec{x}\) and \(\varvec{y}\) may belong to any metatask \(\mathcal {D}_i\) or any targettask \(\widetilde{\mathcal {D}}\). The motivation for using this transformation is twofold. On one hand, RTs take only positive values. Without any transformation, the model has to learn this restriction on its own, which may be difficult in the scarce data scenario. By using the transformed RTs \(\bar{\varvec{y}}\), the model learns to predict a target without any restriction. Then, the inverse transformation maps back \(\bar{\varvec{y}}\) to the positive interval, forcing positiveness in the projected RTs. On the other hand, by also applying the transformation to \(\varvec{x}\), the nonlinear relationship between \(\varvec{x}\) and \(\varvec{y}\) linearizes, which could enable the use of simpler kernels.
After the logtransformation, and since the software used to implement GPs is geared towards using inputs normalized to [0,Â 1] and outcomes normalized to \([1, 1]\), data is further scaled using robust statistics. We used
where \(\text {IQR}\) denotes the interquartile range. The constant 0.741 is used because, for normal populations, the standard deviation fulfills \(\sigma \approx 0.741 \cdot \text {IQR}\). Hence, under the normality assumption, 99.7% of the transformed \(\bar{\bar{x}}\) will be on the [0,Â 1] range and 99.7% of the transformed \(\bar{\bar{y}}\) will be on the \([1, 1]\) range. Despite the different last preprocessing step for predicted (\(\varvec{x}\)) and experimental (\(\varvec{y}\)) RTs, both transformations are learned using the predicted RTs. Thus, there is no Chromatographic Method (CM)dependent scaling.
Comparison of metalearned GP models
We evaluated the performance of the different metalearned GPs models arising from various choices of their two parameters:

Mean function \(m_{\varvec{\theta }_m}(\cdot )\): a typical parametrization of a GP when no prior information is available about the mean is \(f(\cdot ) \sim \mathcal {GP}\left( 0, k_{\varvec{\theta }_k}(\cdot , \cdot )\right)\); that is, \(m_{\varvec{\theta }_m}(\cdot )=0\). The underlying assumption is that all relevant prior information can be incorporated into the kernel parameters \(\varvec{\theta }_k\). However, [50] shows that learning a mean function \(m_{\varvec{\theta }_m}(\cdot )\) (either alone or combined with kernel learning) can outperform kernel learning alone. We tested this in our problem by studying GPs with a constant mean function, and GPs with a mean function parametrized with a neural network. In our experiments, we used a neural network with two hidden layers with 128 units and leakyReLU activations.

Kernel function \(k_{\varvec{\theta }_k}(\cdot , \cdot )\): different kernels result in different properties of the projection function. We compared the commonly used kernels and combinations of them. Specifically, we tested the squared exponential kernel, MatĂ©rn kernels with \(\nu =1.5\) and \(\nu =2.5\), the polynomial kernel of degree 4, a linear combination of two squared exponential kernels, a linear combination of a linear kernel and a squared exponential kernel, and a linear combination of a squared exponential kernel and a polynomial kernel of degree 4. A full list of the mathematical expressions for these kernels can be found in SectionÂ S1 of Additional file 1.
The experimental setup described in "Experimental setup and data preprocessing" Section is used. We focused on the performance of the models in the lowdata regime using just 10 training data points. For a single targettask \(\widetilde{\mathcal {D}}=\{\varvec{\tilde{x}}, \varvec{\tilde{y}}, \varvec{\tilde{x}^*}, \varvec{\tilde{y}^*}\}\), the predictive marginal loglikelihood
was used as metric of the modelâ€™s performance.
To obtain a single metric while taking into account the possible differences in the scales of the marginal loglikelihoods for the different CMs, each \(\mathcal {L}_{\mathcal {D}}\) was compared with the marginal loglikelihood of a reference model \(\mathcal {L}_{\mathcal {D}}^{\text {ref}}\): \(\Delta \mathcal {L}_{\mathcal {D}} = \mathcal {L}_{\mathcal {D}}\mathcal {L}_{\mathcal {D}}^{\text {ref}}\). We used as reference model a GP with constant mean and squared exponential kernel trained on the targettask without metalearning. This model was trained by optimizing type II maximum likelihood during 500 epochs using an Adam optimizer with learning rate set to 0.01. The final metric \(\Delta \mathcal {L}_{\text {avg}}\) was obtained by averaging across the four testtasks and repetitions. Values \(\Delta \mathcal {L}_{\text {avg}} > 0\) correspond with models that perform better (in average) than the reference one. Note that this not only permits the comparison of different metalearned GPmodels, but it also assesses the usefulness of the metalearning approach.
Additional experiments studying the influence of the number of metatasks in the performance of the GP were also carried out. They are discussed in SectionÂ S3 of the Additional file 1.
Predictive performance of the projection function
We compared the best GPmodel from "Comparison of metalearned GP models" Section with monotonic Generalized Additive Models (GAMs)Â [8], robust polynomial regressionÂ [21] and piecewise polynomial regressionÂ [24]. Unfortunately, it is not possible to compute the predictive marginal likelihood for all these models. Hence, we evaluated the performance of the models attending to both their predictive accuracy as well as their ability to generate proper prediction intervals. To test the predictive accuracy of the metalearning approach we computed the median relative error, Mean Absolute Error (MAE) and MEDAE for the target test set. To test the prediction intervals we used the interval scoreÂ [51]
where \(\varvec{\ell }\) and \(\varvec{u}\) are the lower and upper ends of the prediction interval generated for the test target points \(\varvec{\tilde{y}^*}\), \(\mathbbm {1}\) is the indicator function, and \(\alpha\) is the coverage that the models are aiming for. We used \(\alpha =0.95\) in all experiments. EquationÂ (2) can be understood by noting that a proper prediction interval should reach a tradeoff between being as small as possible (\(l_i\) should be close to \(u_i\)) and covering the observed values (\(l_i \le \tilde{y}^*_i \le u_i\)). The first term of EquationÂ (2) just measures the length of the interval, while the second and third terms penalize having observed values outside the prediction interval (moreover, the further apart an observation is from the interval, the larger the penalty).
Note that the interval score has the same units as the RTs in \(\varvec{\tilde{y}^*}\). To obtain an adimensional metric and facilitate the comparison of different target CMs, we define the scaled interval score as \(S(\varvec{\ell },\varvec{u},\varvec{\tilde{y}^*}) / \text {median}\left( [\varvec{y^*}, \varvec{\tilde{y}^*}]\right) .\)
Ranking annotations based on the projection function
We have tested the ability of the projection method to rank and filter candidate annotations in metabolomic experiments based on mass search and RT predictions. The test implements a similar workflow to that described in Fig.Â 2. We collected the Retention Time (RT) predictions of the bestperforming model from "Prediction of retention times with machine learning" Section for the 6,823 molecules with KEGG number in the Human Metabolome DataBase (HMDB)Â [52]. This simulates the database used to rank candidate annotations in Fig.Â 2. We used the leaveoneChromatographic Method (CM)out approach described in "Experimental setup and data preprocessing" Section for metatraining and targettasks evaluation. After learning a projection function on the target training set, we simulated queries against the HMDB database to annotate the molecules on the target test set. For each molecule in the target test set, an accurate mass search (10 ppm mass error, the same asÂ [21]) was performed to retrieve all compatible molecules from HMDB. To mimic real experimental conditions, we simulated experimental errors in the mass measurement by adding random noise to the mass of the unknown molecule. The random noise had a normal distribution with zero mean and a standard deviation of 10/3 ppm so that \(99.7\%\) of the errors were between \([10 \text { ppm}, 10 \text { ppm}]\). Random noise below \(10\) ppm or above 10 ppm was truncated to guarantee that the mass search always returned the correct molecule as a candidate (note that the mass search is based on the noisy mass and not the real one). Then, the molecules were ranked using Retention Time (RT) information according to a zscore computed as
where \(\mu (\tilde{x}^*)\) and \(\sigma (\tilde{x}^*)\) represent the GPâ€™s mean and standard deviation for the predictedexperimental Retention Time (RT) pair \((\tilde{x}^*, \tilde{y}^*)\). The intuition for the usage of the zscore as ranking metric is to take into account not only the agreement between the real experimental Retention Time (RT) and the projected value, but also the uncertainty in the projection. We focused on mass queries returning more than three candidates and computed the percentage of results where the true molecule was ranked among the top three candidates after zscoring. To facilitate the interpretation of the results, a baseline performance for metabolite annotation when using only mass information was also computed. In this case, if several candidates with the same mass were returned, ties were randomly broken.
Results
Retention time prediction with machine learning
MAE results for all tested regressors are summarized in Fig.Â 5. The MEDAE results are qualitatively similar to MAE ones, and can be found in Fig.Â S3 of Additional file 1. Both MAE and MEDAE are also reported in TablesÂ 1,Â 2Â andÂ 3. FigureÂ 5 shows that the DNN models outperform the other models, with the exception of the blender, which has similar results. Specifically, the DNN trained with fingerprints achieves a MEDAE of \(17.2 \pm 0.9\;s\) and a MAE of \(39.2 \pm 1.2\; s\) when considering all molecules, and a MEDAE of \(17.2 \pm 0.9\;s\) and a MAE of \(34.0 \pm 0.9\; s\) when considering retained molecules only. To the best of our knowledge, the previous top performing models achieved a MAE of \(45.6 \pm 0.4\;s\) for all moleculesÂ [24], and \(39.87\;s\) when only using retained moleculesÂ [26].
Regarding the other models, they can be sorted from lower to higher errors as follows: DKL, XGBoost, and lightGBM and CatBoost algorithms, which have similar MAE. It is worth noting that DKL obtains similar results to those reported inÂ [24] (\(45.6 \pm 2.4\) and \(40.8\pm 2.4\;s\) using fingerprints for all molecules and retained molecules, respectively) andÂ [26] (\(39.87\;s\) for retained molecules only).
The differences in the regressorsâ€™ performance originate from the prediction of the RTs for the retained molecules since the MAE for nonretained molecules is quite similar for all models.
Computing the projection between chromatographic methods
Comparison of metalearned GP models
FigureÂ 6 shows the averaged differences in predictive marginal loglikelihood \(\Delta \mathcal {L}_{\text {avg}}\) for different combinations of means and kernel functions. Since the median of all models is \(>0\), the metalearning provides some advantage with respect to directly fitting a GP to the targettask data. Using a flexible mean function parametrized by a DNN does not seem to offer any advantage compared to the simpler constant mean. Regarding the influence of kernels, although there is no clear winner, the combination of a squared exponential kernel and a linear kernel (which has the largest median \(\Delta \mathcal {L}_{\text {avg}}\)), and the polynomial kernel of degree 4 (which has the lowest variability) stand out. Since having a low variability is particularly important in the context of training from few points, we shall use a GP parametrized with a constant mean, and a polynomial kernel of degree 4. That is
Performance of the projection function
FiguresÂ 7 and 8 show the MAE and scaled interval scores for the projections to four CMs from the PredRet database when using different models. MEDAE results show a similar behaviour to those obtained with MAE, and hence are shown in Fig.Â S4 of Additional file 1. TableÂ 4 shows these three metrics and the median relative error (in %) for the metalearned GP. Regarding the accuracy of the model (MAE and MEDAE), all methods perform similarly. However, GPs consistently rank among the two best results for most combinations of Chromatographic Method (CM) and number of training points. FigureÂ 7 is particularly revealing since all methods but GPs show some large fluctuation (note the large error bars) for 10 or 20 training points, which suggests that they are more sensitive to the presence of outliers.
Regarding the scaled interval scores, piecewise liner regression and metalearned GPs show a better overall performance than the other methods, specially when compared to GAMs. Metalearned GPs have the best performance in three of four CMs, while piecewise linear regression performs better in one of four.
An illustrative example of the projection function built using just 10 training points is shown in Fig.Â 9.
Ranking candidate annotations with the projection function
TableÂ 5 shows the percentage of the results where, using the Retention Time (RT) projection function, the true molecule was ranked among the top three candidates for those queries with more than three candidates. A comparison with the baseline values when only mass information is used (shown between parentheses in TableÂ 5) reveals that the use of Retention Time (RT) information always improves ranking accuracy. Reported results inÂ [21] for 50 training points were 66.7%, 67.9%, 69.7% and 71.9% for FEM long, FEM orbitrap plasma, LIFE old and RIKEN, respectively. Considering the standard error, when using 50 annotated molecules our DNN+metalearning approach outperformsÂ [21] in the FEM long system, while it has lower performance in the LIFE old system. The DNN+metalearning approach reaches a global mean of \(70\%\) for 50 annotated molecules. The number of training points affects both the ranking accuracy and its variability (standard errors). When using as few as 10 training points, the global performance decreases to \(68\%\).
Discussion and conclusions
In this paper we have trained several stateoftheart machine learning regressors to predict small molecules Retention Time (RT) using the 80,038 experimental RTs from the SMRT dataset. The regressors included DNNs, DKL, XGBoost, lightGBM, CatBoost, and a blending approach. The models were trained using only molecular descriptors, only fingerprints, and both types of features simultaneously. Descriptors and fingerprints were generated with the alvaDesc software. Furthermore, we have proposed a metalearning approach to learn projection functions between different CMs from a few training points.
Retention time prediction
Deep learning models, regardless the input features used for training, clearly outperform the other models. When using fingerprints, the DNN achieves a MAE of \(39.2 \pm 1.2\; s\) when considering all molecules, and a MAE of \(34.0 \pm 0.9\; s\) when considering retained molecules only; the previous top performing models achieved a MAE of \(45.6 \pm 0.4\;s\)Â [24] on all molecules, and \(39.87\;s\) when only using retained moleculesÂ [26]. This suggests that DNNs are better suited for Retention Time (RT) prediction than other models. Note that the DKL models, which should also exploit the benefits of DNNs, also achieve similar results to previously topperforming models, although they do not reach the performance of DNN. This may imply that the use of recent techniques intended for improving the generalization capabilities of DNNs (e.g.Â warmrestarts and SWA) were key for their performance.
Although metamodels are expected to improve the baseregressorsâ€™ performance, the blender built using all regressors has similar performance to that obtained by DNNs (see Fig.Â 5). To achieve an improvement, the baseestimators of the blender should have similar performance and be as diverse as possible, providing complementary information to be exploited by the metaregressor. In our blender, the predictions of the metaregressor are mostly influenced by the DNNs, since they have the best performance. The fact that the blender cannot improve the predictions of the DNNs implies that their predictions are almost the same. Indeed, the predictions of the three different DNNs are highly correlated (e.g., the correlation between the fingerprintsâ€™ DNN and the descriptorsâ€™ DNN is \(0.972 \pm 0.003\)). Since the fingerprintsâ€™ DNN has similar performance and can be trained much faster, we can conclude that the use of blending has not provided any value for Retention Time (RT) prediction.
FigureÂ 5 shows that models that did not employ Bayesian search (lightGBM and CatBoost) perform worse, which suggests the usefulness of this procedure. These were also the models that benefited from using both descriptors and fingerprints; in the other models using both types of features together had a performance similar to using only the descriptors. In the literature there are both works reporting that fingerprints outperform molecular descriptors (e.g.,Â [21]) and works claiming just the opposite (e.g.,Â [24]). Our results slightly favor the usage of fingerprints, although it cannot be ruled out that the best type of feature depends on the machine learning regressor used.
Regarding the experiments where the weights of the nonretained molecules were varied within the CatBoost regressor, Fig.Â 5 shows that increasing the importance of these molecules (large weights) yields worse MAE results for the retained molecules. As expected, large weights yield some improvement in the performance of the nonretained molecules (see Fig.Â 5). However, the large values of MAE for the nonretained molecules indicate that the regressors are not able to reliable distinguish nonretained molecules from retained ones. This also explains why the usage of different weighted CatBoosters did not have the expected impact on the blender: it was expected that the blender would match the performance of the best regressor for nonretained molecules. However, this has not been observed probably because the regressors fail to identify nonretained molecules and they tend to predict RTs as if the molecule was retained, even if it is not. This can be confirmed by inspecting the performance of the classifier trained to predict if a molecule will be retained or not (see "Preprocessing of descriptors and fingerprints" Section). Although the classifier has large specificity (\(0.9953 \pm 0.0005\)), precision and recall are low (\(0.74 \pm 0.03\) and \(0.512 \pm 0.016\), respectively), which highlights the difficulty in properly identifying nonretained molecules.
Metalearningbased projections
The experiments suggest that the method to project the predicted RTs to a specific Chromatographic Method (CM) is able to provide proper projections using as little as 10 or 20 training points. In this range of training points, the accuracy of the metalearnedGP shows similar or slightly better MAE and MEDAE than other stateoftheart methods (Fig.Â 7). Regarding the prediction intervals, it has the best performance in three of the four CMs (Fig.Â 8).
Being able to train the projection model from a few training points is key for real world applications, since it avoids the need to identify a large number of molecules. Note that in this smalldata regime, the predictions are mainly driven by the prior learned during metalearning. This can be seen by looking at the upper confidence interval for the FEM long Chromatographic Method (CM) in Fig.Â 9, which seems larger than needed. With more training points, the GP is flexible enough to reduce uncertainty around the training points, adjusting to the actual dispersion of the Chromatographic Method (CM), as shown by the trend for the FEM long system in Fig.Â 8. Remarkably, and although the scaled interval scores tend to decrease with the number of training points, they are quite stable for the other systems. The ability of GPs of generating credible prediction intervals for the projections can be used to obtain probabilistic scores for the putative annotations, as shown in "Ranking candidate annotations with the projection function" Section.
TableÂ 5 shows that, when using 50 training points, our projection method ranks the correct identity among the top three candidates in \(70\%\) of the cases, at a similar level than other projection methodsÂ [21]. When decreasing the number of training points to just 10 samples, the percentage is \(68\%\), while with 30 is \(69\%\). This shows that metalearning enables the creation of projection functions from just a few known metabolites. However, TableÂ 5 also reveals large standard errors, which suggest that the projection functions are highly dependent on the training inputs.
An accurate predictive model and a projection function that can be learned from as few as 10 identified metabolites permit building a tool to support metabolite annotation following the scheme presented in Fig.Â 2. We intend to integrate such a tool into CEU Mass MediatorÂ [5], a metabolite annotation platform that has 332,665 metabolites in its database, of which approximately 250,000 have no RT information in the SMRT dataset. When RTs are available, it will only be necessary to use the projection function to map the experimental Retention Time (RT)s of the database to the Retention Time (RT) of the Chromatographic Method (CM) of a given experiment. When no Retention Time (RT) is available in the database, it will also be necessary to predict it using the best model achieved in this work (the DNN trained with fingerprints). The user of CEU Mass Mediator will only need to provide (1) the experimental RTs of the known molecules, whose identity should also be specified (by indicating their PubChem ID, InChI Key or similar), and (2) both the m/z and experimental RTs of the molecules to be annotated. This information can be uploaded to the toolâ€™s web page using text format. CEU Mass Mediator will then return the annotations compatible with the experimental data, ranked accordingly to their zscores.
Note that the use of the CEU Mass Mediator database avoids running the DNN in realtime. On the other hand, the projection method is highly efficient thanks to the metalearning approach: the learning of the GP prior parameters is accomplished in an offline task, and the targettask that has to be executed online to compute the posterior distributions runs in just tenths of seconds. Hence, both the predictive model and the metalearned projection function can be integrated into the workflow of Fig.Â 2 with negligible computational overhead. Furthermore, that workflow could be combined with an in silico MS/MSbased annotation approach when MS/MS data is available. In that scenario, the top predicted candidates by the model could be feed into tools that match them to the experimental MS/MS data, followed by a reranking based on both RT and MS/MS predictions.
Availability of data and materials
The software supporting the conclusions of this article is available in the constantinogarcia/cmmrt Github repository (https://github.com/constantinogarcia/cmmrt).â€śThe sofware is distributed as a Python 3 package (platform independent) and also provides a Makefile for running most important actions (i.e, installing dependencies, train and validate regressors, and train and validate projections). Furthermore, the fingerprints and descriptors generated with alvaDesc are automatically downloaded when running the software.
References
Chaleckis R, Meister I, Zhang P, Wheelock CE (2019) Challenges, progress and promises of metabolite annotation for LCMSbased metabolomics. Curr Opin Biotechnol 55:44â€“50
Bach E, Rogers S, Williamson J, Rousu J (2021) Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification. Bioinformatics 37(12):1724â€“1731
Aksenov AA, da Silva R, Knight R, Lopes NP, Dorrestein PC (2017) Global chemical analysis of biology by mass spectrometry. Nat Rev Chem 1(7):1â€“20
Guillevic M, Guillevic A, Vollmer MK, Schlauri P, Hill M, Emmenegger L, Reimann S (2021) Automated fragment formula annotation for electron ionisation, high resolution mass spectrometry: application to atmospheric measurements of halocarbons. J Cheminformatics 13(1):1â€“27
GildelaFuente A, Godzien J, Saugar S, GarciaCarmona R, Badran H, Wishart DS, Barbas C, Otero A (2018) Ceu mass mediator 3.0: a metabolite annotation tool. J Proteome Res 18(2):797â€“802
Kind T, Tsugawa H, Cajka T, Ma Y, Lai Z, Mehta SS, Wohlgemuth G, Barupal DK, Showalter MR, Arita M et al (2018) Identification of small molecules using accurate mass MS/MS search. Mass Spectrom Rev 37(4):513â€“532
BlaĹľenoviÄ‡ I, Kind T, Ji J, Fiehn O (2018) Software tools and approaches for compound identification of LCMS/MS data in metabolomics. Metabolites 8(2):31
Stanstrup J, Neumann S, VrhovĹˇek U (2015) Predret: prediction of retention time by direct mapping between multiple chromatographic systems. Anal Chem 87(18):9421â€“9428
Pawellek R, Krmar J, Leistner A, DjajiÄ‡ N, OtaĹˇeviÄ‡ B, ProtiÄ‡ A, Holzgrabe U (2021) Charged aerosol detector response modeling for fatty acids based on experimental settings and molecular features: a machine learning approach. J Cheminformatics 13(1):1â€“14
Collins CR, Gordon GJ, Von Lilienfeld OA, Yaron DJ (2018) Constant size descriptors for accurate machine learning models of molecular properties. J Chem Phys 148(24):241718
Li X, Fourches D (2020) Inductive transfer learning for molecular activity prediction: Nextgen qsar models with molpmofit. J Cheminformatics 12(1):1â€“15
DjoumbouFeunang Y, Fiamoncini J, GildelaFuente A, Greiner R, Manach C, Wishart DS (2019) Biotransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification. J Cheminformatics 11(1):1â€“25
Moruz L, KĂ¤ll L (2017) Peptide retention time prediction. Mass Spectrom Rev 36(5):615â€“623
Ma C, Ren Y, Yang J, Ren Z, Yang H, Liu S (2018) Improved peptide retention time prediction in liquid chromatography through deep learning. Anal Chem 90(18):10881â€“10888
Aicheler F, Li J, Hoene M, Lehmann R, Xu G, Kohlbacher O (2015) Retention time prediction improves identification in nontargeted lipidomics approaches. Anal Chem 87(15):7698â€“7704
Tsugawa H, Ikeda K, Tanaka W, Senoo Y, Arita M, Arita M (2017) Comprehensive identification of sphingolipid species by in silico retention time and tandem mass spectral library. J Cheminformatics 9(1):1â€“12
Witting M, BĂ¶cker S (2020) Current status of retention time prediction in metabolite identification. J Sep Sci 43(9â€“10):1746â€“1754
Maboudi Afkham H, Qiu X, The M, KĂ¤ll L (2017) Uncertainty estimation of predictions of peptidesâ€™ chromatographic retention times in shotgun proteomics. Bioinformatics 33(4):508â€“513
Bach E, Szedmak S, Brouard C, BĂ¶cker S, Rousu J (2018) Liquidchromatography retention order prediction for metabolite identification. Bioinformatics 34(17):875â€“883
Liu JJ, Alipuly A, BÄ…czek T, Wong MW, Ĺ˝uvela P (2019) Quantitative structureretention relationships with nonlinear programming for prediction of chromatographic elution order. Int J Mol Sci 20(14):3443
DomingoAlmenara X, Guijas C, Billings E, MontenegroBurke JR, Uritboonthai W, Aisporna AE, Chen E, Benton HP, Siuzdak G (2019) The metlin small molecule dataset for machine learningbased retention time prediction. Nat Commun 10(1):1â€“9
Bouwmeester R, Martens L, Degroeve S (2019) Comprehensive and empirical evaluation of machine learning algorithms for small molecule lc retention time prediction. Anal Chem 91(5):3694â€“3703
Naylor BC, Catrow JL, Maschek JA, Cox JE (2020) Qsrr automator: a tool for automating retention time prediction in lipidomics and metabolomics. Metabolites 10(6):237
Osipenko S, Bashkirova I, Sosnin S, Kovaleva O, Fedorov M, Nikolaev E, Kostyukevich Y (2020) Machine learning to predict retention time of small molecules in nanoHPLC. Anal Bioanal Chem Res 412(28):7767â€“7776
Bouwmeester R, Martens L, Degroeve S (2020) Generalized calibration across liquid chromatography setups for generic prediction of smallmolecule retention times. Anal Chem 92(9):6571â€“6578
Yang Q, Ji H, Lu H, Zhang Z (2021) Prediction of liquid chromatographic retention time with graph neural networks to assist in small molecule identification. Anal Chem 93(4):2200â€“2206
Ozaki Y, Tanigaki Y, Watanabe S, Onishi M (2020) Multiobjective treestructured parzen estimator for computationally expensive optimization problems. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 533â€“541
Dalby A, Nourse JG, Hounshell WD, Gushurst AK, Grier DL, Leland BA, Laufer J (1992) Description of several chemical structure file formats used by computer programs developed at molecular design limited. J Chem Inf Comput Sci 32(3):244â€“255
Mauri A (2020) AlvaDesc: A tool to calculate and analyze molecular descriptors and fingerprints. Ecotoxicological QSARs. Springer, New York, pp 801â€“820
Alvascience: AlvaDesc (software for Molecular Descriptors Calculation).https://www.alvascience.com. Accessed 30 May 2022
Rogers D, Hahn M (2010) Extendedconnectivity fingerprints. J Chem Inf Model 50(5):742â€“754
Alvascience: (2021) alvaDesc Molecular Descriptors. https://www.alvascience.com/alvadescdescriptors/. Accessed 15 Jun 2021.
Dong X, Yu Z, Cao W, Shi Y, Ma Q (2020) A survey on ensemble learning. Front Comp Sci 14(2):241â€“258
Chen T, Guestrin C (2016) XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785â€“794
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30:3146â€“3154
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) CatBoost: Unbiased boosting with categorical features. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPSâ€™18, pp. 6639â€“6649. Curran Associates Inc., Red Hook, NY, USA
Schifferer B, Titericz G, Deotte C, Henkel C, Onodera K, Liu J, Tunguz B, Oldridge E, Moreira De Souza Pereira, G, Erdem, A, (2020) GPU accelerated feature engineering and training for recommender systems. Proc Recomm Syst Challen 2020:16â€“23
Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Ĺ˝Ădek A, Nelson AW, Bridgland A et al (2020) Improved protein structure prediction using potentials from deep learning. Nature 577(7792):706â€“710
Loshchilov I, Hutter F (2017) SGDR: stochastic gradient descent with warm restarts. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, Conference Track Proceedings
Izmailov P, Wilson A, Podoprikhin D, Vetrov D, Garipov T (2018) Averaging weights leads to wider optima and better generalization. In: 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pp. 876â€“885
Wilson AG, Hu Z, Salakhutdinov R, Xing EP (2016) Deep kernel learning. In: Artificial intelligence and statistics, PMLR, pp. 370â€“378
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, PMLR, pp. 448â€“456
Wilson A, Nickisch H (2015) Kernel interpolation for scalable structured gaussian processes (kissgp). In: International Conference on Machine Learning, PMLR, pp. 1775â€“1784.
Wilson A, Adams R (2013) Gaussian process kernels for pattern discovery and extrapolation. In: International Conference on Machine Learning, PMLR, pp. 1067â€“1075.
TĂ¶scher A, Jahrer M, Bell RM (2009) The bigchaos solution to the netflix grand prize. Netflix prize documentation.https://www.researchgate.net/publication/223460749_The_BigChaos_Solution_to_the_Netflix_Grand_Prize
Wolpert DH (1992) Stacked generalization. Neural Netw 5(2):241â€“259
Stone M (1974) Crossvalidatory choice and assessment of statistical predictions. J R Stat Soc Series B Stat Methodol 36(2):111â€“133
Qin Y, Zhang W, Zhao C, Wang Z, Zhu X, Shi J, Qi G, Lei Z (2021) Priorknowledge and attention based metalearning for fewshot learning. Knowl Based Syst 213:106609
Wahba G (1990) Spline models for observational data. SIAM, Philadelphia
Fortuin V, RĂ¤tsch G (2019) Deep mean functions for metalearning in gaussian processes. arXiv preprint arXiv:1901.08098
Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 102(477):359â€“378
Wishart DS, Feunang YD, Marcu A, Guo AC, Liang K, VĂˇzquezFresno R, Sajed T, Johnson D, Li C, Karu N et al (2018) Hmdb 4.0: the human metabolome database for 2018. Nucleic Acids Res 46(D1):608â€“617
Funding
This work was supported by Ministry of Science, Innovation and Universities of Spain (MICINN) and FEDER funds (Ref. RTI2018095166BI00).
Author information
Authors and Affiliations
Contributions
CAG contributed to the conceptualization of work, software development, literature review, critical discussion, and paper writing. AGDLF contributed to the software development, critical discussion, and paper reviewing. CB contributed to the conceptualization of work, critical discussion, and paper reviewing. AO contributed to the conceptualization of work, critical discussion, and paper writing. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Additional details on the experiments. SectionÂ S3 provides a brief description of the kernels tested with the DKL methods and the metalearned GPs. SectionÂ S2 illustrates the training and validation procedures for all Retention Time (RT) regressors. SectionÂ S3 describes an additional experiment studying the influence of the number of metatasks in the performance of metalearned GPs. SectionÂ S3 shows the performance of the machine learning models used to predict RTs using the MEDAE metric. SectionÂ S4 shows the performance of different projection methods using the MEDAE metric.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
GarcĂa, C.A., GildelaFuente, A., Barbas, C. et al. Probabilistic metabolite annotation using retention time prediction and metalearned projections. J Cheminform 14, 33 (2022). https://doi.org/10.1186/s13321022006138
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13321022006138
Keywords
 Metabolomics
 Retention time
 Machine learning
 Bayesian methods
 Deep learning