A large set of MS/MS spectra was retrieved from GNPS [16] and subsequently curated and cleaned using matchms [18] (see “Methods”). The resulting training data set contains chemical structure annotations for 109,734 spectra, in the form of SMILES [19] and/or InChI [20], which allowed us to create molecular fingerprints to quantify structural similarities of spectral pairs. The dataset contains 15,062 different molecules (disregarding stereoisomerism—as represented by InChIKeys unique in the first 14 characters).

We randomly took out 500 of the 15,062 InChIKeys to form a validation set and again 500 to form a test set (see also “Methods”). The validation set (3597 spectra of 500 unique InChIKeys) is used to monitor the model training process and explore the key hyperparameters while the test set (3601 spectra of 500 unique InChIKeys) is used for a final unbiased evaluation of our model. Drawing pairs of spectra from the training set (102,536 spectra of 14,062 unique molecules), we trained a Siamese neural network to predict Tanimoto scores as depicted in Fig. 1.

A key challenge when training a neural network to predict Tanimoto scores is that the total set of possible spectrum pairs shows a highly unbalanced distribution in structural similarities, with most pairs displaying low structural similarity. Our procedure for drawing spectral pairs compensates for the unbalanced nature of the data, by selecting pairs with probabilities that are weighted according to their structural similarity, as described in detail in the Methods section. We further applied L1, L2 and dropout regularization, as well as data augmentation techniques to ensure the generalization of the model to unseen data.

### MS2DeepScore predicts Tanimoto scores with high accuracy

In real-world applications of the model, it is acceptable that there is a small error in estimation of the structural similarity between two spectra, while outliers with large errors should be avoided. Therefore, the root mean squared error (RMSE) was used as an overall evaluation metric, since it penalizes large errors on individual samples. In addition, the model should ideally perform well across the full range of possible pair similarities, which for the here used Tanimoto and Dice scores lies between 0 and 1. However, the datasets are highly unbalanced in that respect, since most spectrum pairs have low Tanimoto scores (Fig. 3A). We hence decided to inspect the model accuracy not as a global average since that would strongly bias the outcome to the performance on low Tanimoto pairs. Instead, we split all possible spectral pairs into 10 equally spaced Tanimoto score bins. In Fig. 2A we display the distributions of the predicted Tanimoto scores for each bin, which reveals that the individual distributions show a high overlap with the correct Tanimoto scores. As expected, the prediction is not perfect. The distributions show long tails of predictions that differ from the true structural similarities. Looking at root mean squared errors (RMSE) across all Tanimoto score bins, it can be noted that MS2DeepScore generally performs very well and can predict Tanimoto scores between 0.1 and 0.9 with a RMSE between 0.13 and 0.2. Accuracy is lower for the highest and lowest Tanimoto scores, which may partly be attributed to the regression to the mean effect (the training loss makes it unattractive to approach the upper and lower score limit). Another reason for the false high Tanimoto score predictions can arise from spectra that differ too much from the training data the model has seen (see also the later described effect of assessing the model uncertainty). The highest Tanimoto scores show a relatively long tail indicating more frequent wrong predictions (Fig. 2A). Predictions are also slightly more spread out for Tanimoto scores around 0.6–0.8, a range with relatively few occurrences and hence less training data in the dataset. Other underlying reasons cannot be ruled out at this point, such as poorer correlation between fragmentation information and actual structural similarity scores. Taken together, MS2DeepScore is highly reliable in separating high, mid, and low structural similarity pairs (see also Additional file 1: Figure S1), but it might be more error prone when it comes to smaller nuances.

### Accurate prediction of different structural similarity measures

Molecular fingerprints come in many different types and flavors which typically do not work equally well for all compound classes [7, 8, 21], and there is no general consensus on which molecular fingerprint to use (although MAP4 was recently said to “rule them all” [21]). 2D fingerprints can be charaterized as belonging to one of four classes [7, 22], two of which are represented in the present work. Those were circular fingerprints (we used Morgan-2 and Morgan-3, the RDKit [23] pendants of ECFP-4 and ECFP-6) as well as topological fingerprints (we used Daylight fingerprints in RDKit). All fingerprints were computed with 2048 bits. Then there are also different metrics to compute the similarity between two molecular fingerprints [24]. Arguably, Tanimoto (= Jaccard) is the most common way to compare molecular fingerprints, other metrics such as Dice scores are occasionally used. To show that our approach generalizes to a variety of structural similarity scores, we trained and tested the MS2DeepScore model for the three different fingerprints (Morgan-2, Morgan-3 and Daylight) as well as 2 different similarity measures (Tanimoto and Dice). Overall, MS2DeepScore can make accurate predictions for all tested structural similarity measures (Fig. 3A). In addition, we observed that the precise distributions of all occurring structural similarity scores vary considerably when different structural similarity measures are used (Fig. 3B). Structural similarity measures by Tanimoto scores from circular fingerprints (Morgan-2 and 3) have a much higher tendency towards low scores when compared to Tanimoto scores from RDKit Daylight fingerprints. This can partly be adjusted by switching from Tanimoto to other metrics such as using a Dice score. Overall, we found that MS2DeepScore performs slightly better when predicting structural similarity scores with a less skewed distribution. This is to be expected since such scores display far more instances of moderate to high scores in the entire training dataset, for instance Tanimoto scores on daylight fingerprints result in 10–100 times more pairs across scores in the range from 0.2 to 0.9 when compared to Tanimoto scores on Morgan-3 fingerprints (Fig. 3A). Due to its less skewed distribution, we decided to mainly use Tanimoto scores on RDKit Daylight fingerprints (2048bits) for evaluating stuctural similarity (unless noted otherwise). It is important to note, though, that linear correlation coefficients between predicted and actual scores were generally high for the pairs with the highest structural similarities (see Additional file 1: Table S1).

### Detecting chemically related pairs: comparison to common mass spectral similarity measures

Due to the low number of available correctly or reliably annotated mass spectra, many analysis pipelines must rely on mass spectral similarity measures. A classical way to compare MS/MS mass spectra is to quantify the fraction of shared peaks as done by using variations of cosine-based similarity scores. They come in many types and flavors [6, 25], but typically rely on multiplying intensities of matching peaks. Those pairs of matching peaks between two spectra are usually computed as so-called assignment problems based on set m/z tolerances. One particular variant of such scores is the ‘modified cosine score’ which also allows matching peaks that were shifted by the difference in precursor m/z [26]. Those measures tend to work well for very similar spectra, i.e. with many identical peaks. We recently introduced Spec2Vec, an unsupervised machine learning approach for computing spectrum similarities based on learned relationships between peaks across large training datasets [6]. Spec2Vec based similarity scores were observed to correlate more strongly than classical cosine-like scores with structural similarities between the underlying compounds. An additional advantage is its fast computation, which allows to compare spectra against very large libraries. While trained on spectral data, Spec2Vec used an unsupervised method, meaning that it was trained on non-annotated data and did not make use of the structural information.

With MS2DeepScore, we now make use of the structural information that we have for a large fraction of the training data. Unlike Spec2Vec, which is trained to learn relationships between peaks from peak co-occurrences, and unlike modified Cosine, which computes the maximum overlap of matching peaks, MS2DeepScores is specifically trained to predict structural similarity scores. The ability of those different scores to identify chemical relatedness can thus not simply be compared by measuring their ability to predict Tanimoto score. In practice, however, all such scores are all used to identify chemically closely related compounds. Modified Cosine and Spec2Vec scores are for instance used to generate molecular networks in GNPS [4, 6]. We therefore tested the scores’ ability to detect chemically related compounds by counting identified chemically related pairs within the test set (3601 spectra). Since “chemically related” is a hard to define concept, we simply operated with a fixed Tanimoto score threshold of 0.6 above which we call two compounds related. We then computed the precision and recall (see “Methods”) for finding structurally related compounds for all spectrum pairs above a threshold for the spectral similarity score which could be either MS2DeepScore, Spec2Vec, or modified Cosine (Fig. 4). This reveals that MS2DeepScore clearly outperforms both classical measures (two forms of the modified Cosine) as well as the unsupervised spectral similarity measure Spec2Vec, with respect to identifying high Tanimoto pairs, which can also be seen in the overall distribution of scores (Additional file 1: Figure S1). This makes MS2DeepScore a very promising approach for searching analogues in large datasets.

### Combining different mass spectra for the same compound decreases Tanimoto score prediction error further

In many applications, such as library matching or analogue searching, datasets will frequently contain multiple mass spectra for a given compound. This is also the case for the data retrieved from GNPS (see “Methods”). The test set, for instance, contains 3601 spectra of 500 unique compounds (ignoring stereoisomerism). We hence tested whether structural similarity score predictions can be improved by taking the median of the scores calculated for different pairs of spectra corresponding to the same compound pairs. This can indeed be seen on the test set (Fig. 5, compare red and dark blue lines).

The improvement in accuracy becomes even more pronounced when removing potential outliers based on the interquartile range (IQR) of all predictions for the same pair of molecules. In particular high and low Tanimoto score predictions become notably more reliable, even at comparably high threshold IQRs (Fig. 5A, e.g. threshold < 0.2 which corresponds to 93% of all scores). Given the considerable improvement of the structural similarity prediction, we expect that this use of multiple predictions for mass spectra of the same compound pair can be applied successfully in practice, e.g. for library matching or analogue search, or when measuring multiple spectra of the same compound at various collision energies.

### Using Monte-Carlo dropout ensemble models to estimate prediction uncertainty

Using ensembles of multiple machine learning models is a frequently used technique for improving machine learning results, but also to assess the model uncertainty (also referred to as epistemic uncertainty). Ensembles can be built in many ways, but one particularly efficient ensemble learning technique for neural networks is Monte-Carlo dropout [17]. It makes use of the dropout layers in a network to randomly silence a fraction of the nodes for each inference step. Traditionally, dropout is only activated during model training, but in Monte-Carlo dropout it stays in place when making actual predictions, which can be interpreted as a random sampling technique across a virtually unlimited set of model variations. Since the neural network architecture used for MS2DeepScore includes dropout layers (Fig. 1), it is straightforward to do such ensemble learning (Fig. 6).

For a given spectrum, we compute *n* different embeddings, each from a slightly different version of the base neural network where 20% of its nodes are silenced (dropout rate = 0.2). For a pair of spectra this results in n*n Tanimoto score predictions from which an ensemble score as well as a dispersion measure to assess the prediction certainty can be calculated. To be less sensitive to outliers, we chose to take the median score, rather than the mean score. The prediction uncertainty is measured by the interquartile range (25–75%) which is more suited than the median absolute deviation for non-symmetric distributions [27]. This is also very accessible computationally since only *n* embeddings need to be generated per spectrum to obtain a total of n × n independent Tanimoto score predictions. Inference will hence only take 10 × longer for an ensemble of 100 predictions.

We tested the resulting uncertainty estimate on all possible pairs within the 3601 spectra of our test set. Taking the median of an ensemble of 100 scores already results in an overall drop in prediction error across nearly all Tanimoto score bins (Fig. 7B, red vs. blue line). We then filtered out scores, according to increasingly stringent interquartile range (IQR) thresholds. Over the entire dataset, this approach leads to a large decrease in prediction error (Fig. 7A) but comes at the cost of a lower retrieval rate which we here define as the fraction of total scores for which the IQR is below the set threshold. For instance, all predictions within IQR < 0.025—which will discard about 75% of the scores—will result in a drop of the average RMSE from about 0.17 to about 0.11 (Fig. 7A). It is important to note, though, that this average gain in precision is not distributed equally across the full range of Tanimoto scores. The RMSE drops most significantly in the low (< 0.4) and high (> 0.8) Tanimoto score range (Fig. 7B), while the error slightly increases in the mid score range (0.5–0.7).

### Embedding based mass spectral clustering

Unlike recent approaches to predict molecular fingerprints using deep learning [11, 12], we have chosen to train a Siamese neural network [15] to directly predict Tanimoto similarities. A key feature of our neural network design is the creation of abstract embedding vectors for each input spectrum (Fig. 1). This has two main benefits. First, it allows to scale similarity calculations much more efficiently to very large numbers of spectrum pair calculations by separating the mass spectrum embedding creation step from the actual similarity score calculation. The embedding creation includes the mass spectrum binning as well as the inference step with the ‘base’ neural network (Fig. 1) and is computationally far more expensive than the actual score calculation. Because embedding creation only needs to happen once for each spectrum instead of for each pair, this vastly reduces computational cost. As an example: predicting all possible similarity scores between the 3601 spectra in the test set (6,485,401 unique pairs) took 5–10 min on an Intel i7-8550U CPU. The second reason for choosing this network architecture design is that such embeddings can have additional value beyond the Tanimoto score prediction. Even though they do not directly correspond to any conventional molecular fingerprint, they are trained to support a prediction of a fingerprint-based similarity score, and therefore we hypothesize that they will contain features that reflect chemical properties. Embeddings also facilitate rapid large-scale comparisons beyond simple pair-wise similarity computations. To illustrate possible future use-cases, we ran t-SNE [28] as implemented in scikit-learn [29] on the 200-dimensional embeddings of the test set (3601 spectra). This algorithm provides x,y-coordinates for every spectrum in the test set which we plotted and colored according to the 14 chemical superclasses provided by ClassyFire [30] (Fig. 8a). Molecules of the same chemical class tend to cluster together in the resulting t-SNE plot, confirming that the MS2DeepScore embeddings represent chemically meaningful molecular features. Figures 8b, c show that this conclusion also holds on a more detailed level, by zooming into a small region of the t-SNE plot (Fig. 8b) and coloring according to ClassyFire subclasses (Fig. 8c). To put those plots in perspective, side-by-side comparisons with t-SNE plots based on modified cosine scores are provided in Additional file 1: Figure S11.

### MS2DeepScore python library

MS2DeepScore is available as an easily installable Python library running on Python 3.7 and 3.8. Source code and installation instructions can be found on GitHub (https://github.com/matchms/ms2deepscore). The presented results were obtained using version 0.2.0. MS2DeepScore is integrated in matchms, a recently developed Python library for mass spectra import, handling and comparisons [18].